Off-Policy Policy Gradient with State Distribution Correction

Off-Policy Policy Gradient with State Distribution Correction

Yao Liu Department of Computer Science, Stanford University Adith Swaminathan Microsoft Research Alekh Agarwal Emma Brunskill Department of Computer Science, Stanford University

We study the problem of off-policy policy optimization in Markov decision processes, and develop a novel off-policy policy gradient method. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under the learned policy. Here we build on recent progress for estimating the ratio of the Markov chain stationary distribution of states in policy evaluation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important, theoretical convergence guarantee for our approach and empirical simulations that highlight the benefits of correcting this distribution mismatch.

1 Introduction

The ability to use data about prior decisions and their outcomes to make counterfactual inferences about how alternative decision policies might perform, is a cornerstone of intelligent behavior. It also has immense practical potential – it can enable the use of electronic medical record data to infer better treatment decisions for patients, the use of prior product recommendations to inform more effective strategies for presenting recommendations, and previously collected data from students using educational software to better teach those and future students. Such counterfactual reasoning, particularly when one is deriving decision policies that will be used to make not one but a sequence of decisions, is important since online sampling during a learning procedure is both costly and dangerous, and not practical in many of the applications above. While amply motivated, doing such counterfactual reasoning is also challenging because the data is censored – we can only observe the result of providing a particular chemotherapy treatment policy to a particular patient, not the counterfactual of if we were then to start with a radiation sequence.

We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process (MDP). We assume that data has been previously collected using some fixed and known behavior policy, and our goal is to learn a new decision policy with good performance for future use. This problem is often known as batch off-policy policy optimization. We assume that the behavior policy used to gather the data is stochastic: if it is deterministic, without any additional assumptions, we will not be able to estimate the performance of any other policy.

In this paper we consider how to perform batch off-policy policy optimization (OPPO) using a policy gradient method. While there has been increasing interest in batch off-policy reinforcement learning (RL) over the last few years  (Thomas et al., 2015; Jiang and Li, 2016; Thomas and Brunskill, 2016), much of this has focused on off-policy policy evaluation, where the goal is to estimate the performance of a particular given target decision policy. Ultimately we will very frequently be interested in the optimization question, which requires us to determine a good new policy for future potential deployment, given a fixed batch of prior data.

To do batch off-policy policy optimization, model free methods (like deep Q-learning (Mnih et al., 2015) or fitted Q iteration (Ernst et al., 2005)) can be used alone, but there are many cases where we might prefer to focus on policy gradient or actor-critic methods. Policy gradient methods have seen substantial success in the last few years (Schulman et al., 2015) in the on-policy setting, and they can be particularly appealing for cases where it is easier to encode inductive bias in the policy space, or when the actions are continuous (see e.g. Abbeel and Schulman (2016) for more discussion). However, existing approaches to incorporating offline information into online policy gradients have shown limited benefit (Gu et al., 2017b, a), in part due to the variance in gradients incurred due to incorporating off-policy data. One approach is to correct exactly for the difference between the sampling data distribution and the target policy data distribution, by using importance sampling to re-weight every sample according to the likelihood ratio of behavior policy and evaluation policy up to that step. Unfortunately the variance of this importance sampling ratio will grow exponentially with the problem horizon.

To avoid introducing variance in gradients, off-policy actor critic (Off-PAC) (Degris et al., 2012) ignores the stationary state distribution difference between the behavior policy and the target policy, and instead only uses a one step of importance sampling to reweight the action distributions. Many practical off-policy policy algorithms including DDPG (Silver et al., 2014), ACER (Wang et al., 2016), and Off-PAC with emphatic weightings (Imani et al., 2018) are based on the gradient expression in the Off-PAC algorithm (Degris et al., 2012). However as we will demonstrate, not correcting for this mismatch in state distributions can result in poor performance in general, both in theory and empirically.

Instead, here we introduce an off-policy policy gradient algorithm that can be used with batch data and that accounts for the difference in the state distributions between the current target and behavior policies during each gradient step. Our approach builds on recent approaches for policy evaluation that avoid the exponential blow up in importance sampling weights by instead computing a direct ratio over the stationary distribution of state visitations under the target and behavior policy (Hallak and Mannor, 2017; Liu et al., 2018a; Gelada and Bellemare, 2019). We incorporate these ideas within an off-policy actor critic method to do batch policy optimization. We first provide an illustrative example to demonstrate the benefit of this approach over Off-PAC (Degris et al., 2012), and show that correcting for the mismatch in state distributions of the behavior policy and the target policy can be critical for getting good estimates of the policy gradient, and we also provide convergence guarantees for our algorithm under certain assumptions. We then compare our approach and Off-PAC experimentally on two simulated domains, cart pole and a HIV patient simulator (Ernst et al., 2005). Our results show that our approach is able to learn a substantially higher performing policy than both Off-PAC and the behavior policy that is used to gather the batch data. We further demonstrate that we can use the recently proposed off-policy evaluation technique of Liu et al. (2018a) to reliably identify good policies found during the policy gradient optimization run. Our results suggest that directly accounting for the state distribution mismatch can be done without prohibitively increasing the variance during policy gradient evaluations, and that doing so can yield significantly better policies. These results are promising for enabling us to learn better policies given batch data or improving the sample efficiency of online policy gradient methods by being able to better incorporate past data.

Related Work

Many prior works focus on the policy evaluation problem, as it is a foundation for downstream policy learning problems. These approaches often build on importance sampling techniques to correct for distribution mismatch in the trajectory space, pioneered by the early work on eligibility traces (Precup et al., 2000), and further enhanced with a variety of variance reduction techniques  (Thomas et al., 2015; Jiang and Li, 2016; Thomas and Brunskill, 2016). Some authors consider model-based approaches to OPPE (Farajtabar et al., 2018; Liu et al., 2018b), which usually perform better than importance sampling approaches empirically in policy evaluation settings. But those methods do not extend easily to our OPPO setting, as well as introduce additional challenges due to bias in the models and typically require fitting a separate model for each target policy. The recent work of Liu et al. (2018a) partially alleviates the variance problem for model-free OPPE by reweighting the state visitation distributions, which can result in as just as high a variance in the worst case, but is often much smaller. Our work incorporates this recent estimator in policy optimization methods to enable learning from off-policy collected data.

In the off-policy policy optimization setting, many works study value-function based approaches (like fitted Q iteration (Ernst et al., 2005) and DQN (Mnih et al., 2015)), as they are known to be more robust to distribution mismatch. Some recent works aim to further incorporate reweighting techniques within off-policy value function learning (Hallak and Mannor, 2017; Gelada and Bellemare, 2019). These methods hint at the intriguing potential of value-function based techniques for off-policy learning, and we are interested in similarly understanding the viability of using direct policy optimization techniques in the off-policy setting.

Off-policy actor critic method (Degris et al., 2012; Imani et al., 2018) proposed an answer to this question by learning the critic in a off-policy way and reweighting actor gradients by correcting the conditional action probabilities, but ignores the mismatch between the state visitation distributions of the data collection policy and learned policies. A different research thread on trust region policy optimization method (Schulman et al., 2015), while requiring the on-policy setting, incorporates robustness to the mismatch between the data collection and gradient evaluation policies. However this is not a fully off-policy scenario and learning from an offline dataset is still strongly motivated by many applications. Many recent methods (Silver et al., 2014; Wang et al., 2016; Gu et al., 2017a, b; Lillicrap et al., 2015) are derived based on the policy gradient form in Degris et al. (2012) and some also combined with trust region idea to improve the empirical sample efficiency by using more off-policy samples from previous iteration. In this work, we demonstrate a basic weakness of the policy gradient definition in Degris et al. (2012), and show how to correct that.

2 Preliminaries

We consider finite horizon MDPs , with a continuous state space , a discrete action space , a transition probability distribution and an expected reward function . We observe tuples of state, action, reward and next state: , where is drawn from a initial state distribution , action is drawn from a stochastic behavior policy and the reward and next state are generated by the MDP. Given a discount factor , the goal is to maximize the expected return of policy:


When this becomes the average reward case and is called the discounted reward case. Given any fixed policy the MDP becomes a Markov chain and we can define the state distribution at time step : , and the stationary state distribution across time:

To make sure the optimal policy is learnable from collected data, we assume the following about the support set of behavior policy:

Assumption 1.

For at least one optimal policy , for all such that , and for all such that when .

3 An Off-Policy Policy Gradient Estimator

Note that Assumption 1 is quite weak when designing a policy evaluation or optimization scheme, since it only guarantees that adequately visits all the states and actions visited by some . However, a policy optimization algorithm might require off-policy policy gradient estimates at arbitrary intermediate policy it produces along the way, which might visit states not reached by . A strong assumption to handle such scenarios is that Assumption 1 holds not just for some , but any possible policy . Instead of making such a strong assumption, we start by defining an augmented MDP where Assumption 1 suffices for obtaining pessimistic estimates of policy values and gradients.

3.1 Constructing an Augmented MDP

Given a data collection policy , let its support set be and . Consider a modified MDP . Any state-action pairs not in will essentially transition to which is a new absorbing state where all actions will lead to a zero reward self-loop. Concretely, and for any . For all other states, the transition probabilities and rewards are defined as: For , for all , and . For all but , . for , and otherwise. First we prove that the optimal policy of the original MDP remains optimal in augmented MDP as a consequence of Assumption 1.

Theorem 1.

The expected return of all policies in the original MDP is larger than the expected return in the new MDP: . For any optimal that satisfies Assumption 1 we have that

That is, policy optimization in has at least one optimal solution identical to the original MDP with the same policy value since lower bounds the policy value in , so sub-optimal policies remain sub-optimal.


For any trajectory sampled from policy , if every then . If not, let be the first state-action pair that is not in . Then . Dividing the accumulated rewards by and taking the limit of , then taking the expectation over trajectories induced by , we have that: . For , since covers all state-action pairs reachable by , so the expected return remains the same. ∎

3.2 Off-Policy Policy Gradient in Augmented MDP

We will now use the expected return in the modified MDP, , as a surrogate for deriving policy gradients. According to the policy gradient theorem in Sutton et al. (2000), for a parametric policy with parameters :

From here on, is with respect to the new MDP. The definition of in both the average and discounted reward cases follows Sutton et al. (2000). 111For discounted case, our definition of expected return differs from the definition of Sutton et al. (2000) by a normalization factor . This is because the definitions of stationary distributions are scaled differently in the two cases.

Now we will show that we can get an unbiased estimator of this gradient using importance sampling from the stationary state distribution and the action distribution . According to the definition of , we have that for all such that , is not in . Hence for any policy since will receive zero reward and lead to a zero reward self-loop. So we have:


Note that according to the definition of , the Markov chain induced by and is exactly the same as and . Thus the distribution of generated by executing in is the same as executing in . So, we can estimate this policy gradient using the data we collected from in . We conclude the section by pointing out that working in the augmented MDP allows us to construct a reasonable off-policy policy gradient estimator under the mild Assumption 1, while all prior works in this vein either explicitly or implicitly require the coverage of all possible policies.

Note that in the average reward case, such an augmented MDP would not be helpful for policy optimization since all policies that potentially reach will have a value of zero, and the stationary state distribution will be a single mass in the absorbing state. That would not induce a practical policy optimization algorithm. In the average reward case, either we need a stronger assumption that covers the entire state-action space or we must approximate the problem by setting a discount factor for the policy optimization algorithm, which is a common approach for deriving practical algorithms in an average reward (episodic) environment.

4 Algorithm: OPPOSD

Given the off-policy policy gradient derived in (2), how can we efficiently estimate it from samples collected from ? Notice that most quantities in the gradient estimator (2) are quite intuitive and also present in prior works such as Off-PAC. The main difference is the state distribution reweighting , which we would like to estimate using samples collected with . For estimating this ratio of state distributions, we build on the recent work of Liu et al. (2018a) which we describe next.

For a policy , let us define the shorthand . Further given a function , define . Then we have the following result.

Theorem 2 ((Liu et al., 2018a)).

Given any , assume that for all and define

Then if and only if for any measurable test function .222When , is only determined up to normalization, and hence an additional constraint is required to obtain the conclusion .

This result suggests a constructive procedure for estimating the state distribution ratio using samples from , by finding a function over the states which minimizes . Since the maximization over all measurable functions as per Theorem 2 is intractable, Liu et al. (2018a) suggest restricting the maximization to a unit ball in an RKHS, which has an analytical solution to the maximization problem, and we use the same procedure to approximate density ratios in our algorithm.

Applying Theorem 2 requires overcoming one final obstacle. The theorem presupposes for all . In case where we can directly apply the theorem. Otherwise in the MDP , this assumption indeed holds for all states, but never visits the absorbing state , or any transitions leading into this state. However, since we know this special state, as well as the dynamics leading in and out of it, we can simulate some samples for this state, effectively corresponding to a slight perturbation of to cover . Concretely, we first choose a small smoothing factor . For any sample in our data set, if there exist actions such that , then we will keep the old samples with probability and sample any one of the actions with probability uniformly and change the next state to . If we sampled , consequently, we would also change all samples after this transition to a self-loop in . Thus we create samples drawn according to a new behavior policy which covers all the state action pairs: where is a uniform distribution over the actions not chosen by in state . Now we can use Theorem 2 and the algorithm from Liu et al. (2018a) to estimate . Note that the propensity scores and policy gradients computed on this new dataset correspond to the behaviour policy and not . Formally, in place of using (2), we now estimate:


Note that we can estimate the expectation in (3) from the smoothed dataset by construction, since the ratio in all states are known.

Now that we have an algorithm for estimating policy gradients from (3), we can plug this into any policy gradient optimization method. Following prior work, we incorporate our off-policy gradients into an actor-critic algorithm. For learning the critic , we can use any off-policy Temporal Difference (Bhatnagar et al., 2009; Maei, 2011) or Q-learning algorithm (Watkins and Dayan, 1992). In our algorithm, we fit an approximate value function by: 333For simplicity, Eqn 4 views in the tabular setting. See Line 14 in Alg 1 for the function approximation case.


where is the step-size for critic updates and is the off-policy -return:

and is generated by executing . After we learn , serves the role of in our algorithm.

Given the estimates of the state distribution ratio from Liu et al. (2018a) and the critic updates from (4), we can now update the policy by plugging these quantities in (3). It remains to specify the initial conditions to start the algorithm. Since we have data collected from a behavior policy, it is natural to also warm-start the actor policy in our algorithm to be the same as the behavior policy and correspondingly the critic and ’s to be the value function and distribution ratios for the behavior policy. This can be particularly useful in situations where the behavior policy, while suboptimal, still gets to states with high rewards with a reasonable probability. Hence we use behavior cloning to warm-start the policy parameters for the actor, use on-policy value function learning for the critic and also fit the state ratios for the actor obtained by behavior cloning. Note that while the ratio will be identically equal to 1 if our behavior cloning was perfect, we actually estimate the ratio to better handle imperfections in the learned actor initialization.

0:  Hyperparameters , , , , , ,
1:  Warm start , ,
2:  Pad to get if necessary
3:  for each step of policy update do
4:     for state ratio updates  do
5:        Sample a mini-batch according to 444, where is the empirical state distribution at time step in dataset
6:        if  then
7:           Perform one update according to Algorithm 1 in Liu et al. (2018a) with stepsize
8:        else
9:           Perform one update according to Algorithm 2 in Liu et al. (2018a) with stepsize
10:        end if
11:     end for
12:     for critic updates  do
13:        Sample a mini-batch
14:        , where:
15:     end for
16:     Sample a mini-batch according to
20:  end for
Algorithm 1 OPPOSD: Off-Policy Policy Optimization with State Distribution Correction

A full pseudo-code of our algorithm, which we call OPPOSD for Off-Policy Policy Optimization with State Distribution Correction, is presented in Algorithm 1. We mention a couple of implementation details which we found helpful in improving the convergence properties of the algorithm. Typical actor-critic algorithms update the critic once per actor update in the on-policy setting. However, in the off-policy setting, we find that performing multiple critic updates before an actor update is helpful, since the off-policy TD learning procedure can have a high variance. Secondly, the computation of the state distribution ratio is done in an online manner similar to the critic updates, and analogous to the critic, we always retain the state of the optimizer for across the actor updates (rather than learning the from scratch after each actor update). Similar to the critic, we also perform multiple updates after each actor update. These choices are intuitively reasonable as the standard two-time scale asymptotic analysis of actor-critic methods (Borkar, 2009) does require the critic to converge faster than the actor.

5 Convergence Result

In this section, we present two main results to demonstrate the theoretical advantage of our algorithm. First we present a simple scenario where the prior approach of Off-PAC yields an arbitrarily biased gradient estimate, despite having access to a perfect critic. In contrast OPPOSD estimates the gradients correctly whenever the distribution ratios in (2) and the critic are estimated perfectly, by definition. We will further provide a convergence result for OPPOSD to a stationary point of the expected reward function.

A hard example for Off-PAC

Many prior off-policy policy gradient methods use the policy gradient estimates proposed in Degris et al. (2012).

Notice that, in contrast to the exact policy gradient, the expectation over states is taken with respect to the behavior policy distribution instead of . In tabular settings this can lead to correct policy updates, as proved by Degris et al. (2012). We now present an example where the policy gradient computed this way is problematic when using function approximators. Consider the problem instance shown in Figure 1, where the behavior policy is given as: . Thus gives us good coverage over all states and actions. Now we consider policies parameterized by a parameter where has the following structure:

Thus aliases the states and as a manifestation of imperfect representation which is typical with large state spaces. The true state value function of , satisfies that:

Now we define our policy class . Clearly the optimal policy is as it completely eliminates the ill-effects of state aliasing. We now study the Off-PAC gradient estimator in an idealized setting where the critic is perfectly known. As per Equation 5 of Degris et al. (2012), we have

That is, the gradient vanishes for any policy , meaning that the algorithm can be arbitrarily sub-optimal at any point during policy optimization. We note that this does not contradict the previous Off-PAC theorems as the policy class is not fully expressive in our example, a requirement for their convergence results. Our gradient estimator (2) instead evaluates to , which is correctly maximized at .

Figure 1: Hard example for Off-PAC (Degris et al., 2012)

Convergence results for OPPOSD

We next ask whether OPPOSD converges, given reasonable estimates for the density ratio and the critic. To this end, we need to introduce some additional notations and assumptions. Suppose we run OPPOSD over some parametric policy class with . In the sequel, we use subscripts and superscripts by to mean the corresponding quantities with to ease the notation. We begin by describing an abstract set of assumptions and a general result about the convergence of OPPOSD, when we run it over the policies given data collected with an arbitrary policy , before instantiating our assumptions for the specific structure of used in our algorithm.

Definition 1.

A function is -smooth when

Assumption 2.

pairs, and a data collection policy , we assume that the MDP guarantees:

  1. , and the function approximator for satisfies .

  2. The expected return of : is a differentiable, -Lipschitz and -smooth function w.r.t. .

Theorem 3.

Assume an MDP, a data collection policy and function classes and satisfy Assumption 2. Suppose OPPOSD with policy parameters at iteration is provided critic estimates and distribution ratio estimates satisfying and for iterations . Then


That is, when Assumption 2 holds, the scheme converges to an approximate stationary point given estimators and with a small average MSE across the iterations under . An immediate consequence of the theorem above is that as long as we guarantee that , which a reasonable online critic and learning algorithm can guarantee, we have:

which implies the procedure will converge to a stationary point where the true policy gradient is zero.

We now discuss the validity of Assumption 2 in the specific context of the data collection policy used in OPPOSD as well as the augmented MDP . The first assumption, that the gradient of policy distribution is bounded, can be achieved by an appropriate policy parametrization such as a linear or a differentiable neural network-based scoring function composed with a softmax link. The second assumption on bounded value functions is standard in the literature. In particular, both these assumptions are crucial for the convergence of policy gradient methods even in an on-policy setting. The third assumption on lower bounded action probabilities holds by construction for the policy due to the smoothing. The fourth assumption on bounded distribution ratios can be ensured if . Technically, this might not hold for in as some states in might be reached with tiny probabilities, but we can instead define to be the set of all the states with . With this change, and given a suitably large , always satisfies the fourth assumption in the MDP . We note that the assumption also requires the outputs of the function approximator to be bounded, which might require additional clipping or regularization in the algorithm. In Algorithm 1, we instead use a weighted importance sampling version of which normalize the value in by its mean in one batch, which ensures that the largest value of is the mini-batch size . Finally the regularity assumption on the smoothness of the reward function is again standard for policy gradient methods even in an on-policy setting.

Thus we find that under standard assumptions for policy gradient methods, along with some reasonable additional conditions, we expect OPPOSD to have good convergence properties in theory.

6 Experimental Evaluation

(a) CartPole-v0. Optimal policy gets a reward of 200, while the uniformly random data collection policy obtains 22.
(b) HIV treatment simulator. The data collection policy obtains a reward of 1.5E7.
Figure 2: Episodic scores over length 200 episodes in CartPole-v0 (Barto et al., 1983; Brockman et al., 2016) (left) and HIV treatment simulator Ernst et al. (2006) (right). Shaded region represents 1 standard deviation over 10 runs of each method.

In this section we study the empirical properties of OPPOSD, with an eye towards two questions:

  1. Does the state distribution correction help improve the performance of off-policy policy optimization?

  2. Can we identify the best policy from the optimization path using off-policy policy evaluation?

Baseline and implementation details

To answer the first question, we compare OPPOSD with its closest prior work, but without the state distribution correction, that is the Off-PAC algorithm (Degris et al., 2012).

We implement both OPPOSD and Off-PAC using feedforward neural networks for the actor and critic, with ReLU hidden layers. For state distribution ratio , we also use a neural network with ReLU hidden layers, with the last activation function to guarantee that for any input. To make a fair comparison, we keep the implementation of Off-PAC as close as possible to OPPOSD other than the use of . Concretely, we also equip Off-PAC with the enhancements that we find improve empirical performance such as warm start of the actor and critic, as well as several critic updates per actor update. We use the same off-policy critic learning algorithm for Off-PAC and OPPOSD. To learn , we use Algorithm 1 (average reward) in Liu et al. (2018a) with RBF kernel for CartPole-v0 experiment, and Algorithm 2 (discounted reward) in Liu et al. (2018a) with RBF kernel for HIV experiment. We normalize the input to the networks to have 0 mean and 1 standard deviation, and in each mini-batch we normalized kernel loss of fitting by the mean of the kernel matrix elements, to minimize the effect kernel hyper-parameters on the learning rate. Full implementation details when omitted are provided in the Appendix.

Simulation domains

We compare the algorithms in two simulation domains. The first domain is the cart pole control problem, where an agent needs to balance a mass attached to a pole in an upright position, by applying one of two sideways movements to a cart on a frictionless track. The state space is continuous and describes the position and velocity of cart and pole. The action space consists of applying a unit force to two directions. The horizon is fixed to 200. If the trajectory ends in less than 200 steps, we pad the episode by continuing to sample actions and repeating the last state. We use a uniformly random policy to collect trajectories as off-policy data, which is a very challenging data set for off-policy policy optimization methods to learn from as this policy does not attain the desired upright configuration for any prolonged period of time. We use neural networks with a 32-unit hidden layer to fit the stationary distribution ratio, actor and critic.

The second domain is an HIV treatment simulation described in Ernst et al. (2006). Here the states are six-dimensional real-valued vectors, which model the response of numbers of cells/virus to a treatment. Each action corresponds to whether or not to apply two types of drug, leading to a total of 4 actions. The transition dynamics are modeled by an ODE system in Ernst et al. (2006). The reward consists of a small negative reward for deploying each type of drug, and a positive reward based on the HIV-specific cytotoxic T-cells which will increase with a proper treatment schedule. To maximize the total reward in this simulator, algorithms need to do structured treatment interruption (STI), which aim to achieve a balance between treatment and the adverse effect of abusing drugs. The horizon of this domain is 200 and discount factor is set by the simulator to . Each trajectory simulates a treatment period for one patient in 1000 days and each step corresponds to a 5-day interval in the ODE system. We represent the state by taking logarithm of state features and divide the reward by to ensure they are in a reasonable range to fit the neural network models. A uniformly random policy does not visit any rewarding states often enough to collect useful data for off-policy learning. To simulate an imperfect but reasonable data collection policy, we first train an on-policy actor critic method to learn a reasonable (but still far from optimal) policy . We then use the data collection policy , where is the uniformly random policy, to collect trajectories. We use neural networks with three 16-unit hidden layers to fit the actor and state distribution ratio, and a neural network with four 32-unit hidden layers for the critic.

Though in both domains our data collection policy is eventually able to cover the whole state-action space, the situation under finite amounts of data is different. In cart pole since an optimal policy can control the cart to stay in a small region, it is relatively easy for the uniform random policy to cover the states visited by the optimal policy. In the HIV treatment domain, it is unlikely that the logged data will cover the desirable state space.

Figure 3: Off-policy policy evaluation results of saved policies from OPPOSD. The estimated and true values exhibit a high correlation (coefficient = 0.80 and 0.71 in the left and right plots respectively) for most policies. Two panels correspond to repeating the whole procedure using two datasets from the same behavior policy.

Impact of state reweighting on policy optimization

In Figures 1(a) and 1(b), we plot the on-policy evaluation values of the policies produced by OPPOSD and Off-PAC during the actor updates across 10 runs. Each run uses a different data set collected using the behavior policy as well as a different random seed for the policy optimization procedure. In each run we use the same dataset for each method to allow paired comparisons. We evaluate the policy after every 100 actor updates using on-policy Monte-Carlo evaluation over 20 trajectories. The results are averaged over 10 runs and error bars show the standard deviation. Along X-axis, the plot shows how the policy value changes as we take policy gradient steps.

At a high-level, we see that in both the domains our algorithm significantly improves upon the behavior policy, and eventually outperforms Off-PAC consistently. Zooming in a bit, we see that for the initial iterates on the left of the plots, the gap between OPPOSD and Off-PAC is small as the state distribution between the learned policies is likely close enough to the behavior policy for the distribution mismatch to not matter significantly. This effect is particularly pronounced in Figure 1(a). However, the gap quickly widens as we move to the right in both the figures. In particular, Off-PAC barely improves over behavior policy in Figure 1(b), while OPPOSD finds a significantly better policy. Overall, we find that these results are an encouraging validation of our intuition about the importance of correcting the state distribution mismatch.

Identifying Best Policy by Off-Policy Evaluation

While OPPOSD consistently outperforms Off-PAC in average performance across 10 runs in both the domains, there is still significant variance in both the methods across runs. Given this variance, a natural question is whether we can identify the best performing policies, during and across multiple runs of OPPOSD for a single dataset. To answer this question, we checkpoint all the policies produced by OPPOSD after every 1000 actor updates, across 5 runs of our algorithm with the same input dataset generated in the HIV domain. We then evaluate these policies using the off-policy policy evaluation (OPPE) method in Liu et al. (2018a). The evaluation is performed with an additional dataset sampled from the behavior policy. This corresponds to the typical practice of sample splitting between optimization and evaluation.

We show the quality of the OPPE estimates against the true policy values for two different datasets for OPPE sampled from the behavior policy in the two panels of Figure 3. In each plot, the X-axis shows the true values by on-policy Monte-Carlo evaluation results and Y-axis shows the OPPE estimates. We find that the OPPE estimates are generally well correlated with the on-policy values, and picking the policy with the best OPPE estimate results in a true value substantially better than both the best Off-PAC result as well as the behavior policy. A closer inspection also reveals the importance of this validation step. The red squares correspond to the final iterate of OPPOSD in each of the 5 iterations, which has a very high value in some cases, but somewhat worse in other runs. Using OPPE to robustly select a good policy adds a layer of additional assurance to our policy optimization procedure.

7 Discussion and Conclusion

We presented a new off-policy actor critic algorithm, OPPOSD, based on a recently proposed stationary state distribution ratio estimator. There exist many interesting next steps, including different critic learning methods which may also leverage the state distribution ratio, and exploring alternative methods for policy evaluation or alternative stationary state distribution ratio estimators (Hallak and Mannor, 2017; Gelada and Bellemare, 2019). Another interesting direction is to improve the sample efficiency of online policy gradient algorithms by using our corrected gradient estimates.

There are also many different algorithms which have been built using the previous off policy gradient framework (Off-PAC) and improve Off-PAC in different directions, such as DDPG, ACER, etc. They are orthogonal to our work and our state distribution correction techniques are composable with these further improvements in the Off-PAC framework. For understanding the impact of correcting the stationary distribution, in the experiment section of this work we therefore focus on ablation comparison with Off-PAC. It would be interesting to combine our work with the additional contributions of DDPG, ACER etc. to derive improved variants of each of those algorithms.

In parallel with our work, Zhang et al. (2019) have presented a different approach for off-policy policy gradient, motivated by a similar recognition of the bias in the Off-PAC gradient estimator. While similarly motivated, the two works have important differences. On the methodological side, Zhang et al. (2019) start from an off-policy objective function and derive a gradient for it. In contrast, we compute an off-policy estimator for the gradient of the on-policy objective function. The latter leads to a much simpler method, both conceptually and computationally, as we do not need to compute the gradients of the visitation distribution. On the other hand, Zhang et al. (2019) focus on incorporating more general interest functions in the off-policy objective, and use the emphatic weighting machinery for obtaining the gradient of their off-policy objective. In terms of settings, our approach works in the offline setting (though easily extended to online), while they require an online setting in order to compute the gradients of the propensity score function. Finally, we present convergence results quantifying the error in our critic and propensity score computations while Zhang et al. (2019) assume a perfect oracle for both and rely on a truly unbiased gradient estimator for the convergence results.

To conclude, our algorithm fixes the bias in off-policy policy gradient estimates introduced by the behavior policy’s stationary state distribution. We prove under certain assumptions our algorithm is guaranteed to converge. We also show that ignoring the bias due to the mismatch in state distributions can make an off policy gradient algorithm fail even in a simple illustrative example, and that by accounting for this mismatch our approach yields significantly better performance in two simulation domains.


  • Abbeel and Schulman (2016) P. Abbeel and J. Schulman. Deep reinforcement learning through policy optimization. NeurIPS Tutorial, 2016. URL
  • Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pages 834–846, 1983.
  • Bhatnagar et al. (2009) S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesvári. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pages 1204–1212, 2009.
  • Borkar (2009) V. S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
  • Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
  • Degris et al. (2012) T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
  • Ernst et al. (2005) D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
  • Ernst et al. (2006) D. Ernst, G.-B. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
  • Farajtabar et al. (2018) M. Farajtabar, Y. Chow, and M. Ghavamzadeh. More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, pages 1446–1455, 2018.
  • Gelada and Bellemare (2019) C. Gelada and M. G. Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In AAAI, 2019.
  • Gu et al. (2017a) S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, 2017a.
  • Gu et al. (2017b) S. S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and S. Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3846–3855, 2017b.
  • Hallak and Mannor (2017) A. Hallak and S. Mannor. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning, pages 1372–1383, 2017.
  • Imani et al. (2018) E. Imani, E. Graves, and M. White. An off-policy policy gradient theorem using emphatic weightings. In Advances in Neural Information Processing Systems, pages 96–106, 2018.
  • Jiang and Li (2016) N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 652–661, 2016.
  • Lillicrap et al. (2015) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Liu et al. (2018a) Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5361–5371, 2018a.
  • Liu et al. (2018b) Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, A. A. Faisal, F. Doshi-Velez, and E. Brunskill. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2649–2658, 2018b.
  • Maei (2011) H. R. Maei. Gradient Temporal-difference Learning Algorithms. PhD thesis, University of Alberta, 2011.
  • Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Precup et al. (2000) D. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759–766, 2000.
  • Schulman et al. (2015) J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pages 1889–1897, 2015.
  • Silver et al. (2014) D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2014.
  • Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • Thomas and Brunskill (2016) P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
  • Thomas et al. (2015) P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pages 2380–2388, 2015.
  • Wang et al. (2016) Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
  • Watkins and Dayan (1992) C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Zhang et al. (2019) S. Zhang, W. Boehmer, and S. Whiteson. Generalized off-policy actor-critic. arXiv preprint arXiv:1903.11329, 2019.

Appendix A Proof of Theorem 3

We first state and prove an abstract result. Suppose we have a function which is differentiable, -Lipschitz and -smooth, and attains a finite minimum value . Suppose we have access to a noisy gradient oracle which returns a vector given a query point . We say that the vector is -accurate for parameter if for all , the quantity satisfies


Notice that the expectations above are only with respect to any randomness in the oracle, while holding the query point fixed. Suppose we run the stochastic gradient descent algorithm using the oracle responses, that is we update . While several results for the convergence of stochastic gradient descent to a stationary point of a smooth, non-convex function are well-known, we could not find a result for the biased oracle assumed here and hence we provide a result from first principles. We have the following guarantee on the convergence of the sequence to an approximate stationary point of .

Theorem 4.

Suppose is differentiable and -smooth, and the approximate gradient oracle satisfies the conditions (6) with parameters at iteration . Then stochastic gradient descent with the oracle, with an initial solution and stepsize satisfies after iterations:


Since is -smooth, we have

Here the first equality follows from our update rule while the remaining simply use the definition of along with algebraic manipulations. Now taking expectations of both sides, we obtain

where we have invoked the properties of the oracle to bound the last two terms. Summing over iterations , we obtain

Rearranging terms, and using that , we obtain

Now using the choice and simplifying, we obtain the statement of the theorem. ∎

The theorem tells us that if we pick an iterate uniformly at random from , then it is an approximate stationary point in expectation, up to an accuracy which is determined by the bias and variance of the stochastic gradient oracle.

Given this abstract result, we can now prove Theorem 3 by instantiating the errors in the gradient oracle as a function of our assumptions.

Proof of Theorem 3

We now instantiate the result and assumptions for the case of the off-policy policy gradient method. First, note that the algorithm is stochastic gradient ascent for maximizing the expected return . Thus we can apply Theorem 4 with , so that where is an upper bound on the value of any policy in the MDP. attains a finite minimum value since the expected return has a finite maximum value. We focus on quantifying the bias in terms of errors in the critic and propensity score computations first. We first introduce some additional notation. Suppose and are the true propensity (in terms of state distributions, relative to ) and -value functions for a policy . Let . Suppose we are given estimators and for and respectively. Then our estimated and true off-policy policy gradients can be written as:

Now the bias can be bounded as

How we simplify further depends on the assumptions we make on the errors in and . As a natural assumption, suppose that the relative errors are bounded in MSE, that is and . Then by Cauchy-Shwartz inequality, we can simplify the above bias term as

where the operations of squaring and square root are applied elementwise to the vector . By Assumption 2 we have , for all , and

Then the bound on the bias further simplifies to

Similarly, for the variance we have

Hence, the RHS of Theorem 4 simplifies to

where and are the error parameters in the propensity scores and critic at the iteration of our algorithm. Since we update these quantities online along with the policy parameters, we expect and to decrease as increases. That is, assuming that satisfies the coverage assumptions with finite upper bounds on the propensities and the policy class is Lipschitz continuous in its parameters, the scheme converges to an approximate stationary point given estimators and with a small average MSE across the iterations under .

Appendix B Details for Experiments

In this section we will show some important details and hyper-parameter settings of our algorithm in experiments. We use three separate neural networks, one for each of actor, critic, and the state distribution ratio model . We use the Adam optimizer for all of them. We also use a entropy regularization for the actor. We warm start the actor by maximizing the log-likelihood of actor on the collected dataset. For critic, we use the same critic algorithm as we used in Algorithm 1 except that there is no importance sampling ratio (as it is on-policy for the warm start). For the warm start of w, we just fit the w for several iterations using the warm start policy found for the actor. Warm start uses the same learning rates as normal training. For critic and , we also keep the state of optimizer to be the same when we start normal training.

In the table below we show some hyper-parameters setting we used in both domain:

Hyper-parameters cart pole HIV
1.0 0.98
0. 0.
entropy coefficient 0.01 0.03
learning rate (actor) 1e-3 5e-6
learning rate (critic) 1e-3 1e-3
learning rate (w) 1e-3 3e-4
batch size (actor) 5000 5000
batch size (critic) 5000 5000
batch size (w) 200 200
number of iterations (critic) 10 10
number of iterations (w) 50 50
weight decay (w) 1e-5 1e-5
behavior cloning number of iterations 2000 2000
warm start number of iterations (crtic) 500 2500
warm start number of iterations (w) 500 2500

We also follow the details in Algorithm 1 and Algorithm 2 of Liu et al. [2018a] to learn . We scale the inputs to so that the whole off-policy dataset has zero mean and standard deviation of 1 along each dimension in state space. We use the RBF kernel to compute the loss function for . For the CartPole simulator, the kernel bandwidth is set to be the median of state distance. If computing this median state distance over the whole off-policy dataset is computationally too expensive, it can be approximated using a mini-batch. In the HIV domain the bandwidth is set to be 1. When we compute the loss of , we need to sample two mini-batch independently to get an unbiased estimates of the quadratic loss. The loss in each pair of mini-batch is normalized by the sum of kernel matrix elements computed from them.

b.1 Choice of Algorithm with Discounted Reward

In discounted reward settings, the state distribution is also defined with respect to the discount factor , and Liu et al. [2018a] introduce an algorithm to learn state distribution ratio in this setting. However, we notice that in on-policy policy learning cases, though the policy gradient theorem [Sutton et al., 2000] requires samples from the stationary state distribution defined using the discount factor, it is common to directly use the collected samples to compute policy gradient without re-sampling/re-weighting ’s according to the discounted stationary distribution. This might be driven by sample efficiency concerns, as samples at later time-steps in the discounted stationay distribution will receive exponentially small probability, meaning they are not leveraged as effectively by the algorithm. Given this, we compare three different variants of our algorithm in HIV experiment with discounted reward. The first (OPPOSD average) variant uses the algorithm for the average reward setting, but evaluates its discounted reward. The second learns the state distribution ratio in the discounted case (Algorithm 2 in [Liu et al., 2018a]), but still samples from the undiscounted distribution to compute the gradient (OPPOSD disc ). The third learns the state distribution ratio in the discounted case and also re-samples the samples according to (OPPOSD). In the main body of paper, we select the third one as it is the most natural way from the definition of problem and policy gradient theorem. Results of these three methods are demonstrated in Figure 4 and they do not have significant differences in this experiment.

Figure 4: Episodic scores over length 200 episodes in HIV treatment simulator.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description