Representation Balancing MDPs
for OffPolicy Policy Evaluation
Abstract
We study the problem of offpolicy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in a common synthetic domain and on a challenging realworld sepsis management problem.
Representation Balancing MDPs
for OffPolicy Policy Evaluation
Yao Liu Stanford University yaoliu@stanford.edu Omer Gottesman Harvard University gottesman@fas.harvard.edu Aniruddh Raghu Cambridge University aniruddhraghu@gmail.com Matthieu Komorowski Imperial College London matthieu.komorowski@gmail.com Aldo Faisal Imperial College London a.faisal@imperial.ac.uk Finale DoshiVelez Harvard University finale@seas.harvard.edu Emma Brunskill Stanford University ebrun@cs.stanford.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
In reinforcement learning, offpolicy (batch) policy evaluation is the task of estimating the performance of some evaluation policy given data gathered under a different behavior policy. Offpolicy policy evaluation (OPPE) is essential when deploying a new policy might be costly or risky, such as in consumer marketing, healthcare, and education. Technically offpolicy evaluation relates to other fields that study counterfactual reasoning, including causal reasoning, statistics and economics.
Offpolicy batch policy evaluation is challenging because the distribution of the data under the behavior policy will in general be different than the distribution under the desired evaluation policy. This difference in distributions comes from two sources. First, at a given state, the behavior policy may select a different action than the one preferred by the evaluation policy—for example, a clinician may chose to amputate a limb, whereas we may be interested in what might have happened if the clinician had not. We never see the counterfactual outcome. Second, the distribution of future states—not just the immediate outcomes—is also determined by the behavior policy. This challenge is unique to sequential decision processes and is not covered by most causal reasoning work: for example, the resulting series of a patient’s health states observed after amputating a patient’s limb is likely to be significantly different than if the limb was not amputated.
Approaches for OPPE must make a choice about whether and how to address this data distribution mismatch. Importance sampling (IS) based approaches precup2000eligibility (); thomas2015high (); guo2017using (); dudik2011doubly (); jiang2015doubly (); thomas2016data () are typically unbiased and strongly consistent, but despite recent progress tend to have high variance—especially if the evaluation policy is deterministic, as evaluating deterministic policies requires finding in the data sequences where the actions exactly match the evaluation policy. However, in most realworld applications deterministic evaluation policies are more common—policies are typically to either amputate or not, rather than a policy that that flips a biased coin (to sample randomness) to decide whether to amputate. IS approaches also often rely on explicit knowledge of the behavior policy, which may not be feasible in situations such as medicine where the behaviors results from human actions. In contrast, some model based approaches ignore the data distribution mismatch, such as by fitting a maximumlikelihood model of the rewards and dynamics from the behavioral data, and then using that model to evaluate the desired evaluation policy. These methods may not converge to the true estimate of the evaluation policy’s value, even in the limit of infinite data mandel2014offline (). However, such model based approaches often achieve better empirical performance than the ISbased estimators jiang2015doubly ().
In this work, we address the question of building modelbased estimators for OPPE that both do have theoretical guarantees and yield better empirical performance that modelbased approaches that ignore the data distribution mismatch. Typically we evaluate the quality of an OPPE estimate , where is an initial state, by evaluating its mean squared error (MSE). Most previous research (e.g. jiang2015doubly (); thomas2016data ()) evaluates their methods using MSE for the average policy value (APV): , rather than the MSE for individual policy values (IPV): . This difference is crucial for applications such as personalized healthcare since ultimately we may want to assess the performance of a policy for an specific individual (patient) state.
Instead, in this paper we develop an upper bound of the MSE for individual policy value estimates. Note that this bound is automatically an upper bound on the average treatment effect. Our work is inspired by recent advancesshalit2016estimating (); johnson2017mimic (); johansson2016learning (); johansson2018learning () in estimating conditional averaged treatment effects (CATE), also known as heterogeneous treatment effects (HTE), in the contextual bandit setting with a single (typically binary) action choice. CATE research aims to obtain precise estimates in the difference in outcomes for giving the treatment vs control intervention for an individual (state).
Recent work johansson2016learning (); shalit2016estimating () on CATE^{1}^{1}1Shalit et al. shalit2016estimating () use the term individual treatment effect (ITE) to refer to a criterion which is actually defined as CATE in most causal inference literature. We discuss the confusion about the two terms in the appendix B. has obtained very promising results by learning a model to predict individual outcomes using a (model fitting) loss function that explicitly accounts for the data distribution shift between the treatment and control policies. We build on this work to introduce a new bound on the MSE for individual policy values, and a new loss function for fitting a modelbased OPPE estimator. In contrast to most other OPPE theoretical analyses (e.g. jiang2015doubly (); dudik2011doubly (); thomas2016data ()), we provide a finite sample generalization error instead of asymptotic consistency. In contrast to previous model value generalization bounds such as the Simulation Lemma kearns2002near (), our bound accounts for the underlying data distribution shift if the data used to estimate the value of an evaluation policy were collected by following an alternate policy.
We use this to derive a loss function that we can use to fit a model for OPPE for deterministic evaluation policies. Conceptually, this process gives us a model that prioritizes fitting the trajectories in the batch data that match the evaluation policy. Our current estimation procedure works for deterministic evaluation policies which covers a wide range of scenarios in realworld applications that are particularly hard for previous methods. Like recently proposed ISbased estimators thomas2016data (); jiang2015doubly (); farajtabar2018more (), and unlike the MLE modelbased estimator that ignores the distribution shift mandel2014offline (), we prove that our modelbased estimator is asymptotically consistent, as long as the true MDP model is realizable within our chosen model class; we use neural models to give our model class high expressivity.
We demonstrate that our resulting models can yield substantially lower mean squared error estimators than prior modelbased and ISbased estimators on a classic benchmark RL task (even when the ISbased estimators are given access to the true behavior policy). We also demonstrate our approach can yield improved results on a challenging sepsis treatment management dataset extracted from the MIMICIII database (johnson2017mimic, ).
2 Related Work
Most prior work on OPPE in reinforcement learning falls into one of three approaches. The first, importance sampling (IS), reweights the trajectories to account for the data distribution shift. Under mild assumptions importance sampling estimators are guaranteed to be both unbiased and strongly consistent, and were first introduced to reinforcement learning OPPE by Precup et al. precup2000eligibility (). Despite recent progress (e.g.thomas2015high (); guo2017using ()) ISonly estimators still often yield very high variance estimates, particularly when the decision horizon is large, and/or when the evaluation policy is deterministic. IS estimators also typically result in extremely noisy estimates for policy values of individual states. A second common approach is to estimate a dynamics and reward model, which can substantially reduce variance, but can be biased and inconsistent (as noted by mandel2014offline ()). The third approach, doubly robust estimators, originates from the statistics community robins1994estimation (). Recently proposed doubly robust estimators for OPPE from the machine and reinforcement learning communities dudik2011doubly (); jiang2015doubly (); thomas2016data () have sometimes yielded orders of magnitude tighter estimates. However, most prior work that leverages an approximate model has largely ignored the choice of how to select and fit the model parameters. Recently, Farajtabar et al. farajtabar2018more () introduced more robust doubly robust (MRDR), which involves fitting a Q function for the modelvalue function part of the doubly robust estimator based on fitting a weighted return to minimize the variance of doubly robust. In contrast, our work learns a dynamics and reward model using a novel loss function, to estimate a model that yields accurate individual policy value estimates. While our method can be combined in doubly robust estimators, we will also see in our experimental results that directly estimating the performance of the model estimator can yield substantially benefits over estimating a Q function for use in doubly robust.
OPPE in contextual bandits and RL also has strong similarities with the treatment effect estimation problem common in causal inference and statistics. Recently, different kinds of machine learning models such as Gaussian Processes alaa2017bayesian (), random forests wager2017estimation (), and GANs Yoon2018GANITE () have been used to estimate heterogeneous treatment effects (HTE), in nonsequential settings. Schulam and Saria schulam2017reliable () study using Gaussian process models for treatment effect estimation in continuous time settings. Their setting differs from MDPs by not having sequential states. Most theoretical analysis of treatment effects focuses on asymptotic consistency rather than generalization error.
Our work is inspired by recent research that learns complicated outcome models (reward models in RL) to estimate HTE using new loss functions to account for covariate shift johansson2016learning (); shalit2016estimating (); atan2018learning (); johansson2018learning (). In contrast to this prior work we consider the sequential stateaction setting. In particular, Shalit et al. shalit2016estimating () provided an algorithm with a more general model class, and a corresponding generalization bound. We extend this idea from the binary treatment setting to sequential and multiple action settings.
3 Preliminaries: Notation and Setting
We consider undiscounted finite horizon MDPs, with finite horizon , bounded state space , and finite action space . Let be the initial state distribution, and be the transition probability. Given a state action pair, the expectation of reward is . Given trajectories collected from a stochastic behavior policy , our goal is to evaluate the policy value of . We assume the policy is deterministic. We will learn a model of both reward and transition dynamics, , based on a learned representation. The representation function is a reversible and twicedifferentiable function, where is the representation space. is the reverse representation such that . The specific form of our MDP model is: , where and is some function over space . We will use the notation instead of later for simplicity.
Let be a trajectory of states and actions, sampled from the joint distribution of MDP and a policy . The joint distributions of are: . Given the joint distribution, we denote the associated marginal and conditional distributions as etc. We also have the joint, marginal and conditional, distributions based on the representation space . We focus on the undiscounted finite horizon case, using to denote the step value function of policy .
4 Generalization Error Bound for MDP based OPPE estimator
Our goal is to learn a MDP model that directly minimizes a good upper bound of the MSE for the individual evaluation policy values: . This model can provide value function estimates of the policy and be used as part of doubly robust methods.
In the onpolicy case, the Simulation Lemma ( kearns2002near () and repeated for completeness in Lemma 1) shows that MSE of a policy value estimate can be upper bounded by a function of the reward and transition prediction losses. Before we state this result, we first define some useful notation.
Definition 1.
The square error loss function of value function, reward, transition are:
(1) 
Then the Simulation lemma ensures that
(2) 
The right hand side can be used to formulate an objective to fit a model for policy evaluation. In offpolicy case our data is from a different policy , and one can get unbiased estimation of the RHS of Equation 2 by importance sampling. However, this will provide an objective function with high variance, especially for a long horizon MDP or a deterministic evaluation policy due to the product of IS weights. An alternative is to learn an MDP model by directly optimizing the prediction loss over our observational data, ignoring the covariate shift. From the Simulation Lemma this minimizes an upper bound of MSE of behavior policy value, but the resulting model may not be a good one for estimating the evaluation policy value. In this paper we propose a new upper bound on the MSE of the individual evaluation policy values inspired by recent work in treatment effect estimation, and use this as a loss function for fitting models.
Before proceeding we first state our assumptions, which are common in most OPPE algorithms:

Support of behavior policy covers the evaluation policy: for any state and action , only if .

Strong ignorability: there are no hidden confounders that influence the choice of actions other that the current observed state.
Denote a factual sequence to be a trajectory that matches the evaluation policy, as . Let a counterfactual action sequence be an action sequence with at least one action that does not match . is the distribution over trajectories under and policy . We define the step value error with respect to the state distribution given the factual action sequence.
Definition 2.
step value error is:
We use the idea of bounding the distance between representations given factual and counterfactual action sequences to adjust the distribution mismatch. Here the distance between representation distributions is formalized by Integral Probability Metric (IPM).
Definition 3.
Let be two distributions and let be a family of realvalued functions defined over the same space. The integral probability metric is:
Some important instances of IPM include the Wasserstein metric where is 1Lipschitz continuous function class, and Maximum Mean Discrepancy where is norm1 function class in RKHS.
Let and , where and denote factual and counterfactual. We first give an upper bound of MSE in terms of an expected loss term and then develop a finite sample bound which can be used as a learning objective.
Theorem 1.
For any MDP , approximate MDP model , behavior policy and deterministic evaluation policy , let and be a real number and function family that satisfy the condition in Lemma 4. Then:
(3) 
(Proof Sketch) The key idea is to use Equation 20 in Lemma 1 to view each step as a contextual bandit problem, and bound recursively. We decompose the value function error into a one step reward loss, a transition loss and a next step value loss, with respect to the onpolicy distribution. We can treat this as a contextual bandit problem, and we build on the method in Shalit et al.’s work shalit2016estimating () about binary action bandits to bound the distribution mismatch by a representation distance penalty term; however, additional care is required due to the sequential setting since the next states are also influenced by the policy. By adjusting the distribution for the next step value loss, we reduce it into , allowing us recursively repeat this process for H steps. ∎
This theorem bounds the MSE for the individual evaluation policy value by a loss on the distribution of the behavior policy, with the cost of an additional representation distribution metric. The first IPM term measures how different the state representations are conditional on factual and counterfactual action history. Intuitively, a balanced representation can generalize better from the observational data distribution to the data distribution under the evaluation policy, but we also need to consider the prediction ability of the representation on the observational data distribution. This bound quantitatively describes those two effects about MSE by the IPM term and the loss terms. The reweighted expected loss terms over the observational data distribution is weighted by the marginal action probabilities ratio instead of the conditional action probability ratio, which is used in importance sampling. The marginal probabilities ratio has lower variance than the importance sampling weights in practice, especially when the sample size is limited.
One natural approach might be to use the right hand side of Equation 3 as a loss, and try to directly optimize a representation and model that minimizes this upper bound on the mean squared error in the individual value estimates. Unfortunately, doing so can suffer from two important issues. (1) The subset of the data that matches the evaluation policy can be very sparse for large , and though the above bound reweights data, fitting a model to it can be challenging due to the limited data size. (2) Unfortunately this approach ignores all the other data present that do not match the evaluation policy. If we are also learning a representation of the domain in order to scale up to very large problems, we suspect that we may benefit from framing the problem as related to transfer or multitask learning.
Motivated by viewing offpolicy policy evaluation as a transfer learning task, we can view the source task as the evaluating the behavior policy, for which we have onpolicy data, and view the target task as evaluating the evaluation policy ,for which we have the highvariance reweighted data from importance sampling. This is similar to transfer learning where we only have a few, potentially noisy, data points for the target task. Thus we can take the idea of colearning a source task and a target task at the same time as a sort of regularization given limited data. More precisely, we now bound the OPPE error by an upper bound of the sum of two terms:
(4) 
where we bound the former part using Theorem 1. Thus our upper bound of this objective can address the issues with separately using and as objective: compared with IS estimation of , the "marginal" action probability ratio has lower variance. The representation distribution distance term regularizes the representation layer such that the learned representation would not vary significantly between the state distribution under the evaluation policy and the state distribution under the behavior policy. That reduces the concern that using as an objective will force our model to evaluate the behavior policy, rather than the evaluation policy, more effectively.
Our work is also inspired by treatment effect estimation in the casual inference literature, where we estimate the difference between the treated and control groups. An analogue in RL would be estimating the difference between the target policy value and the behavior policy value, by minimizing the MSE of policy difference estimation. The objective above is an upper bound of the MSE of policy difference estimator:
We now bound Equation 4 further by finite sample terms. For the finite sample generalization bound, we first introduce a minor variant of the loss functions, with respect to the sample set.
Definition 4.
Let and be an observation of reward and next step given state action pair . Define the loss functions as:
(5)  
(6) 
Definition 5.
Define the empirical risk over the behavior distribution and weighted distribution as:
(7)  
(8) 
where n is the dataset size, is the state of the step in the trajectory, and .
Theorem 2.
Suppose is a model class of MDP models based on representation . For n trajectories sampled by , let , and be the pseudodimension of function class . Suppose is the reproducing kernel Hilbert space induced by k, and is the unit ball in it. Assume there exists a constant such that . With probability , for any :
(9) 
and are the number of samples used to estimate and respectively. is a function of the kernel . is a function of . . .
The first term is the empirical loss over the observational data distribution. The second term is a reweighted empirical loss, which is an empirical version of the first term in Theorem 1. As said previously, this has less variance than importance sampling ratios in practice, especially when the sample size is limited. Our bound is based on the empirical estimate of the marginal probability and we are not required to know the behavior policy. Our method’s independence of the behavior policy is a significant advantage over IS methods which are very susceptible to errors its estimation, as we discuss in appendix A. In practice, this marginal probability is easier to estimate than when is unknown. The third term is an empirical estimate of IPM, which we described in Theorem 1. We use norm1 RKHS functions and MMD distance in this theorem and our algorithm. There are similar but worse results for Wasserstein distance and total variation distance sriperumbudur2009integral (). measures how complex is. It is obtained from concentration measures about empirical IPM estimators sriperumbudur2009integral (). The constant measures how complex the model class is and it is derived from traditional learning theory results cortes2010learning ().
5 Algorithm for Representation Balancing MDPs
Based on our generalization bound above, we propose an algorithm to learn an MDP model for OPPE, minimizing the following objective function:
(10) 
This objective is based on Equation 9 in Theorem 2. We minimize the terms in this upper bound that are related to the model . Note that since depends on the loss function, we cannot know in practice. We therefore use a tunable factor in our algorithm. here is some kind of bounded regularization term of model that one could choose, corresponding to the model class complexity term in Equation 9. This objective function matches our intuition about using lowervariance weights for the reweighting component and using IPM of the representation to avoid fitting the behavior data distribution.
In this work, and are parameterized by neural networks, due to their strong ability to learn representations. We use an estimator of IPM term from Sriperumbudur et al. sriperumbudur2012empirical (). All terms in the objective function are differentiable, allowing us to train them jointly by minimizing the objective by a gradient based optimization algorithm.
After we learn an MDP by minimizing the objective above, we use MonteCarlo estimates or value iteration to get the value for any initial state as an estimator of policy value for that state. We show that if there exists an MDP and representation model in our model class that could achieve:
then and estimator is a consistent estimator for any . See Corollary 2 in Appendix for detail.
We can use our model in any OPPE estimators that leverage modelbased estimators, such as doubly robust jiang2015doubly () and MAGIC thomas2016data (), though our generalization MSE bound is just for the model value.
6 Experiments
6.1 Synthetic domain: Cart Pole
We test our algorithm on Cart Pole, a continuousstate benchmark domain. We use a greedy policy from a learned Q function as the evaluation policy, and an greedy policy with as the behavior policy. We collect 1024 trajectories for OPPE, whose average length is around 190. This setting is challenging for ISbased OPPE estimators due to the deterministic evaluation policy and long horizon, which will give the IS weights high variance. However determinism and long horizons can be very common in realworld domains, and they are key limitations of current OPPE algorithms.
We compare our method RepBM, with a baseline approximate model (AM), doubly robust (DR), more robust doubly robust (MRDR), and importance sampling (IS). The baseline approximate model is an MDP modelbased estimator trained by minimizing the empirical risk, using the same model class as RepBM. DR(RepBM) is doubly robust estimator using our model and DR(AM) is a doubly robust estimator using the baseline model. MRDR farajtabar2018more () is a recent method that trains a Q function as the modelbased part in DR to minimize the resulting variance. We include their Q function estimator (MRDR Q), the doubly robust estimator that combines this Q function with IS (MRDR).
The reported results are square root of the average MSE over runs. is set to for RepBM. We report mean and individual MSEs, corresponding to MSEs of average policy value and individual policy value, and respectively. IS and DR methods reweight samples, so their estimates for single initial states are not applicable, especially in continuous state space. A comparison across more methods is included in the appendix.
Long Horizon  RepBM  DR(RepBM)  AM  DR(AM)  MRDR Q  MRDR  IS 
Mean  0.4121  1.359  0.7535  1.786  151.1  202  194.5 
Individual  1.033    1.313    151.9     
Short Horizon  RepBM  DR(RepBM)  AM  DR(AM)  MRDR Q  MRDR  IS 
Mean  0.07836  0.02081  0.1254  0.0235  3.013  0.258  2.86 
Individual  0.4811    0.5506    3.823     
Representation Balancing MDPs outperform baselines for long time horizons. We observe that MRDR variants and IS methods have high MSE in the long horizon setting. The reason is that the IS weights for 200step trajectories are extremely highvariance, and MRDR whose objective depends on the square of IS weights, also fails. Compared with the baseline model, we can see that our method is better than AM for both the pure model case and when used in doubly robust. We also observe that the IS part in doubly robust actually hurts the estimates, for both RepBM and AM.
Representation Balancing MDPs outperform baselines in deterministic settings. We also include results on Cart Pole with a shorter horizon, by using weaker evaluation and behavior policies. The average length of trajectories is about 23 in this setting. Here, we observe that RepBM is still better than other modelbased estimators, and doubly robust that uses RepBM is still better than other doubly robust methods. Though MRDR produces substantially lower MSE than IS, which matches the report in Farajtabar et al. farajtabar2018more (), it still has higher MSE than RepBM and AM, due to the high variance of its learning objective when the evaluation policy is deterministic.
Representation Balancing MDPs produce accurate estimates even when the behavior policy is unknown. For both horizon cases, we observe that RepBM learned with no knowledge of the behavior policy is better than methods such as MRDR and IS that use the true behavior policy.
6.2 Realworld domain: Sepsis
We demonstrate our method on a real world example of evaluating policies for sepsis treatment in medical intensive care units (ICU). We extract time series data for 9983 sepsis patients from the MIMIC III data set (johnson2017mimic, ). The data for each patient consists of 47 features, recorded at 4 hour intervals. We define the patient state as a concatenation of the last 4 observations with 188 features. The actions we consider are the prescription of IV fluids and vasopressors. Each of the two treatments is binned into 5 discrete actions according to the dosage amounts, resulting in 25 possible actions. The rewards are defined as the change of an acuity measure, the SOFA score, following an action. We treat the data as a task with a fixed time horizon by only observing the first 10 transitions of every patient.
The main difficulty in performing the real data experiment is that we do not have access to the true value of any policy for computing the MSE on the value estimates. We address this issue by artificially splitting the data into two subsets, with each subset representing a different distribution of actions. As a control group, we select one subgroup containing half of the patients who were never treated with vasopressors (chosen randomly from all such patients), and assign the rest of the patients to the other group which we treat as our training dataset. This is equivalent to observing two separate datasets collected under different behavior policies. We approximate the evaluation and behavior policies by a nearestneighbors classifier, discussed in detail in Appendix E.2. We compute the value of the policy which never uses vasopressors by taking the empirical average of the value from the control group, and treat this average as the ground truth policy value. We then try to estimate the value of this policy by performing offpolicy policy evaluation with the data from the training dataset. We ran the experiments times, where for each experiment we repeat the splitting process. The results for the RMSE of the average policy value are reported in Table 2. For this realworld dataset, the lack of access to the true value of a policy makes computation of the RMSE for individual patients very noisy, and we therefore focus on population averages.
RepBM  DR(RepBM)  AM  DR(AM)  IS  WDR(RepBM)  WDR(AM)  

Mean  0.09  0.23  0.8  0.73 
On a realworld domain, RepBM outperforms all other OPE methods, and is the only one which can confidently differentiate between the behavior and evaluation policies. We observe that RepBM has the lowest root MSE on estimating the value of the evaluation policy. The average expected return computed using onpolicy estimation is for the no vasopressor policy and for the policy including vasopressors (more negative is better). As the domain prior implies, the policy including the administration of vasopressors does better in terms of SOFA score reduction, as it employs a wider range of treatment options. RepBM is the only method for which the RMSE is smaller than the difference in expected return between the two policies, and is thus the only method which can confidently differentiate between them.
7 Discussion and Conclusion
One interesting issue for our method is the effect of the hyperparameter on the quality of estimator. In the appendix, we include the results of RepBM across different values of . We find that our method outperforms prior work for a large range of alphas, for both domains. In both domains we observe that the effect of IPM adjustment (nonzero ) is less than the effect of "marginal" IS reweighting, which matches the results in Shalit et al.’s work in the binary action bandit case shalit2016estimating ().
To conclude, in this work we give an MDP model learning method for the individual OPPE problem in RL, based on a new finite sample generalization bound of MSE for the model value estimator. We show our method results in substantially smaller MSE estimates compared to stateoftheart baselines in a common benchmark simulator and on a challenging realworld dataset on sepsis management.
References
 [1] A. M. Alaa and M. van der Schaar. Bayesian inference of individualized treatment effects using multitask gaussian processes. arXiv preprint arXiv:1704.02801, 2017.
 [2] O. Atan, W. R. Zame, and M. van der Schaar. Learning optimal policies from observational data. arXiv preprint arXiv:1802.08679, 2018.
 [3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [4] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.
 [5] M. Dudík, J. Langford, and L. Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
 [6] M. Farajtabar, Y. Chow, and M. Ghavamzadeh. More robust doubly robust offpolicy evaluation. arXiv preprint arXiv:1802.03493, 2018.
 [7] Z. D. Guo, P. S. Thomas, and E. Brunskill. Using options for longhorizon offpolicy evaluation. arXiv preprint arXiv:1703.03453, 2017.
 [8] J. J. J. Yoon and M. van der Schaar. Ganite: Estimation of individualized treatment effects using generative adversarial nets. ICLR, 2018.
 [9] N. Jiang and L. Li. Doubly robust offpolicy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
 [10] F. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In International Conference on Machine Learning, pages 3020–3029, 2016.
 [11] F. D. Johansson, N. Kallus, U. Shalit, and D. Sontag. Learning weighted representations for generalization across designs. arXiv preprint arXiv:1802.08598, 2018.
 [12] A. E. Johnson, D. J. Stone, L. A. Celi, and T. J. Pollard. The mimic code repository: enabling reproducibility in critical care research. Journal of the American Medical Informatics Association, page ocx084, 2017.
 [13] M. Kearns and S. Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 [14] S. Künzel, J. Sekhon, P. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461, 2017.
 [15] T. Mandel, Y.E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Offline policy evaluation across representations with applications to educational games. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pages 1077–1084. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
 [16] D. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for offpolicy policy evaluation. In ICML, pages 759–766. Citeseer, 2000.
 [17] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
 [18] P. Schulam and S. Saria. Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems, pages 1697–1708, 2017.
 [19] U. Shalit, F. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. arXiv preprint arXiv:1606.03976, 2016.
 [20] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet. On integral probability metrics,phidivergences and binary classification. arXiv preprint arXiv:0901.2698, 2009.
 [21] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
 [22] P. Thomas and E. Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
 [23] P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. Highconfidence offpolicy evaluation. In AAAI, 2015.
 [24] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, justaccepted, 2017.
Appendix A IS with approximate behavior policy
In this section, we include some theoretical and empirical results about the effect of using an estimated behavior policy in importance sampling, when the true behavior policy is not accessible.
Proposition 1.
Assume the reward is in . For any estimator of the true behavior policy , let be the IS estimator using this estimated and be the IS estimator with true behavior policy. Both of the IS estimators are computed using trajectories that are independent from the data used to estimate . If the relative error of is bounded by : , then for any given dataset:
The bias of is bounded by:
where is the true evaluation policy value.
Proof.
(11)  
(12) 
According to the condition, for any and , . Then
and:
So:
Plug this into Equation 12:
(13)  
Similarly, for the bias:
(14)  
(15)  
(16)  
(17) 
∎
We bound the error of IS estimates by the relative error of behavior policy estimates. Proposition 3 from Farajtabar et al. farajtabar2018more () gave an expression for the bias when using an empirical estimate of behavior policy in IS. The result in Farajtabar et al. farajtabar2018more () is similar to this proposition, but the authors did not explicitly bound the bias by the error of behavior policy. Note that this bound increases exponentially with the horizon , which shows the accumulated error effect of behavior policy error.
By using the tree MDP example in Jiang and Li jiang2015doubly () we can show that the order of magnitude is tight: there exists an MDP and a policy estimator with such that the bias of is . Define a binary discrete tree MDP jiang2015doubly () as following: At each node in a binary tree, we can take two actions , leading to the two next nodes with observations . The state of a node is defined by the whole path to the root: . That means each node in the tree will have a unique state. The depth of the tree, as well as the horizon of the trajectories, is . Only the leftmost leaf node (by always taking ) has nonzero reward . Denote this state as the target state. The evaluation policy always takes action and the behavior policy is a uniform random policy. Let the estimated policy differ from with in all the stateaction pairs on the path to target state. That means the action probability in is for all stateaction pairs on the path to target state. The IS estimator with has expectation since it is unbiased. It is easy to verify that the IS estimator using has expectation . Thus the bias is .
This result represents the worstcase upper bound on the bias of IS when using an estimated behaviour policy; the fact it is exponential in the trajectory length illustrates the problem when using IS without knowing the true behaviour policy. To support this result with an empirical example illustrating the challenge of using IS with an unknown behaviour policy for a real data distribution, consider Figure 1 which represents the error in OPPE (found using PerDecision WIS) as we vary the accuracy of the behaviour policy estimation. Two different behaviour policies are considered. The domain used in this example is a continuous 2D map () with a discrete action space, , with actions representing a movement of one unit in one of the four coordinate directions or staying in the current position. Gaussian noise of zero mean and specifiable variance is added onto state of the agent after each action, to provide environmental stochasticity. An agent starts in the top left corner of the domain and receives a positive reward within a given radius of the top right corner, and a negative reward within a given radius of the bottom left corner. The horizon is set to be 15. A kNearest Neighbours (kNN) model is used to estimate the behaviour policy distribution, given a set of training trajectories. The accuracy of the model is varied by changing the number of trajectories available and the number of neighbours used for behaviour policy estimation.
This plot shows how IS suffers from very poor estimates with even slight errors in the estimated behaviour policy – average absolute errors of as small as 0.06 can incur errors of up over 50% in OPPE. This provides additional motivation for our approach – we do not require the behaviour policy to be known for OPPE, avoiding the significant errors incurred by using incorrectly estimated behaviour policies.
Appendix B Clarification of CATE/HTE and ITE
In the causal inference literaturekunzel2017meta (), for an single unit with covariate (state) , we observe if we give the unit treatment and if not. The Individual Treatment Effect (ITE) is defined as:
(18) 
for this particular set of observations . However and cannot be observed at the same time, which makes ITE unidentifiable without strong additional assumptions. Thus the conditional average treatment effect (CATE), also known as heterogeneous treatment effect (HTE) is defined as:
(19) 
which is a function of and is identifiable. Shalit et al. shalit2016estimating () defined ITE as , which is actually named as CATE or HTE in most causal reasoning literature. So we use the name CATE/HTE to refer to this quantity and it is inconsistent with Shalit et al. ’s work. We clarify it here so that it does confuse the reader.
Appendix C Proofs of Section 4
c.1 Proofs of Theorem 1 and Corollary 1
Before we prove Lemma 4 and Theorem 1, we need some useful lemmas and assumptions. We restate a wellknown variant of Simulation Lemma kearns2002near () in finite horizon case here:
Lemma 1.
(Simulation Lemma with finite horizon case) Define that . For any approximate MDP model , any policy , and :
(20) 
Then:
(21) 
Lemma 2.
Let be the absolute of the determinant of the Jacobian of . Then for any and any sequence of actions :
Proof.
By the change of variable formula in a probability density function, we have:
∎
Lemma 3.
Let and be two distributions over the state space with the form of (the action sequence might be different for p and q), and and be the corresponding distributions over the representation space. For any real valued function over the state space , if there exists a constant and a function class such that: then we have that
Proof.
(22)  
(23)  
(24)  
(25)  
(26)  
(27)  
(28) 
∎
The following lemma recursively bounds by , whose result allows us to bound .
The main idea to prove this is using Equation 20 from simulation lemma to decompose the loss of value functions into a one step reward loss, a transition loss and a next step value loss, with respect to the onpolicy distribution. We can treat this as a contextual bandit problem, with the right side of Equation 20 as the loss function. For the distribution mismatch term, we follow the method in Shalit et al.’s work shalit2016estimating () about binary action bandits to bound the distribution mismatch by a representation distance penalty term. By converting the next step value error in the right side of Equation 20 into , we can repeat this process recursively to bound the value error for H steps.
Lemma 4.
For any MDP , approximate MDP model <