OffPolicy Evaluation via OffPolicy Classification
Abstract
In this work, we consider the problem of model selection for deep reinforcement learning (RL) in realworld environments. Typically, the performance of deep RL algorithms is evaluated via onpolicy interactions with the target environment. However, comparing models in a realworld environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine offpolicy policy evaluation (OPE) in such settings. We focus on OPE for valuebased methods, which are of particular interest in deep RL, with applications like robotics, where offpolicy algorithms based on Qfunction estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being offpolicy. However, for highdimensional observations, such as images, models of the environment can be difficult to fit and valuebased methods can make IS hard to use or even illconditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important realworld applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positiveunlabeled (PU) classification problem with the Qfunction as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the realworld of policies trained in simulation for an imagebased robotic manipulation task.
OffPolicy Evaluation via OffPolicy Classification
Alex Irpan^{1}, Kanishka Rao^{1}, Konstantinos Bousmalis^{2}, Chris Harris^{1}, Julian Ibarz^{1}, Sergey Levine^{1,3} ^{1}Google Brain, Mountain View, USA ^{2}DeepMind, London, UK ^{3}University of California Berkeley, Berkeley, USA {alexirpan,kanishkarao,konstantinos,ckharris,julianibarz,slevine}@google.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Supervised learning has seen significant advances in recent years, in part due to the use of large, standardized datasets Deng et al. (2009). When researchers can evaluate real performance of their methods on the same data via a standardized offline metric, the progress of the field can be rapid. Unfortunately, such metrics have been lacking in reinforcement learning (RL). Model selection and performance evaluation in RL are typically done by estimating the average onpolicy return of a method in the target environment. Although this is possible in most simulated environments (Todorov et al., 2012; Bellemare et al., 2013; Brockman et al., 2016), realworld environments, like in robotics, make this difficult and expensive (Thomas et al., 2015). Offpolicy evaluation (OPE) has the potential to change that: a robust offpolicy metric could be used together with realistic and complex data to evaluate the expected performance of offpolicy RL methods, which would enable rapid progress on important realworld RL problems. Furthermore, it would greatly simplify transfer learning in RL, where OPE would enable model selection and algorithm design in simple domains (e.g., simulation) while evaluating the performance of these models and algorithms on complex domains (e.g., using previously collected realworld data).
Previous approaches to offpolicy evaluation (Precup et al., 2000; Dudik et al., 2011; Jiang & Li, 2015; Thomas & Brunskill, 2016) generally use importance sampling (IS) or learned dynamics models. However, this makes them difficult to use with many modern deep RL algorithms. First, OPE is most useful in the offpolicy RL setting, where we expect to use realworld data as the “validation set”, but many of the most commonly used offpolicy RL methods are based on value function estimation, produce deterministic policies (Lillicrap et al., 2015; van Hasselt et al., 2016), and do not require any knowledge of the policy that generated the realworld training data. This makes them difficult to use with IS. Furthermore, many of these methods might be used with highdimensional observations, such as images. Although there has been considerable progress in predicting future images (Babaeizadeh et al., 2018; Lee et al., 2018), learning sufficiently accurate models in image space for effective evaluation is still an open research problem. We therefore aim to develop an OPE method that requires neither IS nor models.
We observe that for model selection, it is sufficient to predict some statistic correlated with policy return, rather than directly predict policy return. We address the specific case of binaryreward MDPs: tasks where the agent receives a nonzero reward only once during an episode, at the final timestep (Sect. 2). These can be interpreted as tasks where the agent can either “succeed” or “fail” in each trial, and although they form a subset of all possible MDPs, this subset is quite representative of many realworld tasks, and is actively used e.g. in robotic manipulation (Kalashnikov et al., 2018; Riedmiller et al., 2018). The novel contribution of our method (Sect. 3) is to frame OPE as a positiveunlabeled (PU) classification Kiryo et al. (2017) problem, which provides for a way to derive OPE metrics that are both (a) fundamentally different from prior methods based on IS and model learning, and (b) perform well in practice on both simulated and realworld tasks. Additionally, we identify and present (Sect. LABEL:sec:generalization) a list of generalization scenarios in RL that we would want our metrics to be robust against. We experimentally show (Sect. 4) that our suggested OPE metrics outperform a variety of baseline methods across all of the evaluation scenarios, including a simulationtoreality transfer scenario for a visionbased robotic grasping task (see Fig. 0(b)).

2 Preliminaries
We focus on finite–horizon, deterministic Markov decision processes (MDP). We define an MDP as (). is the state–space, the action–space, and both can be continuous. defines transitions to next states given the current state and action, defines initial state distribution, is the reward function, and is the discount factor. Episodes are of finite length : at a given timestep the agent is at state , samples an action from a policy , receives a reward , and observes the next state as determined by .
The goal in RL is to learn a policy that maximizes the expected episode return . A value of a policy for a given state is defined as where is the action takes at state and implies an expectation over trajectories sampled from . Given a policy , the expected value of its action at a state is called the Qvalue and is defined as .
We assume the MDP is a binary reward MDP, which satisfies the following properties: transitions are deterministic, , the reward is at all intermediate steps, and the final reward is in , indicating whether the final state is a failure or a success. We learn Qfunctions and aim to evaluate policies .
2.1 Positiveunlabeled learning
Positiveunlabeled (PU) learning is a set of techniques for learning binary classification from partially labeled data, where we have many unlabeled points and some positively labeled points (Kiryo et al., 2017). We will make use of these ideas in developing our OPE metric. Positiveunlabeled data is sufficient to learn a binary classifier if the positive class prior is known.
Let be a labeled binary classification problem, where . Let be some decision function, and let be our loss function. Suppose we want to evaluate loss over negative examples , but we only have unlabeled points and positively labeled points . The key insight of PU learning is that the loss over negatives can be indirectly estimated from . For any ,
(1) 
It follows that for any , , since by definition . Letting and rearranging gives
(2) 
In Sect. 3, we reduce offpolicy evaluation of a policy to a classification problem, provide reasoning for how to estimate , use PU learning to estimate classification error with Eqn. 2, then use the error to estimate a lower bound on return with Theorem 1.
3 Offpolicy evaluation via stateaction pair classification
A Qfunction predicts the expected return of each action given state . The policy can be viewed as a classifier that predicts the best action. We propose an offpolicy evaluation method connecting offpolicy evaluation to estimating validation error for a positiveunlabeled (PU) classification problem (Kiryo et al., 2017). Our metric can be used with Qfunction estimation methods without requiring importance sampling, and can be readily applied in a scalable way to imagebased deep RL tasks.
We present an analysis for binary reward MDPs, defined in Sec. 2. In a binary reward MDP, the deterministic dynamics means each is either potentially effective, or guaranteed to lead to failure.
Definition 1.
In a binary reward MDP, is effective if an optimal policy can achieve success, i.e an episode return of 1, after taking in . Equivalently, there exists a sequence of future actions that reaches a success state. is catastrophic if no such sequence exists.
Under this definition, the return of a trajectory is 1 if and only if all () in are effective (see Appendix A.1), the label for does not depend on the policy we are evaluating, and the classification error of at each time can be used to bound return .
Theorem 1.
Given a binary reward MDP and a policy , let denote the state distribution at time , given that was followed and all its previous actions were effective. Let denote the set of catastrophic actions at state , and let be the perstep expectation of making its first mistake at time , with being average error over all . Then , and this lower bound is tight.
Proof.
Failure rate is the total probability makes its first mistake at time , summed over all . For each , the probability that follows effective actions then a catastrophic action is , giving . This gives , which is tight when . ∎
An alternative proof in Appendix A.3 is based on imitation learning behavioral cloning bounds from Ross & Bagnell (2010).
A smaller gives a higher lower bound on return, which implies a better . This leaves estimating . The primary challenge with this approach is that we do not have negative labels – that is, for trials that receive a return of 0 in the validation set, we do not know which were in fact catastrophic, and which were recoverable. We discuss how we address this problem next.
3.1 Missing negative labels
Recall that is effective if can succeed from . Since is at least as good as , whenever succeeds, all tuples in the trajectory must be effective. However, the converse is not true, since could succeed from where failed. This is an instance of the positiveunlabeled (PU) learning problem from Sect. 2.1, where positively labels some and the remaining are unlabeled. In the RL setting, , labels {catastrophic, effective}, and a natural choice for decision function is , since should be high for effective and low for catastrophic .
We aim to estimate , the probability that takes a catastrophic action – i.e., that is a false positive. Note that if is predicted to be catastrophic, but is actually effective, this falsenegative does not impact future reward – since the action is effective, there is still a path to reach success. We want just the falsepositive risk, . This is the same as Eqn. 2, and using gives
(3) 
Eqn. 3 is the core of all our proposed metrics. While it might at first seem that the class prior should be taskdependent, recall that the error is the expectation over the state distribution , where the actions were all effective. This is equivalent to following an optimal “expert” policy , and although we are estimating from data generated by behavior policy , we should match the positive class prior we would observe from expert . Assuming the task is feasible, meaning that the policy has effective actions available from the start, we have . Therefore, although the validation dataset will likely have both successes and failures, a prior of is the ideal prior, and this holds independently of the environment. We illustrate this further with a didactic example in Sect. 4.1.
Theorem 1 relies on estimating over the distribution , but our dataset is generated by an unknown behavior policy . A natural approach here would be importance sampling (IS) (Dudik et al., 2011), but: (a) we assume no knowledge of , and (b) IS is not welldefined for deterministic policies . Another approach is to subsample to transitions where (Liu et al., 2018). This ensures an onpolicy evaluation, but can encounter finite sample issues if does not sample frequently enough. Therefore, we assume classification error over is a good enough proxy that correlates well with classification error over . This is admittedly a strong assumption, but empirical results in Sect. 4 show surprising robustness to distributional mismatch. This assumption is reasonable if is broad (e.g., generated by a sufficiently random policy), but may produce pessimistic estimates when potential effective actions in are unlabeled.
3.2 Offpolicy classification for OPE
Based off of the derivation from Sect. 3.1, our proposed offpolicy classification (OPC) score is defined by the negative loss when in Eqn. 3 is the 01 loss. Let be a threshold, with . This gives
(4) 
To be fair to each , threshold is set separately for each Qfunction to maximize . Given transitions and for all , the best for each can be computed by sorting all Qvalues, then scanning every threshold , which takes time per Qfunction (see Appendix B). This avoids favoring Qfunctions that systematically overestimate or underestimate the true value.
Alternatively, can be a soft loss function. We experimented with , which is minimized when is large for and small for . The negative of this loss is called the SoftOPC.
(5) 
If episodes have different lengths, to avoid focusing on long episodes, transitions from an episode of length are weighted by when estimating OPC and its variants.
3.3 Evaluating OPE metrics
The standard evaluation method for OPE is to report MSE to the true episode return (Thomas & Brunskill, 2016; Liu et al., 2018). However, our metrics do not estimate episode return directly. The ’s estimate of will differ from the true value, since it is estimated over our dataset instead of over the distribution . Meanwhile, does not estimate directly due to using a soft loss function. Despite this, the OPC and SoftOPC are still useful OPE metrics if they correlate well with or episode return .
We propose an alternative evaluation method. Instead of reporting MSE, we train a large suite of Qfunctions with different learning algorithms, evaluating true return of the equivalent argmax policy for each , then compare correlation of the metric to true return. We report two correlations, the coefficient of determination of line of best fit, and the Spearman rank correlation (S. Spearman, 1904).^{1}^{1}1We slightly abuse notation here, and should clarify that is used to symbolize the coefficient of determination and should not be confused with , the average return of a policy . measures confidence in how well our linear best fit will predict returns of new models, whereas measures confidence the metric ranks different policies correctly, without assuming a linear best fit.
4 Experiments
In this section, we investigate the correlation of OPC and SoftOPC with true average return, and how they may be used for model selection with offpolicy data. We compare the correlation of these metrics with the correlation of the baselines, namely the TD Error, Sum of Advantages, and the MCC Error (see Sect. LABEL:sec:baselines) in a number of environments and generalization failure scenarios. For each experiment, a validation dataset is collected with a behavior policy , and stateaction pairs are labeled as effective whenever they appear in a successful trajectory. In line with Sect. 3.3, several Qfunctions are trained for each task. For each , we evaluate each metric over and true return of the equivalent argmax policy. We report both the coefficient of determination of line of best fit and the Spearman’s rank correlation coefficient (S. Spearman, 1904). Our results are summarized in Table 1. Our OPC/SoftOPC metrics are implemented using , as explained in Sect. 3 and Appendix D.
4.1 Simple Environments
Binary tree.
As a didactic toy example, we used a binary tree MDP with depth of episode length . In this environment,^{2}^{2}2Code for the binary tree environment is available at https://bit.ly/2Qx6TJ7. each node is a state with , unless it is a leaf/terminal state with reward . Actions are , and transitions are deterministic. We experimented with two extreme versions of this environment: (a) 1Failure: where the agent is successful unless it reaches the single failure leaf with ; and (b) 1Success: where the agent fails unless it reaches the single success leaf with . In our experiments we used a full binary tree of depth . The initial state distribution was uniform over all nonleaf nodes, which means that the initial state could sometimes be initialized to one where failure is inevitable. The validation dataset was collected by generating 1,000 episodes from a uniformly random policy. For the policies we wanted to evaluate, we generated 1,000 random Qfunctions by sampling for every , defining the policy as . We compared the correlation of the actual onpolicy performance of the policies with the scores given by the OPC, SoftOPC and the baseline metrics using , as shown in Table 1. SoftOPC correlates best and OPC correlates second best.
Pong.
As we are specifically motivated by imagebased tasks with binary rewards, the Atari Bellemare et al. (2013) Pong game was a good choice for a simple environment that can have these characteristics. The visual input is of low complexity, and the game can be easily converted into a binary reward task by truncating the episode after the first point is scored. We learned Qfunctions using DQN (Mnih et al., 2015) and DDQN (van Hasselt et al., 2016), varying hyperparameters such as the learning rate, the discount factor , and the batch size, as discussed in detail in Appendix E.2. A total of 175 model checkpoints are chosen from the various models for evaluation, and true average performance is evaluated over 3,000 episodes for each model checkpoint. For the validation dataset we used 38 Qfunctions that were partiallytrained with DDQN and generated 30 episodes from each, for a total of 1140 episodes. Similarly with the Binary Tree environments we compare the correlations of our metrics and the baselines to the true average performance over a number of onpolicy episodes. As we show in Table 1, both our metrics outperform the baselines, OPC performs better than SoftOPC in terms of correlation but is similar in terms of Spearman correlation .
Tree (1 Fail)  Tree (1 Succ)  Pong  Sim Train  Sim Test  RealWorld  

TD Err  0.01  0.13  0.02  0.15  0.05  0.18  0.02  0.37  0.10  0.51  0.17  0.48 
0.00  0.07  0.00  0.00  0.09  0.32  0.74  0.81  0.74  0.78  0.12  0.50  
MCC Err  0.02  0.17  0.06  0.26  0.04  0.36  0.00  0.33  0.06  0.44  0.01  0.15 
OPC  0.21  0.48  0.21  0.50  0.50  0.72  0.49  0.86  0.35  0.66  0.81  0.87 
SoftOPC  0.23  0.53  0.19  0.51  0.36  0.75  0.55  0.76  0.48  0.77  0.91  0.94 
4.2 Visionbased Robotic Grasping
Our main experimental results were on simulated and real versions of a robotic environment and a visionbased grasping task, following the setup from Kalashnikov et al. (2018), the details of which we briefly summarize. The observation at each timestep is a RGB image from a camera placed over the shoulder of a robotic arm, of the robot and a bin of objects, as shown in Fig. 0(b). At the start of an episode, objects are randomly dropped in a bin in front of the robot. The goal is to grasp any of the objects in that bin. Actions include continuous Cartesian displacements of the gripper, and the rotation of the gripper around the zaxis. The action space also includes three discrete commands: “open gripper”, “close gripper”, and “terminate episode”. Rewards are sparse, with if any object is grasped and otherwise. All models are trained with the fully offpolicy QTOpt algorithm as described in Kalashnikov et al. (2018).
In simulation we define a training and a test environment by generating two distinct sets of 5 objects that are used for each, shown in Fig. LABEL:fig:train_test_objects. In order to capture the different possible generalization failure scenarios discussed in Sect. LABEL:sec:generalization, we trained Qfunctions in a fully offpolicy fashion with data collected by a handcrafted policy with a 60% grasp success rate and greedy exploration (with =0.1) with two different datasets both from the training environment. The first consists of episodes, with which we can show we have insufficient offpolicy training data to perform well even in the training environment. The second consists of episodes, with which we can show we have sufficient data to perform well in the training environment, but due to mismatched offpolicy training data we can show that the policies do not generalize to the test environment (see Fig. LABEL:fig:train_test_objects for objects and Appendix E.3 for the analysis). We saved policies at different stages of training which resulted in 452 policies for the former case and 391 for the latter. We evaluated the true return of these policies on 700 episodes on each environment and calculated the correlation with the scores assigned by the OPE metrics based on heldout validation sets of episodes from the training environment and episodes from the test one, which we show in Table 1.
The realworld version of the environment has objects that were never seen during training (see Fig. 0(b) and 7). We evaluated 15 different models, trained to have varying degrees of robustness to the training and test domain gap, based on domain randomization and randomized–tocanonical adaptation networks (James et al., 2019).^{3}^{3}3For full details of each of the models please see Appendix E.4. Out of these, 7 were trained onpolicy purely in simulation. True average return was evaluated over 714 episodes with 7 different sets of objects, and true policy realworld performance ranged from 17% to 91%. The validation dataset consisted of realworld episodes, 40% of which were successful grasps and the objects used for it were separate from the ones used for final evaluation used for the results in Table 1.

4.3 Discussion
Table 1 shows and for each metric for the different environments we considered. Our proposed SoftOPC and OPC consistently outperformed the baselines, with the exception of the simulated robotic test environment, on which the SoftOPC performed almost as well as the discounted sum of advantages on the Spearman correlation (but worse on ). However, we show that SoftOPC more reliably ranks policies than the baselines for realworld performance without any realworld interaction, as one can also see in Fig. 1(b). The same figure shows the sum of advantages metric that works well in simulation performs poorly in the realworld setting we care about. Appendix F includes additional experiments showing correlation mostly unchanged on different validation datasets.
Furthermore, we demonstrate that SoftOPC can track the performance of a policy acting in the simulated grasping environment, as it is training in Fig. 1(a), which could potentially be useful for early stopping. Finally, SoftOPC seems to be performing slightly better than OPC in most of the experiments. We believe this occurs because the Qfunctions compared in each experiment tend to have similar magnitudes. Preliminary results in Appendix H suggest that when Qfunctions have different magnitudes, OPC might outperform SoftOPC.
5 Conclusion and future work
We proposed OPC and SoftOPC, classificationbased offpolicy evaluation metrics that can be used together with Qlearning algorithms. Our metrics can be used with binary reward tasks: tasks where each episode results in either a failure (zero return) or success (a return of one). While this class of tasks is a substantial restriction, many practical tasks actually fall into this category, including the realworld robotics tasks in our experiments. The analysis of these metrics shows that it can approximate the expected return in deterministic binary reward MDPs. Empirically, we find that OPC and the SoftOPC variant correlate well with performance across several environments, and predict generalization performance across several scenarios. including the simulationtoreality scenario, a critical setting for robotics. Effective offpolicy evaluation is critical for realworld reinforcement learning, where it provides an alternative to expensive realworld evaluations during algorithm development. Promising directions for future work include developing a variant of our method that is not restricted to binary reward tasks, and extending the analysis to stochastic tasks. However, even in the binary setting, we believe that methods such as ours can provide for a substantially more practical pipeline for evaluating transfer learning and offpolicy reinforcement learning algorithms.
Acknowledgements
We would like to thank Razvan Pascanu, Dale Schuurmans, George Tucker, and Paul Wohlhart for valuable discussions.
References
 Babaeizadeh et al. (2018) Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. In International Conference on Representation Learning, 2018.
 Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Cobbe et al. (2018) Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.
 Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009, 2009.
 Dudik et al. (2011) Dudik, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In ICML, March 2011.
 Dudík et al. (2014) Dudík, M., Erhan, D., Langford, J., Li, L., et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
 Farahmand & Szepesvári (2011) Farahmand, A.M. and Szepesvári, C. Model selection in reinforcement learning. Mach. Learn., 85(3):299–332, December 2011.
 Hanna et al. (2017) Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Confidence intervals for OffPolicy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’17, pp. 538–546, Richland, SC, 2017. International Foundation for Autonomous Agents and Multiagent Systems.
 Horvitz & Thompson (1952) Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
 James et al. (2019) James, S., Wohlhart, P., Kalakrishnan, M., Kalashnikov, D., Irpan, A., Ibarz, J., Levine, S., Hadsell, R., and Bousmalis, K. Simtoreal via simtosim: Dataefficient robotic grasping via randomizedtocanonical adaptation networks. In IEEE Conference on Computer Vision and Pattern Recognition, March 2019.
 Jiang & Li (2015) Jiang, N. and Li, L. Doubly robust offpolicy value evaluation for reinforcement learning. November 2015.
 Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In ICML, 2002.
 Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
 Kiryo et al. (2017) Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positiveunlabeled learning with nonnegative risk estimator. In NeurIPS, pp. 1675–1685, 2017.
 Koos et al. (2010) Koos, S., Mouret, J.B., and Doncieux, S. Crossing the reality gap in evolutionary robotics by promoting transferable controllers. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp. 119–126. ACM, 2010.
 Koos et al. (2012) Koos, S., Mouret, J.B., and Doncieux, S. The transferability approach: Crossing the reality gap in evolutionary robotics. IEEE Transactions on Evolutionary Computation, 17(1):122–145, 2012.
 Lee et al. (2018) Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Liu et al. (2018) Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A., DoshiVelez, F., and Brunskill, E. Representation balancing mdps for offpolicy policy evaluation. In NeurIPS, 2018.
 Mahmood et al. (2014) Mahmood, A. R., van Hasselt, H. P., and Sutton, R. S. Weighted importance sampling for offpolicy learning with linear function approximation. In Advances in Neural Information Processing Systems, pp. 3014–3022, 2014.
 Mannor et al. (2007) Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Murphy (2005) Murphy, S. A. A generalization error for QLearning. J. Mach. Learn. Res., 6:1073–1097, July 2005.
 Nichol et al. (2018) Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman, J. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018.
 Precup et al. (2000) Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for offpolicy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 759–766. Morgan Kaufmann, 2000.
 Quillen et al. (2018) Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., and Levine, S. Deep reinforcement learning for VisionBased robotic grasping: A simulated comparative evaluation of OffPolicy methods. February 2018.
 Raghu et al. (2018) Raghu, M., Irpan, A., Andreas, J., Kleinberg, R., Le, Q., and Kleinberg, J. Can deep reinforcement learning solve erdosselfridgespencer games? In International Conference on Machine Learning, pp. 4235–4243, 2018.
 Riedmiller et al. (2018) Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playingsolving sparse reward tasks from scratch. In International Conference on Machine Learning, 2018.
 Ross & Bagnell (2010) Ross, S. and Bagnell, D. Efficient reductions for imitation learning. In AISTATS, pp. 661–668, 2010.
 S. Spearman (1904) S. Spearman, C. The proof and measurement of association between two things. The American Journal of Psychology, 15:72–101, 01 1904. doi: 10.2307/1412159.
 Theocharous et al. (2015) Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. Personalized ad recommendation systems for lifetime value optimization with guarantees. In IJCAI, pp. 1806–1812, 2015.
 Thomas & Brunskill (2016) Thomas, P. and Brunskill, E. DataEfficient OffPolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, June 2016.
 Thomas et al. (2015) Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. HighConfidence OffPolicy evaluation. AAAI, 2015.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 van Hasselt et al. (2016) van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 Zhang et al. (2018a) Zhang, A., Ballas, N., and Pineau, J. A dissection of overfitting and generalization in continuous reinforcement learning. June 2018a.
 Zhang et al. (2018b) Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. April 2018b.
Appendix for OffPolicy Evaluation of Generalization for Deep QLearning in Binary Reward Tasks
Appendix A Classification error bound
a.1 Trajectory return all effective
For the forward direction, because ends in a success state, from any the optimal policy could reach a success state, so all must be effective.
For the reverse direction, if all are effective, then must be effective. Since is a terminal state with no further actions, for to be effective, we must have .
a.2 Proof of Theorem 1
By definition, succeeds if and only if at every , it selects an effective . Since is defined as the state distribution conditioned on being effective, the failure rate can be written as
(6)  
(7)  
(8) 
This gives as desired. The bound is tight when .
a.3 Alternate proof connecting to behavioral cloning
Since every policy that picks only effective actions achieves the optimal reward of 1, and is defined as the 01 loss over states conditioned on not selecting a catastrophic action, we can view as the 01 behavior cloning loss to an expert policy . In this section, we present an alternate proof based on behavioral cloning bounds from Ross & Bagnell (2010).
Theorem 2.1 of Ross & Bagnell (2010) proves a cost bound for general MDPs. This differs from the cost derived above. The difference in bound comes because Ross & Bagnell (2010) derive their proof in a general MDP, whose cost is upper bounded by at every timestep. If deviates from the expert, it receives cost several times, once for every future timestep. In binary reward MDPs, we only receive this cost once, at the final timestep. Transforming the proof to incorporate our binary reward MDP assumptions lets us recover the upper bound from Appendix A.2. We briefly explain the updated proof, using notation from (Ross & Bagnell, 2010) to make the connection more explicit.
Define as the expected 01 loss at time for under the state distribution of . Since corresponds to states visits conditioned on never picking a catastrophic action, this is the same as our definition of . The MDP is defined by cost instead of reward: cost of state is for all timesteps except the final one, where . Let be the probability hasn’t made a mistake (w.r.t ) in the first steps, be the state distribution conditioned on no mistakes in the first steps, and be the state distribution conditioned on making at least 1 mistake. In a general MDP with , total cost is bounded by , where the 1st term is cost while following the expert and the 2nd term is an upper bound of cost if outside of the expert distribution. In a binary reward MDP, since for all except , we can ignore every term in the summation except the final one, giving
(9) 
Note , and as shown in the original proof, . Since is a probability, we have , which recovers the bound, and again this is tight when .
(10) 
Appendix B Efficiently computing the OPC
(11) 
Suppose we have transitions, of which of them have positive labels. Imagine placing each on the number line. Each is annotated with a score, for unlabeled transitions and for positive labeled transitions. Imagine sliding a line from to . At , the OPC score is . The OPC score only updates when this line passes some in our dataset, and is updated based on the score annotated at . After sorting by Qvalue, we can find all possible OPC scores in time by moving from to , noting the updated score after each we pass. Given , we sort the transitions in , annotate them appropriately, then compute the maximum over all OPC scores.
Appendix C Baseline metrics
We elaborate on the exact expressions used for the baseline metrics. In all baselines, is the onpolicy action .
Temporaldifference error
The TD error is the squared error between and the 1step return estimate of the action’s value.
(12) 
Discounted sum of advantages
The difference of the value functions of two policies and at state is given by the discounted sum of advantages (Kakade & Langford, 2002; Murphy, 2005) of on episodes induced by :
(13) 
where is the discount factor and the advantage function for policy , defined as . Since is fixed, estimating (13) is sufficient to compare and . The with smaller score is better.
(14) 
MonteCarlo estimate corrected with the discounted sum of advantages
Estimating with the MonteCarlo return, substituting into Eqn. (13), and rearranging gives
(15) 
With , we can obtain an approximate estimate depending on the whole episode:
(16) 
The MCC Error is the squared error to this estimate.
(17) 
Note that (17) was proposed before by Quillen et al. (2018) as a training loss for a Qlearning variant, but not for the purpose of offpolicy evaluation.
Eqn. (12) and Eqn. (16) share the same optimal Qfunction, so assuming a perfect learning algorithm, there is no difference in information between these metrics. In practice, the Qfunction will not be perfect due to errors in function approximation and optimization. Eqn. (16) is designed to rely on all future rewards from time , rather than just . We theorized that using more of the “ground truth” from could improve the metric’s performance in imperfect learning scenarios.
Appendix D Argument for choosing
The positive class prior should intuitively depend on the environment, since some environments will have many more effective than others. However, recall how error is defined. Each is defined as:
(18) 
where state distribution is defined such that were all effective. This is equivalent to following an optimal “expert” policy , and although we are estimating from data generated by behavior policy , we should match the positive class prior we would observe from expert . Assuming the task is feasible, meaining the policy has effective actions available from the start, we have . Therefore, although the validation dataset will likely have both successes and failures, a prior of is the ideal prior, and this holds independently of the environment. As a didactic toy example, we show this holds for a binary tree domain. In this domain, each node is a state, actions are , and leaf nodes are terminal with reward or . We try in two extremes: only 1 leaf fails, or only 1 leaf succeeds. Validation data is from the uniformly random policy. The frequency of effective varies a lot between the two extremes, but in both Spearman correlation monotonically increases with and was best with . Fig. 3 shows Spearman correlation of OPC and SoftOPC with respect to , when the tree is mostly success or failures. In both settings has the best correlation.
From an implementation perspective, is also the only choice that can be applied across arbitrary validation datasets. Suppose , the policy collecting our validation set, succeeds with probability . In the practical computation of OPC presented in Appendix B, we have transitions, and of them have positive labels. Each is annotated with a score: for unlabeled transitions and for positive labeled transitions. The maximal OPC score will be the sum of all annotations within the interval , and we maximize over .
For unlabeled transitions, the annotation is , which is negative. Suppose the annotation for positive transitions was negative as well. This occurs when . If every annotation is negative, then the optimal choice for is , since the empty set has total 0 and every nonempty subset has a negative total. This gives , no matter what we are evaluating, which makes the OPC score entirely independent of episode return.
This degenerate case is undesirable, and happens when , or equivalently . To avoid this, we should have . If we wish to pick a single that can be applied to data from arbitrary behavior policies , then we should pick . In binary reward MDPs where can always succeed, this gives , and since the prior is a probability, it should satisfy , leaving as the only option.
To complete the argument, we must handle the case where we have a binary reward MDP where . In a binary reward MDP, the only way to have is if the sampled initial state is one where is catastrophic for all . From these , and all future reachable from , the actions chooses do not matter  the final return will always be . Since is defined conditioned on only executing effective actions so far, it is reasonable to assume we only wish to compute the expectation over states where our actions can impact reward. If we computed optimal policy return over just the initial states where effective actions exist, we have , giving once again.
Appendix E Experiment details
e.1 Binary tree environment details
The binary tree is a full binary tree with levels. The initial state distribution is uniform over all nonleaf nodes. Initial state may sometimes be initialized to one where failure is inevitable. The validation dataset is collected by generating 1,000 episodes from the uniformly random policy. For Qfunctions, we generate 1,000 random Qfunctions by sampling for every , defining the policy as . We try priors . Code for this environment is available at https://gist.github.com/alexirpan/54ac855db7e0d017656645ef1475ac08.
e.2 Pong details
Fig. 4 is a scatterplot of our Pong results. Each color represents a different hyperparameter setting, as explained in the legend. From top to bottom, the abbreviations in the legend mean:

DQN: trained with DQN

DDQN: trained with Double DQN

DQN_gamma9: trained with DQN, (default is ).

DQN2: trained with DQN, using a different random seed

DDQN2: trained with Double DQN, using a different random seed

DQN_lr1e4: trained with DQN, learning rate (default is ).

DQN_b64: trained with DQN, batch size 64 (default is 32).

DQN_fixranddata: The replay buffer is filled entirely by a random policy, then a DQN model is trained against that buffer, without pushing any new experience into the buffer.

DDQN_fixranddata: The replay buffer is filled entirely by a random policy, then a Double DQN model is trained against that buffer, without pushing new experience into the buffer.
In Fig. 4, models trained with are highlighted. We noticed that SoftOPC was worse at separating these models than OPC, suggesting the 01 loss is preferable in some cases. This is discussed further in Appendix H.
In our implementation, all models were trained in the full version of the Pong game, where the maximum return possible is points. However, to test our approach we create a binary version for evaluation. Episodes in the validation set were truncated after the first point was scored. Return of the policy was estimated similarly: the policy was executed until the first point is scored, and the average return is computed over these episodes. Although the train time environment is slightly different from the evaluation environment, this procedure is fine for our method, since our method can handle environment shift and we treat as a blackbox scoring function. OPC can be applied as long as the validation dataset matches the semantics of the test environment where we estimate the final return.
e.3 Simulated grasping details
The objects we use were generated randomly through procedural generation. The resulting objects are highly irregular and are often nonconvex. Some example objects are shown in Fig. 4(a).
Fig. 6 demonstrates two generalization problems from Sect. LABEL:sec:generalization: insufficient offpolicy training data and mismatched offpolicy training data. We trained two models with a limited 100k grasps dataset or a large 900k grasps dataset, then evaluated grasp success. The model with limited data fails to achieve stable grasp success due to overfitting to its limited dataset. Meanwhile, the model with abundant data learns to model the train objects, but fails to model the test objects, since they are unobserved at training time.
e.4 Realworld grasping
Several visual differences between simulation and reality limit the real performance of model trained in simulation (see Fig. 7) and motivate simulationtoreality methods such as the RandomizedtoCanonical Adaptation Networks (RCANs), as proposed by James et al. (2019). The 15 realworld grasping models evaluated were trained using variants of RCAN. These networks train a generator to transform randomized simulation images to a canonical simulated image. A policy is learned over this canonical simulated image. At test time, the generator transforms real images to the same canonical simulated image, facilitating zeroshot transfer. Optionally, the policy can be finetuned with realworld data, in this case the realworld training objects are distinct from the evaluation objects. The SoftOPC and realworld grasp success of each model is listed in Table 2. From toptobottom, the abbreviations mean:

Sim: A model trained only in simulation.

Randomized Sim: A model trained only in simulation with the mild randomization scheme from James et al. (2019): random tray texture, object texture and color, robot arm color, lighting direction and brightness, and one of 6 background images consisting of 6 different images from the view of the realworld camera.

Heavy Randomized Sim: A model trained only in simulation with the heavy randomization scheme from James et al. (2019): every randomization from Randomized Sim, as well as slight randomization of the position of the robot arm and tray, randomized position of the divider within the tray (see Figure 1b in main text for a visual of the divider), and a more diverse set of background images.

Randomized Sim + Real (2k): The Randomized Sim Model, after training on an additional 2k grasps collected onpolicy on the real robot.

Randomized Sim + Real (3k): The Randomized Sim Model, after training on an additional 3k grasps collected onpolicy on the real robot.

Randomized Sim + Real (4k): The Randomized Sim Model, after training on an additional 4k grasps collected onpolicy on the real robot.

Randomized Sim + Real (5k): The Randomized Sim Model, after training on an additional 5k grasps collected onpolicy on the real robot.

RCAN: The RCAN model, as described in James et al. (2019), trained in simulation with a pixel level adaptation model.

RCAN + Real (2k): The RCAN model, after training on an additional 2k grasps collected onpolicy on the real robot.

RCAN + Real (3k): The RCAN model, after training on an additional 3k grasps collected onpolicy on the real robot.

RCAN + Real (4k): The RCAN model, after training on an additional 4k grasps collected onpolicy on the real robot.

RCAN + Real (5k): The RCAN model, after training on an additional 5k grasps collected onpolicy on the real robot.

RCAN + Dropout: The RCAN model with dropout applied in the policy.

RCAN + InputDropout: The RCAN model with dropout applied in the policy and RCAN generator.

RCAN + GradsToGenerator: The RCAN model where the policy and RCAN generator are trained simultaneously, rather than training RCAN first and the policy second.
Model  SoftOPC  Real Grasp Success (%) 

Sim  0.056  16.67 
Randomized Sim  0.072  36.92 
Heavy Randomized Sim  0.040  34.90 
Randomized Sim + Real (2k)  0.129  72.14 
Randomized Sim + Real (3k)  0.141  73.65 
Randomized Sim + Real (4k)  0.149  82.92 
Randomized Sim + Real (5k)  0.152  84.38 
RCAN  0.113  65.69 
RCAN + Real (2k)  0.156  86.46 
RCAN + Real (3k)  0.166  88.34 
RCAN + Real (4k)  0.152  87.08 
RCAN + Real (5k)  0.159  90.71 
RCAN + Dropout  0.112  51.04 
RCAN + InputDropout  0.089  57.71 
RCAN + GradsToGenerator  0.094  58.75 
Appendix F SoftOPC performance on different validation datasets
For real grasp success we use 7 KUKA LBR IIWA robot arms to each make 102 grasp attempts from 7 different bins with test objects (see Fig. 4(b)). Each grasp attempt is allowed up to 20 steps and any grasped object is dropped back in the bin, a successful grasp is made if any of the test objects is held in the gripper at the end of the episode.
For estimating SoftOPC, we use a validation dataset collected from two policies, a poor policy with a success of 28%, and a better policy with a success of 51%. We divided the validation dataset based on the policy used, then evaluated SoftOPC on data from only the poor or good policy. Fig. 8 shows the correlation on these subsets of the validation dataset. The correlation is slightly worse on the poor dataset, but the relationship between SoftOPC and episode reward is still clear.
As an extreme test of robustness, we go back to the simulated grasping environment. We collect a new validation dataset, using the same humandesigned policy with greedy exploration instead. The resulting dataset is almost all failures, with only 1% of grasps succeeding. However, this dataset also covers a broad range of states, due to being very random. Fig. 9 shows the OPC and SoftOPC still perform reasonably well, despite having very few positive labels. From a practical standpoint, this suggests that OPC or SoftOPC have some robustness to the choice of generation process for the validation dataset.
Appendix G Plots of Qvalue distributions
In Fig. 10, we plot the Qvalues of two realworld grasping models. The first is trained only in simulation and has poor realworld grasp success. The second is trained with a mix of simulated and realworld data. We plot a histogram of the average Qvalue over each episode of validation set . The better model has a wider separation between successful Qvalues and failed Qvalues.
Appendix H Comparison of OPC and SoftOPC
We elaborate on the argument presented in the main paper, that OPC performs better when have different magnitudes, and otherwise SoftOPC does better. To do so, it is important to consider how the Qfunctions were trained. In the tree environments, was sampled uniformly from , so by construction. In the grasping environments, the network architecture ends in , so . In these experiments, SoftOPC did better. In Pong, was not constrained in any way, and these were the only experiments where discount factor was varied between models. Here, OPC did better.
The hypothesis that Qfunctions of varying magnitudes favor OPC can be validated in the tree environment. Again, we evaluate 1,000 Qfunctions, but instead of sampling , the th Qfunction is sampled from . This produces 1,000 different magnitudes between the compared Qfunctions. Fig. 10(a) demonstrates that when magnitudes are deliberately changed for each Qfunction, the SoftOPC performs worse, whereas the nonparametric OPC is unchanged. To demonstrate this is caused by a difference in magnitude, rather than large absolute magnitude, OPC and SoftOPC are also evaluated over . Every Qfunction has high magnitude, but their magnitudes are consistently high. As seen in Fig. 10(b), in this setting the SoftOPC goes back to outperforming OPC.
Appendix I Scatterplots of Each Metric
We present scatterplots of each of the metrics in the simulated grasping environment from Sect. 4.2. We trained two Qfunctions in a fully offpolicy fashion, one with a dataset of episodes, and the other with a dataset of episodes. For every metric, we generate a scatterplot of all the model checkpoints. Each model checkpoint is color coded by whether it was trained with episodes or episodes.