Empirical Study of OffPolicy Policy Evaluation for Reinforcement Learning
Abstract
The disparate experimental conditions in recent offpolicy policy evaluation (OPE) literature make it difficult both for practitioners to choose a reliable estimator for their application domain, as well as for researchers to identify fruitful research directions. In this work, we present the first detailed empirical study of a broad suite of OPE methods. Based on thousands of experiments and empirical analysis, we offer a summarized set of guidelines to advance the understanding of OPE performance in practice, and suggest directions for future research. Along the way, our empirical findings challenge several commonly held beliefs about which class of approaches tends to perform well. Our accompanying software implementation serves as a first comprehensive benchmark for OPE.
1000
1 Introduction
We focus on understanding the relative performance of existing methods for offpolicy policy evaluation (OPE), which is the problem of estimating the value of a target policy using only precollected historical data generated by another policy. The earliest OPE methods rely on classical importance sampling to handle the distribution mismatch between the target and behavior policies Precup et al. (2000). Many advanced OPE methods have since been proposed for both contextual bandits Dudík et al. (2011); Bottou et al. (2013); Swaminathan et al. (2017); Wang et al. (2017); Li et al. (2015); Ma et al. () and reinforcement learning settings Jiang and Li (2016); Dudík et al. (2011); Farajtabar et al. (2018); Liu et al. (2018); Xie et al. (2019). These new developments reflect practical interests in deploying reinforcement learning to safetycritical situations Li et al. (2011); Wiering (2000); Bottou et al. (2013); Bang and Robins (2005), and the increasing importance of offpolicy learning and counterfactual reasoning more broadly Degris et al. (2012); Thomas et al. (2017); Munos et al. (2016); Le et al. (2019); Liu et al. (2019); Nie et al. (2019). OPE is also closely related to the problem of dynamic treatment regimes in the causal inference literature Murphy et al. (2001).
Empirical validations have long contributed to the scientific understanding and advancement of machine learning techniques Chapelle and Li (2011); Caruana et al. (2008); Caruana and NiculescuMizil (2006). Recently, many have called for careful examination of empirical findings of contemporary deep learning and deep reinforcement learning efforts Henderson et al. (2018); Locatello et al. (2019). As OPE is central to realworld applications of reinforcement learning, an indepth empirical understanding is critical to ensure usefulness and accelerate progress. While many recent methods are built on sound mathematical principles, a practitioner is often faced with a nontrivial task of selecting the most appropriate estimator for their application. A notable gap in current literature is a comprehensive empirical understanding of contemporary methods, due in part to the disparate testing environments and varying experimental conditions among prior work. Consequently, there is little holistic insight into where different methods particularly shine, nor a systematic summary of the challenges one may encounter when in different scenarios. Researchers and practitioners may reasonably deduce the following commonly held impressions from surveying the literature:

Doubly robust methods are often assumed to outperform direct and importance sampling methods.

Horizon length is the primary driver of poor performance for OPE estimators.

Modelbased is the goto direct method, either standalone or as part of a doublyrobust estimator.
The reality, as we will discuss, is much more nuanced. In this work, we take a closer look at recently proposed methods and offer a thorough empirical study of a wide range of estimators. We design various experimental conditions to explore the success and failure modes of different methods. We synthesize general insights to guide practitioners, and suggest directions for future research. Finally, we provide a highly extensive software package that can interface with new experimental environments and methods to run new OPE experiments at scale.
2 Preliminaries
As per RL standard, we represent the environment by . is the state space (or observation space in the nonMarkov case), is the (finite) action space, is the transition function, is the reward, and discount factor . A policy maps states to a distribution over actions, and denotes the probability of choosing in .
OPE is typically considered in the episodic RL setting. A behavior policy generates a historical data set, , of trajectories (or episodes), where indexes over trajectories, and . The episode length is frequently assumed to be fixed for notational convenience. In practice, one can pad additional absorbing states to handle variable lengths. Given a desired evaluation policy , the OPE problem is to estimate the value , defined as:
with , , , and is the initial state distribution.
3 Overview of OPE Methods
OPE methods were historically categorized into importance sampling, direct, and doubly robust methods. This demarcation was first introduced for contextual bandits Dudík et al. (2011), and later to reinforcement learning Jiang and Li (2016). Some recent methods have blurred the boundary of these categories, such as Retrace() Munos et al. (2016) that uses a product of importance weights of multiple time steps for offpolicy correction, and MAGIC Thomas and Brunskill (2016) that switches between importance weighting and direct methods.
In this paper, we propose to regroup OPE into three similar classes of methods, but with expanded definition for each category. Figure 1 provides an overview of OPE methods that we consider. The relative positioning of different methods reflects how close they are to being a pure regressionbased estimator versus a pure importance samplingbased estimator. Appendix D contains a full description of all methods under consideration.
3.1 Inverse Propensity Scoring (IPS)
Inverse Propensity Scoring (IPS), also called importance sampling, is widely used in statistics Powell and Swann (1966); Hammersley and Handscomb (1964); Horvitz and Thompson (1952) and RL Precup et al. (2000). The key idea is to reweight the rewards in the historical data by the importance sampling ratio between and , i.e., how likely a reward is under versus . IPS methods yield consistent and (typically) unbiased estimates; however the product of importance weights can be unstable for long time horizons. The cumulative importance weight between and is written as (where for ). Weighted IPS replaces a normalization factor by . The weighted versions are biased but strongly consistent.
Importance Sampling (IS) takes the form: There are three other main IPS variants that we consider: PerDecision Importance Sampling (PDIS), Weighted Importance Sampling (WIS) and PerDecision WIS (PDWIS) (see Appendix Table 3 for full definitions). Other variants of IPS exist but are neither consistent nor unbiased Thomas (2015). IPS often assumes known , which may not be possible – one approach is to estimate from data Hanna et al. (2019), resulting in a potentially biased estimator that can sometimes outperform traditional IPS methods.
Environment  Graph  GraphMC  MC  PixMC  Enduro  GraphPOMDP  GW  PixGW 
Markov?  yes  yes  yes  yes  yes  no  yes  yes 
State/Obs  position  position  [pos, vel]  pixels  pixels  position  position  pixels 
4 or 16  250  250  250  1000  2 or 8  25  25  
Stoch Env?  variable  no  no  no  no  no  no  variable 
Stoch Rew?  variable  no  no  no  no  no  no  no 
Sparse Rew?  variable  terminal  terminal  terminal  dense  terminal  dense  dense 
Func. Class  tabular  tabular  linear/NN  NN  NN  tabular  tabular  NN 
3.2 Direct Methods (DM)
The main distinction of direct methods from IPS is the focus on regressionbased techniques to (more) directly estimate the value functions of the evaluation policy ( or ). We consider eight different direct approaches, described completely in appendix D. Similar to policy learning literature, we can view OPE through the lens of modelbased vs. modelfree approaches
Modelbased. Perhaps the most commonly used DM is modelbased (also called approximate model, denoted AM), where the transition dynamics, reward function and termination condition are directly estimated from historical data Jiang and Li (2016); Paduraru (2013). The resulting learned MDP is then used to compute the value of , e.g., by MonteCarlo policy evaluation.
Modelfree. Estimating the actionvalue function , parameterized by , is the focus of several modelfree approaches. The value estimate is then: . A simple example is Fitted Q Evaluation (FQE) Le et al. (2019), which is a modelfree counterpart to AM, and is functionally a policy evaluation counterpart to batch Q learning. FQE learns a sequence of estimators , where:
Indeed, several modelfree methods originated from offpolicy learning settings, but are also natural for OPE. Harutyunyan et al. (2016) can be viewed as a generalization of FQE that looks to the horizon limit to incorporate the longterm value into the backup step. Retrace() Munos et al. (2016) and TreeBackup() Precup et al. (2000) also use full trajectories, but additionally incorporate varying levels of clipped importance weights adjustment. The dependent term mitigates instability in the backup step, and is chosen based on experimental findings of Munos et al. (2016).
Q Regression (QReg) and More Robust DoublyRobust (MRDR) Farajtabar et al. (2018) are two recently proposed direct methods that make use of cumulative importance weights in deriving the regression estimate for , solved through a quadratic program. MRDR changes the objective of the regression to that of directly minimizing the variance of the DoublyRobust estimator (see Section 3.3).
Liu et al. (2018) recently proposed a method for the infinite horizon setting (IH). While IH can be viewed as a RaoBlackwellization of the IS estimator, we include it in the DM category because it essentially solves the Bellman equation for state distributions and requires function approximation, which are more characteristic of DM. IH shifts the focus from importance sampling over action sequences to estimating the importance ratio between state density distributions induced by and . This ratio replaces all but the final importance weights in the IH estimate, which resembles IS. More recently, several estimators inspired by density ratio estimation idea have been proposed Nachum et al. (2019); Uehara and Jiang (2019); Xie et al. (2019)  we will leave evaluation of these new extensions for future work.
3.3 Hybrid Methods (HM)
Hybrid methods subsume doubly robustlike approaches, which combine aspects of both IPS and DM. Standard doubly robust OPE (denoted DR) Jiang and Li (2016) is an unbiased estimator that leverages a DM to decrease the variance of the unbiased estimates produced by importance sampling techniques:
Other HMs include Weighted DoublyRobust (WDR) and MAGIC (see Appendix D). WDR selfnormalizes the importance weights (similar to WIS). MAGIC introduces adaptive switching between DR and DM; in particular, one can imagine using DR to estimate the value for part of a trajectory and then using DM for the remainder. Using this idea, MAGIC Thomas and Brunskill (2016) finds an optimal linear combination among a set that varies the switch point between WDR and DM. Note that any DM that returns yields a set of corresponding DR, WDR, and MAGIC estimators. As a result, we consider twentyone hybrid approaches in our experiments.
4 Experiments
Experiment Design Principles. We consider several domain characteristics (simplecomplex, deterministicstochastic, sparsedense rewards, shortlong horizon), pairs (closefar), and data sizes (smalllarge), to study OPE performance under varying conditions.
We use two standard RL benchmarks from OpenAI Brockman et al. (2016): Mountain Car (MC) and Enduro Atari game. As many RL benchmarks are fixed and deterministic, we design 6 additional environments that allow control over various conditions: (i) Graph domain (tabular, varying stochasticity and horizon), (ii) GraphPOMDP (tabular, control for representation), (iii) GraphMC (simplifying MC to tabular case), (iv) PixelMC (study MC in highdimensional setting), (v) Gridworld (tabular, long horizon version) and (vi) PixelGridworld (controlled Gridworld experiments with function approximation).
All together, our benchmark consists of eight environments with characteristics summarized in Table 1. Complete descriptions can be found in Appendix E.
Protocol & Metrics. Each experiment depends on specifying environment and its properties, behavior policy , evaluation policy , and number of trajectories from rollingout for historical data. The true onpolicy value is the MonteCarlo estimate via rollouts of . We repeat each experiment times with different random seeds. We judge the quality of a method via two metrics:

Relative mean squared error (Relative MSE): , which allows a fair comparison across different conditions.
^{3} 
Neartop Frequency: For each experimental condition, we include the number of times each OPE estimator is within of the best performing estimator to facilitate aggregate comparison across domains.
Implementation & Hyperparameters. With thirtythree different OPE methods considered, we run thousands of experiments across the above eight domains. Hyperparameters are selected based on publication, code release or author consultation. We maintain a consistent set of hyperparameters for each estimator and each environment across experimental conditions (see hyperparameter choice in appendix Table 13). We create a software package that allows running experiments at scale and easy integration with new domains and techniques for future research. Due to limited space, we will show the results from selected experiment conditions. The complete results, with highlighted best method in each class, are available in the appendix.
5 Results
5.1 What is the best method?
The first important takeaway is that there is no clearcut winner: no single method or method class is consistently the best performer, as multiple environmental factors can influence the accuracy of each estimator. With that caveat in mind, based on the aggregate top performance metrics, we can recommend the following estimators for each method class (See Table 2 and appendix Table B).
Inverse propensity scoring (IPS). In practice, weighted importance sampling, which is biased, tends to be more accurate and dataefficient than unbiased basic importance sampling methods. Among the four IPSbased estimators, PDWIS tends to perform best (Figure 4 left).
Direct methods (DM). Generally, FQE, , and IH tend to perform the best among DM (appendix Table B). FQE tends to be more data efficient and is the best method when data is limited (Figure 5). generalizes FQE to multistep backup, and works particularly well with more data, but is computationally expensive in complex domains. IH is highly competitive in long horizons and with high policy mismatch in a tabular setting (appendix Tables B.1, B.1). In pixelbased domains, however, choosing a good kernel function for IH is not straightforward, and IH can underperform other DM (appendix Table B.1). We provide a numerical comparison among direct methods for tabular (appendix Figure 16) and complex settings (Figure 4 center).
Class  Recommendation  When to use  Prototypical env.  Neartop Freq. 

Direct  FQE  Stochastic env, severe policy mismatch  Graph, MC, PixMC  23.7% 
Compute nonissue, moderate policy mismatch  GW/PixGW  15.0%  
IH  Long horizon, mild policy mismatch, good kernel  GraphMC  19.0%  
IPS  PDWIS  Short horizon, mild policy mismatch  Graph  4.7% 
Hybrid  MAGIC FQE  Severe model misspecification  GraphPOMDP, Enduro  30.0% 
MAGIC  Compute nonissue, severe model misspecification  GraphPOMDP  17.3% 
Hybrid methods (HM). With the exception of IH, each DM corresponds to three HM: standard doubly robust (DR), weighted doubly robust (WDR), and MAGIC. For each DM, its WDR version often outperforms its DR version. MAGIC can often outperform WDR and DR. However, MAGIC comes with additional hyperparameters, as one needs to specify the set of partial trajectory length to be considered. Unsurprisingly, their performance highly depends on the underlying DM. In our experiments, FQE and are typically the most reliable: MAGIC with FQE or MAGIC with tend to be among the best hybrid methods (see appendix Figures 22  26).
5.2 Key drivers of method accuracy
The main reason for the inconsistent performance of estimators is various environmental factors that are inadequately studied from prior work. These coupled factors often impact accuracy interdependently:

Representation mismatch: Function approximators with insufficient representation power weaken DM, and so do overly rich ones as they cause overfitting (e.g., tabular classes). These issues do not impact IPS. Severe misspecification favors HM and weakens DM.

Horizon length: Long horizons hurt all methods, but especially those dependent on importance weights (including IPS, HM and some DM).

Policy mismatch: Large divergence between and hurts all methods, but tends to favor DM in the small data regime relative to HM and IPS. HM will catch up with DM as data size increases.

Bad estimation of unknown behavior policy:
^{4} estimation quality depends on the state and action dimensionality, and historical data size. Poor estimates cause HM and IPS to underperform simple DM. 
Environment / Reward stochasticity: Stochastic environments hurt the data efficiency of all methods, but favor DM over HM and IPS.
We perform a series of controlled experiments to isolate the impact of these factors. Figure 3 shows a typical comparison of the best performing method in each class, under a tabular setting with both short and long horizons, and a large mismatch between and . The particular best method in each class may change depending on the specific conditions. Within each class, a general guideline for method selection is summarized in Table 2. The appendix contains the full empirical results of all experiments.
5.3 A recipe for method selection
Figure 2 summarizes our general guideline for navigating key factors that affect the accuracy of different estimators. To guide the readers through the process, we now dive further into our experimental design to test various factors, and discuss the resulting insights.
Do we potentially have representation mismatch? Representation mismatch comes from two sources: model misspecification and poor generalization. Model misspecification refers to the insufficient representation power of the function class used to approximate either the transition dynamics (AM), value function (other DM), or state distribution density ratio (in IH).
Tabular representation for MDP controls for representation mismatch by ensuring adequate function class capacity, as well as zero inherent Bellman error (left branch, Fig 2). In such case, we may still suffer from poor generalization without sufficient data coverage, which depends on other factors in the domain settings.
The effect of representation mismatch (right branch, Fig 2) can be understood via two controlled scenarios:

Misspecified and poor generalization: We expose the impact of this severe mismatch scenario via the Graph POMDP construction, where selected information are omitted from an otherwise equivalent Graph MDP. HM substantially outperform DM in this setting (Figure 3 right versus left).

Misspecified but good generalization: Function class such as neural networks has powerful generalization ability, but may introduce bias and inherent Bellman error
^{5} Munos and Szepesvári (2008); Chen and Jiang (2019) (see linear vs. neural networks comparison for Mountain Car in appendix Fig 13). Still, powerful function approximation makes (biased) DM very competitive with HM, especially under limited data and in complex domains (see pixelGridworld in appendix Fig 2729). However, function approximation bias may cause serious problem for high dimensional and long horizon settings. In the extreme case of Enduro (very long horizon and sparse rewards), all DM fail to convincingly outperform a naïve average of behavior data (appendix Fig 12).
Short horizon vs. Long horizon? It is wellknown that IPS methods are sensitive to trajectory length Li et al. (2015). Long horizon leads to an exponential blowup of the importance sampling term, and is exacerbated by significant mismatch between and . This issue is inevitable for any unbiased estimator Jiang and Li (2016) (a.k.a., the curse of horizon Liu et al. (2018)). Similar to IPS, DM relying on importance weights also suffer from long horizon (appendix Fig 16), though to a lesser degree. IH aims to bypass the effect of cumulative weighting in long horizons, and indeed performs substantially better than IPS methods in very long horizon domains (Fig 4 left).
A frequently ignored aspect in previous OPE work is a proper distinction between fixed, finite horizon tasks (IPS focus), infinite horizon tasks (IH focus), and indefinite horizon tasks, where the trajectory length is finite but varies depending on the policy. Many applications should properly belong to the indefinite horizon category.
How different are behavior and target policies? Similar to IPS, the performance of DM is negatively correlated with the degree of policy mismatch. Figure 5 shows the interplay of increasing policy mismatch and historical data size, on the top DM in the deterministic gridworld. We use as an environmentindependent metric of mismatch between the two policies. The performance of the top DM (FQE, , IH) tend to hold up better than IPS methods when the policy gap increases (appendix Figure 18). FQE and IH are best in the small data regime, and performs better as data size increases (Figure 5). Increased policy mismatch weakens the DM that use importance weights (QReg, MRDR, Retrace() and TreeBackup()).
Do we have a good estimate of the behavior policy? Often the behavior policy may not be known exactly and requires estimation, which can introduce bias and cause HM to underperform DM, especially in low data regime (e.g., pixel gridworld appendix Figure 2729). Similar phenomenon was observed in the statistics literature Kang and Schafer (2007). As the data size increases, HMs regain the advantage as the quality of the estimate improves.
Is the environment stochastic or deterministic? While stochasticity affects all methods by straining the data requirement, HM are more negatively impacted than DM (Figure 3 center, Figure 17). This can be justified by e.g., the variance analysis of DR, which shows that the variance of the value function with respect to stochastic transitions will be amplified by cumulative importance weights and then contribute to the overall variance of the estimator; see Jiang and Li (2016, Theorem 1) for further details. We empirically observe that DM frequently outperform their DR versions in the small data case (Figure 17). In a stochastic environment and tabular setting, HM do not provide significant edge over DM, even in short horizon case. The gap closes as the data size increases (Figure 17).
5.4 Challenging common wisdom
We close this section by briefly revisiting commonly held beliefs about highlevel performance of OPE methods.
Are HM always better than DM? No. Overall, DM are surprisingly competitive with HM. Under highdimensionality, long horizons, estimated behavior policies, or reward/environment stochasticity, HM can underperform simple DM, sometimes significantly (e.g., see appendix Figure 17).
Concretely, HM can perform worse than DM in the following scenarios that we tested:
When data is sufficient, or model misspecification is severe, HM do provide consistent improvement over DM.
Is horizon length the most important factor? No. Despite conventional wisdom suggesting IPS methods are most sensitive to horizon length, we find that this is not always the case. Policy divergence can be just as, if not more, meaningful. For comparison, we designed two scenarios with identical mismatch as defined in Section 5.3 (see appendix Tables C, C). Starting from a baseline scenario of short horizon and small policy divergence (appendix Table C), extending horizon length leads to degradation in accuracy, while a comparable increase in policy divergence causes a degradation.
How good is modelbased direct method (AM)? AM can be among the worst performing direct methods (appendix Table B). While AM performs well in tabular setting in the large data case (appendix Figure 16), it tends to perform poorly in high dimensional settings with function approximation (e.g., Figure 4 center). Fitting the transition model is often more prone to small errors than directly approximating . Model fitting errors also compound with long horizons.
5.5 Other Considerations
Hypeparameter selection. As with many machine learning techniques, hyperparameter choice affects the performance of most estimators (except IPS estimators). The situation is more acute for OPE than the online offpolicy learning setting, due to the lack of proper validation signal (such as online game score). When using function approximation, direct methods may not have satisfactory convergence, and require setting a reasonable termination threshold hyperparameter. QReg and MRDR require extra care to avoid illconditioning, such as tuning with L1 and L2 regularization.
Computational considerations. DM are generally significantly more computationally demanding than IPS. In complex domains, modelfree iterative methods can be expensive in training time. Iterative DM that incorporate rollouts until the end of trajectories during training (Retrace(), , TreeBackup()) are the most computationally demanding
Sparsity (nonsmoothness) of the rewards: Methods that are dependent on cumulative importance weights are also sensitive to reward sparsity (Figure 19). We recommend normalizing the rewards. As a rough guideline, zerocentering rewards often improve performance of methods that depend on importance weights. This seemingly naïve practice can be actually viewed as a special case of DR using a constant DM component (baseline), and can yield improvements over vanilla IPS Jiang and Li (2016).
6 Discussion and Future Directions
The most difficult environments break all estimators. Atari games pose significant challenges for contemporary techniques due to long horizon and high state dimensionality. It is possible that substantially more historical data is required for current OPE methods to succeed. However, to overcome computational challenge in complex RL domains, it is important to identify principled ways to stabilize iterative methods such as FQE, Retrace(), Q() when using function approximation, as convergence is typically not attainable. Some recent progress has been made in stabilizing batch Qlearning in the offpolicy learning setting Fujimoto et al. (2019). It remains to be seen whether similar approach can also benefit DM for OPE.
Lack of shorthorizon benchmark in highdimensional settings. Evaluation of other complex RL tasks with short horizon is currently beyond the scope of our study, due to the lack of a natural benchmark. We refer to prior work on OPE for contextual bandits, which are RL problems with horizon 1 Dudík et al. (2011). For contextual bandits, it has been shown that while DR is highly competitive, it is sometimes substantially outperformed by DM Wang et al. (2017). New benchmark tasks should have longer horizon than contextual bandits, but shorter than typical Atari games. We also currently lack natural stochastic environments in highdimensional RL benchmarks. An example candidate for medium horizon, complex OPE domain is NLP tasks such as dialogue.
Other OPE settings. Below we outline several practically relevant settings that current literature has overlooked:

Continuous actions. Recent literature on OPE has exclusively focused on finite actions. OPE for continuous action domains will benefit continuous control applications. Currently, continuous action domains will not work with all IPS and HM (see IPS for continuous contextual bandits by Kallus and Zhou (2018)). Among DM, perhaps only FQE may reasonable work with continuous action tasks with some adaptation.

Missing data coverage. A common assumption in the analysis of OPE is a full support assumption: implies , which often ensure unbiasedness of estimators Precup et al. (2000); Liu et al. (2018); Dudík et al. (2011). This assumption may not hold, and is often not verifiable in practice. Practically, violation of this assumption requires regularization of unbiased estimators to avoid illconditioning Liu et al. (2018); Farajtabar et al. (2018). One avenue to investigate is to optimize biasvariance tradeoff when the full support is not applicable.

Confounding variables. Existing OPE research often assumes that the behavior policy chooses actions solely based on the state. This assumption is often violated when the decisions in the historical data are made by humans instead of algorithms, who may base their decisions on variables not recorded in the data, causing confounding effects. Tackling this challenge, possibly using techniques from causal inference Tennenholtz et al. (2019); Oberst and Sontag (2019), is an important future direction.
Evaluating new OPE estimators. More recently, several new OPE estimators have been proposed: Nachum et al. (2019); Zhang et al. (2020) further build on the perspective of density ratio estimation from IH; Uehara and Jiang (2019) provides a closely related approach that learns value functions from important ratios; Xie et al. (2019) proposes improvement over standard IPS by estimating marginalized state distribution in an analogous fashion to IH; Kallus and Uehara (2019a, b) analyze double reinforcement learning estimator that makes use of both estimates for function and state density ratio. While we have not included these new additions in our analysis, our software implementation is highly modular and can easily accommodate new estimators and environments.
Algorithmic approach to method selection. While we have identified a general guideline for selecting OPE method, often it is not easy to judge whether some decision criteria are satisfied (e.g., quantifying model misspecification, degree of stochasticity, or appropriate data size). As more OPE methods continue to be developed, an important missing piece is a systematic technique for model selection, given a high degree of variability among existing techniques.
Contents
 1 Introduction
 2 Preliminaries
 3 Overview of OPE Methods
 4 Experiments
 5 Results
 6 Discussion and Future Directions
 A Glossary of Terms
 B Ranking of Methods
 C Supplementary Folklore Backup
 D Methods
 E Environments
 F Experimental Setup
 G Additional Supporting Figures

H Tables of Results, per Environment
 H.1 Detailed Results for Graph
 H.2 Detailed Results for GraphPOMDP
 H.3 Detailed Results for Graph Mountain Car (GraphMC)
 H.4 Detailed Results for Mountain Car (MC)
 H.5 Detailed Results for PixelBased Mountain Car (PixMC)
 H.6 Detailed Results for Gridworld
 H.7 Detailed Results for Pixel Gridworld
 H.8 Detailed Results for Enduro
Appendix A Glossary of Terms
See Table A for a description of the terms used in this paper.
tableGlossary of terms
Acronym  Term 

OPE  OffPolicy Policy Evaluation 
State Space  
Action Space  
Transition Function  
Reward Function  
Discount Factor  
Initial State Distribution  
Dataset  
Trajectory/Episode  
Horizon/Episode Length  
Number of episodes in  
Behavior Policy  
Evaluation Policy  
Value, ex:  
ActionValue, ex:  
Cumulative Importance Weight, . If then default is  
IPS  Inverse Propensity Scoring 
DM  Direct Method 
HM  Hybrid Method 
IS  Importance Sampling 
PDIS  PerDecision Importance Sampling 
WIS  Weighted Importance Sampling 
PDWIS  PerDecision Weighted Importance Sampling 
PDWIS  PerDecision Weighted Importance Sampling 
FQE  Fitted Q Evaluation Le et al. (2019) 
IH  Infinite Horizon Liu et al. (2018) 
QReg  Q Regression Farajtabar et al. (2018) 
MRDR  More Robust Doubly Robst Farajtabar et al. (2018) 
AM  Approximate Model (Model Based) 
Harutyunyan et al. (2016)  
Retrace Munos et al. (2016)  
Tree  TreeBackup Precup et al. (2000) 
DR  DoublyRobust Jiang and Li (2016); Dudík et al. (2011) 
WDR  Weighted DoublyRobust Dudík et al. (2011) 
MAGIC  Model And Guided Importance Sampling Combining (Estimator) Thomas and Brunskill (2016) 
Graph  Graph Environment 
GraphMC  Graph Mountain Car Environment 
MC  Mountain Car Environment 
PixMC  PixelBased Mountain Car Environment 
Enduro  Enduro Environment 
GraphPOMDP  GraphPOMDP Environment 
GW  Gridworld Environment 
PixGW  PixelBased Gridworld Environment 
Appendix B Ranking of Methods
A method that is within of the method with the lowest Relative MSE is counted as a top method, called Neartop Frequency, and then we aggregate across all experiments. See Table B for a sorted list of how often the methods appear within of the best method.
tableFraction of time among the top estimators across all experiments
Method  Neartop Frequency 

MAGIC FQE  0.300211 
DM FQE  0.236786 
IH  0.190275 
WDR FQE  0.177590 
MAGIC  0.173362 
WDR  0.173362 
DM  0.150106 
DR  0.135307 
WDR R()  0.133192 
DR FQE  0.128964 
MAGIC R()  0.107822 
WDR Tree  0.105708 
DR R()  0.105708 
DM R()  0.097252 
DM Tree  0.084567 
MAGIC Tree  0.076110 
DR Tree  0.073996 
DR MRDR  0.073996 
WDR QReg  0.071882 
DM AM  0.065539 
IS  0.063425 
WDR MRDR  0.054968 
PDWIS  0.046512 
DR QReg  0.044397 
MAGIC AM  0.038055 
MAGIC MRDR  0.033827 
DM MRDR  0.033827 
PDIS  0.033827 
MAGIC QReg  0.027484 
WIS  0.025370 
NAIVE  0.025370 
DM QReg  0.019027 
DR AM  0.012685 
WDR AM  0.006342 
b.1 Decision Tree Support
Tables B.1B.1 provide a numerical support for the decision tree in the main paper (Figure 2). Each table refers to a child node in the decision tree, ordered from left to right, respectively. For example, Table B.1 refers to the leftmost child node (propery specified, short horizon, small policy mismatch) while Table B.1 refers to the rightmost child node (misspecified, good representation, long horizon, good estimate).
table Neartop Frequency among the properly specified, short horizon, small policy mismatch experiments
DM  Hybrid  
Direct  DR  WDR  MAGIC  
AM  4.7%  4.7%  3.1%  4.7% 
QReg  0.0%  4.7%  6.2%  4.7% 
MRDR  7.8%  14.1%  7.8%  7.8% 
FQE  40.6%  23.4%  21.9%  34.4% 
R  17.2%  20.3%  20.3%  14.1% 
Q  21.9%  18.8%  18.8%  17.2% 
Tree  15.6%  12.5%  12.5%  14.1% 
IH  17.2%       
IPS  
Standard  PerDecision  
IS  4.7%  4.7% 
WIS  3.1%  3.1% 
NAIVE  1.6%   
table Neartop Frequency among the properly specified, short horizon, large policy mismatch experiments
DM  Hybrid  
Direct  DR  WDR  MAGIC  
AM  20.3%  1.6%  0.0%  7.8% 
QReg  1.6%  1.6%  3.1%  1.6% 
MRDR  3.1%  1.6%  6.2%  1.6% 
FQE  35.9%  14.1%  17.2%  37.5% 
R  23.4%  14.1%  20.3%  23.4% 
Q  15.6%  15.6%  14.1%  20.3% 
Tree  21.9%  12.5%  18.8%  21.9% 
IH  29.7%       
IPS  

Standard  PerDecision  
IS  0.0%  0.0% 
WIS  0.0%  1.6% 
NAIVE  3.1%   
table Neartop Frequency among the properly specified, long horizon, small policy mismatch experiments
DM  Hybrid  
Direct  DR  WDR  MAGIC  
AM  6.9%  0.0%  0.0%  5.6% 
QReg  0.0%  1.4%  1.4%  1.4% 
MRDR  1.4%  0.0%  1.4%  2.8% 
FQE  50.0%  22.2%  23.6%  50.0% 
R  13.9%  12.5%  11.1%  9.7% 
Q  20.8%  18.1%  18.1%  18.1% 
Tree  2.8%  1.4%  0.0%  2.8% 
IH  29.2%       
IPS  
Standard  PerDecision  
IS  0.0%  0.0% 
WIS  0.0%  0.0% 
NAIVE  5.6%   
table Neartop Frequency among the properly specified, long horizon, large policy mismatch, deterministic env/rew experiments
DM  Hybrid  
Direct  DR  WDR  MAGIC  
AM  3.5%  3.5%  1.8%  1.8% 
QReg  3.5%  1.8%  0.0%  0.0% 
MRDR  3.5%  1.8%  0.0%  0.0% 
FQE  15.8%  17.5%  29.8%  28.1% 
R  1.8%  3.5%  0.0%  0.0% 
Q  22.8%  15.8%  38.6%  24.6% 
Tree  3.5%  3.5%  1.8%  1.8% 
IH  21.1%       
IPS  

Standard  PerDecision  
IS  5.3%  3.5% 
WIS  0.0%  8.8% 
NAIVE  0.0%   
table Neartop Frequency among the properly specified, long horizon, large policy mismatch, stochastic env/rew experiments
DM  Hybrid  

Direct  DR  WDR  MAGIC  
AM  14.6%  0.0%  0.0%  8.3% 
QReg  4.2%  2.1%  0.0%  2.1% 
MRDR  4.2%  2.1%  0.0%  0.0% 
FQE  31.2%  2.1%  0.0%  25.0% 
R  4.2%  6.2%  0.0%  0.0% 
Q  2.1%  0.0%  0.0%  2.1% 
Tree  4.2%  6.2%  0.0%  0.0% 
IH  41.7%       
IPS  
Standard  PerDecision  
IS  25.0%  4.2% 
WIS  0.0%  0.0% 
NAIVE  2.1%   
table Neartop Frequency among the potentially misspecified, insufficient representation experiments
DM  Hybrid  

Direct  DR  WDR  MAGIC  
AM         
QReg  3.9%  13.7%  25.5%  6.9% 
MRDR  0.0%  18.6%  15.7%  5.9% 
FQE  0.0%  5.9%  13.7%  24.5% 
R         
Q         
Tree         
IH  6.9%       
IPS  

Standard  PerDecision  
IS  10.8%  8.8% 
WIS  9.8%  13.7% 
NAIVE  3.9%   
table Neartop Frequency among the potentially misspecified, sufficient representation, poor estimate experiments
DM  Hybrid  

Direct  DR  WDR  MAGIC  
AM  0.0%  0.0%  0.0%  0.0% 
QReg  0.0%  0.0%  3.3%  0.0% 
MRDR  13.3%  6.7%  0.0%  0.0% 
FQE  0.0%  3.3%  6.7%  10.0% 
R  16.7%  0.0%  6.7%  20.0% 
Q  6.7%  0.0%  0.0%  3.3% 
Tree  20.0%  0.0%  6.7%  6.7% 
IH  0.0%       
IPS  
Standard  PerDecision  
IS  3.3%  0.0% 
WIS  0.0%  0.0% 
NAIVE  0.0%   
table Neartop Frequency among the potentially misspecified, sufficient representation, good estimate experiments
DM  Hybrid  
Direct  DR  WDR  MAGIC  
AM  0.0%  0.0%  0.0%  2.8% 
QReg  0.0%  0.0%  0.0%  0.0% 
MRDR  0.0%  5.6%  0.0%  5.6% 
FQE  8.3%  8.3%  25.0%  11.1% 
R  2.8%  8.3%  8.3%  19.4% 
Q  5.6%  5.6%  8.3%  0.0% 
Tree  5.6%  8.3%  16.7%  5.6% 
IH  0.0%       
IPS  
Standard  PerDecision  
IS  0.0%  0.0% 
WIS  0.0%  0.0% 
NAIVE  0.0%   
Appendix C Supplementary Folklore Backup
The following tables represent the numerical support for how horizon and policy difference affect the performance of the OPE estimators when policy mismatch is held constant. Notice that the policy mismatch for table C and C are identical: . What we see here is that despite identical policy mismatch, the longer horizon does not impact the error as much (compared to the baseline, Table C) as moving to , far from and keeping the horizon the same.
table Graph, relative MSE. . Dense rewards. Baseline.
DM  Hybrid  

Direct  DR  WDR  MAGIC  
AM  1.9E3  4.9E3  5.0E3  3.4E3 
QReg  2.4E3  4.3E3  4.2E3  4.5E3 
MRDR  5.8E3  8.9E3  9.4E3  9.2E3 
FQE  1.8E3  1.8E3  1.8E3  1.8E3 
R  1.8E3  1.8E3  1.8E3  1.8E3 
Q  1.8E3  1.8E3  1.8E3  1.8E3 
Tree  1.8E3  1.8E3  1.8E3  1.8E3 
IH  1.6E3       
IPS  

Standard  PerDecision  
IS  5.6E4  8.4E4 
WIS  1.4E3  1.4E3 
NAIVE  6.1E3   
table Graph, relative MSE. . Dense rewards. Increasing horizon compared to baseline, fixed .
DM  Hybrid  

Direct  DR  WDR  MAGIC  
AM  5.6E2  5.9E2  5.9E2  5.3E2 
QReg  3.4E3  1.1E1  1.2E1  9.2E2 
MRDR  1.1E2  2.5E1  2.9E1  3.1E1 
FQE  6.0E2  6.0E2  6.0E2  6.0E2 
R  6.0E2  6.0E2  6.0E2  6.0E2 
Q  6.0E2  6.0E2  6.0E2  6.0E2 
Tree  3.4E1  7.0E3  1.6E3  2.3E3 
IH  4.7E4       
IPS  

Standard  PerDecision  
IS  1.7E2  2.5E3 
WIS  9.5E4  4.9E4 
NAIVE  5.4E3   
table Graph, relative MSE. . Dense rewards. Increasing compared to baseline, fixed horizon.
DM  Hybrid  

Direct  DR  WDR  MAGIC  
AM  6.6E1  6.7E1  6.6E1  6.6E1 
QReg  5.4E1  6.3E1  1.3E0  9.3E1 
MRDR  5.4E1  7.3E1  2.0E0  2.0E0 
FQE  6.6E1  6.6E1  6.6E1  6.6E1 
R  6.7E1  6.6E1  9.3E1  1.0E0 
Q  6.6E1  6.6E1  6.6E1  6.6E1 
Tree  6.7E1  6.6E1  9.4E1  1.0E0 
IH  1.4E2       
IPS  

Standard  PerDecision  
IS  1.0E0  5.4E1 
WIS  2.0E0  9.7E1 
NAIVE  4.0E0   
Appendix D Methods
Below we include a description of each of the methods we tested. Let .
d.1 Inverse Propensity Scoring (IPS) Methods
d.2 Hybrid Methods
Hybrid rely on being supplied an actionvalue function , an estimate of , from which one can also yield . DoublyRobust (DR): Thomas and Brunskill (2016); Jiang and Li (2016)
Weighted DoublyRobust (WDR): Thomas and Brunskill (2016)
MAGIC: Thomas and Brunskill (2016) Given where
then define and
then, for a simplex we can calculate
which, finally, yields
MAGIC can be thought of as a weighted average of different blends of the DM and Hybrid. In particular, for some , represents estimating the first steps of according to DR (or WDR) and then estimating the remaining steps via . Hence, finds the most appropriate set of weights which trades off between using a direct method and a Hybrid.
d.3 Direct Methods (DM)
ModelBased
Approximate Model (AM): Jiang and Li (2016) An approach to modelbased value estimation is to directly fit the transition dynamics , reward , and terminal condition of the MDP using some for of maximum likelihood or function approximation. This yields a simulation environment from which one can extract the value of a policy using an average over rollouts. Thus, where the expectation is over initial conditions and the transition dynamics of the simulator.
ModelFree
Every estimator in this section will approximate with , parametrized by some . From the OPE estimate we seek is
Note that .
Direct Model Regression (QReg): Farajtabar et al. (2018)
Fitted Q Evaluation (FQE): Le et al. (2019) where
Retrace() (R()), TreeBackup (Tree), : Munos et al. (2016); Precup et al. (2000); Harutyunyan et al. (2016) where
and
More Robust DoublyRobust (MRDR): Farajtabar et al. (2018) Given
and
where is the indicator function, then
Appendix E Environments
For every environment, we initialize the environment with a fixed horizon length . If the agent reaches a goal before or if the episode is not over by step , it will transition to an environmentdependent absorbing state where it will stay until time . For a high level description of the environment features, see Table 1.
e.1 Environment Descriptions
Graph
Figure 11 shows a visualization of the ToyGraph environment. The graph is initialized with horizon and with absorbing state . In each episode, the agent starts at a single starting state and has two actions, and . At each time step , the agent can enter state by taking action , or by taking action . If the environment is stochastic, we simulate noisy transitions by allowing the agent to slip into instead of and viceversa with probability . At the final time , the agent always enters the terminal state . The reward is if the agent transitions to an odd state, otherwise is . If the environment provides sparse rewards, then if is odd, if is even, otherwise . Similarly to deterministic rewards, if the environment’s rewards are stochastic, then the reward is if the agent transitions to an odd state, otherwise . If the rewards are sparse and stochastic then if is odd, otherwise and otherwise.
GraphPOMDP
Figure 11 shows a visualization of the GraphPOMDP environment. The underlying state structure of GraphPOMDP is exactly the Graph environment. However, the states are grouped together based on a choice of GraphPOMDP horizon length, . This parameter groups states into observable states. The agent only is able to observe among these states, and not the underlying MDP structure. ModelFail Thomas and Brunskill (2016) is a special case of this environment when .
Graph Mountain Car (GraphMC)
Figure 11 shows a visualization of the ToyMC environment. This environment is a 1D graphbased simplification of Mountain Car. The agent starts at , the center of the valley and can go left or right. There are total states, to the left of the starting position and to the right of the starting position, and a terminal absorbing state . The agent receives a reward of at every timestep. The reward becomes zero if the agent reaches the goal, which is state . If the agent reaches and continues left then the agent remains in . If the agent does not reach state by step then the episode terminates and the agent transitions to the absorbing state.
Mountain Car (MC)
We use the OpenAI version of Mountain Car Brockman et al. (2016); Sutton and Barto (2018) with a few simplifying modifications. The car starts in a valley and has to go back and forth to gain enough momentum to scale the mountain and reach the end goal. The state space is given by the position and velocity of the car. At each time step, the car has the following options: accelerate backwards, forwards or do nothing. The reward is for every time step until the car reaches the goal. While the original trajectory length is capped at , we decrease the effective length by applying every action five times before observing . Furthermore, we modify the random initial position from being uniformly between to being one of , with no velocity. The environment is initialized with a horizon and absorbing state position at and no velocity.
Pixelbased Mountain Car (PixMC)
This environment is identical to Mountain Car except the state space has been modified from position and velocity to a pixel based representation of a ball, representing a car, rolling on a hill, see Figure 11. Each frame is a image of the ball on the mountain. One cannot deduce velocity from a single frame, so we represent the state as where the initial state. Everything else is identical between the pixelbased version and the positionvelocity version described earlier.
Enduro
We use OpenAI’s implementation of Endurov0, an Atari 2600 racing game. We downsample the image to a grayscale of size (84,84). We apply every action one time and we represent the state as where the initial state, for . See Figure 11 for a visualization.
Gridworld (GW)
Figure 11 shows a visualization of the Gridworld environment. The agent starts at a state in the first row or column (denoted S in the figure), and proceeds through the grid by taking actions, given by the four cardinal directions, for timesteps. An agent remains in the same state if it chooses an action which would take it out of the environment. If the agent reaches the goal state , in the bottom right corner of the environment, it transitions to a terminal state for the remainder of the trajectory and receives a reward of . In the grid, there is a field (denoted F) which gives the agent a reward of and holes (denoted H) which give . The remaining states give a reward of .
PixelGridworld (PixelGW)
This environment is identical to Gridworld except the state space has been modified from position to a pixel based representation of the position: 1 for the agent’s location, 0 otherwise. We use the same policies as in the Gridworld case.
Environment  Graph  GraphMC  MC  PixMC  Enduro  GraphPOMDP  GW  PixGW 
Is MDP?  yes  yes  yes  yes  yes  no  yes  yes 
State desc.  position  position  [pos, vel]  pixels  pixels  position  position  pixels 
4 or 16  250  250  250  1000  2 or 8  25  25  
Stoch Env?  variable  no  no  no  no  no  no  variable 
Stoch Rew?  variable  no  no  no  no  no  no  no 
Sparse Rew?  variable  terminal  terminal  terminal  dense  terminal  dense  dense 
Func. Class  tabular  tabular  linear/NN  NN  NN  tabular  tabular  NN 
Initial state  0  0  variable  variable  gray img  0  variable  variable 
Absorb. state  2T  22  [.5,0]  [.5,0]  zero img  2T  64  zero img 
Frame height  1  1  2  2  4  1  1  1 
Frame skip  1  1  5  5  1  1  1  1 
Appendix F Experimental Setup
f.1 Description of the policies
Graph, GraphPOMDP and GraphMC use static policies with some probability of going left and another probability of going right, ex: , independent of state. We vary in our experiments.
GW, PixGW, MC, PixelMC, and Enduro all use an Greedy policy. In other words, we train a policy (using value iteration or DDQN) and then vary the deviation away from the policy. Hence implies we follow a mixed policy with probability and uniform with probability . We vary in our experiments.
f.2 Enumeration of Experiments
Graph
See Table 5 for a description of the parameters of the experiment we ran in the Graph Environment. The experiments are the Cartesian product of the table.
Parameters  
.98  
N  
T  
Stochastic Env  {True, False} 
Stochastic Rew  {True, False} 
Sparse Rew  {True, False} 
Seed  {10 of random()} 
ModelType  Tabular 
Regress  False 
GraphPOMDP
See Table 6 for a description of the parameters of the experiment we ran in the GraphPOMDP Environment. The experiments are the Cartesian product of the table.
Parameters  
.98  
N  
(T,H)  