Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning


The disparate experimental conditions in recent off-policy policy evaluation (OPE) literature make it difficult both for practitioners to choose a reliable estimator for their application domain, as well as for researchers to identify fruitful research directions. In this work, we present the first detailed empirical study of a broad suite of OPE methods. Based on thousands of experiments and empirical analysis, we offer a summarized set of guidelines to advance the understanding of OPE performance in practice, and suggest directions for future research. Along the way, our empirical findings challenge several commonly held beliefs about which class of approaches tends to perform well. Our accompanying software implementation serves as a first comprehensive benchmark for OPE.




1 Introduction

We focus on understanding the relative performance of existing methods for off-policy policy evaluation (OPE), which is the problem of estimating the value of a target policy using only pre-collected historical data generated by another policy. The earliest OPE methods rely on classical importance sampling to handle the distribution mismatch between the target and behavior policies Precup et al. (2000). Many advanced OPE methods have since been proposed for both contextual bandits Dudík et al. (2011); Bottou et al. (2013); Swaminathan et al. (2017); Wang et al. (2017); Li et al. (2015); Ma et al. () and reinforcement learning settings Jiang and Li (2016); Dudík et al. (2011); Farajtabar et al. (2018); Liu et al. (2018); Xie et al. (2019). These new developments reflect practical interests in deploying reinforcement learning to safety-critical situations Li et al. (2011); Wiering (2000); Bottou et al. (2013); Bang and Robins (2005), and the increasing importance of off-policy learning and counterfactual reasoning more broadly Degris et al. (2012); Thomas et al. (2017); Munos et al. (2016); Le et al. (2019); Liu et al. (2019); Nie et al. (2019). OPE is also closely related to the problem of dynamic treatment regimes in the causal inference literature Murphy et al. (2001).

Empirical validations have long contributed to the scientific understanding and advancement of machine learning techniques Chapelle and Li (2011); Caruana et al. (2008); Caruana and Niculescu-Mizil (2006). Recently, many have called for careful examination of empirical findings of contemporary deep learning and deep reinforcement learning efforts Henderson et al. (2018); Locatello et al. (2019). As OPE is central to real-world applications of reinforcement learning, an in-depth empirical understanding is critical to ensure usefulness and accelerate progress. While many recent methods are built on sound mathematical principles, a practitioner is often faced with a non-trivial task of selecting the most appropriate estimator for their application. A notable gap in current literature is a comprehensive empirical understanding of contemporary methods, due in part to the disparate testing environments and varying experimental conditions among prior work. Consequently, there is little holistic insight into where different methods particularly shine, nor a systematic summary of the challenges one may encounter when in different scenarios. Researchers and practitioners may reasonably deduce the following commonly held impressions from surveying the literature:

  1. Doubly robust methods are often assumed to outperform direct and importance sampling methods.

  2. Horizon length is the primary driver of poor performance for OPE estimators.

  3. Model-based is the go-to direct method, either standalone or as part of a doubly-robust estimator.

The reality, as we will discuss, is much more nuanced. In this work, we take a closer look at recently proposed methods and offer a thorough empirical study of a wide range of estimators. We design various experimental conditions to explore the success and failure modes of different methods. We synthesize general insights to guide practitioners, and suggest directions for future research. Finally, we provide a highly extensive software package that can interface with new experimental environments and methods to run new OPE experiments at scale.1

2 Preliminaries

As per RL standard, we represent the environment by . is the state space (or observation space in the non-Markov case), is the (finite) action space, is the transition function, is the reward, and discount factor . A policy maps states to a distribution over actions, and denotes the probability of choosing in .

OPE is typically considered in the episodic RL setting. A behavior policy generates a historical data set, , of trajectories (or episodes), where indexes over trajectories, and . The episode length is frequently assumed to be fixed for notational convenience. In practice, one can pad additional absorbing states to handle variable lengths. Given a desired evaluation policy , the OPE problem is to estimate the value , defined as:

with , , , and is the initial state distribution.

Figure 1: Categorization of OPE methods. Some methods are direct but have IPS influence and thus fit slightly away from the direct methods axis.

3 Overview of OPE Methods

OPE methods were historically categorized into importance sampling, direct, and doubly robust methods. This demarcation was first introduced for contextual bandits Dudík et al. (2011), and later to reinforcement learning Jiang and Li (2016). Some recent methods have blurred the boundary of these categories, such as Retrace() Munos et al. (2016) that uses a product of importance weights of multiple time steps for off-policy correction, and MAGIC Thomas and Brunskill (2016) that switches between importance weighting and direct methods.

In this paper, we propose to regroup OPE into three similar classes of methods, but with expanded definition for each category. Figure 1 provides an overview of OPE methods that we consider. The relative positioning of different methods reflects how close they are to being a pure regression-based estimator versus a pure importance sampling-based estimator. Appendix D contains a full description of all methods under consideration.

3.1 Inverse Propensity Scoring (IPS)

Inverse Propensity Scoring (IPS), also called importance sampling, is widely used in statistics Powell and Swann (1966); Hammersley and Handscomb (1964); Horvitz and Thompson (1952) and RL Precup et al. (2000). The key idea is to reweight the rewards in the historical data by the importance sampling ratio between and , i.e., how likely a reward is under versus . IPS methods yield consistent and (typically) unbiased estimates; however the product of importance weights can be unstable for long time horizons. The cumulative importance weight between and is written as (where for ). Weighted IPS replaces a normalization factor by . The weighted versions are biased but strongly consistent.

Importance Sampling (IS) takes the form: There are three other main IPS variants that we consider: Per-Decision Importance Sampling (PDIS), Weighted Importance Sampling (WIS) and Per-Decision WIS (PDWIS) (see Appendix Table 3 for full definitions). Other variants of IPS exist but are neither consistent nor unbiased Thomas (2015). IPS often assumes known , which may not be possible – one approach is to estimate from data Hanna et al. (2019), resulting in a potentially biased estimator that can sometimes outperform traditional IPS methods.

Environment Graph Graph-MC MC Pix-MC Enduro Graph-POMDP GW Pix-GW
Markov? yes yes yes yes yes no yes yes
State/Obs position position [pos, vel] pixels pixels position position pixels
4 or 16 250 250 250 1000 2 or 8 25 25
Stoch Env? variable no no no no no no variable
Stoch Rew? variable no no no no no no no
Sparse Rew? variable terminal terminal terminal dense terminal dense dense
Func. Class tabular tabular linear/NN NN NN tabular tabular NN
Table 1: Environment parameters

3.2 Direct Methods (DM)

The main distinction of direct methods from IPS is the focus on regression-based techniques to (more) directly estimate the value functions of the evaluation policy ( or ). We consider eight different direct approaches, described completely in appendix D. Similar to policy learning literature, we can view OPE through the lens of model-based vs. model-free approaches2.

Model-based. Perhaps the most commonly used DM is model-based (also called approximate model, denoted AM), where the transition dynamics, reward function and termination condition are directly estimated from historical data Jiang and Li (2016); Paduraru (2013). The resulting learned MDP is then used to compute the value of , e.g., by Monte-Carlo policy evaluation.

Model-free. Estimating the action-value function , parameterized by , is the focus of several model-free approaches. The value estimate is then: . A simple example is Fitted Q Evaluation (FQE) Le et al. (2019), which is a model-free counterpart to AM, and is functionally a policy evaluation counterpart to batch Q learning. FQE learns a sequence of estimators , where:

Indeed, several model-free methods originated from off-policy learning settings, but are also natural for OPE. Harutyunyan et al. (2016) can be viewed as a generalization of FQE that looks to the horizon limit to incorporate the long-term value into the backup step. Retrace() Munos et al. (2016) and Tree-Backup() Precup et al. (2000) also use full trajectories, but additionally incorporate varying levels of clipped importance weights adjustment. The -dependent term mitigates instability in the backup step, and is chosen based on experimental findings of Munos et al. (2016).

Q Regression (Q-Reg) and More Robust Doubly-Robust (MRDR) Farajtabar et al. (2018) are two recently proposed direct methods that make use of cumulative importance weights in deriving the regression estimate for , solved through a quadratic program. MRDR changes the objective of the regression to that of directly minimizing the variance of the Doubly-Robust estimator (see Section 3.3).

Liu et al. (2018) recently proposed a method for the infinite horizon setting (IH). While IH can be viewed as a Rao-Blackwellization of the IS estimator, we include it in the DM category because it essentially solves the Bellman equation for state distributions and requires function approximation, which are more characteristic of DM. IH shifts the focus from importance sampling over action sequences to estimating the importance ratio between state density distributions induced by and . This ratio replaces all but the final importance weights in the IH estimate, which resembles IS. More recently, several estimators inspired by density ratio estimation idea have been proposed Nachum et al. (2019); Uehara and Jiang (2019); Xie et al. (2019) - we will leave evaluation of these new extensions for future work.

3.3 Hybrid Methods (HM)

Hybrid methods subsume doubly robust-like approaches, which combine aspects of both IPS and DM. Standard doubly robust OPE (denoted DR) Jiang and Li (2016) is an unbiased estimator that leverages a DM to decrease the variance of the unbiased estimates produced by importance sampling techniques:

Other HMs include Weighted Doubly-Robust (WDR) and MAGIC (see Appendix D). WDR self-normalizes the importance weights (similar to WIS). MAGIC introduces adaptive switching between DR and DM; in particular, one can imagine using DR to estimate the value for part of a trajectory and then using DM for the remainder. Using this idea, MAGIC Thomas and Brunskill (2016) finds an optimal linear combination among a set that varies the switch point between WDR and DM. Note that any DM that returns yields a set of corresponding DR, WDR, and MAGIC estimators. As a result, we consider twenty-one hybrid approaches in our experiments.

4 Experiments

Experiment Design Principles. We consider several domain characteristics (simple-complex, deterministic-stochastic, sparse-dense rewards, short-long horizon), pairs (close-far), and data sizes (small-large), to study OPE performance under varying conditions.

We use two standard RL benchmarks from OpenAI Brockman et al. (2016): Mountain Car (MC) and Enduro Atari game. As many RL benchmarks are fixed and deterministic, we design 6 additional environments that allow control over various conditions: (i) Graph domain (tabular, varying stochasticity and horizon), (ii) Graph-POMDP (tabular, control for representation), (iii) Graph-MC (simplifying MC to tabular case), (iv) Pixel-MC (study MC in high-dimensional setting), (v) Gridworld (tabular, long horizon version) and (vi) Pixel-Gridworld (controlled Gridworld experiments with function approximation).

All together, our benchmark consists of eight environments with characteristics summarized in Table 1. Complete descriptions can be found in Appendix E.

Protocol & Metrics. Each experiment depends on specifying environment and its properties, behavior policy , evaluation policy , and number of trajectories from rolling-out for historical data. The true on-policy value is the Monte-Carlo estimate via rollouts of . We repeat each experiment times with different random seeds. We judge the quality of a method via two metrics:

  • Relative mean squared error (Relative MSE): , which allows a fair comparison across different conditions.3

  • Near-top Frequency: For each experimental condition, we include the number of times each OPE estimator is within of the best performing estimator to facilitate aggregate comparison across domains.

Implementation & Hyperparameters. With thirty-three different OPE methods considered, we run thousands of experiments across the above eight domains. Hyperparameters are selected based on publication, code release or author consultation. We maintain a consistent set of hyperparameters for each estimator and each environment across experimental conditions (see hyperparameter choice in appendix Table 13). We create a software package that allows running experiments at scale and easy integration with new domains and techniques for future research. Due to limited space, we will show the results from selected experiment conditions. The complete results, with highlighted best method in each class, are available in the appendix.

5 Results

5.1 What is the best method?

The first important takeaway is that there is no clear-cut winner: no single method or method class is consistently the best performer, as multiple environmental factors can influence the accuracy of each estimator. With that caveat in mind, based on the aggregate top performance metrics, we can recommend the following estimators for each method class (See Table 2 and appendix Table B).

Inverse propensity scoring (IPS). In practice, weighted importance sampling, which is biased, tends to be more accurate and data-efficient than unbiased basic importance sampling methods. Among the four IPS-based estimators, PDWIS tends to perform best (Figure 4 left).

Direct methods (DM). Generally, FQE, , and IH tend to perform the best among DM (appendix Table B). FQE tends to be more data efficient and is the best method when data is limited (Figure 5). generalizes FQE to multi-step backup, and works particularly well with more data, but is computationally expensive in complex domains. IH is highly competitive in long horizons and with high policy mismatch in a tabular setting (appendix Tables B.1, B.1). In pixel-based domains, however, choosing a good kernel function for IH is not straightforward, and IH can underperform other DM (appendix Table B.1). We provide a numerical comparison among direct methods for tabular (appendix Figure 16) and complex settings (Figure 4 center).

Class Recommendation When to use Prototypical env. Near-top Freq.
Direct FQE Stochastic env, severe policy mismatch Graph, MC, Pix-MC 23.7%
Compute non-issue, moderate policy mismatch GW/Pix-GW 15.0%
IH Long horizon, mild policy mismatch, good kernel Graph-MC 19.0%
IPS PDWIS Short horizon, mild policy mismatch Graph 4.7%
Hybrid MAGIC FQE Severe model misspecification Graph-POMDP, Enduro 30.0%
MAGIC Compute non-issue, severe model misspecification Graph-POMDP 17.3%
Table 2: Model Selection Guidelines. (For Near-top Frequency, see definition in Section 4 and support in Table B)

Hybrid methods (HM). With the exception of IH, each DM corresponds to three HM: standard doubly robust (DR), weighted doubly robust (WDR), and MAGIC. For each DM, its WDR version often outperforms its DR version. MAGIC can often outperform WDR and DR. However, MAGIC comes with additional hyperparameters, as one needs to specify the set of partial trajectory length to be considered. Unsurprisingly, their performance highly depends on the underlying DM. In our experiments, FQE and are typically the most reliable: MAGIC with FQE or MAGIC with tend to be among the best hybrid methods (see appendix Figures 22 - 26).

5.2 Key drivers of method accuracy

The main reason for the inconsistent performance of estimators is various environmental factors that are inadequately studied from prior work. These coupled factors often impact accuracy interdependently:

  • Representation mismatch: Function approximators with insufficient representation power weaken DM, and so do overly rich ones as they cause overfitting (e.g., tabular classes). These issues do not impact IPS. Severe misspecification favors HM and weakens DM.

  • Horizon length: Long horizons hurt all methods, but especially those dependent on importance weights (including IPS, HM and some DM).

  • Policy mismatch: Large divergence between and hurts all methods, but tends to favor DM in the small data regime relative to HM and IPS. HM will catch up with DM as data size increases.

  • Bad estimation of unknown behavior policy:4 estimation quality depends on the state and action dimensionality, and historical data size. Poor estimates cause HM and IPS to underperform simple DM.

  • Environment / Reward stochasticity: Stochastic environments hurt the data efficiency of all methods, but favor DM over HM and IPS.

We perform a series of controlled experiments to isolate the impact of these factors. Figure 3 shows a typical comparison of the best performing method in each class, under a tabular setting with both short and long horizons, and a large mismatch between and . The particular best method in each class may change depending on the specific conditions. Within each class, a general guideline for method selection is summarized in Table 2. The appendix contains the full empirical results of all experiments.

5.3 A recipe for method selection

Figure 2 summarizes our general guideline for navigating key factors that affect the accuracy of different estimators. To guide the readers through the process, we now dive further into our experimental design to test various factors, and discuss the resulting insights.

Figure 2: Method Class Selection Decision Tree. Numerical support can be found in Appendix B.1.

Do we potentially have representation mismatch? Representation mismatch comes from two sources: model misspecification and poor generalization. Model misspecification refers to the insufficient representation power of the function class used to approximate either the transition dynamics (AM), value function (other DM), or state distribution density ratio (in IH).

Tabular representation for MDP controls for representation mismatch by ensuring adequate function class capacity, as well as zero inherent Bellman error (left branch, Fig 2). In such case, we may still suffer from poor generalization without sufficient data coverage, which depends on other factors in the domain settings.

The effect of representation mismatch (right branch, Fig 2) can be understood via two controlled scenarios:

  • Misspecified and poor generalization: We expose the impact of this severe mismatch scenario via the Graph POMDP construction, where selected information are omitted from an otherwise equivalent Graph MDP. HM substantially outperform DM in this setting (Figure 3 right versus left).

  • Misspecified but good generalization: Function class such as neural networks has powerful generalization ability, but may introduce bias and inherent Bellman error5 Munos and Szepesvári (2008); Chen and Jiang (2019) (see linear vs. neural networks comparison for Mountain Car in appendix Fig 13). Still, powerful function approximation makes (biased) DM very competitive with HM, especially under limited data and in complex domains (see pixel-Gridworld in appendix Fig 27-29). However, function approximation bias may cause serious problem for high dimensional and long horizon settings. In the extreme case of Enduro (very long horizon and sparse rewards), all DM fail to convincingly outperform a naïve average of behavior data (appendix Fig 12).

Figure 3: Comparing IPS versus DM versus HM under short and long horizon, large policy mismatch and large data. Left: (Graph domain) Deterministic environment. Center: (Graph domain) Stochastic environment and rewards. Right: (Graph-POMDP) Model misspecification (POMDP). Minimum error per class is shown.

Short horizon vs. Long horizon? It is well-known that IPS methods are sensitive to trajectory length Li et al. (2015). Long horizon leads to an exponential blow-up of the importance sampling term, and is exacerbated by significant mismatch between and . This issue is inevitable for any unbiased estimator Jiang and Li (2016) (a.k.a., the curse of horizon Liu et al. (2018)). Similar to IPS, DM relying on importance weights also suffer from long horizon (appendix Fig 16), though to a lesser degree. IH aims to bypass the effect of cumulative weighting in long horizons, and indeed performs substantially better than IPS methods in very long horizon domains (Fig 4 left).

A frequently ignored aspect in previous OPE work is a proper distinction between fixed, finite horizon tasks (IPS focus), infinite horizon tasks (IH focus), and indefinite horizon tasks, where the trajectory length is finite but varies depending on the policy. Many applications should properly belong to the indefinite horizon category.6 Applying HM in this setting requires proper padding of the rewards (without altering the value function in the infinite horizon limit) as DR correction typically assumes fixed length trajectories.

How different are behavior and target policies? Similar to IPS, the performance of DM is negatively correlated with the degree of policy mismatch. Figure 5 shows the interplay of increasing policy mismatch and historical data size, on the top DM in the deterministic gridworld. We use as an environment-independent metric of mismatch between the two policies. The performance of the top DM (FQE, , IH) tend to hold up better than IPS methods when the policy gap increases (appendix Figure 18). FQE and IH are best in the small data regime, and performs better as data size increases (Figure 5). Increased policy mismatch weakens the DM that use importance weights (Q-Reg, MRDR, Retrace() and Tree-Backup()).

Do we have a good estimate of the behavior policy? Often the behavior policy may not be known exactly and requires estimation, which can introduce bias and cause HM to underperform DM, especially in low data regime (e.g., pixel gridworld appendix Figure 27-29). Similar phenomenon was observed in the statistics literature Kang and Schafer (2007). As the data size increases, HMs regain the advantage as the quality of the estimate improves.

Is the environment stochastic or deterministic? While stochasticity affects all methods by straining the data requirement, HM are more negatively impacted than DM (Figure 3 center, Figure 17). This can be justified by e.g., the variance analysis of DR, which shows that the variance of the value function with respect to stochastic transitions will be amplified by cumulative importance weights and then contribute to the overall variance of the estimator; see Jiang and Li (2016, Theorem 1) for further details. We empirically observe that DM frequently outperform their DR versions in the small data case (Figure 17). In a stochastic environment and tabular setting, HM do not provide significant edge over DM, even in short horizon case. The gap closes as the data size increases (Figure 17).

Figure 4: Left: (Graph domain) Comparing IPS (and IH) under short and long horizon. Mild policy mismatch setting. PDWIS is often best among IPS. But IH outperforms in long horizon. Center: (Pixel-MC) Comparing direct methods in high-dimensional, long horizon setting. Relatively large policy mismatch. FQE and IH tend to outperform. AM is significantly worse in complex domains. Retrace(), Q() and Tree-Backup() are very computationally expensive and thus excluded. Right: (Pixel Gridworld) Comparing MAGIC with different base DM and different data size. Large policy mismatch, deterministic environment, known .

5.4 Challenging common wisdom

We close this section by briefly revisiting commonly held beliefs about high-level performance of OPE methods.

Are HM always better than DM? No. Overall, DM are surprisingly competitive with HM. Under high-dimensionality, long horizons, estimated behavior policies, or reward/environment stochasticity, HM can underperform simple DM, sometimes significantly (e.g., see appendix Figure 17).

Concretely, HM can perform worse than DM in the following scenarios that we tested:

  • Tabular with large policy mismatch, or stochastic environments (appendix Figure 17, Table B.1, B.1).

  • Complex domains with long horizon and unknown behavior policy (appendix Figure 27-29, Table B.1).

When data is sufficient, or model misspecification is severe, HM do provide consistent improvement over DM.

Is horizon length the most important factor? No. Despite conventional wisdom suggesting IPS methods are most sensitive to horizon length, we find that this is not always the case. Policy divergence can be just as, if not more, meaningful. For comparison, we designed two scenarios with identical mismatch as defined in Section 5.3 (see appendix Tables C, C). Starting from a baseline scenario of short horizon and small policy divergence (appendix Table C), extending horizon length leads to degradation in accuracy, while a comparable increase in policy divergence causes a degradation.

How good is model-based direct method (AM)? AM can be among the worst performing direct methods (appendix Table B). While AM performs well in tabular setting in the large data case (appendix Figure 16), it tends to perform poorly in high dimensional settings with function approximation (e.g., Figure 4 center). Fitting the transition model is often more prone to small errors than directly approximating . Model fitting errors also compound with long horizons.

Figure 5: (Gridworld domain) Errors are directly correlated with policy mismatch but inversely correlated with data size. We pick the best direct methods for illustration. The two plots represent the same figure from two different vantage points. See full figures in appendix.

5.5 Other Considerations

Hypeparameter selection. As with many machine learning techniques, hyperparameter choice affects the performance of most estimators (except IPS estimators). The situation is more acute for OPE than the online off-policy learning setting, due to the lack of proper validation signal (such as online game score). When using function approximation, direct methods may not have satisfactory convergence, and require setting a reasonable termination threshold hyperparameter. Q-Reg and MRDR require extra care to avoid ill-conditioning, such as tuning with L1 and L2 regularization.7 Similarly, the various choice of the kernel function for IH and the index set for hybrid method such as MAGIC have large impact on the performance. In general, given the choice among different hybrid (or direct) methods, we recommend opting for simplicity as a guiding principle.

Computational considerations. DM are generally significantly more computationally demanding than IPS. In complex domains, model-free iterative methods can be expensive in training time. Iterative DM that incorporate rollouts until the end of trajectories during training (Retrace(), , Tree-Backup()) are the most computationally demanding8, requiring an order of times the number of lookups per gradient step compared to FQE. Model-based method (AM) are expensive at test time when coupled with HM, since rolling-out the learned model is required at every state along the trajectory.9 HM versions of direct methods require times more inference steps, which is often fast after training. In difficult tasks such as Atari games, running AM, Retrace(), , Tree-Backup() can be prohibitively expensive. Q-Reg, MRDR are non-iterative methods and thus are the fastest to execute among DM. The run-time of IH is dependent on the batch size in building a kernel matrix to compute state similarity. The batch size for IH should be as large as possible, but could significantly slow the training.

Sparsity (non-smoothness) of the rewards: Methods that are dependent on cumulative importance weights are also sensitive to reward sparsity (Figure 19). We recommend normalizing the rewards. As a rough guideline, zero-centering rewards often improve performance of methods that depend on importance weights. This seemingly naïve practice can be actually viewed as a special case of DR using a constant DM component (baseline), and can yield improvements over vanilla IPS Jiang and Li (2016).

6 Discussion and Future Directions

The most difficult environments break all estimators. Atari games pose significant challenges for contemporary techniques due to long horizon and high state dimensionality. It is possible that substantially more historical data is required for current OPE methods to succeed. However, to overcome computational challenge in complex RL domains, it is important to identify principled ways to stabilize iterative methods such as FQE, Retrace(), Q() when using function approximation, as convergence is typically not attainable. Some recent progress has been made in stabilizing batch Q-learning in the off-policy learning setting Fujimoto et al. (2019). It remains to be seen whether similar approach can also benefit DM for OPE.

Lack of short-horizon benchmark in high-dimensional settings. Evaluation of other complex RL tasks with short horizon is currently beyond the scope of our study, due to the lack of a natural benchmark. We refer to prior work on OPE for contextual bandits, which are RL problems with horizon 1 Dudík et al. (2011). For contextual bandits, it has been shown that while DR is highly competitive, it is sometimes substantially outperformed by DM Wang et al. (2017). New benchmark tasks should have longer horizon than contextual bandits, but shorter than typical Atari games. We also currently lack natural stochastic environments in high-dimensional RL benchmarks. An example candidate for medium horizon, complex OPE domain is NLP tasks such as dialogue.

Other OPE settings. Below we outline several practically relevant settings that current literature has overlooked:

  • Continuous actions. Recent literature on OPE has exclusively focused on finite actions. OPE for continuous action domains will benefit continuous control applications. Currently, continuous action domains will not work with all IPS and HM (see IPS for continuous contextual bandits by Kallus and Zhou (2018)). Among DM, perhaps only FQE may reasonable work with continuous action tasks with some adaptation.

  • Missing data coverage. A common assumption in the analysis of OPE is a full support assumption: implies , which often ensure unbiasedness of estimators Precup et al. (2000); Liu et al. (2018); Dudík et al. (2011). This assumption may not hold, and is often not verifiable in practice. Practically, violation of this assumption requires regularization of unbiased estimators to avoid ill-conditioning Liu et al. (2018); Farajtabar et al. (2018). One avenue to investigate is to optimize bias-variance trade-off when the full support is not applicable.

  • Confounding variables. Existing OPE research often assumes that the behavior policy chooses actions solely based on the state. This assumption is often violated when the decisions in the historical data are made by humans instead of algorithms, who may base their decisions on variables not recorded in the data, causing confounding effects. Tackling this challenge, possibly using techniques from causal inference Tennenholtz et al. (2019); Oberst and Sontag (2019), is an important future direction.

Evaluating new OPE estimators. More recently, several new OPE estimators have been proposed: Nachum et al. (2019); Zhang et al. (2020) further build on the perspective of density ratio estimation from IH; Uehara and Jiang (2019) provides a closely related approach that learns value functions from important ratios; Xie et al. (2019) proposes improvement over standard IPS by estimating marginalized state distribution in an analogous fashion to IH; Kallus and Uehara (2019a, b) analyze double reinforcement learning estimator that makes use of both estimates for function and state density ratio. While we have not included these new additions in our analysis, our software implementation is highly modular and can easily accommodate new estimators and environments.

Algorithmic approach to method selection. While we have identified a general guideline for selecting OPE method, often it is not easy to judge whether some decision criteria are satisfied (e.g., quantifying model misspecification, degree of stochasticity, or appropriate data size). As more OPE methods continue to be developed, an important missing piece is a systematic technique for model selection, given a high degree of variability among existing techniques.


Appendix A Glossary of Terms

See Table A for a description of the terms used in this paper.


tableGlossary of terms

Acronym Term
OPE Off-Policy Policy Evaluation
State Space
Action Space
Transition Function
Reward Function
Discount Factor
Initial State Distribution
Horizon/Episode Length
Number of episodes in
Behavior Policy
Evaluation Policy
Value, ex:
Action-Value, ex:
Cumulative Importance Weight, . If then default is
IPS Inverse Propensity Scoring
DM Direct Method
HM Hybrid Method
IS Importance Sampling
PDIS Per-Decision Importance Sampling
WIS Weighted Importance Sampling
PDWIS Per-Decision Weighted Importance Sampling
PDWIS Per-Decision Weighted Importance Sampling
FQE Fitted Q Evaluation Le et al. (2019)
IH Infinite Horizon Liu et al. (2018)
Q-Reg Q Regression Farajtabar et al. (2018)
MRDR More Robust Doubly Robst Farajtabar et al. (2018)
AM Approximate Model (Model Based)
Harutyunyan et al. (2016)
Retrace Munos et al. (2016)
Tree Tree-Backup Precup et al. (2000)
DR Doubly-Robust Jiang and Li (2016); Dudík et al. (2011)
WDR Weighted Doubly-Robust Dudík et al. (2011)
MAGIC Model And Guided Importance Sampling Combining (Estimator) Thomas and Brunskill (2016)
Graph Graph Environment
Graph-MC Graph Mountain Car Environment
MC Mountain Car Environment
Pix-MC Pixel-Based Mountain Car Environment
Enduro Enduro Environment
Graph-POMDP Graph-POMDP Environment
GW Gridworld Environment
Pix-GW Pixel-Based Gridworld Environment

Appendix B Ranking of Methods

A method that is within of the method with the lowest Relative MSE is counted as a top method, called Near-top Frequency, and then we aggregate across all experiments. See Table B for a sorted list of how often the methods appear within of the best method.


tableFraction of time among the top estimators across all experiments

Method Near-top Frequency
MAGIC FQE 0.300211
DM FQE 0.236786
IH 0.190275
WDR FQE 0.177590
MAGIC 0.173362
WDR 0.173362
DM 0.150106
DR 0.135307
WDR R() 0.133192
DR FQE 0.128964
MAGIC R() 0.107822
WDR Tree 0.105708
DR R() 0.105708
DM R() 0.097252
DM Tree 0.084567
MAGIC Tree 0.076110
DR Tree 0.073996
DR MRDR 0.073996
WDR Q-Reg 0.071882
DM AM 0.065539
IS 0.063425
WDR MRDR 0.054968
PDWIS 0.046512
DR Q-Reg 0.044397
MAGIC AM 0.038055
MAGIC MRDR 0.033827
DM MRDR 0.033827
PDIS 0.033827
MAGIC Q-Reg 0.027484
WIS 0.025370
NAIVE 0.025370
DM Q-Reg 0.019027
DR AM 0.012685
WDR AM 0.006342

b.1 Decision Tree Support

Tables B.1-B.1 provide a numerical support for the decision tree in the main paper (Figure 2). Each table refers to a child node in the decision tree, ordered from left to right, respectively. For example, Table B.1 refers to the left-most child node (propery specified, short horizon, small policy mismatch) while Table B.1 refers to the right-most child node (misspecified, good representation, long horizon, good estimate).


table Near-top Frequency among the properly specified, short horizon, small policy mismatch experiments

DM Hybrid
AM 4.7% 4.7% 3.1% 4.7%
Q-Reg 0.0% 4.7% 6.2% 4.7%
MRDR 7.8% 14.1% 7.8% 7.8%
FQE 40.6% 23.4% 21.9% 34.4%
R 17.2% 20.3% 20.3% 14.1%
Q 21.9% 18.8% 18.8% 17.2%
Tree 15.6% 12.5% 12.5% 14.1%
IH 17.2% - - -
Standard Per-Decision
IS 4.7% 4.7%
WIS 3.1% 3.1%
NAIVE 1.6% -

table Near-top Frequency among the properly specified, short horizon, large policy mismatch experiments

DM Hybrid
AM 20.3% 1.6% 0.0% 7.8%
Q-Reg 1.6% 1.6% 3.1% 1.6%
MRDR 3.1% 1.6% 6.2% 1.6%
FQE 35.9% 14.1% 17.2% 37.5%
R 23.4% 14.1% 20.3% 23.4%
Q 15.6% 15.6% 14.1% 20.3%
Tree 21.9% 12.5% 18.8% 21.9%
IH 29.7% - - -
Standard Per-Decision
IS 0.0% 0.0%
WIS 0.0% 1.6%
NAIVE 3.1% -

table Near-top Frequency among the properly specified, long horizon, small policy mismatch experiments

DM Hybrid
AM 6.9% 0.0% 0.0% 5.6%
Q-Reg 0.0% 1.4% 1.4% 1.4%
MRDR 1.4% 0.0% 1.4% 2.8%
FQE 50.0% 22.2% 23.6% 50.0%
R 13.9% 12.5% 11.1% 9.7%
Q 20.8% 18.1% 18.1% 18.1%
Tree 2.8% 1.4% 0.0% 2.8%
IH 29.2% - - -
Standard Per-Decision
IS 0.0% 0.0%
WIS 0.0% 0.0%
NAIVE 5.6% -

table Near-top Frequency among the properly specified, long horizon, large policy mismatch, deterministic env/rew experiments

DM Hybrid
AM 3.5% 3.5% 1.8% 1.8%
Q-Reg 3.5% 1.8% 0.0% 0.0%
MRDR 3.5% 1.8% 0.0% 0.0%
FQE 15.8% 17.5% 29.8% 28.1%
R 1.8% 3.5% 0.0% 0.0%
Q 22.8% 15.8% 38.6% 24.6%
Tree 3.5% 3.5% 1.8% 1.8%
IH 21.1% - - -
Standard Per-Decision
IS 5.3% 3.5%
WIS 0.0% 8.8%
NAIVE 0.0% -

table Near-top Frequency among the properly specified, long horizon, large policy mismatch, stochastic env/rew experiments

DM Hybrid
AM 14.6% 0.0% 0.0% 8.3%
Q-Reg 4.2% 2.1% 0.0% 2.1%
MRDR 4.2% 2.1% 0.0% 0.0%
FQE 31.2% 2.1% 0.0% 25.0%
R 4.2% 6.2% 0.0% 0.0%
Q 2.1% 0.0% 0.0% 2.1%
Tree 4.2% 6.2% 0.0% 0.0%
IH 41.7% - - -
Standard Per-Decision
IS 25.0% 4.2%
WIS 0.0% 0.0%
NAIVE 2.1% -

table Near-top Frequency among the potentially misspecified, insufficient representation experiments

DM Hybrid
AM - - - -
Q-Reg 3.9% 13.7% 25.5% 6.9%
MRDR 0.0% 18.6% 15.7% 5.9%
FQE 0.0% 5.9% 13.7% 24.5%
R - - - -
Q - - - -
Tree - - - -
IH 6.9% - - -
Standard Per-Decision
IS 10.8% 8.8%
WIS 9.8% 13.7%
NAIVE 3.9% -

table Near-top Frequency among the potentially misspecified, sufficient representation, poor estimate experiments

DM Hybrid
AM 0.0% 0.0% 0.0% 0.0%
Q-Reg 0.0% 0.0% 3.3% 0.0%
MRDR 13.3% 6.7% 0.0% 0.0%
FQE 0.0% 3.3% 6.7% 10.0%
R 16.7% 0.0% 6.7% 20.0%
Q 6.7% 0.0% 0.0% 3.3%
Tree 20.0% 0.0% 6.7% 6.7%
IH 0.0% - - -
Standard Per-Decision
IS 3.3% 0.0%
WIS 0.0% 0.0%
NAIVE 0.0% -

table Near-top Frequency among the potentially misspecified, sufficient representation, good estimate experiments

DM Hybrid
AM 0.0% 0.0% 0.0% 2.8%
Q-Reg 0.0% 0.0% 0.0% 0.0%
MRDR 0.0% 5.6% 0.0% 5.6%
FQE 8.3% 8.3% 25.0% 11.1%
R 2.8% 8.3% 8.3% 19.4%
Q 5.6% 5.6% 8.3% 0.0%
Tree 5.6% 8.3% 16.7% 5.6%
IH 0.0% - - -
Standard Per-Decision
IS 0.0% 0.0%
WIS 0.0% 0.0%
NAIVE 0.0% -

Appendix C Supplementary Folklore Backup

The following tables represent the numerical support for how horizon and policy difference affect the performance of the OPE estimators when policy mismatch is held constant. Notice that the policy mismatch for table C and C are identical: . What we see here is that despite identical policy mismatch, the longer horizon does not impact the error as much (compared to the baseline, Table C) as moving to , far from and keeping the horizon the same.


table Graph, relative MSE. . Dense rewards. Baseline.

DM Hybrid
AM 1.9E-3 4.9E-3 5.0E-3 3.4E-3
Q-Reg 2.4E-3 4.3E-3 4.2E-3 4.5E-3
MRDR 5.8E-3 8.9E-3 9.4E-3 9.2E-3
FQE 1.8E-3 1.8E-3 1.8E-3 1.8E-3
R 1.8E-3 1.8E-3 1.8E-3 1.8E-3
Q 1.8E-3 1.8E-3 1.8E-3 1.8E-3
Tree 1.8E-3 1.8E-3 1.8E-3 1.8E-3
IH 1.6E-3 - - -
Standard Per-Decision
IS 5.6E-4 8.4E-4
WIS 1.4E-3 1.4E-3
NAIVE 6.1E-3 -

table Graph, relative MSE. . Dense rewards. Increasing horizon compared to baseline, fixed .

DM Hybrid
AM 5.6E-2 5.9E-2 5.9E-2 5.3E-2
Q-Reg 3.4E-3 1.1E-1 1.2E-1 9.2E-2
MRDR 1.1E-2 2.5E-1 2.9E-1 3.1E-1
FQE 6.0E-2 6.0E-2 6.0E-2 6.0E-2
R 6.0E-2 6.0E-2 6.0E-2 6.0E-2
Q 6.0E-2 6.0E-2 6.0E-2 6.0E-2
Tree 3.4E-1 7.0E-3 1.6E-3 2.3E-3
IH 4.7E-4 - - -
Standard Per-Decision
IS 1.7E-2 2.5E-3
WIS 9.5E-4 4.9E-4
NAIVE 5.4E-3 -

table Graph, relative MSE. . Dense rewards. Increasing compared to baseline, fixed horizon.

DM Hybrid
AM 6.6E-1 6.7E-1 6.6E-1 6.6E-1
Q-Reg 5.4E-1 6.3E-1 1.3E0 9.3E-1
MRDR 5.4E-1 7.3E-1 2.0E0 2.0E0
FQE 6.6E-1 6.6E-1 6.6E-1 6.6E-1
R 6.7E-1 6.6E-1 9.3E-1 1.0E0
Q 6.6E-1 6.6E-1 6.6E-1 6.6E-1
Tree 6.7E-1 6.6E-1 9.4E-1 1.0E0
IH 1.4E-2 - - -
Standard Per-Decision
IS 1.0E0 5.4E-1
WIS 2.0E0 9.7E-1
NAIVE 4.0E0 -

Appendix D Methods

Below we include a description of each of the methods we tested. Let .

d.1 Inverse Propensity Scoring (IPS) Methods

Standard Per-Decision
Table 3: IPS methods. Dudík et al. (2011); Jiang and Li (2016)

Table 3 shows the calculation for the four traditional IPS estimators: . In addition, we include the following method as well since it is a Rao-Blackwellization Liu et al. (2018) of the IPS estimators:

d.2 Hybrid Methods

Hybrid rely on being supplied an action-value function , an estimate of , from which one can also yield . Doubly-Robust (DR): Thomas and Brunskill (2016); Jiang and Li (2016)

Weighted Doubly-Robust (WDR): Thomas and Brunskill (2016)

MAGIC: Thomas and Brunskill (2016) Given where

then define and

then, for a simplex we can calculate

which, finally, yields

MAGIC can be thought of as a weighted average of different blends of the DM and Hybrid. In particular, for some , represents estimating the first steps of according to DR (or WDR) and then estimating the remaining steps via . Hence, finds the most appropriate set of weights which trades off between using a direct method and a Hybrid.

d.3 Direct Methods (DM)


Approximate Model (AM): Jiang and Li (2016) An approach to model-based value estimation is to directly fit the transition dynamics , reward , and terminal condition of the MDP using some for of maximum likelihood or function approximation. This yields a simulation environment from which one can extract the value of a policy using an average over rollouts. Thus, where the expectation is over initial conditions and the transition dynamics of the simulator.


Every estimator in this section will approximate with , parametrized by some . From the OPE estimate we seek is

Note that .

Direct Model Regression (Q-Reg): Farajtabar et al. (2018)

Fitted Q Evaluation (FQE): Le et al. (2019) where

Retrace() (R()), Tree-Backup (Tree), : Munos et al. (2016); Precup et al. (2000); Harutyunyan et al. (2016) where


More Robust Doubly-Robust (MRDR): Farajtabar et al. (2018) Given


where is the indicator function, then

State Density Ratio Estimation (IH): Liu et al. (2018)

where is assumed to be a fixed data-generating policy, and is the distribution of states when executing from . The details for how to find can be found in Algorithm and of Liu et al. (2018).

Appendix E Environments

For every environment, we initialize the environment with a fixed horizon length . If the agent reaches a goal before or if the episode is not over by step , it will transition to an environment-dependent absorbing state where it will stay until time . For a high level description of the environment features, see Table 1.

Figure 6: Graph Environment
Figure 7: Graph-MC Environment
Figure 8: MC Environment, pixel-version. The non-pixel version involves representing the state of the car as the position and velocity.
Figure 9: Enduro Environment
Figure 10: Graph-POMDP Environment. Model-Fail Thomas and Brunskill (2016) is a special case of this environment where T=2. We also extend the environment to arbitrary horizon which makes it a semi-mdp.
Figure 11: Gridworld environment. Blank spaces indicate areas of a small negative reward, S indicates the starting states, F indicates a field of slightly less negative reward, H indicates a hole of severe penalty, G indicates the goal of positive reward.

e.1 Environment Descriptions


Figure 11 shows a visualization of the Toy-Graph environment. The graph is initialized with horizon and with absorbing state . In each episode, the agent starts at a single starting state and has two actions, and . At each time step , the agent can enter state by taking action , or by taking action . If the environment is stochastic, we simulate noisy transitions by allowing the agent to slip into instead of and vice-versa with probability . At the final time , the agent always enters the terminal state . The reward is if the agent transitions to an odd state, otherwise is . If the environment provides sparse rewards, then if is odd, if is even, otherwise . Similarly to deterministic rewards, if the environment’s rewards are stochastic, then the reward is if the agent transitions to an odd state, otherwise . If the rewards are sparse and stochastic then if is odd, otherwise and otherwise.


Figure 11 shows a visualization of the Graph-POMDP environment. The underlying state structure of Graph-POMDP is exactly the Graph environment. However, the states are grouped together based on a choice of Graph-POMDP horizon length, . This parameter groups states into observable states. The agent only is able to observe among these states, and not the underlying MDP structure. Model-Fail Thomas and Brunskill (2016) is a special case of this environment when .

Graph Mountain Car (Graph-MC)

Figure 11 shows a visualization of the Toy-MC environment. This environment is a 1-D graph-based simplification of Mountain Car. The agent starts at , the center of the valley and can go left or right. There are total states, to the left of the starting position and to the right of the starting position, and a terminal absorbing state . The agent receives a reward of at every timestep. The reward becomes zero if the agent reaches the goal, which is state . If the agent reaches and continues left then the agent remains in . If the agent does not reach state by step then the episode terminates and the agent transitions to the absorbing state.

Mountain Car (MC)

We use the OpenAI version of Mountain Car Brockman et al. (2016); Sutton and Barto (2018) with a few simplifying modifications. The car starts in a valley and has to go back and forth to gain enough momentum to scale the mountain and reach the end goal. The state space is given by the position and velocity of the car. At each time step, the car has the following options: accelerate backwards, forwards or do nothing. The reward is for every time step until the car reaches the goal. While the original trajectory length is capped at , we decrease the effective length by applying every action five times before observing . Furthermore, we modify the random initial position from being uniformly between to being one of , with no velocity. The environment is initialized with a horizon and absorbing state position at and no velocity.

Pixel-based Mountain Car (Pix-MC)

This environment is identical to Mountain Car except the state space has been modified from position and velocity to a pixel based representation of a ball, representing a car, rolling on a hill, see Figure 11. Each frame is a image of the ball on the mountain. One cannot deduce velocity from a single frame, so we represent the state as where the initial state. Everything else is identical between the pixel-based version and the position-velocity version described earlier.


We use OpenAI’s implementation of Enduro-v0, an Atari 2600 racing game. We downsample the image to a grayscale of size (84,84). We apply every action one time and we represent the state as where the initial state, for . See Figure 11 for a visualization.

Gridworld (GW)

Figure 11 shows a visualization of the Gridworld environment. The agent starts at a state in the first row or column (denoted S in the figure), and proceeds through the grid by taking actions, given by the four cardinal directions, for timesteps. An agent remains in the same state if it chooses an action which would take it out of the environment. If the agent reaches the goal state , in the bottom right corner of the environment, it transitions to a terminal state for the remainder of the trajectory and receives a reward of . In the grid, there is a field (denoted F) which gives the agent a reward of and holes (denoted H) which give . The remaining states give a reward of .

Pixel-Gridworld (Pixel-GW)

This environment is identical to Gridworld except the state space has been modified from position to a pixel based representation of the position: 1 for the agent’s location, 0 otherwise. We use the same policies as in the Gridworld case.

Environment Graph Graph-MC MC Pix-MC Enduro Graph-POMDP GW Pix-GW
Is MDP? yes yes yes yes yes no yes yes
State desc. position position [pos, vel] pixels pixels position position pixels
4 or 16 250 250 250 1000 2 or 8 25 25
Stoch Env? variable no no no no no no variable
Stoch Rew? variable no no no no no no no
Sparse Rew? variable terminal terminal terminal dense terminal dense dense
Func. Class tabular tabular linear/NN NN NN tabular tabular NN
Initial state 0 0 variable variable gray img 0 variable variable
Absorb. state 2T 22 [.5,0] [.5,0] zero img 2T 64 zero img
Frame height 1 1 2 2 4 1 1 1
Frame skip 1 1 5 5 1 1 1 1
Table 4: Environment parameters - Full description

Appendix F Experimental Setup

f.1 Description of the policies

Graph, Graph-POMDP and Graph-MC use static policies with some probability of going left and another probability of going right, ex: , independent of state. We vary in our experiments.

GW, Pix-GW, MC, Pixel-MC, and Enduro all use an Greedy policy. In other words, we train a policy (using value iteration or DDQN) and then vary the deviation away from the policy. Hence implies we follow a mixed policy with probability and uniform with probability . We vary in our experiments.

f.2 Enumeration of Experiments


See Table 5 for a description of the parameters of the experiment we ran in the Graph Environment. The experiments are the Cartesian product of the table.

Stochastic Env {True, False}
Stochastic Rew {True, False}
Sparse Rew {True, False}
Seed {10 of random()}
ModelType Tabular
Regress False
Table 5: Graph parameters


See Table 6 for a description of the parameters of the experiment we ran in the Graph-POMDP Environment. The experiments are the Cartesian product of the table.