1 Introduction

We propose a robust regression approach to off-policy evaluation (OPE) for contextual bandits. We frame OPE as a covariate-shift problem and leverage modern robust regression tools. Ours is a general approach that can be used to augment any existing OPE method that utilizes the direct method. When augmenting doubly robust methods, we call the resulting method triply robust. We prove upper bounds on the resulting bias and variance, as well as derive novel minimax bounds based on robust minimax analysis for covariate shift. Our robust regression method is compatible with deep learning, and is thus applicable to complex OPE settings that require powerful function approximators. Finally, we demonstrate superior empirical performance across the standard OPE benchmarks, especially in the case where the logging policy is unknown and must be estimated from data.


Triply Robust Off-Policy Evaluation


Anqi Liu &Hao Liu &Anima Anandkumar &Yisong Yue


Caltech &Caltech &Caltech &Caltech

1 Introduction

Contextual bandits is the online learning setting where a policy repeatedly observes a context, takes an action, and then observes a reward only for the chosen action [langford2007epoch]. Typical real-world applications include recommender systems [li2010contextual, yue2012hierarchical], online advertising [tang2013automatic, bottou2013counterfactual], experiment design [krause2011contextual], and medical interventions [lei2014actor]. For settings where online deployments can be costly, an important task is to first assess a target policy’s performance offline, which motivates off-policy evaluation.

Off-policy evaluation (OPE) is the problem of estimating reward of a target policy from pre-collected historical data generated by some (possibly unknown) logging policy. The core challenge of OPE is a form of counterfactual reasoning: only the rewards of the actions taken by the logging policy are recorded, and so we must reason about the rewards the target policy would have received despite taking different actions. To date, there have been many OPE approaches proposed, broadly grouped into three categories: (i) direct methods (DM) that directly regress a target policy’s value; (ii) inverse propensity scoring (IPS) that use importance weights adjustment [horvitz1952generalization, swaminathan2015self], and (iii) doubly robust (DR) methods that blend DM and IPS [robins1995semiparametric, bang2005doubly, dudik2011doubly, wang2017optimal, su2019doubly, dudik2014doubly].

In this paper, we take the perspective of off-policy evaluation as a form of covariate shift [shimodaira2000improving, chen2016robust]. Roughly speaking, covariate shift is the problem of modeling a dependent variable, when at test time the generating distribution of the covariates is different than the one used for training. We will show how to frame OPE as a form of covariate shift, where the dependent variable is the reward model, the covariates are the contexts and actions, and the the generating distributions for the covariates are determined (in part) by the target policy (test time) and the logging policy (training time). Perhaps surprisingly, thus far there has been little intersection between the covariate shift literature and the OPE literature.

Building upon recent work in deep robust regression under covariate shift [chen2016robust, liu2019robust], we develop a general framework for augmenting existing OPE methods that utilize a direct method component. Our contributions are:

  • We show how to frame OPE as a covariate shift problem, and how to leverage modern robust regression tools for tackling covariate shift.

  • We present a general framework for augmenting many OPE methods by using robust regression for the direct method. The resulting DMs can enjoy substantially improved bias and variance compared to their non-augmented counterparts. When augmenting the DM within a DR method, we call the resulting method triply robust.

  • We prove bias and variance bounds for our triply robust estimators. We also derive novel minimax bounds based on worst-case covariate shift.

  • Our approach is compatible with deep learning, and is thus applicable to complex OPE settings that require powerful function approximators.

  • We demonstrate superior empirical performance across the standard OPE benchmarks, via augmenting several state-of-the-art OPE approaches. Our approach is particularly beneficial when the logging policy is not known, in which case it can enjoy over 50% relative error reduction compared to existing state-of-the-art methods.

2 Preliminaries

2.1 Off-Policy Evaluation for Contextual Bandits

In contextual bandit problems, the policy iteratively observes a context , takes an action , and observes a scalar reward . Assuming the contexts are generated iid, the value of a policy can be written as:


where denotes some exogenous context distribution (e.g., profiles of users), and denotes the stochasticity of the policy.

In off-policy evaluation (OPE), the goal is to estimate offline using pre-collected historical data from some other (possibly unknown) logging policy . In other words, we assume access to a pre-collected set of tuples of the form: , where , , and is the observed reward observed for taking action (we often drop the explicit dependence on in when it is clear from context). We generally do not know and the reward function, and often also not the logging policy as well – they must be estimated using . Given , the concrete goal of OPE is to compute a reliable estimate of in (1) (we typically drop the explicit dependence on and for brevity).

When designing an effective OPE approach, the typical considerations are centered around managing the bias-variance tradeoff. Relevant factors include the size of , the degree of overlap between the target policy and the logging policy , and the inherent complexity of estimating the various components , , . We next overview several OPE approaches, most of which we can augment using our robust regression framework described in Section 3.

2.2 Direct Methods (DM)

The simplest class of methods are direct methods (DMs). DMs aim to directly learn a mapping from to reward , which is essentially a supervised regression problem on subject to a choice of function class and possibly regularization, e.g., is a neural net trained to minimize:


Given , we can estimate as:


DMs are notorious for suffering from a large bias [dudik2011doubly], because the actions chosen by the target policy are often not chosen by the logging policy , since conventional DM training (2) is performed over the data collected by the logging policy. Our key enabling technical insight is to view this issue as a form of covariate shift, and train DMs using robust regression, as described in Section 3.

2.3 Inverse Propensity Scoring (IPS)

Inverse Propensity Scoring (IPS) has a rich history in the statistics literature [powell1966weighted, horvitz1952generalization, kang2007demystifying], and is used in many popular OPE methods. Although our framework does not directly augment IPS methods, we provide a brief overview for completeness.

Vanilla IPS. The basic idea is to use important weighting on the entries in to reflect the relative probabilities of choosing some action by the target policy versus the logging policy :


where is probability of choosing , and is the estimated logging policy (assuming is not known). It is known that IPS methods are unbiased but can suffer from high variance if and diverge strongly in their behavior, due to unstable estimates of the ratio [dudik2011doubly].

Self-normalized IPS (SnIPS). A more recently proposed approach is the Self-normalized IPS estimator [swaminathan2015batch]:


Rather than normalizing by as in vanilla IPS, SnIPS normalizes by the sum of the importance weights. Even though SnIPS is biased, it tends be more accurate than vanilla IPS when fluctuations in the importance weights dominate fluctuations in the rewards [swaminathan2015self]. It is straightforward for doubly robust methods to use SnIPS as an alternative to vanilla IPS.

2.4 Doubly Robust Methods (DR)

The bulk of recent OPE research for contextual bandits has focused on developing doubly robust estimators, which utilize both DM and IPS as components [dudik2011doubly, dudik2014doubly, wang2017optimal, farajtabar2018more]. The basic idea is to balance between the biased but low variance DM and the unbiased but high variance IPS.

Vanilla DR. The basic formulation is:


One can also interpret DR estimators as using control variates within an IPS method, albeit traditional control variates tend to be much simpler [magic, veness2011variance].

Not surprisingly, DR methods depend on having a good DM or a good IPS. For instance, when one of IPS or DM is unbiased, DR is guaranteed to be unbiased [dudik2011doubly]. It has also been shown that the variance of DR mainly comes from the IPS term [dudik2011doubly]. Moreover, when IPS is not accurate or has high variance, an inaccurate DM can have its error compounded within a DR beyond just using the DM alone. As such, a large body of follow up work has focused on how develop advanced DR methods that mitigate the damaging effects of variance or extreme probabilities from the IPS component.

SWITCH. The SWITCH estimator extends vanilla DR by introducing weight clipping [wang2017optimal]. SWITCH uses vanilla DR unless the importance weight is too large, in which case it only uses the DM.111There is a version of SWITCH that switches from IPS to DM, rather than from DR to DM. We omit that version since it typically performs worse. The intuition is to avoid using the IPS term (and thus reduce to only using DM) if the extreme importance weights are harming the effectiveness of DR. The estimator can be written as:


where , is the threshold parameter for switching, and . This estimator’s performance highly depends on the tuning of the parameter of the weight clipping threshold.

DR-Shrinkage. DR with Shrinkage extends vanilla DR by shrinking the IPS term to obtain a better bias-variance trade-off in finite samples [su2019doubly]:


where is a weight mapping found by hard threshold or optimizing a sharp bound on MSE. In a situation analogous to SWITCH, the performance of DR-Shrinkage is highly dependent on the being able to find a good weight mapping.

Towards Triply Robust. Perhaps surprisingly, not much work has been done on directly minimizing the bias of DMs. Instead, recent research has largely focused on designing DR methods that more carefully balance between the IPS and DM components, in order to control for the variance of IPS.222This methodological focus is also present in research on OPE methods for the RL setting, e.g., [jiang2015doubly, magic, farajtabar2018more]. Extending our framework to the RL setting is a natural direction for future work. In Section 3, we propose a complementary line of research in leveraging robust regression methods to train DMs, which can then be seamlessly integrated in most DR approaches to arrive at their triply robust counterparts.

3 Robust Regression for OPE

3.1 Off-Policy Evaluation as Covariate Shift

Covariate shift refers to the distribution shift caused only by the input distribution, while the conditional output distribution remains unchanged [shimodaira2000improving, chen2016robust]. Assuming the logging data is sampled from a joint distribution , our goal is to accurately estimate the conditional reward distribution . The estimator described in Section 2.2 would then be re-defined as the expected value of this reward distribution, rather than using vanilla supervised learning as in (2). Given such a , it is straightforward to incorporate it into a direct method such as (3).

Covariate shift arises because the covariates to the reward model , in particular the action , experience distribution shift between training and testing. The joint distribution over the covariates can be written as . The generating distribution for contexts, is exogenous and fixed (and thus does not contribute to covariate shift). On the other hand, the conditional action distribution varies depending whether it corresponds to the target evaluation policy or the logging policy. We explicitly deal with this shift when estimating a reward model from logging data using robust regression.

Existing methods for dealing with covariate shift typically employ density ratio estimators, which can be very challenging in high-dimensional settings. In our setting, the contexts can be very high dimensional, but the actions are typically low dimensional. However, since does not experience distribution drift, then we only need to employ density ratio estimators for , which is much easier to do. As a consequence, OPE, once properly framed, actually reduces to a relatively simple covariate shift problem.

3.2 Deep Robust Regression

We now present a deep robust regression approach for off-policy evaluation. The naming of “robust” originates from a line of research in statistics on robust estimation under distribution drift [shimodaira2000improving]. The high level goal is to estimate a reward model that is robust to the “most surprising” distribution shift that can occur, which can be formulated using a minimax objective. We build upon modern tools for deep robust regression under covariate shift [liu2019robust].

Relative Loss. For technical reasons, it is convenient to design a relative loss function defined as the difference in conditional log-loss between an estimator and a baseline conditional distribution on the target data distribution . This loss essentially measures the amount of expected “surprise” in modeling true data distribution that comes from instead of :


The choice of is straightforward in most applications, and we typically use a Gaussian distribution.

Quantifying Allowable . The next step is to quantify the allowable conditional distribution that we aim to be robust against. We do so by creating a constrained set of allowable that are consistent with data statistics from covariate distributions :


where is a vector of statistics measured from the logging data, and is a hyperparameter. Note that is defined on the logging data distribution, while is defined on evaluation data distribution. The crux of this definition lies in the specific instantiation of , which we discuss next.

Interpreting . Originally developed for linear function classes [chen2016robust], is typically instantiated as linear or higher-moment statistics, which in (10) correspond exactly to quantifying the allowable distribution drift via the drift in the sufficient statistics. This interpretation is somewhat less clear when extending to deep neural networks (although the bias/variance and minimax bounds described in Sections 3.4 & 3.5 are still valid). In the deep neural net case, we define as the top hidden layer, which can be estimated end-to-end during training [liu2019robust]. In other words, we directly learn the sufficient statistics to characterize distribution shift.

Minimax Objective. Our learning goal is to find a regression model that is robust to the “most surprising” conditional distribution that can arise from logging data distribution but still consistent with evaluation data distribution under covariate shifts:


By using relative loss (9) with , along with the constraint formulation in (10), the solution to (11) takes the form of a conditional Gaussian distribution :


where is a matrix: , is the base distribution , and is the top hidden layer of a neural net. A detailed derivation is available in Appendix A.

Learning Algorithm. Another technical convenience of this formulation is that, during learning, we do not explicitly consider , since it is included in the KKT conditions at optimality (see the appendix). We can thus employ standard gradient-based learning, as summarized in Algorithm 1.

The role of density ratio : The density ratio corresponds to the logging policy over evaluation policy. The intuition can be interpreted assuming both logging and evaluation policy are stochastic policies. For a certain action, if the probability under the logging policy is different from the one under logging policy, we should adjust our prediction uncertainty. Especially, when an action is very probable under the evaluation policy but improbable under the logging policy, the estimator tends to be less certain and depends more on the base distribution .

The role of base distribution : The base distribution provides prior knowledge about the “default” conditional reward distribution choice when the logging policy and the evaluation policy is totally distinct on certain actions, which is when is close to 0. For OPE, it is reasonable to choose a Gaussian distribution with mean equals to the , where and define the range the reward values can take, assuming we do not have more informative knowledge about the reward.

  Input: Training data points , logging policy , evaluation policy = , DNN with initialization, DNN SGD optimizer , learning rate , regularization , epoch number .
   random initialization, epoch
  While epoch
    For each mini-batch
      Obtain top hidden layer
      Compute and (Eq. 12)
      Compute gradients for (details in appendix Eq. 21 and Eq. 22.)
      Gradient descent on and with regularization
      Back-Propagate through networks.
      SGD using
  Output: Trained NN and
Algorithm 1 Stochastic Gradient Descent for Deep Robust Regression under Covariate Shift

3.3 Triply Robust Off-Policy Evaluation

We now overview how our robust regression approach can be used to augment many existing OPE methods that utilize a direct method component.

DM-R. We can augment vanilla DMs by plugging in mean estimates from robust regression to obtain:


TR. The triply robust method augments DR by augmenting the DM component. For simplicity, we use and to represent the mean prediction from robust regression on logging policy and evaluation policy.


_TR = 1—S—∑_(x,a,r_a)∈S[(ra- ^μa)π(a—x)^p(a—x) ] + ^V_DM-R. Similar with DR, TR benefit from controlling the variance of IPS by using SnIPS or using SWITCH and Shrinkage based on (optimized) thresholds. We list these methods below.

SnTR. Using Self-normalized IPS in the first term of TR, we obtain:


TR-SWITCH. As in SWITCH, we switch from TR to DM-R at a certain threshold , and :


TR-Shrinkage. As in DR-Shrinkage, we use a customized importance weight for the first term of TR, which needs to be tuned or optimized carefully:


3.4 Bias and Variance Analysis

Our analysis connects learning generalization bound of direct method and bias and variance analysis in doubly robust to obtain upper bounds for both bias and variance analysis in the Triply Robust. We first denote as the generalization error upper bound that is given in Theorem 1 in [liu2019robust]. We refer to appendix B for a detailed restatement of the bound in the off-policy evaluation setting.

Theorem 1.

The bias of triply robust is bounded by the following with probability at least :


where is the upper bound of , and is the number of data samples.

Theorem 2.

The variance of triply robust method is bounded by the following with probability at least :


where is the upper bound of , is the upper bound of model parameter , and is the number of data samples.

To interpret this two bounds, both the bias and variance is upper bounded by a combination (1) moments of generalization error of robust regression on evaluation data and (2)the constraint violations in the logging data that is related with weighted and by the IPS. Therefore, this shows a good direct method could help reduce the bias and variance of TR.

3.5 Minimax Analysis

Minimax analysis provides insights about the best possible performance among all the statistical procedures under the worst case behavior of a method. It has been shown in a general case under the multi-armed bandit case  [li2015toward] and contextual bandit setting [wang2017optimal]. In our case, instead of focusing on general max mean and variance constraints, we utilize the data dependant constraints as in (2) (in Appendix C) and obtain a data dependent minimax analysis on DM-R.

Recall that under the robust regression framework, slack terms like and correspond with the regularization in parameter optimization [chen2016robust]. So we assume they are bounded. Recall is the representation of covariates (x, a) in robust regression.

Theorem 3.

Assuming , is independent with , define as the set of distributions that satisfy (2) (in Appendix C), the minimax risk of off-policy evaluation over the class , which is defined as satisfies the lower bound:

where we abuse the notation a little and use to represent in the expectation in the second term, is the lower bound of , n is the number of data samples.

The minimax lower bound of DM-R is the minimum of two terms that are related with and respectively. Unlike other minimax risk analysis, our bound is not related with the upper bound of variance but is closely related with expectation of weighted reward in the logging data distribution. This is due to the fact that constraints in (2), defines the relation between mean and variance of the resulting conditional reward distribution, given fixed and .

4 Related Work

Advances in Off-Policy Evaluation and Learning: Modern off-policy evaluation methods use powerful tools like deep learning to deal with data in large dimensionality and volume, and can also be used within off-policy learning approaches. BanditNet [joachims2018deep] provides a counterfactual risk minimization approach for training deep networks using an equivalent empirical risk estimator with variance regularization. We use deep robust regression for off-policy evaluation and is also compatible with a number of off-policy optimization methods.

Off-policy evaluation has been studied in scenarios other than traditional contextual bandits setup, such as slate recommendation [swaminathan2017off], where key challenge is that the number of possible lists (called slates) is combinatorially large. Off-policy evaluation has been a key challenge in reinforcement learning [magic, fqe, xie2019optimal, jiang2015doubly]. It has also been considered in the setup where there are multiple logging policies [he2019off].

Causal Inference: Off-policy evaluation is connected closely with causal inference[athey2015machine]. A key problem for evaluating the individual treatment effect (ITE) and average treatment effect (ATE) is the evaluation of a counterfactual policy. Methods from domain adaptation and deep representation learning [johansson2016learning] has been applied in this area, but still falls in the sample re-weighting category. There has also been work on using causal models to achieve better off-policy evaluation result [oberst2019counterfactual].

Robust Regression and Covariate Shift: Importance weighting methods is the common choice for regression under the distribution shift[shimodaira2000improving, sugiyama2007covariate]. However, though being asymptotically unbiased, it suffers from the high variance. Recently developed robust covariate shift methods take a worst-case approach, constructing a predictor that (approximately) matches training data statistics, but is otherwise the most uncertain on the testing distribution. These methods were built by minimizing the worst-case expected target log loss and obtain a parametric form of the predicted output labels’ probability distributions [chen2016robust, liu2017robust]. We are the first to use these types of robust regression methods for off-policy evaluation and are able to construct better direct method and further improve doubly robust estimator.

5 Experiments

5.1 Setup

We validate our framework across the standard OPE benchmarks considered in prior work. In particular, we use several UCI datasets as well as CIFAR10, where we convert the multi-class classification problem to contextual bandits with binary reward, following [dudik2011doubly]. Table 1 includes detailed information for datasets we used in the experiments.For each experiment, we first separate the data into training and testing in a 60% to 40% ratio. We then use the fully observed training data to train a classifier that would serve as the evaluation policy in the testing. We use certain logging policy to sample an action for each context, which is one of the class labels in our case, to serve as our training data. We use the same logging policy to generate the testing data. In testing, we first evaluate the ground truth of evaluation policy, which is the classification error of the pretrained classifier, and then compare with the off-policy evaluation methods in RMSE and standard deviation. More experimental details are in Appendix E.

Datasets #Dimensions #Samples #Classes
vehicle 18 946 4
optdigits 64 5620 10
letter 16 20000 26
CIFAR10 3072 60000 10
Table 1: Dataset description for bandit simulation.

Logging Policies. A nice property of multiclass classification to contextual bandits conversion [wang2017optimal, dudik2011doubly] is we can control the logging policy to sample training data. Therefore, to cover various logging policies, our logging policy is trained using a subsampled dataset that is potentially biased. The greater the bias, the more probable there exist extreme densities in the logging policy and the variance of IPS is larger.

We also investigate the case of an unknown logging policy, which is more challenging. We use a classification method that optimizes the logloss objective and produces probabilities for each class as the logging policy estimation.

Best in
Best in
vehicle : 0.028(0.024) : 0.026 (0.023)
optdigits DR: 0.046 (0.030) TR: 0.045 (0.028)
letter DR: 0.021 (0.021) TR: 0.019(0.019)
CIFAR10 DR: 0.012 (0.0092) TR: 0.011(0.0088)
Best in
Best in
vehicle DM: 0.070(10e-6) DM-R: 0.0076(10e-6)
optdigits DR: 0.21 (0.20) TR: 0.13 (0.13)
letter DR: 0.061 (0.061) TR: 0.040(0.028)
CIFAR10 DR: 0.015 (0.0060) TR: 0.012(0.0050)
Best in
Best in
vehicle IPS: 0.21 (0.089) : 0.18 (0.013)
optdigits DR: 0.53 (0.025) TR: 0.47 (0.022)
letter DR: 0.033 (0.016) TR: 0.022 (0.016)
CIFAR10 DR: 0.070 (0.012) TR: 0.033(0.012)
Table 2: Main experimental comparison results, using (top) known and uniform logging policy; (middle) known and high-variance logging policy; and (bottom) unknown logging policy estimated from data. Showing best performing methods in DM/IPS/DR family and DM-R/TR family with their performance in RMSE mean and standard deviation (in parentheses) over 20 repeated experiments. Here we use and to represent DR-Shrinkage and TR-Shrinkage. We see that the best TR method generally outperforms the best baseline method, especially when the logging policy is unknown.
(a) (b)
Figure 1: (a) Performance Comparison in RMSE on Vehicle, when logging policy is known but with high variance. DR/TR fails due to variance in logging, but DM-R is able to outperform DM, and further improve SnTR and TR-SWITCH over DR counterparts. (b) Performance Comparison in RMSE on CIFAR10 when logging policy is estimated from data. Augmenting existing methods improves performance across the board.

Methods Compared. We provide performance comparison with state-of-the-art methods. We classify the off-policy evaluation methods into two categories:

  • DM/IPS/DR family includes DM (3), IPS (4), SnIPS (5), DR (6), DR-SWITCH (7), DR-Shrinkage (8), and SnDR that uses SnIPS in DR.

  • DM-R/TR family includes DM-R (13), TR (3.3), SnTR (14), and TR-SWITCH (15), and TR-Shrinkage (16).

We evaluate all the above methods in our experiments. When a reward model is needed, we adopt deep neural networks as representation . In SWITCH and Shrinkage, we set a hard threshold as or respectively when it is greater than 0.5. The reason is for a fair comparison with other method that does not require careful hyperparameter search. We report three sets of results where logging policy is obtained differently. Table 2 top is with known and uniform logging policy, which means . Table 2 middle is with known logging policy that is estimate from a biased subsampled data from training. Table 2 bottom is with estimated logging policy using a classification model. We show the best performing method in each family. To demonstrate how robust regression affects direct method and doubly robust respectively, we also given more detailed comparison for Viehcle in the higher variance logging policy case and CIFAR10 when using estimated logging policy in Figure 1. We only show pairwise comparison for counterparts in DM/DR family and DM-R/TR family.

5.2 Performance Analysis

In all the cases, best performing methods in DM-R/TR family outperform the ones in DM/IPS/TR family. Especially, in the challenging case when the logging policy needs to be estimated from data, we achieve a even larger gap from the best performing baseline, as shown in the bottom table in Table 2. We can also observe the following from the experimental results.

With known and uniform logging policy: IPS is accurate and small variance in this case, so both DR and TR achieve good results and TR can outperform DR with smaller variance. This is also true for variants methods DR-Shrinkage and TR-Shrinkage.

With known but high variance logging policy: TR outperform baselines most of the time. The only exception is shown in Figure 1 (a), when logging policy is high variance and DM/DM-R achieves best error. DR/TR suffer from the variance. In this case, DM-R outperforms DM and the benefit directly transfers to variants methods.

With estimated logging policy: Even though the RMSE is generally larger in all methods, comparing against known policy cases, best performing methods in DM-R/TR family still can improve over DM/DR family. Moreover, Figure 1(b) shows TR reduce the error by half than DR in cifar10, thanks to a better direct method.

Does DM-R always outperform DM and TR always outperform DR? The answer for former is yes almost all the time. This is due to fact that robust regression considers the shifts explicitly. But the variance of IPS could make both DR and TR suffer, in which case TR-variants or DM-R wins. Therefore, when IPS has low variance, DM-R always help TR and its variants to beat DR counterparts. When IPS has higher variance, DM-R can still outperform DM and transfer its benefit to TR and its variants.

6 Conclusion and Future Work

We propose to use deep robust regression for off-policy evaluation problem under the contextual bandit setting. We demonstrate how it serves a better direct method (DM-R) and also improves all the doubly robust variants when using it in the DM component of DR, which we denote as the Triply Robust (TR) method. We prove novel bias and variance analysis for TR and a minimax bound for DM-R. Experiments demonstrate that DM-R/TR family methods achieve better empirical performance than their counterparts.

We plan to advance our studies from the following perspectives: There are several DR methods that we can further improve using robust regression [agarwal2017effective, swaminathan2017off]. We also plan to investigate a more advanced logging policy estimation method and study how it interplay in TR. Finally, how TR can further benefit off-policy reinforcement learning is also a natural next step. Given recent advances in batch reinforcement learning [highConfidence, sdr, magic, farajtabar2018more, fqe, kallus2019intrinsically], it would be interesting to see how TR methods can interact and compare with them.


Prof. Anandkumar is supported by Bren endowed Chair, faculty awards from Microsoft, Google, and Adobe, DARPA PAI and LwLL grants. Anqi Liu is a PIMCO postdoctoral fellow at Caltech.


Appendix A Derivation of Robust Regression Model

According to [chen2016robust], the solution of the minimax formulation has the parametric form:


The parameters are obtained maximum condition log likelihood estimation with respect to the target distribution:

If we set the potential function in a special way such that it has a quadratic form and assuming the base distribution , we obtain a Gaussian distribution. We provide necessary details here for the derivation of the mean and variance from this special form and refer more details to [chen2016robust]. If potential function has this form:


The optimization of parameters involve maximizing the target loglikelihood: . The gradients of and are as follows:


Appendix B Proof of Theorem 1 and Theorem 2


We first restate a generalization bound and prove a lemma for robust regression.

Theorem 4.

(of [liu2019robust]) The generalization error of robust regression is upper bounded by the following with probability at least :


where is the upper bound of , is the upper bound of model parameter , is the function class of with whose Rademacher complexity is , is the base distribution variance, and is the number of data samples.

Moreover, a property from robust regression is that when the model is fully optimized, we have the following lemma:

Lemma 1.

The distance between the first and second order of the mean estimators from robust regression satisfies the following with probability at least , if both and is bounded by 1:


where is the variance prediction from robust regression model, is the lower bound of all the features in context and action pair representation , and is the number of data samples.

Therefore, we can use these tools to further analyze bias and variance of the triply robust method.

Due to the theoretical property of robust regression, we have (24) and (23) hold. Therefore, we have


According to Theorem 2 in [dudik2011doubly], if logging policy is accurate, the magnitude of variance of DR depends on , therefore, in the triply robust case, we have:


We can plug in this bound to the original DR variance analysis and obtain a new bound. Similarly, according to the decomposition of the variance of DR, we have:


According to  [liu2019robust], we have , where is the upper bound of , is the upper bound of model parameter , therefore


Appendix C Proof of Theorem 3


Robust regression assumes the worst-case data generating distribution that satisfies constraints from feature means of training data. This translates to the mean and variance constraints for the resulting conditional Gaussian distribution. We have the following two lemmas.

Lemma 2.

The max player in the minimax framework of robust regression when feature function take the form of (20) satisfies: {dmath} —E_ap(a—x),r∼ˇP(r—x,a)[r(x, a)^2] - 1n_i r_i^2
   + VAR_ap(a—x), r∼ˇP(r—x,a)[r(x, a)] — η_1 ;
—E_ap(a—x),r∼ˇP(r—x,a)[r(x, a)] - 1n_i r_i —f(x, a) η_2;
where and are the slack we can set for the constraints.

Lemma 3.

The estimator solved from the minimax framework when is (2) also satisfies (2).

Satisfying such constraints, we are interested in what is the lowest MSE that any estimator can achieve. Denote as the set of distributions that satisfy (2), we are interested in the minimax risk of off-policy evaluation over the class , which is defined as

We analyze the minimax risk in terms of the mean squared error, even though we optimize the relative loss (9) in practice. Because it is more natural and convenient to obtain uncertainties from the relative loss, which provides significant benefit in practice. The key of this proof follows the idea of classic minimax theory  [lafferty9minimax]. We first reduce the problem to hypothesis testing and then pick parameters for the testing to obtain the final bound.

Recall that our setup assumes that the logging policy and the evaluation policy , the context distribution , the constraints slacks and fixed. According to Lemma 3, the resulting conditional Gaussian distribution satisfies constraints in (2). Then instead of setting a general upper bound for the mean and variance of the “sup” player in , we use the constraints in (2) to define a more specific set of distributions where the adversarial player come from.

We now construct a set of distribution that satisfy the constraints. For a given and , the reward distribution is a Gaussian distribution with mean and variance such that they satisfy (2).

For any two distributions and in , we have the lower bound:


Here we use subscription to represent data distribution that contexts and actions are drawn based on and and the reward is drawn from . For a to be chosen later, we have the following:


For turning the problem to a testing problem, the idea is to identify a pair of distribution and such that they are far enough from each other so that any estimator which gets a small estimation loss can essentially identify whether the data generating distribution is or . In order to do this, we take any estimator and identify a corresponding test statistic which maps into one of and . The way to do this is identified in (34).

For any estimator , we can associate a statistic . Therefore, we are interested in its error rate . We can prove that if , it yields (34).

Now we place a lower bound on the error of this test statistic. Using the result of Le Cam [lafferty9minimax], which places an upper bound on the attainable error in any testing problem. This translate to the following in our problem:


Since we would like the probability of error in the test to be a constant, it suffice to choose and such that


We next make concrete choices for and . The constraints we need to satisfy are (36) and (2), which ensure that and are not too close that an estimator does not have to identify the true parameter, or too far that the testing problem becomes too trivial. In order to find a reasonable choice of and , we assume and . Then we have . According to (2), we construct the following variances for the Gaussian distribution:


This construction makes sure both and satisfy (2). From now on we just use to represent the variance.

Since both distribution of rewards is a Gaussian and they have the same variance now. The KL-divergence is given by the squared distance between the means, scaled by the variance, which is:


Thus we have:


The minimax lower bound is then obtained by the largest in such that the other constraints are satisfied. This gives the following optimization problem:


Solving for , we have


where we use to represent . If we have , putting together the bounds together, we obtain:


Therefore, the lower bound of is,


Appendix D More method implemented

We also implemented a version of direct method that use an ablation version of DM-R. The robust regression framework is not limited to the covariate shift case. An observation is that if for all the actions, which means there is no covariate shift or we ignore the shift, we obtain a version of robust regression that can be applied to i.i.d. data. In this case, we robustly minimize the relative loss under that data distribution generated by logging policy: , which has the solution as a Gaussian distribution with mean and variance as follows:


We then plug in the mean estimates from (53) in direct method and obtain:


Appendix E Experimental details

The network structure we used in experiments is a 4 layer fully-connected spectral normalized one, with 64 hidden nodes for UIC datasets. Resnet18 is used for CIFAR10. The training epochs for generating the ground truth is set to be 5. The training epochs for training reward models are set to be 20. The learning rate for stochastic gradient descent is set to be 0.0001. In the prediction, we round the regression result to be within . The base distribution for robust regression is a Gaussian distribution with mean 0.5 and variance 1, since we know the reward is between .

For the known logging policy case, except for the uniform logging policy case, we need to train a probabilistic model from a subsampled dataset. We use the same NN architecture with the reward model and train on fully observed data to obtain a “sample model”. We can control the variance by setting different regularization values and training epochs. Similarly, when we need to estimate the logging policy, we use the data generated by the “sample model” as training data to train a “policy model”.

Appendix F More Experimental Results

We put the full experimental results in Table 3 and Table 4.

Data policy DM IPS SnIPS DR SnDR DR-SWITCH DR-Shrinkage
optdigits uniform 0.24 (10e-6) 0.064 (0.035) 0.62 (10e-6) 0.046 (0.03) 0.24 (10e-6) 0.24 (10e-6) 0.18 (0.006)
biased 0.40 (10e-6) 0.24 (0.23) 0.55 (10e-6) 0.21 (0.20) 0.40 (10e-6) 0.41 (0.0015) 0.42 (0.0048)
estimated 0.59 (10e-6) 0.58 (0.027) 0.67 (10e-6) 0.53 (0.025) 0.60 (0.00054) 0.61 (0.027) 0.61 (10-6)