SemiParametric Efficient Policy Learning with Continuous Actions
Abstract
We consider offpolicy evaluation and optimization with continuous action spaces. We focus on observational data where the data collection policy is unknown and needs to be estimated. We take a semiparametric approach where the value function takes a known parametric form in the treatment, but we are agnostic on how it depends on the observed contexts. We propose a doubly robust offpolicy estimate for this setting and show that offpolicy optimization based on this estimate is robust to estimation errors of the policy function or the regression model. Our results also apply if the model does not satisfy our semiparametric form, but rather we measure regret in terms of the best projection of the true value function to this functional space. Our work extends prior approaches of policy optimization from observational data that only considered discrete actions. We provide an experimental evaluation of our method in a synthetic data example motivated by optimal personalized pricing and costly resource allocation.
SemiParametric Efficient Policy Learning with Continuous Actions
Mert Demirer MIT mdemirer@mit.edu Vasilis Syrgkanis Microsoft Research vasy@microsoft.com Greg Lewis Microsoft Research glewis@microsoft.com Victor Chernozhukov MIT vchern@mit.edu
noticebox[b]\end@float
1 Introduction
We consider offpolicy evaluation and optimization with continuous action spaces from observational data, where the data collection (logging) policy is unknown. We take a semiparametric approach where we assume that the value function takes a known parametric form in the treatment, but we are agnostic on how it depends on the observed contexts/features. In particular, we assume that:
(1) 
for some known feature functions but unknown functions . We assume that we are given a set of observational data points that consist of i.i.d copies of the random vector , such that .^{1}^{1}1In most of the paper, we can allow for the case where is endogenous, in the sense that . In other words, the noise in the random variable can be potentially correlated with . However, we assume that conditional on , there is no remaining endogeneity in the choice of the action in our data. The latter is typically referred to as conditional ignorability/exogeneity [11].
Our goal is to estimate a policy from a space of policies that achieves good regret:
(2) 
for some regret rate that depends on the policy space and the sample size .
The semiparametric value assumption allows us to formulate a doubly robust estimate of the value function, from the observational data, which depends on first stage regression estimates of the coefficients and the conditional covariance of the features . The latter is the analogue of the propensity function when actions are discrete. Our estimate is doubly robust in that it is unbiased if either or is correct. Then we optimize this estimate:
(3) 
Main contributions.
We show that the double robustness property implies that our objective function satisfies a Neyman orthogonality criterion, which in turn implies that our regret rates depend only in a second order manner on the estimation errors on the first stage regression estimates of the functions . Moreover, we prove a regret rate whose leading term depends on the variance of the difference of our estimated value between any two policy values within a “small regretslice” and on the entropy integral of the policy space. We achieve this with a computationally efficient variant of the empirical risk minimization (ERM) algorithm (of independent interest) that uses a validation set to construct a preliminary policy and use it to regularize the policy computed on the training set. Hence, we manage to achieve variancebased regret bounds without the need for variance or moment penalization [15, 20, 9] used in prior work and which can render a computationally tractable policy learning problem, nonconvex. We also show that the asymptotic variance of our offpolicy estimate (which governs the scale of the leading regret term) is asymptotic minimax optimal, in the sense that it achieves the semiparametric efficiency lower bound.
Robustness to misspecification.
Notably, our approach provides meaningful guarantees even when our semiparametric value function assumption is violated. Suppose that the true value function does not take the form of Equation (1), but rather takes some other form . Then one can consider the projection of this value function onto the forms of Equation (1), as:
(4) 
where the expectation is taken over the distribution of observed data. Then our approach takes the interpretation of achieving good regret bounds with respect to this best linear semiparametric approximation. This is an alternative to the kernel smoothing approximation proposed by [20] in contextual bandit setting, as a regret target, and related to [12]. If there is some rough domain knowledge on the form of how the action affects the reward, then our semiparametric approximate target should achieve better performance when the dimension of the action space is large, as the bias of kernel methods will typically incur an exponential in the dimension bias.
Double robustness.
In cases where the collection policy is known, our doubly robust approach can be used for variance reduction via fitting first stage regression estimates to the policy value, whilst maintaining unbiasedness. Thus we can apply our approach to improve regret in the counterfactual risk minimization framework [20], [12] and as a variance reduction method in contextual bandit algorithms with continuous actions [20].
Related Literature.
Our work builds on the recent work at the intersection of semiparametric inference and policy learning from observational data. The important work of [1] analyzes the binary treatments and infinitesimal nudges to continuous treatments. They also take a doubly robust approach so as to obtain regret bounds whose leading term depends on the semiparametric efficient variance and the entropy integral and which is robust to first stage estimation errors. The problem we study in this paper is different in that we consider optimizing over continuous action spaces, rather than infinitesimal nudges, under semiparametric functional form. This assumption is without loss of generality if treatment is binary or multivalued. Hence, our results are a generalization of binary treatments to arbitrary continuous actions spaces, subject to our semiparametric value assumption. In fact we show formally in the Appendix how one can recover the result of [1] for the binary setting, from our main regret bound. In turn our work builds on a long line of work on policy learning and counterfactual risk minimization [17, 26, 27, 1, 13, 28, 2, 7, 20, 12, 14]. Notably, the work of [28] extends the work of [1] to many discrete actions, but only proves a second moment based regret bound, which can be much larger than the variance. Our setting also subsumes the setting of many discrete actions and hence our regularized ERM offers an improvement over the rates in [28]. [9] formulates a general framework of statistical learning with a nuisance component. Our method falls into this framework and we build upon some of the results in [9]. However, for the case of policy learning the implications of [9] provide a variance based regret only when invoking second moment penalization, which can be intractable. We sidestep this need and provide a computationally efficient alternative. Finally, most of the work on policy learning in machine learning assumes that the current policy (equiv. ) is known. Hence, double robustness is used mostly as a variance reduction technique. Even for this literature, as we discuss above, our method can be seen an alternative of recent work on policy learning with continuous actions [12, 14] that makes use of nonparametric kernel methods.
Our work also connects to the semiparametric estimation literature in econometrics and statistics. Our model is an extension of the partially linear model which has been extensively studied in the econometrics [8, 19]. By considering contextspecific coefficients (random coefficients) and modeling a value function that is nonlinear in treatment, we substantially extend the partially linear model. [24, 10] studied a special case of our model where output is linearly dependent on treatment given context, with the aim of estimating the average treatment effect. [10] constructed the doubly robust estimator and showed its semiparametric efficiency under the linearintreatment assumption. We extend their results to a more general functional form and use the doublerobustness property and semiparametric efficiency for policy evaluation and optimization rather than treatment effect estimation. Our work is also connected to the recent and rapidly growing literature on the orthogonal/locally robust/debiased estimation literature [5, 6, 21].
2 Orthogonal OffPolicy Evaluation and Optimization
Let be a first stage estimate of , which can be obtained by minimizing the square loss:
(5) 
where is an appropriate parameter space for the parameters . Let denote the conditional covariance matrix:
This is the analogue of the propensity model in discrete treatment settings. An estimate can be obtained by running a multitask regression problem for each entry to the matrix, i.e.:
(6) 
where is some appropriate hypothesis space for these regressions. Finally, the doubly robust estimate of the offpolicy value takes the form:
(7) 
where:
(8)  
(9) 
The quantity can be viewed as an estimate of , based on a single observation. In fact, if the matrix was equal to , then one can see that is an unbiased estimate of . Our estimate also satisfies a doubly robust property, i.e. it is correct if either is unbiased or is unbiased (see Appendix E for a formal statement). Finally, we will denote with the version of , where the nuisance quantities and are replaced by their true values, and correspondingly define . We perform policy optimization based on this doubly robust estimate:
(10) 
Moreover, we let be the optimal policy:
(11) 
Remark 1 (MultiAction Policy Learning).
A special case of our setup is the setting where the number of actions is finitely many. This can be encoded as and . In that case, observe that the covariance matrix becomes a diagonal matrix: , with . In this case, we simply recover the standard doubly robust estimate that combines the direct regression part with the inverse propensity weights part, i.e.:
Thus our estimator is an extension of the doubly robust estimate from discrete to continuous actions.
Remark 2 (Finitely Many Possible Actions: Linear Contextual Bandits).
Another interesting special case of our approach is a generalization of the linear contextual bandit setting. In particular, suppose that there is only a finite (albeit potentially large) set of possible actions and . However, unlike the multiaction setting, where these actions are the orthonormal basis vectors, in this setting, each action , maps to a feature vector . Then the reward that we observe satisfies . This is a generalization of the linear contextual bandit setting, in which the coefficient vector is a constant parameter as opposed to varying with . In this case observe that: , i.e. it is the sum of rank one matrices where , and The doubly robust estimate of the parameter takes the form:
This approach leverages the functional form assumption to get an estimate that avoids a large variance that depends on the number of actions but rather mostly depends on the number of parameters . This is achieved by sharing reward information across actions.
Remark 3 (LinearinTreatment Value).
Consider the case where the value is linear in the action . In this case observe that: . For instance, suppose that we assume that experimentation is independent across actions in the observed data. Then , where . Then the doubly robust estimate of the parameter takes the form:
(12) 
3 Theoretical Analysis
Our main regret bounds are derived for a slight variation of the ERM algorithm that we presented in the preliminary section. In particular, we crucially need to augment the ERM algorithm with a “validation” step, where we split our data into a training and validation step, and we restrict attention to policies that achieve small regret on the training data, while still maintaining small regret on the validation set. This extra modification enabled us to prove variance based regret bounds and is reminiscent of standard approaches in machine learning, like fold crossvalidation and early stopping, hence could be of independent interest.
We note that we present our theoretical results for the simpler case where the nuisance estimates are trained on a separate split of the data. However, our results qualitatively extend to the case where we use the crossfitting idea of [5] (i.e. train a model on one half and predict on the other and vice versa).
Regret bound.
To show the properties of this algorithm, we first show that the regret of the doubly robust algorithm is impacted in a second order manner by the errors in the first stage estimates. We will also make the following preliminary definitions. For any function we denote with , the standard norm and with its empirical analogue. Furthermore, we define the empirical entropy of a function class as the largest value, over the choice of samples, of the logarithm of the size of the smallest empirical cover of on the samples with respect to the norm. Finally, we consider the empirical entropy integral:
(13) 
Our statistical learning problem corresponds to learning over the function space:
(14) 
where the data is . We will also make a very benign assumption on the entropy integral:
Assumption 1.
The function class satisfies that for any constant , as .
Theorem 1 (VarianceBased Oracle Policy Regret).
Suppose that the nuisance estimates satisfy that their mean squared error is upper bounded w.p. by , i.e. w.p. over the randomness of the nuisance sample:
(15) 
Let and . Moreover, let
(16) 
denote an regret slice of the policy space. Let and
(17) 
denote the variance of the difference between any two policies in an regret slice, evaluated at the true nuisance quantities. Then the policy returned by the outofsample regularized ERM, satisfies w.p. over the randomness of :
(18) 
Expected regret is , with is expected MSE of nuisance functions.
We provide a proof of this Theorem in Appendix B. The regret result contains two main contributions: 1) first the impact of the nuisance estimation error is of second order (i.e. instead of ), 2) the leading regret term depends on the variance of smallregret policy differences and the entropy integral of the policy space. The first property stems from the Neyman orthogonality property of the doubly robust estimate of the policy. The second property stems from the outofsample regularization step that we added to the ERM algorithm. Typically, we will have and thereby this term is of lower order than the leading term. Moreover, for many policy spaces , in which case we see that if the setting satisfies a “margin” condition (i.e. the best policy is better by a constant margin), then eventually the variance of small regret policies is , since it only contains the best policy. In that case, our bound leads to fast rates of as opposed to , since the leading term vanishes (similar to the achieved in bandit problems with such a margin condition).
Dependence on the quantity is quite intuitive: if two policies have almost equivalent regret up to a rate, then it will be very easy to be mislead among them if one has much higher variance than the other. For some classes of problems, the above also implies a regret rate that only depends on the variance of the optimal policy (e.g. when all policies with low regret have a variance that is not much larger than the variance of the optimal policy. In Appendix F we show that the latter is always the case for the setting of binary treatment studied in [1] and therefore applying our main result, we recover exactly their bound for binary treatments.
Semiparametric efficient variance.
Our regret bound depends on the variance of our doubly robust estimate of the value function. One then wonders if there are other estimates of the value function that could achieve better variance than . However, we show that at least asymptotically and without further assumptions on the functions and , this cannot be the case. In particular, we show that our estimator achieves what is known as the semiparametric efficient variance limit for our setting. More importantly for our regret result, this is also true for the semiparametric efficient variance of the policy differences. This is the case in our main setup; where the model is misspecified and only a projection of the true value; and even if we assume that our model is correct, but make the extra assumption of homoskedasticity, i.e., the conditional variance of residuals of outcomes do not depend on .
Theorem 2 (Semiparametric Efficiency).
If the model is misspecified, i.e, the asymptotic variance of is equal to the semiparametric efficiency bound for the policy value defined in Equation (4). If the model is correctly specified, is semiparametrically efficient under homoskedasticity, i.e. .
We provide a proof for the value function, but this result also extends to the difference of values. We conclude the section by providing concrete examples of rates for policy classes of interest.
Example 1 (VC Policies).
As a concrete example, consider the case when the class is a VCsubgraph class of VC dimension (e.g. the policy space has small VCdimension or pseudodimension), and let . Then Theorem 2.6.7 of [22] shows that: (see also discussion in Appendix F). This implies that
Hence, we can conclude that regret is . For the case of binary action policies (as we discuss in Appendix F) this result recovers the result of [1] for binary treatments up to constants and extends it to arbitrary action spaces and VCsubgraph policies.
Example 2 (HighDimensional Index Policies).
As an example, we consider the class of policies, characterized by a constant number of or bounded linear indices:
(19) 
where is a fixed Lipschitz function of the indices, with constants, while (and similarly for , where use ). Assuming is a Lipschitz function of and since is a Lipschitz function of , we have by a standard multivariate Lipschitz contraction argument (and since , are constants), that the entropy of is of the same order as the maximum entropy of each of the linear index spaces: . Moreover, by known covering arguments (see e.g. [25], Theorem 3) that if , then: . Thus we get , which leads to regret . In this setting, we observe that the policy space is too large for the variance to drive the asymptotic regret. There is a leading term that remains even if the worsecase variance of policies in a smallregret slice is . Intuitively this stems from the highdimensionality of the linear indices, which introduces an extra dimension of error, namely bias due to regularization. On the contrary, for exactly sparse policies , we have that since for any possible support the entropy at scale is at most , we can take a union over all possible sparse supports, which implies . Thus , leading to policy regret similar to the VC classes: .
Remark 4 (Instrumental Variable Estimation).
Our main regret results extend to the instrumental variables settings where treatments are endogenous but we have a vector of instrumental variables satisfying
and is invertible. Then we can use the following doubly robust moment
Remark 5 (Estimating the First Stages).
Bounds on first stage errors as a function of sample complexity measures for the first stage hypotheses spaces can be obtained by standard results on the MSE achievable by regression problems (see e.g. [18, 23]). Essentially these are bounds for the regression estimates and , as a function of the complexity of their assumed hypothesis spaces. Since the latter is a standard statistical learning problem that is orthogonal to our main contribution, we omit technical details. Since the square loss is a strongly convex objective the rates achievable for these problems are typically fast rates on the MSE (e.g. is of the order for the case of parametric hypothesis spaces, and typically for reproducing kernel Hilbert spaces with fast eigendecay (see e.g. [23])). Thus the term is of lower order. For instance, the required rates for the term to be of second order in our regret bounds are achievable if these nuisance regressions are penalized linear regressions and several regularity assumptions are satisfied by the data distribution, even when the dimension of is growing with .
Extension: SemiBandit Feedback
Suppose that our value function takes the form: , where is a matrix and we observe semibandit feedback, i.e. we observe a vector s.t.: . Then we can apply our DR approach to each coordinate of separately.
All the theorems in this section extend to this case, which will prove useful in our pricing application where is the price of a set of products and is the vector of observed demands for each product.
4 Application: Personalized Pricing
Consider the personalized pricing of a single product. The objective is the revenue:
where and gives the unknown, contextspecific demand function. We assume that we observe an unbiased estimate of demand:
We want to optimize over a space of personalized pricing policies . If, for instance, the observational policy was homoskedastic (i.e. the exploration component was independent of the context ), we show in Appendix G that doubly robust estimators for and are
where and the variance . Thus, in this example, we only need to estimate the mean treatment policy and the variance .
Experimental evaluation.
We empirically evaluate our framework on the personalized pricing application with synthetic data. In particular, we use simulations to assess our estimator’s ability to evaluate and optimize personalized pricing functions. To do this, we compare the performance of our doubly robust estimator with (1) Direct estimator, , (2) Inverse propensity score estimator ^{2}^{2}2 , (3) Oracle orthogonal estimator, .
Data Generating Process.
Our simulation design considers a sparse model. We assume that there are continuous context variables distributed uniformly for but only of them affects demand. Let . Price and demand are generated as . We consider four functional forms for the demand model: (i) (Quadratic) , (ii) (Step) , (iii) (Sigmoid) , (iv) (Linear)
These functions and the data generating process ensure that the conditional expectation function of demand given is nonnegative for all , the observed prices are positive with high probability, and the optimal prices are in the support of the observed prices. In each experiment, we generate 1000, 2000, 5000, and 10000 data points, and report results over 100 simulations. We estimate the nuisance functions using 5fold crossvalidated lasso model with polynomials of degrees up to 3 and all the twoway interactions of context variables. We present the results for two regimes: (i) Low dimensional with , (ii) High dimensional with .
Policy Evaluation.
For policy evaluation we consider four pricing functions: (i) Constant, , (ii) Linear, , (iii) Threshold, , (iv) Sin, . The results for the low dimensional regime are summarized in Figure 1(a), where each row and column corresponds to a different demand function and a policy function, respectively^{3}^{3}3The results are very similar for the high dimensional model which are reported in Figure 4(a) in the appendix.. The results show that, as expected, our the performance of our method is very similar to the oracle estimator and achieves a significantly better performance than the direct and inverse propensity score methods, which suffer from large biases. These results also support our claim that the asymptotic variance of the doubly robust estimate is the same as the variance of the oracle method. It is also important to point out that we obtain similar performances across two different regimes.
Regret.
To investigate the regret performance of our method, we consider a constant pricing function, and a linear policy . We compute the optimal pricing functions in these two function spaces and report the distribution of regret in Figure 4(b) under the low dimensional regime and in Appendix H under the high dimensional regime. Across the four demand functions and two pricing functions we consider, our method achieves small regrets, comparable to the oracle method. The direct and inverse propensity methods, depending on the demand function, yield large regrets.
4.1 Quadratic Model
Finally, we consider the same simulation exercise under the assumption that an unbiased estimate of revenue rather than demand is observed. Since revenue depends on the the model is now quadratic
For the data generating process we use the same functions and as in the personalized pricing example ^{4}^{4}4We provide the calculation of the doubly robust estimators for this example in Appendix G.. Figures 2 and 5 in Appendix H summarize results for policy evaluation and optimization. The overall performance of our doubly robust estimator is similar to the demand model, and it performs better the direct model. One important difference to note is that when the sample size is small, we observe some finite sample biases for some function classes.
5 Application: Costly Resource Allocation
Motivated by a resource allocation scenario, we also analyze experimentally the special case where . Consider the case where we have possible tasks to invest in, and we have investment costs. Each task yields a return on investment that is a linear function of the investment, but an unknown function of the context . Moreover, to maintain an investment portfolio of we need to pay a known cost . Given a policy space , our goal is to optimize:
(20) 
This falls into our framework, if we treat the offset part as of the form but with a known . So in that case we simply consider . Then applying our framework we optimize:
(21) 
In the case of quadratic costs , then this boils down to exactly optimizing a square loss objective, since:
(22) 
Thus policy optimization reduces to a multitask regression problem where we are trying to predict from .^{5}^{5}5The above reasoning extends to heterogeneous costs across tasks e.g. . In this case the label of the th task of the multitask regression problem is and we need to perform a weighted multitask regression where the weight on the square loss for task is equal .
We can consider sparse linear policies:
(23) 
where corresponds to the th row of matrix . In this case our problem reduces to the MultiTask Lasso problem where the label is .
Experimental Evaluation.
For experimental evaluation we consider a model with two tasks, and :
We use the same distributions and functions, and , given above for the pricing application. To estimate the optimal allocation and its regret, we run a 5fold cross validated MultiTask Lasso algorithm and set . We report the distribution of return on investment obtained from different models in Figure (3). The results suggest that doubly robust method achieves a significantly lower regret than the direct method in both regimes and its performance is similar to the oracle method ^{6}^{6}6For comparison, the value achieved by bestinclass policy is 22.2 in low dimensional regime and ? in high dimensional regime. We omit the inverse propensity score regrets since they are too large to report together with other estimates.
References
 [1] Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2018.
 [2] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138. ACM, 2009.
 [3] Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Y Ritov, J Klaassen, Jon A Wellner, and YA’Acov Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993.
 [4] Gary Chamberlain. Efficiency bounds for semiparametric regression. Econometrica: Journal of the Econometric Society, pages 567–596, 1992.
 [5] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
 [6] Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins. Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033, 2016.
 [7] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104. Omnipress, 2011.
 [8] Robert F Engle, Clive WJ Granger, John Rice, and Andrew Weiss. Semiparametric estimates of the relation between weather and electricity sales. Journal of the American statistical Association, 81(394):310–320, 1986.
 [9] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
 [10] Bryan S Graham and Cristine Campos de Xavier Pinto. Semiparametrically efficient estimation of the average linear regression function. Working paper, National Bureau of Economic Research, 2018.
 [11] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
 [12] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. arXiv preprint arXiv:1802.06037, 2018.
 [13] Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591–616, 2018.
 [14] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. arXiv preprint arXiv:1902.01520, 2019.
 [15] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. In The 22nd Conference on Learning Theory (COLT), 2009.
 [16] Whitney K Newey. Semiparametric efficiency bounds. Journal of applied econometrics, 5(2):99–135, 1990.
 [17] Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
 [18] Alexander Rakhlin, Karthik Sridharan, Alexandre B Tsybakov, et al. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23(2):789–824, 2017.
 [19] Peter M Robinson. Rootnconsistent semiparametric regression. Econometrica: Journal of the Econometric Society, pages 931–954, 1988.
 [20] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
 [21] Mark J Van der Laan and Sherri Rose. Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media, 2011.
 [22] A. W. Van Der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series, March 1996.
 [23] Martin J. Wainwright. HighDimensional Statistics: A NonAsymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
 [24] Jeffrey M Wooldridge. Estimating average partial effects under conditional moment independence assumptions. Working paper, cemmap working paper, 2004.
 [25] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527–550, 2002.
 [26] Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
 [27] Xin Zhou, Nicole MayerHamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187, 2017.
 [28] Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multiaction policy learning: Generalization and optimization. arXiv preprint arXiv:arXiv:1810.04778, 2018.
Appendix A Proof of Universal Orthogonality Lemma
We first start by defining a sufficient condition for the notion of universal orthogonality of a loss function, as defined by [9]. A loss function is universally orthogonal with respect to if for any :
(24) 
where is the true value of the nuisance parameter .
Lemma 3.
The loss function is universally orthogonal with respect to .
Proof.
We show that the population loss function that corresponds to the doubly robust estimate, satisfies the universal orthogonality condition. For simplicity of notation let . Then the population loss is:
Let:
Observe that:
To show universal orthogonality it suffices to show that:
This follows easily by simple algebraic manipulations:
and
Now observe that since is the minimizer of the conditional squared loss, taking the first order condition implies:
Moreover:
Combining the two yields:
which implies orthogonality with respect to . ∎
Appendix B Proof of Main Regret Theorem 1
We first consider an arbitrary empirical loss minimization problem of the form:
(25) 
where are i.i.d. drawn from an unknown distribution and is an arbitrary data space. Throughout the section we will assume that: . All the results can be generalized to the case of , for some arbitrary , by simply first rescaling the losses, and then invoking the theorems of this section.
We will also make the following preliminary definitions. For any function we denote with , the standard norm and with its empirical analogue. The localized Rademacher complexity is the defined as:
(26) 
where are independent Rademacher variables that take values with equal probability.
Furthermore, we define the empirical entropy of a function class as the largest value, over the choice of samples, of the logarithm of the size of the smallest empirical cover of on the samples with respect to the norm. Finally, we consider the empirical entropy integral defined as:
(27) 
Throughout this section we will make the following benign assumption that essentially makes the problem learnable:
ASSUMPTION 1. The function class satisfies that for any constant , as
We will use the following theorems from the prior work of [9] as a starting point as they are formalized in manner convenient for our problem.
Theorem 4 (Foster, Syrgkanis [9], Theorem 4).
Consider any function class and let be the outcome of the constrained ERM. Pick any and let . Then for some constants and for any , w.p. :
Lemma 5 (Foster, Syrgkanis [9], Lemma 4).
Consider a function class and pick any (not necessarily in ). Moreover, let:
(28) 
Then for some constant and for any , w.p. :
Our goal is to replace in the latter Theorem with the worstcase variance of the functions in a small “regret”ball around the optimal. We will achieve this by considering a slight modification of the ERM algorithm. In particular, we will split the data in half, and we will use one half as a regularization sample and the other half as the training sample. In particular, we will find the optimal function on the training sample, within the class of functions that also have relatively small regret on the regularization sample.
OutofSample Regularized ERM
Consider the following algorithm:

We split the samples in two parts and let and denote the corresponding empirical expectations.

We run ERM over on the first half and let be the outcome.

Then we define the class of functions that have the constraint that they don’t achieve much worse value than on the first half, i.e. we regularize policies based on their regret on the first half. More formally, for some constant to be defined later:
(29) 
Then we run constrained ERM on the second sample over the function space :
(30)
Theorem 6 (VarianceBased Regret).
Let , and choose , with . Then, w.p. over the sample , the outcome of the OutofSample Regularized ERM satisfies:
(31) 
with: and . Moreover, the expected regret, in expectation over the samples is also of order .
Proof.
First we argue that w.p. , . By the choice of and Theorem 4, we know that w.p. over the randomness of sample :