Semi-Parametric Efficient Policy Learning with Continuous Actions

Semi-Parametric Efficient Policy Learning with Continuous Actions

Abstract

We consider off-policy evaluation and optimization with continuous action spaces. We focus on observational data where the data collection policy is unknown and needs to be estimated. We take a semi-parametric approach where the value function takes a known parametric form in the treatment, but we are agnostic on how it depends on the observed contexts. We propose a doubly robust off-policy estimate for this setting and show that off-policy optimization based on this estimate is robust to estimation errors of the policy function or the regression model. Our results also apply if the model does not satisfy our semi-parametric form, but rather we measure regret in terms of the best projection of the true value function to this functional space. Our work extends prior approaches of policy optimization from observational data that only considered discrete actions. We provide an experimental evaluation of our method in a synthetic data example motivated by optimal personalized pricing and costly resource allocation.

\addauthor

vsblue \addauthormdred

1 Introduction

We consider off-policy evaluation and optimization with continuous action spaces from observational data, where the data collection (logging) policy is unknown. We take a semi-parametric approach where we assume that the value function takes a known parametric form in the treatment, but we are agnostic on how it depends on the observed contexts/features. In particular, we assume that:

 V(a,z)=⟨θ0(z),ϕ(a,z)⟩ (1)

for some known feature functions but unknown functions . We assume that we are given a set of observational data points that consist of i.i.d copies of the random vector , such that .1

Our goal is to estimate a policy from a space of policies that achieves good regret:

 supπ∈ΠE[V(π(z),z)]−E[V(^π(z),z)]≤R(Π,n) (2)

for some regret rate that depends on the policy space and the sample size .

The semi-parametric value assumption allows us to formulate a doubly robust estimate of the value function, from the observational data, which depends on first stage regression estimates of the coefficients and the conditional covariance of the features . The latter is the analogue of the propensity function when actions are discrete. Our estimate is doubly robust in that it is unbiased if either or is correct. Then we optimize this estimate:

 ^π=supπ∈ΠVDR(π) (3)

Main contributions.

We show that the double robustness property implies that our objective function satisfies a Neyman orthogonality criterion, which in turn implies that our regret rates depend only in a second order manner on the estimation errors on the first stage regression estimates of the functions . Moreover, we prove a regret rate whose leading term depends on the variance of the difference of our estimated value between any two policy values within a “small regret-slice” and on the entropy integral of the policy space. We achieve this with a computationally efficient variant of the empirical risk minimization (ERM) algorithm (of independent interest) that uses a validation set to construct a preliminary policy and use it to regularize the policy computed on the training set. Hence, we manage to achieve variance-based regret bounds without the need for variance or moment penalization maurer2009empirical (); swaminathan2015counterfactual (); foster2019orthogonal () used in prior work and which can render a computationally tractable policy learning problem, non-convex. We also show that the asymptotic variance of our off-policy estimate (which governs the scale of the leading regret term) is asymptotic minimax optimal, in the sense that it achieves the semi-parametric efficiency lower bound.

Robustness to mis-specification.

Notably, our approach provides meaningful guarantees even when our semi-parametric value function assumption is violated. Suppose that the true value function does not take the form of Equation (1), but rather takes some other form . Then one can consider the projection of this value function onto the forms of Equation (1), as:

 θp(z)=arginfθE[(V0(a,z)−⟨θ(z),ϕ(a,z)⟩)2∣z] (4)

where the expectation is taken over the distribution of observed data. Then our approach takes the interpretation of achieving good regret bounds with respect to this best linear semi-parametric approximation. This is an alternative to the kernel smoothing approximation proposed by swaminathan2015counterfactual () in contextual bandit setting, as a regret target, and related to kallus2018policy (). If there is some rough domain knowledge on the form of how the action affects the reward, then our semi-parametric approximate target should achieve better performance when the dimension of the action space is large, as the bias of kernel methods will typically incur an exponential in the dimension bias.

Double robustness.

In cases where the collection policy is known, our doubly robust approach can be used for variance reduction via fitting first stage regression estimates to the policy value, whilst maintaining unbiasedness. Thus we can apply our approach to improve regret in the counterfactual risk minimization framework swaminathan2015counterfactual (), kallus2018policy () and as a variance reduction method in contextual bandit algorithms with continuous actions swaminathan2015counterfactual ().

Related Literature.

Our work builds on the recent work at the intersection of semi-parametric inference and policy learning from observational data. The important work of athey2017efficient () analyzes the binary treatment setting and also takes a doubly robust approach so as to obtain regret bounds whose leading term depends on the semi-parametric efficient variance and the entropy integral and which is robust to first stage estimation errors. Hence, our results are a direct generalization of their work to arbitrary continuous actions spaces, subject to our semi-parametric value assumption. In fact we show formally in the Appendix how one can recover the main result of athey2017efficient (), from our main regret bound. In turn our work builds on a long line of work on policy learning and counterfactual risk minimization qian2011performance (); zhao2012estimating (); zhou2017residual (); athey2017efficient (); kitagawa2018should (); zhou2018offline (); beygelzimer2009offset (); dudik2011doubly (); swaminathan2015counterfactual (); kallus2018policy (); krishnamurthy (). Notably, the work of zhou2018offline () extends the work of athey2017efficient () to many discrete actions, but only proves a second moment based regret bound, which can be much larger than the variance. Our setting also subsumes the setting of many discrete actions and hence our regularized ERM offers an improvement over the rates in zhou2018offline (). foster2019orthogonal () formulates a general framework of statistical learning with a nuisance component. Our method falls into this framework and we build upon some of the results in foster2019orthogonal (). However, for the case of policy learning the implications of foster2019orthogonal () provide a variance based regret only when invoking second moment penalization, which can be intractable. We side-step this need and provide a computationally efficient alternative. Finally, most of the work on policy learning in machine learning assumes that the current policy (equiv. ) is known. Hence, double robustness is used mostly as a variance reduction technique. Even for this literature, as we discuss above, our method can be seen an alternative of recent work on policy learning with continuous actions kallus2018policy (); krishnamurthy () that makes use of non-parametric kernel methods.

Our work also connects to the semi-parametric estimation literature in econometrics and statistics. Our model is an extension of the partially linear model which has been extensively studied in the econometrics engle1986semiparametric (); robinson1988root (). By considering context-specific coefficients (random coefficients) and modeling a value function that is non-linear in treatment, we substantially extend the partially linear model. wooldridge2004estimating (); graham2018semiparametrically () studied a special case of our model where output is linearly dependent on treatment given context, with the aim of estimating the average treatment effect. graham2018semiparametrically () constructed the doubly robust estimator and showed its semi-parametric efficiency under the linear-in-treatment assumption. We extend their results to a more general functional form and use the double-robustness property and semi-parametric efficiency for policy evaluation and optimization rather than treatment effect estimation. Our work is also connected to the recent and rapidly growing literature on the orthogonal/locally robust/debiased estimation literature chernozhukov2018double (); chernozhukov2016locally (); van2011targeted ().

2 Orthogonal Off-Policy Evaluation and Optimization

Let be a first stage estimate of , which can be obtained by minimizing the square loss:

 ^θ=arginfθ∈ΘEn[(y−⟨θ(z),ϕ(a,z)⟩)2] (5)

where is an appropriate parameter space for the parameters . Let denote the conditional covariance matrix:

 Σ0(z)=E[ϕ(a,z)ϕ(a,z)T∣z]

This is the analogue of the propensity model in discrete treatment settings. An estimate can be obtained by running a multi-task regression problem for each entry to the matrix, i.e.:

 ^Σij=arginfΣij∈SijE[(ϕi(a,z)ϕj(a,z)−Σij(z))2] (6)

where is some appropriate hypothesis space for these regressions. Finally, the doubly robust estimate of the off-policy value takes the form:

 VDR(π)=En[vDR(y,a,z;π)] (7)

where:

 vDR(y,a,z;π)= ⟨θDR(y,a,z),ϕ(π(z),z)⟩ (8) θDR(y,a,z)= ^θ(z)+^Σ(z)−1ϕ(a,z)(y−⟨^θ(z),ϕ(a,z)⟩) (9)

The quantity can be viewed as an estimate of , based on a single observation. In fact, if the matrix was equal to , then one can see that is an unbiased estimate of . Our estimate also satisfies a doubly robust property, i.e. it is correct if either is unbiased or is unbiased (see Appendix E for a formal statement). Finally, we will denote with the version of , where the nuisance quantities and are replaced by their true values, and correspondingly define . We perform policy optimization based on this doubly robust estimate:

 ^π=argmaxπ∈ΠVDR(π) (10)

Moreover, we let be the optimal policy:

 π0∗=argmaxπ∈ΠV(π) (11)
Remark 1 (Multi-Action Policy Learning).

A special case of our setup is the setting where the number of actions is finitely many. This can be encoded as and . In that case, observe that the covariance matrix becomes a diagonal matrix: , with . In this case, we simply recover the standard doubly robust estimate that combines the direct regression part with the inverse propensity weights part, i.e.:

 θDR,i(y,a,z)=^θi(z)+1^pi(z)1{a=ei}(y−^θi(z))

Thus our estimator is an extension of the doubly robust estimate from discrete to continuous actions.

Remark 2 (Finitely Many Possible Actions: Linear Contextual Bandits).

Another interesting special case of our approach is a generalization of the linear contextual bandit setting. In particular, suppose that there is only a finite (albeit potentially large) set of possible actions and . However, unlike the multi-action setting, where these actions are the orthonormal basis vectors, in this setting, each action , maps to a feature vector . Then the reward that we observe satisfies . This is a generalization of the linear contextual bandit setting, in which the coefficient vector is a constant parameter as opposed to varying with . In this case observe that: , i.e. it is the sum of rank one matrices where , and The doubly robust estimate of the parameter takes the form:

 θDR(y,a,z)=^θ(z)+(UDUT)−1ϕ(a,z)(y−⟨^θ(z),ϕ(a,z)⟩)

This approach leverages the functional form assumption to get an estimate that avoids a large variance that depends on the number of actions but rather mostly depends on the number of parameters . This is achieved by sharing reward information across actions.

Remark 3 (Linear-in-Treatment Value).

Consider the case where the value is linear in the action . In this case observe that: . For instance, suppose that we assume that experimentation is independent across actions in the observed data. Then , where . Then the doubly robust estimate of the parameter takes the form:

 θDR,i(y,a,z)=^θi(z)+ai^σ2i(z)(y−⟨^θ(z),a⟩) (12)

3 Theoretical Analysis

Our main regret bounds are derived for a slight variation of the ERM algorithm that we presented in the preliminary section. In particular, we crucially need to augment the ERM algorithm with a “validation” step, where we split our data into a training and validation step, and we restrict attention to policies that achieve small regret on the training data, while still maintaining small regret on the validation set. This extra modification enabled us to prove variance based regret bounds and is reminiscent of standard approaches in machine learning, like -fold cross-validation and early stopping, hence could be of independent interest.

We note that we present our theoretical results for the simpler case where the nuisance estimates are trained on a separate split of the data. However, our results qualitatively extend to the case where we use the cross-fitting idea of chernozhukov2018double () (i.e. train a model on one half and predict on the other and vice versa).

Regret bound.

To show the properties of this algorithm, we first show that the regret of the doubly robust algorithm is impacted in a second order manner by the errors in the first stage estimates. We will also make the following preliminary definitions. For any function we denote with , the standard norm and with its empirical analogue. Furthermore, we define the empirical entropy of a function class as the largest value, over the choice of samples, of the logarithm of the size of the smallest empirical -cover of on the samples with respect to the norm. Finally, we consider the empirical entropy integral:

 κ(r,F)=infα≥0⎧⎨⎩4α+10∫rα√H2(ϵ,F,n)ndϵ⎫⎬⎭, (13)

Our statistical learning problem corresponds to learning over the function space:

 FΠ={vDR(⋅;π):π∈Π} (14)

where the data is . We will also make a very benign assumption on the entropy integral:

Assumption 1.

The function class satisfies that for any constant , as .

Theorem 1 (Variance-Based Oracle Policy Regret).

Suppose that the nuisance estimates satisfy that their mean squared error is upper bounded w.p. by , i.e. w.p. over the randomness of the nuisance sample:

 max{E[(^θ(z)−θ0(z))2],E[∥^Σ(z)−Σ0(z)∥2Fro]}≤h2n,δ (15)

Let and . Moreover, let

 Π∗(ϵ)={π∈Π:V(π0∗)−V(π)≤ϵ}, (16)

denote an -regret slice of the policy space. Let and

 V02=supπ,π′∈Π∗(ϵn)Var(v0DR(x;π)−v0DR(x;π′)) (17)

denote the variance of the difference between any two policies in an -regret slice, evaluated at the true nuisance quantities. Then the policy returned by the out-of-sample regularized ERM, satisfies w.p. over the randomness of :

 V(π0∗)−V(π2)= O⎛⎜⎝κ(√V02,FΠ)+√V02log(1/δ)n+h2n,δ⎞⎟⎠ (18)

Expected regret is , with is expected MSE of nuisance functions.

We provide a proof of this Theorem in Appendix B. The regret result contains two main contributions: 1) first the impact of the nuisance estimation error is of second order (i.e. instead of ), 2) the leading regret term depends on the variance of small-regret policy differences and the entropy integral of the policy space. The first property stems from the Neyman orthogonality property of the doubly robust estimate of the policy. The second property stems from the out-of-sample regularization step that we added to the ERM algorithm. Typically, we will have and thereby this term is of lower order than the leading term. Moreover, for many policy spaces , in which case we see that if the setting satisfies a “margin” condition (i.e. the best policy is better by a constant margin), then eventually the variance of small regret policies is , since it only contains the best policy. In that case, our bound leads to fast rates of as opposed to , since the leading term vanishes (similar to the achieved in bandit problems with such a margin condition).

Dependence on the quantity is quite intuitive: if two policies have almost equivalent regret up to a rate, then it will be very easy to be mislead among them if one has much higher variance than the other. For some classes of problems, the above also implies a regret rate that only depends on the variance of the optimal policy (e.g. when all policies with low regret have a variance that is not much larger than the variance of the optimal policy. In Appendix F we show that the latter is always the case for the setting of binary treatment studied in athey2017efficient () and therefore applying our main result, we recover exactly their bound.

Semi-parametric efficient variance.

Our regret bound depends on the variance of our doubly robust estimate of the value function. One then wonders if there are other estimates of the value function that could achieve better variance than . However, we show that at least asymptotically and without further assumptions on the functions and , this cannot be the case. In particular, we show that our estimator achieves what is known as the semi-parametric efficient variance limit for our setting. More importantly for our regret result, this is also true for the semiparametric efficient variance of the policy differences. This is the case in our main setup; where the model is mis-specified and only a projection of the true value; and even if we assume that our model is correct, but make the extra assumption of homoskedasticity, i.e., the conditional variance of residuals of outcomes do not depend on .

Theorem 2 (Semi-parametric Efficiency).

If the model is mis-specified, i.e, the asymptotic variance of is equal to the semi-parametric efficiency bound for the policy value defined in Equation (4). If the model is correctly specified, is semi-parametrically efficient under homoskedasticity, i.e. .

We provide a proof for the value function, but this result also extends to the difference of values. We conclude the section by providing concrete examples of rates for policy classes of interest.

Example 1 (VC Policies).

As a concrete example, consider the case when the class is a VC-subgraph class of VC dimension (e.g. the policy space has small VC-dimension or pseudo-dimension), and let . Then Theorem 2.6.7 of VanDerVaartWe96 () shows that: (see also discussion in Appendix F). This implies that

 κ(r,FΠ)=O(∫r0√d(1+log(S/ϵ))dϵ)=O(r√d√1+log(S/r)).

Hence, we can conclude that regret is . For the case of binary action policies (as we discuss in Appendix F) this result recovers the result of athey2017efficient () up to constants and extends it to arbitrary action spaces and VC-subgraph policies.

Example 2 (High-Dimensional Index Policies).

As an example, we consider the class of policies, characterized by a constant number of or -bounded linear indices:

 Π1={z→Γ(⟨β1,z⟩,…,⟨βd,z⟩):βi∈Rp,∥βi∥1≤s} (19)

where is a fixed -Lipschitz function of the indices, with constants, while (and similarly for , where use ). Assuming is a Lipschitz function of and since is a Lipschitz function of , we have by a standard multi-variate Lipschitz contraction argument (and since , are constants), that the entropy of is of the same order as the maximum entropy of each of the linear index spaces: . Moreover, by known covering arguments (see e.g. zhang2002covering (), Theorem 3) that if , then: . Thus we get , which leads to regret . In this setting, we observe that the policy space is too large for the variance to drive the asymptotic regret. There is a leading term that remains even if the worse-case variance of policies in a small-regret slice is . Intuitively this stems from the high-dimensionality of the linear indices, which introduces an extra dimension of error, namely bias due to regularization. On the contrary, for exactly sparse policies , we have that since for any possible support the entropy at scale is at most , we can take a union over all possible sparse supports, which implies . Thus , leading to policy regret similar to the VC classes: .

Remark 4 (Estimating the First Stages).

Bounds on first stage errors as a function of sample complexity measures for the first stage hypotheses spaces can be obtained by standard results on the MSE achievable by regression problems (see e.g. rakhlin2017empirical (); wainwright2019 ()). Essentially these are bounds for the regression estimates and , as a function of the complexity of their assumed hypothesis spaces. Since the latter is a standard statistical learning problem that is orthogonal to our main contribution, we omit technical details. Since the square loss is a strongly convex objective the rates achievable for these problems are typically fast rates on the MSE (e.g. is of the order for the case of parametric hypothesis spaces, and typically for reproducing kernel Hilbert spaces with fast eigendecay (see e.g. wainwright2019 ())). Thus the term is of lower order. For instance, the required rates for the term to be of second order in our regret bounds are achievable if these nuisance regressions are -penalized linear regressions and several regularity assumptions are satisfied by the data distribution, even when the dimension of is growing with .

Extension: Semi-Bandit Feedback

Suppose that our value function takes the form: , where is a matrix and we observe semi-bandit feedback, i.e. we observe a vector s.t.: . Then we can apply our DR approach to each coordinate of separately.

 VDR(π)=En[ϕ(π(z),z)T(^Θ(z)+^Σ(z)−1ϕ(a,z)(YT−ϕ(a,z)T^Θ(z)))ϕ(π(z),z)]

All the theorems in this section extend to this case, which will prove useful in our pricing application where is the price of a set of products and is the vector of observed demands for each product.

4 Application: Personalized Pricing

Consider the personalized pricing of a single product. The objective is the revenue:

 V(p,z)=p(a(z)−b(z)p)

where and gives the unknown, context-specific demand function. We assume that we observe an unbiased estimate of demand:

 E[d∣z,p]=a(z)−b(z)p

We want to optimize over a space of personalized pricing policies . If, for instance, the observational policy was homoskedastic (i.e. the exploration component was independent of the context ), we show in Appendix G that doubly robust estimators for and are

 aDR(z)= ^a(z)+(1+^g(z)^g(z)−p^σ2)(d−^a(z)−^b(z)p) bDR(z)= ^b(z)+p−^g(z)^σ2(d−^a(z)−^b(z)p)

where and the variance . Thus, in this example, we only need to estimate the mean treatment policy and the variance .

Experimental evaluation.

We empirically evaluate our framework on the personalized pricing application with synthetic data. In particular, we use simulations to assess our estimator’s ability to evaluate and optimize personalized pricing functions. To do this, we compare the performance of our doubly robust estimator with (1) Direct estimator, , (2) Inverse propensity score estimator 2, (3) Oracle orthogonal estimator, .

Data Generating Process.

Our simulation design considers a sparse model. We assume that there are continuous context variables distributed uniformly for but only of them affects demand. Let . Price and demand are generated as . We consider four functional forms for the demand model: (i) (Quadratic) , (ii) (Step) , (iii) (Sigmoid) , (iv) (Linear)

These functions and the data generating process ensure that the conditional expectation function of demand given is non-negative for all , the observed prices are positive with high probability, and the optimal prices are in the support of the observed prices. In each experiment, we generate 1000, 2000, 5000, and 10000 data points, and report results over 100 simulations. We estimate the nuisance functions using 5-fold cross-validated lasso model with polynomials of degrees up to 3 and all the two-way interactions of context variables. We present the results for two regimes: (i) Low dimensional with , (ii) High dimensional with .

Policy Evaluation.

For policy evaluation we consider four pricing functions: (i) Constant, , (ii) Linear, , (iii) Threshold, , (iv) Sin, . The results for the low dimensional regime are summarized in Figure 1(a), where each row and column corresponds to a different demand function and a policy function, respectively3. The results show that, as expected, our the performance of our method is very similar to the oracle estimator and achieves a significantly better performance than the direct and inverse propensity score methods, which suffer from large biases. These results also support our claim that the asymptotic variance of the doubly robust estimate is the same as the variance of the oracle method. It is also important to point out that we obtain similar performances across two different regimes.

Regret.

To investigate the regret performance of our method, we consider a constant pricing function, and a linear policy . We compute the optimal pricing functions in these two function spaces and report the distribution of regret in Figure 4(b) under the low dimensional regime and in Appendix H under the high dimensional regime. Across the four demand functions and two pricing functions we consider, our method achieves small regrets, comparable to the oracle method. The direct and inverse propensity methods, depending on the demand function, yield large regrets.

4.1 Quadratic Model

Finally, we consider the same simulation exercise under the assumption that an unbiased estimate of revenue rather than demand is observed. Since revenue depends on the the model is now quadratic

 r =a(z)p−b(z)p2+ϵ

For the data generating process we use the same functions and as in the personalized pricing example 4. Figures 2 and 5 in Appendix H summarize results for policy evaluation and optimization. The overall performance of our doubly robust estimator is similar to the demand model, and it performs better the direct model. One important difference to note is that when the sample size is small, we observe some finite sample biases for some function classes.

5 Application: Costly Resource Allocation

Motivated by a resource allocation scenario, we also analyze experimentally the special case where . Consider the case where we have possible tasks to invest in, and we have investment costs. Each task yields a return on investment that is a linear function of the investment, but an unknown function of the context . Moreover, to maintain an investment portfolio of we need to pay a known cost . Given a policy space , our goal is to optimize:

 supπ∈ΠE[⟨θ(z),π(z)⟩−C(π(z))] (20)

This falls into our framework, if we treat the offset part as of the form but with a known . So in that case we simply consider . Then applying our framework we optimize:

 supπ∈ΠEn[⟨θDR(z),π(z)⟩−C(π(z))] (21)

In the case of quadratic costs , then this boils down to exactly optimizing a square loss objective, since:

 infAEn[∥θDR(z)/λ−π(z)∥2]⇔supAE[⟨θDR(z),π(z)⟩]−λ2En[∥π(z)∥22] (22)

Thus policy optimization reduces to a multi-task regression problem where we are trying to predict from .5

We can consider sparse linear policies:

 Π={z→Az:∥A∥11:=∑i∥αi∥1≤s} (23)

where corresponds to the -th row of matrix . In this case our problem reduces to the MultiTask Lasso problem where the label is .

Experimental Evaluation.

For experimental evaluation we consider a model with two tasks, and :

 y=a(z)a1+b(z)a2+ϵ

We use the same distributions and functions, and , given above for the pricing application. To estimate the optimal allocation and its regret, we run a 5-fold cross validated MultiTask Lasso algorithm and set . We report the distribution of return on investment obtained from different models in Figure (3). The results suggest that doubly robust method achieves a significantly lower regret than the direct method in both regimes and its performance is similar to the oracle method 6.

Appendix A Proof of Universal Orthogonality Lemma

We first start by defining a sufficient condition for the notion of universal orthogonality of a loss function, as defined by [9]. A loss function is universally orthogonal with respect to if for any :

 E[∇h(z),π(z)ℓ(x,π(z);h0(z))∣z]=0 (24)

where is the true value of the nuisance parameter .

Lemma 3.

The loss function is universally orthogonal with respect to .

Proof.

We show that the population loss function that corresponds to the doubly robust estimate, satisfies the universal orthogonality condition. For simplicity of notation let . Then the population loss is:

Let:

 β(a,z,ξ,K)=ξ+Kϕ(a,z)(y−⟨ξ,ϕ(a,z)⟩)

Observe that:

To show universal orthogonality it suffices to show that:

 E[∇ξ,Kβ(a,z,θ0(z),Σ−10(z))∣z]= 0

This follows easily by simple algebraic manipulations:

 E[∇ξβ(a,z,θ0(z),Σ−10(z))∣z]= E[I−Σ−10(z)ϕ(a,z)ϕ(a,z)T∣z] = I−Σ−10(z)E[ϕ(a,z)ϕ(a,z)T∣z]=I−Σ−10(z)Σ0(z)=0

and

 E[∇Kijβ(a,θ0(z),Σ−10(z))∣z]= E[ϕj(a,z)(y−⟨θ0(z),ϕ(a,z)⟩)∣z]

Now observe that since is the minimizer of the conditional squared loss, taking the first order condition implies:

 E[(V0(a,z) −⟨θ0(z),ϕ(a,z)⟩)ϕ(a,z)∣z]=0⟺ E[V0(a,z)ϕ(a,z)∣z]=E[⟨θ0(z),ϕ(a,z)⟩)ϕ(a,z)∣z]

Moreover:

 E[yϕ(a,z)∣z]=E[E[y∣a,z]ϕ(a,z)]=E[V0(a,z)ϕ(a,z)]

Combining the two yields:

 E[ϕ(a,z)(y−⟨θ0(z),ϕ(a,z)⟩)∣z]=0

which implies orthogonality with respect to . ∎

Appendix B Proof of Main Regret Theorem 1

We first consider an arbitrary empirical loss minimization problem of the form:

 fn=argminf∈FEn[f(x)]:=1nn∑i=1f(xi) (25)

where are i.i.d. drawn from an unknown distribution and is an arbitrary data space. Throughout the section we will assume that: . All the results can be generalized to the case of , for some arbitrary , by simply first re-scaling the losses, and then invoking the theorems of this section.

We will also make the following preliminary definitions. For any function we denote with , the standard norm and with its empirical analogue. The localized Rademacher complexity is the defined as:

 R(r,F)=Eϵ,x1:n[supf∈F:∥f∥2≤r1nn∑i=1ϵif(xi)] (26)

where are independent Rademacher variables that take values with equal probability.

Furthermore, we define the empirical entropy of a function class as the largest value, over the choice of samples, of the logarithm of the size of the smallest empirical -cover of on the samples with respect to the norm. Finally, we consider the empirical entropy integral defined as:

 κ(r,F)=infα≥0⎧⎨⎩4α+10∫rα√H2(ϵ,F,n))ndϵ⎫⎬⎭, (27)

Throughout this section we will make the following benign assumption that essentially makes the problem learnable:

ASSUMPTION 1. The function class satisfies that for any constant , as

We will use the following theorems from the prior work of [9] as a starting point as they are formalized in manner convenient for our problem.

Theorem 4 (Foster, Syrgkanis [9], Theorem 4).

Consider any function class and let be the outcome of the constrained ERM. Pick any and let . Then for some constants and for any , w.p. :

 E[fn(x)−f∗(x)]≤ C1⎛⎝R(r,F−f∗)+r√log(1/δ)n+log(1/δ)n⎞⎠ ≤ C1C2⎛⎝κ(r,F)+r√log(1/δ)n+H2(r,F,n)n+log(1/δ)n⎞⎠.
Lemma 5 (Foster, Syrgkanis [9], Lemma 4).

Consider a function class and pick any (not necessarily in ). Moreover, let:

 Zn(r)=supf∈F:∥f−f∗∥2≤r|En[f(x)−f∗(x)]−E[f(x)−f∗(x)]| (28)

Then for some constant and for any , w.p. :

 Zn(r)≤C3⎛⎝R(r,F−f∗)+r√log(1/δ)n+log(1/δ)n⎞⎠

Our goal is to replace in the latter Theorem with the worst-case variance of the functions in a small “regret”-ball around the optimal. We will achieve this by considering a slight modification of the ERM algorithm. In particular, we will split the data in half, and we will use one half as a regularization sample and the other half as the training sample. In particular, we will find the optimal function on the training sample, within the class of functions that also have relatively small regret on the regularization sample.

Out-of-Sample Regularized ERM

Consider the following algorithm:

• We split the samples in two parts and let and denote the corresponding empirical expectations.

• We run ERM over on the first half and let be the outcome.

• Then we define the class of functions that have the constraint that they don’t achieve much worse value than on the first half, i.e. we regularize policies based on their regret on the first half. More formally, for some constant to be defined later:

 F2={f∈F:En1[f(x)−f1(x)]≤μn} (29)
• Then we run constrained ERM on the second sample over the function space :

 f2=argminf∈F2En2[f(x)] (30)
Theorem 6 (Variance-Based Regret).

Let , and choose , with . Then, w.p. over the sample , the outcome of the Out-of-Sample Regularized ERM satisfies:

 E[f2(x)−f∗(x)]= O⎛⎝κ(√V2,F∗(μn))+√V2log(3/δ)n⎞⎠ (31)

with: and . Moreover, the expected regret, in expectation over the samples is also of order .

Proof.

First we argue that w.p. , . By the choice of and Theorem 4, we know that w.p. over the randomness of sample :

 E[f1(x)−f∗(x)]≤μn/2 (32)

Moreover, by Lemma 5, w.p. over the randomness of sample :

 supf∈F|En1[f(x)−f∗(x)]−E[f(x)−f∗(x)]|≤μn/2

Combining the latter two properties we have, w.p. :

 |En1[f∗(x)−f1(x)]|≤|E[f∗(x)−f1(x)]|+μn/2≤μn

Thus in this event, .

Applying Theorem 4 for the last stage of the algorithm with function space and conditioning on the event that the first stage sample is such that