Abstract
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger—relative to the oracle estimator equipped with the knowledge of the behavior policy— by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.
Minimax Off-Policy Evaluation for Multi-Armed Bandits
Cong Ma^{†, ‡} Banghua Zhu^{†} Jiantao Jiao^{†, ‡} Martin J. Wainwright^{†, ‡} |
---|
^{†} Department of Electrical Engineering and Computer Sciences, UC Berkeley |
^{‡} Department of Statistics, UC Berkeley |
1 Introduction
Various forms of sequential decision-making, including multi-armed bandits [LS20], contextual bandits [ACF+02, TM17], and Markov decision processes [SB18, SZE10], are characterized in terms of policies that prescribe actions to be taken. A central problem in all of these settings is that of policy evaluation—that is, estimating the performance of a given target policy. As a concrete example, given a new policy for deciding between treatments for cancer patients, one would be interested in assessing the effect on mortality when it is applied to a certain population of patients.
Perhaps the most natural idea is to deploy the target policy in an actual system, thereby collecting a dataset of samples, and use them to construct an estimate of the performance. Such an approach is known as on-policy evaluation, since the policy is evaluated using data that were collected under the same target policy. However, on-policy evaluation may not feasible; in certain applications, it can be costly, dangerous and/or unethical, such as in clinical trials and autonomous driving. In light of these concerns, a plausible work-around is to evaluate the target policy using historical data collected under a different behavior policy; doing so obviates the need for any further interactions with the real environment. This alternative approach is known as off-policy evaluation, or OPE for short. Methods for off-policy evaluation have various applications, among them news recommendation [LCL+11], online advertising [TTG15], robotics [IRB+19], to name just a few. Although OPE is appealing in not requiring collection of additional data, it also presents statistical challenges, in that the target policy to be evaluated is usually different from the behavioral policy that generates the data.
1.1 Gaps in current statistical understanding of OPE
Recent years have witnessed considerable progress in the development and analysis of methods for OPE. Nonetheless, there remain a number of salient gaps in our current statistical understanding of off-policy evaluation, and these gaps motivate our work.
Non-asymptotic analysis of OPE. The classical analysis of OPE relies upon asymptotics in which the size of historical dataset, call it , increases to infinity with all other aspects of the problem set-up held fixed. Such analysis shows that a simple plug-in estimator, to be described in the sequel, is asymptotically efficient for the OPE problem in certain settings [HIR03]. However, such classical analysis fails to capture the modern practice of OPE, in which the sample size may be of the same order as other problem parameters, such as the number of actions . Thus, it is of considerable interest to obtain non-asymptotic guarantees on the performance of different methods, along with explicit dependence on different problem parameters. Li et al. [LMS15] and Wang et al. [WAD17] went beyond the asymptotic setting and studied the OPE problem for multi-armed bandits and contextual bandits, respectively, from a non-asymptotic perspective. However, as we discuss in the sequel, their analyses and results are applicable only when the sample size is sufficiently large. In this large sample regime, a number of estimators, including the plug-in, importance sampling and Switch estimators to be discussed in this paper, are all minimax rate-optimal. Thus, analysis of this type falls short of differentiating between different estimators. In particular, are they all rate-optimal for the full range of sample sizes, or is one estimator better than others?
Known vs. unknown behavior policies. In practice, the behavior policy generating the historical data might be known or unknown to the statistician, depending on the application at hand. This difference in available knowledge raises a natural question: is there any fundamental difference between OPE problems with known or unknown behavior policies? This question, though natural, appears to have been less explored in the literature. As we noted above from an asymptotic point of view, the plug-in estimator—which requires no information about the behavior policy—is optimal. In other words, asymptotically speaking, knowing the behavior policy brings no extra benefits to solving the OPE problem. Does this remarkable property continue to hold in the finite sample setting?
OPE with partial knowledge of the behavior policy. The known and unknown cases form two extremes of a continuum: in practice, one often has partial knowledge about the behavior policy. For instance, one might have a rough idea on how well the behavior policy covers/approximates the target policy, as measured in terms of likelihood ratios defined by the two policies. Alternatively, there might be a guarantee on the overall exploration level of the behavior policy, as measured by the minimum probability of observing each state/action under the behavior policy. How does such extra knowledge alter the statistical nature of the OPE problem? Can one develop estimators that fully exploit this information and yield improvements over the case of a fully unknown behavior policy?
1.2 Contributions and organizations
In this paper, we focus on the off-policy evaluation problem under the multi-armed bandit model with bounded rewards. This setting, while seemingly simple, is rich enough to reveal some non-trivial issues in developing optimal methods for OPE.
More concretely, consider a bandit model with a total of possible actions to take, also known as arms. Any (possibly randomized) policy can be thought of as a probability distribution over the action space . Given a target policy and a collection of action-reward pairs generated i.i.d. from the behavior policy and the reward distributions , the goal of OPE is to estimate the value function of the target policy , given by
Here the quantity denotes the mean reward of the arm . Our goal is to provide a sharp non-asymptotic characterization of the statistical limits of the OPE problem in three different settings: (i) when the behavior policy is known; (ii) when is unknown; and (iii) when we have partial knowledge about . Along the way, we also develop computationally efficient procedures that achieve the minimax rates, up to a universal constant, for all sample sizes. The detailed statements of our main results are deferred to Section 3, but let us highlight here our contributions that we make in each of the three settings.
Known behavior policy. First, when the behavior policy is known to the statistician, we sharply characterize the minimax risk of estimating the target value function in Theorem 1. Notably, this bound holds for all sample sizes, in contrast to previous previous statistical analysis of OPE, which are either asymptotic or valid only when the sample size is sufficiently large. In addition, we show in Proposition 1 that the so-called Switch estimator achieves this optimal risk. The family of Switch estimators interpolate between two base estimators: a direct method based on the plug-in principle applied to actions in some set , and an importance sampling estimate applied to its complement . Our theory identifies a simple convex program that specifies the optimal choice of subset: solving this program specifies a threshold level of the likelihood ratio at which to switch between the two base estimators. We prove that this choice yields a minimax-optimal estimator, one that reduces the variance of the importance sampling estimator alone.
Unknown behavior policy. Moving onto the case when the behavior policy is completely unknown, we first argue that the global minimax risk is no longer a sensible criterion to measure the performance of different estimators. Instead, we propose a different metric, namely the minimax competitive ratio, that measures the performance of an estimator against the best achievable via an oracle—in this setting, an oracle with the knowledge of the behavior policy. With this new metric in place, we uncover a fundamental statistical gap between the known and unknown behavior policy cases in Theorem 3. More specifically, when evaluating a target policy that can take at most actions (for some ), any estimator without the knowledge of the behavior policy must pay multiplicative factor of (modulo a log factor) compared to the oracle Switch estimator given knowledge of the behavior policy. We further demonstrate that the plug-in estimator alone achieves this optimal worst-case competitive ratio (up to a log factor), illustrating its near-optimality in the unknown case (cf. Theorem 2).
Partially known behavior policy. In the third part of the paper, we initiate the study of the middle ground between the previous two extreme cases: what if we have some partial knowledge regarding the behavior policy? More concretely, we assume the knowledge of the minimum probability that is taken by the behavior policy in Section 3.3. Under such circumstance, we first show that the plug-in estimator is sub-optimal when the behavior policy is less exploratory—that is, in the regime . We then propose a new estimator based on approximation by Chebyshev polynomials and show that it is optimal in estimating a large family of target policies. It is worth pointing out that this optimality is established under a different but closely related Poisson sampling model—instead of the usual multinomial sampling one—with the benefit of simplifying the analysis.
1.3 Related work
Off-policy evaluation has been extensively studied in the past decades and by now there has been an immense body of literature on this topic. Here we limit ourselves to discussion of work directly related to the current paper.
Various estimators for OPE. There exist two classical approaches to the OPE problem. The first is a direct method based on the plug-in principle: it estimates the value of the target policy using the reward and/or the transition dynamics estimated from the data. In the multi-armed bandit setting, the direct method uses the data to estimate the mean rewards, and plugs these estimates into the expression for the target value function. The other approach is based on importance sampling [HT52], also known as inverse propensity scoring (IPS) in the causal inference literature. It reweights the observed rewards according to the likelihood ratios between the target and the behavior policies. Both methods are widely used in practice; we refer interested readers to the recent empirical study [PPM+20] for various forms of these estimators. A number of authors [TB16, WAD17] have proposed hybrid estimators that involve a combination of these two approaches, a line of work that inspired our analysis of the Switch estimator. In this context, our novel contribution is to specify a particular set for switching between the two estimators, and showing that the resulting Switch estimator is minimax-optimal for any sample size.
Statistical analysis of OPE. Statistical analysis of OPE can be separated into two categories: asymptotic and non-asymptotic. On one hand, the asymptotic properties of the OPE estimators are quite well-understood, with plug-in methods known to be asymptotically efficient [HIR03], and asymptotically minimax optimal in multi-armed bandits [LMS15]. Moving beyond bandits, a Cramér-Rao lower bound was recently provided for tabular Markov decision processes [JL16], and approaches based on the plug-in principle were shown to approach this limit asymptotically [YW20, DJW20].
Relative to such asymptotic analysis, there are fewer non-asymptotic guarantees for OPE; of particular relevance are the two papers [LMS15, WAD17]. Li et al. [LMS15] also studied the OPE problem under the multi-armed bandit model, but under different assumptions on the reward distributions than this paper. They proved a minimax lower bound that holds when the sample size is large enough, but did not give matching upper bounds in this regime. Wang et al.[WAD17] extended this line of analysis to the contextual bandit setting with uncountably many contexts. They provided matching upper and lower bounds, but again ones that only hold when the sample size is sufficiently large. Notably, in this large sample regime and under the bounded reward condition of this paper, all three estimators (plug-in, importance sampling and Switch) are minimax optimal up to constant factors. Thus, restricting attention to this particular regime fails to uncover the benefits of the Switch estimator. This paper provides a complete picture of the non-asymptotic behavior of these estimators for the OPE problem, showing that only the Switch estimator is minimax-optimal for all sample sizes.
Estimation of nonsmooth functionals via function approximation. The OPE problem with an unknown behavior policy is intimately connected to the problem of estimating nonsmooth functionals. Portions of our analysis and the development of the Chebyshev estimator exploit this connection. The use of function approximation in functional estimation was pioneered by Ibragimov et al. [INK87], and was later generalized to nonsmooth functionals by Lepski et al. [LNS99] and Cai and Low [CL11]. The underlying techniques have been used to devise optimal estimators for a variety of nonsmooth functionals, including Shannon entropy [VV10, WY16, JVH+15], KL divergence [HJW20], support size [VV11, VV17, WY19], among others. Our development of the Chebyshev estimator is largely inspired by the paper [WY19] on estimating the support size, which can be viewed as a special case of OPE.
Notation: For the reader’s convenience, let us summarize the notation used throughout the remainder of the paper. We reserve boldfaced symbols for vectors. For instance, the symbol denotes the all-zeros vector, whose dimension can be inferred from the context. For a positive integer , we refer to as the set . For a finite set , we use to denote its cardinality. We denote by the indicator of the event . For any distribution on , we denote by its support. For any distribution on and any subset , we define . We follow the convention that .
2 Background and problem formulation
In this section, we introduce the multi-armed bandit model with stochastic rewards, and then formally define the off-policy evaluation (OPE) problem in this bandit setting. We also introduce two existing estimators—the plug-in and the importance sampling estimators—for the OPE problem.
2.1 Multi-armed bandits and value functions
A multi-armed bandit (MAB) model is specified by an action space and a collection of reward distributions , where is the reward distribution associated with the action or arm . Throughout the paper, we focus on the MAB model with possible actions, and we index the action space by . In addition, we assume that the collection of reward distributions belongs to the family of distributions with bounded support—that is,
(1) |
When the maximum reward is understood from the context, we adopt the shorthand for this class of distributions.
A (randomized) policy is simply a distribution over the action space , where specifies the probability of selecting the action . Correspondingly, we can define the value function of the policy to be
(2) |
where denotes the mean reward under given that action is taken. Here denotes a reward random variable distributed according to .
2.2 Observation model and off-policy evaluation
Suppose that we have collected a collection of pairs , where the action is randomly drawn from the behavior policy , whereas the reward is distributed according to the reward distribution . Given a target policy , the goal of off-policy evaluation (OPE) is to evaluate the value function of the target policy, given by
(3) |
Note that this problem is non-trivial because the data is collected under the behavior policy , which is typically distinct from the target policy .
2.3 Plug-in and importance sampling estimators
A variety of estimators have been designed to estimate the value function . Here we introduce two important ones most relevant to our development, namely the plug-in estimator and the importance sampling estimator. We note that in some of the literature, the plug-in estimator is also known as the regression estimator.
Plug-in estimator. Perhaps the simplest method is based on applying the usual plug-in principle. Observe that the only unknown quantities in the definition (3) of the value function are the mean rewards . These unknown quantities can be estimated by their empirical counterparts
(4) |
where denotes the number of times that action is observed in the data set. Substituting these empirical estimates into the definition of the value function yields the plug-in estimator
(5) |
Observe that this estimator is fully agnostic to the behavior policy. Thus, it can also be used when the behavior policy is unknown, a setting that we also study in the sequel.
Importance sampling estimator. An alternative estimator, one which does require knowledge of the behavior policy, is based on the idea of importance sampling. More precisely, let denote the likelihood ratio associated with the action . The importance sampling (IS) estimator is given by
(6) |
In words, it weighs the observed reward based on the corresponding likelihood ratio . As long as for all , the importance sampling estimator is an unbiased estimate of . Note that the IS estimate relies on knowledge of the behavior policy via its use of the likelihood ratio.
3 Main results
We now move onto the main results of this paper. We begin in Section 3.1 with results in the case when the behavior policy is known a priori. In Section 3.2, we provide guarantees when the behavior policy is completely unknown, whereas Section 3.3 is devoted to the setting where certain partial knowledge about the behavior policy, say the minimum value , is known.
3.1 Switch estimator with known
When the behavior policy is known, both the plug-in estimator and the
importance sampling estimator are applicable. In fact, they belong to
the family of Switch estimators, as introduced in
past
(7) |
where is the empirical mean reward defined in equation (4). By making the choices or , respectively, the Switch estimator reduces either to the plug-in estimator (5) or to the IS estimator (6). Choices of intermediate between these two extremes allow us to interpolate (or switch) between the plug-in estimator and the IS estimator.
The following proposition, whose proof is relatively elementary, provides a unified performance guarantee for the family of Switch estimators.
Proposition 1.
For any subset , we have
(8) |
See Section 4.1 for the proof of this claim.
Given the family of Switch estimators , it is natural to ask: how to choose the subset among all possible subsets of the action space? The unified upper bounds established in Proposition 1 offer us a reasonable guideline: one should select a subset to minimize the error bound (8), i.e.,
(9) |
At first glance, the minimization problem (9) is combinatorial in nature, which indicates the possible computational hardness in solving it. Fortunately, it turns out that such an “ambitious” goal can instead be achieved via solving a tractable convex program. To make this claim precise, let us consider the following convex program
(10) |
where is a vector of decision variables. Let be a minimizer of this optimization problem (10), whose existence is guaranteed by the coerciveness of the objective function. Correspondingly, we define
(11) |
to be the support of . It turns out that the choice solves the best subset selection problem (9) up to a constant factor. We summarize in the following:
Proposition 2.
There exists a universal constant such that
(12) |
See Section 4.2 for the proof of the optimality of .
Thus, we conclude that the among the family of Switch estimators, the optimal estimator is given by
(13a) | |||
In view of Proposition 1, it enjoys the following performance guarantee | |||
(13b) |
From now on, we shall refer to as the Switch estimator.
Is the Switch estimator optimal?
The above discussion establishes the optimality of the Switch estimator among the family of estimators (7) parameterized by a choice of subset . However, does the Switch estimator continue to be optimal in a larger context? This question can be assessed by determining whether it achieves, say up to a constant factor, the minimax risk given by
(14) |
Here the infimum ranges over all measurable functions of the data , whereas the supremum is taken over all reward distributions belonging to our family of bounded mean distributions. The following theorem provides a lower bound on this minimax risk:
Theorem 1.
There exists a universal positive constant such that for all pairs , we have
See Section 4.3 for the proof of this lower bound.
By combining Theorem 1 and the upper bound (13b) on the mean-squared error of the Switch estimator , we obtain a finite-sample characterization of the minimax risk up to universal constants—namely
(15) |
Consequently, we see that the Switch estimator is optimal among all estimators in a minimax sense.
In order to gain intuition for this optimality result, it is helpful to consider some special cases.
Degenerate case of on-policy evaluation: First, consider the degenerate setting , so that our OPE problem actually reduces to a standard on-policy evaluation problem. In this case, the IS estimator reduces to the standard Monte Carlo estimate
A straightforward calculation shows that it has mean-squared error , which we claim is order-optimal. To reach this conclusion from our expression (15) for the minimax risk, it suffices to check that is a minimizer of the optimization problem (10). This fact can be certified by showing that the all-zeros vector obeys the first-order optimality condition associated with the convex program (10). More precisely, for all actions , we have
(16) |
Large-sample regime: Returning to the general off-policy case (), suppose that the sample size satisfies a lower bound of the form
(17) |
for a sufficiently large constant . In this case, one can again verify that the all-zeros vector is optimal for the convex program (10) by verifying the first-order optimality condition (16). As a consequence, we conclude that the Switch estimator reduces to the IS estimator in the large-sample regime defined by the lower bound (17). In this regime, the IS estimator achieves mean-squared error . Under the bounded reward condition, this result recovers the rate provided by Li et al. [LMS15] in the large sample regime (17) up to a constant factor; see Theorem 1 in their paper.
It is worthwhile elaborating further on the connections with the paper of Li et al. [LMS15]: they studied classes of reward distributions that are parameterized by bounds on their means (as implied by our bounded rewards) and variances. In this sense, their analysis is finer-grained than our study of bounded rewards only. However, when their results are specialized to bounded reward distributions (1), their minimax risk result (cf. equation (2) in their paper) applies only in the large sample regime defined by the lower bound (17). As we have discussed, when this lower bound holds, the IS estimator itself is order optimal, so analysis restricted to this regime fails to reveal the tradeoff between the two terms in the minimax rate (15), and in particular, the potential sub-optimality of the IS estimator (as reflected by the presence of the additional term in the minimax bound).
A closer look at the Switch estimator
In this subsection, we take a closer look at some properties of the Switch estimator, and in particular its connection to truncation of the likelihood ratio.
Link to likelihood truncation. We begin by investigating the nature of the best subset , as defined in equation (11). Let us assume without loss of generality that the actions are ordered according to the likelihood ratios—viz.
(18) |
Under this condition, unraveling the proof of Proposition 2 shows that the optimal subset takes the form
(19) |
Here it should be understood that the choice corresponds to . Thus, the significance of the optimization problem (10) is that it specifies the optimal threshold at which to truncate the likelihood ratio.
As noted previously, Wang et al. [WAD17] studied the sub-family of Switch estimators obtained by varying the truncation thresholds of the likelihood ratios. Similar to Li et al. [LMS15], they studied the large sample regime in which the IS estimator without any truncation is already minimax optimal up to constant factors. This fails to explain the benefits of truncating large likelihood ratios and the associated Switch estimator. In contrast, the key optimization problem (10) informs us of the optimal subset and hence an optimal truncation threshold, which allows the Switch estimator to optimally estimate the target value function for all sample sizes. This result is especially relevant for smaller sample sizes in which the problem is challenging, and the IS estimator can exhibit rather poor behavior.
Role of the plug-in component. The Switch estimator (13a) is based on applying the plug-in principle to the actions in with large likelihood ratios. However, doing so is not actually necessary to achieve the optimal rate of convergence (15). In fact, if we simply estimate the mean reward by zero for any action in , then we obtain the estimate
(20) |
which is also minimax-optimal up to a constant factor. The intuition is that for actions on the support , the likelihood ratios are so large that the off-policy data is essentially useless, and can be ignored. It suffices to use the zero estimate, yielding a squared bias of the order . On the other hand, for actions in the complement , the likelihood ratios are comparatively small, so that the off-policy data should be exploited.
We note that truncated IS estimators of the type (20) have been explored in empirical work on counterfactual reasoning [BPQ+13] and reinforcement learning [WP07]; our work appears to be the first to establish their optimality for general likelihood ratios. Also noteworthy is the paper by Ionides [ION08], who analyzed the rate at which the truncation level should decay, assuming that the likelihood ratios decay at a polynomial rate. Our theory, while focused on finite action spaces, instead works for any configuration of the likelihood ratios, and in addition provides a precise truncation level instead of only a rate.
Numerical experiments
In this section, we report the results of some simple numerical experiments on simulated data that serve to illustrate the possible differences between the three methods: Switch, plug-in and IS estimators. importance sampling estimators. We performed experiments with the uniform target policy (i.e., for all actions ), and for each action , we defined the reward distribution to be an equi-probable Bernoulli distribution over , so that .
For each choice of , we constructed a behavior policy of the following form
In words, we set the first actions with a low probability , whereas for the remaining actions, the behavior probabilities are relatively large, which is close to . As we will see momentarily, this choice allows us to demonstrate interesting differences between the three estimators.
As is standard in high-dimensional statistics [WAI19], we study a sequence of such problems indexed by the pair ; in order to obtain an interesting slice of this two-dimensional space, we set . For such a sequence of problems, we can explicitly compute that the mean-squared errors of the three estimators scale as follows:
(21a) | ||||
(21b) | ||||
(21c) |
The purpose of our numerical experiments is to illustrate this theoretically predicted scaling.
Figure 1 shows the mean-squared errors of these three estimators versus the sample size , plotted on a log-log scale. The results are averaged over random trials. As can be seen from Figure 1, the Switch estimator performs better than the two competitors uniformly across different sample sizes. Note that our theory (21) predicts that the mean-squared errors should scale as , where for the plug-in, IS, and SWITCH estimators respectively. In order to assess these theoretical predictions, we performed a linear regression of the log MSE on , thereby obtaining an estimated exponent for each estimator. These estimates and their standard errors are shown in the legend of Figure 1. Clearly, the estimated slopes are quite close to the theoretical predictions (21).
3.2 OPE when is unknown: competitive ratio
Our analysis thus far has taken the behavior policy to be known. This condition, while realistic in some settings, is unrealistic in others. Accordingly, we now turn to the version of the OPE problem in which the only knowledge provided are the action-reward pairs . Note that the importance sampling estimator and Switch estimators are no longer applicable, since they require knowledge of the behavior policy. Consequently, we are led to the natural question: what is an optimal estimator when is unknown? Before answering this question, one needs to first settle upon a suitable notion of optimality.
Optimality via the minimax competitive ratio
The first important observation is that when the behavior policy is unknown, the global minimax risk is no longer a suitable metric for assessing optimality. Indeed, for any target policy , one can construct a “nasty” behavior policy such that for any estimator , we have a lower bound of the form
for some universal constant . For this reason, if we measure optimality according to the global minimax risk, then the trivial “always return zero” estimator is optimal, and hence the global minimax risk is not a sensible criterion in this setting.
This pathology arises from the fact that the adversary has too much power: it is allowed to choose an arbitrarily bad behavior policy while suffering no consequences for doing so. In order to mitigate this deficiency, it is natural to consider the notion of a competitive ratio, as is standard in the literature on online learning [FW98]. An analysis in terms of the competitive ratio measures the performance of an estimator against the best achievable by some oracle—in this case, an oracle equipped with the knowledge of .
For a given target policy and behavior policy , recall the definition (14) of the minimax risk ; it corresponds to smallest mean-squared error that can be guaranteed, uniformly over a class of reward distributions , by any method equipped with the oracle knowledge of . Given an estimator and a reward distribution , we can measure its performance relative to this oracle lower bound via the competitive ratio
(22) |
An estimator with a small competitive ratio—that is, close to —is guaranteed to perform almost as well as the oracle that knows the behavior policy . On the other hand, a large competitive ratio indicates poor performance relative to the oracle.
As one concrete example, the “always return zero” estimator is far from ideal when considered in terms of the competitive ratio (22). Indeed, suppose that and ; we then have
(23) |
Here step (i) follows from the fact that by definition, along with the scaling established in Section 3.1.1. Step (ii) follows from the assumption that , which implies that . Thus, we see that the “always return zero” estimator performs extremely badly relative to the oracle, and its competitive ratio further degrades as the sample size increases.
Competitive ratio of the plug-in estimator
As we have emphasized earlier, the plug-in approach is applicable even if the behavior policy is unknown. The following theorem provides a guarantee on its behavior in terms of the competitive ratio:
Theorem 2.
There exists a universal constant such that for any target policy , the plug-in estimator satisfies the bound
(24) |
See Section 4.4 for the proof of this theorem.
Several remarks are in order. Note that the upper bound on the competitive ratio is at most , achieved for a target distribution that places mass on all actions. Comparing this worst-case guarantee with that of the “always return zero” estimator (23) shows that plug-in estimator is strictly better as soon as the sample size exceeds a multiple of . Note that this is a relatively mild condition on the sample size. In addition, Theorem 2 guarantees worst-case competitive ratio of the plug-in estimator scales linearly with the support size of . This showcases the automatic adaptivity of the plug-in estimator to the target policy under consideration. See Figure 2 for a numerical illustration of this phenomenon.
We note that Li et al. [LMS15] established a similar guarantee (see Theorem 3 in their paper [LMS15]). One importance difference is that their guarantee only holds in the large sample regime (cf. the restriction (17)), whereas ours covers the full spectrum of the sample size. Moreover, their upper bound is proportional to for any target policy, and so does not reveal the adaptivity of the plug-in estimator to the support size.
Is the plug-in estimator optimal?
A natural follow-up question is to investigate the optimality of the plug-in approach—in the sense of the worst-case competitive ratio—in the unknown case. It turns out that, the plug-in estimator is close to optimal, as demonstrated by the following theorem.
Theorem 3.
Suppose that the sample size is lower bounded as for a positive constant . Then for each , there exists a target policy supported on actions and
(25) |
where is a universal constant.
As shown in the proof of Theorem 3, for each given integer , the lower bound (25) is met by taking the target policy that chooses actions uniformly from the set . This lower bound shows that when evaluating a policy with support size , the gap—between performance when knowing the behavior policy relative to not knowing it—scales as up to a logarithmic factor; thus, these two settings are very different in terms of their statistical difficulty. In addition, comparing the lower bound in Theorem 3 with the upper bound provided in Theorem 2, one can see that the plug-in estimator is optimal up to a logarithmic factor, measured by the worst-case competitive ratio.
3.3 OPE with lower bounds on the minimum exploration probability
The preceding subsections consider two extreme cases in which the behavior policy is either known or completely unknown. This leaves us with an interesting middle ground: what if we have some partial knowledge regarding the behavior policy? How can such information be properly exploited by estimators?
In this section, we initiate the investigation of these questions by focusing on a particular type of partial knowledge—namely, the minimum exploration probability . More precisely, for a given scalar , consider the collection of distributions
Given that any randomized policy must sum to one, i.e., , this family is non-empty only when . Our goal in this section is to characterize the difficulty of the OPE problem when it is known that for some choice of . We first analyze the plug-in estimator, which does not require knowledge of . We then derive a minimax lower bound, which shows that the plug-in estimator is sub-optimal for certain choices of . In the end, we design an alternative estimator, based on approximation by Chebyshev polynomials, that has optimality guarantees for a large family of target policies, albeit under a different but closely related Poisson sampling model.
Performance of the plug-in estimator
We begin with establishing a performance guarantee for the plug-in estimator.
Theorem 4.
There exist universal constants such that
(26a) | |||
In addition, if , then we have | |||
(26b) |
See Section 4.6 for the proof of these two claims.
Two interesting observations are worth making. First, if the behavior policy is sufficiently exploratory, in the sense that it belongs to the family for some , then the plug-in estimator achieves the optimal estimation error up to a constant factor. In other words, the side condition is sufficient for the plug-in approach to perform optimally.
On the other hand, when the behavior policy is less exploratory—meaning that —its mean-squared error involves the additional term . As shown in the proof of the upper bound (26a), this extra price stems from bias of the plug-in estimator: if we fail to observe rewards for some action , then the plug-in estimator has no avenue for estimating the mean reward ; any estimate that it makes incurs a bias of the order . When , such an event takes place with probability on the order of .
Is the plug-in estimator optimal under partial knowledge?
Is the extra price necessary for all estimators? In order to answer this question, we need to characterize the constrained minimax risk
(27) |
where the supremum is taken over all possible behavior policies in , and reward distributions in . In view of our guarantee (26b) for the plug-in approach, it can be seen that when , then
Consequently, the plug-in estimator is optimal when the behavior policy is sufficiently exploratory.
As a result, in the remainder of this section, we concentrate on the regime . We begin by stating a minimax lower bound in this regime:
Theorem 5.
Consider the case . If further satisfies and for some sufficiently large constant , then there exists another universal positive constant such that
(28) |
See Section 4.7 for the proof of this theorem.
Note that if , the worst-case risk is lower bounded as . Combining the lower bound in Theorem 5 with the upper bound shown in Theorem 4, we conclude that the plug-in approach is minimax optimal up to constants once . However, observe that there remains a gap between the upper and lower bounds when the behavior policy is known to be less exploratory—that is, in the regime .
Optimal estimators via Chebyshev polynomials in the Poisson model
In this section, we devote ourselves to the design of optimal estimators when the behavior policy is less exploratory, meaning that for some .
The Poisson model. In order to bring the key issues to the fore, we analyze the estimator under the Poissonized sampling model that is standard in the functional estimation literature. Recall that in the multinomial observation model, the action counts follow a multinomial distribution with parameters and . In the alternative Poisson model, the total number of samples is assumed to be random, distributed according to a Poisson distribution with parameter . As a result, the action counts obey for . Correspondingly, we can define the minimax risk under the Poisson model as
(29) |
where the expectation is taken under the Poisson model.
Although the two sampling models differ, the difference is not actually essential in terms of characterizing minimax risks. In particular, the corresponding risks and are closely related, as demonstrated by the following lemma.
Lemma 1.
For any , we have
See Appendix A for the proof of this bound.
Setting in the above lemma reveals that . Consequently, in order to obtain an upper bound on the risk under multinomial sampling, it suffices to control the risk under the Poisson model.
The Chebyshev estimator. Now we turn to the construction of the optimal estimator under the Poisson model. It turns out that Chebyshev polynomials play a central role in such a construction. Recall that the Chebyshev polynomial with degree is given by
(30) |
for . Correspondingly, for any pair of scalars such that , we can define a shifted and scaled polynomial
where denotes the coefficient of . Using the coefficients of this polynomial as a building block, we then define a function, with domain the set of nonnegative integers, given by
In terms of these quantities, the Chebyshev estimator takes the form
(31) |
where is the empirical mean reward defined in equation (4). In words, when the action count is larger than the degree , one uses the usual sample mean reward . On the other hand, when the action count is below this threshold, the Chebyshev estimator rescales the empirical mean reward by the value . The goal of this rescaling is to reduce the bias of the plug-in estimate.
A little calculation helps to provide intuition regarding this bias-reduction effect. Under the Poisson sampling model, the biases of the Chebyshev estimator and the plug-in estimator are given by
respectively. For the plug-in estimator, if we allow the behavior policy to range over the family , then the bias can be as large as .
By construction, the Chebyshev polynomial is the unique degree- polynomial such that , and that is closest in sup norm to the all-zeros function on the interval ; see Exercise 2.13.14 in the book [TIM14]. By suitable choices of the triple , we can shape the additional modulation factor so as to reduce the bias of the plug-in estimator. The following theorem makes this intuition precise:
Theorem 6.
Suppose that the target policy satisfies the bound
(32) |
and that we implement the Chebyshev estimator with
for some sufficiently large constant and some constant . Then under the Poisson sampling model, there exists a pair of positive constants such that
(33) |
See Section 4.8 for the proof of this upper bound.
Several comments are in order. First, when , the worst-case risk (33) of the Chebyshev estimator (31) matches the lower bound (28) derived in Theorem 5, which showcases the optimality of the Chebyshev estimator when the partial knowledge is available.
Second, the restriction (32) on the evaluation policy is worth emphasizing. In words, the constraint (32) requires the target policy to be somewhat “de-localized”—that is, there is no action that has an extremely large probability mass. As an example, the uniform policy on satisfies such a constraint.
Third, it should be noted that the Chebyshev estimator requires knowledge of the minimum exploration probability . This property makes it less practically applicable a priori, and how to design an estimator that can adapt to the nested family of behavior policies is an interesting question for future work.
Numerical experiments
We conclude this section with experiments on both simulated data and real data to assess the performance of the Chebyshev estimator relative to other choices.
Simulated data. We begin with some experiments on simulated data. As in our previous simulations, we fix the target policy to be uniform over , and for each action , we choose the reward distribution to be an equi-probable Bernoulli distribution over , so that . For each , we define the behavior policy
Again, we consider a particular scaling of the pair that highlights interesting differences. In particular, when the sample size scales as , then our theory predicts that the plug-in and Chebyshev estimators should have mean-squared error scaling as
respectively. Figure 3 plots the mean-squared errors of the two estimators vs. the sample size on a log-log scale. The results are averaged over random trials. It is clear from Figure 3 that the Chebyshev estimator performs better than the one based on the plug-in principle. Based on the estimated slopes (as shown in the legend), the mean-squared error of the Chebyshev estimator decays as , while consistent with our theory, that of the plug-in estimator nearly plateaus.
Real data. We now turn to some experiments with the MovieLens 25M data set [HK15]. In order to form a bandit problem, we extracted a random subset of movies that each have at least ratings. This subset of movies defines an action space with . For each movie, we average its rating over all samples in order to define the mean reward associated with the movie . This is the ground truth that defines our problem instance. Setting the target policy to be uniform, our goal is to estimate the mean rating of these movies—that is, the quantity .
In order to evaluate our methods, we need to generate an off-policy dataset. In order to do so, we uniformly subsample ratings from the set of all ratings on our subset of movies. This procedure implicitly defines a behavior policy that is very different from the uniform target policy, because the number of ratings for each movie vary drastically. Given such an off-policy dataset, we evaluate the mean-squared errors of four different estimators—the plug-in estimator, the IS estimator, the Switch estimator as well as the Chebyshev estimator. We repeat this procedure for a total of trials for a range of sample sizes .
Figure 4 plots the mean-squared error (averaged over the trials) versus the sample size for the four estimators. To be clear, the Switch estimator and the IS estimator have the luxury of knowing the behavior policy whereas the Chebyshev estimator is given minimum exploration probability. The plug-in estimator requires no side information. Given the oracle knowledge of the behavior policy, the Switch estimator always outperforms other estimators, including the IS estimator with the same knowledge. In addition, the Chebyshev estimator outperforms plug-in estimator, especially in the small sample regime. These qualitative changes are consistent with our theoretical predictions.
4 Proofs
We now turn to the proofs of the main results presented in Section 3. We begin in Section 4.1 with the proof of Proposition 1, followed by the proof of Proposition 2 in Section 4.2. Sections 4.3 through 4.8 are devoted to the proofs of Theorems 1 through 6.
4.1 Proof of Proposition 1
We begin with the standard bias-variance decomposition
(34) |
Our proof involves establishing the following two bounds