Assumption 1

[10pt]

Enhanced Balancing of Bias-Variance Tradeoff in Stochastic Estimation: A Minimax Perspective

Henry Lam

Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, khl2114@columbia.edu

Xinyu Zhang

Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, zhang.xinyu@columbia.edu

Xuhui Zhang

School for the Gifted Young, University of Science and Technology of China, zxh1998@mail.ustc.edu.cn

Biased stochastic estimators, such as finite-differences for noisy gradient estimation, often contain parameters that need to be properly chosen to balance impacts from the bias and the variance. While the optimal order of these parameters in terms of the simulation budget can be readily established, the precise best values depend on model characteristics that are typically unknown in advance. We introduce a framework to construct new classes of estimators, based on judicious combinations of simulation runs on sequences of tuning parameter values, such that the estimators consistently outperform a given tuning parameter choice in the conventional approach, regardless of the unknown model characteristics. We argue the outperformance via what we call the asymptotic minimax risk ratio, obtained by minimizing the worst-case asymptotic ratio between the mean square errors of our estimators and the conventional one, where the worst case is over any possible values of the model unknowns. In particular, when the minimax ratio is less than 1, the calibrated estimator is guaranteed to perform better asymptotically. We identify this minimax ratio for general classes of weighted estimators, and the regimes where this ratio is less than 1. Moreover, we show that the best weighting scheme is characterized by a sum of two components with distinct decay rates. We explain how this arises from bias-variance balancing that combats the adversarial selection of the model constants, which can be analyzed via a tractable reformulation of a non-convex optimization problem.

Key words: bias-variance tradeoff, minimax analysis, stochastic estimation, finite difference, robust optimization

 

This paper studies biased stochastic estimators which, in the simplest form, are expressed as follows. To estimate a target quantity of interest , we use Monte Carlo simulation where each simulation run outputs

(1)

Here represents the noise of the simulation and satisfies , and is the bias given by . We obtain the final estimate by averaging independent runs produced by (id1):

(2)

where denotes an independent run.

The simulation runs in (id1) are specified by a parameter that typically impacts the bias and the variance in an antagonistic fashion. A common example is finite-difference schemes for black-box or zeroth-order noisy gradient estimation, in which is the perturbation size for the function input of interest. As increases, bias increases while variance decreases (and vice versa). To minimize the mean square error (MSE), the best choice of , in terms of the simulation budget, balances the magnitudes of the two error sources. In central finite-difference for instance, this optimal turns out to be of order , whereas in forward or backward finite-difference it is of order (e.g., Glasserman (2013) Chapter 7; Asmussen and Glynn (2007) Chapter 7; Fu (2006); L’Ecuyer (1991)).

While the above tradeoff and the optimal order of in is well understood in the literature, the precise best choice of depends on other, typically unknown, model characteristics (i.e., the “constants” inside and ). For example, choosing in a central finite-difference, and considering only the first-order error term, the best choice of depends on third-order derivative information and the variance of the noise that are typically unavailable in advance.

Our goal in this paper is to develop a framework that enhances the standard estimator (id1) regarding the choice of subject to the ambiguity of the model characteristics. A key idea we will undertake is to consider estimators beyond the form of naive sample average, in a way that reduces the impact of this uncertainty. Under this framework, we derive new estimators that consistently improve (id1) at a given choice of , regardless of these unknowns. This improvement is in terms of the asymptotic MSE as the simulation budget increases. More specifically, we consider the asymptotic ratio between the MSEs of any proposed estimator and (id1):

(3)

The proposed estimator can be parametrized by possibly many tuning parameters. The asymptotic ratio thus contains these parameters, the unknown model characteristics, and the in (id1). Regarding (id1) and its as a “baseline”, we calibrate the tuning parameters in the proposed estimator to minimize the worst-case asymptotic MSE ratio, where the worst case is over all possible model characteristics and choices of . On a high level, this can be expressed as

(4)

This minimized worst-case ratio provides a performance guarantee on our calibrated proposed estimator relative to (id1) – The MSE of our estimator is asymptotically at most of (id1) at the chosen , independent of any possible model specifications. In particular, if , our estimator is guaranteed to strictly improve over (id1). For convenience, we call the asymptotic minimax risk ratio (AMRR).

As our main contributions, we systematically identify the AMRR , achieve , and construct a scheme that consistently outperforms the conventional choice (id1), over the class of weighted estimator in the form

(5)

where is a suitable sequence of tuning parameters, and is any weighting sequence. When ’s are the uniform weights, (id1) is precisely the “recursive estimator” introduced in Glynn and Whitt (1992). This latter estimator selects as if, roughly speaking, the current simulation run is the last one in the budget. In other words, given that achieves the optimal MSE order for (id1), it selects , and it can be shown to exhibit the same optimal MSE order (the term “recursive” refers to the fact that it can be obtained by iteratively reweighting existing estimates and new runs that depend only on the current run index). This construction can be generalized to stochastic approximation (SA) type recursions (e.g., Duplay et al. (2018)), and their averaging version (as in Polyak and Juditsky (1992)). Our main results show that, in general, the optimal weighting scheme to obtain is in the form

(6)

where are two distinct decay rates. The two coefficients depends on the budget , in a way that none of the two terms in (id1) is asymptotically negligible when used in the weighted estimator. This weighting scheme, and an associated transformation from to , give rise to an explicitly identifiable that decreases with an “inflation” factor imposed on the -transformation. This reveals that, for instance, in the central finite-difference scheme, is when the multiplicative constants in and are the same. Since , the weighted estimator using (id1) always outperforms (id1) in terms of asymptotic MSE, independent of the unknown constants in and . In contrast, the corresponding is when the weights are obtained via standard SA recursion, either with or without averaging, indicating that such a restriction on the weighting sequence could lead to subpar performance in the MSE.

Our main analyses build on the insight that, to maintain a low worst-case risk ratio, one typically must calibrate the proposed estimator such that it maintains the relative magnitudes of bias and variance in a similar manner as the conventional scheme (id1). We will show that any distortion away from such a balancing allows an “adversary” to enlarge the risk ratio, thus leading to suboptimal outcomes. This balancing requirement generally leads to a non-convex constrained optimization problem which, upon a reformulation, reveals a tractable structure and solution to the minimax problem in (id1).

The remainder of the paper is as follows. Section id1 first reviews some related works. Section id1 describes the problem settings and reviews some established results on biased estimation. Section id1 presents our minimax framework and investigation on a special class of estimators. Section id1 presents our main results and explains their implications on general weighted estimators and AMRR. Section id1 discusses how our results carry to multivariate settings. Section id1 reports our numerical experiments. Section id1 concludes the paper. All proofs are in the Appendix.

Our study is related to several lines of work. The minimax formulation that we use to analyze and construct estimators resembles robust optimization (e.g., Ben-Tal et al. (2009), Bertsimas et al. (2011), Ben-Tal and Nemirovski (2002)) and robust control (e.g., Zhou and Doyle (1998)) that advocates decision-making against the worst-case scenario. Such ideas also have roots in game theory (Cesa-Bianchi and Lugosi (2006)). Related notions have also been used in online optimization, in which decision is made at each step under a noisily observed dynamical process (e.g., Flaxman et al. (2005), Shalev-Shwartz et al. (2012), Hazan et al. (2016)). The performance in this literature is often measured by the regret that indicates the suboptimality of a decision relative to the best decision assuming complete information (see, e.g., Besbes and Zeevi (2009, 2011) for applications in revenue management). Instead of using an “oracle” best as the benchmark in our minimax formulation, we use the sample average as our benchmark, and focus on improving this conventional estimator by analyzing the risk ratio. In this regard, we note that a ratio formulation and a non-oracle-best benchmark has been used in Agrawal et al. (2012), but in a different context in quantifying the impact of correlation in mean estimation, and their benchmark is an independent distribution with the worst-case being evaluated over a class of dependent models. Ratios between MSEs also appear in Pasupathy (2010) in studying the tradeoff between error tolerance and sample size in so-called retrospective approximation, which is a technique for solving stochastic root-finding or optimization problems via imposing a sequence of sample average approximation problems.

A main application of our work is finite-difference stochastic gradient estimation (e.g., Glasserman (2013) Chapter 7; Asmussen and Glynn (2007) Chapter 7; Fu (2006); L’Ecuyer (1991)), typically used when there is only a noisy simulation oracle to evaluate the function value or model output. Variants of the finite-difference method include the central, forward and backward finite-differences, with different perturbation directions and orders of bias (Zazanis and Suri (1993), Fox and Glynn (1989)). In contrast to finite-differences are unbiased derivative estimators, which include the infinitesimal perturbation analysis or pathwise differentiation (Ho et al. (1983), Heidelberger et al. (1988)), the likelihood ratio or the score function method (Glynn (1990), Rubinstein (1986), Reiman and Weiss (1989)), measure-valued or weak differentiation (Heidergott and Vázquez-Abad (2008), Heidergott et al. (2010)), and other variants such as the push-out method (Rubinstein (1992), L’Ecuyer (1990)), conditional and smoothed perturbation analysis (Gong and Ho (1987), Hong (2009), Fu and Hu (1992), Glasserman and Gong (1990), Fu et al. (2009)) and the generalized likelihood ratio method (Peng et al. (2018)). In multivariate settings, Spall (1992, 1997) study simultaneous perturbation to estimate gradients used in SA, by randomly generating a perturbation direction vector and properly weighting with the perturbation sizes to control estimation bias. Nesterov and Spokoiny (2017) proposes Gaussian smoothing with a different adjustment and investigates finite-sample behaviors in related optimization. Flaxman et al. (2005) suggests uniform sampling. Our framework can be applied to these procedures, as will be discussed in Section id1.

The main skeleton of our proposed estimators uses a sequentialized choice of the tuning parameter, which appears in Glynn and Whitt (1992) in their discussion of subcanonical estimators. A special case of our scheme, discussed in Section id1, resembles the idea of SA in stochastic optimization and root-finding that iteratively updates noisy estimates (Kushner and Yin (2003), Borkar (2009), Pasupathy and Kim (2011), Nemirovski et al. (2009), Polyak and Juditsky (1992), Ruppert (1988)). Our analyses there utilize the classical asymptotic techniques in Fabian (1968) and Chung (1954), and also Polyak and Juditsky (1992) in the averaging case.

We close this section by briefly comparing our work to multi-level Monte Carlo (Giles (2008)). This approach aims to reduce variance in simulation in the presence of a parameter selection like , by stratifying the simulation budget into different values. Of particular relevance is the randomized level selection (Rhee and Glynn (2015), Blanchet and Glynn (2015), Rychlik (1990), McLeish (2010)) that can turn biased estimators in the form of (id1) into unbiased estimators with canonical square-root convergence. This approach has been applied in the simulation of stochastic differential equations and nonlinear functions of expectations, and requires a proper coupling between simulation runs at consecutive levels to control the simulation effort. In investigating (id1), we do not assume any problem structure that allows such coupling, and our performance is benchmarked against the conventional (biased) sample-average scheme.

We elaborate our problem and notations in the introduction. We are interested in estimating . Given a tuning parameter , we run Monte Carlo simulation where each run outputs

(7)

with as , , and . We assume that:

Assumption 1

We have

  1. is a non-zero constant.

  2. is a random variable such that and as .

The above assumptions dictate that the order of the bias is , while the order of the variance is . The former is ensured by the first assumption and the latter by the second one.

As an example, in estimating the derivative of a function with unbiased noisy function evaluation, the central finite-difference (CFD) scheme elicits the output

where is an unbiased evaluation of , and is the perturbation size. Given that is thrice continuously differentiable with non-zero , the bias term has order . Typically, the order of the variance is . Suppose we do not apply common random numbers (CRN) in generating and , and that as , then . Suppose we are able to apply CRN so that as (i.e., we cannot fully eliminate the first-order variance as shrinks), then we have .

Similarly, the forward finite-difference (FFD) scheme elicits the output

Given that is twice continuously differentiable with non-zero , the bias term has order . Analogous conditions on the noise as above guarantees that . The same discussion holds for the backward finite-difference (BFD) scheme.

Given the capability to output independent runs of (id1), say , the conventional approach to obtain an estimate of is to take their sample average. Denote this as . The MSE of , denoted , can be expressed as

(8)

Considering the first order term, the bias increases with and the variance decreases with . Minimizing the MSE requires balancing these two errors to the same order, namely by choosing where , which solves the equation . This leads to an optimal MSE order . For example, in CFD and under the conditions we discussed above, we have , leading to an optimal MSE order ; in FFD or BFD we have , leading to an optimal MSE order .

In order to fully optimize the first-order MSE, including the coefficient, one needs to choose

(e.g., by applying the first-order optimality condition on the leading terms in (8)). This gives an optimal first-order MSE

(9)

The above choice of depends on the “constants” in the bias and variance terms, namely and . While and are often obtainable, constants like and are unknown a priori and can affect the performance of the simulation estimator, despite choosing an optimal order on in using the knowledge of and . Suppose we choose for some , where is optimally chosen. Then the first-order MSE is

(10)

which can be arbitrarily suboptimal relative to the best coefficient in (9). Our goal in this paper is to improve on this suboptimality, by considering estimators beyond the conventional sample average that consistently outperforms the constant showing up in (10).

The following theorem, which follow straightforwardly from Fox and Glynn (1989), summarizes the above discussion on the optimal order of the MSE:

Theorem 1

Under Assumption 1, suppose that , the sample-average-based estimator exhibits the asymptotic MSE

Choosing achieves the optimal MSE order, and the asymptotic MSE is

Lastly, we mention that, in practice, there are other considerations in obtaining good estimators, such as issues regarding the finiteness of the sample that can affect the accuracy of the asymptotic results. These considerations are beyond the scope of this work, which focuses mainly on a theoretical framework on improving the asymptotic constant.

We introduce a framework to assess, and calibrate, estimators beyond the sample-average-based estimator . This framework compares the asymptotic MSEs using as a baseline based on a minimax argument. Section id1 presents this framework, and Section id1 provides an initial study on a special type of estimators.

Consider an estimator for using simulation runs in the form (id1), where the tuning parameter in each run can be arbitrarily chosen. Our goal is to calibrate that performs well, or outperforms, in the first-order coefficient of the MSE, presuming that both and have the optimal order of errors. Let denote the MSE of for convenience.

The estimator can depend on other tuning parameters in addition to the in each run. We denote the collection of all the parameters that involves as , so that . Correspondingly, also depends on .

We suppose knowledge on the order of the bias and noise, namely and in (id1). However, we do not know the constants and . To make the discussion more precise, for fixed , we denote the class of simulation outputs

(11)

In other words, is the set of outputs with bias of order and noise of order , with arbitrary constants , , and the higher-order error terms.

The MSE of , , depends on evaluated at chosen . To highlight this dependence, we write . Similarly, depends on and , so that . We consider the asymptotic risk ratio

(12)

that measures the performance of relative to as a baseline. Since we only know is in but not its exact forms (i.e., the constants), we consider the worst-case scenario of , and search for the best parameters in that minimize this worst-case risk. Namely, we aim to solve

(13)

Note that (13), and the best choice of , depend on the used in . We now take a further viewpoint that an arbitrary user may select any , and we look for a strategy to calibrate that is guaranteed to perform well no matter how is chosen. To write this more explicitly, we let be dependent on , and we search for the best collection of parameters that is potentially a function on :

(14)

where denotes the set of admissible functions . This set depends on the class of estimators we use, which will be described in detail. Moreover, as we will see, (13) and (14) are closely related; in fact, under the settings we consider, solving either of them simultaneously solves another. In the following, we will focus on (14) and discuss the immediate implications on (13) where appropriate. We shall call the asymptotic minimax risk ratio (AMRR).

For convenience, let us from now on set as the tuning parameter in the sample-average-based estimator , where so that it achieves the optimal MSE order. The number can be any fixed integer to prevent from being too big at the early stage, and does not affect our asymptotic analyses.

To construct our proposed estimator , we will first use the idea of the recursive estimator studied in Section 5 of Glynn and Whitt (1992). At run , we simulate , where for some constant , and is the same as in , i.e., the parameter is chosen as if the current simulation run is the last one in the budget if a conventional sample-average-based estimator is used. The estimator in Glynn and Whitt (1992) uses the average of , namely . As shown in Glynn and Whitt (1992), this estimator exhibits the optimal MSE order like . Moreover, as they have also noted, this estimator admits a recursive representation , where each update depends only on the parameter indexed by the current run number, rather than the budget. Thus, the optimal MSE order is achieved in an “online” fashion as increases, independent of the final budget.

The initial class of estimators that we will consider is a generalization of Glynn and Whitt (1992). Specifically, we consider estimators defined via the recursion

(15)

where is defined as before and , and is in the form for some and . can be arbitrary. Moreover, we also consider averaging in the form

(16)

which resembles the standard Polyak-Ruppert averaging in SA (Polyak and Juditsky (1992)).

Our first result is that, in terms of the AMRR, the class of estimators and are quite restrictive and cannot bring in much improvement over . To elicit this result, we begin with some consistency properties of :

Proposition 1

Under Assumption 1, we have:

  1. If and , the estimator is -consistent for , i.e.,

  2. If and , or if , the error of in estimating is bounded away from zero in -norm as , i.e.,

Proposition 1 shows that estimates sensibly only when and . We thus focus on this case subsequently. The following describes the convergence rate:

Theorem 2

Under Assumption 1, the MSE of in estimating behaves as follows:

  1. For and ,

  2. For , and ,

  3. For , and , or for and ,

The proofs of the above results, which are detailed in Appendix id1, utilize the classical asymptotic techniques for recursive sequences in Fabian (1968) and a slight modification of Chung’s lemma (i.e., Lemma 1 in Appendix id1).

We now look at the AMRR for . First, Theorem 2 shows that the choice is the unique choice that gives rise to the optimal MSE order . Moreover, given this choice of , we need , in addition to . We will focus on these configurations for that achieve the same MSE order as the conventional estimator with the same .

Suppose we set , but allow the free selection of within the range that gives rise to the optimal MSE order. We thus can write , defined via (15) with where . The integer does not affect any asymptotic and can be taken as any given value. The following characterizes the AMRR and the configuration that attains it:

Theorem 3

Under Assumption 1, let be the MSE of , and

We have

which is attained by choosing .

Next, we provide more flexibility in the choice of in . In particular, rather than setting , we allow to depend on in any arbitrary fashion, i.e., where is any function. Let be the space of any functions from to . We have the following results on the AMRR of this enhanced scheme:

Theorem 4

Under Assumption 1, let be the MSE of , and

We have

which is attained by choosing and .

We note that Theorem 4 indicates is optimal in this enhanced scheme, while the optimal is chosen as a constant factor of .

Next we look at . It turns out that the AMRR depicted for in Theorem 4 applies also to . To this end, we first state the MSE of :

Theorem 5

Under Assumption 1, the MSE of in estimating behaves as follows:

  1. For and ,

  2. For and ,

Comparing Theorem 5 with Theorem 2, we see that the first-order MSE of in the considered regime exactly equals that of when and . Like before, is the unique choice that optimizes the MSE order for . Thus, we will focus on this choice of in . Note that then depends on . This leads us to the following AMRR:

Theorem 6

Under Assumption 1, let be the MSE of . Let

We have

which is attained by choosing , and any and .

The minimax ratios stated in Theorems 3, 4 and 6 remain the same, in a uniform fashion, when the parameter in is fixed instead of being chosen by an adversarial user. In other words, the minimax risk ratio of or compared to would not improve with a finer calibration on the tuning parameters catered to each specific . This is described in the following result:

Theorem 7

We have the following:

  1. Under the conditions and notations in Theorem 3, we have, for any fixed ,

    which is attained by choosing .

  2. Under the conditions and notations in Theorem 4, we have, for any fixed ,

    which is attained by choosing and .

  3. Under the conditions and notations in Theorem 6, we have, for any fixed ,

    which is attained by choosing , and any and .

Theorem 7 is consistent with Theorems 3, 4 and 6 in that the optimal strategies in calibrating the in and remain as a constant scaling on , regardless of what the specific value of is.

To get a numerical sense of the above results, Tables 1 and 2 show the AMRR and optimal configurations of and . Table 1 illustrates the scenario and (the CFD case). Restricting in (i.e., Theorem 3), the AMRR is , attained by setting in . In contrary, if we allow to arbitrarily depend on (i.e., Theorem 4), the AMRR is reduced to , attained by setting , and in . Similarly, the AMRR for (i.e., Theorem 6) is also , attained again by setting but now with any and .

Analogously, Table 2 illustrates the scenario and (the FFD and BFD cases). If we restrict in (i.e., Theorem 3), the AMRR becomes , attained by setting in . In contrary, if we allow to arbitrarily depend on (i.e., Theorems 4 and 6), the AMRR is , attained by setting , and in or in .

Note that, in all cases considered above, the AMRR is greater than 1, implying that without knowledge on the model characteristics, the estimators and can have a higher MSE than the baseline asymptotically.

( unadjusted) ( optimized)
AMRR 1.38 1.08 1.08
Optimal Configuration ,
Table 1: AMRR and optimal configurations for the case
( unadjusted) ( optimized)
AMRR 1.27 1.09 1.09
Optimal Configuration
Table 2: AMRR and optimal configurations for the case

We provide an intuitive explanation on the minimax results in Section id1. More specifically, we demonstrate that a key argument to obtain the minimax calibration strategy of a proposed class of estimators is to balance bias and variance in a similar manner as the baseline estimator, in terms of the factors multiplying the unknown first-order constants and . This insight is general and will be helpful in optimally calibrating wider classes of estimators, such as the general weighted estimators presented in the next section.

To explain, let us recall the notation in (12) that in general, the asymptotic risk ratio between a proposed estimator with parameter and a baseline estimator (where we hide its parameter for now) can be expressed as

Suppose that both estimators have the same MSE order, which is obtained optimally by balancing the orders of the bias and variance. Then the limit in the above expression becomes

(17)

where and refer to the first-order coefficient in the bias and variance terms of the proposed estimator, and similarly and refer to the corresponding quantities of the baseline estimator. Furthermore, with the model constants and , we can further write (17) as