# Bias-Variance Trade-offs: Novel Applications

## Synonyms

Bias-variance trade-offs, bias plus variance.

## Definition

Consider a given random variable and a random variable that we can modify, . We wish to use a sample of as an estimate of a sample of . The mean squared error between such a pair of samples is a sum of four terms. The first term reflects the statistical coupling between and and is conventionally ignored in bias-variance analysis. The second term reflects the inherent noise in and is independent of the estimator . Accordingly, we cannot affect this term. In contrast, the third and fourth terms depend on . The third term, called the bias, is independent of the precise samples of both and , and reflects the difference between the means of and . The fourth term, called the variance, is independent of the precise sample of , and reflects the inherent noise in the estimator as one samples it. These last two terms can be modified by changing the choice of the estimator. In particular, on small sample sets, we can often decrease our mean squared error by, for instance, introducing a small bias that causes a large reduction the variance. While most commonly used in machine learning, this article shows that such bias-variance trade-offs are applicable in a much broader context and in a variety of situations. We also show, using experiments, how existing bias-variance trade-offs can be applied in novel circumstances to improve the performance of a class of optimization algorithms.

## Motivation and Background

In its simplest form, the bias-variance decomposition is based on the following idea. Say we have a Euclidean random variable taking on values distributed according to a density function . We want to estimate what value we would get if were to sample . However we do not (or cannot) do this simply by sampling directly. Instead, to form our estimate, we sample a different Euclidean random variable taking on values distributed according to . Assuming a quadratic loss function, the quality of our estimate is measured by its Mean Squared Error (MSE):

(1) |

Example 1: To illustrate Eq. 1, consider the simplest type of supervised machine learning problem, where there is a finite input space , the output space is real numbers, and there is no noise. In such learning there is some deterministic ‘target function’ that maps each element of to a single element of . There is a ‘prior’ probability density function over target functions, and it gets sampled to produce some particular target function, . Next, is IID sampled at a set of inputs to produce a ‘training set’ of input-output pairs.

For simplicity, say there is some single fixed “prediction point” . Our goal in supervised learning is to estimate . However is not known to us. Accordingly, to perform the estimation the training set is presented to a ‘learning algorithm’, which in response to the training set produces a guess for the value .

This entire stochastic procedure defines a joint distribution . We can marginalize it to get a distribution . Since is supposed to be an estimate of , we can identify as the value of the random variable and as the value of . In other words, we can define . If we now ask what the mean squared error is of the guess made by our learning algorithm for the value , we get Eq. 1.

Note that one would expect that this and are statistically dependent (Indeed, if they weren’t dependent, then the dependence of thelearning algorithm on would be pointless.) Formally, the dependence can be established by writing

(since the guess of the learning algorithm is determined in full by the training set), and then noting that in general this integral differs from the product

In Ex. 1 and are statistically coupled. Such coupling is extremely common. In practice though, such coupling is simply ignored in analyses of bias plus variance, without any justification. In particular Bayesian supervised learning avoids any explicit consideration of bias plus variance. For its part, non-Bayesian supervised learning avoids consideration of the coupling by replacing the distribution with the associated product of marginals, . For now we follow that latter practice. So our equation for MSE reduces to

(2) |

(If we were to account for the coupling of and an additive correction term would need to be added to the right-hand side. For instance, see Wolpert [1997].)

Using simple algebra, the right hand side of Eq. 2 can be written as the sum of three terms. The first is the variance of . Since this is beyond our control in designing the estimator , we ignore it for the rest of this article. The second term involves a mean that describes the deterministic component of the error. This term depends on both the distribution of and that of , and quantifies how close the means of those distributions are. The third term is a variance that describes stochastic variations from one sample to the next. This term is independent of the random variable being estimated. Formally, up to an overall additive constant, we can write

(3) | |||||

In light of Eq. 3, one way to try to reduce afexpected quadratic error is to modify an estimator to trade-off bias and variance. Some of the most famous applications of such bias-variance trade-offs occur in parametric machine learning, where many techniques have been developed to exploit the trade-off. However there are some extensions of that trade-off that could be applied in parametric machine learning that have been ignored by the community. We illustrate one of them here.

Moreover, the bias-variance trade-off arises in many other fields besides parameteric machine learning. In particular, as we illustrate here, it arises in integral estimation and optimization. In the rest of this paper we present some novel applications of the bias-variance trade-off, and describe some interesting features in each case. A recurring theme is that whenever a bias-variance trade-off arises in a particular field, we can use many techniques from parametric machine learning that have been developed for exploiting this trade-off. The novel applications of the tradeoff discussed here are instances of the Probability Collectives (PC) Wolpert and Rajnarayan [2007], Wolpert et al. [2006], Wolpert and Bieniawski [2004a, b], Macready and Wolpert [2005], a general approach to using probability distributions to do blackbox optimization.

## Applications

In this section, we describe some applications of the bias-variance tradeoff. First, we describe Monte Carlo (MC) techniques for the estimation of integrals, and provide a brief analysis of bias-variance trade-offs in this context. Next, we introduce the field of Monte Carlo Optimization (MCO), and illustrate that there are more subtleties involved than in simple MC. Then, we describe the field of Parametric Machine Learning, which, as will show, is formally identical to MCO. Finally, we present an application of Parametric Learning (PL) techniques to improve the performance of MCO algorithms. We do this in the context of an MCO problem that is central to how PC addresses black-box optimization.

### Monte Carlo Estimation of Integrals Using Importance Sampling

Monte Carlo methods are often the method of choice for estimating difficult high-dimensional integrals. Consider a function , which we want to integrate over some region , yielding the value , as given by

We can view this as a random variable , with density function given by a Dirac delta function centered on . Therefore, the variance of is 0, and Eq. 3 is exact.

A popular MC method to estimate this integral is importance sampling [see Robert and Casella, 2004]. This exploits the law of large numbers as follows: i.i.d. samples are generated from a so-called importance distribution that we control, and the associated values of the integrand, are computed. Denote these ‘data’ by

(4) |

Now,

Denote by the random variable with value given by the sample average for :

We use as our statistical estimator for , as we broadly described in the introductory section. Assuming a quadratic loss function, , the bias-variance decomposition described in Eq. 3 applies exactly. It can be shown that the estimator is unbiased, that is, , where the mean is over samples of . Consequently, the MSE of this estimator is just its variance. The choice of sampling distribution that minimizes this variance is given by [see Robert and Casella, 2004]

By itself, this result is not very helpful, since the equation for the optimal importance distribution contains a similar integral to the one we are trying to estimate. For non-negative integrands , the VEGAS algorithm [Lepage, 1978] describes an adaptive method to find successively better importance distributions, by iteratively estimating , and then using that estimate to generate the next importance distribution . In the case of these unbiased estimators, there is no trade-off between bias and variance, and minimizing MSE is achieved by minimizing variance.

### Monte Carlo Optimization

Instead of a fixed integral to evaluate, consider a parametrized integral

Further, suppose we are interested in finding the value of the parameter that minimizes :

In the case where the functional form of is not explicitly known, one approach to solve this problem is a technique called Monte Carlo Optimization (MCO) [see Ermoliev and Norkin, 1998], involving repeated MC estimation of the integral in question with adaptive modification of the parameter .

We proceed by analogy to the case with MC. First, we introduce the -indexed random variable , all of whose components have delta-function distributions about the associated values . Next, we introduce a -indexed vector random variable with values

(5) |

Each real-valued component can be sampled and viewed as an estimate of .

For example, let be a data set as described in Eq. 4. Then for every , any sample of provides an associated estimate

That average serves as an estimate of . Formally, is a function of the random variable , and is given by such averaging over the elements of . So, a sample of provides a sample of . A priori, we make no restrictions on , and so, in general, its components may be statistically coupled with one another. Note that this coupling arises even though we are, for simplicity, treating each function as having a delta-function distribution, rather than as having a non-zero variance that would reflect our lack of knowledge of the functions.

However is defined, given a sample of , one way to estimate is

We call this approach ‘natural’ MCO. As an example, say that is a set of samples of , and let

as above. Under this choice for ,

(6) |

We call this approach ‘naive’ MCO.

Consider any algorithm that estimates as a single-valued function of . The estimate of produced by that algorithm is itself a random variable, since it is a function of the random variable . Call this random variable , taking on values . Any MCO algorithm is defined by ; that random variable encapsulates the output estimate made by the algorithm.

To analyze the error of such an algorithm, consider the associated random variable given by the true parametrized integral . The difference between a sample of and the true minimal value of the integral, , is the error introduced by our estimating that optimal as a sample of . Since our aim in MCO is to minimize , we adopt the loss function . This is in contrast to our discussion on MC integration, which involved quadratic loss. The current loss function just equals up to an additive constant that is fixed by the MCO problem at hand and is beyond our control. Up to that additive constant, the associated expected loss is

(7) |

Now change coordinates in this integral from the values of the scalar random variable to the values of the underlying vector random variable . The expected loss now becomes

The natural MCO algorithm provides some insight into these results. For that algorithm,

(8) | |||||

For any fixed , there is an error between samples of and the true value . Bias-variance considerations apply to this error, exacty as in the discussion of MC above. We are not, however, concerned with for a single component , but rather for a set of ’s.

The simplest such case is where the components of are independent. Even so, is distributed according to the laws for extrema of multiple independent random variables, and this distribution depends on higher-order moments of each random variable . This means that also depends on such higher-order moments. Only the first two moments, however, arise in the bias and variance for any single . Thus, even in the simplest possible case, the bias-variance considerations for the individual do not provide a complete analysis.

In most cases, the components of are independent. Therefore, in order to analyze , in addition to higher moments of the distribution for each , we must now also consider higher-order moments coupling the estimates for different .

Due to these effects, it may be quite acceptable for all the components to have both a large bias and a large variance, as long as they still order the ’s correctly with respect to the true . In such a situation, large covariances could ensure that if some were incorrectly large, then would also be incorrectly large. This coupling between the components of would preserve the ordering of ’s under . So, even with large bias and variance for each , the estimator as a whole would still work well.

Nevertheless, it sufficient to design estimators with sufficiently small bias plus variance for each single . More precisely, suppose that those terms are very small on the scale of differences for any and . Then by Chebychev’s inequality, we know that the density functions of the random variables and have almost no overlap. Accordingly, the probability that a sample of has the opposite sign of is almost zero.

Evidently, is generally determined by a complicated relationship involving bias, variance, covariance, and higher moments. Natural MCO in general, and naive MCO in particular, ignore all of these effects, and consequently, often perform quite poorly in practice. In the next section we discuss some ways of addressing this problem.

### Parametric Machine Learning

There are many versions of the basic MCO problem described in the previous section. Some of the best-explored arise in parametric density estimation and parametric supervised learning, which together comprise the field of Parametric machine Learning (PL).

In particular, parametric supervised learning attempts to solve

Here, the values represent inputs, and the values represent corresponding outputs, generated according to some stochastic process defined by a set of conditional distributions . Typically, one tries to solve this problem by casting it as an MCO problem, For instance, say we adopt a quadratic loss between a predictor and the true value of . Using MCO notation, we can express the associated supervised learning problem as finding , where

(9) |

Next, the argmin is estimated by minimizing a sample-based estimate of the ’s. More precisely, we are given a ‘training set’ of samples of , {}. This training set provides a set of associated estimates of :

These are used to estimate , exactly as in MCO. In particular, one could estimate the minimizer of by finding the minimium of , just as in natural MCO. As mentioned above, this MCO algorithm can perform very poorly in practice. In PL, this poor performance is called ‘overfitting the data’.

There are several formal approaches that have been explored in PL to try to address this ‘overfitting the data’. Interestingly, none are based on direct consideration of the random variable and the ramifications of its distribution for expected loss (cf. Eq. 8). In particular, no work has applied the mathematics of extrema of multiple random variables to analyze the bias-variance-covariance trade-offs encapsulated in Eq. 8.

The PL approach that perhaps comes closest to such direct consideration of the distribution of is uniform convergence theory, which is a central part of Computational Learning Theory [see Angluin, 1992]. Uniform convergence theory starts by crudely encapsulating the quadratic loss formula for expected loss under natural MCO, Eq. 8. It does this by considering the worst-case bound, over possible and , of the probability that exceeds by more than . It then examines how that bound varies with . In particular, it relates such variation to characteristics of the set of functions , e.g., the ‘VC dimension’ of that set [see Vapnik, 1982, 1995].

Another, historically earlier approach, is to apply bias-plus-variance considerations to the PL algorithm , rather than to each separately. This approach is applicable for algorithms that do not use natural MCO, and even for non-parametric supervised learning. As formulated for parameteric supervised learning, this approach combines the formulas in Eq. 9 to write

This is then substituted into Eq. 7, giving

(10) | |||||

The term in square brackets is an -parameterized expected quadratic loss, which can be decomposed into a bias, variance, etc., in the usual way. This formulation eliminates any direct concern for issues like the distribution of extrema of multiple random variables, covariances between and for different values of , and so on.

There are numerous other approaches for addressing the problems of natural MCO that have been explored in PL. Particulary important among these are Bayesian approaches, e.g., Buntine and Weigend [1991], Berger [1985], Mackay [2003]. Based on these approaches, as well as on intuition, many powerful techniques for addressing data-overfitting have been explored in PL, including regularization, cross-validation, stacking, bagging, etc. Essentially all of these techniques can be applied to MCO problem, not just PL problems. Since many of these techniques can be justified using Eq. 10, they provide a way to exploit the bias-variance trade-off in other domains besides PL.

### Plmco

In this section, we illustrate how PL techniques that exploit the bias-variance decomposition of Eq. 10 can be used to improve an MCO algorithm used in a domain outside of PL. This MCO algorithm is a version of adaptive importance sampling, somewhat similar to the CE method [Rubinstein and Kroese, 2004], and is related to function smoothing on continuous spaces. The PL techniques described are applicable to any other MCO problem, and this particular one is chosen just as an example.

#### MCO Problem Description

Consider the problem of finding the -parameterized distribution that minimizes the associated expected value of a function , i.e., find

We are interested in versions of this problem where we do not know the functional form of , but can obtain its value at any . Similarly we cannot assume that is smooth, nor can we evaluate its derivatives directly. This scenario arises in many fields, including blackbox optimization [see Wolpert et al., 2006], and risk minimization [see Ermoliev and Norkin, 1998].

We begin by expressing this minimization problem as an MCO problem. Write

Using MCO terminology, and . To apply MCO, we must define a vector-valued random variable with components indexed by , and then use a sample of to estimate . In particular, to apply naive MCO to estimate , we first i.i.d. sample a density function . By evaluating the associated values of we get a data set

The associated estimates of for each are

(11) |

The associated naive MCO estimate of is

Suppose includes all possible density functions over ’s. Then the minimizing our estimate is a delta function about the with the lowest associated value of . This is clearly a poor estimate in general; it suffers from ‘data-overfitting’. Proceeding as in PL, one way to address this data-overfitting is to use regularization. In particular, we can use the entropic regularizer, given by the negative of the Shannon entropy . So we now want to find the minimizer of , where is the regularization parameter. Equivalently, we can minimize , where This changes the definition of from the function given in Eq. 11 to

Find the solution to this minimization problem is the focus of the PC approach to blackbox optimization.

#### Solution Methodology

Unfortunately, it can be difficult to find the globally minimizing this new for an arbitrary . An alternative is to find a close approximation to that optimal . One way to do this is as follows. First, we find the minimizer of

(12) |

over the set of possible distributions with domain . We then find the that has minimal Kullback-Leibler (KL) divergence from this , evaluated over . That serves as our approximation to , and therefore as our estimate of the that minimizes .

The minimizer of Eq. 12 can be found in closed form; over it is the Boltzmann distribution The KL divergence in from this Boltzmann distribution to is

The minimizer of this KL divergence is given by

(13) |

is an approximation to the estimate of the that minimizes given by the regularized version of naive MCO. Our incorporation of regularization here has the same motivation as it does in PL: to reduce bias plus variance.

#### Log-concave Densities

#### Mixture Models

The single Gaussian is a fairly restrictive class of models. Mixture models can significantly improve flexibility, but at the cost of convexity of the KL distance minimization problem. However, a plethora of techniques from supervized learning, in particular the Expectation Maximization (EM) algorithm, can be applied with minor modifications.

Suppose is a mixture of Gaussians, that is, where is the mixing p.m.f, we can view the problem as one where a hidden variable decides which mixture component each sample is drawn from. We then have the optimization problem

Following the standard EM procedure, we get the algorithm described in Eq. 14. Since this is a nonconvex problem, one typically runs the algorithm multiple times with random initializations of the parameters.

(14) |

#### Test Problems

To compare the performance of this algorithm with and without the use of PL techniques, we use a couple of very simple academic problems in two and four dimensions - the Rosenbrock function in two dimensions, given by

and the Woods function in four dimensions, given by given by

For the Rosenbrock, the optimum value of 0 is achieved at , and for the Woods problem, the optimum value of 0 is achieved at .

#### Application of PL Techniques

As mentioned above, there are many PL techniques beyond regularization that are designed to optimize the trade-off between bias and variance. So having cast the solution of as an MCO problem, we can apply those other PL techniques instead of (or in addition to) entropic regularization. This should improve the performance of our MCO algorithm, for the exact same reason that using those techniques to trade off bias and variance improves performance in PL. We briefly mention some of those alternative techniques here.

The overall MCO algorithm is broadly described in Alg. 1. For the Woods problem, 20 samples of are drawn from the updated at each iteration, and for the Rosenbrock, 10 samples. For comparing various methods and plotting purposes, 1000 samples of are drawn to evaluate . Note: in an actual optimization, we will not be drawing these test samples! All the performance results in Fig. 1 are based on 50 runs of the PC algorithm, randomly initialized each time. The sample mean performance across these runs is plotted along with 95% confidence intervals for this sample mean (shaded regions).

Cross-validation for Regularization: We note that we are using regularization to reduce variance, but that regularization introduces bias. As is done in PL, we use standard -fold cross-validation to tradeoff this bias and variance. We do this by partitioning the data into disjoint sets. The held-out data for the fold is just the partition, and the held-in data is the union of all other partitions. First, we ‘train’ the regularized algorithm on the held-in data to get an optimal set of parameters , then ‘test’ this by considering unregularized performance on the held-out data . In our context, ‘training’ refers to finding optimal parameters by KL distance minimization using the held-in data, and ‘testing’ refers to estimating on the held-out data using the following formula [Robert and Casella, 2004].

a. | b. |

c. | d. |

e. | f. |

g. | h. |

We do this for several values of the regularization parameter in the interval , and choose the one that yield the best held-out performance, averaged over all folds. For our experiments, , and we use 5 equally-spaced values in this interval. Having found the best regularization parameter in this range, we then use all the data to minimize KL distance using this optimal value of . Note that all cross-validation is done without any additional evaluations of . Cross-validation for in PC is similar to optimizing the annealing schedule in simulated annealing. This ‘auto-annealing’ is seen in Fig. 1.a, which shows the variation of with iterations of the Rosenbrock problem. It can be seen that value sometimes decreases from one iteration to the next. This can never happen in any kind of ‘geometric annealing schedule’, , of the sort that is often used in most algorithms in the literature. In fact, we ran 50 trials of this algorithm on the Rosenbrock and then computed a best-fit geometric variation for , that is, a nonlinear least squares fit to variation of , and a linear least squares fit to the variation of . These are shown in Figs. 1.c. and 1.d. As can be seen, neither is a very good fit. We then ran 50 trials of the algorithm with the fixed update rule obtained by best-fit to , and found that the adaptive setting of using cross-validation performed an order of magnitude better, as shown in Fig. 1.e.

Cross-validation for Model Selection: Given a set (sometimes called a model class) to choose from, we can find an optimal . But how do we choose the set ? In PL, this is done using cross-validation. We choose that set such that has the best held-out performance. As before, we use that model class that yields the lowest estimate of on the held-out data. We demonstrate the use of this PL technique for minimizing the Rosenbrock problem, which has a long curved valley that is poorly approximated by a single Gaussian. We use cross-validation to choose between a Gaussian mixture with up to 4 components. The improvement in performance is shown in Fig. 1.d.

Bagging: In bagging Breiman [1996a], we generate multiple data sets by resampling the given data set with replacement. These new data sets will, in general, contain replicates. We ‘train’ the learning algorithm on each of these resampled data sets, and average the results. In our case, we average the got by our KL divergence minimization on each data set. PC works even on stochastic objective functions, and on the noisy Rosenbrock, we implemented PC with bagging by resampling 10 times, and obtained significant performance gains, as seen in Fig. 1.g.

Stacking: In bagging, we combine estimates of the same learning algorithm on different data sets generated by resampling, whereas in stacking Breiman [1996b], Smyth and Wolpert [1999], we combine estimates of different learning algorithms on the same data set. These combined estimated are often better than any of the single estimates. In our case, we combine the obtained from our KL divergence minimization algorithm using multiple models . Again, Fig. 1.h shows that cross-validation for model selection performs better than a single model, and stacking performs slightly better than cross-validation.

### Conclusions

The conventional goal of reducing bias plus variance has interesting applications in a variety of fields. In straightforward applications, the bias-variance trade-offs can decrease the MSE of estimators, reduce the generalization error of learning algorithms, and so on. In this article, we described a novel application of bias-variance trade-offs: we placed bias-variance trade-offs in the context of Monte Carlo Optimization, and discussed the need for higher moments in the trade-off, such as a bias-variance-covariance trade-off. We also showed a way of applying just a bias-variance trade-off, as used in Parametric Learning, to improve the performance of Monte Carlo Optimization algorithms.

## References

- Wolpert [1997] D. H. Wolpert. On bias plus variance. Neural Computation, 9:1211–1244, 1997.
- Wolpert and Rajnarayan [2007] D. H. Wolpert and D. Rajnarayan. Parametric learning and monte carlo optimization. Available at http://arxiv.org/abs/0704.1274, 2007.
- Wolpert et al. [2006] D. H. Wolpert, C. E. M. Strauss, and D. Rajnarayan. Advances in distributed optimization using probability collectives. Advances in Complex Systems, 9(4):383–436, 2006.
- Wolpert and Bieniawski [2004a] D. H. Wolpert and S. Bieniawski. Distributed control by lagrangian steepest descent. In Proc. of the IEEE Control and Decision Conf., pages 1562–1567, 2004a.
- Wolpert and Bieniawski [2004b] D. H. Wolpert and S. Bieniawski. Adaptive distributed control: beyond single-instant categorical variables. In A. Skowron et al, editor, Proceedings of MSRAS04. Springer Verlag, 2004b.
- Macready and Wolpert [2005] William Macready and David H. Wolpert. Distributed constrained optimization with semicoordinate transformations. submitted to Journal of Operations Research, 2005.
- Robert and Casella [2004] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 2004.
- Lepage [1978] G. P. Lepage. A new algorithm for adaptive multidimensional integration. Journal of Computational Physics, 27:192–203, 1978.
- Ermoliev and Norkin [1998] Y. M. Ermoliev and V. I. Norkin. Monte carlo optimization and path dependent nonstationary laws of large numbers. Technical Report IR-98-009, International Institute for Applied Systems Analysis, March 1998.
- Angluin [1992] D. Angluin. Computational learning theory: Survey and selected bibliography. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing, May 1992, 1992.
- Vapnik [1982] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1982.
- Vapnik [1995] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
- Buntine and Weigend [1991] W. Buntine and A. Weigend. Bayesian back-propagation. Complex Systems, 5:603–643, 1991.
- Berger [1985] J. M. Berger. Statistical Decision theory and Bayesian Analysis. Springer-Verlag, 1985.
- Mackay [2003] D. Mackay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
- Rubinstein and Kroese [2004] R. Rubinstein and D. Kroese. The Cross-Entropy Method. Springer, 2004.
- Breiman [1996a] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996a.
- Breiman [1996b] L. Breiman. Stacked regression. Machine Learning, 24(1):49–64, 1996b.
- Smyth and Wolpert [1999] P. Smyth and D. Wolpert. Linearly combining density estimators via stacking. Machine Learning, 36(1-2):59–83, 1999.