# On reducing sampling variance in covariate shift

using control variates.

###### Abstract

Covariate shift classification problems can in principle be tackled by importance-weighting training samples. However, the sampling variance of the risk estimator is often scaled up dramatically by the weights. This means that during cross-validation - when the importance-weighted risk is repeatedly evaluated - suboptimal hyperparameter estimates are produced. We study the sampling variances of the importance-weighted versus the oracle estimator as a function of the relative scale of the training data. We show that introducing a control variate can reduce the variance of the importance-weighted risk estimator, which leads to superior regularization parameter estimates when the training data is much smaller in scale than the test data.

###### keywords:

Covariate Shift, Importance-Weighting, Sampling Variance, Control Variate, Cross-validation.^{†}

^{†}journal: Pattern Recognition Letters

## 1 Introduction

In many real-world classification problems, training data is not gathered in a completely unbiased manner. An *unbiased* sample refers to events that were observed according to their true probabilities, whereas a *biased* sample refers to events that were observed more/less frequently (heckman1977sample, ; wooldridge2010econometric, ; moreno2012unifying, ). For example, clinical data collected from a single hospital will be biased because the local patient population deviates from the global patient population. Consequently, a computer-aided diagnosis system trained on data from that hospital will not generalize well to hospitals in other countries. Unfortunately, collecting completely unbiased data can be extremely difficult. Instead, we are interested in statistical models that correct for biased samplings and generalize to target populations (zadrozny2004learning, ; bickel2007discriminative, ; cortes2008sample, ; quionero2009dataset, ; ben2010theory, ).
In particular, we propose an adjusted correction procedure that will aid hyperparameter optimization.

In classification settings, bias corrections are often performed based on individual sample probabilities: each sample is weighed by a factor that matches its current probability to the probability of encountering it in the target population. For example, if a particular event occurs very frequently in the training set but rarely in the target population, then it is not deemed important. Vice versa, if it occurs very rarely in the training set but often in the target population, then it is deemed important. As such, this correction is known as importance weighting (sugiyama2007covariate, ). Sample importance originates from Monte Carlo (MC) simulation, where it is used to draw samples from rare yet interesting regions of a distribution (helton2006survey, ; cochran2007sampling, ). The main difference between importance sampling in Monte Carlo simulation and importance weighing in a classification setting is that in the former case the importance sampling distribution is *designed*, whereas, in the latter case, it is *fixed*; it consists of the already collected biased training data. Although importance weighting can be very useful in controlling for biases in data, there are also a number of practical problems. The predominant one is weight bimodality: a small number of samples are assigned a very large weight while the remainder is assigned a near-zero weight. Essentially, only a handful of samples are deemed important, which greatly reduces effective sample size (mcbook, ).

We focus on cross-validation in the face of biased data. More specifically, we consider the example of selecting a regularization parameter for a least-squares classifier (scholkopf2002learning, ). If the collected training data were unbiased, a classifier can be evaluated by holding out a portion of the training data, training on the remainder and validating on the held-out set. Splitting the dataset into parts where each is hold out once, is called -fold cross-validation (efron1994introduction, ; kohavi1995study, ). By repeating this procedure for different values of hyperparameters, such as regularization parameters, the parameter can be selected that generalizes best to unseen data. However, since the training data is biased, the hyperparameter estimate that is obtained through cross-validation will not be optimal with respect to the whole population (sugiyama2005model, ; kouw2016regularization, ). It is essentially over-fitted to the biased training data (cawley2010over, ). One could correct for the discrepancy caused by the biased data by assigning importance-weights to the validation set (sugiyama2007covariate, ). However, the weight variance scales the sampling variance of the cross-validation estimator, which affects its ability to select the optimal hyperparameter (markatou2005analysis, ). This paper proposes an adjustment to the importance-weighted cross-validation estimator that counteracts the increase in sampling variance due to the importance weights.

The sampling variance of an estimator describes the variation in its estimates for different datasets. This sampling variance depends on the size of the sample: as the estimator gets more samples, it will return more accurate estimates. However, the size of a given dataset is often fixed. Instead of increasing the number of samples in order to obtain a more accurate estimate, it is possible to directly reduce the sampling variance of the estimator (kahn1953methods, ). In fact, there are many variance reduction techniques that were designed to make MC simulation more efficient and practical (cochran2007sampling, ; mcbook, ). These techniques incorporate additional information on the data distribution. For example, with antithetic variates one has the knowledge that the data-generating distribution is symmetric around some point. This knowledge can be exploited by mirroring the existing samples and augmenting the dataset (hammersley1956general, ). Alternatively, a control variate consists of a function that is known to correlate strongly with the estimand. By subtracting a value from the estimand when the control variate rises and adding a value when the control variate shrinks, one reduces the estimator’s deviation from its mean. It essentially returns more accurate estimates using the same dataset (nelson1987control, ).

We show how we can use control variates to reduce the sampling variance of importance-weighted cross-validation (see Section 4). For the correlating function, we chose the importance weights themselves. Instead of scaling up the sampling variance of the estimator whenever the weight variance increases, it now helps us to perform more accurate estimations. Furthermore, we show that this improved risk estimator can be used to evaluate classifiers and leads to better hyperparameters when employed in cross-validation (see Section 5). In the next section we first introduce the problem setting, known as covariate shift (moreno2012unifying, ), in more detail (see Section 2).

## 2 Covariate shift

In this section, we introduce some concepts and notation, followed by an explanation of covariate shift along with an example that will be used throughout the paper.

### 2.1 Notation and Initial Problem Setting

Biased training data that stems from local sampling and unbiased test data that stems from global sampling can be described as different *domains*. A domain in this context is defined as the combination of an input space , an output space and an associated probability distribution . Given two domains, we call them different if they are different in at least one of their constituent components, i.e., the input space, the output space, or the probability density function. We focus on the case where only the probability distributions differ. Inputs remain the same, namely the -dimensional real space and outputs stay the same as well, namely the classes . We denote the source domain as and will refer to it as . The target domain is denoted with the shorthand . The challenge is to use information from the source domain to generalize to the target domain.

Domain-specific functions will be marked with the subscript or as well, for example . With some abuse of notation for the sake of clarity, we will mark marginal and conditional distributions with and as well: for the target joint distribution, for the target data marginal distribution and for the target class-conditional distribution. The source data is denoted as the set . Note that refers to an element of the input space , while refers to a specific observation drawn from the source distribution, . Likewise, the target domain data consists of the set .

### 2.2 Specifics of Covariate Shift and Example Setting

Covariate shift refers to the case where the class-posterior distributions remain equal, . Furthermore, it is assumed that the class-priors are equal in both domains as well, . It is therefore called covariate shift because only the covariates - the marginal data distributions - have shifted; .

Throughout the paper, we will use a running example of a basic covariate shift setting to illustrate several concepts: the target data distribution is set to be a normal distribution with mean and standard deviation , , its priors are set equal , and its class-posterior distribution is set to a cumulative normal distribution with mean and standard deviation , . As the goal is to create a covariate shift setting, the target’s class-posterior distribution is set to be equal to the source’s: . The source’s priors are set to be equal as well, , but its data marginal distribution is set to be a normal distribution with mean and standard deviation , . controls the scale of the source domain. deviates further away from (the target domain’s scale in this example setting), the domain dissimilarity increases.

Figure 1 plots the distributions of the example; the left column corresponds to the source domain and the right column to the target domain. The top row corresponds to the data distributions , the middle row to the class-posteriors and the bottom row to the class-conditional distributions . Red and magenta represent the negative class, while blue and cyan represent the positive class. For this figure, we visualized the case of .

## 3 Importance-weighting

The empirical risk minimization framework describes a classifiers performance by its expected loss. The risk function integrates the loss of the classifiers parameters over the joint distribution and is hence domain-specific. We are interested in generalizing to the target domain, which is another way of saying that we are interested in the classifier that minimizes the target risk :

This integral is an expected value, , which can be estimated through the sample average of data drawn from the target domain:

(1) |

We will refer to estimators with a symbol. By the law of large numbers, the estimated value will converge to the true target risk: for all (wasserman2013all, ).

Note that target labels are required for this risk estimator. Unfortunately, these are not available in a covariate shift problem setting. Consequently, we are interested in estimators of the target risk that do *not* depend on the target labels. One of the most popular ones is the importance-weighted risk estimator. It starts by multiplying and dividing the target distribution with the source distribution as follows:

Under the assumption that the class-posterior distributions are equivalent, , the importance-weighted risk simplifies to (shimodaira2000improving, ):

For this risk we can again formulate an estimator based on the sample average. Except this time, data from the source domain is used:

(2) |

Note that this estimator does not depend on target labels .

We will abbreviate the ratio of probability distributions through . Equation 2 already shows why importance-weighting can be problematic: over a small probability equals a very large weight. In the example setting laid out in Section 2.2, the weight function can be derived: . In this case, the importance weights are an exponential function of the domain dissimilarity and can become very large, very quickly. In particular, if we take the variance of the weights with respect to the source distribution, , then we can identify two scenario’s: for the variance rises slowly, while for the variance diverges to infinity as approaches (see Figure 2). The former scenario corresponds to the case where the source domain is larger in scale and the goals is to generalize to a particular subset. The latter scenario corresponds to the case where the source domain is smaller in scale and the goal is to generalize to a larger population. Based on the weight variance, it seems that the latter case is far less feasible than the former.

### 3.1 Sampling variances

Although the expected values of and are the same, they behave differently for finite sample sizes. It is much more difficult to estimate the target risk using samples from another domain; estimates tend to vary much more than ’s ones for a fixed sample size. The variance of an estimator with respect to its samples is known as the *sampling variance* (not to be confused with sample variance, which is the variance between samples in a set). In the following, we will compare the sampling variance of versus that of . The sampling variance with respect to a set of samples consists of the average squared deviation of the estimator from its true risk:

(3) |

Using the fact that samples are drawn independently and are identically distributed (iid), (3) can be simplified by pulling the sum over samples outside of the expectation:

(4) |

The sampling variance with respect to a single sample, , will be referred to as .

The source data is drawn iid as well. That means that the same simplification holds for the importance-weighted risk estimator:

(5) |

Similar to before, the sampling variance with respect to a single sample will be referred to as .

Expanding the squares in (4) leads to and expanding (5) leads to . Note that and that the only difference between and is the addition of the importance weights. Thus, the weights directly scale the sampling variance of the estimator. So, even though and are estimators of the same risk, the fact that is based on data from another distribution makes it a less accurate estimator.

Figure 2(a) computes the estimators for the running example using a least-squares classifier: with (friedman2001elements, ). Since the probability distributions are known, the Bayes optimal classifier for the source domain can be derived: . This was used to compute the risk estimates. The figure shows a learning curve of repetitions of the estimated risk as a function of the size of the validation set ( and for the target and the importance-weighted risk estimators respectively). A source standard deviation of was chosen for this visualization. Note that the importance-weighted risk varies much more than the target risk.

Figure 2(b) displays the sampling variance of and as a function of the domain dissimilarity. For , the domains are the same and the sampling variances are equal. For , the sampling variance of drops off, while the sampling variance of slowly increases. For , the ’s variance remains relatively steady, while the ’s variance diverges to infinity at . The shape of this curve reflects the influence of the variance of the importance weights, as shown in Figure 2.

## 4 Reducing sampling variance with a control variate

The increased sampling variance of the importance-weighted risk estimator is problematic for procedures that rely on accurate estimates of the target risk. One such procedure is cross-validation, which we discuss in Section 5. Our goal is to reduce the sampling variance of . To that end, we will introduce a control variate (nelson1987control, ; mcbook, ). A control variate is a function that correlates with the estimator and whose expected value is known: . These two properties mean that it essentially contains additional information on the function of interest, which can be used to reduce sampling variance. Whenever the correlating function’s value rises above its expected value, , so does the risk estimator’s value rise above its expected value (the true risk), . By subtracting the control variate from the risk estimate, , the estimator’s deviation from the true risk is reduced. Hence, its variance is reduced. It is however important that the control variate is appropriately scaled, as subtracting a too large value can increase the sampling variance as well. A parameter is used to control the scaling: .

We chose the importance-weights themselves as the control variate, since their expected value is known: . Adding the weight-based control variate to the importance-weighted risk, forms the following estimator:

Note that the added term does not bias the overall estimator: the expected value of the control variate for all values of . This means that the expected value of the controlled estimator is the same as that of the importance-weighted estimator: .

### 4.1 Sampling variance of the controlled estimator

The effect of the control variate on the sampling variance of the importance-weighted risk estimator can be described exactly (nelson1987control, ):

(6) |

The stands for the covariance, in this case between the weighted loss and the weights themselves. The scale parameter of the control variate can be optimized to minimize the overall sampling variance of the estimator:

where is the minimizer. Plugging back in to (6) simplifies the sampling variance to:

(7) |

Considering that both the squared covariance term and the variance term are non-negative, the sampling variance of a controlled estimator is never larger than that of the standard estimator (owen2000safe, ). In particular, multiplying with , allows (7) to be written as:

where denotes the correlation between the weighted loss (the estimand), and the weights (the control variate). Essentially, the more the weights correlate - positively or negatively - with the weighted loss, the larger the reduction in variance.

Computing is not possible without knowledge of the probability distributions, but it can be estimated from data:

Figure 3(a) provides an illustration similar to the one of Figure 2(a)), but adds the estimated risk of the controlled importance-weighted estimator. This is still the case of , for which ’s sampling variance is much smaller than that of . Similarly, Figure 3(b) is the equivalent of Figure 2(b), which plots the sampling variance for the three estimators , , and . For , the sampling variance of the controlled estimator reduces to roughly the same level as the original target risk estimator . For , also diverges at , but rises much more slowly.

## 5 Cross-validation

Accurate estimation of the target risk is important for cross-validation, which is, in turn, important for hyperparameter optimization. In this case, it is used to find an optimal regularization parameter. In order to account for the covariate shift, the validation data is importance-weighted (sugiyama2007covariate, ). However, as Section 3.1 has shown, weighting can increase the sampling variance, making the cross-validation estimator less accurate. Fortunately, the control variate can counteract this negative influence. The following subsections describe an experiment that compares the importance-weighted versus the controlled importance-weighted risk estimators in a cross-validation context.

### 5.1 Experimental setup

In the following experiments, the source set is split up into folds, each of which is held out once for validation. Training samples are marked as and validation samples are marked as , where the sets and together make up the total index set of the source samples . For the classifier, we employ a kernelized version of a regularized least-squares classifier (scholkopf2002learning, ). So for risk evaluation, we evaluate the mean squared error (MSE): . In particular, a quadratic polynomial kernel is taken: , with . Note that the classifier’s parameters are dependent on the regularization parameter . Predictions are made by applying the kernel to new samples and taking the inner product with the classifier parameters: .

The true data marginal distributions are not known in practice. In most cases it is also not known to which family of distributions the data marginals belong to. As such, we opt for a nonparametric approach. Both the source and target distributions are estimated with a kernel density estimator (parzen1962estimation, ). A normal kernel was used, with its bandwidth set through Silverman’s rule of thumb (silverman1986density, ). After estimation, the ratio of distributions is taken to compute the importance weights: .

We compare the following risk estimators:

corresponds to validating on source data without compensating for the covariate shift, compensates with the importance weights, uses the control variate, and corresponds to the oracle case, i.e., validating on labeled target samples. refers to the cardinality of the validation set. We start with a set of regularization parameter values, ranging from to . The risk estimators are used to select the for which the risk is minimal. This selected parameter is then used to train a classifier on all the source data and is evaluated using the target risk estimator.

### 5.2 Data

The ionosphere dataset from the UCI machine learning repository was used. To allow for visualization, the dimensionality of the data was reduced to using principal components analysis. To simulate a covariate shift setting, we perform a biased sampling: a normal distribution is placed at the center with a standard deviation of times the covariance matrix of the whole set. Each sample from the ionosphere dataset is evaluated under this distribution and the resulting probabilities are used to draw - without replacement - a subset of samples. is chosen from a logarithmic range between and , which represents a very local, biased sampling to a nearly uniform, unbiased sampling. Figure 5 shows scatter plots of samples selected as part of the source domain (top) and the remainder as part of the target domain (bottom), for .

### 5.3 Results

Figure 5(a) plots the estimated regularization parameters as a function of the scale of the source domain’s sampling distribution. The value of the optimal regularization parameter tends to be quite large, but decreases from onwards. and differ much more in the regime . From onwards, the source domain covers the dataset so well, that all samples evaluate to nearly the same probability under the source domain’s sampling distribution. Hence, the selected data is an unbiased sample and there is no covariate shift. Figure 5(b) shows the risk of the estimated regularization parameter and indicates that the large differences between estimated regularization parameters cause large differences in the resulting risks. Conversely, no difference in causes no difference in risks, from onwards. The improvement from over is largest where the domains are the most different, as is the improvement of over . Overall, always leads to superior or equal estimates compared to .

## 6 Conclusion

We presented a study of the sampling variance of the importance-weighted risk estimator as compared to the target risk estimator in the context of covariate shift. We showed that the sampling variance can increase substantially as a function of the scale of the source domain, leading to a far less accurate estimator for a given sample size. Furthermore, we introduced a control variate to reduce the sampling variance of the importance-weighted risk estimator. This reduction is beneficial for hyperparameter optimization in cases where the sampling variance becomes problematic. As it is never detrimental, the controlled importance-weighted risk estimator is the preferred choice.

In this work, only the additive control variate has been studied. Multiplicative control variates or more complex functions applied to the additive control variate have the potential to increase its correlation with the estimand, thus decreasing the sampling variance of the estimator even further. However, it is hard to predict whether a more complex control variate will be useful.

## Acknowledgments

This work was supported by the Netherlands Organization for Scientific Research (NWO; grant 612.001.301).

## References

- (1) J. J. Heckman, Sample selection bias as a specification error (with an application to the estimation of labor supply functions) (1977).
- (2) J. M. Wooldridge, Econometric analysis of cross section and panel data, MIT press, 2010.
- (3) J. G. Moreno-Torres, T. Raeder, R. Alaiz-RodríGuez, N. V. Chawla, F. Herrera, A unifying view on dataset shift in classification, Pattern Recognition 45 (1) (2012) 521–530.
- (4) B. Zadrozny, Learning and evaluating classifiers under sample selection bias, in: Proceedings of the twenty-first international conference on Machine learning, ACM, 2004, p. 114.
- (5) S. Bickel, M. Brückner, T. Scheffer, Discriminative learning for differing training and test distributions, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 81–88.
- (6) C. Cortes, M. Mohri, M. Riley, A. Rostamizadeh, Sample selection bias correction theory, in: Algorithmic learning theory, 2008, pp. 38–53.
- (7) J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence, Dataset shift in machine learning, The MIT Press, 2009.
- (8) S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. W. Vaughan, A theory of learning from different domains, Machine learning 79 (1-2) (2010) 151–175.
- (9) M. Sugiyama, M. Krauledat, K.-R. Müller, Covariate shift adaptation by importance weighted cross validation, The Journal of Machine Learning Research 8 (2007) 985–1005.
- (10) J. C. Helton, J. D. Johnson, C. J. Sallaberry, C. B. Storlie, Survey of sampling-based methods for uncertainty and sensitivity analysis, Reliability Engineering & System Safety 91 (10) (2006) 1175–1209.
- (11) W. G. Cochran, Sampling techniques, John Wiley & Sons, 2007.
- (12) A. B. Owen, Monte Carlo theory, methods and examples, 2013.
- (13) B. Schölkopf, A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT press, 2002.
- (14) B. Efron, R. J. Tibshirani, An introduction to the bootstrap, CRC press, 1994.
- (15) R. Kohavi, et al., A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Ijcai, Vol. 14, Stanford, CA, 1995, pp. 1137–1145.
- (16) M. Sugiyama, K.-R. Müller, Model selection under covariate shift, in: Artificial Neural Networks: Formal Models and Their Applications–ICANN 2005, Springer, 2005, pp. 235–240.
- (17) W. M. Kouw, M. Loog, On regularization parameter estimation under covariate shift, in: International Conference on Pattern Recognition, 2016.
- (18) G. C. Cawley, N. L. Talbot, On over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research 11 (Jul) (2010) 2079–2107.
- (19) M. Markatou, H. Tian, S. Biswas, G. M. Hripcsak, Analysis of variance of cross-validation estimators of the generalization error, Journal of Machine Learning Research 6 (2005) 1127–1168.
- (20) H. Kahn, A. W. Marshall, Methods of reducing sample size in monte carlo computations, Journal of the Operations Research Society of America 1 (5) (1953) 263–278.
- (21) J. M. Hammersley, J. Mauldon, General principles of antithetic variates, in: Mathematical proceedings of the Cambridge philosophical society, Vol. 52, Cambridge University Press, 1956, pp. 476–481.
- (22) B. L. Nelson, On control variate estimators, Computers & Operations Research 14 (3) (1987) 219–225.
- (23) L. Wasserman, All of statistics: a concise course in statistical inference, Springer Science & Business Media, 2013.
- (24) H. Shimodaira, Improving predictive inference under covariate shift by weighting the log-likelihood function, Journal of statistical planning and inference 90 (2) (2000) 227–244.
- (25) J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning, Vol. 1, Springer series in statistics Springer, Berlin, 2001.
- (26) A. Owen, Y. Zhou, Safe and effective importance sampling, Journal of the American Statistical Association 95 (449) (2000) 135–143.
- (27) E. Parzen, On estimation of a probability density function and mode, The annals of mathematical statistics 33 (3) (1962) 1065–1076.
- (28) B. W. Silverman, Density estimation for statistics and data analysis, Vol. 26, CRC press, 1986.