Local Uncertainty Sampling for LargeScale MultiClass
Logistic Regression
Abstract
A major challenge for building statistical models in the big data era is that the available data volume may exceed the computational capability. A common approach to solve this problem is to employ a subsampled dataset that can be handled by the available computational resources. In this paper, we propose a general subsampling scheme for largescale multiclass logistic regression, and examine the variance of the resulting estimator. We show that asymptotically, the proposed method always achieves a smaller variance than that of the uniform random sampling. Moreover, when the classes are conditional imbalanced, significant improvement over uniform sampling can be achieved. Empirical performance of the proposed method is compared to other methods on both simulated and realworld datasets, and these results match and confirm our theoretical analysis.
1 Introduction
In recent years, data volume grows exponentially in the society, which has created demands for building statistical models with huge datasets. A major challenge is that the size of these datasets may exceed the available computational capability at hand. For example, when the dataset size is large, it may become infeasible to perform standard statistical procedures on a single machine. Although one remedy is to develop sophisticated distributed computing systems that can directly handle big data, the increased system complexity makes this approach not suitable for all scenarios. Another remedy to this problem is to employ a subsampled dataset that can be handled by the existing computational resources. This approach is widely applicable; however, since fewer data are used due to subsampling, statistical accuracy is lost. Therefore a natural question is to tradeoff computational efficiency and statistical accuracy by designing an effective sampling scheme that can minimize the reduction of statistical accuracy given a certain computational capacity.
In this paper, we examine the subsampling approach for solving big data multiclass logistic regression problems that are common in practical applications. The general idea of subsampling is to assign an acceptprobability for each data point and select observations according to the assigned probabilities. After subsampling, only a small portion of the data are extracted from the full dataset, which means that the model built on the subsampled data will not be as accurate as that of the full data. The required computational resource can be measured by the number of subsampled data, and our key challenge is to design a good sampling scheme together with the corresponding estimation procedure so that the loss of statistical accuracy is minimized given a fixed number of sampled data.
There has been substantial work on subsampling methods for largescale statistical estimation problems [7, 6, 8, 16, 24, 25, 26]. The simplest method is to subsample the data uniformly. However, uniform subsampling assigns the same acceptance probability to every data point, which fails to differentiate the importance among the samples. For example, a particular scenario, often encountered in practical applications of logistic regression, is when the class labels are imbalanced. This problem has attracted significant interests in the machine learning literature (see survey papers in [5, 10]). Generally, there are two types of commonly encountered class imbalance situations: marginal imbalance and conditional imbalance. In the case of marginal imbalance, some classes are much rarer than other classes. This situation often occurs in applications such as fraud and intrusion detection [1, 12], disease diagnoses [23], and protein fold classification [20], etc. On the other hand, conditional imbalance is the case that the labels of most observations are very easy to predict. This happens in applications with highly accurate classifiers such as handwriting digits recognition [14] and web or email spam filtering [22, 9]. Note that marginally imbalance implies conditional imbalance, while the reverse is not necessarily true.
For marginally imbalanced binary classification problems, casecontrol subsampling (CC), which uniformly selects an equal number of samples from each class, has been widely used in practice, including epidemiology and social science studies [15]. As a result, equal number of examples from each class are subsampled, and hence the sampled data are marginally balanced. It is known that casecontrol subsampling is more efficient than uniform subsampling when we deal with marginally imbalanced datasets. However, since the acceptprobability relies on the response variable in CC subsampling, the distribution of subsampled data is skewed by the sample selection process [4]. It follows that correction methods are necessary to be employed to adjust the selection bias [2, 13]. Another method to remove bias in CC subsampling is to weight each sampled data point by the inverse of its acceptance probability. This is known as the weighted casecontrol method, which has been shown to be consistent and unbiased [11], but may increase the variance of the resulting estimator [19, 17, 18].
One drawback of the standard casecontrol subsampling is that it does not consider the situation where data are conditionally imbalanced. This issue was addressed in [9], which proposed an improved subsampling scheme called local casecontrol (LCC) sampling for binary logistic regression. The LCC method assigns each data point an acceptance probability determined not only by its label but also by the observation covariates. It puts more importance on data points that are easy to be misclassified according to a consistent pilot estimator, which is an approximate conditional probability estimator possibly obtained using a small number of uniformly sampled data. The method proposed in [9] tries to fit a logistic model with the LCC sampled data, and then apply a postestimation correction to the resulting estimator using the pilot estimator. Therefore, the LCC sampling approach belongs to the correction based methods like [2, 13]. It was shown in [9] that given a consistent pilot the LCC estimator is consistent with an asymptotic variance that may significantly outperform that of the uniform sampling and CC based sampling methods when the data is strongly conditional imbalanced.
In this paper, we propose an effective sampling strategy for largescale multiclass logistic regression problems that generalizes LCC sampling. The general sampled estimation procedure can be summarized in the following two steps:

Assign an acceptprobability for each data point and select observations according to the assigned probabilities.

Fit a logistic model with sampled observations to obtain the unknown model parameter.
In the above framework, the acceptance probability for each data point can be obtained using an arbitrary probability function. Unlike correction based methods [13, 9] that are specialized for certain models such as linear model, we propose a maximum likelihood estimate (MLE) that integrates the correction into the MLE formulation, and this approach allows us to deal with arbitrary sampling probability and produces a consistent estimator within the original model family as long as the underlying logistic model is correctly specified. This new integrated estimation method avoids the postestimation correction step used in the existing literature.
Based on this estimation framework, we propose a new sampling scheme that generalizes LCC sampling as follows. Given a rough but consistent prediction of , this scheme preferentially chooses data points with labels that are conditionally uncertain given their local observations based on . The proposed sampling strategy is therefore referred as Local Uncertainty Sampling (LUS). We show that the LUS estimator can achieve an asymptotic variance that is never worse than that of the uniform random sampling. That is, we can achieve variance of no more than times the variance of the fullsample based MLE by using no more than of the sampled data in expectation. Moreover the required sample size can be significantly smaller than of the full data when the classification accuracy of the pilot estimator is relatively high. This generalizes a result for LCC in [9], which reaches a similar conclusion for binary logistic regression when .
We conduct extensive empirical evaluations on both simulated and realworld datasets, showing that the experimental results match the theoretical conclusions, and the LUS method significantly outperforms the previous approaches in terms of both variance and accuracy.
Our main contributions can be summarized as follows.

We propose a general estimation framework for largescale multiclass logistic regression, which can be used with arbitrary sampling probabilities. The procedure always generates a consistent estimator within the original model family when the model is correctly specified. This method can be applied to general logistic models without the need of postestimation corrections.

Under this framework, we propose an efficient sampling scheme called local uncertainty sampling. For any , the method can achieve asymptotic variance no more than that of the random subsampling with probability , using an expected sample size of no more than that of the random subsampling. Moreover the required sample size can be significantly smaller than that of the random subsampling when the the classification accuracy of the underlying problem is relatively high.
2 Preliminaries of MultiClass Logistic Regression
For a class classification problem, we observe random data points from a unknown underlying distribution , where is the feature vector and is the corresponding label. The label can be alternatively represented by a dimensional vector with only one nonzero element at the corresponding class label . Given a set of independently drawn observations from , we want to estimate conditional probabilities , . This paper considers multiclass logistic model with the following parametric form:
where each is the model parameter for the th class. It implies that
(1) 
Let be the entire model parameter vector. The model in Eq. (1) is specified in terms of logodds or logit transformations, with the constraint that the probabilities of each class should sum to one. Note that the logistic model uses one reference class as the denominator in the oddsratios, and the choice of the denominator is arbitrary since the estimates are equivalent under this choice. This paper uses the last class as the reference class in the definition of oddsratios.
When the underlying model is correctly specified, there exists a true parameter vector such that
(2) 
and is the maximizer of the expected population likelihood:
where is the vector representation of introduced at the beginning of Section 2. In the maximum likelihood formulation of multiclass logistic regression, the unknown parameter is estimated from the data by maximizing the empirical likelihood:
(3) 
where is the vector representation of .
For largescale multiclass logistic regression problems, can be extremely large. In such cases, solving the multiclass logistic regression problem (3) may be computationally infeasible due to the limitation of computational resources. To overcome this computational challenge, we will consider a subsampling framework next.
3 Model Parameter Estimation with Subsampling
In this section, we introduce the estimation framework with subsampling for multiclass logistic regression. The proposed approach contains the following steps:

given an arbitrary sampling probability function defined for all data points . For each (), generate a random binary variable , drawn from the valued Bernoulli distribution with acceptprobability

keep the samples with for . Fit a multiclass logistic regression model based on the selected examples by solving the following optimization problem
(4) where is the indicator function.
In the following, we shall derive Eq. (4), under the assumption that the logistic model is correctly specified as in Eq. (2). As we will show later, the acceptance probability used in the first step can be an arbitrary function, and the above method always produces a consistent estimator for the original population. The computational complexity in the second step is reduced to samples after the subsampling step.
Given , we may draw according to the Bernoulli distribution . This gives the following augmented distribution for the joint random variable with probability function
Note that each sampled data pair follows , and the random variable is independently drawn from . It follows that each joint data point is drawn i.i.d. from the distribution . For the sampled data with , the distribution of random variable follows from
Therefore, we have
If is correctly specified for , then the following function family
(5) 
is correctly specified for , i.e., the true parameter in Eq. (2) also satisfies
Therefore, we have the following logistic model under :
It follows that can be obtained by using MLE with respect to the new population :
where
Practically, the model parameter can be estimated by empirical conditional MLE with respect to the sampled data as
where
(6) 
which is equivalent to Eq. (4). As we will see later, the resulting subsampling based estimator is a consistent estimator of .
4 Asymptotic Analysis
In this section, we examine the asymptotic behavior of the method in Section 3. First, based on the empirical likelihood in Eq. (6), we have the following result for .
Theorem 4.1 (Consistency and Asymptotic Normality).
Suppose that the parameter space is compact and such that we have for . Moreover, assume the quantities , and for are bounded under any norm defined on the parameter space of . Let and . If Eq. (2) is satisfied, then given an arbitrary sampling probability , as , the following claims hold:

converges to ;

follows the asymptotic normal distribution:
(7) where
Theorem 4.1 shows that given an arbitrary sampling probability , the method in Section 3 can generate a consistent estimator without postestimation correction as long as the logistic model is correctly specified. This is different from earlier methods such as the LCC method of [9] which employs postestimation corrections. One benefit of the proposed method is that without postestimation correction we can still produce a consistent estimator in the original model family, and our framework allows different sampling functions for different data points . For example, in time series analysis, we may want to sample the older data more aggressively than the more recent data. This can be naturally handled in our framework but will be impossible to be addressed by using the earlier postestimation correction approach. Another benefit is that the framework can be naturally applied with regularization, because regularization can be regarded as a restriction on the parameter space for . However, postestimation correction based methods can not be directly applied to regularized estimators.
From Theorem 4.1, the resulted estimator follows the asymptotic normal distribution in Eq. (7) with zero mean and a variance of . Given a data point , although the sampling probability can be arbitrary probability, it is natural to select a sampling probability such that the variance is as small as possible. In the following, we study a specific choice of that achieves the purpose.
Denote by the corresponding matrix when we set , i.e., we accept all data points in the dataset, then
(8) 
Moreover, if we set for some , i.e., we sample a fraction of the full dataset uniformly at random, denote the corresponding matrix as , then
In the following, we denote the asymptotic variance of our subsampling based estimator in Eq. (4), the fullsample based estimator and the estimator obtained from uniformly sampled data by
respectively.
Our purpose is to find a better sampling strategy with lower variance than that of uniform sampling. That is, we want to choose an acceptance probability function such that there exists some scalar making
under the constraint that
for all . The constraint means that the expected subsample size is no more than , i.e., we sample no more than fraction of the full data.
Theorem 4.2 (Sampling Strategy).
For any data point , let
Given any , consider the following choice of acceptance probability function:

for , set as
(9) 
for , set as
(10)
then, we always have
(11) 
and the expected number of subsampled examples is
(12) 
It is easy to check that the assigned acceptance probability in Theorem 4.2 is always valid, i.e., it is a value in . With the sampling strategy in Theorem 4.2, we always use less than a fraction of the full data to achieve less than times the variance of the fullsample based MLE. It implies that the method is never worse than the uniform sampling method. Moreover, the required sample size can be significantly smaller than under favorable conditions. For example, when , then it is easy to verify that , which is close to zero when for most , which happens when the classification accuracy is high.
More precisely, we have the following explicit formula for the expected conditional sampling probability:
Therefore in the favorable case where most for , i.e., the data are conditionally imbalanced, Theorem 4.2 implies that the method will subsample very few examples to achieve the desired variance comparable to that of random sampling.
The choice of Theorem 4.2 reduces to the sampling strategy of LCC in [9] when and . Although a method was proposed in [9] for , it is different from our sampling strategy, and there is no theoretical guarantee for that strategy. In fact, the choice in [9] for may lead to a variance larger than that of the random sampling. The empirical performance can also be inferior to our method as we will show in the experimental section.
In the multiclass case, our method is not a natural extension of local case control (which would imply a method to set all class probabilities to after sampling). Instead, we will only assign a smaller sampling probability for when . The method is less likely to select a sample when coincides with the the prediction of the underlying true model, while it will likely be selected if contradicts the underlying true model. Since the sampling strategy prefers data points with uncertain labels, we call it local uncertainty sampling (LUS).
5 Local Uncertainty Sampling
(13) 
In order to apply Theorem 4.2 empirically, the main idea is to employ a rough but consistent estimate of the probabilities given , and then assign the acceptance probability according to this rough estimate. In fact, we only need to have an approximate estimate of according to Theorem 4.2. Note that a rough estimate of can be obtained via a consistent pilot estimator (which may be obtained using a small amount of uniformly sampled data), where we set . This is similar to what’s employed in [9]. However, in practice, one may also use a simpler model family to obtain the pilot estimate, as shown in our MNIST experiment below, or one may use other techniques such as neighborhood based methods [3] for this purpose. In realworld applications, this rough estimate is often easy to obtain. For example, when data arrive in time sequence, a pilot estimator trained on previous observations can be used for fitting a new model when new observations are coming in. Moreover, a rough estimate obtained on a small subset of the full population can be used for training on the entire dataset. In our experiments, we adopt the latter: a small uniformly subsampled subset of the original population is used to obtain the rough estimate of . As we will see later, it is sufficient for our method to obtain good practical performance. The LUS algorithm can be described in Algorithm 1.
6 Experiments
In this section, we evaluate the performance of the LUS method and compare it with the uniform sampling (US) and casecontrol (CC) sampling methods on both simulated and realworld datasets. For the CC sampling method, we extend the standard CC considered in the binary classification problem to multiclass case by sampling equal number of data points for each class. Under marginal imbalance, if some minority classes do not have enough samples, we keep all data for those classes and subsample equal number of the remaining data points from other classes. In addition, we also compare the LUS and LCC methods on the Web Spam dataset, which is a binary classification problem studied in [9]. The experiments are implemented in a single machine with 2.2GHz quadcore Intel Core i7 and 16GB memory.
6.1 Simulation: Marginal Imbalance
We first simulate the case where the data is marginally imbalanced. We generate a 3class Gaussian model according to , which is the true data distribution . We set the number of features as , and , and . The covariance matrices for classes are assigned to be the same as , where is a identity matrix. So the true logodds function is linear and we use a linear model to fit the simulation data. Moreover, we set , , , i.e., the data is marginally imbalanced and the second class dominates the population.
Since the true data distribution is known in this case, we directly generate the full dataset from the distribution . For the full dataset, we generate data points. The entire procedure is repeated for 200 times to obtain the variance of different estimators. For the LUS method, we randomly generate data points, i.e., an amount of 10% of the full data, from to obtain a rough estimate by a linear model and keep it before the 200 repetitions. Moreover, we generate another data points to test the prediction accuracy of different methods.
Recall that controls the desired variance of the LUS estimator according to Theorem 4.2. In the following experiments, we will test different values of , respectively. Given the value of , suppose the LUS method will subsample a number of data points. Then, we let the US and CC sampling methods select the same amount of examples to achieve fair comparison.
Since and there is an additional intercept parameter, the estimator contains a total number of coordinates. Now, denoting by the coordinatewise ratio between the variance of the coordinate in the candidate estimator and the variance of the coordinate in the fullsample based MLE, we show the value for each coordinate under different values of . The results under and are shown in Fig. 1. In this simulation, there are coordinates. From the figures, we observe that the value for each coordinate of the LUS method is approximately , which matches our theoretical analysis in Theorem 4.2. On the other side, the variances of the US and CC sampling methods are much higher than that of the LUS method.
In Fig. 2(a), we plot the relationship between the average for all coordinates against . From the figure, we observe that the relationship is close to (the dashed green line), which shows that approximately equals . These experimental results support our theoretical analysis. Fig. 2(b) reports the relationship between and . Fig. 2(c) shows the relationship between the prediction accuracy on the test data and the subsampling proportion . From the figure, when decreases, the prediction accuracy of all the methods decreases, while the LUS method shows much slower degradation compared to the US and CC methods. Moreover, according to the results of LUS in Fig. 2(c), we only need about of the full data to achieve the same prediction accuracy as the full MLE, implying that the LUS method is very effective for reducing the computational cost while preserving high accuracy.
6.2 Simulation: Marginal Balance
In the second simulation, we generate marginally balanced data with conditional imbalance. Under this situation, the CC sampling method is identical to US, and hence we omit it in our comparison. The settings are exactly the same as those in the previous simulation, except that we let here, i.e., the data is marginally balanced. However, this simulated data is conditional imbalanced as we will see later.
The value for each coordinate when and is shown in Fig. 3. The relationship between the average for all the coordinates and is plotted in Fig. 4(a). Fig. 4(b) reports the relationship between and . In Fig. 4(c), we show the relationship between the prediction accuracy on the test data and the subsampling proportion . The conclusions are similar to those of the previous simulation, which demonstrate the effectiveness of the LUS method under the marginally balanced (but conditionally imbalanced) case. Fig. 4(c) suggests again that we only need about of the full data to achieve the same prediction accuracy as that of the MLE estimator based on full data.
6.3 MNIST Data
In this section, we evaluate different methods on the MNIST data^{1}^{1}1http://yann.lecun.com/exdb/mnist/, which is a benchmark dataset in image classification problems and the stateoftheart results have achieved less than 1% test error rate. Therefore, the classification accuracy of this problem is relatively high. Note that, different from the LCC sampling, our LUS method can handle general logistic models. In this experiment, we let the model function for the LUS estimator to be one of the stateoftheart deep neural networks. Moreover, in order to save computational cost, we use a simpler neural net structure to obtain the pilot estimator , which is different from the one used for the final LUS estimator. The detailed net structures and parameter settings are provided in the appendix. For the US method, we apply the same net structure used by the final LUS estimator to achieve fair comparison. Since the MNIST data is marginally balanced, the CC method performs the same with the US method and we omit its comparison here.
The training set consists of 60,000 images and the test set has 10,000 images. We uniformly select data points, i.e., 10% of the training data, to compute the rough estimate and then use it to perform 10 repetitions of the experiment to obtain the average performance of different methods.
We test a number of in the range and Fig. 5(a) plots the subsample proportion against . Fig. 5(c) shows the relationship between the test error (%) and the subsampling proportion. Note that the rough estimate has a relatively large error rate of about 3.5% ; this is due to the fact that it employs a simpler network structure to save computational cost. Nevertheless, the LUS method can achieve an error rate below 1% using only about 15% training data; with 30% training data, it achieves the same error rate as that obtained by using the full training data. The LUS method consistently outperforms the US method. Table 1 shows the speedup of the LUS method on the MNIST dataset.
The rough estimate  LUS  Full training data  Speedup  
Seconds  51.0  321.0  1115.1  3.0 
6.4 Web Spam Data: Binary Classification
In this section, we compare the LUS method with the LCC method on the Web Spam data^{2}^{2}2http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html, which is a binary classification problem used in [9] to evaluate the LCC method. Since the comparison among LCC, US and CC on this dataset has been reported in [9], we do not repeat them here and focus on the comparison between the LUS and LCC methods. The Web Spam data contains 350,000 web pages and about 60% of them are web spams. This data set is approximately marginally balanced, but it has been shown to have strong conditional imbalance in [9]. Here, we adopt the same settings as described in [9] to compare the LUS and the LCC methods. That is, we use linear logistic models and select 99 features which appear in at least 200 documents, and the features are logtransformed. of the observations are uniformly selected to obtain a pilot estimator as did in [9]. Since we only have a single data set, we follow [9] to uniformly subsample 100 datasets, each of which contains 100,000 data points, as 100 independent ‘full’ datasets, and then repeat the experiments 100 times for comparison.
Observe that when , the LUS and LCC methods are equivalent to each other by setting the parameter of LCC in [9]. Therefore, we only focus on the case of . Similar to previous experiments, we test different values of and accordingly set , so that the two methods have the same asymptotic variance. Then, we will compare the number of subsampled data points to see which method is more effective in terms of subsampled data size .
Fig. 6 plots the values for different choices of . As expected from the theoretical results, both LUS and LCC methods have the same variance that is approximately (or ) times variance of the fullsample based MLE. Next, we compare the subsampling proportion of different methods when changes in Fig. 7. From the figure, the LUS method consistently subsamples smaller number of data points compared with LCC when they achieve the same variance as shown in Fig. 6 parameterized by . This demonstrates that the new formulation in LUS is not only theoretically better justified, but also more effective than LCC in practice (for the case of ).
7 Conclusion
This paper introduced a general subsampling method for solving largescale logistic regression problems. We investigated the asymptotic variance of the proposed estimator. Based on the theoretical analysis, we proposed an effective sampling strategy called Local Uncertainty Sampling to achieve any given level of desired variance. We proved that the method always achieves lower variance than random subsampling for a given expected sample size, and the improvement may be significant under the favorable condition of strong conditional imbalance. Therefore the method can effectively accelerate the computation of largescale logistic regression in practice. Experiments on synthetic and realworld datasets are provided to illustrate the effectiveness of the proposed method. The empirical studies confirm the theory, and demonstrate that the local uncertainty sampling method outperforms the uniform sampling, casecontrol sampling and the local casecontrol sampling methods under various settings. By using the proposed method, we are able to select a very small subset of the original data to achieve the same performance as that of the full dataset, which provides an effective mean for big data computation under limited resources.
Acknowledgment
This research is partially supported by NSF IIS1250985, NSF IIS1407939, and NIH R01AI116744.
References
 [1] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multiclass costsensitive learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11. ACM, 2004.
 [2] James A Anderson. Separate sample logistic discrimination. Biometrika, 59(1):19–35, 1972.
 [3] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning for control. In Lazy Learning, pages 75–113. Springer, 1997.
 [4] Norman Breslow. Design and analysis of casecontrol studies. Annual Review of Public Health, 3(1):29–54, 1982.
 [5] Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1):1–6, 2004.
 [6] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in Neural Information Processing Systems, pages 442–450, 2010.
 [7] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Algorithmic Learning Theory, pages 38–53. Springer, 2008.
 [8] Paramveer Dhillon, Yichao Lu, Dean P Foster, and Lyle Ungar. New subsampling algorithms for fast least squares regression. In Advances in Neural Information Processing Systems, pages 360–368, 2013.
 [9] William Fithian and Trevor Hastie. Local casecontrol sampling: efficient subsampling in imbalanced data sets. Annals of Statistics, 42(5):1693, 2014.
 [10] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009.
 [11] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
 [12] HyunChul Kim, Shaoning Pang, HongMo Je, Daijin Kim, and Sung Yang Bang. Pattern classification using support vector machine ensemble. In Proceedings of the International Conference on Pattern Recognition, volume 2, pages 160–163. IEEE, 2002.
 [13] Gary King and Langche Zeng. Logistic regression in rare events data. Political Analysis, 9(2):137–163, 2001.
 [14] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [15] Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies. Journal of the National Cancer Institute, 22(4):719–748, 1959.
 [16] Paul Mineiro and Nikos Karampatziakis. Lossproportional subsampling for subsequent erm. arXiv preprint arXiv:1306.1840, 2013.
 [17] AJ Scott and CJ Wild. Fitting logistic regression models in stratified casecontrol studies. Biometrics, pages 497–510, 1991.
 [18] Alastair Scott and Chris Wild. On the robustness of weighted methods for fitting models to case–control data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2):207–219, 2002.
 [19] Alastair J Scott and CJ Wild. Fitting logistic models under casecontrol or choice based sampling. Journal of the Royal Statistical Society. Series B (Methodological), pages 170–182, 1986.
 [20] Aik Choon Tan, David Gilbert, and Yves Deville. Multiclass protein fold classification using a new ensemble machine learning approach. Genome Informatics, 14:206–217, 2003.
 [21] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the Annual ACM Conference on Multimedia Conference, pages 689–692. ACM, 2015.
 [22] Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: using email spam to identify web spam automatically. In Proceedings of the Third Conference on Email and AntiSpam, 2006.
 [23] Achmad Widodo and BoSuk Yang. Support vector machine in machine condition monitoring and fault diagnosis. Mechanical Systems and Signal Processing, 21(6):2560–2574, 2007.
 [24] Yu Xie and Charles F Manski. The logit model and responsebased samples. Sociological Methods and Research, 17(3):283–302, 1989.
 [25] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the International Conference on Machine Learning, page 114. ACM, 2004.
 [26] Tong Zhang and F Oles. The value of unlabeled data for classification problems. In Proceedings of the International Conference on Machine Learning, pages 1191–1198. Citeseer, 2000.
Appendix
Appendix A Proof of Theorem 4.1
The following lemma is useful in our analysis.
Lemma A.1.
For any norm defined on the parameter space of , assume the quantities , and for are bounded. Then, for any compact set , we have
Proof.
For fixed , we define
then we have and . By the Law of Large Numbers, we know that converges pointwisely to in probability.
According to the assumption, there exists a constant such that
Given any , we may find a finite cover so that for any , there exists such that . Since is finite, as , converges to in probability. Therefore as , with probability , we have
Let , we obtain the first bound. The second and the third bounds can be similarly obtained. ∎
We are now ready to prove Theorem 4.1.
Proof.
For notational simplicity, we abbreviate the pointwise functions , , , and at as , , , and respectively. Moreover, denote .

Define and the function
then is the global optimizer of , and we have . Moreover,
Note that for convex function , its KL divergence for and is
and with only when . Moreover, since is pointwise function at , then indicates that for any , and thus we have . This is due to the assumption of the Theorem, which says that the parameter space is compact and we have . Hence,
and we have for any , . It follows that given any , there exists so that implies that . Now according to Lemma A.1, given any , when and with probability larger than , we have