Less is More: An Exploration of Data Redundancy with Active Dataset Subsampling
Abstract
Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN’s performance. If there is a large number of such samples, subsampling the training dataset in a way that removes them could provide an effective solution to both improve performance and reduce training time. In this paper, we propose an approach called Active Dataset Subsampling (ADS), to identify favorable subsets within a dataset for training using ensemble based uncertainty estimation. When applied to three image classification benchmarks (CIFAR10, CIFAR100 and ImageNet) we find that there are low uncertainty subsets, which can be as large as 50% of the full dataset, that negatively impact performance. These subsets are identified and removed with ADS. We demonstrate that datasets obtained using ADS with a lightweight ResNet18 ensemble remain effective when used to train deeper models like ResNet101. Our results provide strong empirical evidence that using all the available data for training can hurt performance on large scale vision tasks.
Less is More: An Exploration of Data Redundancy with Active Dataset Subsampling
Kashyap Chitta Jose M. Alvarez Elmar Haussmann Clement Farabet NVIDIA {kchitta,josea,ehaussmann,cfarabet}@nvidia.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Deep Neural Networks (DNNs) have become the dominant approach for addressing supervised learning problems. They are trained using stochastic gradient methods, where (i) data is subsampled into minibatches, and (ii) the network parameters are iteratively updated by the gradient of the parameter weights, relative to a loss function for the given minibatch. Through this procedure, DNNs have achieved remarkable success when there is an abundance of labeled training data and computational resources.
Given a fixed amount of training data for a DNN, it is well known that scaling up the compute (number of parameters and number of training updates) eventually leads to overfitting, where the generalization of the network suffers. This is due to the inherent noise present in the training dataset, properties specific to the training samples and not the underlying data distribution. Certain redundant samples in a dataset may not be essential for training a DNN, but contribute to the dataset’s inherent noise. In theory, limiting this inherent noise in a training dataset would reduce the chance of overfitting, in turn allowing for both improved performance and reduced compute to solve a specific task.
Subsampling the training dataset is a potential approach to reduce the impact of redundant samples on training. Dataset subsampling is a key step in several classical computer vision and machine learning algorithms [16, 35, 6]. The key issue that arises with subsampling is that redundant samples are removed at the cost of a large reduction in the total number of samples in the training dataset. For example, bagging [5], a popular subsampling strategy, reduces the number of unique samples seen by a learner by 37% through random subsampling [29]. On large scale vision tasks, empirical studies show that even in the presence of noise, having a larger number of training samples can significantly improve a DNN’s performance [41, 44]. Conversely, since DNNs typically have millions of learnable parameters, randomly removing samples may adversely impact performance.
In this paper, we present Active Dataset Subsampling (ADS), a simple yet effective approach to isolate and remove the redundant samples in a dataset that degrade or do not contribute to the DNN’s performance. Rather than subsampling randomly, we propose the use of ensemble based uncertainty estimation methods to specifically detect these redundant samples. A key issue of ensemble based methods is the difficulty in scaling up to large datasets due to computational constraints. ADS solves this problem by ensembling multiple training checkpoints from different experimental runs. Our experiments show that on the three popular image classification benchmarks, the performance of a DNN can be improved by training on a subset of all the available data. For example, for ResNet18 training on ImageNet, we remove 20% of the samples and improve top1 accuracy by 0.5%. Further, given the rapid pace of model development, we show that our subsampled datasets remain valuable even after a particular model is surpassed by newer architectures, by demonstrating the transferability of ADS between DNNs with very different network capacities [31].
To summarize, our contributions in this paper are as follows: (i) we propose a simple approach to scale up ensemble based uncertainty estimation methods with a negligible computational overhead at train time; (ii) we propose Active Dataset Subsampling, which uses ensembles to remove the redundant samples from a training dataset to improve DNN performance; and (iii) we conduct a detailed empirical study of ADS for image classification, and the robustness of ADS to changes in model architecture and dataset size. Our technique is simple to implement at scale, and provides benefits across a variety of settings with respect to task complexity and dataset size.
2 Related Work
Dataset subsampling for DNNs. Despite the widespread use of image classification datasets, there has only recently been an increased interest in understanding the properties of different subsets of these datasets through subsampling. Work along these lines can be split into two categories based on their final goal: (i) reducing dataset size [36, 43, 3, 11] or (ii) countering catastrophic forgetting [42, 25]. Coreset selection [36] is an example of the first category of techniques. This approach attempts to find a representative subset of points based on relative distances in the DNN feature space. Vodrahalli et al. [43] also aims to find representative subsets, using the magnitude of the gradient generated by each sample for the DNN as an importance measure. While these techniques give subsets that perform favorably when compared to randomly sampled subsets with the same amount of data, they are unable to match or improve the performance of a model trained with all the data. More recent techniques are able to successfully reduce dataset sizes by identifying redundant examples. Birodkar et al. [3] uses clustering in the DNN feature space to identify redundant samples, leading to a discovery of 10% redundancy in the CIFAR10 and ImageNet datasets. Select via proxy [11] uses simple uncertainty metrics (confidence, margin and entropy) to select a subset of data for training, and removes 40% of the samples in CIFAR10 without reducing the network performance. In comparison to these methods, our technique aims to not only maintain, but also improve the performance of a DNN. Further, we do so with much less data than these methods, for example, on CIFAR10, we only require 50% of the training dataset.
Among the second category of works related to catastrophic forgetting, Toneva et al. [42] uses the number of instances in which a previously correctly classified sample is ’forgotten’ and misclassified during training as an importance measure for subsampling. By doing so, this method is able to remove 30% of the samples that are ’unforgettable’ from CIFAR10 without significantly reducing performance. Chang et al. [7] propose a similar idea of emphasizing data points whose predictions have changed most over the previous epochs during training. Rather than directly subsampling, this variance in predictions is used to increase or decrease the sampling weight during training, and therefore the approach has no significant impact on training time. In comparison, by subsampling our datasets, ADS not only improves performance but also cuts down on training time by up to 50%.
Active learning. Active learning aims to select, from a large unlabeled dataset, the smallest possible training set to label in order to solve a specific task [10]. The main goal of active learning is to minimize labeling costs for unlabeled data. In comparison, ADS does not focus directly on labeling costs, but rather the redundancy of samples in labeled datasets. Implicitly, if there is a large amount of redundancy in a labeled dataset for a given task, it indicates that the labeling budget was not optimally utilized while building that dataset. This in turn indicates the potential savings in annotation costs that can be achieved for that task through improved data selection strategies such as active learning.
A comprehensive review of classical approaches to active learning is presented in [37]. In these approaches, data samples for which the current model is uncertain are queried for labeling. Current stateoftheart active learning techniques for computer vision DNNs are based on uncertainty estimates from ensembles [2, 8]. While these methods typically assume that the performance obtained by training on the entire data pool is an upper bound, we show that this is not always the case, and demonstrate distinct advantages of training on a subset of data using ADS regardless of the annotation costs. Active learning techniques also tend to focus on a single model, which leads to datasets that have been shown to transfer poorly to new model architectures [31]. On the other hand, we empirically show that ADS selects datasets that are robust to changes in model architecture.
Uncertainty for DNNs. Uncertainty estimation is a key component of ADS. Due to several important applications, such as active learning and adversarial sample detection, techniques to improve the uncertainty estimates of a DNN have recently gained significant momentum [39, 21]. These techniques may be broadly categorized into (i) Bayesian [22, 4, 18] and (ii) nonBayesian techniques [30, 14]. Bayesian techniques approximate a class of neural networks called Bayesian Neural Networks (BNNs) [33]. When BNNs are trained, each weight in the network takes the form of a probability distribution in the parameter space. However, training a BNN involves marginalization over all possible assignments of weights, which is intractable for deep BNNs without approximations [22, 4, 18]. Due to this, these approaches have traditionally been computationally more demanding and conceptually more complicated than nonBayesian ones [30, 14].
Recent methods based on ensembles have made notable progress on simplifying BNN approximations, but remain computationally demanding [29]. In our work, we present a technique to efficiently scale up ensemblebased methods for experiments with larger datasets with millions of samples. We focus on reducing the traintime computational cost of generating ensembles.
3 Active Dataset Subsampling
In this section, we show how ensembles can be used to efficiently approximate the Bayesian approach to uncertainty estimation for classification DNNs. We then describe the ADS algorithm.
Consider a distribution over inputs x and labels . In a Bayesian framework, the predictive uncertainty of a particular input after training on a dataset is denoted as . The predictive uncertainty will result from data (aleatoric) uncertainty and model (epistemic) uncertainty [27]. A model’s estimates of data uncertainty are described by the posterior distribution over class labels given a set of model parameters . This is typically the softmax output in a classification DNN. Additionally, the model uncertainty is described by the posterior distribution over the parameters given the training data [32]:
(1) 
We see that, uncertainty in the model parameters induces a distribution over the softmax distributions . The expectation is obtained by marginalizing out the parameters . Unfortunately, obtaining the full posterior using Bayes’ rule is intractable. If we train a single DNN, we only obtain a single sample from the distribution . Ensemble based uncertainty estimation techniques approximate the integral over from Eq. 1 by Monte Carlo estimation, generating multiple samples using different members of an ensemble [29]:
(2) 
where represents the approach used for building the ensemble and is the number of models. The strength of the ensemble approximation depends on the ensemble configuration, which refers to (i) the number of samples drawn (), and (ii) how the parameters for each model in the ensemble are sampled (i.e., how closely the distribution matches ).
Stateoftheart ensemblebased active learning approaches use different random seeds to construct ensembles [2]. They recommend the number of samples drawn to be in the range models [29]. In theory, the error of a Monte Carlo estimator should decrease with more samples, which is evident in other BNN based uncertainty estimation techniques, that require the number of stochastic samples drawn to be increased to the range [19].
The major limiting factor preventing the training of models with different random seeds for ensemblebased uncertainty estimation is the computational burden at train time. ’Implicit’ ensembling approaches that are computationally inexpensive, such as Dropout [40], suffer from mode collapse, where the different members in the ensemble lack sufficient diversity for reliable uncertainty estimation [34]. An alternate approach, called snapshot ensembles, that is less computationally expensive at train time, uses a cyclical learning rate to converge to multiple local optima in a single training run [26]. However, this technique is also limited to ensembles in the range of members. In our work, we present an approach that allows users to draw a large number of samples using the catastrophic forgetting property in DNNs [42]. Specifically, we exploit the disagreement between different checkpoints stored during successive training epochs to efficiently construct large and diverse ensembles. We collect several training checkpoints over multiple training runs with different random seeds. This allows us to maximize the number of samples drawn, efficiently generating ensembles with up to hundreds of members.
Though we study the impact of the number of samples, our work does not attempt specifically to better match and , which is an open and active research area [32]. The overall ADS algorithm is generic and can potentially benefit from more advanced techniques for uncertainty estimation.
3.1 Acquisition Functions
We now go into more details on estimating predictive uncertainty with ensembles. Once the ensemble configuration is fixed, we estimate the uncertainty of a sample using an acquisition function . In our experiments, we empirically evaluate three acquisition functions: entropy, mutual information and variation ratios.
Bayesian active learning approaches argue that the acquisition function must target samples with high model uncertainty rather than data uncertainty. Intuitively, the reasoning behind this is that model uncertainty is a result of not knowing the correct parameters, which can be reduced by adding the right data and retraining the model. However, data uncertainty exists even for a model with the most optimal parameters, and cannot be reduced by adding more data [27].
In the case of classification, the predictive uncertainty for a sample from Eq. 1 is a multinomial distribution, which can be represented as a vector of probabilities p over each of the classes. We can obtain the predictive uncertainty for a sample as its entropy [38]:
(3) 
However, once we marginalize out from Eq. 1 it is impossible to tell if this predictive uncertainty is a result of model uncertainty or data uncertainty. One acquisition function that explicitly looks for large disagreement between the models (i.e., model uncertainty), is mutual information [39]:
(4) 
where denotes the entropy of an individual member of the ensemble before marginalization. Since entropy is always positive, the maximum possible value for is . However, when the models make similar predictions, , and , which is its minimum value. This shows that encourages samples with high disagreement to be selected during the data acquisition process. An alternate way to look at the metric is that from the predictive uncertainty, we subtract away the expected data uncertainty, leaving an approximate of the model uncertainty [13].
Variation ratios is another acquisition function that looks for disagreement. It is defined as the fraction of members in the ensemble that do not agree with the majority vote [17]:
(5) 
where is the number of classes. This is the simplest quantitative measure of variation, and prior applications in literature show that it works well in practice [20, 2, 9].
3.2 Active Dataset Subsampling
The pseudo code for ADS is summarized in Algorithm 1. The algorithm involves the following:

A labeled dataset, consisting of labeled pairs, , where each is a data point and each is its corresponding label.

An acquisition model, . For our ensemble based uncertainty estimation technique, the acquisition model takes the form of a set of different DNNs with parameters .

A subsampled dataset, , where is a subset of selected using an acquisition function .

A subset model, , with parameters , trained on .
We consider three different initialization schemes for the parameters of the acquisition and subset models: pretrain, build up and compress. The pretrain scheme uses the entire dataset for pretraining both the acquisition and subset models. During optimization, the subset model is then finetuned on the subsampled dataset . In the compress scheme, the acquisition model is pretrained on but the subset model is randomly initialized and trained from scratch on . The acquisition model therefore accesses all the data and then ’compresses’ the dataset for the subset model. Finally, in the build up scheme, we aim to emulate an iteration in a typical active learning loop, as described in [8]. A set of existing subset models are used as an acquisition model, in an approach with multiple iterations of ADS. For the pretrain and compress schemes, the subsampled dataset is initialized with an empty set, whereas for the build up scheme, we initialize the subsampled dataset using the existing samples in from a previous iteration. The very first iteration of the build up scheme starts by initializing with a randomly selected subset of the data.
4 Experiments
In this section, we demonstrate the effectiveness of ADS on three image classification benchmarks. We initially investigate the impact of the initialization schemes discussed in Section 3.2 and acquisition functions from Section 3.1 on the performance obtained through ADS. We then focus on scaling up the approach and evaluating its robustness to architecture shifts.
Datasets. We experiment with three datasets: CIFAR10 and CIFAR100 [28], as well as ImageNet [12]. The CIFAR datasets involve object classification tasks over natural images: CIFAR10 is coarsegrained over 10 classes, and CIFAR100 is finegrained over 100 classes. For both tasks, there are 50k training images and 10k validation images of resolution , which are balanced in terms of the number of training samples per class. ImageNet consists of 1000 object classes, with annotation available for 1.28 million training images and 50k validation images of resolution . This dataset has a slight class imbalance, with 732 to 1300 training images per class.
Implementation Details. Unless otherwise specified, we use 8 models with the ResNet18 architecture to build the acquisition and subset models [23]. For ImageNet, each ResNet18 uses the standard kernel sizes and counts. For CIFAR10 and CIFAR100, we use a variant of ResNet18 as proposed in [24]. For all three tasks, we do meanstd preprocessing, and augment the labeled dataset online with random crops and horizontal flips. Optimization is done using Stochastic Gradient Descent with a learning rate of 0.1 and momentum of 0.9, and weight decay of . On CIFAR, we use a patience parameter (set to 25) for counting the number of epochs with no improvement in validation accuracy, in which case the learning rate is dropped by a factor of 0.1. We end training when dropping the learning rate also gives no improvement in the validation accuracy after a number of epochs equal to twice the patience parameter. If the early stopping criterion is not met, we train for a maximum of 400 epochs. On ImageNet, we train for a total of 150 epochs, scaling the learning rate by a factor of 0.1 after 70 and 130 epochs. Experiments are run on Tesla V100 GPUs.
4.1 Results
Initialization schemes. In our first experiment, we compare the three initialization schemes introduced in Section 3.2 to a random subsampling baseline. For this experiment, we fix the number of ensemble members to 8 for CIFAR and 4 for ImageNet, each with a different random seed. We fix the acquisition function to Mutual Information (), as defined in Eq. 4. For the pretrain scheme, we pretrain 8 (or 4) models with different random seeds on the entire dataset , and finetune them starting with a learning rate of on the chosen subset . For the other schemes, we train the subset model from scratch on . We report the top1 validation accuracy of three independent ensembles, each from a different experimental trial, plotting the mean with one standard deviation as an error bar. These results are summarized in Fig. 1.
We observe certain common trends for all three datasets: random subsampling (blue in Fig. 1) leads to a steady dropoff in performance; and the pretrain scheme (orange in Fig. 1) does not significantly impact the performance in comparison to training with the full dataset. Interestingly, the compress scheme (green in Fig. 1) performs extremely poorly when the subset chosen is very small, but performs well when the acquisition and subset models have a similar overall dataset size (eg., 1M samples on ImageNet). The poor performance of the compress scheme implies that the uncertainty estimates for this experiment are not robust to large changes in the dataset size between the acquisition and subset models. It is possible that a very small subset with high uncertainty is informative to the acquisition model, but too difficult for the subset model which is trained from scratch with just these samples. To check for this, we repeat the experiment for the compress scheme on CIFAR while subsampling the dataset to 25k samples (50%), but set aside a percentage of the highest uncertainty samples as outliers instead of adding them to them during acquisition. For example, for 12.5% outliers, after sorting by the acquisition function, we select the samples in the range 37.5% to 87.5% as the subset instead of 50% to 100%. As shown in Table 1, leaving outliers does not improve the performance of the compress scheme on CIFAR10. Even in the case of CIFAR100, where there is an improvement in performance after removing outliers, the obtained accuracy of 76.76% is well short of the build up scheme (red in Fig. 1), which reaches 79.37%. This indicates that though the highest uncertainty samples are suboptimal for training when taken alone, they are important when used as part of a larger set of samples.
CIFAR10  CIFAR100  

No Outliers  12.5% Outliers  25% Outliers  No Outliers  12.5% Outliers  25% Outliers 
95.77  94.85  93.81  75.89  76.76  76.41 
From Fig. 1, the build up scheme consistently outperforms random subsampling by a large margin. The performance is robust across all three tasks. Based on these observations, we fix the initialization scheme to build up for the next experiments. With a sufficiently large subset of the data (eg., 25k on CIFAR10, 1M on ImageNet), ADS slightly outperforms a model trained on the full dataset.
Acquisition functions. In our next experiment, we compare random subsampling against the three acquisition functions from Section 3.1. For the acquisition model, we use ensembles of 8 members for CIFAR and 4 members for ImageNet. We run three experimental trials, and report the mean validation accuracy of the ensemble of subset models for each trial, in Table 2.
For all three acquisition functions, the subset models significantly outperform the baseline (random) which uses a randomly selected subset of the same size. Further, when using 50% of the data on CIFAR10, and 80% of the data on CIFAR100 and ImageNet, we obtain subsets of data that improve performance compared to training on the full dataset (100%). Among the three functions, mutual information () and variation ratios () outperform entropy (). On the CIFAR10 dataset, there are likely very few absolute disagreements due to the small number of classes and high performance. In this setting, outperforms . However, on the CIFAR100 and ImageNet datasets, with more classes and lower overall performance, there is likely a greater number of disagreements, and outperforms by a significant amount when using 80% of the dataset. Compared to training on the full dataset, we obtain a 0.5% absolute improvement in validation accuracy on both CIFAR100 and ImageNet when using the dataset acquired by . Additionally, this performance improvement is accompanied by a 20% reduction in training time.
Dataset  CIFAR10  CIFAR100  ImageNet  
Acquisition Function  12.5%  25%  50%  20%  40%  80%  20%  40%  80% 
Random  87.65  91.88  94.30  63.96  74.18  80.65  63.51  68.54  71.66 
Entropy ()  89.67  94.29  96.08  65.83  76.35  81.94  64.11  69.64  72.00 
Mutual Information ()  89.46  94.34  96.30  66.11  76.29  82.02  64.65  70.18  72.36 
Variation Ratios ()  89.86  94.42  95.76  65.28  76.27  82.37  64.39  69.20  72.78 
Full Dataset (100%)  96.18  81.86  72.33 
Scaling up the ensemble. We now explore the possibility of further gains in performance by scaling up the ensemble to increase the number of samples drawn in the Monte Carlo estimator as per Eq. 2. For the remaining experiments, we focus on the ImageNet dataset. We start by setting up 5 different training runs on 40% of the ImageNet dataset (512k samples) as selected by the build up scheme with the mutual information () acquisition function in Table 2. For each of these 5 training runs, we store the 20 checkpoints obtained in the final stage of training (epochs 130150, with a learning rate of ). We pick 3 ensemble configurations from these ResNet18 training runs to utilize as the acquisition model: (i) random seeds, which uses a total of 5 models from the best performing epoch of each run; (ii) training checkpoints, which uses the 20 models from epochs 130150 of a single run; and (iii) combined, which uses all 100 models.
We initially evaluate the performance of these three ensemble configurations. To this end, we report the top1 accuracy of the three ensemble configurations, along with a baseline of a single model, when evaluated on the 40% selected data (i.e, the training set) and 60% unselected data of ImageNet. Our results are shown in Table 3. For all ensemble configurations, the performance is better on the larger unselected subset of data than on the selected training set. For a single model, the gap in top1 accuracy between these two subsets is nearly 13%. This indicates the huge amount of redundancy in the unselected part of the dataset. As the number of members in the ensemble is scaled up, we observe large and clear improvements in performance on both selected and unselected data. This shows that though the checkpoints are obtained with no additional computational cost at train time, they can be used to generate diverse ensemble
Eval Set  Single (1)  Random Seeds (5)  Training Checkpoints (20)  Combined (100) 

Selected  58.10  74.98  79.60  83.78 
Unselected  70.85  82.55  84.02  85.57 
Further, we are interested in how the ensemble configurations affect the acquisition function. To this end, we query for an additional 40% of the samples from the unselected data using the variation ratios () acquisition function, with each of the three ensemble configurations. We obtain three new datasets with 80% of the samples in ImageNet. The top1 validation accuracy of a subset model trained using each of these new datasets is shown in Table 4. We observe a steady increase in performance as the number of models in during acquisition is increased, showing the benefits of scaling up the ensemblebased uncertainty estimation approach.
Acquisition  Random Seeds (5)  Training Checkpoints (20)  Combined (100) 

Variation Ratios ()  69.97  70.18  70.34 
Robustness to architecture shift. Finally, we evaluate the robustness of the subsampled datasets to changes in model capacity. Specifically, we evaluate the robustness of the best performing dataset selected with a ResNet18 acquisition model in Table 4 (’combined’). We use this dataset (referred to as ADSR1880) to train subset models with the ResNet18, ResNet34, ResNet50 and ResNet101 architectures. We compare the ADSR1880 dataset to a randomly subsampled 80% of ImageNet (Random80) and the full dataset (Full100). Our results are summarized in Table 5. As shown, on all 4 architectures, the dataset obtained by ADS achieves similar performance to training on the full dataset, with a 20% reduction in overall training time. This ability to transfer selected datasets to larger architectures has significant implications in domains where training time is crucial, such as MLPerf [1] and Neural Architecture Search [15].
Training Dataset  ResNet18  ResNet34  ResNet50  ResNet101 

Random80  69.24  73.00  75.15  76.72 
ADSR1880  70.34  73.61  76.18  77.75 
Full100  70.31  73.68  76.30  77.99 
5 Conclusion
In this paper, we presented Active Dataset Subsampling (ADS) for deep neural networks. Our approach uses an ensemble of DNNs to estimate the uncertainty of each sample in a dataset, and discards the lowest uncertainty samples during training. Our results demonstrate that ADS improves the performance of a DNN compared to training with the entire dataset on three different image classification benchmarks. Moreover, we propose a simple technique to scale up the ensemble leading to additional accuracy gains with minimum computational overhead. Importantly, our results demonstrate that datasets obtained using ADS can be effectively reused for training new models with different capacity than the DNN used for subsampling.
References
 [1] Mlperf benchmark suite for measuring performance of ml software frameworks, ml hardware accelerators, and ml cloud platforms. https://mlperf.org/. Accessed: 20190523.
 [2] William H. Beluch, Tim Genewein, Andreas Nürnberger, and Jan M. Köhler. The power of ensembles for active learning in image classification. In CVPR, 2018.
 [3] Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. Semantic Redundancies in ImageClassification Datasets: The 10% You Don’t Need. arXiv eprints, page arXiv:1901.11409, Jan 2019.
 [4] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In ICML, 2015.
 [5] Leo Breiman. Bagging predictors. Mach. Learn., 24(2):123–140, Aug 1996.
 [6] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001.
 [7] HawShiuan Chang, Erik LearnedMiller, and Andrew McCallum. Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples. arXiv eprints, page arXiv:1704.07433, Apr 2017.
 [8] Kashyap Chitta, Jose M. Alvarez, and Adam Lesnikowski. LargeScale Visual Active Learning with Deep Probabilistic Ensembles. arXiv eprints, page arXiv:1811.03575, Nov 2018.
 [9] Kashyap Chitta, Jianwei Feng, and Martial Hebert. Adaptive Semantic Segmentation with a Strategic Curriculum of Proxy Labels. arXiv eprints, page arXiv:1811.03542, Nov 2018.
 [10] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Mach. Learn., 15(2):201–221, May 1994.
 [11] Cody Coleman, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Select via proxy: Efficient data selection for training deep networks, 2019.
 [12] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [13] Stefan Depeweg, José Miguel HernándezLobato, Finale DoshiVelez, and Steffen Udluft. Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risksensitive Learning. arXiv eprints, page arXiv:1710.07283, Oct 2017.
 [14] Terrance DeVries and Graham W. Taylor. Learning Confidence for OutofDistribution Detection in Neural Networks. arXiv eprints, page arXiv:1802.04865, Feb 2018.
 [15] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural Architecture Search: A Survey. arXiv eprints, page arXiv:1808.05377, Aug 2018.
 [16] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981.
 [17] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.
 [18] Yarin Gal and Zoubin Ghahramani. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. arXiv eprints, page arXiv:1506.02158, Jun 2015.
 [19] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv eprints, page arXiv:1506.02142, Jun 2015.
 [20] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian Active Learning with Image Data. arXiv eprints, page arXiv:1703.02910, Mar 2017.
 [21] Yarin Gal and Lewis Smith. Sufficient Conditions for Idealised Models to Have No Adversarial Examples: a Theoretical and Empirical Study with Bayesian Neural Networks. arXiv eprints, page arXiv:1806.00667, Jun 2018.
 [22] Alex Graves. Practical variational inference for neural networks. In NIPS. 2011.
 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. arXiv eprints, page arXiv:1603.05027, Mar 2016.
 [25] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Lifelong learning via progressive distillation and retrospection. In ECCV, 2018.
 [26] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger. Snapshot Ensembles: Train 1, get M for free. arXiv eprints, page arXiv:1704.00109, Mar 2017.
 [27] Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv eprints, page arXiv:1703.04977, Mar 2017.
 [28] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [29] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2017.
 [30] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training Confidencecalibrated Classifiers for Detecting OutofDistribution Samples. arXiv eprints, page arXiv:1711.09325, Nov 2017.
 [31] David Lowell, Zachary C. Lipton, and Byron C. Wallace. How transferable are the datasets collected by active learners? arXiv eprints, page arXiv:1807.04801, Jul 2018.
 [32] Andrey Malinin and Mark Gales. Predictive Uncertainty Estimation via Prior Networks. arXiv eprints, page arXiv:1802.10501, Feb 2018.
 [33] Radford M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
 [34] Remus Pop and Patric Fulop. Deep Ensemble Bayesian Active Learning : Addressing the Mode Collapse issue in Monte Carlo dropout via Ensembles. arXiv eprints, page arXiv:1811.03897, Nov 2018.
 [35] Robert E. Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence  Volume 2, IJCAI’99, pages 1401–1406, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
 [36] Ozan Sener and Silvio Savarese. Active Learning for Convolutional Neural Networks: A CoreSet Approach. arXiv eprints, page arXiv:1708.00489, Aug 2017.
 [37] Burr Settles. Active learning literature survey. Technical report, 2010.
 [38] Claude Elwood Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 7 1948.
 [39] Lewis Smith and Yarin Gal. Understanding Measures of Uncertainty for Adversarial Example Detection. arXiv eprints, page arXiv:1803.08533, Mar 2018.
 [40] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In ICML, 2014.
 [41] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. arXiv eprints, page arXiv:1707.02968, Jul 2017.
 [42] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An Empirical Study of Example Forgetting during Deep Neural Network Learning. arXiv eprints, page arXiv:1812.05159, Dec 2018.
 [43] Kailas Vodrahalli, Ke Li, and Jitendra Malik. Are All Training Examples Created Equal? An Empirical Study. arXiv eprints, page arXiv:1811.12569, Nov 2018.
 [44] I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billionscale semisupervised learning for image classification. arXiv eprints, page arXiv:1905.00546, May 2019.