# Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples

###### Abstract

Self-paced learning and hard example mining re-weight training instances to improve learning accuracy. This paper presents two improved alternatives based on lightweight estimates of sample uncertainty in stochastic gradient descent (SGD): the variance in predicted probability of the correct class across iterations of mini-batch SGD, and the proximity of the correct class probability to the decision threshold. Extensive experimental results on six datasets show that our methods reliably improve accuracy in various network architectures, including additional gains on top of other popular training techniques, such as residual learning, momentum, ADAM, batch normalization, dropout, and distillation.

Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples

Haw-Shiuan Chang, Erik Learned-Miller, Andrew McCallum University of Massachusetts, Amherst 140 Governors Dr., Amherst, MA 01003 {hschang,elm,mccallum}@cs.umass.edu

noticebox[b]\end@float

## 1 Introduction

Learning easier material before the harder material is often beneficial to human learning. Inspired by this observation, curriculum learning Bengio et al. (2009) has shown that learning first from easier instances can also improve neural network training. When it is not known a priori which samples are easy, examples with the lower loss on the current model can be inferred to be easier and can be used in early training. This strategy has been referred to as self-paced learning Kumar et al. (2010). By decreasing the weight on the loss of difficult examples, the model may become more robust to outliers Meng et al. (2015), and this method has proven to be useful in several applications, especially with noisy training data Jiang et al. (2015).

Nevertheless, selecting easier examples for training often slows down the training process because easier samples usually contribute smaller gradients and the current model has already learned how to make correct predictions on these samples. On the other hand, and somewhat ironically, the opposite strategy (i.e., sampling harder instances more often) has been shown to accelerate (mini-batch) stochastic gradient descent (SGD) in some cases, where the difficulty of an example can be defined by its loss Hinton (2007); Loshchilov and Hutter (2015); Shrivastava et al. (2016) or be proportional to the magnitude of its gradient Zhao and Zhang (2014); Alain et al. (2015); Gao et al. (2015); Gopal (2016). This strategy is sometimes referred to as hard example mining Shrivastava et al. (2016).

In the literature, we can see that these two opposing strategies work well in different situations. Preferring easier examples may be effective when either machines or humans try to solve a challenging task containing more label noise or outliers. Otherwise, focusing on harder samples may accelerate and stabilize the SGD in cleaner data by minimizing the variance of the gradients Alain et al. (2015); Gao et al. (2015). However, we often do not know how noisy our training dataset is. Motivated by this practical need, this paper explores new methods of re-weighting training examples that are robust to both scenarios.

Intuitively, if our model has already predicted some examples correctly with high confidence, the samples may be too easy to contain useful information for the current model. Similarly, if some examples are always predicted incorrectly over many iterations of training, these examples may just be too difficult/noisy and would confuse the model. This suggests that we should somehow prefer uncertain samples that are predicted incorrectly sometimes during training and correctly at other times, as illustrated in Figure 1. This preference is consistent with common variance reduction strategies in active learning Settles (2010).

Previous studies suggest that finding informative samples for efficiently collecting new labels is related to selecting already-labeled samples to optimize the model parameters Guillory et al. (2009). As previously reported in Schohn and Cohn (2000); Bordes et al. (2005), models can sometimes achieve lower generalization error after being trained with only a subset of actively selected training data. In other words, focusing more on informative samples could be beneficial even when all labels are available.

We propose two lightweight methods that actively emphasize the uncertain samples to improve mini-batch SGD for classifiers. One method measures the variance of prediction probabilities, while the other one estimates the closeness between the prediction probabilities and the decision threshold. For logistic regression, both methods can be proven to reduce the uncertainty in the model parameters under reasonable approximations.

We present extensive experiments on CIFAR 10, CIFAR 100, MNIST (image classification), Question Type (sentence classification), CoNLL 2003, OntoNote 5.0 (Named entity Recognition), as well as different architectures, including multiple class logistic regression, fully-connected networks, convolutional neural networks (CNN) LeCun et al. (1998), and residual networks He et al. (2016). The results show that active bias makes neural networks more robust without prior knowledge of noise, and reduces the generalization error by 1% –18% even on training sets having few (if any) annotation errors.

## 2 Related work

As (deep) neural networks become more widespread, many methods have recently been proposed to improve SGD training. When using (mini-batch) SGD, the randomness of the gradient sometimes slows down the optimization, so one common approach is to use the gradient computed in previous iterations to stabilize the process. Examples include momentum Qian (1999), stochastic variance reduced gradient (SVRG) Johnson and Zhang (2013), and proximal stochastic variance reduced gradient (Prox-SVRG) Xiao and Zhang (2014). Other work proposes variants of semi-stochastic algorithms to approximate the exact gradient direction and reduce the gradient variance Wang et al. (2013); Mu et al. (2016). More recently, supervised optimization methods like learning by learning Andrychowicz et al. (2016) also show great potential in this problem.

In addition to the high variance of the gradient, another issue with SGD is the difficulty of tuning the learning rate. Like Quasi-Newton methods, several methods adaptively adjust learning rates based on local curvature Amari et al. (2000); Schaul et al. (2013), while ADAGRAD Duchi et al. (2011) applies different learning rates to different dimensions. ADAM Kingma and Ba (2014) combines several of these techniques and is widely used in practice.

More recently, some studies accelerate SGD by learning each class differently Gopal (2016) or each sample differently as we do Hinton (2007); Zhao and Zhang (2014); Loshchilov and Hutter (2015); Gao et al. (2015); Alain et al. (2015); Shrivastava et al. (2016), and their experiments suggest that the methods are often compatible with other techniques such as Prox-SVRG, ADAGRAD, or ADAM Loshchilov and Hutter (2015); Gopal (2016). Notice that Gao et. al. Gao et al. (2015) discuss the idea of selecting uncertain examples for SGD based on active learning, but their proposed methods choose each sample according to the magnitude of its gradient as in ISSGD Alain et al. (2015), which actually prefers more difficult examples.

The aforementioned methods focus on accelerating the optimization of a fixed loss function given a fixed model. Many of these methods adopt importance sampling. That is, if the method prefers to select harder examples, the learning rate corresponds to those examples will be lower. This makes gradient estimation unbiased Hinton (2007); Zhao and Zhang (2014); Alain et al. (2015); Gao et al. (2015); Gopal (2016), which guarantees convergence Zhao and Zhang (2014); Gopal (2016).

On the other hand, to make models more robust to outliers, some approaches inject bias into the loss function in order to emphasize easier examples Pregibon (1982); Wang et al. (2016); Lee et al. (2016); Northcutt et al. (2017). Some variants of the strategy gradually increase the loss of hard examples Mandt et al. (2016b), as in self-paced learning Kumar et al. (2010). To alleviate the local minimum problem during training, other techniques that smooth the loss function have been proposed recently Chaudhari et al. (2017); Gulcehre et al. (2017). Nevertheless, to our knowledge, it remains an unsolved challenge to balance the easy and difficult training examples so as to facilitate training while remaining robust to outliers.

## 3 Methods

### 3.1 Baselines

Due to its simplicity and generally good performance, the most widely used version of SGD samples each training instance uniformly. This basic strategy has two variants. The first samples with replacement. Let indicate the training dataset. The probability of selecting each sample is equal (i.e., ), so we call it SGD Uniform (SGD-Uni). The second samples without replacement. Let be the set of samples we select at the current epoch. Then, the sampling probability would become , where is an indicator function. This version scans through all of the examples in each epoch, so we call it SGD-Scan.

We propose a simple baseline which selects harder examples with higher probability as Loshchilov and Hutter Loshchilov and Hutter (2015) did. Specifically, we let , where is the history of prediction probability which stores all when is selected to train the network before the current iteration , , and is the average probability of classifying sample into its correct class over all the stored in , and is a smoothness constant. Notice that by only considering in , we won’t need to perform extra forward passes. We refer to this simple baseline as SGD Sampled by Difficulty (SGD-SD).

In practice, SGD-Scan often works better than SGD-Uni because it ensures that the model sees all of the training examples in each epoch. To emphasize difficult examples while applying SGD-Scan, we weight each sample differently in the loss function. That is, the loss function is modified as , where are the parameters in the model, is the prediction loss, and is the regularization term of the model. The weight of the th sample can be set as , where is a normalization constant making the average of equal 1. We want to keep the average of the same so that we do not change the global learning rate. We denote this method as SGD Weighted by Difficulty (SGD-WD).

The model usually cannot fit outliers well, so SGD-SD and SGD-WD would not be robust to noise. To make it unbiased, importance sampling can be used. That is, we let and , which is similar to the approach from Hinton Hinton (2007). We refer to this as SGD Importance-Sampled by Difficulty (SGD-ISD).

On the other hand, we propose two simple baselines, which emphasize the easy examples as self-paced learning does. Based on the same naming convention, SGD Sampled by Easiness (SGD-SE) means , while SGD Weighted by Easiness (SGD-WE) sets , where normalizes ’s average to be 1.

### 3.2 Prediction Variance

In the active learning setting, the prediction variance can be used to measure the uncertainty of each sample for either a regression or classification problem Schein and Ungar (2007). In order to gain more information at each SGD iteration, we prefer to choose the samples with high prediction variances.

Since the prediction variances are estimated on the fly, we would like to balance exploration and exploitation. Adopting the optimism in face of uncertainty heuristics in the bandit problem Bubeck et al. (2012), we draw the next sample based on the estimated prediction variance plus its confidence interval. Specifically, for SGD Sampled by Prediction Variance (SGD-SPV), we let

(1) |

is the prediction variance estimated by history , and is the number of stored prediction probability. Assuming is normally distributed, the variance of prediction variance estimation can be measured by .

As we did in the baselines, adding the smoothness constant prevents the low variance instances from never being selected again. Similarly, another variant of the method sets , where normalizes like other weighted methods; we call this SGD Weighted by Prediction Variance (SGD-WPV).

Example: logistic regression

Given a Gaussian prior on the parameters, consider the probabilistic interpretation of logistic regression:

(2) |

where , and .

Since the posterior distribution of is log-concave Rennie (2005), we can use , where is maximum a posteriori (MAP) estimation, and

(3) |

Then, we further approximate using the Taylor expansion , where . We can compute the prediction variance Schein and Ungar (2007) with respect to the uncertainty of

(4) |

These approximations tell us several things. First, is proportional to , so the prediction variance is larger when the sample is closer to the boundary. Second, when we have more sample points close to the boundary, the variance of parameters would be smaller. That is, when we emphasize the samples with high prediction variance, the uncertainty of parameters tend to be reduced like the variance reduction strategy in active learning MacKay (1992). Notice that there are other methods that can measure the prediction uncertainty, such as the mutual information between labels and parameters Houlsby et al. (2011), but we found that prediction variance works better in our experiments.

After w is close to a local minimum using SGD, the parameters estimated in each iteration can be viewed as approximately drawn samples from the posterior distribution of the parameters Mandt et al. (2016a). Therefore, after running SGD long enough, can be used to approximate . After applying approximations in logistic regression, the distribution of would become Gaussian, which justifies our previous assumption of for confidence estimation of prediction variance. In this simple example, we can also see the gradient magnitude is proportional to the difficulty because . This is why we believe the SGD acceleration methods based on gradient magnitude Alain et al. (2015); Gopal (2016) can be categorized as variants of preferring difficult examples.

Figure 2 illustrates a toy example. Given the same learning rate, we can see that the normal SGD in Figure 1(c) and 1(d) will have higher uncertainty when there are many outliers, and emphasizing difficult examples in Figure 1(e) and 1(f) makes it worse. On the other hand, the samples near the boundaries would have higher prediction variance (i.e., larger circles or cross in Figure 1(h)) and thus higher impact on the loss function in SGD-WPV.

### 3.3 Threshold Closeness

Motivated by the previous analysis, we propose a simpler and more direct approach to select samples whose correct class probability is close to the decision threshold. SGD Sampled by Threshold Closeness (SGD-STC) makes . When there are multiple classes, this measures the closeness of the threshold distinguishing the correct class out of the combination of the rest of the classes. The method is similar to an approximation of the optimal allocation in stratified sampling proposed by Druck & McCallum Druck and McCallum (2011).

Similarly, SGD Weighted by Threshold Closeness (SGD-WTC) chooses the weight of th sample , where . Although other uncertainty estimations such as entropy are widely used in active learning and also can be viewed as a measurement of the boundary closeness, we found the proposed formula works better in our experiments.

When using logistic regression, after injecting bias into the loss function, approximate the prediction probability based on previous history, removing the regularization and smoothness constant (i.e., , , and ), we can show that

(5) |

where is the dimension of parameters . This will ensure the average prediction variance drops linearly as the number of training instance increases.

## 4 Experiments

Dataset | # Conv | Filter | Filter | # Pooling | # BN | # FC | Dropout | L2 |

layers | size | number | layers | layers | layers | keep probs | reg | |

MNIST | 2 | 5x5 | 32, 64 | 2 | 0 | 2 | 0.5 | 0.0005 |

CIFAR 10 | 0 | N/A | N/A | 0 | 0 | 1 | 1 | 0.01 |

CIFAR 100 | 26 or | 3X3 | 16, 32, 64 | 0 | 13 or | 1 | 1 | 0 |

62 | 31 | |||||||

Question Type | 1 | (2,3,4)x1 | 64 | 1 | 0 | 1 | 0.5 | 0.01 |

CoNLL 2003 | 3 | 3x1 | 100 | 0 | 0 | 1 | 0.5, 0.75 | 0.001 |

OntoNote 5.0 | ||||||||

MNIST | 0 | N/A | N/A | 0 | 0 | 2 | 1 | 0 |

Dataset | Optimizer | Batch | Learning | Learning | # Epochs | # Burn-in | # Trials |

size | rate | rate decay | epochs | ||||

MNIST | Momentum | 64 | 0.01 | 0.95 | 80 | 2 | 20 |

CIFAR 10 | SGD | 100 | 1e-6 | 0.5 (per 5 epochs) | 30 | 10 | 30 |

CIFAR 100 | Momentum | 128 | 0.1 | 0.1 (at 80, 100, | 150 | 90 or | 20 |

120 epochs) | 50 | ||||||

Question Type | ADAM | 64 | 0.001 | 1 | 150 | 20 | 100 |

CoNLL 2003 | ADAM | 128 | 0.0005 | 1 | 200 | 30 | 10 |

OntoNote 5.0 | |||||||

MNIST | SGD | 128 | 0.1 | 1 | 60 | 20 | 10 |

We test our methods on six different datasets. The results show that the active bias techniques constantly outperform the standard uniform sampling (i.e., SGD-Uni and SGD-Scan) in the deep models as well as the shallow models. For each dataset, we use an existing, publicly available implementation for the problem and emphasize samples using different methods. The architectures and hyper-parameters are summarized in Table 1. All neural networks use softmax and cross-entropy loss at the last layer. The optimization and experiment setups are listed in Table 2. As shown in the second column of the table, SGD in CNNs and residual networks actually refers to momentum or ADAM instead of vanilla SGD. All experiments use mini-batch.

Like most of the widely used neural network training techniques, the proposed techniques are not applicable to every scenario. For all the datasets we tried, we found that the proposed methods are not sensitive to the hyper-parameter setup except when applying a very complicated model to a relatively smaller dataset. If a complicated model achieves 100% training accuracy within a few epochs, the most uncertain examples would often be outliers, biasing the model towards overfitting.

To avoid this scenario, we modify the default hyper-parameters setup in the implementation of the text classifiers in Section 4.3 and Section 4.4 to achieve similar performance using simplified models. For all other models and datasets, we use the default hyper-parameters of the existing implementations, which should favor the SGD-Uni or SGD-Scan methods, since the default hyper-parameters are optimized for these cases. To show the reliability of the proposed methods, we do not optimize the hyper-parameters for the proposed methods or baselines.

Due to the randomness in all the SGD variants, we repeat experiments and list the number of trials in Table 2. At the beginning of each trial, network weights are trained with uniform sampling SGD until validation performance starts to saturate. After these burn-in epochs, we apply different sampling/weighting methods and compare performance. The number of burn-in epochs is determined by cross-validation, and the number of epochs in each trial is set large enough to let the testing error of most methods converge. In Tables 3 and 4, we evaluate the testing performance of each method after each epoch and report the best testing performance among epochs within each trial.

As previously discussed, there are various versions preferring easy or difficult examples. Some of them require extra time to collect necessary statistics such as the gradient magnitude of each sample Gao et al. (2015); Alain et al. (2015), change the network architecture Gulcehre et al. (2017); Shrivastava et al. (2016), or involve an annealing schedule like self-paced learning Kumar et al. (2010); Mandt et al. (2016b). We tried self-paced learning on CIFAR 10 but found that performance usually remains the same and is sometimes sensitive to the hyper-parameters of the annealing schedule. This finding is consistent with Avramova (2015). To simplify the comparison, we focus on testing the effects of steady bias based on sample difficulty (e.g., compare with SGD-SE and SGD-SD) and do not gradually change the preference during the training like self-paced learning.

It is not always easy to change the sampling procedure because of the model or implementation constraints. For example, in sequence labeling tasks (CoNLL 2003 and OntoNote 5.0), the words in the same sentence need to be trained together. Thus, we only compare methods which modify the loss function (SGD-W*) with SGD-Scan for some models. For the other experiments, re-weighting examples (SGD-W*) generally gives us better performance than changing the sampling distribution (SGD-S*). It might be because we can better estimate the statistics of each sample.

### 4.1 Mnist

We apply our method to a CNN LeCun et al. (1998) for MNIST^{1}^{1}1http://yann.lecun.com/exdb/mnist/ using one of the Tensorflow tutorials.^{2}^{2}2https://github.com/tensorflow/models/blob/master/tutorials/image/mnist The dataset has high testing accuracy, so most of the examples are too easy for the model after a few epochs. Selecting more difficult instances can accelerate learning or improve testing accuracy Hinton (2007); Loshchilov and Hutter (2015); Gopal (2016). The results from SGD-SD and SGD-WD confirm this finding while selecting uncertain examples can give us a similar or larger boost. Furthermore, we test the robustness of our methods by randomly reassigning the labels of 10% of the images, and the results indicate that the SGD-WPV improves the performance of SGD-Scan even more while SGD-SD overfits the data seriously.

Datasets | Model | SGD-Uni | SGD-SD | SGD-ISD | SGD-SE | SGD-SPV | SGD-STC |

MNIST | CNN | 0.550.01 | 0.520.01 | 0.570.01 | 0.540.01 | 0.51 0.01 | 0.510.01 |

Noisy MNIST | CNN | 0.830.01 | 1.000.01 | 0.840.01 | 0.69 0.01 | 0.640.01 | 0.630.01 |

CIFAR 10 | LR | 62.490.06 | 63.140.06 | 62.480.07 | 60.870.06 | 60.660.06 | 61.000.06 |

QT | CNN | 2.190.02 | 2.030.02 | 2.200.02 | 2.280.02 | 2.08 0.02 | 2.080.02 |

Datasets | Model | SGD-Scan | SGD-WD | SGD-WE | SGD-WPV | SGD-WTC |

MNIST | CNN | 0.540.01 | 0.480.01 | 0.560.01 | 0.480.01 | 0.480.01 |

Noisy MNIST | CNN | 0.810.01 | 0.920.01 | 0.720.01 | 0.610.02 | 0.630.01 |

CIFAR 10 | LR | 62.480.06 | 63.100.06 | 60.880.06 | 60.610.06 | 61.020.06 |

CIFAR 100 | RN 27 | 34.040.06 | 34.550.06 | 33.650.07 | 33.690.07 | 33.640.07 |

CIFAR 100 | RN 63 | 30.700.06 | 31.570.09 | 29.920.09 | 30.020.08 | 30.160.09 |

QT | CNN | 2.240.02 | 1.930.02 | 2.300.02 | 1.990.02 | 2.020.02 |

CoNLL 2003 | CNN | 11.620.04 | 11.500.05 | 11.730.04 | 11.240.06 | 11.180.03 |

OntoNote 5.0 | CNN | 17.800.05 | 17.650.06 | 18.400.05 | 17.820.03 | 17.510.05 |

MNIST | FC | 2.850.03 | 2.170.01 | 3.080.03 | 2.680.02 | 2.340.03 |

MNIST (distill) | FC | 2.270.01 | 2.130.02 | 2.350.01 | 2.180.02 | 2.070.02 |

### 4.2 CIFAR 10 and CIFAR 100

We test a simple multi-class logistic regression^{3}^{3}3https://cs231n.github.io/assignments2016/assignment2/ on CIFAR 10 Krizhevsky and Hinton (2009).^{4}^{4}4https://www.cs.toronto.edu/~kriz/cifar.html Images are down-sampled significantly to , so many examples are difficult, even for humans. SGD-SPV and SGD-SE perform significantly better than SGD-Uni here, consistent with the idea that avoiding difficult examples increases robustness to outliers.

For CIFAR 100 Krizhevsky and Hinton (2009), we demonstrate that the proposed approaches can also work in very deep residual networks He et al. (2016).^{5}^{5}5https://github.com/tensorflow/models/tree/master/resnet To show the method is not sensitive to the network depth and the number of burn-in epochs, we present results from the network with 27 layers and 90 burn-in epochs as well as the network with 63 layers and 50 burn-in epochs. Without changing architectures, emphasizing uncertain or easy examples gains around 0.5% in both settings, which is significant considering the fact that the much deeper network shows only 3% improvement here.

When training a neural network, gradually reducing the learning rate (i.e., the magnitude of gradients) usually improves performance. When difficult examples are sampled less, the magnitude of gradients would be reduced. Thus, some of the improvement of SGD-SPV and SGD-SE might come from using a lower effective learning rate. Nevertheless, since we apply the aggressive learning rate decay in the experiments of CIFAR 10 and CIFAR 100, we know that the improvements from SGD-SPV and SGD-SE cannot be entirely explained by its lower effective learning rate.

### 4.3 Question Type

To investigate whether our methods are effective for smaller text datasets, we apply them to a sentence classification task, which we refer to as the Question Type (QT) dataset Li and Roth (2002).^{6}^{6}6https://cogcomp.cs.illinois.edu/Data/QA/QC/ We use the CNN architecture proposed by Kim Kim (2014).^{7}^{7}7https://github.com/dennybritz/cnn-text-classification-tf Like many other NLP tasks, the dataset is relatively small and this CNN classifier does not inject noise to inputs like the implementation of residual networks in CIFAR 100, so this complicated model reaches 100% training accuracy within a few epochs.

To address this, we reduced the model complexity by {enumerate*}[(i)]

decreasing the number of filters from 128 to 64,

decreasing convolutional filter widths from 3,4,5 to 2,3,4,

adding L2 regularization with scale 0.01,

performing PCA to reduce the dimension of pre-trained word embedding from 300 to 50 and fixing the word embedding during training. The smaller model achieves better performance compared with the results from the original paper Kim (2014). As with MNIST, most examples are too easy for the model, so preferring hard examples is effective while the proposed active bias methods can achieve comparable performance, and are better than SGD-Uni and SGD-Scan.

### 4.4 Sequence Tagging Tasks

We also test our methods on Named Entity Recognition (NER) in CoNLL 2003 Tjong Kim Sang and De Meulder (2003) and OntoNote 5.0 Hovy et al. (2006) datasets using the CNN from Strubell et. al. Strubell et al. (2017).^{8}^{8}8https://github.com/iesl/dilated-cnn-ner Similar to Question Type, the model is too complex for our approaches. So we
{enumerate*}[(i)]

only use 3 layers instead of 4 layers,

reduce the number of filters from 300 to 100,

add 0.001 L2 regularization,

### 4.5 Distillation

Although state-of-the-art neural networks in many applications memorize examples easily Zhang et al. (2017), much simpler models can usually achieve similar performance like those in the previous two experiments. In practice, such models are often preferable due to their low computation and memory requirements. We have shown that the proposed method can improve these smaller models as distillation did Hinton et al. (2014), so it is natural to check whether our methods can work well with distillation. We use an implementation^{9}^{9}9https://github.com/akamaus/mnist-distill that distills a shallow CNN with 3 convolution layers to a 2 layer fully-connected network in MNIST. The teacher network can achieve 0.8% testing error, and the temperature of softmax is set as 1.

Our approaches and baselines simply apply the sample dependent weights to the final loss function (i.e., cross-entropy of the true labels plus cross-entropy of the prediction probability from the teacher network). In MNIST, SGD-WTC and SGD-WD can achieve similar or better improvements compared with adding distillation into SGD-Scan. Furthermore, the best performance comes from the distillation plus SGD-WTC, which shows that active bias is compatible with distillation in this dataset.

## 5 Conclusion

Deep learning researchers often gain accuracy by employing training techniques such as momentum, dropout, batch normalization, and distillation. This paper presents a new compatible sibling to these methods, which we recommend for wide use. Our relatively simple and computationally lightweight techniques emphasize the uncertain examples (i.e., SGD-*PV and SGD-*TC). The experiments confirm that the proper bias can be beneficial to generalization performance, and the active bias techniques consistently lead to a more accurate and robust neural network in various tasks as long as the classifier does not memorize all the training samples easily.

## Acknowledgements

This material is based on research sponsored by National Science Foundation under Grant No. 1514053 and by DARPA under agreement number FA8750-1 3-2-0020 and HRO011-15-2-0036. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

## References

- Alain et al. [2015] G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Bengio. Variance reduction in SGD by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
- Amari et al. [2000] S.-I. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399–1409, 2000.
- Andrychowicz et al. [2016] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, 2016.
- Avramova [2015] V. Avramova. Curriculum learning with deep convolutional neural networks, 2015.
- Bengio et al. [2009] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
- Bordes et al. [2005] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):1579–1619, 2005.
- Bubeck et al. [2012] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Chaudhari et al. [2017] P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. In ICLR, 2017.
- Collobert et al. [2011] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
- Druck and McCallum [2011] G. Druck and A. McCallum. Toward interactive training and evaluation. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 947–956. ACM, 2011.
- Duchi et al. [2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Gao et al. [2015] J. Gao, H. Jagadish, and B. C. Ooi. Active sampler: Light-weight accelerator for complex data analytics at scale. arXiv preprint arXiv:1512.03880, 2015.
- Gopal [2016] S. Gopal. Adaptive sampling for SGD by exploiting side information. In ICML, 2016.
- Guillory et al. [2009] A. Guillory, E. Chastain, and J. A. Bilmes. Active learning as non-convex optimization. In AISTATS, 2009.
- Gulcehre et al. [2017] C. Gulcehre, M. Moczulski, F. Visin, and Y. Bengio. Mollifying networks. In ICLR, 2017.
- He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Hinton et al. [2014] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.
- Hinton [2007] G. E. Hinton. To recognize shapes, first learn to generate images. Progress in brain research, 165:535–547, 2007.
- Houlsby et al. [2011] N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
- Hovy et al. [2006] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. OntoNotes: the 90% solution. In HLT-NAACL, 2006.
- Jiang et al. [2015] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann. Self-paced curriculum learning. In AAAI, 2015.
- Johnson and Zhang [2013] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 2013.
- Kim [2014] Y. Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.
- Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky and Hinton [2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- Kumar et al. [2010] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.
- LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lee et al. [2016] G.-H. Lee, S.-W. Yang, and S.-D. Lin. Toward implicit sample noise modeling: Deviation-driven matrix factorization. arXiv preprint arXiv:1610.09274, 2016.
- Li and Roth [2002] X. Li and D. Roth. Learning question classifiers. In COLING, 2002.
- Loshchilov and Hutter [2015] I. Loshchilov and F. Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
- MacKay [1992] D. J. MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
- Mandt et al. [2016a] S. Mandt, M. D. Hoffman, and D. M. Blei. A variational analysis of stochastic gradient algorithms. In ICML, 2016a.
- Mandt et al. [2016b] S. Mandt, J. McInerney, F. Abrol, R. Ranganath, and D. Blei. Variational tempering. In AISTATS, 2016b.
- Meng et al. [2015] D. Meng, Q. Zhao, and L. Jiang. What objective does self-paced learning indeed optimize? arXiv preprint arXiv:1511.06049, 2015.
- Mu et al. [2016] Y. Mu, W. Liu, X. Liu, and W. Fan. Stochastic gradient made stable: A manifold propagation approach for large-scale optimization. IEEE Transactions on Knowledge and Data Engineering, 2016.
- Northcutt et al. [2017] C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv preprint arXiv:1705.01936, 2017.
- Pregibon [1982] D. Pregibon. Resistant fits for some commonly used logistic models with medical applications. Biometrics, pages 485–498, 1982.
- Qian [1999] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- Rennie [2005] J. D. Rennie. Regularized logistic regression is strictly convex. Unpublished manuscript. URL: people.csail.mit.edu/jrennie/writing/convexLR.pdf, 2005.
- Schaul et al. [2013] T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML, 2013.
- Schein and Ungar [2007] A. I. Schein and L. H. Ungar. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235–265, 2007.
- Schohn and Cohn [2000] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In ICML, 2000.
- Settles [2010] B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
- Shrivastava et al. [2016] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.
- Strubell et al. [2017] E. Strubell, P. Verga, D. Belanger, and A. McCallum. Fast and accurate sequence labeling with iterated dilated convolutions. arXiv preprint arXiv:1702.02098, 2017.
- Tjong Kim Sang and De Meulder [2003] E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In HLT-NAACL, 2003.
- Wang et al. [2013] C. Wang, X. Chen, A. J. Smola, and E. P. Xing. Variance reduction for stochastic gradient optimization. In NIPS, 2013.
- Wang et al. [2016] Y. Wang, A. Kucukelbir, and D. M. Blei. Reweighted data for robust probabilistic models. arXiv preprint arXiv:1606.03860, 2016.
- Xiao and Zhang [2014] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
- Zhang et al. [2017] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
- Zhao and Zhang [2014] P. Zhao and T. Zhang. Stochastic optimization with importance sampling. arXiv preprint arXiv:1412.2753, 2014.

## Appendix A Implementation details

As in self-paced learning Avramova [2015], we will need to train an unbiased model for several burn-in epochs at the beginning before it is capable of judging the sampling uncertainty reasonably and stably. A sufficient number of burn-in epochs will help alleviate the local minimum problem in active learning Guillory et al. [2009], especially when we apply the methods on non-convex models like neural networks.

The general framework of the methods can be seen in Algorithm 1. In each aforementioned method, if is not specified, we use SGD-Scan (without replacement uniform sampling). If is not specified, it means for all sample . The in each method is set as the average of current estimation. For example, for SDG Sampled by Difficulty is set as .

When estimating sample related statistics like prediction variance, we found that excluding the prediction history near the beginning transient state improves performance. In our implementation, we use a simple outlier removal by computing the deviation between the prediction probability and its average at iteration (i.e., ), and excludes all prediction probability at current iteration when . We apply the same method when estimating difficulty, easiness, prediction variance or threshold closeness.

By only using the prediction results from previous iterations, implementing the methods is easy and the overhead of the method is very small because we do not need any extra forward or backward passes in the neural network. Due to the outlier removal process, the average overhead for each sample at each epoch is , where is the number of total epochs.

When we have a very large number of samples and epochs, we can modify outlier removal by only considering the prediction probability in the latest few epochs. Then, the overhead is constant. In Section 4.4, performing outlier removal in the prediction history of each word is time-consuming, so we determine the uncertainty only based on the latest 5 epochs.

## Appendix B Experiment details

Summaries of dataset properties can be seen in Table 5.

In Figure 4 and Figure 4, we present the convergence curves of MNIST without noise for the experiment in Section 4.1. By comparing the error rates, we can see that changing the sampling distribution accelerates the training more, but changing the loss function can give us better results at the end.

In the paper, we only provide the best testing performance within each trial. To further understand the characteristics of each methods, we report the average testing performance of the last 10 epochs in Table 6 and Table 7. The results in the tables roughly follow the same trends in Table 3 and Table 4. In addition, the training errors are presented in Table 8 and Table 9. We can see that emphasizing difficult examples indeed usually increases the training accuracy, but it does not necessarily imply the improvements in the testing error.

Dataset | # Class | Instance | Input dimensions | # Training | # Testing |
---|---|---|---|---|---|

MNIST | 10 | Image | 28x28 | 60,000 | 10,000 |

CIFAR 10 | 10 | Image | 32x32x3 | 50,000 | 10,000 |

CIFAR 100 | 100 | Image | 32x32x3 | 50,000 | 10,000 |

Question Type | 6 | Sentence | 50 | 5492 | 500 |

CoNLL 2003 | 17 | Word | 50 | 204,567 | 46,666 |

OntoNote 5.0 | 74 | Word | 50 | 1,088,503 | 152,728 |

Datasets | Model | SGD-Uni | SGD-SD | SGD-ISD | SGD-SE | SGD-SPV | SGD-STC |

MNIST | CNN | 0.59 | 0.56 | 0.60 | 0.58 | 0.55 | 0.55 |

Noisy MNIST | CNN | 1.180.00 | 1.520.01 | 1.260.00 | 0.760.00 | 0.920.00 | 0.850.00 |

CIFAR 10 | LR | 62.660.01 | 63.350.01 | 62.640.01 | 61.010.01 | 60.800.01 | 61.160.01 |

QT | CNN | 2.62 | 2.45 | 2.53 | 2.72 | 2.46 | 2.45 |

Datasets | Model | SGD-Scan | SGD-WD | SGD-WE | SGD-WPV | SGD-WTC |

MNIST | CNN | 0.58 | 0.51 | 0.59 | 0.53 | 0.52 |

Noisy MNIST | CNN | 1.150.00 | 1.590.01 | 0.800.00 | 0.840.00 | 0.850.00 |

CIFAR 10 | LR | 62.610.01 | 63.290.01 | 60.990.01 | 60.730.01 | 61.130.01 |

CIFAR 100 | RN 27 | 34.210.01 | 34.750.01 | 33.820.02 | 33.900.02 | 33.810.02 |

CIFAR 100 | RN 64 | 31.060.01 | 32.110.02 | 30.170.02 | 30.330.02 | 30.510.02 |

QT | CNN | 2.64 | 2.47 | 2.70 | 2.47 | 2.48 |

CoNLL 2003 | CNN | 11.960.02 | 11.850.02 | 12.040.02 | 11.650.02 | 11.600.02 |

OntoNote 5.0 | CNN | 18.110.02 | 18.030.03 | 18.700.02 | 18.080.02 | 17.840.02 |

MNIST | FC | 2.91 | 2.26 | 3.15 | 2.78 | 2.41 |

MNIST (distill) | FC | 2.33 | 2.21 | 2.41 | 2.24 | 2.14 |

Datasets | Model | SGD-Uni | SGD-SD | SGD-ISD | SGD-SE | SGD-SPV | SGD-STC |

MNIST | CNN | 0.01 | 0.00 | 0.01 | 0.05 | 0.00 | 0.00 |

Noisy MNIST | CNN | 5.540.09 | 0.010.00 | 2.880.09 | 9.080.01 | 7.600.06 | 7.830.04 |

CIFAR 10 | LR | 59.880.02 | 60.500.02 | 59.880.02 | 58.490.02 | 58.260.02 | 58.420.02 |

QT | CNN | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 0.00 |

Datasets | Model | SGD-Scan | SGD-WD | SGD-WE | SGD-WPV | SGD-WTC |

MNIST | CNN | 0.01 | 0.00 | 0.04 | 0.01 | 0.01 |

Noisy MNIST | CNN | 6.210.15 | 0.290.02 | 9.010.01 | 7.930.05 | 8.020.04 |

CIFAR 10 | LR | 59.870.02 | 60.480.02 | 58.450.02 | 58.230.02 | 58.400.02 |

CIFAR 100 | RN 27 | 18.720.04 | 18.440.04 | 19.430.04 | 18.860.04 | 18.760.04 |

CIFAR 100 | RN 64 | 6.060.03 | 5.420.04 | 8.150.03 | 8.410.03 | 7.850.02 |

QT | CNN | 0.01 | 0.00 | 0.05 | 0.00 | 0.00 |

CoNLL 2003 | CNN | 2.550.03 | 1.640.02 | 4.000.03 | 2.140.01 | 1.860.02 |

OntoNote 5.0 | CNN | 13.900.03 | 13.160.05 | 15.210.03 | 13.290.03 | 12.610.03 |

MNIST | FC | 1.840.01 | 0.070.00 | 2.210.02 | 1.600.01 | 0.790.01 |

MNIST (distill) | FC | 0.730.01 | 0.010.00 | 0.960.01 | 0.580.01 | 0.130.00 |

## Appendix C Proof sketch of Equation (5)

We apply the

(9) |

to the loss function, so

(10) |

Then, .

When , , and ,

(11) |

Finally,

(12) |