A Ablation study on B

Deep Learning with Importance Sampling

Abstract

Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored.

We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on “informative” examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup.

The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%.

\printAffiliationsAndNotice

1 Introduction

The dramatic increase in available training data has made the use of deep neural networks feasible, which in turn has significantly improved the state-of-the-art in many fields, in particular computer vision and natural language processing. However, due to the complexity of the resulting optimization problem, computational cost is now the core issue in training these large architectures.

When training such models, it appears to any practitioner that not all samples are equally important; many of them are properly handled after a few epochs of training, and most could be ignored at that point without impacting the final model. To this end, we propose a novel importance sampling scheme that accelerates the training of any neural network architecture by focusing the computation on the samples that will introduce the biggest change in the parameters which reduces the variance of the gradient estimates.

For convex optimization problems, many works [Bordes et al.(2005)Bordes, Ertekin, Weston, and Bottou, Zhao & Zhang(2015)Zhao and Zhang, Needell et al.(2014)Needell, Ward, and Srebro, Canévet et al.(2016)Canévet, Jose, and Fleuret, Richtárik & Takáč(2013)Richtárik and Takáč] have taken advantage of the difference in importance among the samples to improve the convergence speed of stochastic optimization methods. On the other hand, for deep neural networks, sample selection methods were mainly employed to generate hard negative samples for embedding learning problems or to tackle the class imbalance problem [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin, Wu et al.(2017)Wu, Manmatha, Smola, and Krahenbuhl, Simo-Serra et al.(2015)Simo-Serra, Trulls, Ferraz, Kokkinos, Fua, and Moreno-Noguer].

Recently, researchers have shifted their focus on using importance sampling to improve and accelerate the training of neural networks [Alain et al.(2015)Alain, Lamb, Sankar, Courville, and Bengio, Loshchilov & Hutter(2015)Loshchilov and Hutter, Schaul et al.(2015)Schaul, Quan, Antonoglou, and Silver]. Those works, employ either the gradient norm or the loss to compute each sample’s importance. However, the former is prohibitively expensive to compute and the latter is not a particularly good approximation of the gradient norm.

Compared to the aforementioned works, we derive an upper bound to the per sample gradient norm that can be computed in a single forward pass. This results in reduced computational requirements of more than an order of magnitude compared to \citetalain2015variance. Furthermore, we quantify the variance reduction achieved with the proposed importance sampling scheme and associate it with the batch size increment required to achieve an equivalent variance reduction. The benefits of this are twofold, firstly we provide an intuitive metric to predict how useful importance sampling is going to be, thus we are able to decide when to switch on importance sampling during training. Secondly, we also provide theoretical guarantees for speedup, when variance reduction is above a threshold. Based on our analysis, we propose a simple to use algorithm that can be used to accelerate the training of any neural network architecture.

Our implementation is generic and can be employed by adding a single line of code in a standard Keras model training. We validate it on three independent tasks: image classification, fine-tuning and sequence classification with recurrent neural networks. Compared to existing batch selection schemes, we show that our method consistently achieves lower training loss and test error for equalized wall-clock time.

2 Related Work

Existing importance sampling methods can be roughly categorized in methods applied to convex problems and methods designed for deep neural networks.

2.1 Importance Sampling for Convex Problems

Importance sampling for convex optimisation problems has been extensively studied over the last years. \citetbordes2005fast developed LASVM, which is an online algorithm that uses importance sampling to train kernelized support vector machines. Later, \citetrichtarik2013optimal proposed a generalized coordinate descent algorithm that samples coordinate sets in a way that optimizes the algorithm’s convergence rate.

More recent works [Zhao & Zhang(2015)Zhao and Zhang, Needell et al.(2014)Needell, Ward, and Srebro] make a clear connection with the variance of the gradient estimates of stochastic gradient descent and show that the optimal sampling distribution is proportional to the per sample gradient norm. Due to the relatively simple optimization problems that they deal with, the authors resort to sampling proportionally to the norm of the inputs, which in simple linear classification is proportional to the Lipschitz constant of the per sample loss function.

Such simple importance measures do not exist for Deep Learning and the direct application of the aforementioned theory [Alain et al.(2015)Alain, Lamb, Sankar, Courville, and Bengio], requires clusters of GPU workers just to compute the sampling distribution.

2.2 Importance Sampling for Deep Learning

Importance sampling has been used in Deep Learning mainly in the form of manually tuned sampling schemes. \citetbengio2009curriculum manually design a sampling scheme inspired by the perceived way that human children learn; in practice they provide the network with examples of increasing difficulty in an arbitrary manner. Diametrically opposite, it is common for deep embedding learning to sample hard examples because of the plethora of easy non informative ones [Simo-Serra et al.(2015)Simo-Serra, Trulls, Ferraz, Kokkinos, Fua, and Moreno-Noguer, Schroff et al.(2015)Schroff, Kalenichenko, and Philbin].

More closely related to our work, \citetschaul2015prioritized and \citetloshchilov2015online use the loss to create the sampling distribution. Both approaches keep a history of losses for previously seen samples, and sample either proportionally to the loss or based on the loss ranking. One of the main limitations of history based sampling, is the need for tuning a large number of hyperparameters that control the effects of “stale” importance scores; i.e. since the model is constantly updated, the importance of samples fluctuate and previous observations may poorly reflect the current situation. In particular, \citetschaul2015prioritized use various forms of smoothing for the losses and the importance sampling weights, while \citetloshchilov2015online introduce a large number of hyperparameters that control when the losses are computed, when they are sorted as well as how the sampling distribution is computed based on the rank.

In comparison to all the above methods, our importance sampling scheme based on an upper bound to the gradient norm has a solid theoretical basis with clear objectives, very easy to choose hyperparameters, theoretically guaranteed speedup and can be applied to any type of network and loss function.

2.3 Other Sample Selection Methods

Finally, we would like to mention two other related methods for speeding up the training of neural networks. \citetWu_2017_ICCV, design a distribution (suitable only for the distance based losses) that maximizes the diversity of the losses in a single batch. \citetfan2017learning use reinforcement learning to train a neural network that selects samples for another neural network in order to optimize the convergence speed. Although their preliminary results are promising, the overhead of training two networks makes the wall-clock speedup unlikely and their proposal not as appealing.

3 Variance Reduction for Deep Neural Networks

Importance sampling aims at increasing the convergence speed of Stochastic Gradient Descent (SGD) by focusing computation on samples that actually induce a change in the model parameters. This formally translates into a reduced variance of the gradient estimates for a fixed computational cost. In the following sections, we analyze how this works and present an efficient algorithm that can be used to train any Deep Learning model.

3.1 Introduction to Importance Sampling

Let , be the -th input-output pair from the training set, be a Deep Learning model parameterized by the vector , and be the loss function to be minimized during training. The goal of training is to find

(1)

where corresponds to the number of examples in the training set.

We use an SGD procedure with learning rate , where the update at iteration depends on the sampling distribution and re-scaling coefficients . Let be the data point sampled at that step, we have and

(2)

Plain SGD with uniform sampling is achieved with and for all and .

If we define the convergence speed of SGD as the reduction of the distance of the parameter vector from the optimal parameter vector in two consecutive iterations and

(3)

and if we have such that

(4)

and set , then we get (this is a different derivation of the result by \citealpwang2016accelerating)

(5)

Since the first two terms, in the last expression, are the speed of batch gradient descent, we observe that it is possible to gain a speedup by sampling from the distribution that minimizes . Several works [Needell et al.(2014)Needell, Ward, and Srebro, Zhao & Zhang(2015)Zhao and Zhang, Alain et al.(2015)Alain, Lamb, Sankar, Courville, and Bengio] have shown the optimal distribution to be proportional to the per-sample gradient norm. However, computing this distribution is computationally prohibitive.

3.2 Beyond the Full Gradient Norm

Given an upper bound and due to

(6)

we propose to relax the optimization problem in the following way

(7)

The minimizer of the second term of equation 7, similar to the first term, is . All that remains, is to find a proper expression for which is significantly easier to compute than the norm of the gradient for each sample.

In order to continue with the derivation of our upper bound , let us introduce some notation specific to a multi-layer perceptron. Let be the weight matrix for layer and be a Lipschitz continuous activation function. Then, let

(8)
(9)
(10)
(11)

Although our notation describes simple fully connected neural networks without bias, our analysis holds for any affine operation followed by a slope-bounded non-linearity (). With

(12)
(13)
(14)

we get

(15)
(16)
(17)
(18)

Various weight initialization [Glorot & Bengio(2010)Glorot and Bengio] and activation normalization techniques [Ioffe & Szegedy(2015)Ioffe and Szegedy, Ba et al.(2016)Ba, Kiros, and Hinton] uniformise the activations across samples in most high-performance architectures. As a result, the variation of the gradient norm is mostly captured by the gradient of the loss function with respect to the pre-activation outputs of the last layer of our neural network. Consequently we can derive the following upper bound to the gradient norm of all the parameters

(19)

which can be computed in a closed form in terms of and is marginally more difficult to compute than the value of the loss. However, our upper bound depends on the time step , thus we cannot generate a distribution once and sample from it during training. This is intuitive because the importance of each sample changes as the model changes.

3.3 When is Variance Reduction Possible?

Computing the importance score from equation 19 is more than an order of magnitude faster compared to computing the gradient norm for each sample. Nevertheless, it still costs one forward pass through the network and can be wasteful. For instance, during the first iterations of training, the gradients with respect to every sample have approximately equal norm; thus we would waste computational resources trying to sample from the uniform distribution. In addition, computing the importance score for the whole dataset is still prohibitive and would render the method unsuitable for online learning.

In order to solve the problem of computing the importance for the whole dataset, we pre-sample a large batch of data points, compute the sampling distribution for that batch and re-sample a smaller batch with replacement. The above procedure upper bounds both the speedup and variance reduction. Given a large batch consisting of samples and a small one consisting of , we can achieve a maximum variance reduction of and a maximum speedup of assuming that the backward pass requires twice the amount of time as the forward pass.

Due to the large cost of computing the importance per sample, we only perform importance sampling when we know that the variance of the gradients can be reduced. In the following equation, we show that the variance reduction is proportional to the squared distance of the sampling distribution, , to the uniform distribution . Let and the uniform probability.

(20)
(21)
(22)
(23)
(24)
(25)
(26)

Equation 26 already provides us with a useful metric to decide if the variance reduction is significant enough to justify using importance sampling. However, choosing a suitable threshold for the distance squared would be tedious and unintuitive. We can do much better by dividing the variance reduction with the original variance to derive the increase in the batch size that would achieve an equivalent variance reduction. Assuming that we increase the batch size by , we achieve variance reduction ; thus we have

(27)
(28)
(29)

Using equation 29, we have a hyperparameter that is very easy to select and can now design our training procedure which is described in pseudocode in algorithm 3.3. Computing from equation 29 allows us to have guaranteed speedup when . However, as it is shown in the experiments, we can use smaller than and still get a significant speedup.

{algorithm}

Deep Learning with Importance Sampling {algorithmic}[1] \STATEInputs \STATE \STATE \REPEAT\IF \STATE uniformly sampled datapoints \STATE according to eq 19 \STATE datapoints sampled with from \STATE \STATE \ELSE\STATE uniformly sampled datapoints \STATE \STATE \STATE \ENDIF\STATE \UNTILconvergence

The inputs to the algorithm are the pre-sampling size , the batch size , the equivalent batch size increment after which we start importance sampling and the exponential moving average parameter used to compute a smooth estimate of . denotes the initial parameters of our deep network. We would like to point out that in line \comment3.3 15 of the algorithm, we compute for free since we have done the forward pass in the previous step.

The only parameter that has to be explicitly defined for our algorithm is the pre-sampling size because can be set using equation 29. We provide a small ablation study for in the supplementary material.

4 Experiments

In this section, we analyse experimentally the performance of the proposed importance sampling scheme based on our upper-bound of the gradient norm. In the first subsection, we compare the variance reduction achieved with our upper bound to the theoretically maximum achieved with the true gradient norm. We also compare against sampling based on the loss, which is commonly used in practice. Subsequently, we conduct experiments which demonstrate that we are able to achieve non-negligible wall-clock speedup for a variety of tasks using our importance sampling scheme.

In all the subsequent sections, we use uniform to refer to the usual training algorithm that samples points from a uniform distribution, we use loss to refer to algorithm 3.3 but instead of sampling from a distribution proportional to our upper-bound to the gradient norm (equations 7 and 19), we sample from a distribution proportional to the loss value and finally upper-bound to refer to our proposed method. All the other baselines from published methods are referred to using the names of the authors.

Experiments were conducted using Keras [Chollet et al.(2015)] with TensorFlow [Abadi et al.(2016)Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, et al.], and the code to reproduce the experiments will be provided under an open source license when the paper will be published. For all the experiments, we use Nvidia K80 GPUs and the reported time is calculated by subtracting the timestamps before starting one epoch and after finishing one; thus it includes the time needed to transfer data between CPU and GPU memory.

Our implementation provides a wrapper around models that substitutes the standard uniform sampling with our importance-sampling method. This means that adding a single line of code to call this wrapper before actually fitting the model is sufficient to switch from the standard uniform sampling to our importance-sampling scheme. And, as specified in § 3.3 and Algorithm 3.3, our procedure reliably estimates at every iteration if the importance sampling will provide a speed-up and sticks to uniform sampling otherwise.

4.1 Ablation study

Figure 1: The y-axis denotes the distance of the average gradient of the large batch () and the average gradient of the small batch () normalized with the distance achieved by uniform sampling. The sampling of the small batch is done times and the reported results are the average. The details of the experimental setup are given in § 4.1.
Figure 2: The probabilities generated with the loss and our upper-bound are plotted against the ideal probabilities produced by the gradient-norm. The black line denotes perfect correlation. The details of the experimental setup are given in § 4.1.

As already mentioned, several works [Loshchilov & Hutter(2015)Loshchilov and Hutter, Schaul et al.(2015)Schaul, Quan, Antonoglou, and Silver] use the loss value, directly or indirectly, to generate sampling distributions. In this section, we present experiments that validate the superiority of our method with respect to the loss in terms of variance reduction. For completeness, in the supplementary material we include a theoretical analysis that explains why sampling based on the loss also achieves variance reduction during the late stages of training.

Our experimental setup is as follows: we train a wide residual network [Zagoruyko & Komodakis(2016)Zagoruyko and Komodakis] on the CIFAR100 dataset [Krizhevsky(2009)], following closely the training procedure of \citetzagoruyko2016wrn (the details are presented in § 4.2). Subsequently, we sample images uniformly at random from the dataset. Using the weights of the trained network, at intervals of updates, we resample images from the large batch of images using uniform sampling or importance sampling with probabilities proportional to the loss, our upper-bound or the gradient-norm. The gradient-norm is computed by running the backpropagation algorithm with a batch size of 1.

Figure 1 depicts the variance reduction achieved with every sampling scheme in comparison to uniform. We measure this directly as the distance between the mini-batch gradient and the batch gradient of the samples. For robustness we perform the sampling times and report the average. We observe that our upper bound and the gradient norm result in very similar variance reduction, meaning that the bound is relatively tight and that the produced probability distributions are highly correlated. This can also be deduced by observing figure 2, where the probabilities proportional to the loss and the upper-bound are plotted against the optimal ones (proportional to the gradient-norm). We observe that our upper bound is almost perfectly correlated with the gradient norm, in stark contrast to the loss which is only correlated at the regime of very small gradients. Quantitatively the sum of squared error of points in figure 2 is for the loss and for our proposed upper bound.

Furthermore, we observe that sampling hard examples (with high loss), increases the variance, especially in the beginning of training. Similar behaviour has been observed in problems such as embedding learning where semi-hard sample mining is preferred over sampling using the loss [Wu et al.(2017)Wu, Manmatha, Smola, and Krahenbuhl, Schroff et al.(2015)Schroff, Kalenichenko, and Philbin].

4.2 Image classification

{subfigure}

[t]0.49   {subfigure}[t]0.49
{subfigure}[t]0.49   {subfigure}[t]0.49

Figure 3: CIFAR10 Training Loss
Figure 4: CIFAR10 Test Error
Figure 5: CIFAR100 Training Loss
Figure 6: CIFAR100 Test Error
Figure 7: Comparison of importance sampling using the upper-bound with uniform and loss based importance sampling. The details of the training procedure are given in § 4.2. Our proposed scheme is the only one achieving a speedup on CIFAR100 and results in 5% smaller test error. All presented results are averaged across independent runs.

In this section, we use importance sampling to improve the training of a wide residual network on CIFAR10 and CIFAR100. We follow the experimental setup of \citetzagoruyko2016wrn, specifically we train a wide resnet 28-2 with SGD with momentum. We use batch size , weight decay , momentum , initial learning rate divided by after and parameter updates. Finally, we train for a total of iterations. In order for our history based baselines to be compatible with the data augmentation of the CIFAR images, we pre-augment both datasets to generate images for each one. It is important to mention, that our method does not have this limitation since it can work on infinite datasets in a true online fashion. To compare between methods, we use a learning rate schedule based on wall-clock time and we also fix the total seconds available for training. A faster method should have smaller training loss and test error given a specific time during training.

For this experiment, we compare the proposed method to uniform, loss, online batch selection by \citetloshchilov2015online and the history based sampling of \citetschaul2015prioritized. For the method of \citetschaul2015prioritized, we use their proportional sampling since the rank based is very similar to \citetloshchilov2015online and we select the best parameters from the grid and . Similarly, for online batch selection, we use and a recomputation of all the losses every updates.

For our method, we use a presampling size of . One of the goals of this experiment is to show that even a smaller reduction in variance can effectively stabilize training and provide wall-clock time speedup; thus we set . For robustness, we perform independent runs and report the average.

The results are depicted in figure 7. We observe that in the relatively easy CIFAR10 dataset, all methods can provide some speedup over uniform sampling. However, the case is very different for the more complicated CIFAR100 where only sampling with our proposed upper-bound to the gradient norm reduces the variance of the gradients and provides faster convergence. Examining the training evolution in detail, we observe that on CIFAR10 our method is the only one that achieves a significant improvement in the test error even in the first stages of training ( to seconds). Quantitatively, on CIFAR10 we achieve more than an order of magnitude lower training loss and lower test error from to while on CIFAR100 approximately times lower training loss and lower test error from to compared to uniform sampling.

At this point, we would also like to discuss the performance of the loss compared to other methods that also select batches based on this metric. Our experiments show, that using “fresh” values for the loss combined with a warmup stage so that importance sampling is not started too early outperforms all the other baselines on the CIFAR10 dataset.

4.3 Fine-tuning

{subfigure}

[t]0.49   {subfigure}[t]0.49

Figure 8: Comparison of importance sampling for fine-tuning on MIT67 dataset. The details of the training procedure are given in § 4.3. Our proposed algorithm converges very quickly to test error in approximately half an hour, a relative reduction of to uniform sampling. For robustness, the results are averaged across independent runs.

Our second experiment shows the application of importance sampling to the significant task of fine tuning a pre-trained large neural network on a new dataset. This task is of particular importance because there exists an abundance of powerful models pre-trained on large datasets such as ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei].

The details of our experimental setup are the following, we fine-tune a ResNet-50 [He et al.(2015)He, Zhang, Ren, and Sun] that is previously trained on the ImageNet ILSVRC 2012 dataset. We replace the last classification layer and then train the whole network end-to-end to classify indoor images among 67 possible categories [Quattoni & Torralba(2009)Quattoni and Torralba]. We use SGD with learning rate and momentum . We set the batch size to and for our importance sampling algorithm we pre-sample . The variance reduction threshold is set to as designated by equation 29.

To assess the performance of both our algorithm and our gradient norm approximation, we compare the convergence speed of our importance sampling algorithm using our upper-bound and using the loss. Once again, for robustness, we run independent runs and report the average.

The results of the experiment are depicted in figure 8. As expected, importance sampling is very useful for the task of fine-tuning since a lot of samples are handled correctly very early in the training process. Our upper-bound, once again, greatly outperforms sampling proportionally to the loss when the network is large and the problem is non trivial. Compared to uniform sampling, in just half an hour importance sampling has converged close to the best performance ( test error) that can be expected on this dataset without any data augmentation or multiple crops [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson], while uniform achieves only .

4.4 Pixel by Pixel MNIST

{subfigure}

[t]0.49   {subfigure}[t]0.49

Figure 9: Comparison of importance sampling on pixel-by-pixel MNIST with an LSTM. The details of the training procedure are given in § 4.4. Our proposed algorithm speeds up training and achieves lower test error in one hour of training ( compared to ). We observe that sampling proportionally to the loss actually hurts convergence in this case.

The goal of our final experiment is to show that our method is not limited to convolutional networks with ReLU activations. To that end, we use our importance sampling algorithm to accelerate the training of an LSTM in a sequence classification problem. We use the pixel by pixel classification of randomly permuted MNIST digits [LeCun et al.(2010)LeCun, Cortes, and Burges], as defined by \citetle2015simple. The problem may seem trivial at first, however as shown by \citetle2015simple it is particularly suited to benchmarking the training of recurrent neural networks, due to the long range dependency problems inherent in the dataset ( time steps).

The details of our experimental setup are the following. We fix a permutation matrix for all the pixels to generate a training set of samples with time steps each. Subsequently, we train an LSTM [Hochreiter & Schmidhuber(1997)Hochreiter and Schmidhuber] with dimensions in the hidden space, as an activation function and as the recurrent activation function. Finally, we use a linear classifier on top of the LSTM to choose a digit based on the hidden representation.

To train the aforementioned architecture, we use the Adam optimizer [Kingma & Ba(2014)Kingma and Ba] with a learning rate of and a batch size of . We have also found gradient clipping to be necessary for the training not to diverge; thus we clip the norm of all gradients to .

The results of the experiment are depicted in figure 9. For both the loss and our proposed upper-bound, importance sampling starts at around seconds by setting and the presampling size to . We could set (equation 29) which would only result in our algorithm being more conservative and starting importance sampling later. We clearly observe that sampling proportionally to the loss hurts the convergence in this case. On the other hand, our algorithm achieves lower training loss and lower test error in the given time budget.

5 Conclusions

We have presented an efficient algorithm for accelerating the training of deep Neural Networks using importance sampling. Our algorithm takes advantage of a novel upper bound to the gradient norm of any Neural Network that can be computed in a single forward pass. In addition, we show an equivalence of the variance reduction achieved with importance sampling to increasing the batch size; thus we are able to quantify both the variance reduction and the speedup and intelligently decide when to stop sampling uniformly.

Our experiments show that our algorithm is effective in reducing the training time for several tasks both on image and sequence data. More importantly, we show that not all data points matter equally in the duration of training, which can be exploited to gain a speedup or better quality gradients or both.

Our analysis opens several avenues of future research. The two most important ones that were not investigated in this work are automatically tuning the learning rate based on the variance of the gradients and decreasing the batch size. It has been theorized that variance is advantageous for improving the generalization performance of a Neural Network. We can keep the variance stable and increase the convergence speed by automatically increasing the learning rate proportionally to the batch increment. Secondly, instead of sampling more data and re-sampling afterwards, we could instead start decreasing the batch size thus keeping again the variance stable but reducing the time per update.

\standalonetitle

Appendix

Appendix A Ablation study on

The only hyperparameter that is somewhat hard to define in our algorithm is the pre-sampling size . As mentioned in the main paper, it controls the maximum possible variance reduction and also how much wall-clock time one iteration with importance sampling will require.

In figure 10 we depict the results of training with importance sampling and different pre-sampling sizes on CIFAR10. We follow the same experimental setup as in the paper.

We observe that larger presampling size results in lower training loss, which follows from our theory since the maximum variance reduction is smaller with small . In this experiment we use the same for all the methods and we observe that reaches first to training loss. This is justified because computing the importance for samples in the beginning of training is wasteful according to our analysis.

According to this preliminary ablation study for , we conclude that choosing with is a good strategy for achieving a speedup. However, regardless of the choice of , pairing it with a threshold designated by the analysis in the paper guarantees that the algorithm will be spending time on importance sampling only when the variance can be greatly reduced.

{subfigure}

0.49 {subfigure}0.49

Figure 10: Results on training with different on CIFAR10. See the paper for the experimental setup.

Appendix B Importance Sampling with the Loss

In this section we will present a small analysis that provides intuition regarding using the loss as an approximation or an upper bound to the per sample gradient norm.

Let be either the negative log likelihood through a sigmoid or the squared error loss function defined respectively as

(30)

Given our upper bound to the gradient norm, we can write

(31)

Moreover, for the losses that we are considering, when then . Using this fact in combination to equation 31, we claim that so does the per sample gradient norm thus small loss values imply small gradients. However, large loss values are not well correlated with the gradient norm which can also be observed in § 4.1 in the paper.

To summarize, we conjecture that due to the above facts, sampling proportionally to the loss reduces the variance only when the majority of the samples have losses close to 0. Our assumption is validated from our experiments, where the loss struggles to achieve a speedup in the early stages of training where most samples still have relatively large loss values.

References

  1. Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  2. Alain, Guillaume, Lamb, Alex, Sankar, Chinnadhurai, Courville, Aaron, and Bengio, Yoshua. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
  3. Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. CoRR, abs/1607.06450, 2016.
  4. Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
  5. Bordes, Antoine, Ertekin, Seyda, Weston, Jason, and Bottou, Léon. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):1579–1619, 2005.
  6. Canévet, O., Jose, C., and Fleuret, F. Importance sampling tree for large-scale empirical expectation. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1454–1462, 2016.
  7. Chollet, François et al. keras. https://github.com/fchollet/keras, 2015.
  8. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  9. Fan, Yang, Tian, Fei, Qin, Tao, Bian, Jiang, and Liu, Tie-Yan. Learning what data to learn. arXiv preprint arXiv:1702.08635, 2017.
  10. Glorot, Xavier and Bengio, Yoshua. Understanding the difficulty of training deep feedforward neural networks. 2010.
  11. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  12. Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  13. Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.
  14. Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  15. Krizhevsky, Alex. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
  16. Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
  17. LeCun, Yann, Cortes, Corinna, and Burges, CJ. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
  18. Loshchilov, Ilya and Hutter, Frank. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
  19. Needell, Deanna, Ward, Rachel, and Srebro, Nati. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pp. 1017–1025, 2014.
  20. Quattoni, Ariadna and Torralba, Antonio. Recognizing indoor scenes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 413–420. IEEE, 2009.
  21. Razavian, Ali Sharif, Azizpour, Hossein, Sullivan, Josephine, and Carlsson, Stefan. Cnn features off-the-shelf: an astounding baseline for recognition. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pp. 512–519. IEEE, 2014.
  22. Richtárik, Peter and Takáč, Martin. On optimal probabilities in stochastic coordinate descent methods. arXiv preprint arXiv:1310.3438, 2013.
  23. Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  24. Schroff, Florian, Kalenichenko, Dmitry, and Philbin, James. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.
  25. Simo-Serra, Edgar, Trulls, Eduard, Ferraz, Luis, Kokkinos, Iasonas, Fua, Pascal, and Moreno-Noguer, Francesc. Discriminative learning of deep convolutional feature point descriptors. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 118–126. IEEE, 2015.
  26. Wang, Linnan, Yang, Yi, Min, Martin Renqiang, and Chakradhar, Srimat. Accelerating deep neural network training with inconsistent stochastic gradient descent. arXiv preprint arXiv:1603.05544, 2016.
  27. Wu, Chao-Yuan, Manmatha, R., Smola, Alexander J., and Krahenbuhl, Philipp. Sampling matters in deep embedding learning. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  28. Zagoruyko, Sergey and Komodakis, Nikos. Wide residual networks. In Richard C. Wilson, Edwin R. Hancock and Smith, William A. P. (eds.), Proceedings of the British Machine Vision Conference (BMVC), pp. 87.1–87.12. BMVA Press, September 2016. ISBN 1-901725-59-6. doi: 10.5244/C.30.87. URL https://dx.doi.org/10.5244/C.30.87.
  29. Zhao, Peilin and Zhang, Tong. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1–9, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
103321
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description