The Two Regimes of Deep Network Training

The Two Regimes of Deep Network Training

1 Introduction

Finding the right learning rate schedule is critical to obtain the best testing accuracy for a given neural network architecture. As deep learning started gaining popularity, starting with the largest learning rate and gradually decreasing it became the standard practice (Bengio, 2012b). Indeed, today, such “step schedule” still remains one the most popular learning rate schedules, and when properly tuned it yields competitive models.

More recently, as architectures grew deeper and wider, and as training massive datasets became the norm, more elaborate schedules emerged  (Loshchilov and Hutter, 2017; Smith, 2017; Smith and Nicholay, 2017). While these schedule have shown great practical success, it is still unclear why that is the case. Li and Arora (2019) even showed that, counter-intuitively, an exponentially increasing schedule could also be effective. In the light of this, it is time to revisit learning rate schedules and shed some light on why some perform well and some other don’t.

Our contributions

In this paper, we identify two training regimes: (1) the large-step regime and (2) the small-step one which correspond usually respectively to the start and the end of “step-schedule”. In particular, we examine these regimes through the lens of optimization and generalization. We find that:

  • In the large-step rate regime, the loss does not decrease consistently at each epoch and the final loss value obtained after convergence is much higher than when training in the small-step regime. In that latter regime, the reduction of the loss is faster and smooth and to large degree matches the intuition drawn from the convex optimization literature.

  • In the large-step regime, momentum does not seem to have a discernible benefit. More precisely, we show that we can recover similar loss decrease curves for a wide range of different momentum values as long as we make a corresponding change in the learning rate. In the small-step regime, however, momentum becomes crucial to reaching a good solution quickly.

  • Finally, we leverage this understanding to propose a simple two-stage learning rate schedule that achieves state of the art performance on the CIFAR-10 and ImageNet datasets. Importantly, in this schedule, each stage uses a different algorithm and hyper-parameters.

Our findings suggest that it might be beneficial to depart from viewing deep network training as a single optimization problem and instead to explore using different algorithms for different stages of that process. In particular, second order methods (such as K-FAC (Martens and Grosse, 2015) and L-BFGS (Liu and Nocedal, 1989))—that are currently viewed as successful at reducing the number of training iterations but leading to suboptimal generalization performance—might be good candidates for using (solely) in the small-step regime.

2 Background

Given a (differentiable) function , one of the most popular techniques for minimizing it is to use the gradient descent method (GD). This method, starting from an initial solution , iteratively updates the solution as:

(1)

where is the learning rate. GD is the most natural and simple continuous optimization scheme. However, there are a host of its most advanced variants. One of the prominent ones is momentum gradient descent, often referred to as the classic momentum, or heavy ball method (Polyak, 1964). It corresponds to an update rule:

(2)

where is a scalar that controls the momentum accumulation. There are also other variants of momentum dynamics. Most prominently, Nesterov’s accelerated gradient (Nesterov, 1983) offers a theoretically optimal convergence rate. However, it tends to have poor behavior in practice due its brittleness. For that reason, and also because of its immense popularity, we will focus on the above-mentioned classic momentum dynamics instead.

3 The Two Learning Regimes

In this work, we will be interested in isolating two learning regimes:

  • “large-step” regime: corresponds to the highest learning rate that can be used without causing divergence, as per Bengio (2012b).

  • “small-step” regime: corresponds to the largest learning rate at which loss is consistently decreasing. (In  Smith and Nicholay (2017), the authors propose an experimental procedure to estimate appropriate learning rates).

In carefully tuned step-wise learning rate schedules, the first and last learning rates usually correspond to the large-step and small-step regimes1. Our goal is to characterize and understand how these regimes differ—first from an optimization and then from a generalization perspective.

3.1 Optimization perspective

By examining the evolution of the loss from initialization on Figure 1, we can note three major differences between the two regimes:

Figure 1: Evolution of the training loss for 50 epochs with different momentum values on CIFAR-10 and VGG-13-BN. (Top) regime (A) with and (Bottom) regime (B) with .
  1. The best solution is found in the low learning rate regime, even though we performed the same number of 100 times smaller steps—which corresponds to a much shorted distance traveled from the initialization.

  2. In regime (A), the evolution of the loss is very noisy, while in (B), it decreases almost at each epoch.

  3. Momentum seems to behave completely differently in the two experiments. At the top of Figure 1, the largest value yields the worst solution, whereas on the other, the final loss decreases as we increase .

These pieces of evidence suggest that regime (A) is a highly non-convex optimization problem, while the low learning rate regime reflects the intuitions from the convex optimization world. To highlight this distinction we will use momentum (as defined in Section 2).

Momentum.

Momentum can provably accelerate gradient descent over functions that are convex, but does not provide any theoretical guarantees when that property does not hold. In order to highlight the different nature of the problems we are solving in each regime, we compare the behavior of momentum when used on a convex function, and on a deep neural network under both regimes.

Ideally, we would like the momentum vector to be a signal that: (1) points reliably towards the optimum of our problem, and (2) is strong enough to actually have an impact on the trajectory. To focus on these two key properties, we track the two respective quantities:

  1. Alignment: the angle between the momentum and the direction to optimum2,

    (3)
  2. Scale: the ratio between the magnitude of the momentum vector and the gradient,

    (4)
Figure 2: Evolution of (top) the value of the function, (middle) the alignment and (bottom) the scale while optimizing a quadratic function where is a random positive semi-definite matrix whose condition number is .

Note that, in order to be helpful in the optimization process, one would expect the direction of the momentum vector to be correlated with the direction towards optimum (alignment to be close to ), and its scale large enough to be significant. Indeed, that’s the behavior that provably emerges in the context of convex optimization.

Convex function baseline.

Figure 2 implies that in case of a quadratic convex function, a higher momentum value results in faster convergence According to the middle plot, the momentum vector is a strong indicator of the direction towards the optimum (it quickly goes to ). Also, the scale increases with the momentum and eventually converges towards , which is what one would expect when the momentum is indeed accumulating.

Figure 3: Evolution of the metrics (top) and (bottom) corresponding to the experiment done in Figure 1 () with

Deep learning setting.

Now that we saw that the metrics behave as we expect on convex functions, we can measure them (Figure 3) on the experiment presented earlier.

In regime (A), the scale is oscillating around and the value of the alignment is very low. This means that the momentum vector is nearly orthogonal to the direction of the final solution and never constitutes a strong signal. In regime (B), the momentum vector is able to accumulate more and gives non-negligible information about the direction towards the point we are converging to.

According to Kidambi et al. (2018), momentum might not be able to cope with the noise coming from the stochasticity of SGD. While it is plausible, experiments in appendix Appendix A using full gradients instead of mini-batches show that this noisiness has only minimal impact and that the step size is the most important factor determining the success of momentum.

This leads to the following informal argument: with small step-sizes, the trajectory is unable to escape the current basin of attraction. The region is “locally convex” and, as a consequence, allows the momentum vector to point towards the same critical point during the optimization process, thus helping to speed up optimization. On the other hand, high learning rates penalize the effect of momentum. The steps taken at each iteration are large enough to escape the current basin of attraction and enter a different basin (therefore optimizing towards a different local minimum). As this happens, the direction approximated by the momentum vector points to different critical points during the course of optimization. Thus, at some iteration, momentum steers the trajectory towards a point that is not reachable anymore. This hypothesis also explains why in the top plot of Figure 1, we see momentum struggling more than vanilla SGD: often, the momentum vector, orthogonal, is completely disagreeing with the gradient and slows down convergence.

3.2 Generalization perspective

If our objective is to minimize the loss, training in the small-step regime (B) is simpler and faster. Indeed, as we saw in Figure 1, it was two times faster to reach a loss of . It is therefore natural to ask: Why do we even spend some time in the high learning rate regime? In deep learning, the loss is only a surrogate of our real objective: testing accuracy. It turns out that training only in the second regime, while it is fast, leads to very sharp minimizers. This is a phenomenon similar to what was described in Keskar et al. (2017) in the context of the batch size.

The relationship between learning rate and generalization has already been studied in the past (Li et al., 2019; Hoffer et al., 2017; Keskar et al., 2017; Jiang et al., 2020; Keskar et al., 2017; Jiang et al., 2020). However, it seems that what truly defines the regime we are in is not the learning rate itself, but the actual step size.3

Momentum, as defined in Section 2, increases the size of the step we actually take at each iteration of SGD. As we saw in Section 3.1, it does not seem to be able to speed up the optimization process. However, it is easy to find parameters where increasing momentum improve generalization. In this paper, we demonstrate that in the large-step size regime (A), momentum solely boosts the step size.

Indeed, assuming that does not fluctuate much during training and can be approximated by a constant, then we can simulate the increase in step size implied by momentum just by using a higher learning rate (Figure 4).

Figure 4: Testing accuracies obtained for various learning rates and three different values of . VGG-13-BN was trained using SGD.
Figure 5: Loss curves of the best learning rate for each momentum value in Figure 4.

Figure 4 indicates that the generalization ability is dictated by the size of the steps taken rather than the learning rate itself. For three different momentum intensities, we can observe the same pattern repeating. Reductions in momentum appear to be compensated by increasing the learning rate. The three curves, albeit shifted, are surprisingly similar. They even exhibit the same drop just after their respective optimal learning rate. Additional experiments made on other architectures and datasets were performed to rule out the hypothesis that these results are problem specific. Results are presented on Figure 11. Also in appendix Appendix B, we explore in more detail the equivalence of pairs of learning rate and momentum values.

Finally, Figure 5 shows the evolution of the loss during training for the bets performing learning rate of each momentum value considered in Figure 4. It is clear that momentum had no impact here as the trajectories are oddly similar. There is no evidence that the convergence was improved at all. The only difference that we can observe is that each model reached at a different time, but it does not seem to be linked to the intensity of momentum. Moreover, they all reach very similar losses at the end of training.

4 Towards new learning rate schedules

As we characterized these two very distinct training regimes, it is tempting to experiment with a “stripped down” schedule that consists of two completely different phases; For each one we use an algorithm individually tuned to excel in a particular task. The first one has to be SGD as it provides good generalization to the model. The second can be any algorithm able to minimize the loss quickly. To stay consistent with the previous experiments we pick here SGD with momentum but we believe that many algorithms would perform similarly or better. Especially, fast algorithms that have been criticized for their poor generalization ability like K-FAC (Martens and Grosse, 2015) and L-BFGS (Liu and Nocedal, 1989) could be perfect candidates. First, we will appraise the benefits of having radically different momentum values for the two phases. Secondly, we will evaluate the performance of this two step approach in comparison to the more elaborate three phase training schedule.

4.1 Decoupling momentum

We believe that, even if researchers do search for the best momentum value, unlike learning rate, they assume that it stays constant. For example, in Goyal et al. (2017) and Shallue et al. (2018), a large amount of schedules are compared; yet momentum never change over the course of training. However, as we saw, the two regimes are wildly different. This is why we suggest isolating the two regimes in the two tasks, and optimize them individually.

It turns out that with the appropriate learning rate, using momentum in the first phase has no observable impact on the performance of the models. However, having a larger momentum (again with an appropriate change in learning rate) is beneficial in the second phase as it inceases the final testing accuracy under the same budget.

In order to control for the parameters, we trained multiple models and randomly picked the transition epoch, epoch at which we switch from an algorithm to another. We display the distribution of testing accuracies obtained on Figure 64. On the top plot we see that two distributions are the same (for a fixed second phase). On the bottom one, however, a more aggressive momentum associated with a smaller learning rate, on average, outperforms the “classic” parameters.

We previously observed that disabling momentum has to be accompanied by a corresponding increase in learning rate. To find such a learning rate we used random search and took the one that had a training loss curve that matched the baseline as closely as possible for the first 50 epochs(more details about this procedure in appendix Appendix B).

Figure 6: Impact on the distribution of testing accuracies when using different values momentum in the training phases, (top) is for the first phase and (bottom) is for the second.

4.2 Performance of the two phases schedule

Comparing against the popular, three-step schedule, we find that two truly independent phases can perform similarly or better. This suggests that complex schedules are not necessary to train deep neural networks.

We evaluate this schedule on two datasets: CIFAR-10 and ImageNet (Russakovsky et al., 2015). For the former, we sampled many transition points and took the median over equally sized bins. For the latter, because it is particularly expensive, only a few transition points were hand picked. For CIFAR-10, we used the same parameters as in Section 4.1. For ImageNet, learning rates and momentum values were hand picked, as optimizing them would have been prohibitively costly.

Performance as a function of the transition epoch is shown on Figure 7 and Figure 8. In both cases, our schedule outperforms or matches the three stages schedule for at least ones value of the transition epoch. For CIFAR-10, we also considered enabling momentum in the first phase5. As our previous experiment would suggest, the two configurations appear equivalent.

Figure 7: Evolution of the testing accuracy in function of the transition epoch for our proposed simplified two steps schedule on CIFAR-10. We present our schedule with and without momentum in the first phase to emphasise its lack of influence on the results.
Figure 8: Evolution of the testing accuracy in function of the transition epoch for our proposed simplified two steps schedule on ImageNet.

5 Related work

The older and more popular multiple step learning rate schedules probably originates from the practical recommendations found in Bengio (2012a). Bottou et al. (2018) provides a theoretical argument that support schedules with decreasing learning rate.

More recently, Smith (2017) introduced the cyclic learning rate that consist in a sequence of linear increase and decrease of the learning rate where the high and low values correspond to what we named the large and small step regimes. Soon after, Smith and Nicholay (2017) concludes that a single period of that pattern is sufficient to obtain good performance and the schedule is named 1-cycle. Similarly to the cycling learning rate schedule, Loshchilov and Hutter (2017) present SGDR, a schedule with sudden jumps of learning rate similar to the restarts found in many gradient free optimization techniques.

The learning rate is not the only parameter that has been considered to change over time. For example,  Smith and Topin (2017) and Goyal et al. (2017) both had success varying the size of the batch size. This is aligned with our recommendation that every hyper-parameter should be optimized for each phase of training.

The impact of large learning rates and generalization has received a lot of attention in the past. The predominant hypothesis is that is acts as regularizer (Li et al., 2019; Hoffer et al., 2017). It is believed that it either promotes flatter minima (Keskar et al., 2017; Jiang et al., 2020) or increase the amount of noise during training (Mandt et al., 2017; Smith and Topin, 2017).

6 Conclusion

In this paper, we studied the properties of the two regimes of deep network training. The large-step one is typically found in the early stages of training6 and the small-step tends to ends training.

Our investigations show that optimization in the large step-size regime does not follow training patterns typically expected in the convex setting: the evolution of the loss is very noisy and we reach a solution far from the optimal one. In this regime, the benefits of momentum are nuanced: It seems that any gain that it offers can be compensated by a corresponding increase in learning rate.

The small step-size regime seems fundamentally different: we obtain a lower loss, faster and smoothly, but solutions generalize poorly. In this case, momentum can greatly speed up the convergence—as it does in the convex case.

The intensity of momentum and, more generally, the optimization algorithm used are typically considered during hyper-parameter search. However, they are always kept constant over the whole training. This restrict the search space drastically because we are unable to tailor them to the different training regimes we encounter. By separating the two regimes into two distinct problem we might be able to obtain better model and/or train them faster.

Indeed, we demonstrate that a simple schedule consisting of only two stages, the first one being SGD with no momentum and the second of SGD with a value of momentum larger than usual can be competitive with state of the art learning rate schedules. This opens up the possibility for development of new training algorithms that are specialized in only one regime. This might also let us leverage second order methods—usually criticized for the poor generalization performance— in the second phase of training.

Appendix A Full gradient experiments

In this appendix we reproduce some of the experiments made in Section 3.1 and Section 3.2 using full gradients instead of SGD to rule out the possibility that the stochasticity is the cause of the inability of momentum to build up.

Figure 9 and Figure 10 presents similar results to Figure 1 and Figure 4 respectively.

Figure 9: Evolution of the training loss using full gradients with different momentum values on CIFAR-10 and VGG-13-BN with for (a) and for (b).
Figure 10: Testing accuracy after 50 epochs in function of the learning rate for different values of . VGG-13-BN model was used and trained using GD and no weight decay.

Appendix B Momentum learning-rate equivalence in the large step regime

To evaluate in more detail the relationship that ties the learning and the momentum together, we designed the following experiment:

  1. Train models for CIFAR-10 for 50 epochs with a wide range of learning rates and three different momentum intensities: , and .

  2. For each configuration with we find the corresponding two configurations with and that match the training loss the best in norm.

  3. We report the corresponding matching learning rate and norm between the curves on Figure 12.

We see that for any learning rate between and it is possible to find an equivalent learning rate with an almost identical behavior. Moreover, the relation between equivalent learning rates seems to be linear.

Figure 11: Testing accuracies obtained for various learning rates and three different values of . VGG-13-BN was trained using SGD. Models used are the same for each row and are, from top to bottom: ResNet18, ResNet50, VGG13, VGG19. (left) column shows CIFAR-10 and (right) CINIC-10.
Figure 12: Best equivalent learning rate (top) and corresponding distance between the loss curves for different values of momentum.
Figure 13: Distribution of test accuracies with and without momentum in the first phase for different second phase algorithms: (top) classic, (middle) reduced learning rate, (AdamW)

Appendix C Experiment details

c.1 Shared between all experiments

The details provided in this section are valid for every experiment unless specified otherwise:

  • Programming language: Python 3

  • Framework PyTorch 1.0

  • Dataset CIFAR-10 (Krizhevsky, 2009)

  • Batch size:

  • Weight decay:

  • Per channel normalization Yes

  • Data augmentation:

    1. Random Crop

    2. Random horizontal flip

c.2 Experiment visible on Figure 1 and Figure 3

  • Architecture: VGG-13 (Simonyan and Zisserman, 2015) with extra batch norm layers (Ioffe and Szegedy, 2015).

  • Learning rates: and for the large and small steps regime respectively.

  • Momentum type: Heavy ball (non Nesterov)

  • Momentum intensities: , and

c.3 Experiment visible on Figure 2

  • Function optimized:

  • Iterations:

  • Properties of : Fixed positive semi definite random matrix with eigen values ranging from to .

  • Momentum type: Heavy ball (non Nesterov)

  • Momentum intensities: , , and

  • Learning rates: They were picked to yield the best performance for each momentum value. They were obtained using a grid search procedure. Results of the grid search visible on Figure 14

  • Grid search range: 50 values equally spaced in -scale, 50 values equally spaced in -scale.

Figure 14: Result of the grid search to find the best learning rate for different momentum intensities. Is displayed after 10000 iterations.

c.4 Experiment visible on Figure 4 and Figure 5

  • Architecture: VGG-13 (Simonyan and Zisserman, 2015) with extra batch norm layers (Ioffe and Szegedy, 2015).

  • Momentum type: Heavy ball (non Nesterov)

  • Momentum intensities: , and

  • Learning rates: We performed a random search to find the best for each each momentum value. We took 20 samples uniformly in log scaled in the following ranges:

c.5 Experiment vibile at the top of Figure 6

  • Framework PyTorch 0.4.1

  • Architecture: VGG-13 (Simonyan and Zisserman, 2015) with extra batch norm layers (Ioffe and Szegedy, 2015).

  • Batch size:

  • Optimizers

    1. Phase 1 (with momentum): SGD

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

    2. Phase 1 (without momentum): SGD

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

    3. Phase 2 (same for the two distributions): SGD with momentum

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

c.6 Experiment vibile at the bottom of Figure 6

  • Framework PyTorch 0.4.1

  • Architecture: VGG-13 (Simonyan and Zisserman, 2015) with extra batch norm layers (Ioffe and Szegedy, 2015).

  • Batch size:

  • Optimizers

    1. Phase 1 (same for the two distributions): SGD

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

    2. Phase 2 SGD

      1. Learning rate: displayed on the legend

      2. Momentum: displayed on the legend

      3. Weight decay:

c.7 Experiment visible on Figure 7

  • Framework PyTorch 0.4.1

  • Architecture: VGG-13 (Simonyan and Zisserman, 2015) with extra batch norm layers (Ioffe and Szegedy, 2015).

  • Batch size:

  • Data augmentation: None

  • Optimizers

    1. Phase 1: SGD

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

    1. Phase 2: SGD with momentum

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

  • Momentum type: Classic

  • Reference testing accuracy: Median over multiple training runs that we ran ourself wit the same parameters except for the learning rate schedule. The default three stages schedule was used with a constant .

c.8 Experiment visible on Figure 8

  • Framework PyTorch 0.4.1 + Robustness 1.1

  • Architecture: ResNet-18 (He et al., 2016)

  • Batch size:

  • Data augmentation:

    1. Random crop to size 224

    2. Random horizontal flip

    3. Color Jitter

    4. Lighting noise

  • Optimizers

    1. Phase 1: SGD

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

    1. Phase 2: SGD with momentum

      1. Learning rate:

      2. Momentum:

      3. Weight decay:

  • Momentum type: Classic

  • Reference testing accuracy: We used the value availble here: https://pytorch.org/docs/stable/torchvision/models.html the day of submission.

Footnotes

  1. We could not identify a sharp boundary between these two regimes. Learning rates in between the two extremes seems to essentially be a mixture of the two behaviors.
  2. When the optimum is not known, as it is the case for neural networks, we use instead the solution our algorithm converged to eventually.
  3. This is the reason why we prefer the term large and small steps regimes as it is possible to make large steps with a small learning rate if momentum is large enough.
  4. Results for different second phase algorithms are available in the appendix on Figure 13.
  5. with the appropriate change in learning rate
  6. Cyclic learning rates and the schedule used in Goyal et al. (2017) are example of exceptions.

References

  1. Practical recommendations for gradient-based training of deep architectures. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7700 LECTU, pp. 437–478. External Links: ISBN 9783642352881, Document, ISSN 03029743 Cited by: §5.
  2. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade: Second Edition, Cited by: §1, item (A).
  3. Optimization methods for large-scale machine learning. Siam Review. Cited by: §5.
  4. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint 1706.02677. Cited by: §4.1, §5, footnote 6.
  5. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: 2nd item.
  6. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.2, §5.
  7. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), Cited by: 2nd item, 1st item, 1st item, 2nd item, 2nd item.
  8. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, Cited by: §3.2, §5.
  9. On large-batch training for deep learning: generalization gap and sharp minima. In International Conference on Learning Representations (ICLR), Cited by: §3.2, §3.2, §5.
  10. On the insufficiency of existing momentum schemes for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.1.
  11. Learning multiple layers of features from tiny images. In Technical report, Cited by: 3rd item.
  12. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.2, §5.
  13. An exponential learning rate schedule for deep learning. Cited by: §1.
  14. On the limited memory bfgs method for large scale optimization. In Mathematical Programming, Cited by: §1, §4.
  15. SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), Cited by: §1, §5.
  16. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research 18 (1). Cited by: §5.
  17. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, Cited by: §1, §4.
  18. A method of solving a convex programming problem with convergence rate o(1/k2). In Soviet Mathematics Doklady, Cited by: §2.
  19. Some methods of speeding up the convergence of iteration methods. In USSR Computational Mathematics and Mathematical Physics, Cited by: §2.
  20. ImageNet Large Scale Visual Recognition Challenge. In International Journal of Computer Vision (IJCV), Cited by: §4.2.
  21. Measuring the effects of data parallelism on neural network training. In ArXiv preprint arXiv:1811.03600, Cited by: §4.1.
  22. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: 2nd item, 1st item, 1st item, 2nd item, 2nd item.
  23. Cyclical learning rates for training neural networks. In Winter Conference on Applications of Computer Vision, Cited by: §1, §5.
  24. Super-convergence: very fast training of neural networks using large learning rates. In ArXiv preprint arXiv:1708.07120, Cited by: §1, item (B), §5.
  25. Super-convergence: very fast training of residual networks using large learning rates. CoRR abs/1708.07120. External Links: Link, 1708.07120 Cited by: §5, §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409274
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description