The Two Regimes of Deep Network Training
1 Introduction
Finding the right learning rate schedule is critical to obtain the best testing accuracy for a given neural network architecture. As deep learning started gaining popularity, starting with the largest learning rate and gradually decreasing it became the standard practice (Bengio, 2012b). Indeed, today, such “step schedule” still remains one the most popular learning rate schedules, and when properly tuned it yields competitive models.
More recently, as architectures grew deeper and wider, and as training massive datasets became the norm, more elaborate schedules emerged (Loshchilov and Hutter, 2017; Smith, 2017; Smith and Nicholay, 2017). While these schedule have shown great practical success, it is still unclear why that is the case. Li and Arora (2019) even showed that, counterintuitively, an exponentially increasing schedule could also be effective. In the light of this, it is time to revisit learning rate schedules and shed some light on why some perform well and some other don’t.
Our contributions
In this paper, we identify two training regimes: (1) the largestep regime and (2) the smallstep one which correspond usually respectively to the start and the end of “stepschedule”. In particular, we examine these regimes through the lens of optimization and generalization. We find that:

In the largestep rate regime, the loss does not decrease consistently at each epoch and the final loss value obtained after convergence is much higher than when training in the smallstep regime. In that latter regime, the reduction of the loss is faster and smooth and to large degree matches the intuition drawn from the convex optimization literature.

In the largestep regime, momentum does not seem to have a discernible benefit. More precisely, we show that we can recover similar loss decrease curves for a wide range of different momentum values as long as we make a corresponding change in the learning rate. In the smallstep regime, however, momentum becomes crucial to reaching a good solution quickly.

Finally, we leverage this understanding to propose a simple twostage learning rate schedule that achieves state of the art performance on the CIFAR10 and ImageNet datasets. Importantly, in this schedule, each stage uses a different algorithm and hyperparameters.
Our findings suggest that it might be beneficial to depart from viewing deep network training as a single optimization problem and instead to explore using different algorithms for different stages of that process. In particular, second order methods (such as KFAC (Martens and Grosse, 2015) and LBFGS (Liu and Nocedal, 1989))—that are currently viewed as successful at reducing the number of training iterations but leading to suboptimal generalization performance—might be good candidates for using (solely) in the smallstep regime.
2 Background
Given a (differentiable) function , one of the most popular techniques for minimizing it is to use the gradient descent method (GD). This method, starting from an initial solution , iteratively updates the solution as:
(1) 
where is the learning rate. GD is the most natural and simple continuous optimization scheme. However, there are a host of its most advanced variants. One of the prominent ones is momentum gradient descent, often referred to as the classic momentum, or heavy ball method (Polyak, 1964). It corresponds to an update rule:
(2) 
where is a scalar that controls the momentum accumulation. There are also other variants of momentum dynamics. Most prominently, Nesterov’s accelerated gradient (Nesterov, 1983) offers a theoretically optimal convergence rate. However, it tends to have poor behavior in practice due its brittleness. For that reason, and also because of its immense popularity, we will focus on the abovementioned classic momentum dynamics instead.
3 The Two Learning Regimes
In this work, we will be interested in isolating two learning regimes:

“largestep” regime: corresponds to the highest learning rate that can be used without causing divergence, as per Bengio (2012b).

“smallstep” regime: corresponds to the largest learning rate at which loss is consistently decreasing. (In Smith and Nicholay (2017), the authors propose an experimental procedure to estimate appropriate learning rates).
In carefully tuned stepwise learning rate schedules, the first and last learning
rates usually correspond to the largestep and smallstep regimes
3.1 Optimization perspective
By examining the evolution of the loss from initialization on Figure 1, we can note three major differences between the two regimes:

The best solution is found in the low learning rate regime, even though we performed the same number of 100 times smaller steps—which corresponds to a much shorted distance traveled from the initialization.

In regime (A), the evolution of the loss is very noisy, while in (B), it decreases almost at each epoch.

Momentum seems to behave completely differently in the two experiments. At the top of Figure 1, the largest value yields the worst solution, whereas on the other, the final loss decreases as we increase .
These pieces of evidence suggest that regime (A) is a highly nonconvex optimization problem, while the low learning rate regime reflects the intuitions from the convex optimization world. To highlight this distinction we will use momentum (as defined in Section 2).
Momentum.
Momentum can provably accelerate gradient descent over functions that are convex, but does not provide any theoretical guarantees when that property does not hold. In order to highlight the different nature of the problems we are solving in each regime, we compare the behavior of momentum when used on a convex function, and on a deep neural network under both regimes.
Ideally, we would like the momentum vector to be a signal that: (1) points reliably towards the optimum of our problem, and (2) is strong enough to actually have an impact on the trajectory. To focus on these two key properties, we track the two respective quantities:

Alignment: the angle between the momentum and the direction to optimum
^{2} ,(3) 
Scale: the ratio between the magnitude of the momentum vector and the gradient,
(4)
Note that, in order to be helpful in the optimization process, one would expect the direction of the momentum vector to be correlated with the direction towards optimum (alignment to be close to ), and its scale large enough to be significant. Indeed, that’s the behavior that provably emerges in the context of convex optimization.
Convex function baseline.
Figure 2 implies that in case of a quadratic convex function, a higher momentum value results in faster convergence According to the middle plot, the momentum vector is a strong indicator of the direction towards the optimum (it quickly goes to ). Also, the scale increases with the momentum and eventually converges towards , which is what one would expect when the momentum is indeed accumulating.
Deep learning setting.
Now that we saw that the metrics behave as we expect on convex functions, we can measure them (Figure 3) on the experiment presented earlier.
In regime (A), the scale is oscillating around and the value of the alignment is very low. This means that the momentum vector is nearly orthogonal to the direction of the final solution and never constitutes a strong signal. In regime (B), the momentum vector is able to accumulate more and gives nonnegligible information about the direction towards the point we are converging to.
According to Kidambi et al. (2018), momentum might not be able to cope with the noise coming from the stochasticity of SGD. While it is plausible, experiments in appendix Appendix A using full gradients instead of minibatches show that this noisiness has only minimal impact and that the step size is the most important factor determining the success of momentum.
This leads to the following informal argument: with small stepsizes, the trajectory is unable to escape the current basin of attraction. The region is “locally convex” and, as a consequence, allows the momentum vector to point towards the same critical point during the optimization process, thus helping to speed up optimization. On the other hand, high learning rates penalize the effect of momentum. The steps taken at each iteration are large enough to escape the current basin of attraction and enter a different basin (therefore optimizing towards a different local minimum). As this happens, the direction approximated by the momentum vector points to different critical points during the course of optimization. Thus, at some iteration, momentum steers the trajectory towards a point that is not reachable anymore. This hypothesis also explains why in the top plot of Figure 1, we see momentum struggling more than vanilla SGD: often, the momentum vector, orthogonal, is completely disagreeing with the gradient and slows down convergence.
3.2 Generalization perspective
If our objective is to minimize the loss, training in the smallstep regime (B) is simpler and faster. Indeed, as we saw in Figure 1, it was two times faster to reach a loss of . It is therefore natural to ask: Why do we even spend some time in the high learning rate regime? In deep learning, the loss is only a surrogate of our real objective: testing accuracy. It turns out that training only in the second regime, while it is fast, leads to very sharp minimizers. This is a phenomenon similar to what was described in Keskar et al. (2017) in the context of the batch size.
The relationship between learning rate and generalization has already been
studied in the
past (Li et al., 2019; Hoffer et al., 2017; Keskar et al., 2017; Jiang et al., 2020; Keskar et al., 2017; Jiang et al., 2020). However, it seems that what truly defines the regime we
are in is not the learning rate itself, but the actual step size.
Momentum, as defined in Section 2, increases the size of the step we actually take at each iteration of SGD. As we saw in Section 3.1, it does not seem to be able to speed up the optimization process. However, it is easy to find parameters where increasing momentum improve generalization. In this paper, we demonstrate that in the largestep size regime (A), momentum solely boosts the step size.
Indeed, assuming that does not fluctuate much during training and can be approximated by a constant, then we can simulate the increase in step size implied by momentum just by using a higher learning rate (Figure 4).
Figure 4 indicates that the generalization ability is dictated by the size of the steps taken rather than the learning rate itself. For three different momentum intensities, we can observe the same pattern repeating. Reductions in momentum appear to be compensated by increasing the learning rate. The three curves, albeit shifted, are surprisingly similar. They even exhibit the same drop just after their respective optimal learning rate. Additional experiments made on other architectures and datasets were performed to rule out the hypothesis that these results are problem specific. Results are presented on Figure 11. Also in appendix Appendix B, we explore in more detail the equivalence of pairs of learning rate and momentum values.
Finally, Figure 5 shows the evolution of the loss during training for the bets performing learning rate of each momentum value considered in Figure 4. It is clear that momentum had no impact here as the trajectories are oddly similar. There is no evidence that the convergence was improved at all. The only difference that we can observe is that each model reached at a different time, but it does not seem to be linked to the intensity of momentum. Moreover, they all reach very similar losses at the end of training.
4 Towards new learning rate schedules
As we characterized these two very distinct training regimes, it is tempting to experiment with a “stripped down” schedule that consists of two completely different phases; For each one we use an algorithm individually tuned to excel in a particular task. The first one has to be SGD as it provides good generalization to the model. The second can be any algorithm able to minimize the loss quickly. To stay consistent with the previous experiments we pick here SGD with momentum but we believe that many algorithms would perform similarly or better. Especially, fast algorithms that have been criticized for their poor generalization ability like KFAC (Martens and Grosse, 2015) and LBFGS (Liu and Nocedal, 1989) could be perfect candidates. First, we will appraise the benefits of having radically different momentum values for the two phases. Secondly, we will evaluate the performance of this two step approach in comparison to the more elaborate three phase training schedule.
4.1 Decoupling momentum
We believe that, even if researchers do search for the best momentum value, unlike learning rate, they assume that it stays constant. For example, in Goyal et al. (2017) and Shallue et al. (2018), a large amount of schedules are compared; yet momentum never change over the course of training. However, as we saw, the two regimes are wildly different. This is why we suggest isolating the two regimes in the two tasks, and optimize them individually.
It turns out that with the appropriate learning rate, using momentum in the first phase has no observable impact on the performance of the models. However, having a larger momentum (again with an appropriate change in learning rate) is beneficial in the second phase as it inceases the final testing accuracy under the same budget.
In order to control for the parameters, we trained multiple models and
randomly picked the transition epoch, epoch at which we switch from an
algorithm to another. We display the
distribution of testing accuracies obtained on
Figure 6
We previously observed that disabling momentum has to be accompanied by a corresponding increase in learning rate. To find such a learning rate we used random search and took the one that had a training loss curve that matched the baseline as closely as possible for the first 50 epochs(more details about this procedure in appendix Appendix B).
4.2 Performance of the two phases schedule
Comparing against the popular, threestep schedule, we find that two truly independent phases can perform similarly or better. This suggests that complex schedules are not necessary to train deep neural networks.
We evaluate this schedule on two datasets: CIFAR10 and ImageNet (Russakovsky et al., 2015). For the former, we sampled many transition points and took the median over equally sized bins. For the latter, because it is particularly expensive, only a few transition points were hand picked. For CIFAR10, we used the same parameters as in Section 4.1. For ImageNet, learning rates and momentum values were hand picked, as optimizing them would have been prohibitively costly.
Performance as a function of the transition epoch is shown on
Figure 7 and Figure 8. In
both cases, our schedule outperforms or matches the three stages schedule for at
least ones value of the transition epoch. For CIFAR10, we also
considered enabling momentum in the first phase
5 Related work
The older and more popular multiple step learning rate schedules probably originates from the practical recommendations found in Bengio (2012a). Bottou et al. (2018) provides a theoretical argument that support schedules with decreasing learning rate.
More recently, Smith (2017) introduced the cyclic learning rate that consist in a sequence of linear increase and decrease of the learning rate where the high and low values correspond to what we named the large and small step regimes. Soon after, Smith and Nicholay (2017) concludes that a single period of that pattern is sufficient to obtain good performance and the schedule is named 1cycle. Similarly to the cycling learning rate schedule, Loshchilov and Hutter (2017) present SGDR, a schedule with sudden jumps of learning rate similar to the restarts found in many gradient free optimization techniques.
The learning rate is not the only parameter that has been considered to change over time. For example, Smith and Topin (2017) and Goyal et al. (2017) both had success varying the size of the batch size. This is aligned with our recommendation that every hyperparameter should be optimized for each phase of training.
The impact of large learning rates and generalization has received a lot of attention in the past. The predominant hypothesis is that is acts as regularizer (Li et al., 2019; Hoffer et al., 2017). It is believed that it either promotes flatter minima (Keskar et al., 2017; Jiang et al., 2020) or increase the amount of noise during training (Mandt et al., 2017; Smith and Topin, 2017).
6 Conclusion
In this paper, we studied the properties of the two regimes of deep network
training. The largestep one is typically found in the early stages of
training
Our investigations show that optimization in the large stepsize regime does not follow training patterns typically expected in the convex setting: the evolution of the loss is very noisy and we reach a solution far from the optimal one. In this regime, the benefits of momentum are nuanced: It seems that any gain that it offers can be compensated by a corresponding increase in learning rate.
The small stepsize regime seems fundamentally different: we obtain a lower loss, faster and smoothly, but solutions generalize poorly. In this case, momentum can greatly speed up the convergence—as it does in the convex case.
The intensity of momentum and, more generally, the optimization algorithm used are typically considered during hyperparameter search. However, they are always kept constant over the whole training. This restrict the search space drastically because we are unable to tailor them to the different training regimes we encounter. By separating the two regimes into two distinct problem we might be able to obtain better model and/or train them faster.
Indeed, we demonstrate that a simple schedule consisting of only two stages, the first one being SGD with no momentum and the second of SGD with a value of momentum larger than usual can be competitive with state of the art learning rate schedules. This opens up the possibility for development of new training algorithms that are specialized in only one regime. This might also let us leverage second order methods—usually criticized for the poor generalization performance— in the second phase of training.
Appendix A Full gradient experiments
In this appendix we reproduce some of the experiments made in Section 3.1 and Section 3.2 using full gradients instead of SGD to rule out the possibility that the stochasticity is the cause of the inability of momentum to build up.
Appendix B Momentum learningrate equivalence in the large step regime
To evaluate in more detail the relationship that ties the learning and the momentum together, we designed the following experiment:

Train models for CIFAR10 for 50 epochs with a wide range of learning rates and three different momentum intensities: , and .

For each configuration with we find the corresponding two configurations with and that match the training loss the best in norm.

We report the corresponding matching learning rate and norm between the curves on Figure 12.
We see that for any learning rate between and it is possible to find an equivalent learning rate with an almost identical behavior. Moreover, the relation between equivalent learning rates seems to be linear.
Appendix C Experiment details
c.1 Shared between all experiments
The details provided in this section are valid for every experiment unless specified otherwise:

Programming language: Python 3

Framework PyTorch 1.0

Dataset CIFAR10 (Krizhevsky, 2009)

Batch size:

Weight decay:

Per channel normalization Yes

Data augmentation:

Random Crop

Random horizontal flip

c.2 Experiment visible on Figure 1 and Figure 3
c.3 Experiment visible on Figure 2

Function optimized:

Iterations:

Properties of : Fixed positive semi definite random matrix with eigen values ranging from to .

Momentum type: Heavy ball (non Nesterov)

Momentum intensities: , , and

Learning rates: They were picked to yield the best performance for each momentum value. They were obtained using a grid search procedure. Results of the grid search visible on Figure 14

Grid search range: 50 values equally spaced in scale, 50 values equally spaced in scale.
c.4 Experiment visible on Figure 4 and Figure 5

Momentum type: Heavy ball (non Nesterov)

Momentum intensities: , and

Learning rates: We performed a random search to find the best for each each momentum value. We took 20 samples uniformly in log scaled in the following ranges:

c.5 Experiment vibile at the top of Figure 6

Framework PyTorch 0.4.1

Batch size:

Optimizers

Phase 1 (with momentum): SGD

Learning rate:

Momentum:

Weight decay:


Phase 1 (without momentum): SGD

Learning rate:

Momentum:

Weight decay:


Phase 2 (same for the two distributions): SGD with momentum

Learning rate:

Momentum:

Weight decay:


c.6 Experiment vibile at the bottom of Figure 6

Framework PyTorch 0.4.1

Batch size:

Optimizers

Phase 1 (same for the two distributions): SGD

Learning rate:

Momentum:

Weight decay:


Phase 2 SGD

Learning rate: displayed on the legend

Momentum: displayed on the legend

Weight decay:


c.7 Experiment visible on Figure 7

Framework PyTorch 0.4.1

Batch size:

Data augmentation: None

Optimizers

Phase 1: SGD

Learning rate:

Momentum:

Weight decay:


Phase 2: SGD with momentum

Learning rate:

Momentum:

Weight decay:



Momentum type: Classic

Reference testing accuracy: Median over multiple training runs that we ran ourself wit the same parameters except for the learning rate schedule. The default three stages schedule was used with a constant .
c.8 Experiment visible on Figure 8

Framework PyTorch 0.4.1 + Robustness 1.1

Architecture: ResNet18 (He et al., 2016)

Batch size:

Data augmentation:

Random crop to size 224

Random horizontal flip

Color Jitter

Lighting noise


Optimizers

Phase 1: SGD

Learning rate:

Momentum:

Weight decay:


Phase 2: SGD with momentum

Learning rate:

Momentum:

Weight decay:



Momentum type: Classic

Reference testing accuracy: We used the value availble here: https://pytorch.org/docs/stable/torchvision/models.html the day of submission.
Footnotes
 We could not identify a sharp boundary between these two regimes. Learning rates in between the two extremes seems to essentially be a mixture of the two behaviors.
 When the optimum is not known, as it is the case for neural networks, we use instead the solution our algorithm converged to eventually.
 This is the reason why we prefer the term large and small steps regimes as it is possible to make large steps with a small learning rate if momentum is large enough.
 Results for different second phase algorithms are available in the appendix on Figure 13.
 with the appropriate change in learning rate
 Cyclic learning rates and the schedule used in Goyal et al. (2017) are example of exceptions.
References
 Practical recommendations for gradientbased training of deep architectures. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7700 LECTU, pp. 437–478. External Links: ISBN 9783642352881, Document, ISSN 03029743 Cited by: §5.
 Practical recommendations for gradientbased training of deep architectures. In Neural Networks: Tricks of the Trade: Second Edition, Cited by: §1, item (A).
 Optimization methods for largescale machine learning. Siam Review. Cited by: §5.
 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint 1706.02677. Cited by: §4.1, §5, footnote 6.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: 2nd item.
 Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.2, §5.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), Cited by: 2nd item, 1st item, 1st item, 2nd item, 2nd item.
 Fantastic generalization measures and where to find them. In International Conference on Learning Representations, Cited by: §3.2, §5.
 On largebatch training for deep learning: generalization gap and sharp minima. In International Conference on Learning Representations (ICLR), Cited by: §3.2, §3.2, §5.
 On the insufficiency of existing momentum schemes for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.1.
 Learning multiple layers of features from tiny images. In Technical report, Cited by: 3rd item.
 Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.2, §5.
 An exponential learning rate schedule for deep learning. Cited by: §1.
 On the limited memory bfgs method for large scale optimization. In Mathematical Programming, Cited by: §1, §4.
 SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), Cited by: §1, §5.
 Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research 18 (1). Cited by: §5.
 Optimizing neural networks with kroneckerfactored approximate curvature. In International Conference on Machine Learning, Cited by: §1, §4.
 A method of solving a convex programming problem with convergence rate o(1/k2). In Soviet Mathematics Doklady, Cited by: §2.
 Some methods of speeding up the convergence of iteration methods. In USSR Computational Mathematics and Mathematical Physics, Cited by: §2.
 ImageNet Large Scale Visual Recognition Challenge. In International Journal of Computer Vision (IJCV), Cited by: §4.2.
 Measuring the effects of data parallelism on neural network training. In ArXiv preprint arXiv:1811.03600, Cited by: §4.1.
 Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), Cited by: 2nd item, 1st item, 1st item, 2nd item, 2nd item.
 Cyclical learning rates for training neural networks. In Winter Conference on Applications of Computer Vision, Cited by: §1, §5.
 Superconvergence: very fast training of neural networks using large learning rates. In ArXiv preprint arXiv:1708.07120, Cited by: §1, item (B), §5.
 Superconvergence: very fast training of residual networks using large learning rates. CoRR abs/1708.07120. External Links: Link, 1708.07120 Cited by: §5, §5.