Large Batch Size Training of Neural Networks with Adversarial Training and SecondOrder Information
Abstract
Stochastic Gradient Descent (SGD) methods using randomly selected batches are widelyused to train neural network (NN) models. Performing design exploration to find the best NN for a particular task often requires extensive training with different models on a large dataset, which is very computationally expensive. The most straightforward method to accelerate this computation is to distribute the batch of SGD over multiple processors. To keep the distributed processors fully utilized requires commensurately growing the batch size; however, large batch training often times leads to degradation in accuracy, poor generalization, and even poor robustness to adversarial attacks. Existing solutions for large batch training either significantly degrade accuracy or require massive hyperparameter tuning. To address this issue, we propose a novel large batch training method which combines recent results in adversarial training (to regularize against “sharp minima”) and second order optimization (to use curvature information to change batch size adaptively during training). We extensively evaluate our method on Cifar10/100, SVHN, TinyImageNet, and ImageNet datasets, using multiple NNs, including residual networks as well as smaller networks for mobile applications such as SqueezeNext. Our new approach exceeds the performance of the existing solutions in terms of both accuracy and the number of SGD iterations (up to 1% and , respectively). We emphasize that this is achieved without any additional hyperparameter tuning to tailor our proposed method in any of these experiments.
\ul \NewDocumentCommand\varOs m O
1 Introduction
Finding the right NN architecture for a particular application requires extensive hyperparameter tuning and architecture search, often on a very large dataset. The delays associated with training NNs is often the main bottleneck in the design process. One of the ways to address this issue to use large distributed processor clusters; however, to efficiently utilize each processor, the portion of the batch associated with each processor (sometimes called the minibatch) must grow correspondingly. In the ideal case, the hope is to decrease the computational time proportional to the increase in batch size, without any drop in generalization quality. However, large batch training has a number of well known draw backs. These include degradation of accuracy, poor generalization, and poor robustness to adversarial perturbations (keskar2016large; yao2018hessian).
In order to address these drawbacks, many solutions have been proposed (goyal2017accurate; you2017scaling; devarakonda2017adabatch; smith2017don; jia2018highly). However, these methods either work only for particular models on a particular dataset, or they require massive hyperparameter tuning, which is often not discussed in the presentation of results. Note that while extensive hyperparameter turning may result in good result tables, it is antithetical to the original motivation of using large batch sizes to reduce training time.
One solution to reduce the brittleness of SGD to hyperparameter tuning is to use secondorder methods. Full Newton method with line search is parameterfree, and it does not require a learning rate. This is achieved by using a secondorder Taylor series approximation to the loss function, instead of a firstorder one as in SGD, to obtain curvature information. schaul2013no; xu2017second; xu2017newton show that Newton/quasiNewton methods outperform SGD for training NNs. However, their results only consider simple fully connected NNs and autoencoders. A problem with secondorder methods is that they can exacerbate the large batch problem, as by construction they have a higher tendency to get attracted to local minima as compared to SGD. For these reasons, early attempts at using secondorder methods for training convolutional NNs have so far not been successful.
Ideally, if we could find a regularization scheme to avoid local/bad minima during training, this could resolve many of these issues. In the seminal works of el1997robust; xu2009robust, a very interesting connection was made between robust optimization and regularization. It was shown that the solution to a robust optimization problem for least squares is the same as the solution of a Tikhonov regularized problem (el1997robust). This was also extended to the Lasso problem in xu2009robust. Adversarial learning/training methods, which are a special case of robust optimization methods, are usually described as a minmax optimization procedure to make the model more robust. Recent studies with NNs have empirically found that robust optimization usually converges to points in the optimization landscape that are flatter and are more robust to adversarial perturbation (yao2018hessian).
Inspired by these results, we explore whether second order information regularized by robust optimization can be used to do large batch size training of NNs. We show that both classes of methods have properties that can be exploited in the context of large batch training to help reduce the brittleness of SGD with large batch size training, thereby leading to significantly improved results.
Main Contributions
In more detail, we propose an adaptive batch size method based on curvature information extracted from the Hessian, combined with a robust optimization method. The latter helps regularize against sharp minima, especially during early stages of training. We show that this combination leads to superior testing performance, as compared to the proposed methods for large batch size training. Furthermore, in addition to achieving better testing performance, we show that the total number of SGD updates of our method is significantly lower than stateoftheart methods for large batch size training. We achieve these results without any additional hyperparameter tuning of our algorithm (which would, of course, have helped us to tailor our solution to these experiments). Here is a more detailed itemization of the main contributions of this work:

We propose an Adaptive Batch Size method for SGD training that is based on second order information, computed by backpropagating the Hessian operator. Our method automatically changes the batch size and learning rate based on Hessian information. We state and prove a result that this method is convergent for a convex problem. More importantly, we empirically test the algorithm for important nonconvex problems in deep learning and show that it achieves equal or better test performance, as compared to small batch SGD (We refer to this method as ABS).

We propose a regularization method using robust training by solving a minmax optimization problem. We combine the second order adaptive batch size method with recent results of yao2018hessian, which show that robust training can be used to regularize against sharp minima. We show that this combination of Hessianbased adaptive batch size and robust optimization achieves significantly better test performance with little computational overhead (we refer to this Adaptive Batch Size Adversarial method as ABSA).

We test the proposed strategies extensively on a wide range of datasets (Cifar10/100, SVHN, TinyImageNet, and ImageNet), using different NNs, including residual networks. Importantly, we use the same hyperparameters for all of the experiments, and we do not perform any kind of tuning of our hyperparameters to tailor our results. The empirical results show the clear benefit of our proposed method, as compared to the stateoftheart. The proposed algorithm achieves equal or better test accuracy (up to 1%) and requires significantly fewer SGD updates (up to ).

We empirically show that we can use a block approximation of the Hessian operator (i.e., the Hessian of the last fewer layers) to reduce the computational overhead of backpropagating the second order information. This approximation is especially effective for deep NNs.
While a number of recent works have discussed adaptive batch size or increasing batch size during training (devarakonda2017adabatch; smith2017don; friedlander2012hybrid; balles2016coupling), to the best of our knowledge this is the first paper to introduce Hessian information and adversarial training in adaptive batch size training, with extensive testing on many datasets.
Limitations
We believe that it is important for every work to state its limitations (in general, but in particular in this area). We were particularly careful to perform extensive experiments and repeated all the reported tests multiple times. We test the algorithm on models ranging from a few layers to hundreds of layers, including residual networks as well as smaller networks such as SqueezeNext.
An important limitation is that second order methods have additional overhead for backpropagating the Hessian. Currently, most of the existing frameworks do not support (memory) efficient backpropagation of the Hessian (thus providing a structural bias against these powerful methods). However, the complexity of each Hessian matvec is the same as a gradient computation (martens2010deep). Our method requires Hessian spectrum, which typically needs ten Hessian matvecs (for power method iterations to reach a tolerance of 1e2). Thus, the benefits that we show in terms of testing accuracy and reduced number of updates do come at a cost (see Table 3 for details). We measure this additional overhead and report it in terms of wall clock time. Furthermore, we (empirically) show that this power iteration needs to be done only at the end of every epoch, thus significantly reducing the additional overhead.
Another limitation is that our theory only holds for convex problems (under certain smoothness assumptions). Proving convergence for nonconvex setting requires more involved analysis. Recently, ward2018adagrad has provided interesting theoretical guarantees for AdaGrad (duchi2011adaptive) in the nonconvex setting; and xu2017second; yao2018inexact have developed subsampled second order methods for nonconvex objectives. Exploring a similar direction for our method is of interest for future work. Another point is that adaptive batch size prevents one from utilizing all of the processes, as compared to using large batch throughout the training. However, a large data center can handle and accommodate a growing number of requests for processor resources, which could alleviate this.
2 Related Work
Optimization methods based on SGD are currently the most effective techniques for training NNs, and this is commonly attributed to SGD’s ability to escape saddlepoints and “bad” local minima (dauphin2014identifying). The sequential nature of weight updates in synchronous SGD limits possibilities for parallel computing. In recent years, there has been considerable effort on breaking this sequential nature, through asynchronous methods (zhang2015deep) or symbolic execution techniques (maleki2017parallel). A main problem with asynchronous methods is reproducibility, which, in this case, depends on the number of processes used (zheng2016asynchronous; agarwal2011distributed). Due to this issue, recently there have been attempts to increase parallelization opportunities in synchronous SGD by using large batch size training. With large batches, it is possible to distribute more efficiently the computations to parallel compute nodes (gholami2017integrated), thus reducing the total training time. However, large batch training often leads to suboptimal test performance (keskar2016large; yao2018hessian). This has been attributed to the observation that large batch size training tends to get attracted to local minima or sharp curvature directions, which are not robust to (possible) mismatch between training and testing curves (keskar2016large). A full understanding of this, however, remains elusive.
There have been several solutions proposed for alleviating the problem with large batch size training. The first notable work here is goyal2017accurate, where it was shown that by scaling the learning rate, it is possible to achieve the same testing accuracy for large batches. In particular, ResNet50 model was tested on ImageNet dataset, and it was shown that the baseline accuracy could be recovered up to a batch size of 8192. However, this approach does not generalize to other networks such as AlexNet (you2017scaling), or other tasks such as NLP. In you2017scaling, an adaptive learning rate method (called LARS) was proposed which allowed scaling training to a much larger batch size of 32K with more hyperparameter tuning. Another notable work is smith2017don (and also devarakonda2017adabatch), which proposed a hybrid increase of batch size and learning rate to accelerate training. In this approach, one would select a strategy to “anneal” the batch size during the training. This is based on the idea that large batches contain less “noise,” and that could be used much the same way as reducing learning rate during training. More recent work jia2018highly; puri2018large proposed mixprecision method to further explore the limit of large batch training.
A recent study has shown that anisotropic noise injection could also help in escaping sharp minima (zhu2018anisotropic). The authors showed that the noise from SGD could be viewed as anisotropic, with the Hessian as its covariance matrix. Injecting random noise using the Hessian as covariance was proposed as a method to avoid sharp minima.
Another recent work by yao2018hessian has shown that adversarial training (or robust optimization) could be used to “regularize” against these sharp minima, with preliminary results showing superior testing performance as compared to other methods. The link between robust optimization and regularization is a very interesting observation that has been theoretically proved in the case of Ridge regression (el1997robust), and Lasso (bertsimas2011theory). shaham2015understanding; shrivastava2017learning used adversarial training and showed that the model trained using robust optimization is often more robust to perturbations, as compared to normal SGD training. Similar observations have been made by others (szegedy2013intriguing; goodfellow6572explaining).
3 Our Main Method
We consider a supervised learning framework where the goal is to minimize a loss function :
(1) 
where are the model weight parameters, is the training dataset, and is the loss for a datum . Here, is the input, is the corresponding label, and is the cardinality of the training set. SGD is typically used to optimize Eqn. (1) by taking steps of the form:
(2) 
where is a minibatch of examples drawn randomly from , and is the step size (learning rate) at iteration . In the case of large batch size training, the batch size is increased to large values.
smith2018bayesian views the learning rate and batch size as noise injected during optimization. Both a large learning rate as well as a small batch size can be considered to be equivalent to high noise injection. This is explained by modeling the behavior of NNs as a stochastic differential equation (SDE) of the following form:
(3) 
where is the noise injected by SGD (see smith2018bayesian for details). The authors then argue that the noise magnitude is proportional to . For minibatch , the noise magnitude can be estimated as . Hence, in order to achieve the benefits from small batch size training, i.e., the noise generated by small batch training, the learning rate should increase proportionally to the batch size, and vice versa. That is, the same annealing behavior could be achieved by increasing the batch size, which is the method used by smith2017don.
The need for annealing can be understood by considering a convex problem. When we get closer to a local minimum, a more accurate descent direction with less noise is preferable to a more noisy direction, since less noise helps converge to rather than oscillate around the local minimum. This explains the manual batch size and learning rate changes proposed in (smith2017don; devarakonda2017adabatch). Ideally, we would like to have an automatic method that could provide us with such information and regularize against local minima with poor generalization. As we show next, this is possible through the use of second order information combined with robust optimization.
3.1 Adaptive Batch Size (ABS) based on Hessian Information
In this section, we propose a method for utilizing second order information to adaptively change the batch size. We refer to this as the Adaptive Batch Size (ABS) method; see Alg. 1. Intuitively, using a larger batch size in regions where the loss has a “flatter” landscape, and using a smaller batch size in regions with a “sharper” loss landscape, could help to avoid attraction to local minima with poor generalization. This information can be obtained through the lens of the Hessian operator. We adaptively increase the batch size as the Hessian eigenvalue decreases or stays stable for several epochs (fixed to be ten in all of the experiments).
The second component of our framework is robust optimization. In the seminal works of (el1997robust; xu2009robust), a connection between robust optimization and regularization was proved in the context of ridge and lasso regression. In yao2018hessian, the authors empirically showed that adversarial training leads to more robust models with respect to adversarial perturbation. An interesting corollary was that, after adversarial training, the model converges to regions that are considerably flatter, as compared to the baseline.
Thus, we can combine our ABS algorithm with adversarial training as a form of regularization against “sharp” minima. We refer to this as the Adaptive Batch Size Adversarial (ABSA) method; see Alg. 1. In practice, ABSA is often more stable than ABS. This corresponds to solving a minmax problem instead of a normal minimization problem (keskar2016large; yao2018hessian). Solving this minmax problem for NNs is an intractable problem, and thus we approximately solve the maximization problem through the Fast Gradient Sign Method (FGSM) proposed by goodfellow6572explaining. This basically corresponds to generating adversarial inputs using one gradient ascent step (i.e., the perturbation is computed by ). Other possible choices are proposed by (thakur2005optimization; carlini2017towards; moosavi2016deepfool).^{2}^{2}2In yao2018hessian, similar behavior was observed with other methods for solving the robust optimization problem.
Figure 1 illustrates our ABS schedule as compared to a normal training strategy and the increasing batch size strategy of smith2017don; devarakonda2017adabatch. Note that our learning rate adaptively changes based on the Hessian eigenvalue in order to keep the same noise level as in the baseline SGD training. As we show in section 4, our combined approach (second order and robust optimization) not only achieves better accuracy, but it also requires significantly fewer SGD updates, as compared to smith2017don; devarakonda2017adabatch.
3.2 Convergence Rate of ABS
Before discussing the empirical results, an important question is whether using ABS is a convergent algorithm for even a convex problem. Here, we show that our ABS algorithm does converge for strongly convex problems. Based on an assumption about the loss (Assumption 2 in Appendix A), it is not hard to prove the following theorem.
Theorem 1.
Under Assumption 2, let assume at step , the batch size used for parameter update is , the step size is , where is fixed and satisfies,
(4) 
where is the maximum batch size during training. Then, with as the initilization, the expected optimality gap satisfies the following inequality,
(5) 
From Theorem 1, if , the convergence rate for steps, based on equation 5, is . However, the convergence rate of Alg. 1 becomes , where . With an adaptive , Alg. 1 can converge faster than basic SGD. We show empirical results for a logistic regression problem in the Appendix A, which is a simple convex problem.
4 Our Main Results
We evaluate the performance of our ABS and ABSA methods on different datasets (ranging from O(1E4) to O(1E7) training examples) and multiple NN models. We compare the baseline performance (i.e., small batch size), along with other stateoftheart methods proposed for large batch training (smith2017don; goyal2017accurate). The two main metrics for comparison are (1) the final accuracy and (2) the total number of updates. Preferably we would want a higher testing accuracy along with fewer SGD updates. We emphasize that, for all of the datasets and models we tested, we do not change any of the hyperparameters in our algorithm. We use the exact same parameters used in the baseline model, and we do not tailor any parameters to suit our algorithm. A detailed explanation of the different NN models, and the datasets is given in Appendix B.
Section 4.1 shows the result of ABS (ABSA) compared to BaseLine (BL), FB (goyal2017accurate) and GG (smith2017don). Section 4.2 presents the results on more challenging datasets of TinyImageNet and ImageNet. The superior performance of our method does come at the cost of backpropagating the Hessian. Thus, in section 4.3, we discuss how approximate Hessian information could be used to alleviate the costs.
4.1 ABS and ABSA for SVHN and Cifar
We first start by discussing the results of ABS and ABSA on SVHN and Cifar10/100 datasets. Notice that GG and our ABS and ABSA have different batch sizes during training. Hence the batch size reported in our results represents the maximum batch size during training. To allow for a direct comparison we also report the number of weight updates in our results (lower is better). It should be mentioned that the number of SGD updates is not necessarily the same as the wallclock time. Therefore, we also report a simulated training time of I3 model in Appendix C.
Tables 1 and 47 (see Appendix D for Tables 47) report the test accuracy and the number of parameter updates for different datasets and models. First, note the drop in BL accuracy for large batch confirming the accuracy degradation problem. Moreover, note that the FB strategy only works well for moderate batch sizes (it diverges for large batch). However, the GG method has a very consistent performance, but its number of parameter updates are usually greater than our method.
Looking at the last two major columns of Tables 1 and 47, the test performances ABS achieves are similar accuracy as BL. Overall, the number of updates of ABS is 310 times smaller than BL with batch size 128. However, for most cases, ABSA achieves superior results. This confirms the effectiveness of adversarial training combined with the second order information.
BL  FB  GG  ABS  ABSA  
BS  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters 
128  92.02  78125  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A. 
256  91.88  39062  91.75  39062  91.84  50700  91.7  40792  92.11  43352 
512  91.68  19531  91.67  19531  91.19  37050  92.15  32428  91.61  25388 
1024  89.44  9766  91.23  9766  91.12  31980  91.61  17046  91.66  23446 
2048  83.17  4882  90.44  4882  89.19  30030  91.57  21579  91.61  14027 
4096  73.74  2441  86.12  2441  91.83  29191  91.91  18293  92.07  21909 
8192  63.71  1220  64.91  1220  91.51  28947  91.77  22802  91.81  16778 
16384  47.84  610  32.57  610  90.19  28828  92.12  17485  91.97  24361 
4.2 ABSA for TinyImageNet and ImageNet
SVHN is a very simple dataset, and Cifar10/100 are relatively small datasets, and one might wonder whether the improvements we reported in section 4.1 hold for more complex problems. Here, we report the ABSA method on more challenging datasets, i.e., TinyImageNet and ImageNet. We use the exact same hyperparameters in our algorithm, even though tuning them could potentially be preferable for us.
TinyImageNet is an image classification problem, with 200 classes and only 500 images per class. Thus it is easy to overfit the training data. The results for I1 model is reported in Table 2. Note that with fewer SGD iterations, ABSA can achieve better test accuracy than other methods. The performance of ABSA is actually about higher ( the training loss and test performance of I1 on TinyImagenet is shown in Figure 4 in appendix).
BL  FB  GG  ABSA  
BS  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters 
128  60.41  93750  N.A.  N.A.  N.A.  N.A.  N.A.  N.A. 
256  58.24  46875  59.82  46875  60.31  70290  61.28  60684 
512  57.48  23437  59.28  23437  59.94  58575  60.55  51078 
1024  54.14  11718  59.62  11718  59.72  52717  60.72  19011 
2048  50.89  5859  59.18  5859  59.82  50667  60.43  17313 
4096  40.97  2929  58.26  2929  60.09  49935  61.14  22704 
8192  25.01  1464  16.48  1464  60.00  49569  60.71  22334 
16384  10.21  732  0.5  732  60.37  48995  60.71  20348 
ImageNet classification task is perhaps among the most challenging classification problems. Due to the limited computational resources, we only test ABSA and BL, and report results in Figure 5 (see Appendix D). The BL uses parameter updates, reaching validation accuracy. For ABSA, the final validation accuracy is , with only parameter updates. The maximum batch size reached by ABSA is , with initial batch size .
Figure 2 shows the result of I3 model on ImageNet. The BL uses parameter updates, reaching validation accuracy. The final validation accuracy of ABS and ABSA are and , respectively, both with parameter updates. The maximum batch size reached by ABS and ABSA is with initial batch size . If GG schedule is implemented, the total number of parameter updates would have been . (Due to the limitation of resource, we do not run GG for I3 on ImageNet.)
Note that we do not tune the hyperparameters, e.g., , and perhaps one could close the gap between and with fine tuning of our hyperparameters. However, from a practical point of view such tuning is antithetical to the goal of large batch size training as it would increase the total training time, and we specifically did not want to tailor any new parameters for a particular model/dataset.
4.3 Approximate Hessian
One of the limitations of our ABS (ABSA) method is the additional computational cost for computing the top Hessian eigenvalue. If we use the full Hessian operator, the second backpropagation needs to be done all the way to the first layer of NN. For deep networks this could lead to high cost. Here, we empirically explore whether we could use approximate second order information, and in particular we test a block Hessian approximation Figure 6. The block approximation corresponds to only analyzing the Hessian of the last few layers.
In Figure 6 (see Appendix D), we plot the trace of top eigenvalues of full Hessian and block Hessian for C1 model. Although the top eigenvalue of block Hessian has more variance than that of full Hessian, the overall trends are similar for C1. The test performance of C1 on Cifar10 with block Hessian is with 4600 parameter updates (as compared to for full Hessian ABSA). The test performance of C4 on Cifar100 with block Hessian is with 12500 parameter updates (as compared to for full Hessian ABSA). These results suggest that using a block Hessian to estimate the trend of the full Hessian might be a good choice to overcome computation cost, but a more detailed analysis is needed.
5 Conclusion
We introduce an adaptive batch size algorithm based on Hessian information to speed up the training process of NNs, and we combine this approach with adversarial training (which is a form of robust optimization, and which could be viewed as a regularization term for large batch training). We extensively test our method on multiple datasets (SVHN, Cifar10/100, TinyImageNet and ImageNet) with multiple NN models (AlexNet, ResNet, Wide ResNet and SqueezeNext). As the goal of large batch is to reduce training time, we did not perform any hyperparameter tuning to tailor our method for any of these tests. Our method allows one to increase batch size and learning rate automatically, based on Hessian information. This helps significantly reduce the number of parameter updates, and it achieves superior generalization performance, without the need to tune any of the additional hyperparameters. Finally, we show that a block Hessian can be used to approximate the trend of the full Hessian to reduce the overhead of using secondorder information. These improvements are useful to reduce NN training time in practice.
Appendix A Proof of Theorem
For a finite sum objective function , i.e., equation 1, we assume that:
Assumption 2.
The objective function satisfies:

is continuously differentiable and the gradient function of is Lipschitz continuous with Lipschitz constant , i.e.
(6) 
is strongly convex, i.e., there exists a constant s.t.
(7) Also, the global minima of is achieved at and .

Each gradient of each individual is an unbiased estimation of the true gradient, i.e.
(8) 
There exist scalars and s.t.
(9) where is the variance operator, i.e.
With Assumption 2, the following two lemmas could be found in any optimization reference, e.g. bottou2018optimization. We give the proofs here for completeness.
Lemma 3.
Under Assumption 2, after one iteration of stochastic gradient update with step size at , we have
(11) 
where for some .
Proof.
With the smooth of , we have
From above, the result follows. ∎
Lemma 4.
Under Assumption 2, for any , we have
(12) 
Proof.
Let
Then has a unique global minima at with . Using the strong convexity of , it follows
∎
The following lemma is trivial, we omit the proof here.
Lemma 5.
Let . Then the variance of is bounded by
(13) 
Proof of Theorem 1
Given these lemmas, we now proceed with the proof of Theorem 1.
Proof.
We show a toy example of binary logistic regression on mushroom classification dataset^{3}^{3}3https://www.kaggle.com/uciml/mushroomclassification. We split the whole dataset to 6905 for training and 1819 for validation. for SGD with batch size 100 and full gradient descent. We set for our algorithm, i.e. ABS. Here we mainly focus on the training losses of different optimization algorithms. The results are shown in Figure 3. In order to see if is not an optimal step size of full gradient descent, we vary for full gradient descent; see results in Figure 3.
Appendix B Outline of training
In this section, we give the detailed outline of our training datasets, models, strategy as well as hyperparameter used in Alg 1.
Dataset. We consider the following datasets.

SVHN. The original SVHN (netzer2011reading) dataset is small. However, in this paper, we choose the additional dataset, which contains more than 500k samples, as our training dataset.

Cifar. The two Cifar (i.e., Cifar10 and Cifar100) datasets (krizhevsky2009learning) have same number of images but different number of classes.

TinyImageNet. TinyImageNet consists of a subset of ImangeNet images (deng2009imagenet), which contains 200 classes. Each of the class has 500 training and 50 validation images.^{4}^{4}4In some papers, this validation set is sometimes referred to as a test set. The size of each image is .

ImageNet. The ILSVRC 2012 classification dataset (deng2009imagenet) consists of 1000 images classes, with a total of 1.2 million training images and 50,000 validation images. During training, we crop the image to .
Model Architecture. We implement the following convolution NNs. When we use data augmentation, it is exactly same the standard data augmentation scheme as in the corresponding model.

S1. AlexNet like model on SVHN as same as yao2018hessian[C1]. We train it for 20 epochs with initial learning rate , and decay a factor of 5 at epoch 5, 10 and 15. There is no data augmentation.

C1. ResNet18 on Cifar10 dataset (he2016deep). We train it for 90 epochs with initial learning rate , and decay a factor of 5 at epoch 30, 60, 80. There is no data augmentation.

C2. WResNet 164 on Cifar10 dataset (zagoruyko2016wide). We train it for 90 epochs with initial learning rate , and decay a factor of 5 at epoch 30, 60, 80. There is no data augmentation.

C3. SqueezeNext on Cifar10 dataset (gholami2018squeezenext). We train it for 200 epochs with initial learning rate , and decay a factor of 5 at epoch 60, 120, 160. Data augmentation is implemented.

C4. ResNet18 on Cifar100 dataset (he2016deep). We training it for 160 epochs with initial learning rate , and decay a factor of 10 at epoch 80, 120. Data augmentation is implemented.

I1. ResNet50 on TinyImageNet dataset (he2016deep). We training it for 120 epochs with initial learning rate , and decay a factor of 10 at epoch 60, 90. Data augmentation is implemented.

I2. AlexNet on ImageNet dataset (krizhevsky2012imagenet). We train it for 90 epochs with initial learning rate , and decay it to quadratically at epoch 60, then keeps it as for the remaining 30 epochs. Data augmentation is implemented.

I3. ResNet18 on ImageNet dataset (he2016deep). We train it for 90 epochs with initial learning rate , and decay a factor of 10 at epoch 30, 60 and 80. Data augmentation is implemented.
Training Strategy: We use the following training strategies

BL. Use the standard training procedure.

FB. Use linear scaling rule (goyal2017accurate) with warmup stage.

GG. Use increasing batch size instead of decay learning rate (smith2017don).

ABS. Use our adaptive batch size strategy without adversarial training.

ABSA. Use our adaptive batch size strategy with adversarial training.
For adversarial training, the adversarial data are generated using Fast Gradient Sign Method (FGSM) (goodfellow6572explaining). The hyperparameters in Alg. 1 ( and ) are chosen to be , , , , and for all the experiments. The only change is that for SVHN, the frequency to compute Hessian information is training examples as compared to one epoch, due to the small number of total training epochs (only 20).
Appendix C Simulated Training Time
As discussed above, the number of SGD updates does not necessarily correlate with wallclock time, and this is particularly the case because our method require Hessian backpropagation. Here, we use the method suggested in gholami2017integrated, to approximate the wallclock time of our algorithm when utilizing parallel processes. For the ring algorithm thakur2005optimization the communication time per SGD iteration for processes is:
(14) 
where is the network latency, is the inverse bandwidth, and is the size number of model parameters measured in terms of Bits. Moreover, we manually measure the wallclock time of computing the Hessian information using our inhouse code, as well as the cost of forward/backward calculations on a V100 GPU. The total time will consists of this computation time and the communication one along with Hessian computation overhead (if any). Therefore we have:
(15) 
where is the time to compute forward and backward propagation, is the time to communicate between different machine, and is the time to compute top eigenvalues.
We use the latency and bandwidth values of , and based on NERSC’s Cori2 supercomputing platform. Based on above formulas, we give an example of simulated computation time cost of I3 on ImageNet. Note that for large processes and small latency terms, the communication time formula is simplified as,
(16) 
In Table 3 we report the simulation time of I3 on ImageNet on 512 processes. For GG, we assume it increases batch size by a factor of 10 at epoch 30, 60 and 80. The batch size per GPU core is set to 16 for SGD (and 8 for Hessian computation due to memory limit) and the total batch size used for Hessian computation is set to images. The and is for one SGD update and is for one complete Hessian eigenvalue computation (including communication for Hessian computation). Note that the total Hessian computation time for ABS/ABSA is only even though the Hessian computation is not efficiently implemented in the existing frameworks. Note that even with the additional Hessian overhead ABS/ABSA is still much faster than BL (and these numbers are with an inhouse and not highly optimized code for Hessian computations). We furthermore note that we have added the additional computational overhead of adversarial computations to the ABSA method.
Method  Total Time  
BL  2.2E2  1.5E2  0.  16666 
GG  2.2E2  1.5E2  0.  6150 ( faster) 
ABS  2.2E2  1.5E2  1.15  2666 ( faster) 
ABSA  3.6E2  1.5E2  1.15  3467 ( faster) 
Appendix D Additional empirical results
In this section, we present additional empirical results that were discussed in Section 4 (i.e., Table 12, and Figure 2).
BL  FB  GG  ABS  ABSA  
BS  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters 
128  94.90  81986  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A. 
512  94.76  20747  95.24  20747  95.49  51862  95.65  25353  95.72  24329 
2048  95.17  5186  95.00  5186  95.59  45935  95.51  10562  95.82  16578 
8192  93.73  1296  19.58  1296  95.70  44407  95.56  14400  95.61  7776 
32768  91.03  324  10.0  324  95.60  42867  95.60  7996  95.90  12616 
131072  84.75  81  10.0  81  95.58  42158  95.61  11927  95.92  11267 
BL  FB  GG  ABS  ABSA  
BS  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters 
128  83.05  35156  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A. 
640  81.01  7031  84.59  7031  83.99  16380  83.30  10578  84.52  9631 
3200  74.54  1406  78.70  1406  84.27  14508  83.33  6375  84.42  5168 
5120  70.64  878  74.65  878  83.47  14449  83.83  6575  85.01  6265 
10240  68.75  439  30.99  439  83.68  14400  83.56  5709  84.29  7491 
16000  67.88  281  10.00  281  84.00  14383  83.50  5739  84.24  5357 
BL  FB  GG  ABS  ABSA  
BS  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters 
128  87.64  35156  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A. 
640  86.20  7031  87.9  7031  87.84  16380  87.86  10399  89.05  10245 
3200  82.59  1406  73.2  1406  87.59  14508  88.02  5869  89.04  4525 
5120  81.40  878  63.27  878  87.85  14449  87.92  7479  88.64  5863 
10240  79.85  439  10.00  439  87.52  14400  87.84  5619  89.03  3929 
16000  81.06  281  10.00  281  88.28  14383  87.58  9321  89.19  4610 
BL  FB  GG  ABS  ABSA  
BS  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters  Acc.  # Iters 
128  67.67  62500  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A.  N.A 
256  67.12  31250  67.89  31250  66.79  46800  67.71  33504  67.32  33760 
512  66.47  15625  67.83  15625  67.74  39000  67.68  32240  67.87  24688 
1024  64.7  7812  67.72  7812  67.17  35100  65.31  22712  68.03  13688 
2048  62.91  3906  67.93  3906  67.76  33735  64.69  25180  68.43  12103 