Secondorder Optimization Method for Large Minibatch:
Training ResNet50 on ImageNet in 35 Epochs
Abstract
Largescale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective minibatch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a secondorder optimization method that shows similar generalization capability to firstorder methods, but converges faster and can handle larger minibatches. To test our method on a benchmark where highly optimized firstorder methods are available as references, we train ResNet50 on ImageNet. We converged to 75% Top1 validation accuracy in 35 epochs for minibatch sizes under 16,384, and achieved 75% even with a minibatch size of 131,072, which took 100 epochs.
1 Introduction
As the size of deep neural network models and the data which they are trained on continues to increase rapidly, the demand for distributed parallel computing is increasing. A common approach for achieving distributed parallelism in deep learning is to use the dataparallel approach, where the data is distributed across different processes while the model is replicated across them. When the minibatch size per process is kept constant to increase the ratio of computation over communication, the effective minibatch size over the entire system grows proportional to the number of processes.
When the minibatch size is increased beyond a certain point, the validation accuracy starts to degrade. This generalization gap caused by large minibatch sizes have been studied extensively for various models and datasets [23]. Hoffer et al. attribute this generalization gap to the limited number of updates, and suggest to train longer [13]. This has lead to strategies such as scaling the learning rate proportional to the minibatch size, while using the first few epochs to gradually warmup the learning rate [24]. Such methods have enabled the training for minibatch sizes of 8K, where ImageNet [7] with ResNet50 [12] could be trained for 90 epochs to achieve 76.3% top1 validation accuracy in 60 minutes [9]. Combining this learning rate scaling with other techniques such as RMSprop warmup, batch normalization without moving averages, and a slowstart learning rate schedule, Akiba et al. were able to train the same dataset and model with a minibatch size of 32K to achieve 74.9% accuracy in 15 minutes [3].
More complex approaches for manipulating the learning rate were proposed, such as LARS [29], where a different learning rate is used for each layer by normalizing them with the ratio between the layerwise norms of the weights and gradients. This enabled the training with a minibatch size of 32K without the use of ad hoc modifications, which achieved 74.9% accuracy in 14 minutes (64 epochs) [29]. It has been reported that combining LARS with counter intuitive modifications to the batch normalization, can yield 75.8% accuracy even for a minibatch size of 65K [15].
The use of small batch sizes to encourage rapid convergence in early epochs, and then progressively increasing the batch size is yet another successful approach [24, 8]. Using such an adaptive batch size method, Mikami et al. were able to train 224 seconds with an accuracy of 75.03%. The hierarchical synchronization of minibatches have also been proposed [18], but such methods have not been tested at scale to the extent of the authors’ knowledge.
In the present work, we take a more mathematically rigorous approach to tackle the large minibatch problem, by using secondorder optimization methods. We focus on the fact that for large minibatch training, each minibatch becomes more statistically stable and falls into the realm where secondorder optimization methods may show some advantage. Another unique aspect of our approach is the accuracy at which we can approximate the Hessian when compared to other secondorder methods. Unlike methods that use very crude approximations of the Hessian, such as the TONGA [17], Hessian free methods [19], we adopt the Kronecker Factored Approximate Curvature (KFAC) method [20]. The two main characteristics of KFAC are that it converges faster than firstorder stochastic gradient descent (SGD) methods, and that it can tolerate relatively large minibatch sizes without any ad hoc modifications. KFAC has been successfully applied to convolutional neural networks [10], distributed memory training of ImageNet [5], recurrent neural networks [21], Bayesian deep learning [30], and reinforcement learning [27].
Our contributions are:

We implement a distributed KFAC optimizer using a synchronous allworker scheme. We used half precision floating point numbers for both computation and exploited the symmetry of the Kronecker factor to reduce the overhead.

We were able to show for the first time that secondorder optimization methods can achieve similar generalization capability compared to highly optimized SGD, by training ResNet50 on ImageNet as a benchmark. We converged to 75% top1 validation accuracy in 35 epochs for minibatch sizes under 16,384, and achieved 75% even with a minibatch size of 131,072, which took 100 epochs (Table 1).

We show that we can reduce the frequency of updating the Fisher matrices for KFAC after a few hundred iterations. In doing so, we are able to reduce the overhead of KFAC. We were able to train ResNet50 on ImageNet in 10 minutes to a top1 accuracy of 74.9% using 1,024 Tesla V100 GPUs (Table 2).

We show that the Fisher matrices for Batch Normalization layers [14] can be approximated as diagonal matrices, which further reduces the computation and memory consumption.
Minibatch size  Epoch  Iteration  Accuracy 

4,096  35  10,948  75.1 0.09 % 
8,192  35  5,434  75.2 0.05 % 
16,384  35  2,737  75.2 0.03 % 
32,768  45  1,760  75.3 0.13 % 
65,536  60  1,173  75.0 0.09 % 
131,072  100  978  75.0 0.06 % 
2 Related work
Hardware  Software  Minibatch size  Optimizer  Epoch  Time  Accuracy  

Goyal et al. [9]  Tesla P100 256  Caffe2  8,192  SGD  90  1 hr  76.3% 
You et al. [29]  KNL 2048  Intel Caffe  32,768  SGD  90  20 min  75.4% 
Akiba et al. [3]  Tesla P100 1024  Chainer  32,768  RMSprop SGD  90  15 min  74.9% 
You et al. [29]  KNL 2048  Intel Caffe  32,768  SGD  64  14 min  74.9% 
Jia et al. [15]  Tesla P40 2048  TensorFlow  65,536  SGD  90  6.6 min  75.8% 
Mikami et al. [22]  Tesla V100 2176  NNL  34,816 69,632  SGD  90  3.7 min  75.0% 
Ying et al. [28]  TPU v3 1024  TensorFlow  32,768  SGD  90  2.2 min  76.3% 
This work  Tesla V100 1024  Chainer  32,768  KFAC  45  10 min  74.9% 
With respect to largescale distributed training of deep neural networks, there have been very few studies that use secondorder methods. At a smaller scale, there have been previous studies that used KFAC to train ResNet50 on ImageNet [5]. However, the SGD they used as reference was not showing stateoftheart Top1 validation accuracy (only around 70%), so the advantage of KFAC over SGD that they claim was not obvious from the results. In the present work, we compare the Top1 validation accuracy with stateoftheart SGD methods for large minibatches mentioned in the introduction (Table 2).
The previous studies that used KFAC to train ResNet50 on ImageNet [5], also were not considering large minibatches and were only training with minibatch size of 512 on 8 GPUs. In contrast, the present work uses minibatch sizes up to 131,072, which is equivalent to 32 per GPU on 4096 GPUs, and we are able to achieve a much higher Top1 validation accuracy of 75%. Note that such large minibatch sizes can also be achieved by accumulating the gradient over multiple iterations before updating the parameters, which can mimic the behavior of the execution on many GPUs without actually running them on many GPUs.
The previous studies using KFAC also suffered from large overhead of the communication since they implemented their KFAC in TensorFlow and used a parameterserver approach [1]. Since the parameter server requires all workers to send the gradients and receive the latest model’s parameters from the parameter server, the parameter server becomes a huge communication bottleneck especially at large scale. Our implementation uses a decentralized approach using MPI/NCCL collective communications among the processes. Although, software like Horovod can alleviate the problems with parameter servers, the decentralized approach has been used in high performance computing for a long time, and is known to scale to thousands of GPUs without modification.
3 Distributed KFAC
3.1 Notation and background
Throughout this paper, we use as the mean among the samples in the minibatch , and compute the crossentropy loss as
(1) 
where are the training input and label (onehot vector), is the likelihood calculated by the probabilistic model using a deep neural network with the parameters .
Update rule of SGD. For the standard firstorder stochastic gradient descent (SGD), the parameters in the th layer is updated based on the gradient of the loss function:
(2) 
where is the learning rate and represents the gradient of the loss function for .
Fisher information matrix. The Fisher information matrix (FIM) of the probabilistic model is estimated by
(3) 
Strictly speaking, is the empirical (stochastic version of) FIM [20], but we refer to this matrix as FIM throughout this paper for the sake of brevity. In the training of deep neural networks, FIM can be assumed as the curvature matrix in the parameter space [4, 20, 6].
3.2 KFac
Kroneckerfactored approximate curvature (KFAC) [20] is a secondorder optimization method for deep neural networks, which is based on an accurate and mathematically rigorous approximation of the FIM. Here, KFAC is applied to the training of convolutional neural networks, which minimizes the log likelihood (e.g. a classification task with a loss function (1)).
For the training of the deep neural network with layers, KFAC approximates as a diagonal block matrix:
(4) 
The diagonal block represents the FIM for the th layer of the deep neural network with weights (). Each diagonal block matrix is approximated as a Kronecker product:
(5) 
This is called Kronecker factorization and are called Kronecker factors. is computed from the gradient of the output of the th layer, and is computed from the activation of the th layer (the input of th layer) [20, 10].
The inverse of a Kronecker product is approximated by the Kronecker product of the inverse of each Kronecker factor.
(6) 
Update rule of KFAC. The parameters in the th layer is updated as follows:
(7)  
(8) 
where is the preconditioned gradient.
3.3 Our design
Due to the extra calculation of the inverse FIM, KFAC has considerable overhead compared to SGD. We designed our distributed parallelization scheme so that this overhead decreases as the number of processes is increased. Furthermore, we introduce a relaxation technique to reduce the computation of the FIM, which is explained in Section 5.4. In doing so, we were able to reduce the overhead of KFAC to almost a negligible amount. We implement all computation on top of an existing framework, called Chainer [2, 25]. Using this framework, user is only required to modify one line of the code to apply KFAC to the current CNN model definition.
In this section we focus on the design of our distributed parallelization scheme for KFAC.
Figure 1 shows the overview of our design, which shows a single iteration of the training. We use the term stage to refer to each phase of the calculation, which is indicated at the top of the figure. The variables in each box illustrates what the process computes during that stage, e.g. at stage 1, each process computes the Kronecker factor from the activation.
Stage 1 and 2 are the forward pass and backward pass, in which the Kronecker factors and are computed, respectively. Since the first two stages are computed in a dataparallel fashion, each process computes the Kronecker factors for all layers, but using different minibatches. In order to get the Kronecker factors for the global minibatch, we need to average these matrices over all processes. This is performed by a ReduceScatterV collective communication, which essentially transitions our approach from dataparallelism to modelparallelism. This collective is much more efficient than an AllReduce, and distributes the Kronecker factors to different processes while summing their values. Stage 3 shows the result after the communication, where the model is now distributed across the GPUs.
Stage 4 is the matrix inverse computation and stage 5 is the matrixmatrix multiplication for computing the preconditioned gradient . These computations are performed in a modelparallel fashion, where each processes inverts the Kronecker factors and multiplies them to the gradient of different layers. When the number of layers is larger than the number of GPUs, multiple layers are handled by each GPU, as shown in Figure 1. If the number of layers is smaller than the number of GPUs, some layers will be calculated redundantly on multiple GPUs. This simplifies the implementation, reduces the communication, and prevents loadimbalance.
Once we obtain the preconditioned gradient, we switch back to dataparallelism by calling an AllGatherV collective. After stage 6 is finished, all processes can update their parameters using the preconditioned gradients for all layers. As we will mention in Section 5.4, we are able to reduce the amount of communication required for the Kronecker factors and . Therefore, the amount of communication is similar to SGD, where the AllReduce is implemented as a ReduceScatter+AllGather.
Algorithm 1 shows the pseudo code of our distributed KFAC design.
3.4 Further acceleration
Our dataparallel and modelparallel hybrid approach allows us to minimize the overhead of KFAC in a distributed setting. However, KFAC still has a large overhead compared to SGD. There are two hotspots in our distributed KFAC design. The first is the construction of Kronecker factors, which cannot be done in a modelparallel fashion. The second is the extra communication for distributing these Kronecker factors. In this section, we discuss how we accelerated these two hotspots to achieve faster training time.
Mixedprecision computation. KFAC requires the construction of Kronecker factors and for all layers in the model. Since this operation must be done before taking the global average, it is in the dataparallel stages of our hybrid approach. Therefore, it’s computation time does not decrease even when more processes are used, and becomes relatively heavy compared to the other stages. To accelerate this computation, we use the Tensor Cores in the NVIDIA Volta Architecture. This more than doubles the speed of the calculation for this part.
Symmetryaware communication. The Kronecker factors are all symmetric matrices, so we exploit this property to reduce the volume of communication. To communicate a symmetric matrix of size , we only need to send the upper triangular matrix of size .
4 Training schemes
The behavior of KFAC on large models and datasets has not been studied in length. Also, there are very few studies that use KFAC for large minibatches (over 4K) using distributed parallelism at scale. Contrary to SGD, where the hyperparameters have been optimized by many practitioners even for large minibatches, there is very little insight on how to tune hyperparameters for KFAC. In this section, we have explored some methods, which we call training schemes, to achieve higher accuracy in our experiments. In this section, we show those training schemes in our large minibatch training with KFAC.
4.1 Data augmentation
We resize the all images in ImageNet to ignoring the aspect ratio of original images and compute the mean value () of the upper left of the resized images. When reading an image, we randomly crop a image from it, randomly flip it horizontally, subtract the mean value, and scale every pixel to .
Running mixup. We extend mixup [31, 11] to increase its regularization effect. We synthesize virtual training samples from raw samples and virtual samples from the previous step (while the original mixup method synthesizes only from the raw samples):
(9)  
(10) 
is a raw input and label (onehot vector), and is a virtual input and label for th step. is sampled from the Beta distribution with the beta function
(11) 
where we set .
Random erasing with zero value. We also adopt the Random Erasing [30]. We put zero value on the erasing region of each input instead of a random value as used in the original method. We set the erasing probability , the erasing area ratio , and the erasing aspect ratio . We randomly switch the size of the erasing area from to .
4.2 Damping for Fisher information matrix
The eigenvalue distribution of the Fisher information matrix (FIM) of deep neural networks is known to have an extremely long tail [16], where most of the eigenvalues are close to zero. This in turn causes the eigenvalues of the inverse FIM to become extremely large, which causes the norm of the preconditioned gradient to become huge compared to the parameter , so the training becomes unstable. To prevent this problem, we add a regularization term to the objective:
(12) 
Here, is the expected mean of the weights . We use for the weights of the Batch Normalization layers and for that of the other layers. Since we view the FIM as the curvature of the parameter space, we add the second derivative of the regularization term to the diagonal elements of the FIM. Finally, we get the preconditioned gradient for by the following procedure ():
(13)  
(14) 
Adding a positive value to the diagonal of the FIM to stabilize the training can be thought of as a form of damping [20]. As the damping limits the maximum eigenvalues of the inverse, we can restrict the norm of the preconditioned gradient . This prevents KFAC from moving too far in flat directions.
4.3 Warmup damping
We control the norm of the preconditioned gradient by changing the coefficient of the regularization term during the training. At early stages of the training, the FIM changes rapidly (Figure 5). Therefore, we start with a large damping rate and gradually decrease it using following rule:
(15)  
(16) 
is the value for the damping in the th step. is the initial value, and controls the steps to reach the target value . At each iteration, we use () for the Batch Normalization layers to stabilize the training.
4.4 Learning rate and momentum
The learning rate used for all of our experiments is schedule by polynomial decay. The learning rate for th epoch is determined as follows:
(17) 
is the initial learning rate and is the epoch when the decay starts and ends. The decay rate guides the speed of the learning rate decay. The learning rate scheduling in our experiments are plotted in Figure 3 .
We use the momentum method for KFAC updates. Because the learning rate decays rapidly in the final stage of the training with the polynomial decay, the current update can become smaller than the previous update. We adjust the momentum rate for th epoch so that the ratio between and is fixed throughout the training:
(18) 
where is the initial momentum rate. The weights are updated as follows:
(19) 
4.5 Weights rescaling
5 Results
We train ResNet50 [12] for ImageNet [7] in all of our experiments. We use the same hyperparameters for the same minibatch size when comparing the different schemes in Section 4. The training curves for the top1 validation accuracy shown in Figures 3, 6 are averaged over 2 or 3 executions using the same hyperparameters. The hyperparameters for our results are shown in Table 3.
Minibatch size 

Warmup damping [4.3]  Learning rate and momentum [4.4]  

4,096  
8,192  
16,384  
32,768  
65,536  
131,072 
5.1 Experiment environment
We conduct all experiments on the ABCI (AI Bridging Cloud Infrastructure) supercomputer operated by the National Institute of Advanced Industrial Science and Technology (AIST) in Japan. ABCI has 1088 nodes with four NVIDIA Tesla V100 GPUs per node. Due to the additional memory required by KFAC, all of our experiments use a minibatch size of 32 images per GPU. For large minibatch size experiments which cannot be executed directly, we used an accumulation method to mimic the behavior by accumulating over multiple steps. We were only given a 24 hour window to use the full machine so we had to tune the hyperparameters on a smaller number of nodes while mimicking the global minibatch size of the full node run.
5.2 Scalability
We measured the scalability of our distributed KFAC implementation on ResNet50 with ImageNet dataset. Figure 2 shows the time for one iteration using different number of GPUs. Ideally, this plot should show a flat line parallel to the xaxis, since we expect the time per iteration to be independent of the number of GPUs. From 1 GPU to 64 GPUs, we observed a superlinear scaling, where the 64 GPU case is 131.1% faster compared to 1 GPU, which is the consequence of our hybrid data/modelparallel design. ResNet50 has 107 layers in total when all the convolution, fullyconnected, and batch normalization layers are accounted for. Despite this superlinear scaling, after 256 GPUs we observe performance degradation due to the communication overhead.
5.3 Large minibatch training with KFAC
We trained ResNet50 for the classification task on ImageNet with extremely large minibatch size BS={4,096 (4K), 8,192 (8K), 16,384 (16K), 32,768 (32K), 65,536 (65K), 131,072 (131K)} and achieved a competitive top1 validation accuracy (). The summary of the training is shown in Table 1. The training curves and the learning rate schedules are plotted in Figure 3. When we use BS={4K, 8K, 16K, 32K, 65K}, the training converges in much less than 90 epochs, which is the usual number of epochs required by SGDbased training of ImageNet [3, 9, 15, 22, 29]. For BS={4K,8K,16K}, the required epochs to reach higher than 75% top1 validation accuracy does not change so much. Even for a relatively large minibatch size of BS={32K} KFAC still converges in half the number of epochs compared to SGD. When increasing the minibatch size to BS=65K, we see a 33% increase in the number of epochs it takes to converge. Note that the calculation time is still decreasing while the number of epochs is less than double when we double the minibatch size, assuming that doubling the minibatch corresponds to doubling the number of GPUs (and halving the execution time). At BS=131K, there are only 10 iterations per epoch since the dataset size of ImageNet is 1,281,167. None of the SGDbased training of ImageNet have sustained the top1 validation accuracy at this minibatch size. Furthermore, this is the first work that uses KFAC for the training with extremely large minibatch size BS={16K,32K,65K,131K} and achieves a competitive top1 validation accuracy.
5.4 Fisher information and large minibatch training
We analyzed the relationship between the large minibatch training with KFAC and the Fisher information matrix of ResNet50.
Staleness of Fisher information. To achieve faster training with distributed KFAC, reducing the computation and the communication of the FIM (or the Kronecker factors) is required. In ResNet50 for ImageNet classification, the data of the Kronecker factors for the convolutional layers and the FIM for the Batch Normalization layers are dominant (Figure 4). Note that we do not factorize the FIM for the Batch Normalization layers into and . Previous work on KFAC used stale Kronecker factors by only calculating them every few steps [20]. Even though our efficient distributed scheme minimizes the overhead of the Kronecker factor calculation, we thought it was worth investigating how much staleness we can tolerate to further speed up our method. We examine the change rate of the Kronecker factors for the convolutional layers and the FIM for the Batch Normalization layers.
(21) 
where is the Frobenius norm. The results from our large minibatch training (5.3) are plotted in Figure 5. We can see that the FIM fluctuates less for larger minibatches, because each minibatch becomes more statistically stable. This implies that we can reduce the frequency of updating the FIM more aggressively for larger minibatches. For the convolutional layers, the Kronecker factor which represents the correlation among the dimensions of the input of the th layer () fluctuates less than which represents the correlation among the dimensions of the gradient for the output of the th layer. Hence, we can also consider refreshing less frequently than .
Training with stale Fisher information. We found that, regardless of the minibatch size, the FIM changes rapidly during the first 500 or so iterations. Based on this, we reduce the frequency of updating () after 500 iterations. We apply a heuristic scheduling of the refreshing interval. The refreshing interval (iterations) for the th epoch is determined by:
(22)  
Using 1024 NVIDIA Tesla V100, we achieve 74.9 % top1 accuracy with ResNet50 for ImageNet in 10 minutes (45 epochs, including a validation after each epoch). We used the same hyperparameters shown in Table 3. The training time and the validation accuracy are competitive with the results reported by related work that use SGD for training (the comparison is shown in Table 2).
Diagonal Fisher information matrix.
As shown in Figure 4, the FIM for Batch Normalization (BN) layers contribute to a large portion of the memory overhead of KFAC. To alleviate this overhead, we approximate it with a diagonal matrix. By using the diagonal approximation, we can reduce the memory consumption of the FIM for BN from 1017MiB to 587MiB (Figure 4). We measure the effect of the diagonal approximation on the accuracy of ResNet50 for ImageNet with minibatch size BS=32,768 with/without using stale Fisher information for all layers. In this experiment, we adopt another heuristic for :
(23) 
The training curves are plotted in Figure 6. Using diagonal FIM does not affect the training curve even with stale FIM. This result suggests that only diagonal values of the FIM is essential for the training of BN layers.
6 Conclusion
In this work, we proposed a largescale distributed computational design for the secondorder optimization using KroneckerFactored Approximate Curvature (KFAC) and showed the advantages of KFAC over the firstorder stochastic gradient descent (SGD) for the training of ResNet50 with ImageNet classification using extremely large minibatches. We introduced several schemes for the training using KFAC with minibatch sizes up to 131,072 and achieved over 75% top1 accuracy in much fewer number of epochs compared to the existing work using SGD with large minibatch. Contrary to prior claims that second order methods do not generalize as well as SGD, we were able to show that this is not at all the case, even for extremely large minibatches. Data and model hybrid parallelism introduced in our design allowed us to train on 1024 GPUs and achieved 74.9% in 10 minutes by using KFAC with the stale Fisher information matrix (FIM). This is the first work which observes the relationship between the FIM of ResNet50 and its training on large minibatches ranging from 4K to 131K. There is still room for improvement in our distributed design to overcome the bottleneck of computation/communication for KFAC  the Kronecker factors can be approximated more aggressively without loss of accuracy. One interesting observation is that, whenever we coupled our method with a well known technique that improves the convergence of SGD, it allowed us to approximate the FIM more aggressively without any loss of accuracy. This suggests that all these seemingly ad hoc techniques to improve the convergence of SGD, are actually performing an equivalent role to the FIM in some way. The advantage that we have in designing better optimizers by taking this approch is that we are starting from the most mathematically rigorous form, and every improvement that we make is a systematic design decision based on observation of the FIM. Even if we end up having similar performance to the best known firstorder methods, at least we will have a better understanding of why it works by starting from secondorder methods. Further analysis of the eigenvalues of FIM and its effect on preconditioning the gradient will allow us to further understand the advantage of secondorder methods for the training of deep neural networks with extremely large minibatches.
Acknowledgements
Computational resource of AI Bridging Cloud Infrastructure (ABCI) was awarded by "ABCI Grand Challenge" Program, National Institute of Advanced Industrial Science and Technology (AIST). This work is supported by JST CREST Grant Number JPMJCR1687, Japan. This work was supported by JSPS KAKENHI Grant Number JP18H03248. (Part of) This work is conducted as research activities of AIST  Tokyo Tech Real World BigData Computation Open Innovation Laboratory (RWBCOIL). This work is supported by "Joint Usage/Research Center for Interdisciplinary Largescale Information Infrastructures" in Japan (Project ID: jh180012NAHI).
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [2] T. Akiba, K. Fukuda, and S. Suzuki. ChainerMN: Scalable Distributed Deep Learning Framework. In Proceedings of Workshop on ML Systems in The Thirtyfirst Annual Conference on Neural Information Processing Systems (NIPS), 2017.
 [3] T. Akiba, S. Suzuki, and K. Fukuda. Extremely large minibatch sgd: Training resnet50 on imagenet in 15 minutes. http://arxiv.org/abs/1711.04325, 2017.
 [4] S.I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251276, 1998.
 [5] J. Ba, R. Grosse, and J. Martens. Distributed secondorder optimization using kroneckerfactored approximations. In International Conference on Learning Representations, 2017.
 [6] A. Botev, H. Ritter, and D. Barber. Practical GaussNewton Optimisation for Deep Learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 557565. PMLR, June 2017.
 [7] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. page 8, 2012.
 [8] A. Devarakonda, M. Naumov, and M. Garland. Adabatch: Adaptive batch sizes for training deep neural networks. In arXiv:1712.02029v2, 2017.
 [9] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. http://arxiv.org/abs/1706.02677, 2017.
 [10] R. Grosse and J. Martens. A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573582, 2016.
 [11] H. Guo, Y. Mao, and R. Zhang. MixUp as Locally Linear OutOfManifold Regularization. arXiv:1809.02499 [cs, stat], Sept. 2018.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs], Dec. 2015.
 [13] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In 31st Conference on Neural Information Processing Systems, 2017.
 [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. http://arxiv.org/abs/1502.03167, 2015.
 [15] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. Highly scalable deep learning training system with mixedprecision: Training imagenet in four minutes. http://arxiv.org/abs/1807.11205, 2018.
 [16] R. Karakida, S. Akaho, and S.i. Amari. Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. arXiv:1806.01316 [condmat, stat], June 2018.
 [17] N. Le Roux, P.A. Manzagol, and Y. Bengio. Topmoumoute Online Natural Gradient Algorithm, volume 20, chapter Advances in Neural Information Processing Systems, pages 849856. MIT Press, 2008.
 [18] T. Lin, S. U. Stich, and M. Jaggi. Don’t use large minibatches, use local SGD. In arXiv:1808.07217v3, 2018.
 [19] J. Martens. Deep learning via hessianfree optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
 [20] J. Martens and R. Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pages 24082417, 2015.
 [21] J. Matens, J. Ba, and M. Johnson. Koneckerfactored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018.
 [22] H. Mikami, H. Suganuma, P. Uchupala, Y. Tanaka, and Y. Kageyama. ImageNet/ResNet50 Training in 224 Seconds. page 8, 2018.
 [23] C. J. Shallue, J. Lee, J. Antognini, J. SohlDickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. In arXiv:1811.03600v1, 2018.
 [24] S. L. Smith, P.J. Kindermans, and Q. V. Le. Don’t decay the learning rate, increase the batch size. http://arxiv.org/abs/1711.00489, 2017.
 [25] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a nextgeneration open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twentyninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
 [26] T. van Laarhoven. L2 Regularization versus Batch and Weight Normalization. arXiv:1706.05350 [cs, stat], June 2017.
 [27] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In 31st Conference on Neural Information Processing Systems, 2017.
 [28] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng. Image Classification at Supercomputer Scale. arXiv:1811.06992 [cs, stat], Nov. 2018.
 [29] Y. You, Z. Zhang, C.J. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. http://arxiv.org/abs/1709.05011, 2017.
 [30] G. Zhang, S. Sun, D. Duvenaud, and R. Grosse. Noisy natural gradient as variational inference. http://arxiv.org/abs/1712.02390, 2017.
 [31] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. Mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412 [cs, stat], Oct. 2017.