CrossIteration Batch Normalization
Abstract
A wellknown issue of Batch Normalization is its significantly reduced effectiveness in the case of small minibatch sizes. When a minibatch contains few examples, the statistics upon which the normalization is defined cannot be reliably estimated from it during a training iteration. To address this problem, we present CrossIteration Batch Normalization (CBN), in which examples from multiple recent iterations are jointly utilized to enhance estimation quality. A challenge of computing statistics over multiple iterations is that the network activations from different iterations are not comparable to each other due to changes in network weights. We thus compensate for the network weight changes via a proposed technique based on Taylor polynomials, so that the statistics can be accurately estimated and batch normalization can be effectively applied. On object detection and image classification with small minibatch sizes, CBN is found to outperform the original batch normalization and a direct calculation of statistics over previous iterations without the proposed compensation technique. Code is available at https://github.com/Howal/CrossiterationBatchNorm.
1 Introduction
Batch Normalization (BN) (Ioffe and Szegedy, 2015) has played a significant role in the success of deep neural networks. It was introduced to address the issue of internal covariate shift, where the distribution of network activations changes during training iterations due to the updates of network parameters. This shift is commonly believed to be disruptive to network training, and BN alleviates this problem through normalization of the network activations by their mean and variance, computed over the examples within the minibatch at each iteration. With this normalization, network training can be performed at much higher learning rates and with less sensitivity to weight initialization.
In BN, it is assumed that the distribution statistics for the examples within each minibatch reflect the statistics over the full training set. While this assumption is generally valid for large batch sizes, it breaks down in the small batch size regime (Peng et al., 2018; Wu and He, 2018; Ioffe, 2017), where noisy statistics computed from small sets of examples can lead to a dramatic drop in performance. This problem hinders the application of BN to memoryconsuming tasks such as object detection (Ren et al., 2015; Dai et al., 2017), semantic segmentation (Long et al., 2015; Chen et al., 2017) and action recognition (Wang et al., 2018b), where batch sizes are limited due to memory constraints.
Towards improving estimation of statistics in the small batch size regime, alternative normalizers have been proposed. Several of them, including Layer Normalization (LN) (Ba et al., 2016), Instance Normalization (IN) (Ulyanov et al., 2016), and Group Normalization (GN) (Wu and He, 2018), compute the mean and variance over the channel dimension, independent of batch size. Different channelwise normalization techniques, however, tend to be suitable for different tasks, depending on the set of channels involved. Although GN is designed for detection task, the slow inference speed limits its practical usage. On the other hand, synchronized BN (SyncBN) (Peng et al., 2018) yields consistent improvements by processing larger batch sizes across multiple GPUs. These gains in performance come at the cost of additional overhead needed for synchronization across the devices.
A seldom explored direction for estimating better statistics is to compute them over the examples from multiple recent training iterations, instead of from only the current iteration as done in previous techniques. This can substantially enlarge the pool of data from which the mean and variance are obtained. However, there exists an obvious drawback to this approach, in that the activation values from different iterations are not comparable to each other due to the changes in network weights. As shown in Figure 1, directly calculating the statistics over multiple iterations, which we refer to as Naive CBN, results in lower accuracy.
In this paper, we present a method that compensates for the network weight changes among iterations, so that examples from preceding iterations can be effectively used to improve batch normalization. Our method, called CrossIteration Batch Normalization (CBN), is motivated by the observation that network weights change gradually, instead of abruptly, between consecutive training iterations, thanks to the iterative nature of Stochastic Gradient Descent (SGD). As a result, the mean and variance of examples from recent iterations can be well approximated for the current network weights via a loworder Taylor polynomial, defined on gradients of the statistics with respect to the network weights. The compensated means and variances from multiple recent iterations are averaged with those of the current iteration to produce better estimates of the statistics.
In the small batch size regime, CBN leads to appreciable performance improvements over the original BN, as exhibited in Figure 1. The superiority of our proposed approach is further demonstrated through more extensive experiments on ImageNet classification and object detection on COCO. These gains are obtained with negligible overhead, as the statistics from previous iterations have already been computed and Taylor polynomials are simple to evaluate. With this work, it is shown that cues for batch normalization can successfully be extracted along the time dimension, opening a new direction for investigation.
2 Related Work
The importance of normalization in training neural networks has been recognized for decades (LeCun et al., 1998). In general, normalization can be performed on three components: input data, hidden activations, and network parameters. Among them, input data normalization is used most commonly because of its simplicity and effectiveness (Sola and Sevilla, 1997; LeCun et al., 1998).
After the introduction of Batch Normalization (Ioffe and Szegedy, 2015), the normalization of activations has become nearly as prevalent. By normalizing hidden activations by their statistics within each minibatch, BN effectively alleviates the vanishing gradient problem and significantly speeds up the training of deep networks. To mitigate the minibatch size dependency of BN, a number of variants have been proposed, including Layer Normalization (LN) (Ba et al., 2016), Instance Normalization (IN) (Ulyanov et al., 2016), Group Normalization (GN) (Wu and He, 2018), and Batch Instance Normalization (BIN) (Nam and Kim, 2018). The motivation of LN is to explore more suitable statistics for sequential models, while IN performs normalization in a manner similar to BN but with statistics only for each instance. GN achieves a balance between IN and LN, by dividing features into multiple groups along the channel dimension and computing the mean and variance within each group for normalization. BIN introduces a learnable method for automatically switching between normalizing and maintaining style information, enjoying the advantages of both BN and IN on style transfer tasks. CrossGPU Batch Normalization (CGBN or SyncBN) (Peng et al., 2018) extends BN across multiple GPUs for the purpose of increasing the effective batch size. Though providing higher accuracy, it introduces synchronization overhead to the training process. Kalman Normalization (KN) (Wang et al., 2018a) presents a Kalman filtering procedure for estimating the statistics for a network layer from the layer’s observed statistics and the computed statistics of previous layers.
Batch Renormalization (BRN) (Ioffe, 2017) is the first attempt to utilize the statistics of recent iterations for normalization. It does not compensate for the statistics from recent iterations, but rather it downweights the importance of statistics from distant iterations. This downweighting heuristic, however, does not make the resulting statistics “correct”, as the statistics from recent iterations are not of the current network weights. BRN can be deemed as a special version of our Naive CBN baseline (without Taylor polynomial approximation), where distant iterations are downweighted.
Recent work have also investigated the normalization of network parameters. In Weight Normalization (WN) (Salimans and Kingma, 2016), the optimization of network weights is improved through a reparameterization of weight vectors into their length and direction. Weight Standardization (WS) (Qiao et al., 2019) instead reparameterizes weights based on their first and second moments for the purpose of smoothing the loss landscape of the optimization problem. To combine the advantages of multiple normalization techniques, Switchable Normalization (SN) (Luo et al., 2018) and Sparse Switchable Normalization (SSN) (Shao et al., 2019) make use of differentiable learning to switch among different normalization methods.
The proposed CBN takes an activation normalization approach that aims to mitigate the minibatch dependency of BN. Different from existing techniques, it provides a way to effectively aggregate statistics across multiple training iterations.
3 Method
3.1 Revisiting Batch Normalization
The original batch normalization (BN) (Ioffe and Szegedy, 2015) whitens the activations of each layer by the statistics computed within a minibatch. Denote and as the network weights and the feature response of a certain layer for the th example in the th minibatch. With these values, BN conducts the following normalization:
(1) 
where is the whitened activation with zero mean and unit variance, is a small constant added for numerical stability, and and are the mean and variance computed for all the examples from the current minibatch, i.e.,
(2) 
(3) 
where , and denotes the number of examples in the current minibatch. The whitened activation further undergoes a linear transform with learnable weights, to increase its expressive power:
(4) 
where and are the learnable parameters (initialized to and in this work).
When the batch size is small, the statistics and become noisy estimates of the training set statistics, thus degrading the effects of batch normalization. In the ImageNet classification task for which the BN module was originally designed, a batch size of 32 is typical. However, for other tasks requiring larger models and/or higher image resolution, such as object detection, semantic segmentation and video recognition, the typical batch size may be as small as 1 or 2 due to GPU memory limitations. The original BN becomes considerably less effective in such cases.
3.2 Leveraging Statistics from Previous Iterations
To address the issue of BN with small minibatches, a naive approach is to compute the mean and variance over the current and previous iterations. However, the statistics and of the th iteration are computed under the network weights , making them obsolete for the current iteration. As a consequence, directly aggregating statistics from multiple iterations produces inaccurate estimates of the mean and variance, leading to significantly worse performance.
We observe that the network weights change smoothly between consecutive iterations, due to the nature of gradientbased training. This allows us to approximate and from the readily available and via a Taylor polynomial, i.e.,
(5) 
(6) 
where and are gradients of the statistics with respect to the network weights, and denotes higherorder terms of the Taylor polynomial, which can be omitted since the firstorder term dominates when is small.
In Eq. (5) and Eq. (6), the gradients and cannot be precisely determined at a negligible cost because the statistics and for a node at the th network layer depend on all the network weights prior to the th layer, i.e., and for , where denotes the network weights at the th layer. Only when can these gradients be derived in closed form efficiently.
Empirically, we find that as the layer index decreases (), the partial gradients and rapidly diminish. These reduced effects of network weight changes at earlier layers on the activation distributions in later layers may perhaps be explained by the reduced internal covariate shift of BN. Motivated by this phenomenon, which is studied in Section 4.4, we propose to truncate these partial gradients at layer .
Thus, we further approximate Eq. (5) and Eq. (6) by
(7)  
(8) 
A naive implementation of and involves computational overhead of , where and denote the channel dimension of the th layer and the th layer, respectively, and denotes the kernel size of . Here, we find that the operation can be implemented efficiently in , thanks to the averaging over feature responses of and . See Appendix for the details.
3.3 CrossIteration Batch Normalization
After compensating for network weight changes, we aggregate the statistics of the most recent iterations with those of the current iteration to obtain the statistics used in CBN:
(9)  
(10) 
(11) 
where and are computed from Eq. (7) and Eq. (8). In Eq. (10), is determined from the maximum of and in each iteration because should hold for valid statistics but may be violated by Taylor polynomial approximations in Eq. (7) and Eq. (8). Finally, and are applied to normalize the corresponding feature responses at the current iteration:
(12) 
With CBN, the effective number of examples used to compute the statistics for the current iteration is times as large as that for the original BN. In training, the loss gradients are backpropagated to the network weights and activations at the current iteration, i.e., and . Those of the previous iterations are fixed and do not receive gradients. Hence, the computation cost of CBN in backpropagation is the same as that of BN.


batch size per iter  #examples for statistics  Norm axis  
IN  #bs/GPU * #GPU  1  (spatial) 
LN  #bs/GPU * #GPU  1  (channel, spatial) 
GN  #bs/GPU * #GPU  1  (channel group, spatial) 
BN  #bs/GPU * #GPU  #bs/GPU  (batch, spatial) 
syncBN  #bs/GPU * #GPU  #bs/GPU * #GPU  (batch, spatial, GPU) 
CBN  #bs/GPU * #GPU  #bs/GPU * temporal window  (batch, spatial, iteration) 

Replacing the BN modules in a network by CBN leads to only minor increases in computational overhead and memory footprint. For computation, the additional overhead mainly comes from computing the partial derivatives and , which is insignificant in relation to the overhead of the whole network. For memory, the module requires access to the statistics ( and ) and the gradients ( and ) computed for the most recent iterations, which is also minor compared to the rest of the memory consumed in processing the input examples. The additional computation and memory of CBN is reported for our experiments in Table 6.
A key hyperparameter in the proposed CBN is the temporal window size, , of recent iterations used for statistics estimation. A broader window enlarges the set of examples, but the example quality becomes increasingly lower for more distant iterations, since the differences in network parameters and become more significant and are compensated less well using a loworder Taylor polynomial. Empirically, we found that CBN is effective with a window size up to in a variety of settings and tasks. The only trick is that the window size should be kept small at the beginning of training, when the network weights change quickly. Thus, we introduce a burnin period of length for the window size, where and CBN degenerates to the original BN. In our experiments, the burnin period is set to 25 epochs on ImageNet image classification and 3 epochs on COCO object detection by default. Ablations on this parameter are presented in the Appendix.
Table 1 compares CBN with other feature normalization methods. The key difference among these approaches is the axis along which the statistics are counted and the features are normalized. The previous techniques are all designed to exploit examples from the same iteration. By contrast, CBN explores the aggregation of examples along the temporal dimension. As the data utilized by CBN lies in a direction orthogonal to that of previous methods, the proposed CBN could potentially be combined with other feature normalization approaches to further enhance statistics estimation in certain challenging applications.
4 Experiments
4.1 Image Classification on ImageNet
Experimental settings. ImageNet (Russakovsky et al., 2015) is a benchmark dataset for image classification, containing 1.28M training images and 50K validation images from 1000 classes. We follow the standard setting in (He et al., 2015) to train deep networks on the training set and report the singlecrop top1 accuracy on the validation set. Our preprocessing and augmentation strategy strictly follows the GN baseline (Wu and He, 2018). We use a weight decay of 0.0001 for all weight layers, including and . We train standard ResNet18 for 100 epochs on 4 GPUs, and decrease the learning rate by the cosine decay strategy (He et al., 2019). We perform the experiments for five trials, and report their mean and standard deviation (error bar). ResNet18 with BN is our base model. To compare with other normalization methods, we directly replace BN with IN, LN, GN, BRN, and our proposed CBN.


IN  LN  GN  CBN  BN  
Top1 acc  64.40.2  67.90.2  68.90.1  70.20.1  70.20.1 

Comparison of feature normalization methods. In Table 2, we compare the performance of each normalization method with a batch size, 32, sufficient for computing reliable statistics. Under this setting, BN clearly yields the highest top1 accuracy. Similar to results found in previous works (Wu and He, 2018), the performance of IN and LN is significantly worse than that of BN. GN works well on image classification but falls short of BN by 1.2%. Among all the methods, our CBN is the only one that is able to achieve accuracy comparable to BN, as it converges to the procedure of BN at larger batch sizes.


batch size per GPU  32  16  8  4  2  1 
BN  70.2  70.2  68.4  65.1  55.9   
GN  68.9  69.0  68.9  69.0  69.1  68.9 
BRN  70.1  68.5  68.2  67.9  60.3   
CBN  70.2  70.2  70.1  70.0  69.6  69.3 

Sensitivity to batch size. We compare the behavior of CBN, original BN (Ioffe and Szegedy, 2015), GN (Wu and He, 2018), and BRN (Ioffe, 2017) at the same number of images per GPU on ImageNet classification. For CBN, the recent iterations are utilized so as to ensure that the number of effective examples is no fewer than 16. For BRN, the settings strictly follow the original paper. We adopt a learning rate of 0.1 for the batch size of 32, and linearly scale the learning rate by for a batch size of .
The results are shown in Table 3. For the original BN, its accuracy drops noticeably as the number of images per GPU is reduced from 32 to 2. BRN suffers a significant performance drop as well. GN maintains its accuracy by utilizing the channel dimension but not batch dimension. For CBN, its accuracy holds by exploiting the examples of recent iterations. Also, CBN outperforms GN by 0.9% on average top1 accuracy with different batch sizes. This is reasonable, because the statistics computation of CBN introduces uncertainty caused by the stochastic batch sampling like in BN, but this uncertainty is missing in GN which results in some loss of regularization ability. For the extreme case that the number of images per GPU is 1, BN and BRN fails to produce results, while CBN outperforms GN by 0.4% on top1 accuracy in this case.
4.2 Object Detection and Instance Segmentation on COCO
Experimental settings. COCO (Lin et al., 2014) is chosen as the benchmark for object detection and instance segmentation. Models are trained on the COCO 2017 train split with 118k images, and evaluated on the COCO 2017 validation split with 5k images. Following the standard protocol in (Lin et al., 2014), the object detection and instance segmentation accuracies are measured by the mean average precision (mAP) scores at different intersectionoverunion (IoU) overlaps at the box and the mask levels, respectively.
Following (Wu and He, 2018), Faster RCNN (Ren et al., 2015) and Mask RCNN (He et al., 2017) with FPN (Lin et al., 2017) are chosen as the baselines for object detection and instance segmentation, respectively. For both, the 2fc box head is replaced by a 4conv1fc head for better use of the normalization mechanism (Wu and He, 2018). The backbone networks are ImageNet pretrained ResNet50 (default) or ResNet101, with specific normalization. Finetuning is performed on the COCO train set for 12 epochs on 4 GPUs by SGD, where each GPU processes 4 images (default). Note that the mean and variance statistics in CBN are computed within each GPU. The learning rate is initialized to be for a batch size per GPU of , and is decayed by a factor of 10 at the 9th and the 11th epochs. The weight decay and momentum parameters are set to 0.0001 and 0.9, respectively. We use the average over 5 trials for all results. As the values of standard deviation of all methods are less than 0.1 on COCO, they are ignored here.
As done in (Wu and He, 2018), we experiment with two settings where the normalizers are activated only at the taskspecific heads with frozen BN at the backbone (default), or the normalizers are activated at all the layers except for the early conv1 and conv2 stages in ResNet.


backbone  box head  AP  AP  AP  AP  AP  AP 
fixed BN    36.9  58.2  39.9  21.2  40.8  46.9 
fixed BN  BN  36.3  57.3  39.2  20.8  39.7  47.3 
fixed BN  syncBN  37.7  58.5  41.1  22.0  40.9  49.0 
fixed BN  GN  37.8  59.0  40.8  22.3  41.2  48.4 
fixed BN  CBN  37.7  59.0  40.7  22.1  40.9  48.8 
BN  BN  35.5  56.4  38.7  19.7  38.8  47.3 
syncBN  syncBN  37.9  58.5  41.1  21.7  41.5  49.7 
GN  GN  37.8  59.1  40.9  22.4  41.2  49.0 
CBN  CBN  37.3  57.7  39.3  21.9  40.8  48.2 



Backbone  method  norm  AP  AP  AP  AP  AP  AP 
R50+FPN  Faster RCNN  GN  37.8  59.0  40.8  22.3  41.2  48.4 
syncBN  37.7  58.5  41.1  22.0  40.9  49.0  
CBN  37.7  59.0  40.7  22.1  40.9  48.8  
R101+FPN  Faster RCNN  GN  39.3  60.6  42.7  22.5  42.5  51.3 
syncBN  39.2  59.8  43.0  22.2  42.9  51.6  
CBN  39.2  60.0  42.6  22.3  42.6  51.1  


AP  AP  AP  AP  AP  AP  
R50+FPN  Mask RCNN  GN  38.6  59.8  41.9  35.0  56.7  37.3 
syncBN  38.5  58.9  42.3  34.7  56.3  36.8  
CBN  38.5  59.2  42.1  34.6  56.4  36.6  
R101+FPN  Mask RCNN  GN  40.3  61.2  44.2  36.6  58.5  39.2 
syncBN  40.3  60.8  44.2  36.0  57.7  38.6  
CBN  40.1  60.5  44.1  35.8  57.3  38.5  

Normalizers at backbone and taskspecific heads. We further study the effect of different normalizers on the backbone network and taskspecific heads for object detection on COCO. CBN, original BN, syncBN, and GN are included in the comparison.
Table 4 presents the results. When BN is frozen in the backbone and no normalizer is applied at the head, the AP score is 36.9%. When the original BN is applied at the head only and at both the backbone and the head, the accuracy drops to 36.3% and 35.5%, respectively. For CBN, the accuracy is 37.7% and 37.3% at these two settings, respectively. Without any synchronization across GPUs, CBN can achieve performance on par with syncBN and GN, showing the superiority of the proposed approach. Unfortunately, due to the accumulation of approximation error, there is a 0.4% decrease in AP when replacing frozen BN with CBN in the backbone. Even so, CBN still outperforms the variant with unfrozen BN in the backbone by 1.8%.
Instance segmentation and stronger backbones. Results of object detection (Faster RCNN (Ren et al., 2015)) and instance segmentation (Mask RCNN (He et al., 2017)) with ResNet50 and ResNet101 are presented in Table 5. We can observe that our proposed CBN achieves performance comparable to syncBN and GN with R50 and R101 as the backbone on both Faster RCNN and Mask RCNN, which demonstrates that CBN is robust and versatile to various deep models and tasks.
4.3 Ablation Study
Effect of temporal window size . We conduct this ablation on ImageNet image classification and COCO object detection, with each GPU processing 4 images. Figure 3 presents the results. When , only the batch from the current iteration is utilized; therefore, CBN degenerates to the original BN. The accuracy suffers due to the noisy statistics on small batch sizes. As the window size increases, more examples from recent iterations are utilized for statistics estimation, leading to greater accuracy. Accuracy saturates at and even drops slightly. For more distant iterations, the network weights differ more substantially and the Taylor polynomial approximation becomes less accurate.
On the other hand, it is empirically observed that the original BN saturates at a batch size of 16 or 32 for numerous applications (Peng et al., 2018; Wu and He, 2018), indicating that the computed statistics become accurate. Thus, a temporal window size of is suggested.
Effect of compensation. To study this, we compare CBN to 1) Naive CBN, where statistics from recent iterations are directly aggregated without compensation via Taylor polynomial; and 2) the original BN applied with the same effective example number as CBN (i.e., its batch size per GPU is set to the product of the batch size per GPU and the temporal window size of CBN), which does not require any compensation and serves as an upper performance bound.
The experimental results are also presented in Figure 3. CBN clearly surpasses Naive CBN when the previous iterations are included. Actually, Naive CBN fails when the temporal window size grows to as shown in Figure 3, demonstrating the necessity of compensating for changing network weights over iterations. Compared with the original BN upper bound, CBN achieves similar accuracy at the same effective example number. This result indicates that the compensation using a loworder Taylor polynomial by CBN is effective.
Figure 4 presents the train and test curves of CBN, Naive CBN, BNbs4, and BNbs16 on ImageNet, with 4 images per GPU and a temporal window size of 4 for CBN, Naive CBN, and BNbs4, and 16 images per GPU for BNbs16. The train curve of CBN is close to BNbs4 at the beginning, and approaches BNbs16 at the end. The reason is that we adopt a burnin period to avoid the disadvantage of rapid statistics change at the beginning of training. The gap between the train curve of Naive CBN and CBN shows that Naive CBN cannot even converge well on the training set. The test curve of CBN is close to BNbs16 at the end, while Naive CBN exhibits considerable jitter. All these phenomena indicate the effectiveness of our proposed Taylor polynomial compensation.
Effect of burnin period length . We study the effect of varying the burnin period length , at 4 images per GPU on both ImageNet image classification (ResNet18) and COCO object detection (Faster RCNN with FPN and ResNet50). Figure 5(a) and 5(b) present the results. When the burnin period is too short, the accuracy suffers. This is because at the beginning of training, the network weights change rapidly, causing the compensation across iterations to be less effective. Nevertheless, the accuracy is stable for a wide range of burnin periods .







BN  14.1  5155.1  1.3  6.2  
GN  14.1  5274.2  1.2  3.7  
CBN  15.1  5809.7  1.0  6.2  

4.4 Analysis
Computational overhead, memory footprint, and training and inference speed. We examine the computational overhead, memory footprint, and the training and inference speed of BN, GN and CBN in a practical COCO object detection task using R50Mask RCNN, shown in Table 6. The batch size per GPU is 4 and the window size of CBN is set to 4.
Compared to BN and GN, CBN consumes about 7% extra memory and 11% more computational overhead. The extra memory mainly contains the statistics ( and ), their respective gradients, and the network parameters () of previous iterations, while the computational overhead comes from calculations of the statistics’ respective gradients, Taylor compensations, and averaging operations.
The overall training speed of CBN is close to both BN and GN. It is worth noting that the inference speed of CBN is equal to BN, which is much faster than GN. The inference stage of CBN is the same as that of BN, where the calculation of statistics includes prerecorded statistics.
From these results, the additional overhead of CBN is seen to be minor.
Using more than one layer for compensation. We also study the influence of applying compensation over more than one layer. CBN using two layers for compensation achieves 70.1 on ImageNet (batch size per GPU=4, k=4), which is comparable to CBN using only one layer. However, the efficient implementation can no longer be used when more than one layer of compensation is employed. As using more layers does not further improve performance but consumes more FLOPs, we adopt onelayer compensation for CBN in practice.
On the gradients from different layers. The key assumption in Eq. (7) and Eq. (8) is that for a node at the th layer, the gradient of its statistics with respect to the network weights at the th layer is larger than that of weights from the prior layers, i.e.,
where denotes , denotes , and denotes the Frobenius norm.
Here, we examine this assumption empirically for networks trained on ImageNet image recognition. Both and for are averaged over all CBN layers of the network at different training epochs (Figure 6). The results suggest that the key assumption holds well, thus validating the approximation in Eq. (7) and Eq. (8).
We also study the gradients of nonResNet models. The ratios of and are (0.20 and 0.41) for VGG16 and (0.15 and 0.37) for InceptionV3, which is similar to ResNet (0.12 and 0.39), indicating that the assumption should also hold for the VGG and Inception series.
5 Conclusion
In the small batch size regime, batch normalization is widely known to drop dramatically in performance. To address this issue, we propose to enhance the quality of statistics via utilizing examples from multiple recent iterations. As the network activations from different iterations are not comparable to each other due to changes in network weights, we compensate for the network weight changes based on Taylor polynomials, so that the statistics can be accurately estimated. In the experiments, the proposed approach is found to outperform original batch normalization and a direct calculation of statistics over previous iterations without compensation. Moreover, it achieves performance on par with SyncBN, which can be regarded as the upper bound, on both ImageNet classification and COCO object detection.
Acknowledgement We sincerely thank Jifeng Dai for his efforts, discussion and comments about this work. Jifeng Dai was involved in the early discussions of the work. He gave up the authorship after he joined another company.
Appendix A Algorithm Outline
Algorithm 1 presents an outline of our proposed CrossIteration Batch Normalization (CBN).
Appendix B Efficient Implementation of and
Let and denote the channel dimension of the th layer and the th layer, respectively, and denotes the kernel size of . and are thus of dimensions in channels, and is a dimensional tensor. A naive implementation of and involves computational overhead of . Here we find that the operations of and can be implemented efficiently in and , respectively, thanks to the averaging of feature responses in and .
Here we derive the efficient implementation of . That of is about the same. Let us first simplify the notations a bit. Let and denote and respectively, by removing the irrelevant notations for iterations. The elementwise computation in the forward pass can be computed as
(12) 
where denotes the th channel in , and denotes the th channel in the th example. is computed as
(13) 
where and enumerate the input feature dimension and the convolution kernel index, respectively, denotes the spatial offset in applying the th kernel, and is the output of the th layer.
References
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §2.
 Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
 Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §1.
 Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2, §4.2.
 Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §4.1.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Figure 1.
 Bag of tricks for image classification with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: Figure 1, §1, §2, §3.1, §4.1.
 Batch renormalization: towards reducing minibatch dependence in batchnormalized models. In Advances in Neural Information Processing Systems, pp. 1945–1953. Cited by: §1, §2, §4.1.
 Efficient backprop. In Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop, pp. 9–50. Cited by: §2.
 Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
 Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.2.
 Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
 Differentiable learningtonormalize via switchable normalization. arXiv preprint arXiv:1806.10779. Cited by: §2.
 Batchinstance normalization for adaptively styleinvariant neural networks. In Advances in Neural Information Processing Systems, pp. 2563–2572. Cited by: §2.
 Megdet: a large minibatch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6181–6189. Cited by: §1, §1, §2, §4.3.
 Weight standardization. arXiv preprint arXiv:1903.10520. Cited by: §2.
 Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §4.2, §4.2.
 Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Figure 1, §4.1.
 Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909. Cited by: §2.
 SSN: learning sparse switchable normalization via sparsestmax. arXiv preprint arXiv:1903.03793. Cited by: §2.
 Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Transactions on nuclear science 44 (3), pp. 1464–1468. Cited by: §2.
 Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §1, §2.
 Kalman normalization: normalizing internal representations across network layers. In Advances in Neural Information Processing Systems, pp. 21–31. Cited by: §2.
 Nonlocal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §1.
 Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: Figure 1, §1, §1, §2, §4.1, §4.1, §4.1, §4.2, §4.2, §4.3.