Cross-Iteration Batch Normalization

Cross-Iteration Batch Normalization

Abstract

A well-known issue of Batch Normalization is its significantly reduced effectiveness in the case of small mini-batch sizes. When a mini-batch contains few examples, the statistics upon which the normalization is defined cannot be reliably estimated from it during a training iteration. To address this problem, we present Cross-Iteration Batch Normalization (CBN), in which examples from multiple recent iterations are jointly utilized to enhance estimation quality. A challenge of computing statistics over multiple iterations is that the network activations from different iterations are not comparable to each other due to changes in network weights. We thus compensate for the network weight changes via a proposed technique based on Taylor polynomials, so that the statistics can be accurately estimated and batch normalization can be effectively applied. On object detection and image classification with small mini-batch sizes, CBN is found to outperform the original batch normalization and a direct calculation of statistics over previous iterations without the proposed compensation technique. Code is available at https://github.com/Howal/Cross-iterationBatchNorm.

\printAffiliationsAndNotice

1 Introduction

Batch Normalization (BN) (Ioffe and Szegedy, 2015) has played a significant role in the success of deep neural networks. It was introduced to address the issue of internal covariate shift, where the distribution of network activations changes during training iterations due to the updates of network parameters. This shift is commonly believed to be disruptive to network training, and BN alleviates this problem through normalization of the network activations by their mean and variance, computed over the examples within the mini-batch at each iteration. With this normalization, network training can be performed at much higher learning rates and with less sensitivity to weight initialization.

In BN, it is assumed that the distribution statistics for the examples within each mini-batch reflect the statistics over the full training set. While this assumption is generally valid for large batch sizes, it breaks down in the small batch size regime (Peng et al., 2018; Wu and He, 2018; Ioffe, 2017), where noisy statistics computed from small sets of examples can lead to a dramatic drop in performance. This problem hinders the application of BN to memory-consuming tasks such as object detection (Ren et al., 2015; Dai et al., 2017), semantic segmentation (Long et al., 2015; Chen et al., 2017) and action recognition (Wang et al., 2018b), where batch sizes are limited due to memory constraints.

Towards improving estimation of statistics in the small batch size regime, alternative normalizers have been proposed. Several of them, including Layer Normalization (LN) (Ba et al., 2016), Instance Normalization (IN) (Ulyanov et al., 2016), and Group Normalization (GN) (Wu and He, 2018), compute the mean and variance over the channel dimension, independent of batch size. Different channel-wise normalization techniques, however, tend to be suitable for different tasks, depending on the set of channels involved. Although GN is designed for detection task, the slow inference speed limits its practical usage. On the other hand, synchronized BN (SyncBN) (Peng et al., 2018) yields consistent improvements by processing larger batch sizes across multiple GPUs. These gains in performance come at the cost of additional overhead needed for synchronization across the devices.

A seldom explored direction for estimating better statistics is to compute them over the examples from multiple recent training iterations, instead of from only the current iteration as done in previous techniques. This can substantially enlarge the pool of data from which the mean and variance are obtained. However, there exists an obvious drawback to this approach, in that the activation values from different iterations are not comparable to each other due to the changes in network weights. As shown in Figure 1, directly calculating the statistics over multiple iterations, which we refer to as Naive CBN, results in lower accuracy.

In this paper, we present a method that compensates for the network weight changes among iterations, so that examples from preceding iterations can be effectively used to improve batch normalization. Our method, called Cross-Iteration Batch Normalization (CBN), is motivated by the observation that network weights change gradually, instead of abruptly, between consecutive training iterations, thanks to the iterative nature of Stochastic Gradient Descent (SGD). As a result, the mean and variance of examples from recent iterations can be well approximated for the current network weights via a low-order Taylor polynomial, defined on gradients of the statistics with respect to the network weights. The compensated means and variances from multiple recent iterations are averaged with those of the current iteration to produce better estimates of the statistics.

In the small batch size regime, CBN leads to appreciable performance improvements over the original BN, as exhibited in Figure 1. The superiority of our proposed approach is further demonstrated through more extensive experiments on ImageNet classification and object detection on COCO. These gains are obtained with negligible overhead, as the statistics from previous iterations have already been computed and Taylor polynomials are simple to evaluate. With this work, it is shown that cues for batch normalization can successfully be extracted along the time dimension, opening a new direction for investigation.

Figure 1: Top-1 classification accuracy vs. batch sizes per iteration. The base model is a ResNet-18 (He et al., 2016) trained on ImageNet (Russakovsky et al., 2015). The accuracy of BN (Ioffe and Szegedy, 2015) drops rapidly when the batch size is reduced. GN (Wu and He, 2018) exhibits stable performance but underperforms BN on adequate batch sizes. CBN compensates for the reduced batch size per GPU by exploiting approximated statistics from recent iterations (Temporal window size denotes how many recent iterationss are utilized for statistics computation). CBN shows relatively stable performance over different batch sizes. Naive CBN, which directly calculates statistics from recent iterations without compensation, is shown not to work well.

2 Related Work

The importance of normalization in training neural networks has been recognized for decades (LeCun et al., 1998). In general, normalization can be performed on three components: input data, hidden activations, and network parameters. Among them, input data normalization is used most commonly because of its simplicity and effectiveness (Sola and Sevilla, 1997; LeCun et al., 1998).

After the introduction of Batch Normalization (Ioffe and Szegedy, 2015), the normalization of activations has become nearly as prevalent. By normalizing hidden activations by their statistics within each mini-batch, BN effectively alleviates the vanishing gradient problem and significantly speeds up the training of deep networks. To mitigate the mini-batch size dependency of BN, a number of variants have been proposed, including Layer Normalization (LN) (Ba et al., 2016), Instance Normalization (IN) (Ulyanov et al., 2016), Group Normalization (GN) (Wu and He, 2018), and Batch Instance Normalization (BIN) (Nam and Kim, 2018). The motivation of LN is to explore more suitable statistics for sequential models, while IN performs normalization in a manner similar to BN but with statistics only for each instance. GN achieves a balance between IN and LN, by dividing features into multiple groups along the channel dimension and computing the mean and variance within each group for normalization. BIN introduces a learnable method for automatically switching between normalizing and maintaining style information, enjoying the advantages of both BN and IN on style transfer tasks. Cross-GPU Batch Normalization (CGBN or SyncBN) (Peng et al., 2018) extends BN across multiple GPUs for the purpose of increasing the effective batch size. Though providing higher accuracy, it introduces synchronization overhead to the training process. Kalman Normalization (KN) (Wang et al., 2018a) presents a Kalman filtering procedure for estimating the statistics for a network layer from the layer’s observed statistics and the computed statistics of previous layers.

Batch Renormalization (BRN) (Ioffe, 2017) is the first attempt to utilize the statistics of recent iterations for normalization. It does not compensate for the statistics from recent iterations, but rather it down-weights the importance of statistics from distant iterations. This down-weighting heuristic, however, does not make the resulting statistics “correct”, as the statistics from recent iterations are not of the current network weights. BRN can be deemed as a special version of our Naive CBN baseline (without Taylor polynomial approximation), where distant iterations are down-weighted.

Recent work have also investigated the normalization of network parameters. In Weight Normalization (WN) (Salimans and Kingma, 2016), the optimization of network weights is improved through a reparameterization of weight vectors into their length and direction. Weight Standardization (WS) (Qiao et al., 2019) instead reparameterizes weights based on their first and second moments for the purpose of smoothing the loss landscape of the optimization problem. To combine the advantages of multiple normalization techniques, Switchable Normalization (SN) (Luo et al., 2018) and Sparse Switchable Normalization (SSN) (Shao et al., 2019) make use of differentiable learning to switch among different normalization methods.

The proposed CBN takes an activation normalization approach that aims to mitigate the mini-batch dependency of BN. Different from existing techniques, it provides a way to effectively aggregate statistics across multiple training iterations.

3 Method

Figure 2: Illustration of BN and the proposed Cross-Iteration Batch Normalization (CBN).

3.1 Revisiting Batch Normalization

The original batch normalization (BN) (Ioffe and Szegedy, 2015) whitens the activations of each layer by the statistics computed within a mini-batch. Denote and as the network weights and the feature response of a certain layer for the -th example in the -th mini-batch. With these values, BN conducts the following normalization:

(1)

where is the whitened activation with zero mean and unit variance, is a small constant added for numerical stability, and and are the mean and variance computed for all the examples from the current mini-batch, i.e.,

(2)
(3)

where , and denotes the number of examples in the current mini-batch. The whitened activation further undergoes a linear transform with learnable weights, to increase its expressive power:

(4)

where and are the learnable parameters (initialized to and in this work).

When the batch size is small, the statistics and become noisy estimates of the training set statistics, thus degrading the effects of batch normalization. In the ImageNet classification task for which the BN module was originally designed, a batch size of 32 is typical. However, for other tasks requiring larger models and/or higher image resolution, such as object detection, semantic segmentation and video recognition, the typical batch size may be as small as 1 or 2 due to GPU memory limitations. The original BN becomes considerably less effective in such cases.

3.2 Leveraging Statistics from Previous Iterations

To address the issue of BN with small mini-batches, a naive approach is to compute the mean and variance over the current and previous iterations. However, the statistics and of the -th iteration are computed under the network weights , making them obsolete for the current iteration. As a consequence, directly aggregating statistics from multiple iterations produces inaccurate estimates of the mean and variance, leading to significantly worse performance.

We observe that the network weights change smoothly between consecutive iterations, due to the nature of gradient-based training. This allows us to approximate and from the readily available and via a Taylor polynomial, i.e.,

(5)
(6)

where and are gradients of the statistics with respect to the network weights, and denotes higher-order terms of the Taylor polynomial, which can be omitted since the first-order term dominates when is small.

In Eq. (5) and Eq. (6), the gradients and cannot be precisely determined at a negligible cost because the statistics and for a node at the -th network layer depend on all the network weights prior to the -th layer, i.e., and for , where denotes the network weights at the -th layer. Only when can these gradients be derived in closed form efficiently.

Empirically, we find that as the layer index decreases (), the partial gradients and rapidly diminish. These reduced effects of network weight changes at earlier layers on the activation distributions in later layers may perhaps be explained by the reduced internal covariate shift of BN. Motivated by this phenomenon, which is studied in Section 4.4, we propose to truncate these partial gradients at layer .

Thus, we further approximate Eq. (5) and Eq. (6) by

(7)
(8)

A naive implementation of and involves computational overhead of , where and denote the channel dimension of the -th layer and the -th layer, respectively, and denotes the kernel size of . Here, we find that the operation can be implemented efficiently in , thanks to the averaging over feature responses of and . See Appendix for the details.

3.3 Cross-Iteration Batch Normalization

After compensating for network weight changes, we aggregate the statistics of the most recent iterations with those of the current iteration to obtain the statistics used in CBN:

(9)
(10)
(11)

where and are computed from Eq. (7) and Eq. (8). In Eq. (10), is determined from the maximum of and in each iteration because should hold for valid statistics but may be violated by Taylor polynomial approximations in Eq. (7) and Eq. (8). Finally, and are applied to normalize the corresponding feature responses at the current iteration:

(12)

With CBN, the effective number of examples used to compute the statistics for the current iteration is times as large as that for the original BN. In training, the loss gradients are backpropagated to the network weights and activations at the current iteration, i.e., and . Those of the previous iterations are fixed and do not receive gradients. Hence, the computation cost of CBN in back-propagation is the same as that of BN.

 

batch size per iter #examples for statistics Norm axis
IN #bs/GPU * #GPU 1 (spatial)
LN #bs/GPU * #GPU 1 (channel, spatial)
GN #bs/GPU * #GPU 1 (channel group, spatial)
BN #bs/GPU * #GPU #bs/GPU (batch, spatial)
syncBN #bs/GPU * #GPU #bs/GPU * #GPU (batch, spatial, GPU)
CBN #bs/GPU * #GPU #bs/GPU * temporal window (batch, spatial, iteration)

 

Table 1: Comparison of different feature normalization methods. #bs/GPU denotes batch size per GPU.

Replacing the BN modules in a network by CBN leads to only minor increases in computational overhead and memory footprint. For computation, the additional overhead mainly comes from computing the partial derivatives and , which is insignificant in relation to the overhead of the whole network. For memory, the module requires access to the statistics ( and ) and the gradients ( and ) computed for the most recent iterations, which is also minor compared to the rest of the memory consumed in processing the input examples. The additional computation and memory of CBN is reported for our experiments in Table 6.

A key hyper-parameter in the proposed CBN is the temporal window size, , of recent iterations used for statistics estimation. A broader window enlarges the set of examples, but the example quality becomes increasingly lower for more distant iterations, since the differences in network parameters and become more significant and are compensated less well using a low-order Taylor polynomial. Empirically, we found that CBN is effective with a window size up to in a variety of settings and tasks. The only trick is that the window size should be kept small at the beginning of training, when the network weights change quickly. Thus, we introduce a burn-in period of length for the window size, where and CBN degenerates to the original BN. In our experiments, the burn-in period is set to 25 epochs on ImageNet image classification and 3 epochs on COCO object detection by default. Ablations on this parameter are presented in the Appendix.

Table 1 compares CBN with other feature normalization methods. The key difference among these approaches is the axis along which the statistics are counted and the features are normalized. The previous techniques are all designed to exploit examples from the same iteration. By contrast, CBN explores the aggregation of examples along the temporal dimension. As the data utilized by CBN lies in a direction orthogonal to that of previous methods, the proposed CBN could potentially be combined with other feature normalization approaches to further enhance statistics estimation in certain challenging applications.

4 Experiments

4.1 Image Classification on ImageNet

Experimental settings. ImageNet (Russakovsky et al., 2015) is a benchmark dataset for image classification, containing 1.28M training images and 50K validation images from 1000 classes. We follow the standard setting in (He et al., 2015) to train deep networks on the training set and report the single-crop top-1 accuracy on the validation set. Our preprocessing and augmentation strategy strictly follows the GN baseline (Wu and He, 2018). We use a weight decay of 0.0001 for all weight layers, including and . We train standard ResNet-18 for 100 epochs on 4 GPUs, and decrease the learning rate by the cosine decay strategy (He et al., 2019). We perform the experiments for five trials, and report their mean and standard deviation (error bar). ResNet-18 with BN is our base model. To compare with other normalization methods, we directly replace BN with IN, LN, GN, BRN, and our proposed CBN.

 

IN LN GN CBN BN
Top-1 acc 64.40.2 67.90.2 68.90.1 70.20.1 70.20.1

 

Table 2: Top-1 accuracy of feature normalization methods using ResNet-18 on ImageNet.

Comparison of feature normalization methods. In Table 2, we compare the performance of each normalization method with a batch size, 32, sufficient for computing reliable statistics. Under this setting, BN clearly yields the highest top-1 accuracy. Similar to results found in previous works (Wu and He, 2018), the performance of IN and LN is significantly worse than that of BN. GN works well on image classification but falls short of BN by 1.2%. Among all the methods, our CBN is the only one that is able to achieve accuracy comparable to BN, as it converges to the procedure of BN at larger batch sizes.

 

batch size per GPU 32 16 8 4 2 1
BN 70.2 70.2 68.4 65.1 55.9 -
GN 68.9 69.0 68.9 69.0 69.1 68.9
BRN 70.1 68.5 68.2 67.9 60.3 -
CBN 70.2 70.2 70.1 70.0 69.6 69.3

 

Table 3: Top-1 accuracy of normalization methods with different batch sizes using ResNet-18 as the base model on ImageNet.

Sensitivity to batch size. We compare the behavior of CBN, original BN (Ioffe and Szegedy, 2015), GN (Wu and He, 2018), and BRN (Ioffe, 2017) at the same number of images per GPU on ImageNet classification. For CBN, the recent iterations are utilized so as to ensure that the number of effective examples is no fewer than 16. For BRN, the settings strictly follow the original paper. We adopt a learning rate of 0.1 for the batch size of 32, and linearly scale the learning rate by for a batch size of .

The results are shown in Table 3. For the original BN, its accuracy drops noticeably as the number of images per GPU is reduced from 32 to 2. BRN suffers a significant performance drop as well. GN maintains its accuracy by utilizing the channel dimension but not batch dimension. For CBN, its accuracy holds by exploiting the examples of recent iterations. Also, CBN outperforms GN by 0.9% on average top-1 accuracy with different batch sizes. This is reasonable, because the statistics computation of CBN introduces uncertainty caused by the stochastic batch sampling like in BN, but this uncertainty is missing in GN which results in some loss of regularization ability. For the extreme case that the number of images per GPU is 1, BN and BRN fails to produce results, while CBN outperforms GN by 0.4% on top-1 accuracy in this case.

4.2 Object Detection and Instance Segmentation on COCO

Experimental settings. COCO (Lin et al., 2014) is chosen as the benchmark for object detection and instance segmentation. Models are trained on the COCO 2017 train split with 118k images, and evaluated on the COCO 2017 validation split with 5k images. Following the standard protocol in (Lin et al., 2014), the object detection and instance segmentation accuracies are measured by the mean average precision (mAP) scores at different intersection-over-union (IoU) overlaps at the box and the mask levels, respectively.

Following (Wu and He, 2018), Faster R-CNN (Ren et al., 2015) and Mask R-CNN (He et al., 2017) with FPN (Lin et al., 2017) are chosen as the baselines for object detection and instance segmentation, respectively. For both, the 2fc box head is replaced by a 4conv1fc head for better use of the normalization mechanism (Wu and He, 2018). The backbone networks are ImageNet pretrained ResNet-50 (default) or ResNet-101, with specific normalization. Finetuning is performed on the COCO train set for 12 epochs on 4 GPUs by SGD, where each GPU processes 4 images (default). Note that the mean and variance statistics in CBN are computed within each GPU. The learning rate is initialized to be for a batch size per GPU of , and is decayed by a factor of 10 at the 9-th and the 11-th epochs. The weight decay and momentum parameters are set to 0.0001 and 0.9, respectively. We use the average over 5 trials for all results. As the values of standard deviation of all methods are less than 0.1 on COCO, they are ignored here.

As done in (Wu and He, 2018), we experiment with two settings where the normalizers are activated only at the task-specific heads with frozen BN at the backbone (default), or the normalizers are activated at all the layers except for the early conv1 and conv2 stages in ResNet.

 

backbone box head AP AP AP AP AP AP
fixed BN - 36.9 58.2 39.9 21.2 40.8 46.9
fixed BN BN 36.3 57.3 39.2 20.8 39.7 47.3
fixed BN syncBN 37.7 58.5 41.1 22.0 40.9 49.0
fixed BN GN 37.8 59.0 40.8 22.3 41.2 48.4
fixed BN CBN 37.7 59.0 40.7 22.1 40.9 48.8
BN BN 35.5 56.4 38.7 19.7 38.8 47.3
syncBN syncBN 37.9 58.5 41.1 21.7 41.5 49.7
GN GN 37.8 59.1 40.9 22.4 41.2 49.0
CBN CBN 37.3 57.7 39.3 21.9 40.8 48.2

 

Table 4: Results of feature normalization methods on Faster R-CNN with FPN and ResNet50 on COCO. As the values of standard deviation of all methods are less than 0.1 on COCO, we ignore them here.

 

Backbone method norm AP AP AP AP AP AP
R50+FPN Faster R-CNN GN 37.8 59.0 40.8 22.3 41.2 48.4
syncBN 37.7 58.5 41.1 22.0 40.9 49.0
CBN 37.7 59.0 40.7 22.1 40.9 48.8
R101+FPN Faster R-CNN GN 39.3 60.6 42.7 22.5 42.5 51.3
syncBN 39.2 59.8 43.0 22.2 42.9 51.6
CBN 39.2 60.0 42.6 22.3 42.6 51.1

 

AP AP AP AP AP AP
R50+FPN Mask R-CNN GN 38.6 59.8 41.9 35.0 56.7 37.3
syncBN 38.5 58.9 42.3 34.7 56.3 36.8
CBN 38.5 59.2 42.1 34.6 56.4 36.6
R101+FPN Mask R-CNN GN 40.3 61.2 44.2 36.6 58.5 39.2
syncBN 40.3 60.8 44.2 36.0 57.7 38.6
CBN 40.1 60.5 44.1 35.8 57.3 38.5

 

Table 5: Results with stronger backbones on COCO object detection and instance segmentation.

Normalizers at backbone and task-specific heads. We further study the effect of different normalizers on the backbone network and task-specific heads for object detection on COCO. CBN, original BN, syncBN, and GN are included in the comparison.

Table 4 presents the results. When BN is frozen in the backbone and no normalizer is applied at the head, the AP score is 36.9%. When the original BN is applied at the head only and at both the backbone and the head, the accuracy drops to 36.3% and 35.5%, respectively. For CBN, the accuracy is 37.7% and 37.3% at these two settings, respectively. Without any synchronization across GPUs, CBN can achieve performance on par with syncBN and GN, showing the superiority of the proposed approach. Unfortunately, due to the accumulation of approximation error, there is a 0.4% decrease in AP when replacing frozen BN with CBN in the backbone. Even so, CBN still outperforms the variant with unfrozen BN in the backbone by 1.8%.

Instance segmentation and stronger backbones. Results of object detection (Faster R-CNN (Ren et al., 2015)) and instance segmentation (Mask R-CNN (He et al., 2017)) with ResNet-50 and ResNet-101 are presented in Table 5. We can observe that our proposed CBN achieves performance comparable to syncBN and GN with R50 and R101 as the backbone on both Faster R-CNN and Mask R-CNN, which demonstrates that CBN is robust and versatile to various deep models and tasks.

4.3 Ablation Study

Effect of temporal window size . We conduct this ablation on ImageNet image classification and COCO object detection, with each GPU processing 4 images. Figure 3 presents the results. When , only the batch from the current iteration is utilized; therefore, CBN degenerates to the original BN. The accuracy suffers due to the noisy statistics on small batch sizes. As the window size increases, more examples from recent iterations are utilized for statistics estimation, leading to greater accuracy. Accuracy saturates at and even drops slightly. For more distant iterations, the network weights differ more substantially and the Taylor polynomial approximation becomes less accurate.

Figure 3: The effect of temporal window size (k) on ImageNet (ResNet-18) and COCO (Faster R-CNN with ResNet-50 and FPN) with #bs/GPU = 4 for CBN and Naive CBN. Naive CBN directly utilizes statistics from recent iterations, while BN uses the equivalent #examples as CBN for statistics computation.

On the other hand, it is empirically observed that the original BN saturates at a batch size of 16 or 32 for numerous applications (Peng et al., 2018; Wu and He, 2018), indicating that the computed statistics become accurate. Thus, a temporal window size of is suggested.

Effect of compensation. To study this, we compare CBN to 1) Naive CBN, where statistics from recent iterations are directly aggregated without compensation via Taylor polynomial; and 2) the original BN applied with the same effective example number as CBN (i.e., its batch size per GPU is set to the product of the batch size per GPU and the temporal window size of CBN), which does not require any compensation and serves as an upper performance bound.

The experimental results are also presented in Figure 3. CBN clearly surpasses Naive CBN when the previous iterations are included. Actually, Naive CBN fails when the temporal window size grows to as shown in Figure 3, demonstrating the necessity of compensating for changing network weights over iterations. Compared with the original BN upper bound, CBN achieves similar accuracy at the same effective example number. This result indicates that the compensation using a low-order Taylor polynomial by CBN is effective.

Figure 4: Training and test curves for CBN, Naive CBN, and BN on ImageNet, with batch size per GPU of 4 and temporal window size for CBN, Naive CBN, and BN-bs4, and batch size per GPU of 16 for BN-bs16. The plot of BN-bs16 is the ideal bound.

Figure 4 presents the train and test curves of CBN, Naive CBN, BN-bs4, and BN-bs16 on ImageNet, with 4 images per GPU and a temporal window size of 4 for CBN, Naive CBN, and BN-bs4, and 16 images per GPU for BN-bs16. The train curve of CBN is close to BN-bs4 at the beginning, and approaches BN-bs16 at the end. The reason is that we adopt a burn-in period to avoid the disadvantage of rapid statistics change at the beginning of training. The gap between the train curve of Naive CBN and CBN shows that Naive CBN cannot even converge well on the training set. The test curve of CBN is close to BN-bs16 at the end, while Naive CBN exhibits considerable jitter. All these phenomena indicate the effectiveness of our proposed Taylor polynomial compensation.

(a) ImageNet
(b) COCO
Figure 5: Results of different burn-in periods (in epochs) on CBN, with batch size per iteration of 4, on ImageNet and COCO.

Effect of burn-in period length . We study the effect of varying the burn-in period length , at 4 images per GPU on both ImageNet image classification (ResNet-18) and COCO object detection (Faster R-CNN with FPN and ResNet-50). Figure 5(a) and 5(b) present the results. When the burn-in period is too short, the accuracy suffers. This is because at the beginning of training, the network weights change rapidly, causing the compensation across iterations to be less effective. Nevertheless, the accuracy is stable for a wide range of burn-in periods .

 

Memory
(GB)
FLOPs
(M)
Training
Speed (iter/s)
Inference
Speed (iter/s)
BN 14.1 5155.1 1.3 6.2
GN 14.1 5274.2 1.2 3.7
CBN 15.1 5809.7 1.0 6.2

 

Table 6: Comparison of theoretical memory, FLOPs and practical training and inference speed between original BN, GN, and CBN in both training and inference on COCO.

4.4 Analysis

Computational overhead, memory footprint, and training and inference speed. We examine the computational overhead, memory footprint, and the training and inference speed of BN, GN and CBN in a practical COCO object detection task using R50-Mask R-CNN, shown in Table 6. The batch size per GPU is 4 and the window size of CBN is set to 4.

Compared to BN and GN, CBN consumes about 7% extra memory and 11% more computational overhead. The extra memory mainly contains the statistics ( and ), their respective gradients, and the network parameters () of previous iterations, while the computational overhead comes from calculations of the statistics’ respective gradients, Taylor compensations, and averaging operations.

The overall training speed of CBN is close to both BN and GN. It is worth noting that the inference speed of CBN is equal to BN, which is much faster than GN. The inference stage of CBN is the same as that of BN, where the calculation of statistics includes pre-recorded statistics.

From these results, the additional overhead of CBN is seen to be minor.

Using more than one layer for compensation. We also study the influence of applying compensation over more than one layer. CBN using two layers for compensation achieves 70.1 on ImageNet (batch size per GPU=4, k=4), which is comparable to CBN using only one layer. However, the efficient implementation can no longer be used when more than one layer of compensation is employed. As using more layers does not further improve performance but consumes more FLOPs, we adopt one-layer compensation for CBN in practice.

(a) The gradients of
(b) The gradients of
Figure 6: Comparison of gradients of statistics w.r.t. current layer vs. that w.r.t. previous layers on ImageNet.

On the gradients from different layers. The key assumption in Eq. (7) and Eq. (8) is that for a node at the -th layer, the gradient of its statistics with respect to the network weights at the -th layer is larger than that of weights from the prior layers, i.e.,

where denotes , denotes , and denotes the Frobenius norm.

Here, we examine this assumption empirically for networks trained on ImageNet image recognition. Both and for are averaged over all CBN layers of the network at different training epochs (Figure 6). The results suggest that the key assumption holds well, thus validating the approximation in Eq. (7) and Eq. (8).

We also study the gradients of non-ResNet models. The ratios of and are (0.20 and 0.41) for VGG-16 and (0.15 and 0.37) for Inception-V3, which is similar to ResNet (0.12 and 0.39), indicating that the assumption should also hold for the VGG and Inception series.

5 Conclusion

In the small batch size regime, batch normalization is widely known to drop dramatically in performance. To address this issue, we propose to enhance the quality of statistics via utilizing examples from multiple recent iterations. As the network activations from different iterations are not comparable to each other due to changes in network weights, we compensate for the network weight changes based on Taylor polynomials, so that the statistics can be accurately estimated. In the experiments, the proposed approach is found to outperform original batch normalization and a direct calculation of statistics over previous iterations without compensation. Moreover, it achieves performance on par with SyncBN, which can be regarded as the upper bound, on both ImageNet classification and COCO object detection.

Acknowledgement We sincerely thank Jifeng Dai for his efforts, discussion and comments about this work. Jifeng Dai was involved in the early discussions of the work. He gave up the authorship after he joined another company.

Appendix A Algorithm Outline

Algorithm 1 presents an outline of our proposed CrossIteration Batch Normalization (CBN).

Input: Feature responses of a network node of the -th layer at the -th iteration , network weights , statistics and , and gradients and from most recent iterations

Output:

, //statistics on the current iteration
for  do

2       //approximation from recent iterations
//approximation from recent iterations
3 end for
//averaging over recent iterations
//validation and averaging over recent iterations

//normalize
//scale and shift
Algorithm 1 Cross-Iteration Batch Normalization(CBN)

Appendix B Efficient Implementation of and

Let and denote the channel dimension of the -th layer and the -th layer, respectively, and denotes the kernel size of . and are thus of dimensions in channels, and is a dimensional tensor. A naive implementation of and involves computational overhead of . Here we find that the operations of and can be implemented efficiently in and , respectively, thanks to the averaging of feature responses in and .

Here we derive the efficient implementation of . That of is about the same. Let us first simplify the notations a bit. Let and denote and respectively, by removing the irrelevant notations for iterations. The element-wise computation in the forward pass can be computed as

(12)

where denotes the -th channel in , and denotes the -th channel in the -th example. is computed as

(13)

where and enumerate the input feature dimension and the convolution kernel index, respectively, denotes the spatial offset in applying the -th kernel, and is the output of the -th layer.

The element-wise calculation of is as follows, taking Eq. (12) and Eq. (13) into consideration:

(14)

Thus, takes non-zero values only when . This operation can be implemented efficiently in . Similarly, the calculation of can be obtained in .

References

  1. Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §2.
  2. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
  3. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §1.
  4. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2, §4.2.
  5. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §4.1.
  6. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Figure 1.
  7. Bag of tricks for image classification with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  8. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: Figure 1, §1, §2, §3.1, §4.1.
  9. Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems, pp. 1945–1953. Cited by: §1, §2, §4.1.
  10. Efficient backprop. In Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop, pp. 9–50. Cited by: §2.
  11. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
  12. Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.2.
  13. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  14. Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779. Cited by: §2.
  15. Batch-instance normalization for adaptively style-invariant neural networks. In Advances in Neural Information Processing Systems, pp. 2563–2572. Cited by: §2.
  16. Megdet: a large mini-batch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6181–6189. Cited by: §1, §1, §2, §4.3.
  17. Weight standardization. arXiv preprint arXiv:1903.10520. Cited by: §2.
  18. Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §4.2, §4.2.
  19. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Figure 1, §4.1.
  20. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909. Cited by: §2.
  21. SSN: learning sparse switchable normalization via sparsestmax. arXiv preprint arXiv:1903.03793. Cited by: §2.
  22. Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Transactions on nuclear science 44 (3), pp. 1464–1468. Cited by: §2.
  23. Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §1, §2.
  24. Kalman normalization: normalizing internal representations across network layers. In Advances in Neural Information Processing Systems, pp. 21–31. Cited by: §2.
  25. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §1.
  26. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: Figure 1, §1, §1, §2, §4.1, §4.1, §4.1, §4.2, §4.2, §4.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407973
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description