Understanding the Disharmony between Weight Normalization Family and Weight Decay: shifted Regularizer
Abstract
The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight to , which makes independent to the magnitude of . Surprisingly, must be decayed during gradient descent, otherwise we will observe a severe underfitting problem, which is very counterintuitive since weight decay is widely known to prevent deep networks from overfitting. Moreover, if we substitute (e.g., weight normalization) in the original loss function , it is observed that the regularization term will be canceled as a constant in the optimization objective. Therefore, to decay , we need to explicitly append this term: . In this paper, we theoretically prove that merely modulates the effective learning rate for improving objective optimization, and has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several critical problems when introducing weight decay term to weight normalization family, including the missing of global minimum and training instability. To address these problems, we propose an shifted regularizer, which shifts the objective by a positive constant . Such a simple operation can theoretically guarantee the existence of global minimum, while preventing the network weights from being too small and thus avoiding gradient float overflow. It significantly improves the training stability and can achieve slightly better performance in our practice. The effectiveness of shifted regularizer is comprehensively validated on the ImageNet, CIFAR100, and COCO datasets. Our codes and pretrained models will be released in https://github.com/implus/PytorchInsight.
1 Introduction
The normalization methodologies on features have made great progress in recent years, with the introduction of BN [ioffe2015batch], IN [ulyanov2016instance], LN [ba2016layer], GN [wu2018group] and SN [luo2018differentiable]. These methods mainly focus on a zero mean and unit variance normalization operation on a specific dimension (or multiple dimensions) of features, which makes deep neural architectures [he2016deep, he2016identity, huang2017densely, wang2018mixed] much easier to optimize, leading to robust solutions with favorable generalization performance.
Beyond feature normalization, there is an increasing interest on the normalization of network weights. Weight Normalization (WN) [salimans2016weight] first separates the learning of the length and direction of weights, and it performs satisfactorily on several relatively small datasets. In some contexts of generative adversarial networks (GAN) [goodfellow2014generative], Weight Normalization with Translated ReLU [xiang2017effects] is shown to achieve superior results. Later, Centered Weight Normalization (CWN) [huang2017centered] further powers WN by additionally centering their input weights, ulteriorly improving the conditioning and convergence speed. Recently, very similar to CWN, Weight Standardization (WS) [weightstandardization] aims to standardize the weights with zero mean and unit variance. On the largescale tasks (ImageNet [deng2009imagenet] classification/COCO [lin2014microsoft] detection), WS further enhances optimization convergence and generalization performance, under the cooperation of feature normalizations such as GN and BN.
In terms of weight normalization family, despite its appealing success, there is still one confusing mystery – the disharmony between weight normalization family and weight decay [krogh1992simple]. Note that weight decay is widely interpreted as a form of regularization [loshchilov2017fixing] because it can be derived from the gradient of the norm of the weights [loshchilov2018decoupled]. Specifically, we consider training a singlelayer singleoutput neural network where with the following loss function to be minimized:
(1) 
In Eq. (1), denotes the number of training samples, consists of taskrelated loss w.r.t. the input/label pair , and the regularization term with a constant to balance against . For simplicity, we use weight normalization to reparameterize , regardless the learning of its length. By substituting in Eq. (1), we can get , which is equivalent to minimize the following function
(2) 
Interestingly, the weight decay term has indeed disappeared. In the case of WS, we can get similar conclusions by replacing :
(3) 
where . It probably makes sense since weight decay will not take effect on a fixed distribution of normalized weights. However, when we apply Eq. (3) to WSequipped ResNet50 (WSResNet50) on ImageNet dataset, i.e., setting weight decay ratio to 0 for all WSequipped convolutions, we observe a severe degradation with significant performance drop in training set (Fig. 1). It is incredibly strange that weight decay is known to prevent the training from overfitting the data, but it appears that, removing the weight decay instead puts the network into a serious underfitting, which is very counterintuitive.
To answer above questions, in this paper, we first prove that in Eq. (2) (and Eq. (3)), the addition of weight decay term of does not change the optimization goal. Therefore, weight decay loses its original role that finds a better generalized solution by traditionally introducing a different loss part against the taskrelated one. At the same time, basing on the derivation of the gradient formula of , we further prove that weight decay only takes effect in modulating the effective learning rate to help the gradient descent process when the weight normalization family is employed simultaneously, and empirically demonstrate how it adjusts the effective learning rate. Interestingly, we also get an additional empirical discovery: training a network with weight normalization and weight decay implicitly includes an approximate warmup [goyal2017accurate] process, which probably explains the slightly improved performance on the original baselines.
The current common and default operation [salimans2016weight, weightstandardization] to optimize networks with weight normalization family is to continue to preserve the traditional decay term of for better convergence that comes from ensuring the stable effective learning rate, i.e., to explicitly add on Eq. (2):
(4) 
However, there are many potential problems in taking the final optimization objective as Eq. (4). First, we prove that Eq. (4) has no global minimum theoretically. In addition, the improper selection of will push the magnitude () of weights to 0 and easily lead to training failures due to gradient float overflow (as the corresponding gradient is propotional to ), especially for certain adaptive gradient methods (e.g., Adam [kingma2014adam]) which accumulates the square of gradients.
To address these problems, we propose a very simple yet effective shifted regularizer, which shifts the objective by a positive constant . The shifted prevents network weights from being too small, thus it will directly avoid gradient float overflow risks. Such a simple operation can theoretically guarantee the existence of global minimum, whilst greatly improving the training stability. Beyond the training stability, it further brings gains on performance over a wide range of architectures, probably due to its dynamic decay machanism, which we will discuss later. The effectiveness of our method is comprehensively demonstrated by experiments on the ImageNet [deng2009imagenet], CIFAR100 [krizhevsky2009learning] and COCO [lin2014microsoft] datasets.
To summarize our contributions:

We thoroughly analyze the disharmony between weight normalization family and weight decay, and expose the critical problems including lack of global minimum and training instability, which are caused by the optimization of weight decay term in the final loss objective when weight normalization is simultaneously applied.

We theoretically prove that weight decay loses the ability to enhance generalization in the weight normalization family, and only plays a role in regulating effective learning rate to help training. We demonstrate that when optimizing with SGD, the weight decay term can be cancelled by simply scaling the learning rate with a constant at each gradient descent step, where the constant is only determined by the hyperparameters and irrelevant to the training process.

We propose a simple yet effective shifted regularizer to overcome the problems via introducing weight decay into weight normalization family, which significantly improves the training stability whilst achieving better performance over a large range of network architectures on both classification and detection tasks.
2 Related Works
Weight Normalization Family: Weight Normalization (WN) [salimans2016weight] takes the first attempt to reparameterize weights by the separation of direction and length :
(5) 
The normalization operation participates in the gradient flow, resulting in accelerated convergence of stochastic gradient descent optimization. WN shows certain advantages in some tasks of supervised image recognition, generative modelling, and deep reinforcement learning. However, [gitman2017comparison] points out that in the largescale ImageNet dataset, the final test accuracy of WN is significantly lower ( 6%) than that of BN [ioffe2015batch]. Later, Centered Weight Normalization (CWN) is proposed to further improve the conditioning and accelerate the convergence of training deep networks. The central idea of CWN is an additional centering operation based on WN:
(6) 
Recently, in order to alleviate the problem of degraded performance of GN [wu2018group], Weight Standardization (WS) [weightstandardization] is proposed, which is very close to CWN but with the learning length removed:
(7) 
WS is recommended to cooperate with feature normalization methods (such as GN and BN), which leads to further enhanced performance in largescale tasks and can significantly accelerate the convergence. Introducing WS on the basis of GN or BN can consistently bring gains to multiple downstream visual tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. In this paper, we mainly focus on the weight normalization family and conduct a series of analyses on their properties, especially on their relations to weight decay.
Weight Decay: weight decay can be traced back to [krogh1992simple], which is defined as multiplying each weight in the gradient descent at each epoch by a factor . In the Stochastic Gradient Descent (SGD) setting, weight decay is widely interpreted as a form of regularization [ng2004feature] because it can be derived from the gradient of the norm of the weights [loshchilov2018decoupled]. It is known to be beneficial for the generalization of neural networks. Recently, [zhang2018three] identify three distinct mechanisms by which weight decay improves generalization: increasing the effective learning rate for BN, reducing the Jacobian norm, and reducing the effective damping parameter. Similarly, a series of recent work [van2017l2, hoffer2018norm] also demonstrates that when using BN, weight decay improves optimization only by fixing the norm to a small range of values, leading to a more stable step size for the weight direction. Although related, these works differ from our work in at least four aspects: 1) they mainly focus on the discussion between the feature normalization (especially BN) and weight decay, whilst we are the first to give a thorough analysis on the disharmony between weight normalization family and weight decay; 2) they solely demonstrate empirical results that the accuracy gained by using weight decay can be achieved without it, but only by adjusting the learning rate. However, we give theoretical proof and derive how to linearly scale the learning rate at each step, which is also purely determined by the training hyperparameters; 3) they fail to discover the problems by introducing weight decay into the loss objective with normalized weights, which is heavily revealed and discussed in this article; 4) although weight decay has several potential problems with normalization methods, they have not proposed a solution to replace weight decay. In contrast, our proposed shifted regularizer can successfully guarantee the global minimum and training stability to overcome the existing drawbacks, whilst achieving superior performance over a range of tasks.
3 Roles of Weight Decay in Weight Normalization Family
In this section, we explain the roles of weight decay in weight normalization family in details. The theoretical analyses on the roles of weight decay help to understand why weight decay loses the ability to enhance the generalization, but controls the effective learning rate to help the training of deep networks.
3.1 Weight Decay Doesnot Change Optimization Goal
We first prove that in the networks equipped with weight normalization family, the introduction of weight decay does not change the goal of optimization, indicating that weight decay faithfully brings no additional generalization benefits. For analyses, we simply use the concepts of variable decomposition. Specifically, we choose two representative methods from weight normalization family, namely WN and WS, and discuss each in turn. Note that for simplicity, we ignore the learning of the length in WN in the following analyses and experiments.
In WN, we aim to prove that minimizing
(8) 
is equal to minimizing
(9) 
Specifically, we let and , and decompose the direction and length of as two independent variables. Then the objectives of Eq. (8) and (9) can be rewritten as
(10) 
and
(11) 
respectively. Since and are two independent variables, we have
(12) 
which shows that minimizing actually contains the task of minimizing , and it completes the proof. Similarly, in WS we can further decompose the mean and variance of by letting and , where are also mutually independent. Again we can have
(13) 
Therefore, according to the above analyses, for the networks with weight normalization family employed, the introduction of weight decay does not essentially change the learning objective, which implies that it takes no effect on the network generalization capability.
3.2 Weight Decay Ensures Effective Learning Rate
Since weight decay does not bring a regularization effect to a network with weight normalization family, why is it indispensable in the training process? The central reason is that weight decay helps to control the effective learning rate in a stable and reasonable range. Taking WN as an example, we can derive the gradient of as (the deviate to Eq. (9)):
(14) 
where denotes the elementwise product. If we consider one gradient descent update at step for an element in , i.e., , with the use of learning rate , we have:
(15) 
Eq. (15) demonstrates that even if is fixed, the gradient update in terms of can vary according to its magnitude . The reason is that fixed can only leads to fixed and term. Consequently, the entire update step size can be determined by , which exactly controls the effective learning rate similarly defined as in [van2017l2, hoffer2018norm]. Here we have two conclusions:

If we do not limit during the update process, the weights can grow unbounded , and the effective learning rate goes to 0 ().

On the contrary, if we decay too much during the optimization (), the effective learning rate will grow unbounded (), which leads to gradient float overflow and training failures. This is part of the motivation of our proposed method and we will discuss it in details later.
The similar analysis can be conducted in the case of WS, where one update step is:
(16) 
which has term as its effective learning rate.
To prove that weight decay only ensures effective learning rate, we conduct theoretical analyses as follows:
Theorem 1.
Proof.
Here we focus on the normalized variables to represent the training trajectory, e.g., (or ), as they are the ultimate weights with which networks use to operate. In the case of WN, we suppose optimizing and take steps in total, and at each step, we feed the same data batch to both of them. For the ease of reference, the corresponding variables at step for optimizing are marked with superscript , e.g., . Such notations are kept similarly in optimizing , e.g., . We further assume that two optimization processes start from the same initial weights , which also means , where is a scale between and (if exists). Specifically, we aim to prove that there exists a sequence of as multipliers (note that must be independent of the training process) for learning rate during optimizing , and it ensures for every step from . We take the standard SGD [bottou2010large, ruder2016overview] for analysis via mathematical induction:
1) As stated in the assumption, we have hold for , indicating . Here we do not have since the gradient descent step does not start in the initialization phase.
2) Suppose holds for with , we needs to prove under certain expression of . Let us expand and by performing one gradient descent step:
(17) 
and
(18) 
Therefore, it is very obvious to deduce from Eq. (17) and (18) that when , we can have
(19) 
which consequently leads to and thereby it completes the proof. At the same time, we can also derive the recursive formula for :
(20) 
Given the sequence of generated from Eq. (20), the resulted sequence of then becomes finally. The similar deductions can be carried out for the case of WS. To give a better illustration of the proof, we let and demonstrate the first gradient descent update of and in Fig. 3. It is easy to see that and form a similar triangle relationship and their scale factor is only determined by the hyperparameter and . ∎
According to the above analyses, we conclude that for networks with weight normalization family, weight decay only takes effect in modulating effective learning rate, and theoretically, we can replace it simply by adjusting the learning rate in each iteration with a calculated ratio which is only related to the hyperparameter and .
To show how weight decay regulates the effective learning rate, we plot the mean effective learning rate of all the filters from the first layer of WNResNet50 and WSResNet50 throughout the training process in Fig. 2, where 0 and 1e4 weight decay ratios are applied to the corresponding convolutional layers respectively. It is observed that the effective learning rate is appropriately controlled in a relatively large range with weight decay applied, which leads to a better optimized solution.
One more interesting empirical observation is that the control of effective learning rate by weight decay in the early stage is quite similar to a warmup process, and the effectiveness of warmup has been generally confirmed in [he2019bag, you2017large, goyal2017accurate, liu2019variance]. As shown in the Fig. 4, we sample and investigate a set of convolutional layers, and observe that almost all of the layers show an increase in the effective learning rate of several epochs at the beginning of training. The effect of implicit warmup may explain why networks equipped with WS can have slightly improved performance [weightstandardization]. To investigate this deeper, we make additional experiments by explicitly adding warmup process (i.e., linearly increasing the learning rate from 0 to 0.1 during the first 5 epochs) into the training of ResNet50, and thus partially confirm this conclusion in Table I. Further, the effective learning rate of the first convolutional layer for training WSResNet50 and ResNet50 with warmup are depicted in Fig. 5. Note that the definition of effective learning rate for ResNet50 follows [hoffer2018norm], in order to keep them in a similar magnitude. We suprisingly find that the two curves are very closely matched, which implies that training networks with weight normalization family and weight decay can implicitly have certain benefits of warmup technique.
4 Problems via Introducing Weight Decay in Weight Normalization Family
Despite the certain practical success and benefits of applying traditional weight decay to control the effective learning rate, there are still several serious problems in essence, which are rarely revealed or noticed before our work. In this section, we discuss about these problems of introducing weight decay term in the loss objective for weight normalization family in details.
4.1 No Garantee of Global Minimum
We first consider WN. If we introduce weight decay term of to the final loss objective, i.e., Eq. (4), we can prove that for , the entire loss function does not theoretically garantee a global minimum. Here we use the proof by contradiction:
If there exists a global minimum such that the objective (Eq. (4)) is minimized, then we have the smallest loss as
(21) 
Let’s take a real number and form a new solution . Then we have:
(22)  
which leads to the contradiction with the assumption that is smallest. The similar conclusion can be found with WS.
4.2 Training Instability
When given enough training iterations, the objective (Eq. (10)) will continuously push the length of the weight (i.e., ) to 0. The effective learning rate is inversely proportional to the weight length, which is much easier to cause the floating point overflow and lead to a failed training.
Type  Top1/5 Acc (%) 

ResNet50  76.54/93.07 
ResNet50 + warmup  76.81/93.20 
WSResNet50  76.74/93.28 
Top1 Acc (%) w.r.t.  1e2  1e3  1e4  1e5  0  
ResNet50  SGD  47.67  74.12  76.54  74.80  72.65 
Adam  19.42  35.68  52.97  63.46  72.50  
WNResNet50  SGD  –  –  76.44  74.65  72.86 
Adam  –  –  –  –  72.34  
WSResNet50  SGD  –  –  76.74  74.70  72.92 
Adam  –  –  –  –  72.85 
Specifically, we find that improper selection of would actually cause the instability in training. When we choose a slightly larger for an optimizer, some of the weights in the network will quickly converge to 0, making the effective learning rate close to infinity. Thereby the numerical gradient updates are beyond the representation of float in the computational resource, resulting in a training failure. Table II shows the impact of on network training with two widely used optimizers SGD [sutskever2013importance] with momentum and Adam [kingma2014adam], where it is much easier to have a failed training for networks with weights normalized. Moreover, the adaptive gradient method (e.g., Adam) even fail to have a successful training unless we discard the weight decay by setting . Since Adam will calculate the square of the gradient during the optimization process, it is more likely to encounter the risk of floating point overflow.
To better illustrate the gradient float overflow risks, we demonstrate the maximal for WNResNet50 and for WSResNet50 in the case of 1e3 (for all corresponding convolutional layers) during SGD optimization in Fig. 6, where the maximal (or ) goes exponentially large and eventually leads to the gradient float overflow after about 50k iterations.
One may argue that the practical implementation of WS [weightstandardization] already considers the risk of float overflow in the original paper by adding a positive constant in the denominator of standardization:
(23) 
here we clarify that only adding in the standardization part is definitely not enough for preventing the gradient float overflow problem. To explain, we can derive the gradient of w.r.t. according to Eq. (23):
(24) 
where the individual standard deviation term still appears in the denominator and the gradient float overflow can still take place consequently. Therefore, it is necessary to propose a different approach to address the problem.
5 Methods
This section describes our proposed shifted Regularizer in order to address the above problems.
5.1 shifted Regularizer
As stated in Sec 3.2, when training weight normalization family with weight decay term, it is suggested to design a mechanism which can successfully prevent the network weights from being extremely small towards 0 in any case of hyperparameter or optimizer. Given such an insight, we start from investigating the lack of global minimum problem, where Eq. (10) is again reviewed:
We notice that the central reason for the missing of global minimum is the regular term: . During optimization, this term will have a large chance to continuously drive () infinitely close to 0, making the gradient to infinity and thus leading to training failures. To avoid such risks, we propose the shifted Regularizer, which constrains from being too small by a positive constant :
(25) 
For the case of WS, given the standard deviation , by modifying Eq. (13) we have:
(26) 
5.2 Garantee of Global Minimum
Thanks to the introduction of shifted Regularizer, the modified loss objective (i.e., Eq. (25) and Eq. (26)) now can garantee the existence of global minimum. For WN, suppose is the optimal solution to , then the global minimal solution of is . Therefore, the minimized objective is equal to , where the additional shifted Regularizer is utilized during optimization for mainly two important purposes: 1) controlling the effective learning rate to help networks converge, and 2) preventing the magnitude of weights from being too small and thus avoiding the gradient float overflow and training failures.
5.3 Dynamic Decay Mechanism
In addition, we find that shifted Regularizer has the function of dynamically adjusting the decay coefficient according to the current magnitude of training weights during optimization. In the case of WN, the shifted Regularizer term is , whose gradient formula for is:
(27) 
It can also be regarded as an adaptive version of traditional weight decay (i.e., ), which uses the dynamic magnitude of training weights to slightly adjust the decay ratio: . When the is relatively large, will also be relatively large, meaning that we will use a larger factor to shrink the larger weights, and it is reasonably intuitive. It probably explains why applying shifted Regularizer can have slight improvements in our experiments.
For the case of WS, the gradient formula of w.r.t. is:
(28) 
where the similar analysis can be conducted.
6 Experiments
In this section, we conduct extensive experiments on both classification and detection tasks to validate the effectiveness of the proposed shifted Regularizer.
6.1 Experimental Settings
To validate the effectiveness of the proposed shifted Regularizer, we conduct comprehensive experiments on the ImageNet [deng2009imagenet]/CIFAR100 [krizhevsky2009learning] classification dataset and COCO [lin2014microsoft] detection dataset accordingly. For fair comparisons, all the experiments are run under a unified pytorch [paszke2017automatic] framework, including results of every baseline model. More details can be referred in our public code base: https://github.com/implus/PytorchInsight. We mainly conduct experiments based on the stateoftheart Weight Normalization (WN) [salimans2016weight] and Weight Standardization (WS) [weightstandardization] from the weight normalization family.
ImageNet classification: The ILSVRC 2012 classification dataset [deng2009imagenet] contains 1.2 million images for training, and 50K for validation, from 1K classes. The training settings for large models are kept similar with [li2019spatial], except that we set the weight decay ratio to 0 for all the bias part in networks [he2019bag], which generally improves about 0.2% over all the baselines in this paper. We train networks on the training set and report the Top1 (and Top5) accuracies on the validation set with single 224 224 central crop. For data augmentation, we follow the standard practice [szegedy2015going] and perform the randomsize cropping and random horizontal flipping. All networks are trained with naive softmax cross entropy without labelsmoothing regularization [szegedy2016rethinking]. We train all the architectures from scratch by SGD [sutskever2013importance] or Adam [kingma2014adam, loshchilov2018decoupled]. SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. Adam keeps the default settings with learning rate 0.001, 0.9, 0.999. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. The default weight initialization strategy is in [he2015delving], where we use a ’fan_out’ mode specifically. The training settings for small models (i.e., ShuffleNet [zhang2018shufflenet, ma2018shufflenet] and MobileNet [howard2017mobilenets, sandler2018mobilenetv2]) are slightly different according to the references of their original papers [howard2017mobilenets, zhang2018shufflenet]: the default weight decay is 4e5 with a number of total epochs 300. Warmup [goyal2017accurate], cosine learning rate decay [loshchilov2016sgdr], label smoothing [szegedy2016rethinking] and no weight decay on all depthwise convolutional/BN layers [jia2018highly] are as well applied. One should notice that small models are more difficult to train with higher accuracy, so in many papers of small models [howard2017mobilenets, zhang2018shufflenet], the authors usually take these training tricks as described above. Also importantly, the weight normalization family should not be applied on the depthwise convolution (common in those small architectures) in practice since the number of parameters in each normalized group is too small, otherwise we will observe severe performance degradations. In order to make a fair comparison, especially to make the performance of our reimplemented baseline reach the accuracy of the reported ones in the referenced paper, we use these tips in the training of all small models. Note that in our experiments, only those normalized weights are trained with shifted Regularizer, others (BN and fc layer weights) are kept with traditional weight decay term (if it exists) since they donot suffer from these problems.
Top1 Acc (%) w.r.t.  1e2  1e3  1e4  1e5  
WNResNet50 +  SGD  71.86  75.31  76.52  74.63 
Adam  64.31  64.47  65.92  68.23  
WSResNet50 +  SGD  72.15  75.68  76.86  74.99 
Adam  64.56  64.71  66.17  68.45 
CIFAR100 classification: The CIFAR100 dataset [krizhevsky2009learning] consists of colored natural images with 32 32 pixels, where the images are drawn from 100 classes. The training and test set contain 50K and 10K images, respectively. Apart from the standard data augmentation scheme that is widely used in the dataset [he2016identity, huang2017densely, wang2018mixed], we also add two recent popular methodologies namely cutout [devries2017improved] and mixup [zhang2017mixup] to further reduce the overfitting risks, where we keep their respective default hyperparameters in training. For preprocessing, we normalize the data using the channel means and standard deviations. The networks are trained with batch size 64 on one GPU. The training is with weight decay 0.0005 and momentum 0.9 for 300 epochs, starting from learning rate 0.05, which is decreased at 150th and 225th epoch by a factor of 10.
COCO detection: The COCO 2017 [lin2014microsoft] dataset is comprised of 118k images in train set, and 5k images in validation set. We follow the standard setting [he2017mask] of evaluating object detection via the standard mean AveragePrecision (AP) and mean AverageRecall (AR) [ren2015faster] scores at different box IoUs or object scales, respectively. We use the standard configuration of Cascade RCNN [cai2018cascade] with FPN [lin2017feature] and ResNet as the backbone architecture. The input images are resized such that their shorter side is of 800 pixels. We trained on 8 GPUs with 2 images per GPU. The backbones of all models are pretrained on ImageNet classification, then all layers except for c1 and c2 are jointly finetuned with detection heads. The endtoend training introduced in [ren2015faster] is adopted for our implementation, which yields better results. We utilize the conventional finetuning setting [ren2015faster] by fixing the learning parameters of BN layers. All models are trained for 20 epochs using Synchronized SGD with a weight decay of 1e4 and momentum of 0.9. The learning rate is initialized to 0.02, and decays by a factor of 10 at the 16th and 19th epochs. The choice of hyperparameters also follows the latest release of the Cascade RCNN benchmark [mmdetection]. For more experiments with other detector frameworks, e.g., Faster RCNN [ren2015faster] and Mask RCNN [he2017mask], we exactly follow the official settings of the baselines described in [mmdetection].
Top1 Acc (%) w.r.t.  0  1e2  1e3  1e4  1e5 

WSResNet50  76.74  76.60  76.86  76.84  76.71 
Top1 Acc (%) 1e4  baseline  WS  WS + 

ResNet50 [he2016deep]  76.54  76.74  76.86 
ResNet101 [he2016deep]  78.17  78.07  78.29 
ResNeXt50 [xie2017aggregated]  77.64  77.76  77.88 
ResNeXt101 [xie2017aggregated]  78.71  78.68  78.80 
SEResNet101 [hu2018squeeze]  78.43  78.65  78.75 
DenseNet201 [huang2017densely]  77.54  77.56  77.59 
Top1 Acc (%) 4e5  baseline  WS  WS + 
ShuffleNetV1 1x (g=8) [zhang2018shufflenet]  67.62  67.84  68.09 
ShuffleNetV2 1x [ma2018shufflenet]  69.64  69.66  69.70 
MobileNetV1 1x [howard2017mobilenets]  73.55  73.56  73.60 
MobileNetV2 1x [sandler2018mobilenetv2]  73.14  73.17  73.22 
6.2 Training Stability
The proposed shifted Regularizer ensures the global minimal solution of the magnitude of weights, which is also able to prevent the weights from being too small and thus avoids the gradient float overflow risks. To verify this, we traverse the hyperparameter in a large scope and employ shifted Regularizer to train networks with weight normalization family (namely, WN and WS), yielding the results in Table III. In comparison with Table II, we find that shifted Regularizer not only greatly improves the stability of training, i.e., no matter how the hyperparameter changes, the optimization can finally converge to a good solution with no cases of training failures for any type of optimizer, but also slightly boosts the generalization performance in those comparable cases of 1e4 and 1e5.
In our experiments, we empirically find that, with shifted Regularizer applied, the magnitude of the training weights will actually be controlled in the range of greater than or equal to during the entire optimization. To give a better illustration, we depict the maximal for WSResNet50 with shifted Regularizer and 1e3 over its training iterations, and vary 1e2, 1e3, 1e4. For a better comparison, we also plot the curve without shifted Regularizer (i.e., 0), which is exactly the red curve in Fig. 6. As can be seen in Fig. 7, the shifted limits the range of gradient float in fact and thus successfully prevents the training failures.
6.3 Parameter Sensitivity
In this subsection, we are interested in the selection of as it is the only parameter of the proposed regularizer. To investigate the sensitivity of the hyperparameter , we traverse its range in [1e2, 1e3, 1e4, 1e5] for training WSResNet50 under 1e4, shown in Table IV. It is demonstrated that the choice of is robust to the final generalization performance, where more appropriate selections (i.e., 1e3 and 1e4) can consistently bring slight improvements of accuracy.
6.4 Extension to More Architectures
Further, we apply shifted regularizer to more stateoftheart network structures [he2016deep, xie2017aggregated, hu2018squeeze, huang2017densely, zhang2018shufflenet, ma2018shufflenet, howard2017mobilenets, sandler2018mobilenetv2] and compare it to the original baseline and traditional WS version using SGD. For the WSequipped networks, we set hyperparameters 1e4 and search {1e3, 1e5, 1e8} to report our results. As can be seen from Table V, while keeping excellent training stability, shifted regularizer also achieves very competitive results, both for large and small models. We also empirically demonstrate that the convergence can still be speeded up over the original baseline by shifted regularizer in Fig. 8.
Type  Backbone  Top1 Acc (%) 
baseline  ResNeXt2916x64d [xie2017aggregated]  83.68 
WS  WSResNeXt2916x64d [xie2017aggregated]  83.48 
WS +  WSResNeXt2916x64d [xie2017aggregated]  84.39 (+0.71) 
baseline  SEResNeXt2916x64d [hu2018squeeze]  84.66 
WS  WSSEResNeXt2916x64d [hu2018squeeze]  84.55 
WS +  WSSEResNeXt2916x64d [hu2018squeeze]  84.87 (+0.21) 
6.5 Extension to Other Datasets
We further verify whether the effectiveness of shifted regularizer can generalise to datasets beyond ImageNet. Here we choose the widely used CIFAR100, and mainly validate it based on two strong backbones: ResNeXt2916x64d [xie2017aggregated] and SEResNeXt2916x64d [hu2018squeeze]. In the experiments, we typically find that = 1e2 would bring consistent gains for shifted regularizer. From the results in Table VI, we are suprised to discover that in CIFAR100 for (SE)ResNeXt2916x64d, the performance of WSequipped networks without shifted regularizer has slightly declined when compared to the baseline. Instead, shifted regularizer can still improve the recognition accuracy over the baseline models, which demonstrates its high potentials for practice. Furthermore, the training and validation curves of shifted regularizer (i.e., WS + ) can converge significantly faster than baseline and WS, which is depicted in Fig. 9.
Cascade RCNN [cai2018cascade]  Backbone  AP  AP  AP  AP  AP  AP  AR  AR  AR 
baseline  ResNet50  41.1  59.3  44.8  22.6  44.5  54.9  33.2  58.8  70.7 
WS  WSResNet50  41.6  60.1  45.2  23.4  44.7  55.6  34.2  58.2  71.0 
WS +  WSResNet50  41.8 (+ 0.7)  60.2  45.5  23.4  45.0  55.4  33.9  58.9  71.8 
baseline  ResNet101  42.6  60.9  46.4  23.7  46.1  56.9  34.5  59.8  72.0 
WS  WSResNet101  43.2  61.6  47.2  24.8  46.7  57.8  34.8  59.7  72.2 
WS +  WSResNet101  43.5 (+ 0.9)  61.7  47.5  23.9  47.1  58.4  33.4  60.2  72.4 
6.6 Extension to Detection Tasks
We are also interested that whether shifted regularizer can still work in some downstream tasks beyond image classification, e.g., object detection. Here we choose one of the most advanced object detector: Cascade RCNN [cai2018cascade] for evaluation and conduct comprehensive experiments on COCO datasets [lin2014microsoft]. The pretrained models with best performance are utilized for initialization of detector backbones. From Table VII, it is observed that shifted regularizer has the potential to significantly boost the overall performance of the detectors, especially for large backbone ResNet101 model. Specifically, it purely improves nearly absolute 1% AP based on the original baseline, and outperforms the baseline in all aspects of other metrics, i.e., AP/AR with different object scales. We also conduct experiments on other stateoftheart detectors, and observe the consistent improvements in Table VIII, which demonstrates its wide usage.
Faster RCNN [ren2015faster]  Backbone  AP  AP  AP 

baseline  ResNet50  37.7  59.3  41.1 
WS +  WSResNet50  37.9  59.7  40.9 
baseline  ResNet101  39.4  60.7  43.0 
WS +  WSResNet101  39.8  60.8  43.5 
Mask RCNN [he2017mask]  Backbone  AP  AP  AP 
baseline  ResNet50  38.6  60.0  41.9 
WS +  WSResNet50  38.9  60.1  42.2 
baseline  ResNet101  40.4  61.6  44.2 
WS +  WSResNet101  41.1  62.2  45.0 
6.7 Important Practices for Weight Normalization Family
In the original papers of weight normalization family [salimans2016weight, weightstandardization], the authors rarely discuss where to use WN or WS in deep neural networks. Their default mode is to place WN or WS on all conventional convolutional layers, while the BN and fc layers will not participate in WN/WS operations. However, in our practice, it is not always the best option. For depthwise convolutions or group convolutions with very few parameters in each group, using WN or WS can result in a severe degradation of performance both in train and test set. We speculate that when normalizing only a few parameters, since the set of parameters itself has very few degrees of freedom, normalization or standardization will further reduce the degrees of freedom, leading to extremely limited representation ability. One example is the learnable parameter of BN. It is essentially equivalent to a 11 depthwise convolution, where each parameter group only contains one variable. If we normalize it, it then becomes a fixed constant (in case of WN), which definitely can not learn the effect of scaling features.
The experiments which we have conducted above mainly avoid these risks. For example, in the small models like ShuffleNetV2, MobileNetV1 and MobileNetV2, we do not apply WS on the depthwise convolutions. And for ShuffleNetV1, it is suggested not to equip the group convolutions with WS. To be specific, we list the results of using or not using WS on the depthwise convolutions in Table IX. It can be observed that whether to use WS on the depthwise convolution can result in a very large performance gap.
backbone  w/  wo/ 

WSShuffleNetV2 1x [ma2018shufflenet]  63.79/84.63  69.66/88.76 
WSMobileNetV2 1x [sandler2018mobilenetv2]  69.74/89.18  73.17/91.05 
7 Conclusions
In this paper, we first review the disharmony between weight normalization family and weight decay, i.e., the counterintuitive underfitting risk caused by weight decay on the normalized weights. Then, we theoretically answer this question by two evidences: 1) weight decay doesnot change the optimization goal and 2) it ensures the appropriate effective learning rate for better convergence. After that, we expose the detailed problems via introducing fixed weight decay term in the loss objective, including missing of global minimum and training instability. Finally, to solve these potential problems, we propose shifted regularizer that shifts the objective by a positive constant . The shifted prevents network weights from being too small, where the gradient float overflow risks can be avoided directly. Comprehensive analyses demonstrate that the proposed shifted regularizer successfully garantees the global minimum and significantly improves the training stability, whilst maintaining superior performance.