Understanding the Disharmony between Weight Normalization Family and Weight Decay: \epsilon-shifted L_{2} Regularizer

Understanding the Disharmony between Weight Normalization Family and Weight Decay: shifted Regularizer

Xiang Li, Shuo Chen, Yan Xia and Jian Yang* Xiang Li, Shuo Chen, and Jian Yang are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (E-mail: (xiang.li.implus, shuochen, csjyang)@njust.edu.cn). Xiang Li is also a visiting scholar at Momenta. Yan Xia (E-mail: xiayan@momenta.ai) is the research and development director of Momenta. (Corresponding author: Jian Yang)
Abstract

The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight to , which makes independent to the magnitude of . Surprisingly, must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. Moreover, if we substitute (e.g., weight normalization) in the original loss function , it is observed that the regularization term will be canceled as a constant in the optimization objective. Therefore, to decay , we need to explicitly append this term: . In this paper, we theoretically prove that merely modulates the effective learning rate for improving objective optimization, and has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several critical problems when introducing weight decay term to weight normalization family, including the missing of global minimum and training instability. To address these problems, we propose an shifted regularizer, which shifts the objective by a positive constant . Such a simple operation can theoretically guarantee the existence of global minimum, while preventing the network weights from being too small and thus avoiding gradient float overflow. It significantly improves the training stability and can achieve slightly better performance in our practice. The effectiveness of shifted regularizer is comprehensively validated on the ImageNet, CIFAR-100, and COCO datasets. Our codes and pretrained models will be released in https://github.com/implus/PytorchInsight.

Weight normalization, weight standardization, weight decay, deep neural networks, disharmony, gradient float overflow.

1 Introduction

The normalization methodologies on features have made great progress in recent years, with the introduction of BN [ioffe2015batch], IN [ulyanov2016instance], LN [ba2016layer], GN [wu2018group] and SN [luo2018differentiable]. These methods mainly focus on a zero mean and unit variance normalization operation on a specific dimension (or multiple dimensions) of features, which makes deep neural architectures [he2016deep, he2016identity, huang2017densely, wang2018mixed] much easier to optimize, leading to robust solutions with favorable generalization performance.

Beyond feature normalization, there is an increasing interest on the normalization of network weights. Weight Normalization (WN) [salimans2016weight] first separates the learning of the length and direction of weights, and it performs satisfactorily on several relatively small datasets. In some contexts of generative adversarial networks (GAN) [goodfellow2014generative], Weight Normalization with Translated ReLU [xiang2017effects] is shown to achieve superior results. Later, Centered Weight Normalization (CWN) [huang2017centered] further powers WN by additionally centering their input weights, ulteriorly improving the conditioning and convergence speed. Recently, very similar to CWN, Weight Standardization (WS) [weightstandardization] aims to standardize the weights with zero mean and unit variance. On the large-scale tasks (ImageNet [deng2009imagenet] classification/COCO [lin2014microsoft] detection), WS further enhances optimization convergence and generalization performance, under the cooperation of feature normalizations such as GN and BN.

In terms of weight normalization family, despite its appealing success, there is still one confusing mystery – the disharmony between weight normalization family and weight decay [krogh1992simple]. Note that weight decay is widely interpreted as a form of regularization [loshchilov2017fixing] because it can be derived from the gradient of the norm of the weights [loshchilov2018decoupled]. Specifically, we consider training a single-layer single-output neural network where with the following loss function to be minimized:

(1)

In Eq. (1), denotes the number of training samples, consists of task-related loss w.r.t. the input/label pair , and the regularization term with a constant to balance against . For simplicity, we use weight normalization to re-parameterize , regardless the learning of its length. By substituting in Eq. (1), we can get , which is equivalent to minimize the following function

(2)

Interestingly, the weight decay term has indeed disappeared. In the case of WS, we can get similar conclusions by replacing :

(3)

where . It probably makes sense since weight decay will not take effect on a fixed distribution of normalized weights. However, when we apply Eq. (3) to WS-equipped ResNet-50 (WS-ResNet-50) on ImageNet dataset, i.e., setting weight decay ratio to 0 for all WS-equipped convolutions, we observe a severe degradation with significant performance drop in training set (Fig. 1). It is incredibly strange that weight decay is known to prevent the training from over-fitting the data, but it appears that, removing the weight decay instead puts the network into a serious under-fitting, which is very counter-intuitive.

Fig. 1: Counter-intuitive performance degradation (under-fitting) by setting weight decay parameter to 0 for WS-equipped convolutions. Top-1 Accuracy via single 224 crop on the ImageNet training set is plotted.

To answer above questions, in this paper, we first prove that in Eq. (2) (and Eq. (3)), the addition of weight decay term of does not change the optimization goal. Therefore, weight decay loses its original role that finds a better generalized solution by traditionally introducing a different loss part against the task-related one. At the same time, basing on the derivation of the gradient formula of , we further prove that weight decay only takes effect in modulating the effective learning rate to help the gradient descent process when the weight normalization family is employed simultaneously, and empirically demonstrate how it adjusts the effective learning rate. Interestingly, we also get an additional empirical discovery: training a network with weight normalization and weight decay implicitly includes an approximate warmup [goyal2017accurate] process, which probably explains the slightly improved performance on the original baselines.

The current common and default operation [salimans2016weight, weightstandardization] to optimize networks with weight normalization family is to continue to preserve the traditional decay term of for better convergence that comes from ensuring the stable effective learning rate, i.e., to explicitly add on Eq. (2):

(4)

However, there are many potential problems in taking the final optimization objective as Eq. (4). First, we prove that Eq. (4) has no global minimum theoretically. In addition, the improper selection of will push the magnitude () of weights to 0 and easily lead to training failures due to gradient float overflow (as the corresponding gradient is propotional to ), especially for certain adaptive gradient methods (e.g., Adam [kingma2014adam]) which accumulates the square of gradients.

To address these problems, we propose a very simple yet effective shifted regularizer, which shifts the objective by a positive constant . The shifted prevents network weights from being too small, thus it will directly avoid gradient float overflow risks. Such a simple operation can theoretically guarantee the existence of global minimum, whilst greatly improving the training stability. Beyond the training stability, it further brings gains on performance over a wide range of architectures, probably due to its dynamic decay machanism, which we will discuss later. The effectiveness of our method is comprehensively demonstrated by experiments on the ImageNet [deng2009imagenet], CIFAR-100 [krizhevsky2009learning] and COCO [lin2014microsoft] datasets.

To summarize our contributions:

  • We thoroughly analyze the disharmony between weight normalization family and weight decay, and expose the critical problems including lack of global minimum and training instability, which are caused by the optimization of weight decay term in the final loss objective when weight normalization is simultaneously applied.

  • We theoretically prove that weight decay loses the ability to enhance generalization in the weight normalization family, and only plays a role in regulating effective learning rate to help training. We demonstrate that when optimizing with SGD, the weight decay term can be cancelled by simply scaling the learning rate with a constant at each gradient descent step, where the constant is only determined by the hyper-parameters and irrelevant to the training process.

  • We propose a simple yet effective shifted regularizer to overcome the problems via introducing weight decay into weight normalization family, which significantly improves the training stability whilst achieving better performance over a large range of network architectures on both classification and detection tasks.

2 Related Works

Weight Normalization Family: Weight Normalization (WN) [salimans2016weight] takes the first attempt to reparameterize weights by the separation of direction and length :

(5)

The normalization operation participates in the gradient flow, resulting in accelerated convergence of stochastic gradient descent optimization. WN shows certain advantages in some tasks of supervised image recognition, generative modelling, and deep reinforcement learning. However, [gitman2017comparison] points out that in the large-scale ImageNet dataset, the final test accuracy of WN is significantly lower ( 6%) than that of BN [ioffe2015batch]. Later, Centered Weight Normalization (CWN) is proposed to further improve the conditioning and accelerate the convergence of training deep networks. The central idea of CWN is an additional centering operation based on WN:

(6)

Recently, in order to alleviate the problem of degraded performance of GN [wu2018group], Weight Standardization (WS) [weightstandardization] is proposed, which is very close to CWN but with the learning length removed:

(7)

WS is recommended to cooperate with feature normalization methods (such as GN and BN), which leads to further enhanced performance in large-scale tasks and can significantly accelerate the convergence. Introducing WS on the basis of GN or BN can consistently bring gains to multiple downstream visual tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. In this paper, we mainly focus on the weight normalization family and conduct a series of analyses on their properties, especially on their relations to weight decay.

Weight Decay: weight decay can be traced back to [krogh1992simple], which is defined as multiplying each weight in the gradient descent at each epoch by a factor . In the Stochastic Gradient Descent (SGD) setting, weight decay is widely interpreted as a form of regularization [ng2004feature] because it can be derived from the gradient of the norm of the weights [loshchilov2018decoupled]. It is known to be beneficial for the generalization of neural networks. Recently, [zhang2018three] identify three distinct mechanisms by which weight decay improves generalization: increasing the effective learning rate for BN, reducing the Jacobian norm, and reducing the effective damping parameter. Similarly, a series of recent work [van2017l2, hoffer2018norm] also demonstrates that when using BN, weight decay improves optimization only by fixing the norm to a small range of values, leading to a more stable step size for the weight direction. Although related, these works differ from our work in at least four aspects: 1) they mainly focus on the discussion between the feature normalization (especially BN) and weight decay, whilst we are the first to give a thorough analysis on the disharmony between weight normalization family and weight decay; 2) they solely demonstrate empirical results that the accuracy gained by using weight decay can be achieved without it, but only by adjusting the learning rate. However, we give theoretical proof and derive how to linearly scale the learning rate at each step, which is also purely determined by the training hyper-parameters; 3) they fail to discover the problems by introducing weight decay into the loss objective with normalized weights, which is heavily revealed and discussed in this article; 4) although weight decay has several potential problems with normalization methods, they have not proposed a solution to replace weight decay. In contrast, our proposed -shifted regularizer can successfully guarantee the global minimum and training stability to overcome the existing drawbacks, whilst achieving superior performance over a range of tasks.

Fig. 2: The comparisons of effective learning rate for different weight decay at each epoch during the optimization of WN-ResNet (left) and WS-ResNet (right) on the ImageNet dataset, where the effective learning rates of their first convolutional layer are depicted. The blue curve ensures a larger effective learning rate which helps the networks converge.

3 Roles of Weight Decay in Weight Normalization Family

In this section, we explain the roles of weight decay in weight normalization family in details. The theoretical analyses on the roles of weight decay help to understand why weight decay loses the ability to enhance the generalization, but controls the effective learning rate to help the training of deep networks.

3.1 Weight Decay Doesnot Change Optimization Goal

We first prove that in the networks equipped with weight normalization family, the introduction of weight decay does not change the goal of optimization, indicating that weight decay faithfully brings no additional generalization benefits. For analyses, we simply use the concepts of variable decomposition. Specifically, we choose two representative methods from weight normalization family, namely WN and WS, and discuss each in turn. Note that for simplicity, we ignore the learning of the length in WN in the following analyses and experiments.

In WN, we aim to prove that minimizing

(8)

is equal to minimizing

(9)

Specifically, we let and , and decompose the direction and length of as two independent variables. Then the objectives of Eq. (8) and (9) can be rewritten as

(10)

and

(11)

respectively. Since and are two independent variables, we have

(12)

which shows that minimizing actually contains the task of minimizing , and it completes the proof. Similarly, in WS we can further decompose the mean and variance of by letting and , where are also mutually independent. Again we can have

(13)

Therefore, according to the above analyses, for the networks with weight normalization family employed, the introduction of weight decay does not essentially change the learning objective, which implies that it takes no effect on the network generalization capability.

3.2 Weight Decay Ensures Effective Learning Rate

Since weight decay does not bring a regularization effect to a network with weight normalization family, why is it indispensable in the training process? The central reason is that weight decay helps to control the effective learning rate in a stable and reasonable range. Taking WN as an example, we can derive the gradient of as (the deviate to Eq. (9)):

(14)

where denotes the element-wise product. If we consider one gradient descent update at step for an element in , i.e., , with the use of learning rate , we have:

(15)

Eq. (15) demonstrates that even if is fixed, the gradient update in terms of can vary according to its magnitude . The reason is that fixed can only leads to fixed and term. Consequently, the entire update step size can be determined by , which exactly controls the effective learning rate similarly defined as in [van2017l2, hoffer2018norm]. Here we have two conclusions:

  • If we do not limit during the update process, the weights can grow unbounded , and the effective learning rate goes to 0 ().

  • On the contrary, if we decay too much during the optimization (), the effective learning rate will grow unbounded (), which leads to gradient float overflow and training failures. This is part of the motivation of our proposed method and we will discuss it in details later.

The similar analysis can be conducted in the case of WS, where one update step is:

(16)

which has term as its effective learning rate.

To prove that weight decay only ensures effective learning rate, we conduct theoretical analyses as follows:

Theorem 1.

Using SGD, the training trajectory of (Eq. (8) with weight decay) can be completely reproduced by simply scaling the learning rate at each step when optimizing (Eq. (9) without weight decay).

Proof.

Here we focus on the normalized variables to represent the training trajectory, e.g., (or ), as they are the ultimate weights with which networks use to operate. In the case of WN, we suppose optimizing and take steps in total, and at each step, we feed the same data batch to both of them. For the ease of reference, the corresponding variables at step for optimizing are marked with superscript , e.g., . Such notations are kept similarly in optimizing , e.g., . We further assume that two optimization processes start from the same initial weights , which also means , where is a scale between and (if exists). Specifically, we aim to prove that there exists a sequence of as multipliers (note that must be independent of the training process) for learning rate during optimizing , and it ensures for every step from . We take the standard SGD [bottou2010large, ruder2016overview] for analysis via mathematical induction:

1) As stated in the assumption, we have hold for , indicating . Here we do not have since the gradient descent step does not start in the initialization phase.

2) Suppose holds for with , we needs to prove under certain expression of . Let us expand and by performing one gradient descent step:

(17)

and

(18)

Therefore, it is very obvious to deduce from Eq. (17) and (18) that when , we can have

(19)

which consequently leads to and thereby it completes the proof. At the same time, we can also derive the recursive formula for :

(20)

Given the sequence of generated from Eq. (20), the resulted sequence of then becomes finally. The similar deductions can be carried out for the case of WS. To give a better illustration of the proof, we let and demonstrate the first gradient descent update of and in Fig. 3. It is easy to see that and form a similar triangle relationship and their scale factor is only determined by the hyper-parameter and . ∎

Fig. 3: Illustration of proving that weight decay can be entirely replaced by modulating the learning rate (under the setting of in this example). (a) denotes one update with weight decay. (b) denotes one update without weight decay. (c) shows the similar triangle relationship between these two updates, where the equations and can be easily derived.

Fig. 4: The curves of mean effective learning rate of multiple WS-equipped convolutional layers. We observe an interesting warmup phenomenon in the initial training stage. In the form of “layer..conv”, “” denotes the stage number, “” denotes the bottleneck number, “” represents the order of convolutional layer inside this bottleneck. For reference, the first “conv1” means the first convolutional layer, which is exactly the same with the blue curve in WS-ResNet-50 of Fig. 2.

According to the above analyses, we conclude that for networks with weight normalization family, weight decay only takes effect in modulating effective learning rate, and theoretically, we can replace it simply by adjusting the learning rate in each iteration with a calculated ratio which is only related to the hyper-parameter and .

To show how weight decay regulates the effective learning rate, we plot the mean effective learning rate of all the filters from the first layer of WN-ResNet-50 and WS-ResNet-50 throughout the training process in Fig. 2, where 0 and 1e-4 weight decay ratios are applied to the corresponding convolutional layers respectively. It is observed that the effective learning rate is appropriately controlled in a relatively large range with weight decay applied, which leads to a better optimized solution.

One more interesting empirical observation is that the control of effective learning rate by weight decay in the early stage is quite similar to a warmup process, and the effectiveness of warmup has been generally confirmed in [he2019bag, you2017large, goyal2017accurate, liu2019variance]. As shown in the Fig. 4, we sample and investigate a set of convolutional layers, and observe that almost all of the layers show an increase in the effective learning rate of several epochs at the beginning of training. The effect of implicit warmup may explain why networks equipped with WS can have slightly improved performance [weightstandardization]. To investigate this deeper, we make additional experiments by explicitly adding warmup process (i.e., linearly increasing the learning rate from 0 to 0.1 during the first 5 epochs) into the training of ResNet-50, and thus partially confirm this conclusion in Table I. Further, the effective learning rate of the first convolutional layer for training WS-ResNet-50 and ResNet-50 with warmup are depicted in Fig. 5. Note that the definition of effective learning rate for ResNet-50 follows [hoffer2018norm], in order to keep them in a similar magnitude. We suprisingly find that the two curves are very closely matched, which implies that training networks with weight normalization family and weight decay can implicitly have certain benefits of warmup technique.

Fig. 5: The comparisons of effective learning rate of the first convolutional layer for training WS-ResNet-50 and ResNet-50 with warmup at each epoch during the optimization. The training of WS-ResNet-50 can implicitly simulate the process of warmup to some extent.

4 Problems via Introducing Weight Decay in Weight Normalization Family

Despite the certain practical success and benefits of applying traditional weight decay to control the effective learning rate, there are still several serious problems in essence, which are rarely revealed or noticed before our work. In this section, we discuss about these problems of introducing weight decay term in the loss objective for weight normalization family in details.

4.1 No Garantee of Global Minimum

We first consider WN. If we introduce weight decay term of to the final loss objective, i.e., Eq. (4), we can prove that for , the entire loss function does not theoretically garantee a global minimum. Here we use the proof by contradiction:

If there exists a global minimum such that the objective (Eq. (4)) is minimized, then we have the smallest loss as

(21)

Let’s take a real number and form a new solution . Then we have:

(22)

which leads to the contradiction with the assumption that is smallest. The similar conclusion can be found with WS.

4.2 Training Instability

When given enough training iterations, the objective (Eq. (10)) will continuously push the length of the weight (i.e., ) to 0. The effective learning rate is inversely proportional to the weight length, which is much easier to cause the floating point overflow and lead to a failed training.

Type Top-1/5 Acc (%)
ResNet-50 76.54/93.07
ResNet-50 + warmup 76.81/93.20
WS-ResNet-50 76.74/93.28
TABLE I: Top-1/5 Accuracy (%) via single 224 crop on ImageNet validation set of ResNet-50 with warmup and WS-ResNet-50.
Top-1 Acc (%) w.r.t. 1e-2 1e-3 1e-4 1e-5 0
ResNet-50 SGD 47.67 74.12 76.54 74.80 72.65
Adam 19.42 35.68 52.97 63.46 72.50
WN-ResNet-50 SGD 76.44 74.65 72.86
Adam 72.34
WS-ResNet-50 SGD 76.74 74.70 72.92
Adam 72.85
TABLE II: Top-1 Accuracy (%) via single 224 crop on ImageNet validation set of different weight decay settings for convolution in ResNet-50, WN-convolution in WN-ResNet-50, and WS-convolution in WS-ResNet-50. of other parts (BN and fc layers) is kept with 1e-4. We demonstrate the results of two widely used optimizers: SGD (with momentum) and Adam. “–” denotes a failed training due to the gradient float overflow.

Specifically, we find that improper selection of would actually cause the instability in training. When we choose a slightly larger for an optimizer, some of the weights in the network will quickly converge to 0, making the effective learning rate close to infinity. Thereby the numerical gradient updates are beyond the representation of float in the computational resource, resulting in a training failure. Table II shows the impact of on network training with two widely used optimizers SGD [sutskever2013importance] with momentum and Adam [kingma2014adam], where it is much easier to have a failed training for networks with weights normalized. Moreover, the adaptive gradient method (e.g., Adam) even fail to have a successful training unless we discard the weight decay by setting . Since Adam will calculate the square of the gradient during the optimization process, it is more likely to encounter the risk of floating point overflow.

To better illustrate the gradient float overflow risks, we demonstrate the maximal for WN-ResNet-50 and for WS-ResNet-50 in the case of 1e-3 (for all corresponding convolutional layers) during SGD optimization in Fig. 6, where the maximal (or ) goes exponentially large and eventually leads to the gradient float overflow after about 50k iterations.

One may argue that the practical implementation of WS [weightstandardization] already considers the risk of float overflow in the original paper by adding a positive constant in the denominator of standardization:

(23)

here we clarify that only adding in the standardization part is definitely not enough for preventing the gradient float overflow problem. To explain, we can derive the gradient of w.r.t. according to Eq. (23):

(24)

where the individual standard deviation term still appears in the denominator and the gradient float overflow can still take place consequently. Therefore, it is necessary to propose a different approach to address the problem.

Fig. 6: Illustration of gradient float overflow problem over training iterations of WN-ResNet-50 and WS-ResNet-50. We use the maximal reciprocal length (or standard deviation) of weights as statistics.

5 Methods

This section describes our proposed shifted Regularizer in order to address the above problems.

5.1 shifted Regularizer

As stated in Sec 3.2, when training weight normalization family with weight decay term, it is suggested to design a mechanism which can successfully prevent the network weights from being extremely small towards 0 in any case of hyper-parameter or optimizer. Given such an insight, we start from investigating the lack of global minimum problem, where Eq. (10) is again reviewed:

We notice that the central reason for the missing of global minimum is the regular term: . During optimization, this term will have a large chance to continuously drive () infinitely close to 0, making the gradient to infinity and thus leading to training failures. To avoid such risks, we propose the shifted Regularizer, which constrains from being too small by a positive constant :

(25)

For the case of WS, given the standard deviation , by modifying Eq. (13) we have:

(26)

5.2 Garantee of Global Minimum

Thanks to the introduction of shifted Regularizer, the modified loss objective (i.e., Eq. (25) and Eq. (26)) now can garantee the existence of global minimum. For WN, suppose is the optimal solution to , then the global minimal solution of is . Therefore, the minimized objective is equal to , where the additional shifted Regularizer is utilized during optimization for mainly two important purposes: 1) controlling the effective learning rate to help networks converge, and 2) preventing the magnitude of weights from being too small and thus avoiding the gradient float overflow and training failures.

5.3 Dynamic Decay Mechanism

In addition, we find that shifted Regularizer has the function of dynamically adjusting the decay coefficient according to the current magnitude of training weights during optimization. In the case of WN, the shifted Regularizer term is , whose gradient formula for is:

(27)

It can also be regarded as an adaptive version of traditional weight decay (i.e., ), which uses the dynamic magnitude of training weights to slightly adjust the decay ratio: . When the is relatively large, will also be relatively large, meaning that we will use a larger factor to shrink the larger weights, and it is reasonably intuitive. It probably explains why applying shifted Regularizer can have slight improvements in our experiments.

For the case of WS, the gradient formula of w.r.t. is:

(28)

where the similar analysis can be conducted.

6 Experiments

In this section, we conduct extensive experiments on both classification and detection tasks to validate the effectiveness of the proposed shifted Regularizer.

6.1 Experimental Settings

To validate the effectiveness of the proposed shifted Regularizer, we conduct comprehensive experiments on the ImageNet [deng2009imagenet]/CIFAR-100 [krizhevsky2009learning] classification dataset and COCO [lin2014microsoft] detection dataset accordingly. For fair comparisons, all the experiments are run under a unified pytorch [paszke2017automatic] framework, including results of every baseline model. More details can be referred in our public code base: https://github.com/implus/PytorchInsight. We mainly conduct experiments based on the state-of-the-art Weight Normalization (WN) [salimans2016weight] and Weight Standardization (WS) [weightstandardization] from the weight normalization family.

ImageNet classification: The ILSVRC 2012 classification dataset [deng2009imagenet] contains 1.2 million images for training, and 50K for validation, from 1K classes. The training settings for large models are kept similar with [li2019spatial], except that we set the weight decay ratio to 0 for all the bias part in networks [he2019bag], which generally improves about 0.2% over all the baselines in this paper. We train networks on the training set and report the Top-1 (and Top-5) accuracies on the validation set with single 224 224 central crop. For data augmentation, we follow the standard practice [szegedy2015going] and perform the random-size cropping and random horizontal flipping. All networks are trained with naive softmax cross entropy without label-smoothing regularization [szegedy2016rethinking]. We train all the architectures from scratch by SGD [sutskever2013importance] or Adam [kingma2014adam, loshchilov2018decoupled]. SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. Adam keeps the default settings with learning rate 0.001, 0.9, 0.999. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. The default weight initialization strategy is in [he2015delving], where we use a ’fan_out’ mode specifically. The training settings for small models (i.e., ShuffleNet [zhang2018shufflenet, ma2018shufflenet] and MobileNet [howard2017mobilenets, sandler2018mobilenetv2]) are slightly different according to the references of their original papers [howard2017mobilenets, zhang2018shufflenet]: the default weight decay is 4e-5 with a number of total epochs 300. Warmup [goyal2017accurate], cosine learning rate decay [loshchilov2016sgdr], label smoothing [szegedy2016rethinking] and no weight decay on all depthwise convolutional/BN layers [jia2018highly] are as well applied. One should notice that small models are more difficult to train with higher accuracy, so in many papers of small models [howard2017mobilenets, zhang2018shufflenet], the authors usually take these training tricks as described above. Also importantly, the weight normalization family should not be applied on the depthwise convolution (common in those small architectures) in practice since the number of parameters in each normalized group is too small, otherwise we will observe severe performance degradations. In order to make a fair comparison, especially to make the performance of our reimplemented baseline reach the accuracy of the reported ones in the referenced paper, we use these tips in the training of all small models. Note that in our experiments, only those normalized weights are trained with shifted Regularizer, others (BN and fc layer weights) are kept with traditional weight decay term (if it exists) since they donot suffer from these problems.

Top-1 Acc (%) w.r.t. 1e-2 1e-3 1e-4 1e-5
WN-ResNet-50 + SGD 71.86 75.31 76.52 74.63
Adam 64.31 64.47 65.92 68.23
WS-ResNet-50 + SGD 72.15 75.68 76.86 74.99
Adam 64.56 64.71 66.17 68.45
TABLE III: Top-1 Accuracy (%) via single 224 crop on ImageNet validation set of different weight decay settings for WN-convolution in WN-ResNet-50 and WS-convolution in WS-ResNet-50. of other parts (BN and fc layers) is kept with 1e-4. We demonstrate the results of two widely used optimizers: SGD (with momentum) and Adam. “+ ” denotes the use of -shifted regularizer.

Fig. 7: Illustration of training stability over training iterations of WS-ResNet-50 under 1e-3. The shifted in fact limits the range of gradient float and thus greatly ensures the training stability. “ 0” denotes the training of WS-ResNet-50 with traditional weight decay.

CIFAR-100 classification: The CIFAR-100 dataset [krizhevsky2009learning] consists of colored natural images with 32 32 pixels, where the images are drawn from 100 classes. The training and test set contain 50K and 10K images, respectively. Apart from the standard data augmentation scheme that is widely used in the dataset [he2016identity, huang2017densely, wang2018mixed], we also add two recent popular methodologies namely cutout [devries2017improved] and mixup [zhang2017mixup] to further reduce the overfitting risks, where we keep their respective default hyper-parameters in training. For preprocessing, we normalize the data using the channel means and standard deviations. The networks are trained with batch size 64 on one GPU. The training is with weight decay 0.0005 and momentum 0.9 for 300 epochs, starting from learning rate 0.05, which is decreased at 150th and 225th epoch by a factor of 10.

COCO detection: The COCO 2017 [lin2014microsoft] dataset is comprised of 118k images in train set, and 5k images in validation set. We follow the standard setting [he2017mask] of evaluating object detection via the standard mean Average-Precision (AP) and mean Average-Recall (AR) [ren2015faster] scores at different box IoUs or object scales, respectively. We use the standard configuration of Cascade R-CNN [cai2018cascade] with FPN [lin2017feature] and ResNet as the backbone architecture. The input images are resized such that their shorter side is of 800 pixels. We trained on 8 GPUs with 2 images per GPU. The backbones of all models are pretrained on ImageNet classification, then all layers except for c1 and c2 are jointly finetuned with detection heads. The end-to-end training introduced in [ren2015faster] is adopted for our implementation, which yields better results. We utilize the conventional finetuning setting [ren2015faster] by fixing the learning parameters of BN layers. All models are trained for 20 epochs using Synchronized SGD with a weight decay of 1e-4 and momentum of 0.9. The learning rate is initialized to 0.02, and decays by a factor of 10 at the 16th and 19th epochs. The choice of hyper-parameters also follows the latest release of the Cascade R-CNN benchmark [mmdetection]. For more experiments with other detector frameworks, e.g., Faster R-CNN [ren2015faster] and Mask R-CNN [he2017mask], we exactly follow the official settings of the baselines described in [mmdetection].

Top-1 Acc (%) w.r.t. 0 1e-2 1e-3 1e-4 1e-5
WS-ResNet-50 76.74 76.60 76.86 76.84 76.71
TABLE IV: Top-1 Accuracy (%) via single 224 crop on ImageNet validation set of different settings for WS-convolution in WS-ResNet-50 under 1e-4. denotes the baseline without the use of -shifted regularizer.
Top-1 Acc (%) 1e-4 baseline WS WS +
ResNet-50 [he2016deep] 76.54 76.74 76.86
ResNet-101 [he2016deep] 78.17 78.07 78.29
ResNeXt-50 [xie2017aggregated] 77.64 77.76 77.88
ResNeXt-101 [xie2017aggregated] 78.71 78.68 78.80
SE-ResNet-101 [hu2018squeeze] 78.43 78.65 78.75
DenseNet-201 [huang2017densely] 77.54 77.56 77.59
Top-1 Acc (%) 4e-5 baseline WS WS +
ShuffleNetV1 1x (g=8) [zhang2018shufflenet] 67.62 67.84 68.09
ShuffleNetV2 1x [ma2018shufflenet] 69.64 69.66 69.70
MobileNetV1 1x [howard2017mobilenets] 73.55 73.56 73.60
MobileNetV2 1x [sandler2018mobilenetv2] 73.14 73.17 73.22
TABLE V: Accuracy via single 224 crop on ImageNet validation set of different backbones using SGD. 1e-4 denotes the results of large models and 4e-5 denotes the results of small models. The “WS” denotes the conventional weight decay training of WS-equipped networks. “WS + ” denotes the introduction of -shifted regularizer as objectives for training the WS-equipped networks.

6.2 Training Stability

The proposed shifted Regularizer ensures the global minimal solution of the magnitude of weights, which is also able to prevent the weights from being too small and thus avoids the gradient float overflow risks. To verify this, we traverse the hyper-parameter in a large scope and employ shifted Regularizer to train networks with weight normalization family (namely, WN and WS), yielding the results in Table III. In comparison with Table II, we find that shifted Regularizer not only greatly improves the stability of training, i.e., no matter how the hyper-parameter changes, the optimization can finally converge to a good solution with no cases of training failures for any type of optimizer, but also slightly boosts the generalization performance in those comparable cases of 1e-4 and 1e-5.

In our experiments, we empirically find that, with shifted Regularizer applied, the magnitude of the training weights will actually be controlled in the range of greater than or equal to during the entire optimization. To give a better illustration, we depict the maximal for WS-ResNet-50 with shifted Regularizer and 1e-3 over its training iterations, and vary 1e-2, 1e-3, 1e-4. For a better comparison, we also plot the curve without shifted Regularizer (i.e., 0), which is exactly the red curve in Fig. 6. As can be seen in Fig. 7, the shifted limits the range of gradient float in fact and thus successfully prevents the training failures.

6.3 Parameter Sensitivity

In this subsection, we are interested in the selection of as it is the only parameter of the proposed regularizer. To investigate the sensitivity of the hyper-parameter , we traverse its range in [1e-2, 1e-3, 1e-4, 1e-5] for training WS-ResNet-50 under 1e-4, shown in Table IV. It is demonstrated that the choice of is robust to the final generalization performance, where more appropriate selections (i.e., 1e-3 and 1e-4) can consistently bring slight improvements of accuracy.

Fig. 8: The training and validation curves of (WS-)ResNet-50 on ImageNet dataset. It is observed that the -shifted regularizer maintains the property of faster convergence.

6.4 Extension to More Architectures

Further, we apply -shifted regularizer to more state-of-the-art network structures [he2016deep, xie2017aggregated, hu2018squeeze, huang2017densely, zhang2018shufflenet, ma2018shufflenet, howard2017mobilenets, sandler2018mobilenetv2] and compare it to the original baseline and traditional WS version using SGD. For the WS-equipped networks, we set hyper-parameters 1e-4 and search {1e-3, 1e-5, 1e-8} to report our results. As can be seen from Table V, while keeping excellent training stability, -shifted regularizer also achieves very competitive results, both for large and small models. We also empirically demonstrate that the convergence can still be speeded up over the original baseline by -shifted regularizer in Fig. 8.

Type Backbone Top-1 Acc (%)
baseline ResNeXt29-16x64d [xie2017aggregated] 83.68
WS WS-ResNeXt29-16x64d [xie2017aggregated] 83.48
WS + WS-ResNeXt29-16x64d [xie2017aggregated] 84.39 (+0.71)
baseline SE-ResNeXt29-16x64d [hu2018squeeze] 84.66
WS WS-SE-ResNeXt29-16x64d [hu2018squeeze] 84.55
WS + WS-SE-ResNeXt29-16x64d [hu2018squeeze] 84.87 (+0.21)
TABLE VI: Top-1 Accuracy (%) on CIFAR-100 validation set. For the type, the “WS” denotes the conventional weight decay training of WS-equipped networks. “WS + ” denotes the introduction of -shifted regularizer as objectives for training the WS-equipped networks. The numbers in parentheses represent the absolute promotion of baseline.

Fig. 9: The training and validation curves of different training types of (WS-)ResNeXt29-16x64d on CIFAR-100 dataset. The -shifted regularizer significantly speeds up the convergence of WS-ResNeXt29-16x64d.

6.5 Extension to Other Datasets

We further verify whether the effectiveness of -shifted regularizer can generalise to datasets beyond ImageNet. Here we choose the widely used CIFAR-100, and mainly validate it based on two strong backbones: ResNeXt29-16x64d [xie2017aggregated] and SE-ResNeXt29-16x64d [hu2018squeeze]. In the experiments, we typically find that = 1e-2 would bring consistent gains for -shifted regularizer. From the results in Table VI, we are suprised to discover that in CIFAR-100 for (SE-)ResNeXt29-16x64d, the performance of WS-equipped networks without -shifted regularizer has slightly declined when compared to the baseline. Instead, -shifted regularizer can still improve the recognition accuracy over the baseline models, which demonstrates its high potentials for practice. Furthermore, the training and validation curves of -shifted regularizer (i.e., WS + ) can converge significantly faster than baseline and WS, which is depicted in Fig. 9.

Cascade R-CNN [cai2018cascade] Backbone AP AP AP AP AP AP AR AR AR
baseline ResNet-50 41.1 59.3 44.8 22.6 44.5 54.9 33.2 58.8 70.7
WS WS-ResNet-50 41.6 60.1 45.2 23.4 44.7 55.6 34.2 58.2 71.0
WS + WS-ResNet-50 41.8 (+ 0.7) 60.2 45.5 23.4 45.0 55.4 33.9 58.9 71.8
baseline ResNet-101 42.6 60.9 46.4 23.7 46.1 56.9 34.5 59.8 72.0
WS WS-ResNet-101 43.2 61.6 47.2 24.8 46.7 57.8 34.8 59.7 72.2
WS + WS-ResNet-101 43.5 (+ 0.9) 61.7 47.5 23.9 47.1 58.4 33.4 60.2 72.4
TABLE VII: Detection results on COCO 2017 [lin2014microsoft] validation set using Cascade R-CNN [cai2018cascade] and FPN [lin2017feature] with (WS-)ResNet-50 and (WS-)ResNet-101 as backbones. The “WS” denotes the conventional weight decay training of WS-equipped networks. “WS + ” denotes the introduction of -shifted regularizer as objectives for training the WS-equipped networks. The numbers in parentheses represent the absolute promotion of baseline.

6.6 Extension to Detection Tasks

We are also interested that whether -shifted regularizer can still work in some downstream tasks beyond image classification, e.g., object detection. Here we choose one of the most advanced object detector: Cascade R-CNN [cai2018cascade] for evaluation and conduct comprehensive experiments on COCO datasets [lin2014microsoft]. The pretrained models with best performance are utilized for initialization of detector backbones. From Table VII, it is observed that -shifted regularizer has the potential to significantly boost the overall performance of the detectors, especially for large backbone ResNet-101 model. Specifically, it purely improves nearly absolute 1% AP based on the original baseline, and outperforms the baseline in all aspects of other metrics, i.e., AP/AR with different object scales. We also conduct experiments on other state-of-the-art detectors, and observe the consistent improvements in Table VIII, which demonstrates its wide usage.

Faster R-CNN [ren2015faster] Backbone AP AP AP
baseline ResNet-50 37.7 59.3 41.1
WS + WS-ResNet-50 37.9 59.7 40.9
baseline ResNet-101 39.4 60.7 43.0
WS + WS-ResNet-101 39.8 60.8 43.5
Mask R-CNN [he2017mask] Backbone AP AP AP
baseline ResNet-50 38.6 60.0 41.9
WS + WS-ResNet-50 38.9 60.1 42.2
baseline ResNet-101 40.4 61.6 44.2
WS + WS-ResNet-101 41.1 62.2 45.0
TABLE VIII: Detection results for Faster R-CNN [ren2015faster] and Mask R-CNN [he2017mask] with FPN [lin2017feature] on COCO 2017 [lin2014microsoft] validation set. “WS + ” denotes the introduction of -shifted regularizer as objectives for training the WS-equipped networks.

6.7 Important Practices for Weight Normalization Family

In the original papers of weight normalization family [salimans2016weight, weightstandardization], the authors rarely discuss where to use WN or WS in deep neural networks. Their default mode is to place WN or WS on all conventional convolutional layers, while the BN and fc layers will not participate in WN/WS operations. However, in our practice, it is not always the best option. For depthwise convolutions or group convolutions with very few parameters in each group, using WN or WS can result in a severe degradation of performance both in train and test set. We speculate that when normalizing only a few parameters, since the set of parameters itself has very few degrees of freedom, normalization or standardization will further reduce the degrees of freedom, leading to extremely limited representation ability. One example is the learnable parameter of BN. It is essentially equivalent to a 11 depthwise convolution, where each parameter group only contains one variable. If we normalize it, it then becomes a fixed constant (in case of WN), which definitely can not learn the effect of scaling features.

The experiments which we have conducted above mainly avoid these risks. For example, in the small models like ShuffleNetV2, MobileNetV1 and MobileNetV2, we do not apply WS on the depthwise convolutions. And for ShuffleNetV1, it is suggested not to equip the group convolutions with WS. To be specific, we list the results of using or not using WS on the depthwise convolutions in Table IX. It can be observed that whether to use WS on the depthwise convolution can result in a very large performance gap.

backbone w/ wo/
WS-ShuffleNetV2 1x [ma2018shufflenet] 63.79/84.63 69.66/88.76
WS-MobileNetV2 1x [sandler2018mobilenetv2] 69.74/89.18 73.17/91.05
TABLE IX: Top-1/5 accuracy (%) via single 224 crop on ImageNet validation set of using or not using WS on the depthwise convolutions. “w/” denotes using WS and “wo/” denotes not using WS. All other parts are kept the same.

7 Conclusions

In this paper, we first review the disharmony between weight normalization family and weight decay, i.e., the counter-intuitive under-fitting risk caused by weight decay on the normalized weights. Then, we theoretically answer this question by two evidences: 1) weight decay doesnot change the optimization goal and 2) it ensures the appropriate effective learning rate for better convergence. After that, we expose the detailed problems via introducing fixed weight decay term in the loss objective, including missing of global minimum and training instability. Finally, to solve these potential problems, we propose -shifted regularizer that shifts the objective by a positive constant . The shifted prevents network weights from being too small, where the gradient float overflow risks can be avoided directly. Comprehensive analyses demonstrate that the proposed -shifted regularizer successfully garantees the global minimum and significantly improves the training stability, whilst maintaining superior performance.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398375
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description