Convolutional Neural Networks with Dynamic Regularization
Abstract
Regularization is commonly used in machine learning for alleviating overfitting. In convolutional neural networks, regularization methods, such as Dropout and ShakeShake, have been proposed to improve the generalization performance. However, these methods are lack of selfadaption throughout training, i.e., the regularization strength is fixed to a predefined schedule, and manual adjustment has to be performed to adapt to various network architectures. In this paper, we propose a dynamic regularization method which can dynamically adjust the regularization strength in the training procedure. Specifically, we model the regularization strength as a backward difference of the training loss, which can be directly extracted in each training iteration. With dynamic regularization, the large model is regularized by the strong perturbation and vice versa. Experimental results show that the proposed method can improve the generalization capability of offtheshelf network architectures and outperforms stateoftheart regularization methods.
I Introduction
Convolutional neural networks (CNNs), which use a stack of convolution operations followed by nonlinear activation (e.g., Rectified Linear Unit, ReLU) to extract highlevel discriminative features, have achieved considerable improvements for visual tasks [12, 6, 24]. Via layerbylayer connectivity, extracted features can reach outstanding representational power. Recent advances of the CNN architectures, such as ResNet [6], DenseNet [8], ResNeXt [21], and PyramidNet [5], ease the problems of vanishing gradients and boost the performance. However, overfitting, which reduces the generalization capability of CNNs, is still a big problem.
A wide variety of regularization strategies were exploited to alleviate overfitting and decrease the generalization error. Data augmentation [12] is a simple yet effective manner to make models adapt to the diversity of data. Batch Normalization [10] standardizes the mean and variance of features for each minibatch, which makes the optimization landscape smoother [15]. Dropbased methods [7, 18] aim to train an ensemble of subnetworks, which weakens the effect of “coadaptions” on training data. Recently, ShakeShake regularization [3] was proposed to randomly interpolate two complementary features in the two branches of ResNeXt, achieving stateoftheart classification performance. ShakeDrop [22] incorporated the idea of Stochastic Depth [9] with ShakeShake regularization to stabilize the training process in the residual branch of ResNet or PyramidNet. Despite the impressive improvement by Shakebased regularization methods, there are two main drawbacks with this type of methods.

ShakeDrop regularization was designed for deep networks and not suitable for shadow network architectures. It may not improve generalization performance, and even make the performance worse for shadow networks (see the TABLE I).

The regularization strength (or amplitude) is unchangeable over the whole training process. The fixed strong regularization is beneficial to reduce overfitting, but it causes difficulties to fit data at the beginning of training. From the perspective of curriculum learning [1], the learner needs to begin with easy examples.
In view of these issues, we propose a dynamic regularization method for CNNs, in which the regularization strength is adaptable to the dynamics of the training loss. During training, the dynamic regularization strength can be gradually increased with respect to the training status. Analogous to human education, the regularizer is regarded as an instructor who gradually increases the difficulty of training examples by way of feature perturbation. Moreover, dynamic regularization can adapt to models with different sizes. It provides a strong regularization for large models and vice versa. (See Fig. 4 (b)). That is, the regularization strength grows faster and achieves the higher value for the large model than that of the light model.)
Fig. 1 shows the proposed dynamic regularization in the ResNet structure. The training loss is not only used to perform backpropagation but also exploited to update the amplitude of the regularization. The features are multiplied by the regularizer in the residual branch. The regularizer works as a perturbation which introduces an augmentation in feature space, so CNNs are trained by the diversity of augmented features. Additionally, the regularization amplitude is changeable with respect to the dynamics of the training loss. We conduct experiments on the image classification task to evaluate our regularization strategy. Experimental results show that the proposed dynamic regularization outperforms stateoftheart regularization methods, i.e., PyramidNet and ResNeXt equipped with our dynamic regularization improve the classification accuracy in various model settings, when compared with the same network with ShakeDrop [22] and ShakeShake [3] regularization.
The rest of this paper is organized as follows. We first briefly introduce the related work on deep CNNs and regularization methods in Section II. Then, the proposed dynamic regularization is presented in Section III. Experimental results and discussion are given in Section IV. Finally, Section V concludes this paper.
Ii Related Work
Iia Deep CNNs
CNNs have become deeper and wider with a more powerful capacity [6, 8, 5, 17, 20]. As our proposed regularization is based on ResNet and its variants, we briefly review the basic structure of ResNet, i.e., residual block.
Residual block. The residual block (ResBlock, shown in Fig. 1) is formulated as
(1) 
where an identity branch is the input features of the ResBlock, which is added with a residual branch that is nonlinear transformations of the input by a set of parameters ( will be omitted for simplicity). consists of two ConvBNReLU or Bottleneck Architectures in the original ResNet structure [6]. In recent improvement, can also be other forms, e.g. WideResNet [23], Inception module [19], PyramidNet [5], and ResNeXt [21]. PyramidNet gradually increases the number of channels in the ResBlocks as the layers go deep. ResNeXt has multiple aggregated residual branches expressed as
(2) 
where and are two residual branches. The number of branches (namely cardinality) is not limited.
IiB Regularization
In addition to the advances of network architectures, many regularization techniques, i.e., data augmentation [12, 2], stochastic drooping [18, 9, 14], and Shakebased regularization methods [3, 22], have been successfully applied to avoid overfitting of CNNs.
Data augmentation (e.g., random cropping, flipping, and color adjusting [12]) is a simple yet effective strategy to increase the diversity of data. DeVries and Taylor [2] introduced an image augmentation technique, in which augmented images are generated by randomly cutting out square regions from input images (called Cutout). Dropout [18] is a widely used technique which stochastically drops out the hidden nodes from the networks during the training process. Following this idea, Maxout [4], Continuous Dropout [16], DropPath [14], and Stochastic Depth [9] were proposed. Based on ResNet, Stochastic Depth randomly drops a certain number of residual branches so that the network is shrunk in training. It performs inference using the whole network without dropping. Shakebased regularization approaches [3, 22] was recently proposed to augment features inside CNNs, which achieves outstanding classification performance.
Shakebased regularization approaches. Gastaldi [3] proposed a ShakeShake regularization method, as shown in Fig. 2 (a). A random variable is used to control the interpolation of the two residual branches (i.e., and in 3branch ResNeXt). It is given by:
(3) 
where follows the uniform distribution in the forward pass. For the backward pass, is replaced by another uniform random variable to disturb the learning process. The regularization amplitude of each branch is fixed to .
To extend the use of ShakeShake regularization, Yamada et al. [22] introduced a single Shake in 2branch architectures (e.g., ResNet or PyramidNet) as shown in Fig. 2 (b) in which they adopted Stochastic Depth [9] to stabilize the learning:
(4) 
where is an uniform random variable and is a Bernoulli random variable which decides to performs the original network (i.e., , if ) or the perturbated one (i.e., , if ). In backward pass, is replaced by . The regularization amplitude of the branch is also fixed to . In [22], Yamada et al. also presented a structure of Singlebranch Shake without the original network: , in which the perturbation is applied in the feature space. They showed that this structure gets bad results in some cases. For instance, the 110layer PyramidNet with Singlebranch Shake drops the error rate to 77.99% on CIFAR100. This fixed large regularization overemphasizes the overfitting. We argue that the fixed regularization amplitude cannot fit the dynamics of the training process and different model sizes well.
Iii The Proposed Method
As aforementioned, the fixed regularization strength in the existing regularization methods, such as DropPath [14], Stochastic Depth [9], ShakeShake [3], and Shakedrop [22], departs from the human learning paradigm (e.g., the curriculum learning [1] or selfpaced learning [13]). A naive way is to predefine the schedule for updating the regularization strength, such as the linear increment scheme in [25], which linearly increases the learning difficulty from low to high. We argue that the predefined schedule is not flexible enough to reveal the learning process. Inspired by the fact that the feedback of the learning itself can provide useful information, we propose a dynamic regularization, which is capable of adjusting the regularization strength adaptively.
Our dynamic regularization for CNNs is based on the dynamics of the training loss. Specifically, at the beginning of the training process, both the training and testing loss keeps decreasing, which means the network is learning to recognize the images. However, through a certain number of iterations, the network may overfit training data, resulting in that the training loss decreases more rapidly than the testing loss. The design of the regularization method needs to follow this dynamics. If the training loss drops in an iteration, the regularization strength should increase against overfitting in the next iteration; otherwise, the regularization strength should decrease against underfitting. In what follows, we first introduce the network architectures with dynamic regularization and then deliberate the update of the regularization strength in each iteration of the training process.
Iiia Network Architectures with Dynamic Regularization
We apply the dynamic regularization method on the two residual network architectures: the 2branch architecture (e.g., PyramidNet [5]) and the 3branch architecture (e.g., ResNeXt [21]).
IiiA1 The 2branch architecture with dynamic regularization
Training phase. During training, dynamic regularization is adopted in ResBlock, as shown in Figs. 3 (a) and (b). Specifically, a dynamic regularization unit (called random perturbation) is introduced into the residual branch of ResBlock. The random perturbation is achieved by
(5) 
where is the basic constant amplitude, is the dynamic factor at the iteration, and is the uniform random noise with the expected value . The value of is updated via the backward difference of the training loss (See Section III.B). The regularization amplitude is proportional to . In the forward pass, the output of the ResBlock can be expressed as:
(6) 
In the backward pass, has a different value (represented by in Fig. 3 (b)) due to the random noise .
Random noise. The range of , i.e., , is a hyperparameter in the training phase. A straightforward way is to set to be uniform inside all ResBlocks. According to [9], the features of the earlier ResBlocks should remain more than those of the later ResBlocks. Hence, we propose a linear enhancement rule to configure this range inside ResBlocks. For the ResBlock, the range denoted as is given by
(7) 
where is the total number of ResBlocks. With the increasing trend of the range , the regularization strength is gradually raised from the bottom layer to the top layer. We conduct a comparison between different settings of inside ResBlocks in Section IV.
IiiA2 The 3branch architecture with dynamic regularization
We apply the dynamic regularization on a 3branch architecture (See Fig. 2 (a)). ShakeShake regularization is given by Eq. (3), in which is a uniform random variable. We introduce the random perturbation in Eq. (5) to replace in Eq. (3). ResBlock with dynamic regularization can be defined as
(10) 
If we set and and limit equal to , ranges from and , which is consistent with in Eq. (3). The ShakeShake regularization can be thought of as a special case of our dynamic regularization with a fixed dynamic factor.
IiiB Update of the Regularization Strength
The proposed updating solution for the dynamic regularization strength is achieved by the dynamics of the training loss. Specifically, the dynamic characteristic of the training loss can be model as the difference of the training loss between successive iterations. We define the backward difference between the training loss at two successive iterations as
(11) 
where denotes the training loss at the iteration. Although the training loss shows a downtrend in overall, there are huge fluctuations when feeding sequential minibatches. To eliminate the noise and find out the trend of the loss, we apply a Gaussian filter to smooth it. Hence, the filtered backward difference can be rewritten as
(12) 
where is the filtering operation defined as
(13) 
where the filter length is . We use the normalized Gaussian window defined by
(14) 
where , and . The standard deviation is determined by . We will discuss the Gaussian filter in the experiment. The dynamic factor in Eqs. (6) and (10) with respect to , i.e.,
(15) 
where is a small constant step for changing the regularization amplitude. From Eq. (15), it can be observed that if the training loss decreases (), the regularization amplitude increases to avoid overfitting; otherwise, it decreases to prevent underfitting. The dynamic factor keeps updating to follow the dynamics of the training loss in each iteration of the training procedure.
Remark. There are some existing methods to change the regularization strength. For instance, Zoph et al. [25] introduced a ScheduledDropPath to regularize NASNets, which is a linear increment scheme of the regularization strength. The probability of dropping out a path is increased linearly throughout the training. However, the constant or linear scheme is a predefined rule, which cannot adapt to the training procedure and different model size. Different from them, our proposed dynamic scheduling exploits the dynamics of the training loss, which is applicable to the training procedure in different network architectures. In Section IV, we conduct comparisons between them.
Iv Experimental Results
In this section, we evaluate the proposed dynamic regularization on the classification benchmark: CIFAR100 [11], in comparison with two stateoftheart regularization approaches: ShakeShake [3] and ShakeDrop [22]. Then we conduct ablation studies to compare with the fixed or linearincrement scheme of the regularization strength, and discuss the effectiveness of the Gaussian filter and the random noise.
Iva Implementation Details
The following settings are used throughout the experiments. We set the training epoch to and the batch size to . The learning rate was initialized to for the 2branch architecture as [22] and for the 3branch architecture as [3], and we used the cosine learning schedule to gradually reduce the learning rate to at the end of training. For the dynamic regularization, we set the initial dynamic factor , , and for the 2branch architecture and for the 3branch architecture. The length of the Gaussian filter was . PyramidNet [5] and ResNeXt [21] were used as baselines. We employed the standard translation, flipping [12] and Cutout [2] as the data augmentation scheme. Therefore, the Shakebased regularizer is the only one variable to affect experiments. All experimental results are presented by the average of 3 runs at the 300th epoch.
IvB Comparison with StateoftheArt Regularization Methods
Network Architecture  Params  Regularization  Top1 Error (%) 

PyramidNet110a48  1.8M  Baseline [5]  23.40 
ShakeDrop [22]  21.60  
Dynamic (ours)  21.32  
PyramidNet26a84  0.9M  Baseline [5]  26.30 
ShakeDrop [22]  31.83  
Dynamic (ours)  23.83  
PyramidNet26a200  3.8M  Baseline [5]  22.53 
ShakeDrop [22]  26.11  
Dynamic (ours)  20.34 
We first compare the proposed dynamic regularization with ShakeDrop in the 2branch architecture on CIFAR100. Following the ShakeDrop, we used PyramidNet [5] as our baseline (namely Baseline) and chose different architectures including: 1) PyramidNet110a48 (i.e., the network has a depth of 110 and a widening factor of 48) which is a deep and narrow network, 2) PyramidNet26a84 which is a light network, and 3) PyramidNet26a200 which is a shallow and wide network.
Table I shows the experimental results. From Table I, it can be observed that our dynamic regularization outperforms the counterparts of ShakeDrop in various architectures. The error rates of ShakeDrop are even worse than those of Baseline in the shallow architectures, i.e., PyramidNet26a84 and PyramidNet26a200, which means ShakeDrop with fixed regularization strength fails in this case. This issue comes from Stochastic Depth [9] in ShakeDrop where Stochastic Depth works well for deep networks. Regardless of the depth of networks, PyramidNet with dynamic regularization obtains a consistent improvement. Networks with the dynamic regularization are comparable with the baseline networks which has the double number of parameters (e.g., 23.83% of 26a84Dynamic v.s. 23.40% of 110a48Baseline; and 21.32% of 110a48Dynamic v.s. 22.53% of 26a200Baseline).
For the 3branch architecture, we compare the dynamic regularization with ShakeShake [3] in ResNeXt262x32d (i.e., the network has the depth of 26 and the residual branch of 2, and the first residual block has the width of 32) and ResNeXt262x64d as shown in Table II. We can see that the error rates of dynamic regularization are lower than those of ShakeShake. The results from Tables I and II shows that our dynamic regularization can adapt to various network architectures.
Fig. 4 shows the training loss, dynamic factor, and Top1 error with respect to the epoch in the two networks, i.e., PyramidNet26a84 and PyramidNet110a48. For networks with dynamic regularization, the downward trend of the training loss is slowed down, unlike Baseline in which the loss goes down towards zero (See Fig. 4 (a)). Dynamic regularization can prevent networks from rote learning the training data. As shown in Fig. 4 (b), the dynamic factor of the two network architectures gradually increases throughout the training process. Instead of using a predefined scheduling function in [25], our dynamic scheduling is selfadaptive according to the backward difference of training loss. Another important property of the dynamic scheduling is that a small regularization strength is generated for a light model (i.e., 26a84), and a large strength is for a large model (i.e., 110a48). Fig. 4 (c) illustrates networks with dynamic regularization can narrow the gap between the training and testing errors (from Gap1 to Gap2, and from Gap3 to Gap4) and achieve lower testing error when compared with Baseline.
Network Architecture  Params  Regularization  Top1 Error (%) 

ResNeXt262x32d  2.9M  Baseline [5]  22.95 
ShakeShake [3]  21.45  
Dynamic (ours)  20.91  
ResNeXt262x64d  11.7M  Baseline [5]  20.59 
ShakeShake [3]  19.19  
Dynamic (ours)  18.76 
IvC Ablation Study and Discussion
IvC1 Schedules of the regularization strength
Apart from the proposed dynamic schedule, the regularization strength can be adjusted by a linearincrement schedule as [25], where ScheduledDropPath is proposed to linearly increase the probability of dropped path (that can also be considered as the regularization strength) in training. Besides, the fixed regularization schedule is commonly used in many previous methods [14, 9, 3, 22]. We compared our dynamic method with such fixed or linear increment schedules. We used PyramidNet26a84 as a backbone to compare different regularization schedules.
Table III illustrates six different configurations of the regularization strength. ‘Fix’ means the dynamic factor is fixed to and ‘Linear’ means the dynamic factor is linearly scheduled from to over the course of training steps. ‘Fix2’ and ‘Linear3’ achieve the best results in fixed and linear schedules, respectively. Compared with them, the dynamic setting with 23.83% error rate achieves the best performance, which shows the effectiveness of our dynamic regularization schedules.
PyramidNet26a84 

PyramidNet26a84 


Fix1  25.45  Linear1  25.76  
Fix2  24.75  Linear2  25.09  
Fix3  25.52  Linear3  24.28  
Fix4  30.52  Linear4  25.80  
Dynamic  23.83 
IvC2 Random noise
As mentioned in Section III, the range of the random noise involved in our dynamic regularization, i.e., , is designed to grow from bottom ResBlocks to top ResBlocks linearly. To evaluate this setting, we performed the dynamic regularization with uniform and linearly growing in PyramidNet26a84. From the third and fourth row of Table IV, we can see the model with uniform is inferior to the model with linearly growing inside ResBlocks (25.83% v.s. 23.83%).
IvC3 Gaussian Filtering
In the process of updating the dynamic factor, we employed a Gaussian filter to remove the instant change of the training loss in a minibatch mode. That is, we refer to the Eq. (11) instead of Eq. (12) to update the dynamic factor. To study the effectiveness of Gaussian filter, we conducted comparative experiments between the Eq. (11) and Eq. (12). The last two rows of Table IV shows that if we remove the Gaussian filter, the error rate increases by 1.38%. This shows that the Gaussian filter also plays an important role in dynamic regularization.
PyramidNet26a84 



Baseline  26.30  
DynamicUniform  25.28  
DynamicLinear growth  23.83  
DynamicNo filter  25.21  
DynamicGaussian filter  23.83 
V Conclusion
In this paper, we have presented a dynamic schedule to adjust the regularization strength to fit various network architectures and the training process. Our dynamic regularization is selfadaptive in accordance with the change of the training loss. It produces a low regularization strength for light network architectures and high regularization strength for large ones. Furthermore, the strength is selfpaced grown to avoid overfitting. Experimental results demonstrate that the proposed dynamic regularization outperforms stateoftheart ShakeDrop and ShakeShake regularization in the feature augmentation field. We consider that the dynamic regularization highly encourages to be exploited in data augmentation and Dropoutbased methods in the future.
References
 [1] (2009) Curriculum learning. In Proceedings of the Annual International Conference on Machine Learning, pp. 41–48. Cited by: item 2, §III.
 [2] (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. Cited by: §IIB, §IIB, §IVA.
 [3] (2017) Shakeshake regularization. CoRR abs/1705.07485. Cited by: §I, §I, Fig. 2, §IIB, §IIB, §IIB, §III, §IVA, §IVB, §IVC1, TABLE II, §IV.
 [4] (2013) Maxout networks. In International Conference on Machine Learning, Cited by: §IIB.
 [5] (2017) Deep pyramidal residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5927–5935. Cited by: §I, §IIA, §IIA, §IIIA, §IVA, §IVB, TABLE I, TABLE II.
 [6] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I, §IIA, §IIA.
 [7] (2012) Improving neural networks by preventing coadaptation of feature detectors. CoRR abs/1207.0580. Cited by: §I.
 [8] (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §I, §IIA.
 [9] (2016) Deep networks with stochastic depth. In European Conference on Computer Vision, pp. 646–661. Cited by: §I, §IIB, §IIB, §IIB, §IIIA1, §III, §IVB, §IVC1.
 [10] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §I.
 [11] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §IV.
 [12] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, §I, §IIB, §IIB, §IVA.
 [13] (2010) Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197. Cited by: §III.
 [14] (2017) Fractalnet: ultradeep neural networks without residuals. In International Conference on Learning Representations, Cited by: §IIB, §IIB, §III, §IVC1.
 [15] (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §I.
 [16] (2017) Continuous dropout. IEEE Transactions on Neural Networks and Learning Systems 29 (9), pp. 3926–3937. Cited by: §IIB.
 [17] (2014) Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, Cited by: §IIA.
 [18] (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §I, §IIB, §IIB.
 [19] (2017) Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI Conference on Artificial Intelligence, Cited by: §IIA.
 [20] (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §IIA.
 [21] (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500. Cited by: §I, §IIA, §IIIA, §IVA.
 [22] (2018) Shakedrop regularization for deep residual learning. CoRR abs/1802.02375. Cited by: §I, §I, Fig. 2, §IIB, §IIB, §IIB, §III, §IVA, §IVC1, TABLE I, §IV.
 [23] (2016) Wide residual networks. In British Machine Vision Conference, Cited by: §IIA.
 [24] (2019) Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
 [25] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §IIIB, §III, §IVB, §IVC1.