Global Sparse Momentum SGD for Pruning Very Deep Neural Networks
Abstract
Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resourceconstrained frontend devices. DNN pruning is an approach for deep model compression, which aims at eliminating some parameters with tolerable performance degradation. In this paper, we propose a novel momentumSGDbased optimization method to reduce the network complexity by onthefly pruning. Concretely, given a global compression ratio, we categorize all the parameters into two parts at each training iteration which are updated using different rules. In this way, we gradually zero out the redundant parameters, as we update them using only the ordinary weight decay but no gradients derived from the objective function. As a departure from prior methods that require heavy human works to tune the layerwise sparsity ratios, prune by solving complicated nondifferentiable problems or finetune the model after pruning, our method is characterized by 1) global compression that automatically finds the appropriate perlayer sparsity ratios; 2) endtoend training; 3) no need for a timeconsuming retraining process after pruning; and 4) superior capability to find better winning tickets which win the initialization lottery.
1 Introduction
The recent years have witnessed great success of Deep Neural Network (DNN) in many realworld applications. However, today’s very deep models have been accompanied by millions of parameters, thus making them difficult to be deployed on computationally limited devices. In this context, DNN pruning approaches have attracted much attention, where we eliminate some connections (i.e., individual parameters) han2015learning ; hassibi1993second ; lecun1990optimal , or channels li2016pruning , thus the required storage space and computations can be reduced. This paper is focused on connection pruning, but the proposed method can be easily generalized to structured pruning (e.g., neuron, kernel or filterlevel). In order to reach a good tradeoff between accuracy and model size, many pruning methods have been proposed, which can be categorized into two typical paradigms. 1) Some researchers dong2017learning ; guo2016dynamic ; han2015learning ; hassibi1993second ; hu2016network ; lecun1990optimal ; li2016pruning ; luo2017thinet ; molchanov2016pruning propose to prune the model by some means to reach a certain level of compression ratio, then finetune it using ordinary SGD to restore the accuracy. 2) The other methods seek to produce sparsity in the model through a customized learning procedure alvarez2016learning ; ding2018auto ; lin2019towards ; liu2015sparse ; wang2018structured ; wen2016learning ; zhang2018systematic .
Though the existing methods have achieved great success in pruning, there are some typical drawbacks. Specifically, when we seek to prune a model in advance and finetune it, we confront two problems:

[noitemsep,nolistsep,topsep=0pt,parsep=0pt,partopsep=0pt]

The layerwise sparsity ratios are inherently tricky to set as hyperparameters. Many previous works han2015learning ; he2017channel ; hu2016network ; li2016pruning have shown that some layers in a DNN are sensitive to pruning, but some can be pruned significantly without degrading the model in accuracy. As a consequence, it requires prior knowledge to tune the layerwise hyperparameters in order to maximize the global compression ratio without unacceptable accuracy drop.

The pruned models are difficult to train, and we cannot predict the final accuracy after finetuning. E.g., the filterlevelpruned models can be easily trapped into a bad local minima, and sometimes cannot even reach a similar level of accuracy with a counterpart trained from scratch ding2019centripetal ; liu2019rethinking . And in the context of connection pruning, the sparser the network, the slower the learning and the lower the eventual test accuracy DBLP:conf/iclr/FrankleC19 .
On the other hand, pruning by learning is not easier due to:

[noitemsep,nolistsep,topsep=0pt,parsep=0pt,partopsep=0pt]

In some cases we introduce a hyperparameter to control the tradeoff, which does not directly reflect the resulting compression ratio. For instance, MorphNet gordon2018morphnet uses group Lasso roth2008group to zero out some filters for structured pruning, where a key hyperparameter is the Lasso coefficient. However, given a specific value of the coefficient, we cannot predict the final compression ratio before the training ends. Therefore, when we target at a specific eventual compression ratio, we have to try multiple coefficient values in advance and choose the one that yields the result closest to our expectation.

Some methods prune by solving an optimization problem which directly concerns the sparsity. As the problem is nondifferentiable, it cannot be solved using SGDbased methods in an endtoend manner. A more detailed discussion will be provided in Sect. 3.2.
In this paper, we seek to overcome the drawbacks discussed above by directly altering the gradient flow based on momentum SGD, which explicitly concerns the eventual compression ratio and can be implemented via endtoend training. Concretely, we use firstorder Taylor series to measure the importance of a parameter by estimating how much the objective function value will be changed by removing it molchanov2016pruning ; theis2018faster . Based on that, given a global compression ratio, we categorize all the parameters into two parts that will be updated using different rules, which is referred to as activation selection. For the unimportant parameters, we perform passive update with no gradients derived from the objective function but only the ordinary weight decay (i.e., 2 regularization) to penalize their values. On the other hand, via active update, the critical parameters are updated using both the objectivefunctionrelated gradients and the weight decay to maintain the model accuracy. Such a selection is conducted at each training iteration, so that a deactivated connection gets a chance to be reactivated in the next iteration. Through continuous momentumaccelerated passive updates we can make most of the parameters infinitely close to zero, such that pruning them causes no damage to the model’s accuracy. Owing to this, there is no need for a finetuning process. In contrast, some previously proposed regularization terms can only reduce the parameters to some extent, thus pruning still degrades the model. Our contributions are summarized as follows.

[noitemsep,nolistsep,topsep=0pt,parsep=0pt,partopsep=0pt]

For lossless pruning and endtoend training, we propose to directly alter the gradient flow, which is clearly distinguished with existing methods that either add a regularization term or turn the model pruning to solving some nondifferentiable optimization problems.

We propose Global Sparse Momentum SGD (GSM), a novel SGD optimization method, which splits the update rule of momentum SGD into two parts. GSMbased DNN pruning requires a sole global eventual compression ratio as hyperparameter and can automatically discover the appropriate perlayer sparsity ratios to achieve it.

Seen from the experiments, we have validated the capability of GSM to achieve high compression ratios on MNIST, CIFAR10 krizhevsky2009learning and ImageNet deng2009imagenet as well as find better winning tickets DBLP:conf/iclr/FrankleC19 . The codes are available at https://github.com/DingXiaoH/GSMSGD.
2 Related work
2.1 Momentum SGD
Stochastic gradient descent only takes the first order derivatives of the objective function into account and not the higher ones kathuria2018 . Momentum is a popular technique used along with SGD, which accumulates the gradients of the past steps to determine the direction to go, instead of using only the gradient of the current step. I.e., momentum gives SGD a shortterm memory goh2017why . Formally, let be the objective function, be a single parameter, be the learning rate, be the momentum coefficient which controls the percentage of the gradient retained every iteration, be the ordinary weight decay coefficient (e.g., for ResNets he2016deep ), the update rule is
(1)  
There is a popular story about momentum goh2017why ; polyak1964some ; rutishauser1959theory ; sutskever2013importance : gradient descent is a man walking down a hill. He follows the steepest path downwards; his progress is slow, but steady. Momentum is a heavy ball rolling down the same hill. The added inertia acts both as a smoother and an accelerator, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima. In this paper, we use momentum as an accelerator to boost the passive updates.
2.2 DNN pruning and other techniques for compression and acceleration
DNN pruning seeks to remove some parameters without significant accuracy drop, which can be categorized into unstructured and structured techniques based on the pruning granularity. Unstructured pruning (a.k.a. connection pruning) castellano1997iterative ; han2015learning ; hassibi1993second ; lecun1990optimal targets at significantly reducing the number of nonzero parameters, resulting in a sparse model, which can be stored using much less space, but cannot effectively reduce the computational burdens on offtheshelf hardware and software platforms. On the other hand, structured pruning removes structures (e.g., neurons, kernels or whole filters) from DNN to obtain practical speedup. E.g., channel pruning ding2019centripetal ; ding2019approximated ; li2016pruning ; liu2019metapruning ; liu2017learning ; liu2019rethinking cannot achieve an extremely high compression ratio of the model size, but can convert a wide CNN into a narrower (but still dense) one to reduce the memory and computational costs. In realworld applications, unstructured and structured pruning are often used together to achieve the desired tradeoff.
This paper is focused on connection pruning (but the proposed method can be easily generalized to structured pruning), which has attracted much attention since Han et al. han2015learning pruned DNN connections based on the magnitude of parameters and restored the accuracy via ordinary SGD. Some inspiring works have improved the paradigm of pruningandfinetuning by splicing connections as they become important again guo2016dynamic , directly targeting at the energy consumption yang2017designing , utilizing perlayer second derivatives dong2017learning , etc. The other learningbased pruning methods will be discussed in Sect. 3.2
Apart from pruning, we can also compress and accelerate DNN in other ways. Some works alvarez2017compression ; sainath2013low ; zhang2016accelerating decompose or approximate parameter tensors; quantization and binarization techniques courbariaux2016binarized ; gupta2015deep ; han2015deep ; liu2018bi approximate a model using fewer bits per parameter; knowledge distillation ba2014deep ; hinton2015distilling ; romero2014fitnets transfers knowledge from a big network to a smaller one; some researchers seek to speed up convolution with the help of perforation figurnov2016perforatedcnns , FFT mathieu2013fast ; vasilache2014fast or DCT wang2016cnnpack ; Wang et al. wang2017beyond compact feature maps by extracting information via a Circulant matrix.
3 GSM: Global Sparse Momentum SGD
3.1 Formulation
We first clarify the notations in this paper. For a fullyconnected layer with dimensional input and dimensional output, we use to denote the kernel matrix. For a convolutional layer with kernel tensor , where and are the height and width of convolution kernel, and are the numbers of input and output channels, respectively, we unfold the tensor into . Let be the number of all such layers, we use to denote the collection of all such kernel matrices, and the global compression ratio is given by
(2) 
where is the size of and is the 0 norm, i.e., the number of nonzero entries. Let , , be the accuracyrelated loss function (e.g., cross entropy for classification tasks), test examples and labels, respectively, we seek to obtain a good tradeoff between accuracy and model size by achieving a high compression ratio without unacceptable increase in the loss .
3.2 Rethinking learningbased pruning
The optimization target or direction of ordinary DNN training is to minimize the objective function only, but when we seek to produce a sparse model via a customized learning procedure, the key is to deviate the original training direction by taking into account the sparsity of the parameters. Through training, the sparsity emerges progressively, and we eventually reach the expected tradeoff between accuracy and model size, which is usually controlled by one or a series of hyperparameters.
3.2.1 Explicit tradeoff as constrained optimization
The tradeoff can be explicitly modeled as a constrained optimization problem zhang2018systematic , e.g.,
(3) 
where is an indicator function,
(4) 
and is the required number of nonzero parameters at layer . Since the second term of the objective function is nondifferentiable, the problem cannot be settled analytically or by stochastic gradient descent, but can be tackled by alternately applying SGD and solving the nondifferentiable problem, e.g., using ADMM boyd2011distributed . In this way, the training direction is deviated, and the tradeoff is obtained.
3.2.2 Implicit tradeoff using regularizations
It is a common practice to apply some extra differentiable regularizations during training to reduce the magnitude of some parameters, such that removing them causes less damage alvarez2016learning ; han2015learning ; wen2016learning . Let be the magnituderelated regularization term, be a tradeoff hyperparameter, the problem is
(5) 
However, the weaknesses are twofold. 1) Some common regularizations, e.g., 1, 2 and Lasso roth2008group , cannot literally zero out the entries in , but can only reduce the magnitude to some extent, such that removing them still degrades the performance. We refer to this phenomenon as the magnitude plateau. The cause behind is simple: for a specific trainable parameter , when its magnitude is large at the beginning, the gradient derived from , i.e., , overwhelms , thus is gradually reduced. However, as shrinks, diminishes, too, such that the reducing tendency of plateaus when approaches , and maintains a relatively small magnitude. 2) The hyperparameter does not directly reflect the resulting compression ratio, thus we may need to make several attempts to gain some empirical knowledge before we obtain the model with our expected compression ratio.
3.3 Global sparse gradient flow via momentum SGD
To overcome the drawbacks of the two paradigms discussed above, we intend to explicitly control the eventual compression ratio via endtoend training by directly altering the gradient flow of momentum SGD to deviate the training direction in order to achieve a high compression ratio as well as maintain the accuracy. Intuitively, we seek to use the gradients to guide the few active parameters in order to minimize the objective function, and penalize most of the parameters to push them infinitely close to zero. Therefore, the first thing is to find a proper metric to distinguish the active part. Given a global compression ratio , we use to denote the number of nonzero entries in . At each training iteration, we feed a minibatch of data into the model, compute the gradients using the ordinary chain rule, calculate the metric values for every parameter, perform active update on parameters with the largest metric values and passive update on the others. In order to make GSM feasible on very deep models, the metrics should be calculated using only the original intermediate computational results, i.e., the parameters and gradients, but no secondorder derivatives. Inspired by two preceding methods which utilized firstorder Taylor series for greedy channel pruning molchanov2016pruning ; theis2018faster , we define the metric in a similar manner. Formally, at each training iteration with a minibatch of examples and labels , let be the metric value of a specific parameter , we have
(6) 
The theory is that for the current minibatch, we expect to reduce those parameters which can be removed with less impact on . Using the Taylor series, if we set a specific parameter to 0, the loss value becomes
(7) 
Ignoring the higherorder term, we have
(8) 
which is an approximation of the change in the loss value if is zeroed out.
We rewrite the update rule of momentum SGD (Formula 1). At the th training iteration with a minibatch of examples and labels on a specific layer with kernel , the update rule is
(9)  
where is the elementwise multiplication (a.k.a. Hadamardproduct), and is the mask matrix,
(10) 
We refer to the computation of for each kernel as activation selection. Obviously, there are exactly ones in all the mask matrices, and GSM degrades to ordinary momentum SGD when .
Of note is that GSM is modelagnostic because it makes no assumptions on the model structure or the form of loss function. I.e., the calculation of gradients via back propagation is modelrelated, of course, but it is modelagnostic to use them for GSM pruning.
3.4 GSM enables implicit reactivation and fast continuous reduction
As GSM conducts activation selection at each training iteration, it allows the penalized connections to be reactivated, if they are found to be critical to the model again. Compared to two previous works which explicitly insert a splicing guo2016dynamic or restoring yang2017designing stage into the entire pipeline to rewire the mistakenly pruned connections, GSM features simpler implementation and endtoend training.
However, as will be shown in Sect. 4.4, reactivation only happens on a minority of the parameters, but most of them undergo a series of passive updates, thus keep moving towards zero. As we would like to know how many training iterations are needed to make the parameters small enough to realize lossless pruning, we need to predict the eventual value of a parameter after passive updates, given , and . We can use Formula 1 to predict , which is practical but cumbersome. In our common use cases where (from the very beginning of training), is large (at least tens of thousands), and is small (e.g., , ), we have observed an empirical formula precise enough to approximate the resulting value (Fig. 1),
(11) 
In practice, we fix (e.g., for ResNets he2016deep and DenseNets huang2017densely ) and adjust just as we do for ordinary DNN training, and use for faster zeroingout. When the training is completed, we prune the model by only preserving parameters with the largest magnitude. We decide the number of training iterations using Eq. 11 based on an empirical observation that with , such a pruning step causes no accuracy drop on very deep models like ResNet56 and DenseNet40.
Momentum is critical for GSMbased pruning to be completed with acceptable time cost. As most of the parameters continuously grow in the same direction determined by the weight decay (i.e., towards zero), such a tendency accumulates in the momentum, thus the zeroingout process is significantly accelerated. On the other hand, if a parameter does not always vary in the same direction, raising less affect its training dynamics. In contrast, if we increase the learning rate for faster zeroingout, the critical parameters which are hovering around the global minima will significantly deviate from their current values reached with a much lower learning rate before.
4 Experiments
4.1 Pruning results and comparisons
We evaluate GSM by pruning several common benchmark models on MNIST, CIFAR10 krizhevsky2009learning and ImageNet deng2009imagenet , and comparing with the reported results from several recent competitors. For each trial, we start from a welltrained base model and apply GSM training on all the layers simultaneously.
MNIST. We first experiment on MNIST with LeNet300100 and LeNet5 lecun1998gradient . LeNet300100 is a threelayer fullyconnected network with 267K parameters, which achieves 98.19% Top1 accuracy. LeNet5 is a convolutional network which comprises two convolutional layers and two fullyconnected layers, contains 431K parameters and delivers 99.21% Top1 accuracy. To achieve 60 and 125 compression, we set for LeNet300100 and for LeNet5, respectively. We use momentum coefficient and a batch size of 256. The learning rate schedule is for 160, 40 and 40 epochs, respectively. After GSM training, we conduct lossless pruning and test on the validation dataset. As shown in Table. 1, GSM can produce highly sparse models which still maintain the accuracy. By further raising the compression ratio on LeNet5 to 300, we only observe a minor accuracy drop (0.15%), which suggests that GSM can yield reasonable performance with extremely high compression ratios.
CIFAR10. We present the results of another set of experiments on CIFAR10 in Table. 2 using ResNet56 he2016deep and DenseNet40 huang2017densely . We use , a batch size of 64 and learning rate for 400, 100 and 100 epochs, respectively. We adopt the standard data augmentation including padding to , random cropping and leftright flipping. Though ResNet56 and DenseNet40 are significantly deeper and more complicated, GSM can also reduce the parameters by 10 and still maintain the accuracy.
ImageNet. We prune ResNet50 to verify GSM on largescale image recognition applications. We use a batch size of 64 and train the model with for 40, 10 and 10 epochs, respectively. We compare the results with LOBS dong2017learning , which is the only previous method that reported experimental results on ResNet50, to the best of our knowledge. Obviously, GSM outperforms LOBS by a clear margin (Table. 3). We assume that the effectiveness of GSM on such a very deep network is due to its capability to discover the appropriate layerwise sparsity ratios, given a desired global compression ratio. In contrast, LOBS performs pruning layer by layer using the same compression ratio. This assumption is further verified in Sect. 4.2.
Model  Result  Base Top1  Pruned Top1  Origin / Remain Params  Compress Ratio  Nonzero Ratio 
LeNet300  Han et al. han2015learning  98.36  98.41  267K / 22K  12.1  8.23% 
LeNet300  LOBS dong2017learning  98.24  98.18  267K / 18.6K  14.2  7% 
LeNet300  Zhang et al. zhang2018systematic  98.4  98.4  267K / 11.6K  23.0  4.34% 
LeNet300  DNS guo2016dynamic  97.72  98.01  267K / 4.8K  55.6  1.79% 
LeNet300  GSM  98.19  98.18  267K / 4.4K  60.0  1.66% 
LeNet5 
Han et al. han2015learning  99.20  99.23  431K / 36K  11.9  8.35% 
LeNet5  LOBS dong2017learning  98.73  98.73  431K / 3.0K  14.1  7% 
LeNet5  Srinivas et al. srinivas2017training  99.20  99.19  431K / 22K  19.5  5.10% 
LeNet5  Zhang et al. zhang2018systematic  99.2  99.2  431K / 6.05K  71.2  1.40% 
LeNet5  DNS guo2016dynamic  99.09  99.09  431K / 4.0K  107.7  0.92% 
LeNet5  GSM  99.21  99.22  431K / 3.4K  125.0  0.80% 
LeNet5  GSM  99.21  99.06  431K / 1.4K  300.0  0.33% 
Model  Result  Base Top1  Pruned Top1  Origin / Remain Params  Compress Ratio  Nonzero Ratio 
ResNet56  GSM  94.05  94.10  852K / 127K  6.6  15.0% 
ResNet56  GSM  94.05  93.80  852K / 85K  10.0  10.0% 
DenseNet40  GSM  93.86  94.07  1002K / 150K  6.6  15.0% 
DenseNet40  GSM  93.86  94.02  1002K / 125K  8.0  12.5% 
DenseNet40  GSM  93.86  93.90  1002K / 100K  10.0  10.0% 
Model  Result  Base Top1 / Top5  Pruned Top1 / Top5  Origin / Remain Params  Compress Ratio  Nonzero Ratio 
ResNet50  LOBSdong2017learning   / 92   / 92  25.5M / 16.5M  1.5  65% 
ResNet50  LOBSdong2017learning   / 92   / 85  25.5M / 11.4M  2.2  45% 
ResNet50  GSM  75.72 / 92.75  75.33 / 92.47  25.5M / 6.3M  4.0  25% 
ResNet50  GSM  75.72 / 92.75  74.30 / 91.98  25.5M / 5.1M  5.0  20% 
4.2 GSM for automatic layerwise sparsity ratio decision
Modern DNNs usually contain tens or even hundreds of layers. As the architectures deepen, it becomes increasingly impractical to set the layerwise sparsity ratios manually to reach a desired global compression ratio. Therefore, the research community is soliciting techniques which can automatically discover the appropriate sparsity ratios on very deep models. In practice, we noticed that if directly pruning a single layer of the original model by a fixed ratio results in a significant accuracy reduction, GSM automatically chooses to prune it less, and vice versa.
In this subsection, we present a quantitative analysis of the sensitivity to pruning, which is an underlying property of a layer defined via a natural proxy: the accuracy reduction caused by pruning a certain ratio of parameters from it. We first evaluate such sensitivity via singlelayer pruning attempts with different pruning ratios (Fig. 2). E.g., for the curve labeled as “prune 90%” of LeNet5, we first experiment on the first layer by setting 90% of the parameters with smaller magnitude to zero, and testing the model to obtain the validation accuracy. Then we restore the first layer, prune the second layer and test. The same procedure is applied to the third and fourth layers. After that, we use different pruning ratios of 99%, 99.5%, 99.7%, and obtain three curves in the same way. From such experiments we learn that the first layer is far more sensitive than the third, as pruning 99% of the parameters from the first layer reduces the Top1 accuracy by around 85% (i.e., reduced to hardly above 10%), but doing so on the third layer only slightly degrades the accuracy by 3%.
Then we show the resulting layerwise nonzero ratio of the GSMpruned models ( pruned LeNet5 and pruned DenseNet40, as presented in Table. 1, 2) as another proxy for sensitivity, of which the curves are labeled as “GSM discovered” in Fig. 2. As the two curves vary in the same tendency across layers as others, we find out that the sensitivities measured in the two proxies are closely related, which suggests that GSM automatically decides to prune the sensitive layers less (e.g., the 14th, 27th and 40th layer in DenseNet40, which perform the interstage transitions huang2017densely ) and the insensitive layers more in order to reach the desired global compression ratio, eliminating the need for heavy human works to tune the sparsity ratios as hyperparameters.
4.3 Momentum for accelerating parameter zeroingout
We investigate the role momentum plays in GSM by only varying the momentum coefficient and keeping all the other training configurations the same as the pruned DenseNet40 in Sect. 4.1. During training, we evaluate the model both before and after pruning every 8000 iterations (i.e., 10.24 epochs). We also present in Fig. 3 the global ratio of parameters with magnitude under and , respectively. As can be observed, a large momentum coefficient can drastically increase the ratio of smallmagnitude parameters. E.g., with a target compression ratio of 8 and , GSM can make 87.5% of the parameters close to zero (under ) in around 150 epochs, thus pruning the model causes no damage. And with , 400 epochs are not enough to effectively zero the parameters out, thus pruning degrades the accuracy to around 65%. On the other hand, as a larger value brings more rapid structural change in the model, the original accuracy decreases at the beginning but increases when such change becomes stable and the training converges.
4.4 GSM for implicit connection reactivation
GSM implicitly implements connection rewiring by performing activation selection at each iteration to restore the parameters which have been wrongly penalized (i.e., gone through at least one passive update). We investigate the significance of doing so by pruning DenseNet40 by 8 again using and the same training configurations as before but without reselection. Concretely, we use the mask matrices computed at the first iteration to guide the updates until the end of training. The training loss and accuracy evaluated both before and after 8 pruning every 8000 iterations are shown in Fig. 4. It is observed that if reselection is canceled, the training loss becomes higher, and the accuracy is degraded. This is because the first selection decides to eliminate some connections which are not critical for the first iteration but may be important for the subsequent input examples. Without reselection, GSM insists on zeroing out such parameters, leading to lower accuracy. And by depicting the reactivation ratio of GSM training process (i.e., the ratio of the number of connections which switch from passive to active to the total number of connections) at the reselection of each training iteration, we learn that reactivation happens on a minority of the connections, and the ratio decreases gradually, such that the training converges and the desired sparsity ratio is obtained.
4.5 GSM for better winning lottery tickets
A recent work DBLP:conf/iclr/FrankleC19 reported that the parameters which are found to be important after training are actually important at the very beginning (after random initialization but before training), which are referred to as the winning tickets, because they have won the initialization lottery. We found out that GSM can be used as a more powerful method to find the winning tickets.
Frankle and Carbin DBLP:conf/iclr/FrankleC19 discovered that if we 1) randomly initialize a network parameterized by , 2) train and obtain , 3) prune some parameters based on the properties of resulting in a subnetwork parameterized by , 4) find the winning ticket parameters in the initialized model which reside in the positions corresponding to , 5) train only, and remove the other parameters, we may attain a comparable level of accuracy with the trainedthenpruned model . In that work, the third step is accomplished by simply preserving the parameters with the largest magnitude. In our experiments, we found out that GSM can find a better set of winner tickets than the original simple magnitudebased method. Concretely, we only replace step 3 by a pruning process via GSM, and use the resulting nonzero parameters as , and all the other experimental settings are kept the same for comparability. On LeNet300100 with a compression ratio of 60, finding winning tickets by the original magnitude criterion and GSM delivers a final accuracy of 96.85% v.s. 97.36% after step 5. On LeNet5 with a compression ratio of 300, the accuracies of the two methods are 97.94% v.s. 99.04%. More experimental details can be found in the codes.
Two possible explanations to this phenomenon are that 1) GSM distinguishes the unimportant parameters by activation selection much earlier (at each iteration) than the magnitudebased criterion (after training), and 2) GSM decides the final winning tickets in a way that is robust to mistakes (i.e., via activation reselection). The intuition is that since we expect to find the parameters that have “won the initialization lottery”, the timing when we make the decision should be closer to when the initialization takes place, and we wish to correct the mistakes immediately when we are aware of the wrong decisions. Frankle and Carbin also noted that it might bring benefits to prune as early as possible DBLP:conf/iclr/FrankleC19 , which is precisely what GSM does, as GSM keeps pushing the unimportant parameters continuously to zero from the very beginning.
5 Conclusion
Aiming at a desired global compression ratio, we seek to prune a DNN via a learning process. We proposed Global Sparse Momentum SGD (GSM) to directly alter the gradient flow for DNN pruning, where the key is to split the ordinary momentumSGDbased update into two parts: active update uses the gradients derived from the objective function to maintain the model’s accuracy, and passive update only performs momentumaccelerated weight decay to push the redundant parameters infinitely close to zero. GSM is characterized by endtoend training, easy implementation, lossless pruning, implicit connection rewiring, the ability to automatically discover the appropriate perlayer sparsity ratios in modern very deep neural networks and the capability to find powerful winning tickets.
Acknowledgement
This work was supported by the National Key R&D Program of China (No. 2018YFC0807500), National Natural Science Foundation of China (No. 61571269, No. 61971260), National Postdoctoral Program for Innovative Talents (No. BX20180172), and the China Postdoctoral Science Foundation (No. 2018M640131). We sincerely thank all the reviewers for their comments. Corresponding author: Guiguang Ding, Jungong Han.
References
 [1] Jose M Alvarez and Mathieu Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pages 2270–2278, 2016.
 [2] Jose M Alvarez and Mathieu Salzmann. Compressionaware training of deep networks. In Advances in Neural Information Processing Systems, pages 856–867, 2017.
 [3] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
 [4] Yoshua Bengio and Yann LeCun, editors. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.
 [5] Yoshua Bengio and Yann LeCun, editors. 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019. OpenReview.net, 2019.
 [6] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
 [7] Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. An iterative pruning algorithm for feedforward neural networks. IEEE transactions on Neural networks, 8(3):519–531, 1997.
 [8] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [9] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [10] Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. Centripetal sgd for pruning very deep convolutional networks with complicated structure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4943–4953, 2019.
 [11] Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan. Approximated oracle filter pruning for destructive cnn width optimization. In International Conference on Machine Learning, pages 1607–1616, 2019.
 [12] Xiaohan Ding, Guiguang Ding, Jungong Han, and Sheng Tang. Autobalanced filter pruning for efficient convolutional neural networks. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [13] Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layerwise optimal brain surgeon. In Advances in Neural Information Processing Systems, pages 4857–4867, 2017.
 [14] Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. Perforatedcnns: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pages 947–955, 2016.
 [15] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Bengio and LeCun [5].
 [16] Gabriel Goh. Why momentum really works. Distill, 2017.
 [17] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, TienJu Yang, and Edward Choi. Morphnet: Fast & simple resourceconstrained structure learning of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1586–1595, 2018.
 [18] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
 [19] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
 [20] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, 2016.
 [21] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [22] Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [24] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.
 [25] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [26] Hengyuan Hu, Rui Peng, YuWing Tai, and ChiKeung Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
 [27] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pages 2261–2269. IEEE Computer Society, 2017.
 [28] Ayoosh Kathuria. Intro to optimization in deep learning: Momentum, rmsprop and adam. https://blog.paperspace.com/introtooptimizationmomentumrmspropadam/, 2018.
 [29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 [30] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [31] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
 [32] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
 [33] Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, and Xuelong Li. Towards compact convnets via structuresparsity regularized filter pruning. arXiv preprint arXiv:1901.07827, 2019.
 [34] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 806–814, 2015.
 [35] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim KwangTing Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. arXiv preprint arXiv:1903.10258, 2019.
 [36] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and KwangTing Cheng. Bireal net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 722–737, 2018.
 [37] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2755–2763. IEEE, 2017.
 [38] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In Bengio and LeCun [5].
 [39] JianHao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
 [40] Michaël Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, 2014.
 [41] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net, 2017.
 [42] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [43] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In Bengio and LeCun [4].
 [44] Volker Roth and Bernd Fischer. The grouplasso for generalized linear models: uniqueness of solutions and efficient algorithms. In Proceedings of the 25th international conference on Machine learning, pages 848–855. ACM, 2008.
 [45] Heinz Rutishauser. Theory of gradient methods. In Refined iterative methods for computation of the solution and the eigenvalues of selfadjoint boundary value problems, pages 24–49. Springer, 1959.
 [46] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655–6659. IEEE, 2013.
 [47] Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 138–145, 2017.
 [48] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
 [49] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with dense networks and fisher pruning. arXiv preprint arXiv:1801.05787, 2018.
 [50] Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolutional nets with fbfft: A GPU performance evaluation. In Bengio and LeCun [4].
 [51] Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. Structured pruning for efficient convnets via incremental regularization. arXiv preprint arXiv:1811.08390, 2018.
 [52] Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Beyond filters: Compact feature map for portable deep model. In International Conference on Machine Learning, pages 3703–3711, 2017.
 [53] Yunhe Wang, Chang Xu, Shan You, Dacheng Tao, and Chao Xu. Cnnpack: Packing convolutional neural networks in the frequency domain. In Advances in neural information processing systems, pages 253–261, 2016.
 [54] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [55] TienJu Yang, YuHsin Chen, and Vivienne Sze. Designing energyefficient convolutional neural networks using energyaware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5687–5695, 2017.
 [56] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 184–199, 2018.
 [57] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016.