Abstract
Deep Neural Networks are an important class of machine learning algorithms that have demonstrated stateoftheart accuracy for different cognitive tasks like image and speech recognition. Modern deep networks have millions to billions of parameters, which leads to high memory and energy requirements during training as well as during inference on resourceconstrained edge devices. Consequently, pruning techniques have been proposed that remove less significant weights in deep networks, thereby reducing their memory and computational requirements. Pruning is usually performed after training the original network, and is followed by further retraining to compensate for the accuracy loss incurred during pruning. The pruneandretrain procedure is repeated iteratively until an optimum tradeoff between accuracy and efficiency is reached. However, such iterative retraining adds to the overall training complexity of the network. In this work, we propose a dynamic pruningwhiletraining procedure, wherein we prune filters of the convolutional layers of a deep network during training itself, thereby precluding the need for separate retraining. We evaluate our dynamic pruningwhiletraining approach with three different preexisting pruning strategies, viz. mean activationbased pruning, random pruning, and L1 normalizationbased pruning. Our results for VGG16 trained on CIFAR10 shows that L1 normalization provides the best performance among all the techniques explored in this work with less than 1% drop in accuracy after pruning 80% of the filters compared to the original network. We further evaluated the L1 normalization based pruning mechanism on CIFAR100. Results indicate that pruning while training yields a compressed network with almost no accuracy loss after pruning 50% of the filters compared to the original network and 5% loss for high pruning rates ( 80%). The proposed pruning methodology yields 41% reduction in the number of computations and memory accesses during training for CIFAR10, CIFAR100 and ImageNet compared to training with retraining for 10 epochs .
Pruning Filters while Training for Efficiently Optimizing
Deep Learning Networks
Sourjya Roy*, Priyadarshini Panda**, Gopalakrishnan Srinivasan*, and Anand Raghunathan*
I Introduction
Deep Neural Networks (DNNs) are a prominent class of machine learning algorithms that have found widespread utility in various Artificial Intelligence (AI) tasks such as image recognition [10], speech recognition, spam detection, personal assistants, among others. However, stateoftheart DNNs like VGG16 and ResNet152 are memoryintensive with millions of trainable parameters, and computeintensive requiring 15.3 and 11.3 billion FLOPS, respectively, per inference. The high memory, computational energy, and latency requirements pose significant challenges to the deployment of large DNNs on edge devices, with limited power budget and compute resources, for inference. Model compression is a popularly used technique for alleviating the memory and computational energy requirements of large DNNs. Model compression can be achieved by either pruning the redundant weights of a DNN and/or by quantizing the weights and activations to lower bit precision.
In this work, we propose an efficient pruning strategy for DNNs that minimizes the accuracy loss compared to the original network with minimal training overhead. Most previously proposed pruning strategies train a DNN until the best accuracy is achieved, and then prune the filters (individual weights) in the convolutional (fully connected) layers of the DNN. The pruning phase is typically followed by a retraining phase for a certain number of epochs to regain the accuracy loss incurred due to pruning. Retraining imposes significant computational overhead especially for large realworld datasets like ImageNet consisting of millions of training images. In an effort to eliminate the retraining phase, we propose pruning the DNN in a gradual manner as the network is being trained. We prune a small fraction of the convolutional filters every epoch over the course of training until the target pruning rate is achieved. Our analysis indicates that gradual pruning of filters during training enables successive epochs to compensate for any accuracy loss, leading to comparable accuracy at the end of training to the original network on the CIFAR10 dataset for high pruning rates ( 80%), and on the CIFAR100 and ImageNet datasets for moderate pruning rates ( 50%). In addition, the proposed gradual pruning methodology also enhances the computational efficiency during training since the feedforward and gradient computations for the pruned filters can be skipped. As a result, the proposed pruning methodology offers competitive accuracy without the need for a separate retraining phase. Unlike previous approaches, the proposed approach enables sparsity to be exploited for computational efficiency during both training and inference.
We investigate three different pruning strategies popularly used in the literature for identifying the filters to be pruned, namely, L1 normalization based pruning [4], random pruning [7], and mean activation based pruning [9]. Our results indicate that L1 normalization based pruning provides the best accuracy after removing the redundant filters during training based on the proposed gradual pruning methodology.
Overall, the key contributions of our work are:

We propose an efficient pruning methodology without the need for a separate retraining phase, wherein the convolutional filters are pruned gradually every epoch to achieve the target pruning rate.

We demonstrate the effectiveness of our gradual pruning methodology on the CIFAR10, CIFAR100, and ImageNet datasets.
Ii Related Work
Many previous works have focused on compressing DNNs using architectural, quantization, and pruning techniques. For instance, Mobile Net [2] used depthwise separate convolutions to reduce the number of parameters and make inference more energy efficient while works such as DoReFaNet [11] efficiently compressed DNNs by quantizing the different data structures. Deep Compression [6] proposes compressing DNNs using pruning, trained quantization, and Huffman coding. Most works focused on pruning a network after training followed by further retraining to compensate for the accuracy loss [6, 8]. Runtime neural pruning [12] focuses on pruning dynamically during run time using reinforcement learning. We employ simpler pruning techniques such as L1 normalization based pruning to minimize the energy and latency overhead incurred by the pruning mechanisms. Our work differs from the above efforts by incorporating pruning into the training process itself, obviating the need for a distinct retraining phase.Pruning filters for energyefficient ConvNets [4] adopt two strategies for pruning the network and regaining accuracy: oneshot pruning/retraining and iterative pruning/retraining based on the significance of filters. In both cases, the retraining overhead in terms of additional MAC operations, gradient computations, and memory accesses is quite large. Training networks for datasets such as ImageNet (1,281,167 images) require atleast 20 retraining cycles. Further PRT approaches require 3 longer time (or latency) to retrain pruned networks [6, 4]. Our pruning while training approach yields both MAC/memory energy and latency benefits (while completely getting rid of the retraining overhead) with very minimal or no loss in accuracy.The key findings of this paper which support [5] are

Training a large, overparameterized model is often not necessary to obtain an efficient final model.

Learned ‘important’ weights of the large model are typically not useful for the small pruned model.

The pruned architecture itself, rather than a set of inherited important weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm.
Iii Proposed Pruning Methodology
This section describes the proposed pruning methodology for efficiently compressing deep networks with minimal training overhead. A plethora of prior approaches adopted a ‘pruning followed by training’ strategy, which necessitates additional an retraining phase to recover the accuracy degradation caused by pruning. Retraining cost in terms of latency and computational energy can be substantial especially for large realworld datasets like ImageNet. In order to eliminate the retraining overhead, we propose pruning the network during the training phase itself. We prune the network gradually every epoch, i.e., uniformly over the course of the training period, until the target pruning rate is achieved. The presented pruning methodology has the following twofold advantages. First, it reduces the trainable parameters gradually, resulting in almost no drop in accuracy on the CIFAR10 dataset, even for high pruning rates ( 80%), and on CIFAR100/ImageNet dataset for relatively lower pruning rates, as will be shown in the results section (Section IV). Second, the gradual reduction in the trainable parameters can be exploited to further the computational efficiency in sparsityaware neural accelerators by eliminating redundant operations during both forward and backpropagation for the pruned weights. We employ three widelyused pruning schemes for removing the redundant weights, namely, L1 normalization based pruning, mean activation based pruning, and random pruning that are described in Subsection IIIA. For each of the pruning techniques, the weights to be pruned are forced to zero and the corresponding gradient calculations are eliminated. We rigorously investigate the effectiveness of the proposed methodology, and present the pruning rates that can be achieved for given target accuracy on the CIFAR10, CIFAR100, and ImageNet datasets. Note that we have focused on only pruning the filters of the convolutional layers to analyze the impact of compressed input representations on the network accuracy, and because the the coarsergrained pruning of filters is easier to exploit for time and energy improvements. The presented methodology can be extended to the fully connected classification layers for improved compression efficiency.
Iiia Pruning Techniques
L1 Normalization Based Pruning
L1 Normalization based pruning [4] is a way to remove the filters based on their magnitude or , which is computed as
(1) 
where is the absolute value of the filter weight and is the total number of filter weights. The filter magnitude is used to determine the significance of the filter. Filters with low magnitude do not contribute substantially to the network output, and hence are pruned away.
Mean Activation Based Pruning
Mean activation based pruning [9] is another form of magnitudedriven pruning. The mean activation is calculated for each feature map in the network on the entire training dataset as described by
(2) 
where is the output activation of a feature map for the image and is the size of the training dataset. The filter corresponding to the feature map with the lowest mean activation is considered to contribute insignificantly to the network performance, and hence is pruned away. Mean activation based pruning tends to identify sparse feature maps with maximum number of zeros, inserted as a result of using ReLU nonlinearity that zeroes out negative activations, and removes the corresponding filters.
Random Pruning
Random pruning, as the name suggests, prunes filters in the network randomly. A filter is chosen at random every layer based on an unbiased random number generator and removed from the network.
IiiB Training Algorithms for Gradual Pruning of Filters
In this section, we describe the training algorithm for gradual pruning of the convolutional filters using the three different pruning techniques described in Section IIIA. Algorithm 1 details network training with L1 normalization based gradual pruning, Algorithm 2 describes network training with mean activation based gradual pruning, and Algorithm 3 outlines network training with random pruning of the filters. In Algorithms 1, 2, and 3, the basic steps are as follows for each training epoch: 1) Perform forward and backward passes through the network and perform weight update. 2) While the current % of pruned filters (denoted as ) is less than the required pruning rate (denoted as ), we prune the filters of all the convolutional layers of the network using one of the pruning techniques. 3) As the current % of pruned filters reaches , we proceed to the next epoch of training and also increment the required pruning rate by a fixed value (denoted as rate_per_epoch). Intuitively, such gradual pruning while training will allow a network to adjust its weights and compensate for the pruninginduced accuracy loss dynamically and thus, reach a optimized pruned configuration towards the end of training.
Iv Results and Discussion
We evaluated the efficacy of the proposed pruning methodology using the VGG16 DNN consisting of 13 convolutional layers and 3 fully connected layers. Batch Normalization is used after every layer to normalize the output activations before feeding them to the following layer. We used the Adam optimizer [3] and cross entropy loss function for all the experiments reported in this work. We trained VGG16 on the CIFAR10 (for 80 epochs), CIFAR100 (for 100 epochs) and ImageNet datasets (for 100 epochs) to comprehensively demonstrate the utility of the proposed ‘pruning while training’ methodology.
We first trained VGG16 on CIFAR10 while pruning using the three different techniques described in Section IIIA to identify the technique best suited for our ‘pruning while training’ strategy, henceforth abbreviated as PWT[Pruning Technique]. For example, PWTL1Norm refers to L1 normalization based ‘pruning while training’. For the CIFAR10 dataset, we first used the PWTL1Norm pruning methodology and pruned 1% of filters every epoch to achieve the target pruning rate of 80% at the end of 80 epochs. For the baseline, we trained VGG16 for certain number of epochs (#epochs) before pruning 80% of the filters abruptly and retraining for the remaining epochs, which is designated as PRT[#epochs]. Consider for instance, PRT55, where VGG16 is trained for 55 epochs followed by pruning and retraining for the rest of the epochs. Fig. 1 indicates that PWTL1Norm strategy offers comparable accuracy to the original VGG16 network (without any pruning) with 80% of the filters pruned. The PWTL1Norm strategy also provides higher accuracy than PRTbased pruning, thereby yielding a superior pruned network.
In addition to offering higher performance, the PWTL1Norm strategy also gradually improves the computational efficiency as training progresses. Fig. 2 illustrates that the number of VGG16 parameters, pruned using the PWTL1Norm strategy, gradually decreases during the course of training. The gradual parameter reduction can be exploited by sparsityaware neural accelerators to improve the computational efficiency of training by eliminating the redundant memory fetches and computations corresponding to the pruned filters as shown in Section V. Note that the PRT approach can also provide higher computational efficiency post the abrupt pruning phase, which is carried out after initial training for substantial number of epochs. However, the proposed PWT methodology yields a higher accuracy pruned network compared to that obtained using the PRT approaches as illustrated in Fig. 1, thereby providing the best tradeoff between network performance and computational efficiency. We also quantified the latency overhead for computing the L1 norm every epoch and found that it is 10 lower compared to the time taken for a training epoch on the CIFAR10 dataset containing 50K images. The L1 norm latency overhead further drops and becomes negligible for larger datasets like ImageNet with over a million images.
Next, we analyzed the impact of the overall target pruning rate on the efficacy of the PWTL1Norm pruning methodology. We varied the target pruning rate for the filters from 72% to 92% over the entire 80 training epochs, which translates to 0.9% to 1.15% pruning rate, respectively, per epoch. Fig. 3 shows that target pruning rate of 72% yields the best accuracy and that the accuracy degrades significantly as the pruning rate is increased beyond 80%. For the CIFAR10 dataset, VGG16 pruned using the proposed PWTL1Norm methodology with target pruning rate of 80% provides the best tradeoff between accuracy and parameter savings.
The PWTL1Norm pruning strategy has thus far been applied every epoch. We next investigated the tradeoff between accuracy and training efficiency if the PWT strategy is carried out with fixed delay ( 1 delayepochs) between successive pruning epochs, which is referred to as PWTL1NormMod [#delayepochs+1]. For instance, if the PWT strategy is applied every second epoch; it is abbreviated as PWTL1NormMod2 and translates to 1 delayepoch between successive pruning epochs. Fig. 4 indicates that superior accuracy is obtained using PWTL1Norm, where pruning is performed every epoch, or PWTL1NormMod2. The accuracy degrades beyond a delay of 1 epoch between successive pruning epochs as depicted in Fig. 4. The original PWTL1Norm strategy offers the best tradeoff between accuracy and training efficiency since it reduces the network size every epoch as opposed to every certain number of epochs.
Finally, we compared the PWTL1Norm methodology with random and mean activation based ‘pruning while training’ schemes, referred to as PWTRandom and PWTMeanAct, respectively. The network is pruned gradually every epoch using the respective schemes. Fig. 5 shows that PWTRandom strategy with higher target pruning rate of 80% failed to converge during training. As the target pruning rate is reduced to 40%, PWTRandom strategy achieved training convergence, albeit with lower accuracy compared to PWTL1Norm methodology that could prune 80% of the filters. This is because random pruning does not account for the significance of filters while pruning them, and hence can remove filters critical to network performance as depicted in Fig. 5. We obtained similar results for the PWTMeanAct strategy, which yielded lower accuracy than PWTL1Norm strategy and comparable accuracy to the inferior PWTRandom strategy. This indicates that L1 norm is a better indicator of the significance of a filter than the mean activation of the corresponding output map. Our comprehensive analysis on CIFAR10 shows that PWTL1Norm based gradual pruning of filters every epoch provides a higher quality compressed network compared to those obtained with random and mean activation based pruning strategies.
We finally evaluated the efficacy of the PWTL1Norm methodology on the CIFAR100 and ImageNet datasets. Fig. 6 shows that VGG16 pruned using PWTL1Norm with target pruning rate of 80% incurs 5% accuracy loss compared to the original network. However, as the target pruning rate is reduced to 50%, both pruned and original networks offer comparable accuracy. For the ImageNet dataset, we similarly found that VGG16 pruned using PWTL1Norm with higher target pruning rate suffers from 10% accuracy loss as shown in Fig. 7, which can be minimized by lowering the pruning rate. It is noteworthy to mention that our PWTL1Norm strategy yields comparable accuracy with that of the PRT70 (pruning with retraining where pruning is applied at the 70th epoch). Since we are effectively using progressively smaller networks to train every epoch, we observe larger benefits in memory and compute efficiency with PWT as compared to PRT.
V Computation and Latency Benefits
The number of MultiplyandAccumulate (MAC) operations and read/write memory accesses are used as a proxy for roughly estimating the energy benefits of the proposed methodology. Table I lists the number of MAC operations and memory accesses for the activations and weights during both the forward and backpropagation phases, for given DNN layer with input channels and output channels, following the estimation methodology described in [1]. P_{p} and P_{c} stand for pruning percentage for previous and current layer. The input and output feature map dimensions are considered to be and , respectively. FP, BP, WU and S stand for forward propagation, back propagation, weight update, and stride, respectively. The savings in the number of MACs or memory accesses using our PWT methodology over the baseline PRT approaches can be computed as
(3) 
where is the number of nominal training epochs, is the number of retraining epochs, is the actual number of MACs or memory accesses in VGG16, and is the target pruning rate. We obtained 41% savings in the number of MACs and memory accesses by pruning VGG16 on CIFAR10 (isoaccuracy), CIFAR100 and ImageNet using our PWTL1Norm strategy with and over PRT approach with retraining epochs.
The latency for the proposed PWTL1Norm and baseline PRT approaches can be computed as
(4) 
where is the number of minibatches per epoch. is the minibatch latency, is the latency for L1 norm computation per epoch. We find that computation time is lesser than the forward pass computation time through a DNN for a batch of inputs. Note the forward pass computation involves cost of Matrix Vector Multiplication (MVM) and nonlinear operations. In our experiments, we found and to be 3.3 and 37.5 seconds respectively for VGG16 on CIFAR100. We found to be 7680 seconds for VGG16 on ImageNet. The pruneretrain approach takes approximately 21 hours more than PWT on ImageNet for 10 retraining epochs on a NVIDIA GeForce GTX 1080 Ti GPU. This clearly shows that the latency for the proposed PWT strategy is lower than that incurred for the pruneretrain approaches even when m is small.
Forward Pass  

Operation  Number of ops 
MAC operations  
Backward Pass  
R=(NM)/S+1  
Operation  Number of ops 
MAC(Error)  
MAC(dw)  
Operation  Number of ops 
FP,BP and WU  
Input Read  
Weight Read  
Memory Write(Activation)  
Memory Write(Weight) 
Vi Conclusion
In this paper, we propose a dynamic pruning while training procedure to overcome the retraining complexity generally incurred with conventional pruneandretrain techniques. We find that L1 normalization proves to be the best technique to be used with our pruning while training approach. Our analysis on CIFAR10, CIFAR100, ImageNet datasets show that our approach yields the most optimal network configuration with respect to efficiency and accuracy, while yielding, higher memory and training latency improvements in comparison to prior works.
Vii Acknowledgment
This work was supported in part by Semiconductor Research Corporation (SRC).
References
 (2019) PCA driven hybrid network design for enabling intelligence at the edge. External Links: 1906.01493 Cited by: §V.
 (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: 1704.04861 Cited by: §II.
 (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §IV.
 (2017) Pruning filters for efficient convnets. External Links: 1608.08710 Cited by: item 2, §I, §II, §IIIA1.
 (2019) Rethinking the value of network pruning. External Links: 1810.05270 Cited by: §II.
 (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. External Links: 1510.00149 Cited by: §II.
 (2018) Recovering from random pruning: on the plasticity of deep convolutional neural networks. External Links: 1801.10447 Cited by: item 2, §I.
 (2017) Pruning convolutional neural networks for resource efficient inference. External Links: 1611.06440 Cited by: §II.
 (2017) Pruning convolutional neural networks for resource efficient inference. External Links: 1611.06440 Cited by: item 2, §I, §IIIA2.
 (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §I.
 (2018) DoReFanet: training low bitwidth convolutional neural networks with low bitwidth gradients. External Links: 1606.06160 Cited by: §II.
 (2017) Runtime neural pruning. In Proc., NIPS. Cited by: §II.