Training Efficient Network Architecture and Weights via Direct Sparsity Control
Artificial neural networks (ANNs) especially deep convolutional networks are very popular these days and have been proved to successfully offer quite reliable solutions to many vision problems. However, the use of deep neural networks is widely impeded by their intensive computational and memory cost. In this paper, we propose a novel efficient network pruning method that is suitable for both non-structured and structured channel-level pruning. Our proposed method tightens a sparsity constraint by gradually removing network parameters or filter channels based on a criterion and a schedule. The attractive fact that the network size keeps dropping throughout the iterations makes it suitable for the pruning of any untrained or pre-trained network. Because our method uses a constraint instead of the penalty, it does not introduce any bias in the training parameters or filter channels. Furthermore, the constraint makes it easy to directly specify the desired sparsity level during the network pruning process. Finally, experimental validation on synthetic and real datasets both show that the proposed method obtains better or competitive performance compared to other states of art network pruning methods.
In recent years, artificial neural networks (ANNs) especially deep convolutional neural networks (DCNNs) are widely applied and have become the dominant approach in many computer vision tasks. These tasks include image classification [12, 20, 7, 10], object detection [2, 19], semantic segmentation , 3D reconstruction , etc. The quick development in the deep learning field leads to network architectures can go nowadays as deep as 100 layers and contain millions or even billions of parameters. Along with that, more and more computation resources must be utilized to successfully train such a deep modern neural network.
The deployment of DCNNs in real applications is largely impeded by their intensive computational and memory cost. With this observation, the study of network pruning methods that learn a smaller sub-network from a large original network without losing much accuracy has attracted a lot of attention. Network pruning algorithms can be divided into two groups: non-structured pruning and structured pruning. The earliest work for non-structured pruning is conducted by , the most recent work is done by [4, 5]. The non-structured pruning aims at directly pruning parameters regardless of the consistent structure for each network layer. This renders modern GPU acceleration technique unable to obtain computational benefits from the irregular sparse distribution of parameters in the network, only specialized software or hardware accelerators can gain memory and time savings. The advantage of non-structured pruning is that it can obtain high network sparsity and at the same time preserve the network performance as much as possible. On the other side, structured pruning aims at directly removing entire convolutional filers or filter channels. Li et al.  determines the importance of a convolutional filter by measuring the sum of its absolute weights. Liu et al.  introduces a -norm constraint in the batch normalization layer to remove filter channels associated with smaller . Although structured pruning cannot obtain the same level of sparsity as non-structured pruning, it is more friendly to modern GPU acceleration techniques and independent of any specialized software or hardware accelerators.
Unfortunately, many of the existing non-structured and structured pruning techniques are conducted in a layer-wise way, which requires a sophisticated procedure for determining the hyperparameters of each layer in order to obtain a desired number of weights or filters/channels in the end. This kind of pruning manner is not effective nor efficient.
In this paper, we combine the regularization technique and the sequential algorithm design to bring forward a novel global network pruning scheme that could be suitable for either non-structured pruning or structured pruning (particular for filter channel-wise pruning of DCNNs with batch normalization). We investigate a estimation optimization problem with a -norm constraint in the whole parameter space, together with the use of annealing to lessen the greediness of the pruning process. An attractive property is that parameters or filter channels are removed while the model is updated at each iteration, which makes the problem size decrease during the iteration process. Experiments on extensive synthetic and real data including the parity dataset, MNIST, CIFAR10, and CIFAR100, provide empirical evidence that the proposed global network pruning scheme has performance comparable to or better than other state of art pruning methods.
2 Related Work
Network pruning is a very active research area nowadays, it provides a powerful tool to accelerate the network inference by having a much smaller sub-network without too much accuracy loss. The earliest work about network pruning can be dated back to 1990s, when LeCun et al.  and Hassibi et al.  proposed a weight pruning method that uses the Hessian matrix of the loss function to determine the unimportant weights. Recently, Han et al.  uses a quality parameter multiplied by the standard deviation of a layer’s weights to determine the pruning threshold. The weight in a layer will be pruned if its absolute value is below that threshold. Guo et al.  proposed a pruning method that can properly incorporate connection slicing into the pruning process to avoid incorrect pruning. These pruning schemes mentioned above are all non-structured pruning, which needs specialized hardware or software to gain computation and time savings.
For structured pruning, there are also quite a few works in teh literature. Li et al.  determine the importance of a convolutional filter by measuring the sum of its absolute weights. Hu et al.  compute the average percentage of zero activations after the ReLu function and determine to prune the corresponding filter if its this percentage score is high. He et al.  propose an iterative two-step channel pruning method by a LASSO regression based channel selection and least square reconstruction. Liu et al.  introduce a -norm constraint in the batch normalization layer to remove filter channels associated with smaller . Zhou et al.  impose an extra cluster loss term in the loss function that forces filters in each cluster to be similar and only keep one filter in each cluster after training. Yu et al.  utilize a greedy algorithm to perform channel selection in a layer-wise way by constructing a specific optimization problem.
3 Network Pruning via Direct Sparsity Control
Given a set of training examples where is an input and y is a corresponding target output, with a differentiable loss function we can formulate the pruning problem for a neural network with parameters as following constrained problem
where the norm bounds the number of non-zero parameters in to be less than or equal to a specific positive integer .
For non-structured pruning, we directly address the pruning problem in the whole space. The final will have an irregular distribution pattern of the zero-value parameters across all layers.
For structured pruning, suppose the DCNN is with convolutional filters or channels , we can replace the constrained problem (1) by
By solving the problem (2), we will obtain the on the convolutional layers having more uniform zero-value parameter distribution, specialized in some filters or filter channels.
These constrained optimization problems (1) and (2) facilitate parameter tuning because our sparsity parameter is much more intuitive and easier to specify in comparison to penalty parameters such as in and .
In this paper, we will focus on the study of the weight-level pruning (non-structured pruning) for all neural networks and channel-level pruning (structured pruning) particularly for neural networks with batch normalization layers preceding the convolutional layers.
3.1 Basic Algorithm Description
Some key ideas in our algorithm design are: a) We concatenate some target parameters into a single vector for pruning purposes; b) We use an annealing plan to directly control the global sparsity level; c) We gradually remove the most unimportant parameters or channels to facilitate computation. The prototype algorithm, summarized in Algorithm 1 and 2, is actually pretty simple. It starts with either an untrained or pre-trained model and alternates two basic steps: one step of parameter updates towards minimizing the loss by gradient descent and one step that removes some parameters or channels according to the target coefficient magnitudes.
The intuition behind our DSC algorithms is that during the pruning process, each time we only remove a certain number of the ”smallest” target parameters/channels in the parameter/channel space based on an annealing schedule. This ensures that we do not inject too much noise in the parameter masking step so that the pruning procedure can be conducted smoothly. Our method directly controls the sparsity level obtained at each layer, unlike many layer-wise pruning methods where a sophisticated procedure has to be used to control how many parameters are kept, because pruning the weights or channels in all layers simultaneously can be very time-consuming.
Note that for algorithm DSC-2, the channel pruning procedure for each convolutional layer is conducted based on the magnitude of the corresponding in the previous Batch Normalization layer. This pruning idea is first introduced by Liu et al.  that they impose a norm penalty on each Batch normalization layer for the to reformulate the training loss function. Here we embrace this idea but the two main differences from theirs are: we do not make any modification to the loss function but utilize a norm constraint, and we use an annealing schedule to gradually eliminate channels with small and lessen the greediness.
Through the annealing schedule, the support set of the network parameters or channels is gradually shrunken until we reach or . The keep-or-kill rule is simply based on the magnitude of coefficients and does not involve any information of the objective function . This is in contrast to many ad-hoc networking pruning approaches that have to modify the loss function and can not easily be scaled up to many existing pre-trained models.
3.2 Implementation Details
In this part, we provide some implementation details of our proposed DSC algorithms.
First, the annealing schedule is determined empirically. Our experimental experience shows that the following annealing plans can perform well to balance the efficiency and accuracy:
Here is the total number of parameters or channels in the neural network. Our consists two parts. The first part can be used to quickly prune the unimportant parameters with a reasonable value of down to a percentage of the parameters. The second part can further refine our pruned sub-network to a more compact model. is the pruning rate and we will set it to for all experiments. denotes the percentage of parameters or channels to be pruned in the first part. denotes the final pruning percentage goal at the end of the pruning procedure, thus the number of remaining parameters is . The parameter specifies how many epochs to train before performing another pruning. We will select . denotes the incremental pruning percentage as the annealing continues and will be set to . An example of an annealing schedule (with for clarity) with , , and is shown in Figure 1.
Second, as the convolutional layers and fully connected layers have very different behavior in a DCNN, we will prune them separately during the non-structured pruning process, i.e. we will fix the convolutional layer parameters while pruning the fully connected layer, and vice versa.
Third, after the pruning process, we will conduct a fine-tuning procedure to gain back the performance lost during the pruning period. But before we start the fine-tuning, we can remove the zero incoming or outgoing degree neurons for non-structured pruning and remove the masked filter channels by adding a channel selection layer  after the Batch Normalization Layer to form a more compact network for later inference use.
In this section, we first present a simulation on a synthetic dataset named parity dataset  to demonstrate the effectiveness of our DSC algorithm with selected annealing plan. Then we conduct non-structured pruning with Lenet-300-100 and LeNet-5  on MNIST  dataset. Finally we conduct our experiments with VGG16  and densenet40  on CIFAR10 and CIFAR100  dataset for structured channel pruning.
4.1 Synthetic Parity Dataset
The parity data with noise is a classical problem in computational learning theory . The data has feature vector which is uniformly drawn from . The label is generated follow the XOR logic: for some unknown subset of indices , the label value is set as
That is this dataset cannot be perfectly separated and the best classifier would have a prediction error of 0.1.
This kind of dataset is frequently used to test different optimizers and regularization techniques on the NN model. We perform the experiment in dimensional data with parities . The training set, valid set, and testing set contain respectively 15,000, 5,000 and 5,000 data points. We trained a single hidden layer neural network with DSC-1 and perform pruning only on the weights connecting the hidden layer and the output neuron. By doing this, we can exactly determine how many hidden neurons are preserved for the NN model. Starting with hidden nodes, and down to a hidden node number in the range using annealing schedule mentioned above. we report the best result out of 10 independent random initializations.
The comparison of the test errors is shown in Figure 2. We train a one hidden layer neural network with default SGD, Adam and Adam + DSC-1. We can see that the NN with the SGD optimizer cannot learn any good model with less than 100 hidden nodes on this data, while a NN with the Adam optimizer can learn some pattern when the number of hidden nodes is greater than 25, but still mostly cases are trapped in shallow local optima. The best performance is achieved by NN with Adam + DSC-1, with 256 starting hidden nodes. After applying DSC-1 during the NN training, we only needed to keep as few as 6 hidden nodes to get the best possible prediction error. This observation implies: The DSC-1 algorithm has a good capability to find a global or deep enough local optimum by gradually removing unimportant connections; The direct sparsity control design can help the final NN model reach very close to the most compact model achievable.
4.2 Non-structured Pruning on MNIST
The MINIST dataset provided by Lecun et al.  is a handwirtten digits dataset that is widely used in evalutating machine learning algorithms. It contains 50K training data, 10K validation data and 10K testing data respectively. In this section, we will test our non-structured pruning method DSC-1 on two network models: LeNet-300-100 and LeNet-5.
LeNet-300-100  is a classical fully connected neural network with two hidden layers. The first hidden layer has 300 neurons and the second has 100. The LeNet-5 is a conventional convolutional neural network that has two convolution layers and two fully connected layers. LeNet-300-100 consists of 267K learnable parameters and LeNet-5 consists of 431K. To have a fair comparison with , we follow the same experimental setting by using the default SGD method, training batch size and initial learning rate to train the two models from scratch. After a model with similar performance was obtained, we stop the training and directly apply our DSC-1 pruning algorithm to compress the model. During the pruning and retraining procedure, a learning rate with 1/10 of the original network’s learning rate is adopted. The momentum with coefficient value 0.9 is set to speed up the model retrain.
|Lenet-300-100 (Han et al.)||1.59%||22K||91.8%|
|Lenet-5 (Han et al.)||0.77%||36k||91.6%|
|Model||Layer||Params.||Han %||Ours %|
In LeNet-300-100, a total of 20 epochs were used for both pruning and fine-tuning. For the annealing schedule, is directly set to 0.85 without using any annealing schedule. Then we follow the fine-grain pruning annealing schedule which and to reach at the final percentage goal .
The remaining epochs are used for fine-tuning purposes. In LeNet-5, the pruning for fully connected layers and convolutional layers are treated separately. For pruning on fully connected layers, we directly set at and then reach with . For the convolutional layers we start with , and to reach at . The total number of pruning and retraining epochs for LeNet-5 is 40 epochs. After several experimental trials, we output our best result in Table 1 .
From the result table shown above, one can observe that our proposed non-structured pruning algorithm can learn a more compact sub-network for both LeNet300-100 and LeNet-5 with comparable performance compare to Han et al.’s .
By using a global hyperparameter we can directly control the sparsity level to get close to the most compact model achievable. It is not hard to conjecture that using a quality factor times the variance as a pruning threshold in each layer as proposed by Han et al. cannot exactly determine how many parameters should be kept. Our method can directly control the sparsity level globally and therefore enjoy a higher possibility to reach the position of the most compact sub-network.
Table 2 shows the layer-by-layer compression comparisons between ours and Han et al.’s . It is interesting to see that although two different pruning algorithms yield a similar performance result, the network architecture is quite different. Our DSC-1 algorithm controls the global sparsity level in the whole parameter space with an annealing schedule, this ensures the target sub-network can learn its pattern in a global view. For LeNet300-100, the most parameter killing comes from the first layer, which is quite reasonable as the images in the MNIST dataset are grayscale containing a large portion of pure black pixels. This large portion of black pixels almost has nothing to contribute for the neural network to learn useful information. The least parameter percentage dropping comes from the output layer, preserved as high as . We can conjecture the reason for this behavior could be that as the most unrelated features are removed from the first fully connected layer, the output layer should remain a considerable number of parameters to bear the weight of those kept and useful features. For LeNet-5, the most parameter preservation occurs in the first convolutional layer. This is again really very reasonable, as indeed the first layer should be the most important layer that directly extracts relevant features from the raw input image pattern. Our global direct global sparsity control strategy indeed let the network itself decide which part is more important than other, and which part contains most irrelevant or junk connections that could be removed safely. The parameter percentage distribution of the two fully connected layers in LeNet-5 has a similar behavior as in LeNet-300-100.
4.3 Structured Channel Pruning on the CIFAR Datasets
The CIFAR datasets (CIFAR10 and CIFAR100) provided by Krizhevsky et al.  are established computer-vision datasets used for image classification and object recognition. Both CIFAR datasets consist of a total of 60K natural color images and are divided into a training dataset with 50K images and a testing dataset with 10K images. The CIFAR-10 dataset is drawn from 10 classes with 6000 images per class. The CIFAR-100 dataset is drawn from 100 classes with 600 images per class. The color images in the CIFAR datasets have resolution . In this section, we will test our structured channel pruning method DSC-2 on two network models: VGG-16  and DenseNet40 .
The VGG-16  is a deep convolutional neural network containing 16 layers which was mainly designed for the ImageNet dataset. Initially, we plan to use the same variation of the original VGG-16 designed for CIFAR datasets studied in  to have an identical comparison of our channel pruning method DSC-2 with theirs. However, we had a hard time training it from scratch to obtain a similar baseline performance. So here we adopt another variation of VGG-16 also designed for CIFAR datasets, which was used in  and has a smaller number of total parameters, to conduct our experiments and compare with other state of art pruning algorithms. For DenseNet  we adopted the DenseNet40 with a total of 40 layers and a growth rate of 12.
We first train all the networks from scratch to obtain similar baseline results compared to and Liu et al.’s . The total epochs for training was set to 250 epochs for all networks. The batch size used was 128. A Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.1, weight decay of and momentum of was adopted. A division of the learning rate by 5 occurs at 50 training epochs. For these datasets, standard data augmentation techniques including normalization, random flipping, and cropping were applied.
During the pruning and fine-tuning procedure, the same number of training epochs is adopted in total. We use an SGD optimizer with an initial learning rate of and no weight decay or very small weight decay for pruning and fine-tuning purposes. Similarly, a division of the learning rate by 2 occurs at 50 training epochs. For the annealing schedule, a grid search is utilized here to determine the best , and for different . A good from empirically experience is set to or for the CIFAR 10 dataset and or for CIFAR 100. After the first part of the pruning schedule when we reach the pruning target , we conduct the fine-grain pruning for each final pruning rate . We output our best results in Table 3 for CIFAR 10 and Table 4 for CIFAR 100.
The experimental results displayed in Table 3 and 4 demonstrate the effectiveness of our proposed channel pruning algorithm DSC-2. It can be observed that our DSC-2 method can obtain results competitive with or even better than Liu et al.’s. What’s even better, our DSC-2 pruning method does not introduce any extra term in the training loss function. By using the annealing schedule to gradually remove the ”unimportant” channels based on the magnitude of the corresponding Batch Normalization coefficient , we could successfully find a compact sub-network without losing any model performance. Our DSC-2 is easy to use and can be easily scaled up to any untrained or existing pre-trained model.
Figure 3 displays two 70% channel-pruned network models for the CIFAR 10 dataset. Due to the significant differences in network architecture between the VGG-16 and Densenet-40, the resulting distribution of the percentage of remaining channels is quite different. For VGG-16, only a very small number of channels are kept in the last five CONV layers. This is reasonable as the last five CONV layers are those layers that initially have 512 input channels. Evidently, we do not need so many channels in each of the last five layers. The high pruning percentage may suggest that the VGG-16 network is over-parameterized in a layer-wise way for the CIFAR 10 dataset. For Densenet-40 with a growth rate of 12, the kept channel percentage is relatively evenly distributed in each CONV layer except the two transitional layers. This is again very reasonable based on the special architecture of Densenet. With a growth rate of 12, every consecutive 12 layers are correlated with each other, and outputs of those previous CONV layers will be concatenated to be the inputs of following CONV layer inside the growth rate period. Only the transitional layers do not hold that property. Overall, our channel-level pruning algorithm DSC-2 with a global sparsity control can automatically detect the reasonable sub-network without performance loss for VGG-16 and Densenet-40 on the CIFAR dataset.
This paper presented a global neural network pruning method that is suitable to both structured and non-structured pruning. The method directly imposes a sparsity constraint on the network parameters, which is gradually tightened to the desired sparsity level. This direct control allows us to obtain the precise sparsity level desired, as opposed to other recent methods that obtain the sparsity level indirectly through the use of penalty parameters. Experiments on synthetic and real data, including the parity dataset, MNIST, CIFAR10, and CIFAR100, provide empirical evidence that the proposed global network pruning scheme obtains a performance comparable to or better than other state of art pruning methods.
- Contact Author
- (2017) End-to-end 3d face reconstruction with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5908–5917. Cited by: §1.
- (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
- (2016) Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: §2.
- (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §4.2, §4.2, §4.2.
- (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §2.
- (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §2.
- (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
- (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.
- (2016) Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250. Cited by: §2.
- (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §4.3, §4.3, §4.
- (2009) Learning multiple layers of features from tiny images. Cited by: §4.3, §4.
- (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2, §4.
- (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §4.2, §4.
- (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1, §2.
- (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1, §2, §4.3.
- (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §1, §2, §3.1, §3.2, §4.3, §4.3, Table 3, Table 4.
- (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
- (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
- (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.3, §4.3, §4.
- (2018) Nisp: pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: §2.
- (2017) On the learnability of fully-connected neural networks. In Artificial Intelligence and Statistics, pp. 83–91. Cited by: §4.1, §4.
- (2018) Online filter weakening and pruning for efficient convnets. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §2.