Parameterized Structured Pruning for Deep Neural Networks
As a result of the growing size of Deep Neural Networks (DNNs), the gap to hardware capabilities in terms of memory and compute increases. To effectively compress DNNs, quantization and connection pruning are usually considered. However, unconstrained pruning usually leads to unstructured parallelism, which maps poorly to massively parallel processors, and substantially reduces the efficiency of general-purpose processors. Similar applies to quantization, which often requires dedicated hardware.
We propose Parameterized Structured Pruning (PSP), a novel method to dynamically learn the shape of DNNs through structured sparsity. PSP parameterizes structures (e.g. channel- or layer-wise) in a weight tensor and leverages weight decay to learn a clear distinction between important and unimportant structures. As a result, PSP maintains prediction performance, creates a substantial amount of sparsity that is structured and, thus, easy and efficient to map to a variety of massively parallel processors, which are mandatory for utmost compute power and energy efficiency. PSP is experimentally validated on the popular CIFAR10/100 and ILSVRC2012 datasets using ResNet and DenseNet architectures, respectively.
Deep Neural Networks (DNNs) are widely used for many applications including object recognition Krizhevsky:2012:ICD:2999134.2999257, speech recognition hinton12 and robotics lenz2016deep. An empirical observation is that DNNs, trained by Stochastic Gradient Descent (SGD) from random initialization, are remarkable successful in fitting training data DBLP:journals/corr/ZhangBHRV16. The ability of modern DNNs to excellently fit training data is suspected to be due to heavy over-parameterization, i.e., using more parameters than the total number of training samples, since there always exists parameter choices that achieve a training error of zero DBLP:journals/corr/abs-1811-03962. In particular, Li et al. DBLP:journals/corr/abs-1808-01204 showed that SGD finds nearly-global optimal solution on the training data, as long as the network is over-parameterized which can be extended to test data as well.
While over-parameterization is essential for the learning ability of neural networks, it results in extreme memory and compute requirements for training (development) as well as inference (deployment). Recent research showed, e.g. DBLP:journals/corr/abs-1811-06992, that training can be scaled to up to 1024 accelerators operating in parallel, resulting in a development phase not exceeding a couple of minutes, even for large-scale image classification. However, the deployment has usually much harder constraints than the development, as energy, space and monetary cost are scarce in mobile devices.
Model compression methods are targeting this issue by training an over-parameterized model and compressing it for deployment. Popular compression methods are pruning DBLP:journals/corr/HanPTD15; DBLP:journals/corr/GuoYC16, quantization DBLP:journals/corr/abs-1807-10029, knowledge distillation HinVin15Distilling, and low-rank factorization DBLP:journals/corr/JaderbergVZ14; DBLP:journals/corr/DentonZBLF14, with the first two being most popular due to their extreme efficiency. Pruning connections DBLP:journals/corr/HanPTD15; DBLP:journals/corr/GuoYC16 achieves impressive theoretical compression rates through fine-grained sparsity (Fig. 0(a)) without sacrificing prediction performance, but has several practical drawbacks such as indexing overhead, load imbalance and random memory accesses: (i) Compression rates are typically reported without considering the space requirement of additional data structures to represent non-zero weights. For instance, using indices, a model with 8-bit weights, 8-bit indices and 75% sparsity saves only 50% of the space, while a model with 50% sparsity does not save memory at all. (ii) It is a well-known problem that massively parallel processors show notoriously poor performance when the load is not well balanced. Unfortunately, since the end of Dennard CMOS scaling 1050511, massive parallelization is mandatory for a continued performance scaling. (iii) Sparse models increase the amount of randomness in memory access patterns, preventing caching methods which rely on predictable strides from being effective. As a result, the amount of cache misses increases the average memory access latency and energy consumption, as off-chip accesses are 10-100 time higher in terms of latency, respectively 100-1000 times higher in terms of energy consumption 6757323. Quantization has recently received plenty of attention and reduces computational complexity as additions scale linearly and multiplications scale quadratically with the number of bits SzeCYE17. However, in comparison, pruning avoids a computation completely.
Structured pruning methods can prevent these drawbacks by inducing sparsity in a hardware-friendly way: Fig. 0(b)-0(e) illustrate exemplary a 4-dimensional convolution tensor (see DBLP:journals/corr/ChetlurWVCTCS14 for details on convolution lowering), where hardware-friendly sparsity structures are shown as channels, layers, etc. However, pruning whole structures in a neural network is not as trivial as pruning individual connections and usually causes high accuracy degradation under mediocre compression constraints. Structured pruning methods can be roughly clustered into two categories: re-training-based and regularization-based methods (see Sec. 2 for details). Re-training-based methods aim to remove structures by minimizing the pruning error in terms of changes in weight, activation, or loss, respectively, between the pruned and the pre-trained model. Regularization-based methods train a randomly initialized model and apply regularization, usually an penalty, in order to force structures to zero. This work introduces a new regularization-based method based on learned parameters for structured sparsity without substantial increase in training time. Our approach differs from previous methods, as we explicitly parameterize certain structures of weight tensors and regularize them with weight decay, enabling a clear distinction between important and unimportant structures. Combined with threshold-based magnitude pruning and a straight-through gradient estimator (STE) DBLP:journals/corr/BengioLC13, we can remove a substantial amount of structure while maintaining the classification accuracy. We evaluate the proposed method based on state-of-the-art Convolutional Neural Networks (CNNs) like ResNet DBLP:journals/corr/HeZRS15 and DenseNet DBLP:journals/corr/HuangLW16a, and popular datasets like CIFAR-10/100 cifar and ILSVRC2012 (DBLP:journals/corr/RussakovskyDSKSMHKKBBF14).
2 Related Work
Re-training-based methods: In DBLP:journals/corr/HuPTT16 it is proposed to prune neurons based on their average percentage of zeros activations. Li et al. DBLP:journals/corr/LiKDSG16 evaluate the importance of filters by calculating its absolute weight sum. Mao et al. DBLP:journals/corr/MaoHPLLWD17 prune structures with the lowest norm. Channel Pruning (CP) DBLP:journals/corr/HeZS17 uses an iterative two-step algorithm to prune each layer by a LASSO regression based channel selection and least square reconstruction. Structured Probabilistic Pruning (SPP) DBLP:journals/corr/abs-1709-06994 introduces a pruning probability for each weight where pruning is guided by sampling from the pruning probabilities. Soft Filter Pruning (SFP) DBLP:journals/corr/abs-1808-06866 enables pruned filters to be updated when training the model after pruning, which results in larger model capacity and less dependency on the pre-trained model. Layer-Compensated Pruning (LCP) DBLP:journals/corr/abs-1810-00518 leverages meta-learning to learn a set of latent variables that compensate for approximation errors. ThiNet DBLP:journals/corr/LuoWL17 shows that pruning filters based on statistical information calculated from the following layer is more accurate than using statistics of the current layer. Discrimination-aware Channel Pruning (DCP) DBLP:journals/corr/abs-1810-11809 selects channels based on their discriminative power.
Regularization-based methods: Group LASSO Yuan06modelselection allows predefined groups in a model to be selected together. Adding an penalty to each group is a heavily used approach for inducing structured sparsity in CNNs DBLP:journals/corr/LebedevL15; DBLP:journals/corr/WenWWCL16; Jenatton:2011:SVS:1953048.2078194; NIPS2016_6372. Network Slimming DBLP:journals/corr/abs-1708-06519, Huang et al. DBLP:journals/corr/HuangW17aa and MorphNet DBLP:journals/corr/abs-1711-06798 apply regularization on coefficients of batch-normalization layers in order to create sparsity in a structured way.
3 Parameterized Pruning
DNNs are constructed by layers of stacked processing units, where each unit computes an activation function of the form
where W is a weight tensor, x is an input tensor, denotes a linear operation, e.g., a convolution, and is a non-linear function. Modern neural networks have very large numbers of these stacked compute units, resulting in huge memory requirements for the weight tensors W and compute requirements for the linear operations . In this work, we aim to learn a structured sparse substitute Q for the weight tensor W, so that there is only minimal overhead for representing the sparsity pattern in Q while retaining computational efficiency using dense tensor operations. For instance, by setting all weights at certain indices of the tensor to zero, it suffices to store the indices of non-zero elements only once for the entire tensor Q an not for each individual dimension separately. By setting all weights connected to an input feature map to zero, the corresponding feature map can effectively be removed without the need to store any indices at all.
3.1 Hardware-friendly structures in CNNs
We consider CNNs with filter kernels, input and output feature maps. Different granularities of structured sparsity yield different flexibilities when mapped to hardware. In this work, we consider only coarse-grained structures such as column, channel and layer pruning, that can be implemented using off-the-shelf libraries on general-purpose hardware or shape pruning for direct convolutions on re-configurable hardware.
Convolutions are usually lowered onto matrix multiplication in order to explore data locality and the massive amounts of parallelism in general-purpose GPUs or specialized processors like TPUs DBLP:journals/corr/JouppiYPPABBBBB17. The reader may refer to the work of Chetlur et al. DBLP:journals/corr/ChetlurWVCTCS14 for a detailed explanation. Consequently, the computation of all structured sparsities that are explored in this work can be lowered to dense matrix multiplications.
Column pruning refers to pruning weight tensors in a way that a whole column of the flattened weight tensor and the respective row of the input data can be removed (Fig. 0(b)). Channel pruning refers to removing a whole channel in the weight tensor and the respective input feature map (Fig. 0(c)). Shape pruning targets to sparsify filter kernels per layer equally (Fig. 0(d)), which can be mapped onto re-configurable hardware. Layer pruning simply removes whole layers of a DNN (Fig. 0(e)).
The proposed approach is not restricted to these forms of sparsity, arbitrary structures and combinations of different structures are possible. Other structured sparsites, but more fine-grained, are explored by Mao et al. DBLP:journals/corr/MaoHPLLWD17.
Identifying the importance of certain structures in neural networks is vital for the prediction performance of structured-pruning methods. Our approach is to train the importance of structures by parameterizing and optimizing them together with the weights using backpropagation. Therefore, we divide the tensor into subtensors so that each constitutes the weights of structure . During forward propagation, we substitute by the structured sparse tensor as
where is the dense structure parameter associated with structure and is a tuneable pruning threshold. As the threshold function is not differentiable at and the gradient is zero in , we approximate the gradient of by defining a STE as
We use the sparse parameters for forward and backward propagation and update the respective dense parameters based on the gradients of . Updating the dense structure parameters instead of the sparse parameters is beneficial because improperly pruned structures can reappear if moves out of the pruning interval , resulting in faster convergence to a better performance. Following the chain rule, the gradient of the dense structure parameters is:
where represents the objective function. As a result, the dense structure parameters descent towards the predominant direction of the structure weights. Training the structures introduces additional parameters, however, during inference they are folded into the weight tensors, resulting in no extra memory or compute costs. The dense structure parameters for individual structures and their corresponding gradients are shown in Table 1. Note that layer pruning is only applicable to multi-branch neural network architectures.
|Pruning method||Dense structure parameter||Gradient|
We use SGD for training and apply momentum and weight decay when updating the dense structure parameters:
where is the momentum, is the learning rate and is the regularization strength. We use a momentum in order to diminish fluctuations over iterations in parameter changes, which is highly important since we update large structures of a layer.
Regularization not only prevents overfitting, but also decays the dense structure parameters towards zero (but not exactly to zero) and, hence, reduces the pruning error. Using weight decay for sparsity instead of regularization may seem counterintuitive, since implicitly decays parameters exactly to zero, however, the update rule between regularization and weight decay differs significantly: the objective function for regularization changes to , while for weight decay it changes to . Adding the penalty results in an SGD update rule as:
while weight decay results in the update rule of Eq. 6. regularization only considers the direction the parameters are decayed towards and weight decay also takes the magnitude of the parameters into account. This makes a severe difference in the learning capabilities of SGD based neural networks, that can be best visualized using the distributions of the dense structure parameters (corresponding to different layers) in Fig. 2.
Parameterizing structures and regularization ultimately shrink the complexity (variance of the layers) of a neural network. We observe that weight decay without pruning () produces unimodal, bimodal and trimodal distributions (Fig. 1(a)-1(d)), indicating different complexities, with a clear distinction between important and unimportant dense structure parameters. In contrast, regularization without pruning () (Fig. 1(e)-Fig. 1(h)) lacks the ability to form this clear distinction. Second, regularized dense structure parameters are roughly one order of magnitude larger than parameters trained with weight decay, making them more sensitive to small noise in the input data.
The CIFAR-10 and CIFAR-100 datasets cifar consist of colored images, with 50,000 training and 10,000 validation images. They differ in the number of classes, being 10 respectively 100. For data augmentation, we subtract the per-pixel mean from the input images, following He et al. DBLP:journals/corr/HeZRS15. The ILSVRC 2012 dataset (ImageNet) DBLP:journals/corr/RussakovskyDSKSMHKKBBF14 consists of 1.28 million trainings and 50,000 validation images. We adopt the data preprocessing from DBLP:journals/corr/HeZRS15; DBLP:journals/corr/HuangLW16a and we report top-1 and top-5 classification errors on the validation set.
We only use already optimized state-of-the-art networks for our experiments: ResNet DBLP:journals/corr/HeZRS15 and DenseNet DBLP:journals/corr/HuangLW16a. We use the same networks for CIFAR and ImageNet as described in the original publications. Both networks apply convolutions as bottleneck layers before the convolutions to improve compute and memory efficiency. DenseNet further improves model compactness by reducing the number of feature maps at transition layers. If bottleneck and transition compression is used, the models are labeled as ResNet-B and DenseNet-BC, respectively. Removing the bottleneck layers in combination with our compression approach has the advantage of reducing both, memory/compute requirements and the depth of the networks. We apply PSP to all convolutional layers except the sensitive input, output, transition and shortcut layers, which have negligible impact on overall memory and compute costs.
We train all models using SGD and a batch size of 64 (1 GPU) and 256 (8 GPUs) for CIFAR and ImageNet, respectively. For the CIFAR experiments, we train for 300 epochs and start with a learning rate of 0.1, which is divided by 10 at 50% and 75% of the training epochs DBLP:journals/corr/HuangLW16a. For the ImageNet experiments, we train for 110 epochs and start with a learning rate of 0.1, which is divided by 10 at 30, 60, 90 and 100 epochs DBLP:journals/corr/HeZRS15. We use a weight decay of and a momentum of 0.9 for weights and structure parameters throughout this work. We use the initialization introduced by He et al. DBLP:journals/corr/HeZR015 for the weights and initialize the structure parameters randomly using a zero-mean Gaussian with standard deviation 0.1. For the DenseNet experiments, we set the threshold parameter for inducing sparsity (Eq. 3). For the ResNet experiments, we set the threshold parameter , except for the following ablation experiments (Sec. 4.1), where we evaluate the sensitivity of the hyperparameter and different sparsity constraints.
4.1 Ablation experiments
We start the experiments with an ablation experiment to validate methods and statements made in this work. This experiment is evaluated on the ResNet architecture, using column pruning, with 56 layers using the CIFAR10 dataset (Fig. 3). We report the validation error for varying sparsity constraints, and with the baseline error set to the original unpruned network, with some latitude to filter out fluctuations: . The dashed vertical lines indicate the maximum amount of sparsity while maintaining the baseline error. A common way DBLP:journals/corr/MaoHPLLWD17 to estimate the importance of structures is the norm of the targeted structure in a weight tensor , which is followed by pruning the structures with the smallest norm. We use this rather simple approach as a baseline, denoted as norm, to show the differences to the proposed parameterized structure pruning. The parameterization in its most basic form is denoted as PSP (fixed sparsity), where we do not apply regularization ( in the SGD update in Eq. 6) and simply prune the parameters with the lowest magnitude. As can be seen, this parameterization achieves about 10% more sparsity compared to the baseline ( norm) approach, or 1.8% better accuracy under a sparsity constraint of 80%.
Furthermore, we observe that regularized dense structure parameters are able to learn a clear distinction between important and unimportant structures (Sec. 3.3). Thus, it seems appropriate to use a simple threshold heuristic (Eq. 3) rather than pruning all layers equally (as compared to PSP (fixed sparsity)). We also show the impact of the threshold heuristic in combination with regularization (Eq. 7) and weight decay (Eq. 6) in Fig. 3. These methods are denoted as PSP ( regularization) and PSP (weight decay), respectively. We vary the regularization strength for regularization, since it induces sparsity implicitly, while we vary the threshold parameter for weight decay: for PSP ( regularization), we set the threshold and the initial regularization strength , which is changed by an order of magnitude () to show various sparsity levels. For PSP, we set the regularization strength and the initial threshold and increase by for each sparsity level. Both methods show higher accuracy for high sparsity constraints (sparsity ), but only weight decay achieves baseline accuracy.
4.2 Pruning different structures
Next, we compare the performance of the different structure granularities using DenseNet on CIFAR10 (Table 2, with 40 layers, a growth rate of and a pruning threshold of ). We report the required layers, parameters and Multiply-Accumulate (MAC) operations.
While all structure granularities show a good prediction performance, with slight deviations compared to the baseline error, column- and channel-pruning achieve the highest compression ratios. Shape pruning results in the best accuracy but only at a small compression rate, indicating that a higher pruning threshold is more appropriate. It is worth noticing that PSP is able to automatically remove structures, which can be seen best when comparing layer pruning and a combination of layer and channel pruning: layer pruning removes 12 layers from the network but still requires 0.55M parameters and 0.28G MACs, while the combination of layer and channel pruning removes only 7 layers but requires only 0.48M parameters and 0.24G MACs.
4.3 CIFAR10/100 and ImageNet
To validate the effectiveness of PSP, we now discuss results from ResNet and DenseNet on CIFAR10/100 and ImageNet. We use column pruning throughout this section, as it offers the highest compression rates while preserving classification performance.
Table 3 reports results for CIFAR10/100. As can be seen, PSP maintains classification performance for a variety of networks and datasets. This is due to the ability of self-adapting the pruned structures during training, which can be best seen when changing the network topology or dataset: for instance, when we use the same models on CIFAR10 and the more complex CIFAR100 task, we can see that PSP is able to automatically adapt as it removes less structure from the network trained on CIFAR100. Furthermore, if we increase the number of layers by from 40 to 100, we also increase the over-parameterization of the network and PSP automatically removes more structure.
|NASNet-B (4 @ 1152) DBLP:journals/corr/ZophVSL17||–||2.60M||–||3.73||–||–||–|
The same tendencies can be observed on the large-scale ImageNet task as shown in Table 4; when applying PSP, classification accuracy can be maintained (with some negligible degradation) and a considerable amount of structure can be removed from the networks (e.g. from ResNet18 or from DenseNet121). Furthermore, PSP obliterates the need for bottleneck layers, effectively reducing network depth and MACs. For instance, removing the bottleneck layers from the DenseNet121 network in combination with PSP removes parameters, MACs and layers, while only sacrificing top-5 accuracy.
|Model||Layer||Parameters||MACs||Top-1 [%]||Top-5 [%]|
|MobileNetV2 (1.4) DBLP:journals/corr/abs-1801-04381||–||6.9M||0.59G||25.30||–|
|NASNet-A (4 1056) DBLP:journals/corr/ZophVSL17||5.3M||0.56G||26.00||8.40|
We also report some results of recently proposed network reduction methods that achieved notable performance on the used datasets (in terms of accuracy, memory and compute requirements): MobileNetV2 DBLP:journals/corr/abs-1801-04381 is an optimized CNN network for mobile platforms, which uses, among other optimizations, the popular lightweight depthwise convolutions. NASNet DBLP:journals/corr/ZophVSL17 is a Neural Network Search (NAS) algorithm that searches for the best neural network architecture. PSP outperforms the efficiency of these methods, using standard networks and requiring only a fraction of the training time of NAS.
4.4 Comparison to related methods
We report a profound comparison to related structured pruning methods (see Sec. 2 for details) in Table 5. As reported metrics and baseline accuracy vary significantly in the corresponding publications, to show a fair comparison, we only report the improvement factor and the accuracy gap over the baseline network, where represents accuracy degradation and accuracy improvement.
|ResNet-56 on CIFAR10: error=6.35%|
|DenseNet-40 on CIFAR10: error=5.80%|
|ResNet-18 on ImageNet: top1 error=29.60%, top5 error=10.52%|
|Top1 error gap||–||–||+2.29||–||–||–||+3.18||+0.77|
|Top5 error gap||–||–||+1.38||–||–||–||+1.85||+0.58|
|ResNet-B-50 on ImageNet: top1 error=23.68%, top5 error=6.85%|
|Top1 error gap||+1.87||–||+1.06||–||–||+0.96||+1.54||+0.39|
|Top5 error gap||+1.12||+1.40||+0.61||–||+0.8||+0.42||+0.81||+0.16|
As can be seen, PSP outperforms other pruning methods substantially in memory, compute requirements, and accuracy. Due to the self-adapting pruning method, PSP achieves less compression on ResNet-B-50 on ImageNet, however, it achieves the best accuracy and is inline with overarching goals.
We have presented PSP, a novel approach for compressing DNNs through structured pruning, which reduces memory and compute requirements while creating a form of sparsity that is inline with massively parallel processors. Our approach exhibits parameterization of arbitrary structures (e.g. channels or layers) in a weight tensor and uses weight decay to force certain structures towards zero, while clearly discriminating between important and unimportant structures. Combined with threshold-based magnitude pruning and backward approximation, we can remove a large amount of structure while maintaining prediction performance. Experiments using state-of-the-art DNN architectures on real-world tasks show the effectiveness of our approach in comparison to a variety of related methods. As a result, the gap between DNN-based application demand and capabilities of resource-constrained devices is reduced, while this method is applicable to a wide range of processors.