Neuron Merging: Compensating for Pruned Neurons
Network pruning is widely used to lighten and accelerate neural network models. Structured network pruning discards the whole neuron or filter, leading to accuracy loss. In this work, we propose a novel concept of neuron merging applicable to both fully connected layers and convolution layers, which compensates for the information loss due to the pruned neurons/filters. Neuron merging starts with decomposing the original weights into two matrices/tensors. One of them becomes the new weights for the current layer, and the other is what we name a scaling matrix, guiding the combination of neurons. If the activation function is ReLU, the scaling matrix can be absorbed into the next layer under certain conditions, compensating for the removed neurons. We also propose a data-free and inexpensive method to decompose the weights by utilizing the cosine similarity between neurons. Compared to the pruned model with the same topology, our merged model better preserves the output feature map of the original model; thus, it maintains the accuracy after pruning without fine-tuning. We demonstrate the effectiveness of our approach over network pruning for various model architectures and datasets. As an example, for VGG-16 on CIFAR-10, we achieve an accuracy of 93.16% while reducing 64% of total parameters, without any fine-tuning. The code can be found here: https://github.com/friendshipkim/neuron-merging
Modern Convolutional Neural Network (CNN) models have shown outstanding performance in many computer vision tasks. However, due to their numerous parameters and computation, it remains challenging to deploy them to mobile phones or edge devices. One of the widely used methods to lighten and accelerate the network is pruning. Network pruning exploits the findings that the network is highly over-parameterized. For example, denil2013predicting demonstrate that a network can be efficiently reconstructed with only a small subset of its original parameters.
Generally, there are two main branches of network pruning. One of them is unstructured pruning, also called weight pruning, which removes individual network connections. han2015learning achieved a compression rate of 90% by pruning weights with small magnitudes and retraining the model. However, unstructured pruning produces sparse weight matrices, which cannot lead to actual speedup and compression without specialized hardware or libraries han2016eie. On the other hand, structured pruning methods eliminate the whole neuron or even the layer of the model, not individual connections. Since structured pruning maintains the original weight structure, no specialized hardware or libraries are necessary for acceleration. The most prevalent structured pruning method for CNN models is to prune filters of each convolution layer and the corresponding output feature map channels. The filter or channel to be removed is determined by various saliency criteria li2016pruning; you2019gate; yu2018nisp.
Regardless of what saliency criterion is used, the corresponding dimension of the pruned neuron is removed from the next layer. Consequently, the output of the next layer will not be fully reconstructed with the remaining neurons. In particular, when the neurons of the front layer are removed, the reconstruction error continues to accumulate, which leads to performance degradation yu2018nisp.
In this paper, we propose neuron merging that compensates for the effect of the removed neuron by merging its corresponding dimension of the next layer. Neuron merging is applicable to both the fully connected and convolution layers, and the overall concept applied to the convolution layer is depicted in Fig. 1. Neuron merging starts with decomposing the original weights into two matrices/tensors. One of them becomes the new weights for the current layer, and the other is what we name a scaling matrix, guiding the process of merging the dimensions of the next layer. If the activation function is ReLU and the scaling matrix satisfies certain conditions, it can be absorbed into the next layer; thus, merging has the same network topology as pruning.
In this formulation, we also propose a simple and data-free method of neuron merging. To form the remaining weights, we utilize well-known pruning criteria (e.g., -norm li2016pruning). To generate the scaling matrix, we employ the cosine similarity and -norm ratio between neurons. This method is applicable even when only the pretrained model is given without any training data. Our extensive experiments demonstrate the effectiveness of our approach. For VGG-16 SimonyanZ14a and WideResNet 40-4 BMVC2016_87 on CIFAR-10, we achieve an accuracy of 93.16% and 93.3% without any fine-tuning, while reducing 64% and 40% of the total parameters, respectively. Our contributions are as follows:
(1) We propose and formulate a novel concept of neuron merging that compensates for the information loss due to the pruned neurons/filters in both fully connected layers and convolution layers.
(2) We propose a one-shot and data-free method of neuron merging which employs the cosine similarity and ratio between neurons.
(3) We show that our merged model better preserves the original model than the pruned model with various measures, such as the accuracy immediately after pruning, feature map visualization, and Weighted Average Reconstruction Error yu2018nisp.
2 Related Works
A variety of criteria he2018soft; he2019filter; li2016pruning; molchanov2016pruning; you2019gate; yu2018nisp have been proposed to evaluate the importance of a neuron, in the case of CNN, a filter. However, all of them suffer from significant accuracy drop immediately after the pruning. Therefore, fine-tuning the pruned model often requires as many epochs as training the original model to restore the accuracy near the original model. Several works liu2017learning; ye2018rethinking add trainable parameters to each feature map channel to obtain data-driven channel sparsity, enabling the model to automatically identify redundant filters. In this case, training the model from scratch is inevitable to obtain the channel sparsity, which is a time- and resource-consuming process.
Among filter pruning works, luo2017thinet and he2017channel have similar motivation to ours, aiming to similarly reconstruct the output feature map of the next layer. luo2017thinet search the subset of filters that have the smallest effect on the output feature map of the next layer. he2017channel propose LASSO regression based channel selection and least square reconstruction of output feature maps. In both papers, data samples are required to obtain feature maps. However, our method is novel in that it compensates for the loss of removed filters in a one-shot and data-free way.
BMVC2015_31 introduce data-free neuron pruning for the fully connected layers by iteratively summing up the co-efficients of two similar neurons. Different from BMVC2015_31, neuron merging introduces a different formulation including the scaling matrix to systematically incorporate the ratio of neurons and is applicable to various model structures such as the convolution layer with batch normalization. More recently, myssay2020coreset approximate the output of the next layer by finding the coresets of neurons and discarding the rest.
“Pruning-at-initialization” methods lee2018snip; wang2020picking prune individual connections in advance to save resources at training time. SNIP lee2018snip and GraSP wang2020picking use gradients to measure the importance of connections. In contrast, our approach is applied to structured pruning, so no specialized hardware or libraries are necessary to handle sparse connections. Also, our approach can be adopted even when the model is trained without any consideration of pruning.
Canonical Polyadic (CP) decomposition lebedev2014speeding and Tucker decomposition kim2015compression are widely used to lighten convolution kernel tensor. At first glance, our method is similar to row rank approximation in that it starts with decomposing the weight matrix/tensor into two parts. Different from row rank approximation works, we do not retain all decomposed matrices/tensors during inference time. Instead, we combine one of the decomposed matrices with the next layer and achieve the same acceleration as structured network pruning.
First, we mathematically formulate the new concept of neuron merging in the fully connected layer. Then, we show how merging is applied to the convolution layer. In Section 3.3, we introduce one possible data-free method of merging.
3.1 Fully Connected Layer
For simplicity, we start with the fully connected layer without bias. Let denote the length of input column vector for the -th fully connected layer. The -th fully connected layer transforms the input vector into the output vector . The network weights of the -th layer are denoted as .
Our goal is to maintain the activation vector of the ()-th layer, which is
where is an activation function.
Now, we decompose the weight matrix into two matrices, and , where . Therefore, . Then Eq. 1 is approximated as,
The key idea of neuron merging is to combine and , the weight of the next layer. In order for to be moved out of the activation function, should be ReLU and a certain constraint on is necessary.
Let , . Then,
if and only if has only non-negative entries with at most one strictly positive entry per column.
where . As shown in Fig. 2, the number of neurons in the ()-th layer is reduced from to after merging, so the network topology is identical to that of structured pruning. Therefore, represents the new weights remaining in the -th layer, and is the scaling matrix, indicating how to compensate for the removed neurons. We provide the same derivation for the fully connected layer with bias in the Appendix.
3.2 Convolution Layer
For merging in the convolution layer, we first define two operators for -way tensors.
According to kolda2009tensor, the n-mode (matrix) product of a tensor with a matrix is denoted by and is size of .
Elementwise, we have
We define tensor-wise convolution operator between a 4-way tensor and a 3-way tensor . For simple notation, we assume that the stride of convolution is 1. However, this notation can be generalized to other convolution settings.
Elementwise, we have
Intuitively, denotes the channel-wise concatenation of the output feature map matrices that result from 3D convolution operation between and each filter of .
Merging the convolution layer
Now we extend neuron merging to the convolution layer. Similar to the fully connected layer, let and denote the number of input and output channels of the -th convolution layer. The -th convolution layer transforms the input feature map into the output feature map . The filter weights of the -th layer are denoted as which consists of filters.
Our goal is to maintain the activation feature map of the ()-th layer, which is
We decompose the 4-way tensor into a matrix and a 4-way tensor . Therefore,
Then Eq. 4 is approximated as,
The key idea of neuron merging is to combine and , the weight of the next layer. If is ReLU, we can extend Theorem 1 to a 1-mode product of tensor.
Let , . Then,
if and only if has only non-negative entries with at most one strictly positive entry per column.
If is ReLU and satisfies the condition of Corollary 1.1,
where . See the Appendix for proofs of Corollary 1.1, Eq. 6a, and 7a. After merging, the number of filters in the -th convolution layer is reduced from to , so the network topology is identical to that of structured pruning. As is merged with the weights of the ()-th layer, the pruned dimensions are absorbed into the remaining ones, as shown in Fig 1.
3.3 Proposed Algorithm
The overall process of neuron merging is as follows. First, we decompose the weights into two parts. represents the new weights remaining in the -th layer, and is the scaling matrix. After the decomposition, is combined with the weights of the next layer, as described in Section 3.1 and 3.2. Therefore, the actual compensation takes place by merging the dimensions of the next layer. The corresponding dimension of a pruned neuron is multiplied by a positive number and then added to that of the retained neuron.
Now we propose a simple one-shot method to decompose the weight matrix/tensor into two parts. First, we select the most useful neurons to form . We can utilize any pruning criteria. Then, we generate by selecting the most similar remaining neuron for each pruned neuron and measuring the ratio between them. Algorithm 1 describes the overall procedure of decomposition for the case of one-dimensional neurons in a fully connected layer. The same algorithm is applied to the convolution filters after reshaping each three-dimensional filter tensor to a one-dimensional vector.
According to Theorem 1, if a pruned neuron can be expressed as a positive multiple of a remaining one, we can remove and compensate for it without causing any loss in the output vector. This gives us an important insight into the criterion for determining similar neurons: direction, not absolute distance. Therefore, we employ the cosine similarity to select similar neurons. Algorithm 2 demonstrates selecting the most similar neuron with the given one and obtaining the scale between them. We set the scale value as an -norm ratio of the two neurons. The scale value indicates how much to compensate for the removed neuron in the following layer.
Here we introduce a hyperparameter ; we compensate only when the similarity between the two neurons is above . If is -1, all pruned neurons are compensated for, and the number of compensated neurons decreases as approaches 1. If none of the removed neurons is compensated for, the result is exactly the same as vanilla pruning. In other words, pruning can be considered as a special case of neuron merging.
Batch normalization layer
For modern CNN architectures, batch normalization ioffe2015batch is widely used to prevent an internal covariate shift. If batch normalization is applied after a convolution layer, the output feature map channels of two identical filters could be different. Therefore, we introduce an additional term to consider when selecting the most similar filter.
Let denote the output feature map of a convolution layer, and denote after a batch normalization layer. The batch normalization layer contains four types of parameters, .
For simplicity, we consider the element-wise scale of two feature maps. Let , , , . Let denote the -norm ratio of and . Assuming that they have the same direction, the relationship between and is as follows:
According to Eq. 8, if is 0, the ratio of to is exactly . Therefore, we select the filter that simultaneously minimizes the cosine distance () and the bias distance () and then use as . We normalize the bias distance between 0 and 1. The overall selection procedure for a convolution layer with batch normalization is described in Algorithm 3. The input includes the -th filter of the convolution layer, denoted as . A hyperparameter is employed to control the ratio between the cosine distance and the bias distance.
Neuron merging aims to preserve the original model by maintaining the scale of the output feature map better than network pruning. To validate this, we compare the initial accuracy, feature map visualization, and Weighted Average Reconstruction Error yu2018nisp of image classification, without fine-tuning.
We evaluate the proposed approach with several popular models, which are LeNet lecun1998gradient, VGG SimonyanZ14a, ResNet he2016deep, and WideResNet BMVC2016_87, on FashionMNIST xiao2017fashion, CIFAR hinton2007learning, and ImageNet
To train the baseline models, we employ SGD with the momentum of 0.9. The learning rate starts at 0.1, with different annealing strategies per model. For LeNet, the learning rate is reduced by one-tenth for every 15 of the total 60 epochs. Weight decay is set to 1e-4, and batch size to 128. For VGG and ResNet, the learning rate is reduced by one-tenth at 100 and 150 of the total 200 epochs. Weight decay is set to 5e-4, and batch size to 128. Weights are randomly initialized before the training. To preprocess FashionMNIST images, each one is normalized with a mean and standard deviation of 0.5; for CIFAR, we follow the setting in he2019filter.
In Section 3.3, we introduced two hyperparameters for neuron merging: and . For , we use 0.45 for LeNet, and 0.1 for other convolution models. For , the value between 0.7 and 0.9 generally gives a stable performance. Specifically, we use 0.85 for VGG and ResNet on CIFAR-10, 0.8 for WideResNet on CIFAR-10, and 0.7 for VGG-16 on CIFAR-100.
We test neuron merging with three structured pruning criteria: 1) ’-norm’ proposed in li2016pruning; 2) ’-norm’ proposed in he2018soft; and 3) ’-GM’ proposed in he2019filter, referring to pruning filters with a small distance from the geometric median. These methods were originally proposed for convolution filters but can be applied to the neurons in fully connected layers. Among various pruning criteria, these methods have the top-level initial accuracy. In accordance with the data-free characteristic of our method, we exclude pruning methods that require feature maps or data loss in filter scoring.
4.1 Initial Accuracy of Image Classification
The results of LeNet-300-100 with bias on FashionMNIST are presented in Table 1. The number of neurons in each layer is reduced in proportion to the pruning ratio. As shown in Table 1, the pruned model’s performance deteriorates as more neurons are pruned. However, if the removed neurons are compensated for with merging, the performance improves in all cases. Accuracy gain is more prominent as the pruning ratio increases. For example, when the pruning ratio is 80%, the merging recovers more than 13% of accuracy compared to the pruning.
|Pruning Ratio||Baseline Acc.||-norm||-norm||-GM|
We test neuron merging for VGG-16 on CIFAR datasets. As described in Table 2, the merging shows an impressive accuracy recovery on both datasets. For CIFAR-10, we adopt the pruning strategy from PFEC li2016pruning, pruning half of the filters in the first convolution layer and the last six convolution layers. Compared to the baseline model, the accuracy after pruning is dropped by 5% on average with a parameter reduction of 63%. On the other hand, merging improves the accuracy to a near-baseline level for all three pruning criteria, showing a mere 0.6% drop at most.
For CIFAR-100, we slightly modified the pruning strategy of PFEC. In addition to the first convolution layer, we prune only the last three, not six, convolution layers. With this strategy, we can still reduce 44.1% of total parameters. Similar to CIFAR-10, the merging recovers about 4% of the performance deterioration caused by the pruning. In CIFAR-100, the accuracy drop compared to the baseline was about 1% greater than CIFAR-10. This seems to be because the filter redundancy decreases as the target label diversifies. Interestingly, the accuracy gain of merging is most prominent in the ’-GM’ he2019filter criterion, and the final accuracy is also the highest.
|Dataset||Criterion||Baseline Acc. (B)||Initial Acc.||B-M||Param. (#)|
We also test our neuron merging for ResNet-56 and WideResNet-40-4, on CIFAR-10. We additionally adopt WideResNet-40-4 to examine the effect of merging with extra channel redundancy. To avoid the misalignment of feature map in the shortcut connection, we only prune the internal layers of the residual blocks as in li2016pruning; luo2017thinet. We carry out experiments on four different pruning ratios: 20%, 30%, 40%, and 50%. The pruning ratio refers to how many filters are pruned in each internal convolution layer.
As shown in Fig. 3, ResNet-56 noticeably suffer from performance deterioration in all pruning cases because of its narrow structure. However, the merging increases the accuracy in all cases. As the pruning ratio increases, merging exhibits a more prominent recovery. When the pruning ratio is 50%, merging restores accuracy by more than 30%. Since, structurally, ResNet has insufficient channels to reuse, the merging alone has limits in recovery. After fine-tuning, both the pruned and merged models reach comparable accuracy. Interestingly, the ’-GM’ criterion shows a more significant accuracy drop than other norm-based criteria after pruning and merging. On the other hand, for WideResNet, three pruning criteria show a similar trend in accuracy drop. As the pruning ratio increases, the accuracy trend of merging falls more gradually than pruning. Since the number of compensable channels increases in WideResNet, the accuracy after merging is closer to baseline accuracy than ResNet. Even after removing 50% of the filters, the merging only shows an accuracy loss of less than 5%, which is 20% better than pruning.
4.2 Feature Map Reconstruction of Neuron Merging
To further validate that merging better preserves the original feature maps than pruning, we make use of two types of measures, namely feature map visualization and Weighted Average Reconstruction Error. We visualize the output feature map of the last residual block in WideResNet-40-4 on CIFAR-10. Fifty percent of the total filters are pruned with -norm criterion. Feature maps are resized in the same way as zhou2016learning. As shown in Fig. 4, while the original model captures the coarse-grained area of the object, the pruned model produces noisy and divergent feature maps. However, the feature maps of our merged model are very similar to those of the original model. Although the heated regions are slightly blurrier than in the original model, the merged model accurately detects the object area.
Weighted Average Reconstruction Error (WARE) is proposed in yu2018nisp to measure the change of the important neurons’ responses on the final response layer after pruning (without fine-tuning). The final response layer refers to the second-to-last layer before classification. WARE is defined as
where and represent the number of samples and number of retained neurons in the final response layer, respectively; is the importance score of the -th neuron; and and are the responses on the -th sample of the -th neuron before/after pruning.
Neuron importance scores () are set to 1 to reflect the effect of all neurons equally. Therefore, the lower the WARE is, the more the network output (i.e., logit values) is similar to that of the original. We measure the WARE of all three kinds of models presented in Section 4.1 on CIFAR-10. Our merged model has lower WARE than the pruned model in all cases. Similar with the initial accuracy, the WARE drops considerably as the pruning ratio increases. We provide a detailed result in Table 3. Through these experiments, we can validate that neuron merging compensates well for the removed neurons and approximates the output feature map of the original model.
In this paper, we propose and formulate a novel concept of neuron merging that compensates for the accuracy loss of the pruned neurons. Our one-shot and data-free method better reconstructs the output feature maps of the original model than vanilla pruning. To demonstrate the effectiveness of merging over network pruning, we compare the initial accuracy, WARE, and feature map visualization on image-classification tasks. It is worth noting that decomposing the weights can be varied in the neuron merging formulation. We will explore the possibility of improving the decomposition algorithm. Furthermore, we plan to generalize the neuron merging formulation to more diverse activation functions and model architectures.
This research was results of a study on the “HPC Support” Project, supported by the ‘Ministry of Science and ICT’ and NIPA. This work was also supported by Korea Institute of Science and Technology (KIST) under the project “HERO Part 1: Development of core technology of ambient intelligence for proactive service in digital in-home care.”
6.1 Fully Connected Layer with Bias
The overall derivation is the same as Section 3.1. The difference is that we decompose the weights after concatenating the bias vector at the end of the weight matrix. Let , and .
Our goal is to maintain the activation vector of the ()-th layer, which is
where is an activation function. Then, we decompose into two matrices, , and , where . Therefore, . Then Eq. 10 is approximated as,
If is ReLU and satisfies the condition of Theorem 1,
where . and denote the new weight and bias for the merged model, respectively. After merging, bias vector is detached from the weight matrix as the original model. Therefore, the number of neurons in the ()-th layer is reduced from to , and the corresponding entries of the bias vector are removed as well.
6.2 Proof of Theorem 1
Let denote the element of , and denote the -th element of . We also denote the -th column vector of as . Eq. 13b is satisfied if and only if all the entries are non-negative.
If has only non-negative entries with at most one strictly positive entry per column, then Eq. 13a also holds.
Proof of Claim 1.1 1.
Let us define as the row-index of the strictly positive entry in , or 1 if .
We used the fact that is non-negative in the above equation.
If there exists a column with more than one strictly positive entry, then Eq. 13a does not hold in general.
Proof of Claim 1.2 1.
Without loss of generality, say has positive entries, , where , and 0 otherwise. Also, we can assume that . Suppose one which is,
Then the first entry of is equal to 0. However, the first entry of is equal to which is not zero. Therefore, .
6.3 Proof of Corollary 1.1
According to kolda2009tensor, the definition of n-mode product is multiplying each mode- fiber of tensor by the matrix . The idea can also be expressed in terms of unfolded tensors:
where denotes mode- matricization of tensor .
6.4 Proof of Equation 6a
For simple notation, subscript is omitted.
6.5 Proof of Equation 7a
For simple notation, let and subscript is omitted. Also, let ,
6.6 Image Classification Results on ImageNet
In Table 4, we present the test results of VGG-16 and ResNet-34 on ImageNet. We prune only the last convolution layer of VGG-16 as most of the parameters come from fully connected layers. For ResNet-34, we prune all convolution layers in equal proportion. Due to the large scale of the dataset, the initial accuracy right after the pruning drops rapidly as the pruning ratio increases. However, our merging recovers the accuracy in all cases, showing our idea is also effective even for large-scale datasets like ImageNet.
|Pruning Ratio||Criterion||Top 1 Acc.||Top 5 Acc.||Param. #|
6.7 Effect of Hyperparameter
In this section, we analyze the effect of the hyperparameter in the case of ResNet-56. The average cosine similarity between filters of ResNet is lower than that of over-parameterized models. Thus, it is more sensitive to the hyperparameter , which is used as the minimum cosine similarity threshold of compensated filters.
Fig. 5(a) shows the distribution of the maximum cosine similarity, which is the value between each filter and the nearest one. The variance and the median value of the maximum cosine similarity tend to decrease toward the back layers of ResNet-56. In the back layers, the cosine similarity values are mostly distributed between 0.1 and 0.3. This level of cosine similarity might seem too low to be meaningful. Nevertheless, the highest accuracy is obtained when all the filters with the cosine similarity over 0.15 are compensated for, as shown in Fig. 5(b). This trend appears in all three pruning ratios. As the pruning ratio increases, both the accuracy gain and fluctuation are more prominent.
- Test results on ImageNet are provided in the Appendix.