RANP: Resource Aware Neuron Pruning at Initialization for 3D CNNs

# RANP: Resource Aware Neuron Pruning at Initialization for 3D CNNs

## Abstract

Although 3D Convolutional Neural Networks (CNNs) are essential for most learning based applications involving dense 3D data, their applicability is limited due to excessive memory and computational requirements. Compressing such networks by pruning therefore becomes highly desirable. However, pruning 3D CNNs is largely unexplored possibly because of the complex nature of typical pruning algorithms that embeds pruning into an iterative optimization paradigm. In this work, we introduce a Resource Aware Neuron Pruning (RANP) algorithm that prunes 3D CNNs at initialization to high sparsity levels. Specifically, the core idea is to obtain an importance score for each neuron based on their sensitivity to the loss function. This neuron importance is then reweighted according to the neuron resource consumption related to FLOPs or memory. We demonstrate the effectiveness of our pruning method on 3D semantic segmentation with widely used 3D-UNets on ShapeNet and BraTS’18 as well as on video classification with MobileNetV2 and I3D on UCF101 dataset. In these experiments, our RANP leads to roughly 50%-95% reduction in FLOPs and 35%-80% reduction in memory with negligible loss in accuracy compared to the unpruned networks. This significantly reduces the computational resources required to train 3D CNNs. The pruned network obtained by our algorithm can also be easily scaled up and transferred to another dataset for training.

\threedvfinalcopy

## 1 Introduction

3D image analysis is important in various real-world applications including scene understanding [1, 2], object recognition [3, 4], medical image analysis [5, 6, 7], and video action recognition [8, 9]. Typically, sparse 3D data can be represented using point clouds [10] whereas volumetric representation is required for dense 3D data which arises in domains such as medical imaging [11] and video segmentation and classification [1, 8, 9]. While efficient neural network architectures can be designed for sparse point cloud data [12, 13], conventional dense 3D Convolutional Neural Network (CNN) is required for volumetric data. Such 3D CNNs are computationally expensive with excessive memory requirements for large-scale 3D tasks. Therefore, it is highly desirable to reduce the memory and FLOPs required to train 3D CNNs while maintaining the accuracy. This will not only enable large-scale applications but also 3D CNN training on resource-limited devices.

Network pruning is a prominent approach to compress a neural network by reducing the number of parameters or the number of neurons in each layer [14, 15, 16, 17]. However, most of the network pruning methods aim at 2D CNNs while pruning 3D CNNs is largely unexplored. This is mainly because pruning is typically targeted at reducing the test-time resource requirements while computational requirements of training time are as large as (if not more than) the unpruned network. Such pruning schemes are not suitable for 3D CNNs with dense volumetric data where training-time resource requirement is prohibitively large.

In this work, we introduce a Resource1 Aware Neuron Pruning (RANP) that prunes 3D CNNs at initialization. Our method is inspired by, but superior to, SNIP [18] which prunes redundant parameters of a network at initialization and only tests with small scale 2D CNNs for image classification. With the same characteristics of effectively pruning at initialization without requiring large computational resources, RANP yields better-pruned networks compared to SNIP by removing neurons that largely contribute to the high resource requirement. In our experiments on video classification and more challenging 3D semantic segmentation, with minimal accuracy loss, RANP yields 50%-95% reduction in FLOPs and 35%-80% reduction in memory while only 5%-51% reduction in FLOPs and 1%-17% reduction in memory are achieved by SNIP NP.

The main idea of RANP is to prune based on a neuron importance criterion analogous to the connection sensitivity in SNIP. Note that, pruning based on such a simple criterion as SNIP has the risk of pruning the whole layer(s) at extreme sparsity levels especially on large networks [19]. Even though an orthogonal initialization that ensures layer-wise dynamical isometry is sufficient to mitigate this issue for parameter pruning on 2D CNNs [19], it is unclear if this could be directly applied to neuron pruning on 3D CNNs. To tackle this and improve pruning, we introduce a resource aware reweighting scheme that first balances the mean value of neuron importance in each layer and then reweights the neuron importance based on the resource consumption of each neuron. As evidenced by our experiments, such a reweighting scheme is crucial to obtain large reductions in memory and FLOPs while maintaining high accuracy.

We firstly evaluate our RANP on 3D semantic segmetation on a sparse point-cloud dataset, ShapeNet [10], and a dense medical image dataset, BraTS’18 [11, 20], with widely used 3D-UNets [5]. We also evaluate RANP on video classification using UCF101 with MobileNetV2 [21] and I3D [22]. Our RANP-f significantly outperforms other neuron pruning methods in resource efficiency by yielding large reductions in computational resources (50%-95% FLOPs reduction and 35%-80% memory reduction) with comparable accuracy to the unpruned network (Fig. 1).

Furthermore, we perform extensive experiments to demonstrate 1) scalability of RANP by pruning with a small input spatial size and training with a large one, 2) transferability by pruning using ShapeNet and training on BraTS’18 and vice versa, 3) lightweight training on a single GPU, and 4) fast training with increased batch size.

## 2 Related Work

Previous works of network pruning mainly focus on 2D CNNs by parameter pruning [18, 17, 14, 15, 23] and neuron pruning [24, 16, 25, 26, 27, 28]. While a majority of the pruning methods use the traditional prune-retrain scheme with a combined loss function of pruning criteria [17, 16, 26], some pruning at initialization methods is able to reduce computational complexity in training [18, 29, 30, 31, 32]. While very few are for 3D CNNs [33, 34, 35], none of them prune networks at initialization, and thus, none of them effectively reduce the training-time computational and memory requirements of 3D CNNs.

2D CNN pruning. Parameter pruning merely sparsifies filters for a high learning capability with small models. Han et al. [17] adopted an iterative method of removing parameters with values below a threshold. Lee et al. [18] recently proposed a single-shot method with connection sensitivity by magnitudes of parameter mask gradients to retain top- parameters. These filter-sparse methods, however, do not directly yield large speedup and memory reductions.

By contrast, neuron pruning, also known as filter pruning or channel pruning, can effectively reduce computational resources. For instance, Li et al.[25] used normalization to remove unimportant filters with connecting features. He et al.[16] adopted a LASSO regression to prune network layers with reconstruction in the least square manner. Yu et al.[24] proposed a group-wise 2D-filter pruning from each 3D-filter by a learning-based method and a knowledge distillation. Structure learning based MorphNet [36] and SSL [37] aim at pruning activations with structure constraints or regularization. These approaches only reduce the test-time resource requirement while we focus on reducing those of large 3D CNNs at training time.

3D CNN pruning. To improve the efficiency on 3D CNNs, some works like SSC [12] and OctNet [38] use efficient data structures to reduce the memory requirement for sparse point-cloud data. However, these approaches are not useful for dense data, e.g., MRI images and videos, and the resource requirement remains prohibitively large.

Hence, it is desirable to develop an efficient pruning for 3D CNNs that can handle dense 3D data which is common in real applications. Only very few works are relevant to 3D CNN pruning. Molchanov et al.[33] proposed a greedy criteria-based method to reduce resources via backpropagation with a small 3D CNN for hand gesture recognition. Zhang et al.[34] used a regularization-based pruning method by assigning regularization to weight groups with 4 speedup in theory. Recently, Chen et al.[35] converted 3D filters into frequency domain to eliminate redundancy in an iterative optimization for convergence. Being a parameter pruning method, this does not lead to large FLOPs and memory reductions, e.g., merely a speedup compared to our (ref. Sec. 5.3). In summary, these methods embed pruning in the iterative network optimization and require extensive resources, which is inefficient for 3D CNNs.

Pruning at Initialization. While few works adopted pruning at initialization, some achieved impressive success. SNIP [18] is the first single-shot pruning method that presented a high possibility of pruning networks at initialization with minimal accuracy loss in training, followed by many recent works on single-shot pruning [29, 30, 31, 32]. But none are for 3D CNNs pruning.

In addition to being a parameter pruning approach, the benefits of SNIP was demonstrated only on small-scale datasets [18], such as MNIST and CIFAR-10. Therefore, it is unclear that whether these benefits could be transposed to 3D CNNs applied to large-scale datasets. Our experiments indicate that, while SNIP itself is not capable of yielding large resource reduction on 3D CNNs, our RANP can greatly reduce the computational resources without causing network infeasibility. Furthermore, we show that RANP enjoys strong transferability among datasets and enables fast and lightweight training of large 3D volumetric data segmentation on a single GPU.

## 3 Preliminaries

We first briefly describe the main idea of SNIP [18] which removes redundant parameters prior to training. Given a dataset with input and ground truth and the sparsity level , the optimization problem associated with SNIP can be written as

 minc,wL(c⊙w;D) =minc,w1SS∑i=1ℓ(c⊙w,(xi,yi)) , (1) s.t.w∈Rm ,\enskipc∈{0,1}m,\enskip∥c∥0≤κ ,

where is denoted a -dimensional vector of parameters, is the corresponding binary mask on the parameters, is the standard loss function (e.g., cross-entropy loss), and denotes norm. The mask for parameter denotes that the parameter is retained in the compressed model if and otherwise it is removed. In order to optimize the above problem, they first relax the binary constraint on the masks such that . Then an importance function for parameter is calculated by the normalized magnitude of the loss gradient over mask as

 (2)

Then, only top- parameters are retained based on the parameter importance, called connection sensitivity in [18], defined above. Upon pruning, the retained parameters are trained in the standard way. It is interesting to note that, even though having the mask is easier to explain the intuition, SNIP can be implemented without these additional variables by noting that  [19]. This method has shown remarkable results in achieving sparsity on 2D image classification tasks with minimal loss of accuracy. Such a parameter pruning method is important, however, it cannot lead to sufficient computation and memory reductions to train a deep 3D CNN on current off-the-shelf graphics hardware. In particular, the sparse weight matrices cannot efficiently reduce memory or FLOPs, and they require specialized sparse matrix implementations for speedup. In contrast, neuron pruning directly translates into practical gains of reducing both memory and FLOPs. This is crucial in 3D CNNs due to their substantially higher resource requirement compared to 2D CNNs.

## 4 Resource Aware NP at Initialization

To explain the proposed RANP, we first extend the SNIP idea to neuron pruning at initialization. Then we discuss a resource aware reweighting strategy to further reduce the computational requirements of the pruned network. The flowchart of our RANP algorithm is shown in Fig. 2.

Before introducing our neuron importance, we first consider a fully-connected feed-forward neural network for the simplicity of notations. Consider weight matrices , biases , pre-activations , and post-activations , for layer . Now the feed-forward dynamics is

 xl=ϕ(hl) ,where hl=Wlxl−1+bl ,\vspace−1mm (3)

where the activation function has elementwise nonlinearity and the network input is denoted by . Now we introduce binary masks on neurons (i.e., post-activations). The feed-forward dynamics is then modified to include this masking operation as

 xl=cl⊙ϕ(hl) ,%where$cl∈{0,1}Nl$ ,\enskip∀l∈K ,\vspace−0.2em (4)

where neuron mask indicates neuron is retained and otherwise pruned. Here, neuron pruning can be written as the following optimization problem

 minwL(c,w;D) =minc,w1SS∑i=1ℓ(c,w;(xi,yi)) , (5) s.t.w∈Rm ,\enskipc∈{0,1}n,\enskip∥c∥0≤κ ,

where is the total number of neurons and denotes a standard loss function of the feed-forward mapping with neuron masks defined in Eq. 4. This can be easily extended to convolutional and skip-concatenation operations.

As removing neurons could largely reduce memory and FLOPs compared to merely sparsifying parameters, the core of our approach is benefited by removing redundant neurons from the model. We use an influence function concept developed for parameters to establish neuron importance through the loss function, to locate redundant neurons.

### 4.1 Neuron Importance

Note that, neuron importance can be derived from the SNIP-based parameter importance discussed in Sec. 3. Another approach is to directly define neuron importance as the normalized magnitude of the neuron mask gradients analogous to parameter importance.

Neuron Importance with Parameter Mask Gradients. The first approach to calculate neuron importance depends on the magnitude of parameter mask gradients, denoted as Magnitude of Parameter Mask Gradients (MPMG). Thus, the importance of neuron is

 slu=f(|glu1|,…,|gluNl−1|) , (6)

where with as the mask of parameter . Refer to Eq. 2. Here, is a function mapping a set of values to a scalar. We choose with alternatives, i.e., mean and max functions, in Appendix D. Now, we set neuron masks as 1 for neurons with top- largest neuron importance.

Neuron Importance with Neuron Mask Gradients. Another approach is to directly compute mask gradients on neurons and treat their magnitudes as neuron importance, denoted as Magnitude of Neuron Mask Gradients (MNMG). The neuron importance of is calculated by

 slu=∣∣∣∂L(c,w;D)∂clu∣∣∣c=1∣∣∣ . (7)

Noting that a non-linear activation function in CNN including but not limited to ReLU can satisfy . Given such a homogeneous function, the calculation of neuron importance with neuron masks can be derived from parameter mask gradients in the form of

 ∂L(c,w;D)∂clu∣∣∣c=1=Nl−1∑v=1∂L(c⊙w;D)∂cluv∣∣∣c=1 . (8)

Details of the influence of such an activation function on neuron importance are provided in Appendix B. These two approaches for neuron importance are in a similar form that while MPMG is by the sum of magnitudes, MNMG is by the magnitude of the sum of parameter mask gradients. It can be implemented directly from parameter gradients.

The neuron importance based on MPMG or MNMG approach can be used to remove redundant neurons. However, they could lead to an imbalance of sparsity levels of each layer in 3D network architectures. As shown in Table 2, the computational resources required by vanilla neuron pruning are much higher than those by other sparsity enforcing methods, e.g., random neuron pruning and layer-wise neuron pruning. We hypothesize that this is caused by the layer-wise imbalance of neuron importance which unilaterally emphasizes on some specific layer(s) and may lead to network infeasibility by pruning the whole layer(s). This behavior is also observed in [19], and orthogonal initialization is thus recommended to solve the problem for 2D CNN pruning, which however cannot result in balanced neuron importance in our case, see results in Appendix D.

In order to resolve this issue, we propose resource aware neuron pruning (RANP) with reweighted neuron importance, and the details are provided below.

### 4.2 Resource Aware Reweighting

To tackle the imbalanced neuron importance issue above, we first weight the neuron importance across layers. Weighting neuron importance of can be expressed as

 ~slu=maxKk=1¯sk¯slslu ,\enskipwhere ¯sk=1Nk∑Nku=1sku,\enskip∀k∈K . (9)

Here, is the mean neuron importance of layer and is the updated neuron importance. This helps to achieve the same mean neuron importance in each layer, which largely avoids underestimating neuron importance of specific layer(s) to prevent from pruning the whole layer(s).

To further reduce the memory and FLOPs with minimal accuracy loss, we then reweight the neuron importance by available resource, i.e., memory or FLOPs. This reweighting counts on the addition of weighted neuron importance and the effect of the computational resource, denoted as RANP-[mf], where “m” is for memory and “f” is for FLOPs . We represent the importance of this available resource in layer as , refer to Appendix C for details.

The reweighted neuron importance of neuron by following weighted addition variant RANP-[mf] is

 ^slu=(1+λsoftmax(−τl))~slu=(1+λe−τl∑Kk=1e−τk)~slu , (10)

where coefficient helps to control the effect of resource on neuron importance. This effect represented by softmax constrains the values into a controllable range [0,1], making it easy to determine and function a high resource influence with a small resource occupation.

We demonstrate the effect of this reweighting strategy over vanilla pruning in Fig. 3. In more detail, vanilla neuron importance tends to have high values in the last few layers, making it highly possible to remove all neurons of such as the 7th and 8th layers. Weighting the importance in Fig. 2(b) makes the distribution of importance balanced with the same mean value in each layer. Furthermore, since some neurons have different numbers of input channels, each layer requires different FLOPs and memory. Considering the effect of computational resources on training, we embed them into neuron importance as weights.

In Fig. 2(c), the last few layers require larger computational resources than the others, and thus their neurons share lower weights, see the tendency of mean values. Vividly, neuron ratio in Fig. 2(d) indicates a more balanced distribution by RANP-f than vanilla NP. For instance, very few neurons are retained in the 8th layer by vanilla NP, resulting in low accuracy and low maximum neuron sparsity. With reweighting by RANP-f, however, more neurons can be retained in this layer. Moreover, in Table 2, while weighted NP achieves high accuracy, its computational resource reductions are small. In contrast, RANP-f largely decreases the computational resources with a small accuracy loss.

Then, with reweighted neuron importance by Eq. 10 and as the th reweighted neuron importance in a descending order, the binary mask of neuron can be obtained by

 clu=1[^slu−¨sκ≥0] . (11)

As mentioned in Sec. 2, our RANP is more effective in reducing memory and FLOPs than SNIP-based pruning which merely sparsifies parameters but needs high memory required by dense operations in training. RANP can easily remove neurons and all involved input channels at once, leading to huge reductions of input and output channels of the filter. Pseudocode is provided in Appendix A.

## 5 Experiments

We evaluated RANP on 3D-UNets for 3D semantic segmentation and MobileNetV2 and I3D for video classification. Experiments are on Nvidia Tesla P100-SXM2-16GB GPUs in PyTorch. More results are in Appendix D. Our code is available at https://github.com/zwxu064/RANP.git.

### 5.1 Experimental Setup

3D Datasets. For 3D semantic segmentation, we adopted the large-scale 3D sparse point-cloud dataset, ShapeNet [10], and dense biomedical MRI sequences, BraTS’18 [11, 20]. ShapeNet consists of 50 object part classes, 14007 training samples, and 2874 testing samples. We split it into 6955 training samples and 7052 validation samples as [12] to assign each point/voxel with a part class.

BraTS’18 includes 210 High Grade Glioma (HGG) and 75 Low Grade Glioma (LGG) cases. Each case has 4 MRI sequences, i.e., T1, T1_CE, T2, and FLAIR. The task is to detect and segment brain scan images into 3 categories: Enhancing Tumor (ET), Tumor Core (TC), and Whole Tumor (WT). The spatial size is 240240155 in each dimension. We adopted the splitting strategy of cross-validation in [39] with 228 cases for training and 57 cases for validation.

For video classification, we used video dataset, UCF101 [40] with 101 action categories and 13320 videos. 2D spatial dimension from images and temporal dimension from frames are cast as dense 3D inputs. Among the 3 official train/test splits, we used split-1 which has 9537 videos for training and 3783 videos for validation.

3D CNNs. For 3D semantic segmentation on ShapeNet (sparse data) and BraTS’18 (dense data), we used the standard 15-layer 3D-UNet [5] including 4 encoders, each consists of two “3D convolution + 3D batch normalization + ReLU”, a “3D max pooling”, four decoders, and a confidence module by softmax. It has 14 convolution layers with 3 kernels and 1 layer with 1 kernel.

For video classification, we used the popular MobileNetV2 [21, 41] and I3D (with inception as backbone) [22] on UCF101. MobileNetV2 has a linear layer and 52 convolution layers while 18 of them are 3 kernels and the rest are 1. I3D has a linear layer and 57 convolution layers, 19 of which are 3 kernels, 1 is 7, and the rest are 1.

Hyper-parameters in learning. For ShapeNet, we set learning rate as 0.1 with an exponential decay rate by 100 epochs; batch size is 12 on 2 GPUs; spatial size for pruning and training is while the spatial size for training is in Sec. 5.7; optimizer is SGD-Nesterov [42] with weight decay 0.0001 and momentum 0.9.

For BraTS’18, learning rate is 0.001, decayed by 0.1 at 150th epoch with 200 epochs; optimizer is Adam[43] with weight decay 0.0001 and AMSGrad[44]; batch size is 2 on 2 GPUs; spatial size for pruning is and for training.

For UCF101, we adopted similar setup from [41] with learning rate 0.1, decayed by 0.1 at {40, 55, 60, 70}th epoch; optimizer by SGD with weight decay 0.001; batch size 8 on one GPU. Spatial size for pruning and training is for MobileNetV2 and for I3D; 16 frames are used for the temporal size. Note that in [41] networks for UCF101 had higher performance since they were pretrained on Kinetics600, while we directly trained on UCF101. A feasible train-from-scratch reference could be [40].

For Eq. 10, we empirically set the coefficient as 11 for ShapeNet, 15 for BraTS’18, and 80 for UCF101. Glorot initialization [45] was used for weight initialization. Note that we used orthogonal initialization [46] to handle imbalanced layer-wise neuron importance distribution [19] but obtained lower maximum neuron sparsity.

In addition, loss function and metrics are in Appendix A.

### 5.2 Maximum Neuron Sparsity by Vanilla NP

3

We selected MPMG-sum and MNMG-sum for vanilla neuron importance for comparison. All neurons of the last convolutional layer are retained for the given classes.

In Table 1, MPMG-sum for ShapeNet achieves the largest neuron sparsity 78.24% by reducing 76.59% FLOPs, 95.92% parameters, and 44.10% memory with 0.53% accuracy loss. Meanwhile, for BraTS’18, MNMG-sum achieves the largest neuron sparsity 81.32% but has up to 8.48% accuracy loss. MPMG-sum, however, has the largest neuron sparsity 78.17% but smaller accuracy loss with decreased 78.14% FLOPs, 96.46% parameters, and 46.63% memory.

Hence, we selected MPMG-sum as vanilla NP considering the trade-off between the maximum neuron sparsity and the accuracy loss. This is applied to all methods related to weighted neuron pruning and RANP in our experiments. Results of mean and max are in Appendix D.

### 5.3 Evaluation of RANP on Pruning Capability

Random NP retains neurons with neuron indices randomly shuffled. Layer-wise NP retains neurons using the same retain rate as in each layer. For SNIP-based parameter pruning, the parameter masks are post-processed by removing redundant parameters and then making sparse filters dense, which is denoted as SNIP NP. For a fair comparison with SNIP NP, we used the maximum parameter sparsity 98.98% for ShapeNet , 98.88% for BraTS’18, 86.26% for MobileNetV2, and 81.09% for I3D.

ShapeNet. Compared with random NP and layer-wise NP in Table 2, the maximum reduced resources by vanilla NP are much less due to the imbalanced layer-wise distribution of neuron importance. Weighted neuron importance by Eq. 9, however, further reduces 18.3% FLOPs and 29.6% memory with 0.14% accuracy loss.

Reweighting by RANP-f and RANP-m further reduces FLOPs and memory on the basis of weighted NP. Here, RANP-f can reduce 96.8% FLOPs, 95.3% parameters, and 73.7% memory over the unpruned networks. Furthermore, with a similar resource in Table 3, RANP achieves 0.5% increase in accuracy. Note that a too-large can additionally reduce the resources but at the cost of accuracy.

BraTS’18. In Table 2, RANP-f achieves 96.5% FLOPs, 95.1% parameters, and 80% memory reductions. It further reduces 18.3% FLOPs and 33.3% memory over vanilla NP while increasing -1.21% ET, 5.11% TC, and 0.77% WT. With a similar resource in Table 3, RANP achieves higher accuracy than random NP and layer-wise NP.

Additionally, Chen et al.[35] achieved 2 speedup on BraTS’18 with 3D-UNet. In comparison, our RANP-f has roughly 28 speedup, which is theoretically evidenced by the reduced FLOPs from 478.13G to 16.97G in Table 2.

UCF101. In Table 2, for MobileNetV2, RANP-f reduces 55.2% FLOPs, 49% parameters, and 44.1% memory with around 1% accuracy loss. Meanwhile, for I3D, it reduces 49.9% FLOPs, 43.5% parameters, and 35.3% memory with around 2% accuracy increase. The RANP-based methods can reduce much more resources than other methods.

### 5.4 Resources and Accuracy with Neuron Sparsity

Here, we further studied the tendencies of resources and accuracy with an increasing neuron sparsity level from 0 to the maximum one with network feasibility.

Resource Reductions. In Figs. 3(a)-3(d), RANP, marked with (w), achieves much larger FLOPs and memory reductions than vanilla NP, marked with (w/o), due to the balanced distribution of neuron importance by reweighting.

Specifically, for ShapeNet, RANP prunes up to 98.57% neurons while only up to 78.24% by vanilla NP in Fig. 3(a). For BraTS’18, RANP can prune up to 96.24% neurons while only up to 78.17% neurons can be pruned by vanilla NP in Fig. 3(b). For UCF101, RANP can prune up to 80.83% neurons compared to 33.15% on MobileNetV2 in Fig. 3(c), and 85.3% neurons compared to 25.32% on I3D in Fig. 3(d).

Accuracy with Pruning Sparsity. For ShapeNet in Fig. 3(e), the 23-layer 3D-UNet achieves a higher mIoU than the 15-layer one. Extremely, when pruned with the maximum neuron sparsity 97.99%, it can achieve 78.10% mIoU. With the maximum neuron sparsity 98.57%, however, the 15-layer 3D-UNet achieves only 61.42%.

For BraTS’18 in Fig. 3(f), the 23-layer 3D-UNet does not always outperform the 15-layer one and has a larger fluctuation which could be caused by the limited training samples. Nevertheless, even in the extreme case, the 23-layer 3D-UNet has small accuracy loss. Clearly, RANP makes it feasible to use deeper 3D-UNets without the memory issue.

For UCF101 in Figs. 3(g)-3(h), RANP-f achieves 3% accuracy loss at 70% neuron sparsity, indicating its effectiveness of greatly reducing resources with small accuracy loss.

### 5.5 Transferability with Interactive Model

In this experiment, we trained on ShapeNet with a transferred 3D-UNet by RANP on BraTS’18 with 80% neuron sparsity. Interactively, with the same neuron sparsity, a transferred 3D-UNet by RANP on ShapeNet was applied to train on BraTS’18. Results in Table 4 demonstrate that training with transferred models crossing different datasets can largely maintain high or higher accuracy.

### 5.6 Lightweight Training on a Single GPU

RANP with high neuron sparsity makes it feasible to train with large data size on a single GPU due to the largely reduced resources. We trained on ShapeNet with the same batch size 12 and spatial size in Sec. 5.1 using a 23-layer 3D-UNet with 80% neuron sparsity on a single GPU. With this setup, RANP-f reduces GFLOPs (from 259.59 to 7.39) and memory (from 1005.96MB to 255.57MB), making it feasible to train on a single GPU instead of 2 GPUs. It achieves a higher mIoU, 84.340.21%, than the 15-layer and 23-layer full 3D-UNets in Table 5.

The accuracy increase is due to the enlarged batch size on each GPU. With limited memory, however, training a 23-layer full 3D-UNet on a single GPU is infeasible.

### 5.7 Fast Training with Increased Batch Size

Here, we used the largest spatial size of one sample on a single GPU and then extended it to RANP with increased batch size from 1 to 4 to fully fill GPU capacity. The initial learning rate was reduced from 0.1 to 0.01 due to the batch size decreased from 12 in Table 5. This is to avoid an immediate increase in training loss right after 1st epoch due to the unsuitably large learning space.

In Fig. 4(a), RANP-f enables increased batch size 4 and achieves a faster loss convergence than the full network. In Fig. 4(c), the full network executed 6 epochs while RANP-f reached 26 epochs. Vividly shown by training time in Figs. 4(b) and 4(d), RANP-f has much lower loss and higher accuracy than the full one. This greatly indicates the practical advantage of RANP on fastening training convergence.

## 6 Conclusion

In this paper, we propose an effective resource aware neuron pruning method, RANP, for 3D CNNs. RANP prunes a network at initialization by greatly reducing resources with negligible loss of accuracy. Its resource aware reweighting scheme balances the neuron importance distribution in each layer and enhances the pruning capability of removing a high ratio, say 80% on 3D-UNet, of neurons with minimal accuracy loss. This advantage enables training deep 3D CNNs with a large batch size to improve accuracy and achieving lightweight training on one GPU.

Our experiments on 3D semantic segmentation using ShapeNet and BraTS’18 and video classification using UCF101 demonstrate the effectiveness of RANP by pruning 70%-80% neurons with minimal loss of accuracy. Moreover, the transferred models pruned on a dataset and trained on another one are succeeded in maintaining high accuracy, indicating the high transferability of RANP. Meanwhile, the largely reduced computational resources enable lightweight and fast training on one GPU with increased batch size.

### Acknowledgement

We would like to thank Ondrej Miksik for valuable discussions. This work is supported by the Australian Centre for Robotic Vision (CE140100016) and Data61, CSIRO.

### Appendix

We first provide the pseudocode of our RANP algorithm, then discuss our selection of MPMG-sum as vanilla NP, and justify our reweighting scheme against orthogonal initialization with more ablation experiments.

## Appendix A Pseudocode of RANP Procedures

In Alg. LABEL:alg:pseudo, we provide the pseudocode of the pruning procedures of RANP. In Alg. 1, we used a simple half-space method to automatically search for the max neuron sparsity with network feasibility. Note that this searching cannot guarantee a small accuracy loss but merely to decide the maximum pruning capability. The relation between pruning capability and accuracy was studied in the experimental section in the main paper and Table 6.

Loss Function and Metrics. Due to the page limitation, we provide loss functions and metrics used in our experiments. Standard cross-entropy function was used as the loss function for ShapeNet and UCF101. For BraTS’18, the weighted function in [39] is

 L=Lce+αLdice=Lce+α1CC∑i=12|Pi∩Gi||P|+|G| , (12)

where is an empiric weight for dice loss, is prediction, is ground truth, and is the number of classes. Meanwhile, ShapeNet accuracy was measured by mean IoU over each part of object category [47] while IoU by was adopted for BraTS’18. For UCF101 classification, top-1 and top-5 recall rates were used.

\@dblfloat

algocf[htbp]     \end@dblfloat

## Appendix B Impacts of the Activation Function

In the following, we first establish the relation between MPMG and MNMG for calculating neuron importance given a homogeneous activation function that includes but not limited to ReLU used in the 3D CNNs. Then we analyze the impact of such an activation function on the calculation of neuron importance by derivating the mask gradients on post-activations and pre-activations illustrated in Figs. 5(b) and 5(a) respectively.

###### Proposition 1

For a network activation function : being a homogeneous function of degree 1 satisfying , the neuron mask gradient equals the sum of parameter mask gradients of this neuron.

Proof: Given a neuron mask before the activation function in Fig. 5(a) and the output of the 1st neuron as , we have

 yl1 =ϕ(cl1⊙hl1) (13) =ϕ(cl1⊙(xl−11wl11+xl−12wl12+xl−13wl13)) =ϕ(cl1xl−11wl11+cl1xl−12wl12+cl1xl−13wl13) .

The gradient of loss over the neuron mask is

 ∂L∂cl1 =∂L∂yl1∂yl1∂cl1 (14) =∂L∂yl1(xl−11wl11+xl−12wl12+xl−13wl13) .

Meanwhile, if setting masks on weights of this neuron directly, we can obtain

 yl1=ϕ(cl11xl−11wl11+cl12xl−12wl12+cl13xl−13wl13) , (15)

then the gradient of weight mask, e.g., , from loss is

 ∂Lcl11=∂L∂yl1∂yl1∂cl11=∂L∂yl1xl−11wl11 . (16)

Similarly,

 ∂L∂cl11+∂L∂cl12+∂L∂cl13 (17) = ∂L∂yl1(xl−11wl11+xl−12wl12+xl−13wl13) .

Clearly, Eq. 14 equals Eq. 17. Hence, the neuron mask gradients can be calculated by parameter mask gradients. To this end, the proof is done.

Furthermore, given such a homogeneous activation function in Prop. 1, the importance of a post-activation equals the importance of its pre-activation. In more detail, for post-activations in Fig. 5(b), output is

 yl1 =cl1⊙ϕ(hl1) (18) =cl1⊙ϕ(xl−11wl11+xl−12wl12+xl−13wl13) .

Since the activation function satisfies ,

 yl1=ϕ(cl1xl−11wl11+cl1xl−12wl12+cl1xl−13wl13) . (19)

The neuron importance determined by neuron mask is

 ∂L∂cl1 =∂L∂yl1∂yl1∂cl1 (20) =∂L∂yl1(xl−11wl11+xl−12wl12+xl−13wl13) .

Clearly, Eq. 20 equals Eq. 14. Now, the importance of pre-activations and post-activations is the same given such a homogeneous activation function.

## Appendix C Resource Aware Reweighting Scheme

As described in Sec. 4.2 in the main paper, the reweighting of RANP is conducted by first balancing the layer-wise distribution of neuron importance and then adopting resource importance for layer to further reduce resources. Since FLOPs and memory are the main resources of 3D CNNs, is defined by FLOPs or memory as follows.

Generally, given input dimension of the th layer 4, neuron dimension , and output dimension with and , the resource importance in terms of FLOPs or memory is defined by

 FLOPs:\enskipτl= [(fhfwfd+fhfwfd−1)fin +fin−1+1|bias]yinyhywyd = (2fhfwfdfin−1+1|bias)yinyhywyd, (21a) Memory:\enskipτl =yinyhywyd, (21b)

where is the number of operations of multiplications of filter5 and layer input, is for additions of values from the multiplications, is for multiplications over all filters, is for additions of values from all these multiplications, is for an addition when the neuron has a bias, and is for all elements of the layer output.

## Appendix D More Ablation Study

In this section, we add more experimental results for the analysis of selecting MPMG-sum as vanilla NP, Glorot initialization for network initialization compared with orthogonal initialization [19] to handle the imbalanced layer-wise distribution of neuron importance, and visualization of neuron distribution by RANP for BraTS’18 in addition to that for ShapeNet in the main paper.

Figures in this sections are for 3D-UNets on ShapeNet and BraTS’18 because 3D-UNets used in our experiments typically clarify the neuron imbalance and memory issues and are clear for illustration with a limited number of layers, i.e., 15 layers, while MobileNetV2 and I3D have more than 55 layers but many are not typical 3D convolutional layers with 3 kernel size filters.

### d.1 MPMG-sum as Vanilla Neuron Pruning

7

In Sec. 5.2 in the main paper, we select MPMG-sum as vanilla neuron pruning for the trade-off between computational resources and accuracy. To give a comprehensive study of this selection, we demonstrate detailed results of mean, max, and sum operations of MPMG and MNMG in Table 6. Note that we relax the sum operation in Eq. 8 in the main paper to mean, max, and sum.

In Table 6, we aim at obtaining the maximum neuron sparsity due to the target of reducing the computational resources at an extreme sparsity level with minimal accuracy loss. Vividly, for ShapeNet, MPMG-sum achieves the largest maximum neuron sparsity 78.24% among all with only 0.53% accuracy loss. Differently, for BraTS’18, MNMG-sum has the largest maximum neuron sparsity 81.32%; however, the accuracy loss can reach up to 8.48%. In contrast, while MPMG-sum has the second-largest maximum neuron sparsity 78.17%, the accuracy loss is much smaller than MNMG-sum. For UCF101, it is surprising that many manners have low accuracy. As we analyse the reason in the footnote in Table 6, with the extreme neuron sparsity, some layers of the pruned networks have only 1 neuron retained, losing sufficient features for learning, and thus, leading to low accuracy.

Hence, considering the comprehensive performance of reducing resources and maintaining the accuracy, MPMG-sum is selected as vanilla NP. Note that any neuron sparsity greater than the maximum neuron sparsity will make the pruned network infeasible by pruning the whole layer(s).

### d.2 Initialization for Neuron Imbalance

The imbalanced layer-wise distribution of neuron importance hinders pruning at a high sparsity level due to the pruning of the whole layer(s). For 2D classification tasks in [19], orthogonal initialization is used to effectively solve this problem for balancing the importance of parameters; but it does not improve our neuron pruning results in 3D tasks and even leads to a poor pruning capability with a lower maximum neuron sparsity than Glorot initialization [45]. This is briefly mentioned in Sec. 4.1 in the main paper. Here, we compare the resource reducing capability using Glorot initialization and orthogonal initialization.

Resource reductions. In Table 7, vanilla neuron pruning (i.e., MPMG-sum) with Glorot initialization, i.e., vanilla-xn, achieves smaller FLOPs and memory consumption than those with orthogonal initialization, i.e., vanilla-ort, except FLOPs with 3D-UNet on ShapeNet and I3D on UCF101. This exception of I3D on UCF101 is possibly caused by the high ratio of 1 kernel size filters in I3D, i.e., 37 out of 57 convolutional layers, because those 1 kernel size filters can be regarded as 2D filters on which orthogonal initialization can effectively deal with [19]. While this ratio is also high in MobileNetV2, i.e., 34 out of 52 convolutional layers, it is unnecessary to have the same problem as I3D since it is also affected by the number of neurons in each layer. Note that since 3D-UNets used are all with 3 kernel size filters, the orthogonal initialization for 3D-UNet in most cases is inferior to Glorot initialization according to our experiments. Meanwhile, in Table 7, this gap between vanilla-ort and vanilla-xn is very small on MobileNetV2 and I3D.

Nevertheless, with RANP-f and Glorot initialization, i.e., RANP-f-xn, more FLOPs and memory can be reduced than using orthogonal initialization, i.e., RANP-f-ort.

Balance of Neuron Importance Distribution. More importantly, with reweighting by RANP in Fig. 7, the values of neuron importance are more balanced and stable than those of vanilla neuron importance. This can largely avoid network infeasibility without pruning the whole layer(s).

Now, we analyse the neuron distribution from the observation of neuron importance values and network structures. Fig. 8 illustrates a detailed comparison between orthogonal and Glorot initialization by each two subfigures in column of Fig. 7. In Figs. 7(a)-7(c), vanilla neuron importance by Glorot initialization is more stable and compact than that by orthogonal initialization. After applying the reweighting scheme of RANP-f, the importance tends to be in a similar tendency, shown in Figs. 7(b)-7(d). Consequently, in Figs. 7(e)-7(f), neuron ratios are more balanced after the reweighting than without reweighting, especially the 8th layer. Thus, we choose Glorot initialization as network initialization. Note that we adopt the same neuron sparsity for these two initialization experiments in Table 7 and Fig. 8.

### d.3 Visualization of Balanced Neuron Distribution by RANP

In Fig. 9, neuron importance by MPMG-sum is more balanced than by MNMG-sum, which avoids pruned networks by MPMG-sum to be infeasible, that is at least 1 neuron will be retained in each layer.

In addition to the distribution of retained neuron ratios in Fig. 2 in the main paper for ShapeNet, which is also shown in the first row of Fig. 10, the last row of Fig. 10 is for BraTS’18. Moreover, Fig. 11 illustrates the distribution of neurons retained in each layer by vanilla neuron pruning (i.e., vanilla NP) and RANP-f compare to the full network.

Clearly, upon pruning, neurons in each layer are largely reduced except the last layer where all neurons are retained for the number of segmentation classes. In Fig. 11, vanilla NP has very few neurons in, e.g., the 8th layer, resulting in low accuracy or network infeasibility. By contrast, the neuron distribution by RANP-f is more balanced to improve the pruning capability.

### Footnotes

1. We concretely define “resource” as FLoating Point Operations per second (FLOPs) and memory required by one forward pass.
2. footnotemark:
3. footnotetext: Since 2 layers of the pruned MobileNetV2 by MNMG-sum have only 1 neuron due to the imbalanced layer-wise neuron importance distribution.
4. The dimension order follows that of PyTorch.
5. Here, we refer a 3D filter with dimension .
6. footnotemark:
7. footnotetext: For MobileNetV2 pruned by MPMG-mean, MPMG-max, MNMG-max, and MNMG-sum, the accuracy is very low because 1) the neuron sparsity here is the extreme (largest) value, a larger one will make network infeasible by removing whole layer(s) and 2) the distribution of neuron importance is rather imbalanced possibly caused by the high mixture of 1 kernels and 3 in MobileNetV2. In the pruned networks, we observe that, for MPMG-mean, MPMG-max, and MNMG-max, the last convolutional layer has only 1 neuron retained; for MNMG-sum, 2 convolutional layers have only 1 neuron retained. Note that, this imbalance issue can be greatly alleviated by the reweighting of our RANP, while we select MPMG-sum as vanilla NP merely according to the results in Table 6.

### References

1. H. Zhang, K. Jiang, Y. Zhang, Q. Li, C. Xia, and X. Chen, “Discriminative feature learning for video semantic segmentation,” International Conference on Virtual Reality and Visualization, 2014.
2. C. Zhang, W. Luo, and R. Urtasun, “Efficient convolutions for real-time semantic segmentation of 3D point cloud,” 3DV, 2018.
3. S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” TPAMI, 2013.
4. R. Hou, C. Chen, R. Sukthankar, and M. Shah, “An efficient 3D CNN for action/object segmentation in video,” BMVC, 2019.
5. O. Cicek, A. Abdulkadir, S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning dense volumetric segmentation from sparse annotation,” MICCAI, 2016.
6. F. Zanjani, D. Moin, B. Verheij, F. Claessen, T. Cherici, T. Tan, and P. With, “Deep learning approach to semantic segmentation in 3D point cloud intra-oral scans of teeth,” Proceedings of Machine Learning Research, 2019.
7. J. Kleesiek, G. Urban, A. Hubert, D. Schwarz, K. Hein, M. Bendszus, and A. Biller, “Deep MRI brain extraction: A 3D convolutional neural network for skull stripping,” NeuroImage, 2016.
8. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” CVPR, 2014.
9. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” NeurIPS, 2014.
10. L. Yi, V. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas, “A scalable active framework for region annotation in 3D shape collections,” SIGGRAPH Asia, 2016.
11. B. Menze, A. Jakab, and S. B. et al, “The multimodal brain tumor image segmentation benchmark (brats),” IEEE Transactions on Medical Imaging, 2015.
12. B. Graham, M. Engelcke, and L. Maaten, “3D semantic segmentation with submanifold sparse convolutional networks,” CVPR, 2018.
13. C. Qi, H. Su, K. Mo, and L. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,” CVPR, 2017.
14. Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” NeurIPS, 2016.
15. X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks via layer-wise optimal brain surgeon,” NeurIPS, 2017.
16. Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” ICCV, 2017.
17. S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” NeurIPS, 2015.
18. N. Lee, T. Ajanthan, and P. Torr, “SNIP: Single-shot network pruning based on connection sensitivity,” ICLR, 2019.
19. N. Lee, T. Ajanthan, S. Gould, and P. Torr, “A signal propagation perspective for pruning neural networks at initialization,” ICLR, 2020.
20. S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos, “Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features,” Nature Scientific Data, 2017.
21. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” CVPR, 2018.
22. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the Kinetics dataset,” CVPR, 2017.
23. C. Chen, F. Tung, N. Vedula, and G. Mori, “Constraint-aware deep neural network compression,” ECCV, 2018.
24. N. Yu, S. Qiu, X. Hu, and J. Li, “Accelerating convolutional neural networks by group-wise 2D-filter pruning,” IJCNN, 2017.
25. H. Li, A. Kadav, I. D. H. Samet, and H. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
26. R. Yu, A. Li, C. Chen, J. Lai, V. Morariu, X. Han, M. Gao, C. Lin, and L. Davis, “NISP: Pruning networks using neuron importance score propagation,” CVPR, 2018.
27. Z. Huang and N. Wang, “Data-driven sparse structure selection for deep neural networks,” ECCV, 2018.
28. Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” ECCV, 2018.
29. M. Zhang and B. Stadie, “One-shot pruning of recurrent neural networks by jacobian spectrum evaluation,” arXiv:1912.00120, 2019.
30. C. Li, Z. Wang, X. Wang, and H. Qi, “Single-shot channel pruning based on alternating direction method of multipliers,” arXiv:1902.06382, 2019.
31. J. Yu and T. Huang, “Autoslim: Towards one-shot architecture search for channel numbers,” arXiv:1903.11728, 2019.
32. C. Wang, G. Zhang, and R. Grosse, “Picking winning tickets before training by preserving gradient flow,” ICLR, 2020.
33. P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” ICLR, 2017.
34. Y. Zhang, H. Wang, Y. Luo, L. Yu, H. Hu, H. Shan, and T. Quek, “Three dimensional convolutional neural network pruning with regularization-based method,” ICIP, 2019.
35. H. Chen, Y. Wang, H. Shu, Y. Tang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “Frequency domain compact 3D convolutional neural networks,” CVPR, 2020.
36. A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi, “Morphnet: Fast & simple resource-constrained structure learning of deep networks,” CVPR, 2018.
37. W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” NeurIPS, 2016.
38. G. Riegler, A. Ulusoy, and A. Geiger, “OctNet: Learning deep 3D representations at high resolutions,” CVPR, 2017.
39. P. Kao, T. Ngo, A. Zhang, J. Chen, and B. Manjunath, “Brain tumor segmentation and tractographic feature extraction from structural MR images for overall survival prediction,” Workshop on MICCAI, 2018.
40. K. Soomro, A. Zamir, and M. Shah, “UCF101: A dataset of 101 human action classes from videos in the wild,” CRCV-Techinal Report, 2012.
41. O. Kopuklu, N. Kose, A. Gunduz, and G. Rigoll, “Resource efficient 3D convolutional neural networks,” ICCVW, 2019.
42. I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” ICML, 2013.
43. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
44. S. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” ICLR, 2018.
45. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
46. A. Saxe, J. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” ICLR, 2014.
47. L. Yi, L. Shao, and M. Savva, “Large-scale 3D shape reconstruction and segmentation from shapenet core55,” arXiv preprint arXiv:1710.06104, 2017.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters