RANP: Resource Aware Neuron Pruning at Initialization for 3D CNNs
Abstract
Although 3D Convolutional Neural Networks (CNNs) are essential for most learning based applications involving dense 3D data, their applicability is limited due to excessive memory and computational requirements. Compressing such networks by pruning therefore becomes highly desirable. However, pruning 3D CNNs is largely unexplored possibly because of the complex nature of typical pruning algorithms that embeds pruning into an iterative optimization paradigm. In this work, we introduce a Resource Aware Neuron Pruning (RANP) algorithm that prunes 3D CNNs at initialization to high sparsity levels. Specifically, the core idea is to obtain an importance score for each neuron based on their sensitivity to the loss function. This neuron importance is then reweighted according to the neuron resource consumption related to FLOPs or memory. We demonstrate the effectiveness of our pruning method on 3D semantic segmentation with widely used 3DUNets on ShapeNet and BraTS’18 as well as on video classification with MobileNetV2 and I3D on UCF101 dataset. In these experiments, our RANP leads to roughly 50%95% reduction in FLOPs and 35%80% reduction in memory with negligible loss in accuracy compared to the unpruned networks. This significantly reduces the computational resources required to train 3D CNNs. The pruned network obtained by our algorithm can also be easily scaled up and transferred to another dataset for training.
1 Introduction
3D image analysis is important in various realworld applications including scene understanding [1, 2], object recognition [3, 4], medical image analysis [5, 6, 7], and video action recognition [8, 9]. Typically, sparse 3D data can be represented using point clouds [10] whereas volumetric representation is required for dense 3D data which arises in domains such as medical imaging [11] and video segmentation and classification [1, 8, 9]. While efficient neural network architectures can be designed for sparse point cloud data [12, 13], conventional dense 3D Convolutional Neural Network (CNN) is required for volumetric data. Such 3D CNNs are computationally expensive with excessive memory requirements for largescale 3D tasks. Therefore, it is highly desirable to reduce the memory and FLOPs required to train 3D CNNs while maintaining the accuracy. This will not only enable largescale applications but also 3D CNN training on resourcelimited devices.
Network pruning is a prominent approach to compress a neural network by reducing the number of parameters or the number of neurons in each layer [14, 15, 16, 17]. However, most of the network pruning methods aim at 2D CNNs while pruning 3D CNNs is largely unexplored. This is mainly because pruning is typically targeted at reducing the testtime resource requirements while computational requirements of training time are as large as (if not more than) the unpruned network. Such pruning schemes are not suitable for 3D CNNs with dense volumetric data where trainingtime resource requirement is prohibitively large.
In this work, we introduce a Resource
The main idea of RANP is to prune based on a neuron importance criterion analogous to the connection sensitivity in SNIP. Note that, pruning based on such a simple criterion as SNIP has the risk of pruning the whole layer(s) at extreme sparsity levels especially on large networks [19]. Even though an orthogonal initialization that ensures layerwise dynamical isometry is sufficient to mitigate this issue for parameter pruning on 2D CNNs [19], it is unclear if this could be directly applied to neuron pruning on 3D CNNs. To tackle this and improve pruning, we introduce a resource aware reweighting scheme that first balances the mean value of neuron importance in each layer and then reweights the neuron importance based on the resource consumption of each neuron. As evidenced by our experiments, such a reweighting scheme is crucial to obtain large reductions in memory and FLOPs while maintaining high accuracy.
We firstly evaluate our RANP on 3D semantic segmetation on a sparse pointcloud dataset, ShapeNet [10], and a dense medical image dataset, BraTS’18 [11, 20], with widely used 3DUNets [5]. We also evaluate RANP on video classification using UCF101 with MobileNetV2 [21] and I3D [22]. Our RANPf significantly outperforms other neuron pruning methods in resource efficiency by yielding large reductions in computational resources (50%95% FLOPs reduction and 35%80% memory reduction) with comparable accuracy to the unpruned network (Fig. 1).
Furthermore, we perform extensive experiments to demonstrate 1) scalability of RANP by pruning with a small input spatial size and training with a large one, 2) transferability by pruning using ShapeNet and training on BraTS’18 and vice versa, 3) lightweight training on a single GPU, and 4) fast training with increased batch size.
2 Related Work
Previous works of network pruning mainly focus on 2D CNNs by parameter pruning [18, 17, 14, 15, 23] and neuron pruning [24, 16, 25, 26, 27, 28]. While a majority of the pruning methods use the traditional pruneretrain scheme with a combined loss function of pruning criteria [17, 16, 26], some pruning at initialization methods is able to reduce computational complexity in training [18, 29, 30, 31, 32]. While very few are for 3D CNNs [33, 34, 35], none of them prune networks at initialization, and thus, none of them effectively reduce the trainingtime computational and memory requirements of 3D CNNs.
2D CNN pruning. Parameter pruning merely sparsifies filters for a high learning capability with small models. Han et al. [17] adopted an iterative method of removing parameters with values below a threshold. Lee et al. [18] recently proposed a singleshot method with connection sensitivity by magnitudes of parameter mask gradients to retain top parameters. These filtersparse methods, however, do not directly yield large speedup and memory reductions.
By contrast, neuron pruning, also known as filter pruning or channel pruning, can effectively reduce computational resources. For instance, Li et al.[25] used normalization to remove unimportant filters with connecting features. He et al.[16] adopted a LASSO regression to prune network layers with reconstruction in the least square manner. Yu et al.[24] proposed a groupwise 2Dfilter pruning from each 3Dfilter by a learningbased method and a knowledge distillation. Structure learning based MorphNet [36] and SSL [37] aim at pruning activations with structure constraints or regularization. These approaches only reduce the testtime resource requirement while we focus on reducing those of large 3D CNNs at training time.
3D CNN pruning. To improve the efficiency on 3D CNNs, some works like SSC [12] and OctNet [38] use efficient data structures to reduce the memory requirement for sparse pointcloud data. However, these approaches are not useful for dense data, e.g., MRI images and videos, and the resource requirement remains prohibitively large.
Hence, it is desirable to develop an efficient pruning for 3D CNNs that can handle dense 3D data which is common in real applications. Only very few works are relevant to 3D CNN pruning. Molchanov et al.[33] proposed a greedy criteriabased method to reduce resources via backpropagation with a small 3D CNN for hand gesture recognition. Zhang et al.[34] used a regularizationbased pruning method by assigning regularization to weight groups with 4 speedup in theory. Recently, Chen et al.[35] converted 3D filters into frequency domain to eliminate redundancy in an iterative optimization for convergence. Being a parameter pruning method, this does not lead to large FLOPs and memory reductions, e.g., merely a speedup compared to our (ref. Sec. 5.3). In summary, these methods embed pruning in the iterative network optimization and require extensive resources, which is inefficient for 3D CNNs.
Pruning at Initialization. While few works adopted pruning at initialization, some achieved impressive success. SNIP [18] is the first singleshot pruning method that presented a high possibility of pruning networks at initialization with minimal accuracy loss in training, followed by many recent works on singleshot pruning [29, 30, 31, 32]. But none are for 3D CNNs pruning.
In addition to being a parameter pruning approach, the benefits of SNIP was demonstrated only on smallscale datasets [18], such as MNIST and CIFAR10. Therefore, it is unclear that whether these benefits could be transposed to 3D CNNs applied to largescale datasets. Our experiments indicate that, while SNIP itself is not capable of yielding large resource reduction on 3D CNNs, our RANP can greatly reduce the computational resources without causing network infeasibility. Furthermore, we show that RANP enjoys strong transferability among datasets and enables fast and lightweight training of large 3D volumetric data segmentation on a single GPU.
3 Preliminaries
We first briefly describe the main idea of SNIP [18] which removes redundant parameters prior to training. Given a dataset with input and ground truth and the sparsity level , the optimization problem associated with SNIP can be written as
(1)  
where is denoted a dimensional vector of parameters, is the corresponding binary mask on the parameters, is the standard loss function (e.g., crossentropy loss), and denotes norm. The mask for parameter denotes that the parameter is retained in the compressed model if and otherwise it is removed. In order to optimize the above problem, they first relax the binary constraint on the masks such that . Then an importance function for parameter is calculated by the normalized magnitude of the loss gradient over mask as
(2) 
Then, only top parameters are retained based on the parameter importance, called connection sensitivity in [18], defined above. Upon pruning, the retained parameters are trained in the standard way. It is interesting to note that, even though having the mask is easier to explain the intuition, SNIP can be implemented without these additional variables by noting that [19]. This method has shown remarkable results in achieving sparsity on 2D image classification tasks with minimal loss of accuracy. Such a parameter pruning method is important, however, it cannot lead to sufficient computation and memory reductions to train a deep 3D CNN on current offtheshelf graphics hardware. In particular, the sparse weight matrices cannot efficiently reduce memory or FLOPs, and they require specialized sparse matrix implementations for speedup. In contrast, neuron pruning directly translates into practical gains of reducing both memory and FLOPs. This is crucial in 3D CNNs due to their substantially higher resource requirement compared to 2D CNNs.
4 Resource Aware NP at Initialization
To explain the proposed RANP, we first extend the SNIP idea to neuron pruning at initialization. Then we discuss a resource aware reweighting strategy to further reduce the computational requirements of the pruned network. The flowchart of our RANP algorithm is shown in Fig. 2.
Before introducing our neuron importance, we first consider a fullyconnected feedforward neural network for the simplicity of notations. Consider weight matrices , biases , preactivations , and postactivations , for layer . Now the feedforward dynamics is
(3) 
where the activation function has elementwise nonlinearity and the network input is denoted by . Now we introduce binary masks on neurons (i.e., postactivations). The feedforward dynamics is then modified to include this masking operation as
(4) 
where neuron mask indicates neuron is retained and otherwise pruned. Here, neuron pruning can be written as the following optimization problem
(5)  
where is the total number of neurons and denotes a standard loss function of the feedforward mapping with neuron masks defined in Eq. 4. This can be easily extended to convolutional and skipconcatenation operations.
As removing neurons could largely reduce memory and FLOPs compared to merely sparsifying parameters, the core of our approach is benefited by removing redundant neurons from the model. We use an influence function concept developed for parameters to establish neuron importance through the loss function, to locate redundant neurons.
4.1 Neuron Importance
Note that, neuron importance can be derived from the SNIPbased parameter importance discussed in Sec. 3. Another approach is to directly define neuron importance as the normalized magnitude of the neuron mask gradients analogous to parameter importance.
Neuron Importance with Parameter Mask Gradients. The first approach to calculate neuron importance depends on the magnitude of parameter mask gradients, denoted as Magnitude of Parameter Mask Gradients (MPMG). Thus, the importance of neuron is
(6) 
where with as the mask of parameter . Refer to Eq. 2. Here, is a function mapping a set of values to a scalar. We choose with alternatives, i.e., mean and max functions, in Appendix D. Now, we set neuron masks as 1 for neurons with top largest neuron importance.
Neuron Importance with Neuron Mask Gradients. Another approach is to directly compute mask gradients on neurons and treat their magnitudes as neuron importance, denoted as Magnitude of Neuron Mask Gradients (MNMG). The neuron importance of is calculated by
(7) 
Noting that a nonlinear activation function in CNN including but not limited to ReLU can satisfy . Given such a homogeneous function, the calculation of neuron importance with neuron masks can be derived from parameter mask gradients in the form of
(8) 
Details of the influence of such an activation function on neuron importance are provided in Appendix B. These two approaches for neuron importance are in a similar form that while MPMG is by the sum of magnitudes, MNMG is by the magnitude of the sum of parameter mask gradients. It can be implemented directly from parameter gradients.
The neuron importance based on MPMG or MNMG approach can be used to remove redundant neurons. However, they could lead to an imbalance of sparsity levels of each layer in 3D network architectures. As shown in Table 2, the computational resources required by vanilla neuron pruning are much higher than those by other sparsity enforcing methods, e.g., random neuron pruning and layerwise neuron pruning. We hypothesize that this is caused by the layerwise imbalance of neuron importance which unilaterally emphasizes on some specific layer(s) and may lead to network infeasibility by pruning the whole layer(s). This behavior is also observed in [19], and orthogonal initialization is thus recommended to solve the problem for 2D CNN pruning, which however cannot result in balanced neuron importance in our case, see results in Appendix D.
In order to resolve this issue, we propose resource aware neuron pruning (RANP) with reweighted neuron importance, and the details are provided below.
4.2 Resource Aware Reweighting
To tackle the imbalanced neuron importance issue above, we first weight the neuron importance across layers. Weighting neuron importance of can be expressed as
(9) 
Here, is the mean neuron importance of layer and is the updated neuron importance. This helps to achieve the same mean neuron importance in each layer, which largely avoids underestimating neuron importance of specific layer(s) to prevent from pruning the whole layer(s).
To further reduce the memory and FLOPs with minimal accuracy loss, we then reweight the neuron importance by available resource, i.e., memory or FLOPs. This reweighting counts on the addition of weighted neuron importance and the effect of the computational resource, denoted as RANP[mf], where “m” is for memory and “f” is for FLOPs . We represent the importance of this available resource in layer as , refer to Appendix C for details.
The reweighted neuron importance of neuron by following weighted addition variant RANP[mf] is
(10) 
where coefficient helps to control the effect of resource on neuron importance. This effect represented by softmax constrains the values into a controllable range [0,1], making it easy to determine and function a high resource influence with a small resource occupation.
We demonstrate the effect of this reweighting strategy over vanilla pruning in Fig. 3. In more detail, vanilla neuron importance tends to have high values in the last few layers, making it highly possible to remove all neurons of such as the 7th and 8th layers. Weighting the importance in Fig. 2(b) makes the distribution of importance balanced with the same mean value in each layer. Furthermore, since some neurons have different numbers of input channels, each layer requires different FLOPs and memory. Considering the effect of computational resources on training, we embed them into neuron importance as weights.
In Fig. 2(c), the last few layers require larger computational resources than the others, and thus their neurons share lower weights, see the tendency of mean values. Vividly, neuron ratio in Fig. 2(d) indicates a more balanced distribution by RANPf than vanilla NP. For instance, very few neurons are retained in the 8th layer by vanilla NP, resulting in low accuracy and low maximum neuron sparsity. With reweighting by RANPf, however, more neurons can be retained in this layer. Moreover, in Table 2, while weighted NP achieves high accuracy, its computational resource reductions are small. In contrast, RANPf largely decreases the computational resources with a small accuracy loss.
Then, with reweighted neuron importance by Eq. 10 and as the th reweighted neuron importance in a descending order, the binary mask of neuron can be obtained by
(11) 
As mentioned in Sec. 2, our RANP is more effective in reducing memory and FLOPs than SNIPbased pruning which merely sparsifies parameters but needs high memory required by dense operations in training. RANP can easily remove neurons and all involved input channels at once, leading to huge reductions of input and output channels of the filter. Pseudocode is provided in Appendix A.
5 Experiments
We evaluated RANP on 3DUNets for 3D semantic segmentation and MobileNetV2 and I3D for video classification. Experiments are on Nvidia Tesla P100SXM216GB GPUs in PyTorch. More results are in Appendix D. Our code is available at https://github.com/zwxu064/RANP.git.
5.1 Experimental Setup
3D Datasets. For 3D semantic segmentation, we adopted the largescale 3D sparse pointcloud dataset, ShapeNet [10], and dense biomedical MRI sequences, BraTS’18 [11, 20]. ShapeNet consists of 50 object part classes, 14007 training samples, and 2874 testing samples. We split it into 6955 training samples and 7052 validation samples as [12] to assign each point/voxel with a part class.
BraTS’18 includes 210 High Grade Glioma (HGG) and 75 Low Grade Glioma (LGG) cases. Each case has 4 MRI sequences, i.e., T1, T1_CE, T2, and FLAIR. The task is to detect and segment brain scan images into 3 categories: Enhancing Tumor (ET), Tumor Core (TC), and Whole Tumor (WT). The spatial size is 240240155 in each dimension. We adopted the splitting strategy of crossvalidation in [39] with 228 cases for training and 57 cases for validation.
For video classification, we used video dataset, UCF101 [40] with 101 action categories and 13320 videos. 2D spatial dimension from images and temporal dimension from frames are cast as dense 3D inputs. Among the 3 official train/test splits, we used split1 which has 9537 videos for training and 3783 videos for validation.
3D CNNs. For 3D semantic segmentation on ShapeNet (sparse data) and BraTS’18 (dense data), we used the standard 15layer 3DUNet [5] including 4 encoders, each consists of two “3D convolution + 3D batch normalization + ReLU”, a “3D max pooling”, four decoders, and a confidence module by softmax. It has 14 convolution layers with 3 kernels and 1 layer with 1 kernel.
For video classification, we used the popular MobileNetV2 [21, 41] and I3D (with inception as backbone) [22] on UCF101. MobileNetV2 has a linear layer and 52 convolution layers while 18 of them are 3 kernels and the rest are 1. I3D has a linear layer and 57 convolution layers, 19 of which are 3 kernels, 1 is 7, and the rest are 1.
Hyperparameters in learning. For ShapeNet, we set learning rate as 0.1 with an exponential decay rate by 100 epochs; batch size is 12 on 2 GPUs; spatial size for pruning and training is while the spatial size for training is in Sec. 5.7; optimizer is SGDNesterov [42] with weight decay 0.0001 and momentum 0.9.
For BraTS’18, learning rate is 0.001, decayed by 0.1 at 150th epoch with 200 epochs; optimizer is Adam[43] with weight decay 0.0001 and AMSGrad[44]; batch size is 2 on 2 GPUs; spatial size for pruning is and for training.
For UCF101, we adopted similar setup from [41] with learning rate 0.1, decayed by 0.1 at {40, 55, 60, 70}th epoch; optimizer by SGD with weight decay 0.001; batch size 8 on one GPU. Spatial size for pruning and training is for MobileNetV2 and for I3D; 16 frames are used for the temporal size. Note that in [41] networks for UCF101 had higher performance since they were pretrained on Kinetics600, while we directly trained on UCF101. A feasible trainfromscratch reference could be [40].
For Eq. 10, we empirically set the coefficient as 11 for ShapeNet, 15 for BraTS’18, and 80 for UCF101. Glorot initialization [45] was used for weight initialization. Note that we used orthogonal initialization [46] to handle imbalanced layerwise neuron importance distribution [19] but obtained lower maximum neuron sparsity.
In addition, loss function and metrics are in Appendix A.
5.2 Maximum Neuron Sparsity by Vanilla NP
Dataset (Model)  Manner  Sparsity  Param  GFLOPs  Mem  Metric  

Full[5]  0  62.26  237.85  997.00  83.790.21  

66.93  4.29  100.34  783.14  83.650.02  

78.24  2.54  55.69  557.32  83.260.14  

Full[5]  0  15.57  478.13  3628.00  72.960.60  
MNMGsum  81.32  0.35  73.50  1933.20  64.481.10  
MPMGsum  78.17  0.55  104.50  1936.44  71.941.68  

Full[21]  0  9.47  0.58  157.47  47.080.72  

39.89  4.66  0.43  120.01  1.030.00 


33.15  6.35  0.55  155.17  46.320.79  

Full[22]  0  47.27  27.88  201.28  51.581.86  
MNMGsum  32.87  20.00  16.03  125.17  49.023.33  
MPMGsum  25.32  29.93  25.76  192.42  51.571.46 
We selected MPMGsum and MNMGsum for vanilla neuron importance for comparison. All neurons of the last convolutional layer are retained for the given classes.
In Table 1, MPMGsum for ShapeNet achieves the largest neuron sparsity 78.24% by reducing 76.59% FLOPs, 95.92% parameters, and 44.10% memory with 0.53% accuracy loss. Meanwhile, for BraTS’18, MNMGsum achieves the largest neuron sparsity 81.32% but has up to 8.48% accuracy loss. MPMGsum, however, has the largest neuron sparsity 78.17% but smaller accuracy loss with decreased 78.14% FLOPs, 96.46% parameters, and 46.63% memory.
Hence, we selected MPMGsum as vanilla NP considering the tradeoff between the maximum neuron sparsity and the accuracy loss. This is applied to all methods related to weighted neuron pruning and RANP in our experiments. Results of mean and max are in Appendix D.
5.3 Evaluation of RANP on Pruning Capability
Dataset  Model  Manner  Sparsity(%)  Param(MB)  GFLOPs  Memory(MB)  Metrics(%)  
mIoU  


Full[5]  0  62.26  237.85  997.00  83.790.21  
SNIP[18] NP  98.98  5.31 (91.5)  126.22 (46.9)  833.20 (16.4)  83.700.20  
Random NP  78.24  3.05 (95.1)  10.36 (95.6)  267.95 (73.1)  82.900.19  
Layerwise NP  2.99 (95.2)  11.63 (95.1)  296.22 (70.3)  83.250.14  
ours 
Vanilla NP  2.54 (95.9)  55.69 (76.6)  557.32 (44.1)  83.260.14  
Weighted NP  2.97 (95.2)  12.06 (94.9)  301.56 (69.8)  83.120.09  
RANPm  3.39 (94.6)  6.68 (97.2)  214.95 (78.4)  82.350.24  
RANPf  2.94 (95.3)  7.54 (96.8)  262.66 (73.7)  83.070.22  
ET  TC  WT  


Full[5]  0  15.57  478.13  3628.00  72.960.60  73.511.54  86.790.35  
SNIP[18] NP  98.88  1.09 (93.0)  233.11 (51.2)  2999.64 (17.3)  73.331.89  71.982.15  86.440.39  
Random NP  78.17  0.75 (95.2)  22.59 (95.3)  817.59 (77.5)  67.270.99  71.621.20  74.161.33  
Layerwise NP  0.75 (95.2)  24.09 (95.0)  836.88 (77.0)  69.741.33  71.491.62  86.380.39  
ours 
Vanilla NP  0.55 (96.5)  104.50 (78.1)  1936.44 (46.6)  71.941.68  69.392.29  84.680.78  
Weighted NP  0.79 (95.0)  22.40 (95.3)  860.64 (76.3)  71.500.63  75.051.19  84.050.65  
RANPm  0.87 (94.4)  13.47 (97.2)  506.97 (86.0)  66.702.94  62.992.38  82.900.41  
RANPf  0.76 (95.1)  16.97 (96.5)  729.11 (80.0)  70.730.66  74.501.05  85.451.06  
Top1  Top5  

MobileNetV2  Full[21]  0  9.47  0.58  157.47  47.080.72  76.680.50  
SNIP[18] NP  86.26  3.67 (61.3)  0.54 ( 6.9)  155.35 ( 1.3)  45.780.04  75.080.17  
Random NP  33.15  4.58 (51.6)  0.34 (41.4)  106.68 (32.3)  44.740.36  74.690.58  
Layerwise NP  4.56 (51.8)  0.33 (43.1)  106.92 (32.1)  44.900.36  75.540.34  
ours 
Vanilla NP  6.35 (32.9)  0.55 ( 5.2)  155.17 ( 1.5)  46.320.79  75.420.60  
Weighted NP  4.82 (49.1)  0.30 (48.3)  100.33 (36.3)  46.190.51  75.720.30  
RANPm  4.87 (48.6)  0.27 (53.4)  84.51 (46.3)  45.110.41  75.530.37  
RANPf  4.83 (49.0)  0.26 (55.2)  88.01 (44.1)  45.870.41  75.750.30  

Full[22]  0  47.27  27.88  201.28  51.581.86  77.350.63  
SNIP[18] NP  81.09  30.06 (36.4)  26.31 ( 5.6)  195.62 ( 2.8)  52.383.55  78.323.24  
Random NP  25.32  26.36 (44.2)  16.45 (41.0)  145.07 (27.9)  52.422.52  79.052.06  
Layerwise NP  26.67 (43.6)  16.93 (39.3)  150.95 (25.0)  52.771.99  78.411.07  
ours 
Vanilla NP  29.93 (36.7)  25.76 ( 7.6)  192.42 ( 4.4)  51.571.46  78.071.34  
Weighted NP  26.57 (43.8)  15.56 (44.2)  142.57 (29.2)  54.090.82  79.260.61  
RANPm  26.75 (43.4)  14.08 (49.5)  130.44 (35.2)  52.113.05  77.542.64  
RANPf  26.69 (43.5)  13.98 (49.9)  130.22 (35.3)  54.272.88  79.272.13 
Random NP retains neurons with neuron indices randomly shuffled. Layerwise NP retains neurons using the same retain rate as in each layer. For SNIPbased parameter pruning, the parameter masks are postprocessed by removing redundant parameters and then making sparse filters dense, which is denoted as SNIP NP. For a fair comparison with SNIP NP, we used the maximum parameter sparsity 98.98% for ShapeNet , 98.88% for BraTS’18, 86.26% for MobileNetV2, and 81.09% for I3D.
Manner  ShapeNet  BraTS’18  
Sparsity  Param  GFLOPs  Mem  mIoU  Sparsity  Param  GFLOPs  Mem  ET  TC  WT  
Random NP  81.01  2.27  7.54  253.12  82.660.23  81.08  0.56  16.97  685.77  61.091.87  68.942.44  78.892.47 
Layerwise NP  82.82  1.84  7.54  255.67  82.820.26  83.50  0.46  16.97  700.64  70.500.63  74.270.95  83.630.92 
Random NP  78.83  2.87  9.57  262.66  82.860.45  80.90  0.57  17.95  729.11  68.451.11  70.671.21  75.020.79 
Layerwise NP  82.81  1.94  8.14  262.66  82.520.13  82.45  0.51  17.31  729.11  70.451.03  69.271.95  82.420.68 
RANPf(ours)  78.24  2.94  7.54  262.66  83.070.22  78.17  0.76  16.97  729.11  70.730.66  74.501.05  85.451.06 
ShapeNet. Compared with random NP and layerwise NP in Table 2, the maximum reduced resources by vanilla NP are much less due to the imbalanced layerwise distribution of neuron importance. Weighted neuron importance by Eq. 9, however, further reduces 18.3% FLOPs and 29.6% memory with 0.14% accuracy loss.
Reweighting by RANPf and RANPm further reduces FLOPs and memory on the basis of weighted NP. Here, RANPf can reduce 96.8% FLOPs, 95.3% parameters, and 73.7% memory over the unpruned networks. Furthermore, with a similar resource in Table 3, RANP achieves 0.5% increase in accuracy. Note that a toolarge can additionally reduce the resources but at the cost of accuracy.
BraTS’18. In Table 2, RANPf achieves 96.5% FLOPs, 95.1% parameters, and 80% memory reductions. It further reduces 18.3% FLOPs and 33.3% memory over vanilla NP while increasing 1.21% ET, 5.11% TC, and 0.77% WT. With a similar resource in Table 3, RANP achieves higher accuracy than random NP and layerwise NP.
Additionally, Chen et al.[35] achieved 2 speedup on BraTS’18 with 3DUNet. In comparison, our RANPf has roughly 28 speedup, which is theoretically evidenced by the reduced FLOPs from 478.13G to 16.97G in Table 2.
UCF101. In Table 2, for MobileNetV2, RANPf reduces 55.2% FLOPs, 49% parameters, and 44.1% memory with around 1% accuracy loss. Meanwhile, for I3D, it reduces 49.9% FLOPs, 43.5% parameters, and 35.3% memory with around 2% accuracy increase. The RANPbased methods can reduce much more resources than other methods.
5.4 Resources and Accuracy with Neuron Sparsity
Here, we further studied the tendencies of resources and accuracy with an increasing neuron sparsity level from 0 to the maximum one with network feasibility.
Resource Reductions. In Figs. 3(a)3(d), RANP, marked with (w), achieves much larger FLOPs and memory reductions than vanilla NP, marked with (w/o), due to the balanced distribution of neuron importance by reweighting.
Specifically, for ShapeNet, RANP prunes up to 98.57% neurons while only up to 78.24% by vanilla NP in Fig. 3(a). For BraTS’18, RANP can prune up to 96.24% neurons while only up to 78.17% neurons can be pruned by vanilla NP in Fig. 3(b). For UCF101, RANP can prune up to 80.83% neurons compared to 33.15% on MobileNetV2 in Fig. 3(c), and 85.3% neurons compared to 25.32% on I3D in Fig. 3(d).
Accuracy with Pruning Sparsity. For ShapeNet in Fig. 3(e), the 23layer 3DUNet achieves a higher mIoU than the 15layer one. Extremely, when pruned with the maximum neuron sparsity 97.99%, it can achieve 78.10% mIoU. With the maximum neuron sparsity 98.57%, however, the 15layer 3DUNet achieves only 61.42%.
For BraTS’18 in Fig. 3(f), the 23layer 3DUNet does not always outperform the 15layer one and has a larger fluctuation which could be caused by the limited training samples. Nevertheless, even in the extreme case, the 23layer 3DUNet has small accuracy loss. Clearly, RANP makes it feasible to use deeper 3DUNets without the memory issue.
5.5 Transferability with Interactive Model
Manner  ShapeNet  BraTS’18  
mIoU(%)  ET(%)  TC(%)  WT(%)  
Full[5]  84.270.21  74.041.45  75.112.43  84.490.74 
RANPf(ours)  83.860.15  71.131.43  72.401.48  83.320.62 
TRANPf(ours)  83.250.17  72.740.69  73.251.69  85.220.57 
In this experiment, we trained on ShapeNet with a transferred 3DUNet by RANP on BraTS’18 with 80% neuron sparsity. Interactively, with the same neuron sparsity, a transferred 3DUNet by RANP on ShapeNet was applied to train on BraTS’18. Results in Table 4 demonstrate that training with transferred models crossing different datasets can largely maintain high or higher accuracy.
5.6 Lightweight Training on a Single GPU
Manner  Layer  Batch  GPU(s)  Sparsity(%)  mIoU(%) 
Full  15  12  2  0  83.790.21 
Full  23  12  2  0  84.270.21 
RANPf(ours)  23  12  1  80  84.340.21 
RANP with high neuron sparsity makes it feasible to train with large data size on a single GPU due to the largely reduced resources. We trained on ShapeNet with the same batch size 12 and spatial size in Sec. 5.1 using a 23layer 3DUNet with 80% neuron sparsity on a single GPU. With this setup, RANPf reduces GFLOPs (from 259.59 to 7.39) and memory (from 1005.96MB to 255.57MB), making it feasible to train on a single GPU instead of 2 GPUs. It achieves a higher mIoU, 84.340.21%, than the 15layer and 23layer full 3DUNets in Table 5.
The accuracy increase is due to the enlarged batch size on each GPU. With limited memory, however, training a 23layer full 3DUNet on a single GPU is infeasible.
5.7 Fast Training with Increased Batch Size
Here, we used the largest spatial size of one sample on a single GPU and then extended it to RANP with increased batch size from 1 to 4 to fully fill GPU capacity. The initial learning rate was reduced from 0.1 to 0.01 due to the batch size decreased from 12 in Table 5. This is to avoid an immediate increase in training loss right after 1st epoch due to the unsuitably large learning space.
In Fig. 4(a), RANPf enables increased batch size 4 and achieves a faster loss convergence than the full network. In Fig. 4(c), the full network executed 6 epochs while RANPf reached 26 epochs. Vividly shown by training time in Figs. 4(b) and 4(d), RANPf has much lower loss and higher accuracy than the full one. This greatly indicates the practical advantage of RANP on fastening training convergence.
6 Conclusion
In this paper, we propose an effective resource aware neuron pruning method, RANP, for 3D CNNs. RANP prunes a network at initialization by greatly reducing resources with negligible loss of accuracy. Its resource aware reweighting scheme balances the neuron importance distribution in each layer and enhances the pruning capability of removing a high ratio, say 80% on 3DUNet, of neurons with minimal accuracy loss. This advantage enables training deep 3D CNNs with a large batch size to improve accuracy and achieving lightweight training on one GPU.
Our experiments on 3D semantic segmentation using ShapeNet and BraTS’18 and video classification using UCF101 demonstrate the effectiveness of RANP by pruning 70%80% neurons with minimal loss of accuracy. Moreover, the transferred models pruned on a dataset and trained on another one are succeeded in maintaining high accuracy, indicating the high transferability of RANP. Meanwhile, the largely reduced computational resources enable lightweight and fast training on one GPU with increased batch size.
Acknowledgement
We would like to thank Ondrej Miksik for valuable discussions. This work is supported by the Australian Centre for Robotic Vision (CE140100016) and Data61, CSIRO.
Appendix
We first provide the pseudocode of our RANP algorithm, then discuss our selection of MPMGsum as vanilla NP, and justify our reweighting scheme against orthogonal initialization with more ablation experiments.
Appendix A Pseudocode of RANP Procedures
In Alg. LABEL:alg:pseudo, we provide the pseudocode of the pruning procedures of RANP. In Alg. 1, we used a simple halfspace method to automatically search for the max neuron sparsity with network feasibility. Note that this searching cannot guarantee a small accuracy loss but merely to decide the maximum pruning capability. The relation between pruning capability and accuracy was studied in the experimental section in the main paper and Table 6.
Loss Function and Metrics. Due to the page limitation, we provide loss functions and metrics used in our experiments. Standard crossentropy function was used as the loss function for ShapeNet and UCF101. For BraTS’18, the weighted function in [39] is
(12) 
where is an empiric weight for dice loss, is prediction, is ground truth, and is the number of classes. Meanwhile, ShapeNet accuracy was measured by mean IoU over each part of object category [47] while IoU by was adopted for BraTS’18. For UCF101 classification, top1 and top5 recall rates were used.
algocf[htbp] \end@dblfloat
Appendix B Impacts of the Activation Function
In the following, we first establish the relation between MPMG and MNMG for calculating neuron importance given a homogeneous activation function that includes but not limited to ReLU used in the 3D CNNs. Then we analyze the impact of such an activation function on the calculation of neuron importance by derivating the mask gradients on postactivations and preactivations illustrated in Figs. 5(b) and 5(a) respectively.
Proposition 1
For a network activation function : being a homogeneous function of degree 1 satisfying , the neuron mask gradient equals the sum of parameter mask gradients of this neuron.
Proof: Given a neuron mask before the activation function in Fig. 5(a) and the output of the 1st neuron as , we have
(13)  
The gradient of loss over the neuron mask is
(14)  
Meanwhile, if setting masks on weights of this neuron directly, we can obtain
(15) 
then the gradient of weight mask, e.g., , from loss is
(16) 
Similarly,
(17)  
Clearly, Eq. 14 equals Eq. 17. Hence, the neuron mask gradients can be calculated by parameter mask gradients. To this end, the proof is done.
Furthermore, given such a homogeneous activation function in Prop. 1, the importance of a postactivation equals the importance of its preactivation. In more detail, for postactivations in Fig. 5(b), output is
(18)  
Since the activation function satisfies ,
(19) 
The neuron importance determined by neuron mask is
(20)  
Appendix C Resource Aware Reweighting Scheme
As described in Sec. 4.2 in the main paper, the reweighting of RANP is conducted by first balancing the layerwise distribution of neuron importance and then adopting resource importance for layer to further reduce resources. Since FLOPs and memory are the main resources of 3D CNNs, is defined by FLOPs or memory as follows.
Generally, given input dimension of the th layer
(21a)  
(21b) 
where is the number of operations of multiplications of filter
Appendix D More Ablation Study
In this section, we add more experimental results for the analysis of selecting MPMGsum as vanilla NP, Glorot initialization for network initialization compared with orthogonal initialization [19] to handle the imbalanced layerwise distribution of neuron importance, and visualization of neuron distribution by RANP for BraTS’18 in addition to that for ShapeNet in the main paper.
Figures in this sections are for 3DUNets on ShapeNet and BraTS’18 because 3DUNets used in our experiments typically clarify the neuron imbalance and memory issues and are clear for illustration with a limited number of layers, i.e., 15 layers, while MobileNetV2 and I3D have more than 55 layers but many are not typical 3D convolutional layers with 3 kernel size filters.
d.1 MPMGsum as Vanilla Neuron Pruning
Dataset  Model  Manner  Sparsity(%)  Param(MB)  GFLOPs  Memory(MB)  Metrics(%)  
mIoU  
ShapeNet  3DUNet  Full[5]  0  62.26  237.85  997.00  83.790.21  
MPMGmean  68.10  5.08  110.14  819.97  83.330.18  
MPMGmax  70.24  4.54  107.38  809.88  83.790.10  
MPMGsum  78.24  2.54  55.69  557.32  83.260.14  
MNMGmean  63.03  4.23  112.95  834.98  83.460.13  
MNMGmax  73.93  3.67  103.57  796.44  83.510.08  
MNMGsum  66.93  4.29  100.34  783.14  83.650.02  
ET  TC  WT  
BraTS’18  3DUNet  Full[5]  0  15.57  478.13  3628.00  72.960.60  73.511.54  86.790.35  
MPMGmean  65.64  1.48  226.86  3038.27  73.510.82  73.281.14  87.150.43  
MPMGmax  75.78  0.83  189.43  2812.53  73.670.98  72.731.70  86.440.71  
MPMGsum  78.17  0.55  104.50  1936.44  71.941.68  69.392.29  84.680.78  
MNMGmean  63.85  1.08  176.76  2790.64  73.350.70  73.380.94  87.210.38  
MNMGmax  80.05  0.59  169.99  2676.05  72.521.91  72.401.74  84.630.60  
MNMGsum  81.32  0.35  73.50  1933.20  64.481.10  68.471.59  80.711.07  
Top1  Top5  
UCF101  MobileNetV2  Full[21]  0  9.47  0.58  157.47  47.080.72  76.680.50  
MPMGmean  26.31  4.39  0.55  156.00 
2.980.14 
14.040.14  
MPMGmax  29.48  3.96  0.54  155.38  3.490.12  13.640.10  
MPMGsum  33.15  6.35  0.55  155.17  46.320.79  75.420.60  
MNMGmean  38.91  2.79  0.50  147.69  29.130.92  62.931.37  
MNMGmax  50.33  2.59  0.53  153.45  2.840.06  13.400.23  
MNMGsum  39.89  4.66  0.43  120.01  1.030.00  5.760.00  
I3D  Full[22]  0  47.27  27.88  201.28  51.581.86  77.350.63  
MPMGmean  16.47  31.57  26.50  196.51  51.882.00  77.981.46  
MPMGmax  19.83  30.06  26.31  195.62  52.441.25  78.081.27  
MPMGsum  25.32  29.93  25.76  192.42  51.571.46  78.071.34  
MNMGmean  35.36  16.69  15.37  124.85  49.260.96  75.701.49  
MNMGmax  40.27  17.86  23.73  184.77  44.901.19  74.431.26  
MNMGsum  32.87  20.00  16.03  125.17  46.901.26  74.021.25 
In Sec. 5.2 in the main paper, we select MPMGsum as vanilla neuron pruning for the tradeoff between computational resources and accuracy. To give a comprehensive study of this selection, we demonstrate detailed results of mean, max, and sum operations of MPMG and MNMG in Table 6. Note that we relax the sum operation in Eq. 8 in the main paper to mean, max, and sum.
In Table 6, we aim at obtaining the maximum neuron sparsity due to the target of reducing the computational resources at an extreme sparsity level with minimal accuracy loss. Vividly, for ShapeNet, MPMGsum achieves the largest maximum neuron sparsity 78.24% among all with only 0.53% accuracy loss. Differently, for BraTS’18, MNMGsum has the largest maximum neuron sparsity 81.32%; however, the accuracy loss can reach up to 8.48%. In contrast, while MPMGsum has the secondlargest maximum neuron sparsity 78.17%, the accuracy loss is much smaller than MNMGsum. For UCF101, it is surprising that many manners have low accuracy. As we analyse the reason in the footnote in Table 6, with the extreme neuron sparsity, some layers of the pruned networks have only 1 neuron retained, losing sufficient features for learning, and thus, leading to low accuracy.
Hence, considering the comprehensive performance of reducing resources and maintaining the accuracy, MPMGsum is selected as vanilla NP. Note that any neuron sparsity greater than the maximum neuron sparsity will make the pruned network infeasible by pruning the whole layer(s).
d.2 Initialization for Neuron Imbalance
The imbalanced layerwise distribution of neuron importance hinders pruning at a high sparsity level due to the pruning of the whole layer(s). For 2D classification tasks in [19], orthogonal initialization is used to effectively solve this problem for balancing the importance of parameters; but it does not improve our neuron pruning results in 3D tasks and even leads to a poor pruning capability with a lower maximum neuron sparsity than Glorot initialization [45]. This is briefly mentioned in Sec. 4.1 in the main paper. Here, we compare the resource reducing capability using Glorot initialization and orthogonal initialization.
Dataset(Model)  Manner  Sparsity(%)  Param(MB)  GFLOPs  Mem(MB)  

Full[5]  0  62.26  237.85  997.00  
Vanillaort  70.53  4.40  72.65  630.00  
Vanillaxn  70.53  4.56  73.22  618.35  
RANPfort  70.53  5.40  21.73  366.29  
RANPfxn  70.53  5.52  15.06  328.66  

Full[5]  0  15.57  478.13  3628.00  
Vanillaort[19]  72.20  0.95  159.91  2240.33  
Vanillaxn  72.20  0.92  130.28  2109.19  
RANPfort  72.20  1.24  33.28  967.56  
RANPfxn  72.20  1.29  23.31  850.56  

Full[21]  0  9.47  0.58  157.47  
Vanillaort[19]  30.21  6.80  0.56  155.71  
Vanillaxn  30.21  6.77  0.55  155.48  
RANPfort  30.21  5.12  0.32  105.88  
RANPfxn  30.21  5.19  0.28  94.50  

Full[22]  0  47.27  27.88  201.28  
Vanillaort[19]  24.24  30.56  25.83  192.70  
Vanillaxn  24.24  30.64  25.85  192.88  
RANPfort  24.24  27.39  15.94  144.10  
RANPfxn  24.24  27.38  14.63  133.80 
Resource reductions. In Table 7, vanilla neuron pruning (i.e., MPMGsum) with Glorot initialization, i.e., vanillaxn, achieves smaller FLOPs and memory consumption than those with orthogonal initialization, i.e., vanillaort, except FLOPs with 3DUNet on ShapeNet and I3D on UCF101. This exception of I3D on UCF101 is possibly caused by the high ratio of 1 kernel size filters in I3D, i.e., 37 out of 57 convolutional layers, because those 1 kernel size filters can be regarded as 2D filters on which orthogonal initialization can effectively deal with [19]. While this ratio is also high in MobileNetV2, i.e., 34 out of 52 convolutional layers, it is unnecessary to have the same problem as I3D since it is also affected by the number of neurons in each layer. Note that since 3DUNets used are all with 3 kernel size filters, the orthogonal initialization for 3DUNet in most cases is inferior to Glorot initialization according to our experiments. Meanwhile, in Table 7, this gap between vanillaort and vanillaxn is very small on MobileNetV2 and I3D.
Nevertheless, with RANPf and Glorot initialization, i.e., RANPfxn, more FLOPs and memory can be reduced than using orthogonal initialization, i.e., RANPfort.
Balance of Neuron Importance Distribution. More importantly, with reweighting by RANP in Fig. 7, the values of neuron importance are more balanced and stable than those of vanilla neuron importance. This can largely avoid network infeasibility without pruning the whole layer(s).
Now, we analyse the neuron distribution from the observation of neuron importance values and network structures. Fig. 8 illustrates a detailed comparison between orthogonal and Glorot initialization by each two subfigures in column of Fig. 7. In Figs. 7(a)7(c), vanilla neuron importance by Glorot initialization is more stable and compact than that by orthogonal initialization. After applying the reweighting scheme of RANPf, the importance tends to be in a similar tendency, shown in Figs. 7(b)7(d). Consequently, in Figs. 7(e)7(f), neuron ratios are more balanced after the reweighting than without reweighting, especially the 8th layer. Thus, we choose Glorot initialization as network initialization. Note that we adopt the same neuron sparsity for these two initialization experiments in Table 7 and Fig. 8.
d.3 Visualization of Balanced Neuron Distribution by RANP
In Fig. 9, neuron importance by MPMGsum is more balanced than by MNMGsum, which avoids pruned networks by MPMGsum to be infeasible, that is at least 1 neuron will be retained in each layer.
In addition to the distribution of retained neuron ratios in Fig. 2 in the main paper for ShapeNet, which is also shown in the first row of Fig. 10, the last row of Fig. 10 is for BraTS’18. Moreover, Fig. 11 illustrates the distribution of neurons retained in each layer by vanilla neuron pruning (i.e., vanilla NP) and RANPf compare to the full network.
Clearly, upon pruning, neurons in each layer are largely reduced except the last layer where all neurons are retained for the number of segmentation classes. In Fig. 11, vanilla NP has very few neurons in, e.g., the 8th layer, resulting in low accuracy or network infeasibility. By contrast, the neuron distribution by RANPf is more balanced to improve the pruning capability.
Footnotes
 We concretely define “resource” as FLoating Point Operations per second (FLOPs) and memory required by one forward pass.
 footnotemark:
 footnotetext: Since 2 layers of the pruned MobileNetV2 by MNMGsum have only 1 neuron due to the imbalanced layerwise neuron importance distribution.
 The dimension order follows that of PyTorch.
 Here, we refer a 3D filter with dimension .
 footnotemark:
 footnotetext: For MobileNetV2 pruned by MPMGmean, MPMGmax, MNMGmax, and MNMGsum, the accuracy is very low because 1) the neuron sparsity here is the extreme (largest) value, a larger one will make network infeasible by removing whole layer(s) and 2) the distribution of neuron importance is rather imbalanced possibly caused by the high mixture of 1 kernels and 3 in MobileNetV2. In the pruned networks, we observe that, for MPMGmean, MPMGmax, and MNMGmax, the last convolutional layer has only 1 neuron retained; for MNMGsum, 2 convolutional layers have only 1 neuron retained. Note that, this imbalance issue can be greatly alleviated by the reweighting of our RANP, while we select MPMGsum as vanilla NP merely according to the results in Table 6.
References
 H. Zhang, K. Jiang, Y. Zhang, Q. Li, C. Xia, and X. Chen, “Discriminative feature learning for video semantic segmentation,” International Conference on Virtual Reality and Visualization, 2014.
 C. Zhang, W. Luo, and R. Urtasun, “Efficient convolutions for realtime semantic segmentation of 3D point cloud,” 3DV, 2018.
 S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” TPAMI, 2013.
 R. Hou, C. Chen, R. Sukthankar, and M. Shah, “An efficient 3D CNN for action/object segmentation in video,” BMVC, 2019.
 O. Cicek, A. Abdulkadir, S. Lienkamp, T. Brox, and O. Ronneberger, “3D UNet: Learning dense volumetric segmentation from sparse annotation,” MICCAI, 2016.
 F. Zanjani, D. Moin, B. Verheij, F. Claessen, T. Cherici, T. Tan, and P. With, “Deep learning approach to semantic segmentation in 3D point cloud intraoral scans of teeth,” Proceedings of Machine Learning Research, 2019.
 J. Kleesiek, G. Urban, A. Hubert, D. Schwarz, K. Hein, M. Bendszus, and A. Biller, “Deep MRI brain extraction: A 3D convolutional neural network for skull stripping,” NeuroImage, 2016.
 A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei, “Largescale video classification with convolutional neural networks,” CVPR, 2014.
 K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” NeurIPS, 2014.
 L. Yi, V. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas, “A scalable active framework for region annotation in 3D shape collections,” SIGGRAPH Asia, 2016.
 B. Menze, A. Jakab, and S. B. et al, “The multimodal brain tumor image segmentation benchmark (brats),” IEEE Transactions on Medical Imaging, 2015.
 B. Graham, M. Engelcke, and L. Maaten, “3D semantic segmentation with submanifold sparse convolutional networks,” CVPR, 2018.
 C. Qi, H. Su, K. Mo, and L. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,” CVPR, 2017.
 Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” NeurIPS, 2016.
 X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks via layerwise optimal brain surgeon,” NeurIPS, 2017.
 Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” ICCV, 2017.
 S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” NeurIPS, 2015.
 N. Lee, T. Ajanthan, and P. Torr, “SNIP: Singleshot network pruning based on connection sensitivity,” ICLR, 2019.
 N. Lee, T. Ajanthan, S. Gould, and P. Torr, “A signal propagation perspective for pruning neural networks at initialization,” ICLR, 2020.
 S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos, “Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features,” Nature Scientific Data, 2017.
 M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” CVPR, 2018.
 J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the Kinetics dataset,” CVPR, 2017.
 C. Chen, F. Tung, N. Vedula, and G. Mori, “Constraintaware deep neural network compression,” ECCV, 2018.
 N. Yu, S. Qiu, X. Hu, and J. Li, “Accelerating convolutional neural networks by groupwise 2Dfilter pruning,” IJCNN, 2017.
 H. Li, A. Kadav, I. D. H. Samet, and H. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
 R. Yu, A. Li, C. Chen, J. Lai, V. Morariu, X. Han, M. Gao, C. Lin, and L. Davis, “NISP: Pruning networks using neuron importance score propagation,” CVPR, 2018.
 Z. Huang and N. Wang, “Datadriven sparse structure selection for deep neural networks,” ECCV, 2018.
 Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” ECCV, 2018.
 M. Zhang and B. Stadie, “Oneshot pruning of recurrent neural networks by jacobian spectrum evaluation,” arXiv:1912.00120, 2019.
 C. Li, Z. Wang, X. Wang, and H. Qi, “Singleshot channel pruning based on alternating direction method of multipliers,” arXiv:1902.06382, 2019.
 J. Yu and T. Huang, “Autoslim: Towards oneshot architecture search for channel numbers,” arXiv:1903.11728, 2019.
 C. Wang, G. Zhang, and R. Grosse, “Picking winning tickets before training by preserving gradient flow,” ICLR, 2020.
 P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” ICLR, 2017.
 Y. Zhang, H. Wang, Y. Luo, L. Yu, H. Hu, H. Shan, and T. Quek, “Three dimensional convolutional neural network pruning with regularizationbased method,” ICIP, 2019.
 H. Chen, Y. Wang, H. Shu, Y. Tang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “Frequency domain compact 3D convolutional neural networks,” CVPR, 2020.
 A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi, “Morphnet: Fast & simple resourceconstrained structure learning of deep networks,” CVPR, 2018.
 W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” NeurIPS, 2016.
 G. Riegler, A. Ulusoy, and A. Geiger, “OctNet: Learning deep 3D representations at high resolutions,” CVPR, 2017.
 P. Kao, T. Ngo, A. Zhang, J. Chen, and B. Manjunath, “Brain tumor segmentation and tractographic feature extraction from structural MR images for overall survival prediction,” Workshop on MICCAI, 2018.
 K. Soomro, A. Zamir, and M. Shah, “UCF101: A dataset of 101 human action classes from videos in the wild,” CRCVTechinal Report, 2012.
 O. Kopuklu, N. Kose, A. Gunduz, and G. Rigoll, “Resource efficient 3D convolutional neural networks,” ICCVW, 2019.
 I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” ICML, 2013.
 D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
 S. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” ICLR, 2018.
 X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
 A. Saxe, J. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” ICLR, 2014.
 L. Yi, L. Shao, and M. Savva, “Largescale 3D shape reconstruction and segmentation from shapenet core55,” arXiv preprint arXiv:1710.06104, 2017.