Localization-aware Channel Pruning for Object Detection

Localization-aware Channel Pruning for Object Detection

Abstract

Channel pruning is one of the important methods for deep model compression. Most of existing pruning methods mainly focus on classification. Few of them conduct systematic research on object detection. However, object detection is different from classification, which requires not only semantic information but also localization information. In this paper, based on discrimination-aware channel pruning (DCP) which is state-of-the-art pruning method for classification, we propose a localization-aware auxiliary network to find out the channels with key information for classification and regression so that we can conduct channel pruning directly for object detection, which saves lots of time and computing resources. In order to capture the localization information, we first design the auxiliary network with a contextual RoIAlign layer which can obtain precise localization information of the default boxes by pixel alignment and enlarges the receptive fields of the default boxes when pruning shallow layers. Then, we construct a loss function for object detection task which tends to keep the channels that contain the key information for classification and regression. Extensive experiments demonstrate the effectiveness of our method. On MS COCO, we prune 70% parameters of the SSD based on ResNet-50 with modest accuracy drop, which outperforms the-state-of-art method.

keywords:
channel pruning, object detection, localization-aware
1

1 Introduction

Since AlexNet Krizhevsky et al. (2012) won the ImageNet Challenge: ILSVRC 2012 Russakovsky et al. (2015), deep convolutional neural network (CNNs) have been widely applied to various computer vision tasks, from basic image classification tasks He et al. (2016) to some more advanced applications, e.g., object detection Liu et al. (2016); Ren et al. (2015), semantic segmentation Noh et al. (2015), video analysis Wang et al. (2016) and many others. In these fields, CNNs have achieved state-of-the-art performance compared with traditional methods based on manually designed visual features.

However, deep models often have a huge number of parameters and its size is very large, which incurs not only huge memory requirement but also unbearable computation burden. As a result, a typical deep model is hard to be deployed on resource constrained devices, e.g., mobile phones or embedded gadgets. To make CNNs available on resource-constrained devices, there are lots of studies on model compression, which aims to reduce the model redundancy without significant degeneration in performance. Channel pruning He et al. (2017b); Luo et al. (2017); Jiang et al. (2018) is one of the important methods. Different from simply making sparse connections Han et al. (2015a, b), channel pruning reduces the model size by directly removing redundant channels and can achieve fast inference without special software or hardware implementation.

In order to determine which channels to reserve, existing reconstruction-based methods He et al. (2017b); Luo et al. (2017); Jiang et al. (2018) usually minimize the reconstruction error of feature maps between the original model and the pruned one. However, a well-reconstructed feature map may not be optimal for there is a gap between intermediate feature map and the performance of final output. Information redundancy channels could be mistakenly kept to minimize the reconstruction error of feature maps. To find the channels with true discriminative power for the network, DCP Zhuang et al. (2018) attend to conduct channel selection by introducing additional discrimination-aware losses that are actually correlated with the final performance. It constructs the discrimination-aware losses by a fully connected layer which works on the entire feature map. However, the discrimination-aware loss of DCP is designed for classification task. Since object detection network uses the classification network as backbone, a simple method to conduct DCP for object detection is to fine-tune the pruned model, which was trained on classification dataset, for the object detection task. But the information that the two tasks need is not exactly the same. The classification task needs strong semantic information while what the object detection task needs is not only semantic information but also localization information. Hence, the existing training scheme may not be optimal due to the mismatched goals of feature learning for classification and object detection task.

In this paper, we propose a method called localization-aware channel pruning (LCP), which conducts channel pruning directly for object detection. We propose a localization-aware auxiliary network for object detection task. First, we design the auxiliary network with a contextual RoIAlign layer which can obtain precise localization information of the default boxes by pixel alignment and enlarges the receptive fields of the default boxes when pruning shallow layers. Then, we construct a loss function for object detection task which tends to keep the channels that contain the key information for classification and regression. Our main contributions are summarized as follows. (1) We propose a localization-aware auxiliary network which can find out the channels with key information so that we can conduct channel pruning directly on object detecion dataset, which saves lots of time and computing resources. (2) We propose a contextual RoIAlign layer which enlarges the receptive fields of the default boxes in shallow layers. (3) Extensive experiments on benchmark datasets show that the proposed method is theoretically reasonable and practically effective. For example, our method can prune 70% parameters of SSD Liu et al. (2016) based on ResNet-50 He et al. (2016) with modest accuracy drop on VOC2007, which outperforms the-state-of-art method.

2 Related Works

2.1 Network Quantization

Network quantization compresses the original network by reducing the number of bits required to represent each weight. Han et al. Han et al. (2015a) propose a complete deep network compression pipeline: First trim the unimportant connections and retrain the sparsely connected network. Weight sharing is then used to quantize the weight of the connection, and then the quantized weight and codebook are Huffman encoded to further reduce the compression ratio. Courbariaux et al. Courbariaux et al. (2016) propose to accelerate the model by reducing the weight and accuracy of the output, because this will greatly reduce the memory size and access times of the network, and replace the arithmetic operator with a bit-wise operator. Li et al. Li et al. (2016a) consider that multi-weights have better generalization capabilities than binarization and the distribution of weights is close to a combination of a normal distribution and a uniform distribution. Zhou et al. Zhou et al. (2017) propose a method which can convert the full-precision CNN into a low-precision network, making the weights 0 or 2 without loss or even higher precision (shifting can be performed on embedded devices such as FPGAs). For more recent works, Yu et al. Yu et al. (2019), to reduce the communication complexity, propose a general scheme for quantizing both model parameters and gradients. Zhao et al. Zhao et al. (2019) attend to the statistical properties of sparse CNNs and present focused quantization, a novel quantization strategy based on power-of-two values, which exploits the weight distributions after fine-grained pruning.

2.2 Sparse or Low-rank Connections

Wen et al. Wen et al. (2016) propose a learning method called Structured Sparsity Learning, which can learn a sparse structure to reduce computational cost, and the learned structural sparseness can be effectively accelerate for hardware. Guo et al. Guo et al. (2016) propose a new network compression method, called dynamic network surgery, is to reduce network complexity through dynamic connection pruning. Unlike previous methods of greedy pruning, this approach integrates join stitching throughout the process to avoid incorrect trimming and maintenance of the network. Jin et al. Jin et al. (2016) proposes to reduce the computational complexity of the model by training a sparsely high network. By adding a paradigm about weights to the loss function of the network, the sparsity of weights can be reduced. For more recent works, Kim Kim et al. (2019) propose novel accuracy metrics to represent the accuracy and complexity relationship for a given neural network and use these metrics in a non-iterative fashion to obtain the right rank configuration which satisfies the constraints on FLOPs and memory while maintaining sufficient accuracy. Liu et al. Liu et al. () propose a layerwise sparse coding (LSC) method to maximize the compression ratio by extremely reducing the amount of meta-data.

2.3 Channel Pruning

Finding unimportant weights in the network has a long history. LeCun LeCun et al. (1990) and Hassibi Hassibi and Stork (1993) consider using the Hessian, which contains second order derivative, performs better than using the magnitude of the weights. Computing the Hessian is expensive and thus is not widely used. Han Han et al. (2015a) et al. proposed an iterative pruning method to remove the redundancy in deep models. Their main insight is that small-weight connectivity below a threshold should be discarded. In practice, this can be aided by applying or regularization to push connectivity values to become smaller. The major weakness of this strategy is the loss of universality and flexibility, thus seems to be less practical in real applications. Li et al. Li et al. (2016b) measure the importance of channels by calculating the sum of absolute values of weights. Hu et al. Hu et al. (2016) define average percentage of zeros (APoZ) to measure the activation of neurons. Neurons with higher values of APoZ are considered more redundant in the network. With a sparsity regularizer in the objective function Alvarez and Salzmann (2016); Liu et al. (2017), training based methods are proposed to learn the compact models in the training phase. With the consideration of efficiency, reconstruction-methods He et al. (2017b); Luo et al. (2017) transform the channel selection problem into the optimization of reconstruction error and solve it by a greedy algorithm or LASSO regression. DCP Zhuang et al. (2018) aimed at selecting the most discriminative channels for each layer by considering both the reconstruction error and the discrimination-aware loss. PP Singh et al. (2019) jointly prunes and fine-tunes CNN model parameters, with an adaptive pruning rate. Liu et al. Liu et al. (2019) propose a novel meta learning approach for automatic channel pruning of very deep neural networks. AutoPrune Xiao et al. (2019) prunes the network through optimizing a set of trainable auxiliary parameters instead of original weights.

2.4 Object Detection

Current state-of-the-art object detectors with deep learning can be mainly divided into two major categories: two-stage detectors and one-stage detectors. Two-stage detectors first generate region proposals which may potentially be objects and then make predictions for these proposals. Faster R-CNN Ren et al. (2015) is a representative two-stage detector, which was able to make predictions at 5FPS on GPU and achieved state-of-the-art results on many public benchmark datasets, such as Pascal VOC 2007, 2012 and MSCOCO. Currently, there are huge number of detector variants based on Faster R-CNN for different usage Lin et al. (2017a); Cai and Vasconcelos (2018). Mask R-CNN He et al. (2017a) extends Faster R-CNN to the field of instance segmentation. Based on Mask R-CNN, Huang et al. Huang et al. (2019) proposed a mask-quality aware framework, named Mask Scoring R-CNN, which learned the quality of the predicted masks and calibrated the misalignment between mask quality and mask confidence score.

Different from two-stage detectors, one-stage detectors do not generate proposals and directly make predictions on the whole feature map. SSD Liu et al. (2016) is a representative one-stage detector. However, the class imbalance between foreground and background is a severe problem in one-stage detector. RetinaNet Lin et al. (2017b) matigates the class imbalance problem by introducing focal loss which reduces loss of easy samples. The previous approaches required designing anchor boxes manually to train a detector. Currently, a series of anchor-free object detectors were developed, where the goal was to predict keypoints of the bounding box, instead of trying to fit an object to an anchor. Law and Deng proposed a novel anchor-free framework CornerNet Law and Deng (2018) which detected objects as a pair of corners. Later there were several other variants of anchor-free detectors Zhou et al. (2019); Duan et al. (2019)

3 Proposed Method

Fig. 1 is the overall frame diagram. The blue network in Figure 1 is the original model which could be any detectors, such as SSD Liu et al. (2016), Faster R-CNN Ren et al. (2015) and so on. The orange network in Figure 1 is the model to be pruned which is initialized exactly the same as the original model. The middle is the auxiliary network. We use the auxiliary network to prune the model by constructing the localization-aware loss. The localization-aware loss consists of two parts, one part is the reconstruction error and the other is the loss of auxiliary network. Their details will be discussed later. After the loss is constructed, we could fine-tune the network and use the gradient of the localization-aware loss to decide which channel to preserve. After repeating this operation layer by layer, the pruning is finished.

The auxiliary network we propose mainly consists of two parts. First, a contextual RoIAlign layer is designed to extract the features of the boxes. Then, a loss is designed for object detection task which can reserve the important channels. The details of the proposed approach are elaborated below.

Figure 1: Illustration of localization-aware channel pruning. The auxiliary network is used to supervise layer-wise channel selection. The blue network the original model, the orange network is the model to be pruned, the green network is the auxiliary network. The reconstruction error and the auxiliary network are used to construct localization-aware loss.

3.1 Contextual RoIAlign Layer

For object detection task, if we predict the bounding boxes directly on the entire feature maps, there will be a huge amount of parameters and unnecessary noises. So, it is important to extract the feature of region of interest (RoI) , which can be better used for classification and regression. To obtain precise localization information and find out the channels which are important for classification and regression, RoIAlign layer is a good choice which properly align the extracted features with the input. RoIAlign use bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Fig. 2 for details. However, the default boxes generated by the detector do not always completely cover the object area. From Fig. 3, we can see that, the defalut box is sometimes bigger than the ground truth and sometimes smaller than it. So, the receptive fields may be insufficient if we only extract the features of the default box especially when we prune the shallow layers. To solve this problem, we propose a contextual RoIAlign layer, which introduces larger context information. The orange part in Figure 4 is feature extraction network. The network first extract the feature map of the whole image, then obtain the features of the boxes by RoIAlign operation. To introduce larger context information, we further gather the information of the default box and its context by adding the feature of the default box and its enclosing convex object.

Figure 2: RoIAlign: The dashed grid represents a feature map, the solid lines an RoI (with 2 bins in this example), and the dots the 4 sampling points in each bin. RoIAlign computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map.
Figure 3: The features of default boxes do not always contain enough context information, especially when we prune shallow layers. The blue is default box, the red is ground truth.
Figure 4: Contextual RoIAlign: The red, blue and orange boxes represent the ground truth, default box and its smallest enclosing box, respectively. The network first extract the feature map of the whole image, then obtain the features of the boxes by RoIAlign operation. To introduce larger context information, we further gather the information of the default box and its context by adding the feature of the default box and its enclosing convex object.

For better description of the algorithm, some notations are given first. For a training sample, represents the coordinates of ground truth box , denotes the coordinates of the matched default box . We further use to denote the feature map and represents the features of area , represents the RoIAlign operation. First, we calculate the of box A and B:

(1)

B is a positive sample only if is larger than a preset threshold. We do not conduct contextual RoIAlign for B when B is negative sample. If B is a positive sample, then we calculate the smallest enclosing convex object C for A and B:

(2)
(3)
(4)
(5)

where are the coordinates of C. Finally, the output of contextual RoIAlign layer is defined as:

(6)

Now we can get the precise features of default box B, the process can refer to Fig 4.

3.2 Construction of the Loss of Auxiliary Network

After we construct the contextual RoIAlign layer, we need to consturct a loss for object detection task so that we can use the gradient of the auxiliary network to conduct model pruning. The details are discussed below.

Considering that the object detection task needs both the classification and localization information. The loss of auxiliary consists of two parts. One is the loss for classification and the other is the loss for regression. For the classification part, We still use discrimination-aware loss which has been proved to be useful on classification task. For the regression part, we choose GIoU Rezatofighi et al. (2019) loss function. It is reasonable to use GIoU as loss function for boxes regression. It considers not only overlapping areas but also non-overlapping areas, which better reflects the overlap of the boxes. The GIoU of two arbitrary shapes and is defined as:

Figure 5: A, B are two arbitrary shapes, C is the smallest enclosing convex of A and B, D is the of A and B.
(7)

where , and are calculated by Eq. 1 - Eq. 5. Fig. 5 is a schematic diagram of GIoU. Then, we use to denote the GIoU of the -th predicted box and the ground truth, to represent the cross entropy of the -th predicted box. Then, in the pruning stage, represents the classification loss, represents the regression loss, represents the localization-aware loss of the auxiliary network. Finally, the loss of positive samples in pruning stage is defined as:

(8)
(9)
(10)

where is a constant coefficient.

3.3 Localization-aware Channel Pruning

After we construct the auxiliary network and the localization-aware loss, we can conduct channel pruning with them layer by layer. The pruning process of the whole model is described in Algorithm 1. For better description of the channel selection algorithm, some notations are given first. Considering a layers of the CNN model and we are pruning the -th layer, represents the output feature map of the layer, denotes the convolution filter of the -th layer of the pruned model and represents the convolution operation. We further use to denote output feature maps of the -th layer of the original model. Here, , , represents the number of output channels, the height and the width of the feature maps respectively. Finally we use and to denote classification loss and regression loss of the pruned network.

To find out the channels which really contribute to the network, we should fine-tune the auxiliary network and pruned network first and the fine-tune loss is defined as the sum of the losses of them:

(11)

In order to minimizing the reconstruction error of a layer, we introduce a reconstruction loss as DCP does which can be defined as the Euclidean distance of feature maps between the original model and the pruned one:

(12)

where , represents the selected channels, represents the submatrix indexed by .

Input: number of layers , weights of original model , the training set , the pruning rate .
Output: : weights of the pruned model.

1:  Initialize with for
2:  for  do
3:     Construct the fine-tune loss shown as in Eq. 11
4:     Fine-tune the auxiliary network and the pruned model by
5:     Construct the joint loss shown as in Eq. 3.3
6:     Conduct channel selection for layer by Eq. 14
7:     Update w.r.t. the selected channels by Eq. 15
8:  end for
9:  return the pruned model
Algorithm 1 The proposed method

Taking into account the reconstruction error, the localization-aware loss of the auxiliary network, the problem of channel pruning can be formulated to minimize the following joint loss function:

(13)

where is a constant, is the number of channels to be selected. Directly optimizing Eq. 3.3 is NP-hard. Following general greedy methods in DCP, we conduct channel pruning by considering the gradient of Eq. 3.3. Specifically, the importance of the -th channel is defined as:

(14)
Figure 6: The percentage of the gradients generated by the three loss functions.

is the square sum of gradient of the -th channel. Then we reserve the channels with the largest importance and remove others. After this, the selected channels is further optimized by stochastic gradient (SGD). is updated by:

(15)

where represents the learning rate. After updating , the channel pruning of a single layer is finished.

4 Experiments

We evaluate LCP on the popular 2D object detector SSD Liu et al. (2016). Several state-of-the-art methods are adopted as the baselines, including ThiNet Luo et al. (2017) and DCP Zhuang et al. (2018). In order to verify the effectiveness of our method, we use VGG and ResNet to extract feature respectively.

4.1 Dataset and Evaluation

The results of all baselines are reported on standard object detection benchmarks, i.e. the PASCAL VOC Everingham et al. (2010) . PASCAL VOC2007 and 2012: The Pascal Visual Object Classes (VOC) benchmark is one of the most widely used datasets for classification, object detection and semantic segmentation. We use the union of VOC2007 and VOC2012 trainval as training set, which contains 16551 images and objects from 20 pre-defined categories annotated with bounding boxes. And we use the VOC2007 test as test set which contains 4592 images. In order to verify the effectiveness of our method, on PASCAL VOC, we first compare our method only with ThiNet based on VGG-16 because the authors of DCP do not release the VGG model. To this end, we compare our method with DCP and ThiNet based on ResNet-50. Then we conduct the ablation experiment of our method on PASCAL VOC. In order to more fully verify the effectiveness of our method, we also perform experiments on the MS COCO2017 dataset.

In this paper, we use for all experiments on PASCAL VOC. For experiments on MS COCO, the main performance measure used in this benchmark is shown by AP, which is averaging mAP across different value of IoU thresholds, i.e. .

4.2 Implementation details

Our experiments are based on SSD and the input size of the SSD is . We use VGGNet and ResNet as the feature extraction network for experiments. For ThiNet, we implement it for object detection. And the three methods prune the same number of channels for each layer. Other common parameters are described in detail below.

For VGGNet Simonyan and Zisserman (2014), we use VGG-16 without Batch Normalization layer and prune the SSD from conv1-1 to conv5-3. The network is fine-tuned for 10 epochs every time a layer is pruned and the learning rate is started at 0.001 and divided by 10 at epoch 5. After the model is pruned, we fine-tune it for 60k iterations and the learning rate is started at 0.0005 and divided by 10 at iteration 30k and 45k, respectively.

For ResNet He et al. (2016), we use the layers of ResNet-50 from conv1-x to conv4-x for feature extracting. The network is fine-tuned for 15 epochs every time a layer is pruned and the learning rate is started at 0.001 and divided by 10 at epoch 5 and 10, respectively. After the model is pruned, we fine-tune it for 120k iterations and the learning rate is started at 0.001 and divided by 10 at iteration 80k and 100k, respectively.

For the loss of auxiliary network, we set to 50.

4.3 Experiments on PASCAL VOC

On PASCAL VOC, we prune the VGG-16 from conv1-1 to conv5-3 with compression ratio 0.75, which is 4x faster. We report the results in Tab. 1. From the results, we can see that our method achieves the best performance under the same acceleration rate. The accuracy of reconstruction based method like ThiNet drops a lot. But for our LCP, there is not much degradation in the performance of object detection. It is proved that our method retain the channels which really contribute to the final performance. Then we conduct the experiment based ResNet-50. We report the results in Tab. 2. From the results, LCP achieves the best performance regardless of pruning by 75% or pruning by 50%, which proves that our method can reserve the channels which contain key information for classification and regression. In addition, the ThiNet outperforms the DCP when pruning ratio is 0.7, which indicates that pruning the model on classification dataset for object detection is not optimal.

Method backbone flops params mAP
Original VGG-16 0 0 0 77.4
ThiNet VGG-16 0.5 50% 50% 74.6
LCP(our) VGG-16 0.5 50% 50% 77.2
ThiNet VGG-16 0.75 75% 75% 72.7
LCP(our) VGG-16 0.75 75% 75% 75.2
Table 1: The pruning results on PASCAL VOC2007. We conduct channel pruning from conv1-1 to conv5-3.
Method backbone flops params mAP
Original ResNet-50 0 0 0 73.7
DCP ResNet-50 0.5 50% 50% 72.4
ThiNet ResNet-50 0.5 50% 50% 72.2
LCP(our) ResNet-50 0.5 50% 50% 73.3
DCP ResNet-50 0.7 70% 70% 70.2
ThiNet ResNet-50 0.7 70% 70% 70.8
LCP(our) ResNet-50 0.7 70% 70% 71.7
Table 2: The pruning results on PASCAL VOC2007. We conduct channel pruning from conv2-x to conv4-x.
Method small medium large mAP
Original 4.2 22.5 39.0 37.3 22.7 21.9
DCP 2.8 17.2 33.0 31.8 17.8 17.8
LCP 4.1 20.4 38.2 35.5 21.6 20.9
Relative improv.% 46.4 18.6 15.8 11.6 21.3 17.4
Table 3: The pruning results on MS COCO2017. The backbone is ResNet-50, We conduct channel pruning from conv2-x to conv4-x with compression ratio 0.7. Small, medium, large are the size of objects.
Method backbone flops params mAP
Original ResNet-50 0 0 0 21.9
DCP ResNet-50 0.5 50% 50% 21.2
ThiNet ResNet-50 0.5 50% 50% 22.6
LCP(our) ResNet-50 0.5 50% 50% 23.1
DCP ResNet-50 0.7 70% 70% 17.8
ThiNet ResNet-50 0.7 70% 70% 20.2
LCP(our) ResNet-50 0.7 70% 70% 20.9
Table 4: The pruning results on COCO. We conduct channel pruning from conv2-x to conv4-x.
Method backbone flops params mAP
DCP ResNet-50 70% 70% 70.2
LCP+RoIAlign ResNet-50 70% 70% 71.1
LCP+CR ResNet-50 70% 70% 71.7
Table 5: The pruning results on PASCAL VOC2007. We conduct channel pruning from conv2-x to conv4-x. CR means Contextual RoIAlign.
Method backbone flops params mAP
Original VGG-16 0 0 0 77.4
+ VGG-16 0.75 75% 75% 74.7
++ VGG-16 0.75 75% 75% 75.2
Table 6: The pruning results on PASCAL VOC2007. We conduct channel pruning from conv1-1 to conv5-3 .
Figure 7: Visualization of the performance of ThiNet (top row) and our method (bottom row) on animals, furniture, and vehicles classes in the VOC 2007 test set. The figures show the cumulative fraction of detections that are correct (Cor) or false positives due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). The solid red line reflects the change of recall with the ‘strong’ criteria (0.5 jaccard overlap) as the number of detections increases. The dashed red line uses the ‘weak’ criteria (0.1 jaccard overlap).
Figure 8: The predictions of original SSD, models pruned by Thinet and our LCP. We prune the VGG-16 by 75% on PASCAL VOC.

4.4 Experiments on MS COCO

In this section, we prune the ResNet-50 by 70% on COCO2017. We report the results in Tab. 3 and Tab. 4. From the results, our method achieves a better performance than the DCP and ThiNet, which further illustrates the effectiveness of our approach. It is noted that compared with DCP, LCP has larger gain on small objects. In addition, the higher the IoU threshold, the greater improvement of our method. This indicates that our method retains more localization information and can obtain more accurate predictions.

4.5 Ablation Analysis

Gradient Analysis. In this section, we prune the VGG-16 from conv1-1 to conv5-3 with compression ratio 0.75 On PASCAL VOC. Then we count the percentage of the gradients generated by the three losses during the pruning process. From Fig. 6, we see that the gradient of regression loss play a important role during the pruning process, which proves that the localization information is necessary. The gradient generated by reconstruction error only works in the shallow layers while the localization-aware loss contributes to the channel pruning process each layer.
Component Analysis. In this section, in order to verify the effectiveness of the two points we propose, we prune the SSD based on ResNet-50 by 70% with different combinations of our points. We report the results in Tab.5. From the results, we can get that each part of the method we propose contributes to the performance.
Loss Analysis. In order to explore the importance of the gradient of regression loss, we prune the SSD based on VGG-16 by 75% with different losses. We report the results in Tab. 6. From the results, we can know that the performance of our method drops a lot without the gradient of the regression loss during the pruning stage, which shows that the regression branch contains important localization information.

4.6 Qualitative Results

To demonstrate the effectiveness of our proposed method in details, we use the detection analysis tool from Hoiem et al. (2012). Figure 7 shows that our model can detect various object categories with high quality (large blue area). The recall is higher than 90%, and is much higher with the ‘weak’ (0.1 jaccard overlap) criteria. We can also observe that comparing with ThiNet, Our method has larger correct area, which indicates our superior performance.

4.7 Visualization of predictions

In this section, we prune the SSD based on VGG-16 by 75% and we compare the original model with the pruned models. From Fig. 8, we can find that the predictions of our method are closed to the predictions of the original model while the predictions of ThiNet are far away. It is proved that our method reserve more localization information for bounding box regression.

5 Conclusions

In this paper, we propose a localization-aware auxiliary network which allows us to conduct channel pruning directly for object detection. First, we design the auxiliary network with a contextual RoIAlign layer which can obtain precise localization information of the default boxes by pixel alignment and enlarges the receptive fields of the default boxes when pruning shallow layers. Then, we construct a loss function for object detection task which tends to keep the channels that contain the key information for classification and regression. Visualization shows our method reserves layers with more localization information. Moreover, extensive experiments demonstrate the effectiveness of our method.

6 Acknowledge

This work was supported by the National Natural Science Foundation of China under Grants 91748204, 61976227 and 61772213 and in part by the Wuhan Science and Technology Plan under Grant 2017010201010121 and Shenzhen Science and Technology Plan under Grant JCYJ20170818165917438.

Footnotes

  1. journal: Journal of LaTeX Templates

References

  1. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pp. 2270–2278. Cited by: §2.3.
  2. Cascade r-cnn: delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §2.4.
  3. Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §2.1.
  4. Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578. Cited by: §2.4.
  5. The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.1.
  6. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: §2.2.
  7. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.1, §2.3.
  8. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
  9. Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §2.3.
  10. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.4.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §1, §4.2.
  12. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §1, §1, §2.3.
  13. Diagnosing error in object detectors. In European conference on computer vision, pp. 340–353. Cited by: §4.6.
  14. Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250. Cited by: §2.3.
  15. Mask scoring r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6409–6418. Cited by: §2.4.
  16. Efficient dnn neuron pruning by minimizing layer-wise nonlinear reconstruction error.. In IJCAI, Vol. 2018, pp. 2–2. Cited by: §1, §1.
  17. Training skinny deep neural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423. Cited by: §2.2.
  18. Efficient neural network compression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  19. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  20. Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2.4.
  21. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §2.3.
  22. Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §2.1.
  23. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.3.
  24. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.4.
  25. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.4.
  26. Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §1, §2.4, §3, §4.
  27. X. Liu, W. Li, J. Huo and L. Y. Y. Gao Layerwise sparse coding for pruned deep neural networks with extreme compression ratio. Cited by: §2.2.
  28. Metapruning: meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3296–3305. Cited by: §2.3.
  29. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §2.3.
  30. Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §1, §1, §2.3, §4.
  31. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520–1528. Cited by: §1.
  32. Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.4, §3.
  33. Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: §3.2.
  34. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1.
  35. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
  36. Play and prune: adaptive filter pruning for deep model compression. arXiv preprint arXiv:1905.04446. Cited by: §2.3.
  37. Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §1.
  38. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2.2.
  39. AutoPrune: automatic network pruning by regularizing auxiliary parameters. In Advances in Neural Information Processing Systems, pp. 13681–13691. Cited by: §2.3.
  40. Double quantization for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 4440–4451. Cited by: §2.1.
  41. Focused quantization for sparse cnns. In Advances in Neural Information Processing Systems, pp. 5585–5594. Cited by: §2.1.
  42. Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044. Cited by: §2.1.
  43. Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.4.
  44. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §1, §2.3, §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
410085
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description