Precise Box Score: Extract More Information from Datasets to Improve the Performance of Face Detection

Precise Box Score: Extract More Information from Datasets to Improve the Performance of Face Detection

Ce Qi  Xiaoping Chen  Pingyu Wang  Fei Su
School of Information and Communication Engineering
Beijing Key Laboratory of Network System and Network Culture
Beijing University of Posts and Telecommunications, Beijing, China

Abstract

For the training of face detection network based on R-CNN framework, anchors are assigned to be positive samples if intersection-over-unions (IoUs) with ground-truth are higher than the first threshold(such as 0.7); and to be negative samples if their IoUs are lower than the second threshold(such as 0.3). And the face detection model is trained by the above labels. However, anchors with IoU between first threshold and second threshold are not used. We propose a novel training strategy, Precise Box Score(PBS), to train object detection models. The proposed training strategy uses the anchors with IoUs between the first and second threshold, which can consistently improve the performance of face detection. Our proposed training strategy extracts more information from datasets, making better utilization of existing datasets. What’s more, we also introduce a simple but effective model compression method(SEMCM), which can boost the performance of face detectors further. Experimental results show that the performance of face detection network can consistently be improved based on our proposed scheme.

1 Introduction

Face detection, which is the basis of face alignment and face recognition, plays an important role in face related tasks. More accurate face detection and face bounding boxes will also benefit the performance of face alignment and face recognition.

Many research works [36, 16, 17, 21, 4, 29, 18, 31, 30, 35, 12, 27, 28, 10, 22] have been done to improve the performance of face detectors. However, there is still big gap between humans and current face detectors, especially in the scenarios of small faces or occluded faces. The gap becomes bigger in case of resource constraint environment for the trading of the complexity and the required speed and memory. Some good performance face detectors are usually slow and high memory foot-prints(e.g. it takes more than 1 second in [10] per image and HyperNet [14] is slow and has big model size).

One way to make the face detection models efficient is to use more powerful networks or design a specific network architecture. But this kind of strategy is not elegant and may cannot be used in other detection tasks. Recently, Hu et al. [10] gets state-of-the-art results on the WIDER FACE detection benchmark [32] by using a similar approach to the Region Proposal Networks(RPN) [24] to directly detect faces. To boost the performance, it introduces an image pyramid as an integral part of the method, which is not that time efficient.

Another way is using more training data and data augmentation. AFLW [13], which is a relatively smaller face detection dataset, is usually used as train set before. Face detection model’s performance will be boosted when using the WIDER FACE dataset [32], a relatively bigger face detection dataset. What’s more, data augmentation such as flipping and blurring will also help the final accuracy.

When training the R-CNN style face detector, anchors are assigned to be positive samples if intersection-over-unions (IoUs) with ground-truth are higher than the first threshold(such as 0.7); and to be negative samples if their IoUs are lower than the second threshold(such as 0.3). The object detection model is trained by the above labels, meaning the positive labels and negative labels are set by hand roughly. And the anchors with IoUs between first and second threshold are not used, which loses much information from detection dataset.

In this paper, we show that when training face detection model based on R-CNN framework, the original anchors assignment strategy is not appropriate and loses much information from original face detection dataset. As is shown in Fig. 1(a), original training strategy has three weaknesses: (a) choosing thresholds roughly; (b) setting positive and negative labels with 1 or 0 roughly; (c) the information of anchors with IoUs between 0.7 and 0.3 is not used. So, we propose a novel training strategy, called Precise Box Score(PBS), to train face detection models. The proposed training strategy uses the anchors more effectively, meaning more information from face detection dataset will be used for training. What’s more, we also introduce a simple but effective model compression method(SEMCM), which can boost the performance of face detectors further. The experimental results show that when using the proposed novel training strategy and model compression method, the performance of face detection model can consistently be improved.

The rest of the paper is organized as follows. Section 2 provides an overview of the related works. Section 3 introduces the proposed training strategy: Precise Box Score(PBS) and the new architecture designed for PBS. Section 4 describes a simple but effective model compression method(SEMCM), which can improve the performance of face detector and reduce the model size. Section 5 presents the experiments and Section 6 gives the conclusion.

(a) orignal R-CNN [24] style training strategy
(b) our new training strategy
Figure 1: Comparison of training strategy. Original R-CNN [24] style training strategy uses incomplete information from face detection dataset(only use anchors with or , and just set anchors with and with roughly). On the other hand, our new training strategy uses full information from face detection dataset and set the anchors’s labels by a function precisely.

2 Related Work

2.1 Face Detection

Face detection are the basis of face related tasks. And there are many works have been done to improve the performance of face detection. Before the re-emergence of convolutional neural networks(CNN), many traditional methods [26, 36, 16, 17, 21, 4, 29] have been proposed for face detection. The most successful traditional method is Viola-Jones [26] detector. However, most of the traditional methods use hand-crafted features, which limit the performance of face detector. Following the success of CNN [15], the performance of face detection is improved significantly, for the discriminative features of CNN. Recently, many CNN-based works have been done for face detection, such as [18, 31, 30, 35, 10].

2.2 R-CNN Style Face Detector

The idea of detecting and localizing objects in two stages is widely used in object detection, such as Faster R-CNN [24] and R-fcn [5]. Face detector [12, 27, 28] with R-CNN style can obtain good accuracy. However, unlike the object detection with many classes, face detection detects only one class. Single stage face detectors [35, 10, 22] also work well, which detect faces directly from the early convolutional layers with bounding box classification and regression. Most of the single stage face detection methods are more similar to the object proposal algorithm which is used as the first stage in detection pipeline. These kind of algorithms generally regress a set of anchors toward faces and assign scores to different anchors according to the intersection-over-unions (IoUs) between anchors and ground truth bounding boxes.

2.3 Main Focus of Face Detection Research

For single stage face detectors, there are many works have been done to boost the performance of face detection. Most of the methods focus on scale invariance and context modeling. Scale invariant can make detection of different scale of faces easier and context information can do help to hard classified faces.

For general object detection, ION [1] uses skip pooling and RNN(recurrent neural networks) for context modeling and scale invariance. FPN [6] employs skip connections and multiple shared RPN from different convolutional layers. The same methods also be used for face detection. CMS-RCNN [35] employs skip connection, too. Hu et al[10] uses image pyramids and context modeling to improve the performance.

There are also some other methods focusing on object loss functions of detection, such as Unitbox [33] and Grid loss [23]. Some other researchers do efforts on non-maximum suppression(NMS), a post processing step. Soft-NMS [2] uses a very simple but effective way to improve the NMS. Authors in [9] use a convolutional network to guide the NMS after detection.

Our training strategy(PBS) focuses on how to extract more information from current face detection dataset to improve the performance of face detection.

3 Novel Training Strategy: Precise Box Score (PBS)

Most of the works about detection are related to network architecture, loss function or post processing step. The main focus of the proposed training strategy is the input data step of model, just say, how to extract more information from current face detection datasets.

Our novel training strategy(PBS) is designed for detection network using anchors. Details are shown in Fig. 2.

Figure 2: Training and testing stages with PBS

3.1 General Architecture

Fig. 3 shows the general architecture we use, called Face Detection Network(FDN). It is a fully convolutional network which performs bounding box classification and regression simultaneously. For localization, just like RPN in [24], our face detection network(FDN), regresses a set of predefined bounding boxes(anchors), to approximate the ground-truth bounding boxes. And the operations of bounding box classification and regression are added on the top of feature map with stride 16. The scales FDN used are 4, 8, 16 and 32. And, we only consider anchors with aspect ratio of 1:1 to reduce the number of total anchor boxes and fit the truth that most face boxes have aspect ratios of 1:1. Exactly, if the top of feature map with stride 16 has size , there would be anchors, where equals to in our setting.

For the reason of FDN with all convolutional network, the input images can be set to any size. And the total model size is small. Note that there are two different network architectures in Fig. 3. Details will be described as follows.

(a) FDN with softmax
(b) FDN with precise-sigmoid
Figure 3: General architecture of our face detection network(FDN). (a) is the network architecture of general one stage face detector using origianl training strategy as shown in Fig. 1(a). (b) is a new network architecture proposed by ourself, designed for the proposed novel training stategy, Precise Box Score(PBS) as shown in Fig. 1(b).

3.2 Precise Box Score(PBS)

In this section, we will discuss the training strategy of R-CNN style detector, such as Faster R-CNN [24], R-fcn [5], Hu et al.  [10] and SSH [22]. Furthermore, the details of our novel training strategy(PBS) will be given.

General training strategy of R-CNN style detector: For the training of R-CNN style detector, anchor boxes are introduced to serve as reference of multiple scales and aspect ratios. Classification and Regression are done simultaneously for anchors to do region proposal. When training, anchors are assigned a binary class label(of being an object or not). The conditions of assigning positive labels to anchors are: (i) anchors with the highest Intersection over Unions(IoUs) overlap with a ground-truth box, or (ii) anchors that have IoUs higher than with any ground-truth box. And anchors are assigned to be negative labels if their IoUs are lower than with all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training. The detail of general training strategy of R-CNN style detector is shown in Fig. 1(a) and the formula is illustrated in Eq. (1). Note that the figure and formula above approximately illustrate the general training strategy of R-CNN style detector, for the existing of principle (i) described above.

(1)

Weakness of general training strategy: As shown in Fig. 1(a), according to the general training strategy of R-CNN style described above, we can find that: (1) the positive anchors threshold() and negative anchors threshold() are set to certain numbers roughly. The two thresholds in Faster R-CNN are and respectively. And the two thresholds in R-fcn are the same numbers, . (2) the labels are binary class, meaning the labels are or . The rough binary label loses much information from detection dataset. For example, the anchor with IoU is different with the anchor with IoU . The latter is more like a positive sample than the former. (3) the anchors with IoUs between first and second thresholds are not used.

Precise Box Score(PBS): To overcome the three weaknesses of general training strategy described above, we propose a novel training strategy, Precise Box Score(PBS). PBS will choose the best thresholds and use precise float point numbers as labels, when the precise float point numbers are the outputs of a designed function using IoUs as inputs, as illustrated in Fig. 1(b). It will firstly choose the best thresholds through experiments and then choose a best function to translate IoUs to labels. Detailed steps are as follows:

(1) Using Eq. (2) to choose the best thresholds through experiments(Note that when , and when ). The best thresholds are represented by and .

(2) Choosing the best function to translate IoUs to labels, based on best thresholds and obtained in step (1). Three classes of functions are used, as illustrated in Eq. (3), Eq. (4) and Eq. (5). Eq. (3) adds a shift variable A() to the IoU for the positive label, and when the positive label is bigger than , it will be set to . This function roughly translates the IoUs to labels. Eq. (4) limits some IoUs() to . And Eq. (5) uses more variables and do more precise limitation to the function of IoU. Details will be shown in experiments in Section 5.4.

(2)
(3)
(4)
(5)

3.3 New Architecture Designed for Precise Box Score(PBS)

As introduced above, the proposed novel training strategy, Precise Box Score(PBS), uses precise float point numbers as labels, not simply uses binary labels. So, softmax with cross entropy loss, as shown in Eq. (6), which is designed for binary labels, is not appropriate for the precise float point number labels.

(6)

where and denote the input data and its corresponding label, respectively. denotes the element of the softmax input vector , and . is the number of training images. is the number of class.

We design a new architecture designed for Precise Box Score(PBS), as illustrated in Fig. 3(b). Because of two reasons, the new architecture is needed:

(a) Our proposed novel training strategy, Precise Box Score(PBS), uses precise float point numbers as labels, not simply uses binary labels.

(b) The precise float point number labels the PBS used are in .

So, the designed new architecture for PBS replaces the softmax(with cross entropy loss)with sigmoid(with euclidean loss), as shown in Eq. (7), which is called Precise Sigmoid. The new architecture, which uses sigmoid(with euclidean loss), will output the numbers in , while with the loss for precise float point number.

The formula of sigmoid with euclidean loss(Precise Sigmoid) is shown in Eq. (7).

(7)

where denotes the Precise Sigmoid Loss, new loss designed for PBS in the new architecture. and denote the input data and its corresponding label. is the number of training images. is the number of class.

The loss of our face detector using new architecture is:

(8)

where . denotes the Precise Sigmoid Loss. denotes the SmoothL1 Loss [24] used for bounding box regression.

Note: we know that in mathematics, the sigmoid with euclidean loss may generate gradient vanishing. To solve this problem, we train the new architecture based on the parameters of a model pretrained by softmax with cross entropy loss. Experiments will be given in Section 5 to demonstrate the effectiveness of this method, which can avoid the gradient vanishing of sigmoid with euclidean loss. The experimental results for demonstration are shown in Table 4.

The benefits of new architecture are as follows:

(a) Training phase: using precise float point numbers in as labels satisfies the training request of PBS.

(b) Testing phase: the outputs of sigmoid are in , which can be used as scores of face boxes directly.

(c) It is effective to reduce the params of models through sigmoid, which makes the training and testing faster.

3.4 Superiority of Precise Sigmoid+Precise Box Score(PBS)

(a) Using the labels of PBS with Precise Sigmoid is the full implementation of proposed new training strategy.

(b) Using the precise float point number labels can extract more information from detection dataset, which can help the training.

(c) Under PBS, models can output precise and appropriate scores for bounding boxes, which can benefit the post processing of NMS(bounding boxes with lower IoUs with ground-truth get lower scores).

4 A Simple but Effective Model Compression Method(SEMCM)

Figure 4: Overall architecture of SEMCM.

After using the Precise Box Score(PBS) to improve the performance of face detection model, we propose a simple but effective model compression method(SEMCM) for one stage face detector. There are also some other works for model compression [7, 8, 20, 19].  [7] is a general but relatively complex model compression method. [8, 20, 19] are relatively simple model compression methods, while  [8, 20] are designed for image classification or face recognition. [19] is designed for object detection, but this method is used for two stage detector and uses the information of Region Of Interest(ROI), which does not exist in one stage detector.

SEMCM is designed for one stage detector, as illustrated in Fig. 4. One stage detector uses classification and regression results for anchors simultaneously to get the final results. SEMCM uses the output from pretrianed big model directly and the output will be used as supervision signals for the training of small model. Note that the output feature maps of big model and small model should be in same dimension in width, height and channel. So, the most convenient way to get a trainable small model is just downsampling the channels of all layers in big model, except the output layers used for classification and regression. The details of training steps of SEMCM are described as follows:

(1) A big model is trained through our proposed training strategy(PBS).

(2) A small model is obtained by downsampling the channels of all layers in big model, except the output layers used for classification and regression.

(3) The small model is trained simply by general training strategy with some iterations. SEMCM also works for half-trained small model.

(4) Frozen the parameters of pretrained big model in (1), and train the small model supervised by the output feature maps of big model.

(5) Note: to make the training of SEMCM stable. Two new layers are designed, to guide how to use the output feature maps of big model to supervise the training of small model, as illustrated in Fig. 4. And the original training signals are also added to guarantee the good performance of SEMCM.

5 Experiments

5.1 Experimental Setup

The new architecture models(shown in Fig. 3(b)) trained with PBS will start the training from a pretrained network trained through softmax with cross entropy loss. Using a pretrained model will solve the problem of gradient vanishing of sigmoid with euclidean loss for PBS. All anchors have aspect ratio of 1:1. For FDDB [11], anchors with scales {4, 8, 16, 32} are used on the feature map with total stride . The training and testing are both single scale, meaning we rescale the shorter side of the image up to 600 pixels while keeping the longer side below 1000 pixels without changing the aspect ratio. For WIDER FACE dataset [32], all settings follow the SSH [22]. During inference, model outputs 300 top scoring boxes and NMS with threshold of is performed on the boxes to get the final detection results.

The goal of our experiments is to demonstrate the effectiveness of PBS and SEMCM, so we use relatively simple networks to verify the effectiveness of our methods. We use one stage face detector with main bone of ZF-net [34], VGG_CNN_M_1024 [3], VGG16 [25] and ZF-24-net. Note that the ZF-24-net is a network designed for SEMCM. ZF-24-net has the same architecture as ZF-net, except all layers’ channel reduced by (not inculde layers for classification and regression), meaning ZF-24-net is a smaller network compared with ZF-net. The four designed one stage face detectors’ model size is 17.3M, 30.9M, 68.3M and 1.1M, respectively.

5.2 Datasets

FDDB [11] and WIDER FACE [32] are used in our experiments.

FDDB [11]: FDDB contains 2845 images with 5171 annotated faces. We use this dataset only for testing.

WIDER FACE [32]: WIDER FACE contains 32, 203 images with 393, 703 annotated faces, 158, 989 of which are in the train set, 39, 496 in the validation set and rest are in the test set. The validation and test set are divided into “easy”, “medium”, “hard” subsets cumulatively(i.e. the “hard” set contains all images). This is one of the most challenging public face detection datasets, with wide variety of face scales and occlusion. By default, we train our models on the train set of WIDER FACE [32] and evaluate on the validation set of WIDER FACE or FDDB [11].

5.3 Ablation study of loss weight for Precise Sigmoid

We firstly do experiments on FDDB [11] for loss weight in Eq. (8) of the proposed new architecture(shown in Fig. 3(b)) for PBS. When doing the experiments for loss weight of in Eq. (8), we use the original R-CNN style training strategy, as described in Section 3.2, and the two thresholds are set to and . All anchors have aspect ratio of 1:1 with scales {4, 8, 16, 32}, which are used on the feature map with total stride .

We use the train set of WIDER FACE for training and FDDB for testing. Different networks are used as main bone of one stage detector, to find an optimal loss weight for different networks. The results are shown as follows. Table 1,2 and 3 give the results of one stage detector with main bone of ZF-net [34], VGG_CNN_M_1024 [3] and ZF-24-net. the accuracy is measured when the false positive is , , respectively on FDDB. The “” in three tables means the model cannot converge well.

Loss weight() Accuracy(%)
1000 FP 500 FP 100 FP
1 - - -
10 85.7 79.4 51.4
20 86.6 81.1 61.3
100 91.1 87.9 74.2
200 92.4 90.5 81.4
300 92.7 90.7 83.2
400 92.6 91.0 82.5
Table 1: Accuracy(%) on FDDB with different loss weight for . The main bone of the detector is ZF-net.
Loss weight() Accuracy(%)
1000 FP 500 FP 100 FP
1 - - -
10 89.5 87.6 72.8
20 89.6 86.8 72.0
100 91.6 89.9 79.8
200 91.8 90.0 80.2
300 92.2 90.5 82.3
400 92.1 90.4 81.1
Table 2: Accuracy(%) on FDDB with different loss weight for . The main bone of the detector is VGG_CNN_M_1024.
Loss weight() Accuracy(%)
1000 FP 500 FP 100 FP
1 - - -
100 88.6 86.9 78.6
200 89.3 87.4 80.9
300 89.8 88.2 82.3
400 89.8 88.1 82.3
500 89.8 88.1 82.2
Table 3: Accuracy(%) on FDDB with different loss weight for . The main bone of the detector is ZF-24-net.

From Table 1,2 and 3, the optimal loss weight equals to . So, for the new architecture with Precise Sigmoid, we use for by default.

5.4 Precise Sigmoid+PBS on FDDB

The comparable performance of Precise Sigmoid: we also compare the performance of our Precise Sigmoid with softmax by using original R-CNN style training strategy, as shown in Table 4. The results show the comparable performance of Precise Sigmoid and softmax. Furthermore, better results on VGG16 based network of Precise Sigmoid prove the gradient vanishing problem can be solved.

The effectiveness of Precise Sigmoid+PBS: the Precise Sigmoid is designed for PBS, and experimental results will show the effectiveness of Precise Sigmoid+PBS. The experiments are done with ZF-net as main bone of face detector.

Fig. 5 shows the detailed experimental results of Precise Sigmoid+PBS, and the PBS strategy of “split_0.4_0.8_0.5_0.9” gets the best accuracy. Next, results in Table 5 prove the effectiveness of PBS. The network using Precise Sigmoid+PBS can consistently get accuracy gain compared with conventional softmax+original training strategy.

Architecture Accuracy(%)
1000 FP 500 FP 100 FP
ZF-net(softmax) 92.5 91.2 83.3
ZF-net(Precise Sigmoid, no PBS) 92.7 90.7 83.2
ZF-24-net(softmax) 90.6 88.9 82.2
ZF-24-net(Precise Sigmoid, no PBS) 89.8 88.2 82.3
VGG_CNN_M_1024(softmax) 91.9 90.5 83.6
VGG_CNN_M_1024(Precise Sigmoid, no PBS) 92.2 90.5 82.3
VGG16(softmax) 94.1 92.8 86.2
VGG16(Precise Sigmoid, no PBS) 95.0 93.6 82.2
Table 4: Accuracy(%) comparison between Precise Sigmoid with softmax on FDDB. Both of the two networks are trained by original R-CNN style training strategy. And the loss weight for
Figure 5: Detailed experimental results of different Precise Box Score(PBS) on FDDB. “pos0.7+all=1” denotes the using of Eq. (2) with . “pos0.4+add0.6” denotes the using of Eq. (3) with and . “pos0.4+split_0.7_0.8” demotes the using of Eq. (4) with , and . “pos0.4+split_0.4_0.8_0.5_0.9” denotes the using of Eq. (5) with , , , and .
Architecture Accuracy(%)
1000 FP 500 FP 100 FP
ZF-net(softmax) 92.5 91.2 83.3
ZF-net(Precise Sigmoid) 92.7 90.7 83.2
ZF-net(Precise Sigmoid+PBS) 94.6 93.3 83.4
ZF-24-net(softmax) 90.6 88.9 82.2
ZF-24-net(Precise Sigmoid) 89.8 88.2 82.3
ZF-24-net(Precise Sigmoid+PBS) 91.7 90.2 83.1
VGG_CNN_M_1024(softmax) 91.9 90.5 83.6
VGG_CNN_M_1024(Precise Sigmoid) 92.2 90.5 82.3
VGG_CNN_M_1024(Precise Sigmoid+PBS) 93.7 92.3 81.8
VGG16(softmax) 94.1 92.8 86.2
VGG16(Precise Sigmoid) 95.0 93.6 82.2
VGG16(Precise Sigmoid+PBS) 95.4 94.5 87.4
Table 5: Accuracy(%) comparison of Precise Sigmoid+PBS and softmax.

5.5 Precise Sigmoid+PBS on WIDER FACE

We also evaluate the performance of Precise Sigmoid+PBS on WIDER FACE. SSH [22] is used as baseline, using the train set of WIDER FACE for training. The training and testing are both single scale, meaning we rescale the shorter side of the image up to 1200 pixels while keeping the longer side below 1600 pixels without changing the aspect ratio. The settings of scale and aspect ratio follow the SSH’s.

Table 6 compares the original SSH with SSH trained by our Precise Sigmoid+PBS. SSH uses two rough IoU thresholds, 0.5 and 0.3. In experiments, we adjust the threshold for positive samples on the architecture of Precise Sigmoid. We also use the empirical best parameters of PBS(“pos0.4+split_0.4_0.8_0.5_0.9”, the results of Fig 5), to demonstrate the effectiveness of Precise Sigmoid+PBS. Table 6 shows the effectiveness of Precise Sigmoid+PBS:

(a) SSH trained by Precise Sigmoid+PBS outperforms the original SSH [22] by 0.3%, 0.6% and 0.8% in “easy”, “medium”, “hard” subsets of WIDER FACE respectively.

(b) current result of SSH trained by Precise Sigmoid+PBS is just simply using the setting of “pos0.4+split_0.4_0.8_0.5_0.9”, which is the empirical parameters of PBS, as shown in Fig. 5. Specific adjustment of parameters of PBS may make the model get even better result.

Method Accuracy(%)
easy medium hard
SSH(softmax) [22](pos0.5) 91.9 90.7 81.4
SSH(Precise Sigmoid)(pos0.4) 92.3 90.5 79.0
SSH(Precise Sigmoid)(pos0.5) 91.8 90.5 81.3
SSH(Precise Sigmoid)(pos0.7) 88.3 85.8 74.0
SSH(Precise Sigmoid+PBS)(pos0.4+split_0.4_0.8_0.5_0.9) 92.2 91.3 82.2
Table 6: Comparison of original SSH with SSH trained by our Precise Sigmoid+PBS on WIDER FACE. “pos0.5” denotes the using of Eq. (2) with . “pos0.4+split_0.4_0.8_0.5_0.9” denotes the using of Eq. (5) with , , , and . By default, we set the anchors with IoUs lower than 0.3 as negative samples.

5.6 SEMCM on FDDB

There, we use the one stage face detector with main bone of ZF-net as teacher model. And the one stage face detector with main bone of ZF-24-net as student model. As illustrated in Fig. 4, we conduct the SEMCM as described in Section 4. The teacher model(ZF-net) has accuracy of 94.6, 93.3 and 83.4 when the false positive on FDDB is 1000, 500 and 100, and the model is trained by Precise Sigmoid+PBS. The student model(ZF-24-net) is all layers’ channel reduced by from teacher model(ZF-net), except the layers for classification and regression. The student model(ZF-24-net) is half-trained with accuracy of 88.4, 86.5 and 77.4.

Table 7 and Fig 6 shows that: after using SEMCM, the performance of small student model can be raised to 92.2, 91.0, 84.5, even better than the accuracy of the small student model trained with Precise Sigmoid+PBS. The detailed accuracy gain is(the accuracy shown next is when false positive on FDDB is 1000, 500 and 100 respectively.):

(a) The small model has model size of 1M. And it is trained from the performance of 88.4, 86.5 and 77.4, which is a half-trained model.

(b) The small model is raised to the accuracy of 92.2, 91.0, 84.5 from 88.4, 86.5 and 77.4. The gain is 3.8, 4.5, 7.1, relative to the half-trained model for initialization. The gain is 1.6, 2.1, 2.3, relative to the model trained with softmax. The gain is 0.5, 0.8, 1.4, relative to the model trained with Precise Sigmoid+PBS.

(c) The small model is raised to the accuracy of 92.2, 91.0, 84.5, even better than the small model trained with Precise Sigmoid+PBS, proving the complementary of Precise Sigmoid+PBS and SEMCM.

Architecture Accuracy(%)
1000 FP 500 FP 100 FP
ZF-24-net(Precise Sigmoid, half trained)(For student model initialization) 88.4 86.5 77.4
ZF-24-net(softmax) 90.6 88.9 82.2
ZF-24-net(Precise Sigmoid+PBS) 91.7 90.2 83.1
ZF-24-net(After SEMCM) 92.2 91.0 84.5
Table 7: Results of SEMCM….
Figure 6: Results of SEMCM.

5.7 Qualitative Results

Some qualitative results of three face detectors are shown in Fig 7. The three models have model size of 1.1M, 17.3M and 98.1M, respectively. The first one stage face detector of ZF-24-net as main bone is trained by Precise Sigmoid+PBS+SEMEM. The second one stage face detector of ZF-net as main bone is trained by Precise Sigmoid+PBS. And the third model, SSH, is trained by Precise Sigmoid+PBS.

(a) Qualitative results of one stage face detector of ZF-24-net as main bone(Precise Sigmoid+PBS+SEMEM)
(b) Qualitative results of one stage face detector of ZF-net as main bone(Precise Sigmoid+PBS)
(c) Qualitative results of SSH trained by Precise Sigmoid+PBS
Figure 7: Qualitative results

6 Conclusion

We propose a novel training strategy, Precise Box Score(PBS), which can extract more information from detection dataset and benefit the post-processing of NMS for the precise bounding box scores. And a new architecture, Precise Sigmoid, is introduced for the implementation of PBS. We do experiments using one stage face detector on FDDB to explore how to design the function of PBS. Further more, a simply but effective model compression method(SEMCM) is proposed for one stage face detector, which can boost the performance of face detection further. Experiments demonstrate: (a) Precise Sigmoid+PBS can consistently improve the performance of face detection, and (b) the complementary of Precise Sigmoid+PBS and SEMCM.

References

  • [1] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2874–2883, 2016.
  • [2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Improving object detection with one line of code. arXiv preprint arXiv:1704.04503, 2017.
  • [3] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
  • [4] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In European Conference on Computer Vision, pages 109–122. Springer, 2014.
  • [5] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
  • [6] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1532–1545, 2014.
  • [7] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • [8] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [9] J. Hosang, R. Benenson, and B. Schiele. A convnet for non-maximum suppression. In German Conference on Pattern Recognition, pages 192–204. Springer, 2016.
  • [10] P. Hu and D. Ramanan. Finding tiny faces. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [11] V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010.
  • [12] H. Jiang and E. Learned-Miller. Face detection with the faster r-cnn. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pages 650–657. IEEE, 2017.
  • [13] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2144–2151. IEEE, 2011.
  • [14] T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 845–853, 2016.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [16] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang. Probabilistic elastic part model for unsupervised face detector adaptation. In Proceedings of the IEEE international conference on computer vision, pages 793–800, 2013.
  • [17] H. Li, Z. Lin, J. Brandt, X. Shen, and G. Hua. Efficient boosted exemplar-based face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1843–1850, 2014.
  • [18] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015.
  • [19] Q. Li, S. Jin, and J. Yan. Mimicking very efficient network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6356–6364, 2017.
  • [20] P. Luo, Z. Zhu, Z. Liu, X. Wang, X. Tang, et al. Face model compression by distilling knowledge from neurons. In AAAI, pages 3560–3566, 2016.
  • [21] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In European Conference on Computer Vision, pages 720–735. Springer, 2014.
  • [22] M. Najibi, P. Samangouei, R. Chellappa, and L. Davis. SSH: Single stage headless face detector. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [23] M. Opitz, G. Waltner, G. Poier, H. Possegger, and H. Bischof. Grid loss: Detecting occluded faces. In European Conference on Computer Vision, pages 386–402. Springer, 2016.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [26] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.
  • [27] H. Wang, Z. Li, X. Ji, and Y. Wang. Face r-cnn. arXiv preprint arXiv:1706.01061, 2017.
  • [28] Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li. Detecting faces using region-based fully convolutional networks. arXiv preprint arXiv:1709.05256, 2017.
  • [29] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel features for multi-view face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014.
  • [30] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Convolutional channel features. In Proceedings of the IEEE international conference on computer vision, pages 82–90, 2015.
  • [31] S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, pages 3676–3684, 2015.
  • [32] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [33] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: An advanced object detection network. In Proceedings of the 2016 ACM on Multimedia Conference, pages 516–520. ACM, 2016.
  • [34] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  • [35] C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection. In Deep Learning for Biometrics, pages 57–79. Springer, 2017.
  • [36] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
169262
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description