HAMBox: Delving into Online High-quality Anchors Mining for Detecting Outer Faces

HAMBox: Delving into Online High-quality Anchors Mining for Detecting Outer Faces


Current face detectors utilize anchors to frame a multi-task learning problem which combines classification and bounding box regression. Effective anchor design and anchor matching strategy enable face detectors to localize faces under large pose and scale variations. However, we observe that more than 80% correctly predicted bounding boxes are regressed from the unmatched anchors (the IoUs between anchors and target faces are lower than a threshold) in the inference phase. It indicates that these unmatched anchors perform excellent regression ability, but the existing methods neglect to learn from them. In this paper, we propose an Online High-quality Anchor Mining Strategy (HAMBox), which explicitly helps outer faces compensate with high-quality anchors. Our proposed HAMBox method could be a general strategy for anchor-based single-stage face detection. Experiments on various datasets, including WIDER FACE, FDDB, AFW and PASCAL Face, demonstrate the superiority of the proposed method. Furthermore, our team win the championship on the Face Detection test track of WIDER Face and Pedestrian Challenge 2019. We will release the codes with PaddlePaddle. 1


1 Introduction

Face detection is a fundamental task for many high-level face-based applications, such as face alignment [30], face recognition [1] and face aging [27]. Deriving from early face detectors with hand-crafted features, modern detectors have been significantly improved owing to the robust features learnt with deep Convolutional Neural Networks (CNNs) [11]. Current state-of-the-art face detectors are usually based on anchor-based deep CNNs, inspired by their successes on the general object detection.

Different from general object detectors, face detectors often face smaller variations of aspect ratios (from 1:1 to 1:1.5) but much larger scale variations (face area, from several pixels to thousands of pixels). Considering the large variations of scales, Zhang et al. [31] tile anchors on a wide range of layers and design anchor scales according to the effective receptive field. Current state-of-the-art detectors [23, 12] capture the locations of various face scales by utilizing Feature Pyramid Network (FPN) [14]. FPN is an effective way to exploit the inherent multi-scale features for constructing feature pyramids in a top-down manner. It adopts lateral connection from the high-level deeper features to the low-level ones. Then from the perspective of designing anchor setting, anchor-based detectors with FPN continue to resolve this by raising the number of anchors from different aspects (e.g., anchor stride, and ratio anchors [32, 25]). However, increasing the number of anchors remarkably reduces the performance of a face detector, especially when adopting the feature map conv2 or P2 (in Resnet-50) for recalling small faces empirically.

(a) Average Number of Anchors Matched to Each Face
(b) Proportion of Faces that can Match with Anchors
Figure 1: Two crucial factors in designing anchor scales on the WIDER FACE dataset. (a) As the scale of anchor increases, the average number of anchors matched to each face also increases. (b) The proportion of faces that can match the anchors decreases significantly outside a specific interval ([0.43, 0.7]).
(a) Cumulative Desity Curve of IoU
(b) Performance of Compensated Anchors
(c) Proportion of unmatched High-quality Anchors
(d) Performance of Matched High-quality Anchors
Figure 2: The problem of standard anchor matching strategy during training and inference (on the WIDER FACE dataset). (a) During inference, only 11% of all correctly predicted bounding boxes are regressed by matched anchors. (b) PBB represents ‘Predicted Bounding Boxes’. When using our HAMBox strategy, the IoUs between ground-truths and predicted bounding boxes regressed by compensated anchors are much higher thanstandard anchor matching strategy during training. (c) During training, the average number of unmatched high-quality anchors occupies a surprisingly 65% proportion of all high-quality anchors. (d) CPBB represents ‘Correctly Predicted Bounding Boxes’. During inference, the number of matched high-quality anchors dramatically decreases after NMS, representing some unmatched anchors have higher regression ability. All these results demonstrate that the standard anchor matching strategy can not utilize high-quality negative anchors effectively, which play essential roles whatever during training or inference.
Figure 3: Visualization of the quality of compensated anchors through two methods. In the early stage of training, our method does not compensate anchors for outer faces. Then with the increasing of training iteration, our method is gradually mining unmatched high-quality anchors for outer ones, which have higher IoU than anchors generated by standard anchor matching strategy.

As far as we know, for an anchor-based detector, effective anchor design strategies are necessary to achieve high performance. SFD [31] adopts single scale and aspect ratio anchors for each detection stage. Nonetheless, choosing the proper anchor scale remains a big challenge, which generally produced by the following misalignment phenomenon. Figure 1 shows ‘the average number of anchors matched to each face’ and ‘the proportion of all faces that can match the anchors’ across different anchor scales, which are two indicative factors to be considered in designing proper anchor scale. With the increase of anchor scales, although the number of anchors matched with each face steadily grows, the proportion of faces which are capable of matching anchors gradually descends. Moreover, this misalignment usually leads to a heuristical anchor scale designation.

To alleviate the imbalance between ‘the average number of anchors matched to each face’ and ‘the proportion of all faces that can match the anchors’ as discussed above, two representative solutions have been proposed: Firstly, SFD [31] introduces an anchor compensation strategy by offsetting anchors for outer faces2; Secondly, Zhu et al. [32] formulate a metric named Expected Maximum Overlap (EMO) to obtain more reasonable anchor stride and receptive field. All these solutions focus on helping outer faces match more anchors during the training phase. However, they also bring a large number of redundant or low-quality anchors. (see the olive line of Figure 2(b)).

In this paper, we conduct an anchor matching statistic experiment on a well-trained face detector [23, 13] and find an intriguing phenomenon. The red line in Figure 2(a) represents the cumulative distribution curve of IoU between the ground-truth and the anchors which can be regressed to correctly predicted bounding boxes. We surprisingly observe that only 11% of all correctly predicted bounding boxes are regressed by matched anchors. So, not only the matched anchors but also some unmatched ones play a critical role in face detection. However, in the phase of training, those unmatched anchors are assigned with background labels, which are unreasonable supervision signals for classification branch consequently. Effectively leveraging these unmatched anchors is expected to improve the detection performance.

Motivated by this observation, we identify two key issues in current anchor matching strategies as follows:

  • The majority of compensated anchors are of low-quality. Figure 2(b) shows the regression ability of compensated anchors during training when adopting the standard anchor matching strategy [20]. Apparently, compensated anchors have a poor performance on location regression (average IoU between the bounding boxes regressed by compensated anchors and the ground-truth is 0.42). In other words, this method helps those outer faces matching more low-quality anchors, instead of high-quality ones.

  • Many unmatched anchors in the training phase actually have strong localization ability. As shown in Figure 2(c), around 65% of all high-quality anchors3 are unmatched anchors during training. Based on the above observations, we argue that the current anchor matching strategy is neither flexible nor sufficient to utilize the anchors in face detection. As illustrated in Figure 2(d), the red, blue and green bars denote the number of faces matched to anchors (IoU0.354), matched to correctly predicted bounding boxes (IoU0.55) and matched to correctly predicted bounding boxes after Non-Maximum Suppression (NMS). It is obvious that the correctly predicted bounding boxes regressed by unmatched anchors suppress the ones regressed by matched anchors during NMS. Lots of unmatched anchors also have strong abilities for regression.

To address this issue, we propose an Online High-quality Anchor Mining Strategy (HAMBox) method. The idea is to mine those high-quality anchors consistently to help outer faces compensate more anchors with the ability of precise regression. Figure 2(b) and Figure 3 show that the quality of our compensated anchors has a significant enhancement than standard anchor mathing strategy’s. In Figure 3, when using standard anchor matching strategy, the unmatched anchors are assigned with background labels. With the increase of training iteration, our Online High-quality Anchor Compensation Strategy is gradually mining unmatched high-quality anchors for outer faces. Moreover, the unmatched anchors could regress high-quality bounding boxes with higher IoU than anchors generated by standard anchor matching strategy. After mining high-quality anchors, we further propose regression-aware focal loss to effectively weight those new compensated high-quality anchors. Dynamic weights based on IoU are added for new compensated anchors mainly by considering the weak connection between location and classification. Benefiting from online high-quality anchor compensation strategy and regression-aware focal loss, we achieve 91.6% AP on the WIDER FACE [29] validation hard set, with the baseline of RetinaNet [15]. Furthermore, we add some popular modules, including SSH head [19], deep head [15], and pyramid anchors [23], and achieve 93.3% AP, which outperform current state-of-the-art model [12] by a large margin of 2.9% AP.

In summary, our main contributions can be summarized as:

  • We observe an inspiring phenomenon that some unmatched anchors have strong regression ability, and the current box regression branch neglects to learn unmatched anchors.

  • Based on the observations, we propose an Online High-quality Anchor Mining Strategy (HAMBox) to sample high-quality anchors for training. Benefiting from HAMBox, we can provide sufficient and effective anchors for outer faces in the training phase;

  • Thanks to the high-quality anchors, a regression-aware focal loss assists in face detector training with a flexible way;

  • Our approach outperforms the state-of-the-art methods by 2.9% and 2.3% AP on the WIDER FACE validation and test hard-set, respectively. Moreover, we achieve 57.45% (validation) / 57.13% (test) mAP on the Face Detection track of WIDER Face and Pedestrian Challenge 2019.

2 Related Work

Face detection is a fundamental yet challenging computer vision task. Viola and Jones [24] first utilizes Haar features and AdaBoost to train a face detector. After that, more following works pay attention to combining multi models to get discriminative features. For example, DPM [4] proposes an extra model to capture human lateral feature and merges it with front and back body features. All the face detectors based on hand-craft features are optimized with each sub-model separately. Due to both weak features and classifiers, the performance of these face detectors is limited in the practical scenario.

Recently, owing to the rapid development of deep convolutional networks [11, 6, 21, 22] on image classification and object detection, face detection has made significant progress on large variations, including poses, scales, blur and occlusions, etc., in practice. By introducing the core ideas of hand-craft face detector, Cascade CNN and Multi-task CNN (MTCNN) propose a coarse-to-fine framework to capture faces via deep CNNs. With the flourish of general object detectors [20, 16, 2], [10] and SSH [19] introduce anchor-based detectors to face detection. Yang et al. [29] collect WIDER FACE dataset, which contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. The WIDER FACE dataset pushes forward to the research of face detection, focusing on the extreme variations, including scale, pose and occlusion. Recently, most state-of-the-art face detectors focus on these extreme variations with three following aspects: image pyramid, feature pyramid and context module. HR [8] designs image pyramids of the low, medium and high resolutions for training and testing, which significantly boosts the performance on extreme scale variations (from several pixels to thousands of pixels). FAN [25] introduces attention modules and feature pyramid network [14] to capture occluded faces. SSH [19] builds a detection module with a rich receptive field. PyramidBox [23] formulates a data-anchor-sampling strategy to increase the proportion of small faces in the training data. Moreover, by designing a scale propose network, SAFD [5] generates a scale histogram and further automatically normalizes face scales prior for optimizing face detectors. DSFD [12] introduces small faces supervision signals on the backbone, which implicitly boosts the performance of pyramid features.

Considering some works on anchor design and sampling strategies, SFD [31] proposes a new anchor matching strategy which helps the outer faces match more anchors. SRN [3] introduces a Selective Two-step Classification to ignore training easy sample anchors in the second stage. ZCC [32] introduces Expected Max Score to evaluate the quality of anchor matching, which helps to design anchor stride. Group sampling [18] conducts lots of experiments on the ratio of matched and unmatched anchors, which emphasizes the importance of the ratio for matched and unmatched anchors. In this paper, inspired by the anchor matching strategy in SFD [31] and the statistical curve discussed in Figure 1 and Figure 2, we propose an Online High-quality Anchor Mining Strategy (HAMBox), as well as a regression aware focal loss. Benefiting from these methods, we achieve a strong face detector, compared with other state-of-the-art face detection methods.

(a) Face Matched to Anchor on the First Step of Standard Anchor Matching Strategy
(b) Face Matched to Anchor on the Second Step of Standard Anchor Matching Strategy
(c) Cummulative Density Curve of IoU
Figure 4: (a) (b) Two different stages on standard anchor matching strategy, the blue rectangle represents ground-truth and the red one is an anchor matched with it. (c) Cumulative Density Curve of IoU between ground-truth and its matched anchor on different stages.

3 Online High-quality Anchor Mining

This section presents the proposed Online High-quality Anchor Mining Strategy (HAMBox) to compensate outer faces with the most proper anchors. We firstly build our high-recall face detector based on RetinaNet [15]. Then we demonstrate the online high-quality anchor compensation strategy in detail. Finally, we formulate a regression-aware focal loss for the compensated anchors.

3.1 High-recall Anchor-based Face Detector

Current anchor-based face detectors utilize predefined anchors to frame a multi-task learning problem, which combines classification and bounding box regression branches. We start with RetinaNet [15] as the baseline. The backbone is ResNet-50. Following the settings in [31], we employ the feature map of conv2 layer to improve the performance of face detector. The reason is that around 40% faces are matched to conv2 anchors on the WIDER FACE benchmark. Furthermore, it is important to design anchors for training a well-performed detector. Therefore, different from the general object detection with multiple anchor scales and aspect ratios, we set only one anchor scale and one aspect ratio at each prediction layer for our default anchor settings.

Inspired by statistical results in Figure 1, we change the anchor scale6 to match more extreme face scales. The advantage and disadvantage of this anchor setting are equally obvious. From the perspective of advantage, our strategy can match over 95% of all the faces on the WIDER FACE benchmark, a small difference comparing to multi-scale and ratio anchors that can match 98.46% of faces. At the same time, our method uses three times or nine times fewer anchors than the latter anchor setting with multi-scale and ratio, leading model to focus more on the regression of useful anchors and further get higher detecting performance. From the perspective of disadvantage, it is harmful to the robustness of the model because decreasing the number of faces matched to anchors. This obstruction will be resolved in the following two sections.

3.2 Online High-quality Anchor Compensation Strategy

After finishing the design of anchor scale and ratio, we further need to allocate anchors with their nearest adjacent ground-truth or background. As shown in Figure 4, the current anchor matching strategy consists of two steps. A face firstly matches anchors with IoU higher than a threshold. Then faces that do not match with any anchor would be compensated with anchors that have the max IoU with them. Obviously, compensated anchors in the second step may reduce the performance of regression and classification of the network since these anchors initially have lower IoU with faces, as shown in Figure 4(c).

In Figure 2(b), we surprisingly find that with the increase of iterations, some unmatched anchors have the ability to make correct predictions while those are ignored on regression branch and even assigned as background on classification branch. Inspired by this observation, we propose an Online High-quality Anchor Compensation strategy to resolve current misaligned supervision signal. Firstly, each face matches the anchors with IoU higher than a threshold, but for those remaining outer faces, we do not compensate any anchors. Secondly, at the end of forward propagation during training, each anchor computes regression bounding box through its related regression coordinates. We define this regression bounding box as and represents outer faces. Finally, for each face in , we compute its IoU with and compensate this face with extra unmatched anchors. We define all IoUs as . These compensated anchors are selected according to two rules. 1) The IoUs between their corresponding regression bounding boxes and target faces should be greater than ( represents an online positive anchor threshold). 2) These IoUs (calculated in rule 1) should be in the top- highest IoU in . is a hyperparameter that represents the max number of anchors that can be matched with. If is greater than after filtering out by above two rules, we select top-() highest IoU anchors in these unmatched anchors to compensate this face and set . denotes the number of anchors that faces already matched with in the first step. We have done many experiments in ablation study by varying , . Details can be seen in Algorithm 1.

3.3 Regression-aware Focal Loss

After the analysis of two subsections above, we have mined those high-quality anchors and the following problem is to make full use of these anchors effectively. Furthermore, we propose a regression-aware focal loss to give more reasonable weights to those new compensated high-quality anchors, which are newly mined for outer faces by Online High-quality Anchor Compensation Strategy.

Two improvements have been made on focal loss [15]. (1) Considering the weak connection between location and classification on new compensated anchors, we add dynamic weights based on IoU for these compensated ones. (2) We define anchors satisfying the following three conditions simultaneously as ignored anchors (which are not optimized during training): a) Belong to the high-quality anchors. b) Be assigned as background in the first step of anchor matching strategy. c) Not included in new compensated anchors. We define the loss as:


where is the anchor index in a training-batch, is the predicted probability of the anchor . is the class label of anchor , which is assigned with the label on the first step of standard anchor matching strategy. is the label of our newly compensated anchors, which are revised from backgrounds to foregrounds. is the IoU between the corresponding regression bounding box and its target ground-truth. represents a set of all matched and unmatched low-quality anchors7 and represents a set of newly compensated anchors. is the number of normally matched anchors in and is the total number of compensated anchors in . is the normal sigmoid focal loss over two classes (face foreground and background). In addition, the supervision for new compensated anchors is added to the location loss and the specific equation is shown as below:


where is the ground-truth location coordinates of anchor . is a normal location loss inspired by Faster-RCNN [20]. All other parameters are similar to ’s.

0:   is a set of regression bounding boxes, in the form of (, , , ). is a set of ground-truth, in the form of (, , , ) is an online anchor mining threshold (see details on Subsection 3.2) is a hyperparameter and represents the max number of anchors that can be matched with. is a Dict, key is ground-truth, item is the number of anchors that ground-truth can match in the first step of our HAMBox method. is a Dict, key is anchor index, item is a label that anchor index is assigned with in the final process of our HAMBox method. is a Dict, key is anchor index, item is encoded coordinates of the key during standard anchor matching strategy. is a Dict, key is anchor index, item is coordinates of the key, in the form of (, , , ).
0:   and after using our HAMBox method.
1:  for  in  do
2:     if  then
4:     end if
8:     for  in  do
9:        if  then
11:        end if
12:         -=
14:        -
15:        if  then
17:        end if
18:     end for
19:  end for
20:  Return ,
Algorithm 1 Online high-quality anchor mining

4 Experiments

In this section, we first show the effectiveness of our proposed strategies with comprehensive ablative experiments. Then with the final optimal model, our approach achieves state-of-the-art results on face detection benchmarks.

4.1 Ablation Study

The WIDER FACE dataset is used in this ablation study. This dataset has 32,203 images with 393,703 labeled faces with huge variability in scales, occlusions and poses. Our networks are only trained on the training set and evaluated both on validation and test set. Average Precision (AP) score is used as the evaluation metric.

Data Augmentation Our models are trained with following data augmentation strategies:

  • Color distort: Apply some photometric distortions similar to [7].

  • Data anchor sampling: This method [23] resizes all train images through reshaping a random face in this image to a smaller size.

  • Horizontal flip: After data-anchor-sampling, the cropped image patch is resized to 640 640 and horizontally flipped with a probability of 0.5.

Baseline. We build an anchor-based detector with ResNet-50 guided by the RetinaNet as our baseline face detector. It differs from the original RetinaNet [15] in the following four aspects: Firstly, we set 6 anchors whose scales are from the set {16, 32, 64, 128, 256, 512}, and all anchors’ aspect ratios are set to 1:1. Secondly, we use LFPN [23] instead of FPN [14] for feature fusion since top two high-level features are extracted from regions with little context and may introduce noise for detecting small faces. Thirdly, we do not use deep head owing to two main factors. One is that the time cost of the training process is too high, the other is that our baseline is significantly higher than that of any other SOTA works on the WIDER FACE hard dataset. Finally, the threshold of IoU for matched anchors is changed to 0.35, and ignore-zone is not implemented.

Optimization Details. All models are initialized with the pre-trained weights of ResNet-50 and fine-tuned on WIDER FACE training set. Each training iteration contains seven images per GPU for a 4 NVIDIA Tesla V100 GPUs server. The initial learning rate is set to 0.01 and decreases to 0.001 after 110k iterations. All the models are trained for 140k iterations by synchronized SGD. The momentum and weight decay are set to 0.9 and 5, respectively.

(a) AFW
(b) PASCAL Face
(c) FDDB
Figure 5: Evaluation on the common face detection datasets.

The effect of High-recall Anchor-based Detector As discussed above, the difference between our high-recall detector and baseline detector is the pre-defined anchor scale. Inspired by Figure 1, we design our method with anchor scales {16, 32, 64, 128, 256, 512} * 0.68 tiled on pyramid feature maps from P2 to P6.

To better understand the advantage of our method, we conduct four experiments, which are shown in Table 1. First step on standard anchor matching strategy with anchor scale ratio 0.68 (denoted as SMS(ratio=0.68)); Two-step on standard anchor matching strategy with anchor scale ratio 0.68 (denoted as DMS(ratio=0.68)); Two-step on standard anchor matching strategy with anchor scale ratio 0.5 whose scale could help more outer faces match anchor while significantly decreasing the number of anchors matched with each face (denoted as DMS(ratio=0.50)); New anchor matching strategy introduced by SFD [31] with anchor scale ratio 0.68 (denoted as NAMS(ratio=0.68)). Compared to the baseline, SMS(ratio=0.68), DMS(ratio=0.68), NAMS(ratio=0.68) provide a significant improvement on the hard subset (rising by 1.2%, 0.8%, 0.8% AP respectively) and DMS(ratio=0.50) is with no improvements on the hard subset (decreasing by 0.7%). Through these experimental results, we could draw two conclusions: On the one hand, enhancing the proportion of faces that can be matched with anchors could improve the model performance. However, with the continuously decrease on the scale of anchor to enhance this proportion, the remaining faces are more difficult to match and the number of anchors that each face can match with decreases dramatically, which are the main reasons why the performance of DMS(ratio=0.50) is 1.5% AP lower than DMS(ratio=0.68) on the hard dataset. On the other hand, NAMS(ratio=0.68) and DMS(ratio=0.68) achieve almost same performance with SMS(ratio=0.68), suggesting that these two anchor compensation methods have less influence on the performance of the detector. Thus we use the SMS(ratio=0.68) method in the following experiments. In addition, DMS(ratio=0.68) and NAMS(ratio=0.68) would be regarded as comparisons, respectively. And the anchor ratio of SMS in Table 3 and 4 is set to 0.68.

Subset ratio Easy Medium Hard
Baseline 1.00 0.943 0.931 0.894
+SMS 0.68 0.949 0.945 0.906
+DMS 0.68 0.954 0.949 0.902
+DMS 0.50 0.938 0.922 0.887
+NAMS 0.68 0.951 0.948 0.902
Table 1: AP performance on various anchor setting and anchor matching strategy on WIDER FACE validation subset.

The Effect of Online High-quality Anchor Mining Next, we look into the effect of our proposed online high-quality anchor mining strategy. In this paragraph, we mainly discuss the effect of two hyper-parameters in our method. The performance under different , (defined in Subsection 3.2) is shown in Table 2. It shows that: (1) the performance gets better when increases and it is easier to conclude that the higher the quality of compensated anchors, the better the performance of the model. (2) The performance gets better when is smaller than 5 and gets worse when is larger than 5, suggesting that it is not good to increase too large numbers of compensated anchors that each outer face can match since the anchors off-limits are redundant for their corresponding faces. After multiple ablative experiments, we find the optimal K(3), T(8) and further increase the performance with 0.7% AP.

Easy Medium Hard
3 0.5 0.945 0.939 0.902
7 0.5 0.941 0.937 0.898
3 0.7 0.947 0.943 0.911
5 0.7 0.952 0.941 0.909
3 0.8 0.957 0.951 0.913
3 0.9 0.962 0.943 0.911
Table 2: Varying , for regression-aware focal loss on WIDER FACE validation subset.

The Effect of Regression-aware Focal Loss This regression-aware focal loss completes our final HAMBox model. As discussed above, this loss gives those new matched anchors a reasonable weight which simultaneously helps model training these anchors more steadily and precisely. Results using our regression-aware focal loss (denoted as RAL) are shown in Table 4, and the performance of our detector continues to increase 0.3% AP.

(a) Val: Easy
(b) Val: Medium
(c) Val: Hard
(d) Test: Easy
(e) Test: Medium
(f) Test: Hard
Figure 6: Precision-Recall (PR) curves on WIDER FACE validation and testing subsets.

Our method vs ZCC and NAMS To further verify the effectiveness of our method, we compare our method with NAMS [31] and ZCC [32]. As shown in Table 3, our method outperforms theirs 6.4% and 5.5% AP respectively on their paper baseline. Moreover, in our baseline, ours also outperforms theirs 2.0% and 3.4% AP, respectively. Note that our method offers more high-quality anchors to help bounding box regression and classification branch optimize well.

The Effect of Other Tricks As shown in Table 4, we introduce SSH [19], deep head (DH) [15] and pyramid anchor (PA) [23] modules to further improve the performance of detector and achieve best AP among all state-of-the-art face detectors [12, 3, 17, 23, 26, 32, 31, 19, 8]. We outperform others on validation/ test hard dataset 2.9%, 2.3% AP, respectively. Besides, we achieve 57.45%, 57.13% on validation/ test datasets when using more scientific mAP score metric.

Subset Easy Medium Hard
NAMS 0.937 0.924 0.852
ZCC 0.949 0.933 0.861
Baseline + 0.941 0.937 0.896
Baseline + 0.943 0.942 0.882
Baseline + SMS + OAM + RAL 0.962 0.953 0.916
Table 3: AP performance of our model with various anchor matching strategy on WIDER FACE validation subset. * denotes the reproduced performance by us, and APs in NAMS and ZCC represent the performance presented by their papers.
Baseline SMS OAM RAL DH SSH PA Easy Medium Hard
- - - - - - 0.943 0.931 0.894
- - - - - 0.949 0.945 0.906
- - - - 0.957 0.951 0.913
- - - 0.962 0.953 0.916
- - 0.964 0.955 0.922
- 0.968 0.959 0.927
0.970 0.964 0.933
Table 4: AP performance of our proposed modules and additional tricks on WIDER FACE validation subset.

4.2 Evaluation on Common Benchmarks

We evaluate our proposed method on the common face detection benchmarks, including WIDER FACE [29], Annotated Faces in the Wild (AFW) [33], PASCAL Faces [28], FDDB [9]. Our face detector is trained only using WIDER FACE training set and is tested on those benchmarks. We demonstrate the state-of-the-art performance across all the datasets.

WIDER FACE Dataset We report the performance of our face detection system on the WIDER FACE [29] testing set with 16,097 images. Detection results are sent to the database server for receiving the precision-recall curves. Figure 6 illustrates the precision-recall curves along with AP scores. Our proposed method achieves 97.0% (Easy), 96.4%(Medium), 93.3%(Hard) on validation dataset and 95.9% (Easy), 95.5% (Medium), 92.3% (Hard) on test dataset. Especially on the hard subset, we outperform the current state-of-the-art model 2.3% AP (Test) and 2.9% AP (Validation). This huge enhancement demonstrates the superiority of our method.

AFW Dataset This dataset [33] consists of 205 images with 473 annotated faces. Figure 5(a) shows that our detector outperforms others by a considerable margin.

PASCAL Face Dataset This dataset [28] has 851 images with 1,335 annotated faces. Figure 5(b) demonstrates the superiority of our method.

FDDB Dataset This dataset [9] has 2,845 images with 5,171 annotated faces. Most of them are with low image resolutions and complicated scenes, such as occlusions, huge poses. Figure 5(c) shows our proposed method outperforms all state-of-the-art models.

5 Conclusion

In this paper, we first observe an interesting phenomenon that only 11% correctly predicted bounding boxes are regressed from the unmatched anchors in the inference phase. Then we further propose an online high-quality anchor mining strategy that helps outer faces match high-quality anchors. Our method first enhances the proportion of face matched with anchor, and then we propose an online high-quality anchor compensation strategy for outer faces. Finally, we design a regression-aware focal loss for new compensated anchors. We conduct extensive experiments on the AFW, PASCAL Face, FDDB, WIDER FACE datasets and achieve the state-of-the-art detection performance.


  1. Equal contribution. Work done during an internship at Baidu VIS.
  2. Faces cannot match enough positive anchors. In our paper, we set the number as hyper-parameter detailed in the Subsection 3.2.
  3. The intersection-over-union (IoU) between its regression bounding box and corresponding ground-truth is higher than 0.5.
  4. This denotes the IoU between anchor and target face in the training phase.
  5. This denotes the IoU between the predicted bounding box of matched anchor and the target face.
  6. In our method, anchor scale is set to 0.68{16, 32, 64, 128, 256, 512} and ratio is 1:1 at different prediction layers.
  7. This denotes the IoU between the predicted bounding box regressed by unmatched anchor and the target face is below 0.5.


  1. G. Antipov, M. Baccouche and J. Dugelay (2017) Face aging with conditional generative adversarial networks. In ICIP, pp. 2089–2093. Cited by: §1.
  2. S. Chen, J. Li, C. Yao, W. Hou, S. Qin, W. Jin and X. Tang (2019) DuBox: no-prior box objection detection via residual dual scale detectors. arXiv preprint arXiv:1904.06883. Cited by: §2.
  3. C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li and X. Zou (2019) Selective refinement network for high performance face detection. In AAAI, Vol. 33, pp. 8231–8238. Cited by: §2, §4.1.
  4. P. F. Felzenszwalb, R. B. Girshick, D. McAllester and D. Ramanan (2009) Object detection with discriminatively trained part-based models. TPAMI 32 (9), pp. 1627–1645. Cited by: §2.
  5. Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li and X. Hu (2017) Scale-aware face detection. In CVPR, pp. 6186–6195. Cited by: §2.
  6. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.
  7. A. G. Howard (2013) Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402. Cited by: 1st item.
  8. P. Hu and D. Ramanan (2017) Finding tiny faces. In CVPR, pp. 951–959. Cited by: §2, §4.1.
  9. V. Jain and E. Learned-Miller (2010) Fddb: a benchmark for face detection in unconstrained settings. Cited by: §4.2, §4.2.
  10. H. Jiang and E. Learned-Miller (2017) Face detection with the faster r-cnn. In FG, pp. 650–657. Cited by: §2.
  11. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §1, §2.
  12. J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li and F. Huang (2019) Dsfd: dual shot face detector. In CVPR, pp. 5060–5069. Cited by: §1, §1, §2, §4.1.
  13. Z. Li, X. Tang, X. Wu, J. Liu and R. He (2019) Progressively refined face detection through semantics-enriched representation learning. IEEE Transactions on Information Forensics and Security. Cited by: §1.
  14. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §1, §2, §4.1.
  15. T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §1, §3.1, §3.3, §3, §4.1, §4.1.
  16. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §2.
  17. W. Liu, S. Liao, W. Ren, W. Hu and Y. Yu (2019) High-level semantic feature detection: a new perspective for pedestrian detection. In CVPR, pp. 5187–5196. Cited by: §4.1.
  18. X. Ming, F. Wei, T. Zhang, D. Chen and F. Wen (2019) Group sampling for scale invariant face detection. In CVPR, pp. 3446–3456. Cited by: §2.
  19. M. Najibi, P. Samangouei, R. Chellappa and L. S. Davis (2017) Ssh: single stage headless face detector. In ICCV, pp. 4875–4884. Cited by: §1, §2, §4.1.
  20. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: 1st item, §2, §3.3.
  21. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  22. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §2.
  23. X. Tang, D. K. Du, Z. He and J. Liu (2018) Pyramidbox: a context-assisted single shot face detector. In ECCV, pp. 797–813. Cited by: §1, §1, §1, §2, 2nd item, §4.1, §4.1.
  24. P. Viola and M. J. Jones (2004) Robust real-time face detection. IJCV 57 (2), pp. 137–154. Cited by: §2.
  25. J. Wang, Y. Yuan and G. Yu (2017) Face attention network: an effective face detector for the occluded faces. arXiv preprint arXiv:1711.07246. Cited by: §1, §2.
  26. Y. Wang, X. Ji, Z. Zhou, H. Wang and Z. Li (2017) Detecting faces using region-based fully convolutional networks. arXiv preprint arXiv:1709.05256. Cited by: §4.1.
  27. Z. Wang, X. Tang, W. Luo and S. Gao (2018) Face aging with identity-preserved conditional generative adversarial networks. In CVPR, pp. 7939–7947. Cited by: §1.
  28. J. Yan, X. Zhang, Z. Lei and S. Z. Li (2014) Face detection by structural models. Image and Vision Computing 32 (10), pp. 790–799. Cited by: §4.2, §4.2.
  29. S. Yang, P. Luo, C. Loy and X. Tang (2016) Wider face: a face detection benchmark. In CVPR, pp. 5525–5533. Cited by: §1, §2, §4.2, §4.2.
  30. J. Zhang, S. Shan, M. Kan and X. Chen (2014) Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In ECCV, pp. 1–16. Cited by: §1.
  31. S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang and S. Z. Li (2017) S3fd: single shot scale-invariant face detector. In ICCV, pp. 192–201. Cited by: §1, §1, §1, §2, §3.1, §4.1, §4.1, §4.1.
  32. C. Zhu, R. Tao, K. Luu and M. Savvides (2018) Seeing small faces from robust anchor’s perspective. In CVPR, pp. 5127–5136. Cited by: §1, §1, §2, §4.1, §4.1.
  33. X. Zhu and D. Ramanan (2012) Face detection, pose estimation, and landmark localization in the wild. In CVPR, pp. 2879–2886. Cited by: §4.2, §4.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description