Face Detection with Feature Pyramids and Landmarks
Accurate face detection and facial landmark localization are crucial to any face recognition system. We present a series of three single-stage RCNNs with different sized backbones (MobileNetV2-25, MobileNetV2-100, and ResNet101) and a six-layer feature pyramid trained exclusively on the WIDER FACE dataset. We compare the face detection and landmark accuracies using eight context module architectures, four proposed by previous research and four modified versions. We find no evidence that any of the proposed architectures significantly overperform and postulate that the random initialization of the additional layers is at least of equal importance. To show this we present a model that achieves near state-of-the-art performance on WIDER FACE and also provides high accuracy landmarks with a simple context module. We also present results using MobileNetV2 backbones, which achieve over average precision on the WIDER FACE hard validation set while being able to run in real-time. By comparing to other authors, we show that our models exceed the state-of-the-art for similar-sized RCNNs and match the performance of much heavier networks.
Over recent years, the ever-improving performance of Convolutional Neural Networks (CNNs) has resulted in highly accurate computer vision applications (e.g. Redmon and Farhadi, 2018; Wang et al., 2018; Tan and Le, 2019; Wang et al., 2019; Wang and Deng, 2018). One predominant application is object detection. The task consists of localizing objects and classifying them. Accurately solving this task opens the door to a wide range of applications, from autonomous driving (see Badue et al., 2019, for a review) to person reidentification (e.g. Hermans et al., 2017). The problem was traditionally approached in three main tracks, region selection (Vedaldi et al., 2009), feature extraction (Dalal and Triggs, 2005) and object classification (Forsyth, 2014). However, the low efficiency of region selection and the semantic limitation of manually engineered descriptors hinders the performance of these algorithms. Recently, many deep learning approaches have been developed to tackle this inefficiency issue. Region CNNs (RCNNs) (Girshick et al., 2014) are commonly used for various detection tasks. They perform a greedy selective search algorithm (Uijlings et al., 2013) to significantly lower the number of region propositions, however, this is computationally expensive. Fast RCNN (Girshick, 2015) feeds pixel-level region proposals into the detection network from the feature maps, reducing the overhead somewhat. However, Faster RCNNs (Ren et al., 2017), utilize CNN based Region Proposal Networks (RPNs), removing the greedy selective search used in previous RCNNs, enabling detection in real-time. RPNs use a pyramid of anchors to propose regions more efficiently than pyramids of images (e.g. Viola and Jones, 2004) or filters. This anchor-based approach has successfully been applied many detection tasks (Zhou et al., 2019; Yang and Geng, 2018; Fan et al., 2016; Sa et al., 2017). Recently, single-stage detectors, based on Faster RCNN have also been widely adopted. For example, Lin et al. (2017) proposed a single-stage architecture called RetinaNet combined with focal loss—designed to combat the inherent class imbalance—which achieves state-of-the-art accuracy on the COCO dataset.
Face recognition is now commonplace in our daily lives, with businesses looking to take advantage of its convenience and robustness. Face detection serves as the foundation for recognition and various other face related research and products including alignment, and attribute classification (e.g. gender, age, face expression). Like any object detection task, the goal of face detection is to provide bounding boxes of all the faces in an image. However, variations in pose, illumination, resolution, occlusion, and human variance in real-world data make face detection challenging. Viola and Jones (2004) proposed an approach that performs feature searching using Haar-like (Wilson and Fernandez, 2006) features, which together with integral image, generates a set of features that assist in face detection. While the authors perform face detection on an image pyramid, the multi-scale feature pyramid approach has recently been shown to perform feature extraction more efficiently (Lin et al., 2017).
Landmark localization refers to estimating predefined landmark locations in images. Common tasks include facial (pupils, nose peak, mouth corners, etc.) and body (elbows, knees, wrists, shoulders, and face landmarks) landmark localization. One of the most benchmarked public face landmark datasets is AFLW-2000 (Köstinger et al., 2011) which has 20,000 training images, 4386 test images and 19 manually annotated face landmarks per image. Facial landmarks are employed in various tasks, particularly in face alignment as a preprocessing step before face recognition (Taigman et al., 2014; Sun et al., 2014). One of the first breakthrough deep learning models for face landmark localization was Multitask Cascaded Convolutional Networks (MTCNN) (Zhang et al., 2016). Utilizing several CNN networks, the authors achieved robust and accurate results and it has remained a strong baseline for several years. Subsequently, more computationally efficient methods have been developed which match the accuracy of MTCNN; such as Bulat and Tzimiropoulos (2017). Recently, methods superior to MTCNN–both in terms of accuracy and economy–have been developed including RetinaFace (Deng et al., 2019). Many dense 3D face alignment techniques (e.g. Liu et al., 2018), using U-Nets (Guo et al., 2018) and Hourglass networks (Newell et al., 2016) have further improved landmark localization. However, these approaches often incur large computational overheads.
This paper is organized as follows. Section 2 presents some of the most recent work that we will draw upon. Section 3 details our loss function and context modules. Section 4 outlines the experiment procedure we will employ. Section 5 compares the results for different architectures. Section 6 presents our conclusions.
2 Related Work
2.1 Pyramid of features
Recently the adoption of multi-scale feature pyramids for detection tasks has been widespread (e.g. Najibi et al., 2017; Lin et al., 2016; Zhang et al., 2017; Wang et al., 2017; Deng et al., 2019; Zhang et al., 2019, 2019). The work draws on results using spatial pyramid pooling (He et al., 2014), which can efficiently extract features at different levels from a single image, moving away from the less efficient pyramid of images approach (Viola and Jones, 2004). Multi-scale feature pyramids rely on only a single-scale image and outputs proportionally sized feature maps at various levels through top-down and lateral connections. This approach displays significant performance improvements on the COCO (Lin et al., 2014, 2016), and WIDER FACE (Wang et al., 2017; Deng et al., 2019; Zhang et al., 2019, 2019) challenges. Due to the performance of this approach, we will be adopting it in our experiments.
2.2 Single versus two-stage
Generally, there are two types of modern face detectors, single-stage, and two-stage. Single-stage models make independent object classification from multiple feature maps from deep in the network (Liu et al., 2016), typically having a latency advantage. However, these feature maps have a lower spatial resolution, hence may have already lost some semantic information relating to small objects, generally leading to reduced accuracy. Two-stage detectors (e.g. Faster-RCNN) construct semantically rich feature maps from different layers in the network (Lin et al., 2017) and classify regions of interest. As a result, two-stage based architectures can detect small objects with higher precisions but with reduced speeds (Yoo et al., 2019). Finding a balance between accuracy and inference time has been a predominant focus of recent research.
In the past few years, there have been numerous two-stage detectors that perform well on WIDER FACE. For example, Li et al. (2017) proposed ‘Light-Head’ RCNNs, an efficient and accurate two-stage face detector by generating ‘thin’ feature maps, applying a large-kernel deformable convolution before the RoI warping, inspired by Light RCNN. The authors add additional small anchors to support tiny faces which help in evaluation achieving , and on the easy, medium and hard WIDER FACE sets. Similarly, Li et al. (2018) proposed Duel Shot Face Detectors (DSFDs) using progressive anchor loss, a feature enhancement module and an improved anchor matching strategy to achieve state-of-the-art face detection. The authors use a Feature Enhance Module, a combination of a typical FPN and a Receptive Field Block (RFB) (Liu et al., 2017), before the second shot. They also propose a Progressive Anchor Loss strategy, using smaller anchors in the first shot and larger in the second, arguing that original feature maps have less semantic information but more location information.
More recently, however, single-stage solutions have shown their dominance. For example, Najibi et al. (2017) introduced a Single Stage Headless (SSH) architecture that detects faces in a single forward pass by directly extracting features from different scales within the network. Their detector achieved state-of-the-art performance on the WIDER FACE dataset and is eight times faster than previous methods. More recently, RetinaFace (Deng et al., 2019), another example of a RetinaNet (Lin et al., 2017) style single-stage feature pyramid detection network achieved state-of-the-art on WIDER FACE. The authors stress the importance of incorporating face key-points with the bounding boxes for improved performance on WIDER FACE. Recent results have shown that single-stage detectors can outperform two-stage both in terms of accuracy and latency. This work was subsequently followed by both Zhang et al. (2019) and Zhang et al. (2019) using similar approaches. Because of this recent success, we will also be utilizing a single-stage RetinaNet approach.
2.3 Multi-task learning
Chen et al. (2014) were the first to propose combining face detection and alignment into a joint cascade framework. Subsequently, other authors have used this approach to improve the accuracy of face detection networks (Chen et al., 2016; Zhang et al., 2016; Deng et al., 2019, e.g.). Having a face detector that can also provide basic alignment information is extremely beneficial for any face recognition system. Multi-task learning is a common practice for training face detection networks. For example, Tian et al. (2018) presented a feature fusion pyramid architecture with a weakly supervised segmentation branch able to achieve state-of-the-art performance on WIDER FACE. The authors used the combination of three loss functions to train their network; a classification loss, a regression loss, and a segmentation loss. The authors argue that the segmentation branch helps the network learn more discriminative features. Similarly, Deng et al. (2019) trained RetinaFace using both a face landmark loss and a dense regression loss—generated from the difference between the original face and the reconstruction from a mesh decoder. Like Tian et al. (2018), the authors were able to achieve state-of-the-art performance on WIDER FACE.
2.4 Landmark localization
Face recognition models rely on having well aligned faces at training and inference (Taigman et al., 2014; Sun et al., 2014; Deng et al., 2019). To align the face, a transformation on the original image is needed, such that the landmarks of each face should reside in a specific location. This transformation depends on the quality of the landmark locations. MTCNN (Zhang et al., 2016) has been used prolifically in face recognition tasks because the network provides both face bounding boxes and landmarks. Deng et al. (2019) improved on this with RetinaFace, a Faster-RCNN face detector that also returns the same face landmarks as MTCNN, so it can easily replace MTCNN in most use cases. The authors found vast improvements in verification accuracy on LFW (Huang et al., 2008), CFP-FP (Sengupta et al., 2016), AgeDB-30 (Moschoglou et al., 2017) and IJB-C (Maze et al., 2018) just by changing from MTCNN to RetinaFace. Therefore, we will also include a landmark loss term to help train our networks.
Najibi et al. (2017) were the first to propose using context modules in single-stage detectors.
Since this work, several papers have used different context modules in their detectors and have reported improved results on WIDER FACE (e.g. Deng et al., 2019; Li et al., 2019).
In the original paper, SSH, the authors took the network output and performed a series of three convolutions, then concatenated the outputs of the final two.
RetinaFace (Deng et al., 2019), uses a similar 3 layer approach, reducing the number of filters in the second and third layers by a factor of two, then concatenates all three outputs.
However, in their GitHub repository
3.1 Multi-task loss function
As previously mentioned using a multi-task loss function is commonplace in detection tasks. To train our models we use a loss function comprised of three components; class, bounding box and landmark loss. The class loss is given by the log loss over the two classes (face vs background), this is calculated for both positive and negative anchors. The bounding box loss is the smooth- regression loss of the box location, only calculated for positive anchors. Similarly, the landmark loss is the regression loss of the landmark locations, also only calculated for the positive anchors. The combination of these three loss functions yields our multi-task loss function,
where are scale factors which are set to and , and correspond to all and positive anchors, and are the batch size and number of anchors, respectively. Lin et al. (2017) proposed using focal loss to address the inherent class imbalance, however, we find no significant benefit in replacing cross entropy.
3.2 Context modules
Figure 1 shows an illustration of all the context modules we will be testing in our experiments. The left column shows the context modules from the previous section, the right column shows some slightly modified versions. The number of filters () in the context module is different for each network backbone. For SSH, we modify the context module by dividing the number of channels by four then performing four convolutions and concatenating all the outputs. For RSSH, we simply half the number of channels in the third convolution then perform a forth convolution and concatenate all four outputs. For Retina, we swap the addition and concatenation, so we concatenate the last layer with the sum of the first and second layers. For Dense, we add a fourth densely connected convolution. We also test with two ‘basic’ context modules, the first with just a single convolution and the second two concatenated convolutions.
4.1 Training dataset
To train our model we use the WIDER FACE dataset (Yang et al., 2016). This dataset consists of 32,203 images and 393,703 labeled face bounding boxes with variable scale, pose and occlusion. The dataset is organized based on 61 event classes (e.g. parade, riot, and festival). Each event class is randomly sampled, with 40%, 10% and 50% of the images assigned to the training, validation and testing sets. EdgeBox (Zitnick and Dollar, 2014) is used to separate the proposals into three difficulty levels; Easy, Medium and Hard with recall rates of 92%, 76%, and 34%, respectively.
To incorporate landmarks into our training procedure we also use the five landmark annotations from (Deng et al., 2019). The authors labeled faces in the training set and made them publically available. These landmarks follow the format used by Zhang et al. (2016): eye centers, nose tip, and mouth corners. Faces with indistinguishable landmarks were given a dummy value and are not used in the loss function for that proposal. Deng et al. (2019) showed that by incorporating the landmarks into their multi-task loss the mAP on WIDER FACE improved by . We also label a further faces in the validation set, using the same labeling scheme.
4.2 Baseline settings
In this report, we train with three different backbone network sizes. We train a very lightweight network based on MobileNetV2 (Howard et al., 2017; Sandler et al., 2018) with , a heavier MobileNetV2 with , and a much heavier ResNet v2 (He et al., 2016) with 101 layers. We will refer to these networks as MNet, MNet and ResNet101, respectively.
We use an input image size of , in line with previous work (e.g. Li et al., 2018; Deng et al., 2019) and anchor scales ranging from to , with total anchors. Due to the nature of the task, we set all anchors to have an aspect ratio of 1:1. We match positive anchors with ground truth IoUs greater than and negative anchors with IoUs less than . Furthermore, we incorporate online hard example mining (OHEM) (Shrivastava et al., 2016) which has been successful in training other recent RPN based face detectors (Zhang et al., 2017; Deng et al., 2019; Zhang et al., 2019, 2019). The hard examples are selected by sorting the anchors by their loss and taking the hardest positive and negative anchors at a ratio of 1:3, following Girshick (2015). During training we randomly crop regions of the original images (following Zhang et al., 2017; Tang et al., 2018; Deng et al., 2019). For our feature pyramid we found that using a six-level feature pyramid (see Table 4.2) gave us the best results, and we will use this setup for all our models. We found that increasing the number of levels in the feature pyramid hampered the landmark accuracy. However, as our primary goal is face detection accuracy, we choose to forego some landmark accuracy for improved face detection. All of the various context modules are implemented at the same point in the network with the same tensor input all yielding the same shape output. After each context module, we apply a modulated deformable convolution (Zhu et al., 2018) to enhance the context information.
Transfer learning is a widely used technique to improve the accuracy of networks and improve generalization (Tan et al., 2018). As such, all our models are pretrained on ImageNet (Russakovsky et al., 2014) and are finetuned on WIDER FACE. Contrary to Zhang et al. (2019), we find a significant improvement in performance using transfer learning. Our MobileNetV2 models come pretrained from glouncv (Guo et al., 2019); as such our results should be reproducible. We employ a warmup learning rate schedule (Goyal et al., 2017), with five epochs where the learning rate increases linearly from by an order of magnitude, then falls an order of magnitude at epochs 50 and 70, and training terminates at epoch 90. All models are trained using stochastic gradient descent with momentum , weight decay of and with a batch size of eight per GPU. The majority of our models are all trained on a single NVIDIA Telsa GPU, however, our three final models are trained across six.
WIDER FACE employs the PASCAL VOC procedure (Everingham et al., 2012; Yang et al., 2016) for evaluation. Detections are considered true or false based on the area of overlap with the ground truth bounding boxes. If the intersection-over-union (IoU) between a positive anchor and ground truth is greater than the detection is a true positive, whereas an IoU value below this is considered a false positive. For multiple true positive detections of one ground truth, only the detection with the highest IoU is counted as correct and the rest are counted as false positives. The evaluation metric is average precision (AP), for each set (easy, medium, hard) the precisions are drawn from all unique recall values and averaged.
As the WIDER FACE dataset is limited to faces that are at least ten pixels high, we remove any bounding boxes with a height of fewer than five pixels. Following Najibi et al. (2017); Zhang et al. (2017); Li et al. (2017); Wang et al. (2017); Deng et al. (2019) we employ flipping and multi-scale detection strategies, disregarding any bounding box with a class probability less than . We apply the greedy non-maximum suppression from Girshick et al. (2013) to remove regions that have an IoU overlap greater than with another region that has a larger IoU with the ground truth. We also further refine the bounding boxes by using box voting (Gidaris and Komodakis, 2015), where each bounding box with an IoU overlap greater than ‘vote’ on the location weighted by their respective IoU.
To evaluate the accuracy of the landmarks we use two datasets: AFW (consisting of 337 faces with 68 landmarks; Zhu and Ramanan, 2012) and AFLW2000 (consisting of 2000 faces with 68 landmarks; Zhu et al., 2015). The defacto metric of evaluation is the mean L2 error of all the estimated landmarks normalized by the square root of the face bounding-box area (NME)—as in (Deng et al., 2019). For both datasets, we employ the same evaluation protocol, except we use the absolute (L1) error. We calculate the absolute error (AE) using the highest confidence—center most—face, and the distance from each predicted landmarks to their respective ground truth,
where and represent the and coordinates of the predicted and ground truth landmark, and and are the height and width of the bounding box, respectively. This AE is then averaged over all faces in the dataset to yield the mean absolute error (mAE). For AFLW all five landmarks are provided, however, for AFW the center of the eyes is not given. Therefore, for AFW we use the mean of the left and right eye corner for the center of each eye.
5 Face Detection Results
5.1 Backbone Baseline
To ensure that we start with the optimal backbone we first compared the performance of all three versions of MobileNet. Table 5.1 reports the face detection and landmark accuracies of each version of MobileNet. We find that there is very little difference between MobileNet (Howard et al., 2017) and MobileNetV2 (Sandler et al., 2018), however, MobileNetV3 (Howard et al., 2019) significantly underperforms. As the performance of MobileNet and MobileNetV2 is so similar, we select MobileNetV2 as our backbone only because it is a lighter network.
For the rest of this paper, we will be referring to three network backbones. Our smallest network (MNet) we use a MobileNetV2 backbone with an value of and just filters in the context modules. The medium sized network (MNet) also uses a MobileNetV2 backbone with and filters in the context modules. For our large network, we choose ResNet101 v2 with filters in the context modules to be comparable with other literature.
5.2 Context Module Comparison: Face Detection
|MNet (AP %)||MNet (AP %)||ResNet101 (AP %)|
|Head||Hard ()||Overall ()||Hard ()||Overall ()||Hard ()||Overall ()|
|SSH||86.85 (+0.15)||90.79 (-0.01)||89.04 (+0.11)||92.76 (+0.14)||90.55 (-0.08)||94.14 (+0.05)|
|SSH||86.76 (+0.06)||90.86 (+0.05)||88.76 (-0.17)||92.58 (-0.05)||90.64 (+0.01)||94.22 (+0.13)|
|Retina||87.11 (+0.42)||91.02 (+0.21)||88.87 (-0.06)||92.59 (-0.04)||90.74 (+0.11)||94.26 (+0.24)|
|Retina||87.27 (+0.57)||91.03 (+0.22)||88.70 (-0.23)||92.43 (-0.19)||90.49 (-0.14)||94.21 (-0.01)|
|RSSH||86.41 (-0.29)||90.67 (-0.14)||89.19 (+0.26)||92.69 (+0.06)||90.68 (+0.05)||94.18 (+0.18)|
|RSSH||86.67 (-0.03)||90.83 (+0.02)||88.99 (+0.06)||92.61 (-0.01)||90.68 (+0.05)||94.25 (+0.17)|
|Dense||86.47 (-0.23)||90.56 (-0.25)||88.93 (+0.00)||92.57 (-0.05)||90.52 (+0.00)||94.21 (0.02)|
|Dense||86.05 (-0.65)||90.70 (-0.11)||88.97 (+0.04)||92.77 (+0.14)||89.72 (+0.00)||93.30 (-0.77)|
|Average||86.70 0.39||90.81 0.16||88.93 0.16||92.63 0.11||90.50 0.33||94.07 0.32|
Table 3 presents the results for each context module on the WIDER FACER validation dataset. For reference we also trained MNetV2 with two ‘basic’ context modules, our two layer module only achieves and , and our one layer module only and on the ‘hard’ set and overall, respectively. Therefore, we can see that having at least three layers improved the performance by more than percent.
For MNetV2, we find that Retina, Retina, and SSH are the three top performers on the ‘hard’ set, and Retina, Retina, and SSH the top overall. For MNetV2, we find that Retina, Retina, and RSSH are the three top performers on the ‘hard’ set, and SSH, Dense, and RSSH the top overall. For ResNet101, we find that Retina, RSSH, and RSSH are the three top performers on the ‘hard’ set, and Retina, RSSH, and SSH are the top overall. For the ‘hard’ set, we find that the top three context performers are Retina, Retina and SSH with average mean divergences of , and , respectively. Over all three sets, we find that the top three context performers are Retina, SSH and RSSH with average mean divergences of , and , respectively. From these results the context modules seem quite similar in performance, the only real outliers are Dense and Dense which seem to underperform.
To test the statistical significance of these results we ran the same experiment eight times (to be the same number as the number of context modules) to determine the amount of variance just due to randomness. We find almost no difference in variance between running the same experiment and using different context modules. For the ‘hard’ set we get standard deviations of , and for MNetV2, MNetV2 and ResNet101, respectively. Overall three sets we get standard deviations of , and for MNetV2, MNetV2 and ResNet101, respectively. Therefore, we can not find any significant difference between any of the architectures. We can also see that smaller networks have higher variance. Therefore, it is unlikely that the architecture of the context module has more importance for smaller networks, it is just that the randomness is more influential. We also compared the number of filters used in the context module, finding on average a percent performance increase for smaller networks when doubling the number of filters. However, this comes with significant efficiency problems. For example, doubling the number of filters in the context modules for MNet almost doubles the total amount of parameters in the network.
5.3 Context Module Comparison: Landmark Accuracy
|MNetV2 (mAE )||MNetV2 (mAE )||ResNet101 (mAE )|
|SSH||1.14 0.62||1.86 2.80||1.03 0.56||1.60 1.92||0.91 0.53||1.56 1.83|
|SSH||1.55 2.76||2.05 2.42||0.98 0.98||1.51 1.95||0.92 0.54||1.51 2.50|
|Retina||1.12 0.63||1.84 2.68||0.95 0.59||1.63 2.20||0.90 0.51||1.41 1.47|
|Retina||1.09 0.68||2.00 3.15||1.02 0.57||1.61 1.95||1.00 0.87||1.63 2.24|
|RSSH||1.11 0.90||1.99 3.09||1.01 0.52||1.42 1.85||0.90 0.55||1.50 1.99|
|RSSH||1.12 1.31||2.14 3.33||1.01 0.99||1.54 2.74||0.97 0.73||1.53 2.06|
|Dense||1.15 2.78||2.11 3.16||1.21 0.66||1.58 2.00||0.93 0.62||1.53 2.02|
|Dense||1.27 2.63||2.00 2.92||1.05 1.39||1.48 2.07||0.93 0.62||1.53 2.02|
|Average||1.19 1.80||2.00 2.96||1.03 0.78||1.55 2.10||0.93 0.62||1.53 2.02|
Table 4 presents the mean absolute error and standard deviation for each backbone and context module. Similarly to the previous section, we also present the accuracy of two MNetV2 networks trained with ‘basic’ context modules. Our single layer context module network can achieve and on AFW and ALFW-2000, respectively. Moreover, our two layer context module network can achieve and on AFW and ALFW-2000, respectively. These results are not far from the accuracy of the other context modules presented in Table 4, therefore, the choice of context module seems to have little influence on landmark accuracy. Following the previous section, to investigate the effect of randomness we compare these results to a single experiment run eight times. When switching context modules we get , and for MNet, MNet and ResNet101, respectively. But when running the same experiment eight times we get , and for MNet, MNet and ResNet101, respectively. As in the previous section, we find that the variance due to randomness is similar to the variance in the choice of the context module architecture.
For MNetV2, we find average mean absolute errors of and for AFW and AFLW-2000, respectively. Whereas, MNetV2 is significantly better on both datasets, with an average mean absolute error of and for AFW and AFLW-2000, respectively. Furthermore, the largest backbone, ResNet101, is even better, with an average mean absolute error of and for AFW and AFLW-2000, respectively. Unsurprisingly, the larger backbones perform significantly better on landmark localization, especially on side faces.
We also consider the impact of increasing the number of filters in the context module on the landmark accuracy. We find that doubling the number of filters in the context module has almost no effect on the quality of the landmarks. These results suggest the choice of context module arcutecture, and the number of filters it has is irrelevant to landmark quality.
5.4 Final Results: Face Detection
|Backbone||Hard (AP%)||Overall (AP%)|
For our final models we only change the number of GPUs (from one to six) and the number of layers in the ResNet v2 (from 101 to 152). Table 5.4 shows our final results on the WIDER FACE ‘hard’ set and overall sets. We find that increasing the size of the network is the most reliable way to increase the performance on the WIDER FACE dataset. However, using networks like ResNet152 may not be practical in most applications. Both of our MobileNetV2s perform extremely well. For comparison, Zhang et al. (2019) presents results for their ResNet18 model which achieves very similar accuracy to our much smaller MNetV2 both achieving on the ‘hard’ set. Moreover, Deng et al. (2019) report their results on the ‘hard’ set using MobileNet with , with an AP of just , whereas, our model with the same backbone achieves using a six-layer pyramid and using only three-layers. By increasing the number of filters in the context module by a factor of two we can achieve over on the ‘hard’ set. However, this makes the network significantly bigger, so it is not a fair comparison with Deng et al. (2019).
Figure 2 shows our final results on all three of the WIDER FACE validation sets. Our ResNet152 can rank a respectable fourth on both the ‘medium’ and ‘hard’ sets, without adding a large number of layers. Moreover, our MNetV2 can rank four to five places higher than EXTD (a similar lightweight detector) on all three sets. Also, our MNetV2 is able to out perform much heavier networks, for exmaple FAN (Zhang et al., 2019) which uses a ResNet50 backbone.
5.5 Final Results: Landmark Accuracy
|MNetV2||1.06 1.11||1.80 2.39|
|MNetV2||0.79 0.43||1.17 1.42|
|ResNet152||0.87 1.52||0.87 0.67|
Table 5.5 shows the final landmark accuracy for each of our backbones and table 5.4 shows the total number of parameters in the network. For comparison MTCNN (Zhang et al., 2016) achieves on AFLW-2000, far higher than even our smallest model, and struggles to even detect many of the faces in AFW. We can see that larger backbones, generally, provide better quality landmarks. As previously mentioned, landmark accuracy can be improved significantly by reducing the number of layers in the feature pyramid. However, this will cause a substantial loss in face detection accuracy. We also investigated including more filters in the context modules, which does not affect the quality of landmarks. Moreover, we found that using deformable convolutions also impeded the landmark accuracy, but again it is a trade-off we make to ensure higher face detection accuracy.
5.6 Final Results: Network Performance
Table 5.6 shows the inference speed of both MobileNetV2 models on different devices. These devices are very heterogeneous, so we use specialized techniques to get the best performance from each of them. All of the optimization techniques are open source, therefore, these benchmarks should be reproducible. For desktop CPU (Intel i5-7500) and embedded GPU (NVIDIA Jetson TX2) we take advantage of tvm (Chen et al., 2018). While, mxnet (Chen et al., 2015) accelerated using cuda (Nickolls et al., 2008) is used for the desktop GPU (NVIDIA 1050ti).
For comparison, EXTD (Yoo et al., 2019), use far less parameters in their face detector; just . However, our smallest model is not only ms faster on a VGA input, it also performs better on the WIDER FACE hard set by . RefineFace (Zhang et al., 2019) use ResNet-18 as their smallest model which runs in 26.8ms on an NVIDIA 1080ti. By comparison, we achieve similar results on the WIDER FACE hard set with our MNetV2, which runs ms faster on a much slower GPU (NVIDIA 1050ti). To compare to Deng et al. (2019), we use a three-layer pyramid and can achieve very similar or better inference speeds. We also achieve a higher accuracy on WIDER FACE hard set with the same model.
We have shown that the choice of context module architecture is likely irrelevant to the models’ performance. One possible reason for this is that the layers added by the feature pyramid and the context modules are always randomly initialized and, for smaller networks, they can constitute a large number of the total parameters percent. Therefore, a ‘lucky’ initialization can yield more performance gain than crafting an optimal context module. One possible way around this would be to pretrain the full network on a similar detection task, e.g. person detection, to alleviate the effect of the random initialization.
Our largest model can achieve a near state-of-the-art score on the WIDER FACE hard set of percent without making use of any excessive additional layers. It also provides very accurate landmarks that can be used for face alignment. Our two smaller networks can exceed state-of-the-art performance on the WIDER FACE hard set compared to similar network sizes. These networks also provide accurate landmarks while being able to run in real-time on modest desktop and mobile hardware.
We would like to thank Aubin Samacoits, Jeff Hnybida, Riccardo Gallina and Sanjana Jain for their constructive input and feedback during the writing of this paper. We would also like to thank CAT Telecom for granting us access to their GPU cluster for training.
- Self-Driving Cars: A Survey. arXiv e-prints, pp. arXiv:1901.04407. External Links: Cited by: §1.
- Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 3726–3734. External Links: Cited by: §1.
- Supervised Transformer Network for Efficient Face Detection. arXiv e-prints, pp. arXiv:1607.05477. External Links: Cited by: §2.3.
- Joint Cascade Face Detection and Alignment. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars (Eds.), Cham, pp. 109–122. External Links: Cited by: §2.3.
- Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §5.6.
- TVM: an automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, pp. 578–594. External Links: Cited by: §5.6.
- Histograms of oriented gradients for human detection. In Proceedings - 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, External Links: Cited by: §1.
- RetinaFace: single-stage dense face localisation in the wild. In arxiv, Cited by: §1, §1, §2.1, §2.2, §2.3, §2.4, §2.5, Figure 1, §4.1, §4.2, §4.3, §4.3, §5.4, §5.6.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Cited by: §4.3.
- A closer look at faster r-cnn for vehicle detection. In 2016 IEEE intelligent vehicles symposium (IV), pp. 124–129. Cited by: §1.
- Object detection with discriminatively trained part-based models. Computer. External Links: Cited by: §1.
- Unsupervised Training for 3D Morphable Model Regression. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 8377–8386. External Links: Cited by: §1.
- Object detection via a multi-region & semantic segmentation-aware CNN model. arXiv e-prints, pp. arXiv:1505.01749. External Links: Cited by: §4.3.
- Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv e-prints, pp. arXiv:1311.2524. External Links: Cited by: §4.3.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
- Fast R-CNN. arXiv e-prints, pp. arXiv:1504.08083. External Links: Cited by: §1, §4.2.
- Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv e-prints, pp. arXiv:1706.02677. External Links: Cited by: §4.2.
- Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment. External Links: Cited by: §1.
- GluonCV and gluonnlp: deep learning in computer vision and natural language processing. arXiv preprint arXiv:1907.04433. Cited by: §4.2.
- Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv e-prints, pp. arXiv:1406.4729. External Links: Cited by: §2.1.
- Identity Mappings in Deep Residual Networks. arXiv e-prints, pp. arXiv:1603.05027. External Links: Cited by: §4.2.
- In Defense of the Triplet Loss for Person Re-Identification. arXiv e-prints, pp. arXiv:1703.07737. External Links: Cited by: §1.
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. External Links: Cited by: §4.2, §5.1.
- Searching for MobileNetV3. External Links: Cited by: §5.1.
- Densely Connected Convolutional Networks. arXiv e-prints, pp. arXiv:1608.06993. External Links: Cited by: §2.5.
- Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. In Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France. External Links: Cited by: §2.4.
- Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Vol. , pp. 2144–2151. External Links: Cited by: §1.
- DSFD: Dual Shot Face Detector. External Links: Cited by: §1, §2.2, §4.2.
- Light-Head R-CNN: In Defense of Two-Stage Object Detector. arXiv e-prints, pp. arXiv:1711.07264. External Links: Cited by: §2.2, §4.3.
- PyramidBox++: High Performance Detector for Finding Tiny Face. External Links: Cited by: §1, §2.5, Figure 1.
- Feature Pyramid Networks for Object Detection. External Links: Cited by: §2.1.
- Focal Loss for Dense Object Detection. External Links: Cited by: §1, §1, §1, §2.2, §2.2, §3.1.
- Microsoft COCO: Common Objects in Context. arXiv e-prints, pp. arXiv:1405.0312. External Links: Cited by: §2.1.
- Receptive Field Block Net for Accurate and Fast Object Detection. arXiv e-prints, pp. arXiv:1711.07767. External Links: Cited by: §2.2.
- SSD: Single shot multibox detector. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 9905 LNCS, pp. 21–37. External Links: Cited by: §2.2.
- Dense Face Alignment. Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 2018-January, pp. 1619–1628. External Links: Cited by: §1.
- IARPA janus benchmark - c: face dataset and protocol. In 2018 International Conference on Biometrics (ICB), Vol. , pp. 158–165. External Links: Cited by: §2.4.
- Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, Vol. 2, pp. 5. Cited by: §2.4.
- SSH: Single Stage Headless Face Detector. arXiv e-prints, pp. arXiv:1708.03979. External Links: Cited by: §2.1, §2.2, §2.5, Figure 1, §4.3.
- Stacked hourglass networks for human pose estimation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9912 LNCS, pp. 483–499. External Links: Cited by: §1.
- Scalable parallel programming with cuda. Queue 6 (2), pp. 40–53. External Links: Cited by: §5.6.
- YOLOv3: An Incremental Improvement. External Links: Cited by: §1.
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Cited by: §1.
- ImageNet Large Scale Visual Recognition Challenge. arXiv e-prints, pp. arXiv:1409.0575. External Links: Cited by: §4.2, §5.4.
- Intervertebral disc detection in x-ray images using faster r-cnn. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 564–567. Cited by: §1.
- MobileNetV2: Inverted Residuals and Linear Bottlenecks. External Links: Cited by: §4.2, §5.1.
- Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1–9. External Links: Cited by: §2.4.
- Training Region-based Object Detectors with Online Hard Example Mining. arXiv e-prints, pp. arXiv:1604.03540. External Links: Cited by: §4.2.
- Deep Learning Face Representation by Joint Identification-Verification. arXiv e-prints, pp. arXiv:1406.4773. External Links: Cited by: §1, §2.4.
- DeepFace: closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1701–1708. External Links: Cited by: §1, §2.4.
- A Survey on Deep Transfer Learning. arXiv e-prints, pp. arXiv:1808.01974. External Links: Cited by: §4.2.
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. External Links: Cited by: §1.
- PyramidBox: A Context-assisted Single Shot Face Detector. External Links: Cited by: §1, §4.2.
- Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision. External Links: Cited by: §2.3.
- Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: §1.
- Multiple Kernels for object detection. In Proceedings of the IEEE International Conference on Computer Vision, External Links: Cited by: §1.
- Robust real-time face detection. International Journal of Computer Vision 57, pp. 137–154. Cited by: §1, §1, §2.1.
- Spatial-Temporal Person Re-identification. External Links: Cited by: §1.
- Face attention network: an effective face detector for the occluded faces. arXiv preprint arXiv:1711.07246. Cited by: §1, §2.1, §4.3.
- Deep High-Resolution Representation Learning for Visual Recognition. External Links: Cited by: §1.
- Deep Face Recognition : A Survey [2018/06/04]. pp. 1–17. External Links: Cited by: §1.
- Facial feature detection using haar classifiers. J. Comput. Sci. Coll. 21 (4), pp. 127–133. External Links: Cited by: §1.
- Application of faster r-cnn model on human running pattern recognition. arXiv preprint arXiv:1811.05147. Cited by: §1.
- WIDER face: a face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.1, §4.3.
- EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse. arXiv e-prints, pp. arXiv:1906.06579. External Links: Cited by: §2.2, §5.6.
- Accurate Face Detection for High Performance. arXiv e-prints, pp. arXiv:1905.01585. External Links: Cited by: §1, §2.1, §2.2, §4.2, §5.4.
- Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. External Links: Cited by: §1, §2.3, §2.4, §4.1, §5.5.
- RefineFace: Refinement Neural Network for High Performance Face Detection. arXiv e-prints, pp. arXiv:1909.04376. External Links: Cited by: §1, §2.1, §2.2, §4.2, §5.4, §5.6.
- Improved Selective Refinement Network for Face Detection. External Links: Cited by: §4.2.
- S3FD: Single Shot Scale-Invariant Face Detector. Proceedings of the IEEE International Conference on Computer Vision 2017-October, pp. 192–201. External Links: Cited by: §1, §2.1, §4.2, §4.3.
- Robust and High Performance Face Detector. arXiv e-prints, pp. arXiv:1901.02350. External Links: Cited by: §1.
- Objects as Points. External Links: Cited by: §1.
- Dense 3d face decoding over 2500fps: joint texture & shape convolutional mesh decoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1097–1106. Cited by: §1.
- Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2879–2886. External Links: Cited by: §4.3.
- Face Alignment Across Large Poses: A 3D Solution. arXiv e-prints, pp. arXiv:1511.07212. External Links: Cited by: §4.3.
- Deformable ConvNets v2: More Deformable, Better Results. arXiv e-prints, pp. arXiv:1811.11168. External Links: Cited by: §4.2.
- Edge Boxes: Locating Object Proposals from Edges. In European Conference on Computer Vision, ECCV edition. External Links: Cited by: §4.1.