Face Detection with Feature Pyramids and Landmarks

Face Detection with Feature Pyramids and Landmarks


Accurate face detection and facial landmark localization are crucial to any face recognition system. We present a series of three single-stage RCNNs with different sized backbones (MobileNetV2-25, MobileNetV2-100, and ResNet101) and a six-layer feature pyramid trained exclusively on the WIDER FACE dataset. We compare the face detection and landmark accuracies using eight context module architectures, four proposed by previous research and four modified versions. We find no evidence that any of the proposed architectures significantly overperform and postulate that the random initialization of the additional layers is at least of equal importance. To show this we present a model that achieves near state-of-the-art performance on WIDER FACE and also provides high accuracy landmarks with a simple context module. We also present results using MobileNetV2 backbones, which achieve over average precision on the WIDER FACE hard validation set while being able to run in real-time. By comparing to other authors, we show that our models exceed the state-of-the-art for similar-sized RCNNs and match the performance of much heavier networks.

1 Introduction

Over recent years, the ever-improving performance of Convolutional Neural Networks (CNNs) has resulted in highly accurate computer vision applications (e.g. Redmon and Farhadi, 2018; Wang et al., 2018; Tan and Le, 2019; Wang et al., 2019; Wang and Deng, 2018). One predominant application is object detection. The task consists of localizing objects and classifying them. Accurately solving this task opens the door to a wide range of applications, from autonomous driving (see Badue et al., 2019, for a review) to person reidentification (e.g. Hermans et al., 2017). The problem was traditionally approached in three main tracks, region selection (Vedaldi et al., 2009), feature extraction (Dalal and Triggs, 2005) and object classification (Forsyth, 2014). However, the low efficiency of region selection and the semantic limitation of manually engineered descriptors hinders the performance of these algorithms. Recently, many deep learning approaches have been developed to tackle this inefficiency issue. Region CNNs (RCNNs) (Girshick et al., 2014) are commonly used for various detection tasks. They perform a greedy selective search algorithm (Uijlings et al., 2013) to significantly lower the number of region propositions, however, this is computationally expensive. Fast RCNN (Girshick, 2015) feeds pixel-level region proposals into the detection network from the feature maps, reducing the overhead somewhat. However, Faster RCNNs (Ren et al., 2017), utilize CNN based Region Proposal Networks (RPNs), removing the greedy selective search used in previous RCNNs, enabling detection in real-time. RPNs use a pyramid of anchors to propose regions more efficiently than pyramids of images (e.g. Viola and Jones, 2004) or filters. This anchor-based approach has successfully been applied many detection tasks (Zhou et al., 2019; Yang and Geng, 2018; Fan et al., 2016; Sa et al., 2017). Recently, single-stage detectors, based on Faster RCNN have also been widely adopted. For example, Lin et al. (2017) proposed a single-stage architecture called RetinaNet combined with focal loss—designed to combat the inherent class imbalance—which achieves state-of-the-art accuracy on the COCO dataset.

Face recognition is now commonplace in our daily lives, with businesses looking to take advantage of its convenience and robustness. Face detection serves as the foundation for recognition and various other face related research and products including alignment, and attribute classification (e.g. gender, age, face expression). Like any object detection task, the goal of face detection is to provide bounding boxes of all the faces in an image. However, variations in pose, illumination, resolution, occlusion, and human variance in real-world data make face detection challenging. Viola and Jones (2004) proposed an approach that performs feature searching using Haar-like (Wilson and Fernandez, 2006) features, which together with integral image, generates a set of features that assist in face detection. While the authors perform face detection on an image pyramid, the multi-scale feature pyramid approach has recently been shown to perform feature extraction more efficiently (Lin et al., 2017).

Landmark localization refers to estimating predefined landmark locations in images. Common tasks include facial (pupils, nose peak, mouth corners, etc.) and body (elbows, knees, wrists, shoulders, and face landmarks) landmark localization. One of the most benchmarked public face landmark datasets is AFLW-2000 (Köstinger et al., 2011) which has 20,000 training images, 4386 test images and 19 manually annotated face landmarks per image. Facial landmarks are employed in various tasks, particularly in face alignment as a preprocessing step before face recognition (Taigman et al., 2014; Sun et al., 2014). One of the first breakthrough deep learning models for face landmark localization was Multitask Cascaded Convolutional Networks (MTCNN) (Zhang et al., 2016). Utilizing several CNN networks, the authors achieved robust and accurate results and it has remained a strong baseline for several years. Subsequently, more computationally efficient methods have been developed which match the accuracy of MTCNN; such as Bulat and Tzimiropoulos (2017). Recently, methods superior to MTCNN–both in terms of accuracy and economy–have been developed including RetinaFace (Deng et al., 2019). Many dense 3D face alignment techniques (e.g. Liu et al., 2018), using U-Nets (Guo et al., 2018) and Hourglass networks (Newell et al., 2016) have further improved landmark localization. However, these approaches often incur large computational overheads.

WIDER FACE1 (Yang et al., 2016) is a publicly available face detection benchmark dataset, which is widely used to train and benchmark face detection models. The dataset contains 32,203 images and 393,703 faces selected from the publicly available WIDER dataset. Recent work has produced very good results on this benchmark. For example, Face Attention Networks (FAN) (Wang et al., 2017) follows a similar approach to RetinaNet (Lin et al., 2017) using a single-stage and anchor-level attention networks trained in a supervised manner. They report an improvement over traditional methods and argue this is due to its capability to capture more contextual information. Similarly, PyramidBox (Tang et al., 2018) introduces a context-assisted single-stage detector that classifies and regresses faces, heads, and bodies to allow the detector to overcome small, blurred and partially occluded faces. The authors propose a data-anchor-sampling strategy which has subsequently been widely adopted (Li et al., 2019; Zhang et al., 2019, 2019). PyramidBox++ (Li et al., 2019) further enhances the original PyramidBox detector introducing progressive anchor loss (Li et al., 2018) and by adding dense connections to the context module and employing a balanced-data-anchor-sampling strategy preventing oversampling on small faces. RetinaFace (Deng et al., 2019) also use a RetinaNet approach and include landmarks, as well as, a 3D graph CNN mesh decoder alongside a joint shape and texture decoder (Zhou et al., 2019), and a differentiable renderer (Genova et al., 2018) to construct localized 3D face meshes. This approach resulted in state-of-the-art performance on WIDER FACE. AInnoFace (Zhang et al., 2019) employs a slightly different strategy, using a modified RetinaNet to perform a two-stage classification and regression task. The authors, apply the Intersection over Union (IOU) regression loss to minimize the difference between predictions and ground-truths (Zhang et al., 2019), anchor-based sampling similar to the data-anchor-sampling in PyramidBox (Tang et al., 2018), and max-out (Zhang et al., 2017; Tang et al., 2018). The authors report similar results to RetinaFace on WIDER FACE. RefineFace (Zhang et al., 2019) achieves slightly higher performance on the WIDER FACE by combining five different modules; selective two-step regression, selective two-step classification, scale-aware margin loss, feature supervision module, and receptive field enhancement. The authors argue that these modules address the class imbalance, reduce the classifier search space and produce more discriminative features, and were able to get even better results on the WIDER FACE challenge.

This paper is organized as follows. Section 2 presents some of the most recent work that we will draw upon. Section 3 details our loss function and context modules. Section 4 outlines the experiment procedure we will employ. Section 5 compares the results for different architectures. Section 6 presents our conclusions.

2 Related Work

2.1 Pyramid of features

Recently the adoption of multi-scale feature pyramids for detection tasks has been widespread (e.g. Najibi et al., 2017; Lin et al., 2016; Zhang et al., 2017; Wang et al., 2017; Deng et al., 2019; Zhang et al., 2019, 2019). The work draws on results using spatial pyramid pooling (He et al., 2014), which can efficiently extract features at different levels from a single image, moving away from the less efficient pyramid of images approach (Viola and Jones, 2004). Multi-scale feature pyramids rely on only a single-scale image and outputs proportionally sized feature maps at various levels through top-down and lateral connections. This approach displays significant performance improvements on the COCO (Lin et al., 2014, 2016), and WIDER FACE (Wang et al., 2017; Deng et al., 2019; Zhang et al., 2019, 2019) challenges. Due to the performance of this approach, we will be adopting it in our experiments.

2.2 Single versus two-stage

Generally, there are two types of modern face detectors, single-stage, and two-stage. Single-stage models make independent object classification from multiple feature maps from deep in the network (Liu et al., 2016), typically having a latency advantage. However, these feature maps have a lower spatial resolution, hence may have already lost some semantic information relating to small objects, generally leading to reduced accuracy. Two-stage detectors (e.g. Faster-RCNN) construct semantically rich feature maps from different layers in the network (Lin et al., 2017) and classify regions of interest. As a result, two-stage based architectures can detect small objects with higher precisions but with reduced speeds (Yoo et al., 2019). Finding a balance between accuracy and inference time has been a predominant focus of recent research.


In the past few years, there have been numerous two-stage detectors that perform well on WIDER FACE. For example, Li et al. (2017) proposed ‘Light-Head’ RCNNs, an efficient and accurate two-stage face detector by generating ‘thin’ feature maps, applying a large-kernel deformable convolution before the RoI warping, inspired by Light RCNN. The authors add additional small anchors to support tiny faces which help in evaluation achieving , and on the easy, medium and hard WIDER FACE sets. Similarly, Li et al. (2018) proposed Duel Shot Face Detectors (DSFDs) using progressive anchor loss, a feature enhancement module and an improved anchor matching strategy to achieve state-of-the-art face detection. The authors use a Feature Enhance Module, a combination of a typical FPN and a Receptive Field Block (RFB) (Liu et al., 2017), before the second shot. They also propose a Progressive Anchor Loss strategy, using smaller anchors in the first shot and larger in the second, arguing that original feature maps have less semantic information but more location information.


More recently, however, single-stage solutions have shown their dominance. For example, Najibi et al. (2017) introduced a Single Stage Headless (SSH) architecture that detects faces in a single forward pass by directly extracting features from different scales within the network. Their detector achieved state-of-the-art performance on the WIDER FACE dataset and is eight times faster than previous methods. More recently, RetinaFace (Deng et al., 2019), another example of a RetinaNet (Lin et al., 2017) style single-stage feature pyramid detection network achieved state-of-the-art on WIDER FACE. The authors stress the importance of incorporating face key-points with the bounding boxes for improved performance on WIDER FACE. Recent results have shown that single-stage detectors can outperform two-stage both in terms of accuracy and latency. This work was subsequently followed by both Zhang et al. (2019) and Zhang et al. (2019) using similar approaches. Because of this recent success, we will also be utilizing a single-stage RetinaNet approach.

2.3 Multi-task learning

Chen et al. (2014) were the first to propose combining face detection and alignment into a joint cascade framework. Subsequently, other authors have used this approach to improve the accuracy of face detection networks (Chen et al., 2016; Zhang et al., 2016; Deng et al., 2019, e.g.). Having a face detector that can also provide basic alignment information is extremely beneficial for any face recognition system. Multi-task learning is a common practice for training face detection networks. For example, Tian et al. (2018) presented a feature fusion pyramid architecture with a weakly supervised segmentation branch able to achieve state-of-the-art performance on WIDER FACE. The authors used the combination of three loss functions to train their network; a classification loss, a regression loss, and a segmentation loss. The authors argue that the segmentation branch helps the network learn more discriminative features. Similarly, Deng et al. (2019) trained RetinaFace using both a face landmark loss and a dense regression loss—generated from the difference between the original face and the reconstruction from a mesh decoder. Like Tian et al. (2018), the authors were able to achieve state-of-the-art performance on WIDER FACE.

2.4 Landmark localization

Face recognition models rely on having well aligned faces at training and inference (Taigman et al., 2014; Sun et al., 2014; Deng et al., 2019). To align the face, a transformation on the original image is needed, such that the landmarks of each face should reside in a specific location. This transformation depends on the quality of the landmark locations. MTCNN (Zhang et al., 2016) has been used prolifically in face recognition tasks because the network provides both face bounding boxes and landmarks. Deng et al. (2019) improved on this with RetinaFace, a Faster-RCNN face detector that also returns the same face landmarks as MTCNN, so it can easily replace MTCNN in most use cases. The authors found vast improvements in verification accuracy on LFW (Huang et al., 2008), CFP-FP (Sengupta et al., 2016), AgeDB-30 (Moschoglou et al., 2017) and IJB-C (Maze et al., 2018) just by changing from MTCNN to RetinaFace. Therefore, we will also include a landmark loss term to help train our networks.

2.5 Context

Najibi et al. (2017) were the first to propose using context modules in single-stage detectors. Since this work, several papers have used different context modules in their detectors and have reported improved results on WIDER FACE (e.g. Deng et al., 2019; Li et al., 2019). In the original paper, SSH, the authors took the network output and performed a series of three convolutions, then concatenated the outputs of the final two. RetinaFace (Deng et al., 2019), uses a similar 3 layer approach, reducing the number of filters in the second and third layers by a factor of two, then concatenates all three outputs. However, in their GitHub repository2 they also use a second context module which sums the output of the first two layers and concatenates it with the first. Li et al. (2019) use densely connected convolutions (Huang et al., 2016) in their context module, where the input of each convolutional layer is the concatenation of all previous layers. In this paper, we will be comparing all of these context modules, alongside a few of our own, to quantify the influence of the context module’s architecture on the overall performance of the network.

3 PyramidKey

3.1 Multi-task loss function

As previously mentioned using a multi-task loss function is commonplace in detection tasks. To train our models we use a loss function comprised of three components; class, bounding box and landmark loss. The class loss is given by the log loss over the two classes (face vs background), this is calculated for both positive and negative anchors. The bounding box loss is the smooth- regression loss of the box location, only calculated for positive anchors. Similarly, the landmark loss is the regression loss of the landmark locations, also only calculated for the positive anchors. The combination of these three loss functions yields our multi-task loss function,


where are scale factors which are set to and , and correspond to all and positive anchors, and are the batch size and number of anchors, respectively. Lin et al. (2017) proposed using focal loss to address the inherent class imbalance, however, we find no significant benefit in replacing cross entropy.

3.2 Context modules

Figure 1 shows an illustration of all the context modules we will be testing in our experiments. The left column shows the context modules from the previous section, the right column shows some slightly modified versions. The number of filters () in the context module is different for each network backbone. For SSH, we modify the context module by dividing the number of channels by four then performing four convolutions and concatenating all the outputs. For RSSH, we simply half the number of channels in the third convolution then perform a forth convolution and concatenate all four outputs. For Retina, we swap the addition and concatenation, so we concatenate the last layer with the sum of the first and second layers. For Dense, we add a fourth densely connected convolution. We also test with two ‘basic’ context modules, the first with just a single convolution and the second two concatenated convolutions.

Figure 1: Illustration of the context modules used in this work, where denotes the number of filters in the first convolution layer, ‘C’ denotes concatenation, and ‘+’ denotes vector addition. The left column are common context modules, SSH context module is from Najibi et al. (2017), both the RSSH and Retina modules are from Deng et al. (2019), and the Dense context module is from Li et al. (2019). The right column are slight permutations of these modules.

4 Experiments

4.1 Training dataset

To train our model we use the WIDER FACE dataset (Yang et al., 2016). This dataset consists of 32,203 images and 393,703 labeled face bounding boxes with variable scale, pose and occlusion. The dataset is organized based on 61 event classes (e.g. parade, riot, and festival). Each event class is randomly sampled, with 40%, 10% and 50% of the images assigned to the training, validation and testing sets. EdgeBox (Zitnick and Dollar, 2014) is used to separate the proposals into three difficulty levels; Easy, Medium and Hard with recall rates of 92%, 76%, and 34%, respectively.

To incorporate landmarks into our training procedure we also use the five landmark annotations from (Deng et al., 2019). The authors labeled faces in the training set and made them publically available. These landmarks follow the format used by Zhang et al. (2016): eye centers, nose tip, and mouth corners. Faces with indistinguishable landmarks were given a dummy value and are not used in the loss function for that proposal. Deng et al. (2019) showed that by incorporating the landmarks into their multi-task loss the mAP on WIDER FACE improved by . We also label a further faces in the validation set, using the same labeling scheme.

4.2 Baseline settings

In this report, we train with three different backbone network sizes. We train a very lightweight network based on MobileNetV2 (Howard et al., 2017; Sandler et al., 2018) with , a heavier MobileNetV2 with , and a much heavier ResNet v2 (He et al., 2016) with 101 layers. We will refer to these networks as MNet, MNet and ResNet101, respectively.

Pyramid Stride Anchor scale
4 16
8 32
16 64
32 128
64 256
128 516

Table 1: The pyramid setup we use for all of our experiments. The stride denotes the factor by which the original image have been scaled by, and the scale is the factor that is applied to the anchor size.

We use an input image size of , in line with previous work (e.g. Li et al., 2018; Deng et al., 2019) and anchor scales ranging from to , with total anchors. Due to the nature of the task, we set all anchors to have an aspect ratio of 1:1. We match positive anchors with ground truth IoUs greater than and negative anchors with IoUs less than . Furthermore, we incorporate online hard example mining (OHEM) (Shrivastava et al., 2016) which has been successful in training other recent RPN based face detectors (Zhang et al., 2017; Deng et al., 2019; Zhang et al., 2019, 2019). The hard examples are selected by sorting the anchors by their loss and taking the hardest positive and negative anchors at a ratio of 1:3, following Girshick (2015). During training we randomly crop regions of the original images (following Zhang et al., 2017; Tang et al., 2018; Deng et al., 2019). For our feature pyramid we found that using a six-level feature pyramid (see Table 4.2) gave us the best results, and we will use this setup for all our models. We found that increasing the number of levels in the feature pyramid hampered the landmark accuracy. However, as our primary goal is face detection accuracy, we choose to forego some landmark accuracy for improved face detection. All of the various context modules are implemented at the same point in the network with the same tensor input all yielding the same shape output. After each context module, we apply a modulated deformable convolution (Zhu et al., 2018) to enhance the context information.

Transfer learning is a widely used technique to improve the accuracy of networks and improve generalization (Tan et al., 2018). As such, all our models are pretrained on ImageNet (Russakovsky et al., 2014) and are finetuned on WIDER FACE. Contrary to Zhang et al. (2019), we find a significant improvement in performance using transfer learning. Our MobileNetV2 models come pretrained from glouncv (Guo et al., 2019); as such our results should be reproducible. We employ a warmup learning rate schedule (Goyal et al., 2017), with five epochs where the learning rate increases linearly from by an order of magnitude, then falls an order of magnitude at epochs 50 and 70, and training terminates at epoch 90. All models are trained using stochastic gradient descent with momentum , weight decay of and with a batch size of eight per GPU. The majority of our models are all trained on a single NVIDIA Telsa GPU, however, our three final models are trained across six.

4.3 Evaluation

WIDER FACE employs the PASCAL VOC procedure (Everingham et al., 2012; Yang et al., 2016) for evaluation. Detections are considered true or false based on the area of overlap with the ground truth bounding boxes. If the intersection-over-union (IoU) between a positive anchor and ground truth is greater than the detection is a true positive, whereas an IoU value below this is considered a false positive. For multiple true positive detections of one ground truth, only the detection with the highest IoU is counted as correct and the rest are counted as false positives. The evaluation metric is average precision (AP), for each set (easy, medium, hard) the precisions are drawn from all unique recall values and averaged.

As the WIDER FACE dataset is limited to faces that are at least ten pixels high, we remove any bounding boxes with a height of fewer than five pixels. Following Najibi et al. (2017); Zhang et al. (2017); Li et al. (2017); Wang et al. (2017); Deng et al. (2019) we employ flipping and multi-scale detection strategies, disregarding any bounding box with a class probability less than . We apply the greedy non-maximum suppression from Girshick et al. (2013) to remove regions that have an IoU overlap greater than with another region that has a larger IoU with the ground truth. We also further refine the bounding boxes by using box voting (Gidaris and Komodakis, 2015), where each bounding box with an IoU overlap greater than ‘vote’ on the location weighted by their respective IoU.

To evaluate the accuracy of the landmarks we use two datasets: AFW (consisting of 337 faces with 68 landmarks; Zhu and Ramanan, 2012) and AFLW2000 (consisting of 2000 faces with 68 landmarks; Zhu et al., 2015). The defacto metric of evaluation is the mean L2 error of all the estimated landmarks normalized by the square root of the face bounding-box area (NME)—as in (Deng et al., 2019). For both datasets, we employ the same evaluation protocol, except we use the absolute (L1) error. We calculate the absolute error (AE) using the highest confidence—center most—face, and the distance from each predicted landmarks to their respective ground truth,


where and represent the and coordinates of the predicted and ground truth landmark, and and are the height and width of the bounding box, respectively. This AE is then averaged over all faces in the dataset to yield the mean absolute error (mAE). For AFLW all five landmarks are provided, however, for AFW the center of the eyes is not given. Therefore, for AFW we use the mean of the left and right eye corner for the center of each eye.

5 Face Detection Results

5.1 Backbone Baseline

To ensure that we start with the optimal backbone we first compared the performance of all three versions of MobileNet. Table 5.1 reports the face detection and landmark accuracies of each version of MobileNet. We find that there is very little difference between MobileNet (Howard et al., 2017) and MobileNetV2 (Sandler et al., 2018), however, MobileNetV3 (Howard et al., 2019) significantly underperforms. As the performance of MobileNet and MobileNetV2 is so similar, we select MobileNetV2 as our backbone only because it is a lighter network.

Backbone Hard Overall
MNet 87.11 90.90
MNetV2 87.27 91.03
MNetV3-Small 86.52 89.37
MNet 89.10 93.10
MNetV2 89.87 93.28
MNetV3-Large 88.23 92.12

Table 2: Generic mobilnet backbones and their respective performance on WIDER FACE. The left column is the performance on the ‘hard’ set, the middle is the performance averaged across all three sets, and the right column is the number of parameters

For the rest of this paper, we will be referring to three network backbones. Our smallest network (MNet) we use a MobileNetV2 backbone with an value of and just filters in the context modules. The medium sized network (MNet) also uses a MobileNetV2 backbone with and filters in the context modules. For our large network, we choose ResNet101 v2 with filters in the context modules to be comparable with other literature.

5.2 Context Module Comparison: Face Detection

MNet (AP %) MNet (AP %) ResNet101 (AP %)
Head Hard () Overall () Hard () Overall () Hard () Overall ()
SSH 86.85 (+0.15) 90.79 (-0.01) 89.04 (+0.11) 92.76 (+0.14) 90.55 (-0.08) 94.14 (+0.05)
SSH 86.76 (+0.06) 90.86 (+0.05) 88.76 (-0.17) 92.58 (-0.05) 90.64 (+0.01) 94.22 (+0.13)
Retina 87.11 (+0.42) 91.02 (+0.21) 88.87 (-0.06) 92.59 (-0.04) 90.74 (+0.11) 94.26 (+0.24)
Retina 87.27 (+0.57) 91.03 (+0.22) 88.70 (-0.23) 92.43 (-0.19) 90.49 (-0.14) 94.21 (-0.01)
RSSH 86.41 (-0.29) 90.67 (-0.14) 89.19 (+0.26) 92.69 (+0.06) 90.68 (+0.05) 94.18 (+0.18)
RSSH 86.67 (-0.03) 90.83 (+0.02) 88.99 (+0.06) 92.61 (-0.01) 90.68 (+0.05) 94.25 (+0.17)
Dense 86.47 (-0.23) 90.56 (-0.25) 88.93 (+0.00) 92.57 (-0.05) 90.52 (+0.00) 94.21 (0.02)
Dense 86.05 (-0.65) 90.70 (-0.11) 88.97 (+0.04) 92.77 (+0.14) 89.72 (+0.00) 93.30 (-0.77)
Average 86.70 0.39 90.81 0.16 88.93 0.16 92.63 0.11 90.50 0.33 94.07 0.32

Table 3: The WIDER FACE validation average precision (AP) for each backbone and each context module. The value in parentheses denotes the APs divergence () from the mean. The ‘Hard’ columns show the APs on the ‘hard’ set, and the ‘Overall’ columns show the mean APs across all three sets. The bottom row shows the mean APs and standard deviation for all context modules.

Table 3 presents the results for each context module on the WIDER FACER validation dataset. For reference we also trained MNetV2 with two ‘basic’ context modules, our two layer module only achieves and , and our one layer module only and on the ‘hard’ set and overall, respectively. Therefore, we can see that having at least three layers improved the performance by more than percent.

For MNetV2, we find that Retina, Retina, and SSH are the three top performers on the ‘hard’ set, and Retina, Retina, and SSH the top overall. For MNetV2, we find that Retina, Retina, and RSSH are the three top performers on the ‘hard’ set, and SSH, Dense, and RSSH the top overall. For ResNet101, we find that Retina, RSSH, and RSSH are the three top performers on the ‘hard’ set, and Retina, RSSH, and SSH are the top overall. For the ‘hard’ set, we find that the top three context performers are Retina, Retina and SSH with average mean divergences of , and , respectively. Over all three sets, we find that the top three context performers are Retina, SSH and RSSH with average mean divergences of , and , respectively. From these results the context modules seem quite similar in performance, the only real outliers are Dense and Dense which seem to underperform.

To test the statistical significance of these results we ran the same experiment eight times (to be the same number as the number of context modules) to determine the amount of variance just due to randomness. We find almost no difference in variance between running the same experiment and using different context modules. For the ‘hard’ set we get standard deviations of , and for MNetV2, MNetV2 and ResNet101, respectively. Overall three sets we get standard deviations of , and for MNetV2, MNetV2 and ResNet101, respectively. Therefore, we can not find any significant difference between any of the architectures. We can also see that smaller networks have higher variance. Therefore, it is unlikely that the architecture of the context module has more importance for smaller networks, it is just that the randomness is more influential. We also compared the number of filters used in the context module, finding on average a percent performance increase for smaller networks when doubling the number of filters. However, this comes with significant efficiency problems. For example, doubling the number of filters in the context modules for MNet almost doubles the total amount of parameters in the network.

5.3 Context Module Comparison: Landmark Accuracy

MNetV2 (mAE ) MNetV2 (mAE ) ResNet101 (mAE )
SSH 1.14 0.62 1.86 2.80 1.03 0.56 1.60 1.92 0.91 0.53 1.56 1.83
SSH 1.55 2.76 2.05 2.42 0.98 0.98 1.51 1.95 0.92 0.54 1.51 2.50
Retina 1.12 0.63 1.84 2.68 0.95 0.59 1.63 2.20 0.90 0.51 1.41 1.47
Retina 1.09 0.68 2.00 3.15 1.02 0.57 1.61 1.95 1.00 0.87 1.63 2.24
RSSH 1.11 0.90 1.99 3.09 1.01 0.52 1.42 1.85 0.90 0.55 1.50 1.99
RSSH 1.12 1.31 2.14 3.33 1.01 0.99 1.54 2.74 0.97 0.73 1.53 2.06
Dense 1.15 2.78 2.11 3.16 1.21 0.66 1.58 2.00 0.93 0.62 1.53 2.02
Dense 1.27 2.63 2.00 2.92 1.05 1.39 1.48 2.07 0.93 0.62 1.53 2.02
Average 1.19 1.80 2.00 2.96 1.03 0.78 1.55 2.10 0.93 0.62 1.53 2.02

Table 4: The mean absolute error (mAE) of the predicted landmarks, normalized by the bound box size, for each backbone and context module. The bottom row shows the average mAE and the pooled standard deviation across all context modules. We compare our five landmarks to the five closest in both datasets, however, for AFW we use the average coordinates of the eye corners for each eye. We can see a significant drop in error by increasing the model size, but no significant difference in context module choice.

Table 4 presents the mean absolute error and standard deviation for each backbone and context module. Similarly to the previous section, we also present the accuracy of two MNetV2 networks trained with ‘basic’ context modules. Our single layer context module network can achieve and on AFW and ALFW-2000, respectively. Moreover, our two layer context module network can achieve and on AFW and ALFW-2000, respectively. These results are not far from the accuracy of the other context modules presented in Table 4, therefore, the choice of context module seems to have little influence on landmark accuracy. Following the previous section, to investigate the effect of randomness we compare these results to a single experiment run eight times. When switching context modules we get , and for MNet, MNet and ResNet101, respectively. But when running the same experiment eight times we get , and for MNet, MNet and ResNet101, respectively. As in the previous section, we find that the variance due to randomness is similar to the variance in the choice of the context module architecture.

For MNetV2, we find average mean absolute errors of and for AFW and AFLW-2000, respectively. Whereas, MNetV2 is significantly better on both datasets, with an average mean absolute error of and for AFW and AFLW-2000, respectively. Furthermore, the largest backbone, ResNet101, is even better, with an average mean absolute error of and for AFW and AFLW-2000, respectively. Unsurprisingly, the larger backbones perform significantly better on landmark localization, especially on side faces.

We also consider the impact of increasing the number of filters in the context module on the landmark accuracy. We find that doubling the number of filters in the context module has almost no effect on the quality of the landmarks. These results suggest the choice of context module arcutecture, and the number of filters it has is irrelevant to landmark quality.

5.4 Final Results: Face Detection

MNetV2 MNetV2 ResNet152

Table 5: Total number of parameters for each backbone.
Backbone Hard (AP%) Overall (AP%)
MNetV2 87.37 91.49
MNetV2 90.16 93.60
ResNet152 91.71 94.78

Table 6: Average precision of each backbone on the WIER FACE ‘hard’ set and averaged over all sets. All models are trained with a batch size of eight across six GPUs and are pretrained on ImageNet (Russakovsky et al., 2014)

For our final models we only change the number of GPUs (from one to six) and the number of layers in the ResNet v2 (from 101 to 152). Table 5.4 shows our final results on the WIDER FACE ‘hard’ set and overall sets. We find that increasing the size of the network is the most reliable way to increase the performance on the WIDER FACE dataset. However, using networks like ResNet152 may not be practical in most applications. Both of our MobileNetV2s perform extremely well. For comparison, Zhang et al. (2019) presents results for their ResNet18 model which achieves very similar accuracy to our much smaller MNetV2 both achieving on the ‘hard’ set. Moreover, Deng et al. (2019) report their results on the ‘hard’ set using MobileNet with , with an AP of just , whereas, our model with the same backbone achieves using a six-layer pyramid and using only three-layers. By increasing the number of filters in the context module by a factor of two we can achieve over on the ‘hard’ set. However, this makes the network significantly bigger, so it is not a fair comparison with Deng et al. (2019).

Figure 2 shows our final results on all three of the WIDER FACE validation sets. Our ResNet152 can rank a respectable fourth on both the ‘medium’ and ‘hard’ sets, without adding a large number of layers. Moreover, our MNetV2 can rank four to five places higher than EXTD (a similar lightweight detector) on all three sets. Also, our MNetV2 is able to out perform much heavier networks, for exmaple FAN (Zhang et al., 2019) which uses a ResNet50 backbone.

5.5 Final Results: Landmark Accuracy

Backbone (mAE) (mAE)
MNetV2 1.06 1.11 1.80 2.39
MNetV2 0.79 0.43 1.17 1.42
ResNet152 0.87 1.52 0.87 0.67

Table 7: Mean absolute error () of the landmark predictions for each backbone. We can see that heavier models perform much better on landmark localization.

Table 5.5 shows the final landmark accuracy for each of our backbones and table 5.4 shows the total number of parameters in the network. For comparison MTCNN (Zhang et al., 2016) achieves on AFLW-2000, far higher than even our smallest model, and struggles to even detect many of the faces in AFW. We can see that larger backbones, generally, provide better quality landmarks. As previously mentioned, landmark accuracy can be improved significantly by reducing the number of layers in the feature pyramid. However, this will cause a substantial loss in face detection accuracy. We also investigated including more filters in the context modules, which does not affect the quality of landmarks. Moreover, we found that using deformable convolutions also impeded the landmark accuracy, but again it is a trade-off we make to ensure higher face detection accuracy.

Figure 2: Final results for face detection on the WIDER FACE validation set. In the figure we show the curves for all submissions but only label the top 20.

5.6 Final Results: Network Performance

Table 5.6 shows the inference speed of both MobileNetV2 models on different devices. These devices are very heterogeneous, so we use specialized techniques to get the best performance from each of them. All of the optimization techniques are open source, therefore, these benchmarks should be reproducible. For desktop CPU (Intel i5-7500) and embedded GPU (NVIDIA Jetson TX2) we take advantage of tvm (Chen et al., 2018). While, mxnet (Chen et al., 2015) accelerated using cuda (Nickolls et al., 2008) is used for the desktop GPU (NVIDIA 1050ti).

Device MNetV2 MNetV2
ms fps ms fps
Desktop CPU 60.3 16.6 142.2 7.0
Mobile CPU 38.1 26.2 89.3 11.2
Embedded GPU 34.3 29.2 77.3 12.9
Mobile GPU 17.6 56.8 39.9 25.1
Desktop GPU 9.1 109.9 20.9 47.8

Table 8: For for the desktop we use one core of an intel i5-7500 (CPU), and an NVIDIA 1050ti (GPU). For the mobile device we use a Xiaomi Mi9 with a SnapDragon 855 chipset (Kryo 485 CPU and Adreno 640 GPU). While, the embedded GPU is a NVIDIA Jetson TX2.

For comparison, EXTD (Yoo et al., 2019), use far less parameters in their face detector; just . However, our smallest model is not only ms faster on a VGA input, it also performs better on the WIDER FACE hard set by . RefineFace (Zhang et al., 2019) use ResNet-18 as their smallest model which runs in 26.8ms on an NVIDIA 1080ti. By comparison, we achieve similar results on the WIDER FACE hard set with our MNetV2, which runs ms faster on a much slower GPU (NVIDIA 1050ti). To compare to Deng et al. (2019), we use a three-layer pyramid and can achieve very similar or better inference speeds. We also achieve a higher accuracy on WIDER FACE hard set with the same model.

6 Conclusions

We have shown that the choice of context module architecture is likely irrelevant to the models’ performance. One possible reason for this is that the layers added by the feature pyramid and the context modules are always randomly initialized and, for smaller networks, they can constitute a large number of the total parameters percent. Therefore, a ‘lucky’ initialization can yield more performance gain than crafting an optimal context module. One possible way around this would be to pretrain the full network on a similar detection task, e.g. person detection, to alleviate the effect of the random initialization.

Our largest model can achieve a near state-of-the-art score on the WIDER FACE hard set of percent without making use of any excessive additional layers. It also provides very accurate landmarks that can be used for face alignment. Our two smaller networks can exceed state-of-the-art performance on the WIDER FACE hard set compared to similar network sizes. These networks also provide accurate landmarks while being able to run in real-time on modest desktop and mobile hardware.


We would like to thank Aubin Samacoits, Jeff Hnybida, Riccardo Gallina and Sanjana Jain for their constructive input and feedback during the writing of this paper. We would also like to thank CAT Telecom for granting us access to their GPU cluster for training.


  1. http://shuoyang1213.me/WIDERFACE
  2. https://github.com/deepinsight/insightface/tree/master/RetinaFace


  1. Self-Driving Cars: A Survey. arXiv e-prints, pp. arXiv:1901.04407. External Links: 1901.04407 Cited by: §1.
  2. Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 3726–3734. External Links: Document, ISBN 9781538610329, ISSN 15505499 Cited by: §1.
  3. Supervised Transformer Network for Efficient Face Detection. arXiv e-prints, pp. arXiv:1607.05477. External Links: 1607.05477 Cited by: §2.3.
  4. Joint Cascade Face Detection and Alignment. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars (Eds.), Cham, pp. 109–122. External Links: ISBN 978-3-319-10599-4 Cited by: §2.3.
  5. Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §5.6.
  6. TVM: an automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, pp. 578–594. External Links: ISBN 978-1-939133-08-3, Link Cited by: §5.6.
  7. Histograms of oriented gradients for human detection. In Proceedings - 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, External Links: Document, ISBN 0769523722 Cited by: §1.
  8. RetinaFace: single-stage dense face localisation in the wild. In arxiv, Cited by: §1, §1, §2.1, §2.2, §2.3, §2.4, §2.5, Figure 1, §4.1, §4.2, §4.3, §4.3, §5.4, §5.6.
  9. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Cited by: §4.3.
  10. A closer look at faster r-cnn for vehicle detection. In 2016 IEEE intelligent vehicles symposium (IV), pp. 124–129. Cited by: §1.
  11. Object detection with discriminatively trained part-based models. Computer. External Links: Document, ISSN 00189162 Cited by: §1.
  12. Unsupervised Training for 3D Morphable Model Regression. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 8377–8386. External Links: Document, ISBN 9781538664209, ISSN 10636919 Cited by: §1.
  13. Object detection via a multi-region & semantic segmentation-aware CNN model. arXiv e-prints, pp. arXiv:1505.01749. External Links: 1505.01749 Cited by: §4.3.
  14. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv e-prints, pp. arXiv:1311.2524. External Links: 1311.2524 Cited by: §4.3.
  15. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
  16. Fast R-CNN. arXiv e-prints, pp. arXiv:1504.08083. External Links: 1504.08083 Cited by: §1, §4.2.
  17. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv e-prints, pp. arXiv:1706.02677. External Links: 1706.02677 Cited by: §4.2.
  18. Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment. External Links: 1812.01936, Link Cited by: §1.
  19. GluonCV and gluonnlp: deep learning in computer vision and natural language processing. arXiv preprint arXiv:1907.04433. Cited by: §4.2.
  20. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv e-prints, pp. arXiv:1406.4729. External Links: 1406.4729 Cited by: §2.1.
  21. Identity Mappings in Deep Residual Networks. arXiv e-prints, pp. arXiv:1603.05027. External Links: 1603.05027 Cited by: §4.2.
  22. In Defense of the Triplet Loss for Person Re-Identification. arXiv e-prints, pp. arXiv:1703.07737. External Links: 1703.07737 Cited by: §1.
  23. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. External Links: 1704.04861, Link Cited by: §4.2, §5.1.
  24. Searching for MobileNetV3. External Links: 1905.02244, Link Cited by: §5.1.
  25. Densely Connected Convolutional Networks. arXiv e-prints, pp. arXiv:1608.06993. External Links: 1608.06993 Cited by: §2.5.
  26. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. In Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France. External Links: Link Cited by: §2.4.
  27. Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Vol. , pp. 2144–2151. External Links: Document, ISSN Cited by: §1.
  28. DSFD: Dual Shot Face Detector. External Links: 1810.10220, Link Cited by: §1, §2.2, §4.2.
  29. Light-Head R-CNN: In Defense of Two-Stage Object Detector. arXiv e-prints, pp. arXiv:1711.07264. External Links: 1711.07264 Cited by: §2.2, §4.3.
  30. PyramidBox++: High Performance Detector for Finding Tiny Face. External Links: 1904.00386, Link Cited by: §1, §2.5, Figure 1.
  31. Feature Pyramid Networks for Object Detection. External Links: 1612.03144, Link Cited by: §2.1.
  32. Focal Loss for Dense Object Detection. External Links: 1708.02002, Link Cited by: §1, §1, §1, §2.2, §2.2, §3.1.
  33. Microsoft COCO: Common Objects in Context. arXiv e-prints, pp. arXiv:1405.0312. External Links: 1405.0312 Cited by: §2.1.
  34. Receptive Field Block Net for Accurate and Fast Object Detection. arXiv e-prints, pp. arXiv:1711.07767. External Links: 1711.07767 Cited by: §2.2.
  35. SSD: Single shot multibox detector. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 9905 LNCS, pp. 21–37. External Links: Document, ISBN 9783319464473, ISSN 16113349 Cited by: §2.2.
  36. Dense Face Alignment. Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 2018-January, pp. 1619–1628. External Links: Document, 1709.01442, ISBN 9781538610343, Link Cited by: §1.
  37. IARPA janus benchmark - c: face dataset and protocol. In 2018 International Conference on Biometrics (ICB), Vol. , pp. 158–165. External Links: Document, ISSN Cited by: §2.4.
  38. Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, Vol. 2, pp. 5. Cited by: §2.4.
  39. SSH: Single Stage Headless Face Detector. arXiv e-prints, pp. arXiv:1708.03979. External Links: 1708.03979 Cited by: §2.1, §2.2, §2.5, Figure 1, §4.3.
  40. Stacked hourglass networks for human pose estimation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9912 LNCS, pp. 483–499. External Links: Document, 1603.06937, ISBN 9783319464831, ISSN 16113349, Link Cited by: §1.
  41. Scalable parallel programming with cuda. Queue 6 (2), pp. 40–53. External Links: ISSN 1542-7730, Link, Document Cited by: §5.6.
  42. YOLOv3: An Incremental Improvement. External Links: 1804.02767, Link Cited by: §1.
  43. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Document, 1506.01497, ISSN 01628828 Cited by: §1.
  44. ImageNet Large Scale Visual Recognition Challenge. arXiv e-prints, pp. arXiv:1409.0575. External Links: 1409.0575 Cited by: §4.2, §5.4.
  45. Intervertebral disc detection in x-ray images using faster r-cnn. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 564–567. Cited by: §1.
  46. MobileNetV2: Inverted Residuals and Linear Bottlenecks. External Links: 1801.04381, Link Cited by: §4.2, §5.1.
  47. Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1–9. External Links: Document, ISSN Cited by: §2.4.
  48. Training Region-based Object Detectors with Online Hard Example Mining. arXiv e-prints, pp. arXiv:1604.03540. External Links: 1604.03540 Cited by: §4.2.
  49. Deep Learning Face Representation by Joint Identification-Verification. arXiv e-prints, pp. arXiv:1406.4773. External Links: 1406.4773 Cited by: §1, §2.4.
  50. DeepFace: closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1701–1708. External Links: Document, ISSN Cited by: §1, §2.4.
  51. A Survey on Deep Transfer Learning. arXiv e-prints, pp. arXiv:1808.01974. External Links: 1808.01974 Cited by: §4.2.
  52. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. External Links: 1905.11946, Link Cited by: §1.
  53. PyramidBox: A Context-assisted Single Shot Face Detector. External Links: 1803.07737, Link Cited by: §1, §4.2.
  54. Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision. External Links: 1811.08557, Link Cited by: §2.3.
  55. Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: §1.
  56. Multiple Kernels for object detection. In Proceedings of the IEEE International Conference on Computer Vision, External Links: Document, ISBN 9781424444205 Cited by: §1.
  57. Robust real-time face detection. International Journal of Computer Vision 57, pp. 137–154. Cited by: §1, §1, §2.1.
  58. Spatial-Temporal Person Re-identification. External Links: 1812.03282, Link Cited by: §1.
  59. Face attention network: an effective face detector for the occluded faces. arXiv preprint arXiv:1711.07246. Cited by: §1, §2.1, §4.3.
  60. Deep High-Resolution Representation Learning for Visual Recognition. External Links: 1908.07919, Link Cited by: §1.
  61. Deep Face Recognition : A Survey [2018/06/04]. pp. 1–17. External Links: 1804.06655, Link Cited by: §1.
  62. Facial feature detection using haar classifiers. J. Comput. Sci. Coll. 21 (4), pp. 127–133. External Links: ISSN 1937-4771, Link Cited by: §1.
  63. Application of faster r-cnn model on human running pattern recognition. arXiv preprint arXiv:1811.05147. Cited by: §1.
  64. WIDER face: a face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.1, §4.3.
  65. EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse. arXiv e-prints, pp. arXiv:1906.06579. External Links: 1906.06579 Cited by: §2.2, §5.6.
  66. Accurate Face Detection for High Performance. arXiv e-prints, pp. arXiv:1905.01585. External Links: 1905.01585 Cited by: §1, §2.1, §2.2, §4.2, §5.4.
  67. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. External Links: Document, 1604.02878 Cited by: §1, §2.3, §2.4, §4.1, §5.5.
  68. RefineFace: Refinement Neural Network for High Performance Face Detection. arXiv e-prints, pp. arXiv:1909.04376. External Links: 1909.04376 Cited by: §1, §2.1, §2.2, §4.2, §5.4, §5.6.
  69. Improved Selective Refinement Network for Face Detection. External Links: 1901.06651, Link Cited by: §4.2.
  70. S3FD: Single Shot Scale-Invariant Face Detector. Proceedings of the IEEE International Conference on Computer Vision 2017-October, pp. 192–201. External Links: Document, 1708.05237, ISBN 9781538610329, ISSN 15505499, Link Cited by: §1, §2.1, §4.2, §4.3.
  71. Robust and High Performance Face Detector. arXiv e-prints, pp. arXiv:1901.02350. External Links: 1901.02350 Cited by: §1.
  72. Objects as Points. External Links: 1904.07850, Link Cited by: §1.
  73. Dense 3d face decoding over 2500fps: joint texture & shape convolutional mesh decoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1097–1106. Cited by: §1.
  74. Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2879–2886. External Links: Document, ISSN Cited by: §4.3.
  75. Face Alignment Across Large Poses: A 3D Solution. arXiv e-prints, pp. arXiv:1511.07212. External Links: 1511.07212 Cited by: §4.3.
  76. Deformable ConvNets v2: More Deformable, Better Results. arXiv e-prints, pp. arXiv:1811.11168. External Links: 1811.11168 Cited by: §4.2.
  77. Edge Boxes: Locating Object Proposals from Edges. In European Conference on Computer Vision, ECCV edition. External Links: Link Cited by: §4.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description