FHEDN: A based on context modeling Feature Hierarchy Encoder-Decoder Network for face detection

FHEDN: A based on context modeling Feature Hierarchy Encoder-Decoder Network for face detection


Because of affected by weather conditions, camera pose and range, etc. Objects are usually small, blur, occluded and diverse pose in the images gathered from outdoor surveillance cameras or access control system. It is challenging and important to detect faces precisely for face recognition system in the field of public security. In this paper, we design a based on context modeling structure named Feature Hierarchy Encoder-Decoder Network for face detection(FHEDN), which can detect small, blur and occluded face with hierarchy by hierarchy from the end to the beginning likes encoder-decoder in a single network. The proposed network is consist of multiple context modeling and prediction modules, which are in order to detect small, blur, occluded and diverse pose faces. In addition, we analyse the influence of distribution of training set, scale of default box and receipt field size to detection performance in implement stage. Demonstrated by experiments, Our network achieves promising performance on WIDER FACE and FDDB benchmarks.

1 Introduction

Face detection is most studied in computer vision recent years. It is a core module of face recognition system which has successfully applied in many areas such as public security surveillance, smart pay, etc. Many state-of-the-art algorithm for face detection have been present over the past two decades.

Previous methods utilize hand-craft feature and specific classifier to detect face from any natural images. Viola and Jones[42] put forward the cascade face detector based on Haar-like feature and AdaBoost classifier, whose excellent real-time detection performance made it milestone in face detection. Followed by Viola and Jones, many improved work based on extent Haar-like feature and cascade boosting based methods have been proposed. Dong et al. designed a joint cascaded framework named JDA for face detection and alignment[3], which adopt shape indexed feature and cascade boosting based classifier to solve the two tasks jointly. Shengcai et al. proposed a fast and accurate unconstrained face detector which took advantage of feature named Normalized Pixel Difference(NPD) and deep quadratic tree structure[18]. Besides, due to the state-of-the-art result by Deformable Parts Models(DPMs) in object detection[4], DPMs also successfully applied for face detection and achieved competitive performance compared with other method[24, 31].

With the state-of-the art performance gained by DCNNs in computer vision, more and more based on DCNNs models have been proposed for generic object detection task. They can be categorized into two categories. One is scale-invariant methods, such as the seminal work of RCNN[8], Fast RCNN[7] and Faster-RCNN[33], etc. The other is scale-variant methods which includes YOLO[32], SSD[20], etc. Recent face detection methods typically follow the paradigm of the two categories. These methods use DCNNs as the backbone structure to learn highly discriminative representation. Among these methods, [17, 30, 51] combined traditional cascade style with region proposal into DCNNs and achieve a good trade-off between accuracy and speed, in contrast, [11, 48, 52, 26] follow methodology of scale-variant such that they can detect face in multiple feature hierarchies without constructing image pyramids.

In this paper, we focus on leveraging Deep Convolutional Neural Networks(DCNNs) to detect small and blur face in unconstrained scene. As we known, small object detection is an open and challenging problem. But the images gathered from outdoor survillance cameras, which are usually affected by weather conditions, camera pose and range, etc. Consequently, objects in the images are small, blur, occluded and diverse pose. How to detect face above described well in surveillance camera has been a key problem for subsequent face recognition. Therfore, we are interested in constructing a framework based on DCNNs for solved above problem.

This paper mainly makes following contributions:
(1) Inspired by the idea of scale-variant methods such as Single Shot Detector(SSD)[20], we firstly fine-tuned SSD for face detection task with Annotated Facial Landmarks in the Wild(AFLW)[16] as training dataset. We then analyzed the weak capability about detecting small face of trained model from distribution characteristics of training dataset.
(2) We designed a based on context modeling Feature Hierarchy Encoder-Decoder Network(FHEDN) for face detection to detect small and blur face in unconstrained scene, which learns and fuses context information around the face adjacent to the previous feature hierarchy. Due to face in image is not independent individual and it is both at top of the neck and bottom of the hair. These important context information around face can help FHEDN to improve detection performance.
(3) For modeling context by FHEDN, we employed a stacked hourglass network structure[1] to fuse context information together with hierarchy-wise. Therefore, the overall network adopts scale-variant designing paradigm which likes ”encoder-decoder” style. Furthermore, we analyzed the influence of default box scale, receptive field size etc. on detection performance(discussed in section 4.2).

2 Related Work

As the survey on face detection elaborated thoroughly[49], there are numerous works in the field of face detection. In this section, we only focus on a series of work exploiting deep learning.

Early in 1900s, there are existing works using neural networks for face detection task. Vaillant et al.[41] proposed a two-stage CNNs to detect faces from images in coarse-to-fine manner. Rowley et al.[34] presented a approach based on CNN for upright frontal face detection and achieved improved performance. Garcia et al.[6] developed a convolutional neural architecture for different pose and rotation angles to detect faces. In 2005 Osadchy et al.[29] designed a multi-task neural network to train jointly face detector and pose estimator. In recent five years, deep learning has been pay more and more attention because of its state-of-the-art performance in the area of computer vision, natural language processing and speech recognition, etc. In various deep neural networks architecture of deep learning, due to DCNNs have achieved breakthrough results in computer vision task such as image classification[15, 50, 37, 39, 10], genetic object detection[8, 7, 33, 32, 20, 35, 40, 9, 25] and semantic segmentation[27, 22], and thus it been mostly studied. Inspired by deep learning-based methods in generic object detection, face detection algorithm based on DCNNs can be categorized into tow categories: scale-invariant methods and scale-variant methods.

Scale-invariant methods: The seminal work of Faster RCNN employs region of interest(ROI) pooling to extract scale-invariant features. In addition, through traditional cascaded classification scheme, some new face detection algorithm based on DCNNs have been proposed in recent years. Haoxiang et al.[17] put forward a convolutional neural network for face detection, which consists of two stage cascaded network: one is used to eliminate none facial regions quickly, another for evaluating candidates carefully with fast multiple resolution technique. Szarvas et.al[38] used DCNNs for multi-view face detection named Deep Dense Face Detector(DDFD) which is a single model with last heat map for face classification and bounding box regression. Hongwei et.al[30] addressed previous cascade DCNNs structure[17] which trained different stages isolate. Therefore, they jointly trained different stages to achieve better performance. Similar to [30], Kaipeng et.al[51] via DCNNs which is consist of three subnetwork for multiple task about face detection and alignment. These region-based and cascaded framework-integrated methodology forms a multi-scale input image pyramid or fix input size image and resize various size, it will perform several forward passes during inference and thus the computing consumption of the model will increase correspondingly.

Scale-variant methods: These methods will extract the feature and detect faces from various hierarchies in single network and then merge the predictions out from the various network hierarchies to generate the overall detection results. Following this designed style, Peiyun et al.[11] indicated that the context was helpful to detect tiny face in complicated scenes, and then they defined foveal descriptor to extract feature of tiny face in large receptive field. Inspired by [11], [48] proposed through scale-friendly DCNNs, which made use of training specialized networks with the most suitable depth and spatial pooling stride to detect face from each specific sub-range of scales. S.Zhang et al. [52]put forward a single shot scale-invariant which uses VGG16 as backbone network and multiple feature hierarchies in single network for face detection. M.Najibi et al. [26]designed a single stage headless network structure which was scale-invariant and could simultaneously detect faces with different scales from different layers in the way of single forward passing.

3 Our work

3.1 Fine-tuning SSD for face detection

As the work[17, 30, 51]stated, their network component such as 12-net[17], branch x12[30], or pnet[51] employed region proposal mechanism to gain the ROI of the input image. For processing input image of multiple scales, these methods firstly construct pyramid for input image and then respectively input the image from pyramid into DCNNs to extract feature and detect face with region proposal mechanism. These framework whose feature pyramids built upon image pyramids required dense scale sampling such that it could achieve good results, at the same time it would cost more inference time and storage resources. In addition, multi-stage jointly training or testing increases the complexity of process flow. Inspired of scale-variant network designed style such as SSD which detects objects from various feature hierarchies in single network, the feature hierarchies have an inherent multi-scale, pyramidal hierarchical structure. Therefore, for reducing the complexity of training or testing face detector, we can directly use pyramidal feature hierarchy of DCNNs to detect face without constructing image pyramids. We try directly fine-tune SSD for face detection task. Some results are shown in Figure 1.

(a) Detection result in WIDERFACE
(b) Detection result in FDDB
Figure 1: Some results detected by fine-tuning SSD

Although fine-tuned SSD demonstrated effective in face detection task as shown above, there exist some shortcomings in fine-tuned SSD, for instance, it has limit to detect smaller faces. Because DCNNs obtain a series of feature maps through forward propagating layer by layer with pooling operations. The feature hierarchy consisted of these feature maps has an inherent multi-scale, pyramidal shape. This network architecture style can reuse the multi-scale feature maps in the forward pass, and then add extra layers to build bottom-up feature pyramid for object detection of different scales. As the depth of network increasing, SSD will detect objects on lower-resolution maps of higher level layers in the feature hierarchy. However, it ignores the strong semantics which can improve detection performance and are computed at higher-resolution feature maps that are in the low level layers. Therefore, SSD-style using DCNNs’ pyramidal feature hierarchy not only produces feature maps of different spatial resolutions, but also introduces large semantic gaps caused by different depths[19]. Consequently, It causes SSD limited for detecting small objects.

(a) Distribution of WIDER FACE
(b) Distribution of AFLW
Figure 2: Distribution of face size in WIDER FACE and AFLW

We analyze the distributions of annotated faces in WIDER FACE[47] and AFLW. As illustrated in Figure 2, we find most of annotated faces in WIDER FACE training datasets are very small and blurry, about 76% faces whose size are less than 40px, and 15% are less than 10px. In contrast, there are approximate 80% faces whose size are more than 92px in AFLW, and 19% are in the range of 40px-92px. Thence, for SSD this scale-variant without image pyramid and strong supervised learning methods, it is not suitable to detect small object using training set which is lack of ground truth one. Besides, there are approximate 15.1624% faces in images, 17.0668% ones in the range of . Therefore, if we resize the original image to which is as input size of SSD, the annotated small faces(¡10px, or px but ¡ 40px) will get much smaller resulting in extracted feature lack of effective representative ability. At the same time, the detected bounding box of the receipt field located on that hierarchy is invariant. When this bounding box slides on that region, it could not discriminate the weak feature of small face from background. It leads to trained model has weak capacity to detect small face.

3.2 The designed network

Our goal is to leverage the multiple scale-variant feature pyramid hierarchy for face detection in unconstrained scenes. Besides, as shown in figure 1, human can quickly detect small faces in other extreme and challenging conditions e.g. blurry, noised because we can utilize faces around semantic information such as hair, neck, or hat. Therefore we focus on how to comprehensively take advantage of strong semantic information in feature maps on different hierarchies extracted by the backbone network. We design a network likes encoder-decoder structure named Feature Hierarchy Encoder-Decoder Network(FHEDN) for face detection, which utilized hierarchy feature maps on different hierarchies extracted by the pre-trained backbone network e.g.VGG-16. The designed network consists of two key components named context modeling and prediction module. We will describe more details in following sections.

Context modeling module

[11] demonstrated the effect of context semantic information for detecting extreme small, blurry and noised face. How to modeling context to obtain strong semantic information that can improve the performance is the key technology of the whole work. Previous work e.g.[19, 5, 14, 2, 22] made use of subsequent deeper feature and upsampling procedure to refine context semantic information. Inspired by these work, we follow an hourglass like architecture called ”encoder-decoder”[27, 1] which encodes the feature embedding context semantical information in the medium layers, and then decodes the feature for special task in higher layers. This style of network is able to combine low-resolution, semantically strong features on low hierarchy with high-resolution, semantically weak features on high hierarchy through top-down pathway. Finally, it will produce a feature pyramid involving rich semantic information. Thence we adopt the paradigm of encoder-decoder to modeling context. The whole context modeling procedure contains deconvolution and element-wise sum operations. The deconvolutional layer is responsible to upsample current feature hierarchy, while element-wise sum layer is used to fuse feature on different layer. The context modeling module is shown in Figure 3.

(a) Modeling mode A
(b) Modeling mode B
Figure 3: The context modeling module

Here we try two type modeling modes as shown in Figure 3, we found mode A which contains three convolutional layers is not very helpful for improving performance. For reducing calculation redundancy, we employ the other mode which only contains one convolutional layer to adjust channels fit for fusing. It not only works well but also does not sacrifice accuracy.

Prediction module

Following the scale-variant feature hierarchy of designed style in SSD, a set of convolutional layer blocks as prediction modules are added after previous each scale hierarchy respectively. As shown in Figure 4, each prediction module is consist of two convolutional layers and default box generational layer. One convolutional layer is prepared for next softmax layer to classify face. The other one is provided to location regression. Default box generational layer is used to compute default boxes for corresponding feature hierarchy. The prediction module will produce a group of vectors, which is constructed by five value(first one represents confidence of face/background, other four represent the coordinate of left-top and right-bottom of detection box as respectively illustrated as white and gray cell in Figure 4).

Figure 4: Prediction module

Suppose we want to compute the information of default box in some feature hierarchy. The calculation formula is


where and mean the center coordinate of the default box, represents index of the x and y axis on feature map in this hierarchy respectively. means the width and height of feature map respectively. is offset of the center coordinate of current default box to next one and is set 0.5. indicates stride of the default box on original detected image following x and y axis respectively and are width and height of the image. note the width and height of default box. denotes the size of the default box on the th(total l hierarchies) hierarchy, which is computed as


where means the min dimension of input image. and are computed as and , where are receptive field size of 0-th and l-th scale hierarchy respectively. is the aspect ratio for the default box. Similar to SSD, we add a default box whose hierarchy is for aspect ratio of 1.

Network architecture and Training objective

Network architecture   Taking VGG-16 as the backbone network for example, it has 13 convolutional layers and 3 fully-connected layers. We convert both FC6 and FC7 to convolutional layers and retain frontal 13 convolution layers. Similar to SSD, we add extra convlolution layers to extend VGG-16 to improve its representational ability. Therefore, the overall network architecture is a fully convolutional network. As is shown in Figure 5, the overall network contains encoder and decoder part. Encoder subnetwork is used to extract deep feature from input image with hierarchies by hierarchies. Decoder subnetwork to integrate above described context modeling and prediction modules.

Figure 5: The whole network architecture

Training objective   We employ the multi-task loss defined in [20] to jointly train our network:


where N is the number of matched default boxes, is used to balance above two loss terms and is set to 1, indicates the i-th default bounding box matching to the j-th annotated face box, is the confidence of face or background, and respectively notes the predicted box and the annotated face box. We first match each face to the default box with the maximum jaccard overlap. When the predicted bounding box has the jaccard overlap greater than a threshold, it will be assigned to a face.

represents category confidence loss, which is adopt 2-class softmax loss function for face detection:


where notes the predicted confidence of face about i-th default box, while notes the predicted confidence of background box about i-th default box.

is a smooth L1 loss function[7], whose value denotes loss between the predicted box(l) and the ground truth box(g). Following by [20], we use it to regress to offsets between default box and annotated face.


where smooth L1 loss function defined in [7] is


4 Experiments

4.1 Experiment settings

Training datasets   Firstly, we use AFLW to fine-tune original SSD for face detection task. AFLW contains 25,993 annotated faces in real-world images. We choose 18,303 correctly annotated faces of images from AFLW, and then randomly select 80% of those as training dataset, while the remaining is reserved as validation dataset. As section 3.1 stated, because of distribution of AFLW and weak capability of SSD for detecting small object, fine-tuned SSD dose not have sufficient capability to detect face in extreme conditions. Secondly, we consider another dataset WIDER FACE which contains 32,203 images and 393,703 annotated faces and has more than 50% extreme small annotated faces. The dataset is split into training(40%), validation(10%) and testing(50%). It is very suitable for scale-variant without feature pyramid to train face detector on extreme scenes.

Testing datasets   For evaluating the effectiveness of our network, we verify FHEDN on two public face detection benchmarks. One is Face Detection Data Set and Benchmark(FDDB)[12], which contains 2,845 images with a total of 5,171 annotations including occlusions, difficult poses and low resolutions. The other is WIDER FACE validation and test set. We follow the standard evaluation protocol on FDDB using receiver operating characteristic curve(ROC) with two metric: discontinuous score and continuous score. While for WIDER FACE validation and test set, we use average precision(AP) as evaluation metric.

Experiment platform   We implement our experiments on Caffe framework[13]. At the same time, we apply and modify some source code provided by [20] in order to be suitable for our task. Besides, we try various hyper parameters for training with Stochastic Gradient Descent(SGD) algorithm. The size of each input image in a batch is set . The network is trained on NVIDIA Tesla P40 leased in cloud computing server with a total of 14 images per mini-batch. Weight decay is 0.00001 and momentum is 0.9. Meanwhile, the initial learning rate is set 0.01 and it will be dropped by 10 at 40480 and again at 70000 iterators with total 80000 iterators.

4.2 Implement detail

Influence of scale

In this section, we analyze the influence of scale of feature maps in different hierarchies to the detected performance. Taking VGG16 backbone network as example, we show the output feature maps which will be fuse with previous hierarchy to modeling context in Figure 6. As illustrated in Figure 6(a), the output feature maps of conv3_3, conv4_3, fc7 were computed without normalization operation, whose scale range are different from conv6_2 and conv7_2. It will cause the feature map in large scale range covers the one in low scale range when it is in feature fusing stage. Different from [21] used L2 normalization technique, we utilize batch normalization layer attached above layers before fusing to solve this problem. Parameters in batch normalization layer are trained by network without manual setting. Figure 6(c) and (d) show the training loss curves about the influence of normalization. It demonstrates normalization operation could keep the training process more stable.

(a) Before normalized histogram of feature maps
(b) After normalized histogram of feature maps
(c) Without normalized training loss curve
(d) With normalized training loss curve
Figure 6: The influence of normalization

Influence of receptive field

The receptive field(RF) size is a crucial issue in computer vision because it need be large enough to capture information about the covered region. Theoretical receptive field(TRF) can be computed from designed network architecture. Zhou et al.[53] introduce the concept of empirical receptive field via a data-driven approach to demonstrate that the actual size of RF is much smaller than TRF. Luo et al.[23] adopt mathematical model to analyze the empirical receptive field relate to TRF and put forward effective receptive field, who pointed out only a fraction of TRF would contribute equally to an output unit’s response. As stated in section 3.2.2, the prediction module contains several components, one is the key part to generate default boxes. If the size of generated default box is not suitable for the receptive field, the extracted feature on corresponding receptive field do not represent the default box. It will make trained model behave weak representable capacity. Wei et.al[44] analyzed effective receptive field within the framework of object detection and provide the algorithm to calculate the effective receptive field sizes of a standard VGG16 network. We use [44] provided algorithm to design default box to solve above described size problem. Take an example as eltw13 layer to detect small faces, we use formula (1) to compute the default box size , whose range can cover the effective receptive field size of conv3_3.

In addition, all the image in WIDER FACE has the same width 1024px but various height. When we resize the original image to for training, the original aspect ratio of annotated face will change. Therefore, the default box should be adjusted to reduce the affect of distortion by resizing operation. We count top3 percent of face size about original image size in WIDER FACE training dataset as Table 1. For example, images which contain face size is in are approximately 1932 (, as illustrated in Figure 1(a)), 22.1640% of them are in . We find face size mainly concentrated in , from Table 1 listed. We compute the aspect ratio in responding resized image is approximately 1.5. Analogously, for images in and , we approximately estimate their new aspect ratio as 0.5. In the end, the overall new aspect ratio is {0.5,1,1.5,,}.

face size(px) top1 top2 top3
0¡size¡10 (22.1640) (8.5002) (6.9691)
10size¡40 (14.7147) (10.1768) (4.1434)
40size¡92 (13.4199) (11.4456) (4.2862)
92size¡192 (10.4288) (8.1915) (3.8452)
192size (5.7435) (4.9042) (4.7731)
Table 1: The top 3 percent of size in the corresponding range

Other implementation details

Data augmentation   We apply data augmentation to make the proposed network more robust to various input sizes and shapes. Each entire original input image is randomly sampled a patch, whose cropped ratio is select from 0.3, 0.5, 0.7, 0.9 or 1.0. Then, we set the minimum jaccard overlap with the face is 0.5, 0.7 or 0.9, which can be help to extract feature of some occluded faces. After random cropping, the sampled patch will be horizontally flipped with probability of 0.5 and applied some photo-metric distortions.
Online hard example mining   Since the number of negative default box is more than the positive, there is a significant imbalance between the positive and negative training examples. For stable and faster optimization training, we apply online hard example mining(OHEM)[36] technique to resample hard examples during training stage. After OHEM, the positive default boxes with the lowest scores and the negative default boxes with highest scores are randomly selected so that the ratio between the negatives and positives is at most 3:1.

4.3 Results

We evaluate our network against state-of-the-art face detection methods on two benchmark datasets: FDDB and WIDER FACE.

FDDB results

We divide the FDDB into 10 folds for performance evaluation and accumulate the detection results to generate the Receiver Operating Characteristic(ROC) curves. Besides, because of ellipse region style adopted in FDDB and for a more fair comparison under the continuous score evaluation, we transform the predicted bounding boxes to meet FDDB annotation style following the toolbox provided by [12]. The results compared with other state-of-the-art[17, 18, 38, 30, 51, 11, 52, 48, 43] are shown in Figure 7(a) and Figure 8(b).

Figure 7: Evaluation on FDDB benchmark

WIDER FACE results

We evaluate and compare our method with top performances[45, 46, 51, 28, 54, 11, 26, 52, 48] on validation and test set split from WIDER FACE. Both these two benchmark datasets is divided into three levels(Easy, Medium and Hard subset) according to the difficulty settings correlate with the face scales. The precision-recall curves and mAP values are shown in Figure 7. These results demonstrate the effectiveness of the proposed method in detecting small and hard faces in unconstrained scenes.

Alogrithms Easy Medium Hard
ACF-WIDER[45] 65.9% 54.1% 27.3%
Faceness[46] 71.6% 60.4% 31.5%
MTCNN[51] 85.1% 82.0% 60.7%
LDCF+[28] 79.0% 76.9% 52.3%
CMS-RCNN[54] 89.9% 87.4% 77.2%
HR[11] 92.3% 91.0% 81.9%
SSH[26] 93.1% 92.1% 84.5%
S3FD[52] 93.7% 92.5% 85.9%
ScaleFace[48] 86.8% 86.7% 77.2%
FHEDN(ours) 87.1% 83.1% 63.4%
Table 2: Evaluation on WIDER FACE compared with the sate of the art
(a) Val: Easy
(b) Val: Medium
(c) Val: Hard
Figure 8: Precision recall curves on WIDER FACE validation and test set

5 Conclusion

In this work, we have designed an end-to-end effective network with multiple scales feature hierarchy to detect faces in unconstrained scenes. Firstly, we fine-tune SSD as face detector via AFLW dataset and analyze its shortcoming for small, blur and occluded face detection. Secondly, we design a feature hierarchy network named FHEDN to improve detection performance, which fuse context semantic information fused with deeper feature hierarchies. Last but not least, we analyze some devil in implement details by statistical form and find some solutions to further improve the performance of our proposed method. Although there is a gap between our method and sate of the art ones, our designed network has a great room for improvement.

Figure 9: Some results detected by our FHEDN in FDDB

Figure 10: Some results detected by our FHEDN in WIDERFACE


  1. K.Yang A.Newell and J.Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 438–499, 2016.
  2. Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, and Nuno Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In European Conference on Computer Vision, pages 354–370, 2016.
  3. Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and Jian Sun. Joint cascade face detection and alignment. In European Conference on Computer Vision, pages 109–122, 2014.
  4. Pedro F. Felzenszwalb, Ross B. Girshick, David Mcallester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627, 2010.
  5. Cheng Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
  6. Christophe Garcia and Manolis Delakis. A neural architecture for fast and robust face detection. 2(11):44–47 vol.2, 2002.
  7. Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
  8. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142–158, 2015.
  9. K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904, 2015.
  10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2015.
  11. Peiyun Hu and Deva Ramanan. Finding tiny faces. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  12. Vidit Jain and Erik Learned-Miller. FDDB: A Benchmark for Face Detection in Unconstrained Settings. 2010.
  13. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Acm International Conference on Multimedia, pages 675–678, 2014.
  14. Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen. Ron: Reverse connection with objectness prior networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  15. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the Acm, 60(2):2012, 2012.
  16. Martin Köstinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In IEEE International Conference on Computer Vision Workshops, pages 2144–2151, 2012.
  17. Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network cascade for face detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015.
  18. Shengcai Liao, Anil K. Jain, and Stan Z. Li. A fast and accurate unconstrained face detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 2016.
  19. Tsung Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  20. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng Yang Fu, and Alexander C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, 2016.
  21. Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. In International Conference on Learning Representations, 2016.
  22. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, 2017.
  23. Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. In In Advances in Neural Information Processing Systems, 2016.
  24. Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool. Face detection without bells and whistles. 8692(19):720–735, 2014.
  25. Mahyar Najibi, Mohammad Rastegari, and Larry S. Davis. G-cnn: An iterative grid based object detector. In Computer Vision and Pattern Recognition, pages 2369–2377, 2016.
  26. Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and Larry Davis. Ssh: Single stage headless face detector. In IEEE International Conference on Computer Vision, 2017.
  27. Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision, pages 1520–1528, 2015.
  28. Eshed Ohn-Bar and Mohan M Trivedi. To boost or not to boost? on the limits of boosted trees for object detection. In International Conference on Pattern Recognition, 2017.
  29. Margarita Osadchy, Yann Le Cun, and Matthew L Miller. Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8(1):1197–1215, 2004.
  30. Hongwei Qin, Junjie Yan, Xiu Li, and Xiaolin Hu. Joint training of cascaded cnn for face detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3456–3465, 2016.
  31. Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2879–2886, 2012.
  32. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  33. S. Ren, K. He, R Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
  34. H. A. Rowley, S. Baluja, and T. Kanade. Rotation invariant neural network-based face detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, page 963, 1998.
  35. Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann Lecun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations, 2013.
  36. Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  37. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  38. Máté Szarvas, Akira Yoshizawa, Munetaka Yamamoto, and Jun Ogata. Multi-view face detection using deep convolutional neural networks. In International Conference on Multimedia Retrieval, pages 643–650, 2015.
  39. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2014.
  40. J. R. Uijlings, K. E. Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  41. R Vaillant, C Monrocq, and Y Le Cun. Original approach for the localisation of objects in images. Vision, Image and Signal Processing, IEE Proceedings -, 141(4):245 – 250, 1993.
  42. Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, pages I–511–I–518 vol.1, 2003.
  43. Yitong Wang, Xing Ji, Zheng Zhou, Hao Wang, and Zhifeng Li. Detecting faces using region-based fully convolutional networks. arXiv:1709.05256, 2017.
  44. Wei Xiang, Dong Qing Zhang, Vassilis Athitsos, and Heather Yu. Context-aware single-shot detector. arXiv preprint arXiv:1707.08682, 2017.
  45. Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Aggregate channel features for multi-view face detection. In International Joint Conference on Biometrics, 2014.
  46. Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. From facial parts responses to face detection: A deep learning approach. In IEEE International Conference on Computer Vision, pages 3676–3684, 2016.
  47. Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  48. Shuo Yang, Yuanjun Xiong, Chen Change Loy, and Xiaoou Tang. Face detection through scale-friendly deep convolutional networks. arXiv preprint arXiv:1706.02863, 2017.
  49. Stefanos Zafeiriou, Cha Zhang, and Zhengyou Zhang. A survey on face detection in the wild. Elsevier Science Inc., 2015.
  50. Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833, 2014.
  51. Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
  52. Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. Sfd: Single shot scale-invariant face detector. In IEEE International Conference on Computer Vision, 2017.
  53. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. In International Conference on Learning Representations, 2015.
  54. Chenchen Zhu, Yutong Zheng, Khoa Luu, and Marios Savvides. Cms-rcnn: Contextual multi-scale region-based cnn for unconstrained face detection. Deep Learning for Biometrics, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description