End-to-End Spatial Transform Face Detection and Recognition
Plenty of face detection and recognition methods have been proposed and got delightful results in decades. Common face recognition pipeline consists of: 1) face detection, 2) face alignment, 3) feature extraction, 4) similarity calculation, which are separated and independent from each other. The separated face analyzing stages lead the model redundant calculation and are hard for end-to-end training. In this paper, we proposed a novel end-to-end trainable convolutional network framework for face detection and recognition, in which a geometric transformation matrix was directly learned to align the faces, instead of predicting the facial landmarks. In training stage, our single CNN model is supervised only by face bounding boxes and personal identities, which are publicly available from WIDER FACE  dataset and CASIA-WebFace  dataset. Tested on Face Detection Dataset and Benchmark (FDDB)  dataset and Labeled Face in the Wild (LFW)  dataset, we have achieved 89.24% recall for face detection task and 98.63% verification accuracy for face recognition task simultaneously, which are comparable to state-of-the-art results.
As the fundamental stage of facial analyzing such as face recognition, age-gender recognition, emotion recognition and face transformation, face detection is an important and classical problem in computer vision and pattern recognition fields. After the real-time object detection framework proposed by Viola and Jones , lots of face detection methods have been proposed. The face detection is suffered from many challenges such as illumination, pose, rotation and occlusion. In the past, these challenges were solved by combining different models or using different hand-craft features. In recent years, the Convolutional Neural Network (CNN) presented its powerful ability in computer vision tasks and achieved higher performance[22, 23].
The common face recognition methods used faces with known identities to train with classification and took the intermediate layer as feature expression. In the wild, human faces are not always frontal, it is import to extract spatially invariant features from face patches with large pose transformation.Almost all the methods use the face landmarks predictor [25, 24, 38, 40, 5, 31] to locate the position of the face landmarks and then performed the face alignment by fitting the geometric transformation between the predicted facial landmarks and the pre-defined landmarks.
The common pipeline for face recognition consists of: 1) face detection, 2) face alignment, 3) feature extraction, 4) similarity calculation, which are separated and independent from each other. Lots of methods [16, 18, 29] were focusing on how to efficiently extract features from the face patches and make the in-class closer and out-class farther in the feature space. Different loss functions [39, 35] were proposed for this task.
The separated face analyzing stages lead the model redundant calculation and are hard for end-to-end training. Since it has been shown that the jointly learning could boost the performance of individual tasks, such as jointly learning face detection and alignment , many multi-task methods for facial analyzing were proposed [38, 20, 21].
In this paper, a novel end-to-end trainable convolutional network framework for face detection and recognition is proposed. Base on the Faster R-CNN  framework, this proposed framework could benefit from its strong object detection ability. In the proposed framework, the face landmarks prediction and the alignment stage are replaced by Spatial Transform Network (STN)  , in which a geometric transformation matrix was directly learned to align the faces, instead of predicting the facial landmarks. Compared with facial landmarks prediction network, the STN is smaller and more flexible for almost any feature, which makes the network end-to-end trainable, and the face detection and recognition tasks could share the common lower features to reduce unnecessary extra feature calculation. This end-to-end network improves the performance and is easy to extend to multi-task problems.
This paper makes the following contributions:
A novel convolutional network framework that is end-to-end trainable for simultaneously face detection and recognition is proposed. In the framework, the STN is used for face alignment, which is trainable and no need to be supervised by labeled facial landmarks.
In the proposed framework, the detection part, the recognition part and the STN share common lower features, which makes the model smaller and reduces unnecessary calculations.
The single CNN model is supervised only by face bounding boxes and personal identities, which are publicly available from WIDER FACE  dataset and CASIA-WebFace  dataset. Tested on Face Detection Dataset and Benchmark (FDDB)  dataset and Labeled Face in the Wild (LFW)  dataset, the model got 89.24% recall on the FDDB dataset and 98.63% accuracy on the CASIA dataset, achieves the state-of-the-art result.
2 Related Work
Face Detection develops rapidly in decades. In 2001, Viola and Jones  first proposed a cascade Adaboost framework using the Haar-like features and made the face detection real-time. In recent years, the Convolutional Neural Network has shown its powerful ability in computer vision and pattern recognition. Many CNN-based object detection methods have been proposed [17, 22, 23, 3, 7, 26, 6] .  improved the region proposal based CNN method and proposed the Faster R-CNN framework, this framework introduced the anchors method and made region proposal a CNN classification problem, which could be trained in the whole net during the training stage. The end-to-end trainable Faster R-CNN network was faster and more powerful, which achieved 73% mAP in the VOC2007 dataset with VGG-net. [41, 13, 30] used Faster R-CNN framework to solve the face detection problem, and achieved promising results.
Most face recognition methods used aligned faces as the input , it had been shown that adopting alignment in the test stage could have 1% recognition accuracy improvement on the LFW  dataset. The usual way for face alignment was predicting facial landmarks from the detected facial patches, such as eyes, nose and mouth. And the geometric transformation between the positions of the predicted facial landmarks and the pre-defined landmarks was applied to the facial patches. The aligned faces with known face identities were then fed into the deep networks and were classified by the last classification layer for training the discriminative feature extractors, the intermediate bottleneck layer was taken as the representation.  proposed a new supervision signal, called Center Loss, for face recognition task, which could simultaneously learn a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. Differently, FaceNet  used a deep convolutional network to directly optimize the embedding itself, rather than an intermediate bottleneck layer. It used triplets of matching / non-matching face patches generated by a novel online triplet mining method and achieved state-of-the-art face recognition performance using only 128-bytes per face, in their experiments, using the face alignment boosted the accuracy record.
Those methods got high performance on detection and recognition benchmarks, but only focused on the single task. Jointly learning for face detection and alignment first appeared in , which proved that these two related tasks could benefit from each other.  adopted a cascaded structure with three stages of carefully designed deep convolutional networks for predicting face and landmark location simultaneously.  proposed Hyper Face method for simultaneous face detection, facial landmark localization, head pose estimation and gender recognition from a given image, but not included face recognition task.  proposed an all-in-one network for learning different facial tasks including face recognition, but the input faces were aligned using Hyper Face, which was not end-to-end trainable from detection to recognition due to the individual face alignment part.
 introduced a new learnable module. The Spatial Transformer, which explicitly allowed the spatial manipulation of data within the network, that could directly learn the best geometric transformation of the input feature maps and transformed features to make them robust for rotation, scale and translation. the STN conducts regression on the transformation parameters directly, and is only supervised by the final recognition objectives.
A novel end-to-end trainable convolutional network framework for face detection and recognition is proposed, in which a geometric transformation matrix was directly learned by STN to align the faces, instead of predicting the facial landmarks. Compared with facial landmarks prediction network, the STN is smaller and more flexible for almost any feature, which makes the network end-to-end trainable, and the face detection and the recognition tasks could share the low-level features and reduce unnecessary extra feature calculation. The detection part in proposed architecture is based on the Faster R-CNN framework, which is state-of-the-ate object detector in the world. The recognition part is based on a simplified ResNet . In this section, the whole architecture is described into three parts. Sec. 3.1 illustrates the whole architecture. Sec. 3.2 describes the detection part. Sec. 3.3 introduces the STN. Sec. 3.4 describes the recognition part.
Typically, a face recognition system required the cropped face patches as its input, which are the results of a pre-trained face detector, then the deep CNN processed the cropped faces and obtained the distinguishable features for the recognition task. The common pipeline of face recognition task includes the following stages: 1) face detection, 2) face alignment, 3) feature extraction, 4) similarity calculation. The separated face analyzing stages lead the model redundant calculation and are hard for end-to-end training. In this work, the face detection task and the face recognition task could share the same lower features. By using the STN in face alignment, the model becomes end-to-end trainable because in this case, the gradient can be calculated and backward computation is allowed.
3.2 Detection Part
Similar to the Faster R-CNN framework , we use the VGG-16 network  as the pre-trained model for face detection, which is pre-trained on ImageNet  for image classification so that we could benefit from its discriminable image representation for different object categories. The region proposal network (RPN), which is a fully convolutional network, uses the convolution results from the VGG network to predict the bounding boxes and the objectless scores for the anchors. The anchors are defined with different scales and ratios for obtaining translation invariance. The RPN outputs a set of proposal regions that have the high probability to be the targets.
In the original Faster R-CNN framework, the ROI-Pooling operator pools the features of the proposal regions which are extracted from the convolutional feature maps into a fixed size. Then the following fully connected layers predict the bounding boxes offset and the scores of the proposal regions. In the proposed method, a spatial transform network (STN)  is applied to implement the region features transformation between the ROI-Pooling layer and the fully connected layers. The STN is used to learn a transformation matrix to make the input features spatially invariant. The STN is flexible that it could be inserted after any convolutional layers without obvious costs in both training and testing. In the detection network, the STN share the features of the proposal regions to regress the transform matrix parameters, and a transform operation uses the predicted parameters to transform the input features. And then the transformed output features will be passed to classification and regression.
3.3 Spatial Transform Network
The Convolutional Neural Network shows the surprising performance for feature extraction but is still not effective to be spatially invariant for the input data. For the face analyzing problems, the input faces could be collected in a variety of different conditions such as different scales and rotations. To solve this problem, some methods use a plenty of training data to cover the different conditions, however, almost all the methods used the face landmarks predictor to locate the position of the face landmarks. And performed the face alignment by fitting the geometric transformation between the positions of the predicted facial landmarks and the predefined landmarks. Spatial Transform Network, which is proposed by DeepMind, allow the spatial manipulation of data within the network. The spatial transform network could learn the invariance of translation, scale, rotation and more generic warping from the feature map itself, without any extra training supervision. Because in STN, the gradient can be calculated and backward computation is allowed, the whole framework becomes end-to-end trainable. The experiments show that the spatial transform network could learn the transform invariance and reduce unnecessary calculation.
For the input feature map , the points could be regard as the transformed result from the aligned feature with . The could be different format for different transformation type. In the affine transformation we used in our method, the is a matrix as follow:
where indicates the rotation angle, and indicate the translate. The pixels in the input feature map is the original points, which are denoted as , and the pixels in the output feature map are the target points that denoted as . Eq. (2) shows the pointwise 2D affine transformation.
The sampling kernel is applied on each channel of the input feature map to obtain the correspond pixel value in the output feature map V. Here we use the bilinear sampling kernel, the process could be written as in Eq. (3).
Similar to Eq.5 for
3.4 Recognition Part
After the detection boxes are obtained, the filtered boxes are fed into the following recognition part. Another spatial transform network is added before the recognition part to align the detected faces. The ROI-Pooling operation extracts the features in the detection boxes as before, from the shared feature maps. The STN predicts the transformation parameters and applies the transformation on the region features. The whole network is end-to-end trainable and is supervised only by face bounding boxes and person identities from publicly available datasets.
The architecture of the propose method is shown in Fig.1. The VGG-16 network feature extractor includes 13 convolutions, and outputs 512 feature maps named . The RPN network is a fully convolutional network which outputs the candidate boxes for proposal regions. Then the ROI-Pooling operator extracts the features of the proposal regions from feature maps and resizes it to 7x7, which are the same size of the convolution output in pre-trained VGG model. The spatial transform network for detection includes a convolution layer with , a pooling layer with , and a fully connected layer which is initialized with . The following four fully connected layers are set to classify the proposal region and regress the bounding box. The Softmax loss layer and the L2 loss layer are used for supervising the detection training. The spatial transform network for recognition includes a convolution layer with , a pooling layer with , and a fully connected layer which is initialized as same.
As the Residual Network (ResNet) showing its generalization ability for lots of machine learning problems, we used the ResNet to extract the distinguishable features of the faces, due to the feature sharing, the ResNet could be simplified. It produces a 512-dimensional output feature vector which is able to capture the stable individual features. Inspired by , we used the Center Loss function with the Softmax loss for co-operation learning discriminative features for the recognition.
We designed several experiments to demonstrate the effectiveness of the proposed method. For feature sharing, we share different convolution features to compare the loss decreasing and face recognition accuracy and testing time. Finally, We compared with other methods of face detection recall and face recognition accuracy.
4.1 Implementation Details
The whole pipeline was written in Python, the CNN was trained on TitanX pascal GPU using the Caffe framework . we use with . To training more face in one batch, we set , the is . The batch size for RPN training was , the positive overlap was set to . Due to the different training targets, the two datasets produce different loss_weight for the backward processing as in Table.1. The detection part was trained for produce accuracy detection results in the first iterations, the detection and recognition part are both trained in the next iterations, and in the last iterations, the detection and recognition part were both trained. we fix the detection network parameters and only train the recognition network. The training process is shown in Fig.2.
|WIDER FACE ||1.0||1.0||0.0|
|CASIA WebFace ||0.0||0.5||1.0|
It took around days to training the network on WIDER FACE dataset and CASIA WebFace dataset. It took around day to train the first iterations for detection training, and day for co-training, day to train the final recognition task. In average the detection took for image size of . It took about for STN forwards. The feature extraction time per face is shown in Table.3.
4.2.1 Face Detection
In this work, the WIDER FACE  training set is used for training and the FDDB dataset is used for testing. WIDER FACE dataset is a publicly available face detection benchmark dataset, which includes labeled 393,703 faces in 32,203 images. FDDB contains the annotations for 5171 faces in a set of 2845 images taken from the Faces in the Wild data set.
4.2.2 Face Recognition
In this work, the model is trained on the CASIA-WebFace dataset which contains 10,575 subjects and 494,414 images collected from the Internet, and tested on LFW  dataset which contains more than 13,000 images of faces collected from the web. Scaling up the count of images in the batch could boost the generalization ability of the model. In the Faster R-CNN framework, the network can use any size of images as the input, but at the same time it loses the ability for processing several images simultaneously. Some methods used the cropped images for the training efficiency, however, to keep the ability for arbitrary input, the images which are randomly selected from the CASIA WebFace dataset are put together. The original image size is , by stitching 12 images in 3 rows, for 4 images in each row, the final target training sample size is . Examples of the stitched images are shown in Fig.4.
4.3 Share features
In common face recognition pipeline, the face alignment is applied on the origin input face patches. Benefit from the flexibility of STN which could insert after any convolution layers to align the features, we could share the features for both face detection task and face recognition task. In original detection framework, the first 5 convolution blocks are used for extracting features for face detection task. We designed several training experiments for sharing different features, and tested the results on the FDDB and LFW dataset.
|LFW results||test time(ms)|
Fig.4 illustrates the feature sharing. The shared features include of the VGG-16 struct. For each experiment, the recognition networks are simplified by cutting the corresponding convolution layers. The deeper the sharing layers are, the smaller the recognition network will be. Fig.5 shows the loss decreasing during the training stage for sharing features. It demonstrates that the deep sharing features help loss convergent faster. Table.3 shows the LFW results of the different models. The model gets 98.63% accuracy on LFW dataset sharing the features of convolution 3, better than using the original image patches. The accuracy decreases when sharing deeper due to the less convolutional layer in the recognition for extracting the distinguishable facial features.
In this work, a novel end-to-end trainable framework for face detection and face recognition tasks is proposed, in which the spatial transform network is used to align the feature map without additional face alignment stage. The face detection and face recognition network share the low level feature so they could benefit from each other. the experiment demonstrated that the STN could replace the face landmark prediction stage, and sharing common features could make loss convergence faster and more accuracy. Tested on FDDB and LFW dataset, the single model could achieve the state-of-the-art results.
In the future, we are intent to extend this method for other facial targets prediction and make the network size smaller to speed up both training and testing stage.
-  D. Chen, G. Hua, F. Wen, and J. Sun. Supervised transformer network for efficient face detection. arXiv, 2016.
-  D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In European Conference on Computer Vision, pages 109–122. Springer, 2014.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. arXiv, 2016.
-  J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 248–255, June 2009.
-  Z. Deng, K. Li, Q. Zhao, and H. Chen. Face landmark localization using a single deep network. Biometric Recognition, Jan. 2016.
-  R. Girshick. Fast r-CNN. In Proc. IEEE Int. Conf. Computer Vision (ICCV), pages 1440–1448, Dec. 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142–158, Jan. 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
-  G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. arXiv, 2015.
-  V. Jain and E. G. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. UMass Amherst Technical Report, 2010.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv, 2014.
-  H. Jiang and E. Learned-Miller. Face detection with the faster r-cnn. arXiv, 2016.
-  H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 5325–5334, 2015.
-  S. Liao, A. K. Jain, and S. Z. Li. A fast and accurate unconstrained face detector. IEEE transactions on pattern analysis and machine intelligence, 38(2):211–223, 2016.
-  V. E. Liong, J. Lu, and G. Wang. Face recognition using deep PCA. In Proc. Communications Signal Processing 2013 9th Int. Conf. Information, pages 1–5, Dec. 2013.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. arXiv, 2015.
-  J. Lu, V. E. Liong, G. Wang, and P. Moulin. Joint feature learning for face recognition. IEEE Transactions on Information Forensics and Security, 10(7):1371–1383, July 2015.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. arXiv, 2016.
-  R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa. An all-in-one convolutional neural network for face analysis. arXiv, 2016.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 779–788, June 2016.
-  J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. arXiv, 2016.
-  S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 Fps via regressing local binary features. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1685–1692, June 2014.
-  S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment via regressing local binary features. IEEE Transactions on Image Processing, 25(3):1233–1245, Mar. 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1, 2016.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 815–823, June 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv, 2014.
-  A. Stuhlsatz, J. Lippel, and T. Zielke. Feature extraction with deep neural networks by a generalized discriminant analysis. IEEE Transactions on Neural Networks and Learning Systems, 23(4):596–608, Apr. 2012.
-  X. Sun, P. Wu, and S. C. H. Hoi. Face detection using deep learning: An improved faster rcnn approach. arXiv, 2017.
-  Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 3476–3483, June 2013.
-  Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1891–1898, 2014.
-  D. Triantafyllidou and A. Tefas. A fast deep convolutional neural network for face detection in big visual data. In INNS Conference on Big Data, pages 61–70. Springer, 2016.
-  P. Viola and M. Jones. Robust real-time face detection. In Proc. Eighth IEEE Int. Conf. Computer Vision. ICCV 2001, volume 2, page 747, 2001.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A Discriminative Feature Learning Approach for Deep Face Recognition, pages 499–515. Springer International Publishing, Cham, 2016.
-  S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 5525–5533, June 2016.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv, 2014.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, Oct. 2016.
-  X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep face recognition with long-tail. arXiv, 2016.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. Computer Vision â ECCV 2014, Jan. 2014.
-  Y. Zheng, C. Zhu, K. Luu, C. Bhagavatula, T. H. N. Le, and M. Savvides. Towards a deep learning framework for unconstrained face detection. arXiv, 2016.