Robust Facial Landmark Localization Based on Texture and Pose Correlated Initialization
Abstract
Robust facial landmark localization remains a challenging task when faces are partially occluded. Recently, the cascaded pose regression has attracted increasing attentions, due to it’s superior performance in facial landmark localization and occlusion detection. However, such an approach is sensitive to initialization, where an improper initialization can severly degrade the performance. In this paper, we propose a Robust Initialization for Cascaded Pose Regression (RICPR) by providing texture and pose correlated initial shapes for the testing face. By examining the correlation of local binary patterns histograms between the testing face and the training faces, the shapes of the training faces that are most correlated with the testing face are selected as the texture correlated initialization. To make the initialization more robust to various poses, we estimate the rough pose of the testing face according to five fiducial landmarks located by multitask cascaded convolutional networks. Then the pose correlated initial shapes are constructed by the mean face’s shape and the rough testing face pose. Finally, the texture correlated and the pose correlated initial shapes are joined together as the robust initialization. We evaluate RICPR on the challenging dataset of COFW. The experimental results demonstrate that the proposed scheme achieves better performances than the stateoftheart methods in facial landmark localization and occlusion detection.
Facial landmark localization, Cascaded pose regression, Robust initialization, Occlusion, Texture and pose correlated.
I Introduction
Facial landmark localization, which is localizing the facial key points (e.g., eye brows, eyes, nose, mouth and jaw), plays an important role in many computer vision tasks, such as face detection [1], face recognition [2, 3, 4] and facial expression analysis [5, 6, 7]. In recent years, facial landmark localization has been extensively studied and achieved remarkable performance on standard datasets and even on datasets collected in the wild [8, 9, 10, 11, 12, 13]. However, it still has obstacles for faces with various variations in appearance including pose, expression, especially occlusions.
Since Cascaded Pose Regression (CPR) was used to estimate facial shapes [14], the shape regression in a cascaded manner has emerged as one of the most popular approaches for facial landmark localization [15, 11, 9, 16, 17, 18, 19, 20, 21, 22]. CPR and its variations typically begin with an initial shape, such as an average shape or a random shape of training samples, and then update the shape from coarse to fine through the trained regressors. Based on CPR, BurgosArtizzu et al. proposed a scheme of Robust CPR (RCPR) [11], which is the first scheme explicitly detect occlusion state at the same time to estimate locations of landmarks. And they created a popular challenging dataset named Caltech Occluded Faces in the Wild (COFW) [11], where most faces in this dataset have occlusions. Researchers have used this dataset to study facial landmark localization under occlusions [11, 17, 20, 23, 24, 25]. Although these methods make some progress on facial landmark localization under partial occlusion, the occlusion problem is not essentially solved. The accuracy of occlusion prediction is still unsatisfactory. Since the occluded landmarks can hardly provide information for further analysis, it is significant to detect occlusion state of landmarks, furthermore, the occluded landmarks may reduce the accuracy of localizing the unoccluded landmarks.
Since regression is initialization dependent where an improper initialization will significantly decrease the performance sharply. When the pose variation and occlusion appear simultaneously on a face, localization will fail if a bad initial shape is selected. As shown in Fig. 1, a bad initial shape usually leads to a failure of landmark localization and occlusion prediction. In this paper, we propose a Robust Initialization for CPR (RICPR)^{1}^{1}1The source code of the proposed scheme can be found at https://github.com/pervadepyy/robustinitializationrcpr to avoid the bad initialization by examining texture and pose of testing face to get the texture correlated and the pose correlated initial shapes. Since texture is always related with occlusion, we select the initial shapes according to the texture correlation between the testing face and the training faces. We firstly compute Local Binary Patterns (LBP) histograms of all training faces and the testing face. Then we get correlation distance between histograms of the testing face and each training face. We choose the shapes of the most correlated training faces as the texture correlated initialization. On the other hand, the rough pose, which is represented by the rotation vector, of the testing face is used to obtain the pose correlated initial shapes. More specifically, we first estimate five fiducial landmarks including the pupils, the tip of the nose, and the corners of the mouth using MultiTask Cascaded Convolutional Networks (MTCNN) [26]. According to the five landmarks and a mean 3D face shape with 5 facial key points, the face pose values can be obtained. Then another 3D mean face shape, represented by 29 facial key points, can be projected to a set of corresponding 2D locations by the face pose, which can obtain the pose correlated initial shapes. Finally, the texture correlated initial shapes and the pose correlated initial shapes are taken together as the robust initialization for regression, as shown in Fig. 2, which is more relevant to the true shape of the testing face in location and occlusion. We evaluate RICPR on the challenging dataset of COFW. The experimental results show that the Normalized Mean Error (NME) is 6.64 and the accuracy of occlusion detection is 80/54.6% precision/recall, which is better than that of the stateoftheart schemes.
Some of the ideas presented in this paper were initially reported in [27]. In this paper, we report the full and new formulation and extensive experimental evaluation of our method. The initialization not only depends on texture correlation but also pose correlation and the accuracies of landmark localization and occlusion detection are further improved.
Ii Related Work
The works to solve the problems of facial landmark localization and face alignment can be roughly divided into two groups: holistic based methods and local based methods. The holistic methods regard the shape as a whole, which usually align the face in an iterative or cascaded way. A typical holistic based method is the Active Appearance Model (AAM) [28, 29, 30, 31]. CPR [14] is a similar method with a random fern regressor, which is a fast and accurate solution of computing the 2D shape of an object. Explicit Shape Regression (ESR) [15] and RCPR [11] extended the idea of CPR, which also use pixel difference features and fern regressor. A similar method called Supervised Descent Method (SDM) was proposed in [32]. This method used cascade regression with fast SIFT feature and solved localization using newtontype optimization on nonlinear least squares problem.
Since occlusions are very common in the real applications of computer vision and the occluded landmarks usually cannot provide information, some works focus on facial landmark localization and occlusion detection jointly [11, 17, 33, 20, 24]. BurgosArtizzu et al. first proposed to detect occlusions state at the same time of estimating landmarks in RCPR [11], where occlusion states are applied at each iteration to get visually different regressors. The outputs of regressors are merged with weights that depend on the occlusion prediction results. Considering occlusions often cover a region, instead of visibility annotation, Yang et, al. [17] used the consistency of votes of the local regression forest in several oversegmented regions to get a confidence value of each pixel, which is called Regional Predictive Power (RPP). Compared with RCPR, RPP obtained a higher accuracy in landmark localization. Yu et al. [24] proposed a Consensus of occlusionrobust Regression method (CoR) by forming a consensus from estimates arising from a set of occlusionspecific regressors. Each regressor is trained to estimate facial landmark locations under the precondition that a particular predefined region of the occluded face. CoR improved the accuracy of occlusion detection. Liu et, al. [20] proposed Cascade Regression with Adaptive Shape Model (CRASM) for robust facial landmark localization. In each iteration, the shapeindexed appearance is used to estimate the occlusion level of each landmark, and each landmark is then weighted according to the occlusion level. Moreover, the occlusion level of the landmark acts as adaptive weights on the shapeindexed features to decrease the noise on the shapeindexed features, which improved the performance of facial landmark localization and occlusion detection. CRASM improved the performance of landmark localization and occlusion detection compared with other methods, which obtained a 80/48.45% precision/recall for occlusion detection and NME is 6.68 for localization on COWF dataset.
Iii The Proposed Scheme
In this section, we briefly review CPR and RCPR, and then describe in detail the proposed RICPR scheme for facial landmark localization under occlusions.
Iiia Cascaded Shape Regression
The main steps of CPR [14] can be described as Algorithm 1. It starts from a raw initial shape , which is progressively refined in each iteration by applying a cascade of regressors until the final shape is estimated. At each iteration, image features are calculated as , where is the face image and is the previous iteration’s shape. Based on the shapeindexed features and the regressor , an update is calculated. The update is combined with to form a new shape . ESR [15] proposed some improvements over CPR, which uses twolevel cascaded regression to strengthen regressors. There are primitive fern regressors at each iteration, and the shape update is obtained by:
(1) 
BurgosArtizzu et al. [11] proposed a novel regression approach RCPR, to handle localization under occlusions, which divides the face image into 9 zones. At each iteration , the occlusions presented in each one of the 9 zones can be estimated by projecting the current estimation in the image. Then, instead of training a single regressor, RCPR trains regressors in each primitive fern regressor . Moreover, each regressor is allowed to draw features only from 1 of the 9 predefined zones. Finally, the updates of the regressors are combined through a weighted mean voting. The weight is inversely proportional to the occlusions presented in the zone. At the tth iteration, the kth update can be described as:
(2) 
IiiB Robust Initialization for Cascaded Pose Regression
The procedure of the proposed RICPR is illustrated in Fig. 2. Firstly, we get the texture correlated initial shapes by calculating texture correlation between the testing face and the training faces, at the same time, the pose correlated initial shapes are obtained by examining rough face pose of the testing face. Then, these initial shapes are taken as the robust initialization for cascaded regression. We describe these two initialization methods in the subsections IIIC and IIID, respectively.
IiiC The Texture Correlated Initial Shapes
Since occlusion and pose variation change the appearance of a face and texture descriptor captures the local appearance detail, we can select a texture correlated initial shape to consider the occlusion information of the testing face, rather than a random initial shape.
We propose an initialization method based on texture correlation analysis between the testing face and the training faces. The shapes of the training faces which are most correlated with the testing face are chosen as the initialization, instead of a random one. The LBP operator was originally proposed for texture analysis which is widely used in computer vision [34]. It labels an image by thresholding the neighborhood of each pixel with the centre value, as shown in Fig. 3 (a). The histogram of the labels can be used as a texture descriptor. To deal with the limitation of the basic LBP operator, the rotationinvariant LBP and uniform LBP were proposed [35], as shown in Fig. 3 (b). In the proposed scheme, we choose the uniform LBP since it balances performance and speed. We use the notation to denote the LBP operator, where the subscript represents using the operator in a neighborhood, while means sampling points on a circle of radius . The superscript means using only uniform patterns.
Given an image, we divide the face into nonoverlapping subblocks, as shown in Fig. 4. For each block , we use to calculate LBP features, then a histogram of the labeled block is computed as:
(3) 
where is the number of labels of each block produced by the LBP operator and is defined as:
Finally, the histograms are combined yielding the histogrammatrix with a size of .
During testing, histogrammatrices of the testing image and the training samples are computed by using the above scheme. It should be noticed that, to save the time cost of the testing, the histogrammatrices of the training samples can be computed offline before the testing. The best way to classify histogrammatrices is to use one of the histogram similarity measures, such as histogram intersection, loglikelihood or ChiSquare statics [36]. Since our work aims to select proper initial shapes for regression and we hope to pick a few of the most relevant shapes with the testing face from training faces, we need a method to assess the correlation between the testing face and the training faces. In this paper, we choose the Pearson correlation coefficient [37] to measure the correlation between the testing face and the training faces. The Pearson correlation coefficient between the testing face histogrammatrix and each training face histogrammatrix is calculated by:
(4)  
where is the covariance, is the standard deviation and is the total number of training faces. Due to the size of each histogrammatrix is , can be calculated as:
where and are mean values of matrices and , respectively. Then the correlation coefficient can be used to calculate correlation distance :
(5) 
A smaller represents that the training face is more correlated with the testing face. We choose the most correlated faces from training faces and select their shapes as initial shapes for the testing face. The main procedure of initialization based on texture correlation is presented in Algorithm 2. As shown in Fig. 5, a comparison between random initialization [11] and texture correlation based initialization is illustrated. It can be found that the proposed texture correlation based initialization usually obtains more accurate initial shapes which improves the accuracy of landmark localization.
IiiD The Pose Correlated Initial Shapes
In the above section, we describe how to select the texture correlated initial shapes considering the occlusion information but ignoring pose information of the testing face. Empirically, landmark distribution is highly correlated to head pose. To further make the initial shapes more robust to various poses, we choose some pose correlated initial shapes for regression.
To obtain the pose correlated initial shapes, we estimate the rough pose of the testing face, which can be obtained by the five fiducial landmarks, i.e., the pupils, the tip of the nose, and the corners of the mouth. In this paper, we use MTCNN [26] to detect the five fiducial landmarks, as shown in Fig. 6. Inspired by PerspectivenPoint (PnP) problem, which is the problem of estimating the pose of a calibrated camera given a set of 3D points and their corresponding 2D projections in the image [38]. Given a 3D mean shape with 5 facial key points and the five detected fiducial landmarks, a rough face pose can be estimated by:
(6) 
where is a rotation vector, which represents the face pose and represents the five fiducial landmarks detected by MTCNN, is the Efficient PnP [38].
Feature  LBP  LDP  Gabor  GMRF  GLDS  GLCM  Eigenface 

NME()  7.35  7.75  7.87  8.28  8.19  8.06  8.18 
Precision/recall  80/51.4%  80/48.7%  80/46.1%  80/45.6%  80/47.2%  80/46.5%  80/47.6% 

Accuracy of facial landmark localization and occlusion detection based on texture correlated initialization using different features. The results indicate that the LBP performs better than the others.
Then, a 3D mean face shape, represented by 29 facial landmark locations, is projected to a set of corresponding 2D locations according to the testing face pose , as shown in Fig. 6. After that, the shape which has similar pose with the testing face is obtained. To get a reasonable initial shape for each image, we rescale the corresponding 2D locations based on the face bounding box and the detected five fiducial landmarks . The initial occlusion information of the pose correlated initial shape is distributed randomly as:
(7) 
where is the face bounding box, is the 3D mean face shape with 29 points, and is the pose correlated initial shape. To achieve a better performance, we selected several frontal faces from the training set to augment the pose correlated initial shape. Referring to their true 2D shapes and the 3D mean face shape , we construct different 3D frontal face shapes that have little variation compared with the 3D mean face shape. Then, based on Eq. 7, different initial shapes can be generated by replacing with the constructed 3D frontal face shapes.
IiiE Variance Evaluation
As stated in [11], due to the coarse to fine nature of CPR, even if a face image is initialized by several different shapes, the predictions should reach a similarity after iterations. Based on this principle, instead of taking the median of all predicted results as the final output, the variance is used to determine the reliability of two initialization methods’ predictions.
Firstly, after finishing the regression, the variance of all predictions is calculated. If the value of is below a certain threshold , it indicates that the predictions is a good solution. In this case, all predictions are considered as reliable, thus we take the median of all predictions as the final output. Otherwise, part of predictions belong to “bad” class, then the variances between predictions based on two initialization methods are computed and represented as and . If is less than , it indicates that the predictions based on the texture correlated initialization are more reliable than these based on the pose correlated initialization. Therefore, only considering the predictions by the texture correlated initial shapes, we abandon the predictions which make obvious variance variation and take the median of the rest of predictions as the final output. If is greater than , it indicates that the predictions based on the pose correlated initialization are more reliable than those based on the texture correlated initialization. Then, the median of the predictions based on the pose correlated initial shapes are taken as the final output.
Iv Experimental Results
Iva Dataset and Implementation
Methods  Landmark localization error  Occlusion prediction 

NME ()  Precision/Recall  
RCPR [11]  8.01  80/42% 
HPM [33]  7.46*  80/37%* 
RPP [17]  7.52*  78/40%* 
SDM [32]  10.88   
TCDCN [39]  8.05*   
CRASM [20]  6.68*  80/48.45%* 
HORSD [13]  6.8*   
LBPIRCPR  7.35  80/51.4% 
RICPR  6.64  80/54.6% 
Human [11]  5.6   

Comparison of facial landmark localization and occlusion prediction on COFW dataset. The table lists the results of NME and occlusion detection. * indicates that the result is from the published paper.
We evaluate the performance of the proposed scheme on the challenging dataset COFW [11], which is widely used to evaluate the robustness of facial landmark localization and occlusion detection. The face images in COFW have large variations in shape and occlusions due to differences in pose, expression, hairstyle, using of accessories such as sunglasses, hats and interactions with objects (e.g. food, hands, microphones, etc.). Each image is annotated with the location and occluded/unoccluded state of 29 facial landmarks. This dataset has 1852 face images in total, where 1345 and 507 images are used for training and testing respectively. The average occlusion rate of faces in COFW is over 23%.
To evaluate the performance of the proposed scheme, we implement the proposed scheme with two configurations. One is RCPR based on LBP histogram correlation initialization (LBPIRCPR), which represents the initialization only based on texture correlation, and the median of predictions is taken as the final output. The other is the full version of the proposed scheme RICPR, in which the initialization is based jointly on both texture correlation and pose correlation jointly. In texture correlation analysis, the face, whose location is provided by a face detector, is divided into nonoverlapping subblocks. The uniform LBP in a neighborhood is employed to obtain texture information of the face. Thus, the number of labels produced by the LBP operator is 59. In pose correlation analysis, we utilize MTCNN [26] to predict five fiducial landmarks. A threshold is used to determine whether the predictions lead a good result. Since the proposed scheme always has a good initialization without the need for the smart restarts and the number of initial shapes is set to 10 and is set to 4. We run RICPR and RCPR using the same configuration.
We compare LBPIRCPR and RICPR with several stateoftheart methods on COFW dataset using NME (Normalized Mean Error) defined by Eq. 8.
(8) 
where is the number of images in the test set, is the number of landmarks in one image, is the predicted position of the landmark of the image, is the ground truth position of the landmark of the image, and are the ground truth positions of the left and right eye centres respectively.
Based on NME, we can plot the Cumulative Error Distribution (CED) curves to further analyse the performance of the proposed scheme, which is calculated from the NME over each image. We also evaluate the speed of the proposed scheme on the COFW dataset. Speed is measured in Frames Per Second (FPS). All methods are implemented using Matlab R2015b and run on a PC with 3.60 GHz CPU and 64bit Windows 7 operating system.
IvB Results
1) Analysis of initialization based on texture correlation: Instead of randomly selecting shapes from the training set as the initialization in RCPR, we employ a texture correlated initialization (LBPIRCPR) by computing LBP histograms. To prove the effectiveness of the texture correlated initialization method, we compare the performance of LBPIRCPR with RCPR on the COFW dataset as shown in Fig. 7.
The NMEs on various initial shapes with different correlation distances are shown in Fig. 7. The results show that the NME is reduced with decreasing correlation distance and LBPIRCPR can significantly reduces NME by at least 45%. It indicates that the initial shapes which are selected from the training faces based on texture correlation is closer to the real shape of the testing face.
Moreover, given different initial shapes for each image, the variance between their predictions is applied to determine whether the face belongs to a “good” class as stated in [11]. As shown in Fig. 7, it can be found that the number of “good” instances increases as correlation distance decreases and more images belong to “good” class among 507 testing images when using LBPIRCPR. The number of “good” instances dramatically increase by at least 45%, and thus less bad initial shapes are selected. Furthermore, the number of “good” instances increases from 395 to 504 among the 507 images in RICPR scheme, which means fewer than 1% instances are “bad”, thus the initialization become more robust.
We also initialize the shapes using other different features, including Local Derivative Pattern (LDP) [40], Gabor, Gaussian Markov Random Field (GMRF), GrayLevel Difference Statistics (GLDS), GrayLevel Cooccurrence Matrix (GLCM), and Eigenface. We report the NME and occlusion detection of each feature respectively in Table I. The results indicate that the initialization based on LBP histogram correlation performs better.
2) Facial landmark localization evaluation on COFW: Many facial landmark localization methods perform not well on the COFW database due to the large variation in occlusion. To evaluate the proposed scheme, we compare the proposed scheme with several stateoftheart methods including RCPR [11], RPP [17], SDM [32], TasksConstrained Deep Convolutional Network (TCDCN) [39], Hierarchical Deformable Part Model (HPM) [33], CRASM [20] and Hierarchical Occlusion Stagewise Relational Dictionary (HOSRD)[13]. The comparisons of NME on COFW dataset are given in Table II.
We can find that RICPR obtains the smallest NME. Compared to RCPR, the LBPIRCPR reduces the NME from 8.01 to 7.35 and the RICPR further reduces the NME to 6.64. The NME is reduced by 17.1% in total. RICPR performs even better than the most recent CRASM method proposed in 2017. To get the pose correlated initial shapes, we use MTCNN to detect five fiducial landmarks. The accuracy of five fiducial landmarks plays a significant role on performance. If the groundtruth of the five fiducial landmarks is employed, the NME can reach 5.52, which demonstrates that the proposed scheme can obtain a admirable performance if the five fiducial landmarks are detected accurately.
We also show the CED curves of the COFW dataset in Fig. 8. As can be seen, more images perform better using the proposed scheme, it also demonstrates the superiority of the proposed scheme for facial landmark localization in face image with occlusions.
3) Occlusion detection on COFW: Since the COFW dataset provides the ground truth of occlusion, we evaluate the occlusion detection on COFW and compare the proposed scheme with RCPR [11], HPM [33], CoR [24], RPP [17] and CRASM [20]. The occlusion prediction results are shown in Table II and Fig. 9. As can be seen, the proposed scheme also outperforms the stateoftheart methods in occlusion detection.
When we set the false alarm at 80%, the proposed scheme achieves an accuracy of 54.6%, which is higher than 42% obtained by RCPR, 37% obtained by HPM, 41.44% obtained by CoR, 48.45% obtained by CoR and 78/40% precision/recall obtained by RPP. Even if only using LBPIRCPR scheme, the accuracy of detecting occlusion reaches 51.4%. It demonstrates that the proposed scheme achieves a much higher accuracy of occlusion detection, which can provide significant benefits in real world application, such as image texture analysis, facial expression understanding and face recognition. Fig. 11 shows example images with the result obtained by the proposed RICPR.
4) Run time: We record the speeds of RCPR, LBPIRCPR and RICPR on the COFW dataset. The speeds of these methods are 5.3 FPS, 4.1 FPS and 4.0 FPS, respectively. We can find that the proposed scheme takes some time on calculating the correlation. The speed can be improved by implementing it with C++ or using a powerful server. We will try to improve the efficiency of the proposed scheme in the future, for example, by reducing the number of face images used for texture correlation based initialization.
IvC Generalization of the Proposed Initialization Scheme
The experimental results demonstrate that the proposed initialization scheme significantly improved the performance of RCPR in both localization and occlusion prediction. Since the initialization is usually independent to facial landmark localization, the proposed initialization scheme can be applied to other algorithms such as SDM. The results are shown in Fig.10, where baseline is the original SDM or RCPR, LBPI+baseline is the texture correlated initialization scheme applied to SDM or RCPR, RI+baseline denotes the joint texture correlation and pose correlation initialization scheme applied to SDM or RCPR. Compared with the original SDM which is based on random initialization, the LBPISDM and the RISDM reduce the NME by 14.6% and 19% respectively. The results indicate that the proposed initialization scheme can also improve the performance of SDM.
V Conclusions
In this paper, we propose a robust initialization scheme to solve the initialization sensitive problem for the cascaded pose regression approach through jointly analyzing texture and pose of a testing face. By examining the correlation of local binary patterns histograms between the testing face and the training faces, the texture correlated shapes are selected instead of random shapes. At the same time, the pose correlated initialization is proposed to further improve the robustness of the initialization by estimating the face pose. Experimental results show that the proposed scheme obtains remarkably higher accuracies on both facial landmark localization and occlusion detection on facial images than the stateoftheart benchmarks. Moreover, since the initialization is usually independent with facial landmark localization, the proposed initialization scheme has the potential to be extended and applied to other algorithms.
Acknowledgment
This work is supported by the National Natural Science Foundation of China (Grant No. 61601337).
References
 [1] S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts responses to face detection: A deep learning approach,” in IEEE International Conference on Computer Vision, 2015, pp. 3676–3684.
 [2] R. Weng, J. Lu, and Y. P. Tan, “Robust point set matching for partial face recognition,” IEEE Transactions on Image Processing, vol. 25, no. 3, pp. 1163–1176, 2016.
 [3] H. Li, D. Huang, J. M. Morvan, Y. Wang, and L. Chen, “Towards 3D face recognition in the real: A registrationfree approach using finegrained matching of 3d keypoint descriptors,” International Journal of Computer Vision, vol. 113, no. 2, pp. 128–142, 2015.
 [4] Y. Tai, J. Yang, Y. Zhang, L. Luo, J. Qian, and Y. Chen, “Face recognition with pose variations and misalignment via orthogonal procrustes regression,” IEEE Transactions on Image Processing, vol. 25, no. 6, pp. 2673–2683, 2016.
 [5] Y. Li, S. Wang, Y. Zhao, and Q. Ji, “Simultaneous facial feature tracking and facial expression recognition,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2559–2573, 2013.
 [6] S. K. A. Kamarol, M. H. Jaward, H. KÃ¤lviÃ¤inen, J. Parkkinen, and R. Parthiban, “Joint facial expression recognition and intensity estimation based on weighted votes of image sequences,” Pattern Recognition, vol. 92, pp. 25–32, 2017.
 [7] W. Zhang, Y. Zhang, L. Ma, J. Guan, and S. Gong, “Multimodal learning for facial expression recognition,” Pattern Recognition, vol. 48, no. 10, pp. 3191–3202, 2015.
 [8] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2879–2886.
 [9] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 FPS via regressing local binary features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1685–1692.
 [10] S. Zhu, C. Li, C. Change Loy, and X. Tang, “Face alignment by coarsetofine shape searching,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4998–5006.
 [11] X. P. BurgosArtizzu, P. Perona, and P. Dollar, “Robust face landmark estimation under occlusion,” in IEEE International Conference on Computer Vision, 2013, pp. 1513–1520.
 [12] A. Jourabloo and X. Liu, “Largepose face alignment via CNNbased dense 3D model fitting,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4188–4196.
 [13] J. Xing, Z. Niu, J. Huang, W. Hu, X. Zhou, and S. Yan, “Towards robust and accurate multiview and partiallyoccluded face alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2017.
 [14] P. Dollar, P. Welinder, and P. Perona, “Cascaded pose regression,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1078–1085.
 [15] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2887–2894.
 [16] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1867–1874.
 [17] H. Yang, X. He, X. Jia, and I. Patras, “Robust face alignment under occlusion via regional predictive power estimation,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2393–2403, 2015.
 [18] G. Tzimiropoulos, “Projectout cascaded regression with an application to face alignment,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2015, pp. 3659–3667.
 [19] Q. Liu, J. Deng, and D. Tao, “Dual sparse constrained cascade regression for robust face alignment,” IEEE Transactions on Image Processing, vol. 25, no. 2, pp. 700–712, 2016.
 [20] Q. Liu, J. Deng, J. Yang, G. Liu, and D. Tao, “Adaptive cascade regression model for robust face alignment,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 797–807, 2017.
 [21] D. Lee, H. Park, and C. D. Yoo, “Face alignment using cascade gaussian process regression trees,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4204–4212.
 [22] B. M. Smith and C. R. Dyer, “Efficient branching cascaded regression for face alignment under significant head rotation,” CoRR, vol. abs/1611.01584, 2016. [Online]. Available: http://arxiv.org/abs/1611.01584
 [23] J. Zhang, M. Kan, S. Shan, and X. Chen, “Occlusionfree face alignment: Deep regression networks coupled with decorrupt autoencoders,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3428–3437.
 [24] X. Yu, Z. Lin, J. Brandt, and D. N. Metaxas, “Consensus of regression for occlusionrobust facial feature localization,” in European Conference Computer Vision, 2014, pp. 105–118.
 [25] K. Seshadri and M. Savvides, “Towards a unified framework for pose, expression, and occlusion tolerant automatic facial alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2110–2122, 2016.
 [26] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
 [27] Y. Pan, J. Zhou, Y. Gao, J. Xiang, S. Xiong, and Y. Yang, “Robust facial landmark localization using LBP histogram correlation based initialization,” in IEEE International Conference on Automatic Face Gesture Recognition, 2017, pp. 619–625.
 [28] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” in European Conference on Computer Vision, 1998, pp. 484–498.
 [29] I. Matthews and S. Baker, “Active appearance models revisited,” International Journal of Computer Vision, vol. 60, no. 2, pp. 135–164, 2004.
 [30] J. Alaborti Medina and S. Zafeiriou, “Bayesian active appearance models,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3438–3445.
 [31] E. Antonakos, J. AlabortIMedina, G. Tzimiropoulos, and S. P. Zafeiriou, “Featurebased lucasâkanade and active appearance models,” IEEE Transactions on Image Processing, vol. 24, no. 9, p. 2617, 2015.
 [32] X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 532–539.
 [33] G. Ghiasi and C. C. Fowlkes, “Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1899–1906.
 [34] T. Ojala, M. PietikÃ¤inen, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern Recognition, vol. 29, no. 1, pp. 51–59, 1996.
 [35] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, July 2002.
 [36] T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition with local binary patterns,” in European Conference on Computer Vision, 2004, pp. 469–481.
 [37] K. Pearson, “Note on regression and inheritance in the case of two parents,” Proceedings of the Royal Society of London, vol. 58, pp. 240–242, 1895.
 [38] V. Lepetit, F. MorenoNoguer, and P. Fua, “EPnP: An accurate O(n) solution to the PnP problem,” International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, 2009.
 [39] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning deep representation for face alignment with auxiliary attributes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 5, pp. 918–930, 2016.
 [40] B. Zhang, Y. Gao, S. Zhao, and J. Liu, “Local derivative pattern versus local binary pattern: Face recognition with highorder local pattern descriptor,” IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 533–544, 2010.