Robust Facial Landmark Localization Based on Texture and Pose Correlated Initialization

Robust Facial Landmark Localization Based on Texture and Pose Correlated Initialization

Abstract

Robust facial landmark localization remains a challenging task when faces are partially occluded. Recently, the cascaded pose regression has attracted increasing attentions, due to it’s superior performance in facial landmark localization and occlusion detection. However, such an approach is sensitive to initialization, where an improper initialization can severly degrade the performance. In this paper, we propose a Robust Initialization for Cascaded Pose Regression (RICPR) by providing texture and pose correlated initial shapes for the testing face. By examining the correlation of local binary patterns histograms between the testing face and the training faces, the shapes of the training faces that are most correlated with the testing face are selected as the texture correlated initialization. To make the initialization more robust to various poses, we estimate the rough pose of the testing face according to five fiducial landmarks located by multi-task cascaded convolutional networks. Then the pose correlated initial shapes are constructed by the mean face’s shape and the rough testing face pose. Finally, the texture correlated and the pose correlated initial shapes are joined together as the robust initialization. We evaluate RICPR on the challenging dataset of COFW. The experimental results demonstrate that the proposed scheme achieves better performances than the state-of-the-art methods in facial landmark localization and occlusion detection.

F
\IEEEoverridecommandlockouts\overrideIEEEmargins

acial landmark localization, Cascaded pose regression, Robust initialization, Occlusion, Texture and pose correlated.

\IEEEpeerreviewmaketitle

1 Introduction

Facial landmark localization, which is localizing the facial key points (e.g., eye brows, eyes, nose, mouth and jaw), plays an important role in many computer vision tasks, such as face detection [1], face recognition [2, 3, 4] and facial expression analysis [5, 6, 7]. In recent years, facial landmark localization has been extensively studied and achieved remarkable performance on standard datasets and even on datasets collected in the wild [8, 9, 10, 11, 12, 13]. However, it still has obstacles for faces with various variations in appearance including pose, expression, especially occlusions.

Figure 1: Visual results of RCPR on COFW dataset (red: occluded, green: un-occluded). The initial shapes (the first row) and their localization results (the second row) of RCPR [11]. Facial landmark localization usually fails when it begins with a bad initial shape.

Since Cascaded Pose Regression (CPR) was used to estimate facial shapes [14], the shape regression in a cascaded manner has emerged as one of the most popular approaches for facial landmark localization [15, 11, 9, 16, 17, 18, 19, 20, 21, 22]. CPR and its variations typically begin with an initial shape, such as an average shape or a random shape of training samples, and then update the shape from coarse to fine through the trained regressors. Based on CPR, Burgos-Artizzu et al. proposed a scheme of Robust CPR (RCPR) [11], which is the first scheme explicitly detect occlusion state at the same time to estimate locations of landmarks. And they created a popular challenging dataset named Caltech Occluded Faces in the Wild (COFW) [11], where most faces in this dataset have occlusions. Researchers have used this dataset to study facial landmark localization under occlusions [11, 17, 20, 23, 24, 25]. Although these methods make some progress on facial landmark localization under partial occlusion, the occlusion problem is not essentially solved. The accuracy of occlusion prediction is still unsatisfactory. Since the occluded landmarks can hardly provide information for further analysis, it is significant to detect occlusion state of landmarks, furthermore, the occluded landmarks may reduce the accuracy of localizing the un-occluded landmarks.

Since regression is initialization dependent where an improper initialization will significantly decrease the performance sharply. When the pose variation and occlusion appear simultaneously on a face, localization will fail if a bad initial shape is selected. As shown in Fig. 1, a bad initial shape usually leads to a failure of landmark localization and occlusion prediction. In this paper, we propose a Robust Initialization for CPR (RICPR)1 to avoid the bad initialization by examining texture and pose of testing face to get the texture correlated and the pose correlated initial shapes. Since texture is always related with occlusion, we select the initial shapes according to the texture correlation between the testing face and the training faces. We firstly compute Local Binary Patterns (LBP) histograms of all training faces and the testing face. Then we get correlation distance between histograms of the testing face and each training face. We choose the shapes of the most correlated training faces as the texture correlated initialization. On the other hand, the rough pose, which is represented by the rotation vector, of the testing face is used to obtain the pose correlated initial shapes. More specifically, we first estimate five fiducial landmarks including the pupils, the tip of the nose, and the corners of the mouth using Multi-Task Cascaded Convolutional Networks (MTCNN) [26]. According to the five landmarks and a mean 3D face shape with 5 facial key points, the face pose values can be obtained. Then another 3D mean face shape, represented by 29 facial key points, can be projected to a set of corresponding 2D locations by the face pose, which can obtain the pose correlated initial shapes. Finally, the texture correlated initial shapes and the pose correlated initial shapes are taken together as the robust initialization for regression, as shown in Fig. 2, which is more relevant to the true shape of the testing face in location and occlusion. We evaluate RICPR on the challenging dataset of COFW. The experimental results show that the Normalized Mean Error (NME) is 6.64 and the accuracy of occlusion detection is 80/54.6% precision/recall, which is better than that of the state-of-the-art schemes.

Some of the ideas presented in this paper were initially reported in [27]. In this paper, we report the full and new formulation and extensive experimental evaluation of our method. The initialization not only depends on texture correlation but also pose correlation and the accuracies of landmark localization and occlusion detection are further improved.

The remainder of this paper are organized as follows. In Section 2, the related works are briefly introduced. We review CPR and describe the proposed scheme in Section 3. The experimental results on COFW dataset are given in Section 4. Finally, conclusions are drawn in Section 5.

2 Related Work

The works to solve the problems of facial landmark localization and face alignment can be roughly divided into two groups: holistic based methods and local based methods. The holistic methods regard the shape as a whole, which usually align the face in an iterative or cascaded way. A typical holistic based method is the Active Appearance Model (AAM) [28, 29, 30, 31]. CPR [14] is a similar method with a random fern regressor, which is a fast and accurate solution of computing the 2D shape of an object. Explicit Shape Regression (ESR) [15] and RCPR [11] extended the idea of CPR, which also use pixel difference features and fern regressor. A similar method called Supervised Descent Method (SDM) was proposed in [32]. This method used cascade regression with fast SIFT feature and solved localization using newton-type optimization on nonlinear least squares problem.

Since occlusions are very common in the real applications of computer vision and the occluded landmarks usually cannot provide information, some works focus on facial landmark localization and occlusion detection jointly [11, 17, 33, 20, 24]. Burgos-Artizzu et al. first proposed to detect occlusions state at the same time of estimating landmarks in RCPR [11], where occlusion states are applied at each iteration to get visually different regressors. The outputs of regressors are merged with weights that depend on the occlusion prediction results. Considering occlusions often cover a region, instead of visibility annotation, Yang et, al. [17] used the consistency of votes of the local regression forest in several over-segmented regions to get a confidence value of each pixel, which is called Regional Predictive Power (RPP). Compared with RCPR, RPP obtained a higher accuracy in landmark localization. Yu et al. [24] proposed a Consensus of occlusion-robust Regression method (CoR) by forming a consensus from estimates arising from a set of occlusion-specific regressors. Each regressor is trained to estimate facial landmark locations under the precondition that a particular predefined region of the occluded face. CoR improved the accuracy of occlusion detection. Liu et, al. [20] proposed Cascade Regression with Adaptive Shape Model (CRASM) for robust facial landmark localization. In each iteration, the shape-indexed appearance is used to estimate the occlusion level of each landmark, and each landmark is then weighted according to the occlusion level. Moreover, the occlusion level of the landmark acts as adaptive weights on the shape-indexed features to decrease the noise on the shape-indexed features, which improved the performance of facial landmark localization and occlusion detection. CRASM improved the performance of landmark localization and occlusion detection compared with other methods, which obtained a 80/48.45% precision/recall for occlusion detection and NME is 6.68 for localization on COWF dataset.

3 The Proposed Scheme

Figure 2: The procedure of RICPR. The texture correlated initial shapes and the pose correlated initial shapes are calculated in parallel. The texture correlated initialization is based on the correlation of LBP histograms between the testing face and the training faces, while the pose correlated initialization is based on the evaluated rough face pose. These initial shapes are combined together as robust initializations for regression to get predictions. Finally, the reliability of each prediction is evaluated by variance to get the final output.

In this section, we briefly review CPR and RCPR, and then describe in detail the proposed RICPR scheme for facial landmark localization under occlusions.

3.1 Cascaded Shape Regression

1:Image , initial shape , regressors
2:for  to  do
3:         //Shape-indexed features
4:         //Apply regressor
5:        //Update estimated shape
6:end for
7:Estimated shape
Algorithm 1 Cascaded Pose Regression

The main steps of CPR [14] can be described as Algorithm 1. It starts from a raw initial shape , which is progressively refined in each iteration by applying a cascade of regressors until the final shape is estimated. At each iteration, image features are calculated as , where is the face image and is the previous iteration’s shape. Based on the shape-indexed features and the regressor , an update is calculated. The update is combined with to form a new shape . ESR [15] proposed some improvements over CPR, which uses two-level cascaded regression to strengthen regressors. There are primitive fern regressors at each iteration, and the shape update is obtained by:

(1)

Burgos-Artizzu et al. [11] proposed a novel regression approach RCPR, to handle localization under occlusions, which divides the face image into 9 zones. At each iteration , the occlusions presented in each one of the 9 zones can be estimated by projecting the current estimation in the image. Then, instead of training a single regressor, RCPR trains regressors in each primitive fern regressor . Moreover, each regressor is allowed to draw features only from 1 of the 9 predefined zones. Finally, the updates of the regressors are combined through a weighted mean voting. The weight is inversely proportional to the occlusions presented in the zone. At the t-th iteration, the k-th update can be described as:

(2)

3.2 Robust Initialization for Cascaded Pose Regression

The procedure of the proposed RICPR is illustrated in Fig. 2. Firstly, we get the texture correlated initial shapes by calculating texture correlation between the testing face and the training faces, at the same time, the pose correlated initial shapes are obtained by examining rough face pose of the testing face. Then, these initial shapes are taken as the robust initialization for cascaded regression. We describe these two initialization methods in the subsections 3.3 and 3.4, respectively.

3.3 The Texture Correlated Initial Shapes

Since occlusion and pose variation change the appearance of a face and texture descriptor captures the local appearance detail, we can select a texture correlated initial shape to consider the occlusion information of the testing face, rather than a random initial shape.

We propose an initialization method based on texture correlation analysis between the testing face and the training faces. The shapes of the training faces which are most correlated with the testing face are chosen as the initialization, instead of a random one. The LBP operator was originally proposed for texture analysis which is widely used in computer vision [34]. It labels an image by thresholding the -neighborhood of each pixel with the centre value, as shown in Fig. 3 (a). The histogram of the labels can be used as a texture descriptor. To deal with the limitation of the basic LBP operator, the rotation-invariant LBP and uniform LBP were proposed [35], as shown in Fig. 3 (b). In the proposed scheme, we choose the uniform LBP since it balances performance and speed. We use the notation to denote the LBP operator, where the subscript represents using the operator in a neighborhood, while means sampling points on a circle of radius . The superscript means using only uniform patterns.

Figure 3: (a) Basic LBP operator. (b) Examples of extended LBP operators: The circular (8,1), (16,2), and (8,2) neighborhoods.
Figure 4: The procedure of constructing histograms matrix.

Given an image, we divide the face into non-overlapping sub-blocks, as shown in Fig. 4. For each block , we use to calculate LBP features, then a histogram of the labeled block is computed as:

(3)

where is the number of labels of each block produced by the LBP operator and is defined as:

Finally, the histograms are combined yielding the histogram-matrix with a size of .

Figure 5: We run RCPR based on 5 initial shapes selected randomly from the training set and 5 most correlated shapes from the training set by the proposed texture correlated initialization. The median of all predictions is taken as the final output. (a) The five initial shapes, where the images in the first row are the initial shapes using random initialization and the images in the second row are initial shapes using the texture correlated initialization. (b) The corresponding outputs of the two facial landmark localization methods.

During testing, histogram-matrices of the testing image and the training samples are computed by using the above scheme. It should be noticed that, to save the time cost of the testing, the histogram-matrices of the training samples can be computed offline before the testing. The best way to classify histogram-matrices is to use one of the histogram similarity measures, such as histogram intersection, log-likelihood or Chi-Square statics [36]. Since our work aims to select proper initial shapes for regression and we hope to pick a few of the most relevant shapes with the testing face from training faces, we need a method to assess the correlation between the testing face and the training faces. In this paper, we choose the Pearson correlation coefficient [37] to measure the correlation between the testing face and the training faces. The Pearson correlation coefficient between the testing face histogram-matrix and each training face histogram-matrix is calculated by:

1:testing face , training faces , shapes of training faces
2:Texture correlated initial shapes
3:Compute LBP histograms for all faces
4:Obtain histograms-matrices and for and
5:for  to  do
6:      Calculate correlation coefficient between and
7:      Get correlation distance
8:end for
9:Search the smallest correlation distances
10:Select corresponding shapes in as
Algorithm 2 Initialization based on texture correlation
(4)

where is the covariance, is the standard deviation and is the total number of training faces. Due to the size of each histogram-matrix is , can be calculated as:

where and are mean values of matrices and , respectively. Then the correlation coefficient can be used to calculate correlation distance :

(5)

A smaller represents that the training face is more correlated with the testing face. We choose the most correlated faces from training faces and select their shapes as initial shapes for the testing face. The main procedure of initialization based on texture correlation is presented in Algorithm 2. As shown in Fig. 5, a comparison between random initialization [11] and texture correlation based initialization is illustrated. It can be found that the proposed texture correlation based initialization usually obtains more accurate initial shapes which improves the accuracy of landmark localization.

3.4 The Pose Correlated Initial Shapes

In the above section, we describe how to select the texture correlated initial shapes considering the occlusion information but ignoring pose information of the testing face. Empirically, landmark distribution is highly correlated to head pose. To further make the initial shapes more robust to various poses, we choose some pose correlated initial shapes for regression.

To obtain the pose correlated initial shapes, we estimate the rough pose of the testing face, which can be obtained by the five fiducial landmarks, i.e., the pupils, the tip of the nose, and the corners of the mouth. In this paper, we use MTCNN [26] to detect the five fiducial landmarks, as shown in Fig. 6. Inspired by Perspective-n-Point (PnP) problem, which is the problem of estimating the pose of a calibrated camera given a set of 3D points and their corresponding 2D projections in the image [38]. Given a 3D mean shape with 5 facial key points and the five detected fiducial landmarks, a rough face pose can be estimated by:

(6)

where is a rotation vector, which represents the face pose and represents the five fiducial landmarks detected by MTCNN, is the Efficient PnP [38].

Figure 6: Illustration of generating the pose correlated shape. Given an image, we first detect five fiducial landmarks and estimate face pose. Then, according to the face pose, a 3D mean face shape with 29 facial key points, can be projected to a set of corresponding 2D locations, which has similar pose with testing image.
Figure 7: Comparisons between the texture correlated initialization based RCPR and the traditional random initialization based RCPR. (a) The NME on various initial shapes with different correlation distances in training and testing processes. (b) The number of “good” instances determined by variance after 10% cascades of each prediction on various correlation distances. Correlation distances in (a) and (b) are ranked in ascending order.
Feature LBP LDP Gabor GMRF GLDS GLCM Eigenface
NME() 7.35 7.75 7.87 8.28 8.19 8.06 8.18
Precision/recall 80/51.4% 80/48.7% 80/46.1% 80/45.6% 80/47.2% 80/46.5% 80/47.6%
  • Accuracy of facial landmark localization and occlusion detection based on texture correlated initialization using different features. The results indicate that the LBP performs better than the others.

Table 1: Texture Correlated Initialization Using Different Features

Then, a 3D mean face shape, represented by 29 facial landmark locations, is projected to a set of corresponding 2D locations according to the testing face pose , as shown in Fig. 6. After that, the shape which has similar pose with the testing face is obtained. To get a reasonable initial shape for each image, we re-scale the corresponding 2D locations based on the face bounding box and the detected five fiducial landmarks . The initial occlusion information of the pose correlated initial shape is distributed randomly as:

(7)

where is the face bounding box, is the 3D mean face shape with 29 points, and is the pose correlated initial shape. To achieve a better performance, we selected several frontal faces from the training set to augment the pose correlated initial shape. Referring to their true 2D shapes and the 3D mean face shape , we construct different 3D frontal face shapes that have little variation compared with the 3D mean face shape. Then, based on Eq. 7, different initial shapes can be generated by replacing with the constructed 3D frontal face shapes.

3.5 Variance Evaluation

As stated in [11], due to the coarse to fine nature of CPR, even if a face image is initialized by several different shapes, the predictions should reach a similarity after iterations. Based on this principle, instead of taking the median of all predicted results as the final output, the variance is used to determine the reliability of two initialization methods’ predictions.

Firstly, after finishing the regression, the variance of all predictions is calculated. If the value of is below a certain threshold , it indicates that the predictions is a good solution. In this case, all predictions are considered as reliable, thus we take the median of all predictions as the final output. Otherwise, part of predictions belong to “bad” class, then the variances between predictions based on two initialization methods are computed and represented as and . If is less than , it indicates that the predictions based on the texture correlated initialization are more reliable than these based on the pose correlated initialization. Therefore, only considering the predictions by the texture correlated initial shapes, we abandon the predictions which make obvious variance variation and take the median of the rest of predictions as the final output. If is greater than , it indicates that the predictions based on the pose correlated initialization are more reliable than those based on the texture correlated initialization. Then, the median of the predictions based on the pose correlated initial shapes are taken as the final output.

4 Experimental Results

4.1 Dataset and Implementation

Methods Landmark localization error Occlusion prediction
NME () Precision/Recall
RCPR [11] 8.01 80/42%
HPM [33] 7.46* 80/37%*
RPP [17] 7.52* 78/40%*
SDM [32] 10.88 -
TCDCN [39] 8.05* -
CRASM [20] 6.68* 80/48.45%*
HORSD [13] 6.8* -
LBP-I-RCPR 7.35 80/51.4%
RICPR 6.64 80/54.6%
Human [11] 5.6 -
  • Comparison of facial landmark localization and occlusion prediction on COFW dataset. The table lists the results of NME and occlusion detection. * indicates that the result is from the published paper.

Table 2: Results on COFW Dataset.

We evaluate the performance of the proposed scheme on the challenging dataset COFW [11], which is widely used to evaluate the robustness of facial landmark localization and occlusion detection. The face images in COFW have large variations in shape and occlusions due to differences in pose, expression, hairstyle, using of accessories such as sunglasses, hats and interactions with objects (e.g. food, hands, microphones, etc.). Each image is annotated with the location and occluded/un-occluded state of 29 facial landmarks. This dataset has 1852 face images in total, where 1345 and 507 images are used for training and testing respectively. The average occlusion rate of faces in COFW is over 23%.

To evaluate the performance of the proposed scheme, we implement the proposed scheme with two configurations. One is RCPR based on LBP histogram correlation initialization (LBP-I-RCPR), which represents the initialization only based on texture correlation, and the median of predictions is taken as the final output. The other is the full version of the proposed scheme RICPR, in which the initialization is based jointly on both texture correlation and pose correlation jointly. In texture correlation analysis, the face, whose location is provided by a face detector, is divided into non-overlapping sub-blocks. The uniform LBP in a neighborhood is employed to obtain texture information of the face. Thus, the number of labels produced by the LBP operator is 59. In pose correlation analysis, we utilize MTCNN [26] to predict five fiducial landmarks. A threshold is used to determine whether the predictions lead a good result. Since the proposed scheme always has a good initialization without the need for the smart restarts and the number of initial shapes is set to 10 and is set to 4. We run RICPR and RCPR using the same configuration.

We compare LBP-I-RCPR and RICPR with several state-of-the-art methods on COFW dataset using NME (Normalized Mean Error) defined by Eq. 8.

(8)

where is the number of images in the test set, is the number of landmarks in one image, is the predicted position of the landmark of the image, is the ground truth position of the landmark of the image, and are the ground truth positions of the left and right eye centres respectively.

Based on NME, we can plot the Cumulative Error Distribution (CED) curves to further analyse the performance of the proposed scheme, which is calculated from the NME over each image. We also evaluate the speed of the proposed scheme on the COFW dataset. Speed is measured in Frames Per Second (FPS). All methods are implemented using Matlab R2015b and run on a PC with 3.60 GHz CPU and 64-bit Windows 7 operating system.

Figure 8: CED curves on the COFW dataset.

4.2 Results

1) Analysis of initialization based on texture correlation: Instead of randomly selecting shapes from the training set as the initialization in RCPR, we employ a texture correlated initialization (LBP-I-RCPR) by computing LBP histograms. To prove the effectiveness of the texture correlated initialization method, we compare the performance of LBP-I-RCPR with RCPR on the COFW dataset as shown in Fig. 7.

The NMEs on various initial shapes with different correlation distances are shown in Fig. 7. The results show that the NME is reduced with decreasing correlation distance and LBP-I-RCPR can significantly reduces NME by at least 45%. It indicates that the initial shapes which are selected from the training faces based on texture correlation is closer to the real shape of the testing face.

Moreover, given different initial shapes for each image, the variance between their predictions is applied to determine whether the face belongs to a “good” class as stated in [11]. As shown in Fig. 7, it can be found that the number of “good” instances increases as correlation distance decreases and more images belong to “good” class among 507 testing images when using LBP-I-RCPR. The number of “good” instances dramatically increase by at least 45%, and thus less bad initial shapes are selected. Furthermore, the number of “good” instances increases from 395 to 504 among the 507 images in RICPR scheme, which means fewer than 1% instances are “bad”, thus the initialization become more robust.

We also initialize the shapes using other different features, including Local Derivative Pattern (LDP) [40], Gabor, Gaussian Markov Random Field (GMRF), Gray-Level Difference Statistics (GLDS), Gray-Level Co-occurrence Matrix (GLCM), and Eigenface. We report the NME and occlusion detection of each feature respectively in Table 1. The results indicate that the initialization based on LBP histogram correlation performs better.

Figure 9: Occlusion detection result on the COFW dataset.

2) Facial landmark localization evaluation on COFW: Many facial landmark localization methods perform not well on the COFW database due to the large variation in occlusion. To evaluate the proposed scheme, we compare the proposed scheme with several state-of-the-art methods including RCPR [11], RPP [17], SDM [32], Tasks-Constrained Deep Convolutional Network (TCDCN) [39], Hierarchical Deformable Part Model (HPM) [33], CRASM [20] and Hierarchical Occlusion Stage-wise Relational Dictionary (HOSRD)[13]. The comparisons of NME on COFW dataset are given in Table 2.

We can find that RICPR obtains the smallest NME. Compared to RCPR, the LBP-I-RCPR reduces the NME from 8.01 to 7.35 and the RICPR further reduces the NME to 6.64. The NME is reduced by 17.1% in total. RICPR performs even better than the most recent CRASM method proposed in 2017. To get the pose correlated initial shapes, we use MTCNN to detect five fiducial landmarks. The accuracy of five fiducial landmarks plays a significant role on performance. If the ground-truth of the five fiducial landmarks is employed, the NME can reach 5.52, which demonstrates that the proposed scheme can obtain a admirable performance if the five fiducial landmarks are detected accurately.

We also show the CED curves of the COFW dataset in Fig. 8. As can be seen, more images perform better using the proposed scheme, it also demonstrates the superiority of the proposed scheme for facial landmark localization in face image with occlusions.

Figure 10: Results of SDM and RCPR using the proposed initialization methods.

3) Occlusion detection on COFW: Since the COFW dataset provides the ground truth of occlusion, we evaluate the occlusion detection on COFW and compare the proposed scheme with RCPR [11], HPM [33], CoR [24], RPP [17] and CRASM [20]. The occlusion prediction results are shown in Table 2 and Fig. 9. As can be seen, the proposed scheme also outperforms the state-of-the-art methods in occlusion detection.

When we set the false alarm at 80%, the proposed scheme achieves an accuracy of 54.6%, which is higher than 42% obtained by RCPR, 37% obtained by HPM, 41.44% obtained by CoR, 48.45% obtained by CoR and 78/40% precision/recall obtained by RPP. Even if only using LBP-I-RCPR scheme, the accuracy of detecting occlusion reaches 51.4%. It demonstrates that the proposed scheme achieves a much higher accuracy of occlusion detection, which can provide significant benefits in real world application, such as image texture analysis, facial expression understanding and face recognition. Fig. 11 shows example images with the result obtained by the proposed RICPR.

4) Run time: We record the speeds of RCPR, LBP-I-RCPR and RICPR on the COFW dataset. The speeds of these methods are 5.3 FPS, 4.1 FPS and 4.0 FPS, respectively. We can find that the proposed scheme takes some time on calculating the correlation. The speed can be improved by implementing it with C++ or using a powerful server. We will try to improve the efficiency of the proposed scheme in the future, for example, by reducing the number of face images used for texture correlation based initialization.

4.3 Generalization of the Proposed Initialization Scheme

The experimental results demonstrate that the proposed initialization scheme significantly improved the performance of RCPR in both localization and occlusion prediction. Since the initialization is usually independent to facial landmark localization, the proposed initialization scheme can be applied to other algorithms such as SDM. The results are shown in Fig.10, where baseline is the original SDM or RCPR, LBP-I+baseline is the texture correlated initialization scheme applied to SDM or RCPR, RI+baseline denotes the joint texture correlation and pose correlation initialization scheme applied to SDM or RCPR. Compared with the original SDM which is based on random initialization, the LBP-I-SDM and the RI-SDM reduce the NME by 14.6% and 19% respectively. The results indicate that the proposed initialization scheme can also improve the performance of SDM.

Figure 11: Example result of facial landmark localization and occlusion detection obtained by the proposed RICPR on the COFW dataset.

5 Conclusions

In this paper, we propose a robust initialization scheme to solve the initialization sensitive problem for the cascaded pose regression approach through jointly analyzing texture and pose of a testing face. By examining the correlation of local binary patterns histograms between the testing face and the training faces, the texture correlated shapes are selected instead of random shapes. At the same time, the pose correlated initialization is proposed to further improve the robustness of the initialization by estimating the face pose. Experimental results show that the proposed scheme obtains remarkably higher accuracies on both facial landmark localization and occlusion detection on facial images than the state-of-the-art benchmarks. Moreover, since the initialization is usually independent with facial landmark localization, the proposed initialization scheme has the potential to be extended and applied to other algorithms.

Acknowledgment

This work is supported by the National Natural Science Foundation of China (Grant No. 61601337).

Footnotes

  1. The source code of the proposed scheme can be found at https://github.com/pervadepyy/robust-initialization-rcpr

References

  1. S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts responses to face detection: A deep learning approach,” in IEEE International Conference on Computer Vision, 2015, pp. 3676–3684.
  2. R. Weng, J. Lu, and Y. P. Tan, “Robust point set matching for partial face recognition,” IEEE Transactions on Image Processing, vol. 25, no. 3, pp. 1163–1176, 2016.
  3. H. Li, D. Huang, J. M. Morvan, Y. Wang, and L. Chen, “Towards 3D face recognition in the real: A registration-free approach using fine-grained matching of 3d keypoint descriptors,” International Journal of Computer Vision, vol. 113, no. 2, pp. 128–142, 2015.
  4. Y. Tai, J. Yang, Y. Zhang, L. Luo, J. Qian, and Y. Chen, “Face recognition with pose variations and misalignment via orthogonal procrustes regression,” IEEE Transactions on Image Processing, vol. 25, no. 6, pp. 2673–2683, 2016.
  5. Y. Li, S. Wang, Y. Zhao, and Q. Ji, “Simultaneous facial feature tracking and facial expression recognition,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2559–2573, 2013.
  6. S. K. A. Kamarol, M. H. Jaward, H. Kälviäinen, J. Parkkinen, and R. Parthiban, “Joint facial expression recognition and intensity estimation based on weighted votes of image sequences,” Pattern Recognition, vol. 92, pp. 25–32, 2017.
  7. W. Zhang, Y. Zhang, L. Ma, J. Guan, and S. Gong, “Multimodal learning for facial expression recognition,” Pattern Recognition, vol. 48, no. 10, pp. 3191–3202, 2015.
  8. X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2879–2886.
  9. S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 FPS via regressing local binary features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1685–1692.
  10. S. Zhu, C. Li, C. Change Loy, and X. Tang, “Face alignment by coarse-to-fine shape searching,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4998–5006.
  11. X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmark estimation under occlusion,” in IEEE International Conference on Computer Vision, 2013, pp. 1513–1520.
  12. A. Jourabloo and X. Liu, “Large-pose face alignment via CNN-based dense 3D model fitting,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4188–4196.
  13. J. Xing, Z. Niu, J. Huang, W. Hu, X. Zhou, and S. Yan, “Towards robust and accurate multi-view and partially-occluded face alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2017.
  14. P. Dollar, P. Welinder, and P. Perona, “Cascaded pose regression,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1078–1085.
  15. X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2887–2894.
  16. V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1867–1874.
  17. H. Yang, X. He, X. Jia, and I. Patras, “Robust face alignment under occlusion via regional predictive power estimation,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2393–2403, 2015.
  18. G. Tzimiropoulos, “Project-out cascaded regression with an application to face alignment,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2015, pp. 3659–3667.
  19. Q. Liu, J. Deng, and D. Tao, “Dual sparse constrained cascade regression for robust face alignment,” IEEE Transactions on Image Processing, vol. 25, no. 2, pp. 700–712, 2016.
  20. Q. Liu, J. Deng, J. Yang, G. Liu, and D. Tao, “Adaptive cascade regression model for robust face alignment,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 797–807, 2017.
  21. D. Lee, H. Park, and C. D. Yoo, “Face alignment using cascade gaussian process regression trees,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4204–4212.
  22. B. M. Smith and C. R. Dyer, “Efficient branching cascaded regression for face alignment under significant head rotation,” CoRR, vol. abs/1611.01584, 2016. [Online]. Available: http://arxiv.org/abs/1611.01584
  23. J. Zhang, M. Kan, S. Shan, and X. Chen, “Occlusion-free face alignment: Deep regression networks coupled with de-corrupt autoencoders,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3428–3437.
  24. X. Yu, Z. Lin, J. Brandt, and D. N. Metaxas, “Consensus of regression for occlusion-robust facial feature localization,” in European Conference Computer Vision, 2014, pp. 105–118.
  25. K. Seshadri and M. Savvides, “Towards a unified framework for pose, expression, and occlusion tolerant automatic facial alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2110–2122, 2016.
  26. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  27. Y. Pan, J. Zhou, Y. Gao, J. Xiang, S. Xiong, and Y. Yang, “Robust facial landmark localization using LBP histogram correlation based initialization,” in IEEE International Conference on Automatic Face Gesture Recognition, 2017, pp. 619–625.
  28. T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” in European Conference on Computer Vision, 1998, pp. 484–498.
  29. I. Matthews and S. Baker, “Active appearance models revisited,” International Journal of Computer Vision, vol. 60, no. 2, pp. 135–164, 2004.
  30. J. Alabort-i Medina and S. Zafeiriou, “Bayesian active appearance models,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3438–3445.
  31. E. Antonakos, J. Alabort-I-Medina, G. Tzimiropoulos, and S. P. Zafeiriou, “Feature-based lucas–kanade and active appearance models,” IEEE Transactions on Image Processing, vol. 24, no. 9, p. 2617, 2015.
  32. X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 532–539.
  33. G. Ghiasi and C. C. Fowlkes, “Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1899–1906.
  34. T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern Recognition, vol. 29, no. 1, pp. 51–59, 1996.
  35. T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, July 2002.
  36. T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition with local binary patterns,” in European Conference on Computer Vision, 2004, pp. 469–481.
  37. K. Pearson, “Note on regression and inheritance in the case of two parents,” Proceedings of the Royal Society of London, vol. 58, pp. 240–242, 1895.
  38. V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n) solution to the PnP problem,” International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, 2009.
  39. Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning deep representation for face alignment with auxiliary attributes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 5, pp. 918–930, 2016.
  40. B. Zhang, Y. Gao, S. Zhao, and J. Liu, “Local derivative pattern versus local binary pattern: Face recognition with high-order local pattern descriptor,” IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 533–544, 2010.
192940
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
Request comment
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description