MobileFAN: Transferring Deep Hidden Representation for Face Alignment

MobileFAN: Transferring Deep Hidden Representation for Face Alignment

Yang Zhao      Yifan Liu      Chunhua Shen      Yongsheng Gao      Shengwu Xiong
Griffith University     The University of Adelaide     Wuhan University of Technology
Abstract

Facial landmark detection is a crucial prerequisite for many face analysis applications. Deep learning-based methods currently dominate the approach of addressing the facial landmark detection. However, such works generally introduce a large number of parameters, resulting in high memory cost. In this paper, we aim for a lightweight as well as effective solution to facial landmark detection. To this end, we propose an effective lightweight model, namely Mobile Face Alignment Network (MobileFAN), using a simple backbone MobileNetV as the encoder and three deconvolutional layers as the decoder. The proposed MobileFAN, with only of the model size and lower computational cost, achieves superior or equivalent performance compared with state-of-the-art models. Moreover, by transferring the geometric structural information of a face graph from a large complex model to our proposed MobileFAN through feature-aligned distillation and feature-similarity distillation, the performance of MobileFAN is further improved in effectiveness and efficiency for face alignment. Extensive experiment results on three challenging facial landmark estimation benchmarks including COFW, 300W and WFLW show the superiority of our proposed MobileFAN against state-of-the-art methods.

1 Introduction

Facial landmark detection, a.k.a, face alignment, is a crucial step for various downstream face applications including face recognition [27], facial attributes estimation [45] , face pose estimation [14] and so forth. Face alignment aims to find the coordinates of several predefined landmarks or parts, such as eye center, eyebrow, nose tip, mouth and chin, on a face graph. Although great progress has been made on accuracy improvements in the past decades [43, 15], approaches focusing on simple, small and lightweight network for face alignment receive relatively much less attention.

Significant improvements via deep Convolutional Neural Networks (CNNs) have been achieved on facial landmark detection recently [30, 4, 29], even though it remains a very challenging task when dealing with faces in real-world conditions (e.g., faces with unconstrained large pose variations and heavy occlusions). In order to guarantee promising performance in face alignment benchmarks, the majority of those works are designed to adopt large backbones (e.g., Hourglass [21] and ResNet- [10]), carefully designed schemes (e.g., a coarse-to-fine cascade regression framework [28]), or adding extra face structure information (e.g., face boundary information [30]). Recently, neural networks with small model size, light computation cost and high accuracy, have attracted much attention because of the need of applications on mobile devices [38]. In this paper, we aim to investigate the possibility of optimizing facial landmark detection with a simpler and smaller model. We propose a plain model without bells and whistles, namely Mobile Face Alignment Network (MobileFAN), which employs an Encoder-Decoder architecture in the form of Convolution-Deconvolution Network. In the proposed MobileFAN, MobileNetV [26] is adopted as the encoder, while the decoder is constructed utilizing three deconvolutional layers. Model details are illustrated in Section III.

More recently, knowledge distillation (KD) has attracted much attention for its simplicity and efficiency [11]. Motivated by KD and TCNN [33], which has shown that intermediate features from deep networks are good at predicting different head poses in facial landmark detection, we further introduce knowledge transfer techniques to help the training of our proposed lightweight face alignment network. Because our proposed MobileFAN uses several deconvolutional layers sequentially to map from the input space to the output space, we try to transfer the useful information of intermediate feature maps from a teacher to a student.

Inspired by [18, 24], we propose to align the deconvolutional feature maps between student models and teacher models. Specifically, the feature map generated by the student network can be transformed to a new feature map, which needs to match the same size of the corresponding feature map generated by the teacher network. Mean squared error (MSE) is used as the loss function to measure the distance between teacher’s feature map and student’s new feature map. We term this scheme feature-aligned distillation, which can transfer the distribution of intermediate feature map produced by the teacher network to that of the student network.

To distill more structured knowledge information from the teacher network, inspired by [19], we apply the feature-similarity distillation to our framework. The similarity matrix is generated by computing cosine similarity of feature vectors. We find that the similarity matrix can be used to represent the structure information of a face image. It contains the directional knowledge between features, which can be thought of a kind of structure information. With the help of feature-similarity distillation, the student network is trained to make its similarity matrix similar to that of the teacher network. The illustration of our method for knowledge transfer is depicted in Fig. 1 (c).

The interest of this work lies in exploring a simple, small and lightweight network that can achieve comparable or even better results than the common facial landmark detection benchmarks.

To summarize, our main contributions are as follows.

  • We propose a simple and lightweight network, namely Mobile Face Alignment Network (MobileFAN), for face alignment, which achieves comparable results compared with the state-of-the-art large and complicated models. We prove that a simple and compact network still can handle the face alignment problem with high accuracy.

  • We introduce two kinds of knowledge transfer methods, feature-aligned distillation and feature-similarity distillation, to help the training process of our proposed MobileFAN. Specifically, the hidden representation of the large model is transferred to that of the proposed MobileFAN to achieve higher performance.

  • Extensive experiments demonstrate the effectiveness of our method on three challenging benchmark datasets: COFW, W and WFLW, as well as the superior performances of our method over several state-of-the-art methods.

2 Related Work

In this section, we present an overview of related work on facial landmark detection and knowledge distillation.

2.1 Facial Landmark Detection

Traditional Methods. Facial landmark detection has been an active topic for more than twenty years. Recently, cascade regression attracts a lot of attention, which focuses on learning a cascade of regressors to iteratively update the shape estimation. BurgosArtizzu et al. [1] proposed Robust Cascade Pose Regression (RCPR) that reduces exposure to outliers by detecting occlusions explicitly and using robust shape-indexed features. Explicit Shape Regression (ESR) [2] introduced two-level boosted regression and correlation-based feature selection. Ren et al. [22] proposed Local Binary Features (LBF) that is computationally cheap and thus enables very fast regression on the face alignment tasks.

CNN-based Methods. Apart from the above early works, deep learning-based face alignment approaches have achieved state-of-the-art performance. They can be divided into two categories, namely coordinate regression-based method and heatmap regression-based method. A coordinate regression-based method estimates the landmark coordinates vector from the input image directly. The earliest work could be dated to [28]. Sun et al. [28] trained a three-level cascade CNN to locate the facial landmarks in a coarse-to-fine manner, and obtained promising landmark detection results. A multi-task learning framework is proposed by Zhang et al. [42] to optimize face alignment and correlated facial attributes, such as pose, expression and gender, simultaneously. More recently, Feng et al. [7] proposed a new loss function, namely Wingloss, to fill the gap of a better loss function in facial coordinates regression community. It shows that Wingloss with the proposed strong data augmentation method, pose-based data balancing (PDB), could obtain better performance against widely used L2 loss. Different from the above methods, our approach regards face alignment as a dense prediction problem.

A heatmap regression-based method generates probability heatmap for each landmark respectively. Thanks to the development of Hourglass [21], heatmap regression has been successfully applied to landmark localization problems. Yang et al. [37] adopted a supervised face transformation based on the Hourglass to reduce the variance of the target. LAB [30] utilized boundary lines to characterize geometric structure of a face image and thus improved the detection of facial landmarks. However, both the two methods rely on the Hourglass, resulting in introducing a large number of parameters. Valle et al. [29] used a simple CNN to generate heatmaps of landmark locations for a better initialization to Ensemble of Regression Trees (ERT) regressor.

By contrast, our model requires neither cascaded networks nor large backbones, leading to great reduction in model parameters and computation complexity, whilst still achieving comparable or even better accuracy.

2.2 Knowledge Distillation

Deep CNN models dominate the approach to solving many computer vision tasks recently [13, 39]. However, millions of parameters are commonly introduced in these Deep CNN models, leading to large model sizes and expensive computation cost. As a result, it is difficult to deploy such models to real-time applications. Therefore, it motivates researchers to focus on smaller networks that can fit large training data while maintaining the performance. Recently, knowledge distillation (KD) [11] has attracted much attention due to its capability of transferring rich information from a large and complex teacher network to a small and compact student network. It is widely used in model compression. Originally, KD is used in the task of image classification [11], where a compact model can learn from the output of a large model, namely soft target. So the student is supervised by both softened labels and hard labels simultaneously.

Following [11], some subsequent works have tried to transfer intermediate representations of the teacher network to that of the student network, and achieved great progress in image classification [24, 40], object detection [18], and semantic segmentation [19]. Romero et al. proposed a FitNet [24] which directly aligns full feature maps of the teacher model and student model. Attention transfer (AT) [40] was proposed to regularize the learning of the student network by imitating the attention maps of a powerful teacher network. Liu et al. [19] proposed to distill the pair-wise information from the teacher model to a student model via the last convolution features. Unlike previous approaches, we perform the distillation through multiple features.

Given the effectiveness of knowledge distillation in the above applications, we are motivated to simultaneously use both feature-aligned distillation and feature-similarity distillation on intermediate features in this work.

Figure 1: The structure of proposed MobileFAN and an overview of our knowledge transfer framework. (a) The process indicated by the blue dotted arrow is the teacher network, which is made up of ResNet- and three deconvolutional layers. (b) The process indicated by the blue arrow is MobileFAN (the student network), which consists of MobileNetV and three deconvolutional layers. (c) Knowledge transfer module, which is designed to help the training process of MobileFAN by introducing feature-aligned distillation and feature-similarity distillation.

3 Method

In this section, we start with an introduction of network architectures. Then we take a look at standard MSE loss and two knowledge distillation schemes: feature-aligned distillation and feature-similarity distillation.

3.1 Network Architectures

Teacher Network. ResNet is a common backbone network used in many computer vision tasks. In this work, we design a teacher network following [34], in which a simple structure is proposed to integrate a ResNet backbone and three deconvolutional layers. Our teacher network uses ResNet as the encoder for feature extraction. In particular, the last average pooling layer and the classification layer are removed in our framework. The decoder is added over the last bottleneck of ResNet in our method. To be specific, the decoder is made up of three deconvolutional layers, with dimension of and kernel size for each deconvolutional layer. The stride of each deconvolutional layer in the decoder is set to be . Then, a convolutional layer is added after the last deconvolutional layer to generate likelihood heatmaps. Our proposed teacher network architecture is shown in Fig. 1 (a).

Student Network. To make our framework easy to be reconstructed, we propose a student network, termed Mobile Face Alignment Network (MobileFAN), using a similar structure as adopted in the teacher network. MobileNetV [26], designed based on an inverted residual structure with linear bottleneck, is a common backbone network used in mobile devices for image feature extraction. In this work, MobileNetV is used to as an encoder for feature extraction in the student network, and three deconvolutional layers as a decoder are added over the last bottleneck of MobileNetV. Each deconvolutional layer has filters with kernel. The stride of each decovolutional layer in the student network is . Same as most heatmap-based methods, a convolutional layer is added after the last deconvolutional layer to generate likelihood heatmaps for facial landmarks. An illustration of the detailed network structure is shown in Fig. 1 (b).

3.2 Loss Function

Mean Squared Error. Same as [34], we employ Mean Squared Error (MSE) loss to compare the predicted heatmaps H and the ground-truth heatmaps generated from the annotated D facial landmarks.

Specifically, is a set of response maps, one per facial landmark, where , . Here heatmap for the landmark is made up of a D gaussian centered on the landmark location. Let be the ground-truth position of the facial landmark, the value at location in is defined as:

(1)

Therefore, the loss between the predicted heatmaps H and ground-truth heatmaps is defined as:

(2)

After training, the landmark location can be generated from the corresponding predicted heatmap by transforming the highest heatvalued location from to the original image space.

Knowledge Transfer. In the student-teacher framework, apart from the standard MSE loss, , we further introduce knowledge transfer loss to help the training of our student network (MobileFAN). In other words, we want the student network to learn not only the information provided by the ground-truth labels, but also the finer structure knowledge encoded by the teacher network. Let , and , denote the teacher network, the student network and their corresponding weights. Details of knowledge distillation are described below.

Feature-Aligned Distillation. In order to transfer richer facial details (e.g., exaggerated expressions and head poses) learned by a teacher network to the student network, we perform feature-aligned distillation such that the distribution of a feature of the student network is similar to that of the teacher network. Feature-aligned distillation is designed to align the feature map between a student network and a teacher network. Given a feature map of the student network, which consists of feature planes with spatial dimensions , we denote a corresponding feature map of the teacher network as . We adopt a convolutional layer to enable alignment of the feature maps between student network and teacher network . In this way, an aligned mapping function (w.r.t. that layer) can be defined to take an input of feature map and output a aligned new feature map :

(3)

Therefore, the feature-aligned transfer loss between the teacher network and student network is defined as:

(4)

Feature-Similarity Distillation. Facial images are geometrically constrained. As illustrated before, we adopt feature-similarity distillation to transfer more structural information from the teacher network to the student network (MobileFAN) by comparing their similarity matrix. The similarity matrix represents the basic facial structures and textures, which can provide richer directional information to facial landmark detection. We perform cosine similarity computation on the whole feature map, making the relative spatial positions between facial landmarks more precisely. Given a feature map that has dimensions of , where is the total number of channels and is the feature map size, denotes a feature vector extracted from the ) spatial location of this feature map. The cosine similarity between the feature vector and feature vector is calculated as:

(5)

Suppose that the feature maps of the student network and the teacher network are and , respectively. Let denote the similarity between the feature vector and feature vector computed from feature map A of the student network, while denote the similarity between the and pixel computed from the feature map B of the teacher network. Then the feature-similarity transfer loss can be formulated as:

(6)

where denotes all the locations.

3.3 Distillation over Scales

In order to transfer low-, mid- and high-level useful geometric structure information form the teacher network to the student network (MobileFAN), we extend the combination of the feature-aligned distillation and feature-similarity distillation to three deconvolutional layers. As shown in Fig. 1 (c), three deconvolutional layers of the student network are guided by that of the teacher network, where significantly richer facial details are provided. It is different from most previous methods [19] that only add supervision on the last convolutional layer. Then we perform both feature-aligned distillation and feature-similarity distillation on the three deconvolutional feature maps during training. So the student network is trained to optimize the following loss function:

(7)

where is a tunable parameter to balance the MSE loss and the distillation loss. and are the feature-aligned loss and feature-similarity loss of the deconvolutional layer. Extensive experiments show that with the help of distilled knowledge from different feature map scales, the performance of facial landmark detection can be significantly increased.

3.4 Learning Procedure

Training the proposed MobileFAN. To evaluate the performance of the proposed vanilla MobileFAN, we optimize MobileFAN only with the standard MSE loss (as illustrated in Equation (2)) without any extra losses. The experimental results indicate that our proposed MobileFAN, a simple and small network, still can handle the problem of facial landmark detection with satisfying performance.

Training MobileFAN with distilled knowledge. To transfer the distilled knowledge from a large complicated network to the proposed MobileFAN, we regard MobileFAN as a student network in a student-teacher framework. Fig. 1 summarizes the training of knowledge transfer framework. Specifically, a teacher network (Fig. 1 (a)) is pre-trained and the parameters are kept frozen during training. The training stage of the proposed MobileFAN is supervised by standard MSE loss, feature-aligned loss and feature-similarity loss. In other words, guided by the pre-trained parameters of the teacher network, we train the parameters of the MobileFAN to minimize Equation (7).

4 Experiments

4.1 Datasets

We perform experiments on three challenging public datasets: the Caltech Occluded Faces in the Wild (COFW) dataset [1], the Faces in the Wild (W) dataset [25] and the Wider Facial Landmarks in the Wild (WFLW) dataset [30].

COFW. The face images in COFW comprise heavy occlusions and large shape variations, which are common issues in realistic conditions [1]. Its training set has faces, and the testing set has faces. Each image in the COFW dataset has manually annotated landmarks, as shown in Fig. 2(a).

300W. The W [25] dataset is a widely used facial landmark detection benchmark, which consists of HELEN, LFPW, AFW and IBUG datasets. Images in HELEN, LFPW and AFW datasets are collected in the wild environment, where large pose variations, expression variations, and partial occlusions may exist. There are annotated facial landmarks in each face from W dataset, as shown in Fig. 2(b). We follow the same protocol as used in [23] to adopt images for training ( images from the training subset of HELEN dataset, images from the training subset of LFPW dataset and images from the full set of AFW dataset). For testing, the Full test set has images including Common subset ( images) and Challenging subset ( images). Here Common subset is composed of HELEN test subset ( images) and LFPW test subset ( images), while Challenging subset is the IBUG dataset.

WFLW. WFLW [30] is a recently proposed facial landmark dataset based on WIDER FACE. It comprises face images (for training) and face images (for testing) with manual annotated landmarks (shown in Fig. 2(c)), respectively. Faces in WFLW are collected under unconstrained conditions, such as large variations in poses, exaggerated expressions and heavy occlusions. To validate the robustness against each different condition, WFLW is further divided into several subsets including large pose ( images), expression ( images), illumination ( images), make-up ( images), occlusion ( images) and blur ( images). We report the results of all the competing methods on the whole test set and each testing subset in the WFLW dataset.

(a) COFW
(b) W
(c) WFLW
Figure 2: An illustration of the landmark annotations of (a) COFW dataset, (b) W dataset and (c) WFLW dataset.
Protocol Training Set Size Test Set Size #Landmarks Normalisation Term
COFW inter-ocular distance
W Full set inter-ocular distance
WFLW inter-ocular distance
Table 1: A summary of the evaluation protocols used in our experiments.

4.2 Evaluation Metrics and Implementation Details

Evaluation Metrics. We adopt the normalized mean error and the area-under-the curve (AUC) as metrics for evaluation. For all the datasets (COFW, W and WFLW), we use the distance between the outer eye corners (inter-ocular distance) as the normalization term [1, 30].

The mean error is defined as the average Euclidean distance between the predicted facial landmark locations and their corresponding ground-truth facial landmark locations :

(8)

where is the number of images in the test set, and is the number of landmarks (as illustrated in TABLE 1). is the normalization factor, which is the inter-ocular distance.

We also measure the Cumulative Errors Distribution (CED) curve and the failure rate (which is defined as the proportion of failed detected faces) on these benchmarks. Specifically, any normalized error above is considered as a failure [30]. The summary of detailed evaluation protocols used in our experiments is listed in TABLE 1.

Implementation Details. All the face images including both training and testing images are cropped and scaled to according to center location and provided bounding box [30, 3]. Standard data augmentation is performed to make networks robust to data variations. Specifically, we follow [7, 30] to augment samples by ( degree) in-plane rotation, (-) scaling and randomly horizontal flip with the probability of . In the training, Adam optimizer is used with a mini-batch size of for epochs. The base learning rate is , and it drops to at th epochs and at th epochs respectively. is set to be for COFW dataset and for W and WFLW dataset. For implementation, all our face estimation models are trained with Pytorch toolbox on one GPU.

Method Mean Error Failure Rate
Human [1] -
RCPR [1]
HPM [9]
CCR [6]
DRDA [41]
RAR [35]
SFPD [32] -
DAC-CSR [8]
CNN (Wing + PDB) [7]
ResNet (Wing + PDB) [7]
LAB [30]
Teacher
MobileFAN ()
MobileFAN
MobileFAN () + KD
MobileFAN + KD
Table 2: Mean error (%) and failure rate (%) on COFW test set.

4.3 Comparison with state-of-the-art methods

We compare the proposed method against the state-of-the-art methods on each dataset. To further explore whether the effectiveness of a smaller decoder on face alignment task, we adopt a channel-halved version of MobileFAN, where we use dimension to replace dimension of each deconvolutional layer. This architecture is denoted as MobileFAN (). We apply our distillation method to both of the two lightweight networks: MobileFAN and MobileFAN (). For simplicity, we name our full models trained using the combination of feature-aligned distillation and feature-similarity distillation of all the deconvolutional layers to be “MobileFAN + KD” and “MobileFAN () + KD”. Similarly, the baseline models, MobileFAN without distillation and MobileFAN () without distillation, are named as “MobileFAN” and “MobileFAN ()”. We use “Teacher” to represent our proposed teacher network.

Figure 3: CED curves of the baselines, teacher network and the proposed MobileFAN with KD on COFW test set.
Figure 4: Example alignment results of MobileFAN on COFW [1] test set.

Evaluation on COFW. In TABLE 2, we provide the results of the state-of-the-art methods in COFW test set. We can see that the proposed simple and small “MobileFAN” achieves mean error with failure rate without any extra information. Although the mean error of “MobileFAN ()” is a little higher than that of LAB [30], it is still a comparable result ( mean error with failure rate). With the knowledge transferred from “Teacher”, our method outperforms existing methods with a margin of over in mean error reduction. More specifically, our proposed models, “MobileFAN + KD” and “MobileFAN () + KD”, achieve the best performances on COFW dataset, with about and improvements in mean error reduction over LAB [30] with extra boundary information. “MobileFAN + KD” achieves the comparable result ( mean error with failure rate) to the teacher network ( mean error with failure rate). This is not surprising since our proposed distillation method provides rich structural information of a face image, which may contribute to the performance of facial landmark detection. The CED curves in Fig. 3 show that the distilled small networks gain better performance than baselines as well as achieve comparable performance to the “Teacher” network. Fig. 4 shows some example alignment results, demonstrating the effectiveness on various occlusions of “MobileFAN”.

Evaluation on 300W. The W dataset is a challenging face alignment benchmark because of its variants on pose and expressions. TABLE 3 shows the comparable performance with previous methods on W dataset. We can observe that simple “MobileFAN” performs better than the state-of-the-art SAN [4], but the number of model parameters of “MobileFAN” is smaller than that of SAN (we can see form TABLE 5). Although “MobileFAN + KD” does not outperform DCFE [29], it achieves comparable results to LAB [30] with extra boundary information on W Full set and Common subset. Using the knowledge distillation, our two full models are better than their corresponding baselines. The “MobileFAN + KD” achieves , and improvements over “MobileFAN” on W Full set, Challenging subset and Common subset. Although the “MobileFAN () + KD” fails to compete other state-of-the-art methods, which is possible because the dimension of output score maps (for W dataset, it is ) is larger than the dimension of the final deconvolutional layer (), it reduces the mean error from to on W Full set over its baseline “MobileFAN ()”. Fig. 5 visualizes some of our results. It can be observed that driven by knowledge transfer technique, our model can capture various facial expression accurately.

Method Common Challenging Full
RCN [12]
DAN [16]
PCD-CNN [17]
CPM [5]
DSRN [20]
SAN [4]
LAB [30] 2.98
DCFE [29]
Teacher
MobileFAN ()
MobileFAN
MobileFAN () + KD
MobileFAN + KD
Table 3: Mean error (%) on W Common subset, Challenging subset and Full set.
Figure 5: Example alignment results of MobileFAN + KD on W [25] Full set.

Evaluation on WFLW. A summary of the performance obtained by state-of-the-art methods and the proposed approach on WFLW test set and six subset is shown in TABLE 4. As indicated in TABLE 4, our proposed “MobileFAN” outperforms LAB [30] with boundary information in test set and all six subset and ResNet (Wing + PDB) [7] with strong data augmentation in Test set, Make-up and Occlusion subset. Although “MobileFAN” performs a little worse than ResNet (Wing + PDB) [7] in remaining subset, it achieves comparable results with merely of parameters of ResNet (Wing + PDB) (which can be observed from TABLE 5). We can see that “MobileFAN + KD” model outperforms the-state-of-art methods, with a mean error of and failure rate of on WFLW test set. In particular, compared with former best models, ResNet (Wing + PDB) [7] and LAB [30], our “MobileFAN + KD” achieves significant mean error reduction with respect to ResNet (Wing + PDB) [7] and LAB [30] of and on WFLW test set, respectively. Similarly, the failure rate is reduced from to and from to compared with ResNet (Wing + PDB) [7] and LAB [30].

Method test pose expr. illu. mu. occu. blur
Mean Error
ESR [2]
SDM [36]
CFSS [44]
DVLN [31]
LAB [30]
ResNet(Wing+PDB) [7]
Teacher
MobileFAN ()
MobileFAN () + KD
MobileFAN
MobileFAN + KD
Failure Rate
ESR [2]
SDM [36]
CFSS [44]
DVLN [31]
LAB [30]
ResNet(Wing+PDB) [7]
Teacher
MobileFAN ()
MobileFAN () + KD
MobileFAN
MobileFAN + KD
AUC
ESR [2]
SDM [36]
CFSS [44]
DVLN [31]
LAB [30]
ResNet(Wing+PDB) [7]
Teacher
MobileFAN ()
MobileFAN () + KD
MobileFAN
MobileFAN + KD
Table 4: Mean error(%), failure rate (%) and AUC on WFLW test set and six subsets: pose, expression (expr.), illumination (illu.), make-up (mu.), occlusion (occu.) and blur.
Figure 6: Normalized Mean Error (%) on WFLW [30] Test set and typical subset for the method of LAB [30], ResNet (Wing + PDB) [7] and the proposed method.
(a) 0.08
(b) 0.04
(c) 0.08
(d) 0.04
Figure 7: Results of example samples of WFLW [30] under different mean error thresholds: 0.08 and 0.04. Top row: Results of LAB [30]. Middle row: Results of MobileFAN. Bottom row: Results of MobileFAN + KD. Red points indicate normalized mean error is lager than threshold.
Figure 8: Example face alignment results on WFLW [30] test set. Top row: Face alignment results of LAB [30]. Second row: Face alignment results of our proposed MobileFAN. Third row: Face alignment results of our MobileFAN + KD. Bottom row: Groud-truth annotations. Red points indicate normalized error is larger than .

To provide a more straightforward comparative illustration, we compare in Fig. 6 our “MobileFAN” and “MobileFAN + KD” against ResNet (Wing + PDB) and LAB on WFLW test set and four typical subsets. We can see that “MobileFAN” outperforms ResNet (Wing + PDB) and LAB with a big margin on WFLW Test set, Make-up subset and Occlusion subset, let along “MobileFAN + KD” with the help of the “Teacher”. In particular, “MobileFAN” achieves relative improvement in mean error reduction over ResNet (Wing + PDB) on Make-up subset, as well as relative improvement in mean error reduction over LAB on Occlusion subset. Although “MobileFAN” achieves comparable results compared with ResNet (Wing + PDB) on Blur subset and Expression subset, it outperforms LAB with a big margin by using less number of parameters. With the help of knowledge distillation, “MobileFAN + KD” achieves the state-of-the-art performance on WFLW test set and six subsets. The results indicate that our proposed lightweight model is robust to extreme conditions. We can visually see the advantages of our “MobileFAN” from Fig. 7. Specifically, we compare LAB, “MobileFAN” and “MobileFAN + KD” under different mean error thresholds. It can be found that the number of landmarks of low mean error of our method is more than that of LAB. And the third row in Fig. 7 depicts further improvements led by adding feature-aligned distillation and feature-similarity distillation, where the knowledge transfer techniques provide richer facial details to make relative spatial positions between facial landmarks more precisely.

Some example results of LAB111Model available from https://github.com/wywu/LAB., “MobileFAN”, “MobileFAN +KD” and groud-truth on WFLW test set is showed in Fig. 8. We can observe that “MobileFAN + KD” improves the accuracy of landmarks above the face contour (chin), eyebrow, eye corner and so on.

Model Size and Computational Cost Analysis. To further evaluate the model size and the computational complexity, we calculate the number of network parameters (#Params), the sum of floating point operations (FLOPs) and the speed of our approach and other competing methods. The FLOPs of our model is calculated on the resolution of . Frames per second (fps) is adopted for measuring the computation speed. Here the fps calculation is performed on an NVIDIA GeForce GTX 1070 card. We notice that the model size of our compact network is the smallest. We can see from TABLE 5 that the proposed models have minimal parameters and lowest computation complexity against LAB [30] and ResNet (Wing + PDB) [7], while remaining effective for facial landmark localization. Specifically, MobileFAN () and MobileFAN just have M and M parameters, respectively. Although our MobileFAN has fewer parameters, e.g., of model size of LAB and ResNet (Wing + PDB) and of model size of SAN, it achieves comparable or even better results competing with the state-of-the-art methods. We observe that the speed of the proposed MobileFAN is 238 fps on a GPU card, which outperforms the state-of-the-art approaches by a significant margin.

Method backbone #Params (M) FLOPs (B) Speed(fps)
DVLN [31] VGG- -
SAN [4] ResNet-
LAB [30] Hourglass
ResNet (Wing + PDB) [7] ResNet-
MobileFAN MobileNetV
MobileFAN () MobileNetV
Table 5: A comparison of different networks in backbone, model size (the number of model parameters), computational cost (FLOPS) and speed (fps).

4.4 Ablation study

Our framework consists of several different components, such as feature-aligned distillation and feature-similarity distillation of different deconvolutional layers. In this section, we take a look at the effectiveness of different distillation methods on W Challenging subset. Based on the baseline network, MobileFAN without distillation (“MobileFAN”), we evaluate the mean error using various combinations of each component, as summarized in TABLE 6. In addition, we analyze the influence of the hyper parameter (as described in section 3.2) on COFW Test set.

Proposed distillation component Abbreviation
feature-aligned distillation of the first deconvolutional layer FA
feature-aligned distillation of the second deconvolutional layer FA
feature-aligned distillation of the third deconvolutional layer FA
feature-similarity distillation of the first deconvolutional layer FS
feature-similarity distillation of the second deconvolutional layer FS
feature-similarity distillation of the third deconvolutional layer FS
Table 6: The proposed distillation components in our method.

Feature-aligned distillation over layer. To investigate the effectiveness of the feature-aligned distillation in facial landmark detection, we implement the feature-aligned distillation by adopting a convolution layer to align the feature of each pixel between teacher network and student network , such that the channel of the features is matched.

We can see from TABLE 7 that feature-aligned distillation improves the performance of our proposed “MobileFAN”. By utilizing the distillation loss generated from “FA”, “MobileFAN” achieve mean error on W Challenging subset. Moreover, we notice that the performance can be further improved by adding more layers of distillation. It is observed from TABLE 7 that “MobileFAN + FA + FA +FA” achieves and relative improvement over “MobileFAN + FA” and “MobileFAN + FA + FA”, respectively. Similarly, the mean error of “MobileFAN + FA + FA” has reduced from to compared with “MobileFAN + FA”. Fig. 9 provides a straightforward comparison of MobileFAN without distillation and MobileFAN with feature-aligned distillation.

Method MobileFAN
w/o distillation
+ FA
+ FA + FA
+ FA + FA + FA
Table 7: Mean error () of feature-aligned distillation of different deconvolutional layers on W Challenging subset.
Method MobileFAN
w/o distillation
+ FS
+ FS + FS
+ FS + FS + FS
Table 8: Mean error () of feature-similarity distillation of different deconvolutional layers on W Challenging subset.
Figure 9: Comparison of MobileFAN with different distillation method. Top row: Results of MobileFAN without distillation. Second row: Results of MobileFAN with feature-aligned distillation. Third row: Results of MobileFAN with feature-similarity distillation. Bottom row: Results of MobileFAN with both feature-aligned distillation and feature-similarity distillation. Red points indicate the normalized error is larger than .
Figure 10: Normalized mean error () and failure rate () on W Challenging subset of various different combination of feature-aligned distillation and feature-similarity distillation.

Feature-similarity distillation over layer. To explore the effectiveness of feature-similarity distillation, we evaluate the mean error with various combinations of each layer on W Challenging subset. As can be observed from TABLE 8, applying more layers can lead to better performance.

In particular, “MobileFAN + FS” outperforms “MobileFAN” without distillation by a large margin, with a relative improvement of in mean error reduction. When one more layer of distillation is added, although the improvement is marginal, the mean error is reduced from to . It is not surprising that “MobileFAN (0.5) + FS + FS + FS” achieves the best performance of mean error among all versions of “MobileFAN” on W Challenging subset. We can see a straightforward comparison in Fig. 9 that MobileFAN with feature-similarity distillation performs better than MobileFAN without distillation.

Combination of feature-aligned distillation and feature-similarity distillation over layer. To evaluate the effectiveness of both using feature-aligned distillation and feature-similarity distillation, we report the normalized mean error and failure rate of various different combination of feature-aligned distillation and feature-similarity distillation in Fig. 10. It is shown that our “MobileFAN” with knowledge distillation performs better when more layers of both feature-aligned distillation and feature-similarity distillation are used. In particular, “MobileFAN + FA + FS + FA + FS + FA + FS” performs better than “MobileFAN + FA + FS” and “MobileFAN + FA + FS + FA + FS”, with a relative improvement of and in mean error reduction, respectively. we can also find that the failure rate of “MobileFAN + FA + FS” is lower than that of “MobileFAN” without distillation. Similarly, the failure rate of “MobileFAN + FA + FS + FA + FS + FA + FS” drops from to compared with that of “MobileFAN + FA + FS + FA + FS”.

To summarize, the combination of feature-aligned distillation and feature-similarity distillation improves the performance of the compact network in facial landmark detection.

The impact of . To investigate the impact of on the training process of the proposed MobileFAN, we performed an ablation study of regarding mean error on COFW dataset. Experimental results in Table 9 listed the mean errors obtained with different ( increases from to ). means that the experiment is conducted on the proposed MobileFAN without distillation. It can be observed that the proposed MobileFAN achieved consistently lower mean errors with varying from to compared with the mean error obtained with . The best performance is achieved at on COFW dataset. Nevertheless, if is too large (e.g. ), the contribution from the supervision of the ground-truth heatmap would be limited, and thus the proposed MobileFAN may fail to converge. The overall influence of is positive to the training process of the proposed MobileFAN, indicating the effectiveness of the proposed distillation schemes.

Mean Error 3.82 3.78 3.79 3.74 3.66 4.06
Table 9: The influence of on COFW dataset.

5 Conclusion

In this paper, we focus on building a small facial landmark detection model, which remains an unsolved research problem. We proposed a simple and lightweight Mobile Face Alignment Network (MobileFAN) by using MobileNetV as the encoder and three simple deconvolutional layers as the decoder. The MobileFAN is proposed to avoid the use of a large backbone to minimize the number of parameters while maintaining high performance. This simple design significantly helps to reduce the computational burden. With times fewer parameters compared with the state-of-the-art models, our MobileFAN still achieves comparable or even better performance on three challenging facial landmark detection datasets. A knowledge transfer technique is proposed to enhance the performance of MobileFAN. By transferring the finer structural information encoded by the teacher Network, the performance of the proposed MobileFAN is further improved in effectiveness for facial landmark detection.

References

  • [1] X. P. Burgos-Artizzu, P. Perona, and P. Dollár (2013) Robust face landmark estimation under occlusion. In ICCV, pp. 1513–1520. Cited by: §2.1, Figure 4, §4.1, §4.1, §4.2, Table 2.
  • [2] X. Cao, Y. Wei, F. Wen, and J. Sun (2012) Face alignment by explicit shape regression. In CVPR, pp. 2887–2894. Cited by: §2.1, Table 4.
  • [3] Y. Chen, C. Shen, H. Chen, X. Wei, L. Liu, and J. Yang (2019) Adversarial learning of structure-aware fully convolutional networks for landmark localization. IEEE Trans. Pattern Anal. Mach. Intell.. External Links: Document Cited by: §4.2.
  • [4] X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018) Style aggregated network for facial landmark detection. In CVPR, pp. 379–388. Cited by: §1, §4.3, Table 3, Table 5.
  • [5] X. Dong, S. Yu, X. Weng, S. Wei, Y. Yang, and Y. Sheikh (2018) Supervision-by-registration: an unsupervised approach to improve the precision of facial landmark detectors. In CVPR, pp. 360–368. Cited by: Table 3.
  • [6] Z. Feng, G. Hu, J. Kittler, W. J. Christmas, and X. Wu (2015) Cascaded collaborative regression for robust facial landmark detection trained using a mixture of synthetic and real images with dynamic weighting. IEEE Trans. Image Process. 24 (11), pp. 3425–3440. Cited by: Table 2.
  • [7] Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu (2018) Wing loss for robust facial landmark localisation with convolutional neural networks. In CVPR, pp. 2235–2245. Cited by: §2.1, Figure 6, §4.2, §4.3, §4.3, Table 2, Table 4, Table 5.
  • [8] Z. Feng, J. Kittler, W. J. Christmas, P. Huber, and X. Wu (2017) Dynamic attention-controlled cascaded shape regression exploiting training data augmentation and fuzzy-set sample weighting. In CVPR, pp. 3681–3690. Cited by: Table 2.
  • [9] G. Ghiasi and C. C. Fowlkes (2014) Occlusion coherence: localizing occluded faces with a hierarchical deformable part model. In CVPR, pp. 2385–2392. Cited by: Table 2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1.
  • [11] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.2, §2.2.
  • [12] S. Honari, J. Yosinski, P. Vincent, and C. J. Pal (2016) Recombinator networks: learning coarse-to-fine feature aggregation. In CVPR, pp. 5743–5752. Cited by: Table 3.
  • [13] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans. Image Process. 24 (12), pp. 5659–5670. Cited by: §2.2.
  • [14] C. Hong, J. Yu, J. Zhang, X. Jin, and K. Lee (2019) Multi-modal face pose estimation with multi-task manifold deep learning. IEEE Trans. Ind. Inf. 15 (7), pp. 3952–3961. Cited by: §1.
  • [15] X. Jin and X. Tan (2016) Face alignment by robust discriminative hough voting. Pattern Recognit. 60, pp. 318–333. Cited by: §1.
  • [16] M. Kowalski, J. Naruniec, and T. Trzcinski (2017) Deep alignment network: a convolutional neural network for robust face alignment. In CVPRW, pp. 88–97. Cited by: Table 3.
  • [17] A. Kumar and R. Chellappa (2018) Disentangling 3d pose in a dendritic CNN for unconstrained 2d face alignment. In CVPR, pp. 430–439. Cited by: Table 3.
  • [18] Q. Li, S. Jin, and J. Yan (2017) Mimicking very efficient network for object detection. In CVPR, pp. 6356–6364. Cited by: §1, §2.2.
  • [19] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang (2019) Structured knowledge distillation for semantic segmentation. arXiv preprint arXiv:1903.04197. Cited by: §1, §2.2, §3.3.
  • [20] X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang (2018) Direct shape regression networks for end-to-end face alignment. In CVPR, pp. 5040–5049. Cited by: Table 3.
  • [21] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In ECCV, pp. 483–499. Cited by: §1, §2.1.
  • [22] S. Ren, X. Cao, Y. Wei, and J. Sun (2014) Face alignment at 3000 fps via regressing local binary features. In CVPR, pp. 1685–1692. Cited by: §2.1.
  • [23] S. Ren, X. Cao, Y. Wei, and J. Sun (2016) Face alignment via regressing local binary features. IEEE Trans. Image Process. 25 (3), pp. 1233–1245. Cited by: §4.1.
  • [24] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §1, §2.2.
  • [25] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic (2013) 300 faces in-the-wild challenge: the first facial landmark localization challenge. In ICCVW, pp. 397–403. Cited by: Figure 5, §4.1, §4.1.
  • [26] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §1, §3.1.
  • [27] S. Soltanpour, B. Boufama, and Q. J. Wu (2017) A survey of local feature methods for 3d face recognition. Pattern Recognit. 72, pp. 391–406. Cited by: §1.
  • [28] Y. Sun, X. Wang, and X. Tang (2013) Deep convolutional network cascade for facial point detection. In CVPR, pp. 3476–3483. Cited by: §1, §2.1.
  • [29] R. Valle, J. M. Buenaposada, A. Valdes, and L. Baumela (2018) A deeply-initialized coarse-to-fine ensemble of regression trees for face alignment. In ECCV, pp. 585–601. Cited by: §1, §2.1, §4.3, Table 3.
  • [30] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou (2018) Look at boundary: a boundary-aware face alignment algorithm. In CVPR, pp. 2129–2138. Cited by: §1, §2.1, Figure 6, Figure 7, Figure 8, §4.1, §4.1, §4.2, §4.2, §4.2, §4.3, §4.3, §4.3, §4.3, Table 2, Table 3, Table 4, Table 5.
  • [31] W. Wu and S. Yang (2017) Leveraging intra and inter-dataset variations for robust face alignment. In CVPRW, pp. 2096–2105. Cited by: Table 4, Table 5.
  • [32] Y. Wu, C. Gou, and Q. Ji (2017) Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. In CVPR, pp. 5719–5728. Cited by: Table 2.
  • [33] Y. Wu, T. Hassner, K. Kim, G. Medioni, and P. Natarajan (2018) Facial landmark detection with tweaked convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40 (12), pp. 3067–3074. Cited by: §1.
  • [34] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In ECCV, pp. 466–481. Cited by: §3.1, §3.2.
  • [35] S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kassim (2016) Robust facial landmark detection via recurrent attentive-refinement networks. In ECCV, pp. 57–72. Cited by: Table 2.
  • [36] X. Xiong and F. D. la Torre (2013) Supervised descent method and its applications to face alignment. In CVPR, pp. 532–539. Cited by: Table 4.
  • [37] J. Yang, Q. Liu, and K. Zhang (2017) Stacked hourglass network for robust facial landmark localisation. In CVPRW, pp. 2025–2033. Cited by: §2.1.
  • [38] J. Yu, B. Zhang, Z. Kuang, D. Lin, and J. Fan (2016) IPrivacy: image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Trans. Inf. Forensics Secur. 12 (5), pp. 1005–1016. Cited by: §1.
  • [39] J. Yu, C. Zhu, J. Zhang, Q. Huang, and D. Tao (2019) Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE Trans. Neural Netw. Learn. Syst.,, pp. 1–14. External Links: Document Cited by: §2.2.
  • [40] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §2.2.
  • [41] J. Zhang, M. Kan, S. Shan, and X. Chen (2016) Occlusion-free face alignment: deep regression networks coupled with de-corrupt autoencoders. In CVPR, pp. 3428–3437. Cited by: Table 2.
  • [42] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2016) Learning deep representation for face alignment with auxiliary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 38 (5), pp. 918–930. Cited by: §2.1.
  • [43] Z. Zhang, W. Zhang, H. Ding, J. Liu, and X. Tang (2015) Hierarchical facial landmark localization via cascaded random binary patterns. Pattern Recognit. 48, pp. 1277–1288. Cited by: §1.
  • [44] S. Zhu, C. Li, C. C. Loy, and X. Tang (2015) Face alignment by coarse-to-fine shape searching. In CVPR, pp. 4998–5006. Cited by: Table 4.
  • [45] N. Zhuang, Y. Yan, S. Chen, H. Wang, and C. Shen (2018) Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recognit. 80, pp. 225–240. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
392273
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description