AirFace:Lightweight and Efficient Model for Face Recognition
With the development of convolutional neural network, significant progress has been made in computer vision tasks. However, the commonly used loss function softmax loss and highly efficient network architectures for common visual tasks are not as effective for face recognition. In this paper, we propose a novel loss function named Li-ArcFace based on ArcFace. Li-ArcFace takes the value of the angle through a linear function as the target logit rather than through cosine function, which has better convergence and performance on low dimensional embedding feature learning for face recognition. In terms of network architecture, we improved the the perfomance of MobileFaceNet by increasing the network depth, width and adding attention module. Besides, we found some useful training tricks for face recognition. Under all the above effects, we won the second place in the deepglint-light challenge of LFR2019 .
The development of deep convolutional networks(DCNN) has made remarkable progress in a series of computer vision tasks. However, it is not so effective while using the common method for face recognition. Softmax loss, which is commonly used in classification, cant’t maximize inter-class variance and minimize intra-class variance of embedding feature vectors. In order to obtain highly discriminative embedding features(See Figure 1), a series of novel loss functions have been proposed in recent years, such as A-Softmax , CosFace/AM-Softmax [21, 19], ArcFace . Among them, ArcFace achieved state-of-the-art performance by adding additive margin between classes in the angle space. But, when ArcFace is used to train some efficient networks with small(128) embedding size, it’s hard to train from scratch(for example, the embedding feature size of MobileFaceNet  is 128, it can’t converge when trained from scratch with ArcFace). As a result, to guarantee the convergence of training, pre-training is always required with softmax loss. To overcome this problem, we carefully designed a novel loss function named Li-ArcFace based on ArcFace, which performs better in convergence.
Face recognition technology is now widely used on mobile devices, which requires that the computational cost of the model should not be too large. In recent years, some highly efficient neural network architectures have been proposed, such as MobileNetV1 , ShuffleNet , and MobileNetV2 , but they’re all designed for common visual recognition tasks instead of face recognition. Their performance is really general when used for face recognition directly. MobileFaceNet is designed for face recognition based on MobileNetV2, achieved remarkable accuracy on LFW , AgeDB . And it is even comparable to state-of- the-art big DCNN model on MegaFace  Challenge 1 under the much smaller computational resources. In this paper, with limited amount of computation, we carefully designed a higher performance network architecture based on MobileFaceNet.
2 Related Work
Loss function. Softmax loss is the most commonly used loss function in classification, which is presented as:
where denotes the embedding feature of the -th sample belonging to the -th class, and the dimension of the embedding feature(hereinafter referred to as embedding size) is set as . denotes the -th column of the weight and is the bias term. The batch size is and the class number of training data is . However, the traditional softmax loss lacks the power to supervise the embedding feature to minimize inter-class similarity and maximize intra-class similarity. In SphereFace  and NormFace , the bias term is being removed at first and then fixing the , by normalisation, such that the logit is:
Where denotes the angle between and . Thus the logit is only depend on the cosine of the angle. The modified loss function can be formulated as :
Although guarantees a high similarity of features of the same person, it can’t separate different classes very well. In this paper, we use N-Softmax denotes . In ArcFace , the authors added an additive angular margin within , which can enhance the intra-class compactness and inter-class discrepancy, the ArcFace is formulated as :
When using ArcFace to train models with 512-dimensional embedding feature, it has well convergence and state-of-the-art performance. However, it will be difficult to converge while training some highly efficient networks with 128-dimensional embedding feature from scratch.
network architectures. Face recognition is being used more and more on mobile devices. So it’s really important to optimize the trade-off between accuracy and computational cost when designing deep neural network architecture. In recent years, some highly efficient neural network architectures have been proposed for common visual tasks. MobileNetV1  used depthwise separable convolution instead of traditional convolution to reduce computational cost and improve network efficiency. MobileNetV2  introduced inverted residuals and linear bottlenecks to further improve network efficiency. However, these lightweight network architectures are not so accurate when using these unchanged for face recognition. The author of MobileFaceNet  found the weakness of common mobile networks for Face recognition, and solved it by replacing global average pooling with global depthwise convolution(GDC). And the network architecture of MobileFaceNet is specifically designed for face recognition with smaller expansion factors in bottlenecks and more output channels at the beginning of the network architecture.
3 Proposed Approach
In ArcFace, the authors added an angular margin within , which takes as the target logit. The loss function proposed by us takes the angle after a linear function as the logit rather than cosine function . In the same way, we remove the bias term and then fix the , by normalisation, the denotes the angle between and , thus . At first, we constructed a linear function , and we have . Then we add an additive angular margin in the target logit. In the end we have as the target logit. We call this novel loss function Li-ArcFace, the prefix Li refers to the linear function. The whole Li-ArcFace can be formulated as follows
There are two main advantages of using this linear function to replace cosine function. Firstly, it is monotonic decreasing when the angle is between and , which will have better convergence, especially for model with small embedding size. For example, when training MobileFaceNet with ArcFace, it will lead to divergence(NaN). Therefore, softmax loss must be used for pre-training before convergence. The proposed loss does not require these two-stage training. Secondly, the penalty of the proposed loss function increases linearly as the angle between embedding feature and center increasing, so that the target logit decreases linearly which is more intuitive(See Figure 2). It will not decline rapidly in some angles, but slowly in others (corresponding to the gradient value of the target logit curve), which makes the proposed loss function have better performance in the end. In terms of geometric decision margins, ArcFace will have a part of overlap area when the deviates too much from the center , because of its non-monotonicity of the target logit curve. This area can be distinguished as class 1 and class 2 with ArcFace. The proposed loss function would’t have this overlap area(See Figure 3).
3.2 Network Architectures
In this section, we introduce our proposed network architecture. Our network architecture is based on a deeper MobileFaceNet(y2) , so the residual bottlenecks proposed in MobileNetV2 are used as our main building blocks. Table 1 shows the details of our network architecture. We follow the MobileFaceNet, expansion factors for bottlenecks in our architecture are much smaller than those in MobileNetV2 and using PReLU as the non-linearity rather than Relu. Nevertheless, we noticed the importance of network width in face recognition. MobileFaceNet is significantly wider than MobileNetV2 at the beginning of the network. But MobileFaceNet does not double the output channels of the network when sampling at last stage. We did that within limited amount of computation. At the same time, we carefully adjusted the depth of the network, and we introduced the attention module CBAM  into every bottleneck in the network. But we changed the last activation function in the attention module from sigmoid to 1+tanh. The range of 1+tanh is 0 to 2. When a channel or spatial position needs to be enhanced, it is multiplied by a weight between one and two; when it needs to be weakened, it is multiplied by a weight between zero and one. Using 1+tanh instead of sigmoid is more intuitive and makes training converge faster.
3.3 Training Tricks for Face Recognition
During the competition of LFR, we found some useful training tricks for face recognition. Firstly, using a variety of loss functions to fine-tune the model will make the features more robust and improve the accuracy to some extent. In the competition, we used Li-ArcFace, ArcFace, combined loss to fine-tune our model. Secondly, in 512-dimensional embedding feature space, it is difficult for the lightweight model to learn the distribution of the features. It is an effective method to use some large models to guide the feature distribution of lightweight models [9, 13].
4.1 Ablation Experiment of Li-ArcFace
In this section, we mainly introduce some comparative experiments of different loss functions.
Implementation Details. We use the MobileFaceNet as the network architecture, in which the embedding size is set as 128. And batch size is set as 256 x 4. Training models on four NVIDIA TITIAN XP GPUs. We use SGD with momentum 0.9 to optimize models. The scale of the feature is set as 64. The training data set is CASIA- Webface, which contains 10K Identity 0.5M pictures. The learning rate starts at 0.1, divides by 10 at 18k, 26k and 29K iterations, and finally stops training at 30K iterations. At last, we compare the performance on Labelled Faces in the Wild (LFW) , Celebrities in Frontal Profile (CFP)  and Age Database (AgeDB) . In this paper, ArcFace (m=0.5, NS) refers to the margin of ArcFace is set as 0.5, and pre-training via N-Softmax before using ArcFace.
Weight decay. Before we start the comparison of different loss functions, we first do the numerical experiment of weight decay, and finally determine that the weight decay is set to 5e-4, except the weight decay parameter of the last layer to the embedding layer being 5e-3. According to the experimental results(See Table 2 ), the weight decay of the last layer is more demanding.
Effect of Hyper-parameter m. In Table 3, we firstly explored the optimal setting for of Li-ArcFace, and then we found it was between 0.4 and 0.45. We tried training model from scratch with ArcFace, but diverged after about 1600 iterations. Therefore, all the experiments with ArcFace were pre-trained by N-Softmax. Since the embedding size is smaller than 512, we also explored the optimal setting for of ArcFace, and we found it was between 0.45 and 0.5.
Comparison with state-of-the-art loss functions. In Table 3, the difference between Li-ArcFace, ArcFace and Cosface is tiny on LFW, but all of them are obviously better than N-Softmax; On CFP-FP and AgeDB, Li-ArcFace is slightly better than ArcFace and CosFace. We have drawn the accuracy On CFP-FP and AgeDB during the training, which makes the contrast more obvious(See Figure 4). We also compared Li-ArcFace and ArcFace in the same situation pre-trained by N-Softmax. On CFP-FP, Li-ArcFace achieves the highest accuracy, and Li-ArcFace is still slightly better than ArcFace on AgeDB. In general, there is little difference in accuracy on the verification sets, but Li-ArcFace has better convergence and does not need pre-training stage when training the model with small embedding size.
4.2 Evaluation Results of Network Architecture and Training Tricks
We name our model that contains our network architecture and training tricks as AirFace. Under the same training data set MS1M-retina  and model constraints, ( MS1M-retina is cleaned from  by , and  is the face detector and alignment tool used to pre-process the data) the accuracy of AirFace reached 88.415%@FPR=1e-8 in deepglint-light challenge of LFR19 [6, 3]. Meanwhile, we verified the performance of AirFace in MegaFace Challenge 1 compared with the previous state-of-the-art models. In Table 4, AirFace has reached incredible efficiency and performance.
In this paper, first of all, we propose a novel additive margin loss function for deep face recognition based on ArcFace. The proposed loss function solves the problem that ArcFace does not converge in training model with small embedding feature size. And it achieves the state-of-the-art results on several face verification datasets. Second, we have carefully designed an efficient network architecture and explored some useful training tricks for face recognition, which makes our model AirFace extremely efficient at both deepglint-light challenge and MegaFace Challenge 1.
-  https://github.com/deepinsight/insightface.
-  https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop.
-  http://www.insightface-challenge.com/overview.
-  S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In Chinese Conference on Biometric Recognition, pages 428–438. Springer, 2018.
-  J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
-  J. Deng, J. Guo, D. Zhang, Y. Deng, X. Lu, S. Shi, and S. Zafeiriou. Lightweight face recognition challenge. 2019.
-  J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou. Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641, 2019.
-  Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua. Labeled faces in the wild: A survey. In Advances in face detection and facial image analysis, pages 189–248. Springer, 2016.
-  W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.
-  P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang. Face model compression by distilling knowledge from neurons. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  D. Miller, E. Brossard, S. Seitz, and I. Kemelmacher-Shlizerman. Megaface: A million faces for recognition at scale. arXiv preprint arXiv:1505.02108, 2015.
-  S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou. Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 51–59, 2017.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
-  S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9. IEEE, 2016.
-  F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018.
-  F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l 2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pages 1041–1049. ACM, 2017.
-  H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
-  S. Woo, J. Park, J.-Y. Lee, and I. So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.