AirFace:Lightweight and Efficient Model for Face Recognition
Abstract
With the development of convolutional neural network, significant progress has been made in computer vision tasks. However, the commonly used loss function softmax loss and highly efficient network architecture for common visual tasks are not as effective for face recognition. In this paper, we propose a novel loss function named LiArcFace based on ArcFace. LiArcFace takes the value of the angle through linear function as the target logit rather than through cosine function, which has better convergence and performance on low dimensional embedding feature learning for face recognition. In terms of network architecture, we improved the the perfomance of MobileFaceNet by increasing the network depth, width and adding attention module. Besides, we found some useful training tricks for face recognition. With all the above results, we won the second place in the deepglintlight challenge of LFR2019 [2].
1 Introduction
The development of deep convolutional networks(DCNN) has made remarkable progress in a series of computer vision tasks. However, it is not so effective while using the common method for face recognition. Softmax loss, which is commonly used in classification, cant’t maximize interclass variance and minimize intraclass variance of embedding feature vector. In order to obtain highly discriminative embedding features(See Figure 1), a series of novel loss functions have been proposed in recent years, such as ASoftmax [8], CosFace/AMSoftmax [17, 15], ArcFace [4]. Among them, ArcFace achieved stateoftheart performance by adding additive margin between classes in the angle space. But, when ArcFace is used to train some efficient networks with small(128) embedding size, it’s hard to train from scratch(for example, the embedding feature size of MobileFaceNet [3] is 128, it can’t converge when trained from scratch with ArcFace). As a result, to guarantee the convergence of training, pretraining is always required with softmax loss. To overcome this problem, we carefully designed a novel loss function named LiArcFace based on ArcFace, which performs better in convergence.
Face recognition technology is now widely used on mobile devices, which requires that the computational cost of the model should not be too large. In recent years, some highly efficient neural network architectures have been proposed, such as MobileNetV1 [6], ShuffleNet [19], and MobileNetV2 [12], but they’re all designed for common visual recognition tasks instead of face recognition. Their performance is really general when used for face recognition directly. MobileFaceNet is designed for face recognition based on MobileNetV2, achieved remarkable accuracy on LFW [7], AgeDB [11]. And it is even comparable to stateof theart big DCNN model on MegaFace [10] Challenge 1 under the much smaller computational resources. In this paper, with limited amount of computation, we carefully designed a higher performance network architecture based on MobileFaceNet.
2 Related Work
Loss function. Softmax loss is the most commonly used loss function in classification, which is presented as follows:
(1) 
where denotes the embedding feature of the th sample belonging to the th class, and the dimension of the embedding feature(hereinafter referred to as embedding size) is set as . denotes the th column of the weight and is the bias term. The batch size is and the class number of training data is . However, the traditional softmax loss lacks the power to supervise the embedding feature to minimize interclass similarity and maximize intraclass similarity. In SphereFace [8] and NormFace [16], remove the bias term at first and then fix the , by normalisation, such that the logit is
(2) 
Where denotes the angle between and . Thus the logit is only depend on the cosine of the angle. The modified loss function can be formulated as follows
(3) 
Although guarantees a high similarity of features of the same person, it can’t separate different classes. In this paper, we use NSoftmax denotes . In ArcFace, the authors added an additive angular margin within , which can enhance the intraclass compactness and interclass discrepancy, the ArcFace is formulated as follows
(4) 
When using ArcFace to train models with 512dimensional embedding feature, it has well convergence and stateoftheart performance. However, it will be difficult to converge while training some highly efficient networks with 128dimensional embedding feature from scratch.
network architectures. Face recognition is being used more and more on mobile devices. So it’s really important to optimize the tradeoff between accuracy and computational cost when designing deep neural network architecture. In recent years, some highly efficient neural network architectures have been proposed for common visual tasks. MobileNetV1 used depthwise separable convolution instead of traditional convolution to reduce computational cost and improve network efficiency. MobileNetV2 introduced inverted residuals and linear bottlenecks to further improve network efficiency. However, these lightweight network architectures are not so accurate when using these unchanged for face recognition. The author of MobileFaceNet found the weakness of common mobile networks for Face recognition, and solved it by replacing global average pooling with global depthwise convolution(GDC). And the network architecture of MobileFaceNet is specifically designed for face recognition with smaller expansion factors in bottlenecks and more output channels at the beginning of the network architecture.
3 Proposed Approach
3.1 LiArcFace
In ArcFace, the authors added an angular margin within , which takes as the target logit. The loss function proposed by us takes the angle after a linear function as the logit rather than the cosine function . In the same way, we remove the bias term and then fix the , by normalisation, the denotes the angle between and , thus . At first, we constructed a linear function , and we have . Then we add an additive angular margin in the target logit. In the end we have as the target logit. We call this novel loss function LiArcFace, the prefix Li refers to the linear function. The whole LiArcFace can be formulated as follows
(5) 
There are two main advantages of using this linear function to replace the cosine function. Firstly, it is monotonic decreasing when the angle is between and , which will have better convergence, especially for model with small embedding size. For example, when training MobileFaceNet with ArcFace, it will lead to divergence(NaN). Therefore, softmax loss must be used for pretraining before convergence. The proposed loss does not require these twostage training. Secondly, the penalty of the proposed loss function increases linearly as the angle between embedding feature and center increasing, so that the target logit decreases linearly which is more intuitive(See Figure 2). It will not decline rapidly in some angles, but slowly in others (corresponding to the gradient value of the target logit curve), which makes the proposed loss function have better performance in the end. In terms of geometric decision margins, ArcFace will have a part of overlap area when the deviates too much from the center , because of its nonmonotonicity of the target logit curve. This area can be distinguished as class 1 and class 2 with ArcFace. The proposed loss function would’t have this overlap area(See Figure 3).
3.2 network architectures
Input  Operator  t  c  n  s 

112x112x3  conv 3x3    64  1  2 
56x56x64 
depthwise conv3x3    64  1  1 
56x56x64  bottleneck  2  64  1  2 
28x28x64  bottleneck  2  64  9  1 
28x28x64  bottleneck  4  128  1  2 
14x14x128  bottleneck  2  128  16  1 
14x14x128  bottleneck  8  256  1  2 
7x7x256  bottleneck  2  256  6  1 
7x7x256  conv1x1    1024  1  1 
7x7x1024  linear GDConv7x7    1024  1  1 
1x1x1024  linear conv1x1    512  1  1 

In this section, we introduce our proposed network architecture. Our network architecture is based on a deeper MobileFaceNet(y2) [4], so the residual bottlenecks proposed in MobileNetV2 are used as our main building blocks. Table 1 shows the details of the network architecture. We follow the MobileFaceNet, expansion factors for bottlenecks in our architecture are much smaller than those in MobileNetV2 and using PReLU as the nonlinearity rather than Relu. Nevertheless, we noticed the importance of network width in face recognition. MobileFaceNet is significantly wider than MobileNetV2 at the beginning of the network. But MobileFaceNet does not double the output channels of the network when sampling at last stage. We did that within limited amount of computation. At the same time, we carefully adjusted the depth of the network, and we introduced the attention module CBAM [18] into every bottleneck in the network. But we changed the last activation function in the attention module from sigmoid to 1+tanh, which makes it converges faster.
3.3 training tricks for face recognition
During the competition, we found some useful training tricks for face recognition. Firstly, using a variety of loss functions to finetune the model will make the features more robust and improve the accuracy to some extent. In the competition, we used LiArcFace, ArcFace, combined loss to finetune our model. Secondly, in 512dimensional embedding feature space, it is difficult for the lightweight model to learn the distribution of the features. It is an effective method to use some large models to guide the feature distribution of lightweight models [5, 9].
4 Experiments
4.1 Evaluation Results of LiArcFace
In this section, we mainly introduce some comparative experiments of different loss functions.
Implementation Details. We use the MobileFaceNet as the network architecture, in which the embedding size is set as 128. And batch size is set as 256 x 4. Training models on four NVIDIA TITIAN XP GPUs. We use SGD with momentum 0.9 to optimize models. The training data set is CASIA Webface, which contains 10K Identity 0.5M pictures. The learning rate starts at 0.1, divides by 10 at 18k, 26k and 29K iterations, and finally stops training at 30K iterations. At last, we compare the performance on Labelled Faces in the Wild (LFW) [7], Celebrities in Frontal Profile (CFP) [14] and Age Database (AgeDB) [11].In this paper, ArcFace (m=0.5, NS) refers to the margin of ArcFace is set as 0.5, and pretraining via NSoftmax before using ArcFace.
Weight decay  LFW  CFPFP  AgeDB 

5e4  99.167  94.086  92.95 
5e4 wdml10  99.183  93.857  93.183 
4e5 wdml10  99.05  92.07  92 
Weight decay. Before we start the comparison of different loss functions, we first do the numerical experiment of weight decay, and finally determine that the weight decay is set to 5e4, except the weight decay parameter of the last layer to the embedding layer being 5e3. According to the experimental results(See Table 2 ), the weight decay of the last layer is more demanding.
Effect of Hyperparameter m. In Table 3, we firstly explored the optimal setting for of LiArcFace, and then we found it was between 0.4 and 0.45. We tried training model from scratch with ArcFace, but diverged after about 1600 iterations. Therefore, all the experiments with ArcFace were pretrained by NSoftmax. Since the embedding size is smaller than 512, we also explored the optimal setting for of ArcFace, and we found it was between 0.45 and 0.5.
Loss  LFW  CFPFP  AgeDB 

NSoftmax  97.867  90.457  86.55 
LiArcFace(m=0.35)  99.2  94.2  92.867 
LiArcFace(m=0.4)  99.267  94.114  93.25 
LiArcFace(m=0.45)  99.233  94.2  93.15 
LiArcFace(m=0.5)  99.2  94.1  92.933 
LiArcFace(m=0.4, NS)  99.233  94.514  92.9 
LiArcFace(m=0.45, NS)  99.267  94.386  93.217 
ArcFace(m=0.4, NS)  99.267  93.929  92.717 
ArcFace(m=0.45, NS)  99.267  94  92.983 
ArcFace(m=0.5, NS)  99.183  93.857  93.183 
CosFace/AMSoftmax  99.2  93.386  92.55 

Comparison with stateoftheart loss functions. In Table 3, the difference between LiArcFace, ArcFace and Cosface is tiny on LFW, but all of them are obviously better than NSoftmax; On CFPFP and AgeDB, LiArcFace is slightly better than ArcFace and CosFace. We have drawn the accuracy On CFPFP and AgeDB during the training, which makes the contrast more obvious(See Figure 4). We also compared LiArcFace and ArcFace in the same situation pretrained by NSoftmax. On CFPFP, LiArcFace achieves the highest accuracy, and LiArcFace is still slightly better than ArcFace on AgeDB. In general, there is little difference in accuracy on the verification sets, but LiArcFace has better convergence and does not need pretraining stage when training the model with small embedding size.
4.2 Evaluation Results of Network Architecture and Training Tricks
We name our model that contains our network architecture and training tricks as AirFace. Under the same training data set MS1Mretina [1](is cleaned from MS1M) and model constraints, the accuracy of our model reached 88.415%@FPR=1e8 in deepglintlight challenge of LFR19 [2]. Meanwhile, we verified the performance of our model in MegaFace Challenge 1 compared with the previous stateoftheart models. In Table 4, AirFace has reached incredible efficiency and performance.
Methods  Id(%)  Ver(%)  Flops 

FaceNet [13] 
70.49  86.47   
CosFace [17]  82.72  96.65   
R100,MS1MV2,ArcFace [4]  81.03  96.98  27G 
R100,MS1MV2,CosFace [4]  80.56  96.56  27G 
R100,MS1MV2,ArcFace,R [4]  98.35  98.48  27G 
R100,MS1MV2,CosFace,R [4]  97.91  97.91  27G 
MobileFaceNet [3]    90.16  440M 
MobileFaceNet,R [3]    92.59  440M 
AirFace,MS1Mretina  80.80  96.20  1G 
AirFace,MS1Mretina,R  98.04  97.66  1G 

5 5.Conclusions
In this paper, first of all, we propose a novel additive margin loss function for deep face recognition based on ArcFace. The proposed loss function solves the problem that ArcFace does not converge in training model with small embedding feature size. And it achieves the stateoftheart results on several face verification datasets. Second, we have carefully designed an efficient network architecture and explored some useful training tricks for face recognition, which makes our model extremely efficient at both deepglintlight challenge and MegaFace Challenge 1.
References
 [1] https://ibug.doc.ic.ac.uk/resources/lightweightfacerecognitionchallengeworkshop/.
 [2] http://www.insightfacechallenge.com/overview.
 [3] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate realtime face verification on mobile devices. 2018.
 [4] J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. 2018.
 [5] V. O. D. J. Hinton, G. E. Distilling the knowledge in a neural network. In In arXiv:1503.02531, 2015.
 [6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. 2017.
 [7] E. LearnedMiller, G. B. Huang, A. Roychowdhury, H. Li, and H. Gang. Labeled Faces in the Wild: A Survey. 2016.
 [8] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. 2017.
 [9] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang. Face model compression by distilling knowledge from neurons. 2016.
 [10] D. Miller, E. Brossard, S. Seitz, and I. KemelmacherShlizerman. Megaface: A million faces for recognition at scale. Computer Science, 2015.
 [11] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, and S. Zafeiriou. Agedb: the first manually collected, inthewild age database. In Computer Vision & Pattern Recognition Workshops, 2017.
 [12] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. 2018.
 [13] D. K. Schroff, Florian and J. Philbin. Facenet: A unified embedding for face recognition and clustering. 2015.
 [14] S. Sengupta, J. C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In Applications of Computer Vision, 2016.
 [15] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, PP(99):1–1, 2018.
 [16] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: L2 hypersphere embedding for face verification. 2017.
 [17] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. 2018.
 [18] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon. Cbam: Convolutional block attention module. 2018.
 [19] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. 2017.