AirFace:Lightweight and Efficient Model for Face Recognition

AirFace:Lightweight and Efficient Model for Face Recognition

Xianyang Li
   Feng Wang
   Qinghao Hu
   Cong Leng

With the development of convolutional neural network, significant progress has been made in computer vision tasks. However, the commonly used loss function softmax loss and highly efficient network architecture for common visual tasks are not as effective for face recognition. In this paper, we propose a novel loss function named Li-ArcFace based on ArcFace. Li-ArcFace takes the value of the angle through linear function as the target logit rather than through cosine function, which has better convergence and performance on low dimensional embedding feature learning for face recognition. In terms of network architecture, we improved the the perfomance of MobileFaceNet by increasing the network depth, width and adding attention module. Besides, we found some useful training tricks for face recognition. With all the above results, we won the second place in the deepglint-light challenge of LFR2019 [2].

1 Introduction

The development of deep convolutional networks(DCNN) has made remarkable progress in a series of computer vision tasks. However, it is not so effective while using the common method for face recognition. Softmax loss, which is commonly used in classification, cant’t maximize inter-class variance and minimize intra-class variance of embedding feature vector. In order to obtain highly discriminative embedding features(See Figure 1), a series of novel loss functions have been proposed in recent years, such as A-Softmax [8], CosFace/AM-Softmax [17, 15], ArcFace [4]. Among them, ArcFace achieved state-of-the-art performance by adding additive margin between classes in the angle space. But, when ArcFace is used to train some efficient networks with small(128) embedding size, it’s hard to train from scratch(for example, the embedding feature size of MobileFaceNet [3] is 128, it can’t converge when trained from scratch with ArcFace). As a result, to guarantee the convergence of training, pre-training is always required with softmax loss. To overcome this problem, we carefully designed a novel loss function named Li-ArcFace based on ArcFace, which performs better in convergence.

Face recognition technology is now widely used on mobile devices, which requires that the computational cost of the model should not be too large. In recent years, some highly efficient neural network architectures have been proposed, such as MobileNetV1 [6], ShuffleNet [19], and MobileNetV2 [12], but they’re all designed for common visual recognition tasks instead of face recognition. Their performance is really general when used for face recognition directly. MobileFaceNet is designed for face recognition based on MobileNetV2, achieved remarkable accuracy on LFW [7], AgeDB [11]. And it is even comparable to state-of- the-art big DCNN model on MegaFace [10] Challenge 1 under the much smaller computational resources. In this paper, with limited amount of computation, we carefully designed a higher performance network architecture based on MobileFaceNet.

Figure 1: Schematic diagram of discriminative embedding features. W1 refers to the center of the class 1, f1 refers to the embedding feature of class 1, f2 refers to the embedding feature of class 2. During training , the variance of the same person becomes smaller, and the variance of different people becomes larger.

2 Related Work

Loss function. Softmax loss is the most commonly used loss function in classification, which is presented as follows:


where denotes the embedding feature of the -th sample belonging to the -th class, and the dimension of the embedding feature(hereinafter referred to as embedding size) is set as . denotes the -th column of the weight and is the bias term. The batch size is and the class number of training data is . However, the traditional softmax loss lacks the power to supervise the embedding feature to minimize inter-class similarity and maximize intra-class similarity. In SphereFace [8] and NormFace [16], remove the bias term at first and then fix the , by normalisation, such that the logit is


Where denotes the angle between and . Thus the logit is only depend on the cosine of the angle. The modified loss function can be formulated as follows


Although guarantees a high similarity of features of the same person, it can’t separate different classes. In this paper, we use N-Softmax denotes . In ArcFace, the authors added an additive angular margin within , which can enhance the intra-class compactness and inter-class discrepancy, the ArcFace is formulated as follows


When using ArcFace to train models with 512-dimensional embedding feature, it has well convergence and state-of-the-art performance. However, it will be difficult to converge while training some highly efficient networks with 128-dimensional embedding feature from scratch.

network architectures. Face recognition is being used more and more on mobile devices. So it’s really important to optimize the trade-off between accuracy and computational cost when designing deep neural network architecture. In recent years, some highly efficient neural network architectures have been proposed for common visual tasks. MobileNetV1 used depthwise separable convolution instead of traditional convolution to reduce computational cost and improve network efficiency. MobileNetV2 introduced inverted residuals and linear bottlenecks to further improve network efficiency. However, these lightweight network architectures are not so accurate when using these unchanged for face recognition. The author of MobileFaceNet found the weakness of common mobile networks for Face recognition, and solved it by replacing global average pooling with global depthwise convolution(GDC). And the network architecture of MobileFaceNet is specifically designed for face recognition with smaller expansion factors in bottlenecks and more output channels at the beginning of the network architecture.

Figure 2: Target logit curves.
Figure 3: Decision margins of different loss functions under binary classification case. The yellow areas are the decision margins, the red areas are the overlap area of class1 and class2.

3 Proposed Approach

3.1 Li-ArcFace

In ArcFace, the authors added an angular margin within , which takes as the target logit. The loss function proposed by us takes the angle after a linear function as the logit rather than the cosine function . In the same way, we remove the bias term and then fix the , by normalisation, the denotes the angle between and , thus . At first, we constructed a linear function , and we have . Then we add an additive angular margin in the target logit. In the end we have as the target logit. We call this novel loss function Li-ArcFace, the prefix Li refers to the linear function. The whole Li-ArcFace can be formulated as follows


There are two main advantages of using this linear function to replace the cosine function. Firstly, it is monotonic decreasing when the angle is between and , which will have better convergence, especially for model with small embedding size. For example, when training MobileFaceNet with ArcFace, it will lead to divergence(NaN). Therefore, softmax loss must be used for pre-training before convergence. The proposed loss does not require these two-stage training. Secondly, the penalty of the proposed loss function increases linearly as the angle between embedding feature and center increasing, so that the target logit decreases linearly which is more intuitive(See Figure 2). It will not decline rapidly in some angles, but slowly in others (corresponding to the gradient value of the target logit curve), which makes the proposed loss function have better performance in the end. In terms of geometric decision margins, ArcFace will have a part of overlap area when the deviates too much from the center , because of its non-monotonicity of the target logit curve. This area can be distinguished as class 1 and class 2 with ArcFace. The proposed loss function would’t have this overlap area(See Figure  3).

3.2 network architectures

Input Operator t c n s
112x112x3 conv 3x3 - 64 1 2

depthwise conv3x3 - 64 1 1
56x56x64 bottleneck 2 64 1 2
28x28x64 bottleneck 2 64 9 1
28x28x64 bottleneck 4 128 1 2
14x14x128 bottleneck 2 128 16 1
14x14x128 bottleneck 8 256 1 2
7x7x256 bottleneck 2 256 6 1
7x7x256 conv1x1 - 1024 1 1
7x7x1024 linear GDConv7x7 - 1024 1 1
1x1x1024 linear conv1x1 - 512 1 1

Table 1: The proposed network architecture, n refers to the number of repetitions, c refers to output channels, t refers to The expansion factor.

In this section, we introduce our proposed network architecture. Our network architecture is based on a deeper MobileFaceNet(y2) [4], so the residual bottlenecks proposed in MobileNetV2 are used as our main building blocks. Table 1 shows the details of the network architecture. We follow the MobileFaceNet, expansion factors for bottlenecks in our architecture are much smaller than those in MobileNetV2 and using PReLU as the non-linearity rather than Relu. Nevertheless, we noticed the importance of network width in face recognition. MobileFaceNet is significantly wider than MobileNetV2 at the beginning of the network. But MobileFaceNet does not double the output channels of the network when sampling at last stage. We did that within limited amount of computation. At the same time, we carefully adjusted the depth of the network, and we introduced the attention module CBAM [18] into every bottleneck in the network. But we changed the last activation function in the attention module from sigmoid to 1+tanh, which makes it converges faster.

3.3 training tricks for face recognition

During the competition, we found some useful training tricks for face recognition. Firstly, using a variety of loss functions to fine-tune the model will make the features more robust and improve the accuracy to some extent. In the competition, we used Li-ArcFace, ArcFace, combined loss to fine-tune our model. Secondly, in 512-dimensional embedding feature space, it is difficult for the lightweight model to learn the distribution of the features. It is an effective method to use some large models to guide the feature distribution of lightweight models [5, 9].

4 Experiments

4.1 Evaluation Results of Li-ArcFace

In this section, we mainly introduce some comparative experiments of different loss functions.

Implementation Details. We use the MobileFaceNet as the network architecture, in which the embedding size is set as 128. And batch size is set as 256 x 4. Training models on four NVIDIA TITIAN XP GPUs. We use SGD with momentum 0.9 to optimize models. The training data set is CASIA- Webface, which contains 10K Identity 0.5M pictures. The learning rate starts at 0.1, divides by 10 at 18k, 26k and 29K iterations, and finally stops training at 30K iterations. At last, we compare the performance on Labelled Faces in the Wild (LFW) [7], Celebrities in Frontal Profile (CFP) [14] and Age Database (AgeDB) [11].In this paper, ArcFace (m=0.5, NS) refers to the margin of ArcFace is set as 0.5, and pre-training via N-Softmax before using ArcFace.

Weight decay LFW CFP-FP AgeDB
5e-4 99.167 94.086 92.95
5e-4 wdml10 99.183 93.857 93.183
4e-5 wdml10 99.05 92.07 92
Table 2: Verification performance(%) of different weight decay values. Wdml10 represents the weight decay multiplier (wd_mult) parameter of the last convolution layers is 10.

Weight decay. Before we start the comparison of different loss functions, we first do the numerical experiment of weight decay, and finally determine that the weight decay is set to 5e-4, except the weight decay parameter of the last layer to the embedding layer being 5e-3. According to the experimental results(See Table  2 ), the weight decay of the last layer is more demanding.

Effect of Hyper-parameter m. In Table  3, we firstly explored the optimal setting for of Li-ArcFace, and then we found it was between 0.4 and 0.45. We tried training model from scratch with ArcFace, but diverged after about 1600 iterations. Therefore, all the experiments with ArcFace were pre-trained by N-Softmax. Since the embedding size is smaller than 512, we also explored the optimal setting for of ArcFace, and we found it was between 0.45 and 0.5.

N-Softmax 97.867 90.457 86.55
Li-ArcFace(m=0.35) 99.2 94.2 92.867
Li-ArcFace(m=0.4) 99.267 94.114 93.25
Li-ArcFace(m=0.45) 99.233 94.2 93.15
Li-ArcFace(m=0.5) 99.2 94.1 92.933
Li-ArcFace(m=0.4, NS) 99.233 94.514 92.9
Li-ArcFace(m=0.45, NS) 99.267 94.386 93.217
ArcFace(m=0.4, NS) 99.267 93.929 92.717
ArcFace(m=0.45, NS) 99.267 94 92.983
ArcFace(m=0.5, NS) 99.183 93.857 93.183
CosFace/AM-Softmax 99.2 93.386 92.55

Table 3: Verification performance (%) of different loss functions. ArcFace (m=0.5, NS) refers to the margin of ArcFace is set as 0.5, and pre-training via N-Softmax before using ArcFace.
Figure 4: The accuracy of the verification sets during training. CFP-FP is on the left, AgeDB is on the right.

Comparison with state-of-the-art loss functions. In Table  3, the difference between Li-ArcFace, ArcFace and Cosface is tiny on LFW, but all of them are obviously better than N-Softmax; On CFP-FP and AgeDB, Li-ArcFace is slightly better than ArcFace and CosFace. We have drawn the accuracy On CFP-FP and AgeDB during the training, which makes the contrast more obvious(See Figure 4). We also compared Li-ArcFace and ArcFace in the same situation pre-trained by N-Softmax. On CFP-FP, Li-ArcFace achieves the highest accuracy, and Li-ArcFace is still slightly better than ArcFace on AgeDB. In general, there is little difference in accuracy on the verification sets, but Li-ArcFace has better convergence and does not need pre-training stage when training the model with small embedding size.

4.2 Evaluation Results of Network Architecture and Training Tricks

We name our model that contains our network architecture and training tricks as AirFace. Under the same training data set MS1M-retina [1](is cleaned from MS1M) and model constraints, the accuracy of our model reached 88.415%@FPR=1e-8 in deepglint-light challenge of LFR19 [2]. Meanwhile, we verified the performance of our model in MegaFace Challenge 1 compared with the previous state-of-the-art models. In Table 4, AirFace has reached incredible efficiency and performance.

Methods Id(%) Ver(%) Flops

FaceNet [13]
70.49 86.47 -
CosFace [17] 82.72 96.65 -
R100,MS1MV2,ArcFace [4] 81.03 96.98 27G
R100,MS1MV2,CosFace [4] 80.56 96.56 27G
R100,MS1MV2,ArcFace,R [4] 98.35 98.48 27G
R100,MS1MV2,CosFace,R [4] 97.91 97.91 27G
MobileFaceNet [3] - 90.16 440M
MobileFaceNet,R [3] - 92.59 440M
AirFace,MS1M-retina 80.80 96.20 1G
AirFace,MS1M-retina,R 98.04 97.66 1G

Table 4: Identification and verification evaluation of different methods on MegaFace Challenge1. Id refers to the rank 1 face identification accuracy with 1M distractors, and Ver refers to the face verification TAR under FAR. R refers to data refinement on both probe set and 1M distractors.

5 5.Conclusions

In this paper, first of all, we propose a novel additive margin loss function for deep face recognition based on ArcFace. The proposed loss function solves the problem that ArcFace does not converge in training model with small embedding feature size. And it achieves the state-of-the-art results on several face verification datasets. Second, we have carefully designed an efficient network architecture and explored some useful training tricks for face recognition, which makes our model extremely efficient at both deepglint-light challenge and MegaFace Challenge 1.


  • [1]
  • [2]
  • [3] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. 2018.
  • [4] J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. 2018.
  • [5] V. O. D. J. Hinton, G. E. Distilling the knowledge in a neural network. In In arXiv:1503.02531, 2015.
  • [6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. 2017.
  • [7] E. Learned-Miller, G. B. Huang, A. Roychowdhury, H. Li, and H. Gang. Labeled Faces in the Wild: A Survey. 2016.
  • [8] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. 2017.
  • [9] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang. Face model compression by distilling knowledge from neurons. 2016.
  • [10] D. Miller, E. Brossard, S. Seitz, and I. Kemelmacher-Shlizerman. Megaface: A million faces for recognition at scale. Computer Science, 2015.
  • [11] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, and S. Zafeiriou. Agedb: the first manually collected, in-the-wild age database. In Computer Vision & Pattern Recognition Workshops, 2017.
  • [12] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. 2018.
  • [13] D. K. Schroff, Florian and J. Philbin. Facenet: A unified embedding for face recognition and clustering. 2015.
  • [14] S. Sengupta, J. C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In Applications of Computer Vision, 2016.
  • [15] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, PP(99):1–1, 2018.
  • [16] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: L2 hypersphere embedding for face verification. 2017.
  • [17] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. 2018.
  • [18] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon. Cbam: Convolutional block attention module. 2018.
  • [19] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description