AirFace:Lightweight and Efficient Model for Face Recognition

AirFace:Lightweight and Efficient Model for Face Recognition

Xianyang Li
   Feng Wang
   Qinghao Hu
   Cong Leng
   Xianyang Li Feng Wang  Qinghao Hu  Cong Leng
Nanjing Artificial Intelligence Chip Research, Institute of Automation, Chinese Academy of Sciences(AIRIA)
China  {wangfeng, huqinghao, lengcong}

With the development of convolutional neural network, significant progress has been made in computer vision tasks. However, the commonly used loss function softmax loss and highly efficient network architectures for common visual tasks are not as effective for face recognition. In this paper, we propose a novel loss function named Li-ArcFace based on ArcFace. Li-ArcFace takes the value of the angle through a linear function as the target logit rather than through cosine function, which has better convergence and performance on low dimensional embedding feature learning for face recognition. In terms of network architecture, we improved the the perfomance of MobileFaceNet by increasing the network depth, width and adding attention module. Besides, we found some useful training tricks for face recognition. Under all the above effects, we won the second place in the deepglint-light challenge of LFR2019 [6].

1 Introduction

The development of deep convolutional networks(DCNN) has made remarkable progress in a series of computer vision tasks. However, it is not so effective while using the common method for face recognition. Softmax loss, which is commonly used in classification, cant’t maximize inter-class variance and minimize intra-class variance of embedding feature vectors. In order to obtain highly discriminative embedding features(See Figure 1), a series of novel loss functions have been proposed in recent years, such as A-Softmax [12], CosFace/AM-Softmax [21, 19], ArcFace [5]. Among them, ArcFace achieved state-of-the-art performance by adding additive margin between classes in the angle space. But, when ArcFace is used to train some efficient networks with small(128) embedding size, it’s hard to train from scratch(for example, the embedding feature size of MobileFaceNet [4] is 128, it can’t converge when trained from scratch with ArcFace). As a result, to guarantee the convergence of training, pre-training is always required with softmax loss. To overcome this problem, we carefully designed a novel loss function named Li-ArcFace based on ArcFace, which performs better in convergence.

Face recognition technology is now widely used on mobile devices, which requires that the computational cost of the model should not be too large. In recent years, some highly efficient neural network architectures have been proposed, such as MobileNetV1 [10], ShuffleNet [23], and MobileNetV2 [16], but they’re all designed for common visual recognition tasks instead of face recognition. Their performance is really general when used for face recognition directly. MobileFaceNet is designed for face recognition based on MobileNetV2, achieved remarkable accuracy on LFW [11], AgeDB [15]. And it is even comparable to state-of- the-art big DCNN model on MegaFace [14] Challenge 1 under the much smaller computational resources. In this paper, with limited amount of computation, we carefully designed a higher performance network architecture based on MobileFaceNet.

Figure 1: Schematic diagram of discriminative embedding features. W1 refers to the center of the class 1, f1 refers to the embedding feature of class 1, f2 refers to the embedding feature of class 2. During training , the variance of the same person becomes smaller, and the variance of different people becomes larger.

2 Related Work

Loss function. Softmax loss is the most commonly used loss function in classification, which is presented as:


where denotes the embedding feature of the -th sample belonging to the -th class, and the dimension of the embedding feature(hereinafter referred to as embedding size) is set as . denotes the -th column of the weight and is the bias term. The batch size is and the class number of training data is . However, the traditional softmax loss lacks the power to supervise the embedding feature to minimize inter-class similarity and maximize intra-class similarity. In SphereFace [12] and NormFace [20], the bias term is being removed at first and then fixing the , by normalisation, such that the logit is:


Where denotes the angle between and . Thus the logit is only depend on the cosine of the angle. The modified loss function can be formulated as :


Although guarantees a high similarity of features of the same person, it can’t separate different classes very well. In this paper, we use N-Softmax denotes . In ArcFace [5], the authors added an additive angular margin within , which can enhance the intra-class compactness and inter-class discrepancy, the ArcFace is formulated as :


When using ArcFace to train models with 512-dimensional embedding feature, it has well convergence and state-of-the-art performance. However, it will be difficult to converge while training some highly efficient networks with 128-dimensional embedding feature from scratch.

network architectures. Face recognition is being used more and more on mobile devices. So it’s really important to optimize the trade-off between accuracy and computational cost when designing deep neural network architecture. In recent years, some highly efficient neural network architectures have been proposed for common visual tasks. MobileNetV1 [10] used depthwise separable convolution instead of traditional convolution to reduce computational cost and improve network efficiency. MobileNetV2 [16] introduced inverted residuals and linear bottlenecks to further improve network efficiency. However, these lightweight network architectures are not so accurate when using these unchanged for face recognition. The author of MobileFaceNet [4] found the weakness of common mobile networks for Face recognition, and solved it by replacing global average pooling with global depthwise convolution(GDC). And the network architecture of MobileFaceNet is specifically designed for face recognition with smaller expansion factors in bottlenecks and more output channels at the beginning of the network architecture.

Figure 2: Target logit curves.
Figure 3: Decision margins of different loss functions under binary classification case. The yellow areas are the decision margins, the red areas are the overlap area of class1 and class2.

3 Proposed Approach

3.1 Li-ArcFace

In ArcFace, the authors added an angular margin within , which takes as the target logit. The loss function proposed by us takes the angle after a linear function as the logit rather than cosine function . In the same way, we remove the bias term and then fix the , by normalisation, the denotes the angle between and , thus . At first, we constructed a linear function , and we have . Then we add an additive angular margin in the target logit. In the end we have as the target logit. We call this novel loss function Li-ArcFace, the prefix Li refers to the linear function. The whole Li-ArcFace can be formulated as follows


There are two main advantages of using this linear function to replace cosine function. Firstly, it is monotonic decreasing when the angle is between and , which will have better convergence, especially for model with small embedding size. For example, when training MobileFaceNet with ArcFace, it will lead to divergence(NaN). Therefore, softmax loss must be used for pre-training before convergence. The proposed loss does not require these two-stage training. Secondly, the penalty of the proposed loss function increases linearly as the angle between embedding feature and center increasing, so that the target logit decreases linearly which is more intuitive(See Figure 2). It will not decline rapidly in some angles, but slowly in others (corresponding to the gradient value of the target logit curve), which makes the proposed loss function have better performance in the end. In terms of geometric decision margins, ArcFace will have a part of overlap area when the deviates too much from the center , because of its non-monotonicity of the target logit curve. This area can be distinguished as class 1 and class 2 with ArcFace. The proposed loss function would’t have this overlap area(See Figure  3).

3.2 Network Architectures

Input Operator t c n s
112x112x3 conv 3x3 - 64 1 2

depthwise conv3x3 - 64 1 1
56x56x64 bottleneck 2 64 1 2
28x28x64 bottleneck 2 64 9 1
28x28x64 bottleneck 4 128 1 2
14x14x128 bottleneck 2 128 16 1
14x14x128 bottleneck 8 256 1 2
7x7x256 bottleneck 2 256 6 1
7x7x256 conv1x1 - 1024 1 1
7x7x1024 linear GDConv7x7 - 1024 1 1
1x1x1024 linear conv1x1 - 512 1 1

Table 1: The proposed network architecture, n refers to the number of repetitions, c refers to output channels, t refers to The expansion factor.

In this section, we introduce our proposed network architecture. Our network architecture is based on a deeper MobileFaceNet(y2) [5], so the residual bottlenecks proposed in MobileNetV2 are used as our main building blocks. Table 1 shows the details of our network architecture. We follow the MobileFaceNet, expansion factors for bottlenecks in our architecture are much smaller than those in MobileNetV2 and using PReLU as the non-linearity rather than Relu. Nevertheless, we noticed the importance of network width in face recognition. MobileFaceNet is significantly wider than MobileNetV2 at the beginning of the network. But MobileFaceNet does not double the output channels of the network when sampling at last stage. We did that within limited amount of computation. At the same time, we carefully adjusted the depth of the network, and we introduced the attention module CBAM [22] into every bottleneck in the network. But we changed the last activation function in the attention module from sigmoid to 1+tanh. The range of 1+tanh is 0 to 2. When a channel or spatial position needs to be enhanced, it is multiplied by a weight between one and two; when it needs to be weakened, it is multiplied by a weight between zero and one. Using 1+tanh instead of sigmoid is more intuitive and makes training converge faster.

3.3 Training Tricks for Face Recognition

During the competition of LFR, we found some useful training tricks for face recognition. Firstly, using a variety of loss functions to fine-tune the model will make the features more robust and improve the accuracy to some extent. In the competition, we used Li-ArcFace, ArcFace, combined loss to fine-tune our model. Secondly, in 512-dimensional embedding feature space, it is difficult for the lightweight model to learn the distribution of the features. It is an effective method to use some large models to guide the feature distribution of lightweight models [9, 13].

4 Experiments

4.1 Ablation Experiment of Li-ArcFace

In this section, we mainly introduce some comparative experiments of different loss functions.

Implementation Details. We use the MobileFaceNet as the network architecture, in which the embedding size is set as 128. And batch size is set as 256 x 4. Training models on four NVIDIA TITIAN XP GPUs. We use SGD with momentum 0.9 to optimize models. The scale of the feature is set as 64. The training data set is CASIA- Webface, which contains 10K Identity 0.5M pictures. The learning rate starts at 0.1, divides by 10 at 18k, 26k and 29K iterations, and finally stops training at 30K iterations. At last, we compare the performance on Labelled Faces in the Wild (LFW) [11], Celebrities in Frontal Profile (CFP) [18] and Age Database (AgeDB) [15]. In this paper, ArcFace (m=0.5, NS) refers to the margin of ArcFace is set as 0.5, and pre-training via N-Softmax before using ArcFace.

Weight decay LFW CFP-FP AgeDB
5e-4 99.17 94.09 92.95
5e-4 wdml10 99.18 93.86 93.18
4e-5 wdml10 99.05 92.07 92.00
Table 2: Verification performance(%) of different weight decay values. Wdml10 represents the weight decay multiplier (wd_mult) parameter of the last convolution layers is 10.

Weight decay. Before we start the comparison of different loss functions, we first do the numerical experiment of weight decay, and finally determine that the weight decay is set to 5e-4, except the weight decay parameter of the last layer to the embedding layer being 5e-3. According to the experimental results(See Table  2 ), the weight decay of the last layer is more demanding.

Effect of Hyper-parameter m. In Table  3, we firstly explored the optimal setting for of Li-ArcFace, and then we found it was between 0.4 and 0.45. We tried training model from scratch with ArcFace, but diverged after about 1600 iterations. Therefore, all the experiments with ArcFace were pre-trained by N-Softmax. Since the embedding size is smaller than 512, we also explored the optimal setting for of ArcFace, and we found it was between 0.45 and 0.5.

N-Softmax 97.87 90.46 86.55
Li-ArcFace(m=0.35) 99.20 94.20 92.87
Li-ArcFace(m=0.4) 99.27 94.11 93.25
Li-ArcFace(m=0.45) 99.23 94.20 93.15
Li-ArcFace(m=0.5) 99.20 94.10 92.93
Li-ArcFace(m=0.4, NS) 99.23 94.51 92.90
Li-ArcFace(m=0.45, NS) 99.27 94.39 93.22
ArcFace(m=0.4, NS) 99.27 93.93 92.72
ArcFace(m=0.45, NS) 99.27 94.00 92.98
ArcFace(m=0.5, NS) 99.18 93.86 93.18
CosFace/AM-Softmax 99.20 93.39 92.55

Table 3: Verification performance (%) of different loss functions. ArcFace (m=0.5, NS) refers to the margin of ArcFace is set as 0.5, and pre-training via N-Softmax before using ArcFace.
Figure 4: The accuracy of the verification sets during training. CFP-FP is on the left, AgeDB is on the right.

Comparison with state-of-the-art loss functions. In Table  3, the difference between Li-ArcFace, ArcFace and Cosface is tiny on LFW, but all of them are obviously better than N-Softmax; On CFP-FP and AgeDB, Li-ArcFace is slightly better than ArcFace and CosFace. We have drawn the accuracy On CFP-FP and AgeDB during the training, which makes the contrast more obvious(See Figure 4). We also compared Li-ArcFace and ArcFace in the same situation pre-trained by N-Softmax. On CFP-FP, Li-ArcFace achieves the highest accuracy, and Li-ArcFace is still slightly better than ArcFace on AgeDB. In general, there is little difference in accuracy on the verification sets, but Li-ArcFace has better convergence and does not need pre-training stage when training the model with small embedding size.

4.2 Evaluation Results of Network Architecture and Training Tricks

We name our model that contains our network architecture and training tricks as AirFace. Under the same training data set MS1M-retina [2] and model constraints, ( MS1M-retina is cleaned from  [8] by  [5], and  [7] is the face detector and alignment tool used to pre-process the data) the accuracy of AirFace reached 88.415%@FPR=1e-8 in deepglint-light challenge of LFR19 [6, 3]. Meanwhile, we verified the performance of AirFace in MegaFace Challenge 1 compared with the previous state-of-the-art models. In Table 4, AirFace has reached incredible efficiency and performance.

Methods Id(%) Ver(%) Flops

FaceNet [17]
70.49 86.47 -
CosFace [21] 82.72 96.65 -
R100,MS1MV2,ArcFace [5] 81.03 96.98 27G
R100,MS1MV2,CosFace [5] 80.56 96.56 27G
R100,MS1MV2,ArcFace,R [5] 98.35 98.48 27G
R100,MS1MV2,CosFace,R [5] 97.91 97.91 27G
MobileFaceNet [4] - 90.16 440M
MobileFaceNet,R [4] - 92.59 440M
AirFace,MS1M-retina 80.80 96.52 1G
AirFace,MS1M-retina,R 98.04 97.93 1G

Table 4: Identification and verification evaluation on MegaFace Challenge1. Id refers to the rank 1 face identification accuracy with 1M distractors, and Ver refers to the face verification TAR under FAR. R refers to data refinement on both probe set and 1M distractors. The list of data cleansing and the code for calculating flops are from InsightFace [1]

5 Conclusions

In this paper, first of all, we propose a novel additive margin loss function for deep face recognition based on ArcFace. The proposed loss function solves the problem that ArcFace does not converge in training model with small embedding feature size. And it achieves the state-of-the-art results on several face verification datasets. Second, we have carefully designed an efficient network architecture and explored some useful training tricks for face recognition, which makes our model AirFace extremely efficient at both deepglint-light challenge and MegaFace Challenge 1.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description