Robust RGB-D Face Recognition Using Attribute-Aware Loss
Existing convolutional neural network (CNN) based face recognition algorithms typically learn a discriminative feature mapping, using a loss function that enforces separation of features from different classes and/or aggregation of features within the same class. However, they may suffer from bias in the training data such as uneven sampling density, because they optimize the adjacency relationship of the learned features without considering the proximity of the underlying faces. Moreover, since they only use facial images for training, the learned feature mapping may not correctly indicate the relationship of other attributes such as gender and ethnicity, which can be important for some face recognition applications. In this paper, we propose a new CNN-based face recognition approach that incorporates such attributes into the training process. Using an attribute-aware loss function that regularizes the feature mapping using attribute proximity, our approach learns more discriminative features that are correlated with the attributes. We train our face recognition model on a large-scale RGB-D data set with over 100K identities captured under real application conditions. By comparing our approach with other methods on a variety of experiments, we demonstrate that depth channel and attribute-aware loss greatly improve the accuracy and robustness of face recognition.
Convolutional neural networks (CNNs) play a significant role in face analysis tasks, such as landmarks detection [sun2013deep, zhang2016joint], face recognition [huang2014labeled, wolf2011face, kemelmacher2016megaface] and 3D face reconstruction [richardson2016learning, Guo20183DFace]. With the emergence of large public face data sets [yi2014learning] and sophisticated networks [szegedy2015going, he2016deep], the problem of face recognition has gained lots of attention and developed rapidly. At present, some mainstream methods already outperform humans on certain benchmark datasets such as [huang2014labeled]. These methods usually map faces to discriminative feature vectors in a high-dimensional Euclidean space, to determine whether a pair of faces belongs to the same category. For example, deep metric learning (such as contrastive loss [hadsell2006dimensionality] or triplet loss [schroff2015facenet]) usually trains a CNN by comparing pairs or triplets of facial images to learn discriminative features. Later, different variants of the softmax loss [taigman2014deepface, wen2016discriminative, ranjan2017l2, wang2017normface, liu2017sphereface] are used as supervision signals in CNNs to extract discriminative features, which achieve excellent performance under the protocol of small training set. These methods [schroff2015facenet, wen2016discriminative, liu2017sphereface] utilize CNNs to learn strong discriminative deep features, using loss functions that enforce either intra-class compactness or inter-class dispersion.
Although the above two categories of methods have achieved remarkable performance, they still have their own limitations. First, the contrastive loss and triplet loss suffer from slow convergence due to the construction of large number of pairs or triplets. To accelerate convergence, [sohn2016improved] proposed an -tuplet loss that increases the number of negative example. However, this loss still requires complex recombination of training samples. In comparison, the softmax loss and its variants have no such requirement on the training data, and converge more quickly. The center loss [wen2016discriminative] is the first to add soft constrains on deep features in the softmax loss to minimize the intra-class variations, significantly improving the performance of softmax loss. Afterwards, the angular softmax loss [liu2017sphereface] imposed discriminative constraints on a hypersphere manifold, which further improved the performance of softmax loss. However, by enforcing intra-class aggregation and inter-class separation among the training data, existing variants of softmax loss encourage uniform distribution of feature vectors for the training data, even though the training data may not be sampled uniformly. As a result, the proximity between the learned feature vectors for two test data may not correctly indicate the proximity between their underlying faces, which can affect the accuracy of face recognition algorithms based on feature proximity. To address this issue, we propose an attribute-aware loss function that regularizes the learned feature mapping using other attributes such as gender, ethnicity and age. The proposed loss function imposes a global linear relation between the feature difference and the attribute difference between nearby training data, such that feature vectors for facial data with similar attributes are driven towards each other. In addition, as these attributes are correlated with facial geometry and appearance, the attribute-aware loss also implicitly regularizes the feature proximity with respect to the facial proximity, which helps to account for potential sampling bias in the training set.
In addition, although existing RGB image based face recognition methods have achieved great success, they rely solely on the appearance information and may suffer from poor lighting conditions such as dark environments. On the other hand, the depth image captured by RGB-D sensors such as PrimeSense sensors provide additional geometric information that is independent from illumination, which can help to improve robustness of recognition. To this end, we develop a CNN-based RGB-D face recognition approach, by first aligning the depth map with the RGB image grid and normalizing the depth values to the same range as the RGB values, and then feeding the resulting RGB-D values into CNNs for training and testing. Unlike existing RGB-D based deep learning approaches [LeeCTL16, HCSC18] that only use small training data sets with less than 1K identities, we train our model on a large RGB-D data set with over 100K identities, where the resulting model achieves more robust performance than RGB based approaches.
Combining the RGB-D approach with the attribute-aware loss function, our new method greatly improves the robustness and accuracy of facial recognition. We tested our method on several datasets, with different identities in diverse facial expressions and lighting conditions. Our method performs consistently better than state-of-the-art approaches that only rely on RGB information and do not consider additional attributes.
To summarize, this paper makes the following major contributions:
We propose an attribute-aware loss function for CNN-based face recognition, which regularizes the distribution of learned feature vectors with respect to additional attributes and improves the accuracy of recognition results. To the best of our knowledge, this is the first method that utilizes non-facial attributes to improve CNN-based face recognition feature training.
We develop a CNN-based RGB-D face recognition approach, and construct a large RGB-D data set with over 100K identities for neural network training and testing. This is the first result that verifies the effectiveness of CNN-based RGB-D face recognition with large training data sets.
2 Related Work
Face recognition is a classical research topic in pattern recognition and computer vision, with applications in many areas like biometrics, surveillance system and information security. For a comprehensive review of 2D face recognition and 3D face recognition methods, one may refer to [ParkhiVZ15, ABATE20071885]. This section briefly reviews those techniques that are closely related to our work.
2.1 Deep Learning based Face Recognition
In the past few years, deep learning based face recognition is one of the most active research areas. In this part, we mainly discuss the loss functions used in these methods.
Metric Learning. Metric learning [xing2002distance, weinberger2009distance, Wang2011kernel] attempts to optimize a parametric notion of distance in a fully/weakly/semi supervised way such that the similar objects are nearby and dissimilar objects are far apart on a target space. In [xing2002distance], the learning is done by finding a Mahalanobis distance with a matrix parameter when given some similar pairs of samples. In order to handle more challenging problems, kernel tricks [Wang2011kernel, jain2012metric] had been introduced in metric learning to extract nonlinear embeddings. In recent years, more discriminative features can be learned with advanced network architectures that minimizes some loss functions based on Euclidean distance, such as contrastive loss [hadsell2006dimensionality] and triplet loss [schroff2015facenet]. Moreover, these loss functions can be improved by allowing joint comparison among more than one negative examples [sohn2016improved] or minimizing the overall classification error [kumar2016learning].
Classification Losses. The most commonly used classification loss is the softmax loss that maps images to deep features and then to predicted labels. Krizhevsky et al. [krizhevsky2012imagenet] first observed that CNNs trained with softmax loss can produce discriminative feature vectors, which has also been confirmed by other works [sharif2014cnn]. However, softmax loss mainly encourages inter-class dispersion, and thus cannot induce strong discriminative features. To enhance the discrimination power of deep features, Wen et al. [wen2016discriminative] proposed center loss to enforce intra-class aggregation as well as inter-class dispersion. Meanwhile, Ranjan et al. [ranjan2017l2] observed that the softmax loss is biased to the sample distribution, i.e., fitting well to high-quality faces but ignoring the low-quality faces. Adding -constraints on features to the softmax loss can make the resulting features as discriminative as those trained with center loss. Afterwards, Liu and colleagues [liu2016large, liu2017sphereface] further improved the features by incorporating an angular margin instead of the Euclidean margin into softmax loss to enhance the inter-class margin and compressing the intra-class angular distribution simultaneously.
2.2 Face Recognition with Attributes
Besides the feature vectors extracted from CNN, other attributes can also be utilized in face recognition tasks. An early study [KumarBBN09] trained 65 “attribute” SVM classifiers to recognize the traits of input facial images such as gender, age, race, and hair color, which are then fused with other features for face recognition. In the context of deep learning, attribute-enhanced face recognition does not gain too much attention. One related work [SamangoueiC16] is to exploit CNN based attribute features for authentication on mobile devices, and the facial attributes are trained by a multi-task, part based Deep Convolutional Neural Network architecture. Hu et.al [hu2017attribute] systematically study the problem of how to fuse face recognition features and facial attribute features to enhance face recognition performance. They reformulate feature fusion as a gated two-stream neural network, which can be efficiently optimized by neural network learning.
Based on the assumption that attributes like gender, age and pose could share low-level features from the representation learning perspective, some studies investigate multi-task learning [rudd2016moon, RanjanSCC17] and show that such attributes could help the face recognition task. In our method, different from the above attribute fusion and multi-task learning methods, the attributes are directly used to guide the face recognition feature learning in the training stage and they are not needed during the testing stage.
2.3 RGB-D Face Recognition
In recent years, RGB-D based face recognition has attracted increasing attention because of its robustness in unconstrained environment. Hsu et al. [HsuLPW14] considered a scenario in which the gallery is a pair of RGB-D images while the probe is a single RGB image captured by a regular camera without the depth channel. They proposed an approach that reconstructs a 3D face from an RGB-D image for each subject in the gallery, aligns the reconstructed 3D model to a probe using facial landmarks, and recognizes the probe using sparse representation based classification. Zhang et al. [HCSC18] further considered the problem of multi-modality matching (e.g., RGB-D probe vs. RGB-D gallery) and cross-modality matching (e.g., RGB probe vs. RGB-D) in the same framework. They proposed an approach for RGB-D face recognition that is able to learn complementary features from multiple modalities and common features between different modalities. For the RGB-D vs. RGB-D problem, Goswami et al. [GoswamiVS14] proposed to compute an RGB-D image descriptor based on entropy and based on the entropy and saliency, as well as geometric facial attributes from the depth map; then the descriptor and the attributes are fused to perform recognition. Li et al. [LiXMLK16] proposed a multi-channel weighted sparse coding method on the hand-crafted features for RGB-D face recognition.
Although it is straightforward to extend deep learning based face recognition methods from RGB images to RGB-D images, currently there is no large-scale public RGB-D data sets that can be used for training, which limits the practical applications of these approaches. For example, the model proposed in [HCSC18] is trained on a dataset with less than 1K identities. To handle this problem, Lee et al. [LeeCTL16] proposed to first train the deep network with a color face dataset, and then fine-tune it on depth face images for transfer learning.
3.1 Revisiting the Variants of Softmax Loss
Given a training data set with , and their corresponding labels with , the following classical softmax loss function is widely used in face recognition tasks
where is the learned feature mapping by training CNNs. and are the weights and biases in the last fully connected layer, and can be color or depth images of faces. We denote for simplicity. Typically, during the test phase the mapping is applied on an image pair to extract two deep features , and the Euclidean distance or cosine distance between the features are computed to determine the similarity between the image pair. Separable features can be learned using softmax loss, but they are not discriminative enough for face recognition.
To learn more discriminative features, several variants of softmax loss have been developed by enlarging the inter-class margin and reducing the intra-class variation. Among them, the center loss [wen2016discriminative] requires the deep features of each class to gather towards their respective centers :
With the angular softmax loss [liu2017sphereface], deep features of each class are compressed using the angular margin instead of the Euclidean margin:
where is the angle between vectors and . There are other variants of softmax loss [wang2018additive, deng2018arcface] with a similar form as (3), where the margin and the angle are added instead of being multiplied.
3.2 The Attribute-Aware Loss
To achieve high accuracy for face recognition, it is desirable that the proximity between feature clusters of different classes is consistent with the proximity between the classes (i.e., the underlying faces). Ideally, the more dissimilar two faces are, the further apart their corresponding feature clusters should be. This, however, is not guaranteed by the above variants of softmax loss. Since they minimize the intra-class variations and maximize the inter-class margins on the training data, the learned feature mappings tend to produce evenly distributed feature vectors for the training faces. On the other hand, there is no guarantee that the facial images in the training set are evenly distributed in the full face space. As a result, when there exists large variations of sampling density in the training data set, the learned feature mapping may not correctly indicate the proximity of the underlying faces. To address this issue, we can try to introduce a loss function term that regularizes feature proximity with respect to face proximity. However, this is a challenging task as a facial image only reveals the underlying face shape from a certain view direction and can be affected by various factors such as lighting condition and sensor noises. As a result, it is difficult to reliably compute the proximity between two faces by only comparing their scanned images.
Besides the proximity of face shapes, it is also desirable that the learned feature mappings is related to the proximity between other attributes such as gender, ethnicity and age. For example, if we compare a probe image against a database of facial images to identify most likely matches via feature proximity, then it is preferable that all returned images are from persons with the same or similar attributes. The above variants of softmax loss cannot ensure this property either, as they only consider the facial images during training process.
Motivated by these observations, we propose an attribute-aware loss term that regularizes feature proximity with respect to attribute proximity. Besides the label information, other attributes of the facial images like gender, ethnicity and age are also given in the training data set. These attributes can be collected during training data construction, and they are independent of the imaging process. We represent the augmented attributes for a facial image using a vector . Then our attribute-aware loss is formulated as
where is a parameter matrix that needs to be trained, is the Euclidean distance between the two attribute vectors, and is a user-specified threshold. Intuitively, this loss term can drive feature clusters with similar attributes towards each other, via a global linear mapping that relates the feature difference to the attribute difference. To the best of our knowledge, this is the first work in face recognition that optimizes adjacency of learned features using attribute proximity. As shown in Fig. 1, the learned feature clusters with similar attributes become closer after our adjacency optimization.
From another perspective, regularization using additional attributes can help the network pick up other useful cues for face recognition, because attributes such as gender, ethnicity and age are highly correlated with facial shape and appearance. For example, there can be notable difference between the facial appearance of two persons with different genders. Therefore, the attribute-aware loss can improve the learned feature mapping by implicitly utilizing the appearance variation related to these attributes.
3.3 Training with the Attribute-Aware Loss
Similar to [wen2016discriminative], our attribute-aware loss in Eq. (4) is an auxiliary supervision signal, which can be combined with any variant of softmax loss. For example, it can be combined with the classical softmax to derive a loss function
where is a user-specified weight to balance the two loss terms. In the following, we provide the details of mini-batch training with the loss function . Each input mini-batch consists of facial data where , as well as their identity labels and attributes . These data are fed to the CNN, the softmax loss layer, and the attribute-aware loss layer respectively, as illustrated in Fig. 2. The gradients of with respect to and are computed as follows:
Finally, all of the parameters in the CNN and two loss layers can be learned by standard stochastic gradient descent. In Algorithm. 1, we summarize the learning details in the CNNs with joint supervision.
3.4 Training with RGB-D facial data
In this paper, we use RGB-D facial images as the training data, to improve robustness to illumination conditions compared with RGB facial images. The RGB-D data are collected using low-cost sensors such as PrimeSense. Using the RGB part of a facial image, we first detect the face region and five landmarks (the eyes, the nose, and two corners of the mouth) using MTCNN [zhang2016joint]. The face is then cropped to by similarity transformation, and each RGB color component is normalized from the range into . Afterwards, we extract a face region from the corresponding depth image by transferring the RGB face region. Similar to [kim2017deep, zulqarnain2018learning], we find the nose tip and crop the point cloud in the face region within an empirically set radius of mm. Then we move the center of the cropped facial scan to and reproject it onto a 2D image plane to generate a new depth map of size . The value is chosen to enlarge the projection of facial scan onto the image plane as much as possible. Following [hernandez2015near], we compute the depth of each pixel with bilinear interpolation. Using this depth map, we generate a new point cloud under the camera coordinate system. Each point is further normalized as:
where and are the minimum and maximum -, - and -coordinate values among all points, respectively. Augmenting the RGB face region with its normalized point cloud, we obtain a six-channel image with values in , which is fed into the deep neutral network. Some RGB facial images and their normalized point clouds are shown in Fig. 2.
4 Experimental Results
We conduct extensive experiments to evaluate the effectiveness of our approach. We first test our RGB-D face recognition approach on a large-scale private dataset (Secs. 4.2, LABEL:subsec:_private and LABEL:subsec:_fusion_scheme) as well as some public datasets (Sec LABEL:subsec:_public). Then we compare our attribute-aware loss with other methods that utilize attributes for face recognition, using some public RGB datasets (Sec LABEL:subsec:_fused).
Our RGB-D dataset. We construct an RGB-D facial dataset that is captured by PrimeSense camera and contains more than 1.3M RGB-D images of 110K identities, where each identity has at least seven RGB images and their corresponding depth images. Most subjects are captured in the front of camera with neutral expression, and the multiple images of each subject are captured at different times and under different lighting conditions. Some samples from this RGB-D facial dataset are shown in Fig. 3. We also record their attributes including age, gender and ethnicity. Compared with the datasets used for RGB-D face recognition in previous work [LeeCTL16, HCSC18], our dataset contains a much larger number of identities, enabling us to evaluate the effectiveness of our approach in a real-world setting.
Implementation details. All CNN models are implemented using the Caffe library [jia2014caffe] with our modifications. Our CNN models are based on the same architecture as [wen2016discriminative], using a 28-layer ResNet [he2016deep]. We train the models using stochastic gradient descent with different loss functions on RGB data, depth data, and their combination, respectively. All CNN models are trained with a batch size of 200 on two GPUs (TITAN Xp). The learning rate begins at 0.1, and is divided by 10 after 40K and 60K iterations, respectively. The training ends at 70K iterations. The facial data are horizontally flipped for data augmentation. During testing, we extract 512-dimensional deep features from the output of the first fully connected layer. For each test data, we concatenate its 512-dimensional features and its horizontally flipped 512-dimensional features as the final 1024-dimensional representation. In face verification and identification, the similarity between two features is computed using their cosine distance.
4.2 Experiments on the Parameters and
The parameter controls the importance of attribute-aware loss , while the parameter decides whether a pair of attribute vectors are close enough to be considered in the attribute-aware loss. Since both of them are important for our loss function, we conduct two experiments to illustrate how and influence the face recognition performance. We first construct a training set and a test set by sampling the whole dataset. This training set (Training Set I) includes about 0.88M RGB-D images of 60K identities, with 91% Caucasians and 9% Asians. Within the training set there are balanced distributions of age and gender, as shown in Tab. 4.2. The test set includes about 0.22M RGB-D images of 20K identities. The first available neutral image of each identity in the test set is placed in the gallery, and the remaining images are used as probes. We select gender, ethnicity and age as the attributes for training the model. For the gender attribute, we use 1 to indicate male and -1 for female. For ethnicity, since our dataset only contains Asians and Caucasians, we use 1 for Asians and -1 represents Caucasians. For age, we first truncate the age value at 100, and then linearly map it from the range into . In this way, we represent the attributes as 3-dimensional vector in , where the superscripts , and indicate gender, ethnicity and age, respectively.
To demonstrate the effectiveness and sensitivity of the two parameters, we train our models jointly supervised with the softmax loss and the attribute-aware loss on the RGB part of the constructed dataset. In the first experiment, we fix and vary from 0 to 0.003 to learn different models. Performance on the closed-set identification task is the classical evaluation criteria for face recognition. We show the rank-1 identification rates of these models on our test set in Fig. LABEL:fig:param(left). We can see that our attribute-aware loss can greatly improve the face recognition performance, especially when is in the range . In the second experiment, we fix and vary from 0 to 0.04. The corresponding rank-1 identification rates on our test set are shown in Fig. LABEL:fig:param(right). It can be observed that the identification rates remain stable for . Within this range, there are between 150 and 630 pairs of similar attribute vectors with different identities in one batch. In practice, we prefer to select a small value for due to its lower computational cost.