Robust RGB-D Face Recognition Using Attribute-Aware Loss
Existing convolutional neural network (CNN) based face recognition algorithms typically learn a discriminative feature mapping, using a loss function that enforces separation of features from different classes and/or aggregation of features within the same class. However, they may suffer from bias in the training data such as uneven sampling density, because they optimize the adjacency relationship of the learned features without considering the proximity of the underlying faces. Moreover, since they only use facial images for training, the learned feature mapping may not correctly indicate the relationship of other attributes such as gender and ethnicity, which can be important for some face recognition applications. In this paper, we propose a new CNN-based face recognition approach that incorporates such attributes into the training process. Using an attribute-aware loss function that regularizes the feature mapping using attribute proximity, our approach learns more discriminative features that are correlated with the attributes. We train our face recognition model on a large-scale RGB-D data set with over 100K identities captured under real application conditions. By comparing our approach with other methods on a variety of experiments, we demonstrate that depth channel and attribute-aware loss greatly improve the accuracy and robustness of face recognition.
Convolutional neural networks (CNNs) play a significant role in face analysis tasks, such as landmarks detection [sun2013deep, zhang2016joint], face recognition [huang2014labeled, wolf2011face, kemelmacher2016megaface] and 3D face reconstruction [richardson2016learning, Guo20183DFace]. With the emergence of large public face data sets [yi2014learning] and sophisticated networks [szegedy2015going, he2016deep], the problem of face recognition has gained lots of attention and developed rapidly. At present, some mainstream methods already outperform humans on certain benchmark datasets such as [huang2014labeled]. These methods usually map faces to discriminative feature vectors in a high-dimensional Euclidean space, to determine whether a pair of faces belong to the same category. For example, deep metric learning methods (such as contrastive loss [hadsell2006dimensionality] or triplet loss [schroff2015facenet]) usually train a CNN by comparing pairs or triplets of facial images to learn discriminative features. Later, different variants of the softmax loss [taigman2014deepface, wen2016discriminative, ranjan2017l2, wang2017normface, liu2017sphereface] are used as supervision signals in CNNs to extract discriminative features, which achieve excellent performance under the protocol of small training set. These methods [schroff2015facenet, wen2016discriminative, liu2017sphereface] utilize CNNs to learn strong discriminative deep features, using loss functions that enforce either intra-class compactness or inter-class dispersion.
Although the above two categories of methods have achieved remarkable performance, they still have their own limitations. First, the contrastive loss and triplet loss suffer from slow convergence due to the construction of a large number of pairs or triplets. To accelerate convergence, [sohn2016improved] proposed a -tuple loss that increases the number of negative examples. However, this loss still requires complex recombination of training samples. In comparison, the softmax loss and its variants have no such requirement on the training data and converge more quickly. The center loss [wen2016discriminative] is the first to add soft constraints on deep features in the softmax loss to minimize the intra-class variations, significantly improving the performance of softmax loss. Afterward, the angular softmax loss [liu2017sphereface] imposed discriminative constraints on a hypersphere manifold, which further improved the performance of softmax loss. However, by enforcing intra-class aggregation and inter-class separation among the training data, existing variants of softmax loss encourage the uniform distribution of feature vectors for the training data, even though the training data may not be sampled uniformly. As a result, the proximity between the learned feature vectors for two test data may not correctly indicate the proximity between their underlying faces, which can affect the accuracy of face recognition algorithms based on feature proximity. To address this issue, we propose an attribute-aware loss function that regularizes the learned feature mapping using other attributes such as gender, ethnicity, and age. The proposed loss function imposes a global linear relation between the feature difference and the attribute difference between nearby training data, such that feature vectors for facial data with similar attributes are driven towards each other. In addition, as these attributes are correlated with facial geometry and appearance, the attribute-aware loss also implicitly regularizes the feature proximity with respect to the facial proximity, which helps to account for potential sampling bias in the training set.
In addition, although existing RGB image-based face recognition methods have achieved great success, they rely solely on the appearance information and may suffer from poor lighting conditions such as dark environments. On the other hand, the depth image captured by RGB-D sensors such as PrimeSense sensors provides additional geometric information that is independent of illumination, which can help to improve the robustness of recognition. To this end, we develop a CNN-based RGB-D face recognition approach, by first aligning the depth map with the RGB image grid and normalizing the depth values to the same range as the RGB values, and then feeding the resulting RGB-D values into CNNs for training and testing. Unlike existing RGB-D based deep learning approaches [LeeCTL16, HCSC18] that only use small training data sets with less than 1K identities, we train our model on a large RGB-D data set with over 100K identities, where the resulting model achieves more robust performance than RGB based approaches.
Combining the RGB-D approach with the attribute-aware loss function, our new method greatly improves the robustness and accuracy of facial recognition. We tested our method on several datasets, with different identities in diverse facial expressions and lighting conditions. Our method performs consistently better than state-of-the-art approaches that only rely on RGB information and do not consider additional attributes.
To summarize, this paper makes the following major contributions:
We propose an attribute-aware loss function for CNN-based face recognition, which regularizes the distribution of learned feature vectors with respect to additional attributes and improves the accuracy of recognition results. To the best of our knowledge, this is the first method that utilizes non-facial attributes to improve CNN-based face recognition feature training.
For neural network training and testing, we construct a large-scale RGB-D face dataset including more than 100k identities mainly with the frontal pose, and a relatively small RGB-D dataset with 952 identities with various poses. This is the first result that verifies the effectiveness of CNN-based RGB-D face recognition with large training data sets.
2 Related Work
Face recognition is a classical research topic in pattern recognition and computer vision, with applications in many areas like biometrics, surveillance system, and information security. For a comprehensive review of 2D face recognition and 3D face recognition methods, one may refer to [ParkhiVZ15, ABATE20071885]. This section briefly reviews those techniques that are closely related to our work.
2.1 Deep Learning based Face Recognition
In the past few years, deep learning based face recognition is one of the most active research areas. In this part, we mainly discuss the loss functions used in these methods.
Metric Learning. Metric learning [xing2002distance, weinberger2009distance, Wang2011kernel] attempts to optimize a parametric notion of distance in a fully/weakly/semi-supervised way such that the similar objects are nearby and dissimilar objects are far apart on a target space. In [xing2002distance], the learning is done by finding a Mahalanobis distance with a matrix parameter when given some similar pairs of samples. In order to handle more challenging problems, kernel tricks [Wang2011kernel, jain2012metric] had been introduced in metric learning to extract nonlinear embeddings. In recent years, more discriminative features can be learned with advanced network architectures that minimize some loss functions based on Euclidean distance, such as contrastive loss [hadsell2006dimensionality] and triplet loss [schroff2015facenet]. Moreover, these loss functions can be improved by allowing joint comparison among more than one negative example [sohn2016improved] or minimizing the overall classification error [kumar2016learning].
Classification Losses. The most commonly used classification loss is the softmax loss that maps images to deep features and then to predicted labels. Krizhevsky et al. [krizhevsky2012imagenet] first observed that CNNs trained with softmax loss can produce discriminative feature vectors, which has also been confirmed by other works [sharif2014cnn]. However, softmax loss mainly encourages inter-class dispersion, and thus cannot induce strong discriminative features. To enhance the discrimination power of deep features, Wen et al. [wen2016discriminative] proposed center loss to enforce intra-class aggregation as well as inter-class dispersion. Meanwhile, Ranjan et al. [ranjan2017l2] observed that the softmax loss is biased to the sample distribution, i.e., fitting well to high-quality faces but ignoring the low-quality faces. Adding -constraints on features to the softmax loss can make the resulting features as discriminative as those trained with center loss. Afterward, Liu and colleagues [liu2016large, liu2017sphereface] further improved the features by incorporating an angular margin instead of the Euclidean margin into softmax loss to enhance the inter-class margin and compressing the intra-class angular distribution simultaneously.
2.2 Face Recognition with Attributes
Besides the feature vectors extracted from CNN, other attributes can also be utilized in face recognition tasks. An early study [KumarBBN09] trained 65 “attribute” SVM classifiers to recognize the traits of input facial images such as gender, age, race, and hair color, which are then fused with other features for face recognition. In the context of deep learning, attribute-enhanced face recognition does not gain too much attention. One related work [SamangoueiC16] is to exploit CNN based attribute features for authentication on mobile devices, and the facial attributes are trained by a multi-task, partly based Deep Convolutional Neural Network architecture. Hu et.al [hu2017attribute] systematically study the problem of how to fuse face recognition features and facial attribute features to enhance face recognition performance. They reformulate feature fusion as a gated two-stream neural network, which can be efficiently optimized by neural network learning.
Based on the assumption that attributes like gender, age and pose could share low-level features from the representation learning perspective, some studies investigate multi-task learning [rudd2016moon, RanjanSCC17] and show that such attributes could help the face recognition task. In our method, different from the above attribute fusion and multi-task learning methods, the attributes are directly used to guide the face recognition feature learning in the training stage, and they are not needed during the testing stage.
2.3 RGB-D Face Recognition
In recent years, RGB-D based face recognition has attracted increasing attention because of its robustness in an unconstrained environment. Hsu et al. [HsuLPW14] considered a scenario in which the gallery is a pair of RGB-D images while the probe is a single RGB image captured by a regular camera without the depth channel. They proposed an approach that reconstructs a 3D face from an RGB-D image for each subject in the gallery, aligns the reconstructed 3D model to a probe using facial landmarks, and recognizes the probe using sparse representation based classification. Zhang et al. [HCSC18] further considered the problem of multi-modality matching (e.g., RGB-D probe vs. RGB-D gallery) and cross-modality matching (e.g., RGB probe vs. RGB-D) in the same framework. They proposed an approach for RGB-D face recognition that is able to learn complementary features from multiple modalities and common features between different modalities. For the RGB-D vs. RGB-D problem, Goswami et al. [GoswamiVS14] proposed to compute an RGB-D image descriptor based on entropy and based on the entropy and saliency, as well as geometric facial attributes from the depth map; then the descriptor and the attributes are fused to perform recognition. Li et al. [LiXMLK16] proposed a multi-channel weighted sparse coding method on the hand-crafted features for RGB-D face recognition.
Although it is straightforward to extend deep learning based face recognition methods from RGB images to RGB-D images, currently there are no large-scale public RGB-D data sets that can be used for training, which limits the practical applications of these approaches. For example, the model proposed in [HCSC18] is trained on a dataset with less than 1K identities. To handle this problem, Lee et al. [LeeCTL16] proposed to first train the deep network with a color face dataset, and then fine-tune it on depth face images for transfer learning.
3.1 Revisiting the Variants of Softmax Loss
Given a training data set with , and their corresponding labels with , the following classical softmax loss function is widely used in face recognition tasks
where is the learned feature mapping by training CNNs, is the dimension of deep feature . and are the weights and biases in the last fully connected layer, and can be color or depth images of faces. We denote for simplicity. Typically, during the test phase the mapping is applied on an image pair to extract two deep features , and the Euclidean distance or cosine distance between the features are computed to determine the similarity between the image pair. Separable features can be learned using softmax loss, but they are not discriminative enough for face recognition.
To learn more discriminative features, several variants of softmax loss have been developed by enlarging the inter-class margin and reducing the intra-class variation. Among them, the center loss [wen2016discriminative] requires the deep features of each class to gather towards their respective centers :
With the angular softmax loss [liu2017sphereface], deep features of each class are compressed using the angular margin instead of the Euclidean margin:
where is the angle between vectors and . There are other variants of softmax loss [wang2018additive, deng2018arcface] with a similar form as (3), where the margin and the angle are added instead of being multiplied.
3.2 The Attribute-Aware Loss
To achieve high accuracy for face recognition, it is desirable that the proximity between feature clusters of different classes is consistent with the proximity between the classes (i.e., the underlying faces). Ideally, the more dissimilar two faces are, the further apart their corresponding feature clusters should be. However, this is not guaranteed by the above variants of softmax loss according to our experimental observations (see Fig. 2 and Fig. LABEL:fig:gender). Since they minimize the intra-class variations and maximize the inter-class margins on the training data, the learned feature mappings tend to produce evenly distributed feature vectors for the training faces. On the other hand, there is no guarantee that the facial images in the training set are evenly distributed in the full face space. As a result, when there exist large variations of sampling density in the training data set, the learned feature mapping may not correctly indicate the proximity of the underlying faces. To address this issue, we can try to introduce a loss function term that regularizes feature proximity with respect to face proximity. However, this is a challenging task as a facial image only reveals the underlying face shape from a certain view direction and can be affected by various factors such as lighting condition and sensor noises. As a result, it is difficult to reliably compute the proximity between two faces by only comparing their scanned images.
Besides the proximity of face shapes, it is also desirable that the learned feature mappings are related to the proximity between other attributes such as gender, ethnicity, and age. For example, if we compare a probe image against a database of facial images to identify most likely matches via feature proximity, then it is preferable that all returned images are from persons with the same or similar attributes. The above variants of softmax loss cannot ensure this property either, as they only consider the facial images during the training process.
Motivated by these observations, we propose an attribute-aware loss term that regularizes feature proximity with respect to attribute proximity. Besides the label information, other attributes of the facial images like gender, ethnicity, and age are also given in the training data set. These attributes can be collected during training data construction, and they are independent of the imaging process. We represent the augmented attributes for a facial image using a vector . Then our attribute-aware loss is formulated as
where is a parameter matrix to be trained, is the Euclidean distance between the two attribute vectors, and is a user-specified threshold.
Intuitively, this loss term can drive feature clusters with similar attributes towards each other, via a global linear mapping that relates the feature difference to the attribute difference. To the best of our knowledge, this is the first work in face recognition that optimizes adjacency of learned features using attribute proximity. As shown in Fig. 1, the learned feature clusters with similar attributes become closer after our adjacency optimization. From another perspective, regularization using additional attributes can help the network pick up other useful cues for face recognition, because attributes such as gender, ethnicity, and age are highly correlated with facial shape and appearance. For example, there can be a notable difference between the facial appearance of two persons with different genders. Therefore, the attribute-aware loss can improve the learned feature mapping by implicitly utilizing the appearance variation related to these attributes.
To better understand the attribute-aware loss, the following proposition clarifies how the difference between and influences the difference between and in the feature space.
Let be facial features extracted from training samples with corresponding labels and attribute vectors . Assume that:
The number of training epochs is large enough to allow all attribute pairs that satisfy constrains and to appear sufficiently many times in training phase;
For every identity, there exists at least another identity so that they can meet the above constraints;
The training is convergent, the parameter matrix is nonsingular, and where is a constant.
Let denote the set of all facial features of the -th identity. Then
For each , the Euclidean distance between any pair in has an upper bound which is linear with threshold ;
If the Euclidean distance between the attribute vectors of the -th identity and the -th identity is smaller than , then the Euclidean distance between their average features and also has an upper bound which is linear with threshold .
From Eq. (5), we have
Let and be an arbitrary pair of features in . Then from assumption (ii), we can find another feature where , such that their corresponding attributes satisfy and . Applying Eq. 6 to the pairs and , we obtain
According to assumptions (iii) and (iv), is a constant and is bounded. Therefore, Eq. (7) provides an upper bound for the Euclidean distance between any pair in which is linear with .
For the average features and of and , their Euclidean distance satisfies
If the Euclidean distance between the attribute vectors of the -th identity and the -th identity is smaller than , then by definition for any and . This implies according to Eq. (6). Applying this relation to Eq. (8), we obtain
which provides an upper bound that is linear with . ∎
Proposition. 1 implies two properties. First, the attribute-aware loss layer can make intra-class features more compact than using the softmax loss only, similar to the center loss [wen2016discriminative]; the smaller the threshold is, the more compact the intra-class features may become. Sec. LABEL:subsec:_param evaluates the effects from different values of . Second, for two identities with similar attributes, their corresponding feature clusters will not be far away from each other. This can be demonstrated by our experiment in Sec. LABEL:subsec:private.
To showcase the effectiveness of the attribute-aware loss, we present a toy example on a very small RGB face dataset. This dataset is selected from our large-scale RGB-D face dataset presented in Section 4.1, and contains only nine identities with the same gender and ethnicity but different ages. Three of the identities are aged 28, another three aged 50, and the remaining aged 70. We use ResNet-10 [he2016deep] to train two models, one with softmax loss only, the other with both the softmax loss and the attribute-aware loss. Details of training with the combined softmax and attribute-aware loss are presented in Sec. 3.3. We reduce the output dimension of the penultimate fully connected layer to two, allowing us to directly plot the learned features in Fig. 2(a) and Fig. 2(b). We can see that the coordinates of features in Fig. 2(a) span a much larger range than those in Fig. 2(b). It indicates that the two-dimensional features of each identity become more compact through the regularization using attribute-aware loss. We can also observe that features for identities of the same age in Fig. 2(b) are closer to each other than in Fig. 2(a), verifying the second property of the attribute-aware loss.
3.3 Training with the Attribute-Aware Loss
Similar to [wen2016discriminative], our attribute-aware loss in Eq. (4) is an auxiliary supervision signal, which can be combined with any variant of softmax loss. For example, it can be combined with the classical softmax to derive a loss function
where is a user-specified weight to balance the two loss terms. In the following, we provide the details of mini-batch training with the loss function . Each input mini-batch consists of facial data where , as well as their identity labels and attributes . These data are fed to the CNN, the softmax loss layer, and the attribute-aware loss layer respectively, as illustrated in Fig. 3. The addition of an attribute-aware loss layer only introduces a slight overhead for the model size during the training phase, as it contains only one parameter matrix with parameters, where and are the dimensions of the deep facial feature and the attribute vector, respectively. In our implementation, and , meaning that we only need 1536 additional parameters. In comparison, the backbone network that extracts deep features, which is a 28-layer ResNet [he2016deep], requires about 0.3M parameters. Thus, the overhead from the attribute-aware loss layer is almost negligible. Moreover, during the testing phase, the attribute-aware loss layer is not needed and induces no overhead.
All parameters in the CNN and two loss layers can be learned using standard stochastic gradient descent. The gradients of with respect to and are computed as:
Algorithm 1 summarizes the learning details in the CNNs with joint supervision.
3.4 Training with RGB-D facial data
In this paper, we use RGB-D facial images as the training data, to improve robustness to illumination conditions compared with RGB facial images. The RGB-D data are collected using low-cost sensors such as PrimeSense. Using the RGB part of a facial image, we first detect the face region and five landmarks (the eyes, the nose, and two corners of the mouth) using MTCNN [zhang2016joint]. The face is then cropped to by similarity transformation, and each RGB color component is normalized from the range into . Afterward, we extract a face region from the corresponding depth image by transferring the RGB face region. Similar to [kim2017deep, zulqarnain2018learning], we find the nose tip and crop the point cloud in the face region within an empirically set radius of mm. Then we move the center of the cropped facial scan to and reproject it onto a 2D image plane to generate a new depth map of size . The value is chosen to enlarge the projection of facial scans onto the image plane as much as possible. Following [hernandez2015near], we compute the depth of each pixel with bilinear interpolation. Using this depth map, we generate a new point cloud under the camera coordinate system. Each point is further normalized as:
where and are the minimum and maximum -, - and -coordinate values among all points, respectively. Augmenting the RGB face region with its normalized point cloud, we obtain a six-channel image with values in , which is fed into the deep neutral network. Some RGB facial images and their normalized point clouds are shown in Fig. 3.
4 Experimental Results
We conduct extensive experiments to evaluate the effectiveness of our approach. We first test our RGB-D face recognition approach on a large-scale private dataset (Secs. LABEL:subsec:_param, LABEL:subsec:private and LABEL:subsec:_fusion_scheme) as well as some public datasets (Sec LABEL:subsec:_public). Then we compare our attribute-aware loss with other methods that utilize attributes for face recognition, using some public RGB datasets (Sec LABEL:subsec:_fused).
Our RGB-D dataset. We construct an RGB-D facial dataset that is captured by PrimeSense camera and contains more than 1.3M RGB-D images of 110K identities, where each identity has at least seven RGB images and their corresponding depth images. Most subjects are captured in the front of the camera with a neutral expression, and the multiple images of each subject are captured at different times and under different lighting conditions. Some samples from this RGB-D facial dataset are shown in Fig. 4. We also record their attributes including age, gender, and ethnicity. Compared with the datasets used for RGB-D face recognition in previous work [LeeCTL16, HCSC18], our dataset contains a much larger number of identities, enabling us to evaluate the effectiveness of our approach in a real-world setting.