A Quadruplet Loss for Enforcing Semantically Coherent Embeddings in Multi-output Classification Problems

A Quadruplet Loss for Enforcing Semantically Coherent Embeddings in Multi-output Classification Problems

Abstract

This paper describes one objective function for learning semantically coherent feature embeddings in multi-output classification problems, i.e., when the response variables have dimension higher than one. In particular, we consider the problems of identity retrieval and soft biometrics in visual surveillance environments, which have been attracting growing interests. Inspired by the triplet loss [33] function, we propose a generalization of that concept: a quadruplet loss, that 1) defines a metric that analyzes the number of agreeing labels between pairs of elements; and 2) disregards the notion of anchor, replacing by distance constraints, according to such perceived semantic similarity between the elements of each pair. Inherited from the triplet loss formulation, our proposal also privileges small distances between positive pairs, but also explicitly enforces that the distances between negative pairs directly correspond to their similarity in terms of the number of agreeing labels. This typically yields feature embeddings with a strong correspondence between the classes centroids and their semantic descriptions, i.e., where elements that share some of the labels are closer to each other in the destiny space than elements with fully disjoint classes membership. Also, in opposition to its triplet counterpart, the proposed loss is not particularly sensitive to the way learning pairs are mined, being agnostic with regard to demanding criteria for mining learning instances (such as the semi-hard pairs of triplet loss). Our experiments were carried out in four different datasets (BIODI, LFW, Megaface and PETA) and validate our assumptions, showing highly promising results.

Feature embedding, Soft biometrics, Identity retrieval, Convolutional neural networks, Triplet loss.

I Introduction

Characterizing pedestrians in crowds has been attracting growing attention, being currently accepted that soft labels such as gender, ethnicity or age help to determine the identity of one subject. This kind of labels is closely related to human perception and also provides cues to describe the visual appearance of subjects in a crowd, with obvious application in identity retrieval [39][35] and person re-identification [15][26] problems.

Deep learning frameworks have been repeatedly improving the state-of-the-art in many computer vision tasks, such as object detection and classification [24][40], action recognition [19][6], semantic segmentation [23][43], and soft biometrics labelling [31]. Here, the concept of triplet loss [33] is extremely popular, and proposes to us three learning elements at a time, with two of them belonging to the same class and a third one to a different class. By imposing larger distances between the elements of the negative than of the positive pairs, the idea is to enforce intra-class compactness and inter-class discrepancy on the destiny space. This strategy was successfully applied to various problems, upon the appropriate mining of the so-called semi-hard negative input pairs, i.e., cases where the negative element is not closer to the anchor than the positive, but still provides a positive loss due to a margin value that is typically imposed.

A

ID: A

male

adult

bald

A

ID: A

male

adult

bald

B

ID: B

male

adult

bald

C

ID: C

female

young

blond

Learning

Fig. 1: Likewise the triplet loss [33] function, the proposed quadruplet loss also seeks to minimize the distances between the elements of positive pairs , , but simultaneously considers the relative similarity between the different classes (, and ), yielding embeddings that are particularly suitable for identity retrieval. In this example, the proposed objective function will privilege projections into the destiny space such that .

Based on the concept of triplet loss, this paper describes an objective function that can be regarded as a generalization of its predecessor. Instead of binary dividing the learning pairs into positive/negative based in the concept of anchor, we define a metric that analyzes the semantic similarity between any two classes (identities). In learning time, four elements of arbitrary classes are considered at a time and the soft margins between the pairwise distances are defined, according to the number of common labels in each image pair (Fig. 1). Under this formulation, elements of different identities that are semantically close to each other (e.g., two ”young, black, bald, male” subjects) should be projected into adjacent regions of the destiny space, which is particularly interesting for identity retrieval purposes. Also, in result of its formulation, this objective function leverages the difficulties in mining appropriate learning instances, which is known to constitute one on the main difficulties in the original triplet loss formulation.

The remainder of this paper is organized as follows: Section II summarizes the most relevant research in the scope of our work. Section III provides a detailed description of the proposed objective function. In Section IV we discuss the obtained results and the conclusions are given in Section V.

Ii Related Work

Deep learning methods for biometrics can be roughly divided into two major groups: 1) directly learning multi-class classifiers that are used in identity retrieval and soft biometrics inference; and 2) learning low-dimensional feature embeddings, where identification yields from nearest neighbour search. Upon the availability of enough learning data, both families of methods were reported to achieve good performance.

Ii-a Soft Biometrics and Identity Retrieval

Bekele et al[2] proposed a residual network for multi-output inference that handles classes-imbalance directly in the cost function, without depending of data augmentation techniques. Almudhahka et al[1] explored the concept of comparative soft biometrics and assessed the impact of automatic estimations on face retrieval performance. Guo et al[12] studied the influence of distance in the effectiveness of body and facial soft biometrics, introducing a joint density distribution based rank-score fusion strategy [13]. Vera-Rodriguez et al[30] used hand-crafted features extracted from the distances between key points in body silhouettes. Martinho-Corbishley et al[28] introduced the idea of super-fine soft attributes, describing multiple concepts of one trait as multi-dimensional perceptual coordinates. Also, using joint attribute regression and a deep residual CNN, they observed substantially better ranked retrieval performance in comparison to conventional labels. Schumann and Specker used an ensemble of classifiers for robust attribute inference [34], extended to full body search by combining it with a human silhouette detector. He et al[17] proposed a weighted multi-task CNN with a loss term that dynamically updates the weight for each task during the learning phase.

Several works relied regarded semantic segmentation as a tool to support labels inference: Galiyawala et al[10] described a deep learning framework for person retrieval using the height, clothes’ color, and gender labels, with a semantic segmentation module used to remove clutter. Similarly, Cipcigan and Nixon [3] obtained semantically segmented regions of the body, that subsequently fed two CNN-based feature extraction and inference modules.

Finally, specifically designed for handheld devices, Samangouei and Chellappa [31] extracted facial soft biometric information from mobile phones, while Neal and Woodard [25] developed a human retrieval scheme based on thirteen demographic and behavioural attributes from mobile phones data, such as calling, SMS and application data, having authors positively concluded about the feasibility of this kind of recognition.

A comprehensive summary of the most relevant research in soft biometrics is given in [37].

Ii-B Feature Embeddings and Loss Functions

Triplet loss functions were motivated by the concept of contrastive loss [14], where the rationale is to penalize distances between positive pairs, while favouring distances between negative pairs. Kang et al[21] used a deep ensemble of multi-scale CNNs, each one based on triplet loss functions. Song et al[36] learned semantic feature embeddings that lift the vector of pairwise distances within the batch to the matrix of pairwise distances, and described a structured loss on the lifted problem. Liu and Huan [27] proposed a triplet loss learning architecture composed of four CNNs, each one learning features from different body parts that are fused at the score level.

A posterior concept was the center loss [41], which finds a center for elements of each class and penalizes the distances between the projections and their corresponding class centers. Jian et al[20] combined additive margin softmax with center loss to increase inter-class distances and avoid overconfidence on classifications. Ranjan et al[29] described the crystal loss, that restricts the features to lie on a hypersphere of a fixed radius, by adding a constraint on the features projections such that their -norm remains constant. Chen et al[4] used deep representations to feed a joint Bayesian metrics learning module that maximizes the log-likelihood ratio between intra- and inter-classes distances. Based on the concept of Sphereface, Deng et al[8] proposed an additive angular margin loss, with a clear geometric interpretation due to the correspondence to the geodesic distance on the hypersphere.

Observing that CNN-based methods tend to overfit in person re-identification tasks, Shi et al[35] used siamese architectures to provide a joint description to a metric learning module, regularizing the learning process and improving the generalization ability. Also, to cope with large intra-class variations, they suggested the idea of moderate positive mining, again to prevent overfitting. Motivated by the difficulties in generate learning instances for triplet loss frameworks, Su et al[38] performed adaptive CNN fine-tuning, along with an adaptive loss function that relates the maximum distance among positive pairs to the margin demanded for separate the positive from the negative pairs. Hu et al[18] proposed an objective function that generalizes the Maximum Mean Discrepancy [32] metric, with a weighting scheme that favours good quality data. Duan et al[9] proposed the uniform loss to learn deep equi-distributed representations for face recognition. Finally, observing the typical unbalance between positive and negative pairs, Wang et al[40] described an adaptive margin list-wise loss, in which learning data are provided with a set of negative pairs divided into three classes (easy, moderate, and hard), depending of the distance rank with respect to the query.

Finally, we note the differences between the work proposed in this paper and the (also quadruplet) loss described by Chen et al[5]. These authors attempt to augment the inter-class margins and the intra-class compactness without explicitly using any semantical constraint, in the sense that – as in the original triplet loss formulation – there is nothing that explicitly enforces to project similar classes (i.e., different identities that share most of the remaining labels) to neighbour regions of the latent space. In opposition, our method concerns essentially about this semantical coherence of assuring the projection similar classes into adjacent regions and not in obtain larger margins between the different classes. Also, even the idea behind the loss formulation is radically different in both methods, in the sense that  [5] still considers the concept of anchor (as the original triplet-loss formulation), which is - again - in opposition to our proposal.

Iii Proposed Method

ID

Gender

Ethnicity

Elements

x -

y -

z -

Female

Female

Male

Black

White

White

Triplet Loss

Embeddings are possible

Positive pairs

Negative pairs

1

2

3

Proposed Quadruplet Loss

1

Positive pairs Negative pairs

2

Positive pairs More Negative pairs

3

Negative pairs More Negative pairs

Embedding is enforced

Fig. 2: Illustration of the key difference between the triplet loss [33] and the solution proposed in this paper. Using a loss function that analyzes the semantic similarity (in terms of soft biometrics) between the different identities, we enforce embeddings () that are semantically coherent, i.e., where: 1) elements of the same class appear near each other; and also 2) elements of relatively similar classes appear closer to each other than elements with no labels in common. This is in opposition to the original formulation of the triplet loss, that relies exclusively in the elements appearance to define the geometry of the destiny space, which might result - in case of noisy image features - in semantically incoherent embeddings (e.g., in and , classes are compact and discriminative, but the centroids are too close to each other).

Iii-a Quadruplet Loss: Definition

Consider a supervised classification problem, where is the dimensionality of the response variable associated to the input element . Let be one embedding function that maps into a d-dimensional space , with being the projected vector. Let be a batch of images from the learning set. We define as the function that measures the semantic similarity between and :

(1)

with being the -norm operator.

In practice, counts the number of disagreeing labels between the pair, i.e., when the i and j elements have fully disjoint classes membership (e.g., one ”black, adult, male” and another ”white, young, female” subjects), while when they have the exact same label (class) across all dimensions, i.e., when they constitute a positive pair.

Let be the indices of four images in the batch. The corresponding quadruplet loss value is given by:

(2)

where is the sign function and is the desired margin ( was used in all our experiments). Evidently, will be zero when both image pairs have the same number of agreeing labels (as in these cases). In all other cases, the sign function will determine the pairs for which the distances in the embedding should be minimized, i.e., if the elements of the pair are semantically closer to each other than the elements of the pair (i.e.), we want to ensure that .

The accumulated loss in the batch is given by the truncated mean of a sample (of size ) randomly taken from the subset of the individual loss values where :

(3)

where denotes the z composition of four elements in the batch and is the function. Even considering that a large fraction of the combinations in the batch will be invalid (i.e., with ), large values of will result in an intractable number of combinations at each iteration. In practical terms, after filtering out those invalid combinations, we randomly sample a subset of the remaining instances, which is designated as the mini-batch.

Iii-B Quadruplet Loss: Inference

Consider four indices of elements in the mini-batch, with . Let denote the difference between the number of disagreeing labels of the and pairs:

(4)

Also, let be the distance between the elements of the most alike pair minus the distance between the elements of the least alike pair in the destiny space (plus the margin):

(5)

Upon basic algebraic manipulation, the gradients of with respect to the quadruplet terms are given by:

(6)
(7)
(8)
(9)

In practice terms, the model weights will be adjusted only for learning instances where pairs have a different number of agreeing labels (i.e., ) and when the distance in the destiny space between the elements of the most similar pair is higher or equal than the distance between the elements of the least similar pair (). According to this idea, using (6)-(9), the deep learning frameworks supervised by the proposed quadruplet loss are trainable in a way similar to its counterpart triplet loss and can be optimized by the standard Stochastic Gradient Descend (SGD) algorithm, which was done in all our experiments.

For clarity purposes, Algorithm 1 gives a pseudocode description of the learning and batch/mini-batch creation processes during the inference step for the proposed loss function.

: CNN, : Tot. epochs, : mini-batch size, : batch size, : Learning set, images
for  do
     for  do
          randomly sample out of images from
          create quadruplet combinations from
          filter out invalid elements from
          randomly sample elements from
          update weights() (eqs. (6-9))
     end for
end for
return
Algorithm 1 Pseudocode description of the inference and batch/min-batch creation processes.

Iii-C Quadruplet Loss: Insight and Example

Fig. 2 illustrates our rationale in proposing the quadruplet loss. By defining a metric that analyses the similarity between different classes, we abandon the binary division of pairs into positive/negative families, and try to perceive which classes are more/less distinct with respect to others. This information enables to explicitly enforce that more negative pairs (e.g., with no common labels) should be at the farthest possible distance from each other in the embedding. During the learning phase, by sampling image pairs in a stochastic and iterative way, we enforce that the input elements are projected into the destiny space hypersphere in a way that faithfully resembles the human perception of class similarity.

As an example, Fig. 3 compares the bidimensional embeddings resulting from the triplet and from the quadruplet losses, in the subset of the LFW identities with than 15 images in the dataset, using labels. This visualisable plot yielded from the projection of one 128-dimensional embedding down to two dimensions, according to the Neighbourhood Component Analysis (NCA) [11] algorithm.

Triplet Loss

Quadruplet Loss

Fig. 3: Comparison between the bidimensional embeddings resulting from the triplet loss function, as proposed in [33] (top plot), and from the quadruplet loss proposed in this paper (bottom plot). Results are given for features ({”ID”, ”gender”}) for a subset of the LFW set composed of the identities with at least 15 images in the dataset (total of 89 identities).

It can be seen that the triplet loss provided an embedding where the positions of elements were exclusively determined by their appearance, being possible that instances with completely different labels appear close to each other (e.g., in result of noisy image features). This is particularly evident in the upper left corner of the triplet embedding, where ”female” elements were projected into regions adjacent to ”male tennis players”. In opposition, using the quadruplet loss we obtain a much more evident separation between the elements of different genders, while keeping the compactness of each identity. Evidently, this type of embedding will be interesting in - at least - two cases: 1) in image retrieval problems, to guarantee that all retrieved elements have the same soft labels of the query; and 2) upon a semantic description of the query (e.g., ”find adult white males similar to this image”, to guarantee that all retrieved elements meet the semantic description. A third possibility comprises the direct inference of all the query soft labels simply by analyzing its projection in the destiny space hypersphere.

Iv Results and Discussion

Iv-a Experimental Setting and Preprocessing

The empirical validation of our method was conducted in one proprietary (BIODI) plus three freely available datasets (LFW, PETA and Megaface), well known in the biometrics and re-identification literature.

The BIODI1 dataset is proprietary of Tomiworld2, composed of 849,932 images from 13,876 subjects, taken from 216 indoor/outdoor video surveillance sequences. All images were manually annotated for 14 labels: gender, age, height, body volume, ethnicity, hair color and style, beard, moustache, glasses and clothing (x4). The Labeled Faces in the Wild (LFW) [16] dataset contains 13,233 images from 5,749 identities, collected from the web, with large variations in pose, expression and lighting conditions. PETA [7] is a combination of 10 pedestrian re-identification datasets, composed of 19,000 images from 8,705 subjects, each one annotated with 61 binary and 4 multi-output atributes. Finally, Megaface [22] was released to evaluate the performance of face recognition algorithms at the million scale, and consists of a gallery set and a probe set. The gallery set is a subset of Flickr photos from Yahoo (more than 1,000,000 images from 690,000 subjects). The probe dataset includes FaceScrub and FGNet sets. FaceScrub has 100,000 images from 530 individuals and FGNet contains 1,002 images of 82 identities.

Iv-B Convolutional Neural Networks

VGG-like

3 3, 64

3 3, 64

max, 2 2

dropout, 0.75

3 3, 128

3 3, 128

max, 2 2

3 3, 256

3 3, 256

3 3, 256

3 3, 256

max, 2 2

dropout, 0.75

3 3, 256

3 3, 256

3 3, 256

3 3, 256

max, 2 2

dropout, 0.75

4,096

4,096

ResNet-like

7 7, 64, /2

max, 2 2

3 3, 64

3

3 3, 64

3 3, 128, /2

3 3, 128

3 3, 128

3

3 3, 128

3 3, 256, /2

3 3, 256

3 3, 256

5

3 3, 256

3 3, 512, /2

3 3, 512

3 3, 512

2

3 3, 512

avg, 2 2

dropout, 0.75

4,096

Fig. 4: Architectures of the two kinds of CNNs used in the empirical validation of our method. The yellow boxes represent the convolutional layers, and the blue and green boxes represent the pooling and dropout (keeping probability 0.75) layers. Finally, the red boxes denote the fully connected layers. In the ResNet architecture, the dashed skip connections represent convolutions with stride 2 2, yielding outputs with half of the input the size in spatial terms. The ”/2” symbol in the convolution layers also denotes stride 2 2 (the remaining layers use stride 1 1).

In our experiments we considered two kinds of CNN architectures, based in the well known VGG and the ResNet models (Fig. 4). It should be noted that our main goals were: 1) to compare the performance of the quadruplet loss with respect to the baselines; and 2) to observe the variations in performance with respect to different deep learning models. A TensorFlow implementation of the learning strategy described in this paper is available at3.

All the models tested were initialized with random weights, from zero-mean Gaussian distributions with standard deviation equal to 0.01, and biases set to 0.5. We multiplied the learning rate for the randomly initialized fully connected layers by 10 for faster convergence. All the images were resized to 256 256, adding lateral white bands to the images when needed to keep a constant image ratio and avoid distortions. A batch size of 64 was defined, which gives a too large number of possible combinations for the triplet/quadruplet losses. However, considering that most of the instances are not valid, at each iteration we filtered out the invalid triplets/quadruplets instances and selected randomly the mini-batch learning elements (composed of 64 instances). In the case of the baselines techniques, 64 learning instances were also used in each batch adjustment of the weights. The learning rate started from 0.01, with momentum set to 0.9 and weight decay to . We used a learning-from-scratch paradigm and stopped the learning process when the validation loss didn’t decrease for 10 iterations (i.e., patience=10), in order to assure as much as possible a fair comparison between the proposed loss and the baselines.

BIODI

PETA

LFW

Megaface

Fig. 5: Datasets used in the empirical validation of the method proposed in this paper. From top to bottom rows, images of the BIODI, LFW, PETA and Megaface sets are shown.

As a preliminary step, we varied the dimensionality of the embedding, in order to evaluate the sensitivity of the proposed loss with respect to this parameter. Here, we considered the LFW set, with the observed average AUC values observed provided in Fig. 6 (the shadowed regions denote the standard deviation performance, after 10 trials). As expected, high dimensions of the embeddings appear to be directly correlated to the observed performance, even though the results roughly stabilised for dimensions larger than 128. In this regard, we assumed that embeddings of even larger dimensions would require much more training data to provide better performance, having resorted from this moment to dimension of the destiny space =128 in all the subsequent experiments.

Interestingly, the absolute performance observed for very low dimensions of the embedding (2) was not too far of the values observed for much larger values (128), which raises the possibility of simply using the position of the elements in the destiny space directly for classification and visualization purposes, without the need of any dimensionality reduction algorithm (MDS, LLE or PCA algorithms are frequently seen in the literature for this purpose).

AUC

Dimensionality Embedding

VGG

ResNet

Fig. 6: Variations in the mean AUC values ( the standard deviations after 10 trials, given as shadowed regions) with respect to the dimensionality of the embedding. Results are given for the LFW validation set, when using the VGG-like (solid line) and ResNet-like (dashed line) CNN architectures.

Iv-C Identity Retrieval

The identity retrieval problem was considered the ideal task to evaluate the semantical coherence of the embeddings generated by our proposal. In this section we report the performance provided by the quadruplet loss this task, when compared to the baselines triplet loss, center loss, softmax and the work due to Chen et al[5]. We considered the LFW and Megaface sets, with three labels: {”ID”, ”Gender”, ”Ethnicity”} (). In this setting, note that all baselines classify pairs into positive/negative depending exclusively of the ID, while the proposed loss uses all the labels to determine how similar two classes are.

Considering the two classical verification and identification scenarios, we provide the Receiver Operating Characteristic curves (ROC, Fig. 7), the Cumulative Match curves (CMC, Fig. 8) and the Detection and Identification rates at rank-1 (DIR, Fig. 9). Also, the results are summarized in Table I, providing the rank-1, top-10% values and the mean average precision (mAP) scores:

(10)

where is the number of queries and , is the precision at cut-off and is the change in recall from to .

VGG

LFW

Verification Rate (VR)

FAR

Megaface

Verification Rate (VR)

FAR

ResNet

LFW

Verification Rate (VR)

FAR

Megaface

Verification Rate (VR)

FAR

Fig. 7: Comparison between the Receiver Operating Characteristic (ROC) curves observed for the LFW and Megaface sets, in linear and logarithmic (inner plot) scales. Results are shown for the quadruplet loss function (purple lines), and the three baselines considered: softmax (red lines), center loss (green lines), triplet loss (blue lines) and the method due to Chen et al[5].

VGG

LFW

Identification Rate

Rank

Megaface

Identification Rate

Rank

ResNet

LFW

Identification Rate

Rank

Megaface

Identification Rate

Rank

Fig. 8: Closed-set identification (CMC) curves observed for the LFW and Megaface sets. A zoomed-in region with the top-10 results is shown in the inner plot. Results are shown for the quadruplet loss function (purple lines), and the four baselines considered: the softmax (red lines), center loss (green lines), triplet loss (blue lines) and the work due to Chen et al[5].

For the LFW set experiment, the BLUFR4 evaluation protocol was chosen. In the verification (1:1) experiments, the test set contained 9,708 face images of 4,249 subjects, which yielded over 47 million matching scores. For the open-set identification task, the genuine probe set contained 4,350 face images of 1,000 subjects, the impostor probe set had 4,357 images of 3,249 subjects, and the gallery set had 1,000 images.

Generally, we observed that the proposed quadruplet loss outperforms the other loss functions, in most cases by a consistent margin for both the verification and identification tasks, and when considering the VGG and ResNet architectures. Generally, the ResNet-like error rates were roughly 90% of the observed for the VGG-like architecture, in all loss functions/methods tested (higher margins were observed for the softmax loss).

The improvements in performance of our method with respect to the baselines are particularly evident in the top-10% values, which accords its rationale: even in cases where the true identity is not returned at the first position (due to poor data quality), the soft biometrics information assures that the element will not be projected too far from the corresponding identity centroid, and retrieved the true label among the top-10% identities with high probability. Not surprisingly, the Chen et al[5]’ method outperformed the remaining competitors, followed by the triplet loss function, which is consistent with most of the results reported in the literature. The softmax loss got repeatedly the worst performance among the five functions considered.

Regarding the levels of performance per dataset, the values observed for the Megaface set were far worse for all objective functions than the corresponding LFW values. In this dataset, we followed the protocol of the small training set, using 490,000 images for learning from 17,189 subjects (the images overlapping with Facescrub dataset were discarded). Noteworthy, the relative performance between the loss functions compared was exactly the same as in LFW. Also, the softmax loss produced the most evident degradation in performance when compared to the LFW set.

VGG

LFW

DIR @ rank-1

FAR

Megaface

DIR @ rank-1

FAR

ResNet

LFW

DIR @ rank-1

FAR

Megaface

DIR @ rank-1

FAR

Fig. 9: Comparison between the Detection and Identification Rate (DIR) at rank = 1 curves, observed for the LFW and Megaface sets, using the VGG-like (top row) and ResNet-like (bottom row) architectures. Results are shown in linear and logarithmic (inner plot) scales. Results are shown for the proposed loss function (purple lines), and the four baselines considered: softmax (red lines), center loss (green lines), triplet loss (blue lines) and Chen et al[5] work (black line).
Method mAP rank-1 top-10%

LFW
Quadruplet loss 0.958 3 0.951 0.020 0.979 6
0.968 2 0.966 0.012 0.981 4
Softmax loss 0.897 4 0.842 0.034 0.953 0.011
0.912 4 0.861 0.029 0.960 8
Triplet loss [33] 0.934 4 0.929 0.033 0.964 8
0.947 4 0.948 0.026 0.968 9

Center loss [42]
0.918 3 0.863 0.020 0.962 6
0.939 3 0.898 0.016 0.967 6

Chen et al[5]
0.961 2 0.940 0.022 0.976 6
0.966 2 0.959 0.015 0.983 4

Megaface


Quadruplet loss
0.877 0.011 0.812 0.053 0.960 9
0.902 9 0.906 0.048 0.972 8

Softmax loss
0.727 0.014 0.615 0.060 0.863 0.017
0.730 0.010 0.745 0.051 0.899 0.011

Triplet loss [33]
0.854 9 0.758 0.059 0.946 0.017
0.872 8 0.839 0.052 0.957 9

Center loss [42]
0.850 0.013 0.772 0.052 0.939 0.012
0.847 9 0.845 0.048 0.940 9

Chen et al[5]
0.864 0.012 0.772 0.061 0.947 9
0.916 8 0.880 0.050 0.970 8

TABLE I: Identity retrieval performance of the proposed method with respect to the baselines: softmax, center and triplet losses, and the method due to Chen et al[5]. The average performance standard deviation values are given. Inside each cell, the top values regard the VGG-like model, and the bottom values regard the ResNet-like architecture.

Iv-D Soft Biometrics Inference

As stated above, the proposed quadruplet loss can also be used for learning a soft biometrics estimator. Then, in test time the position to where one element is projected to can be used to infer the soft labels, in a nearest neighbour fashion: ”if the query was projected into a ”male” region, then it should also be a ”male””. In these experiments, we considered only 1-NN, i.e., the label inferred for each query was simply given by the closest gallery element in the embedding. Clearly, better results will be attained if a larger number of neighbours had been considered, even though the computational cost of classification will also increase. All experiments were conducted according to a bootstrapping-like strategy: having test images available, the bootstrap randomly selects (with replacement) images, creating samples composed by 90% of the whole data. Ten test samples were created and the experiments were conducted independently on each trail, which enabled to report the mean and the average deviation at each performance value.

As baselines, we used two Off-The-Shelf techniques considered to represent the state-of-the-art: the Matlab SDK for Face++5 and the Microsoft Cognitive Toolkit Commercial6. Face++ is a commercial face recognition system, with good performance reported for the LFW face recognition competition (second best rate). Microsoft Cognitive Toolkit is a deep learning framework that provides useful information based on vision, speech and language.

We considered exclusively in the ”Gender”, ”Ethnicity” and ”Age” labels (), quantised respectively into two classes for Gender (”male”, ”female”), three classes for Age (”young”, ”adult”, ”senior”), and three classes for Ethnicity (”white”, ”black”, ”asian”). The average and standard deviation performance values are reported in Table II for the BIODI, PETA and LFW sets.

The results achieved by the quadruplet loss function can be favourably compared to the COTS techniques in most labels, in particular for the BIODI and LFW datasets. Regarding the PETA set, Face++ invariably outperformed the other techniques, even if at a reduced margin in most cases. This was justified by the extreme heterogeneity of image features in this set, in result of being the concatenation of different databases. This probably had reduced the representativity of the learning data with respect the test set, being the Face++ model apparently the least sensitive to this covariate. Note that the ”Ethnicity” label is only provided by the Face++ framework.

Globally, these experiments supported the possibility of using such the proposed method to estimate groups of soft labels in a single-shot paradigm, which can be particularly important to reduce the computational cost of using specialized soft labelling tools.

Method Gender Age Ethnicity
BIODI
Quadruplet loss 0.816 6 0.603 0.014 0.777 0.011
0.834 5 0.649 0.011 0.786 9

Face++
0.760 8 0.588 0.019 0.788 0.017
Microsoft Cognitive 0.738 7 0.552 0.026 -


PETA
Quadruplet loss 0.862 0.024 0.649 0.061 0.797 0.053
0.882 0.018 0.658 0.057 0.810 0.036

Face++
0.870 0.028 0.653 0.062 0.812 0.054

Microsoft Cognitive
0.885 0.020 0.660 0.057 -


LFW
Quadruplet loss 0.939 0.021 0.702 0.059 0.801 0.044
0.944 0.017 0.709 0.049 0.817 0.041

Face++
0.928 0.041 0.527 0.063 0.842 0.061

Microsoft Cognitive
0.931 0.037 0.710 0.051 -

TABLE II: Soft biometrics labelling performance (mAP) attained by the proposed method, with respect to two commercial-off-the-shelf systems (Face++ and Microsoft Cognitive). The average performance standard deviation values are given, after 10 trials. Inside each cell, the top values regard the VGG-like model, and the bottom values regard the ResNet-like architecture

Finally, it is interesting to analyse the variations in performance with respect to the number of labels considered, i.e., the value of the parameter. For this purpose, we used the BIODI dataset (the one with the largest number of annotated labels - 14), and analysed the average amount of labelling errors, for , with in all cases. The results are shown in Fig. 10, providing the relationship between the independent variable and the average labelling error in the test set :

(11)

with denoting the predicted labels for the image and being the corresponding ground-truth. denotes the -norm.

VGG

ResNet

e()

Fig. 10: Variations in soft biometrics labelling performance with respect to the number of labels considered in the learning phase, i.e., to the value of . Results are shown for the BIODI test dataset, with , when using the VGG-like (solid line) and ResNet-like (dashed line) CNN architectures.

Overall, a positive correlation between the the average error values and the values of was observed, which was justified by the tougher compromise that the models should find when inferring simultaneously a large number of labels, and by the hardness of inferring some of the labels (e.g., the type of shoes). Anyway, considering such degradations as relatively small, we concluded that the proposed loss can be used both when a reduced number of labels is available, and when a relatively large number of labels should be simultaneously inferred. In this regard, we presume that even larger values for () would require substantially more amounts of learning data and also larger values for the parameter (dimension of the embedding).

V Conclusions and Further Work

In this paper we proposed a loss function designed to work in multi-output classification problems, where the response variables have dimension larger than one. Our function can be seen as a generalization of the popular triplet loss, replacing the binary division of positive/negative image pairs and the notion of anchor, by: i) a metric that measures the relative similarity between any two classes; and ii) a quadruplet term that - based in the semantic similarity - explicitly imposes distances between pairs of projections in the destiny space.

In particular, having focused in the identity retrieval and soft biometrics problems, the proposed loss uses all the available labels (e.g., ”ID”, ”Gender”, ”Age” and ”Ethnicity”) to produce a feature embedding that is semantically coherent, i.e., in which not only the intra-class compactness is guaranteed, but also the broad families of classes occupy adjacent regions in the embedding. This is particularly useful for identity retrieval purposes where this type of embeddings enables a direct correspondence between the classes (IDs) centroids and their semantic descriptions, i.e., where a ”young, black, male” is assuredly closer to a ”adult, black, male” than to a ”young, white, female” subject. This is in opposition to previous loss functions work, that project the elements into the embedding based uniquely in their appearance, and assume that the semantical coherence will naturally yields from the similarities in appearance between the different classes (i.e., upon the similarity of image features).

Acknowledgements

This work is funded by FCT/MCTES through national funds and when applicable co-funded by EU funds under the project UIDB/EEA/50008/2020. Also, this research was funded by the FEDER, Fundo de Coesão e Fundo Social Europeu, under the PT2020 program, in the scope of the POCI-01-0247-FEDER-033395 project. We acknowledge the support of NVIDIA Corporation, with the donation of one Titan X GPU board.

Footnotes

  1. http://di.ubi.pt/~hugomcp/BIODI/
  2. https://tomiworld.com/
  3. https://github.com/hugomcp/quadruplets
  4. http://www.cbsr.ia.ac.cn/users/scliao/projects/blufr/
  5. http://www.faceplusplus.com/
  6. https://www.microsoft.com/cognitive-services/

References

  1. N. Almudhahka, M. Nixon and J. Hare. Automatic Semantic Face Recognition. Proceedings of the IEEE 12 International Conference on Automatic Face & Gesture Recognition, doi: 10.1109/FG.2017.31, 2017.
  2. E. Bekele, C. Narber and W. Lawson. Multi-attribute Residual Network (MAResNet) for Soft-biometrics Recognition in Surveillance Scenarios. Proceedings of the IEEE 12 International Conference on Automatic Face & Gesture Recognition, doi: 10.1109/FG.2017.55, 2017.
  3. E. Cipcigan and M. Nixon. Feature Selection for Subject Ranking using Soft Biometric Queries. Proceedings of the 15 IEEE International Conference on Advanced Video and Signal-based Surveillance, doi: 10.1109/AVSS.2018.8639319, 2018.
  4. J-C. Chen, R. Ranjan, A. Kumar, C-H. Chen, V. Patel and R. Chellappa. An End-to-End System for Unconstrained Face Verification with Deep Convolutional Neural Networks. Proceedings of the IEEE International Conference on Computer Vision Workshops, doi: 10.1109/ICCVW.2015.55, 2015.
  5. W. Chen, X. Chen, J. Zhang and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. Proceedings of the IEEE International Conference on Computer Vision, doi: 10.1109/CVPR.2017.145, 2017.
  6. V. Choutas, P. Weinzaepfel, J. Revaud and C. Schmid. PoTion: Pose MoTion Representation for Action Recognition. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pag. 7024–7033, doi: 1109/CVPR.2018.00734, 2018.
  7. Y. Deng, P. Luo, C. Loy and X. Tang. Pedestrian attribute recognition at far distance. Proceedings of the ACM International Conference on Multimedia, pag. 789–792, doi: 10.1145/2647868.2654966, 2014.
  8. J. Deng, J. Guo, N. Xue and S. Zafeiriou. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. https://arxiv.org/abs/1801.07698, 2019.
  9. Y. Duan, J. Lu and J. Zhou. UniformFace: Learning Deep Equi-distributed Representation for Face Recognition. https://arxiv.org/abs/1801.07698, 2019.
  10. H. Galiyawala, K. Shah, V. Gajjar and M. Raval. Person Retrieval in Surveillance Video using Height, Color and Gender. https://arxiv.org/abs/1810.05080, 2018.
  11. J. Goldberger, G. Hinton, S. Roweis and R. Salakhutdinov. Neighbourhood Components Analysis. Proceedings to the Advances in Neural Information Processing Systems Conference, vol. 17, pag. 513–520, doi: 10.5555/2976040.2976105, 2005.
  12. B. Guo, M. Nixon and J. Carter. Fusion Analysis of Soft Biometrics for Recognition at a Distance. Proceedings of the IEEE 4th International Conference on Identity, Security, and Behavior Analysis, doi: 10.1109/ISBA.2018.8311457, 2018.
  13. B. Guo, M. Nixon and J. Carter. A Joint Density Based Rank-Score Fusion for Soft Biometric Recognition at a Distance. Proceedings of the International Conference on Pattern Recognition, pag. 3457–, 3460, doi: 10.1109/ICPR.2018.8546071, 2018.
  14. R. Hadsell, S. Chopra and Y. LeCun. Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE International Conference on Computer Vision, doi: 10.1109/CVPR2006.100, 2006.
  15. M. Halstead, S. Denman, C. Fookes, Y. Tan and M. Nixon. Semantic Person Retrieval in Surveillance Using Soft Biometrics: AVSS 2018 Challenge II. Proceedings of the IEEE International Conference on Advanced Video Signal-based Surveillance, doi: 10.1109/AVSS.2018.8639379, 2018.
  16. E. Learned-Miller, G. Huang, A. RoyChowdhury, H. Li and G. Hua. Labeled Faces in the Wild: A Survey. In Advances in Face Detection and Facial Image Analysis, Michal Kawulok, M. Emre Celebi and Bogdan Smolka (eds.), Springer, pag. 189–248, doi: 10.1007/978-3-319-25958-1_8, 2016.
  17. K. He, Z. Wang, Y. Fu, R.  Feng, Y-G. Jiang and X.  Xue. Adaptively Weighted Multi-task Deep Network for Person Attribute Classification. Proceedings of the 4 ACM Multimedia conference, doi: 10.1145/3123266.3123424, 2017.
  18. Y. Hu, X. Wu and R. He. Attention-Set Based Metric Learning for Video Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pag. 2827–2840, 2018.
  19. S. Ji, W. Xu, M. Yang, K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pag. 221–231, 2019.
  20. M. Jiang, Z. Yang, W. Liu and X. Liu. Additive Margin Softmax with Center Loss for Face Recognition. Proceedings of the 2 International Conference on Video and Image Processing, pag. 1–8, doi: 10.1145/3301506.3301511, 2018.
  21. B-N. Kang, Y. Kim and D. Kim. Deep Convolution Neural Network with Stacks of Multi-scale Convolutional Layer Block using Triplet of Faces for Face Recognition in the Wild. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, doi: 10.1109/CVPRW.2017.89, 2017.
  22. I. Kemelmacher-Shlizerman, S. Seitz, D. Miller and E.  Brossard. The megaface benchmark: 1 million faces for recognition at scale. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pag. 4873–4882, doi: 10.1109/CVPR.2016.527, 2016.
  23. F. Lateef and Y. Ruichek Survey on semantic segmentation using deep learning techniques. Neurocomputing, vol. 338, pag. 321–348, 2019.
  24. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C-Y. Fu and A. Berg. SSD: Single Shot MultiBox Detector. Proceedings of the European Conference on Computer Vision, pag. 21–37, doi: 10.1007/978-3-319-46448-0_2, 2016.
  25. T. Neal and D. Woodard. You Are Not Acting Like Yourself: A Study on Soft Biometric Classification, Person Identification, and Mobile Device Use. IEEE Transactions on Biometrics, Behaviour and Identity Science, vol. 1, no. 2, pag. 109–122, 2019.
  26. D. Li, Z. Zhang, X. Chen and K. Huang. A Richly Annotated Pedestrian Dataset for Person Retrieval in Real Surveillance Scenarios IEEE Transactions on Image Processing, vol. 28, no. 4, pag. 1575–1590, 2019.
  27. H. Liu and W. Huang. Body Structure Based Triplet Convolutional Neural Network for Person Re-Identification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, doi: 10.1109/ICASSP.2017.7952461, 2017.
  28. D. Martinho-Corbishley, M. Nixon and J. Carter. Super-Fine Attributes with Crowd Prototyping. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 6, pag. 1486–1500, 2019.
  29. R. Ranjan, A. Bansal, S. Sankaranarayanan, J-C. Chen, C. Castillo and R. Chellappa. Crystal Loss and Quality Pooling for Unconstrained Face Verification and Recognition. https://arxiv.org/abs/1804.01159, 2018.
  30. R. Vera-Rodriguez, P. Marin-Belinchon, E. Gonzalez-Sosa, P. Tome and J. Ortega-Garcia. Exploring Automatic Extraction of Body-based Soft Biometrics. Proceedings of the IEEE International Carnahan Conference on Security Technology, doi: 10.1109/CCST.2017.8167841, 2017.
  31. P. Samangouei and R. Chellappa. Convolutional neural networks for attribute-based active authentication on mobile devices. Proceedings of the IEEE 8 International Conference on Biometrics Theory, Applications and Systems, 1–8, doi: 10.1109/BTAS.2016.7791163, 2016.
  32. A. Gretton, K. Borgwardt, M. Rasch, B. Schlkopf and J. Smola A kernel method for the two-sample-problem. Proceedings of the Advances in Neural Information Processing Systems Conference, pag. 513–520, doi: 10.5555/2976456.2976521, 2006.
  33. F. Schroff, D. Kalenichenko, J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, doi: 10.1109/CVPR.2015.7298682, 2015.
  34. A. Schumann and A. Specker. Attribute-based person retrieval and search in video sequences. Proceedings of the IEEE International Conference on Advanced Video Signal-based Surveillance, doi: 10.1109/AVSS.2018.8639114, 2018.
  35. H. Shi, X. Zhu, S. Liao, Z. Lei, Y. Yang and S. Li. Constrained Deep Metric Learning for Person Re-identification. https://arxiv.org/abs/1511.07545, 2015.
  36. H. Song, Y. Xiang, S. Jegelka and S. Savarese. Deep Metric Learning via Lifted Structured Feature Embedding. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, doi: 10.1109/CVPR.2016.434, 2016.
  37. E. Gonzalez-Sosa, J. Fierrez, R. Vera-Rodriguez and F. Alonso-Fernandez. Facial Soft Biometrics for Recognition in the Wild: Recent Works, Annotation, and COTS Evaluation. IEEE Transactions on Information Forensics and Security, vol. 13, no. 8, pag. 2001-2014, 2018.
  38. C. Su, Y. Yan, S. Chen and H. Wang. An efficient deep neural networks framework for robust face recognition. Proceedings of the IEEE International Conference on Image Processing, doi: 10.1109/ICIP.2017.8296993, 2017.
  39. F. Wang, W. Zuo, L. Lin, D. Zhang and L. Zhang. Joint learning of single- image and cross-image representations for person re-identification. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, doi: 10.1109/CVPR.2016.144, 2016.
  40. J. Wang, Z. Wang, C. Gao, N. Sang and R. Huang. DeepList: Learning Deep Features With Adaptive List-wise Constraint for Person Re-identification. IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pag. 513–524, 2017.
  41. Y. Wen, K. Zhang, Z. Li and Y. Qiao. A Discriminative Feature Learning Approach for Deep Face Recognition Proceedings of the 14 European Conference on Computer Vision, doi: 10.1007/978-3-319-46478-7_31, 2016.
  42. Y. Wen, K. Zhang, Z. Li and Y. Qiao. A Comprehensive Study on Center Loss for Deep Face Recognition. International Journal of Computer Vision, vol. 127, pag. 668–683, 2019.
  43. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba and A. Oliva. Learning deep features for scene recognition using places database. Proceedings of the Advances in neural information processing systems conference, pag. 487–495, doi: 10.5555/2968826.2968881, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409484
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description