ID-aware Quality for Set-based Person Re-identification
Set-based person re-identification (SReID) is a matching problem that aims to verify whether two sets are of the same identity (ID). Existing SReID models typically generate a feature representation per image and aggregate them to represent the set as a single embedding. However, they can easily be perturbed by noises – perceptually/semantically low quality images – which are inevitable due to imperfect tracking/detection systems, or overfit to trivial images. In this work, we present a novel and simple solution to this problem based on ID-aware quality that measures the perceptual and semantic quality of images guided by their ID information. Specifically, we propose an ID-aware Embedding that consists of two key components: (1) Feature learning attention that aims to learn robust image embeddings by focusing on ‘medium’ hard images. This way it can prevent overfitting to trivial images, and alleviate the influence of outliers. (2) Feature fusion attention is to fuse image embeddings in the set to obtain the set-level embedding. It ignores noisy information and pays more attention to discriminative images to aggregate more discriminative information. Experimental results on four datasets show that our method outperforms state-of-the-art approaches despite the simplicity of our approach.
Set-based person re-identification (SReID) [zheng2016mars, liu2017qan, song2017region, li2018diversity] is a matching problem that targets identifying the same person across multiple non-overlapping cameras. Each person is represented by a set consisting of multiple images. There has been an increasing attention recently because of its critical applications in video surveillance, e.g., airport and shopping mall.
Although many SReID approaches exist [liu2015spatio, yan2016person, you2016top, zhu2016video, mclaughlin2017video, liu2017video, chung2017two] , they mainly follow two steps: feature representation and feature fusion. At high level, the feature representation is learned with a deep convolutional neural network (CNN), and then they are aggregated by simple average fusion of image feature representations in the set. However, since the average fusion treats all images equally in the set, it ignores the fact that some images are more informative than the others in the set. To this end, some SReID methods [zhou2017see, xu2017jointly, liu2017qan, song2017region, li2018diversity] applied attentive aggregation in which they modify CNN such that it can generate the attention score for each image to estimate the quality, e.g., image quality estimation network [liu2017qan]. Their main purpose is to identify low quality/non-discriminative image embeddings (features) and ignore them from the set at the fusing stage. Nevertheless, with regard to the learning strategy, the attention is learned implicitly without any extra supervision, instead only guided by the standard loss function. As a result, in this paper we argue that these quality-based methods address perceptual quality problems, e.g., blurry images (Fig. 0(a)), but they cannot solve semantic quality problems which could be ‘incomplete body’, ‘total occlusion’, and ‘mutiple people in one image’ as shown in Fig. 0(b), Fig. 0(c), and Fig. 0(d), respectively. The semantic quality problems are inevitable in real-life since we do not have perfect detection and tracking systems yet.
Estimating the perceptual and semantic quality of images in the set is non-trivial. Based on our observations from Fig. 1, we find that both kinds of quality could be estimated by how much the image is related to its set ID. To this end, we define ID-aware quality that measures the perceptual quality and semantic quality of images guided by their ID information. To be more specific, let us have a look at two examples in Fig. 1. The leftmost image which is blurry in Fig. 0(a) is of less perceptual quality thus less related to its set ID – low ID quality. The middle image in Fig. 0(c) is of high perceptual quality because it is clear (no blur). However, its semantic quality is low due to the occlusion by a different person – low ID quality again. Thus we argue that ID-aware quality covers more diverse quality issues. Moreover, the ID label for the middle image in Fig. 0(c) is wrong with respect to set ID. The existing methods [song2017region, li2018diversity] which rely on ID classification loss suffer from such kind of corrupted labels.
To realise ID-aware quality, we propose to use the classification confidence score. Our intuition is that in general high quality images (i.e., trivial images) are easily classified and get high confidence scores (e.g., 0.99) while low quality images (e.g., outliers) are less related to its set ID and get low confidence scores (e.g., 0.01). Similarly, ‘medium’ quality images get medium confidence scores. Furthermore, we find that there are two attractive properties of utilising classification confidence to estimate ID-aware quality. The first is that ID-aware quality can be estimated by readily available ID information – no need for extra annotations. The next is that it is easy to obtain as ID classification loss is commonly used in most existing SReID methods.
Based on ID-aware quality, we formulate an ID-aware Embedding (IDE) to learn a robust embedding function. IDE consists of two main components: (1) Feature learning attention (FLA) that targets learning robust image embeddings by focusing on ‘medium’ quality images (i.e., medium hard images). This is because high quality images (i.e., trivial images) contribute little to the loss and very small gradients while low quality images (very hard images or outliers) contribute large but misleading gradients. In this case, it can prevent overfitting to trivial images, and alleviate the influence of outliers. (2) Feature fusion attention (FFA) is to fuse image embeddings in the set to obtain the set-level embedding. It ignores noisy information and pays more attention to discriminative images to aggregate more discriminative information. IDE is an end-to-end framework built on CNN optimised by ID classification loss and set-based verification loss jointly. It is worth mentioning that it is simple as we learn global representations of images and aggregate them into a singe embedding without applying sophisticated attentive spatiotemporal features [li2018diversity].
Contributions: (1) A novel concept named ID-aware quality based on the classification confidence is proposed for the set-based person re-identification. It can estimate not only the perceptual quality, but also the semantic quality of images with respect to the set ID. (2) To show the applicability of ID-aware quality, we formulate ID-aware Embedding (IDE) which learns robust set-level embeddings that can be used for set-based person re-identifcation. Our model can be trained in an end-to-end manner. (3) Extensive experiments are carried out on four benchmarks for set-based person re-identification: MARS [zheng2016mars], iLIDS-VID [wang2014person], PRID-2011 [hirzer2011person], and LPW [song2017region]. Our method achieves new state-of-the-art performance on all datasets. In addition, cross-dataset experiments are conducted, which also achieves state-of-art performance.
2 Related Work
Learning Image Representations Common approach to learning image representation is based on CNN with a certain kind of objective function such as identification loss and verification loss. This is mainly motivated by the image-based person re-identification methods [xiao2016learning, zhong2017re, cheng2016person, zheng2016person], where each ID has a single image instead of multiple images. Some methods learn a global feature representation at an image level [liu2017qan, chen2018video] while others learn local image features to obtain more fine-grained representation [song2017region, li2018diversity]. In this work, we follow the same strategy. However, the difference is that our method focuses on ‘medium’ quality images, while discard trivial (easy) images and outliers (very hard).
Learning Set Representations We review several approaches of learning set representations for SReID. We divide them into two groups in terms of whether they consider the quality of the images.
Non-quality-based methods. The methods belonging to this group do not consider quality, rather they focus more on temporal information [liu2018spatial, mclaughlin2017video, liu2015spatio, you2016top, zhu2016video, yan2016person, mclaughlin2016recurrent, liu2017video, chung2017two] or designing different network architechtures based on 3D convolutional nets [li2018multi]. For example, long short term memory (LSTM) [hochreiter1997long] is applied in [yan2016person] for aggregating image representations and the output at the last time stamp is taken as the final representation of each video/set; Recurrent neural network (RNN) is applied in [mclaughlin2017video, liu2017video, chung2017two] for accumulating the temporal information in each sequence and spatial features of all images are fused by average pooling.
Quality-based methods. The methods belonging to this group consider the quality of individual images in the set when aggregating image-level feature embeddings. The quality is formulated in the form of attention paradigm. That is, if the attention score is high for a particular image, then it is assumed that it is of high quality. The methods in [wang2014person, huang2018video] select the most discriminative set (video) fragments to learn set representations, which can be regarded as fragment-level attention. Recently, image-level attention attracts a great deal of interest and has been widely studied [zhou2017see, xu2017jointly, liu2017qan, song2017region, chen2018video, li2018diversity]. In particular, in attentive spatial-temporal pooling networks [xu2017jointly] and co-attentive embedding [chen2018video], an attention mechanism is designed such that the computation of set representations in the gallery depends on the probe data. A similar approach is taken in [zhou2017see] which contains an attention unit for weighted fusion of image embeddings. Score generation branch is designed for generating attention score in quality aware network [liu2017qan], region-based quality estimation network [song2017region], and diversity regularized spatiotemporal attention [li2018diversity]. More recently, in order to take the parts into account, [fu2018sta] proposed an attention mechanism by dividing the feature map into fixed set of horizontal parts.
Our work has the following key differences: (1) we only consider spatial appearance information in the image set without using the temporal information and optical flow information as compared to [fu2018sta, zhu2016video, song2017region, li2018diversity]. (2) All quality-based approaches mentioned above learn the quality scores implicitly. In contrast, we learn quality with respect to set ID explicitly – supervised by the ID information. (3) Unlike methods in [liu2017qan, song2017region] which only deal with perceptual quality problem, our method can cope with a diverse set of quality problems. (4) ID-aware embedding does not depend on the probe data unlike the methods in [xu2017jointly, chen2018video], thus being more scalable and applicable. (5) Furthermore, although region-based feature extraction [song2017region, li2018diversity, fu2018sta] has achieved great success, to demonstrate the robustness and effectivenss of IDE, we simply learn a global representation for every image in the set.
Attention Numerious works exist about attention, and they could be divided into two directions [jetley2018learn, Andrea2019]. One direction is post hoc network analysis, and an attention unit in this direction relies on fully trained classification network model [karen, Zhou_2016_CVPR]. The other one goes with trainable attention in which the weights of attention unit and the original network weights are learned jointly [jetley2018learn, seo2016]. Our work aligns with the first direction which is the post hoc network analysis. However, our problem setting is verification, in which the label space between test and train data are disjoint. Thus the attention mechanism used during training cannot be used for testing.
The overall pipeline of our framework, IDE, is shown in Fig. 2. First, the given image set, which consists of several person images, goes through CNN network and outputs representation for each image in the set. It is followed by a fully connected layer and softmax normalisation to generate ID-aware qualities. Then, these qualities are used in two ways: (1) Feature learning attention (FLA) component utilises the qualities such that medium hard images get more weights. This ends with the weighted cross-entropy loss for image-based classification. (2) Feature fusion attention (FFA) component transforms the qualites so that the ones with noise are given smaller weights to aggregate more discriminative information. This is supervised by the contrastive loss for set-based verification. IDE is trained end-to-end and optimized by two losses jointly. More formally, we aim to learn an embedding function that takes an image set as input, where is the number of images, and are height, width and the number of channels respectively, and it outputs a -dimensional discriminative feature vector , .
In what follows, we present the key components in detail, i.e., ID-aware quality generation, FLA, FFA, and loss functions.
3.1 ID-aware Quality Generation
ID-aware quality indicates how much it is semantically related to its set ID. We propose to generate it as follows. Firstly, we obtain image representations: for the -th image set in the training batch, , where and are an image and its identity, respectively (for brevity, hereafter we omit the superscript where appropriate), we employ a deep CNN to embed each image to a -dimensional feature representation, i.e, . Then, in order to obtain quality scores we apply a fully connected layer followed by a softmax normalisation. This way, the learned parameters of the fully connected layer are ID context vectors, where is the -th ID’s context vector and is the number of identities in total. In short, plays a role of ID classifier. More formally, the semantic relation of an image to an ID can be measured by the compatibility between the image’s feature vector and the ID’s context vector . We calculate the dot product between two vectors to measure their compatibility [jetley2018learn], followed by a softmax operator which is used for normalising semantic relations over all identities. That is:
Inherently, Eq. (1) computes the classification confidence (likelihood) of with respect to its set ID . Similarly, ID-aware quality is estimated by the classification confidence directly [goldberger2016training].
3.2 Feature Learning Attention
During learning image embeddings, FLA focuses on medium hard images whose classification confidences are around the centre of distribution, i.e., 0.5, as the distribution is between 0 and 1. This is a trade-off between gradient magnitude and gradient correctness. Intuitively, high quality images are easily classified and obtain high classification confidences (e.g., 0.99), but they are trivial because they contribute little to the loss and their gradients are relatively small. Low quality images (e.g. outliers) have low classification confidences (e.g., 0.01) and large gradients, but their gradient directions are misleading. To achieve the trade-off between gradient magnitude and gradient correctness, medium hard images are given higher weights/attention in FLA. The medium hard images are those whose ID-aware qualities are around the centre of distribution – 0.5.
To achieve this intution, we propose the following Gaussian function111Note that different functions could be also designed to reach the same goal. Exploring different functions are left for the future work. to compute FLA scores from ID-aware qualities:
where 0.5 is a mean, and is a temperature parameter controlling the distribution of FLA scores. is the FLA score of , indicating its weight value in the ID classification task. Note that a choice of 0.5 is due to our intution stated above. One intriguing property of this function is that we can change the distribution by merely changing the parameter . Fig. 3 illustrates FLA with Eq. (2).
3.3 Feature Fusion Attention
During feature fusion which is used for verification in a later stage, FFA aggregates most informative feature embeddings. In general, high quality images (ID-aware quality) are very discriminative, thus being more informative for embedding the set. This is shown in Fig. 4. To realise this intuition, we compute FFA scores from ID-aware qualities using the following Gaussian function with :
where is a temperature parameter controlling the distribution of FFA scores. is the score of , indicating its importance when being aggregated into the set embedding. As a result, we obtain the set representation by weighted fusion of image representations in the set:
The term in denominator is the sum of FFA scores for normalisation. At each iteration, FFA scores of images are computed in the forward process and used as constant values for just scaling the gradient vectors during gradient back-propagation.
3.4 Loss Functions
Suppose there are image sets in each mini-batch, i.e., and
each set contains images, thus the mini-batch size is .
Weighted Cross-Entropy Loss. To learn robust image representations based on FLA scores, we propose a weighted cross-entropy loss for the image-level ID classification task:
where the term in the denominator is for normalisation. Accordingly, the partial derivative of w.r.t. is:
At each iteration, after being computed in the forward process, FLA scores of images are assumed as constant values for scaling the gradients during the back-propagation process. Compared with standard cross-entropy loss, we use the normalised FLA score to scale image’s gradient.
Contrastive Loss. For set-based verification, on top of set-level representations, we employ contrastive loss [hadsell2006dimensionality] using the multi-batch setting [tadmor2016learning]. The contrastive loss is a well-known loss function used for verification task. It pulls sets from the same identity as close as possible and pushes the sets from different identities farther than a pre-defined margin . Specifically, we construct a pair between every two set embeddings, resulting in pairs in total. For all the set embeddings in the mini-batch, we compute the contrastive loss per pair:
where if and , otherwise. is the distance between the set pair. The contrastive loss of the mini-batch is the average loss over all the pairs:
IDE is trained end-to-end by optimising the weighted cross-entropy loss and set-based contrastive loss jointly:
After the training is finished, the trained CNN model can be applied to extract image features. Generally, the label spaces of the training and testing sets are disjoint in the verification problems. Therefore, we cannot estimate the ID-aware quality of testing images as their classification confidences are not available during testing.
To obtain the set embedding in the test phase, we aggregate image representations in the set simply by average fusion. After that, we verify whether two image sets show the same person simply by computing their cosine distance.
Remarks: 1) We would like to note that although we use the term ‘attention’ to mean the weight for images, our method is different from the classical attention-based methods. In our formulation, the scores of FLA and FFA are taken as a constant values after forward propogation. In contrast, standard attention-based methods would allow the gradients to flow through FFA and FLA. 2) During training we can estimate the quality, however we cannot estimate the quality at the testing stage. Our main goal during training is to learn a robust embedding function by ignoring perceptually and semantically low quality images. This way we assume that our model generalises better because it does not overfit to the training dataset. Our extensive experiments validate that this assumption is reasonable. Please see Sec. 2 for more.
4.1 Datasets and Settings
Datasets. We use two large-scale and two small-scale datasets in our experiments: (1) LPW [song2017region] is a large-scale dataset released recently. The persons in the dataset are collected across three different scenes separately. Three cameras are placed in the first scene, while four cameras are placed in other two scenes. Each person is captured by more than one camera so that cross-camera search could be possible in each scene. There are totally 7694 image sets with about 77 images per set. Following the evaluation setting in [song2017region], 1975 persons captured in the 2 scene and 3 scene are used for training, while 756 persons from the first scene are used for testing. The dataset is challenging as the evaluation protocol is close to real-world situation, that is, the training data and the testing data are different in terms of not only identities but also scenes. (2) MARS [zheng2016mars] is another large-scale dataset. There are 20478 tracklets (image sets) of 1261 persons in total and each person is shot by at least two cameras. This dataset is challenging due to automatic detection and tracking errors. (3) iLIDS-VID [wang2014person] is a small-scale dataset. Since it is collected in airport environment, image sequences contain significant viewpoint variations, occlusions and background clutter. There are 300 identities in total and each identity has two tracklets from two different cameras. (4) PRID2011 [hirzer2011person] consists of 400 tracklets for 200 persons from two cameras. The tracklet length varies from 5 to 675. Compared to the previous datasets, PRID2011 is less challenging because of few variations and rare occlusion. For LPW and MARS, we follow the evaluation setting in [song2017region] and [zheng2016mars] respectively. For iLIDS-VID and PRID2011, they are split into two subsets with equal size following [wang2014person], one for training and the other for testing. In addition, the 10 random trials are fixed and the same as in [wang2014person] for fair comparison. A summary of the datasets is shown in Table 1.
|evaluation||CMC||CMC & mAP||CMC||CMC|
Evaluation Metrics. We report the Cumulated Matching Characteristics (CMC) results for all the datasets. We also report the mean average precision (mAP) for MARS following the common practice.
Implementation Details. We use GoogLeNet with batch normalisation [ioffe2015batch] as our backbone architecture. Every input image is resized to . We do not apply any data augmentation for training and testing. Each mini-batch contains 3 persons, 2 image sets per person, 9 images per set, so the batch size is 54 (). Each image set in the mini-batch is randomly sampled from the complete image set during training. According to Eq. (8), we have 3 positive set pairs and 12 negative set pairs totally, thus the positive-to-negative rate is 1:4. The margin of contrastive loss is set to 1.2 (). Stochastic gradient descent (SGD) optimiser is applied with an initial learning rate of . When training the model on each dataset, we initialise it by the pre-trained GoogLeNet model on ImageNet. We use Caffe [jia2014caffe] for implementation.
For the temperature parameters and , they control the variances of FLA scores and FFA scores respectively. Based on our experimental results, they are insensitive and are fixed in all the experiments () although better results could be obtained by exploring optimal parameters for each dataset. Analysis of these parameters is reported in the supplimentary material.
4.2 Comparison with State-of-the-art Methods
MARS Dataset. We compare our approach with several state-of-the-art methods. Overall, we divide them into two categories: non-quality-based and attention-based methods. The non-quality-based methods: the softmax loss is used during training with custom/common networks, e.g., IDE(CaffeNet) [zheng2016mars], IDE(ResNet50) [zhong2017re], and metric learning/re-ranking strategies are applied for boosting the performance, e.g., IDE(CaffeNet)+XQDA [zheng2016mars], IDE(CaffeNet)+XQDA+RR [zhong2017re], IDE(ResNet50)+XQDA [zhong2017re], IDE(ResNet50)+XQDA+RR [zhong2017re]; the temporal information of image sequences is used by combining recurrent neural networks with convolutional networks (CNN+RNN [mclaughlin2017video]) and again metric learning is used on top of this (CNN+RNN+XQDA); optical flow is used to capture motion information (AMOC+EpicFlow [liu2017video]). The quality-based methods: combining image-level attention and temporal pooling to select informative images (ASTPN [xu2017jointly], SRM+TAM [zhou2017see], CAE222For CAE, we present the results of complete sequence instead of multiple subsets so that it can be compared with other methods. Multiple subsets can be regarded as data augmentation. [chen2018video]); designing attentive spatiotemporal models to extract complementary region-based information between images in the set (RQEN [song2017region], DRSA[li2018diversity]).
Results of our method and compared methods are shown in Table 2. The following key findings are observed: (1) Our method considerably outperforms all the compared methods in both metrics (CMC, mAP). Noticeably, our mAP is around 6% better than DSRA and 4% better than CAE. DSRA applies spatiotemporal attention to extract features of latent regions and is pre-trained on several large image-based person ReID datasets. CAE is a co-attentive embedding model and computationally expensive as the embeddings of sets in the gallery are dependent on the probe data. In addition, CAE combines optical flow information and RGB information. (2) Generally, the quality-based methods outperform non-quality-based methods except for ASTPN with a relatively shallow architecture. (3) When these methods are coupled with re-ranking technique, there is always a performance boost (ours achieves 82.2 % in terms of mAP).
iLIDS-VID, PRID2011, and LPW datasets. Grouping the methods for these datasets is the same, and most compared methods are borrowed from Table 2. Additional methods include: a spatial-temporal body-action model (STA [liu2015spatio]), novel metric learning approaches (SIDL [zhu2016video], TDL [you2016top]).
The results are shown in Table 3. We find that our observations are consistent with MARS dataset. In particular: (1) In all datasets, our method achieves the best performance; (2) The margin of iLIDS-VID and LPW is larger than PRID2011. This is because PRID2011 is a much cleaner dataset and its accuracy is already high; (3) Noticeably the LPW dataset is the most challenging since the scenes are different for training and testing, and yet our method is around 14% higher than RQEN [song2017region], showing the generalisation capability of our proposed method.
4.3 Effectiveness of FLA and FFA
IDE has two key components: (1) FLA aims to learn robust image representations. Based on FLA, weighted cross-entropy loss is proposed to replace standard cross-entropy loss in the baseline; (2) During training, FFA is proposed for weighted fusion of image embeddings to replace average fusion in the baseline. We use LPW for this analysis. Table 4 shows that the performance improves significantly using either FLA or FFA compared to the baseline, demonstrating the effectiveness of FLA and FFA. We obtain the best performance by FFA+FLA at ranks ranging from 1 to 10.
4.4 A Variant of FFA
We employ the same idea as FLA which pays more attention to medium hard samples to see the impact. We name this as FFAMH. Table 5 shows that in both settings (without FLA or with FLA) FFAMH performs similar as average fusion but much worse than our proposed FFA. This validates our assumption that assigning more weight to high quality images is desirable at the verification stage.
4.5 Cross-Dataset Evaluation
Cross-dataset testing is a better way to evaluate a system’s real-world performance than only evaluating its performance on the same dataset used for training. Generally any public dataset only represents a small proportion of all real-world data. The model trained on A dataset could perform much worse when applied to B dataset, which indicates the model overfits to the particular scenario.
We conduct cross-dataset testing on PRID2011 to evaluate the generalisation of our method. The diverse and large MARS and iLIDS-VID are used for training.
We follow the evaluation setting in CNN-RNN [mclaughlin2017video] and ASTPN [xu2017jointly], i.e., the model is tested on 50% of PRID2011. We use the same testing data as Table 3 so the cross-dataset testing results can be compared with the results in Table 3.
The results are shown in Table 6.333Cross-dataset experimental results of QAN and RQEN are not reported as they were tested on 100% of PRIDAs expected, the results are worse than within-dataset testing because of dataset bias.
However, our method generalises well and achieves state-of-the-art cross-dataset testing performance.
For CMC-1 accuracy, our method achieves 41.4% when trained on MARS and 61.7% when trained on iLIDS-VID. Noticeably, it is comparable with the performance (64.1%) of spatial-temporal body-action model STA [liu2015spatio] in Table 3.
This indicates that IDE enhances the generalisation ability significantly.
In this paper we propose a new concept, ID-aware quality, which measures the semantic quality of images guided by their ID information. Based on ID-aware quality, we propose ID-aware embedding which contains FLA and FFA. To learn robust image embeddings, FLA gives more weight to medium hard images. To accumulate more discriminative information when fusing image features, FFA assigns more weight to high quality images. Compared with previous state-of-the-art attentive spatiotemporal methods, our method works much better in both within-dataset testing and cross-dataset testing. Furthermore, IDE is much simpler compared with previous attentive spatiotemporal methods.
1 The temperature parameter of FLA:
To study the impact of , we set the temperature of FFA in all experiments. We conduct experiments on MARS and LPW and report the results in Table 7 and Figure 5. Firstly, we can see that the performance is not sensitive to . For example, the performance difference is smaller than 1.5% on MARS when ranges from 0.12 to 0.24. Secondly, better results can be obtained by exploring the optimal on each dataset. In the main paper, we fix on all datasets. We observe that works best on LPW but not on MARS.
2 The temperature parameter of FFA:
To study the influence of , we set in all experiments. The experiments are conducted on MARS and LPW and their results are presented in Table 8 and Figure 6. First, we observe that the performance is also insensitive to . The performance gap is less than 1.0% on MARS and around 2.0% on LPW. Second, we can also obtain better results by searching optimal parameters for different datasets. We fix on all datasets in the main paper. We notice that works best on LPW while is the best on MARS.