Learning a Semantically Discriminative Joint Space for Attribute Based Person Re-identification
Attribute based person re-identification (Re-ID) aims to search persons in large-scale image databases using attribute queries. Despite the great practical significance, the huge gap between semantic attributes and visual images makes it a very challenging task. Existing researches usually focus on the match between query attributes and attributes scores predicted from human images, which suffers from imperfect attribute prediction and low discriminability. In this work, we propose to formulate the attribute based person Re-ID as a joint space learning problem. To alleviate the negative impact resulted by the huge heterogeneous gap between different modalities, we apply a novel adversary training strategy to generate homogeneous features for both modalities, providing distribution alignment between different modalities in the feature level and keeping semantic consistency across modalities. Our experiments validate the effectiveness of our model, and show great improvement on the performance over the state of the art models.
Pedestrian attributes, e.g., age, gender and dressing, are searchable semantic elements to describe a person. Attribute-based person re-identification (Re-ID), which is depicted in Figure 1, aims at re-identifying a person in surveillance environment according to specific attribute descriptions provided by users. This task is significant in finding missing people with tens of thousands of surveillance cameras equipped in modern metropolises. Compared with conventional image-based Re-ID [63, 4, 56, 58], attribute-based Re-ID has the advantage that its query is much easier to be obtained, for instance it is more practical to search for criminal suspects when only verbal testimony about the suspects’ appearances are given.
Despite the great significance, attribute-based Re-ID is very challenging due to the inconsistent distribution, representation and discriminability of different modalities, i.e., the high-level attribute which is abstract and structural and the low-level person image which is informative but noisy. These inconsistencies result in a heterogeneous gap that hinders the attribute-based Re-ID task. The most common solution is to predict attributes for each person image, and search within the predicted attributes [39, 49, 38]. If we can reliably recognize the attributes of each pedestrian image, this could be the best way to find the person corresponding to the query attributes. However, recognizing attributes from a person image is still an open issue, as pedestrian images from surveillance environment often suffer from low resolution, pose variations and illumination changes. The problem of imperfect recognition limits the intuitive methods to be a good solution to deal with the heterogeneous gap. Very often in a large-scale scenario, the real attributes from two pedestrians are different but very similar, which results in a very small inter-class distance in the predicted attribute space. This small inter-class distance makes the predicted attributes not robust to prediction errors, thus inherently less discriminative. Therefore, the imperfect prediction and the low semantic discriminability deteriorate the reliability of these existing models.
In this paper, we formulate the attribute-based Re-ID as a joint space learning problem to address the intrinsic challenge of heterogeneous gap. Instead of learning a direct mapping from image to attributes in the intuitive method, we treat our joint space learning process as a generative model, where we generate homogeneous and discriminative features for each modality in the learned joint space. The intuition of our method is that, when we hold some attribute description in mind, e.g., “dressed in red”, we generate an obscure and vague imagination on how a person dressed in red may look like, which we refer to as a concept. Once a concrete image is given, our vision system automatically processes the low-level features (e.g., color and edge) to obtain some perceptions, and then try to judge where the perceptions and the concept agree with each other. By our methodology, directly estimating the attributes of a person image is averted, and the problem caused by imperfect prediction is naturally solved, because we learn a semantically discriminative joint space, rather than predict and match attributes.
Specifically, we propose a novel joint space learning model based on adversarial learning (Figure 2). Our model has two branches for attributes and person images, respectively. The person image branch (lower branch in Figure 2) is a standard convolutional network that learns to extract semantically discriminative concepts from the low-level person images with a semantic ID classifier (See Sec. 3.1 for “semantic ID”). The attribute branch (upper branch in Figure 2) is a generator with generative adversarial architecture that evolves the high-level attributes into mid-level concepts in order to align with raw person images in the joint space. To keep the homogeneity of features, we limit the generated attribute concepts to reside in a similar distribution structure of image concepts using an adversary strategy, while in the same time keeping semantic discriminative consistency of high-level attributes. The homogeneity objective could generate attribute concepts that align with image concepts both in semantic level and distribution level, and effectively reduce the heterogeneous gap to learn more discriminative features for both modalities.
Experiments conducted on Duke Attribute , Market Attribute  and PETA , which are the three biggest pedestrian attribute datasets to our knowledge, show that the performance of our proposed method is not only much better than the attribute predicting methods, but also exceeds those of other cross-modality retrieval methods that is applicable to our problem.
In summary, our contributions are as follows:
1. For the first time in Re-ID, we propose to formulate the attribute-based Re-ID as a joint space learning problem, so that we can better address the heterogeneous gap problem. We propose to learn a joint space for both the attribute and the person image, where the semantically discriminative information of the low-level person image is learned, and the corresponding homogeneous feature of the high-level attribute is generated for searching.
2. We propose a novel adversarial learning-based joint space learning model. Our model adopts an generative adversarial architecture to strengthen the homogeneity of cross modality features, and learn to generate a similar distribution structure for attribute concepts with image concepts. To our best knowledge, this is the first work in Re-ID to learn a semantically discriminative joint space, as well as adopting the adversarial learning pattern.
3. We conduct experiments on three large-scale benchmark datasets, to validate our model. The experimental results show that our model outperforms the state-of-the-art with noticeable margins.
2 Related Works
2.1 Attribute-based Re-ID
While pedestrian attributes in most of researches have been a side information or mid-level representation to improve traditional person Re-ID tasks [29, 18, 44, 20, 21, 19, 43], some early works also suggested the application of attribute-based person Re-ID [40, 17, 49]. They typically formulated attribute-based Re-ID as an attribute prediction problem. That is, they first predict a score of attributes for each image, and then search images according to the predicted scores. Thus attribute prediction task [38, 18, 3, 22, 62] has taken charge of attribute based person search and moved on to be further studied. As many large dataset [3, 29] has released, deep learning methods has shown state-of-the-art performance in pedestrian attribute prediction recently [22, 62]. Li et al.  shows that multi-task learning could benefit the multi-attribute prediction task by using a joint convolutional network, whose layers are shared across different attribute prediction tasks. Recently, Lu et al.  has proposed a dynamic branching procedure to make task grouping decisions so that different attributes have different levels of sharing in the deep network.
Despite the improvement on performance achieved by attribute prediction methods, directly retrieving persons according to their predicted attributes is still challenging, because the attribute prediction methods are still not robust to the cross-view condition changes like different illumination and viewpoints. Different from these models, our model does not directly predict attributes of an image. We cast the cross-view attribute-image matching as a cross-modality matching in latent feature space so as to avert the problem.
2.2 Conventional Re-ID
Conventional image-based Re-ID aims to match target persons across multiple non-overlapping camera views, querying by images. A large number of models for conventional Re-ID problem have been proposed, including learning or designing discriminative feature [63, 34, 59, 27], learning invariant and discriminative distance metric [64, 16, 26, 30, 27, 28, 60, 66, 25, 53, 65, 58] and end-to-end deep learning [24, 1, 56, 55, 4].
These models have contributed a lot to the Re-ID community. However, they hold a strong and often unrealistic assumption that images of the target person are available. Our model is thus different from them in that ours addresses the attribute-based Re-ID which requires cross-modality matching.
2.3 Cross-modality Retrieval
Our work is related to cross-modal content search, which aims to bridge between different modalities [12, 35, 2, 57, 23, 61, 62, 50, 13, 15, 42, 51]. The most traditional and practical solution to this task is Canonical Correlation Analysis (CCA) , which projects two modalities into a common space that maximizes their correlation. The original CCA method that uses a linear projection is further improved by [35, 2, 57]. Ngiam et al.  and Feng et al.  also applied autoencoder-based methods to model the cross-modal correlation, and Wei et al.  proposed a deep semantic matching method to address the cross-modal retrieval problem with respect to samples which are annotated with one or multiple labels. Recently, A. Eisenschtat and L. Wolf  have designed a novel model of two tied pipelines that maximize the projection correlation using an Euclidean loss, which achieves state of the art results in some datasets.
The most related work in the cross modality retrieval area is proposed by Li et al. , which applies a neural attention mechanism to retrieve pedestrian images using language descriptions. Compared with this setting, although both of the query modality are easily provided by users, our attribute-based Re-ID has its own strength. As natural language description is more unconstrained, it is hard for the retrieval system to achieve a satisfactory performance. Our attribute-based Re-ID is able to embed more pre-defined attribute descriptions so as to obtain a better performance.
Most of the cross modality retrieval methods have attempted to embed inputs of the two modalities into a joint space where correlation of matched samples is maximized. These joint space learning methods could relatively reduce the distribution discrepancy problem across different modalities, which we argue is an important factor causing performance ineffectiveness, while to out knowledge few of them specifically model this problem. And our model propose to learn the distribution structure aligned features across modalities, which further improve the retrieval performance.
2.4 Distribution Alignment Methods
The idea of generating a cross modality homogeneous structure is from distribution alignment, which is widely used in domain adaptation [48, 33, 32, 8, 46, 31] and cross modality generation tasks [37, 9]. In domain adaptation, to align the distribution of data from two different domains, several works [48, 33] are based on Maximum Mean Discrepancy (MMD), which minimize the norm of difference between means of the two distributions. Different from these methods, the deep Correlation Alignment (CORAL)  method proposed to match both the mean and covariance of the two distributions. Recently, the generative structure, which solves the distribution generation tasks as a two player minimax game, has shown their powerful ability to generate homogeneous distribution structure not only across different data domains [47, 6], but also among the more discrepant data distribution from different modalities [37, 9].
Our work is different from these methods as we address the problem of attribute based Re-ID, which requires our model to generate homogeneous structure from the two discrepant modalities, while in the same time keep the semantic consistency across modalities.
Attribute-based Re-ID aims at finding specific pedestrian images from an image database where denotes the size of , according to an attribute description . We denote the mid-level concept that is generated from the image and from the attribute as and , respectively.
The goal of our method is to learn two mappings and that respectively map low-level person images and high-level semantic attributes into a joint discriminative space, which could be regarded as the concept space as mentioned. That is, and . To achieve this, we train an image embedding network as a CNN and an attribute-embedding network as a deep fully connected network. We parameterize our model by , and obtain by optimizing a concept generation objective . Given training pairs of images and attributes , the optimization problem is formulated as
In this paper, we design as a combination of several loss terms, each of which formulates a specific aspect of consideration to jointly formulate our problem. The first consideration is that the concepts we extract from the low-level noisy person images should be semantically discriminative. We formulate it by a classification loss . The second consideration is that attribute concepts and image concepts should be homogeneous. Inspired by the powerful ability of generative adversary networks to close the gap between heterogeneous distributions, we propose to embed an adversarial learning strategy into our model. This is modelled by a concept generating objective , which aims to generate concepts not only meaningful but also homogeneous with concepts extracted from images. Therefore, we have
In the following, we describe each of them in detail.
3.1 Image Concept Extraction
Our concept extraction loss is based on softmax classification on the image concepts . Since our objective is to learn semantically discriminative concepts that could distinguish objects with different attributes rather than to distinguish between specific persons, we re-assign a semantic ID for each person image according to its attribute rather than the person ID. Thus, different people with same attributes could have the same semantic ID. We define the concept extraction loss as a softmax classification objective according to semantic IDs. Formally, let denote the unnormalized log probabilities predicted by the softmax classifier , We have
where the -th entry is the unnormalized log probability of having the semantic ID .
3.2 Attribute Concept Generation
We regard as a generative process, just like the process of people when generating an imagination from an attribute description. As the semantically discriminative latent concepts could be extracted from images, they can also provide information to learn the attribute embedding as a guideline. We apply an adversarial training objective to generate homogeneous attribute concepts that reside on the image concept manifold.
In the adversary training process, we train our concept generator with a goal of fooling a skillful concept discriminator that is trained to distinguish the attribute concept from the image concept, so that the generated attribute concept is aligned with the image concept. Let and denote the parameters of and , respectively. The adversarial min-max problem is formulated as
The above optimization problem is solved by iteratively optimizing and . Therefore, the objective can be decomposed into two loss terms and , which are for training the concept generator and the discriminator , respectively. Then the whole objective during adversary training could be formed by the weighted sum of the two terms:
where and trade off the two terms and
In addition to generate the homogeneous representation for the attribute , we further encourage our model to generate meaningful concepts. That is, we require our model to maintain the semantic discriminability of the attribute concept, which should be consistent with the image concept. To this end, we add a semantic classification loss to the attribute concept using the same classifier as in the image concept extraction. Similar to Eq. (3), we have
Thus, the overall concept generating objective for attributes becomes the sum of and :
By this way, we encourage our generation model to generate homogeneous attribute concept structure. And meanwhile, our model correlates attribute concepts with semantically matched image concepts by maintaining semantic discriminability in the learned joint space.
4.1 Datasets and Settings
Datasets. We evaluated our approach and compared with related methods on three benchmark datasets, including, Duke Attribute , Market Attribute , and PETA . The Duke Attribute dataset contains 16522 images for training, and 19889 images for testing. Each person has 23 attributes. As introduced above, we labelled the images using semantic IDs according to their attributes. As a result, we got 300 semantic IDs for training and 387 semantic IDs for testing. Similar to Duke Attribute, the Market Attribute also has 27 attributes to describe a person, with 12141 images and 508 semantic IDs in the training set, and 15631 images and 484 semantic IDs in the test set. In the PETA dataset, each person has 65 attributes (61 binary and 4 multi-valued). In total, we used 13000 images. We used 10000 images with 1500 semantic IDs for training, and 1500 images with 200 semantic IDs for testing.
Evaluation Metric. For comprehensive comparison, we compute both Cumulative Match Characteristic (CMC) and mean average precision (mAP) as metrics to measure performances of the compared models.
4.2 The Network Architecture
The structure of our model is shown in Figure 2. It consists of two branches processing images and attributes, respectively. The image branch of our network is obtained by removing the last softmax classification layer of Resnet-50 , which has been proven to be state of the art in many tasks, and adding a 128-dimensional fully connected layer as the learned image concept. We design the attribute branch in the consideration that it should have enough capability to generate the attribute concept. We show the details of our network structure in Table 1.
4.3 Implementation Details
Our source code is modified from Open-ReID111https://cysu.github.io/open-reid/, using the Pytorch framework. We first pre-train our image network for 100 epochs using the semantic ID, with an adam optimizer  with learning rate 0.01, momentum 0.9 and weight decay 5e-4. After that, we jointly train the whole network. We set in Eq. (2) as 0.001, and as 0.5, which will be discussed in Section 4.5. The total epoch was set to 300. During training, we set the learning rate of the attribute branch to 0.01, and set the learning rate of the image branch to 0.001 because it had been pre-trained. Parameters are fixed in comparisons across all the datasets.
|ours w/o adv||33.83||48.17||53.48||17.82||39.30||55.88||62.50||15.17||36.34||48.48||53.03||25.35|
|ours w/o adv+MMD||34.15||47.96||57.20||18.90||41.77||62.32||68.61||14.23||39.31||48.28||54.88||31.54|
|ours w/o adv+DeepCoral ||36.56||47.61||55.92||20.08||46.09||61.02||68.15||17.10||35.62||48.65||53.75||27.58|
In this section, we compare our model with cross-modality retrieval models and the attribute prediction models in terms of performance. Furthermore, we compare our model with the domain alignment models. These models are incorporated into our model because they cannot be directly applied for our problem. For fair comparison, we use features extracted by our pre-trained ResNet50 as image features, and raw attributes as attribute features in all the compared models. We show the comparative results in Table 2.
Comparing with Cross Modality Retrieval Models. We compare our model with the typical and commonly used Deep canonical correlation analysis (DCCA) , Deep canonically correlated autoencoders (DCCAE)  and a state of the art model 2WayNet . Basically, these models learn two mappings for the two modalities, respectively, to maximize the correlations of related pairs.
From Table 2 we can find that our model outperforms all the cross modality retrieval models on all three datasets, in terms of both the CMC metric and mAP metric by a large margin. One of the main reasons is that our model directly learns the semantically discriminative information, whereas other models only focuses on correlations. That is, our model is specifically designed for the attribute-based Re-ID problem where discriminability is a critical factor.
Comparing with Attribute Prediction Method. We also compare our model with the state-of-the-art attribute prediction model DeepMAR , which predicts an attribute for each image and then performs matching in the attribute space. DeepMAR formulates the attribute prediction as a multi-task learning problem. The output of the prediction model is the predicted score for each attribute.
From Table 2 we find that our model also outperforms the attribute prediction model. One of the main reasons is that the attribute prediction model suffers from the low discriminability of the predicted attributes. In contrast, our model learns a semantically discriminative joint space where we actually perform searching, and thus achieve better searching performance.
Comparing with different version of our method. Finally we compare our model with the domain alignment models, including the typical maximum mean discrepancy (MMD) and a recently proposed DeepCoral . MMD minimizes difference between means of two distributions, and the DeepCoral matches both the mean and covariance of the two distributions. Note that these model cannot be directly applied to attribute-based Re-ID, and therefore they are incorporated into our model, i.e., we substitute the domain alignment objective for the adversarial objective in Eq. (7) in our model. Thus, they can be viewed as some variants of our model. We also compare our model with a baseline denoted as “ours w/o adv”, where the adversarial objective is dropped out.
From Table 2, we find that both our model and the variants outperform the baseline model which does not learn aligned structure (ours w/o adv). This shows that the aligned structure learned by our models improve the baseline. On the other hand, our original model generally performs better than other variants. This validates the effectiveness of the adversarial learning in our model.
4.5 Evaluation on Adversarial Learning
In this section, we present some further evaluations of the adversarial learning in our model. In our model, we currently use the adversarial loss to align the features generated by attributes towards the features generated by images. We refer to it as generation from attributes to image (A2Img). This is natural because our model generates the concepts for the attribute. However, in the view of designing a model, it is also plausible to modify our model to an “Img2A” style by simply exchanging the position of and in Eq. (5). Similarly, we can obtain a simple combination of both A2Img and Img2A. We provide comparative results in Table 3. In addition, we also show the results when our model directly generates images instead of generating the concept for the attribute.
A2Img vs. Img2A We find that Img2A is also an effective structure alignment method, which even outperforms A2Img in the PETA dataset. But in larger datasets Market and Duke, A2Img performs better. The reason may be that PETA has insufficient but more complicate images compared with other datasets. Thus it’s harder on PETA to learn semantic discriminative image features, and aligning images with attributes provides more discriminative information. While when images are abundant, estimating the manifold of images from the training data is more reliable than estimating that of attributes, as semantic IDs are much fewer than images, thus A2Img performs better.
On the other hand, we find the bidirectional alignment doesn’t make improvement compared with single direction. As both the image feature generator and the attribute feature generator try to fool the discriminator, it could be regarded as the situation that we have a powerful generator, and the discriminator is the underdog in the adversary competitive game. As illustrated in previous GAN works, the gap of ability between generator and discriminator is negative on the generative process, we guess this is the reason why our bidirectional model loses.
Alignment in feature space or in image space. We further study the effectiveness of our methods that generate aligned structure in the feature space rather than in the low level image space. To study this point, we use a conditional GAN to generate fake images, which have aligned structure with real images, from our semantic attributes and a random noise input. After several training epochs, we add the semantic ID classification objective as our original model. We use cosine distance of the features from the penultimate layer of our semantic ID classification ResNet as the affinity between fake and real images, which is the same setting as our basic experiments.
The code we use to generate fake images is from Reed et al. , where we have modified some input dimension and add some convolution and deconvolution layers to fit our setting. We train the generative models for 200 epochs, then the classification loss is added for another training of 200 epochs, where the tradeoff parameter of the generative loss is also set to 0.001 as our settings.
We find the retrieval performance is unsatisfactory, as shown in Table 3. For our problem, the process of adding the style noise to generate noisy low-level images, then eliminating the noise to extract discriminative features is redundant and error-prone. Thus generating the aligned structure in the same discriminative feature space is more effective.
Parameter Evaluation. We study the effect of tradeoff parameters and . We conducted the experiments on the Duke Attribute Dataset, and the results is shown in Figure 3. We first study the effectiveness of and set to 1. The result is shown in the left figure. Then we choose the best and changes and get results of the right figure. The best and are 0.001 and 0.5, respectively. We find that the proper values of and that achieves satisfactory results are usually less than 1, which is the weight we pre-defined for and . This indicates the sample-level alignment provided by and is still predominant in the learning process. And during the adversary training strategy, the loss of the discriminator has a much larger weights compared with that of the generator, see, 0.5 vs 0.001. This indicates that the sample-level alignment step may already provide the generator some ability to generate structure aligned features to some extent, thus during the competitive process the discriminator owns a larger weight to become more powerful to distinguish the two modalities.
In this paper, we formulate attribute based Re-ID as a joint semantic discriminative space learning problem for the first time. To address the problem of heterogeneous gap, our model adopts an adversary training strategy that models the cross modality features in a joint distribution structure, while in the same time keeps semantic consistency across modalities. Our method effectively reduces the heterogeneous gap across modalities not only on discriminability but also on distribution structure. Experiments on three pedestrian attribute datasets illustrate that our model is more suitable to the attribute based Re-ID task and could learn more discriminative cross modality features compared with a wide range of previous methods.
-  E. Ahmed, M. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In CVPR, 2015.
-  G. Andrew, R. Arora, K. Livescu, and J. Bilmes. Deep canonical correlation analysis. In ICML, 2013.
-  Y. Deng, P. Luo, C. Loy, and X. Tang. Pedestrian attribute recognition at far distance. In ACM MM, 2014.
-  S. Ding, L. Lin, G. Wang, and H. Chao. Deep feature learning with relative distance comparison for person reidentification. In PR, 2015.
-  A. Eisenschtat and L. Wolf. Linking image and text with 2-way nets. In CVPR, 2017.
-  T. Eric, H. Judy, S. Kate, and D. Trevor. Adversarial discriminative domain adaptation. In CVPR, 2017.
-  F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACMMM, 2014.
-  Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015.
-  J. Gauthier. Conditional generative adversarial nets for convolutional face generation. Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester, 2014.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  H. Hotelling. Relations between two sets of variates. Biometrika, 28(3):321–377, 1936.
-  A. Karpathy and F.-F. Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
-  D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
-  M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012.
-  N. Kumar, P. Belhumeur, and S. N. Facetracer. A search engine for large collections of images with faces. In ECCV, 2008.
-  R. Layne, T. Hospedales, and S. Gong. Person re-identification by attributes. In Proceedings of the British Machine Vision Conference, 2012.
-  R. Layne, T. Hospedales, and S. Gong. Re-id: Hunting attributes in the wild. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
-  R. Layne, T. M. Hospedales, and S. Gong. Towards person identification and re-identification with attributes. In In European Conference on Computer Vision, International Workshop on Reidentification, 2012.
-  R. Layne, T. M. Hospedales, and S. Gong. Attributes-Based Re-identification, pages 93–117. Springer London, London, 2014.
-  D. Li, X. Chen, and K. Huang. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on,.
-  D. Li, N. Dimitrova, M. Li, and I. K. Sethi. Multimedia content processing through cross-modal association. In ACMmm, 2003.
-  W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
-  X. Li, W.-S. Zheng, X. Wang, T. Xiang, and S. Gong. Multi-scale learning for low-resolution person re-identification. In ICCV, 2015.
-  Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith. Learning locally-adaptive decision functions for person verification. In CVPR, 2013.
-  S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
-  S. Liao and S. Z. Li. Efficient psd constrained asymmetric metric learning for person re-identification. In ICCV, 2015.
-  Y. Lin, L. Zheng, and W. Y. a. Y. Y. Zheng, Zhedong and. Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220, 2017.
-  G. Lisanti, I. Masi, and A. Del Bimbo. Matching people across camera views using kernel canonical correlation analysis. In ICDSC, 2014.
-  M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems 29. 2016.
-  M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In ICML), 2015.
-  M. Long and J. Wang. Learning transferable features with deep adaptation networks. 02 2015.
-  B. Ma, Y. Su, and F. Jurie. Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image and Vision Computing, 2014.
-  P. Mineiro and N. Karampatziakis. A randomized algorithm for cca. In CoRR, 2014.
-  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
-  W. Scheirer, N. Kumar, P. Belhumeur, and T. Boult. Multi-attribute spaces: Calibration for attribute fusion and similarity search. In CVPR, 2012.
-  B. Siddiquie, R. S. Feris, and L. Davis. Image ranking and retrieval based on multi-attribute queries. In CVPR, 2011.
-  J. Sivic, M. Everingham, and A. Zisserman. Person spotting: video shot retrieval for face sets. In International Conference on Image and Video Retrieval, 2005.
-  S.Li, T.Xiao, H.Li, and X. et al. Person search with natural language description. In CVPR, 2017.
-  R. Socher, A. Karpathy, Q. V.Le, C. D.Manning, and A. Y.Ng. Grounded compositional semantics for finding and describing images with sentences. In ACL, 2014.
-  C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao. Multi-task learning with low rank attribute embedding for person re-identification. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 3739–3747, Dec 2015.
-  C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian. Deep attributes driven multi-camera person re-identification. In ECCV, 2016.
-  B. Sun and K. Saenko. Deep coral: correlation alignment for deep domain adaptation. In ICCV, 2016.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. 2015 IEEE International Conference on Computer Vision (ICCV), pages 4068–4076, 2015.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. ICCV, pages 4068–4076, 2015.
-  E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. CoRR, 2014.
-  D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur, and M. Turk. Attribute-based people search in surveillance environments. In WACV, 2009.
-  K. Wang, R. He, L. Wang, W. Wang, and T. Tan. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2010–2023, Oct 2016.
-  L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving image-text embeddings. In CVPR, 2016.
-  W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In D. Blei and F. Bach, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1083–1092. JMLR Workshop and Conference Proceedings, 2015.
-  X. Wang, W.-S. Zheng, X. Li, and J. Zhang. Cross-scenario transfer person reidentification. IEEE TCSVT, 2016.
-  Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan. Cross-modal retrieval with cnn visual features: A new baseline. IEEE Transactions on Cybernetics, 47(2):449–460, 2017.
-  S. Wu, Y. C. Chen, X. Li, A. Wu, J. J. You, and W.-S. Zheng. An enhanced deep feature representation for person re-identification. In WACV, 2016.
-  T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, 2016.
-  F. Yan and K. Mikolajczyk. Deep correlation for matching images and text. 2015.
-  Y. Yang, Z. Lei, S. Zhang, H. Shi, and S. Z. Li. Metric embedded discriminative vocabulary learning for high-level person representation. In AAAI, 2016.
-  Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li. Salient color names for person re-identification. In ECCV, 2014.
-  J. You, A. Wu, X. Li, and W.-S. Zheng. Top-push video-based person re-identification. In CVPR, 2016.
-  X. Zhai, Y. Peng, and J. Xiao. Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In M. desJardins and M. L. Littman, editors, AAAI. AAAI Press, 2013.
-  X. Zhai, Y. Peng, and J. Xiao. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology, 24(6):965–978, June 2014.
-  R. Zhao, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In CVPR, 2014.
-  W.-S. Zheng, S. Gong, and T. Xiang. Reidentification by relative distance comparison. IEEE TPAMI, 2013.
-  W.-S. Zheng, S. Gong, and T. Xiang. Towards open-world person re-identification by one-shot group-based verification. IEEE TPAMI, 2016.
-  W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong. Partial person re-identification. In ICCV, 2015.