Multi-task Mid-level Feature Alignment Network for Unsupervised Cross-Dataset Person Re-Identification

Multi-task Mid-level Feature Alignment Network for Unsupervised Cross-Dataset Person Re-Identification

Abstract

Most existing person re-identification (Re-ID) approaches follow a supervised learning framework, in which a large number of labelled matching pairs are required for training. Such a setting severely limits their scalability in real-world applications where no labelled samples are available during the training phase. To overcome this limitation, we develop a novel unsupervised Multi-task Mid-level Feature Alignment (MMFA) network for the unsupervised cross-dataset person re-identification task. Under the assumption that the source and target datasets share the same set of mid-level semantic attributes, our proposed model can be jointly optimised under the person’s identity classification and the attribute learning task with a cross-dataset mid-level feature alignment regularisation term. In this way, the learned feature representation can be better generalised from one dataset to another which further improve the person re-identification accuracy. Experimental results on four benchmark datasets demonstrate that our proposed method outperforms the state-of-the-art baselines.

\addauthor

Shan Linshan.lin@warwick.ac.uk1 \addauthorHaoliang Lilihaoliang@ntu.edu.sg2 \addauthorChang-Tsun Lichli@csu.edu.au3 \addauthorAlex Chichung Koteackot@ntu.edu.sg2 \addinstitution Department of Computer Science
University of Warwick
United Kingdom \addinstitution Rapid-Rich Object Search Lab
Nanyang Technological University
Singapore \addinstitution School of Computing & Mathematics
Charles Sturt University
Australia Mid-level Feature Alignment for Unsupervised Person Re-ID

1 Introduction

Person Re-identification (Re-ID) is the problem of identifying the re-appearing person in a non-overlapping multi-camera surveillance system. Two primary tasks in person Re-ID are learning the subjects’ features and developing new similarity measurements, which should be invariant to the viewpoint, pose, illumination and occlusion. Due to its potential applications in security and surveillance, person Re-ID has received substantial attention from both academia and industry. As a result, the person Re-ID performance on existing datasets has been significantly improved in the recent years. For example, the Rank-1 accuracy of a single query search on the Market1501 dataset [Zheng et al.(2015)Zheng, Shen, Tian, Wang, Wang, and Tian] has been pushed from 44.4% [Zheng et al.(2015)Zheng, Shen, Tian, Wang, Wang, and Tian] to 91.2% [Li et al.(2018b)Li, Zhu, and Gong]. The Rank-1 accuracy of the DukeMTMC-reID dataset [Zheng et al.(2017)Zheng, Zheng, and Yang] released in 2017 has been quickly improved from 30.8% [Liao et al.(2015)Liao, Hu, Xiangyu Zhu, Li, Zhu, and Li] to 81.8%[Si et al.(2018)Si, Zhang, Li, Kuen, Kong, Kot, and Wang]. However, most of these approaches follow supervised learning frameworks which required a large number of manually labelled images. In real-world person Re-ID deployment, typical video surveillance systems usually consist of over one hundred cameras. Manual labelling all those cameras is a prohibitively expensive job. The limited scalability severely hinders the applicability of existing supervised Re-ID approaches in the real-world scenarios.

One solution to make a person Re-ID model scalable is designing an unsupervised algorithm for the unlabelled data. In recent years, some unsupervised methods have been proposed to extract view-invariant features and measure the similarity of images without label information [Wang et al.(2016)Wang, Zhu, Xiang, and Gong, Kodirov et al.(2015)Kodirov, Xiang, and Gong, Wang et al.(2014)Wang, Gong, and Xiang, Yu et al.(2017)Yu, Wu, and Zheng]. These approaches only analyse the unlabelled datasets and generally yield poor person Re-ID performance due to the lack of strong supervised tuning and optimisation. Another approach to solve the scalability issue of Re-ID is unsupervised transfer learning via domain adaptation strategy. The unsupervised domain adaptation methods leverage labelled data in one or more related source datasets (also known as source domains) to learn models for unlabelled data in a target domain. However, most domain adaptation frameworks[Long et al.(2015)Long, Cao, Wang, and Jordan, Long et al.(2017)Long, Zhu, Wang, and Jordan] assume that the source domain and target domain contain the same set of class labels. Such assumption does not hold for person Re-ID because different Re-ID datasets usually contain completely different sets of persons (classes). Therefore, most unsupervised cross-dataset Re-ID methods proposed in recent years [Peng et al.(2016)Peng, Xiang, Wang, Pontil, Gong, Huang, and Tian, Wang et al.(2018)Wang, Zhu, Gong, and Li, Deng et al.(2018)Deng, Zheng, Kang, Yang, Ye, and Jiao] did not use conventional domain adaptation mechanisms. For example, [Deng et al.(2018)Deng, Zheng, Kang, Yang, Ye, and Jiao] uses image-to-image translation to transfer the style of images in the target domain to the source domain images for generating a new training dataset. These newly generated samples which inherit the identity labels from the source domain and the image style of the target domain can be used for supervised person Re-ID learning. [Wang et al.(2018)Wang, Zhu, Gong, and Li] trains two individual models: identity classification and attribute recognition and performs the domain adaptation between two models.

In our work, we rethink the assumption made for the unsupervised cross-dataset Re-ID. Although the identity labels of the source and target datasets are non-overlapping, many of the mid-level semantic features of different people such as genders, age-groups or colour/texture of the outfits are commonly shared between different people across different datasets. Hence, these mid-level visual attributes of the people can be considered as the common labels between different datasets. If we assume these mid-level semantic features are shared between the different domains, we can then treat the unsupervised cross-dataset person Re-ID as a domain adaptation transfer learning based on the mid-level semantic features from the source domain to the target domain. Therefore, we propose a Multi-task Mid-level Feature Alignment network (MMFA) which can simultaneously learn the feature representation from the source dataset and perform domain adaptation to the target dataset via aligning the distributions of the mid-level features. The contributions of our MMFA model are summarized below:

  • We propose a novel unsupervised cross-dataset domain adaptation framework for person Re-ID by minimising the distribution variation of the source’s and the target’s mid-level features based on the Maximum Mean Discrepancy (MMD) distance [Gretton et al.(2009)Gretton, Fukumizu, Harchaoui, and Sriperumbudur]. Due to the low dimensionality of attribute annotations, we also include mid-level feature maps in our deep neuron network as additional latent attributes to capture a more completed representation of mid-level features of each domain. In our experiments, the proposed MMFA method surpasses other state-of-the-art unsupervised models on four popular unsupervised benchmarks datasets.

  • The existing unsupervised domain adaptation Re-ID approaches based on deep learning [Wang et al.(2018)Wang, Zhu, Gong, and Li, Deng et al.(2018)Deng, Zheng, Kang, Yang, Ye, and Jiao] require two-stage learning processes: supervised feature learning and unsupervised domain adaptation. Different from those methods, Our MMFA model introduce a new jointly training structure which simultaneously learns the feature representation from the source domain and adapts the feature to the target domain in a single training process. Because our model does not require two-step training procedure, the training time for our method is much less than many other unsupervised deep learning person Re-ID approaches.

2 Related Work

Most existing person Re-ID models are supervised approaches focusing on features engineering [Gray and Tao(2008), Liao et al.(2015)Liao, Hu, Xiangyu Zhu, Li, Zhu, and Li, Zhao et al.(2014)Zhao, Ouyang, and Wang], distance metrics development [Kostinger et al.(2012)Kostinger, Hirzer, Wohlhart, Roth, and Bischof, Paisitkriangkrai et al.(2015)Paisitkriangkrai, Shen, and van den Hengel, Liao et al.(2015)Liao, Hu, Xiangyu Zhu, Li, Zhu, and Li] or creating new deep learning architectures [Ahmed et al.(2015)Ahmed, Jones, and Marks, Lin and Li(2017), Li et al.(2018b)Li, Zhu, and Gong]. However, in real-world person Re-ID deployment, supervised methods suffer from poor scalability due to the lack of subject’s identity for each camera pair. Therefore, some unsupervised person Re-ID methods have been developed based on hand-crafted features learned from a single unlabelled dataset [Zhao et al.(2013)Zhao, Ouyang, and Wang, Wang et al.(2014)Wang, Gong, and Xiang, Kodirov et al.(2015)Kodirov, Xiang, and Gong]. However, due to the absence of the pairwise identity labels, these unsupervised methods cannot learn robust cross-view discriminative features and usually yield much weaker performance compared to the supervised learning approaches.

Because of the poor person Re-ID performance of the single dataset unsupervised learning, many of recent works are focusing on developing cross-dataset domain adaptation methods for a scalable person Re-ID system [Layne et al.(2013)Layne, Hospedales, and Gong, Ma et al.(2015)Ma, Jiawei Li, Yuen, and Ping Li, Peng et al.(2016)Peng, Xiang, Wang, Pontil, Gong, Huang, and Tian, Wang et al.(2018)Wang, Zhu, Gong, and Li]. These approaches leverage the pre-trained supervised Re-ID models and adapt these models to the target dataset. Early proposed cross-dataset person Re-ID domain adaptation approaches rely on weak label information in target dataset [Layne et al.(2013)Layne, Hospedales, and Gong, Ma et al.(2015)Ma, Jiawei Li, Yuen, and Ping Li]. Therefore, these methods can only be considered as semi-supervised or weakly-supervised learning. The recent cross-dataset works such as UMDL [Peng et al.(2016)Peng, Xiang, Wang, Pontil, Gong, Huang, and Tian], SPGAN [Deng et al.(2018)Deng, Zheng, Kang, Yang, Ye, and Jiao] and TJ-AIDL [Wang et al.(2018)Wang, Zhu, Gong, and Li] do not require any labelled information from the target dataset and can be considered as fully unsupervised cross-dataset domain adaptation learning. The UMDL method tries to transfer the view-invariant feature representation via multi-task dictionary learning on both source and target datasets. The SPGAN approach uses the generative adversarial network (GAN) to generate new training dataset by transferring the image style from the target dataset to the source dataset while preserving the source identity information. Hence, the supervised training on the new translated dataset can be automatically adapted to the target domain. The TJ-AIDL approach individually trains two models: an identity classification model and an attribute recognition model. The domain adaptation in TJ-AIDL is achieved by minimising the distance between inferred attributes from the identity classification model and the predicted attributes from the attribute recognition model. Compared to the previous single dataset unsupervised approaches, the recent cross-dataset unsupervised domain adaptation methods yield much better performance. Our work improved upon these cross-dataset unsupervised methods by introducing a more intuitive domain adaptation mechanism and proposing a novel jointly training framework for simultaneous feature learning and domain adaptation.

3 The Proposed Methodology

One basic assumption behind domain adaptation is that there exists a feature space which underlying both the source and the target domain. Although high-level information like person’s identity is not shared between different Re-ID datasets, the mid-level features such as visual attributes can be overlapped between persons. For example, the people in dataset A and dataset B can be different, but some of mid-level semantic information like genders, age-groups, colour of clothes or accessories may be the same. Hence, in our proposed method MMFA, we assume that the source and target datasets contain the same set of mid-level attribute labels. As a result, the unsupervised cross-dataset person Re-ID can be transformed into an unsupervised domain adaptation problem by regularising the distribution variance of the attribute feature space between the source domain and the target domain.

Currently, there are a few attribute annotations available for some Re-ID datasets. However, the number of these attribute labels are limited. There are 27 attribute labels for the Market1501 dataset and 23 for the DukeMTMC-reID dataset [Lin et al.(2017)Lin, Zheng, Zheng, Wu, and Yang]. The features from 27 or 23 user-defined attributes alone cannot give a good representation of the overall mid-level semantic features for both source and target datasets. There may exist many shared mid-level visual clues between domains which cannot fully captured by those 27/23 user-defined annotations. To obtain more attributes for representing the shared mid-level features, we start to consider the feature-maps generated from the different convolutional layers. In our experiment, we observed that most feature maps from the last convolutional layer of an attribute-identity multi-task classification model are able to capture many distinctive semantic features of a person, see Figure 1 for example. Hence, we treat those feature maps as the attribute-like mid-level deep features in our proposed MMFA model.

(a) Person ID 0585 (b) Person ID 0646 (c) Person ID 1091
Figure 1: In each of these three pairs of images, the one on the left-hand side is randomly selected from the Market1501 dataset while the other one shows the attention regions from highest activated feature maps (, and ) of the last convoltional layer. These feature maps highlight distinctive semantic features such as green shorts, red backpack, red T-shirt. Best view in colour.

3.1 Architecture

Figure 2: In the proposed MMFA model, the source and target images will undergo two networks with shared weights. The global max-pooling will extract the most sensitive feature maps from the last convolutional layer and feed them into each independent softmax classifier for classifying the identity or attributes of the person. In order to generalise the feature representation to the target dataset, we also regularised our network by aligning the distribution of the pre-defined attributes and mid-level deep features from the source to the target domain.

Our model is optimised using stochastic gradient descent (SGD) method on mini-batches. Each mini-batch consists number of labelled images from a source dataset and number of unlabelled images from a target dataset . Each labelled image is associated with an identity label and a set of M attributes . Our model consists of one pre-trained ResNet50-based backbone network [He et al.(2016)He, Zhang, Ren, and Sun] as the feature extractor with 1 fully connected layer for identity classification and individual fully connected layers for single attribute recognition. The overview of our architecture is shown in Figure 2. We change the last average pooling layer from ResNet50 to a global max-pooling (GMP) layer to emphasise the semantic regions from the feature maps of the last convolutional layer.

and are the mid-level deep features of the source domain and the target domain obtained after the GMP layer, respectively. The identity features and are the outputs from the fully connected layer with and as input for identity classification (shown as ID-FC in Figure 2). For a specific -th attribute where , the -th attribute features , can be obtained from its corresponding fully connected layer with and as input (shown as Attr-FC-m in Figure 2). Our model can be jointly trained in a multi-task manner: two supervised classification losses for identity classification and attribute recognition, one adaptation losses based on the attribute features and another adaptation loss based on the mid-level deep features.

3.2 Multi-task Supervised Classification for Feature Learning

The view-invariant feature representations are learned from a multi-task identity and attribute classification training. The additional attribute annotations provide further regularisation and additional supervision to the feature learning process

Identity Loss: We denote that is the predicted probability on the identity feature with the ground-truth label . The identity loss is computed by softmax cross entropy function:

(1)

Attribute Loss: We denote that is the predicted probability for the -th attribute feature with ground-truth label . The overall attributes loss can be expressed as the average of sigmoid cross entropy loss of each attribute:

(2)

3.3 MMD-based Regularisation for Mid-level Feature Alignment

As we make a shared mid-level latent space assumption in our MMFA model, the domain adaptation can be achieved by reducing the distribution distance of attribute features between the source domain and the target domain. Based on the attribute features and obtained from the supervised classification learning, we use the Maximum Mean Discrepancy (MMD) measure [Gretton et al.(2009)Gretton, Fukumizu, Harchaoui, and Sriperumbudur] to calculate the feature distribution distance of each attribute. The overall attribute distribution distance is the the mean MMD distance of all attributes:

(3)

is a map operation which project the attribute distribution into a reproducing kernel Hilbert space (RKHS) [Gretton et al.(2008)Gretton, Borgwardt, Rasch, Scholkopf, and Smola]. and are the batch sizes of the source domain images and target domain images. The arbitrary distribution of the attribute features can be represented by using the kernel embedding technique [Smola et al.(2007)Smola, Gretton, Song, and Schölkopf]. It has been proven that if the kernel is characteristic, then the mapping to the RKHS is injective [Sriperumbudur et al.(2009)Sriperumbudur, Fukumizu, Gretton, Lanckriet, and Schölkopf]. The injectivity indicates that the arbitrary probability distribution is uniquely represented by an element in RKHS. Therefore, we have a kernel function induced by . Now, the average MMD distance between the source domain’s and the target domain’s attribute distributions can be re-expressed as:

(4)

In our MMFA model, we decided to use the well-know RBF characteristic kernel with bandwidth :

(5)

Due to the limited size of available attribute annotations, these attributes alone cannot give a good representation of all domain shared mid-level features. By assuming the last feature maps after the feature extractor are attribute-like mid-level features, we introduce the additional mid-level deep feature alignment to our model. The mid-level deep features adaptation loss is the MMD distance between the source and target mid-level deep feature , similar to our attributes features adaptation loss:

(6)

Finally, we formulate the overall loss function by incorporating the weighted summation of above components , , and :

(7)

4 Experiments

4.1 Datasets and Settings

Person Re-ID Datasets: Four widely used person Re-ID benchmarks are chosen for experimental evaluations: Market1501, DukeMTMC-reID, VIPeR and PRID. The Market1501 dataset [Zheng et al.(2015)Zheng, Shen, Tian, Wang, Wang, and Tian] contains 32,668 images of 1,501 pedestrian. 751 identities are selected for training and 750 remaining identities are for testing. Each identity was captured by at most 6 non-overlapping cameras. The DukeMTMC-reID dataset [Zheng et al.(2017)Zheng, Zheng, and Yang] is the redesign version of pedestrian tracking dataset DukeMTMC [Ristani et al.(2016)Ristani, Solera, Zou, Cucchiara, and Tomasi] for person Re-ID task. It contains 34,183 image of 1,404 pedestrians. 702 identities are used for training and the remaining 702 are for testing. Each identity was captured by 8 non-overlapping cameras. The VIPeR dataset [Gray et al.(2007)Gray, Brennan, and Tao] is one of the oldest person Re-ID dataset. It contains 632 identities, but only two images for each identity. Due to its low resolution and large variation in illumination and viewpoints, the VIPeR dataset is still a very challenging dataset. The PRID dataset [Hirzer et al.(2011)Hirzer, Beleznai, Roth, and Bischof] consists of 934 identities from two camera views. There are 385 identities in View A and 749 identities in View B, but only 200 identities appear in both views.

Evaluation Protocol: We follow the proposed single-query evaluation protocols for Market1501 and DukeMTMC-reID [Zheng et al.(2015)Zheng, Shen, Tian, Wang, Wang, and Tian, Zheng et al.(2017)Zheng, Zheng, and Yang]. For the VIPeR dataset, we randomly half-split the dataset into training and testing sets. The overall performance on VIPeR is the average results from 10 randomly 50/50 split testing. For the PRID dataset evaluation, we follow the same single-shot experiments as [Zhang et al.(2016)Zhang, Xiang, and Gong]. Similar to the VIPeR dataset setting, the final performance is the average of the experimental results based on 10 random split testing. Since the VIPeR and PRID datasets are too small for training the deep learning network, our MMFA model only trains on the Market1501 or the DukeMTMC-reID datasets. We adopted the commonly used Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP) as performance metrics.

Implementation Details: The input images are randomly cropped and resized to (256,128,3). All the fully-connected layers after global max-pooling layer are equipped with batch normalization, the dropout rate of 0.5 and the leaky RELU activation function. , and in the final loss function (Equation 7) are empirically fixed to . For all the adaptation losses, we adopted the mixture kernel strategy [Li et al.(2015)Li, Swersky, and Zemel, Li et al.(2018a)Li, Wang, and Kot] by averaging the RBF kernels with the bandwidth . We use the stochastic gradient descent (SGD) optimizer with batch size 32 for both source domain images and the target domain images. We set the learning rate to and the nesterov momentum to with the weight decay of . The learning rate will decrease by 10 after the -th epoch. The person Re-ID evaluation of the target domain is measured by the distance of the 2048-D mid-level deep features after the global max-pooling layer.

4.2 Comparisons with State-of-the-art Methods

The performance of our proposed MMFA model is extensively compared with 16 state-of-the-art unsupervised person Re-ID methods as shown in Table 1. These methods include: view-invariant feature learning methods SDALF [Farenzena et al.(2010)Farenzena, Bazzani, Perina, Murino, and Cristani] and CPS [Cheng et al.(2011)Cheng, Cristani, Stoppa, Bazzani, and Murino], graph learning method GL [Kodirov et al.(2016)Kodirov, Xiang, Fu, and Gong], sparse ranking method ISR [Lisanti et al.(2015)Lisanti, Masi, Bagdanov, and Del Bimbo], salience learning methods GTS [Wang et al.(2014)Wang, Gong, and Xiang] and SDC [Zhao et al.(2017)Zhao, Oyang, and Wang], neighbourhood clustering methods AML [Ye et al.(2007)Ye, Zhao, and Liu], UsNCA [Qin et al.(2015)Qin, Song, Huang, and Zhu], CAMEL [Yu et al.(2017)Yu, Wu, and Zheng] and PUL [Fan et al.(2017)Fan, Zheng, and Yang], ranking SVM method AdaRSVM [Ma et al.(2015)Ma, Jiawei Li, Yuen, and Ping Li], attribute co-training method SSDAL [Su et al.(2016)Su, Zhang, Xing, Gao, and Tian], dictionary learning method DLLR [Kodirov et al.(2015)Kodirov, Xiang, and Gong] and UDML [Peng et al.(2016)Peng, Xiang, Wang, Pontil, Gong, Huang, and Tian], id-to-attribute transfer method TJ-AIDL [Wang et al.(2018)Wang, Zhu, Gong, and Li] and image style transfer method SPGAN [Deng et al.(2018)Deng, Zheng, Kang, Yang, Ye, and Jiao]. These methods can be categorised into three groups:

  1. hand-craft features approaches: SDALF,CPS,DLLR,GL,ISR,GTS,SDC

  2. clustering approaches: AML, UsNCA, CAMEL, PUL

  3. domain adaptation approaches: AdaRSVM, UDML, SSDAL, TJ-AIDL, SPGAN

Dataset VIPeR PRID Market1501 DukeMCMT-reID
Metric (%) Rank-1 Rank-1 Rank-1 mAP Rank-1 mAP
SDALF [Farenzena et al.(2010)Farenzena, Bazzani, Perina, Murino, and Cristani] 19.9 16.3 - - - -
CPS [Cheng et al.(2011)Cheng, Cristani, Stoppa, Bazzani, and Murino] 22.0 - - - - -
DLLR [Kodirov et al.(2015)Kodirov, Xiang, and Gong] 29.6 21.1 - - - -
GL [Kodirov et al.(2016)Kodirov, Xiang, Fu, and Gong] 33.5 25.0 - - - -
ISR [Lisanti et al.(2015)Lisanti, Masi, Bagdanov, and Del Bimbo] 27.0 17.0 40.3 14.3 - -
GTS [Wang et al.(2014)Wang, Gong, and Xiang] 25.2 - - - - -
SDC [Zhao et al.(2017)Zhao, Oyang, and Wang] 25.8 - - - - -
AML [Ye et al.(2007)Ye, Zhao, and Liu] 23.1 - 44.7 18.4 - -
UsNCA [Qin et al.(2015)Qin, Song, Huang, and Zhu] 24.3 - 45.2 18.9 - -
CAMEL [Yu et al.(2017)Yu, Wu, and Zheng] 30.9 - 54.5 26.3 - -
PUL [Fan et al.(2017)Fan, Zheng, and Yang] - - 44.7 20.1 30.4 16.4
AdaRSVM [Ma et al.(2015)Ma, Jiawei Li, Yuen, and Ping Li] 10.9 4.9 - - - -
UDML [Peng et al.(2016)Peng, Xiang, Wang, Pontil, Gong, Huang, and Tian] 31.5 24.2 - - - -
SSDAL [Su et al.(2016)Su, Zhang, Xing, Gao, and Tian] 37.9 20.1 39.4 19.6 - -
TJ-AIDLDuke [Wang et al.(2018)Wang, Zhu, Gong, and Li] 35.1 34.8 58.2 26.5 - -
SPGANDuke [Deng et al.(2018)Deng, Zheng, Kang, Yang, Ye, and Jiao] - - 51.1 22.8 - -
TJ-AIDLMarket [Wang et al.(2018)Wang, Zhu, Gong, and Li] 38.5 26.8 - - 44.3 23.0
SPGANMarket [Deng et al.(2018)Deng, Zheng, Kang, Yang, Ye, and Jiao] - - - - 41.1 22.3
MMFADuke 36.3 34.5 56.7 27.4 - -
MMFAMarket 39.1 35.1 - - 45.3 24.7
Table 1: Performance comparisons with state-of-the-art unsupervised person Re-ID methods.The best and second best results are highlighted by bold and underline receptively. The superscripts: Duke and Market indicate the source dataset which the model is trained on.

Our MMFA method outperforms most existing state-of-the-art models on VIPeR, PRID, Market1501 and DukeMTMC-reID datasets. the Rank-1 accuracy increases from 38.5% to 39.1% in VIPeR, from 34.8% to 35.1% in PRID and from 44.3% to 45.3% in DukeMTMC-reID. The mAP performance of our approach surpasses all exiting methods by a good margin from 23.0% to 24.7% and 26.5% to 27.4% in DukeMTMC-reID and Market1501 receptively. Although, the Rank-1 accuracy of our MMFA model on the Maket1501 dataset did not surpass the TJ-AIDL method, our mAP score and the overall performance (Rank-5 to Rank-10 accuracy) are better than TJ-AIDL. The complete comparisons with TH-AIDL and SPGAN are shown in Table 2. It is worth noting that the performance of our MMFA is achieved in one single end-to-end training session with only 25 epochs. Our performance can be further improved by implementing any pre- and post-processing techniques such as part-based local max pooling (LMP), attention mechanisms or re-ranking. For fair comparisons, the performance results shown the Table 1 and Table 2 are all based on the basic models without any pre or post-processing.

SourceTarget Market1501 DukeMTMC-reID DukeMTMC-reID Market1501
Metric (%) Rank1 Rank5 Rank10 mAP Rank1 Rank5 Rank10 mAP
SPGAN 41.1 56.6 63.0 22.3 51.5 70.1 76.8 22.8
TJ-AIDL 44.3 59.6 65.0 23.0 58.2 74.8 81.1 26.5
MMFA 45.3 59.8 66.3 24.7 56.7 75.0 81.8 27.4
Table 2: Detail Comparison with SPGAN and TJ-AIDL

4.3 Component Analysis and Evaluation

We also analysed each component of our MMFA model based on their contributions to the cross-domain feature learning. The first set of experiments is the unsupervised performance based on the feature representation learned from the source domain attributes or identity, without any domain adaptation. In the top section of Table 3, the attribute annotations alone cannot give a good representation of a person due to its low dimensionality, only 6.4% and 19.2% Rank1 accuracy achieved. The features from identity labels on the other hand yield much better performance compared to attributes. When attribute and identity information are jointly trained as a multi-objective learning task, the feature representations show a better generalization-ability. This experiment shows that the attribute annotations do provide extra information to the system which serves as additional supervision for learning more generalised cross-dataset features.

Source Target Market1501 DukeMTMC-reID DukeMTMC-reID Market1501
Metric (%) Rank1 Rank5 Rank10 mAP Rank1 Rank5 Rank10 mAP
Attribute Only 6.4 14.4 18.6 2.3 19.2 34.8 45.1 6.2
ID Only 37.6 54.9 61.6 22.6 48.2 66.1 73.3 21.6
Attribute+ID Only 41.7 57.5 63.6 23.3 52.2 69.1 75.7 23.5
Attribute with Attribute Feature Adaptation 15.8 26.0 48.2 5.7 35.5 55.3 64.0 12.7
ID with Mid-level Deep Feature Adaptation 42.1 57.7 63.9 24.3 53.4 70.2 76.4 25.2
Mid-level Deep Feature + Attribute Adaptation 45.3 59.8 66.3 24.7 56.7 75.0 81.8 27.4
Table 3: Adaptation performance on each model components

The lower section of Table 3 shows the unsupervised re-id performances after aligning the mid-level feature distribution. After aligning the source and target distributes of attributes features, mid-level features or both, we can see a large performance increase when compared with previously non-adapted features. It shows that the proposed mid-level feature distribution alignment strategy is a feasible approach for the unsupervised person Re-ID task.

5 Conclusion

In this paper, we presented a novel unsupervised cross-dataset feature learning and domain adaptation framework MMFA for person Re-ID task. We utilised the multi-supervision identity and attribute classifications to learn a discriminative feature for person Re-ID on the labelled source dataset. With a shared mid-level feature space assumption, we proposed the mid-level feature alignment domain adaptation strategy to reduce the MMD distance based on the source domain’s and the target domain’s mid-level feature distributions. In contrast to most existing learn-then-adapt unsupervised cross-dataset approaches, our MMFA is a one-step learn-and-adapt method which can simultaneously learn the feature representation and adapt to the target domain in a single end-to-end training procedure. Meanwhile, our proposed method is still able to outperform a wide range of state-of-the-art unsupervised Re-ID methods.

References

  1. Ejaz Ahmed, Michael Jones, and Tim K Marks. An Improved Deep Learning Architecture for Person Re-Identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  2. Dong Seon Cheng, Marco Cristani, Michele Stoppa, Loris Bazzani, and Vittorio Murino. Custom Pictorial Structures for Re-identification. In British Machine Vision Conference (BMVC), 2011.
  3. Weijian Deng, Liang Zheng, Guoliang Kang, Yi Yang, Qixiang Ye, and Jianbin Jiao. Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  4. Hehe Fan, Liang Zheng, and Yi Yang. Unsupervised Person Re-identification: Clustering and Fine-tuning. In arXiv preprint, 2017.
  5. M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person Re-Identification by Symmetry-Driven Accumulationof Local Features. In Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  6. Doug Gray, Shane Brennan, and Hai Tao. Evaluating Appearance Models for Recognition, Reacquisition, and Tracking. In International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), 2007.
  7. Douglas Gray and Hai Tao. Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features. In European Conference on Computer Vision (ECCV), 2008.
  8. A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A Fast, Consistent Kernel Two-Sample Test. In Advances in Neural Information Processing Systems (NIPS), 2009.
  9. Arthur Gretton, Karsten Borgwardt, Malte J. Rasch, Bernhard Scholkopf, and Alexander J. Smola. A Kernel Method for the Two-Sample Problem. Journal of Machine Learning Research (JMLR), 2008.
  10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  11. Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst Bischof. Person Re-identification by Descriptive and Discriminative Classification. In Scandinavian Conference on Image Analysis (SCIA), 2011.
  12. Elyor Kodirov, Tao Xiang, and Shaogang Gong. Dictionary Learning with Iterative Laplacian Regularisation for Unsupervised Person Re-identification. British Machine Vision Conference (BMVC), 2015.
  13. Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Person Re-identification by Unsupervised L1 Graph Learning. In European Conference on Computer Vision (ECCV), 2016.
  14. Martin Kostinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Large Scale Metric Learning from Equivalence Constraints. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  15. Ryan Layne, Timothy M. Hospedales, and Shaogang Gong. Domain Transfer for Person Re-identification. In iInternational Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Stream (ARTEMIS), 2013.
  16. Haoliang Li, Shiqi Wang, and Alex C Kot. Domain Generalization with Adversarial Feature Learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018a.
  17. Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious Attention Network for Person Re-Identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018b.
  18. Yujia Li, Kevin Swersky, and Richard Zemel. Generative Moment Matching Networks. In International Conference for Machine Learning (ICML), 2015.
  19. Shengcai Liao, Yang Hu, Xiangyu Zhu, Stan Z. Li, Xiangyu Zhu, and Stan Z. Li. Person Re-identification by Local Maximal Occurrence Representation and Metric Learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  20. Shan Lin and Chang-tsun Li. End-to-End Correspondence and Relationship Learning of Mid-Level Deep Features for Person Re-Identification. In International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2017.
  21. Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, and Yi Yang. Improving Person Re-identification by Attribute and Identity Learning. In arXiv preprint, 2017.
  22. Giuseppe Lisanti, Iacopo Masi, Andrew D. Bagdanov, and Alberto Del Bimbo. Person Re-Identification by IterativeRe-Weighted Sparse Ranking. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.
  23. Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning Transferable Features with Deep Adaptation Networks. In International Conference for Machine Learning (ICML), 2015.
  24. Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep Transfer Learning with Joint Adaptation Networks. In International Conference for Machine Learning (ICML), 2017.
  25. Andy J Ma, Jiawei Li, Pong C Yuen, and Ping Li. Cross-Domain Person Reidentification Using Domain Adaptation Ranking SVMs. Transactions on Image Processing (TIP), 5 2015.
  26. Sakrapee Paisitkriangkrai, Chunhua Shen, and Anton van den Hengel. Learning to Rank in Person Re-identification with Metric Ensembles. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  27. Peixi Peng, Tao Xiang, Yaowei Wang, Massimiliano Pontil, Shaogang Gong, Tiejun Huang, and Yonghong Tian. Unsupervised Cross-Dataset Transfer Learning for Person Re-identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  28. Chen Qin, Shiji Song, Gao Huang, and Lei Zhu. Unsupervised Neighborhood Component Analysis for Clustering. Neurocomputing, 2015.
  29. Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance Measures and a Data Set forMulti-Target, Multi-Camera Tracking. In ECCV Workshop on Benchmarking Multi-Target Tracking, 2016.
  30. Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C. Kot, and Gang Wang. Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  31. Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A Hilbert Space Embedding for Distributions. In Advances in Neural Information Processing Systems (NIPS), 2007.
  32. Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Gert R. G. Lanckriet, and Bernhard Schölkopf. Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions. In Advances in Neural Information Processing Systems (NIPS), 2009.
  33. Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Deep Attributes Driven Multi-camera Person Re-identification. In European Conference on Computer Vision (ECCV), 2016.
  34. Hanxiao Wang, Shaogang Gong, and Tao Xiang. Unsupervised Learning of Generative Topic Saliency for Person Re-identification. British Machine Vision Conference (BMVC), 2014.
  35. Hanxiao Wang, Xiatian Zhu, Tao Xiang, and Shaogang Gong. Towards Unsupervised Open-Set Person Re-identification. In International Conference on Image Processing (ICIP), 2016.
  36. Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. Transferable Joint Attribute-Identity Deep Learning for Unsupervised Person Re-Identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  37. Jieping Ye, Zheng Zhao, and Huan Liu. Adaptive Distance Metric Learning for Clustering. In Computer Vision and Pattern Recognition (CVPR), 2007.
  38. Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. Cross-View Asymmetric Metric Learning for Unsupervised Person Re-Identification. In International Conference on Computer Vision (ICCV), 2017.
  39. Li Zhang, Tao Xiang, and Shaogang Gong. Learning a Discriminative Null Space for Person Re-identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  40. Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Unsupervised Salience Learning for Person Re-identification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  41. Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Learning Mid-level Filters for Person Re-identification. In Computer Vision and Pattern Recognition (CVPR), 2014.
  42. Rui Zhao, Wanli Oyang, and Xiaogang Wang. Person Re-Identification by Saliency Learning. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
  43. Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable Person Re-identification: A Benchmark. In International Conference on Computer Vision (ICCV), 2015.
  44. Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in vitro. In International Conference on Computer Vision (ICCV), 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
211800
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description