Adaptive Exploration for Unsupervised Person Re-Identification

Adaptive Exploration for Unsupervised Person Re-Identification

Yuhang Ding, Hehe Fan, Mingliang Xu and Yi Yang Y. Ding is with SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology, and the Centre for Artificial Intelligence, University of Technology Sydney (e-mail: dyh.ustc.uts@gmail.com). H. Fan and Y. Yang are with the Centre for Artificial Intelligence, University of Technology Sydney, Ultimo, NSW 2007, Australia (e-mail: hehe.fan@student.uts.edu.au; yee.i.yang@gmail.com). M. Xu is with the School of Information Engineering, Zhengzhou University, Zhengzhou, 450000, China (e-mail: iexumingliang@zzu.edu.cn).
Abstract

Due to domain bias, directly deploying a deep person re-identification (re-ID) model trained on one dataset often achieves considerably poor accuracy on another dataset. In this paper, we propose an Adaptive Exploration (AE) method to address the domain-shift problem for re-ID in an unsupervised manner. Specifically, with supervised training on the source dataset, in the target domain, the re-ID model is inducted to 1) maximize distances between all person images and 2) minimize distances between similar person images. In the first case, by treating each person image as an individual class, a non-parametric classifier with a feature memory is exploited to encourage person images to move away from each other. In the second case, according to a similarity threshold, our method adaptively selects neighborhoods in the feature space for each person image. By treating these similar person images as the same class, the non-parametric classifier forces them to stay closer. However, a problem of adaptive selection is that, when an image has too many neighborhoods, it is more likely to attract other images as its neighborhoods. As a result, a minority of images may select a large number of neighborhoods while a majority of images has only a few neighborhoods. To address this issue, we additionally integrate a balance strategy into the adaptive selection. Extensive experiments on large-scale re-ID datasets demonstrate the effectiveness of our method. Our code has been released at https://github.com/dyh127/Adaptive-Exploration-for-Unsupervised-Person-Re-Identification.

person re-identification, unsupervised learning, domain adaptation, deep learning

I Introduction

Person re-identification (re-ID) has become more and more popular because it plays an important role in security. Given an image of a person-of-interest from one camera, the goal of re-ID is to find the person from other cameras. In recent years, benefiting from deep convolutional neural networks (CNNs) and large-scale datasets, person re-ID has made great progress. However, due to different cloth styles, camera viewpoints, scenes, etc., deep re-ID models often suffer from the problem of domain shift. Specifically, when we directly apply a pre-trained deep re-ID model to an unseen dataset, it often achieves considerably poor accuracy.

To address the domain-shift problem, one can collect a large amount of labeled data from the target domain for supervised training. However, it is too expensive to collect a large-scale dataset for the training of a deep re-ID model. We can also collect a small amount of labeled data and a large amount of unlabeled data from the target domain for semi-supervised training [1, 2, 3]. However, when the number of target domains increases, this kind of methods is still impractical. Therefore, a promising solution is only to use unlabeled data in target domains to train a deep re-ID model, which is referred to as unsupervised person re-ID [4, 5, 6, 7, 8, 9, 10, 11]. This paper dedicates to unsupervised person re-ID.

There are two lines of unsupervised re-ID methods. The first line is to directly fine-tune a deep CNN model (usually pre-trained on ImagNet [12]) on the unlabeled target data [13, 14, 15, 16, 6]. In this paper, we call this line target-only re-ID [4]. The second line is to exploit the labeled source data additionally. For example, PUL [7] first fine-tunes the deep CNN model (pre-trained on ImagNet) on the labeled source dataset and then fine-tunes the model on the unlabeled target dataset in an unsupervised manner. ECN [11] fine-tunes the deep CNN model on labeled source dataset and the unlabeled target dataset simultaneously. In the community, the second line is usually referred to as domain adaptive re-ID [5, 6, 7, 8, 9, 17, 10, 11]. We propose an Adaptive Exploration (AE) method to improve deep re-ID model on the target dataset. We mainly discuss AE in the domain adaptive re-ID protocol. However, our method can also achieve comparable accuracy on the target-only re-ID.

To learn discriminative features in the target domain, AE is designed to train a re-ID model by maximizing distances between all person images and minimizing distances between similar person images. We leverage a non-parametric classifier, equipped with a feature memory, to optimize this objective. The feature memory stores the features of all target person images. After each iteration, the memory is updated according to the newly learned person image features. The non-parametric classifier classifies person images according to its feature memory. Specifically, the classifier treats each person image in the memory as an individual class. Given a feature of a person image and its label, by a softmax, the classifier tries to minimize the distance between the given image and the ground truth, while maximizing the distances with other person image features in the memory. To maximize distances between all person images, we directly ask the classifier to keep each person image as an individual class during training iteration. To minimize distances between similar person images, we ask the classifier to treat them as the same class.

However, it is challenging to achieve similar reliable samples in an unsupervised manner. Especially in the early training stage, it is impossible to find a large number of similar reliable person images, which located near to each other in the feature space. To alleviate this problem, the AE method adaptively selects a small amount of similar reliable samples to train the deep re-ID model, according to a similarity threshold. When the model becomes stronger, more person images are being adaptively selected as reliable samples, which in return benefits the training for the deep re-ID model. In this way, the model is progressively improved. However, with the adaptive selection, the method may result in unbalanced neighborhood distribution. Once a person image has more neighborhoods than others, the sum of its losses will be larger, which forces other images to quickly move to it. In consequence, images with more neighborhoods will make more other images become their neighborhoods, which inevitably contains unreliable and noisy person images. To alleviate this problem, we integrate a balance strategy into the method. A penalty is employed into the loss function, which helps the number of neighborhoods to be balanced and reasonable. This component automatically adapts the penalty weight on the loss to make training balanced.

The contributions of this article are twofold:

  • We propose the AE method for unsupervised person re-ID. With the supervised learning in the labeled source domain, the method maximizes distances of all target images and minimizes distances of similar target images. With a similarity threshold and a balance term, the method can adaptively find reliable similar target images.

  • Extensive experiments on three large-scale datasets demonstrate the effectiveness of the AE method. Our method significantly improves the unsupervised person re-ID accuracy, for both target-only re-ID and domain adaptive re-ID.

Ii Related Work

In this section, we briefly recount the previous work of person re-ID, including supervised person re-ID and unsupervised person re-ID. In this paper, we separate unsupervised person re-ID into target-only re-ID and domain adaptive re-ID.

Ii-a Supervised Person Re-identification

In recent years, compared with hand-crafted feature methods [18, 19, 20, 20, 21, 22, 23, 24, 25, 26, 15, 27, 28, 29], deep learning methods have dominated the person re-ID community. Based on deep networks, supervised person re-ID has been extensively studied from general problems to specific problems.

For general CNN model, the siamese model [30, 31, 32, 33, 13, 34, 35] and the classification model [36, 37, 38, 39, 40, 41] are studied. Among the work of siamese model exploration, Yi et al. [30] and Li et al. [13] firstly employed siamese network for person re-ID and utilized part information in model training. For the classification model, Zheng et al. [36, 37] used a conventional fine-tuning approach, the ID-discriminative embedding (IDE). In addition to the deep network, ranking loss [42, 34, 43, 44, 45] and classification loss [33, 13, 32, 46] are also explored. Besides, to against the over-fitting problem, several data augmentation methods have been proposed [47, 48, 49, 16, 50]. Among these data augmentation methods, Zhong et al. [50] changed camera styles for images as data augmentation to enhance the robustness of the model to camera variance.

Besides of general problems, some specific problems of person re-ID are studied recently. To against pose variance, several methods [19, 51, 52, 53, 54] are proposed to learn pose invariant representations. Su et al. [53] employed a pose-driven deep convolutional model to leverages the human part cues. For viewpoint variance problem, some work [21, 55, 56, 57, 58, 59] focus on it. Among them, Sun et al. [59] quantitatively analyzed and revealed the impact of different viewpoints. To against background variance, several methods [60, 61, 62, 63, 64] enhance the robustness of the re-ID model to background noises. Tian et al. [63] proposed a person-region guided pooling network and random background augmentation to alleviate the background bias.

To against domain bias, in this work, we focus on unsupervised person re-ID.

Ii-B Target-only Person Re-identification

Target-only person re-ID [64, 4] only leverage unlabeled target data for CNN training. Although without the deep CNN, two hand-crafted feature methods [26, 15] is also classified by us as target-only person re-ID methods. It is because the hand-crafted feature can be directly used for person re-ID also without any training data. In addition to hand-crafted feature methods, Xiao et al. [64] introduced the feature memory with online instance matching (OIM) loss. These can be used in target-only person re-ID. Lin et al. [4] introduced bottom-up clustering (BUC) and achieves competitive results with only unlabeled target data.

Ii-C Domain Adaptive Person Re-identification

Domain adaptive person re-ID [5, 6, 7, 8, 9, 17, 10, 11, 65, 66, 67] exploits the labeled source data, with the target data. Among them, Peng et al. [65] developed an unsupervised cross-dataset transfer learning approach based on asymmetric multi-task dictionary learning. Yu et al. [66] proposed clustering-based asymmetric metric learning called CAMEL. Fan et al. [7] introduced a progressive algorithm with clustering and fine-tuning for re-ID model training. Deng et al. [5] and Wei et al. [6] transferred source data to the target domain with CycleGAN [68] and employed the generated data to train the re-ID model. Wang et al. [8] and Lin et al. [9] incorporated attribute information to enhance the scalability and usability of the re-ID model. Zhong et al. [11] considered the intra-domain variation for the target domain and kept three underlying invariances when training, that is exemplar-invariance, camera-invariance, and neighborhood-invariance. In this work, the AE method considers distances in the target domain and achieves competitive accuracy on both domain adaptive person re-ID and target-only person re-ID.

Fig. 1: (a) Overview of the framework. The re-ID model is trained by source images and target images simultaneously. For source images, supervised learning is used. Specifically, a classifier with a cross-entropy (CE) loss is employed to classify persons in the source domain. For target images, the AE method is used. Specifically, AE aims to maximize feature distances between all images and minimize feature distances between similar images. (b) Illustration of the AE method. The goal of AE is to learn discriminative feature for the target domain. The basic idea is to encourage target images to keep away from each other while force similar target images to stay closer in the feature space. When minimizing feature distances between similar images, according to a similarity threshold , only reliable samples are considered. We employ a non-parametric classifier with a feature memory to train the re-ID model. Details about the non-parametric classifier are provided in Section III-B.

Iii The Proposed Method

In this section, we introduce the AE method in detail. The framework is shown in Figure 1. In the framework, source images and target images are input into ResNet-50 [69] based CNN simultaneously. The difference is that, with the labeled source domain, supervised learning is used to train the model to gain a basic re-ID capability. With the unlabeled target domain, as shown in Figure 1, the AE method is proposed to help the model to generalize well in the target domain. Using a non-parametric classifier equipped with a feature memory, our method simultaneously maximizes distances of all target images and minimizes distances of similar target images in the feature space. In this way, our model encourages similar target images to stay close in the feature space while keeps other target images far away from each other. In the end, our model can learn discriminative features for person images in the target domain.

In Section III-A, we introduce the supervised learning with parametric classifier in the source domain. In Section III-B, we introduce the non-parametric classifier, which is equipped with a feature memory and used for the target domain. In Section III-C, we introduce the formulation of learning with adaptive selection, including maximizing distances between all images, minimizing distances between similar images and adaptive selection of similar images. In Section III-D, we introduce a balance strategy, which aims to make the number of neighborhoods to be balanced and reasonable. In Section III-E, we introduce the optimization procedure of our method.

Iii-a Parametric Classifier for Source Domain

In source domain, supervised learning with parametric classifier is used. Suppose the source domain contains labeled images , where denotes the source domain. The probability that the image belongs to the identity is defined as follows,

(1)

where denotes the parametric classifier, denotes the weights of the classifier, denotes the -th logit of the output from the classifier given and denotes the number of person identities of the source domain. The training for the source domain is to minimize the following loss,

(2)

Iii-B Non-parametric Classifier with feature Memory

To learn discriminative features for the target domain, our method tries to maximize feature distances between all person images and minimize feature distances between similar person images. To optimize distances, one can use contrastive loss [32] or triplet loss [45, 70, 71]. However, these losses become less effective when datasets become large. Therefore, we propose to optimize distances under a classification framework. For example, to minimize feature distances between similar person images, we can treat them as the same class. To maximize feature distances between all target person images, we can treat each image as an individual class.

However, treating each image as an individual class may make the parametric classifier difficult to converge. To alleviate this problem, motivated by [64, 11, 72], we exploit a non-parametric classifier, which is equipped with a memory M, to classify target images.

Suppose the target domain contains unlabeled images , where denotes the target domain. After feature extraction, each image is embedded as a -dimensional vector. The feature memory stores all target image features and is updated after each training iteration.

Based on the memory M, given an image , the non-parametric classifier aims to produce the probability of the image being the same class as the -th image,

(3)

where denotes the -th column of the feature memory M, representing the feature of the -th image. The denotes our model, extracting -dimension feature for every image. The denotes weights of the deep re-ID model, and the hyper-parameter denotes the temperature fact of the softmax function. A higher temperature leads to a softer probability distribution. After each iteration, M is updated as follows,

(4)

where the hyper-parameter is the update rate of M.

Instead of fixing to a constant value, we increase it linearly with epochs increasing. Since M is not reliable enough at the beginning of training, a smaller is needed to accelerate the update of M. By rapidly updated with newly learned representations, M can memorize discriminative features quickly. As M becomes discriminative gradually with epochs increasing, M is required to be more stable. Therefore, in this time, a larger is used to slow down the update. Thus, an increasing is adopted.

Iii-C Learning with Adaptive Selection

In this subsection, we introduce the learning with adaptive selection for the target domain, including maximizing distances between all target images, minimizing distances between similar target images, and adaptive selection of similar images.

We firstly introduce maximizing distances between all target images. For that, we assume each target image as an individual class. Thus, the index of a target image is treated as its pseudo label. We try to maximize the distances between target images by minimizing the following loss,

(5)

When applying this loss to a specific image, the image is encouraged to move far away from other images (Figure 1). When applying this loss to all images, they are encouraged to move far away from each other.

To minimize distances between similar target images, we firstly use a similarity threshold to adaptively select reliable neighborhoods for each target image. Only neighborhoods whose distances to the given target image are smaller than the threshold are selected as reliable neighborhoods. Then, we assume that the target image and its reliable neighborhoods share the same person identity, i.e., treating image and its reliable neighborhoods as the same class. By this operation, each target image is forced to move closer to its neighborhoods (Figure 1), which makes similar target images stay closer.

(6)

where denotes the selection indication vector of -th target image. Specifically, indicates that the -th image is selected as a neighborhood of the -th image. When , the -th image will not be used when forcing the -th image move closer to its neighborhoods.

For neighborhoods selection, we are inspired by Self-Pace Learning (SPL) [73], which has been widely used in weakly-supervised learning [74], semi-supervised learning [7, 75] and unsupervised learning [7]. The basic idea of SPL is to use samples whose losses are small enough for training. Losses with small losses are considered as reliable samples. In this paper, we select reliable neighborhoods according to their distances or similarities to the given image. Specifically, when an image is close enough to the given image in the feature space, it is selected as a reliable neighborhood to the given image. We formulate this selection as minimizing the following loss,

(7)

where is the similarity threshold, and denotes a distance function. By minimizing , two images will be treated as neighborhoods if their distance is less than .

(a) without balance
(b) with balance
Fig. 2: Illustration of learning with balance and without balance. (a) Learning without balance. When an image (point with yellow border) has a large number of neighborhoods (points in dashed circle), because some of its neighborhoods (points in the intersection of the dashed circle and the blue circle) can also be the neighborhoods of another image (point with red border), it is easier for a large group to attract other images than a small group. (b) Learning with balance. When an image has a large number of neighborhoods, the balance strategy decreases the losses between the image and its neighborhoods (points in the intersection of the dashed circle and the blue circle) by a larger penalty. Otherwise, the losses are decreased by a smaller penalty. As a result, no matter a large group or a small group, they attract images in a relatively similar degree, which makes the number of neighborhoods to be balanced and reasonable.

Iii-D Learning with Balance

A problem caused by Eq. (6) is that the number of neighborhoods for an image can change dramatically. When an image has a large of neighborhoods, the sum of the losses between it and its neighborhoods can be considerably large. When an image has a small of neighborhoods, the sum of the losses between it and its neighborhoods can be very small. As a consequence, as shown in Figure 2(a), it is easier for a large group to attract other images than a small group, which makes most groups have only a few data instances while a minority of groups have a large number of data instances. This unbalanced learning would result in a poor accuracy for re-ID.

To address this issue, we integrate a balance term into Eq. (6) to make the adaptive selection balanced,

(8)

where denotes the number of selected neighborhoods for the -th image, and is the binary indicator function. When = 1, is equal to 0, and when is larger than 1, is equal to 1. This is because when , the image does not has any other neighborhoods expect itself. Therefore, we don’t use it as a training sample and set the corresponding loss to 0.

When an image has a large number of neighborhoods, the penalty decreases the losses between the image and its neighborhoods heavily. Otherwise, the losses are decreased slightly (Figure 2(b)). By this balance strategy, no matter a large group or a small group, they attract images in a relatively similar degree. Therefore, they will have a similar number of neighborhoods. During training, the number of neighborhoods thus becomes balanced and reasonable.

Input : Unlabeled target data ;
Labeled source data ;
Similarity threshold ;
Update rate of the feature memory ;
Number of epochs ;
Original model .
Output : Model .
Initialization : randomly initialize ;
zero initialize the feature memory M.
1 for  to  do
2       // adaptive selection
3       for  to  do
4             extract feature ;
5             calculate distances: ;
6             for  to  do
7                   if   then
8                         ; // selected
9                  else
10                         ; // removed
11                   end if
12                  
13             end for
14            
15       end for
16       // model training
17       train the model with source images and selected target images: , ;
18       // feature memory M update
19       for  to  do
20             ;
21       end for
22      
23 end for
Algorithm 1 Adaptive Exploration

Iii-E Optimization Procedure

During training, we alternately optimize the involved parameters, i.e., v and ().

1) Optimize when is fixed. The goal of this step is to adaptively select reliable neighborhoods for minimizing distances between them in the feature space, which is achieved by minimizing

(9)

Specially, if the distance between two images is below the threshold , they will be chosen as neighborhoods for each other. The details are provided in Algorithm 1 (refer to step 1 to step 1).

2) Optimize when is fixed. This step utilizes source data and target data for training model by minimizing

(10)

where hyper-parameter and aim to control the importance of these losses.

3) Update the feature memory . In this step, for the non-parametric classifier is updated by Eq. (4).

With the optimization procedure, our model manages to recognize people in the target domain. Note that, under the target-only re-ID protocol, the loss from source domain is not used. The in Eq. (10) is set to 1.

Iv Experiment

In this section, we evaluated the proposed method on three large-scale re-ID datasets. Besides Market-1501 and DukeMTMC-reID, which are widely used by most existing methods, we also report the accuracy on the MSMT17 dataset.

Iv-a Datasets and Settings

Market-1501 [15] contains 32,688 images of 1,501 identities. They are captured by 6 cameras on campus. The dataset is split into three parts: 12,936 images of 751 identities for training, 19,732 images with 750 identities for the gallery, and another 3,368 hand-drawn images with the same 750 gallery identities for the query.

DukeMTMC-reID [16] contains 36,411 images of 1,812 identities, which are collected from 8 cameras. Following Market-1501, the dataset is split into three parts: 16,522 images with 702 identities for training, 17,661 images with 1,110 identities in the gallery, and another 2,228 images with 702 identities in the gallery for the query.

MSMT17 [6] contains 126,441 images of 4,101 identities, which are captured by 15 cameras. Similar to Market-1501 and DukeMTMC-reID, the dataset is split into three parts: 32,621 images of 1,041 identities for training, 82,161 images with 3,060 identities in the gallery and another 11,659 images with the same 3,060 gallery identities for the query.

We reported the rank1, rank5, rank10 and mean average precision (mAP) for evaluation on the three datasets. All experiments used single-query. Note that, to be fair, we did not use the re-ranking algorithm [76].

Iv-B Implementation details

We used ResNet-50 [69] as the CNN backbone to extract features. The ResNet-50 was pre-trained on ImageNet [12]. After the Pool-5 layer of the ResNet-50, we added a 4,096-dimension fully-connected layer followed by batch normalization [77], ReLU [78] and Dropout [79]. Therefore, the length of re-ID features for training is 4,096. Note that, during testing, following most existing methods, we used the 2,048-dimension feature from the Pool-5 layer. During training, we fixed the first two layers of the ResNet-50 to save GPU memory. After feature extraction, two different classifiers were used to classify person images. For source domain, we adopted a general parametric classifier for supervised learning. For target images, we adopted a non-parametric classifier with a feature memory M.

We used random crop and random erasing [48] as data augmentation for both target images and source images. Images were resized to 256 128. For each iteration, we random chose 128 images from target images and 128 from source images to constitute a batch. We also leveraged the CamStyle [10] method to enhance the robustness and decrease camera variance in the target domain.

We used Stochastic Gradient Descent (SGD) to train the model. We set weight decay and momentum to and 0.9, respectively. We set the learning rate to 0.01 for ResNet-50 base layers and 0.1 for other layers in the first 40 epochs. The learning rate was then divided by 10 for the next 20 epochs. For the target domain, we started minimizing distances of similar images after 5 epochs. During training, we used cosine distance as the similarity metric. Thus, the 4,096-dimension features are -normalized before stored into M or after extracted from our model.

For hyper-parameters, we set the similarity threshold and the temperature fact to 0.55 and 0.05, respectively. By default, the update rate of the feature memory M was linearly increased with epoch increasing from 0 to 0.4 for Market-1501 and from 0 to 0.5 for DukeMTMC-reID. Without being specified, and are set to 3.5 and 0.6, respectively.

Iv-C Ablation Study

(a) DukeMTMC-reID to Market-1501
(b) Market-1501 to DukeMTMC-reID
(c) Similarity Threshold
(d) Similarity Threshold
(e) DukeMTMC-reID to Market-1501
(f) Market-1501 to DukeMTMC-reID
Fig. 3: (a) and (b): Comparison of varying and constant on accuracy. We conducted nine experiments for each case. Note that, the -axis has different meanings for these two cases. For constant , it was fixed to 0.1, …, 0.9, respectively, during the entire training. For varying , we linearly changed it from 0 to 0.1, …, 0.9, respectively. According to the results, the varying always achieves higher accuracy than constant . (c) Influence of similarity threshold on accuracy. Basically, when , the method simultaneously achieves high accuracy on both Market-1501 and DukeMTMC-reID. (d) Influence of similarity threshold on the number of reliable neighborhoods. When is too small, a large number of images are selected as reliable data instances. When is too large, only a few images are selected. (e) and (f): Comparison of adaptive selection (adaptive) and a constant number of neighborhoods (top-) on accuracy. Our adaptive selection outperforms top- for all cases.

Iv-C1 Exploration for Update Rate

To investigate the effect of , different varying rates of the varying were employed during training. Specifically, we linearly increased the varying from 0 to 0.1, …, 0.9, respectively. Meanwhile, the constant was also used to be compared with the varying . Specifically, the was fixed to 0.1, …, 0.9 respectively during training. The results are shown in Figure 3(a) and 3(b), and two major conclusions can be drawn.

Firstly, the varying is superior to the constant . Specifically, we observe that the accuracy curves of the varying are often above those of the constant . Additionally, the highest points of the accuracy curves with the varying are always higher than those with the constant . We speculate that, with a varying , the feature memory can be discriminative rapidly at first and more stable later. Thus, the varying yields higher accuracy than the constant .

Secondly, for varying , different varying rates lead to different accuracy. Specifically, in Figure 3(a) and 3(b), as the varying rates increase, accuracy curves increase at first and decrease later. Moreover, when tested on Market-1501 in 3(a), the peak value is achieved when is equal to 0.4. While when tested on DukeMTMC-reID in 3(b), the peak value is achieved when is equal to 0.5. Therefore, in the following experiments, we increase from 0 to 0.4 for Market-1501 and from 0 to 0.5 for DukeMTMC-reID.

Methods Market-1501(%) DukeMTMC-reID(%)
Src. rank-1 rank-5 rank-10 mAP Src. rank-1 rank-5 rank-10 mAP
Source Only Duke 46.4 63.7 70.6 20.0 Market 30.3 45.4 52.5 15.9

w/o Cam

Target Only adaptive selection N/A 35.4 54.2 63.3 14.4 N/A 37.2 53.5 60.6 20.5
adaptive selection + balance 57.1 75.3 81.8 34.5 56.1 72.6 78.7 35.5
Transfer adaptive selection Duke 49.6 66.0 73.0 22.4 Market 45.8 60.8 68.8 28.9
adaptive selection + balance 67.3 79.6 83.7 40.8 60.1 76.3 82.3 41.8

w/ Cam

Target Only adaptive selection N/A 55.4 74.8 81.6 23.4 N/A 42.5 57.9 64.2 19.4
adaptive selection + balance 77.5 89.8 93.4 54.0 63.2 75.4 79.4 39.0
Transfer adaptive selection Duke 66.6 83.6 89.1 35.1 Market 59.4 72.5 78.3 36.2
adaptive selection + balance 81.6 91.9 94.6 58.0 67.9 79.2 83.6 46.7
TABLE I: Comparison of our models under different settings on DukeMTMC-reID(Duke) and Market-1501(Market). Source Only: Model is only trained on source dataset. Target Only: Model is only trained on target dataset. Transfer: Model is trained on both source dataset and target dataset. Cam: CamStyle [10] data augmentation. Src.: Source domain.

Iv-C2 Learning with Adaptive Exploration

To investigate the effect of learning with adaptive exploration, we adopted different values of similarity threshold during training at first. The accuracy is reported in Figure 3(c). Then, with different , the average number of selected neighborhoods in the last epoch is shown in Figure 3(d). At last, the adaptive selection is compared with selecting a constant number of neighborhoods (top-) shown in Figure 3(e) and 3(f).

In Figure 3(c), different values of result in different recognition accuracy. Specifically, we observe that accuracy (rank-1 and mAP) curves increase first and then decrease with increasing. Meanwhile, the best accuracy is achieved when is equal to 0.55. Also, is robust to different datasets. Specifically, high accuracy is achieved when on both Market-1501 and DukeMTMC-reID.

In Figure 3(d), when , the average number of selected neighborhoods is closer to real neighborhood distribution. For example, the Market-1501 training set contains 12,936 images belonging to 751 identities. If we assume images with the same identity as neighborhoods of each other, the real average number of neighborhoods will be 17.2 on Market-1501. Meanwhile, when taking Market-1501 as the target domain, our model selects about average 18.3 neighborhoods in the last epoch when in Figure 3(d). This number (18.3) is near to real average number of neighborhoods (17.2). On the other hand, other values of encourage the model to select much more or fewer neighborhoods. For example, 28.5 neighborhoods and 6.1 neighborhoods are selected when is equal to 0.50 and 0.60, respectively. The situation is the same on DukeMTMC-reID. This confirms that it is more suitable for to be 0.55 for our model. Thus, we set to 0.55 in the following experiments.

In Figure 3(e) and 3(f), the adaptive selection (when = 0.55) is compared with the selecting a constant number of neighborhoods. When selecting a constant number of neighborhoods, given a constant , our model selected -nearest neighborhoods (top-) for each target image during training.

In Figures 3(e) and 3(f), we observe that curves of the top- always under those of the adaptive selection. Especially when transferring DukeMTMC-reID to Market-1501 in Figure 3(e), there exists a large margin between mAP curves of the top- and the adaptive selection. This demonstrates that the adaptive selection can significantly improve the re-ID model.

Iv-C3 Learning with Balance

To investigate the effect of learning with balance, we compared the models with balance and without balance in Table I. For the model without balance (only with adaptive selection), to achieve higher accuracy, we set to 0.2 and increased from 0 to 0.4 for both Market-1501 and DukeMTMC-reID.

In Table I, results suggest that the usage of learning with balance significantly improves recognition accuracy. For example, when transferring DukeMTMC-reID to Market-1501 with CamStyle augmentation, we observe +15.0% in rank-1 and +22.9% in mAP with the balance strategy. While when transferring Market-1501 to DukeMTMC-reID with CamStyle augmentation, we observe +8.5% in rank-1 and +10.5% in mAP with the balance strategy. We speculate that it is because that learning with balance penalizes more on images with many neighborhoods. Thus, images with many neighbors are prevented from attracting too many images and obtaining too many neighborhoods. Therefore, the number of neighborhoods becomes balanced and reasonable.

In Table I, another finding which should be notice that CamStyle augmentation can increase the re-ID accuracy dramatically. For example, when transferring DukeMTMC-reID to Market-1501, our model with CamStyle augmentation can achieve 81.6% in rank-1 and 58.0% in mAP. Compared with the model without CamStyle, the accuracy is increased by 14.3% in rank-1 and 17.2% in mAP. The situation is the same when transferring Market-1501 to DukeMTMC-reID. We speculate that CamStyle augmentation helps our model to be robust to camera variance in the target domain.

Methods DukeMTMC-reID to Market-1501(%) Market-1501 to DukeMTMC-reID(%)
rank-1 rank-5 rank-10 mAP rank-1 rank-5 rank-10 mAP
UMDL[65] 34.5 52.6 59.6 12.4 18.5 31.4 37.6 7.3
PTGAN[6] 38.6 - 66.1 - 27.4 - 50.7 -
PUL[7] 45.5 60.7 66.7 20.5 30.0 43.4 48.5 16.4
CAMEL[66] 54.5 - - 26.3 - - - -
MMFA[9] 56.7 75.0 81.8 27.4 45.3 59.8 66.3 24.7
SPGAN+LMP[5] 57.7 75.8 82.4 26.7 46.4 62.3 68.0 26.2
TJ-AIDL[8] 58.2 74.8 81.1 26.5 44.3 59.6 65.0 23.0
CamStyle[10] 58.8 78.2 84.3 27.4 48.4 62.5 68.9 25.1
HHL[17] 62.2 78.8 84.0 31.4 46.9 61.0 66.7 27.2
ARN[67] 70.3 80.4 86.3 39.4 60.2 73.9 79.5 33.4
ECN[11] 75.1 87.6 91.6 43.0 63.3 75.8 80.4 40.4
Ours 81.6 91.9 94.6 58.0 67.9 79.2 83.6 46.7
TABLE II: Comparison with the state-of-the-art methods of domain adaptive re-ID on Market-1501 and DukeMTMC-reID.
Methods Market-1501(%) DukeMTMC-reID(%)
rank-1 rank-5 rank-10 mAP rank-1 rank-5 rank-10 mAP
LOMO[26] 27.2 41.6 49.1 8.0 12.3 21.3 26.6 4.8
BOW[15] 35.8 52.4 60.3 14.8 17.1 28.8 34.9 8.3
OIM[64] 38.0 58.0 66.3 14.0 24.5 38.8 46.0 11.3
BUC[4] 66.2 79.6 84.5 38.3 47.4 62.6 68.4 27.5
Ours (target-only) 77.5 89.8 93.4 54.0 63.2 75.4 79.4 39.0
TABLE III: Comparison with the state-of-the-art methods of target-only re-ID on Market-1501 and DukeMTMC-reID.

Iv-C4 Comparison of Transfer Learning and Target Only Learning

In this paper, we considered both domain adaptive re-ID and target-only re-ID of our method. We firstly evaluated the baseline re-ID model, namely directly deploying the source data pre-trained model to the unseen target domain. In Table I, we call this direct transfer as Source Only. Then, we evaluated the domain adaptive re-ID of our method, called Transfer in Table I, Specifically, our method utilizes the labeled source data and unlabeled target data during training and then deploys the trained model to the target domain. Finally, we evaluated the target-only re-ID of our method, named Target Only in Table I, that is trained on only unlabeled target data and tested on the target domain.

Firstly in Table I, the source only model fails to produce good results in the target domain. For example, when directly transferring DukeMTMC-reID to Market-1501, the source only model only achieves 46.4% in rank-1 and 20% in mAP. While when directly transferring Market-1501 to DukeMTMC-reID, the source only model only achieves 30.3% in rank-1 and 15.9% in mAP. This demonstrates the notorious re-ID problem, namely caused by domain bias the pre-trained model often achieves considerably low accuracy on unseen datasets.

Then in Table I, our model significantly increases the re-ID accuracy with transfer learning. For example, when transferring DukeMTMC-reID to Market-1501, our model achieves 81.6% in rank-1 and 58.0% in mAP. Compared with the source only baseline model, the accuracy is increased by 35.2% in rank-1 and 38.0% in mAP. While when transferring Market-1501 to DukeMTMC-reID, our model achieves 67.9% in rank-1 and 46.7% in mAP. Compared with the source only baseline model, the accuracy is increased by 37.6% in rank-1 and 20.8% in mAP. This demonstrates that our model leverages sufficient information of the target data and thus achieves much higher accuracy.

Finally, in Table I, the target only learning of our method still achieves competitive accuracy. For example, when trained with only unlabeled data on Market-1501, our model achieves 77.5% in rank-1 and 54.0% in mAP on Market-1501. Compared with the transfer learning, the accuracy only drops 4.1% in rank-1 and 4.0% in mAP. While when trained with only unlabeled data on DukeMTMC-reID, our model achieves 63.2% in rank-1 and 39.0% in mAP on DukeMTMC-reID. Compared with the transfer learning, the accuracy only drops 4.7% in rank-1 and 7.7% in mAP. This demonstrates that our method manages to learn discriminative features without any annotation information.

Iv-D Comparison with the State-of-the-Art methods

We compared our method with the state-of-the-art unsupervised person re-ID methods on Market-1501, DukeMTMC-reID, and MSMT17 shown in Table II, III, and IV. Table II and III report the results of the domain adaptive re-ID methods and the target-only re-ID methods, respectively, on Market-1501 and DukeMTMC-reID. As for MSMT17, the results are shown in Table IV.

In Table II, our method is compared with eleven domain adaptive methods of person re-ID. Among them, three methods (UMDL [65], PUL [7], and CAMEL [66]) use labeled source data for model initialization and unlabeled target data for training. Eight methods (PTGAN [6], SPGAN+LMP [5], MMFA [9], TJ-AIDL [8], CamStyle [10], HHL [17] ARN [67], and ECN [11]) leverage both source data and target data during training. PUL [7] firstly introduces adaptive exploration to unsupervised person re-ID and achieves 45.5% in rank-1 on Market-1501. CamStyle [10] introduces changing image camera styles as a type of data augmentation and achieves 58.8% in rank-1 on Market-1501. As reported in Table II, Our method clearly outperforms these competing methods. Specifically, our method achieves 81.6% in rank-1 and 58.0% in mAP when treating DukeMTMC-reID as the source set and tested on Market-1501, and 67.9% in rank-1 and 46.7% in mAP when taking Market-1501 as the source set and tested on DukeMTMC-reID. Compared with the current best method (ECN [11]) of domain adaptive re-ID, the rank-1 is increased by 6.5% and 4.6% when tested on Market-1501 and DukeMTMC-reID, respectively.

In Table III, our method is compared with four target-only re-ID methods (LOMO[26], BOW[15], OIM[64], BUC [4]). Among them, two hand-crafted feature methods (LOMO[26] and BOW[15]) directly use designed feature to recognize people and thus fail to produce good re-ID accuracy. Specifically, LOMO and BOW achieve 27.2% and 35.8% in rank-1 when tested on Market-1501, respectively. Compared with these methods, the target-only AE method yields much higher recognition accuracy on Market-1501 and DukeMTMC-reID. Specifically, the target-only AE method achieves 77.5% in rank-1 and 54.0% in mAP on Market-1501, and 63.2% in rank-1 and 39.0% in mAP on DukeMTMC-reID. Compared with the current best method (BUC [4]) of target-only re-ID, the rank-1 is increased by 11.3% and 15.8% when tested on Market-1501 and DukeMTMC-reID, respectively.

Methods MSMT17(%)
Src. rank-1 rank-5 rank-10 mAP
PTGAN [6] market 10.2 - 24.4 2.9
ECN [11] 25.3 36.3 42.1 8.5
Ours 25.5 37.3 42.6 9.2
PTGAN [6] duke 11.8 - 27.4 3.3
ECN [11] 30.2 41.5 46.8 10.2
Ours 32.3 44.4 50.1 11.7
BUC* [4] N/A 11.5 18.6 22.3 3.4
Ours (target-only) 26.6 37.0 41.7 8.5
TABLE IV: Comparison with the state-of-the-art methods on MSMT17. Src. denotes the source domain. * denotes that the results are reproduced by ourselves.

In Table IV, our method is firstly compared with two domain adaptive re-ID methods (PTGAN [6], ECN [11]) on MSMT17. Then our method is compared with one target-only re-ID method (BUC [4]).

Among the two domain adaptive re-ID methods, PTGAN [6] releases the MSMT17 dataset and achieves 10.2% and 11.8% in rank-1 when transferring Market-1501 and DukeMTMC-reID to MSMT17, respectively. Compared with these two methods, the AE method re-ID achieves higher accuracy on MSMT17. Specifically, the AE method achieves 25.5% in rank-1 and 9.2% in mAP when treating Market-1501 as source set and tested on MSMT17, and 32.3% in rank-1 and 11.7% in mAP when treating DukeMTMC-reID as source set and tested on MSMT17. Compared with the current best method (ECN [11]) of domain adaptive person re-ID, the mAP is increased by 0.7% and 1.5% when treating Market-1501 and DukeMTMC-reID as source set, respectively.

(a) Source Only
(b) BUC
(c) ECN
(d) AE (our)
Fig. 4: Visualization for features extracted by source only, BUC [4], ECN [11], and our AE method. 100 identities with 1,926 images in the gallery of Market-1501 are used. Source only indicates the baseline re-ID model which is only trained on DukeMTMC-reID. BUC [4] and ECN [11] are the current best methods of target-only re-ID and domain adaptive re-ID, respectively. Each point represents an image, and each color represents a person identity.
(a) without balance
(b) with balance
Fig. 5: Visualization for selected neighborhoods according to two images from DukeMTMC-reID. The two images (with a blue and red border, respectively) are shown on the upper right. Each point indicates an image, and different color points indicate different person images. (a): When learning without balance, one image can have an extremely large number of neighborhoods. These neighborhoods inevitably contain incorrect persons. Even for the image in a small group, it still chooses some noisy person images as its neighborhoods. (b): When learning with balance, the two images select the similar number of neighborhoods. Meanwhile, these neighborhoods share the same person identities as the two images’.

Then the target-only AE method still achieves competitive results on MSMT17 in Table IV. Specifically, the target-only AE method achieves 26.6% in rank-1 and 8.5% in mAP on MSMT17. Compared with the target-only person re-ID method BUC [4], the rank-1 and mAP are increased by 15.1% and 5.1%, respectively.

Iv-E Visualization of feature space

Iv-E1 Effectiveness of learning with balance

To additionally investigate the effectiveness of learning with balance, we use PCA to visualize neighborhoods selected in the last epoch (60 epoch) by two images on DukeMTMC-reID. The results are shown in Figure 5.

In Figure 5(a), without the balance term, one image selects too many neighborhoods while the other one only chooses few neighborhoods. Meanwhile, the identities of some neighborhoods are different from those of the two images. In Figure 5(b), when learning with balance, the two images enable to select a similar number of neighborhoods. Also, these neighborhoods share the same identities with the two images. This indicates that learning with balance manages to help our model classify people accurately.

Iv-E2 Effectiveness of the AE method

To additionally investigate the effectiveness of our method, we use t-SNE [80] to visualize feature distributions shown in Figure 4. Specifically, part of gallery images on Market-1501 (100 identities, 1926 images) are extracted into features and, then the features are projected into a 2-dimension map by t-SNE. Note that, each point in the map represents one image and points with the same color indicate the same person images.

In Figure 4(d), same color points often stay together and are far away from other color points. This demonstrates our model can extract discriminative features. Caused by the lack of labels, our model inevitably classifies two similar persons as one identity. Therefore, in Figure 4(d), there exist two different color points being together.

We also visualize feature distributions from three other methods, that is source only, BUC [4], and ECN [11] shown in Figure 4(a), 4(b), and 4(c) respectively. Source only indicates the baseline re-ID model pre-trained on only source data. BUC [4] and ECN [11] are the current best methods of target-only re-ID and domain adaptive re-ID, respectively. Compared with them, in Figure 4(d) of our method, same color points stay closer and fewer different color points stay together by mistake. This demonstrates the superior of our method.

V Conclusion

In this paper, we propose the adaptive exploration (AE) method for unsupervised person re-ID. The AE method explores the unlabeled target domain by considering the distances between target images. By a non-parametric classifier with a feature memory, AE maximizes distances of all target images and minimizes distances of similar target images. Meanwhile, we propose to employ a similarity threshold to select reliable similar images. However, with adaptive selection, some images select too many neighborhoods while others have only a few neighborhoods. To alleviate the unbalanced problem, we integrate a balance term into the objective loss to prevent images, which have too many neighborhoods, from attracting other images. As a result, each image tends to select a balanced and reasonable number of neighborhoods. With the adaptive selection and the balance term, the AE method achieves competitive accuracy on both target only and domain adaptive re-ID.

References

  • [1] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [2] Z. Liu, D. Wang, and H. Lu, “Stepwise metric promotion for unsupervised video person re-identification,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [3] M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen, “Dynamic label graph matching for unsupervised video re-identification,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [4] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang, “A bottom-up clustering approach to unsupervised person re-identification,” in AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • [5] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [6] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer GAN to bridge domain gap for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [7] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re-identification: Clustering and fine-tuning,” TOMCCAP, 2018.
  • [8] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute-identity deep learning for unsupervised person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [9] S. Lin, H. Li, C. Li, and A. C. Kot, “Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification,” in British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, 2018.
  • [10] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camstyle: A novel data augmentation method for person re-identification,” IEEE Trans. Image Processing, 2019.
  • [11] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance matters: Exemplar memory for domain adaptive person re-identification,” CoRR, 2019.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, 2009.
  • [13] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 2014.
  • [14] W. Li, R. Zhao, and X. Wang, “Human reidentification with transferred metric learning,” in Computer Vision - ACCV 2012 - 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I, 2012.
  • [15] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
  • [16] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by GAN improve the person re-identification baseline in vitro,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [17] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person retrieval model hetero- and homogeneously,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, 2018.
  • [18] W. Li and X. Wang, “Locally aligned feature transforms across views,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, 2013.
  • [19] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-identification by symmetry-driven accumulation of local features,” in The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, 2010.
  • [20] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person re-identification,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, 2013.
  • [21] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensemble of localized features,” in Computer Vision - ECCV 2008, 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I, 2008.
  • [22] I. Kviatkovsky, A. Adam, and E. Rivlin, “Color invariants for person reidentification,” IEEE Trans. Pattern Anal. Mach. Intell., 2013.
  • [23] B. J. Prosser, W. Zheng, S. Gong, and T. Xiang, “Person re-identification by support vector ranking,” in British Machine Vision Conference, BMVC 2010, Aberystwyth, UK, August 31 - September 3, 2010. Proceedings, 2010.
  • [24] M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, 2012.
  • [25] W. Zheng, S. Gong, and T. Xiang, “Reidentification by relative distance comparison,” IEEE Trans. Pattern Anal. Mach. Intell., 2013.
  • [26] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • [27] D. Tao, L. Jin, Y. Wang, and X. Li, “Person reidentification by minimum classification error-based KISS metric learning,” IEEE Trans. Cybernetics, 2015.
  • [28] Z. Wang, R. Hu, C. Chen, Y. Yu, J. Jiang, C. Liang, and S. Satoh, “Person reidentification via discrepancy matrix and matrix metric,” IEEE Trans. Cybernetics, 2018.
  • [29] N. Martinel, G. L. Foresti, and C. Micheloni, “Person reidentification in a distributed camera network framework,” IEEE Trans. Cybernetics, 2017.
  • [30] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person re-identification,” in 22nd International Conference on Pattern Recognition, ICPR 2014, Stockholm, Sweden, August 24-28, 2014, 2014.
  • [31] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang, “A siamese long short-term memory architecture for human re-identification,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII, 2016.
  • [32] R. R. Varior, M. Haloi, and G. Wang, “Gated siamese convolutional neural network architecture for human re-identification,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, 2016.
  • [33] E. Ahmed, M. J. Jones, and T. K. Marks, “An improved deep learning architecture for person re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • [34] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based CNN with improved triplet loss function,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
  • [35] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li, “Embedding deep metric for person re-identification: A study against large variations,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, 2016.
  • [36] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person re-identification: Past, present and future,” CoRR, 2016.
  • [37] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian, “Person re-identification in the wild,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
  • [38] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributes driven multi-camera person re-identification,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, 2016.
  • [39] T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep feature representations with domain guided dropout for person re-identification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
  • [40] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned CNN embedding for person re-identification,” TOMCCAP, 2016.
  • [41] M. Geng, Y. Wang, T. Xiang, and Y. Tian, “Deep transfer learning for person re-identification,” CoRR, 2016.
  • [42] S. Chen, C. Guo, and J. Lai, “Deep ranking for person re-identification via joint representation learning,” IEEE Trans. Image Processing, 2016.
  • [43] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with relative distance comparison for person re-identification,” Pattern Recognition, 2015.
  • [44] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: A deep quadruplet network for person re-identification,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
  • [45] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” CoRR, 2017.
  • [46] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang, “Joint learning of single-image and cross-image representations for person re-identification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
  • [47] N. McLaughlin, J. M. del Rincón, and P. C. Miller, “Data-augmentation for reducing dataset bias in person re-identification,” in 12th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2015, Karlsruhe, Germany, August 25-28, 2015, 2015.
  • [48] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” CoRR, 2017.
  • [49] F. Zhu, X. Kong, H. Fu, and Q. Tian, “Pseudo-positive regularization for deep person re-identification,” Multimedia Syst., 2018.
  • [50] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adaptation for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [51] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose invariant embedding for deep person re-identification,” IEEE Trans. Image Processing, 2019.
  • [52] Y. Cho and K. Yoon, “Improving person re-identification via pose-aware multi-shot matching,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
  • [53] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deep convolutional model for person re-identification,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [54] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [55] Z. Wu, Y. Li, and R. J. Radke, “Viewpoint invariant human re-identification in camera networks using pose priors and subject-discriminative features,” IEEE Trans. Pattern Anal. Mach. Intell., 2015.
  • [56] S. Bak, S. Zaidenberg, B. Boulay, and F. Brémond, “Improving person re-identification by viewpoint cues,” in 11th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2014, Seoul, South Korea, August 26-29, 2014, 2014.
  • [57] S. Karanam, Y. Li, and R. J. Radke, “Person re-identification with discriminatively trained viewpoint invariant dictionaries,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
  • [58] K. Zheng, X. Fan, Y. Lin, H. Guo, H. Yu, D. Guo, and S. Wang, “Learning view-invariant features for person identification in temporally synchronized videos taken by wearable cameras,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [59] X. Sun and L. Zheng, “Dissecting person re-identification from the viewpoint of viewpoint,” CoRR, 2018.
  • [60] L. Bazzani, M. Cristani, A. Perina, and V. Murino, “Multiple-shot person re-identification by chromatic and epitomic analyses,” Pattern Recognition Letters, 2012.
  • [61] D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai, “Person search via a mask-guided two-stream CNN model,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, 2018.
  • [62] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Mask-guided contrastive attention model for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [63] M. Tian, S. Yi, H. Li, S. Li, X. Zhang, J. Shi, J. Yan, and X. Wang, “Eliminating background-bias for robust person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [64] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “Joint detection and identification feature learning for person search,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
  • [65] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian, “Unsupervised cross-dataset transfer learning for person re-identification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
  • [66] H. Yu, A. Wu, and W. Zheng, “Cross-view asymmetric metric learning for unsupervised person re-identification,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [67] Y. Li, F. Yang, Y. Liu, Y. Yeh, X. Du, and Y. F. Wang, “Adaptation and re-identification network: An unsupervised deep transfer learning approach to person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [68] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
  • [70] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based cnn with improved triplet loss function,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [71] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan, “End-to-end comparative attention networks for person re-identification,” IEEE Trans. Image Processing, 2017.
  • [72] Q. Yang, H.-X. Yu, A. Wu, and W.-S. Zheng, “Patch-based discriminative feature learning for unsupervised person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3633–3642.
  • [73] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada., 2010.
  • [74] H. Fan, X. Chang, D. Cheng, Y. Yang, D. Xu, and A. G. Hauptmann, “Complex event detection by identifying reliable shots from untrimmed videos,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [75] F. Ma, D. Meng, Q. Xie, Z. Li, and X. Dong, “Self-paced co-training,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017.
  • [76] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-identification with k-reciprocal encoding,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
  • [77] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015.
  • [78] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, 2010.
  • [79] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, 2014.
  • [80] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
382233
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description