Semantics-Aligned Representation Learning for Person Re-identification

Semantics-Aligned Representation Learning for Person Re-identification

Xin Jin   Cuiling Lan   Wenjun Zeng   Guoqiang Wei   Zhibo Chen This work is done when Xin Jin is an intern at Microsoft Research Asia.    University of Science and Technology of China   Microsoft Research Asia
jinxustc@mail.ustc.edu.cn   {culan, wezeng}@microsoft.com
wgq7441@mail.ustc.edu.cn
  chenzhibo@ustc.edu.cn
Abstract

Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add triplet reID constraints/losses over the feature maps as the perceptual losses. The decoder is discarded in the inference/test and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID.

 

Semantics-Aligned Representation Learning for Person Re-identification


  Xin Jinthanks: This work is done when Xin Jin is an intern at Microsoft Research Asia.   Cuiling Lan   Wenjun Zeng   Guoqiang Wei   Zhibo Chen and University of Science and Technology of China   Microsoft Research Asia jinxustc@mail.ustc.edu.cn   {culan, wezeng}@microsoft.com wgq7441@mail.ustc.edu.cn   chenzhibo@ustc.edu.cn

\@float

noticebox[b]\end@float

1 Introduction

Person re-identification (reID) aims to identify/match persons in different places, times, or camera views. There are large variations in terms of the human poses, capturing view points, incompleteness of the bodies (due to occlusion). Such spatial semantics misalignment across 2D images is one of the key challenges for reID shen2015person (); varior2016siamese (); subramaniam2016deep (); su2017pose (); zheng2017pose (); zhang2017alignedreid (); yao2017deep (); li2017learning (); zhao2017spindle (); wei2017glad (); zheng2018pedestrian (); ge2018fd (); suh2018part (); qian2018pose (); zhang2019DSA (). For example, the same spatial position in two images may correspond to the eye for one image while shoulder for another image. Moreover, since each image is captured through a 2D projection of the 3D person and scene, only partial 3D surface of a person is visible/projected. The visible appearance/texture/semantics across images are not consistent/aligned, i.e., semantics misalignment. Deep learning methods can deal with such diversities and misalignment to some extend, these approaches that explicitly use human pose information for alignment have demonstrate their superiority su2017pose (); zheng2017pose (); yao2017deep (); li2017learning (); zhao2017spindle (); wei2017glad (); suh2018part (). Thus, guiding the network to have alignment characteristics is important for robust reID.

Alignment: Part-based approaches learn and localize the body parts, i.e., head, upper body, and extract the body-part representations to alleviate pose variations li2017learning (); yao2017deep (); zhao2017spindle (); kalayeh2018human (); zheng2017pose (); su2017pose (). During the inference, these part detection sub-networks are required which increases the computational complexity. Besides, the body-part alignment is coarse and there is still spatial misalignment within the parts zhang2019DSA ().

Different from pose which only identifies several key joints (e.g. 14), dense semantics can assign each position on the 3D surface of a person by a unique semantic identity (a 2D coordinate (u,v) in the canonical UV space) guler2018densepose (); guler2017densereg (). Person images represented in the semantics space is densely semantics aligned. Recently, Zhang et alzhang2019DSA () first propose a densely semantically aligned framework for reID. It takes the warped semantics aligned images as input for feature learning and regularization. However, there is a lack of more direct constraints to enforce the alignment over the features. Besides, the semantics are not ideally aligned since the visible semantics are in general not consistent. For example, a frontal person image and a back-facing person image have less overlap in the UV space.

Figure 1: Examples of texture images (in the first row) and the corresponding synthesized person images with different poses, viewpoints, and backgrounds (in the second row). A texture image represents the full texture of the 3D human surface in a surface-based canonical coordinate system (UV space). Each position (u,v) corresponds to a unique semantic identity. For person images of different persons/poses/viewpoints (in the second row), their corresponding texture images are densely semantically aligned across images.

Our work: We intend to fully solve the misalignment problems. We achieve this by proposing a simple yet powerful Semantics Aligning Network (SAN), which introduces an aligned texture generation sub-task, with aligned texture image (see examples in Figure 1) as supervision. Our SAN enjoys the benefit of dense semantics alignment but without increasing the complexity of inference. Figure 2 shows the framework of the SAN. It consists of a base network as encoder (SA-Enc), and a decoder sub-network (SA-Dec). The SA-Enc can be any baseline network used in person reID (e.g. ResNet-50 he2016deep ()), which outputs a feature map of size . The reID feature vector is then obtained by average pooling the last layer feature map , followed by the reID losses. To encourage the encoder features to be semantically aligned, a reconstruction sub-task is introduced which generates densely semantically aligned full texture image (we also refer to it as texture image for short) with pseudo groundtruth supervision. We exploit a synthesized dataset for learning pseudo groundtruth texture image generation. Our method outperforms previous works on the benchmark datasets CUHK03 li2014deepreid (), Market-1501 zheng2015scalable (), MSMT17 wei2018person (), Partial REID zheng2015partial () without introducing additional computational cost in inference. The decoder is discarded in inference.

Our contributions: 1) We propose a simple yet powerful framework for solving the misalignment challenge in person reID without increasing computational cost in inference. 2) Semantics alignment constraint is delicately introduced by empowering the encoded feature map with aligned full texture generation capability. 3) At the SA-Dec, besides the reconstruction loss, we propose triplet reID constraints over the feature maps as the perceptual metric. 4) There is no groundtruth aligned texture image for the person reID datasets. We address this by generating pseudo groundtruth texture images. Synthesized data with person image and aligned texture image pairs (see Figure 1) is used to train the SAN (without re-ID supervisions) to generate pseudo groundtruth texture images.

2 Related Work

Person reID based on deep neural networks has made great progress in recent years. Due to the variations in poses, viewpoints, incompleteness of the visible bodies (due to occlusion), etc., across the images, misalignment is still one of the key challenges.

Alignment with Pose/Part Cues for ReID: To address the misalignment, most of the previous approaches make use of external cues such as pose/part li2017learning (); yao2017deep (); zhao2017spindle (); kalayeh2018human (); zheng2017pose (); su2017pose (); suh2018part (). Human landmark (pose) information can help align body regions across images. Zhao et alzhao2017spindle () propose a human body region guided Splindle Net, where a body region proposal sub-network (trained with the human pose dataset) is used to extract the body regions, e.g., head-shoulder, arm region. The semantic features from different body regions are separately captured thus the body part features can be aligned across images. Kalayeh et alkalayeh2018human () integrate a human semantic parsing branch in their network for generating probability maps associated to different semantic regions of human body, e.g., head, upper-body. Based on the probability maps, the features from different semantic regions of human body are aggregated separately to have part aligned features. Qian et alqian2018pose () propose to make use of GAN model to synthesize realistic person images of eight canonical poses for matching. However, these approaches usually require pose/part detection or image generation sub-networks, and extra computational cost in inference. Moreover, the alignment based on pose is coarse without considering the finer grained alignment within a part across images. Last, the semantics misalignment caused by the inconsistency of the visible semantics across images is still under-explored.

Zhang et alzhang2019DSA () study the exploitation of dense semantics alignment for reID. Rather than at the coarse pose level, they align the image based on the dense pixel level semantics. They extract the features of the aligned texture images with holes (due to invisible body regions) to regularize the feature learning from the original image. However, there is a lack of more direct constraints to enforce the alignment. Besides, they do not solve the semantics inconsistency/misalignment problem caused by the inconsistency of the visible body regions across images. The design of efficient frameworks for dense semantics alignment is still under-explored. We propose a clean framework which adds direct constraints to encourage dense semantics alignment in feature learning.

Semantics Aligned Human Texture: A human body could be represented by a 3D mesh (e.g. Skinned Multi-Person Linear Model, SMPL loper2015smpl ()) and a texture image varol2017learning (); hormann2007mesh () as illustrated in Figure 3. Each position on the 3D body surface has a semantic identity (identified by a 2D coordinate (u,v) in the canonical UV space) and a texture representation (e.g. RGB pixel value) guler2018densepose (); guler2017densereg (). Texture image on the UV coordinate system (i.e., surface-based coordinate system) holds the aligned full texture of the 3D surface of the person. Note that the texture images across different persons are densely semantically aligned (see Figure 1). In guler2018densepose (), a dataset with labeled dense semantics (i.e. DensePose) is established and a CNN-based system is designed to estimate DensePose from person images. Neverova et alneverova2018dense () and Wang et alwang2019re () leverage the aligned texture image to synthesize person image of another pose or view. Yao et alyao2019densebody () propose to regress the 3D human body ((x,y,z) coordinates in 3D space) in the semantics aligned UV space, with the RGB person image as the input to the CNN.

Different from all these works, we leverage the densely semantically aligned full texture image to address the misalignment problem in person reID. We use them as direct supervisions to drive the reID network to learn semantics aligned features.

Figure 2: Illustration of the proposed Semantics Aligning Network (SAN), which consists of a base network as encoder (SA-Enc) and a decoder sub-network (SA-Dec). The reID feature vector is obtained by average pooling the feature map of the SA-Enc, followed by the reID losses of and . To encourage the encoder learning semantically aligned features, the SA-Dec is followed which regresses the densely semantically aligned full texture image with the pseudo groundtruth supervision . The pseudo groundtruth generation is described in Sec. 3.1 without shown here. At the decoder, triplet reID constriants are added as the high level perceptual metric. We use ResNet-50 with four residual blocks as our SA-Enc. In inference, the decoder is discarded.

3 Proposed Semantics Aligning Network (SAN)

To address the cross image misalignment challenge caused by human pose, capturing viewpoint variations, and the incompleteness of the body surface (due to the occlusion when projecting 3D person to 2D person image), we propose a Semantics Aligning Network (SAN) for robust person reID, in which densely semantically aligned full texture images are taken as supervision to drive the learning of semantics aligned features.

The proposed framework is shown in Figure 2. It consists of a base network as encoder (SA-Enc) for reID, and a decoder sub-network (SA-Dec) (see Sec. 3.2) for generating densely semantically aligned full texture image with supervision. This encourages the reID network to learn semantics aligned feature representation. Since there is no groundtruth texture image of 3D human surface for the reID datasets, we use our synthesized data based on varol2017learning () to train SAN (with reID supervisions removed) which is then used to generate pseudo groundtruth texture images for the reID datasets (see Sec. 3.1).

The reID feature vector is obtained by average pooling the last layer feature map of the SA-Enc, followed by the reID losses. The SA-Dec is added after the last layer of the SE-Enc to regress densely semantically aligned texture image, with the (pseudo) groundtruth texture supervision. At the decoder, triplet reID constraints are incorporated at blocks as the high level perceptual metric to encourage identity preserving reconstruction. During inference, the decoder is discarded.

Figure 3: Illustration of the generation of synthesized person image to form a (person image, texture image) pair. Given a texture image, a 3D mesh, a background image, and rendering parameters, we can obtain a 2D person image through the rendering.

3.1 Densely Semantically Aligned Texture Image

Background: The person texture image in the surface-based coordinate system (UV space) is widely used in the graphics field hormann2007mesh (). Texture images for different persons/viewpoints/poses are densely semantically aligned, as illustrated in Figure 1. Each position (u,v) corresponds to a unique semantic identity on the texture image, e.g., the pixel on the right bottom of the texture image corresponds to some semantics of a hand. Besides, a texture image contains all the texture of the full 3D surface of a person. In contrast, only a part of the surface texture is visible/projected on a 2D person image.

Motivation: We intend to leverage such aligned texture images to drive the reID network learning semantics aligned features. For different input person images, the corresponding texture images are well semantics aligned. First, for the same spatial positions on different texture images, the semantics are the same. Second, for person images with different visible semantics, their texture images are semantics consistent/aligned since each one contains the full texture of the 3D person surface.

Pseudo groundtruth Texture Images Generation: For the images in the reID datasets, however, there are no groundtruth aligned full texture images. We propose to train the SAN using our synthesized data to enable the generation of a pseudo groundtruth texture image for each image in the reID datasets. We leverage a CNN-based network to generate pseudo groundtruth texture images. In this work, we reuse the proposed SAN (with the reID supervisions removed) as the network (see Figure 2), which we refer to it as SAN-PG (Semantics Aligning Network for Pseudo Groundtruth Generation) for differentiation. Given an input person image, the SAN-PG outputs predicted texture image as the pseudo groundtruth.

To train the SAN-PG, we synthesize a Paired-Image-Texture dataset (PIT dataset), based on SURREAL dataset varol2017learning (), for the purpose of providing the image pairs, i.e., the person image and its texture image. The texture image stores the RGB texture of the full person 3D surface. As illustrated in Figure 3, given texture image, a 3D mesh/shape, and a background image, a 2D projection of a 3D person can be obtained by rendering varol2017learning (). We can control the pose and body form of the person, and projection viewpoint, through changing the parameters of 3D mesh/shape model (i.e. SMPL loper2015smpl ()) and the rendering parameters. Note that we do not include identity information in the PIT dataset.

To generate the PIT dataset with paired person images and texture images, in particular, we use 929 (451 for female and 478 for male) raster-scanned texture maps provided by the SURREAL dataset varol2017learning () to generate the person image and texture image pairs. These texture images are aligned with the SMPL default two-dimensional UV coordinate space (UV space). The same uv coordinate value corresponds to the same semantics. To have large diversity of the 3D human shape and poses, we create 9290 different 3D human meshes, based on SMPL body model loper2015smpl (), with each texture image assigned to 10 different 3D meshes. Subsequently, We render these 3D meshes with the corresponding texture image by Neural Render kato2018neural (). To simulate real-world scenes, the background images for rendering are randomly sampled from COCO dataset lin2014microsoft (). Each synthetic person image is centered on a person with resolution 256128. The resolution of the texture images is 256256.

Discussion.

The texture images which we use for supervisions have three major advantages. 1) They are spatially aligned in terms of the dense semantics of a person surface and thus can guide the reID network learning semantics aligned representation. 2) A texture image contains the full 3D surface of a person can guide the reID network learning more comprehensive representation of a person. 3) They represent the textures of human body surface and thus naturally eliminate the interference of diverse background scenes.

There are also some limitations of the current pseudo groundtruth texture image generation process. 1) There is a domain gap between synthetic 2D images (in the PIT dataset) and real-world captured images where the synthetic person is not so realistic. 2) The number of texture images provided by SURREAL varol2017learning () is not large (i.e. 929 in total) which may constraint the diversity of the data in our synthesized dataset. 3) On SURREAL, all faces in the texture image are replaced by an average face of either man or woman lin2014microsoft (). We leave the improvement of them as future work. Even with such gaps, our scheme achieves significant performance improvement over the baseline on person reID.

3.2 Semantics Aligning Network and Optimization

As illustrated in Figure 2, the SAN consists of an encoder SA-Enc for person reID, and a decoder SA-Dec which enforces constraints over the encoder by requiring the encoded features to be able to predict the semantically aligned full texture images.

SA-Enc: We can use any baseline network used in person reID (e.g. ResNet-50 bai2017deep (); sun2017beyond (); zhang2017alignedreid (); almazan2018re (); zhang2019DSA ()) as the SA-Enc. In this work, we similarly use ResNet-50 and it consists of four residual blocks. The output feature map of the fourth block is spatially average pooled to get the feature vector (), which is the reID feature for matching.

For the purpose of reID, on the feature vector , we add the widely-used identification loss ( Loss) , i.e., the cross entropy loss for identification classification, and the ranking loss of triplet loss with batch hard mining hermans2017defense () ( Loss) as the loss functions in training.

SA-Dec: To encourage the encoder features to learn semantics aligned features, we add a decoder SA-Dec after the fourth block () of the encoder to regress the densely semantically aligned texture images, supervised by the (pseudo) groundtruth texture images. A reconstruction loss is introduced to minimize 1 differences between the generated texture image and its corresponding (pseudo) groundtruth texture image.

Triplet ReID Constraints at SA-Dec:

Besides the capability of reconstructing the texture images optimized/measured by the 1 distance, we also expect the features in the decoder inherit the capability of distinguishing different identities. Wang et alwang2019re () use reID network as the perceptual supervision to generate person texture, which judges whether the generated person image and the real image have the same identity. Different from wang2019re (), in considering that the features at each layer of the decoder are spatially semantically aligned across images, we measure the feature distance for each spatial position rather than on the final globally pooled feature. We introduce constraints to minimize the 2 differences between the features of the same identity and maximize those of different identities. Specially, for a sample in a batch, we can randomly select a positive sample (with the same identity) and a negative sample . The triplet ReID constraint/loss over the output feature map of the block of the SA-Dec is defined as

(1)

where is the resolution of feature map with channels, denotes the feature map of sample . with denotes the feature vector of channels at spatial position . The margin parameter is set to 0.3 experimentally.

Training Scheme: There are two steps for training our proposed SAN framework for reID.

In the first step (Step-1), we train a network for the purpose of generating pseudo groundtruth texture images for any given input person image. For simplicity, we reuse a simplified SAN (i.e., SAN-PG) which consists of the SA-Enc and SA-Dec, but with only the reconstruction loss . We train the SAN-PG with our synthesized PIT dataset. The SAN-PG model is then used to generate pseudo groundtruth texture image for reID datasets (such as CUHK03 li2014deepreid ()).

In the second step (Step-2), we train the SAN for both reID and aligned texture generation. The pre-trained weights of the SAN-PG are used to initialize the SAN. One alternative is to use only the reID dataset for training SAN, where the pseudo groundtruth texture images are used for supervision and all the losses are added. The other strategy is to iteratively use the reID dataset and the synthesized PIT dataset during training. We find the later solution gives superior results because the groundtruth texture images for the synthesized PIT dataset have higher quality than that of reID dataset. The overall loss consists of the Loss , the Loss , the reconstruction loss , and the constraint , i.e., = + + + . For a batch of reID data, we experimentally set to as 0.5, 1.5, 1, 1. For a batch of synthesized data, to are set to 0, 0, 1, 0 where the reID losses and triplet ReID constraints (losses) are not used.

4 Experiment

4.1 Datasets and Evaluation Metrics

We conduct experiments on six benchmark person reID datasets, including CUHK03 li2014deepreid (), Market1501 zheng2015scalable (), DukeMTMC-reID zheng2017unlabeled (), the large-scale MSMT17 wei2018person (), and two challenging partial person reID datasets of Partial REID zheng2015partial () and Partial-iLIDS he2018deep () (see the supplementary for more details).

We follow the common practices and use the cumulative matching characteristics (CMC) at Rank-k, = 1,5, or 10, and mean average precision (mAP) to evaluate the performance.

4.2 Implementation Details

We use ResNet-50 he2016deep () (which are widely used in some re-ID systems bai2017deep (); sun2017beyond (); zhang2017alignedreid (); almazan2018re (); zhang2019DSA ()) to build our SA-Enc. We also take it as our baseline (Baseline) with both ID loss and triplet loss. Similar to sun2017beyond (); zhang2019DSA (), the last spatial down-sample operation in the last Conv layer is removed. We build a light weight decoder SA-Dec by simply stacking 4 residual up-sampling blocks with about 1/3 parameters of the SA-Enc. This facilitates our model training using only a single GPU card. Detailed network structure of the SA-Dec, data augmentation and optimization details are in the supplementary.

4.3 Ablation Study

We perform comprehensive ablation studies to demonstrate the effectiveness of the designs in the SAN framework, on the datasets of CUHK03 (labeled bounding box setting) and Market-1501 (single query setting).

Effectiveness of Dense Semantics Alignment. Table 1 shows the comparisons of our schemes. SAN-basic denotes our basic semantics aligning model which is trained with the supervision of the pseudo groundtruth texture images with loss of , the reID losses and (see Figure 2). SAN w/ denotes that the triplet reID constraints at the SA-Dec is added on top of the SAN-basic. SAN w/syn. data denotes that the (person image, texture image) pairs of our PIT dataset is also used in training the SAN on top of the SAN-basic network. SAN denotes our final scheme with both the triplet reID constraints and the groundtrth texture image supervision from the PIT dataset on top of the SAN-basic network.

Model CUHK03(L) Market1501
Rank-1 mAP Rank-1 mAP
Baseline (ResNet-50) 73.7 69.8 94.1 83.2
SAN-basic 77.9 73.7 95.1 85.8
SAN w/ 78.9 74.9 95.4 86.9
SAN w/ syn. data 78.8 75.8 95.7 86.8
SAN 80.1 76.4 96.1 88.0
Table 1: Performance (%) comparisons of our SAN and baseline.

We have the following observations/conclusions. 1) Thanks to the driving of learning semantics aligned features, our SAN-basic significantly outperforms the baseline scheme by about 4% in both Rank-1 and mAP accuracy on CUHK03. 2) The introduction of high-level triplet reID constraints () as the perceptual loss can regularize the feature learning and it brings about additional 1.0% and 1.2% improvements in Rank-1 and mAP accuracy on CUHK03. Note that we add them after each block of the first three blocks in the SA-Dec (see the supplementary for the ablation study). 3) The use of the synthesized PIT dataset (syn. data) with the input image and groundtruth texture image pairs for training the SAN remedies the imperfection of the generated pesudo groundtruth texture images (with errors/noise/blurring). It improves the performance over SAN-basic by 0.9% and 2.1% in Rank-1 and mAP accuracy. Please see the supplementary for the detailed ablation study about how to allocate the ratio of the reID data and synthesized PIT data. 4) Our final scheme SAN significantly outperforms the baseline, i.e., 6.4% and 6.6% in Rank-1 and mAP accuracy on CUHK03, but with the same inference complexity. On Market1501, even though the baseline performance is already very high, our SAN achieves 2% and 4.8% improvement in Rank-1 and mAP.

Analysis of Different Reconstruction Guidance. We study the effect of using different reconstruction guidance and show results in Table 2. We design another two schemes for comparisons. Based on the input image, the three schemes use the same encoder-decoder networks (the same network as SAN-basic) but to reconstruct (a) the input person image, (b) pose aligned person image, and (c) proposed texture image (see Figure 4). Note that to have pose aligned person image as supervision, during synthesizing the PIT dataset, for each projected person image, we also synthesized a person image of a given fixed pose (frontal pose here). Thus, the pose aligned person images are also semantically aligned. In this case, only partial texture (frontal body regions) of the full 3D surface texture is retained with information loss.

Figure 4: The same encoder-decoder networks but with different reconstruction objectives of reconstructing the (a) input image, (b) pose aligned person image, and (c) texture image, respectively.
Model CUHK03(L) Market1501
Rank-1 mAP Rank-1 mAP
Baseline (ResNet-50) 73.7 69.8 94.1 83.2
Enc-Dec rec. input 74.4 70.8 94.3 84.0
Enc-Dec rec. pose 75.8 72.0 94.4 84.5
Enc-Dec rec. texture (SAN-basic) 77.9 73.7 95.1 85.8
Table 2: Performance (%) comparisons of the same encoder-decoder networks but with different reconstruction objectives of reconstructing the input image, pose aligned person image, and texture image respectively.

From Table 2, we have the following observations/conclusions. 1) The adding of a reconstruction sub-task helps improve the reID performance which encourages the encoded feature to preserve the original information. Enc-Dec rec. input improves the performance of the baseline by 0.7% and 1.0% in Rank-1 and mAP accuracy. However, the input images (and their reconstructions) are not semantically aligned across images. 2) Enc-Dec rec. pose enforces the supervision to be pose aligned person images. This has a superior performance to Enc-Dec rec. input, demonstrating the effectiveness of alignment. This is sub-optimal which may lose information. For example, for an input back-facing person image, such fixed (frontal) pose supervision may mistakenly guide the features to drop the back-facing body information. 3) In contrast, our full aligned texture image as supervision can provide comprehensive and densely semantics aligned information, which results in the best performance.

Why not Directly use Generated Texture Image for ReID?

How about the performance when the generated texture images are used as the input for reID? Results show that our scheme significantly outperforms them (see the supplementary for more details). The inferior performance is caused by the low quality of the generated texture image (with the texture smoothed/blurred).

4.4 Comparison with State-of-the-Art

Table 3 shows the performance comparisons of our proposed SAN with the state-of-the-art methods. Our scheme SAN achieves the best performance on CUHK03, Market1501, and MSMT17. It consistently outperforms the approach DSA-reID zhang2019DSA () which also considers the dense alignment. On the DukeMTMC-reID dataset, MGN wang2018learning () achieves better performance which ensembles the local features of multiple granularities and the global features.

Method CUHK03 Market1501 DukeMTMC-reID MSMT17
Labeled Detected
Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP
IDE(ECCV18) sun2017beyond () 43.8 38.9 - - 85.3 68.5 73.2 52.8 - -
Pose
/Part
/Mask
-related
MGN(MM18) wang2018learning () 68.0 67.4 66.8 66.0 95.7 86.9 88.7 78.4 - -
AACN(CVPR18) xu2018attention () - - - - 85.9 66.9 76.8 59.3 - -
MGCAM(CVPR18) song2018mask () 50.1 50.2 46.7 46.9 83.8 74.3 - - - -
MaskReID(ArXiv18) qi2018maskreid () - - - - 90.0 70.3 78.9 61.9 - -
SPReID(CVPR18) kalayeh2018human () - - - - 92.5 81.3 84.4 71.0 - -
Pose Transfer(CVPR18) liu2018pose () 33.8 30.5 30.1 28.2 87.7 68.9 68.6 48.1 - -
PSE(CVPR18) sarfraz2017pose () - - 30.2 27.3 87.7 69.0 79.8 62.0 - -
PN-GAN(ECCV18) qian2018pose () - - - - 89.4 72.6 73.6 53.2 - -
Part-Aligned(ECCV18) suh2018part () - - - - 91.7 79.6 84.4 69.3 - -
PCB+RPP(ECCV18) sun2017beyond () 63.7 57.5 - - 93.8 81.6 83.3 69.2 - -
Attention
-based
DuATM(CVPR18) si2018dual () - - - - 91.4 76.6 81.8 64.6 - -
Mancs(ECCV18) wang2018mancs () 69.0 63.9 65.5 60.5 93.1 82.3 84.9 71.8 - -
FD-GAN(NIPS18) ge2018fd () - - - - 90.5 77.7 80.0 64.5 - -
HPM(AAAI19) fu2018horizontal () 63.9 57.5 - - 94.2 82.7 86.6 74.3 - -
Semantics
DSA-reID(CVPR19) zhang2019DSA () 78.9 75.2 78.2 73.1 95.7 87.6 86.2 74.3 - -
Others
GoogLeNet(CVPR18) wei2018person () - - - - - - - - 47.6 23.0
PDC(CVPR18) wei2018person () - - - - - - - - 58.0 29.7
GLAD(CVPR18) wei2018person () - - - - - - - - 61.4 34.0
Ours Baseline (ResNet-50) 73.7 69.8 69.7 66.1 94.1 83.2 85.9 71.8 73.8 47.2
SAN 80.1 76.4 78.4 74.6 96.1 88.0 87.9 75.5 79.2 55.7
Table 3: Performance (%) comparisons with the state-of-the-art methods. Bold numbers denote the best performance, while the numbers with underlines denote the second best.

4.5 Visualization Analysis of Generated Texture Image

For the different images with varied poses/viewpoints/scales, we find the generated texture images from our SAN are well semantically aligned (see the supplementary).

4.6 Partial Person ReID

Partial person reID is more challenging with severe misalignment problems, where two partial person images are generally not spatial semantics aligned and usually have less overlapped semantics. Benefit from the aligned full texture generation capability, our final SAN outperforms the baseline significantly by 5.8% and 7.6% on Partial REID zheng2015partial () and Partial-iLIDS he2018deep (), respectively, on Rank-1 accuracy. More results and visualization can be found in the supplementary.

5 Conclusion

In this paper, we proposed a simple yet powerful Semantics Aligning Network (SAN) for learning semantics-aligned feature representations for efficient person reID, under the joint supervisions of person reID and semantics aligned texture generation. At the decoder, we add triplet reID constraints over the feature maps as the perceptual loss to regularize the learning. We have synthesized a Paired-Image-Texture dataset (PIT) to train a SAN-PG model, with the purpose to generate pseudo groundtruth texture images for the reID datasets, and to train the SAN. Our proposed SAN achieves the state-of-the-art performances on the datasets CUHK03, Market1501, MSMT17, and the Partial REID, without increasing computational cost in inference.

References

  • [1] Yang Shen, Weiyao Lin, Junchi Yan, Mingliang Xu, Jianxin Wu, and Jingdong Wang. Person re-identification with correspondence structure learning. In ICCV, 2015.
  • [2] Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and Gang Wang. A siamese long short-term memory architecture for human re-identification. In ECCV, 2016.
  • [3] Arulkumar Subramaniam, Moitreya Chatterjee, and Anurag Mittal. Deep neural networks with inexact matching for person re-identification. In NeurIPS, pages 2667–2675, 2016.
  • [4] Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Pose-driven deep convolutional model for person re-identification. In ICCV, 2017.
  • [5] Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Pose invariant embedding for deep person re-identification. arXiv preprint arXiv:1701.07732, 2017.
  • [6] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei Jiang, Chi Zhang, and Jian Sun. Alignedreid: Surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184, 2017.
  • [7] Hantao Yao, Shiliang Zhang, Yongdong Zhang, Jintao Li, and Qi Tian. Deep representation learning with part loss for person re-identification. arXiv preprint arXiv:1707.00798, 2017.
  • [8] Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, 2017.
  • [9] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi, Xiaogang Wang, and Xiaoou Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In CVPR, 2017.
  • [10] Longhui Wei, Shiliang Zhang, Hantao Yao, Wen Gao, and Qi Tian. Glad: global-local-alignment descriptor for pedestrian retrieval. In ACM Multimedia, pages 420–428, 2017.
  • [11] Zhedong Zheng, Liang Zheng, and Yi Yang. Pedestrian alignment network for large-scale person re-identification. TCSVT, 2018.
  • [12] Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, et al. Fd-gan: Pose-guided feature distilling gan for robust person re-identification. In NeurIPS, pages 1222–1233, 2018.
  • [13] Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. Part-aligned bilinear representations for person re-identification. In ECCV, 2018.
  • [14] Xuelin Qian, Yanwei Fu, Wenxuan Wang, Tao Xiang, Yang Wu, Yu-Gang Jiang, and Xiangyang Xue. Pose-normalized image generation for person re-identification. In ECCV, 2018.
  • [15] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Densely semantically aligned person re-identification. In CVPR, 2019.
  • [16] Mahdi M Kalayeh, Emrah Basaran, Muhittin Gökmen, Mustafa E Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, 2018.
  • [17] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. DensePose: Dense human pose estimation in the wild. CVPR, 2018.
  • [18] Riza Alp Güler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokkinos. Densereg: Fully convolutional dense shape regression in-the-wild. In CVPR, 2017.
  • [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [20] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
  • [21] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In ICCV, 2015.
  • [22] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer GAN to bridge domain gap for person re-identification. In CVPR, pages 79–88, 2018.
  • [23] Wei-Shi Zheng, Xiang Li, Tao Xiang, Shengcai Liao, Jianhuang Lai, and Shaogang Gong. Partial person re-identification. In ICCV, pages 4678–4686, 2015.
  • [24] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. TOG, 34(6):248, 2015.
  • [25] Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In CVPR, pages 109–117, 2017.
  • [26] Kai Hormann, Bruno Lévy, and Alla Sheffer. Mesh parameterization: Theory and practice. 2007.
  • [27] Natalia Neverova, Riza Alp Guler, and Iasonas Kokkinos. Dense pose transfer. In ECCV, pages 123–138, 2018.
  • [28] Jian Wang, Yunshan Zhong, Yachun Li, Chi Zhang, and Yichen Wei. Re-identification supervised texture generation. In CVPR, 2019.
  • [29] Pengfei Yao, Zheng Fang, Fan Wu, Yao Feng, and Jiwei Li. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153, 2019.
  • [30] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In CVPR, pages 3907–3916, 2018.
  • [31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  • [32] Xiang Bai, Mingkun Yang, Tengteng Huang, Zhiyong Dou, Rui Yu, and Yongchao Xu. Deep-person: Learning discriminative deep features for person re-identification. arXiv preprint arXiv:1711.10658, 2017.
  • [33] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling. 2018.
  • [34] Jon Almazan, Bojana Gajic, Naila Murray, and Diane Larlus. Re-id done right: towards good practices for person re-identification. arXiv preprint arXiv:1801.05339, 2018.
  • [35] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [36] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017.
  • [37] Lingxiao He, Jian Liang, Haiqing Li, and Zhenan Sun. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In CVPR, pages 7073–7082, 2018.
  • [38] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. ACM Multimedia, 2018.
  • [39] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. arXiv preprint arXiv:1805.03344, 2018.
  • [40] Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Mask-guided contrastive attention model for person re-identification. In CVPR, 2018.
  • [41] Lei Qi, Jing Huo, Lei Wang, Yinghuan Shi, and Yang Gao. Maskreid: A mask based deep ranking neural network for person re-identification. arXiv preprint arXiv:1804.03864, 2018.
  • [42] Jinxian Liu, Bingbing Ni, Yichao Yan, Peng Zhou, Shuo Cheng, and Jianguo Hu. Pose transferrable person re-identification. In CVPR, 2018.
  • [43] M Saquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR, 2018.
  • [44] Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, 2018.
  • [45] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In ECCV, 2018.
  • [46] Yang Fu, Yunchao Wei, Yuqian Zhou, Honghui Shi, Gao Huang, Xinchao Wang, Zhiqiang Yao, and Thomas Huang. Horizontal pyramid matching for person re-identification. arXiv preprint arXiv:1804.05275, 2018.
  • [47] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pages 1874–1883, 2016.
  • [48] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
  • [49] Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q Weinberger. Resource aware person re-identification across multiple resolutions. In CVPR, 2018.
  • [50] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
  • [51] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, 2017.
  • [52] Lingxiao He, Zhenan Sun, Yuhao Zhu, and Yunbo Wang. Recognizing partial biometric patterns. arXiv preprint arXiv:1810.07399, 2018.
  • [53] Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, and Yi Yang. Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220, 2017.
  • [54] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Person re-identification by probabilistic relative distance comparison. In CVPR, pages 649–656, 2011.

Appendix

Appendix 1 More Ablation Studies

Which Block/Layer to Add Triplet ReID Constrain in the SA-Dec?

We study which layer to add the triplet ReID constrains. For the SA-Dec which consists of four residual blocks, we add the constraint on the last layer of some block (i.e. block identity respectively). Table 4 shows the results. The improvement is obvious when adding on one of the first three blocks and is smaller on the last block. We analyse that the features from the first three blocks represent some high-level features but the output of the last block is the RGB appearance where there is already groudtruth supervision. When is added to all the first three blocks,i.e., SAN w/ , our scheme achieves the best performance and outperforms the SNA-basic by +1.0% and +1.2% in Rank-1 and mAP respectively on CUHK03.

SAN w/ shows the results that the distance is calculated on the spatial average pooled feature vectors. In considering the spatial features are spatially semantics aligned in the SA-Dec, we can calculate the feature distance spatial position wisely as in the final design. It achieves 0.8% improvement in both Rank-1 and mAP on CUHK03.

Model CUHK03(L) Market1501
Rank-1 mAP Rank-1 mAP
SAN-basic 77.9 73.7 95.1 85.8
SAN w/ 78.7 74.6 95.4 86.4
SAN w/ 78.4 74.1 95.2 86.2
SAN w/ 77.9 74.0 95.1 86.0
SAN w/ 77.8 73.6 95.2 86.0
SAN w/ (pooled) 78.1 74.1 95.2 86.3
SAN w/ 78.9 74.9 95.4 86.9
Table 4: Performance (%) comparisons of adding triplet reID constraints to the different blocks in the SAN-Dec.

Joint Training SAN with the reID Dataset and the Synthesized PIT Dataset.

Table 5 shows the influence of different allocation ratios of the reID data and the synthesized data of the PIT dataset for training the SNA. For example, the reID:PIT = 3:1 denotes that for every four batches in training, it has three batches of reID data and one batches of the data from the PIT dataset. We find that as the increase of synthesized data (PIT dataset), the reID performance increases and reaches the best when reID:PIT = 1:1. The quality of the groundtruth texture image of the data from the PIT dataset is better than that of the pesudo groudtruth texture images of the reID dataset and is helpful for guiding the network. When the ratio of the synthesized data is too high, the optimization opportunity from the reID loss supervision and becomes fewer since the synthesized data does not have reID labels.

Model CUHK03(L) Market1501
Rank-1 mAP Rank-1 mAP
SAN-basic 77.9 73.7 95.1 85.8
SAN w/ syn. data (reID:PIT = 3:1) 78.2 74.5 95.1 86.0
SAN w/ syn. data (reID:PIT = 2:1) 78.6 75.3 95.4 86.3
SAN w/ syn. data (reID:PIT = 1:1) 78.8 75.8 95.7 86.8
SAN w/ syn. data (reID:PIT = 1:2) 78.5 75.5 95.6 86.4
SAN w/ syn. data (reID:PIT = 1:3) 78.4 75.0 95.3 86.1
Table 5: Performance (%) influence of the ratio of the reID data and the data of the PIT dataset in the iterative training of the SAN.

Why not Directly use Generated Texture Image for ReID?

We show the experimental results in Table 6. ResNet-50 (texture) w/ data aug. and ResNet-50 (texture) w/o data aug. denote the model trained with the pseudo groundtruth texture images generated by our SAN-PG model as input to the ResNet-50 baseline network, where the former one has uses data augmentation of random cropping, horizontal flipping. The model without such data augmentation outperforms the one with data augmentation by 5.7% and 8.8% in Rank-1 and mAP on CUHK03. This is because such augmentation destroy the alignment of the input images and also demonstrate the alignment is very help for reID. However, ResNet-50 (texture) w/o data aug. has poor performance in comparison with our SAN. The inferior performance is caused by the low quality of the generated texture images (with the texture smoothed/blurred).

Model CUHK03(L) Market1501
Rank-1 mAP Rank-1 mAP
Baseline (ResNet-50) 73.7 69.8 94.1 83.2
ResNet-50 (texture) w/ data aug. 63.4 56.6 88.6 76.2
ResNet-50 (texture) w/o data aug. 69.1 65.4 91.7 79.8
SAN 80.1 76.4 96.1 88.0
Table 6: Performance (%) comparisons with the schemes that using the generated pseudo texture images as input of the baseline ResNet-50 for reID.

Appendix 2 Visualization Analysis of Generated Texture Image

We have observed the generated texture images from our SAN. As shown in Figure 5, we can see that even the input images vary in pose/viewpoint/scale, the generate texture images are well densely semantically aligned. 1) The same spatial positions correspond to the texture of the same semantics. 2) Different generated texture images (e.g. frontal person image versus the back-facing one) have consistent/aligned semantics thanks to the prediction ability of the network. Note that the generated images from SAN and SAN-PG are visually very similar and we only show the ones from SAN in Figure 5.

Figure 5: Three sets of examples of the pairs. Each pair corresponds to the original input image and the generated texture image of the Market1501 dataset.

Appendix 3 Partial Person ReID

Partial person reID is more challenging where the misalignment problem is more severe, where two partial person images are generally not spatial semantics aligned and usually have less overlapped semantics. We also demonstrate the effectiveness of our scheme on the challenging partial person reID datasets of Partial REID zheng2015partial () and Partial-iLIDS he2018deep ().

Benefit from aligned full texture generation capability, our SAN promises outstanding performance. Figure 6 show our regressed texture images from the SA-Dec are semantically aligned across images even though the input images have severe misalignment.

Table 7 shows the experiment results. Note that we train SAN on the Market1501 dataset zheng2015scalable () and test on the partial datasets. We directly take the trained model for Market1501 for testing, i.e.Baseline (ResNet-50), SAN. In this case, the network seldom see partial person data and is not strong enough. Similar to he2018deep (), we also fine-tune with the holistic and partial person images cropped from Market1501 (marked by *). Our SAN* outperforms Baseline*, AMC+SWM zheng2015partial () and is comparable with the state-of-the-art partial reID method DSR he2018deep (). Our final SAN* outperforms Baseline (ResNet-50)* by 5.8%, 4.7%, 7.8% on Rank-1, Rank-5, and Rank-10 respectively on the Partial REID dataset. Our SAN* outperforms Baseline (ResNet-50)* by 7.6%, 7.8%, 5.8% on Rank-1, Rank-5, and Rank-10 respectively on the Partial-iLIDS dataset. Even without fine-tune, our SAN also significantly outperform the baseline.

Figure 6: Six example pairs of (input image, regressed texture images by our SAN) from the Partial REID dataset.
Model Partial REID Partial-iLIDS
Rank-1 Rank-5 Rank-10 Rank-1 Rank-5 Rank-10
AMC+SWMzheng2015partial () 36.0 - - 49.6 - -
DSR (single-scale)*he2018deep () 39.3 - - 51.1 - -
DSR (multi-scale)*he2018deep () 43.0 - - 54.6 - -
Baseline (ResNet-50) 37.8 65.0 74.5 42.0 65.5 73.2
SAN 39.7 67.5 80.5 46.9 71.2 78.2
Baseline (ResNet-50)* 38.9 67.7 78.2 46.1 69.6 76.1
SAN* 44.7 72.4 86.0 53.7 77.4 81.9
Table 7: Partial person reID performance on the datasets of Partial REID and Partial-iLIDS (partial images are used as the probe set and holistic images are used as the gallery set). “*” means that the network is fine-tuned with holistic and partial person images from Market1501.

Appendix 4 Details about the Architecture of the SA-Dec

We show the details about the architecture of the SA-Dec in Table 8.

Layer name Parameters Output size Feature name
input 2048
shrink_dim_conv
constraints -
upsampling_1 , pixelshuffle(r=4)
constraints -
upsampling_2 , pixelshuffle(r=4)
constraints -
upsampling_3 , pixelshuffle(r=4)
upsampling_4 , pixelshuffle(r=4) output
conv_out output
Table 8: Detailed architecture of our SA-Dec. We construct it using similar residual convolutional layers and building blocks as in ResNet he2016deep (). For shrink_dim_conv1, 11,512 denotes the convolutional kernel size is 11 and output channel number is 512. Following the representation style in he2016deep (), building blocks are shown in brackets, with the numbers of blocks stacked. Up-sampling is performed by pixel shuffle shi2016real () with with up scale factor of 2.

Appendix 5 Implementation Details

Optimization. We use Adam optimizer kingma2014adam (). In Step-1, we set the initial learning rate as 1e-4 and batch size as 16. It converges around 80 epochs. In Step-2, we follow the training strategy in zhang2019DSA () for the SA-Enc. For the SA-Dec, the initial learning rate is set to 1e-5 and share the decay policy with the SA-Enc. The dimension of the feature vector is 2048. The SAN is end-to-end trained. We implement the schemes on PyTorch.

Data augmentation. For training the SAN-PG that is used for pseudo groundtruth texture image generation (Step-1), we use the augmentation strategies of randomly cropping, rotation, flipping and color-jittering. For training the SAN for reID (Step-2), we use the commonly used data augmentation strategies of random cropping wang2018resource (); zhang2019DSA (), horizontal flipping, and random erasing wang2018mancs (); wang2018resource (); zhong2017random () (with a probability of 0.5) in both the baseline schemes and our schemes.

Appendix 6 Details of Datasets

CUHK03 consists of 1,467 pedestrians with both manually labeled and DPM-detected bounding boxes from about 14,097 images. The new training/testing protocol of zhong2017re (); zheng2018pedestrian (); he2018recognizing () is used. Market1501 consists of 1,501 identities (12,936 images) of 751 identities for training. DukeMTMC-reID contains 16,522 training images of 702 identities, 2,228 query images of the other 702 identities, and 17,661 gallery images. We follow the setting as used in zheng2017unlabeled (); lin2017improving (). MSMT17 is a newly released large dataset which consists 4101 identities of 126441 images.

Partial REID includes 600 images from 60 people, with 5 full-body images, and 5 partial images per person. They are collected from different viewpoints, backgrounds and different types of severe occlusions. Partial-iLIDS is a simulated partial person dataset based on iLIDSzheng2011person () which contains 238 images of 119 people.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
369432
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description