Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline)
Employing part-level features for pedestrian image description offers fine-grained information and has been verified as beneficial for person retrieval in very recent literature. A prerequisite of part discovery is that each part should be well located. Instead of using external cues, e.g., pose estimation, to directly locate parts, this paper lays emphasis on the content consistency within each part.
Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin.
Person retrieval, also known as person re-identification (re-ID), aims at retrieving images of a specified pedestrian in a large database, given a query person-of-interest. Presently, deep learning methods dominate this community, with convincing superiority against hand-crafted competitors . Deeply-learned representations provide high discriminative ability, especially when aggregated from deeply-learned part features. The latest state of the art on re-ID benchmarks are achieved with part-informed deep features [35, 27, 37].
An essential prerequisite of learning discriminative part features is that parts should be accurately located. Recent state-of-the-art methods vary on their partition strategies and can be divided into two groups accordingly. The first group [38, 27, 31] leverage external cues, e.g., assistance from the latest progress on human pose estimation [23, 32, 15, 25, 2]. They rely on external human pose estimation datasets and sophisticated pose estimator. The underlying datasets bias between pose estimation and person retrieval remains an obstacle against ideal semantic partition on person images. The other group [35, 37, 22] abandon cues from semantic parts. They require no part labeling and yet achieve competitive accuracy with the first group. Some partition strategies are compared in Fig. 1. Against this background of progress on learning part-level deep features, we rethink the problem of what makes well-aligned parts. Semantic partitions may offer stable cues to good alignment but are prone to noisy pose detections. This paper, from another perspective, lays emphasis on the consistency within each part, which we speculate is vital to the spatial alignment. Then we arrive at our motivation that given coarsely partitioned parts, we aim to refine them to reinforce within-part consistency. Specifically, we make the following two contributions:
First, we propose a network named Part-based Convolutional Baseline (PCB) which conducts uniform partition on the conv-layer for learning part-level features. It does not explicitly partition the images. PCB takes a whole image as the input and outputs a convolutional feature. Being a classification net, the architecture of PCB is concise, with slight modifications on the backbone network. The training procedure is standard and requires no bells and whistles. We show that the convolutional descriptor has much higher discriminative ability than the commonly used fully-connected (FC) descriptor. On the Market-1501 dataset, for instance, the performance increases from 85.3% rank-1 accuracy and 68.5% mAP to 92.3% (+7.0%) rank-1 accuracy and 77.4% (+8.9%) mAP, surpassing many state-of-the-art methods by a large margin.
Second, we propose an adaptive pooling method to refine the uniform partition. We consider the motivation that within each part the contents should be consistent. We observe that under uniform partition, there exist outliers in each part. These outliers are, in fact, closer to contents in some other part, implying within-part inconsistency. Therefore, we refine the uniform partition by relocating those outliers to the part they are closest to, so that the within-part consistency is reinforced. An example of the refined parts is illustrated in Fig. 1(f). With the proposed refined part pooling (RPP), performance on Market-1501 further increases to 93.8% (+1.5%) rank-1 accuracy and 81.6% (+4.2%) mAP. In Section 3 and 4, we describe the PCB and the refined part pooling, respectively.
In Section 5, we combine the two methods, which achieves a new state of the art in person retrieval. Importantly, we demonstrate experimentally that the proposed refined parts are superior to attentive parts, \ie, parts learned with attention mechanisms.
2 Related Work
Hand-crafted part features for person retrieval. Before deep learning methods dominated the re-ID research community, hand-crafted algorithms had developed approaches to learn part or local features. Gray and Tao  partition pedestrians into horizontal stripes to extract color and texture features. Similar partitions have then been adopted by many works [8, 41, 24, 20]. Some other works employ more sophisticated strategy. Gheissari \etal divide the pedestrian into several triangles for part feature extraction. Cheng \etal employ pictorial structure to parse the pedestrian into semantic parts. Das \etal apply HSV histograms on the head, torso and legs to capture spatial information.
Deeply-learned part features. The state of the art on most person retrieval datasets is presently maintained by deep learning methods . When learning part features for re-ID, the advantages of deep learning over hand-crafted algorithms are two-fold. First, deep features generically obtain stronger discriminative ability. Second, deep learning offers better tools for parsing pedestrians, which further benefits the part features. In particular, human pose estimation and landmark detection have achieved impressive progress [23, 25, 2, 32, 15]. Several recent works in re-ID employ these tools for pedestrian partition and report encouraging improvement [38, 27, 31]. However, the underlying gap between datasets for pose estimation and person retrieval remains a problem when directly utilizing these pose estimation methods in an off-the-shelf manner. Others abandon the semantic cues for partition. Yao \etal cluster the coordinates of max activations on feature maps to locate several regions of interest. Both Liu \etal and Zhao \etal embed the attention mechanism  in the network, allowing the model to decide where to focus by itself.
Deeply-learned part with attention mechanism. A major contribution of this paper is the refined part pooling. We compare it with a recent work, PAR  by Zhao \etalin details. Both works employ a part-classifier to conduct “soft” partition on pedestrian images, as shown in Fig. 1. Two works share the merit of requiring no part labeling for learning discriminative parts. However, the motivation, training methods, mechanism, and final performance of the two methods are quite different, to be detailed below.
Motivation: PAR aims at directly learning aligned parts while RPP aims to refine the pre-partitioned parts. Working mechanism: using attention method, PAR trains the part classifier in an unsupervised manner, while the training of RPP can be viewed as a semi-supervised process. Training process: RPP firstly trains an identity classification model with uniform partition and then utilizes the learned knowledge to induce the training of part classifier. Performance: the slightly more complicated training procedure rewards RPP with better interpretation and significantly higher performance. For instance on Market-1501, mAP achieved by PAR, PCB cooperating attention mechanism and the proposed RPP are 63.4%, 74.6% and 81.6%, respectively. In addition, RPP has the potential to cooperate with various partition strategies.
3 PCB: A Strong Convolutional Baseline
This section describes the structure of PCB and its comparison with several potential alternative structures.
3.1 Structure of PCB
Backbone network. PCB can take any network without hidden fully-connected layers designed for image classification as the backbone, \eg, Google Inception  and ResNet . This paper mainly employs ResNet50 with consideration of its competitive performance as well as its relatively concise architecture.
From backbone to PCB. We reshape the backbone network to PCB with slight modifications, as illustrated in Fig. 2. The structure before the original global average pooling (GAP) layer is maintained exactly the same as the backbone model. The difference is that the GAP layer and what follows are removed. When an images undergoes all the layers inherited from the backbone network, it becomes a 3D tensor of activations. In this paper, we define the vector of activations viewed along the channel axis as a column vector. Then, with a conventional average pooling, PCB partitions into horizontal stripes and averages all the column vectors in a same stripe into a single part-level column vector (, the subscripts will be omitted unless necessary). Afterwards, PCB employs a convolutional layer to reduce the dimension of . According to our preliminary experiment, the dimension-reduced column vectors are set to 256-dim. Finally, each is input into a classifier, which is implemented with a fully-connected (FC) layer and a following Softmax function, to predict the identity (ID) of the input.
During training, PCB is optimized by minimizing the sum of Cross-Entropy losses over pieces of ID predictions. During testing, either pieces of or are concatenated to form the final descriptor or , \ie, or . As observed in our experiment, employing achieves slightly higher accuracy, but at a larger computation cost, which is consistent with the observation in .
3.2 Important Parameters.
PCB benefits from fine-grained spatial integration. Several key parameters, i.e., the input image size (\ie, ), the spatial size of the tensor (\ie, ), and the number of pooled column vectors (\ie, ) are important to the performance of PCB. Note that is determined by the spatial down-sampling rate of the backbone model, given the fixed-size input. Some deep object detection methods, \eg, SSD  and R-FCN , show that decreasing the down-sampling rate of the backbone network efficiently enriches the granularity of feature. PCB follows their success by removing the last spatial down-sampling operation in the backbone network to increase the size of . This manipulation considerably increases person retrieval accuracy with only very light computation cost added. The details can be accessed in Section 5.4, which also provides insights to explain the phenomenon that partitioning tensor into too many stripes (large ) compromises the discriminative ability of the learned feature.
Through our experiment, the optimized parameter settings for PCB are:
The input images are resized to , with a height to width ratio of 3:1.
The spatial size of is set to .
is equally partitioned into horizontal stripes.
3.3 Potential Alternative Structures
Given a same backbone network, there exist several potential alternative structures to learn part-level features. We enumerate two structures for comparison with PCB.
Variant 1. Instead of making an ID prediction based on each , it averages all into a single vector , which is then fully connected to an ID prediction vector. During testing, it also concatenates or to form the final descriptor. Variant 1 is featured by learning a convolutional descriptor under a single loss.
Variant 2. It adopts exactly the same structure as PCB in Fig. 2. However, all the branches of FC classifiers in Variant 2 share a same set of parameters.
Both variants are experimentally validated as inferior to PCB. The superiority of PCB against Variant 1 shows that not only the convolutional descriptor itself, but also the respective supervision on each part, is vital for learning discriminative part-level features. The superiority of PCB against Variant 2 shows that sharing weights for classifiers, while reducing the risk of over-fitting, compromises the discriminative ability of the learned part-level features. The experiment details are to be viewed in Section 5.3.
4 Refined Part Pooling
Uniform partition for PCB is simple, effective, and yet to be improved. This section firstly explains the inconsistency phenomenon accompanying the uniform partition and then proposes the refined part pooling as a remedy to reinforce within-part consistency.
4.1 Within-Part Inconsistency
With focus on the tensor to be spatially partitioned, our intuition of within-part inconsistency is: column vectors in a same part of should be similar to each other and be dissimilar to column vectors in other parts; otherwise the phenomenon of within-part inconsistency occurs, implying that the parts are partitioned inappropriately.
After training PCB to convergence, we compare the similarities between each and (), \ie, the average-pooled column vector of each stripe, by measuring cosine distance. If is closest to , is inferred as closest to the th part, correspondingly. By doing this, we find the closest part to each , as exampled in Fig. 3. Each column vector is denoted by a small rectangle and painted in the color of its closest part.
Two phenomena are observed. First, most column vectors in a same horizontal stripe are clustered together (though there are no explicit constraints for this effect). Second, there exist many outliers, while designated to a specified horizontal stripe (part) during training, which are more similar to another part. The existence of these outliers suggests that they are inherently more consistent with column vectors in another part.
4.2 Relocating Outliers
We propose the refined part pooling to correct within-part inconsistency. Our goal is to assign all the column vectors according to their similarities to each part, so that the outliers will be relocated.
To this end, we need to classify all the column vectors in on the fly. Based on the already-learned , we use a linear layer followed by Softmax activation as a part classifier as follows:
where is the predicted probability of belonging to part , is the number of pre-defined parts (\ie, in PCB), and is the trainable weight matrix of the part classifier, whose training procedure is to be detailed in Section 4.3.
Given a column vector in and the predicted probability of belonging to part , we assign to part with as the confidence. Correspondingly, each part is sampled from all column vectors with as the sampling weight, \ie,
where is the complete set of column vectors in tensor , denotes the sampling operation to form an aggregate.
By doing this, the proposed refined part pooling conducts a “soft” and adaptive partition to refine the original “hard” and uniform partition, and the outliers originated from the uniform partition will be relocated. In combination with refined part pooling described above, PCB is further reshaped into Fig. 4. Refined part pooling, \ie, the part classifier along with the following sampling operation, replaces the original average pooling. The structure of all the other layers remain exactly the same as in Fig. 2.
4.3 Induced Training for Part Classifier
First, a standard PCB model is trained to convergence with equally partitioned.
Second, we remove the original average pooling layer after and append a -category part classifier on . New parts are sampled from according to the prediction of the part classifier, as detailed in Section 4.2.
Third, we set the all the already learned layers in PCB fixed, leaving only the part classifier trainable. Then we retrain the model on training set. In this condition, the model still expects the tensor to be equally partitioned, otherwise it will predict incorrect about the identities of training images. So Step 3 penalizes the part classifier until it conducts partition close to the original uniform partition, whereas the part classifier is prone to categorize inherently similar column vectors into a same part. A state of balance will be reached as a result of Step 3.
Finally, all the layers are allowed to be updated. The whole net, \ie, PCB along with the part classifier are fine-tuned for overall optimization.
In the above training procedure, PCB model trained in Step1 induces the training of the part classifier. Step3 and 4 converges very fast, requiring 10 more epochs in total.
Step 1. A standard PCB is trained to convergence with uniform partition.
Step 2. A -category part classifier is appended on the tensor .
Step 3. All the pre-trained layers of PCB are fixed. Only the part classifier is trainable. The model is trained until convergence again.
Step 4. The whole net is fine-tuned to convergence for overall optimization.
4.4 Discussions on Refined Part Pooling
With step 1 in Alg. 4.3 skipped, the training can also converge. In this case, the training will be similar to PAR  which employs attention mechanism to align parts, as introduced in Section 2. We compare both approaches, \ie, training part classifier with or without step 1, in experiments and find out that the induction procedure matters. Without the proposed induction, the performance turns out significantly lower. For example on Market-1501, when induction is applied, PCB in combination with refined part pooling achieves 80.9% mAP. When induction is removed, mAP decreases to 74.6%. It implies that the proposed induced training is superior to attention mechanism on PCB. The details can be accessed in Section 5.5.
5.1 Datasets and Settings
Datasets. This paper uses three datasets for evaluation, \ie, Market-1501 , DukeMTMC-reID [26, 43], and CUHK03 . The Market-1501 dataset contains 1,501 identities observed under 6 camera viewpoints, 19,732 gallery images and 12,936 training images detected by DPM . The DukeMTMC-reID dataset contains 1,404 identities, 16,522 training images, 2,228 queries, and 17,661 gallery images. With so many images captured by 8 cameras, DukeMTMC-reID manifests itself as one of the most challenging re-ID datasets up to now. The CUHK03 dataset contains 13,164 images of 1,467 identities. Each identity is observed by 2 cameras. CUHK03 offers both hand-labeled and DPM-detected bounding boxes, and we use the latter in this paper. CUHK03 originally adopts 20 random train/test splits, which is time-consuming for deep learning. So we adopt the new training/testing protocol proposed in . For Market-1501 and DukeMTMC-reID, we use the evaluation packages provided by  and , respectively. All the experiment evaluates the single-query setting. Moreover, for simplicity we do not use re-ranking algorithms which considerably improve mAP . Our results are compared with reported results without re-ranking.
5.2 Implementation details
Implementation of IDE for comparison. We note that the IDE model specified in  is a commonly used baseline in deep re-ID systems [40, 38, 33, 10, 28, 42, 43, 45]. In contrast to the proposed PCB, the IDE model learns a global descriptor. For comparison, we implement the IDE model on the same backbone network, \ie, ResNet50, and with several optimizations over the original one in , as follows. 1) After the “pool5” layer in ResNet50, we append a fully-connected layer followed by Batch Normalization and ReLU. The output dimension of the appended FC layer is set to 256-dim. 2) We apply dropout on “pool5” layer. Although there are no trainable parameters in “pool5” layer, there is evidence that applying Dropout on it, which outputs a high dimensional feature vector of 2048d, effectively avoids over-fitting and gains considerable improvement [42, 43]. We empirically set the dropout ratio to 0.5. On Market-1501, our implemented IDE achieves 85.3% rank-1 accuracy and 68.5% mAP, which is a bit higher than the implementation in .
Training. The training images are augmented with horizontal flip and normalization. We set batch size to 64 and train the model for 60 epochs with base learning rate initialized at 0.1 and decayed to 0.01 after 40 epochs. The backbone model is pre-trained on ImageNet . The learning rate for all the pre-trained layers are set to of the base learning rate. When employing refined part pooling for boosting, we append another 10 epochs with learning rate set to 0.01. With two NVIDIA TITAN XP GPUs and Pytorch as the platform, training an IDE model and a standard PCB on Market-1501 (12,936 training images) consumes about 40 and 50 minutes, respectively. The increased training time of PCB is mainly caused by the cancellation of the last spatial down-sample operation in the Conv5 layer, which enlarges the tensor by .
5.3 Performance evaluation
We evaluate our method on three datasets, with results shown in Table 1. Both uniform partition (PCB) and refined part pooling (PCB+RPP) are tested.
PCB is a strong baseline. Comparing PCB and IDE, the prior commonly used baseline in many works [40, 38, 33, 10, 28, 42, 43, 45], we clearly observe the significant advantage of PCB: mAP on three datasets increases from 68.5%, 52.8% and 38.9% to 77.4% (+8.9%), 66.1% (+13.3%) and 54.2% (+15.3%), respectively. This indicates that integrating part information increases the discriminative ability of the feature. The structure of PCB is as concise as that of IDE, and training PCB requires nothing more than training a canonical classification network. We hope it will serve as a baseline for person retrieval task.
Refined part pooling (RPP) improves PCB especially in mAP. From Table 1, while PCB already has a high accuracy, RPP brings further improvement to it. On the three datasets, the improvement in rank-1 accuracy is +1.5%, +1.6%, and +3.1%, respectively; the improvement in mAP is +4.2%, +3.1%, and +3.5%, respectively. The improvement is larger in mAP than in rank-1 accuracy. In fact, rank-1 accuracy characterizes the ability to retrieve the easiest match in the camera network, while mAP indicates the ability to find all the matches. So the results indicate that RPP is especially beneficial in finding more challenging matches.
The benefit of using losses. To validate the usage of branches of losses in Fig. 2, we compare our method with Variant 1 which learns the convolutional descriptor under a single classification loss. Table 1 suggests that Variant 1 yields much lower accuracy than PCB, implying that employing a respective loss for each part is vital for learning discriminative part features.
The benefit of NOT sharing parameters among identity classifiers. In Fig. 2, PCB inputs each column vector to a FC layer before the Softmax loss. We compare our proposal (not sharing FC layer parameters) with Variant 2 (sharing FC layer parameters). From Table 1, PCB is higher than Variant 2 by 2.4%, 3.3%, and 7.4% on the three datasets, respectively. This suggests that sharing parameters among the final FC layers is inferior.
|Triplet Loss ||84.9||94.2||-||69.1|
Comparison with state of the art. We compare PCB and PCB+RPP with state of the art. Comparisons on Market-1501 are detailed in Table 2. The compared methods are categorized into three groups, \ie, hand-crafted methods, deep learning methods with global feature and deep learning methods with part features. Relying on uniform partition only, PCB surpasses all the prior methods, including [27, 31] which require auxiliary part labeling to deliberately align parts. The performance lead is further enlarged by the proposed refined part pooling.
Comparisons on DukeMTMC-reID and CUHK03 (new training/testing protocol) are summarized in Table 3. In the compared methods, PCB exceeds  by +5.5% and 17.2% in mAP on the two datasets, respectively. PCB+RPP (refined part pooling) further surpasses it by a large margin of +8.6% mAP on DukeMTMC-reID and +20.5% mAP on CUHK03. PCB+RPP yields higher accuracy than “TriNet+Era” and “SVDNet+Era”  which are enhanced by extra data augmentation.
In this paper, we report mAP = 81.6%, 69.2%, 57.5% and Rank-1 = 93.8%, 83.3% and 63.7% for Market-1501, Duke and CUHK03, respectively, setting new state of the art on the three datasets. All the results are achieved under the single-query mode without re-ranking. Re-ranking methods will further boost the performance especially mAP. For example, when “PCB+RPP” is combined with the method in , mAP and Rank-1 accuracy on Market-1501 increases to 91.9% and 95.1%, respectively.
5.4 Parameters Analysis
We analyze some important parameters of PCB (and with RPP) introduced in Section 3.2 on Market-1501. Once optimized, the same parameters are used for all the three datasets.
The size of images and tensor . We vary the image size from to , using as interval. Two down-sampling rates are tested, \ie, the original rate, and a halved rate (larger ). We exhaustively train all these models on PCB and report their performance in Fig. 5.4. Two phenomena are observed.