Backbone Can Not be Trained at Once: Rolling Back to Pre-trained Network for Person Re-identification

Backbone Can Not be Trained at Once:
Rolling Back to Pre-trained Network for Person Re-identification

Youngmin Ro1   Jongwon Choi2   Dae Ung Jo1   Byeongho Heo1   Jongin Lim1   Jin Young Choi1
{treeoflife, mardaewoon, bhheo, ljin0429, jychoi},
1Department of ECE, ASRI, Seoul National University, Korea
2Samsung SDS, Korea

In person re-identification (ReID) task, because of its shortage of trainable dataset, it is common to utilize fine-tuning method using a classification network pre-trained on a large dataset. However, it is relatively difficult to sufficiently fine-tune the low-level layers of the network due to the gradient vanishing problem. In this work, we propose a novel fine-tuning strategy that allows low-level layers to be sufficiently trained by rolling back the weights of high-level layers to their initial pre-trained weights. Our strategy alleviates the problem of gradient vanishing in low-level layers and robustly trains the low-level layers to fit the ReID dataset, thereby increasing the performance of ReID tasks. The improved performance of the proposed strategy is validated via several experiments. Furthermore, without any add-ons such as pose estimation or segmentation, our strategy exhibits state-of-the-art performance using only vanilla deep convolutional neural network architecture. Code is available at


1 Introduction

Person re-identification (ReID) refers to the tasks connecting the same person, for instance, a pedestrian, among multiple people detected in non-overlapping camera views. Different camera views capture pedestrians in various poses with different backgrounds, which interferes with the ability to correctly estimate the similarity among pedestrian candidates. These obstacles makes it difficult to recognize the identities of numerous pedestrians robustly by comparing them with a limited number of person images with known identities. Furthermore, it is infeasible to obtain large training datasets sufficient to cover the appearance variation of pedestrians, making the ReID problem difficult to be solved. When sufficient training data is not available, it is a common approach to fine-tune the network pre-trained by another large dataset (e.g., ImageNet) which contains abundant information. The fine-tuning approach results in better performance than the approaches in which networks are trained from randomly initialized parameters. This is a practical approach used in many research areas [\citeauthoryearRen et al.2015, \citeauthoryearLong, Shelhamer, and Darrell2015] to avoid the problem of overfitting. Likewise, the previous ReID algorithms [\citeauthoryearChang, Hospedales, and Xiang2018, \citeauthoryearSi et al.2018, \citeauthoryearSun et al.2017] have utilized the fine-tuning approach. Most of recent works in ReID research have attempted to utilize semantic information such as pose estimation [\citeauthoryearZhao et al.2017, \citeauthoryearXu et al.2018, \citeauthoryearSarfraz et al.2018], segmentation mask [\citeauthoryearSong et al.2018], and semantic parsing [\citeauthoryearKalayeh et al.2018] to improve the accuracy of ReID by considering the additional pedestrian contexts.

In contrast to the previous studies, we are interested in incrementally improving the performance of ReID by enhancing the basic fine-tuning strategy applied to the pre-trained network. A few attempts have been made to improve learning methods by the ways designing a new loss function or augmenting data in a novel way [\citeauthoryearZhang et al.2017, \citeauthoryearChen et al.2017, \citeauthoryearZhong et al.2017b, \citeauthoryearSun et al.2017]. However, there has been no research on improving the learning method to consider the characteristics of each layer filter.

Figure 1: Training loss and mAP graph changed by introducing our learning strategy. ‘base’ means that the network is trained by basic strategy. In our method, the training loss escapes from local minimum and the mAP accuracy increases by utilizing the rolling-back scheme.

Before suggesting our novel fine-tuning strategy for ReID, we first empirically analyze the importance of fine-tuning low-level layers for ReID problems. According to related research [\citeauthoryearZeiler and Fergus2014, \citeauthoryearMahendran and Vedaldi2015], the low-level layers concentrate on details of appearance to discriminate between samples while the high-level layers contain semantic information. Thus, we need to sufficiently fine-tune the low-level layers to improve the discriminant power for the specific class ‘person’ in ReID because the low-level layers of the pre-trained network include detailed information on numerous classes. However, since the gradients delivered from high-level layers to low-level layers are reduced through back-propagation, the low-level layers suffer from a gradient-vanishing problem, which causes early convergence of the entire network before the low-level layers are trained sufficiently.

To solve this problem, we propose a novel fine-tuning strategy in which a part of the network is intentionally perturbed when learning slows down. The proposed fine-tuning strategy can recover the vanished gradients by rolling back the weights in the high-level layers to their pre-trained weights, which provides an opportunity for further tuning of weights in the low-level layers. As shown in Figure 1, the proposed fine-tuning strategy allows the network to converge to a minimum in a basin with better generalization performance than the conventional fine-tuning method. We validate the proposed method that uses no add-on schemes via a number of experiments, and the method outperforms state-of-the-art ReID methods appending additional context to the basic network architecture. Furthermore, we apply the proposed learning strategy to the fine-grained classification problem, which validates its generality for various computer vision tasks.

2 Related Work

Traditionally, the ReID problem has been solved by using a metric learning method [\citeauthoryearKoestinger et al.2012] to narrow the distance among the images of the same person. Clothing provides an important hint in the ReID task, and some approaches [\citeauthoryearPedagadi et al.2013, \citeauthoryearKuo, Khamis, and Shet2013] have used color-based histograms. With the development of deep learning, many ReID methods to learn discriminative features by deep architectures appear, which dramatically increases the ReID performance [\citeauthoryearSun et al.2017, \citeauthoryearLi, Zhu, and Gong2018, \citeauthoryearHermans, Beyer, and Leibe2017]. Recently, the state-of-the-art approaches [\citeauthoryearSi et al.2018, \citeauthoryearSong et al.2018, \citeauthoryearZhong et al.2018] have also used the advanced deep architecture, especially pre-trained on ImageNet [\citeauthoryearDeng et al.2009], as a backbone network.

2.0.1 Add-on semantic information method in ReID

To increase the performance, many recent works based on the deep architectures have tried to consider additional semantic information such as poses of pedestrians and attention masks. One of the most popular approaches is to use the off-the-shelf pose estimation algorithms [\citeauthoryearCao et al.2017, \citeauthoryearInsafutdinov et al.2016] to tackle the misaligned poses of the candidate pedestrians. In [\citeauthoryearSu et al.2017], using the pose information, Su et al aligned each part of a person, producing pose-normalized input to deal with the problem of the deformable variation of the ReID object. Sarfraz et al[\citeauthoryearSarfraz et al.2018] proposed a view predictor network that distinguishes the front, back, and sides of a person using pose information. In addition to using the pose estimation algorithms, there was a method [\citeauthoryearSong et al.2018] which embeds a 4-channel input by concatenating 3-channels of RGB input image and one channel of segmentation mask. Likewise, an algorithm [\citeauthoryearKalayeh et al.2018] uses semantic parsing masks rather than whole body mask. In [\citeauthoryearQian et al.2018], they generate a realistic pose-normalized image. The synthesized image can be used as training data because the label is preserved. [\citeauthoryearXu et al.2018] proposed attention-aware composition network. They pointed out the conventional methods using pose information based on rigid body regions such as rectangular RoI. They obtained non-rigid parts through connectivity information between the human joints and matched them individually. In contrast to the previous ReID methods, we target on improving the training method itself without any additional semantic information or extra architecture.

2.0.2 Advanced fine-tuning methods

There are other studies to improve learning methods on pre-trained networks. Li and Hoiem [\citeauthoryearLi and Hoiem2017] suggested a method which can learn a new task without forgetting the existing tasks in transfer learning. In [\citeauthoryearKornblith, Shlens, and Le2018], Kornblith et al. analyzed a conventional fine-tuning method, which concluded that the state-of-the-art ImageNet architecture yields state-of-the-art results over many tasks. In the ReID task, several methods have improved learning strategy on pre-trained networks. The quadruplet loss was proposed in [\citeauthoryearChen et al.2017]. In this research, Chen et al. have developed an improved version of triplet losses, which does not only make the inter-class close but also add a negative sample, making the distance in the intra-class much longer. In [\citeauthoryearZhang et al.2017], Zhang et al were inspired by the distillation method [\citeauthoryearHinton, Vinyals, and Dean2015] between teacher and student networks and proposed a learning method based on co-student networks which can be trained without teacher network. However, there has been no research considering the fine-tuning characteristics for the ReID problem. In this paper, we propose a novel fine-tuning strategy adapted to the ReID task, which takes into account the layer-by-layer characteristic of the network.

3 Methodology

Figure 2: The description of the network: ResNet-34, ResNet-50 and ResNet-101 are utilized as a feature extractor. The classifiers are re-defined for each ReID dataset.

In this section, we first analyze the conventional fine-tuning strategy to determine which layer is insufficiently trained for ReID problems. Based on the analysis, we propose a new fine-tuning strategy that alleviates the vanishing gradient in the poorly trained layers, consequently improving the generalization performance of the fine-tuned network.

3.1 Overall framework

Before describing the empirical analysis and the proposed fine-tuning strategy, we first introduce an overall framework including a network architecture with its training and testing processes. The notations defined in this section are used in the following sections.

3.1.1 Architecture

In this paper, we use a classification-based network [\citeauthoryearZheng, Yang, and Hauptmann2016] that determines the entire identity label as a class. We assume that the deep convolutional neural network consists of two components: a feature extractor and a classifier. The feature extractor is composed of multiple convolutional layers and the classifier consists of several fully-connected (FC) layers. As the feature extractor, we utilize convolutional layers of pre-trained ResNet [\citeauthoryearHe et al.2016], which are widely used in many ReID algorithms [\citeauthoryearSun et al.2017, \citeauthoryearZhong et al.2018, \citeauthoryearQian et al.2018]. The three structures ResNet-34, ResNet-50, and ResNet-101 are used for the feature extractor to show the generality of the proposed fine-tuning strategy. According to the resolution of the convolutional layers, the feature extractor can be partitioned into five blocks where each block contains several convolutional layers of the same resolution. The five blocks of ResNet-34, ResNet-50, and ResNet-101 contain , , and convolutional layers, respectively. Following feature extraction, a feature vector is obtained by a global average pooling layer that averages the channel-wise values of the feature map resulting from the last convolutional layer. The resulting feature vector is a 2048-D vector for ResNet-50 and ResNet-101 and a 512-D vector for ResNet-34. The network infers the identity of the input sample by feeding the feature vector obtained from the feature extractor into the classifier. The classifier is newly defined in the order of 512-D FC layer, batch normalization, leaky-rectified linear unit, and FC layer with -dimension, where is the number of identities in the training set and varies between datasets. Following the last FC layer, a soft-max layer is located.

3.1.2 Training process

We train the network to classify the identities of training samples based on cross-entropy loss. The weight parameters to be trained are denoted by , where and are weight parameters of -th block and FC layers, respectively. Given training samples with identities and the corresponding one-hot vectors where , the probability that corresponds to each label is calculated as:


where , denotes feature extractor for with , and denotes a classifier with . The cross-entropy loss between the estimated and is calculated as follows:


In the training process, a stochastic gradient descent method is used to train by minimizing Eq. 2.

3.1.3 Testing process

The identities given to the testing set are completely different than the identities in the training set. Thus, the classifier trained in the training process cannot be used for the testing process. To find correspondence between pedestrian candidates without using the classifier, we estimate the similarity of two pedestrians based on the distance between the feature vectors of each pedestrian extracted from the trained feature extractor. To evaluate the performance, the testing set is divided into a query set and a gallery set with and samples, respectively. The samples of the query and gallery sets are denoted by  and , respectively. Each sample in the query set is a person of interest, which should be matched to the candidate samples in the gallery set.

The distance between and is calculated by L-2 norm as follows:


The identity of the gallery sample with the lowest distance is determined as the identity of the -th query sample.

Figure 3: The training loss convergences by ordinary fine-tuning (baseline) and rolling-back schemes where block is continuously tuned and the other blocks are rolled back to the pre-trained one.
remain layers mAP rank-1 rank-5 rank-10
baseline 73.16 89.43 96.35 97.77
Block1 74.08 89.49 96.50 97.62
Block2 74.37 89.96 96.50 97.62
Block3 73.87 89.90 96.20 97.83
Block4 73.82 89.64 95.81 97.62
Block5 71.17 88.45 95.61 97.42
Table 1: The generalization performance of each scheme in Figure 3. Bold numbers show the best performance.
1:N: Number of total block , M: Number of lower block
2: : weights of pre-trained network
3: , , (dataset)
4: Initialize weights to pre-trained one
5: random initialization
6: First fine-tune on ReID dataset X,Y
7:for p = 2 to M do
8:      Remain certain layers and roll back others
9:      Do not roll back FC layers
10:      Refine-tune on ReID dataset X,Y
11:end for
Algorithm 1 Re-fine learning

3.2 Analysis of fine-tuning method

This section determines which layer converges insufficiently by conventional fine-tuning. Figure 3 shows the convergence, supporting the key ideas of the proposed fine-tuning strategy. ‘baseline’ denotes the conventional fine-tuning, while ‘Block ’ indicates the refine-tuning wherein every block except ‘Block ’ is rolled back after the ‘baseline’ fine-tuning. Table 1 shows the generalization performance of each scheme. A meaningful discovery is that a rolling-back scheme with remaining low-level blocks (Block1, Block2, Block3) shows slower convergence than applying the rolling-back scheme to the remaining high-level blocks (Block3, Block4). However, as shown in Table 1, the scheme that maintains the low-level blocks gives better generalization performance than the scheme preserving the high-level blocks. This indicates that the ’baseline’ fine-tuning causes the low-level layers to be converged at a premature. This gives us an insight that rolling back of the high-level layers except the low-level layers might give the low-level layers an opportunity to learn further. As additional consideration, all the weights cannot be given in pre-trained states. This is because the output layer of a deep network for a new task is usually different from the backbone network. Hence, the FC layers must be initialized in a random manner. Rolling back the FC layers to random states does not provide any benefit. Thus, in our rolling-back scheme, FC layers are excluded from rolling back, although it is a high-level layer, to keep a consistent learning of the low-level layers.

3.3 Refine-tuning with rolling back

The aforementioned analysis shows that a premature convergence degrades performance and rolling back high-level layers can be a beneficial strategy to mitigate the premature convergence problem in the low-level layers. For further tuning of the low-level layers, we designed a rolling-back refine-tuning scheme that trains the low-level layers incrementally from the front layer along with rolling back the remaining high-level layers. The detailed rolling back scheme is described in the following.

  1. In the first fine-tuning period (), the weights, , are initialized with the pre-trained weights,  .


    The weights () in FC layer are initialized with the random scratch [\citeauthoryearHe et al.2015]. Then the first period of fine-tuning is performed on the target dataset by Eq. (1), Eq. (2). The updated weight of the -th block is denoted by  , which is obtained by minimizing the loss from Eq. (2).

  2. From the refine-tuning period with rolling back (), we roll-back the high-level layers as in the following procedure. First, Block1 () is maintained in the state of previous period and all the remaining blocks () are rolled back to their pre-trained states . In other words, Block1 continues the learning, and the other blocks restart the learning from the beginning with the pre-trained initial weights. In the incremental manner, the next low-level block is added one-by-one to the set of blocks continuing the learning, while the remaining ones are rolled back. The rolling-back refine-tuning is repeated until all layers are included in the set of blocks continuing the learning. In summary, in the -th refine-tuning period, the weights of the network are rolled back as


    where are the updated weights in the -th refine-tuning period. During the refine-tuning process, the () is not rolled back as mentioned above.


The detailed procedure of the refine-tuning scheme with rolling-back is summarized in Algorithm 1.

4 Experiment

4.1 Dataset

continuously Market-1501 DukeMTMC CUHK03-L CUHK03-D
Period tuned blocks mAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1
1 none 73.16 89.43 63.26 80.83 45.17 50.07 44.05 48.00
2 B1+ FC 75.65 90.95 66.09 81.96 47.69 51.21 45.76 50.50
3 B1+B2+FC 76.54 91.12 66.57 82.41 49.98 54.36 46.20 51.36
4 B1+B2+B3+FC 77.01 91.24 66.39 82.32 50.72 55.64 47.43 52.93
Table 2: Results of our rolling-back scheme on different ReID dataset
ResNet-34 ResNet-101
continuously Market-1501 DukeMTMC Market-1501 DukeMTMC
Period tuned blocks mAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1
1 none 70.65 86.93 60.06 78.69 75.91 90.80 66.00 82.27
2 B1+ FC 73.63 89.13 63.45 81.10 77.21 90.77 69.27 83.62
3 B1+B2+FC 74.85 90.02 65.16 82.18 78.17 91.27 70.24 85.19
4 B1+B2+B3+FC 74.97 90.05 65.44 83.08 79.95 92.49 69.88 84.43
Table 3: Results of our rolling-back scheme for different network types.

4.1.1 Market-1501

Market-1501 [\citeauthoryearZheng et al.2015] is widely used dataset in person ReID. Market-1501 contains 32,668 images of 1,501 identities. All the bounding box images are results of detection by the DPM detector [\citeauthoryearFelzenszwalb et al.2010]. The dataset is divided into a training set of 751 identities and a test set of 750 identities.

4.1.2 DukeMTMC-ReID(DukeMTMC)

Based on the multi-target and multi-camera tracking dataset, DukeMTMC  [\citeauthoryearZheng, Zheng, and Yang2017] has been specially designed for person ReID. DukeMTMC contains 36,411 images of 1,402 identities which are divided into a training set and a testing set of 702 and 702 identities, respectively.

4.1.3 CUHK03-np

CUHK03-np [\citeauthoryearZhong et al.2017a] is a modified version of the original CUHK03 dataset. The hand-labeled (CUHK03-L) and DPM-detected [\citeauthoryearFelzenszwalb et al.2010] bounding boxes (CUHK03-D) are offered. CUHK03-np contains 14,096 images of 1,467 identities. The new version is split into two balanced sets containing 767 and 700 identities for training and testing, respectively.

4.2 Implementation detail

Our method was implemented using PyTorch [\citeauthoryearPaszke et al.2017] library. All inputs are resized to 288144 and the batch size was set to 32. No other augmentation is used except horizontal flip in our training process. The initial learning rate was set to 0.01 and 0.1 for the feature extractor and the classifier, respectively. The learning rates were multiplied by 0.1 at every 20 epoch and we trained for 40 epochs as one refine-tuning period. In our experiment, the proposed refine-tuning strategy has been rolled back three times and four epochs have been trained for four refine-tuning periods, and so a total of 160 epochs are have been repeated for all fine-tuning. The learning rates of rolling back blocks are restored to 0.01 at the beginning of every period. In contrast, the blocks that do not roll back begin with the low learning rate of 0.001 since a high learning rate of the sufficiently trained blocks might yield sudden exploding. The optimizer used in this study was stochastic gradient descent (SGD) with nesterov momentum [\citeauthoryearNesterov1983]. For the optimizer, the momentum rate and the weight decay were set to and , respectively. In every rolling back, the momentum of gradient was reset to . In the test process, the additional feature vector was used to add the feature vector of the horizontal flipped input pairwise. We report rank-1 accuracy of Cumulative Matching Characteristics (CMC) curve and the mean Average Precision (mAP) for performance evaluation.

Figure 4: The train loss and mAP graph for comparison of our rolling-back scheme and the conventional fine-tuning at once. (a) is the results on Market-1501 and (b) is the results on DukeMTMC.

4.3 Ablation tests

The network trained with the proposed strategy was verified via ablation tests on Market-1501, DukeMTMC, CUHK03-L and CUHK03-D. The proposed refine-tuning strategy is applied to a network over four periods. As the refine-tuning periods progress, the continuously tuned blocks are cumulative (e.g., B1+B2+FC in the third period). The other blocks are rolled back to their original pre-trained states. As shown in Table 2, the performance increases as the refine-tuning periods progress with the exception of DukeMTMC in the fourth period. However, even in this case, the gap was negligible. The improvement is most prominent in the second refine-tuning period during which the first rolling back is performed. To verify the generality of our refine-tuning scheme, we conducted additional experiments with other networks including ResNet-34 and ResNet-101 [\citeauthoryearHe et al.2016] under the same settings. Table 3 shows the performance of each network in Market-1501 and DukeMTMC. The proposed refine-tuning scheme also showed a consistent improvement in ResNet-34 and ResNet-101. The ablation test results demonstrate that the proposed refine tuning scheme has a significant advantage as a general method to enhance the generalization performance in the ReID problem in which only a limited amount of data is available.

Figure 5: The results of comparison with FC warm-up training method
Figure 6: The attention maps formed by the last feature layer trained by our rolling-back scheme and the baseline

4.4 Effect of rolling back as a perturbation

To evaluate the effect of our rolling-back scheme, it is compared with ’basecy’ method that does roll back none of the block but merely adjusts the learning rate with the same timing as ours for a perturbation driving to other local basins. The ’basecy’ is similar to other studies [\citeauthoryearLoshchilov and Hutter2016, \citeauthoryearSmith2017] that perturb only the learning rate. Figure 4 shows the change in training loss and mAP of the whole processes of the proposed refine-tuning and the basecy fine-tuning. After the first rolling-back at 40 epochs, the training loss from the rolling-back scheme converges to a value that is better than the value of the basecy in the 70-80 epochs. After the second and third rolling-backs, the training loss of the basecy converges to a lower value than that of the proposed method, but the basecy shows a worse generalization performance (mAP) than the proposed method.

Market-1501 DukeMTMC CUHK03-L CUHK03-D
Method Backbone mAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1 Add-on
PT-GAN \shortciteref:PoseTransfer_GAN ResNet-50 58.0 79.8 48.1 68.6 30.5 33.8 28.2 30.1 pose+GAN
SVDNet \shortciteref:SVDnet ResNet-50 62.1 82.3 56.8 76.7 37.8 40.9 37.3 41.5 -
PDC \shortciteref:pose_driven Inception 63.4 84.1 - - - - - - pose
AACN \shortciteref:AACN GoogleNet 66.9 85.9 59.3 76.8 - - - - pose
HAP2S_P \shortciteref:Hard-aware ResNet-50 69.4 84.6 60.6 75.9 - - - - -
PSE \shortciteref:Pose_sensitive ResNet 69.0 87.7 62.0 79.8 - - - - pose
CamStyle \shortciteref:CamAug ResNet-50 71.6 89.2 57.6 78.3 - - - - GAN
PN-GAN \shortciteref:PoseNormal_GAN ResNet-50 72.6 89.4 53.2 73.6 - - - - pose+GAN
MGCAM \shortciteref:MaskreID MSCAN 74.3 83.8 - - 50.2 50.1 46.7 46.9 mask
MLFN \shortciteref:MLFN Original 74.3 90.0 62.8 81.0 49.2 54.7 47.8 52.8 -
HA-CNN \shortciteref:harmonious Inception 75.7 91.2 63.8 80.5 41.0 44.4 38.6 41.7 -
DuATM \shortciteref:DuATM DenseNet-121 76.6 91.4 64.6 81.8 - - - - -
Ours ResNet-34 75.0 90.1 65.4 83.1 48.6 53.0 45.6 51.3 -
Ours ResNet-50 77.0 91.2 66.6 82.4 50.7 55.6 47.4 52.9 -
Ours ResNet-101 79.9 92.5 70.2 85.2 55.7 59.8 50.5 55.6 -
Table 4: Comparison with State-of-the-art methods on Market-1501, DukeMTMC and CUHK03-L/D
Market-1501 DukeMTMC
Method mAP rank-1 mAP rank-1 Add-on
PT-GAN 58.0 79.8 48.1 68.6 GAN
SVDNet 62.1 82.3 56.8 76.7 -
HAP2S_P 69.4 84.6 60.6 75.9 -
CamStyle 71.6 89.2 57.6 78.3 GAN
PN-GAN 72.6 89.4 53.2 73.6 GAN
Ours 77.0 91.2 66.6 82.4 -
Table 5: Comparison with State-of-the-art methods using same backbone network ResNet-50

4.5 Comparison to FC warm-up training

In this section, we discuss the difference between our method and FC warm-up training [\citeauthoryearHe et al.2016]. As mentioned previously, the new FC layers start randomly from scratch. FC warm-up is a way to freeze the pre-trained weights in all hidden layers except for the FC layers and train the FC layers before starting the main fine-tuning. In the comparison experiment, the baseline was warmed up for 20 epochs. In our proposed method, period 1 (see Table 2) is similar to FC warm-up where FC layers start from random scratch. However, the proposed method does not freeze the pre-trained weights in period 1. The training loss and mAP for FC warm-up and our methods are depicted in Figure 5. FC warm-up and our methods start fine/refine-tuning after training the FC layers. The FC warm-up converges to a lower training loss than the proposed method, but the proposed method shows better performance in terms of generalization.

4.6 Attention performance of our refine-tuning method

To learn discriminative features for the ReID task, it is important to distinguish the foreground from the background. Figure 6 shows that our method can generate a more distinguishable feature map in the last convolutional layer than the baseline of the conventional fine-tuning method.

4.7 Comparisons with state-of-the-art methods

We also compared the proposed method with state-of-the-art methods. Table 5 shows the comparison results when using ResNet-50. The proposed rolling-back refine-tuning scheme shows the best performance even though our method does not use any add-on scheme. Furthermore, compared to other methods without add-on scheme (SVDNet, HAP2S_P), our method outperforms them by more than 7% mAP improvement for Market-1501. Table 4 summarizes the results compared with the state-of-the-art methods on Market-1501, DukeMCMT, and CUHK03-L/D. According to the results, the rolling-back refine-tuning scheme makes a meaningful contribution to the enhancement of any backbone networks so that it outperforms state-of-the-art algorithms utilizing add-on schemes.

5 Conclusion

In this paper, we proposed a refine tuning method with a rolling-back scheme which further enhances the backbone network. The key idea of the rolling-back scheme is to restore the weights in a part of the backbone network to the pre-trained weights when the fine-tuning converges at a premature state. To escape from the premature state, we adopt an incremental refine tuning strategy by applying the fine tuning repeatedly, along with the rolling-back. According to the experimental results, the rolling-back scheme makes a meaningful contribution to enhancement of the backbone network where it derives the convergence to a local basin of a good generalization performance. As a result, our method without any add-on scheme could outperform the state-of-the-arts with help of add-on scheme.

6 Acknowledgement

This work was supported by Next-Generation ICD Program through NRF funded by Ministry of S&ICT [2017M3C4A7077582], ICT R&D program MSIP/IITP [2017-0-00306, Outdoor Surveillance Robots].


  • [\citeauthoryearBossard, Guillaumin, and Van Gool] Bossard, L.; Guillaumin, M.; and Van Gool, L. Food-101–mining discriminative components with random forests. In ECCV, pages=446–461, year=2014, organization=Springer.
  • [\citeauthoryearCao et al.2017] Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.
  • [\citeauthoryearChang, Hospedales, and Xiang2018] Chang, X.; Hospedales, T. M.; and Xiang, T. 2018. Multi-level factorisation net for person re-identification. In CVPR.
  • [\citeauthoryearChen et al.2017] Chen, W.; Chen, X.; Zhang, J.; and Huang, K. 2017. Beyond triplet loss: a deep quadruplet network for person re-identification. In CVPR.
  • [\citeauthoryearDeng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. Ieee.
  • [\citeauthoryearFelzenszwalb et al.2010] Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and Ramanan, D. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. on PAMI.
  • [\citeauthoryearHe et al.2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV.
  • [\citeauthoryearHe et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
  • [\citeauthoryearHermans, Beyer, and Leibe2017] Hermans, A.; Beyer, L.; and Leibe, B. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737.
  • [\citeauthoryearHinton, Vinyals, and Dean2015] Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • [\citeauthoryearInsafutdinov et al.2016] Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; and Schiele, B. 2016. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV.
  • [\citeauthoryearKalayeh et al.2018] Kalayeh, M. M.; Basaran, E.; Gokmen, M.; Kamasak, M. E.; and Shah, M. 2018. Human semantic parsing for person re-identification. In CVPR.
  • [\citeauthoryearKoestinger et al.2012] Koestinger, M.; Hirzer, M.; Wohlhart, P.; Roth, P. M.; and Bischof, H. 2012. Large scale metric learning from equivalence constraints. In CVPR.
  • [\citeauthoryearKornblith, Shlens, and Le2018] Kornblith, S.; Shlens, J.; and Le, Q. V. 2018. Do better imagenet models transfer better? arXiv preprint arXiv:1805.08974.
  • [\citeauthoryearKuo, Khamis, and Shet2013] Kuo, C.-H.; Khamis, S.; and Shet, V. 2013. Person re-identification using semantic color names and rankboost. In WACV.
  • [\citeauthoryearLi and Hoiem2017] Li, Z., and Hoiem, D. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • [\citeauthoryearLi, Zhu, and Gong2018] Li, W.; Zhu, X.; and Gong, S. 2018. Harmonious attention network for person re-identification. In CVPR.
  • [\citeauthoryearLiu et al.2018] Liu, J.; Ni, B.; Yan, Y.; Zhou, P.; Cheng, S.; and Hu, J. 2018. Pose transferrable person re-identification. In CVPR.
  • [\citeauthoryearLong, Shelhamer, and Darrell2015] Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR.
  • [\citeauthoryearLoshchilov and Hutter2016] Loshchilov, I., and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
  • [\citeauthoryearMahendran and Vedaldi2015] Mahendran, A., and Vedaldi, A. 2015. Understanding deep image representations by inverting them. In CVPR.
  • [\citeauthoryearMaji et al.2013] Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; and Vedaldi, A. 2013. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
  • [\citeauthoryearNesterov1983] Nesterov, Y. 1983. A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2). In Doklady AN USSR.
  • [\citeauthoryearPaszke et al.2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
  • [\citeauthoryearPedagadi et al.2013] Pedagadi, S.; Orwell, J.; Velastin, S.; and Boghossian, B. 2013. Local fisher discriminant analysis for pedestrian re-identification. In CVPR.
  • [\citeauthoryearQian et al.2018] Qian, X.; Fu, Y.; Wang, W.; Xiang, T.; Wu, Y.; Jiang, Y.-G.; and Xue, X. 2018. Pose-normalized image generation for person re-identification. In ECCV.
  • [\citeauthoryearRen et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
  • [\citeauthoryearSarfraz et al.2018] Sarfraz, M. S.; Schumann, A.; Eberle, A.; and Stiefelhagen, R. 2018. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR.
  • [\citeauthoryearSi et al.2018] Si, J.; Zhang, H.; Li, C.-G.; Kuen, J.; Kong, X.; Kot, A. C.; and Wang, G. 2018. Dual attention matching network for context-aware feature sequence based person re-identification. arXiv preprint arXiv:1803.09937.
  • [\citeauthoryearSmith2017] Smith, L. N. 2017. Cyclical learning rates for training neural networks. In WACV.
  • [\citeauthoryearSong et al.2018] Song, C.; Huang, Y.; Ouyang, W.; and Wang, L. 2018. Mask-guided contrastive attention model for person re-identification. In CVPR.
  • [\citeauthoryearSu et al.2017] Su, C.; Li, J.; Zhang, S.; Xing, J.; Gao, W.; and Tian, Q. 2017. Pose-driven deep convolutional model for person re-identification. In ICCV.
  • [\citeauthoryearSun et al.2017] Sun, Y.; Zheng, L.; Deng, W.; and Wang, S. 2017. Svdnet for pedestrian retrieval. In ICCV.
  • [\citeauthoryearSzegedy et al.2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826.
  • [\citeauthoryearWah et al.2011] Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
  • [\citeauthoryearXu et al.2018] Xu, J.; Zhao, R.; Zhu, F.; Wang, H.; and Ouyang, W. 2018. Attention-aware compositional network for person re-identification. arXiv preprint arXiv:1805.03344.
  • [\citeauthoryearYu et al.2018] Yu, R.; Dou, Z.; Bai, S.; Zhang, Z.; Xu, Y.; and Bai, X. 2018. Hard-aware point-to-set deep metric for person re-identification. In ECCV.
  • [\citeauthoryearZeiler and Fergus2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision.
  • [\citeauthoryearZhang et al.2017] Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2017. Deep mutual learning. arXiv preprint arXiv:1706.00384.
  • [\citeauthoryearZhao et al.2017] Zhao, L.; Li, X.; Zhuang, Y.; and Wang, J. 2017. Deeply-learned part-aligned representations for person re-identification. In ICCV.
  • [\citeauthoryearZheng et al.2015] Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; and Tian, Q. 2015. Scalable person re-identification: A benchmark. In Computer Vision, IEEE International Conference on.
  • [\citeauthoryearZheng, Yang, and Hauptmann2016] Zheng, L.; Yang, Y.; and Hauptmann, A. G. 2016. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984.
  • [\citeauthoryearZheng, Zheng, and Yang2017] Zheng, Z.; Zheng, L.; and Yang, Y. 2017. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV.
  • [\citeauthoryearZhong et al.2017a] Zhong, Z.; Zheng, L.; Cao, D.; and Li, S. 2017a. Re-ranking person re-identification with k-reciprocal encoding. In CVPR.
  • [\citeauthoryearZhong et al.2017b] Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2017b. Random erasing data augmentation. arXiv preprint arXiv:1708.04896.
  • [\citeauthoryearZhong et al.2018] Zhong, Z.; Zheng, L.; Zheng, Z.; Li, S.; and Yang, Y. 2018. Camera style adaptation for person re-identification. In CVPR.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description