Batch DropBlock Network for Person Re-identification and Beyond

Batch DropBlock Network for Person Re-identification and Beyond

Zuozhuo Dai         Mingqiang Chen         Xiaodong Gu         Siyu Zhu         Ping Tan
Alibaba A.I. Labs             Simon Fraser University
Abstract

Since the person re-identification task often suffers from the problem of pose changes and occlusions, some attentive local features are often suppressed when training CNNs. In this paper, we propose the Batch DropBlock (BDB) Network which is a two branch network composed of a conventional ResNet-50 as the global branch and a feature dropping branch. The global branch encodes the global salient representations. Meanwhile, the feature dropping branch consists of an attentive feature learning module called Batch DropBlock, which randomly drops the same region of all input feature maps in a batch to reinforce the attentive feature learning of local regions. The network then concatenates features from both branches and provides a more comprehensive and spatially distributed feature representation. Albeit simple, our method achieves state-of-the-art on person re-identification and it is also applicable to general metric learning tasks. For instance, we achieve 76.4% Rank-1 accuracy on the CUHK03-Detect dataset and 83.0% Recall-1 score on the Stanford Online Products dataset, outperforming the existing works by a large margin (more than 6%).

1 Introduction

Person re-identification (re-ID) amounts to identify the same person from multiple detected pedestrian images, typically seen from different cameras without view overlap. It has important applications in surveillance and presents a significant challenge in computer vision. Most of recent works focus on learning suitable feature representation that is robust to pose, illumination, and view angle changes to facilitate person re-ID using convolution neural networks. Because the body parts such as faces, hands and feet are unstable as the view angle changes, the CNN tends to focus on the main body part and the other descriminative body parts are consequently suppressed. To solve this problem, many pose-based works [23, 48, 49, 74, 71] seek to localize different body parts and align their associated features, and other part-based works [8, 27, 30, 31, 51, 56, 64] use coarse partitions or attention selection network to improve feature learning. However, such pose-based networks usually require additional body pose or segment information. Moreover, these networks are designed using specific partition mechanisms, such as a horizontal partition, which is fit for person re-ID but hard to be generalized to other metric learning tasks. The problems above motivate us to propose a simple and generalized network for person re-ID and other metric learning tasks.

Figure 1: The class activation map on Baseline and BDB Network. Compared with the Baseline, the two-branch structure in BDB Network learns more comprehensive and spatially distributed features consisting of both global and attentive local representations.

In this paper, we propose the Batch DropBlock Network (BDB Network) for the roughly aligned metric learning tasks. The Batch DropBlock Network is a two-branch network consisting of a conventional global branch and a feature dropping branch where the Batch DropBlock, an attentive feature learning module, is applied. The global branch encodes the global feature representations and the feature dropping branch learns local detailed features. Specifically, Batch DropBlock randomly drops the same region of all the feature maps, namely the same semantic body parts, in a batch during training and reinforces the attentive feature learning of the remaining parts. Concatenating the features of both branches brings a more comprehensive saliency representation rather than few discriminative features. In Figure 1, we use class activation map [84] to visualize the feature attention. We can see that the attention of baseline mainly focuses on the main body part while the BDB network learns more uniformly distributed representations.

Our Batch DropBlock is different from the general DropBlock [14] in two aspects. First, Batch DropBlock is an attentive feature learning module for metric learning tasks while DropBlock is a regularization method for classification tasks. Second, Batch DropBlock drops the same block for a batch of images during a single iteration, while DropBlock [14] erases randomly across different images. Here, ‘Batch’ means the group of images participating in a single loss calculation during training, for example, a pair for pairwise loss, a triplet for triplet loss and a quadruplet for quadruplet loss. If we erase features randomly as [14], for example, one image keeps head features and another image keeps feet features, the network can hardly find the semantic correspondence, not to mention reinforcing the learning of local attentive representations.

In the experimental section, the ResNet-50 [16] based Batch DropBlock Network with hard triplet loss [17] achieves 72.8% Rank-1 accuracy on CUHK03-Detect dataset, which is 6.0% higher than the state-of-the-art work [58]. Batch DropBlock can also be adopted in different metric learning schemes, including triplet loss [40, 17], lifted structure loss [35], weighted sampling based margin loss [62], and histogram loss [54]. We test it with the image retrieval tasks on the CUB200-2011 [57], CARS196 [22], In Shop Clothes Retrieval dataset [32] and Stanford online products dataset [46]. The BDB Network can consistently improve the Rank-1 accuracy of various schemes.

Figure 2: The Batch DropBlock Layer demonstrated on the triplet loss function [40].
Figure 3: The structure of our Batch DropBlock (BDB) Network with the batch hard triplet loss [17] demonstrated on the person re-ID problem. The global branch is appended after ResNet-50 Stage 4 and the feature dropping branch introduces a mask to crop out a large block in the bottleneck feature map. During training, there are two loss functions for both global branch and feature dropping branch. During testing, the features from both branches are concatenated as the final descriptor of a pedestrian image.

2 Related work

Person re-ID is a challenging task in computer vision due to the large variation of poses, background, illumination, and camera conditions. Historically, people used hand-craft features for person re-identification [4, 9, 28, 29, 33, 34, 37, 38, 66, 77]. Recently, deep learning based methods dominate the Person re-ID benchmarks [5, 43, 50, 71, 73, 79].

The formulation of person re-ID has gradually evolved from a classification problem to a metric learning problem, which aims to find embedding features for input images in order to measure their semantic similarity. The work [76] compares both strategies on the Market-1501 dataset. Current works in metric learning generally focus on the design of loss functions, such as contrastive loss [55], triplet loss [8, 30], lifted structure loss [35], quadruplet loss [6], histogram loss [54], etc. In addition to loss functions, the hard sample mining methods, such as distance weighted sampling [62], hard triplet mining [17] and margin sample mining [63] are also critical to the final retrieval precision. Another work [69] also studies the application of mutual learning in metric learning tasks. In this paper, the proposed two-branch BDB Network is effective in many metric learning formulations with different loss functions.

The human body is highly structured and distinguishing corresponding body parts can effectively determine the identity. Many recent works [30, 51, 53, 56, 58, 61, 67, 69, 70] aggregate salient features from different body parts and global cues for person re-ID. Among them, the part-based methods [8, 51, 58] achieve the state-of-the-art performance, which split an input feature map horizontally into a fixed number of strips and aggregate features from those strips. However, aggregating the feature vectors from multiple branches generally results in a complicated network structure. In comparison, our method involves only a simple network with two branches, one-third the size of the state-of-the-art MGN method [58].

To handle the imperfect bounding box detection and body part misalignment, many works [27, 43, 42, 44, 78] exploit the attention mechanisms to capture and focus on attentive regions. Saliency weighting [59, 72] is another effective approach to this problem. Inspired by attention models, Zhao et al. [71] propose part-aligned representations for person re-ID. Following the similar ideology, the works [20, 24, 25, 31] have also demonstrated superior performance, which incorporate a regional attention selection sub-network into the person re-ID model. To learn a feature representation robust to pose changes, the pose guided attention methods [23, 48, 74] fuse different body parts features with the help of pose estimation and human parsing network. However, such methods based on pose estimation and semantic parsing algorithms are only designed for person re-ID tasks while our approach can be applied to other general metric learning tasks.

To further improve the retrieval precision, re-ranking strategies [2, 82] and inference with specific person attributes [41] are adopted too. Recent works also introduce synthetic training data [3], adversarially occluded samples [19] and unlabeled samples generated by GAN [80] to remarkably augment the variant of input training dataset. The work in [13] transfers the representations learned from the general classification dataset to address the data sparsity of the person re-ID problems. Some general data augmentation methods such as Random Erasing [82] and Cutout [11] are also generally used. Notably, such policies above can be used jointly with our method.

3 Batch DropBlock (BDB) Network

This section describes the structure and components of the proposed Batch DropBlock Network.

Backbone Network.

We use the ResNet-50 [16] as the backbone network for feature extraction as many of the person re-ID networks. For a fair comparison with the recent works [51, 58], we also modify the backbone ResNet-50 slightly, in which the down-sampling operation at the beginning of stage 4 is not employed. In this way, we get a larger feature map of size .

ResNet-50 Baseline.

On top of this backbone network, we append a branch denoted as global branch. Specifically, after stage 4 of ResNet-50, we employ global average pooling to get a 2048-dimensional feature vector, the dimension of which is further reduced to 512 through a convolution layer, a batch normalization layer, and a ReLU layer. We denote the backbone network together with the global branch as ResNet-50 Baseline in the following sections. The performance of Baseline with or without triplet loss on person re-ID datasets are shown in table 1. Our baseline without triplet loss is identical to the baseline used in recent works [51, 58].

Batch DropBlock Layer.

Given the feature tensor computed by backbone network from a single batch of input images, the Batch DropBlock Layer randomly drops the same region of tensor . All the units inside the dropping area are zeroed out. We visualize the application of Batch DropBlock Layer in the triplet loss function in Figure 2, while it can be adopted in other loss functions [35, 54, 62] as well. The height and width of the erased region varies from task to task. But in general, the dropping region should be big enough to cover a semantic part of input feature map. Unlike DropBlock [14], there is no need to change the keep probability hyper-parameter during training in Batch DropBlock Layer.

Figure 4: The class activation map of the BDB Network, the feature dropping branch when training alone, and when DropBlock is used in our network. ’FD Branch’ means feature dropping branch.
Network Architecture.

As illustrated in Figure 3, our BDB Network consists of a global branch and a feature dropping branch.

The global branch is commonly used for providing global feature representations in multi-branch network architectures [8, 51, 58]. It also supervises the training for the feature dropping branch and makes the Batch DropBlock layer applied on a well-learned feature map. To demonstrate it, we visualize in Figure 4 the class activation map of the dropping branch trained with and without the global branch. We can see that the features learned by the dropping branch alone are more spatially dispersed with redundant background noise (e.g. at the bottom of Figure 4 (c)). As mentioned in [14], dropping a large area randomly on input feature maps may hurt the network learning at the beginning. It therefore uses a scheduled training method which sets the dropping area small initially and gradually increases it to stabilize the training process. In BDB network, we do not need to change the dropping area with the intermediate supervision of the global branch. At the beginning stage of training, when the feature dropping branch could not learn well, the global branch helps the training.

The feature dropping branch then applies the Batch DropBlock Layer on feature map and provides the batch erased feature map . Afterwards, we apply global max pooling to get the 2048-dimensional feature vector. Finally, the dimension of a feature vector is reduced from 2048 to 1024 for both triplet and softmax losses. The purpose of the feature dropping branch is to learn multiple attentive feature regions instead of only focusing on the major discriminative region. Figure 4 also visualizes the class activation map of feature dropping branch with DropBlock or Batch DropBlock. One can see the features learned by DropBlock miss some attentive part features (e.g. legs in Figure 4 (d)) and the salient representations from Batch DropBlock have more accurate and clearer contours. An intuitive explanation is that, by blocking the same roughly aligned regions, we reinforce the attentive feature learning of the rest parts with semantic correspondences.

The BDB Network uses global average pooling (GAP) on the global branch, the same as the original ResNet-50 network [16]. Notably, we use global max pooling (GMP) in feature dropping branch, because GMP encourages the network to identify comparatively weak salient features after the most descriminative part is dropped. The strong feature is easy to be selected while the weak feature is hard to be distinguished from other low values. When the strong feature is dropped, GMP could encourage the network to strength the weak features. For GAP, low values except the weak features would still impact the results.

Also noteworthy is the ResNet bottleneck block [16] which applies a stack of convolution layers on feature map . Without it, the global average pooling layer and the global max pooling layer would be applied simultaneously on , making the network hard to converge.

Then, during testing, features from the global branch and the feature dropping branch are concatenated as the embedding vector of a pedestrian image. Here, the following three points are worth noting. 1) The Batch DropBlock Layer is parameter free and will not increase the network size. 2) The Batch DropBlock Layer can be easily adopted in other metric learning tasks beyond person re-ID. 3) The Batch DropBlock hyper-parameters are tunable without changing the network structure for different tasks.

Loss function.

The loss function is the sum of soft margin batch-hard triplet loss [17] and softmax loss on both the global branch and feature dropping branch.

CUHK03-Label CUHK03-Detect DukeMTMC-reID Market1501
Method Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP
IDE [76] 22.2 21.0 21.3 19.7 67.7 47.1 72.5 46.0
PAN [81] 36.9 35.0 36.3 34.0 71.6 51.5 82.8 63.4
SVDNet [50] - - 41.5 37.3 76.7 56.8 82.3 62.1
DPFL [7] 43.0 40.5 40.7 37.0 79.2 60.0 88.9 73.1
HA-CNN [27] 44.4 41.0 41.7 38.6 80.5 63.8 91.2 75.7
SVDNet+Era [83] 49.4 45.0 48.7 37.2 79.3 62.4 87.1 71.3
TriNet+Era [83] 58.1 53.8 55.5 50.7 73.0 56.6 83.9 68.7
DaRe [60] 66.1 61.6 63.3 59.0 80.2 64.5 89.0 76.0
GP-reid [1] - - - - 85.2 72.8 92.2 81.2
PCB [51] - - 61.3 54.2 81.9 65.3 92.4 77.3
PCB + RPP [51] - - 62.8 56.7 83.3 69.2 93.8 81.6
MGN [58] 68.0 67.4 66.8 66.0 88.7 78.4 95.7 86.9
Baseline 52.6 49.9 51.1 47.9 81.0 62.8 91.6 77.1
Baseline+Triplet 67.4 61.5 63.6 60.0 83.8 68.5 93.1 80.6
BDB 73.6 71.7 72.8 69.3 86.8 72.1 94.2 84.3
BDB+Cut 79.4 76.7 76.4 73.5 89.0 76.0 95.3 86.7
Table 1: The comparison with the existing person re-ID methods. ‘Era’ means Random Erasing [83]. ‘Cut’ means Cutout [11].

4 Experiments

We verify our BDB Network on the benchmark person re-ID datasets. The BDB Network with different metric learning loss functions is also tested on the standard image retrieval datasets.

4.1 Person re-ID Experiments

4.1.1 Datasets and Settings

We test three generally used person re-ID datasets including Market-1501 [75], DukeMTMC-reID [39, 80], and CUHK03 [26] datasets. We also follow the same strategy used in recent works [17, 51, 58] to generate training, query, and gallery data. Notice that the original CUHK03 dataset is divided into 20 random training/testing splits for cross validation which is commonly used in hand-craft feature based methods. The new partition method adopted in our experiments further splits the training and gallery images, and selects challenging query images for evaluation. Therefore, CUHK03 dataset becomes the most challenging dataset among the three.

During training, the input images are re-sized to and then augmented by random horizontal flip and normalization. In Batch DropBlock layer, we set the erased height ratio to 0.3 and erased width ratio to 1.0. The same setting is used in all the person re-ID datasets. The testing images are re-sized to and only augmented with normalization.

For each query image, we rank all the gallery images in decreasing order of their Euclidean distances to the query images and compute the Cumulative Matching Characteristic (CMC) curve. We use Rank-1 accuracy and mean average precision (mAP) as the evaluation metrics. Results with the same identity and the same camera ID as the the query image are not counted. It is worth noting that all the experiments are conducted in a single-query setting without re-ranking[2, 82] for simplicity.

4.1.2 Training

Our network is trained using 4 GTX1080 GPUs with a batch size of 128. Each identity contains 4 instance images in a batch, so there are 32 identities per batch. The backbone ResNet-50 is initialized from the ImageNet [10] pre-trained model. We use the batch hard soft margin triplet loss [17] to avoid margin parameters. We use the Adam optimizer [21] with the base learning rate initialized to 1e-3 with a linear warm-up [15] in first 50 epochs, then decayed to 1e-4 after 200 epochs, and further decayed to 1e-5 after 300 epochs. The whole training procedure has 400 epochs and takes approximately 1.5 hours.

4.1.3 Comparison with State-of-the-Art

The statistical comparison between our BDB Network and the state-of-the-art methods on CUHK03, DukeMTMC-reID and Market-1501 datasets is shown in Table 1. It shows that our method achieves state-of-the-art performance on both CUHK03 and DukeMTMC-reID datasets. Remarkably, our method achieves the largest improvement over previous methods on CUHK03-Detect dataset, which is the most challenging dataset. For Market1501 datasets, our model achieves comparative performance to MGN [58]. However, it is worth to point out that MGN benefits from a much lager and more complex network which generates 8 feature vectors with 8 branches supervised by 11 loss functions. The model size (i.e., number of parameters) of MGN is three times of BDB Network.

Figure 5: The top-4 ranking list for the query images on CUHK03-Label dataset from the proposed BDB Network. The correct results are highlighted by green borders and the incorrect results by red borders.

Some sample query results are illustrated in Figure 5. We can see that, given a back view person image, BDB Network can even retrieve the front view and side view images of the same person.

4.1.4 Ablation Studies

We perform extensive experiments on Market-1501 and CUHK03 datasets to analyze the effectiveness of each component and the impact of hyper parameters in our method.

Method Rank-1 mAP
Global Branch (Baseline) 93.1 80.6
Feature Dropping Branch 93.6 83.3
Both Branches (BDB) 94.2 84.3
Feature Dropping Branch + Cut 88.0 75.7
BDB + Cut 95.3 86.7
Table 2: The effect of global branch and feature dropping branch on Market-1501 dataset. ‘Cut’ means Cutout [11] augmentation.
Benefit of Global Branch and Feature Dropping Branch.

Without the global branch, the BDB Network still performs better than the baseline as illustrated in Table 2. Adding the global branch could further improve the performance. The motivation behind the two-branch structure in the BDB Network is that it learns both the most salient appearance clues and fine-grained discriminative features. This suggests that the two branches reinforce each other and are both important to the final performance.

Figure 6: The comparison with Dropout methods on two feature maps within the same batch.
Method Rank-1 mAP SpatialDropout[52] 60.5 56.8 Dropout [47] 65.3 62.2 Batch Dropout 65.8 62.9 DropBlock [14] 70.6 67.7 Batch DropBlock 72.8 69.3
Table 3: The Comparison with other Dropout methods on the CUHK03-Detect dataset.
CUHK03-Detect Market1501 Method Rank-1 mAP Rank-1 mAP Baseline 51.1 47.9 91.6 77.1 Baseline + Triplet 63.6 60.0 93.1 80.6 Baseline + Dropping 60.9 57.2 93.8 80.5 Baseline + Triplet + 72.8 69.3 94.2 84.3 Dropping (BDB Network)
Table 4: Ablation studies of the effective components of BDB network on CUHK03-Detect and Market1501 datasets. ‘Dropping’ means the feature dropping branch.
CUHK03-Detect Market1501 Method Rank-1 mAP Rank-1 mAP Baseline 63.6 60.0 93.1 80.6 Baseline + RE 70.6 65.9 93.3 81.5 Baseline + Cut 67.7 64.2 93.5 82.0 Baseline + RE + Cut 70.7 65.9 93.1 82.0 BDB 72.8 69.3 94.2 84.3 BDB + RE 75.9 72.6 94.4 85.0 BDB + Cut 76.4 73.5 95.3 86.7
Table 5: The comparison with data augmentation methods. ‘RE’ means Random Erasing [83]. ‘Cut’ means Cutout [11].
Comparison with Dropout and DropBlock.

Dropout [47] drops values of input tensor randomly and is a widely used regularization technique to prevent overfitting. We replace the Batch DropBlock layer with various Dropout methods and compare their performance in Table 5. SpatialDropout [52] randomly zeroes whole channels of the input tensor. The channels to zero-out are randomized on every forward call. Here, Batch Dropout means we select random spatial positions and drops all input features in these locations. The difference between Batch DropBlock and Batch Dropout is that Batch DropBlock zeroes a large contiguous area while Batch Dropout zeroes some isolated features. DropBlock [14] means for a batch of input tensor, every tensor randomly drops a contiguous region. The difference between Batch DropBlock and DropBlock is that Batch DropBlock drops the same region for every input tensor within a batch while DropBlock crops out different regions. These Dropout methods are visualized in Figure 6. As shown in Table 5, Batch DropBlock is more effective than these various Dropout strategies in the person re-ID tasks.

Global Average Pooling (GAP) vs Global Max Pooling (GMP) in Feature Dropping Branch.

As shown in Figure 7 (b), the Rank-1 accuracy of the feature dropping branch with GMP is consistently superior to that with GAP. We therefore demonstrate the importance of Max Pooling for a robust convergence and increased performance on the feature dropping branch.

Benefit of Triplet Loss

The BDB Network is trained using both triplet loss and softmax loss. The triplet loss is a vital part of BDB Network since the Batch DropBlock layer has effect only when considering relationship between images. In table 5, ‘Baseline + Dropping’ is the BDB Network without triplet loss. We can see that the triplet loss significantly improves the performance.

Figure 7: (a) The effects of erased height ratio on mAP and CMC scores. The erased width ratio is fixed to 1.0. (b) The comparison of global average pooling and global max pooling on the feature dropping branch under different height ratio settings. The statistics are analyzed on the CUHK03-Detect dataset.
Impact of Batch DropBlock Layer Hyper-parameters.

Figure 7 (a) studies the impact of erased height ratio on the performance of the BDB Network. Here, the erased width ratio is fixed to 1.0 in all the person Re-ID experiments. We can see that the best performance is achieved when height erased ratio is 0.3, which is the setting for BDB Network in person re-ID experiments.

Relationship with Data Augmentation methods.

A natural question about BDB Network is could BDB Network still benefit from image erasing data augmentation methods such as Cutout [11] and Random Erasing [83] since they perform similar operations? The answer is yes. Because the BDB Network contains a global branch which sees the complete feature map and it can benefit from Cutout or Random Erasing. To verify it, we apply image erasing augmentation on BDB Network with or without the global branch in Table 2. We can see Cutout performs bad without the global branch. Table 5 shows BDB Network performs well with data augmentation methods. As can be seen, ‘BDB + Cut’ or ‘BDB + RE’ are significantly better than ‘Baseline + Cut’, ‘Baseline + RE’, or ‘BDB’.

Dataset CARS CUB SOP Clothes
# images 16,185 11,788 120,053 52,712
# classes 196 200 22,634 11,735
# training class 98 100 11,318 3,997
# training image 8,054 5,864 59,551 25,882
# testing class 98 100 11,316 3,985
# testing image 8,131 5,924 60,502 26,830
Table 6: The statistics of the image retrieval datastes including CARS196 [22], CUB200-2011 [57], Stanford online products(SOP) [35], and In-Shop Clothes retrieval dataset [32]. Notice that the test set of In-Shop Clothes retrieval dataset is further split to query dataset with 14,218 images and gallery dataset with 12,612 images.
1 2 4 8
PDDM Triplet [18] 50.9 62.1 73.2 82.5
PDDM Quadruplet [18] 58.3 69.2 79.0 88.4
HDC [68] 60.7 72.4 81.9 89.2
Margin [62] 63.9 75.3 84.4 90.6
ABE-8 [20] 70.6 79.8 86.9 92.2
BDB 74.1 83.6 89.8 93.6
(a) CUB200-2011 (cropped) dataset
1 2 4 8
PDDM Triplet [18] 46.4 58.2 70.3 80.1
PDDM Quadruplet [18] 57.4 68.6 80.1 89.4
HDC [68] 83.8 89.8 93.6 96.2
Margin [62] 86.9 92.7 95.6 97.6
ABE-8 [20] 93.0 95.9 97.5 98.5
BDB 94.3 96.8 98.3 98.9
(b) CARS196 (cropped) dataset
1 10 20 30 40
FasionNet [32] 53.0 73.0 76.0 77.0 79.0
HDC [68] 62.1 84.9 89.0 91.2 92.3
DREML [65] 78.4 93.7 95.8 96.7 -
HTL [12] 80.9 94.3 95.8 97.2 97.4
A-BIER [36] 83.1 95.1 96.9 97.5 97.8
ABE-8 [20] 87.3 96.7 97.9 98.2 98.5
BDB 89.1 96.3 97.6 98.5 99.1
(c) In-Shop Clothes Retrieval dataset
1 10 100 1000
LiftedStruct [35] 62.1 79.8 91.3 97.4
N-Pairs [45] 67.7 83.8 93.0 97.8
Margin [62] 72.7 86.2 93.8 98.0
HDC [68] 69.5 84.4 92.8 97.7
A-BIER [36] 74.2 86.9 94.0 97.8
ABE-8 [20] 76.3 88.4 94.8 98.2
BDB 83.0 93.3 97.3 99.2
(d) Stanford online products dataset
Table 7: The comparison on Recall@(%) scores with other state-of-the-art metric learning methods on CUB200-2011 (cropped), CARS196 (cropped), In-Shop Clothes Retrieval, and Stanford online products datasets.
1 5 10 20
Baseline + LiftedStruct [35] 66.8 88.5 93.4 96.3
BDB + LiftedStruct [35] 71.4 89.7 93.9 96.3
Baseline + Margin [62] 65.7 88.1 93.1 96.4
BDB + Margin [62] 72.0 90.8 94.4 97.0
Baseline + Histogram [54] 64.6 87.2 93.0 96.4
BDB + Histogram [54] 73.1 90.7 94.2 96.9
Baseline + Hard Triplet [17] 69.5 89.5 94.0 96.8
BDB + Hard Triplet [17] 74.1 91.0 94.7 97.1
Table 8: The BDB network performance on the other standard loss functions of metric learning methods. The statistics are based on the CUB200-2011 (cropped) dataset. “Baseline” refers to the ResNet-50 Baseline defined in section 3.
Figure 8: The class activation map of Baseline and BDB Network on CARS196, CUB200-2011, In-Shop Clothes retrieval and SOP datasets.
Figure 9: The top-5 ranking list for the query images on CUB200-2011 dataset from BDB Network. The green and red borders respectively denote the correct and incorrect results.

4.2 Image Retrieval Experiments

The BDB Network structure can be applied directly on image retrieval problems.

4.2.1 Datasets and Settings

Our method is evaluated on the commonly used image retrieval datasets including CUB200-2011 [57], CARS196 [22], Stanford online products (SOP) [35], and In-Shop Clothes retrieval [32] datasets. For CUB200-2011 and CARS196, the cropped datasets are used since our BDB Network requires input images to be roughly aligned. The experimental setup is the same as that in [35]. We show the statistics of the four image retrieval datasets in Table 6.

The training images are padded and resized to 256 256 while the aspect ratio is fixed, and then cropped to 224 224 randomly. During testing, CUB200-2011, In-Shop Clothes retrieval dataset, and SOP images are padded on the shorter side and then scaled to 256 256, while CARS196 images are scaled to 256 256 directly. The dropping height ratio and width ratio are both set to 0.5 in the Batch DropBlock Layer. We use the standard Recall@ metric to measure the image retrieval performance.

4.2.2 Comparison with State-of-the-Art

Table 7 shows that our BDB Network achieves the best Recall@ scores on all the experimental image retrieval datasets. In particular, the BDB Network achieves an obvious improvement (+3.5%) on the small scale CUB200-2011 dataset which is also the most challenging one. On the large scale Stanford online products dataset which contains classes with product images, our BDB network surpasses the state-of-the-art by 6.7%. We can see that our BDB Network is applicable on both small and large scale datasets.

Figure 9 visualizes sample retrieval results of CUB200-2011 (cropped) dataset. In Figure 1, we also present the class activation maps of Baseline and our BDB network on the CARS196 and CUB200-2011 data-sets. We can see that our two-branch network encodes more comprehensive features with attentive detail features. This helps to explain why our BDB Network is in some terms robust to the variance in illumination, poses and occlusions.

4.2.3 Adapt to Other Metric Learning Methods

Table 8 shows that our BDB Network can also be used with other standard metric learning loss functions, such as lifted structure loss[35], weighted sampling margin loss[62], and histogram loss[54] to boost their performance. For a fair comparison, we re-implement the above loss functions on our ResNet-50 Baseline and BDB Network to evaluate their performances. Here, the only difference between ResNet-50 Baseline and BDB Network is that the BDB Network has an additional feature dropping branch. For weighted sampling margin loss, although the ResNet-50 Baseline outperforms the results reported in the work [62] (+1.8%), the BDB Network can still improve the result by a large margin (+7.7%). We can therefore conclude that the proposed BDB Network can be easily generalized to other standard loss functions in metric learning.

5 Conclusion

In this paper, we propose the Batch DropBlock to improve the optimization in training a neural network for person re-ID and other general metric learning tasks. The corresponding BDB Network, which adopts this proposed training mechanism, leverages a global branch to embed salient representations and a feature erasing branch to learn detailed features. Extensive experiments on both person re-ID datasets and image retrieval datasets show that the BDB Network can make significant improvement on person re-ID and other general image retrieval benchmarks.

References

  • [1] J. Almazan, B. Gajic, N. Murray, and D. Larlus (2018) Re-id done right: towards good practices for person re-identification. arXiv:1801.05339. Cited by: Table 1.
  • [2] S. Bai, X. Bai, and Q. Tian (2017) Scalable person re-identification on supervised smoothed manifold. In CVPR, Cited by: §2, §4.1.1.
  • [3] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis (2018) Looking beyond appearances: synthetic training data for deep cnns in re-identification. In CVIU, Cited by: §2.
  • [4] L. Bazzani, M. Cristani, A. Perina, M. Farenzena, and V. Murino (2010) Multiple-shot person re-identification by HPE signature. In ICCV, Cited by: §2.
  • [5] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang (2018) Group consistent similarity learning via deep CRF for person re-identification. In CVPR, Cited by: §2.
  • [6] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In CVPR, Cited by: §2.
  • [7] Y. Chen, X. Zhu, S. Gong, et al. (2018) Person re-identification by deep learning multi-scale representations. In ICCV, Cited by: Table 1.
  • [8] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In CVPR, Cited by: §1, §2, §2, §3.
  • [9] A. Das, A. Chakraborty, and A. K. Roy-Chowdhury (2014) Consistent re-identification in a camera network. In ECCV, Cited by: §2.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §4.1.2.
  • [11] T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §2, Table 1, §4.1.4, Table 2, Table 5.
  • [12] W. Ge, W. Huang, D. Dong, and M. R. Scott (2018) Deep metric learning with hierarchical triplet loss. In ECCV, Cited by: 6(c).
  • [13] M. Geng, Y. Wang, T. Xiang, and Y. Tian (2018) Deep transfer learning for person re-identification. In BigMM, Cited by: §2.
  • [14] G. Ghiasi, T. Lin, and Q. V. Le (2018) DropBlock: a regularization method for convolutional networks. arXiv:1810.12890. Cited by: §1, §3, §3, §4.1.4, Table 5.
  • [15] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.1.2.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §3, §3, §3.
  • [17] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737. Cited by: Figure 3, §1, §2, §3, §4.1.1, §4.1.2, Table 8.
  • [18] C. Huang, C. C. Loy, and X. Tang (2016) Local similarity-aware deep feature embedding. In NIPS, Cited by: 6(a), 6(b).
  • [19] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang (2018) Adversarially occluded samples for person re-identification. In CVPR, Cited by: §2.
  • [20] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In ECCV, Cited by: §2, 6(a), 6(b), 6(c), 6(d).
  • [21] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.2.
  • [22] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In ICCV, Cited by: §1, §4.2.1, Table 6.
  • [23] V. Kumar, A. M. Namboodiri, M. Paluri, and C. Jawahar (2017) Pose-aware person recognition.. In CVPR, Cited by: §1, §2.
  • [24] X. Lan, H. Wang, S. Gong, and X. Zhu (2017) Deep reinforcement learning attention selection for person re-identification. In BMVC, Cited by: §2.
  • [25] D. Li, X. Chen, Z. Zhang, and K. Huang (2017) Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, Cited by: §2.
  • [26] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) DeepReID: deep filter pairing neural network for person re-identification. In CVPR, Cited by: §4.1.1.
  • [27] W. Li, X. Zhu, and S. Gong (2018) Harmonious attention network for person re-identification. In CVPR, Cited by: §1, §2, Table 1.
  • [28] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith (2013) Learning locally-adaptive decision functions for person verification. In CVPR, Cited by: §2.
  • [29] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In CVPR, Cited by: §2.
  • [30] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan (2017) End-to-end comparative attention networks for person re-identification. TIP. Cited by: §1, §2, §2.
  • [31] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang (2017) Hydraplus-net: attentive deep features for pedestrian analysis. In ICCV, Cited by: §1, §2.
  • [32] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In CVPR, Cited by: §1, §4.2.1, Table 6, 6(c).
  • [33] A. J. Ma, P. C. Yuen, and J. Li (2013) Domain transfer support vector ranking for person re-identification without target camera label information. In ICCV, Cited by: §2.
  • [34] A. Mignon and F. Jurie (2012) Pcca: a new approach for distance learning from sparse pairwise constraints. In CVPR, Cited by: §2.
  • [35] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In CVPR, Cited by: §1, §2, §3, §4.2.1, §4.2.3, Table 6, 6(d), Table 8.
  • [36] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: 6(c), 6(d).
  • [37] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian (2013) Local fisher discriminant analysis for pedestrian re-identification. In CVPR, Cited by: §2.
  • [38] A. Perina, V. Murino, M. Cristani, M. Farenzena, and L. Bazzani (2010) Person re-identification by symmetry-driven accumulation of local features. In CVPR, Cited by: §2.
  • [39] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, Cited by: §4.1.1.
  • [40] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, Cited by: Figure 2, §1.
  • [41] A. Schumann and R. Stiefelhagen (2017) Person re-identification by deep learning attribute-complementary information. In CVPRW, Cited by: §2.
  • [42] Y. Shen, W. Lin, J. Yan, M. Xu, J. Wu, and J. Wang (2015) Person re-identification with correspondence structure learning. In ICCV, Cited by: §2.
  • [43] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang (2018) Deep group-shuffling random walk for person re-identification. In CVPR, Cited by: §2, §2.
  • [44] J. Si, H. Zhang, C. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, Cited by: §2.
  • [45] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In NIPS, Cited by: 6(d).
  • [46] H. O. Song, S. Jegelka, V. Rathod, and K. Murphy (2017) Deep metric learning via facility location. In CVPR, Cited by: §1.
  • [47] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: §4.1.4, Table 5.
  • [48] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian (2017) Pose-driven deep convolutional model for person re-identification. In ICCV, Cited by: §1, §2.
  • [49] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee (2018) Part-aligned bilinear representations for person re-identification. In ECCV, Cited by: §1.
  • [50] Y. Sun, L. Zheng, W. Deng, and S. Wang (2017) Svdnet for pedestrian retrieval. In ICCV, Cited by: §2, Table 1.
  • [51] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling(and a strong convolutional baseline). In ECCV, Cited by: §1, §2, §3, §3, §3, Table 1, §4.1.1.
  • [52] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015) Efficient object localization using convolutional networks. In CVPR, Cited by: §4.1.4, Table 5.
  • [53] E. Ustinova, Y. Ganin, and V. Lempitsky (2017) Multi-region bilinear convolutional neural networks for person re-identification. In AVSS, Cited by: §2.
  • [54] E. Ustinova and V. Lempitsky (2016) Learning deep embeddings with histogram loss. In NIPS, Cited by: §1, §2, §3, §4.2.3, Table 8.
  • [55] R. R. Varior, M. Haloi, and G. Wang (2016) Gated siamese convolutional neural network architecture for human re-identification. In ECCV, Cited by: §2.
  • [56] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang (2016) A siamese long short-term memory architecture for human re-identification. In ECCV, Cited by: §1, §2.
  • [57] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §1, §4.2.1, Table 6.
  • [58] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou (2018) Learning discriminative features with multiple granularities for person re-identification. arXiv:1804.01438. Cited by: §1, §2, §3, §3, §3, Table 1, §4.1.1, §4.1.3.
  • [59] H. Wang, S. Gong, and T. Xiang (2014) Unsupervised learning of generative topic saliency for person re-identification. In BMVC, Cited by: §2.
  • [60] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger (2018) Resource aware person re-identification across multiple resolutions. In CVPR, Cited by: Table 1.
  • [61] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian (2017) Glad: global-local-alignment descriptor for pedestrian retrieval. In ACM MM, Cited by: §2.
  • [62] C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2017) Sampling matters in deep embedding learning. In ICCV, Cited by: §1, §2, §3, §4.2.3, 6(a), 6(b), 6(d), Table 8.
  • [63] Q. Xiao, H. Luo, and C. Zhang (2017) Margin sample mining loss: a deep learning based method for person re-identification. arXiv:1710.00478. Cited by: §2.
  • [64] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In CVPR, Cited by: §1.
  • [65] H. Xuan, R. Souvenir, and R. Pless (2018) Deep randomized ensembles for metric learning. In ECCV, Cited by: 6(c).
  • [66] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li (2014) Salient color names for person re-identification. In ECCV, Cited by: §2.
  • [67] H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian (2017) Deep representation learning with part loss for person re-identification. arXiv:1707.00798. Cited by: §2.
  • [68] Y. Yuan, K. Yang, and C. Zhang (2017) Hard-aware deeply cascaded embedding. In ICCV, Cited by: 6(a), 6(b), 6(c), 6(d).
  • [69] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun (2017) Alignedreid: surpassing human-level performance in person re-identification. arXiv:1711.08184. Cited by: §2, §2.
  • [70] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In CVPR, Cited by: §2.
  • [71] L. Zhao, X. Li, Y. Zhuang, and J. Wang (2017) Deeply-learned part-aligned representations for person re-identification.. In ICCV, Cited by: §1, §2, §2.
  • [72] R. Zhao, W. Ouyang, and X. Wang (2013) Unsupervised salience learning for person re-identification. In CVPR, Cited by: §2.
  • [73] F. Zheng and L. Shao (2016) Learning cross-view binary identities for fast person re-identification.. In IJCAI, Cited by: §2.
  • [74] L. Zheng, Y. Huang, H. Lu, and Y. Yang (2017) Pose invariant embedding for deep person re-identification. arXiv:1701.07732. Cited by: §1, §2.
  • [75] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In ICCV, Cited by: §4.1.1.
  • [76] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv:1610.02984. Cited by: §2, Table 1.
  • [77] W. Zheng, S. Gong, and T. Xiang (2013) Reidentification by relative distance comparison. PAMI. Cited by: §2.
  • [78] W. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong (2015) Partial person re-identification. In ICCV, Cited by: §2.
  • [79] Z. Zheng, L. Zheng, and Y. Yang (2017) A discriminatively learned cnn embedding for person reidentification. In TOMM, Cited by: §2.
  • [80] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In ICCV, Cited by: §2, §4.1.1.
  • [81] Z. Zheng, L. Zheng, and Y. Yang (2018) Pedestrian alignment network for large-scale person re-identification. In TCSVT, Cited by: Table 1.
  • [82] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In CVPR, Cited by: §2, §4.1.1.
  • [83] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv:1708.04896. Cited by: Table 1, §4.1.4, Table 5.
  • [84] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In CVPR, Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
388242
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description