Parameter-Free Spatial Attention Network for Person Re-Identification

Parameter-Free Spatial Attention Network for Person Re-Identification

Haoran Wang
Xidian University
wanghaoran@stu.xidian.edu.cn
Three authors contribute equally to this work.
   Yue Fan footnotemark:
Max Planck Institute for Informatics
Saarland Informatics Campus
yfan@mpi-inf.mpg.de
   Zexin Wang footnotemark:
Xidian University
zexinwang2016@gmail.com
   Licheng Jiao
Xidian University
lchjiao@mail.xidian.edu.cn
   Bernt Schiele
Max Planck Institute for Informatics
Saarland Informatics Campus
schiele@mpi-inf.mpg.de
Abstract

Global average pooling (GAP) allows to localize discriminative information for recognition [40]. While GAP helps the convolution neural network to attend to the most discriminative features of an object, it may suffer if that information is missing e.g. due to camera viewpoint changes. To circumvent this issue, we argue that it is advantageous to attend to the global configuration of the object by modeling spatial relations among high-level features. We propose a novel architecture for Person Re-Identification, based on a novel parameter-free spatial attention layer introducing spatial relations among the feature map activations back to the model. Our spatial attention layer consistently improves the performance over the model without it. Results on four benchmarks demonstrate a superiority of our model over the state-of-the-art achieving rank-1 accuracy of 94.7% on Market-1501, 89.0% on DukeMTMC-ReID, 74.9% on CUHK03-labeled and 69.7% on CUHK03-detected.

1 Introduction

The aim of Person Re-Identification (Re-ID) is to match people from different camera viewpoints. The task is challenging because of significant pose-variations, frequent occlusions, and different camera viewpoints.

Figure 1: Class activation maps (CAM) for four training images. In each example, from left to right are the original image, the CAM from GAP with our proposed parameter-free spatial attention and the CAM from plain GAP. The highlighted area is less concentrated with the help of the spatial attention.

Global Average Pooling (GAP) [14] is a well-known technique to reduce the number of parameters in the fully-connected layer. GAP often improves model performance due to a regularization effect of the capacity. Zhou et al. [40] analyzed the ability of the GAP layer to localize important regions from the input image even when the convolutional neural network is merely trained on image-level labels. This localization can be thought of as an implicit attention mechanism. At the same time, they proposed the class activation map (CAM) as a generic visualization method to show which regions of the image the model attends to when making a prediction. The third image in each example in figure 1 shows some class activation maps using plain GAP when the corresponding person is recognized. We can see that the highlighted area from GAP is always spatially concentrated. If the model fails to put the concentration onto the most discriminative region or if this region is missing in another camera view, then it may fail to identify that person. For example, GAP focuses on the waist of the first person in figure 1, and if the person turns around in another image, the model might fail to match that person. This issue is less pronounced for general image classification as the classes are relatively distinctive. Therefore, the degradation from the absence of certain features is not necessarily fatal. But the Person Re-ID task is rather a fine-grained classification task, where the combination of many details and feature patterns of the object matter. Thus, the model should focus on the overall feature patterns and their relations. We will show that the spatially concentrated attention of GAP is due to the lack of the spatial relation among the activations on the feature map in section 3.3 and we thus propose a new model to re-introduce spatial relations.

Attention-based methods are particularly promising to model spatial relations [12, 22, 31, 19, 33, 23]. Attention mechanisms [1] equip the model with the ability by which it can pay attention to the most informative regions rather than the entire image during recognition. Inspired by these mechanisms, we proposes a parameter-free spatial attention layer to incorporate the spatial relation for GAP. The second image in each example in figure 1 shows the change of the class activation maps when our spatial attention layer is employed. From the images we can observe that the attention is, on the one hand, more distributed over the image than for GAP, but, on the other hand, also attends to regions that are attended to by GAP. Such attention thus can be more robust to partial occlusions or camera viewpoint changes.

Figure 2: The structure of a common classifier with GAP is shown on the right; the modification on the left is that a spatial attention layer (SA) is inserted before GAP.

Figure 2 shows where the spatial attention layer (SA) is added for a common GAP-based classifier. Instead of passing the feature map from the last convolution layer through GAP before a fully-connected layer, we first assign different importance to different spatial positions with the help of the spatial attention layer. The importance of a spatial position is computed based on the total intensity of the activations along the corresponding channel. Therefore, it regularizes the GAP in a complementary way where all learned filters compete with each other to become more important. This leads to a distributed attention, thus covering more details of the image. We will show in section 3.3 that this is achieved by introducing the spatial relation on the feature map. From another perspective, the importance of an activation of a GAP-based model is only determined by its own intensity. In our case, however, it is also enhanced by other activations along the same channel, which in turn stabilizes the gradient flow, discussed in section 3.4.

Another advantage of the spatial attention layer over the previous attention-based methods is that this module is entirely parameter-free. The increase of the calculation is only proportional to the number of times it appears in a model, which means it doesn’t grow with respect to the scale of the dataset or the model size. Despite its simple structure, we show experimentally in section 5.3 that it improves the performance consistently over the model without it.

Overall, our contribution is threefold. 1) We introduce a parameter-free spatial attention layer which shows consistent improvement for GAP. 2) We propose a novel architecture with the spatial attention layer for Person Re-ID and achieve state-of-the-art performance on four benchmarks. 3) We compare the difference of GAP with and without spatial attention analytically for a better understanding.

2 Related Work

Weakly Supervised Learning. There are many weakly supervised learning approaches using various pooling operations to localize objects, thus making use of more details on the feature map for better classification [6, 16, 13, 15].

Zhou et al. [40] first showed the ability of GAP to localize the most discriminative image region. Instead of GAP, WELDON [7] provides a more robust and automatic activation selection strategy for pooling by selecting multiple high and low score regions from the last feature map, which is a generalization of the min + max prediction function in [6]. However, they just selected top positive and negative instances with the highest and lowest activations and then aggregated them. Different from them, we use all of the activations on the last feature map while having a fixed budget for the total importance by the softmax function. Note, that both are parameter-free methods.

[5] also adopted a similar idea for spatial pooling by introducing an extra hyper-parameter to trade off relative importance between positive and negative instances. But still, it doesn’t take all activations into consideration. In our case, the importance of an activation is determined by its relative magnitude to others.

Based on the fact that lower convolution layers learn redundant filters to extract both positive and negative information due to the positive output from ReLU, Shang et al. [18] and Blot et al. [2] proposed a Concatenated Rectified Linear Unit to preserve both positive and negative information by concatenating a negated copy of the same feature map before ReLU, thus preventing the model from learning highly negatively-correlated pairs of the filters. This has a similar effect as our spatial attention layer, which enforces the filters to compete with each other by setting a budget on the total importance with the help of the softmax function.

Figure 3: The proposed architecture formulates the task as a classification. It consists of four components. The yellow region represents the backbone feature extractor. The red region represents the deeply supervised branches (DS). The blue region represents six part classifiers (P) [25]. The two green region represents two sets of spatial attention layers (SA), SA1 is not used for the main results. It only appears in the section 5.3 for the sake of ablation study. Then the total loss is the summation over all deep supervision losses, six part losses and the loss from the backbone. Note that the spatial attention is only added before GAP.

Attention in Person Re-ID. The attention mechanism proposed in [1] has shown great success for a broad range of computer vision tasks. There also exists many attention-based approaches for Person Re-ID.

One way to introduce attention is to insert a trainable layer into the main body of the model [12]. However, it allows the model to attend not only to the different spatial locations but also to the different channels with various weight, which means the model has to decide for every single activation inside a feature map whether it is worth to pay attention to it. This, on the one hand, gives the model a lot of flexibility to make use of the most valuable features, but on the other hand can be harmful if no proper regularization techniques is applied to reduce overfitting. Another similar work is from Hu et al. [9], where the feature map was first squeezed by GAP and then fed through two fully-connected layers before rescaled by a sigmoid function. The resulting vector had the same length as the number of channels of the original feature map, onto which each entry of the vector was multiplied back for the corresponding channel. It differs from our work in three aspects: we use softmax rather than sigmoid for the resulting vector; the attention in their work was mainly spread among the channels rather than the spatial positions; and our parameter-free module is only used before the last GAP layer.

Deep Supervision and Part-Level Feature. The idea of deep supervision was first proposed by Lee et al. in [10], where a progressively decay for the weight was adopted on the auxiliary losses which went to zero eventually. The auxiliary losses can lead to more discriminative intermediate features. This technique was introduced by Wang et al. [28] for the first time to the Person Re-ID task. They took the summation over all the losses as their final loss, which implicitly assigned equal weight to each individual loss. In this work, we use a fixed proportion for the main and auxiliary losses, namely, and . It achieves empirically similar results as the decay strategy in [10].

Part-level features have shown powerful performance in Person Re-ID due to the robustness to partial occlusion and camera view changes [25, 20, 35, 21]. Sun et al. showed a strong baseline model in [25], where the last feature map was uniformly partitioned into six horizontal stripes before being averaged into a single-channel part-level feature. Then six independent losses were applied for them, respectively. The final loss was simply the summation of all six losses. We leverage this technique along with a seventh global loss which is obtained by taking the six horizontal stripes as a whole.

3 Proposed Method

This section will propose a novel architecture for Person Re-ID that is, on the one hand, a novel combination of power-full components proposed in the literature, and, on the other hand, also contains our novel parameter-free spatial attention layer resulting in consistent improvements.

3.1 Network Architecture

Figure 4: Parameter-free spatial attention layer has no trainable parameter. It assigns different importance for different locations. It consists of a summation and a softmax. denotes the summation over the activations along the channel. The softmax values are computed over then multiplied to all activations from the same location.

We formulate the Person Re-ID task as a classification problem. Our overall architecture leverages several established and powerful techniques and is enhanced by our novel spatial attention layer. In particular we leverage the following techniques: we adopt a standard ResNet-50 up to stage 4 for the backbone as it has shown good performance for our task [25, 28, 32]; we also employ deep supervision due to its ability to also influence the lower feature layers of our network [28]; and we also partition the last layer according to PCB [25] which uniformly partitioned the last feature map from the backbone in order to produce a loss for each part-level feature. PCB is proven to give competitive results in Person Re-ID task, too. Therefore, combining the ideas from [28] and [25], we propose a novel architecture as shown in the figure 3. The yellow part is the backbone ResNet-50. The red part represents the deep supervision branches, for which the spatial attention layers (the left green part) are added before GAP. The blue part denotes the part classifiers from [25]. Each part classifier produces a loss based on the part-level feature. Note that the right spatial attention layer in green is not a part of the main model. It is only used in the ablation study in the section 5.3.

The input for every deeply supervised branch is a feature map . The spatial attention layer first assigns different importance for different locations. The processed feature map is then fed into a GAP layer followed by a fully-connected layer before computing the cross-entropy loss . The losses calculated from three intermediate feature maps will in the end be added to the final loss. Therefore, our final loss consists of three auxiliary losses, partition losses and the main loss, that is , where controls the proportion of losses. The uses the same weight with the auxiliary losses because the backbone can be also thought of as another deeply supervised branch. The only difference is that the feature vectors obtained from stage 4 will be used in inference.

During training, the yellow backbone first takes a image as the input, then the extracted feature map whose size is at the last layer is then vertically partitioned into segments, over each of which GAP is then performed. The resulting part-level feature is of size . Afterwards, each part-level feature is convolved with a shared filter into the shape before an independent linear classifier with cross-entropy loss is deployed. Up to now, different partition losses are computed. Taking all reshaped part-level features as a whole, another GAP is applied to obtain a feature which is then used to compute the main loss . The partition loss should be thought of as a dropout [24] for high-level features. It has a significant regularization effect on the model because each partition loss is independent of each other, thus forcing each part-level feature to be discriminative enough to yield the correct prediction. This is more efficient than random erasing [39] since the latter one is applied on the pixel level.

During inference, all the deeply supervised branches and part classifiers on top are dismissed. The backbone will then be used as a feature extractor, which is applied to all the query images and the images in gallery to obtain the features. Then a standard information retrieval is performed based on the Euclidean distance between these features.

3.2 Parameter-Free Spatial Attention Layer

The spatial relation among the activations is helpful to distribute model attention. However, GAP treats the activations on the same feature map equally despite their locations, making the model less robust to the absence of certain features. Our spatial attention layer can add this information back into the model. Figure 4 shows the details. The first operation is a summation over the channels for each spatial position, which indicates the importance of that position. To make the learned filter more competitive, softmax rather than sigmoid is then applied on the summed activations. This encourages different filters to learn different features, thus making the model more robust. The output from softmax is then used to rescale the magnitude of the activations on the original feature map. This implies that the importance of different activations at the same spatial position can be enhanced by other activations along that channel. All activations from the same spatial position are rescaled by the same weight while different spatial positions have different weights. The spatial relation is considered because the weight is proportional to the relative magnitude of the summed activations at the corresponding position.

Formally, given a feature map with size , a summation along the channel axis yields a 2-D matrix with size , where . Then a softmax function is applied to the flattened matrix in order to assign each spatial position a value indicating the degree of importance for that location. The afore-produced values will be multiplied to all the activations along the channel axis of for the corresponding spatial positions. Therefore, the output of a parameter-free spatial attention layer can be written as:

(1)

where .

3.3 Impact on the Class Activation Map

The class activation map (CAM) proposed in [40] visualizes the attention distribution over the input image when the model identifies a specific class. As shown in figure 1, the spatial attention layer helps to distribute the focus of the model to other regions besides the most discriminative one. The class activation map of GAP of the first person highlights the waist and the feet of that person. But with spatial attention, other parts of the body which are also informative are highlighted as well.

Following the notations from section 3.2, is the feature map from the last convolution layer. If a fully-connected layer is directly performed on for classification, the logits for a specific class is then given as:

(2)

where is the weight connecting the logits and . Therefore, the class activation map can be defined as:

(3)

If Global Average Pooling is employed, then the result after applying GAP can be computed as . The corresponding logits for class is then:

(4)

This leads to the following form of the class activation map:

(5)

The advantage of GAP lies in its regularization effect. The parameters in the fully-connected layer become after adding the GAP layer, which means GAP operations reduce the model capacity by enforcing to be the same for any spatial position and , thus preventing overfitting of the model. But from another aspect, it can also be harmful if the spatial relation across the high-level features is completely ignored, which resembles the spirit of bag-of-words methods [26], thus suffering from the same drawback. Ideally, we want to make a compromise between these two extreme cases, where GAP neglects the spatial relation while directly performing fully-connected layer lacks effective restrictions. To this end, the parameter-free spatial attention layer shows a combination of advantages from both sides.

The formula for the logits with the parameter-free spatial attention layer can be written as:

(6)

where is the output from the parameter-free spatial attention layer. Therefore, the class activation map will be:

(7)

It can be thought of as a factorization of where controls the inter-channel attention and controls the spatial attention. The assumption behind this is that the spatial and channel attention are independent to each other. This factorization decomposes the coupling between them and at the same time maintains the spatial term. The magnitude of depends on the number of activations and the intensity of each activation across all channels. Typically, large indicates a region of interest in the input image which contains lots of desired features. Thus more attention should be payed to it.

3.4 Backpropagation through Parameter-Free Spatial Attention Layer

This section aims to justify the parameter-free spatial attention layer in terms of the gradient flow.

Given the last feature map , the common strategy is to compute , and then logits for a specific class is . Therefore, the gradient for in the model is given as:

When the parameter-free spatial attention layer (SA) is used, then the forward pass will become , , . And the corresponding backward pass will be:

The only difference between (3.4) and (3.4) is an extra intermediate term that scales the gradients before it is passed further.

Following the notation defined in the section 3.2, we can compute the gradient of with respect to using the chain rule:

where the latter part doesn’t vanish only if .

The derivative of the softmax function is given as:

Plugging in the above equations, we can conclude that

The derivation shows that the gradient which flows backwards from a certain activation will be effected by three different sources, that is, the activations at the same spatial position but different channel, itself and others. Since and are always non-negative, for the first two cases has the same sign as , which means that the activations from other channels but the same spatial position boost the gradient flow according to . Other activations, on the contrary, always have the opposite sign to indicating a cancel-out effect to the gradient. Therefore, it stabilizes the gradient in a competitive way so that the training procedure suffers less from the vanishing or exploding gradient problem, which leads to fast convergence.

4 Experimental Setup

Datasets and Protocols. The experiments are evaluated on four large-scale datasets Market-1501 [34], DukeMTMC-ReID [36] [17] and CUHK03-NP [11].

Market-1501 contains 32,668 annotated bounding boxes of 1,501 identities which are taken from six different camera views. The identities are split into 751 training IDs and 750 query IDs. There are in total 3,368 query images, each of which is randomly selected from each camera so that cross-camera search can be performed. The retrieval for each query image is conducted over a gallery of 19,732 images including 6,796 junk images. DukeMTMC-ReID dataset uses the the same format and evaluation protocols as Market-1501. It contains 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images from 8 high-resolution cameras. CUHK03-NP is re-formulated from the old CUHK03 dataset. It is designed for the new training/testing protocol proposed by Market-1501 [34]. It contains 767 identities with 7,368 images and 700 identities with 5,328 for training and testing respectively. Each identity is observed from two non-overlapping cameras. The difficulty is thus increased since there are less ground truth images for a single query.

The evaluation protocol proposed in [34] is used for Market-1501 and DukeMTMC-ReID. The Cumulative Matching Characteristic (CMC) for rank-1, rank-5 and the mean average precision (mAP) are measured. The new protocol from [11] is employed for CUHK03-NP dataset. Note that all the following results are evaluated under the single-query mode.

Datasets Market-1501 DukeMTMC-ReID CUHK03 labeled CUHK03 detected
Metrics(%) R-1 mAP R-1 mAP R-1 mAP R-1 mAP
SVD-net [39] 89.1 83.9 84.0 78.3 - - 41.5 37.6
OIM loss [30] 82.1 60.9 68.1 47.4 - - - -
DPFL [3] 88.9 73.1 79.2 60.6 43.0 40.5 40.7 37.0
PAN [37] 82.2 63.3 71.6 51.6 36.9 35.0 36.3 34.0
FMN [4] 87.9 80.6 79.5 72.8 46.0 47.5 47.5 48.5
HPM [8] 94.2 82.7 86.6 74.3 - - 63.9 57.5
PCB+RPP [25] 93.8 81.6 83.3 69.2 - - 63.7 57.5
HA-CNN [12] 91.2 75.7 80.5 63.8 44.4 41.0 41.7 38.6
Mancs [27] 93.1 82.3 84.9 71.8 69.0 63.9 65.5 60.5
DaRe(R)+RE+RR [29] 90.8 85.9 84.4 79.6 72.9 73.7 69.8 71.2
Ours 94.7 91.7 89.0 85.9 74.9 76.5 69.7 72.2
Table 1: Comparison between our model and other state-of-the-art methods. The highest value for each category is marked in blue.

Implementation Details.We use a similar setting to [25]. The backbone model is a pre-trained ResNet-50 from ImageNet. All three deeply supervised branches take the feature map from the corresponding convolution stage as input. The spatial attention layer and the GAP layer are applied before the fully-connected layer. Dropout is not used in the deeply supervised branches.

The feature map after the fourth convolution stage is treated differently according to [25]. It is first partitioned into six part-level features. Then average pooling is performed within each feature. The resulting feature map is then convolved by a filter to reduce the dimension of each feature vector from to before being processed further. The six part-level features also produce six independent losses.

The input person images are all resized to . Random erasing [39] and random horizontal flipping with 0.5 are applied as the data augmentation. Re-ranking strategy from [38] is also used to further improve the performance. We adopt the recommended number of partitioning from [25]. Dropout rate for Market-1501 is 0.47 and 0.5 for other datasets. Pre-trained weights from Imagenet are used. Batch size is set to 48. Stochastic Gradient Descent (SGD) with momentum is deployed as the optimizer. The base learning rate starts from 0.1 and decays to 0.01 after 40 epochs. The learning rate for all pre-trained layers is set to 0.1 times the base learning rate. All models are trained until convergence. The final loss consists of 0.8 partition losses and 0.2 deeply supervised losses.

5 Experimental Results and Discussions

5.1 Main Results

Evaluated on the four aforementioned datasets, the proposed architecture achieves state-of-the-art performance on Market-1501, DukeMTMC-ReID and CUHK03-NP-labeled datasets regarding both rank-1 accuracy and mAP as shown in table 1. Especially on DukeMTMC-ReID, it shows a dramatic improvement of 3.3% for R-1 and 6.3% for mAP. For the CUHK03-NP-detected dataset, the model is slightly worse (0.1pp) than the best model but is still competitive and outperforms the third best method by a significant margin. But for the other datasets, our model achieves top performance by a large margin.

5.2 Analysis of the Class Activation Map

Figure 5: We visualize CAM of four stages for the model with and without the spatial attention. In both examples, the model with spatial attention shows a properly distributed focus for every stage.
Figure 6: Given a query image and two test images, (a) is taken from the same camera view with the query and (b) is from another one. The third column of each example shows the original class activation maps. The second column shows the class activation maps with the spatial attention. The failure cases are marked with a red box. All are taken from the stage 2.

Zhou et al. [40] used class activation maps (CAM) to illustrate the distribution of the contribution of different image regions when a certain class of object is recognized. Following the same scheme, we visualize the CAM of different stages and different examples in figure 5 and figure 6. As mentioned before, GAP with the spatial attention layers (SA) distributes the focus of the model, thus covers more details of the image. Since the feature map from each stage of the proposed model has a deeply supervised branch, we can visualize the CAM for different stages given an input image. Figure 5 compares the CAM using GAP with or without SA for each stage. GAP with SA shows a pattern of less concentrated attention in general no matter how deep the feature map is. The activations from stage 1 and 2 represents some low-level and intermediate-level features. However, in the stage 3 and 4, the receptive field of the activations can cover the entire width of the image. Therefore, the highlighted area on CAM is more elongated. But still, GAP with SA is more scattered along the vertical axis.

Since the CAM of stage 2 shows a more clear comparison, we study some failure cases of GAP under cross-camera matching and visualize the corresponding CAM for stage 2 in figure 6. The first test image is from the same camera view as the query image and the second test image is from another camera view. In each example, GAP and GAP with SA are both able to recognize the image from the same camera view. However, GAP fails to identify the person from another camera view while GAP with SA still recognizes it. The second and third rows show the CAM of the model with or without the spatial attention layer respectively. Because of lack of spatial relation, the attention of the model without SA is restricted to some certain region of the image. Therefore, the model tends to focus on the most discriminative part of the person, for example, the handbag of the lady in the first example. However, this can be harmful if the expected feature is missing in another camera view. From the class activation map at the lower right corner of the first example, we can see that even though the model tries to find the handbag on the image, the patterns it found look not so convincing to make the claim that it is the same person since the red color of the handbag from the front view is absent. Therefore it fails to recognize her. This is, however, not the case for GAP with SA since it covers a lot more different regions on the image. For example, the edge of the skirt and the haircut are also involved during the recognition. These features may not be discriminative enough alone, but the combination of them can alter the decision. This redundancy equips the model with robustness. The image (b) only shows the back of the lady, but the model with spatial attention can still recognize her correctly based on the edge of the skirt and the haircut. The same thing happened also in other examples.

Metric(%) SA1 SA2 R-1 mAP
no - 84.8 67.0
Backbone yes - 87.4 68.1
no no 86.2 68.3
no yes 86.6 68.8
Backbone + DS yes yes 87.1 69.4
no no 91.7 77.0
no yes 92.4 78.2
Backbone + P + DS yes yes 92.6 78.3
Table 2: The results are evaluated on Market-1501. P represents part classifiers; DS represents deep supervision; SA1 represents the spatial attention layer on the backbone; SA2 represents the spatial attention layer on the deeply supervised branches.

5.3 Ablation Study

We show in this section that the spatial attention improves the performance consistently. The ablation study is conducted on the Market-1501 dataset because it gives relatively stable results due to its large scale. Random erasing and re-ranking are not used here. All models are trained until convergence. For the baseline, we use the blue part in figure 3. It is basically just a normal ResNet-50 with one additional convolution layer at the very end. This convolution layer is to shrink the 2048-channel feature map to a 256-channel feature map. This baseline model gives 84.8% R-1 and 67.0% mAP. Based on this, we evaluate the performance gain of the spatial attention layers. The results are shown in the table 2. DS denotes the deep supervision branch in figure 3. P denotes the usage of part classifiers. SA1 and SA2 denote the spatial attention layer for the backbone and the deeply supervised branches respectively. The models with the spatial attention layers show a consistent improvement to the ones without it. This demonstrates the effectiveness of our proposed spatial attention.

Another experiment is shown in table 3 and compares the proposed architecture with or without the spatial attention layer across four datasets. Random erasing and horizontal flipping are used as data augmentation. The other settings are the same as the previous experiment. From table 3, a consistent improvement for models with spatial attention can be observed again. On the CUHK03 labeled dataset, the model equipped with spatial attention outperforms the attention-free model by a large margin of 2.4% for the rank-1 accuracy. The largest gain on mAP of 1.8% is for DukeMTMC-ReID.

Metrics(%) +SA -SA
R-1 93.3 92.5
Market-1501 mAP 81.7 79.6
R-1 84.3 83.4
DukeMTMC-ReID mAP 72.1 70.3
R-1 64.3 61.9
CUHK03 labeled mAP 61.4 60.3
R-1 58.9 57.8
CUHK03 detected mAP 57.5 55.8
Table 3: The proposed architecture from section 3.1 is used here. The models with the spatial attention shows a consistent improvement over the ones without it. All the models are trained with random erasing until convergence. Re-ranking is not applied.

6 Conclusion

We propose a parameter-free spatial attention layer to incorporate the spatial relation among the activations on the same feature map for GAP. It is computationally efficient and produces consistent improvement over the model without it. The mechanism behind this module is studied both analytically and empirically in order to understand the behavior. We expect a broader range of applications for it.

References

  • [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [2] Michael Blot, Matthieu Cord, and Nicolas Thome. Max-min convolutional neural networks for image classification. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 3678–3682. IEEE, 2016.
  • [3] Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Person re-identification by deep learning multi-scale representations. In IEEE International Conference on Computer Vision Workshop, pages 2590–2600, 2017.
  • [4] Guodong Ding, Salman Khan, Zhenmin Tang, and Fatih Porikli. Let features decide for themselves: Feature mask network for person re-identification. 2017.
  • [5] Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), volume 2, 2017.
  • [6] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Mantra: minimum maximum latent structural svm for image classification and ranking. In Proceedings of the IEEE International Conference on Computer Vision, pages 2713–2721, 2015.
  • [7] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Weldon: Weakly supervised learning of deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4743–4752, 2016.
  • [8] Yang Fu, Yunchao Wei, Yuqian Zhou, Honghui Shi, Gao Huang, Xinchao Wang, Zhiqiang Yao, and Thomas Huang. Horizontal pyramid matching for person re-identification. 2018.
  • [9] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
  • [10] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
  • [11] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
  • [12] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In CVPR, volume 1, page 2, 2018.
  • [13] Weixin Li and Nuno Vasconcelos. Multiple instance learning for soft bags via top instances. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4277–4285, 2015.
  • [14] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
  • [15] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 685–694, 2015.
  • [16] Sobhan Naderi Parizi, Andrea Vedaldi, Andrew Zisserman, and Pedro Felzenszwalb. Automatic discovery and optimization of parts for image classification. arXiv preprint arXiv:1412.6598, 2014.
  • [17] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016.
  • [18] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. In International Conference on Machine Learning, pages 2217–2225, 2016.
  • [19] Chen Shen, Guo-Jun Qi, Rongxin Jiang, Zhongming Jin, Hongwei Yong, Yaowu Chen, and Xian-Sheng Hua. Sharp attention network via adaptive sampling for person re-identification. arXiv preprint arXiv:1805.02336, 2018.
  • [20] Yang Shen, Weiyao Lin, Junchi Yan, Mingliang Xu, Jianxin Wu, and Jingdong Wang. Person re-identification with correspondence structure learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 3200–3208, 2015.
  • [21] Hailin Shi, Xiangyu Zhu, Shengcai Liao, Zhen Lei, Yang Yang, and Stan Z Li. Constrained deep metric learning for person re-identification. arXiv preprint arXiv:1511.07545, 2015.
  • [22] Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. arXiv preprint arXiv:1803.09937, 2018.
  • [23] Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1179–1188, 2018.
  • [24] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [25] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling. arXiv preprint arXiv:1711.09349, 2017.
  • [26] Manik Varma and Andrew Zisserman. A statistical approach to texture classification from single images. International journal of computer vision, 62(1-2):61–81, 2005.
  • [27] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 365–381, 2018.
  • [28] Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q Weinberger. Resource aware person re-identification across multiple resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8042–8051, 2018.
  • [29] Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q Weinberger. Resource aware person re-identification across multiple resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8042–8051, 2018.
  • [30] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. 2017.
  • [31] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. arXiv preprint arXiv:1805.03344, 2018.
  • [32] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. arXiv preprint arXiv:1805.03344, 2018.
  • [33] Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. arXiv preprint arXiv:1708.02286, 2017.
  • [34] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Computer Vision, IEEE International Conference on, 2015.
  • [35] Wei-Shi Zheng, Xiang Li, Tao Xiang, Shengcai Liao, Jianhuang Lai, and Shaogang Gong. Partial person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 4678–4686, 2015.
  • [36] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • [37] Zhedong Zheng, Liang Zheng, and Yi Yang. Pedestrian alignment network for large-scale person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 2018.
  • [38] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 3652–3661. IEEE, 2017.
  • [39] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
  • [40] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
321499
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description