Part-Guided Attention Learning for Vehicle Re-Identification††thanks: This work was done when X. Zhang was visiting The University of Adelaide. Correspondence should be addressed to C. Shen.
Vehicle re-identification (Re-ID) often requires one to recognize the fine-grained visual differences between vehicles. Besides the holistic appearance of vehicles which is easily affected by the viewpoint variation and distortion, vehicle parts also provide crucial cues to differentiate near-identical vehicles. Motivated by these observations, we introduce a Part-Guided Attention Network (PGAN) to pinpoint the prominent part regions and effectively combine the global and part information for discriminative feature learning. PGAN first detects the locations of different part components and salient regions regardless of the vehicle identity, which serve as the bottom-up attention to narrow down the possible searching regions. To estimate the importance of detected parts, we propose a Part Attention Module (PAM) to adaptively locate the most discriminative regions with high-attention weights and suppress the distraction of irrelevant parts with relatively low weights. The PAM is guided by the Re-ID loss and therefore provides top-down attention that enables attention to be calculated at the level of car parts and other salient regions. Finally, we aggregate the global appearance and part features to improve the feature performance further. The PGAN combines part-guided bottom-up and top-down attention, global and part visual features in an end-to-end framework. Extensive experiments demonstrate that the proposed method achieves new state-of-the-art vehicle Re-ID performance on four large-scale benchmark datasets.
Vehicle re-identification (Re-ID) aims to verify whether or not two vehicle images captured by different surveillance cameras belong to the same identity. With the growth of road traffic, it plays an increasingly important role in urban security systems and intelligent transportation [arth07, feris2012large, compcars, veri, wang2017orientation, vehicle1M, cityflow, veriwild].
Different levels of granularity of visual attention are required under various Re-ID scenarios. In the case of comparing vehicles of different car models, we can easily distinguish their identities by examining the overall appearances, such as car types and headlights [compcars]. However, most production vehicles can exhibit near-identical appearances since they may be mass-produced by the same manufacturer. When two vehicles with the same car model are presented, more fine-grained details (e.g., annual service signs, customize paintings, and personal decorations) are required for comparison, as shown in Figure 1 (a). Therefore, the key challenge of vehicle Re-ID lies in how to recognize the subtle visual differences between vehicles and locate the prominent parts that characterize their identities.
Most existing works focus on learning global appearance features with various vehicle attributes, including model type [liu2016deep, vehicle1M, wei2018coarse], license plate [liu2016deep], spatial-temporal information [liu2017provid, shen2017learning], orientation [wang2017orientation, zhou2017cross, zhou2018aware, lou2019embedding], etc. The main disadvantage of global features is the lack of capability to capture more fine-grained visual differences, which is crucial in vehicle Re-ID. Also, they are easily degraded by the viewpoint variation, distortion, occlusion, motion blur and illumination, especially in the unconstrained real-world environment. Therefore, recent works tend to explore car parts [he2019part, zhu2019vehicle, zhao2019structural] to learn the local information. However, these methods mainly focus on the localization of the spatial regions without considering how these regions are subject to attention with different degree.
To address the problems above, we propose a novel Part-Guided Attention Network (PGAN) to detect the prominent part regions of a vehicle at a sufficiently fine-grained level and combine these part characteristics with the holistic appearance for discriminative feature learning. To better capture the details of different vehicle components, the proposed PGAN first detects various parts and salient regions regardless of the vehicle identity, e.g., car lights, logos, and annual service signs, as shown in Figure 1 (b). These detected parts serve as candidate regions for comparison, which help narrow down the possible searching regions for network learning. The part extraction module is instantiated by an object detection network pre-trained on the vehicle attributes datasets [compcars, zhao2019structural]. We refer this kind of attention as bottom-up attention since it is not driven by the Re-ID task, but instead is determined by the region saliency and vehicle attributes.
The next important step of PGAN is to select the most prominent part regions and assign appropriate attention scores to them. Here, we introduce a Part Attention Module (PAM) to provide top-down attention guided by the Re-ID loss. PAM adaptively locates the discriminative regions with high-attention weights and suppresses the distraction of irrelevant parts with relatively low weights, as shown in Figure 1 (c). Compared to the existing grid attention or evenly decomposed part attention [chen2019partition, chen2019multi, zhu2019vehicle, liu2018ram], our PGAN is able to provide more fine-grained attention which is calculated at the level of car parts and subtle areas. The success of combining bottom-up and top-down attention has also been witness in image captioning and visual question answering, where the fine-grained analysis and image understanding are required [anderson2018bottom]. Different from these works, our attention mechanism is tailored to handle the car parts and visual attributes that are non-negligible for vehicle Re-ID. Finally, we aggregate the vehicle’s holistic appearance and part characteristics with a features aggregation module to improve the feature performance further.
To summarize, our main contributions are as follows:
We design a novel Part-Guided Attention Network (PGAN) to capture the fine-grained part details and learn discriminative features for vehicle Re-ID. The PGAN combines part-guided bottom-up and top-down attention, global and part features in an end-to-end framework.
We propose a Part Attention Module (PAM) to pay more attention to the prominent parts adaptively, and reduce the distraction of wrongly detected or irrelevant parts.
Extensive experiments on four challenging benchmark datasets demonstrate that our proposed method achieves new state-of-the-art vehicle Re-ID performance.
2 Related Work
2.1 Global Feature-based Methods
Feature Representation Vehicle Re-ID aims at learning discriminative feature representation to deal with significant appearance changes for different vehicles. Public large-scale datasets [veri, vehicle1M, liu2016deep, veriwild, pkuvehicleid, veriwild, vric, compcars] are widely collected with annotated labels and abundant attributes under unrestricted conditions. These datasets face huge challenges on occlusion, illumination, low resolution and various views. One way to deal with these datasets uses deep features [wang2017orientation, liu2016deep, tang2017multi, vric, veri] instead of hand-crafted features to describe vehicle images. To learn more robust features, some methods [liu2016deep, vehicle1M, liu2018progressive, liu2017provid, shen2017learning, wei2018coarse] try to explore details of vehicles using additional attributes, such as model type, color, spatial-temporal information, etc. Moreover, works of [lou2019embedding, zhou2017cross] propose to use synthetic multi-view vehicle images from a generative adversarial network (GAN) to alleviate cross-view influences among vehicles. In [zhou2018aware, wang2017orientation] authors also implement view-invariant inferences effectively by learning a viewpoint-aware representation. Although great progress has been obtained by these methods, there is a huge drop when encountering invisible variance of different vehicles as well as large diversity in same vehicle identity.
Metric Learning To alleviate the above limitation, deep metric learning methods [yan2017exploiting, sanakoyeu2019divide, kumar2019vehicle, yuan2017hard, kumar2019vehicle] use powerful distance metric expression to pull vehicle images in the same identity closer while pushing dissimilar vehicle images further away. The core idea of these methods is to utilize the matching relationship between image pairs or triplets as much as possible. Whereas, sampling strategies in deep metric learning lead to suboptimal results and also lack of abilities to recognize more meaningful unobtrusive details.
2.2 Part Feature-based Methods
Beyond learning global distinguishable features, a series of part-based learning methods explicitly exploit the discriminative information from multi-part locations of vehicles. [liu2018ram, chen2019partition, zhu2019vehicle, chen2019multi] take great efforts on separating feature maps into multiple even partitions to extract specific feature representation of respective regions. Another line of part-based methods [khorramshahi2019dual, kanaci2019multi, khorramshahi2019attention] bring informative key-points to put more attention on effective localized features. Besides, [zhao2019structural, he2019part] denote to design part-fused networks using ROI features of each part on vehicles from a pre-trained detection model to extract discriminative features. Nevertheless, these methods merely pay attention to exploring the part locations while ignoring the consideration of the important degree of the different part regions.
We firstly define each vehicle image as and the unique corresponding identity label as . Given a training set =, the main goal of the vehicle Re-ID is to learn a feature embedding function for measuring the vehicle similarity under certain metrics, where denotes the parameters of . It is important to learn a with good generalization on unseen testing images. During testing, given a query vehicle image , we can find vehicles with the same identity from a gallery set = by comparing the similarity between and each .
In this section, we present the proposed Part-Guided Attention Network (PGAN) in detail. The overall framework is illustrated in Figure 2, which consists of four main components: Part Extraction Module, Global Feature Learning Module, Part Feature Learning Module and Feature Aggregation Module. We first generate the part masks of vehicles in the part extraction module, which are then applied on the global feature map to obtain the mask-guided part feature. After that, we learn the attention scores of different parts to enhance the part feature via increasing the weights of discriminative parts as well as decreasing that of less informative parts. Subsequently, the three refined features, i.e., global, part, and fusion features are all used for model optimization.
3.1 Global Feature Learning Module
For a vehicle image , before obtaining the part features, we first extract a global feature map with a standard convolutional neural network, as shown in Figure 2. Most previous methods [batchhardtriplet, luo2019bag] directly feed into a global average pooling (GAP) layer to obtain the embedding feature that mainly considers the global information, which is studied as a baseline model in our experiments.
However, maintaining the spatial structure of feature map helps describe the subtle visual differences, which is crucial for distinguishing two near-identical vehicles. Therefore, we directly apply as one of the inputs for the following part learning process and the final optimization.
3.2 Part Extraction Module
We extract the part regions using a pre-trained SSD detector specially trained on vehicle components [zhao2019structural], which is a fast one-stage detector. In the part extraction, vehicle attributes (e.g., annual signs, car lights, logos, and entry license) are considered. Details are left in the supplementary material. Once detected, we only use the confidence scores to select part regions and ignore the label information of each part. It is reasonable since not all attributes are available in each vehicle due to the view variation.
Instead of naively selecting relevant part regions by thresholding the confidence score, we select the most confident top- proposals as the candidate vehicle parts. The main reasons are twofold: 1) some crucial yet less confident bounding boxes, like annual service signs, play a crucial role in distinguishing different vehicle images; 2) part number is fixed, which is easy to learn the attention model in the following stage. Note that we want to ensure a high recall rate to avoid missing relevant parts. The irrelevant parts are filtered out from the subsequent attention learning. Figure 1 illustrates some vehicle samples with the selected candidate parts. More visualizations are in supplementary material.
We use the index to indicate each of the selected top- part regions. The spatial area covered by each part is denoted as . For each candidate part region , we obtain a binary mask matrix by assigning to the elements inside the part region and to the rest:
where indicates a pixel location of . Note that the size of each is the same as a single channel of . If the neural network or the size of input image changes, the corresponding part locations on will be changed accordingly. During processing, we force all in the range of to ensure all part regions are located in the image areas.
After obtaining global feature and part masks , we project the part masks on the feature map to generate a set of mask-based part feature representations , which will be taken as the input of the following part feature learning module. For each part region , we can obtain as:
where denotes the element-wise product operation on each channel of . is the mask-based part feature map of the -th part region. Note that all . In each , only the elements in the regions of -th part are activated.
We learn an attention module on the part regions in the following section. Unlike the traditional grid attention method that processes a set of uniform grids, our attention model can focus on the prominent parts by only activating the selected parts. The irrelevant parts can thus be ignored directly. And the context correlation in a same part can be integrally considered, alleviating missing of essential features. Moreover, this part extraction process can be considered as bottom-up attention [anderson2018bottom].
3.3 Part Feature Learning Module
Part feature learning module is to produce a weight map across the mask-based part feature maps . In this way, the network can take more attention to specific part regions. A recent work [he2019part] simply uses the part mask features as the input of a subsequent feature learning process. In [wang2019vehicle], all the elements inside the detected parts are assigned a fixed larger weight than those in other regions. However, all these methods treat different part regions equally and thus more prominent parts cannot be further highlighted . On the other hand, some detected parts might not be informative for some specific cases, such as wrongly detected background or windshield with no specific information, which tends to influence the results. Consequently, we propose a Part Attention Module (PAM) to adaptively learn the importance of each part so as to take more attention to the most discriminating part regions and suppress the less informative parts. Since this attention signal is supervised by the specific Re-ID task, it can be considered as part-based top-down attention.
Part Attention Module (PAM) Our PAM is designed to obtain a part-guided feature representation relying on a soft attention mechanism on each part. When we have a soft attention weight vector to indicate the importance of each part region, we can obtain the part-guided feature representation as:
where denotes the -th element of the attention weight , which represents a learned weight of -th part obtained via Eq. (4). is normalized with sum as 1 so that the relative importance between different parts is obvious. is added to augment the capability of part regions.
We learn a compact model to predict the attention weights for measuring the different importance of each selected part, as shown in Figure 2. Specifically, we first use a mask-guided global average pooling operation on each and then learn a mapping function with a softmax layer to obtain . Each element can be predicted by:
where denotes a learnable function that is able to highlight the most important part regions with high values (as shown in Figure 2). is the parameter of , and denotes the mask-guided global average pooling (MGAP) discussed in the following.
Before feeding into , we average each channel of as a scalar via the operator. Note that, in each , only the elements in the part region are activated and most of the elements in are zero. Instead of performing the standard global average pooling (GAP), we restrict the average pooling in the areas indicated by the mask via the MGAP operator. In detail, for each channel of , after summing the nonzero elements, the MGAP operator devides the sum value with the number of elements (i.e.), instead of the number of total elements of the feature channel (i.e. ) in the GAP.
3.4 Feature Aggregation Module
Since global and part-based features provide complementary information, we concatenate the global feature and part-guided feature together, which is then denoted as fusion feature . Furthermore, we adopt a Refine operation on to reduce the dimension of feature representation to speed up the training process. The Refine operation is composed of a SE Block [hu2018squeeze] and a Residual Block [resnet50]. Finally, after a GAP layer, the refined fusion feature is obtained as the feature representation.
3.5 Model Training
In training process, we use Softmax cross-entropy loss and Triplet loss [batchhardtriplet] as a joint optimization for , which are denoted as and . Note that following [luo2019bag], an additional batch normalization(BN) operation is adopted on for Softmax cross-entropy loss. The normalized fusion feature is used as the feature representation for evaluation in our work. In order to make full use of global and part information separately, we also optimize the refined global feature and the refined part-guided feature with Triplet loss function [batchhardtriplet]. We define these two loss functions as and , respectively. The total loss function can be formulated as:
where is the loss weight to trade off the influence of two types of loss functions. Experiments show that joint optimization could improve the ability of feature representation.
4.1 Datasets and Evaluation Metrics
We evaluate our PGAN method on four public large-scale Vehicle Re-ID benchmark datasets.
VeRi-776 [liu2016deep] is a challenging benchmark in vehicle Re-ID task that contains about images of vehicle identities across cameras. Each vehicle is from - cameras with various viewpoints, illuminations and occlusions. All datasets are split into a training set with images of vehicles and a testing set with images with vehicles.
VehicleID [pkuvehicleid] is a widely-used vehicle Re-ID dataset which contains vehicle images captured in the daytime by multiple cameras. There are total of images with vehicles, where each vehicle has either front or rearview. The training set contains images of vehicles while the testing set comprises images of vehicles. There are three test subsets with different sizes, i.e., images of IDs in small test set, images of vehicles in medium test set and images of vehicles in large test set.
VRIC [vric] is a realistic vehicle Re-ID benchmark with unconstrained variations of images in resolution, motion blur, illumination, occlusion, and viewpoint. It contains images of vehicle identities captured from 60 different traffic cameras during both daytime and nighttime. The training set has images of vehicles, while the rest is used for testing with images of another vehicle IDs.
VERI-Wild [veriwild] is recently released with vehicle images of IDs captured by cameras. The training set consists of IDs with images. Similar as VehicleID, the small test subset consists of IDs with images and the medium/large subset consists of / IDs with / images.
Evaluation Metrics. To measure the performance for vehicle Re-ID, we utilize the Cumulated Matching Characteristics (CMC) and the mean Average Precision (mAP) as evaluation criterions. The CMC calculates the cumulative percentage of correct matches appearing before the top- candidates. We report Top- and Top- scores to represent the CMC criterion. Given a query image, Average Precision (AP) is the area under the Precision-Recall curve while mAP is the mean value of AP across all query images. The mAP criterion reflects both precision and recall, which provides a more convincing evaluation on Re-ID task.
4.2 Implementation Details
Part Extraction. We use the same SSD model as [zhao2019structural] to extract part regions. The model is fixed at the training process. For each image, we extract Top- part regions according to confident scores. In this paper, we set .
Vehicle Re-ID. We adopt ResNet50 [resnet50] without the last classification layer as backbone model in global feature learning module, which is pre-trained on ImageNet [imagenet]. The model modification follows [luo2019bag], and a refined model is added for a fair comparison, which is called baseline model in our work.
All input images are resized to while only random horizontal flipping and random erasing [randomerasing] with a probability of are applied for data augmentation. We use Adam optimizer [adam] with a momentum of and a weight decay . For all experiments without other specification, we set the batch size to with vehicle IDs randomly selected. The learning rate starts from and is multiplied by every epochs. The total number of epochs is .
4.3 Ablation Study
We conduct extensive experiments on VeRi-776 to thoroughly analyze the effectiveness of our PGAN method.
4.3.1 Effectiveness of Feature Aggregation
To validate the necessity of different features in our proposed PGAN, we first design an ablation experiment analyzing the effectiveness of global, part, and fusion feature. We fix the feature dimension of to . For a fair comparison, we also set the feature dimension to in a baseline model. As reported in Table 1, we can observe that only using for optimization can improve the performance by on mAP comparing with baseline model, which confirms that PAM can provide important part information that is better for model optimization. After adding and separately, mAP can improve by about . It shows that combining with global and part feature can provide more useful information. Furthermore, with the joint optimization with all these features, the result improves to and on mAP and Top-1, which outperforms baseline model by and .
|+ + (PGAN)||79.3||96.5||98.3|
|PGAN w/o PAM||256||77.9||95.6||98.4|
|PGAN w/o PAM||512||78.0||95.5||98.2|
|PGAN w/o PAM||78.0||95.5||77.6||91.8||77.4||93.1||73.6||93.8|
4.3.2 Analysis of Different Attention Method
We first implement traditional grid attention by removing part extraction module, i.e., PAM is directly used on each grid of . As shown in Table 2, grid attention can only achieve mAP and Top-1 accuracy when feature dimension is , showing that part guidance is crucial for filtering invalid information like background. Moreover, we also use the identical weight for each part region by removing PAM. It can be seen as a bottom-up attention with the part guidance from a detection model. From Table 2, we can find and mAP decrease when feature dimension is and without PAM. It proves that PAM is beneficial for focusing on prominent parts as well as suppressing the impact of some wrongly detected or useless regions. We exactly note that our PGAN w/o PAM is still better than grid attention by mAP, which also proves the important role of the part-guided bottom-up attention. In Figure 4, we visualize one vehicle sample with Top- retrieval vehicles and the corresponding heatmaps of the part feature generated by grid attention and our PGAN respectively. It is clear that our PGAN pays more attention on discriminative part regions. Some wrong detected parts and useless parts like background can be suppressed or ignored. More visualization can be found in the supplementary material.
4.3.3 Parameter Analysis of PGAN
First, we evaluate the effectiveness of different feature dimension. We use the dimension of fusion feature on VeRi-776 as the variable. As shown in Figure 3 (a), our PAM module has great improvement compared with the baseline model whatever the dimension is. Note that the improvement is not obtained by increasing the feature dimension. For example, our PGAN with dimension surpasses the baseline model with dimension by a large margin.
4.4 Comparison with State-of-the-art Methods
Finally, we compare our PGAN against other state-of-the-art vehicle Re-ID methods, shown in Table 3. All reported results of our method are based on dimension.
For VeRi-776, we strictly follow the cross-camera-search evaluation protocol as [liu2016deep]. From Table 3, it is clear that our PGAN outperforms all the existing method for a large margin. For instance, the performance of PGAN is better than the state-of-the-art method, i.e.Part-Regular [he2019part], for mAP and Top- respectively.
For VehicleID, we only report the result of the large test subset on Top- and Top-. Our method surpasses all the method except RNN-HA [wei2018coarse] at Top-. Notice that RNN-HA uses the additional supervision of the vehicle model and the input of image size is ( times bigger than us). However, as reported in [wei2018coarse], the performance of RNN-HA is extremely dropped by a large margin on VeRi-776 when the image size is set to , which is lower than our PGAN for about at Top-.
VRIC and VERI-Wild datasets are newly released large vehicle datasets with more unconstrained variations in resolutions, illuminations, occlusion, and viewpoints, etc. There are only a few methods that have reported the results. As for VERI-Wild, we report the result of the large test subset. Table 3 shows that our proposed PGAN achieves satisfactory performance with on VRIC and on VERI-Wild at Top-. Compared with the baseline model, our PGAN is more robust under various environments.
We also report the results of traditional grid attention and PGAN without PAM on each dataset. Experiments show that our method achieves comparable results in all datasets. Especially, PGAN surpasses grid attention by at Top-1 accuracy on VRIC, which shows better precision than grid attention. All results prove that our PGAN is able to retrieval more reliably matched vehicle images.
In this paper, we have presented a novel Part-Guided Attention Network (PGAN) for vehicle Re-ID. First, we extract part regions of each vehicle image from an object detection model. These part regions provide a range of candidate searching area for the network learning, which is regarded as a bottom-up attention process. Then we use the proposed part attention module (PAM) to discover the prominent part regions by learning a soft attention weight for each candidate part, which is a top-down attention process. In this way, the most discriminative parts are highlighted with high-attention weights, while the opposite effects of invalid or useless parts are suppressed with relatively low weights. Furthermore, with the joint optimization with the holistic feature and the part-guided feature, the Re-ID performance can be further improved. Extensive experiments are conducted to show the effectiveness of our PGAN. And our PGAN outperforms other state-of-the-art methods by a large margin. In the future, we plan to extend the proposed method to the multi-task learning, i.e., object detection and Re-ID, for simultaneously improving the performance of these two tasks.
Appendix A Appendix
a.1 The Attributes of Vehicle Part Regions
The work [zhao2019structural] carefully labelled attributes of vehicles, in which only attributes are adopted in our work. Since the attributes of vehicle style, i.e., “car”, “trunk”, “tricycle”, “train” and “bus”, represent the whole vehicle image, they can be recognized as the global information in the area of vehicle Re-ID task. The remaining attributes are shown as Table 4.
Note that we do not use the attribute labels once the detection process is finished in our work since most of vehicles contain only few vehicle parts due to the multi-view variation.
|annual service signs||back mirror||car light||carrier|
|car topwindow||entry license||hanging||lay ornament|
|light cover||logo||newer sign||tissuebox|
|plate||safe belt||wheel||wind-shield glass|
a.2 Analysis of the Number of Part Regions .
In addition, we analyse how the number of part regions in the part extraction module affects the Re-ID results. We test the performance with of our PGAN on VeRI-776. The feature dimension is fixed to .
As shown in Table 5, we can observe that can get the relatively best results. Compared with the baseline module without the part guidance, there is a consistent improvement with the detected part regions. It shows that these part regions are able to narrow down the possible searching area, which is helpful for focusing on the valid part components. We also observe that our PGAN can gradually improve the Re-ID performance with the number of part regions increasing. However, the performance has a limitated increase with the part number. The reasons are twofold: 1) a lot of detected part regions are covered with each other, which provide no further part information; 2) more wrongly detected parts with invalid information are extracted that might result in the distraction for model learning. We believe that if we use a better detector, the performance can be further improved.
a.3 Qualitative Analysis of the Performance
In this section, we visualize more retrieval results of the grid attention and our part-guided attention (PGAN), repectively. As illustrated in Figure 5, we illustrate four different query vehicle images and their corresponding Top- most similar images as well as the heatmaps of the part-guided feature from PAM in the gallery set. From Figure 5, we observe that our PGAN is able to obtain more reliable retrieval results compared with the grid attention method. In detail, the main advantages of our PGAN can be summarized as follows:
Insensitive to Various Situations. The PGAN can extract more robust feature representation so as to significantly improve the Re-ID performance. As shown in the ID1 and ID4, given a rear vehicle image, we can not only find the easy vehicles from the rear and side views, but also get the front-view vehicle images that are difficult to recognize even by humans. In contrast, the grid attention can only focus on the images from the nearly same views. Moreover, our PGAN is also able to deal with various situations, such as illumination and occlusion. It means that our method is more robust to learn discriminative features that is not sensitive to multiple variants from the environment.
The Effectiveness of the Part Extraction Module. The detected part regions play an important role in the feature representation. As illustrated in the ID3, it is clear that the wrongly retrieved images from the grid attention method have different car lights with the query image. However, a lot of regions representing the body and the bottom of the car are concentrated, which are not the obvious differences between two vehicles. With the guidance of the detected part regions, our PGAN can only concentrate on these candidate regions, e.g., car lights in the ID3. It helps the model focus on the useful regions as well as alleviating the bad effect from the other regions. That is to say, the part extraction module is beneficial for the network learning by narrowing down the searching areas.
The Effectiveness of the Part Attention Module. Our PGAN is useful for selecting the most prominent part regions and lighten the influence of invalid and useless regions. As described in the main paper, we propose a Part Attention Module (PAM) that is responsible for learning a soft attention weight for each part. Therefore, the important part regions are underlined by a high-attention value, while the impact of other insignificant parts is relatively suppressed. From the heatmaps, we can clearly observe that our PAM could focus on the most significant part regions, such as the car lights in ID3, back mirrors in ID1. As shown in ID4, although there are few valid part regions that are extracted, our PAM can still find the key information to recognize the vehicles, such as the wheel and the car lights. On the contrary, the grid attention is largely influenced by some invalid regions with extremely similar appearance in different vehicles, such as the bottom of the vehicle body.