CaNet: Contextual-Attentional Attribute-Appearance Network
for Person Re-Identification
Person re-identification aims to identify the same pedestrian across non-overlapping camera views. Deep learning techniques have been applied for person re-identification recently, towards learning representation of pedestrian appearance. This paper presents a novel Contextual-Attentional Attribute-Appearance Network () for person re-identification. The simultaneously exploits the complementarity between semantic attributes and visual appearance, the semantic context among attributes, visual attention on attributes as well as spatial dependencies among body parts, leading to discriminative and robust pedestrian representation. Specifically, an attribute network within is designed with an Attention-LSTM module. It concentrates the network on latent image regions related to each attribute as well as exploits the semantic context among attributes by a LSTM module. An appearance network is developed to learn appearance features from the full body, horizontal and vertical body parts of pedestrians with spatial dependencies among body parts. The jointly learns the attribute and appearance features in a multi-task learning manner, generating comprehensive representation of pedestrians. Extensive experiments on two challenging benchmarks, i.e., Market-1501 and DukeMTMC-reID datasets, have demonstrated the effectiveness of the proposed approach.
Person re-identification aims at identifying a target pedestrian at diverse locations over different non-overlapping camera views. It has attracted increasing attention recently because of its importance for many practical applications, such as automated surveillance, activity analysis and content-based visual retrieval etc (Liu et al., 2016, 2018). Despite recent progress in person re-identification, it still remains a challenging task due to various challenges, including background clutter, occlusion, dramatic variations in illumination, body pose and viewpoint, as well as similar appearance among different pedestrians etc. Figure 1 illustrates sample images of pedestrians with some attributes in two benchmarks of person re-identification, i.e., Market-1501 (Zheng et al., 2015) and DukeMTMC-reID (Zheng et al., 2017b).
Conventional person re-identification approaches are mainly based on hand-crafted descriptors of pedestrian appearance (Yang et al., 2014; Chun et al., 2008), such as Symmetry-Driven Accumulation of Local Features (SDALF) (Farenzena et al., 2010), Local Maximal Occurrence (LOMO) (Liao et al., 2015) and Weighted Histograms of Overlapping Stripes (WHOS) (Lisanti et al., 2015) etc. Recently, deep learning technique has been applied for person re-identification (Ahmed et al., 2015; Varior et al., 2016; Liu et al., 2017a; Zhou et al., 2017), towards learning discriminative appearance representation for identifying the same pedestrian and distinguishing different ones in an end-to-end manner. These approaches abstract global appearance features from full body of pedestrians, local appearance features from body parts or both of them by designing a variety of deep architectures. However, appearance representation, whether hand-crafted descriptors or deep learning features, is not robust to the aforementioned challenges, resulting in unsatisfactory person re-identification results.
On the other hand, person attributes, such as long hair, short sleeve and carrying a handbag etc, represent intermediate-level semantic properties of a pedestrian, providing crucial cues for identifying the pedestrian. Compared to appearance features, person attributes possess much better robustness to the variations of illumination, body pose and camera viewpoint (Lin et al., 2017). For pedestrians with similar appearance or a pedestrian having large appearance variance across images, appearance features usually result in false matches among pedestrians, As intermediate-level semantic descriptors, person attributes have shown good efficacy on dealing with such large intra-category variance and small inter-category variance within low-level feature space (Zhao et al., 2018; Li et al., 2016). Moreover, attribute and appearance representation of pedestrians are complementary to each other. They describe pedestrians from intermediate-level semantic abstraction and low-level visual details, respectively. The joint exploration of them could offer a comprehensive representation of pedestrians and thus enhance the accuracy of person re-identification. Recently, a few of preliminary works (Shi et al., 2015; Su et al., 2018; Schumann and Stiefelhagen, 2017) exploit attributes for person re-identification. They applied the attribute classifiers trained on auxiliary datasets to generate attribute responses over pedestrian images, which are in turn used for identifying pedestrians. However, they only used attribute representation and neglected visual description of appearance features that contain essential visual cues for re-identification. Lin et al. (Lin et al., 2017) annotated person attributes on the pedestrian images within the Market-1501 and DukeMTMC-reID person re-identification datasets. They also made a preliminary effort on learning attribute and appearance features for person re-identification. In (Lin et al., 2017), attributes are modeled individually without exploration of the semantic context among them. However, different attributes correlate semantically. Some attributes usually co-occur in a pedestrian image, while some ones are not likely to co-occur. Hence, the presence or absence of a certain attribute provides valuable cues for inferring the presence/absence of other related attributes. Moreover, an attribute usually arises from one or more regions within the images rather than entire images. Hence, it is essential to concentrate on the latent related image regions with visual attention mechanism when modeling attributes.
In this work, we propose a Contextual-Attentional Attribute-Appearance network () to learn discriminative and robust representation for person re-identification. The jointly learns attribute and appearance representation by multi-task learning for person identification and attribute recognition. simultaneously exploits the semantic context among attributes, visual attention on attributes, spatial context among body parts as well as the complementarity between semantic attributes and visual appearance. As illustrated in Figure 2, consists of an attribute network learning attribute representation, an appearance network learning global and local appearance features as well as a base network generating low-level feature maps. Specifically, the attribute network contains an Attention-LSTM module, a convolution layer and a series of fully connected layers. The LSTM cell (Hochreiter and Schmidhuber, 1997) in the Attention-LSTM module captures the underlying semantic context among attributes which could effectively boost the learning of attributes. The attention block in the Attenion-LSTM module explores latent spatial attention for each attribute with identity-level supervision and concentrates the network’s attention on the related image regions when learning attributes. The appearance network contains three convolution layers, three avg-pooling layers and ten fully connected layers. It learns appearance features from full body of pedestrians, horizontal body parts as well as vertical body parts with exploration of spatial dependencies among body parts along horizontal and vertical directions. The appearance network is trained by a part loss and a global loss which compute the identification errors on each body part and full body respectively to avoid over-fitting on a specific body part. The base network is built upon ResNet-50 (He et al., 2016) model for extracting low-level visual details. Based on the above subnetworks, is able to learn effective representation of pedestrians leading to satisfactory person re-identification results. We conduct extensive experiments to evaluate on two widely-used person re-identification datasets, i.e., Market-1501 and DukeMTMC-reID, and report superior performance over state-of-the-art approaches.
The main contribution of this paper is three-fold: (1) We propose a novel Contextual-Attentional Attribute-Appearance Network () for person re-identification; (2) We design a new attribute learning network with an Attention-LSTM module, which exploits latent semantic context among attributes and visual attention on attributes; (3) We conduct extensive evaluations on two benchmark with significant performance improvements over state-of-the-art solutions.
2. Related Work
Recent years have witnessed many research efforts and encouraging progress on person re-identification. This section briefly reviews existing works belonging to two major categories, i.e., appearance based re-identification and the newly emerging attribute based re-identification methods.
Appearance based re-ID. Appearance based person re-identifi- cation approaches mainly focus on developing distinctive appearance representations from pedestrian images. Many sophisticated hand-crafted features have been developed to boost the performance. For example, Farenzena et al. (Farenzena et al., 2010) proposed the Symmetry-Driven Accumulation of Local Features (SDALF) to exploit the symmetry property of human body to handle variations of camera viewpoint. Liao et al. (Liao et al., 2015) analyzed the horizontal occurrence of local features, and proposed an effective feature called Local Maximal Occurrence (LOMO). Recently, deep learning technique has been adopted for person re-identification, towards learning discriminative representation of pedestrian appearance. For example, Liu et al. (Liu et al., 2016) proposed a multi-scale triplet CNN which captures visual appearance of a person at various scales by a comparative similarity loss on massive sample triplets. Li et al. (Li et al., 2017b) proposed to jointly learn global and local features with pre-defined grid horizontal stripes on pedestrian images by a multi-loss function. Li et al. (Li et al., 2017a) designed a Multi-Scale Context-Aware Network (MSCAN) to learn appearance features over full body and body parts with local context knowledge by stacking multi-scale convolutions, as well as a Spatial Transformer Networks (STN) to deal with the problem of pedestrian misalignment. Li et al. (Li et al., 2018) formulated a Harmonious Attention CNN (HA-CNN) for the joint learning of soft pixel attention and hard region attention. Si et al. (Si et al., 2018) proposed a Dual ATtention Match-ing network (DuATM) for learning context-aware feature sequences, in which both intra-sequence and inter-sequence attention strategies are used for feature refinement and feature-pair alignment, respectively.
Attribute based re-ID. Person attributes have been exploited for person re-identification in a few of recent works and have shown good robustness against challenging variations of illumination, body pose and viewpoint. Some of them adopt transfer learning technique to learn attributes for person re-identification task but neglect appearance feature. For example, Shi et al. (Shi et al., 2015) present a new semantic attribute learning model which was trained on a fashion photography dataset and adapted to provide a semantic description for person re-identification. Su et al. (Su et al., 2018) proposed a weakly supervised multi-type attribute learning framework which involved a three-stage training to progressively boost the accuracy of attributes with only a limited number of labeled samples. Schumann et al. (Schumann and Stiefelhagen, 2017) developed a person re-identification approach which trained an attribute classifier on separate attribute dataset and integrated its responses into the person re-identification model based on CNNs. Recently, Lin et al. (Lin et al., 2017) annotated person attributes on the pedestrian images within the Market-1501 and DukeMTMC-reID datasets. They also proposed an attribute-person recognition (APR) network to learn attribute and appearance features for person re-identification. However, their work overlooks the semantic context among attributes and visual attention on attributes, which are importance for learning person attributes.
3. The proposed Method
In this section, we first present the overall architecture of the proposed and then elaborate its components.
3.1. Overall Architecture
Given a training set containing samples from pedestrians captured by non-overlapping camera networks together with their corresponding person ID as , the objective is to learn a discriminative representation for identifying the same pedestrian and distinguishing different pedestrians. We propose a novel Contextual-Attentional Attribute-Appearance Network (), which simultaneously exploits complementarity between semantic attributes and visual appearance, semantic context among attributes, visual attention on attributes as well as spatial dependency among body parts. As shown in Figure 2, consists of an attribute network learning person attributes, an appearance network characterizing pedestrian appearance as well as a base network generating low-level visual representation. Specifically, the base network is built on ResNet-50 model (He et al., 2016) due to its strong ability in learning visual representation. It consists of five Residual blocks, each of which contains several convolution layers with Batch Normalization (BN), Rectified Linear Units (ReLU) and optional Max-Pooling operations. The attribute network is proposed to effectively abstract attribute representation with an Attention-LSTM module. The attention block concentrates the network on latent image patches that are related to each attribute during attribute learning. The LSTM cell progressively takes each attribute as input and decides whether to retain or discard the latent semantic context from the current attribute and previous ones. It leverages the inter-attribute context for attribute recognition. Furthermore, we develop an appearance network to learn appearance feature from the full body, horizontal body parts and vertical body parts of pedestrians simultaneously. The body-part features are learned with a part loss function leading to detailed visual cues from each body part, while the full-body features are learned with a global loss function extracting the global visual appearance of pedestrians.
The attributes and appearance features are learned with the loss functions for person re-identification and attribute recognition, respectively, in a multi-task learning manner. During testing, the two features are integrated to form the final pedestrian representation. The matching score between pedestrian images and can be computed as:
3.2. Attribute Network
An attribute learning network is proposed to learn discriminative intermediate-level semantic descriptions of pedestrian images by the task of attribute recognition. An attribute usually arises from one or more regions within the images. The network is expected to concentrate on the corresponding regions when learning an attribute. However, such regions are not localized with ground-truth. On the other hand, different attributes correlate semantically. The presence or absence of a certain attribute is usually useful for inferring the presence/absence of other related attributes. For example, the attributes “wearing a dress” and “long hair” are likely to co-occur, the attributes “carrying a bag” and “carrying a backpack” may mutually exclusive. The exploration of such semantic context among attributes can well boost attribute recognition. Motivated by these observations, we propose a novel attribute network with an Attention-LSTM module, which contains an attention block and a LSTM cell. The attention block consisting of 3 convolution layers, learning spatial attention for each attribute. The LSTM cell sweeps all attributes sequentially, memorizes the semantic correlation and dependencies from previous inputs by the memory mechanism. As shown in Figure 2, the attribute network consists of a convolution layer, an Attention-LSTM module and fully-connected layers. is the number of attributes.
The attention block, containing 3 consecutive convolution layers, takes the feature tensor as input to generate initial attention maps for all attributes. is the output feature maps of the base network. Each channel in the initial attention maps () corresponds to one attribute. The kernel sizes of the 3 convolution layers are , , and , respectively. The BN and ReLU nonlinearity operations are performed with the first two convolution layers. In addition, another individual convolution layer with kernel size is utilized to transfer to feature map . The attributes are regarded as a temporal sequence. At each time step , the LSTM cell receives an attentional attribute feature map corresponding to attribute as input, which comes from the element-wise multiplication of and the precise attention map , and outputs attribute predictions for the attribute recognition task. The feedback connections and internal gating mechanism of the LSTM cell is able to memorize the latent semantic dependencies among attributes, selectively discover and propagate relevant context to next attribute. The formulation of the LSTM cell is shown as follows
where , , , , , and are the forget gate, input gate, output gate, cell state, weight matrix and hidden state respectively. In addition, represents the set of attentional attribute feature maps for all attributes. The LSTM cell sweeps all person attributes and generates discriminative attribute features.
In order to learn more precise spatial distribution of attention over a pedestrian image for each attribute, we incorporate the initial attention map and previous internal hidden state to obtain a precise attention map of attribute , instead of using for attribute directly. The formulations of the incorporation of and is shown as follows:
where and are parameters to be learned, is the area of the feature Tensor (), is the dimension of the hidden state (256). is referred to the unnormalized attention map. Then the precise attention map can be obtained by spatial normalization with a softmax function
where denotes the normalized attention values at the pixel for attribute . Figure 3 illustrates the attention maps corresponding to certain attributes, such as shoes, hat, upper-body clothing and handbag. It can be observed that the regions corresponding to the attributes are get concentrated with high attention scores. This indicates that the attention block enables the network to concentrate on the regions corresponding to attributes and thus generate more precise modeling of attributes. Finally, each type of attribute feature is taken into a m-dim FC layer and a corresponding softmax function layer, where represents the specific attribute has categories. The attribute network utilizes the Attention-LSTM module to capture visual attention for each attribute and explore the semantic context, which is effective for improving the performance of attribute recognition.
3.3. Appearance Network
An appearance network is developed to learn global and local appearance representation by the task of person re-identification. Existing methods usually abstract appearance feature from full body of pedestrians and/or from horizontal body parts. For learning more effective appearance feature for person matching, an appearance network is designed to extract representation from full body of pedestrians and horizontal body parts, as well as the vertical body parts with exploration of spatial dependencies among body parts along both horizontal and vertical directions.
As shown in Figure 2, the appearance network consists of 3 convolution layers, 3 average-pooling layers and 10 fully-connected (FC) layers. Each convolution layer is followed with a BatchNorm (BN) layer and a Rectified Linear Units (ReLU) layer. The appearance network takes the feature tensor as input, and partitions into horizontal stripes and vertical stripes of body, respectively. Then, the feature tensor , the horizontal and vertical stripes go through three corresponding local branches to abstract global appearance feature and part-based features, respectively. We define the vector of activations along the channel axis as a column vector. The three local branches in the network employ mean-pooling layers to average all the column vectors in a same horizontal stripe or a same vertical stripe for producing a horizontal part-level column vector or a vertical part-level column vector , as well as average to produce a global-level column vector . Afterward, three convolution layers are applied to reduce the dimension of the three type column vectors, respectively. Finally, each type column vector is taken into a classifier layer which is implemented with a FC layer and a corresponding softmax function layer for classifying the person ID.
The size of is , which is also equally partitioned into 6 horizontal stripes () and 3 vertical stripes (), respectively. The kernel sizes of the three convolution layers in the network are . The dimension of the FC layers is the total number of person IDs. The appearance network is optimized by minimizing the sum of part loss for horizontal stripes and vertical stripes, and global loss for . During testing stage, all the pieces of 256-dimensional column vectors are concatenated to form the appearance feature , which is used for pedestrian matching.
3.4. Loss Function and Optimization
Identification loss is usually leveraged for classification task and has advantages in terms of simplicity and effectiveness. Hence, we adopt the identification loss to optimize the appearance features and attribute features. Suppose the training set has images of identities, where denotes the - person images, is person IDs of image and is a set of attribute labels of person image . Given training examples, the proposed extracts , (local appearance features) and (global appearance feature) from the appearance network, as well as extract attribute features from the attribute network. The loss function for the task of person re-identification is the sum of part loss and global loss, which is formulated as follows:
where is the corresponding person ID of the -th pedestrian image, represents the j-th column of the weight matrix and refers to a bias term. and are the number of horizontal stripes and vertical stripes, respectively. The batch size of the input is .
The loss function for the attribute classification task is the sum of attribute classification losses, which is formulated as follows:
By jointing the person re-identification task and attribute classification task, the proposed is optimized to predict person IDs and attributes, simultaneously. The total loss function for the is defined as follows:
where denotes the balance weight of the two loss functions.
In this section, we conduct extensive experiments to evaluate the performance of the proposed on two widely used person re-identification datasets and compare the to state-of-the-art methods. The experimental results show that achieves superior performance of person re-identification over the state-of-the-art methods. Moreover, we investigate the effectiveness of the proposed including the attribute network and the appearance network.
Datasets - There are several benchmark datasets established for person re-identification. In this work, extensive experiments are conducted on two widely used datasets, i.e, Market-1501 and DukeMTM C-reID for fair comparison and evaluation. The two person re-identification datasets are challenging and realistic.
The Market-1501 dataset is one of the largest and most realistic person re-identification benchmark, contains 32,643 images of 1,501 identities captured by 6 cameras. All images are automatically detected by the Deformable Part Model (DPM) detector (Felzenszwalb et al., 2008). Following the protocol used in (Zheng et al., 2015), the dataset is fixedly divided into two parts respectively, one part contains 12,936 images of 750 identities as training set and the other contains 19,732 images of 751 identities as testing set. The proposed method is compared to the state-of-the-art methods under single query evaluation setting.
The DukeMTMC-reID dataset is a subset of the DukeMTMC dataset (Ristani et al., 2016) and is one of the most challenging re-ID datasets due to similar clothes of different pedestrians and occlusion by trees and cars. It contains 36,411 hand-drawn bounding boxes of 1,812 identities from 8 high-resolution cameras. Following the evaluation protocol specified in (Zheng et al., 2017b), it is fixedly divided into two parts respectively, one part contains 16,522 images of 702 identities as training set and the other contains 17,661 gallery images of 702 identities as testing set. In addition, there are 2,228 query pedestrian images. Analogously, performance on the DukeMTMC-reID dataset is also evaluated under single query evaluation setting.
Pedestrian images in the Market-1501 dataset are annotated with 27 attributes at identity-level. Each attribute is labeled with its presence or absence on each pedestrian image, the attribute “age” is labeled with four types, i.e., young, teenager, adult and old. Considering that there are 8 and 9 colors (eg, upblack, upwhite and downred etc) for upper-body clothing and lower-body clothing respectively and only one color is labeled as positive for one identity, we regard the 8 upper-body colors and the 9 lower-body colors as one upper-body clothing color attribute with 9 classes and one lower-body clothing color attribute with 10 classes (there is one more category for the case that the upper-body clothing or lower-body clothing colors of a pedestrian may not belong to the 8 or 9 colors). Hence, there are 12 attributes consisting of binary and multi-type valued attributes on the Market-1501 dataset as well as 10 attributes in the DukeMTMC-reID dataset. As the Attention-LSTM module processes attributes sequentially, we need to determine the order of attributes. However, person attributes are naturally without a fixed order. A promising solution is to adopt multiple orders of attributes (e.g., rare first, frequent first, top-down, and random order etc.) and fuse their results for subsequent module (Wang et al., 2017). In the experiments, we adopt two types of orders, i.e., top-down according to body topological structure and fine-abstract following to the semantic granularity from fine grained attributes to abstract attributes.
Implementation Details - The implementation of the proposed method is based on the Pytorch framework with two NVID- IA Titan XP GPUs. We adopt the pre-trained model on ImageNet to initialize parameters of the on the two person re-identificat- ion datasets. The stochastic gradient descent (SGD) algorithm is started with learning rate of 0.01, the weight decay of and the Nesterov momentum of 0.9. The parameter in Eq. (7) is set to 2. All the images are resized to the size of and normalised with . Meanwhile, the training set is enlarged by data augmentation strategies (Zhong et al., 2017b) including random horizontal flipping and random erasing probability of 0.5 during training phase. The number of mini-batches is set to 64. The proposed network is optimized for 250 iterations in each epoch, and 70 epochs in total. Moreover, the whole training process is divided into three parts. In the first stage, the base network followed with the appearance network is trained until convergence for person re-identification task, which impels the base network to learn befitting feature maps prepared for the attribute classification task. In the second stage, the is trained for person re-identification and attribute classification tasks and learn discriminative appearance feature and robust attribute feature, respectively. In the last stage, the two features are merged and the is re-trained for only person re-identification task until convergence, which can drive the to learn discriminative and robust representation of pedestrian for person re-identification.
Protocol - Cumulative Matching Characteristic (CMC) is extensively adopted for quantitative evaluation of person re-identification methods. The rank- recognition rate in the CMC curve indicates the probability that a query identity appears in the top- position. The other evaluation metric is the mean average precision (mAP), considering person re-identification as a retrieval task.
4.1. Comparison to State-of-the-Arts
|Bow+kissMe(Zheng et al., 2015)||44.4||63.9||72.2||20.8|
|WARCA(Jose and Fleuret, 2016)||45.2||68.1||76.0||-|
|KLFDA(Karanam et al., 2018)||46.5||71.1||79.9||-|
|DNS(Zhang et al., 2016)||55.43||-||-||29.9|
|CRAFT(Chen et al., 2018)||68.7||87.1||90.8||42.3|
|SOMAnet(Barbosa et al., 2018)||73.9||88.0||92.2||47.9|
|HydraPlus(Liu et al., 2017b)||76.9||91.3||94.5||-|
|SVDNet(Sun et al., 2017)||82.3||92.3||95.2||62.1|
|PAN(Zheng et al., 2017a)||82.8||-||-||63.4|
|Triplet Loss(Hermans et al., 2017)||84.9||94.2||-||69.1|
|MultiScale (Chen et al., 2017)||88.9||-||-||73.1|
|GLAD (Wei et al., 2017)||89.9||-||-||73.9|
|HA-CNN(Li et al., 2018)||91.2||-||-||75.7|
|ACRN (Schumann and Stiefelhagen, 2017)||83.6||92.6||95.3||62.6|
|APR(Lin et al., 2017)||84.3||93.2||95.2||64.7|
|Bow+kissMe(Zheng et al., 2015)||25.14||-||-||12.2|
|LOMO+XQDA(Liao et al., 2015)||30.8||-||-||17.0|
|GAN(Zheng et al., 2017b)||67.7||-||-||47.1|
|PAN(Zheng et al., 2017a)||71.6||83.9||-||45.0|
|SVDNet(Sun et al., 2017)||76.7||86.4||89.9||56.8|
|MultiScale (Chen et al., 2017)||79.2||-||-||60.6|
|EMR(Yu et al., 2017)||80.4||-||-||63.9|
|HA-CNN(Li et al., 2018)||80.5||-||-||63.8|
|Deep-Person(Bai et al., 2017)||80.9||-||-||64.8|
|APR(Lin et al., 2017)||70.7||-||-||51.9|
|ACRN(Schumann and Stiefelhagen, 2017)||72.6||84.8||88.9||52.0|
Market-1501: Table 1 shows the performance comparison of the proposed against 15 state-of-the-art methods in terms of CMC accuracy and mAP. The compared methods belong to three categories, i.e., traditional methods based on hand-crafted feature and/or distance metric learning including Bow+kissMe (Zheng et al., 2015), WARCA (Jose and Fleuret, 2016), KLFDA (Karanam et al., 2018), DNS (Zhang et al., 2016) and CRAFT (Chen et al., 2018), deep learning based methods including SOMANet (Barbosa et al., 2018), HydraPlus (Liu et al., 2017b), SVDNet (Sun et al., 2017), PAN (Zheng et al., 2017a), Triplet Loss (Hermans et al., 2017), MultiScale (Chen et al., 2017), GLAD (Wei et al., 2017) and HA-CNN (Li et al., 2018) and attribute based methods including ACRN (Schumann and Stiefelhagen, 2017) and APR (Lin et al., 2017). The proposed achieves 93.2% rank-1 recognition rate and 80.0% mAP score. We can see that our method surpasses existing methods, improving the 2nd best compared method HA-CNN by 2.2% rank-1 recognition rate and 5.7% mAP score. Moreover, achieves significant performance improvement compared to the two attribute based methods ACRN and APR by 11.5% and 10.6% at rank-1 recognition rate, respectively. The comparison indicates that the effectiveness of the proposed for jointly exploiting attributes and appearance information. Moreover, the Attention-LSTM in the attribute network is able to capture latent semantic context among attribute and learn the spatial attention for each attribute. (RK) refers to the proposed method with re-ranking (Zhong et al., 2017a) with k-reciprocal encoding, which is an effective strategy for boosting the performance. With the help of re-ranking, rank-1 accuracy and mAP of are further improved to 94.7% and 91.5% respectively.
DukeMTMC-reID: We compare the proposed against 11 state-of-the-art methods including two traditional methods Bow +kissMe (Zheng et al., 2015) and LOMO+XQAD (Jose and Fleuret, 2016), and seven deep learning based methods including GAN (Zheng et al., 2017b), PAN (Zheng et al., 2017a), SVDNet (Sun et al., 2017), MultiScale (Chen et al., 2017), EMR (Yu et al., 2017), HA-CNN (Li et al., 2018) and Deep-Person (Bai et al., 2017), as well as two attribute based methods APR (Lin et al., 2017) and ACRN (Schumann and Stiefelhagen, 2017). From Table 2, we can observe that the proposed outperforms all the existing methods at all ranks, obtaining the best 84.6% rank-1 recognition rate and 70.2% mAP score. boosts the 2nd best compared method Deep-Person by 4.6% rank-1 recognition rate and 8.3% mAP score. In addition, the performance of achieves 19.7% and 16.5% improvement of rank-1 accuracy respectively, compared to the attribute based methods APR and ACRN. The result of can be increased to 89.6% rank-1 recognition rate and 86.4% mAP score with re-Ranking. An illustration of some retrieval results is given in Figure 4.
4.2. Ablation Studies
To demonstrate the effectiveness and contribution of each component of the , we conduct a series of ablation experiments on the DukeMTMC-reID dataset. Moreover, we compare the performance of the different type of appearance features for the appearance network. We also evaluate the effect of the Attention-LSTM module for the attribute network.
Table 3 summarizes the ablation results of the proposed . _w/o App refers to without the appearance network, which exploits the base network followed with the attribute network to learn attribute feature and use the attribute features to match pedestrians directly. _w/o Att refers to without the attribute network, which utilizes the base network followed with the appearance network to learn appearance feature. From Table 3, we can observe that _w/o App only obtains 57.1 % rank-1 accuracy, since the network is not specially designed for person re-identification task. _w/o Att achieves performance of 80.1 % rank-1 accuracy. yields the best performance of 84.6% than the other two networks, which shows the effectiveness of for joint exploration of both appearance and attribute features, leading to satisfactory person re-identification results.
Table 4 reports the accuracy of different type of appearance features. All the experiments are conducted with the base network followed by the appearance network. AppNet_G only extracts global appearance feature for person re-identification, achieving 72.1% rank-1 accuracy and 51.9% mAP score. AppNet_V only abstracts local appearance feature from vertical stripes of body, obtaining 77.6 % rank-1 accuracy and 59.6% mAP score. AppNet_H is to extract local appearance feature from horizontal stripes of body, acquiring 79.2% rank-1 accuracy and 63.9% mAP score. AppNet extracts both the global and local appearance features and obtains the best performance of 80.1% rank-1 accuracy and 64.1% mAP score. By comparing different type of appearance features, it can be observed that the performance of local appearance feature surpasses the global appearance feature, which indicates that the local feature is able to guides the network to learn more detailed visual cues and avoid over-fitting on a specific body part. Moreover, joint learning global and local features enforces the appearance network to learn more spatial context among different body parts and are effective to improve the performance of person re-identification.
Table 5 compares each component of Attention-LSTM module in the attribute network. All the experiments are conducted with the base network followed by the attribute network. AttNet_Base dose not use Attention-LSTM module and directly connects the output feature tensor with each attribute classifier, obtaining 40.3% rank-1 accuracy. AttNet_LSTM is to replace the Attention-LSTM module with a single LSTM cell, obtaining 43.9 % rank-1 accuracy. AttNet_Attention only uses the attention block of the full Attention-LSTM module, achieving 53.7% rank-1 accuracy. AttNet is to employ the Attention-LSTM module for the attribute network and get the best performance of 57.1% rank-1 accuracy. By comparing the results of AppNet_Base and AppNet_LSTM, the LSTM cell is able to learn the contextual semantic correlation of attributes. Meanwhile, from the comparison of AppNet_Base and AppNet_Attention, the attention block is able to learn a precise spatial attention for each attribute. Therefore, the Attention-LSTM module can improve the accuracy of attribute recognition and learn more discriminative attribute feature for person re-identification.
In this work, we proposed a novel Contextual-Attentional Attribute-Appearance Network () to learn discriminative and robust pedestrian representation for person re-identification. The proposed jointly learns semantic attribute and visual appearance representation of pedestrians with simultaneous exploration of the semantic context among attributes, visual attention on attributes and spatial dependencies between body parts. The exploration of inter-attribute context and visual attention leads to precise learning of attributes, in turn generating effective attribute representation. The appearance features learned from both full body and body parts provide a comprehensive description of pedestrian appearance. We conducted extensive experiments on two widely-used real-world person re-identification datasets, i.e., Market-1501 and DukeMTMC-reID. The experimental results have shown that the proposed achieves significant performance improvements over a wide range of state-of-the art methods.
Acknowledgements.This work was supported by the National Natural Science Foundation of China (NSFC) under Grants 61622211, 61472392, 61620106009 and 61525206 as well as the Fundamental Research Funds for the Central Universities under Grant WK2100100030.
- Ahmed et al. (2015) Ejaz Ahmed, Michael Jones, and Tim K Marks. 2015. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3908–3916.
- Bai et al. (2017) Xiang Bai, Mingkun Yang, Tengteng Huang, Zhiyong Dou, Rui Yu, and Yongchao Xu. 2017. Deep-Person: Learning Discriminative Deep Features for Person Re-Identification. arXiv preprint arXiv:1711.10658 (2017).
- Barbosa et al. (2018) Igor Barros Barbosa, Marco Cristani, Barbara Caputo, Aleksander Rognhaugen, and Theoharis Theoharis. 2018. Looking beyond appearances: Synthetic training data for deep cnns in re-identification. Computer Vision and Image Understanding 167 (2018), 50–62.
- Chen et al. (2017) Yanbei Chen, Xiatian Zhu, and Shaogang Gong. 2017. Person re-identification by deep learning multi-scale representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2590–2600.
- Chen et al. (2018) Ying-Cong Chen, Xiatian Zhu, Wei-Shi Zheng, and Jian-Huang Lai. 2018. Person re-identification by camera correlation aware feature augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2018), 392–408.
- Chun et al. (2008) Young Deok Chun, Nam Chul Kim, and Ick Hoon Jang. 2008. Content-based image retrieval using multiresolution color and texture features. IEEE Transactions on Multimedia 10, 6 (2008), 1073–1084.
- Farenzena et al. (2010) Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani. 2010. Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2360–2367.
- Felzenszwalb et al. (2008) Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–8.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
- Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Jose and Fleuret (2016) Cijo Jose and François Fleuret. 2016. Scalable metric learning via weighted approximate rank component analysis. In Proceedings of the European Conference on Computer Vision. 875–890.
- Karanam et al. (2018) Srikrishna Karanam, Mengran Gou, Ziyan Wu, Angels Rates-Borras, Octavia Camps, and Richard J Radke. 2018. A systematic evaluation and benchmark for person re-identification: Features, metrics, and datasets. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (2018), 1–1.
- Li et al. (2017a) Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017a. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 384–393.
- Li et al. (2017b) Wei Li, Xiatian Zhu, and Shaogang Gong. 2017b. Person re-identification by deep joint learning of multi-loss classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2194–2200.
- Li et al. (2018) Wei Li, Xiatian Zhu, and Shaogang Gong. 2018. Harmonious Attention Network for Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2–11.
- Li et al. (2016) Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Human attribute recognition by deep hierarchical contexts. In Proceedings of the European Conference on Computer Vision. Springer, 684–700.
- Liao et al. (2015) Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2197–2206.
- Lin et al. (2017) Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, and Yi Yang. 2017. Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220 (2017).
- Lisanti et al. (2015) Giuseppe Lisanti, Iacopo Masi, Andrew D Bagdanov, and Alberto Del Bimbo. 2015. Person re-identification by iterative re-weighted sparse ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 8 (2015), 1629–1642.
- Liu et al. (2017a) Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, and Jiashi Feng. 2017a. Video-based person re-identification with accumulative motion context. IEEE Transactions on Circuits and Systems for Video Technology 99 (2017), 1–1.
- Liu et al. (2018) Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, and Yongdong Zhang. 2018. Dense 3d-convolutional neural network for person re-identification in videos. ACM Transactions on Multimedia Computing Communications and Applications pp, 1 (2018), 1.
- Liu et al. (2016) Jiawei Liu, Zheng-Jun Zha, Qi Tian, Dong Liu, Ting Yao, Qiang Ling, and Tao Mei. 2016. Multi-scale triplet cnn for person re-identification. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 192–196.
- Liu et al. (2017b) Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan, and Xiaogang Wang. 2017b. Hydraplus-net: Attentive deep features for pedestrian analysis. arXiv preprint arXiv:1709.09930 (2017).
- Ristani et al. (2016) Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Proceedings of the European Conference on Computer Vision. 17–35.
- Schumann and Stiefelhagen (2017) Arne Schumann and Rainer Stiefelhagen. 2017. Person re-identification by deep learning attribute-complementary information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1435–1443.
- Shi et al. (2015) Zhiyuan Shi, Timothy M Hospedales, and Tao Xiang. 2015. Transferring a semantic representation for person re-identification and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4184–4193.
- Si et al. (2018) Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C. Kot, and Gang Wang. 2018. Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8–17.
- Su et al. (2018) Chi Su, Fan Yang, Shiliang Zhang, Qi Tian, Larry Steven Davis, and Wen Gao. 2018. Multi-task learning with low rank attribute embedding for multi-camera person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (2018), 1167–1181.
- Sun et al. (2017) Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. 2017. Svdnet for pedestrian retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6–15.
- Varior et al. (2016) Rahul Rama Varior, Mrinal Haloi, and Gang Wang. 2016. Gated siamese convolutional neural network architecture for human re-identification. In Proceedings of the European Conference on Computer Vision. Springer, 791–808.
- Wang et al. (2017) Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. 2017. Attribute recognition by joint recurrent learning of context and correlation. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2-12.
- Wei et al. (2017) Longhui Wei, Shiliang Zhang, Hantao Yao, Wen Gao, and Qi Tian. 2017. Glad: Global-local-alignment descriptor for pedestrian retrieval. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 420–428.
- Yang et al. (2014) Yang Yang, Jimei Yang, Junjie Yan, Shengcai Liao, Dong Yi, and Stan Z Li. 2014. Salient color names for person re-identification. In Proceedings of the European Conference on Computer Vision. Springer, 536–551.
- Yu et al. (2017) Qian Yu, Xiaobin Chang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. 2017. The Devil is in the Middle: Exploiting Mid-level Representations for Cross-Domain Instance Matching. arXiv preprint arXiv:1711.08106 (2017).
- Zhang et al. (2016) Li Zhang, Tao Xiang, and Shaogang Gong. 2016. Learning a discriminative null space for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1239–1248.
- Zhao et al. (2018) Xin Zhao, Liufang Sang, Guiguang Ding, Yuchen Guo, and Xiaoming Jin. 2018. Grouping Attribute Recognition for Pedestrian with Joint Recurrent Learning.. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3177–3183.
- Zheng et al. (2015) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116–1124.
- Zheng et al. (2017a) Zhedong Zheng, Liang Zheng, and Yi Yang. 2017a. Pedestrian alignment network for large-scale person re-identification. arXiv preprint arXiv:1707.00408 (2017).
- Zheng et al. (2017b) Zhedong Zheng, Liang Zheng, and Yi Yang. 2017b. Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision.
- Zhong et al. (2017a) Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017a. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3652–3661.
- Zhong et al. (2017b) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2017b. Random Erasing Data Augmentation. arXiv preprint arXiv:1708.04896 (2017).
- Zhou et al. (2017) Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. 2017. See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6776–6785.