Discriminative Feature Learning with Foreground Attention for Person Re-Identification

Discriminative Feature Learning with Foreground Attention for Person Re-Identification

Sanping Zhou, Jinjun Wang, Deyu Meng, Yudong Liang, Yihong Gong, Nanning Zheng This work is supported by the National Key Research and Development Program of China under Grant No. 2017YFA0700805, and the National Science Foundation of China under Grant No. 61473219.Sanping Zhou, Jinjun Wang, Yihong Gong and Nanning Zheng are with Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shaanxi, China.Deyu Meng is with School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi, China.Yudong Liang is with School of Computer and Information Technology, Shanxi University, Taiyuan, Shanxi, China.Corresponding author: Jinjun Wang. Email: jinjun@mail.xjtu.edu.cn.
Abstract

The performance of person re-identification (Re-ID) seriously depends on the camera environment, where the large cross-view appearance variations caused by mutual occlusions and background clutters have severely decreased the identification accuracy. Therefore, it is very essential to learn a feature representation that can adaptively emphasize the foreground of each input image. In this paper, we propose a simple yet effective deep neural network to solve the person Re-ID problem, which attempts to learn a discriminative feature representation by addressing the foreground of input images. Specifically, a novel foreground attentive neural network (FANN) is first built to strengthen the positive influence of foreground while weaken the side effect of background, in which an encoder and decoder subnetwork is carefully designed to drive the whole neural network to put its attention on the foreground of input images. Then, a novel symmetric triplet loss function is designed to enhance the feature learning capability by jointly minimizing the intra-class distance and maximizing the inter-class distance in each triplet unit. Training the deep neural network in an end-to-end fashion, a discriminative feature representation can be finally learned to find out the matched reference to the probe among various candidates in the gallery. Comprehensive experiments on several public benchmark datasets are conducted to evaluate the performance, which have shown clear improvements of our method as compared with the state-of-the-art approaches.

Person Re-identification, Convolutional Neural Network (CNN), Foreground Attentive Feature Learning.

I Introduction

Fig. 1: Motivation of our method, in which we aim to learn a discriminative feature representation that focuses on the foreground persons. Specifically, row (a) shows the general way that extracts features from the entire input, and row (b) shows that our method learns features only from the foreground persons.

Person re-identification (Re-ID) is an important task to many surveillance applications such as person association [1], multi-target tracking [2] and behavior analysis [3]. Given a pedestrian image from one camera view, the technology tries to relocate the person amongst a set of gallery candidates captured from the disjoint camera views. The person Re-ID problem has attracted extensive research attentions in recent years, and yet it remains a challenging one due to the large appearance variations caused by mutual occlusions and background clutters across different camera views. Therefore, the key to improve the identification performance is to learn a discriminative feature representation which is robust to the large cross-view appearance variations.

To address this problem, extensive works have been reported in the past few years, which could be roughly divided into the following two categories: 1) developing discriminative descriptors to handle the variations of person’s appearance, and 2) designing distinctive distance metrics to measure the similarity between images. In the first line of works, different informative feature descriptors have been attempted by utilizing different cues, including the LBP [4], ELF [5] and LOMO [6]. In the second line of works, labeled images are used to learn effective distance metrics, including the LADF [7], LMNN [8] and ITML [9]. An evident drawback of these methods is that they consider feature extraction and metric learning as two disjoint steps, therefore they cannot mutually compensate their capabilities in a joint framework.

With the rapid development of deep neural network (DNN), the deep learning based methods [10, 11, 12, 13] have significantly extended the state-of-the-art results on several public benchmark datasets for person Re-ID. The main reason is that feature extraction and metric learning can now be incorporated into an end-to-end framework, in which the representation capacity of deep neural networks (DNNs) can be greatly explored with a number of labeled training samples. These methods are usually consisted of two main components, i.e., a neural network and a objective function. Specifically, the neural network is built to extract features from input images and the objective function is designed to guide the learning process. Representative deep neural networks include the AlexNet [14], VGGNet [15] and ResNet [16], and representative objective functions include the softmax loss function [14], triplet loss function [17] and contrastive loss function [18]. In practice, the heuristic knowledge can be embedded into the neural network or loss function, so as to improve the final performance. For example, Lin et al. [19] incorporated the affine constraint into the neural network, so as to learn a view transformation between different scenes. In [20], Zhou et al. took a self-paced learning regime in the objective function, which can enhance the robustness by gradually involving samples from the easy to hard ones.

In this paper, we incorporate a foreground attentive neural network (FANN) and a symmetric triplet loss function into an end-to-end learning framework, so as to efficiently learn a discriminative feature representation from the foreground of input images for person Re-ID. In practical scenarios, the foreground person inclines to be largely polluted by the background clutters and mutual occlusions, as shown in Fig. 1, which has greatly decreased the performance of feature learning. To leverage this problem, our solution is to alleviate the side effects of background in feature learning and supervise the training process with an efficient objective function. Specifically, the FANN is first built to focus its attention on the foreground persons, in which each image is first passed through an encoder and decoder subnetwork, then the output of encoder is used for subsequent feature learning. The encoder network extracts features from the entire image, and the decoder network reconstructs a binary mask representation of each foreground person. As a result, the encoder network will gradually focus its attention on the foreground with the regularization of decoder network using a novel local regression loss function. Besides, the symmetric triplet loss function is also designed to supervise the training process, in which the intra-class distance is minimized and the inter-class distance is maximized in each triplet unit, simultaneously. Benefited from the deduced symmetric gradient back-propagation, a large margin is held between the positive pairs and negative pairs in the learned feature space. Extensive experimental results on the 3DPeS, VIPeR, CUHK01, CUHK03 and Market1501 datasets have shown a significant improvement of our method, as compared with the state-of-the-art approaches.

The main contributions of this work can be highlighted as follows:

  • We propose a simple yet effective FANN for feature learning, in which the side effects of background can be naturally suppressed and the useful information of foreground can be greatly emphasized.

  • We propose a novel symmetric triplet loss function to supervise the feature learning, in which the intra-class distance is minimized and the inter-class is maximized in each triplet unit, simultaneously.

  • We propose an effective local regression loss function to supervise the foreground mask reconstruction, in which the local information in a neighborhood is used to deal with isolated regions in the ground truth mask.

The rest of our paper is organized as follows. In Section II, we briefly review the related works. Section III introduces our neural network and objective function, followed by a discussion of the learning algorithm in Section IV. Experimental results and parameter analysis are presented in Section V. And conclusion comes in Section VI.

Ii Related Work

We review three lines of related works, including the metric learning based method, the deep learning based method and the attention learned based method, which are briefly introduced in the following paragraphs.

Metric Learning Based Method. This category of methods aims to find a mapping function from feature space to distance space, in which features of the same person are closer than those from the different ones. For example, Zheng et al. [21] proposed a relative distance learning method from the probabilistic prospective. In [22], Mignon et al. learned a distance metric with sparse pairwise similarity constraints. Pedagadi et al. [23] utilized the Local Fisher Discriminant Analysis (LFDA) to map the high dimensional features into a more discriminative low dimensional space. In  [4], Xiong et al. further extended the LFDA and several other metrics by using the kernel tricks and different regularizers. Nguyen et al. [24] measured the similarity of face pairs through cosine similarity, which was closely related to the inner product similarity. In [25], Loy et al. casted the person Re-ID problem as an image retrieval task by considering the listwise similarity. Chen et al. [26] proposed a kernel based metric learning method to explore the nonlinearity relationship of samples in feature space. In [27], Hirzer et al. learned a discriminative metric by using relaxed pairwise constraints. These methods learn a specific distance metric mainly based on features extracted by manually designed feature descriptors, which could not fully discover the potency of metric learning.

Deep Learning Based Method. This category of methods usually incorporate feature extraction and metric learning into a joint framework, in which a deep neural network is used to extract features and a distance metric is used to compute the loss and back-propagate the gradient. For example, Ahmed et al. [11] proposed a novel deep neural network which took pairwise images as inputs, and output a similarity value indicating whether two input images were the same person or not. In [28], Xiao et al. applied a domain guided dropout algorithm to improve the performance of deep neural network in extracting general person features. Ding et al. [12] introduced a triplet neural network to learn the relative similarity for person Re-ID. In [29], Wang et al. proposed a unified triplet and siamese deep architecture, which could jointly extract single-image and cross-image feature representation. Zhang et al. [30] incorporated deep hash learning into a triplet formulation for similarity comparison. In [31], Zhou et al. applied a recurrent neural network to jointly learn spatial and temporal features from video sequence. In [32], Lin et al. made use of the consistency information between different cameras and proposed a consistent-aware deep learning method for person Re-ID. One major limitation of these methods is that they take the whole images as input which do not address the foreground in training process, therefore the learned feature representations will be easily effected by noises from the background.

Fig. 2: Illustration of our deep neural network, which is consisted of the foreground attentive subnetwork, the body part subnetwork and the feature fusion subnetwork. Specifically, the foreground attentive subnetwork aims to focus the attention on foreground by passing each input image through an encoder and decoder network. Then, the encoded feature maps are averagely sliced and discriminately learned in the following body part subnetwork. Afterwards, the resulting feature maps are fused in the feature fusion subnetwork. Finally, the final feature vectors are normalized on the unit sphere space and learned by following the symmetric triplet loss layer.

Attention Learning Based Method. This category of methods aim at learning discriminative features from input images by using attention mechanism, which have been widely used in the visual recognition tasks. For example, Wang et al. [33] proposed a residual attention neural network which incorporated attention mechanism and convolutional neural network into an end-to-end learning framework for image classification. In [34], Pei et al. introduced a temporal attention-gated model to remove noisy irrelevant parts for robust sequence classification. Wang et al. [35] proposed a novel deep architecture to address multi-label image recognition by recurrently discovering attentional regions from the whole input image. In [36], Zheng et al. applied a multi-attention convolutional neural network to learn a discriminative feature representation from the generated parts. Song et al. [37] proposed a deep spatial-semantic attention learning method for fine-grained sketch-based image retrieval. In [38], Gorji and Clark presented a novel visual attention tracking technique based on the shared attention model which learned the loci of attention from the scene actors. Chen et al. [39] proposed an attention guided multi-modal correction learning method to analyze the visual and textual inputs for image search. In [40], Marijn et al. proposed a deep attention selective network architecture, in which the feedback structure could dynamically alter its convolutional filter sensitivities during image classification. Different from these attention learning based methods, we attempt to focus the attention on the foreground of input images, so as to alleviate the side effects of noises from the background in feature learning.

Iii Multi-task framework for foreground attentive feature learning

Iii-a Foreground Attentive Neural Network

The goal of our FANN is to learn a discriminative feature representation from the foreground of each input image. The proposed network is shown in Fig. 2, and it is consisted of the foreground attentive subnetwork, the body part subnetwork and the feature fusion subnetwork, as explained in the following paragraphs.

Foreground Attentive Subnetwork. The foreground attentive subnetwork aims to focus its attention on the foreground of input images, so as to alleviate the side effects of the background. Our adopted paradigm is to pass each input image through an encoder and decoder network, in which the encoder network extracts features from the RGB images and the decoder network reconstructs binary masks of the foreground persons. The encoder subnetwork will naturally focus its attention on the foreground, sine the decoder network can gradually reconstruct the binary foreground mask in the learning process. Specifically, the input images are first resized in and passed through two learned filters in size of and , respectively. Then, the resulting feature maps are passed through a rectified linear unit (ReLU) and followed by a max pooling kernel in size of . These layers constitute the encoder network which is then fed into the corresponding decoder network and the following body part subnetwork, simultaneously. The decoder network is consisted of two deconvolutional layers, which are with and learned filters in size of and , respectively. In addition, a rectified linear unit (ReLU) is put between the two layers. The output of decoder network is used to reconstruct the binary mask of foreground, so as to adaptively focus the attention of encoder network on the foreground of input images.

Body Part Subnetwork. The body part subnetwork aims at discriminatively learning feature representations from different body parts, which is inspired by the idea that different body parts have different weights in representing one person [11]. The resulting feature maps of the encoder network is first averagely sliced into four equal parts across the hight channel, and then the sliced feature maps are fed into the body part subnetwork for feature learning. The body part subnetwork is consisted of two sets of residual blocks [16], in which the parameters are not shared between the convolutional layers, so as to discriminatively learn feature representations from different body parts. In each residual block, we pass each sliced feature maps through two small convolutional layers, in which both of them have learned filters in size of with stride . The outputs of first small convolutional layer are summarized with the outputs of second small convolutional layer using the eltwise operation. Then, a rectified linear unit (ReLU) is followed after them. Finally, the resulting feature maps are passed through a max pooling kernel in size of with stride . In order to learn a discriminative feature representation on the large scale datasets, we add a second residual block after the first one, which has the same shape with the former one.

Feature Fusion Subnetwork. The feature fusion subnetwork aims to fuse the learned features in representing different body parts and normalize the final features to a unit sphere space. It is consisted of four teams of fully connected layers and a normalization layer. Specifically, the local feature maps of each body part are first discriminately learned by following two small fully connected layers in each team. The dimensions of these small fully connected layers are . Then, a rectified linear unit (ReLU) is added between them. Afterwards, the discriminatively learned features of the first four small fully connected layers are concatenated to be fused by following a large fully connected layer, whose dimension is . Finally, the resulting features are further concatenated with the outputs of the second four fully connected layers, so as to generate the final dimensional features for representation. In addition, a normalization layer is used to regularize the magnitude of each feature vector to be unit. Therefore, the similarity comparison measured in the Euclidian distance is equivalent to that by using the cosine distance.

Iii-B Multi-Task Objective Function

Let be the input training data, in which denotes the RGB images, represents the mask of foreground, and is the number of training samples. Specifically, indicates the triplet unit, in which and are two images with the same identity, and are two mismatched images with different identities. Besides, represents the corresponding foreground mask of . The goal of our foreground attentive deep neural network is to learn filter weights and biases that can jointly minimize the ranking error and the reconstruction error at the output layers, respectively. A recursive function for an -layer deep model can be defined as follows:

(1)
Fig. 3: Illustration of gradient back-propagations and motion trajectories driven by two different triplet loss functions. Specifically, (a) shows the gradients of asymmetric triplet loss function; (b) shows the gradients of symmetric triplet loss function, (c) shows the motion trajectory driven by asymmetric triplet loss function, and (d) shows the motion trajectory driven by symmetric triplet loss function. In the optimization process, we adaptively update and , so as to jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit.

where denotes the filter weights of the layer to be learned, refers to the corresponding biases, denotes the convolution operation, is an element-wise non-linear activation function such as ReLU, and represents the feature maps generated at layer for . For simplicity, we consider the deep parameters as a whole , in which and .

In order to train the foreground attentive deep neural network in an end-to-end manner, we apply a multi-task objective function to supervise the learning process, which is defined as follows:

(2)

where denotes the symmetric triplet loss term, represents the local regression term, indicates the parameter regularization term, and are two fixed weight parameters. Specifically, and are two adaptive weights which control the symmetric gradient back-propagation, is the parameters of whole network, is the parameters of deep ranking neural network, and is the parameters of deep regression neural network.

Symmetric Triplet Loss Term. The goal of our symmetric triplet loss function is to maximize the relative distance between positive pair and negative pair in each triplet input, so as to learn a discriminative feature representation to identify different individuals. Its superiority against the asymmetric triplet loss function [17] is that the deduced gradient back-propagation by our method is symmetric, as shown in Fig. 3, which is very essential to jointly minimize the intra-class distance between positive pair and maximize the inter-class distance between negative pair in each triplet unit 111The internal reason of why our symmetric triplet loss function outperforms the asymmetric one is that it can accelerate the motion of positive sample in the vertical direction, as shown in Fig. 3. Therefore, the intra-class distance can be minimized in the whole training process.. The hinge loss of our symmetric triplet loss function can be formulated as follows:

(3)

where is the margin between the positive pair and negative pair, and denotes the pairwise distance measured in the spherical space, which is defined as follows:

(4)

In practice, we need to normalize and , therefore the distance measured in the Euclidean space is equivalent to that measured in the spherical space. The smaller the distance is, the more similar the two input images and are, and vice versa. As a result, this definition formulates the person Re-ID as a nearest neighbor search problem, which can be effectively solved by learning a discriminative feature representation.

Local Regression Loss Term. The goal of our local regression loss term is to minimize the error in reconstructing the binary foreground mask at the output layer of decoder network. As a consequence, the encoder network will be regularized by the decoder network in the training process, and the attention of encoder network can be gradually focused on the foreground of target. We measure the reconstruction error of each pixel in a local neighborhood, which is formulated as follows:

(5)

where represents a truncated Gaussian kernel with the standard deviation of , which is formulated as follows:

(6)

where indicates the radius of local neighborhood which is centered at the point of . By considering the reconstruction problem in a local neighborhood, the performance is more robust to the poor mask annotation. As shown in Fig. 4, some pixels in the foreground are wrongly labeled as the background, and the reconstruction accuracy will be seriously effected if we reconstruct the foreground mask by only measuring the point to point difference. In our method, we measure the point to point difference by jointly considering its neighborhood information, and therefore the foreground mask will be properly reconstructed if most of the pixels in a local neighborhood can be rightly annotated.

Parameter Regularization Term. The goal of our parameter regularization term is to smooth the parameters of the entire neural network, which is formulated as follows:

(7)

where indicates the Frobenius norm, and denotes the Euclidian norm.

Fig. 4: Illustration of the binary mask reconstruction in a local neighborhood. In practice, some wrongly annotated foreground pixels can be properly rectified by considering the reconstruction in a local neighborhood.

Iv Optimization

We apply the momentum method to optimize the direction control weights and the stochastic gradient descent algorithm to optimize the deep parameters, which are introduced in the following paragraphs.

The weight parameters and can be adaptively updated in the training process using the momentum method, so as to jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit. In order to simplify this problem, we define and , and therefore the two parameters can be optimized by only updating in each iteration. The partial derivative of the symmetric triplet loss function with respect to can be formulated as follows:

(8)

where , and is formulated as follows:

(9)

Then, can be optimized as follows:

(10)

where is the weight updating rate. It can be clearly seen that when , namely , then will be decreased while will be increased; and vice verse. As a result, the strength of back-propagation to each sample in the same triplet unit will be adaptively tuned, in which the anchor and the positive will be clustered, and the negative one will be far away from the hyper-line expanded by the anchor and the positive.

In order to apply the stochastic gradient descent algorithm to optimize the deep parameters, we compute the partial derivative of the objective function as follows:

(11)

where the first term represents the gradient of symmetric triplet loss term, the second term denotes the gradient of local regression loss term, and the third term indicates the gradient of parameter regularization term.

By the definition of in Eq. (8), the gradient of our symmetric triplet loss term can be formulated as follows:

(12)

where is formulated as follows:

(13)

According to the definition of our local regression loss term in Eq. (5), the gradient can be formulated as follows:

(14)

It is clear that the gradient of samples can be easily calculated given the values of , and , in each mini-batch, which can be easily obtained by running the forward and backward propagation in the training process. As the algorithm needs to back-propagate the gradients to learn a foreground attentive feature representation, we call it the foreground attentive gradient descent algorithm. Algorithm 1 shows the overall process of our implementation regime.

  Input:      Training data , learning rate , maximum iterative number , weight parameters , kernel parameters , margin parameter , initial weights of and and updating rate .
  Output:     The network parameters .
  repeat
     1. Extract the features of and in each triplet unit by the forward propagation.
     repeat
        a) Compute the gradient of according to Eq. (9);
        b) Update weights and according to Eq. (10);
        c) Compute the gradients of and according to Eq. (12) and Eq. (14);
        d) Update the gradients of according to Eq. (11);
     until Traverse all the triplet inputs in each min-batch;
     2. Update and .
  until 
Algorithm 1 Foreground Attentive Gradient Descent.

V Experiments

Methods Top 1 Top 5 Top10 Top15 Top20
KISSME [41] 22.9 48.7 62.2 72.4 78.1
LF [23] 33.4 45.5 69.9 76.5 81.0
ME [42] 53.3 76.8 86.0 89.4 92.8
kLFDA [4] 54.0 77.7 85.9 90.0 92.4
SCSP [43] 57.3 78.9 85.0 89.5 91.5
JSTL [28] 56.0
Spindle [44] 62.1 83.4 90.5 95.7
P2S [45] 71.2 90.5 95.2 96.9 97.6
SPL [20] 72.2 90.7 95.3 96.8 97.5
WARCA- [46] 51.9 75.6
Our method (FANN) 84.5 95.1 98.4 100 100
TABLE I: The matching rates(%) on the 3DPeS dataset.
Methods Top 1 Top 5 Top10 Top15 Top20
LADF [7] 29.9 64.7 79.0 86.7 91.3
RPLM [27] 27.3 55.3 69.0 77.1 82.7
Polymap [26] 36.8 70.4 83.7 88.7 91.7
Triplet [12] 40.5 60.8 70.4 78.4 84.4
TCP [47] 47.8 74.7 84.8 89.2 91.1
LNDS [48] 51.2 82.1 90.5 95.9
Quadruplet [49] 49.1 73.1 81.9
Spindle [44] 53.8 74.1 83.2 92.1
SSM [50] 53.7 91.5 96.1
TMA [51] 48.2 87.7 95.5
Our FANN 62.5 84.9 92.7 94.9 96.5
TABLE II: The matching rates(%) on the VIPeR dataset.

V-a Datasets and Settings

Benchmark datasets. We evaluate our method on five benchmark datasets, namely the 3DPeS [52], VIPeR [5], CUHK01 [53], CUHK03 [54] and Market1501 [55], which are briefly introduced in the following paragraphs 222The 3DPeS dataset provides the foreground masks and the foreground masks of images in other datasets are obtained by using the algorithm [56] in link http://www.robots.ox.ac.uk/~szheng/CRFasRNN.html.. The 3DPeS dataset contains 1011 images of 192 persons captured from 8 outdoor cameras with significantly different viewpoints, and each person has 2 to 26 images. The VIPeR dataset contains 632 person images captured by two cameras in an outdoor environment, and each person has only one image in each camera view. The CUHK01 dataset contains 971 persons captured from two camera views in a campus environment, and there are two images for each person under every camera view. The CUHK03 dataset contains 14097 images from 1467 persons, which is captured from six cameras in a campus environment and each person only has two camera views. The Market1501 dataset contains 32668 images of 1501 persons in a campus environment, in which each person is captured by six cameras at most, and two cameras at least.

Parameter settings. The parameters are taken as follows: the weights are initialized from two zero-mean Gaussian distributions with the standard deviations of to , and the bias terms are set as . The learning rate , the margin parameter , the kernel parameters , the weight parameter , the initial adaptive weights and , and the weight updating rate . If not specified, we use the same parameters in all the experiments.

Methods CUHK01 (p=100) CUHK01 (p=486) CUHK03 (p=100) CUHK03 (p=700)
Top 1 Top 5 Top10 Top 1 Top 5 Top10 Top 1 Top 5 Top10 Top 1 Top 5 mAP
ITML [9] 17.1 42.3 55.1 16.0 35.2 45.6 5.5 18.9 30.0
eSDC [57] 22.8 43.9 57.7 19.7 32.7 40.3 8.8 24.1 38.3
LOMO+XQDA [58] 77.6 94.1 97.5 63.2 83.9 90.0 52.0 82.2 92.1 14.8 13.6
IDLA [11] 65.0 89.5 93.0 47.5 71.5 80.0 54.7 86.5 94.0
kLFDA [4] 42.7 69.0 79.6 32.7 59.0 69.6 48.2 59.3 66.4
SVDNet [59] 81.8 95.2 97.2
PAN [60] 36.9 56.9 35.0
Quadruplet [49] 79.0 96.0 97.0 62.6 83.0 88.8 74.5 96.6 99.0
MLFN [61] 82.8 54.7 49.2
DPFL [62] 86.7 43.0 40.5
Our FANN 97.2 99.6 100 73.5 93.4 98.9 87.1 98.6 100 69.3 85.2 67.2
TABLE III: The matching rates(%) on the CUHK01 and CUHK03 datasets.

Evaluation protocol. Our experiments use the cumulative matching characteristic (CMC) curve to measure the performance, which is an estimation of finding the corrected top match. For the 3DPeS and VIPeR datasets, we follow the single-shot protocol in [12], in which 96 persons from the 3DPeS dataset and 316 persons in the VIPeR dataset are randomly chosen to train the deep neural network, and the others are used to evaluate the performance. Considering the fact that the two datasets are too small to train the large scale deep neural networks, we take images from the CUHK01 dataset to pre-train a deep model, so as to initialize the deep neural network on the 3DPeS and VIPeR datasets. For the CUHK01 and CUHK03 datasets, we follow two data partition protocols to split the datasets into the training sets and testing sets. Specifically, 100/486 persons of the CUHK01 dataset and 100/700 persons of the CUHK03 dataset are used to evaluate the performance, and the remainings are used to train the deep neural network. For the Market1501 dataset, we used the provided data partitioning method to prepare the training and testing samples. Besides, the mAP is also used to evaluate the performance on the CUHK03 and Market1501 datasets. To obtain a statistical result, we repeated the testing 10 times to report the average result.

V-B Results

In the following paragraphs, we will compare our method with the recent state-of-the-art approaches on five benchmark datasets, such as the KISSME [41], LF [23], ME [42], kLFDA [4], SCSP [43], JSTL [28], Spindle [44], P2S [45], SPL [20], WARCA- [46], LADF [7], RPLM [27], Polymap [26], Triplet [12], TCP [47], LNDS [48], Quadruplet [49], SSM [50], TMA [51], ITML [9], eSDC [57], LOMO+XQDA [58], IDLA [11], kLFDA [4], SVDNet [59], PAN [60], MLFN [61], DPFL [62], JLML [63], PDC [64], Histogram [65], DML [66], TriNet [67] and DLPA [68], respectively. To make the comparison fair, we directly copy the reported results of these methods on the corresponding datasets, and evaluated the performance of our method using the same experimental setup with the compared methods.

Methods Single-Query Multi-Query
Top 1 mAP Top 1 mAP
JLML [63] 83.9 64.4 89.7 74.5
PDC [64] 84.1 63.4
SVDNet [59] 82.3 62.1
LDNS [48] 61.0 35.6 71.5 46.0
SSM [50] 82.2 68.8 88.1 76.1
Histogram [65] 59.4
DPFL [62] 88.6 72.6 92.2 80.4
DML [66] 87.7 68.8 91.7 77.1
TriNet [67] 84.9 69.1 90.5 76.4
DLPA [68] 81.0 63.4
Our Method (FANN) 90.3 76.1 91.6 78.9
TABLE IV: The matching rates(%) on the Market1501 dataset.

For clarity, we highlight the best performance on each dataset in bold. The detailed results are shown in Table I to Table IV, from which we can see that our FANN has achieved the best performances on nearly all the five benchmark datasets. Specifically, our FANN outperforms the previous best performed SPL [20] method by on the 3DPeS dataset in the Top 1 accuracy. Besides, our FANN also outperforms the previous best performed Spindle [44] method by on the VIPeR dataset in the Top 1 accuracy. Comes to the CUHK01 and CUHK03 datasets, our FANN outperforms the previous best performed Quadruplet [49] and DPFL [62] methods by and in the Top 1 accuracy, when identities are randomly chosen to evaluate the performances, respectively. When identities from the CUHK01 dataset are used to evaluate the performance, our FANN outperforms the Quadruplet [49] method by in the Top 1 accuracy. In addition, our FANN outperforms the DPFL [62] method by and in the Top 1 accuracy and mAP, when identities are used to evaluate the performance on the CUHK03 dataset, respectively. The same conclusion can be get on the Market1501 dataset, in which our FANN outperforms the previous best performed DPFL [62] methods by and in the Top 1 accuracy and mAP using the single-query evaluation, while our FANN performs a little bit worse than it under the multi-query evaluation.

Methods 3DPeS VIPeR CUHK01 (p=100) CUHK01 (p=486) CUHK03 (p=100) CUHK03 (p=700) Market1501 (single) Market1501 (multi)
Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 mAP Top 1 mAP Top 1 mAP
None 71.9 90.8 47.9 76.9 77.8 88.7 58.3 82.3 69.1 89.5 49.6 47.2 63.2 40.1 75.4 49.1
S 77.6 92.6 54.7 81.2 89.6 94.8 65.9 87.8 76.1 92.9 60.1 58.3 73.9 46.1 86.1 75.9
L 80.3 93.2 56.5 82.9 91.7 96.1 67.2 88.1 79.1 94.2 63.5 61.7 80.3 59.7 87.5 77.1
F 76.5 91.8 55.1 81.8 88.3 93.9 66.1 87.9 77.2 93.0 60.5 58.7 73.4 46.5 86.5 76.1
S+F 82.2 94.8 58.3 84.3 93.4 97.3 69.2 88.1 82.9 96.3 65.4 63.7 82.4 61.1 88.6 76.9
L+F 78.3 93.1 57.3 82.8 90.6 95.2 67.4 86.5 75.8 93.6 60.8 58.9 85.4 63.1 89.7 77.8
S+L+F 84.5 95.1 62.5 84.9 97.2 99.6 73.5 93.4 87.1 98.1 69.3 67.2 90.3 76.1 91.6 78.9
TABLE V: The matching rates(%) improved by each of our contributions on the five benchmark datasets.
Fig. 5: Illustration of the learned feature maps by our neural network. From left to right: the RGB input, the ground truth mask, the resulted mask in reconstruction task, the learned feature maps by running the ranking and reconstruction tasks, the obtained mask and learned feature map in the multi-task.

In general, our FANN has achieved the promising results on all the five benchmark datasets, which is mainly due to the three contributions in our paper: 1) the foreground attentive neural network, which is benefit to put the attention on the foreground of input images; 2) the symmetric triplet loss function, which can jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit; 3) the local regression loss function, which can well reconstruct the mask even from the ill-segmented ground truth. In the following paragraph, we will analyze how much each component contributes to the final performance in detail.

V-C Analysis

Firstly, we will evaluate the performance of each contribution in our paper. Secondly, the effectiveness of our FANN in background suppression will be illustrated. Next, we will show the robustness of our method to parameter setting. Finally, some ranking examples will be presented and discussed.

Performance of each contribution. In order to show how much of each contribution improves the person Re-ID results, we carefully design seven different experiments on each dataset, as shown in Table V. In particular, None denotes that we get rid of the decoder network and takes the conventional triplet loss function to train the remaining network, S means that we get rid of the decoder network and takes our symmetric triplet loss function to train the remaining network, L represents that we use the conventional triplet loss function and local regression loss function to train the whole network, F indicates that we use the conventional triplet loss function and Euclidean loss function to train the whole network, S+F denotes that we use our symmetric triplet loss function and Euclidean loss function to train the whole network, F+L means that we use the conventional triplet loss function and local regression loss function to train the whole network, and S+F+L represents the whole of our FANN method.

From the results we can see that the S+F+L significantly outperforms the other six situations on all the five benchmark datasets, which can well explain the effectiveness of our symmetric triplet loss function, local regression loss function and FANN in feature learning. Specifically, we take the results on the VIPeR dataset to evaluate the improvement by each contribution. Compare the performances between None and S, between F and S+F, between L+F and S+L+F, we can find that our symmetric triplet loss function can improve the Top 1 accuracy by , , in three different cases. For the improvement of our local regression loss function, we compare the results between None and L, between F and L+F, between S+F and S+L+F and the results show that our local regression loss function can improve the Top 1 accuracy by , , in three different cases. Finally, we evaluate the improvements bought by our FANN by comparing the results between None and F, between L and L+F, between S and S+F, which show that our FANN improves the Top 1 accuracy by , , in three different cases. In the same way, we can find the similar conclusion on the other datasets in both the Top 1 and mAP evaluation.

Effectiveness of background suppression. In this paragraph, we will explain the underling reason of how our FANN focus its attention on the foreground of input images. In our neural network, we apply the auto-encoder mechanism to drive the attention, in which we reconstruct the mask representation at the output of the decoder network and the encoder will be naturally regularized by the decoder network in the training process. As a result, the encoder network will pay more attention of the foreground of input images, so as to suppress the noises in the background. The resulting feature maps of encoder network will be fed into the subsequent for discriminative feature learning. Put the mask regression and feature learning into a multi-task framework, a discriminative feature representation can be learned to improve the final person Re-ID performance.

Fig. 6: The comparison of different parameter settings to the final person Re-ID performances on the five benchmark datasets. Specifically, the first row to the fifth row show the detailed results on the 3DpeS, VIPeR, CUHK01, CUHK03 and Market1501 datasets, respectively.

Some representative feature maps of two input images are shown in Fig. 5, in which the two images represent the same person under two disjoint camera views. Specifically, the second and third columns show the ground truth masks and the binary masks obtained by only running the reconstruction task, which have well shown the effectiveness of our local regression loss function. The fourth and fifth columns illustrate the feature maps of the second convolutional layer in the ranking and reconstruction tasks, which can explain that the reconstruction task focus more attention on the foreground than the ranking task. The sixth and seventh columns represent the feature maps of the second convolutional layer and the reconstructed mask in the multi-task, which show that jointly running the two tasks in a joint framework is more beneficial to learn foreground attentive feature representations for the person Re-ID task.

Fig. 7: The comparison of different parameter settings to the final person Re-ID performances on the five benchmark datasets. Specifically, the first row to the fifth row show the detailed results on the 3DpeS, VIPeR, CUHK01, CUHK03 and Market1501 datasets, respectively.

Robustness of parameter setting. Even though there are several hyper parameters in our algorithm, we argue that they are not very sensitive to the final person Re-ID performance. In the following paragraphs, we will evaluate our method with varying parameters: the margin parameter , the kernel parameters and , the weight parameter , and the initial adaptive weights and . Specifically, we change one parameter and keep the others fixed in each experiment, so as to illustrate the insensitiveness of parameter setting to our method.

Firstly, we evaluate the influence of margin parameter , kernel parameters and , and weight parameter to the final person Re-ID performance. The detailed comparison results are shown in Fig. 6, in which we have reported the Top 1 accuracy with different parameter settings. From the results, we can conclude three conclusions: 1) For the margin parameter, small will lead to small discriminative power between positive and negative pairs, and large will make the model pay more attention to the hard training samples. Both parameter settings are not beneficial to keep better generation ability of learned model on the testing data. What’s more, there are a relative large sliding interval of to keep an approximate performance on the testing data, when an optimal margin parameter is chosen in the experiment. 2) For the kernel parameters, small will make the reconstruction task sensitive to the isolated regions in the ground truth mask, and large will cause the edge blur when reconstructs the foreground mask. Small will lead the reconstruction task sensitive to the difference between reconstructed mask and the ground truth mask, and large will make the algorithm pay less attention to the foreground of input images. We argue that too small or too large and are not beneficial to the generation ability of learned model, however our algorithm allows a large variation of the two parameters around the optimal values. For the weight parameter, small will also make our algorithm pay less attention to the foreground of input images, and large will make our algorithm pay too much attention to the foreground of input images. Therefore, an suitable weight should be chosen to keep the person Re-ID performance, which lies in the local neighborhood of optimal value.

Datasets
Top 1 Top 5 Top 1 Top 5 Top 1 Top 5
3DPeS 77.6 92.6 84.5 95.1 83.7 95.4
VIPeR 54.7 81.2 62.5 84.9 62.1 82.7
CUHK01 89.6 94.8 97.2 99.6 95.9 100
CUHK03 76.1 92.5 87.1 98.6 87.3 97.9
Market1501 85.4 89.6 90.3 93.4 89.6 93.2
TABLE VI: The matching rate (%) on five benchmark datasets in term of and with using the L2 normalization.
Datasets
Top 1 Top 5 Top 1 Top 5 Top 1 Top 5
3DPeS 75.4 90.3 83.8 94.2 80.6 93.1
VIPeR 54.9 88.6 63.1 85.2 60.1 80.3
CUHK01 88.9 97.5 96.9 99.3 92.8 98.7
CUHK03 79.3 95.4 85.4 97.5 82.9 96.1
Market1501 78.5 83.1 87.2 91.4 85.1 89.3
TABLE VII: The matching rate (%) on five benchmark datasets in term of and without using the L2 normalization.

The main difference between our symmetric triplet loss function and the conventional triplet loss function is that it introduces another negative distance between two samples from the same camera view to regularize the gradient back-propagation, as shown in Fig. 3. As a result, the intra-class distance can be minimized and the inter-class distance can be maximized in each triplet unit, simultaneously. In our formulation, two negative distances are weighted to represent the inter-class distance, and two weight parameters can be adaptively updated in the feature learning process, so as to deduce the symmetric gradient back-propagation. Because we apply the L2 normalization after the resulting feature vectors, the final person Re-ID performance is very stable with different initializations to the weight parameters and . In Table VI, we give some detailed analysis results on the five benchmark datasets. From the results, we can conclude that the Top 1 accuracies of our method on the five benchmark datasets are robust to different weight initializations. There are two underlying reasons: 1) The weight updating algorithm in Eq. (8) to Eq. (10) can measure the difference between and , so as to keep in the optimization. Therefore, the weights can be adaptively updated to keep the symmetric gradient back-propagation. 2) For the triplet input, the L2 normalization is applied to keep at the output layer, therefore the difference between and is bounded in . As a result, the L2 normalization will make our algorithm more robust to the numerical stability. For comparison, we evaluate the performance of our method without using L2 normalization, as shown in Table VII, on the five benchmark datasets. From the results, we can see that 1) The best performances of two cases are similar on the five benchmark datasets, which indicates that both the cosine space and the Euclidean space are suitable for the similarity comparison.333For fair comparison, we set the when conduct experiments without using L2 normalization on the five benchmark datasets. 2) Because the distances in cosine space are bounded, therefore the performances on the five benchmark datasets are more stable by using different initializations.

Some examples of ranking. Finally, we illustrate some real ranking examples of our method on the five benchmark datasets, as shown in Fig. 7, including both successful cases and failure cases of our results. Specifically, the images in green boxes are the probes to find out the matched references from the gallery. The image in red box is a true match to the corresponding probe, in which the smaller order indicates the better performance. In the successful cases of our method, all the matched candidates are found out in the first place from various candidates in the gallery. The results indicate that our method can learn a discriminative feature representation to overcome the cross-view appearance variations caused by the mutual occlusion and background clutter in person Re-ID. However, we also notice that there are a fraction of failure cases of our method in person Re-ID, in which the matched references can not be ranked firstly from the very similar candidates. We think the failure reason is very complex, and we will continue to study this problem in the future.

Vi Conclusion

In this paper, we propose a novel foreground attentive deep neural network to learn a discriminative feature representation from the foreground of input images for person Re-identification. Firstly, a foreground attentive neural network is constructed to weaken the side effects of the background, in which an encoder and decoder subnetwork is built to guide the neural network to directly learn feature representations from the foreground persons. Secondly, a symmetric triplet loss function is introduced to supervise the feature learning process, which can jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit. Thirdly, a novel local regression loss function is designed to deal with isolated regions in the ground truth mask by considering the local neighborhood information. Extensive experiments on the 3DPeS, VIPeR, CUHK01, CUHK03 and Market1501 datasets substantiate the superiority of the proposed network, as well the objective function, as compared with the state-of-the-art approaches.

References

  • [1] B. T. Morris and M. M. Trivedi, “A survey of vision-based trajectory learning and analysis for surveillance,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 18, no. 8, pp. 1114–1127, 2008.
  • [2] S. Zhang, J. Wang, Z. Wang, Y. Gong, and Y. Liu, “Multi-target tracking by learning local-to-global trajectory models,” Pattern Recognition, vol. 48, no. 2, pp. 580–590, 2015.
  • [3] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visual surveillance of object motion and behaviors,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 34, no. 3, pp. 334–352, 2004.
  • [4] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person re-identification using kernel-based metric learning methods,” in European Conference on Computer Vision.    Springer, 2014, pp. 1–16.
  • [5] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensemble of localized features,” in European conference on computer vision.    Springer, 2008, pp. 262–275.
  • [6] R. Zhao, W. Ouyang, and X. Wang, “Learning mid-level filters for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 144–151.
  • [7] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith, “Learning locally-adaptive decision functions for person verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3610–3617.
  • [8] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in neural information processing systems, 2005, pp. 1473–1480.
  • [9] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proceedings of the 24th international conference on Machine learning.    ACM, 2007, pp. 209–216.
  • [10] S. Zhou, J. Wang, Q. Hou, and Y. Gong, “Deep ranking model for person re-identification with pairwise similarity comparison,” in Pacific Rim Conference on Multimedia.    Springer, 2016, pp. 84–94.
  • [11] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture for person re-identification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [12] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with relative distance comparison for person re-identification,” Pattern Recognition, vol. 48, no. 10, pp. 2993–3003, 2015.
  • [13] S. Zhou, J. Wang, R. Shi, Q. Hou, Y. Gong, and N. Zheng, “Large margin learning in set-to-set similarity comparison for person reidentification,” IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 593–604, 2018.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [17] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1386–1393.
  • [18] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Computer vision and pattern recognition, 2006 IEEE computer society conference on, vol. 2.    IEEE, 2006, pp. 1735–1742.
  • [19] L. Lin, G. Wang, W. Zuo, X. Feng, and L. Zhang, “Cross-domain visual matching via generalized similarity measure and feature learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1089–1102, 2017.
  • [20] S. Zhou, J. Wang, D. Meng, X. Xin, Y. Li, Y. Gong, and N. Zheng, “Deep self-paced learning for person re-identification,” Pattern Recognition, 2017.
  • [21] W.-S. Zheng, S. Gong, and T. Xiang, “Reidentification by relative distance comparison,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 3, pp. 653–668, 2013.
  • [22] A. Mignon and F. Jurie, “Pcca: A new approach for distance learning from sparse pairwise constraints,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.    IEEE, 2012, pp. 2666–2672.
  • [23] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, “Local fisher discriminant analysis for pedestrian re-identification,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.    IEEE, 2013, pp. 3318–3325.
  • [24] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face verification,” in Computer Vision–ACCV 2010.    Springer, 2011, pp. 709–720.
  • [25] C. C. Loy, C. Liu, and S. Gong, “Person re-identification by manifold ranking,” in Image Processing (ICIP), 2013 20th IEEE International Conference on.    IEEE, 2013, pp. 3567–3571.
  • [26] D. Chen, Z. Yuan, G. Hua, N. Zheng, and J. Wang, “Similarity learning on an explicit polynomial kernel feature map for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1565–1573.
  • [27] M. Hirzer, P. M. Roth, M. Köstinger, and H. Bischof, “Relaxed pairwise learned metric for person re-identification,” in Computer Vision–ECCV 2012.    Springer, 2012, pp. 780–793.
  • [28] T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep feature representations with domain guided dropout for person re-identification,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016, pp. 1249–1258.
  • [29] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang, “Joint learning of single-image and cross-image representations for person re-identification,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016, pp. 1288–1296.
  • [30] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
  • [31] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, July 2017.
  • [32] J. Lin, L. Ren, J. Lu, J. Feng, and J. Zhou, “Consistent-aware deep learning for person re-identification in a camera network,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, July 2017.
  • [33] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, July 2017.
  • [34] W. Pei, T. Baltrusaitis, D. M. Tax, and L.-P. Morency, “Temporal attention-gated model for robust sequence classification,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on.    IEEE, 2017, pp. 820–829.
  • [35] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin, “Multi-label image recognition by recurrently discovering attentional regions,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017.
  • [36] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolutional neural network for fine-grained image recognition,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017.
  • [37] J. Song, Y. Qian, Y.-Z. Song, T. Xiang, and T. Hospedales, “Deep spatial-semantic attention for fine-grained sketch-based image retrieval,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017.
  • [38] S. Gorji and J. J. Clark, “Attentional push: A deep convolutional network for augmenting image salience with shared attention modeling in social scenes,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 2017, pp. 2510–2519.
  • [39] K. Chen, T. Bui, C. Fang, Z. Wang, and R. Nevatia, “Amc: Attention guided multi-modal correlation learning for image search,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, July 2017.
  • [40] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber, “Deep networks with internal selective attention through feedback connections,” in Advances in neural information processing systems, 2014, pp. 3545–3553.
  • [41] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.    IEEE, 2012, pp. 2288–2295.
  • [42] S. Paisitkriangkrai, C. Shen, and A. van den Hengel, “Learning to rank in person re-identification with metric ensembles,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1846–1855.
  • [43] D. Chen, Z. Yuan, B. Chen, and N. Zheng, “Similarity learning with spatial constraints for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1268–1277.
  • [44] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang, “Spindle net: Person re-identification with human body region guided feature decomposition and fusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1077–1085.
  • [45] S. Zhou, J. Wang, J. Wang, G. Yihong, and Z. Nanning, “Point to set similarity based deep feature learning for person re-identification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2017.
  • [46] C. Jose and F. Fleuret, “Scalable metric learning via weighted approximate rank component analysis,” in European Conference on Computer Vision.    Springer, 2016, pp. 875–890.
  • [47] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based cnn with improved triplet loss function,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1335–1344.
  • [48] L. Zhang, T. Xiang, and S. Gong, “Learning a discriminative null space for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1239–1248.
  • [49] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: A deep quadruplet network for person re-identification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [50] S. Bai, X. Bai, and Q. Tian, “Scalable person re-identification on supervised smoothed manifold,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [51] N. Martinel, A. Das, C. Micheloni, and A. K. Roy-Chowdhury, “Temporal model adaptation for person re-identification,” in European Conference on Computer Vision.    Springer, 2016, pp. 858–877.
  • [52] D. Baltieri, R. Vezzani, and R. Cucchiara, “Sarc3d: a new 3d body model for people tracking and re-identification,” in Proceedings of the 16th International Conference on Image Analysis and Processing, Ravenna, Italy, Sep. 2011, pp. 197–206.
  • [53] W. Li and X. Wang, “Locally aligned feature transforms across views,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.    IEEE, 2013, pp. 3594–3601.
  • [54] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 152–159.
  • [55] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1116–1124.
  • [56] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.
  • [57] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person re-identification,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.    IEEE, 2013, pp. 3586–3593.
  • [58] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [59] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [60] Z. Zheng, L. Zheng, and Y. Yang, “Pedestrian alignment network for large-scale person re-identification,” arXiv preprint arXiv:1707.00408, 2017.
  • [61] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” arXiv preprint arXiv:1803.09132, 2018.
  • [62] Y. Chen, X. Zhu, and S. Gong, “Person re-identification by deep learning multi-scale representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2590–2600.
  • [63] W. Li, X. Zhu, and S. Gong, “Person re-identification by deep joint learning of multi-loss classification,” arXiv preprint arXiv:1705.04724, 2017.
  • [64] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deep convolutional model for person re-identification,” in ICCV.    IEEE, 2017, pp. 3980–3989.
  • [65] E. Ustinova and V. Lempitsky, “Learning deep embeddings with histogram loss,” in NIPS, 2016, pp. 4170–4178.
  • [66] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” arXiv preprint arXiv:1706.00384, 2017.
  • [67] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
  • [68] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-aligned representations for person re-identification,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.

Sanping Zhou received the M.E. degree from Northwestern Polytechnical University, Xi’an, China, in 2015. He is currently pursuing the Ph.D. degree in Institute of Artificial Intelligence and Robotics at Xi’an Jiaotong University. His research interests include machine learning, deep learning and computer vision, with a focus on medical image segmentation, person re-identification, image retrieval, image classification and visual tracking.

Jinjun Wang received the B.E. and M.E. degrees from the Huazhong University of Science and Technology, China, in 2000 and 2003, respectively. He received the Ph.D. degree from Nanyang Technological University, Singapore, in 2006. From 2006 to 2009, he was with NEC Laboratories America, Inc., as a Research Scientist, and Epson Research and Development, Inc., as a Senior Research Scientist, from 2010 to 2013. He is currently a Professor with Xi’an Jiaotong University. His research interests include pattern classification, image/video enhancement and editing, content-based image/video annotation and retrieval, semantic event detection, etc.

Deyu Meng Deyu Meng received the B.S., M.S., and Ph.D. degrees from Xian Jiaotong University, Xian, China, in 2001, 2004, and 2008, respectively. From 2012 to 2014, he took his two-year sabbatical leave in Carnegie Mellon University. He is currently a Professor with the Institute for Information and System Sciences, School of Mathematics and Statistics, Xian Jiaotong University. His current research interests include self-paced learning, noise modeling, and tensor sparsity.

Yudong Liang received the B.S. and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 2010 and 2017, respectively. He is currently an assistant Professor with School of Computer and Information Technology, Shanxi University. His research interests include Machine Learning and Computer Vision, with a focus on image super-resolution, image quality assessment and deep learning.

Yihong Gong received the B.S., M.S., and Ph.D. degrees in Electrical Engineering from The University of Tokyo, Japan, in 1987, 1989, and 1992, respectively. In 1992, he joined Nanyang Technological University, Singapore, as an Assistant Professor with the School of Electrical and Electronic Engineering. From 1996 to 1998, he was a Project Scientist with the Robotics Institute, Carnegie Mellon University, USA. Since 1999, he has been with the Silicon Valley branch, NEC Labs America, as a Group Leader, the Department Head, and the Branch Manager. In 2012, he joined Xi’an Jiaotong University, China, as a Distinguished Professor. His research interests include image and video analysis, multimedia database systems, and machine learning.

Nanning Zheng (SM93-F06) graduated from the Department of Electrical Engineering, Xian Jiaotong University, Xian, China, in 1975, and received the M.S. degree in information and control engineering from Xian Jiaotong University in 1981 and the Ph.D. degree in electrical engineering from Keio University, Yokohama, Japan, in 1985. He jointed Xian Jiaotong University in 1975, and he is currently a Professor and the Director of the Institute of Artificial Intelligence and Robotics, Xian Jiaotong University. His research interests include computer vision, pattern recognition and image processing, and hardware implementation of intelligent systems. Dr. Zheng became a member of the Chinese Academy of Engineering in 1999, and he is the Chinese Representative on the Governing Board of the International Association for Pattern Recognition.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
211798
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description