Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-Free Approach
Partial person re-identification (re-id) is a challenging problem, where only some partial observations (images) of persons are available for matching. However, few studies have offered a flexible solution of how to identify an arbitrary patch of a person image. In this paper, we propose a fast and accurate matching method to address this problem. The proposed method leverages Fully Convolutional Network (FCN) to generate certain-sized spatial feature maps such that pixel-level features are consistent. To match a pair of person images of different sizes, hence, a novel method called Deep Spatial feature Reconstruction (DSR) is further developed to avoid explicit alignment. Specifically, DSR exploits the reconstructing error from popular dictionary learning models to calculate the similarity between different spatial feature maps. In that way, we expect that the proposed FCN can decrease the similarity of coupled images from different persons and increase that of coupled images from the same person. Experimental results on two partial person datasets demonstrate the efficiency and effectiveness of the proposed method in comparison with several state-of-the-art partial person re-id approaches. Additionally, it achieves competitive results on a benchmark person dataset Market1501 with the Rank-1 accuracy being 83.58%.
Person re-identification (re-id) has witnessed great progress in recent years, existing approaches always assume that each image covers a full glance of one person. However, the assumption of person re-id on full and frontal images is easily violated in real-world applications, and we merely have access to some partial observations of each person (dubbed partial person images) for retrieval. For instance, as shown in Fig. 1, partial person images often occur when a person is occluded by moving obstacles (e.g., cars, other persons), and static obstacles (trees, barriers). Hence partial person re-id has attracted significant research attention with increasing requirement of identification from CCTV cameras and video surveillance. However, few studies have focused on how to identify an arbitrary patch of a person image, making partial person re-id a challenging problem without solutions from current approaches. From this perspective, studying the partial person re-id problem is necessary and crucial for both academic research and practical retrieval applications.
A majority of existing person re-id approaches fail to identify a person when severely partial person observations are provided. Concretely, to match an arbitrary patch of a person, some researchers resort to re-scale an arbitrary person patch to a fixed-size image. However, the performance would significantly degrade due to the undesired deformation (see Fig. 2(a)). Sliding Window Matching (SWM)  indeed introduces a possible solution for partial re-id by setting up a sliding window of the same size as the probe image and utilizing it to search for the most similar region within each gallery image (see Fig. 2(b)). However, SWM would not work well when the size of the probe person is bigger than the size of the gallery person. Some person re-id approaches further consider a part-based model which offers an alternative solution of partial person re-id in Fig. 2(c). Nevertheless, their computational costs are extensive and they require strict person alignment beforehand. Apart from these limitations, part-based models and SWM repeatedly extract sub-region features without sharing computation, which results in the unsatisfied computation efficiency.
In this paper, we propose a novel and fast partial person re-id framework that matches a pair of person images of different sizes (see Fig. 2(d)). In the proposed framework, Fully Convolutional Network (FCN) is utilized to generate correspondingly-size spatial feature maps, which can be considered a pixel-level feature matrix. Motivated by the remarkable successes achieved by dictionary learning in face recognition [13, 23, 28], we develop an end-to-end model named Deep Spatial feature Reconstruction (DSR), which expects that each pixel in the probe spatial maps can be sparsely reconstructed on the basis of an entire gallery spatial maps. In this manner, the model is independent on the image size, which naturally jumps the time-consuming alignment step. Specifically, we design an objective function for FCN which encourages that the reconstruction error of the spatial feature maps extracted from the same persons is smaller while spatial feature maps from different identities cannot well reconstruct each other (i.e., larger reconstruction error). Generally, the major contributions of our work are summarized as four-fold:
We propose a new approach named Deep Spatial feature Reconstruction (DSR) for partial person re-id, which is alignment-free and flexible to arbitrary-sized person images.
We first integrate sparse reconstruction learning and deep learning in a unified framework, and train an end-to-end deep model through minimizing the reconstruction error for coupled person images from the same identity and maximizing that of different identities.
Besides, we further replace the pixel-level reconstruction with a block-level one, and develop a multi-scale (different block sizes) fusion model to enhance the performance.
The remainder of this paper is organized as follows. In Sec. 2, we review the related work about FCN, Sparse Representation Classification (SRC), and existing partial person re-id algorithms. Sec. 3 introduces the technical details of deep spatial feature reconstruction. Sec. 4 shows the experimental results and analyzes the performance in computational efficiency and accuracy. Finally, we conclude our work in Sec. 5.
2 Related Work
Since the proposed model is a deep feature learning method for partial person re-id based on Fully Convolutional Network and Sparse Representation Classification, we briefly review some related works in this section.
Fully Convolutional Network. FCN only contains convolutional layers and pooling layers, which have been applied into spatially dense tasks including semantic segmentation [1, 2, 6, 17, 20] and object detection [5, 15, 18, 19]. Shelhamer et al.  introduced a FCN that is trained end-to-end, pixels-to-pixels on semantic segmentation, which outperformed the state of the art model without additional machinery. Liu et al.  proposed single shot multi-box detector (SSD) based on FCN that can detect objects quickly and accurately. Besides, FCN also has been exploited in visual recognition. He et al.  introduced a spatial pyramid pooling (SPP) layer imposed on FCN to produce fixed-length representation from arbitrary-size inputs.
Sparse Representation Classification. Wright et al.  introduced a well-known method, SRC for face recognition, achieving a robust performance under occlusions and illumination variations. Further studies [4, 28, 25, 24] based on SRC about face recognition have also been conducted. SRC has been also applied to signal classification , visual tracking , and visual classification , etc.
Partial Person Re-identification. Partial person re-id has become an emerging problem in video surveillance. However, few methods consider how to match an arbitrary patch of a person image. To address this problem, many methods [3, 6] warp an arbitrary patch of an image to a fixed-size image, and then extract the fixed-length feature vector for matching. However, such processing way would result in unwanted deformation. Besides, part-based models offer a kind of solution to partial person re-id. Patch-to-patch matching strategy is employed to handle occlusion and cases where the target is partially out of camera’s view. Zheng et al.  proposed a local patch-level matching model called Ambiguity-sensitive Matching Classifier (AMC) that was based on dictionary learning with explicit patch ambiguity modeling, and introduced a global part-based matching model called Sliding Window Matching (SWM) that can provide complementary spatial layout information. But, the computation cost of AMC+SWM is rather extensive because it runs feature extractor many times without sharing computation. Furthermore, similar occlusion problem also occurs in partial face recognition, Liao et al.  proposed an alignment-free approach called multiple keypoints descriptor SRC (MKD-SRC), where multiple affine invariant keypoints are extracted for facial features representation and sparse representation based on classification (SRC)  is then used for face classification. However, the performance of keypoint-based methods is not quite satisfying with hand-crafted local descriptors. To this end, we propose a fast and accurate method, Deep Spatial feature Reconstruction (DSR), to address partial person images.
3 The Proposed Approach
3.1 Fully Convolutional Network
Deep Convolutional Neural Networks (CNNs), as feature extractors in visual recognition task, require a fixed-size input image. However, it is impossible to meet the requirement since partial person images have arbitrary sizes/scales. In fact, the requirement comes from fully-connected layers that demand fixed-length vectors as inputs. Convolutional layers operate in a sliding-window manner and generate correspondingly-size spatial outputs. To handle an arbitrary patch of a person image, we discard all fully-connected layers to implement Fully Convolutional Network that only convolution and pooling layers remain. Therefore, FCN still retains spatial coordinate information, which is able to extract spatial feature maps from arbitrary-size inputs. The proposed FCN is shown in Fig. 3, it contains 13 convolution layers and 5 pooling layers, and the last pooling layer produces identity feature maps.
3.2 Deep Spatial Feature Reconstruction
In this section, we will introduce how to measure the similarity between a pair of person images of different sizes. Assume that we are given a pair of person images, one is an arbitrary patch of person image (partial person), and the other is holistic person image . Correspondingly-size spatial feature maps and are then extracted by FCN, where denotes the parameters in FCN. denotes a vectorized tensor, where and denote the height, the width and the number of channel of , respectively. As shown in Fig. 3, we divide into blocks , , where , and the size of each block is . Denote by
the block set, where . Likewise, is divided into blocks as
then can be represented by linear combination of . That is to say, we attempt to search similar blocks to reconstruct . Therefore, we wish to solve for the sparse coefficients of with respect to , where . Since few blocks of are expected for reconstructing , we constrain using -norm. Then, the sparse representation formulation is defined as
where ( is fixed in our experiment) controls the sparsity of coding vector . is used to measure the similarity between and . For blocks in , the matching distance can be defined as
where is the sparse reconstruction coefficient matrix. The whole matching procedure is exactly our proposed Deep Spatial feature Matching (DSR). As such, DSR can be used to classify a probe partial person, which does not need additional person alignment. The flowchart of our DSR approach is shown in Fig. 4 and the overall DSR approach is outlined in Algorithm 1.
3.3 Fine-tuning on Pre-trained FCN with DSR
We train the FCN with a particular identification signal that classifies each person images ( in our experiment) into different identities. Concretely, the identification is achieved by the last pooling layer connected with an entropy-loss (see Fig. 5(a)). To further increase the discriminative ability of deep features extracted by FCN, fine-tuning with DSR is adopted to update the convolutional layers, the framework is shown in Fig. 5(b).
The DSR signal encourages the feature maps of the same identity to be similar while feature maps of the different identities stay away. The DSR can be regarded as verification signal, the loss function is thus defined as
where means that the two features are from the same identity and for different identities.
We employ an alternating optimization algorithm to optimize and in the objective .
Step 1: fix , optimize . The aim of this step is to solve sparse reconstruction coefficient matrix . For solving optimal , we solve respectively, hence, equation (3) is further rewritten as
We utilize the feature-sign search algorithm adopted in  to solve an optimal .
Step 2: fix , optimize . To update the parameters in FCN, we then calculate the gradients of with respect to and
Clearly, FCN supervised by DSR is trainable and can be optimized by standard Stochastic Gradient Descent (SGD). In Algorithm 2, we summarize the detail of feature learning with DSR.
We directly embed the proposed DSR into FCN to train an end-to-end deep network, which can improve the overall performance. It is noteworthy that person images in each training pair share the same scale.
3.4 Multi-scale Block Representation
Invariance to varying probe scale is a challenging problem for an arbitrary patch of a person image. Unlike holistic person image, we can directly resize the person image to a fixed size. With regard to a partial person image, it is difficult to determine its scale explicitly. Therefore, the scales between a partial person and a holistic person are easily mismatching, which results in the degraded performance. In Sec. 3.2, we use single-scale block (11 block), it is not very robust to scale variations. To alleviate the influence of scale mismatching, multi-scale block representation is also proposed in DSR (see Fig. 6). In our experiment, we adopt 3 different scale blocks: 11, 22 and 33 and extract these blocks in a sliding-window manner (stride is 1 block).
In order to keep the dimensions consistent, 22 and 33 blocks are resized to 11 block by average pooling. The resulting blocks are all pooled in the block set. The main purpose of multi-scale block representation is to improve the robustness of scale variation. Experiment results show that such processing stages can effectively improve the performance of partial person re-id.
The designed multi-scale block representation is operated in feature-level. Unlike some detection-based model, they perform a multi-scale operation in image-level. Therefore, the computation cost of feature extraction is extensive inevitably. It thus appears that multi-scale block representation in the feature-level is high-efficiency by sharing computation.
On these datasets we focus on five aspects, i.e., (1) Explore the influence of person image deformable. (2) Multi-scale block representation benefits. (3) Partial person re-id in comparison with other partial person re-id approaches. (4) Computational time of various partial person re-id approaches. (5) Effectiveness of fine-tuning with DSR.
4.1 Experiment Settings
Network Architecture. The designed Fully Convolutional Network (FCN) is shown in Fig. 3. Market1501 dataset  is used to train the FCN with a 1,500-way softmax to obtain a pre-trained model and the size of network input is . 3,000 positive pairs of person images and 3,000 negative pairs of person images are used to fine-tune on pre-trained FCN with DSR. For each pair, one is a holistic person image and the other one is an arbitrary patch of a person image.
Datasets. Partial REID dataset is a specially designed partial person Dataset that includes 600 images from 60 people, with 5 full-body images and 5 partial images per person. These images are collected at a university campus from different viewpoints, background and different types of severe occlusion. The examples of partial persons in the Partial REID dataset are shown in Fig. 7(a). The region in the red bounding box is the partial person. The probe set consists of all partial images per person, and the holistic person images are used as the gallery set. Partial-iLIDS is a simulated partial person dataset based on iLIDS . The iLIDS contains a total of 476 images of 119 people captured by multiple non-overlapping cameras. Some images in the dataset contain people occluded by other individuals and luggage. Fig. 7(b) shows some examples of individual images from the iLIDS dataset. For the occluded individuals, the partial observation is generated by cropping the non-occluded region of one image of each person to construct the probe set. The non-occluded images of each person are selected as a gallery set. There are and individuals in each of the test sets for the Partial REID and Partial-iLIDS datasets respectively. One and five partial person images of each person are used as a probe set for the Partial REID and Partial-iLIDS datasets.
Evaluation Protocol. In order to show the performance of the proposed approach, we provide the average Cumulative Match Characteristic (CMC) curves for close-set experiment and Receiver Operating Characteristic (ROC) curves for verification experiment to evaluate our algorithm.
Benchmark Algorithms. Some existing partial person re-identification methods are used for comparison, including part-based matching method Ambiguity-sensitive Matching (AMC) , global-to-local matching method Sliding Window Matching (SWM) , AMC+SWM  and Resizing model (see Fig. 2(a)). For AMC, features are extracted from a support area, and these support areas are densely sampled with an overlap of half of the height/width of the supporting area in both horizontal and vertical directions. Each region is represented by the fine-tuning FCN, creating a 2048-dimensional feature vector (the output size is in the designed FCN).
Settings. Single-shot and multi-shot experiments are conducted respectively. Single-shot experiment means that single () person image is used as gallery image for each individual. Multi-shot experiment means that multiple () person images are used as gallery images for each individual.
4.2 Influence of Person Image Deformation
Fig. 2(a) shows the details of the resizing model, person images in the gallery and probe set are all re-sized to . FCN is used as feature extractor and 15,360-dimension feature vector is produced for each person image. In the single-shot experiment, we use Euclidean distance to measure the similarity of a pair of person images in the Resizing model. In the multi-shot experiment, we return the average similarity between a probe person image and multiple person images of an individual. For DSR, we only adopt single-scale block representation ( block) in this experiment. Fig. 8 shows the experimental results on Partial REID and Partial-iLIDS datasets. Regardless of single-shot experiments or multi-shot experiments, the gap between resizing model and DSR is very large. Such experimental results convincingly show that person image deformation would produce a significant influence on recognition performance. For example, an upper part of the person image is re-sized to fixed-size, which results in the entire image to be stretched vertically.
4.3 Multi-scale Block Representation Benefits
To evaluate the performance of the proposed DSR with regard to the multi-scale block representation, we pool different-size blocks into the gallery and probe block set. 3 different fusion ways are adopted: blocks, blocks combined with and blocks, blocks combined with blocks. Results are shown in Fig. 9. DSR achieve the best performance when gallery and probe block set contain , and blocks. Experimental results suggest that multi-scale block representation is effective. The single-scale block contains more local information, while the multi-scale block is able to provide complementary information to make DSR more robust to scale variation.
|Method||Partial REID, ,||Partial-iLIDS, ,|
4.4 Comparison to the State-of-the-Art
We compare the proposed DSR to the state-of-the-art methods, including AMC, SWM, AMC+SWM and Resizing model, on the Partial REID and Partial-iLIDS datasets. There are and individuals in each of the test sets for the Partial REID and Partial-iLIDS datasets respectively. For DSR, we report the results using single-scale block representation and multi-scale bloc representation. For AMC+SWM, the weights of AMC and SWM are 0.7 and 0.3, respectively. Both single-shot experiment and multi-shot experiment are conducted in this experiment.
Single-shot experiments. Table 1 shows the single-shot experimental results. We find the results on Partial REID and Partial-iLIDS are similar. The proposed method DSR outperforms AMC, SWM, AMC+SWM and Resizing model. DSR takes full advantage of FCN that operate in a sliding-window manner and outputs feature maps without deformation. AMC, as a local-to-local matching method, also achieves comparable performance because background patches can be automatically excluded due to their low visual similarity. Therefore, it is robust to occlusion. However, it is difficult to select satisfactory support area size and stride. Besides, it is not robust to scale variation. SWM is a local-to-global matching method, which requires that the probe size is smaller than the gallery size. Search manner in SWM would ignore some detailed information about a person image. AMC+SWM perform as well as DSR because local features in AMC combined with global features in SWM makes it robust to occlusion and view/pose various. Similar results are also observed from the ROC curves shown in Fig. 10 and Fig. 11. DSR shows small intra-distance and large inter-distance. Fig. 12 gives some examples of partial person re-id.
Multi-shot experiments. DSR approach is evaluated under the multi-shot setting (N=3) on Partial REID and Partial-iLIDS datasets. The results are shown in Table 2 . Similar results are obtained in the single-shot experiment. Specifically, the results show that multi-shot setup helps to improve the performance of DSR since DSR increases from 39.33% to 49.33% on Partial REID dataset and from 51.06% to 54.67% on Partial-iLIDS dataset.
4.5 Computational Efficiency
Our implementation is based on the publicly available code of MatConvnet . All experiments in this paper are trained and tested on PC with 16GB RAM, i7-4770 CPU @ 3.40GHz. Single-shot and multi-shot experiments on Partial REID dataset are conducted to test the computational time of identifying a probe person image. For DSR, we use single-scale block representation ( block) and multi-scale block representation ( and blocks). Table 3 shows the computational time of various partial person re-id approaches, which suggests that the propose DSR outperforms other approaches in computation efficiency. DSR with single-scale block representation and multi-scale block representation respectively take 0.269s and 0.278s to identify a person image. For AMC, it costs more computational time than DSR because it repeatedly runs FCN for each sub-region without sharing computation. For SWM, it sets up a sliding window of the same as the probe person image to search for similar sub-region within each gallery image. Generally, many sub-regions would generate by the sliding window, which increases extensive computational time of feature extraction. Besides, when given a new probe person image, it requires regenerating sub-region by the sliding window of the same as the probe image. DSR performs better than the Resizing model, the computational cost of feature extraction would increase after resizing.
4.6 Contribution of Fine-tuning with DSR
In section 3.3, DSR is used to fine-tune on the pre-trained FCN to learn more discriminative spatial features. To verify the effectiveness of fine-tuning FCN with DSR, we conduct the single-shot experiment on Partial REID dataset. We compare the pre-trained FCN (FCN training only with softmax loss is regarded as a pre-trained model) to the fine-tuning FCN with DSR (fine-tuning model). Fig. 13 shows ROC curves and CMC curves of the two models. Experimental results show that the fine-tuning FCN model performs better than the pre-trained model, which suggests that fine-tuning with DSR can learn more discriminative spatial deep features. Pre-trained model with softmax loss training can only represent the probability of each class of a person image belongs to. For fine-tuning model, DSR can effectively reduce the intra-variation between a pair of person images of the same individual.
4.7 Evaluation on Holistic Person Image
To verify the effectiveness of DSR on holistic person re-identification, we also conduct the holistic person re-id experiment on Market1501 dataset . Market1501 contains 1,501 individuals which are captured by six surveillance cameras in campus. Each individual is captured by two disjoint cameras. Totally it consists of 13,164 person images and each individual has about 4.8 images at each viewpoint. We follow the Market1501 benchmark test protocol, 751 individuals are used for training and 750 individuals are used for testing. The ResNet50  pre-trained on ImageNets the base model. We use 12,936 images from 751 individuals to fine-tune on the ResNet50 model with softmax loss and DSR. For fine-tuning on ResNet50 with DSR, 3,000 positive pairs and 3,000 negative pairs are selected in the experiment. Fine-tuning on ResNet50 only with softmax loss is called as the baseline model and fine-tuning on ResNet50 only with softmax loss and DSR is called as the fine-tuning model.
We compare the proposed DSR with a baseline model (2048-dimension feature extracted by pool5 is used as identity feature and Euclidean distance is used for matching). Besides, several state-of-the-art methods are compared, including Bag of Words (BOW) , Multi-scale Context-aware Network (MSCAN) , Spindle Net , Re-ranking , Consistent-Aware Deep Learning (CADL) , Cross-View Asymmetric Metric Learning (CAMEL) , DNSL+OL-MANS  and Deeply-Learned Part-Aligned Representations (DLPAR) . For DSR, feature maps extracted from res5c are used as identity feature. We respectively adopt single-scale representation () and multi-scale representation (, and ) in feature representation term. The experimental results are shown in Table 4. We find three results: 1) DSR is very effective compared to Euclidean distance because DSR can automatically search similar feature block to matching; 2) Multi-scale presentation can achieve better results because it avoids the influence of scale variations; 3) Training model with DSR effectively learn more discriminative deep spatial features, which encourages the feature maps of the same identity to be similar while feature maps of the different identities stay away.
|+Euclidean distance (baseline model)|
|+DSR (baseline model)|
|+DSR (baseline model)|
|+DSR (fine-tuning model)|
|+DSR (fine-tuning model)|
We have proposed a novel approach called Deep Pixel-level Reconstruction (DSR) to address partial person re-identification. To get rid of the fixed input size, the proposed spatial feature reconstruction method provides a feasibility scheme where each channel in the probe spatial feature map is linearly reconstructed by those channels of a gallery spatial image map, it also avoids the trivial alignment-free matching. Furthermore, we embed DSR into FCN to learn more discriminative features, such that the reconstruction error for a person image pair from the same person is minimized and that of image pair from different persons is maximized. Experimental results on the Partial REID and Partial-iLIDS datasets validate the effectiveness and efficiency of DSR, and the advantages over various partial person re-id approaches are significant. Additionally, the proposed method is also competitive in the holistic person dataset, Market1501.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), 2014.
-  S. Gao, I. W.-H. Tsang, and L.-T. Chia. Kernel sparse representation for image classification and face recognition. In European Conference on Computer Vision (ECCV). Springer, 2010.
-  R. Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9):1904–1916, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  K. Huang and S. Aviyente. Sparse representation for signal classification. In Advances in neural information processing systems (NIPS), 2007.
-  H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In Advances in neural information processing systems (NIPS), 2007.
-  D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  G. Li and Y. Yu. Deep contrast learning for salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Liao, A. K. Jain, and S. Z. Li. Partial face recognition: Alignment-free approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(5):1193–1205, 2013.
-  J. Lin, L. Ren, J. Lu, J. Feng, and J. Zhou. Consistent-aware deep learning for person re-identification in a camera network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV). Springer, 2016.
-  X. Mei and H. Ling. Robust visual tracking and vehicle classification via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(11):2259–2272, 2011.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS), 2015.
-  E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(4):640–651, 2017.
-  A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In ACM international conference on Multimedia (ACM MM). ACM, 2015.
-  J. Wang and S. Li. Query-driven iterated neighborhood graph search for large scale indexing. In ACM international Conference on Multimedia (ACM MM), 2012.
-  J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 31(2):210–227, 2009.
-  Y. Xu, D. Zhang, J. Yang, and J.-Y. Yang. A two-phase test sample sparse representation method for use with face recognition. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 21(9):1255–1262, 2011.
-  M. Yang and L. Zhang. Gabor feature based sparse representation for face recognition with gabor occlusion dictionary. European Conference on Computer Vision (ECCV), 2010.
-  H.-X. Yu, A. Wu, and W.-S. Zheng. Cross-view asymmetric metric learning for unsupervised person re-identification. In IEEE International Conference on Computer Vision (ICCV), 2017.
-  X.-T. Yuan, X. Liu, and S. Yan. Visual classification with multitask joint sparse representation. IEEE Transactions on Image Processing (TIP), 21(10):4349–4360, 2012.
-  L. Zhang, M. Yang, and X. Feng. Sparse representation or collaborative representation: Which helps face recognition? In IEEE International Conference on Computer Vision (ICCV), 2011.
-  H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  L. Zhao, X. Li, J. Wang, and Y. Zhuang. Deeply-learned part-aligned representations for person re-identification. 2017.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  W.-S. Zheng, S. Gong, and T. Xiang. Person re-identification by probabilistic relative distance comparison. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 649–656, 2011.
-  W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong. Partial person re-identification. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. 2017.
-  J. Zhou, P. Yu, W. Tang, and Y. Wu. Efficient online local metric adaptation via negative samples for person re-identification. In IEEE International Conference on Computer Vision (ICCV), 2017.