Sparse Label Smoothing for Semi-supervised Person Re-Identification
In this paper, we propose a semi-supervised framework to address the over-smoothness problem found in current regularization methods. We carefully propose to derive a regularization method by constructing clusters of similar images. We propose Sparse Label Smoothing Regularization (SLSR) which consist of three steps. First, we train a CNN to learn discriminative patterns from labeled data. For each image, we extract the feature map from the last convolution layer and directly apply k-means clustering algorithm on the feature. Secondly, we train a GAN model for feature representation learning and generate sample images for each cluster. Each generated sample is assigned a label using our regularization method. Thirdly, we define a new objective function and fine-tuned two baseline models ResNet and DenseNet. Extensive experiments on four large-scale datasets Market-1501, CUHK03, DukeMTMC-ReID, and VIPeR show that our regularization method significantly improves the Re-ID accuracy compared to existing semi-supervised methods. On Market-1501 dataset, for instance, rank-1 accuracy is improved from to for ResNet, and from to for DenseNet. The code is available at https://github.com/jpainam/SLS_ReID
Person re-identification is the process of establishing a correspondence between images of a person from multiple cameras, i.e. given a person; person re-id determines whether the person has been observed by another camera. The problem has been widely studied in the past and has achieved extraordinary results with deep learning based approaches [3, 11, 56, 58]. Modern deep learning methods require a large volume of labeled data for training to generalize well. Existing labeled data in person re-identification are limited in scale by the number of identities and by their size (30 images on average per identity). This lack of large datasets is a big challenge in applying deep learning technique to person re-identification. One way this can be lessened is by using unsupervised methods to train on data without labels. These methods learn features from the data which can then be used for supervised learning with small datasets. In this work, we propose a semi-supervised framework that uses DCGAN  to generate data from clusters. These generated images are assigned a smooth label distribution based on their original cluster. We use the generated data in conjunction with the labeled data and define two losses, an unsupervised loss, and supervised loss. The model is trained to minimize the two losses.
As shown in Fig. 1; our framework has three main steps. The unsupervised step is fed with unlabeled data and output dimensional vectors representing the feature maps of training images. The extracted feature maps is then introduced into a k-means clustering algorithm to obtain cluster sets. We use each cluster set to train an image generator  to generate sample images. As each generated image belongs to one of the clusters, we can assign a label to generated images through our regularization method. Finally, the semi-supervised step uses existing network architectures and introduces an extra linear layer, i.e. a noise layer which adapts the network outputs to match the noisy GAN label distribution. Our model can generalize well, and experiment results show that our method outperformed previous methods.
In this paper, we make the following contributions:
We propose a GAN-based model tailored for person re-identification task with a sparse label smoothing regularization (SLSR).
We use an unsupervised learning approach to do clustering on the data and trained a GAN network to generated images for each cluster.
We use partial smoothing label regularization over the generated images.
We show that unsupervised representation learning with SLSR improves the person re-identification accuracy.
The rest of this paper is organized as follows. Section 2 surveys the related works in person re-identification. Section 3 presents the proposed regularization method. Section 4 presents the network architectures and the implementation details. Section 5 shows the experimental results and section 6 concludes the paper.
2 Related works
In this section, we describe the works relevant to our pipeline. These include a clustering algorithm, person re-identification task, a GAN model for unsupervised learning and a CNN model for semi-supervised learning.
2.1 Generative Adversarial Network
Generative Adversarial Network (GAN) was first introduced by Goodfellow et al.  and is described as a framework for estimating generative models via an adversarial process. GAN consists of two different components: a generator (G) that generates an image and a Discriminator (D) that discriminates real images from generated images. The two networks compete following the minimax two-player game. This kind of learning is called Adversarial Learning. Radford et al.  proposed Deep Convolutional GAN (DCGAN) and certain techniques to improve the stability of GANs. The trained DCGAN showed competitive performance over unsupervised algorithms for image classification tasks. Multiple variants of GANs were published in the literature [6, 44, 62, 63]. GANs were applied to various interesting tasks such as realistic image generation , text-to-image generation ; video generation ; image-to-image generation , image inpainting , super-resolution  and many more. In this work, we use DCGAN  model to generate unlabeled images from the training set. We decide to choose DCGAN model after carefully contrasting various image generators. DCGAN architecture is very simple but yet generates more realistic images as shown in Fig 2.
2.2 Supervised and semi-supervised learning
Supervised learning is a well-studied problem in computer vision for image classification. Given training data where is the corresponding label on data , supervised learning representation learns the mapping function or the posterior distribution . In contrast, unsupervised learning learns the intrinsic structure from unlabeled data. Semi-supervised learning is regarded as an unsupervised learning with some constraints on labels, or a supervised learning with additional information on the data distribution. Researchers also treat unsupervised learning as a sub task to supervised learning . Lee \etal  train a supervised network with labeled and unlabeled data by assigning pseudo-label to unlabeled data. Yu \etal  and Wang \etal  propose unsupervised asymmetric metric learning to unsupervised person re-id. Papandreou \etal  propose Expectation-Maximization (EM) combining weak and strong labels under supervised and semi-supervised settings for image segmentation and Zheng \etal  train a semi-supervised network and assign uniform label distribution to generated samples. We depart from [32, 38, 58, 62] and propose to train a network in a semi-supervised fashion using a combination of two losses.
2.3 Person Re-Identification
Person re-id is viewed as an image retrieval problem and started as a multi-camera tracking research . Some early works on person re-id focus on learning a metric and emphasize inter-personal distances or intra-personal distances (KISSME  , XQDA , MLAPG , LFDA  and Similarity learning ). Other works such as SILTP , LBP  use Color Histograms, Color Names or a combination of them to address the challenge variations in illumination and pose view-point. Recent works in person re-id are CNN based, and the goal is to jointly learn the best feature representation and distance metric. Zheng \etal  propose a siamese network with verification loss and identification loss and predicted the identities of a pair of input images. Many unsupervised methods with GAN-generated data have been developed [50, 51, 61] to address the problem of lack of large labeled datasets in person re-id. Barros \etal  introduce, for the first time in the re-identification field, the strategy of using synthetic data as a proxy for the real data and claim to recognize people independently of their clothing. Zhedong \etal  show that a regularized method (LSRO) over GAN-generated data can improve person re-id. Zhong \etal  propose a camera style (CamStyle) adaptation method to regularize CNN training through the adoption of LSR and use CycleGAN  for image generation. We show in section 3.3 how our model differs from  and .
3.1 Unsupervised loss
We intend to partition the training sample into groups of equal variance and find a share space among similar objects. Our goal is to produce different clusters with relatively similar features. To do this, we define an objective function like that of k-means clustering [2, 14].
where is a cluster center and the Euclidean distance between an embedded data and the cluster center .
If denotes the set of feature vectors extracted from the last convolution layer and the input image of shape (where is the channel and is the spatial size); a feature extraction function for images is an defined by . The weights or the parameters are automatically learned from data. Minimizing Eq. 1 with respect to the network parameters results on:
where is the number of cases, a centroid for cluster .
Learning the centroids such that, given a threshold , distances between similar vector are smaller than , while those between dissimilar vectors are greater than . Eq. 2 assures that the distance between each training sample and its assigned cluster center is small for each features . Using this objective function results in better clustering quality as shown in Fig. 3.
3.2 Semi-supervised loss
Let be a vector class probabilities produced by the neural network for an input image and the combination of weight and bias terms to be learned for label . The network computes the probabilities of each label :
where refers to the input vector from the previous layers, Given the class labels for training samples, we define the cost function for real images as the negative log-likelihood:
In general, neural network represents a function which provides the parameters for a distribution over . So minimizing is equivalent to maximizing the probability of the ground-truth label . For a given person with identity , Eq. 5 can be written as
where represents the set of parameters of the whole network to be learned.
Regularization via Label Smoothing (LSR) Szegedy \etal  propose a mechanism to regularize a classifier by estimating a marginalized effect over non-ground truth labels during training by assigning small value to instead of . where is Dirac delta:
For training image with ground-truth label , Szegedy \etal  replace the label distribution with
Here, are the unnormalized probabilities of the i image generated from cluster with classes. represents a one-hot vector where every entry is equal to if the class label belongs to and if not. We consider the ground-truth distribution over the generated image and normalize so that . To explicitly take into account our label regularization for , we change the network to produce
and we optimize where is the number of class label in cluster . Our loss for generated images can then be written as:
or simply written as
Where is the number of classes. For training images, we set and for the generated images,
Recently, Zheng \etal  propose Label Smoothing Regularization for Outliers (LSRO) and Zhong \etal  propose CamStyle as a data augmentation approach. LSRO expands the training set with unlabeled samples generated by DCGAN  and assigns uniform LSR  to the generated samples i.e. while CamStyle uses CycleGAN  to generate new training samples according to camera styles and assigns to style-transferred images. Although LSRO and CamStyle are similars to our work, we argue that our method is different on two aspects:
1) LSRO  and CamStyle  fuse equal distribution to all generated images; this can leads to an over-smooth especially when the number of classes is excessively large. Our method however fuses generated images with adaptive label distribution over each cluster i.e where is the class set size of cluster . In LSRO and CamStyle, dissimilar and similar images may be assigned relatively equal similarity value, while our method deals with such unfairness by considering generated images in the locality of each sample and propose a strategy to determine the appropriate candidates by using k-means clustering algorithm. The proposed SLSR is assigned to generated images according to their cluster of origin. This enables our model to be highly efficient in dealing with large amount of data while being robust to noise as well. Our method SLSR learns the most discriminative features and can easily avoid the over-smooth similarity.
2) In our model, similarities are maintained and propagated through the network by the concatenation of similar images into one homogeneous feature space. Leveraging feature space for each cluster can substantially improve the performance of person re-identification compared with using single-label distribution over all classes. Fig. 3 illustrates the effectiveness of our method and extensive experiments demonstrate the superiority of our method compared to LSRO  and CamStyle . Our model introduces an extra noise layer to match the noisy GAN label distribution. The parameters of this linear layer can be estimated as part of the training process and involve simple modification of current deep network architectures.
LSRO, CamStyle and our method SLSR share some common practices such as (1) enhancing the training set by the generation of fake images using GAN  models; (2) the adoption of Label Smooth Regularization (LSR) proposed by Szegedy \etal  to alleviate the impact of noise introduced by the generated images; (3) performing semi-supervised learning for person re-id using labeled and unlabeled data in a CNN-based approach.
4 Network Overview
4.1 Generative Adversarial Network
We follow the implementation details of . The Generator G consists of a Deconvolutional Network (DNN) made of linear function, a series of four deconvolution operations with a filter size of and a stride of , and one function. The input shape of G is a -dim uniform distribution Z scaled in the range of and the output shape a sample image of size . The Discriminator D consists of Convolutional Neural Network (CNN) formed by four convolution functions with filter size and a stride of . We add a linear layer followed by to discriminate real images against fake images. The input shape includes sample images from G and real images from the training set. Each convolution and deconvolution layer is followed by a batch normalization  and ReLU in both the generator and discriminator.
It is well known that multiview data object admits a common clustering structure across view  and that person re-id is a cross-camera retrieval task across view. We aim at exploring such clustering structure propriety to generate images that model the correlation among similar views through the use of k-means and GAN. We apply k-means algorithm to cluster the training images into k clusters ( ). K-means clustering is a simple yet very effective unsupervised learning algorithm for data clustering. It clusters data based on the Euclidean distance between data points. We train for epochs a CNN network using a learning rate of with a momentum of . We use ResNet50  model to learn good intermediate representation and later extract high dimension features representation from the last convolutional layer. K-means clustering algorithm is applied to the set of feature map. We found this way to be faster and better than clustering on raw data images.
To judge the goodness of our clustering algorithm, we consider the ground truth not known and perform an evaluation using the model itself. Table 1 shows the cluster quality metric Silhouette Coefficient  applied on Market-1501 dataset .
|Number of clusters||Average silhouette score|
|Cluster size||K = 2||K = 3||K = 4|
4.3 Convolutional Neural Network
We fine-tuned two baseline models Resnet50  and DenseNet  pre-trained on ImageNet , we introduce an extra linear layer into the network which adapts the network outputs to match the noisy GAN label distribution i.e. and linear layer in Resnet50 and DenseNet baselines respectively. The network was able to adjust the weights based on the error when we add a linear layer on top of the softmax layer rather than a non-linear such as or .
5.1 Person Re-ID datasets
We intensively evaluate our proposed model on four widely used datasets including Market-1501, CUHK03, DukeMTMC-ReID and VIPeR.
Market-1501  is a large and most realistic dataset collected in front of a campus supermarket. It contains overlapping among the six cameras and images were automatically detected by the deformable part model (DPM) . The dataset contains images with identities in the training set and images with identities in the test set. We follow the standard data separation strategy as  and use all the training set for the unsupervised step and one image per identity as validation image in the semi-supervised step.
CUHK03  contains images and identities. The dataset provides two image sets, one set is automatically detected by the deformable-part-model detector DPM , and the other set contains manually cropped bounding boxes. Misalignment, occlusions and body part missing are quite common in the detected set. In this work, we use the detected set to make our model more realistic. The dataset is captured by six cameras, and each identity has an average of images in each view.
DukeMTMC-ReID  is a dataset derived from the DukeMTMC  dataset for multi-target tracking. The original dataset consists of a video data set recorded by synchronized cameras over unique identities. In this paper, we use the subset of Zhedong \etal . It contains training images with identities and test images with identities. We follow the partition settings of the Market-1501 dataset and use all the training images for the unsupervised learning and randomly pick one image per identity for the validation set. The remaining images are used for the supervised learning step.
VIPeR  contains 632 pedestrian image pairs captured outdoor from two viewpoints. Each pair contains two images of the same individual cropped and scaled to 128x48 pixels. The datasets are divided into two equal subsets. To be fair in the comparison, we follow the testing strategy as defined in , 
5.2 Implementation details
We use Resnet50  and DenseNet  as baselines, and modify the last fully connected layer with the number of classes i.e. ; and units for Market-1501, CUHK03 and DukeMTMCReID respectively. To train the network, we use stochastic gradient descent  and start with a base learning rate of and gradually decrease it as the training progresses using , where , and is the current mini-batch iteration. We use a momentum of and weight decay of and the mini-batch size of . We train the network for epochs. To generate image samples, we train DCGAN for epoch using Adam  with learning rate and .
Data preprocessing: For DenseNet baseline, all the input images are resized to before being randomly cropped into with random horizontal flip. We scale the pixels between and . For Resnet50 baseline, input images are resized to before being randomly cropped into with random horizontal flip; we also scale the pixels in the range of and . Zero-center by mean pixel and random erasing  are finally applied to both baselines to make the network more robust to variations and occlusions.
|Gate Reid ||65.88||-||-||39.55|
|MR B-CNN ||66.36||85.01||90.17||41.17|
|Gate Reid ||76.04||-||-||48.45|
We use Cumulated Matching Characteristics (CMC) and mean average precision (mAP) as defined in  to evaluate the performance of our model. We use the L2 Euclidean distance to compute a similarity score for ranking or retrieval task as in previous works [47, 57, 58].
Re-ranking: Recent works [4, 43, 59] choose to perform an additional re-ranking to improve ReID accuracy.
In this work, we report re-ranking results based on Zhong \etal  method with k-reciprocal encoding, which combines the original L2 distance and Jaccard distance.
SLS+DenseNet and SLS+ResNet represent our method with DenseNet and ResNet baseline models respectively. SLS+Rerank is our DenseNet model with re-ranking . We only report rank , and accuracy.
|XQDA (LOMO) ||46.25||78.90||88.55||-|
|MR B-CNN ||63.67||89.15||94.66||-|
|Gated ReID ||68.1||88.1||94.6||58.8|
|XQDA (LOMO) ||30.75||-||-||17.04|
|MFA (LOMO) ||38.67||69.18||80.47||89.02|
|XQDA (LOMO) ||40.00||68.13||80.51||91.08|
Comparison with the state of art
On Market-1501 dataset our method achieves an 89.16% rank 1 accuracy and 75.15% mAP accuracy exceeding LSRO  by 5.19% and 9.08% respectively. Our method with both SLSR and re-ranking  with k-reciprocal encoding further improves rank 1 and mAP accuracy to 93.82% and 90.20% respectively. Table 4 shows that our method outperforms previous works globally.
On CUHK03 dataset, we achieve a 91.03% rank 1 accuracy and 94.21% mAP accuracy which are close by 0.77% to the best result reported by HydraPlus-Net . Our method exceeds LSRO  by 6.41% and 6.81 on rank 1 and mAP respectively. Table 5 shows that our method outperforms previous.
On DukeMTMCReID dataset, not many reported results exist on this dataset as shown in Table 6. Yet, our method achieves an 82.94% rank 1 accuracy and 67.78% mAP accuracy exceeding existing works. Compared to LSRO , our ResNet rank 1 accuracy exceeds their result by 8.85%. SVDNet  exceeds our ResNet model by 0.17%; but our model with DenseNet still exceeds their result by 6.24%.
On VIPeR dataset, our method achieve a 67.41% and 65.98% rank 1 accuracy with DenseNet and ResNet respectively. We improve the baseline by 3.95% for rank 1 accuracy and achieve competitive results for rank 5, 10 and 20.
Compared to previous works in general, our method (SLSR) boots 1.23%6.41% rank 1 accuracy and 1.43%6.81% mAP on all datasets.
In this paper, we propose sparse label smoothing regularization for person re-identification. We use unsupervised learning to do clustering on unlabeled data. For each cluster set, we train a GAN to generate images similar to the cluster set. We derive our strategy based on the intuition that each image represent a point in some high-dimensional feature space, and that similar images are close point and share the same feature space, sufficient to be assigned similar label according to their cluster. We use k-means clustering algorithm to partition similar images from dissimilar images and assign SLS to generated images. We finally train a CNN baseline using our SLSR loss function. Our model learns to exploit the samples generated by DCGAN to boost the performance of the person re-id by improving generalization. Extensive evaluations were conducted on four large-scale datasets to validate the advantage of the proposed model on existing models. Tables 4 5 6 7 show the superiority of the model over a wide variety of state-of-art methods.
This work is supported by the Ministry of Science and Technology of Sichuan province (Grant No. 2017JY0073) and Fundamental Research Funds for the Central Universities in China (Grant No. ZYGX2016J083). We appreciate Yongsheng Peng, Eldad Antwi-Bekoe for their useful contributions and Yuyang Zhou for the management of the GPUs during experiments.
- E. Ahmed, M. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3908–3916, June 2015.
- E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers. Clustering with Deep Learning: Taxonomy and New Methods. ArXiv e-prints, Jan. 2018.
- J. Almazán, B. Gajic, N. Murray, and D. Larlus. Re-id done right: towards good practices for person re-identification. CoRR, abs/1801.05339, 2018.
- S. Bai, X. Bai, and Q. Tian. Scalable person re-identification on supervised smoothed manifold. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3356–3365, July 2017.
- I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis. Looking beyond appearances: Synthetic training data for deep cnns in re-identification. Computer Vision and Image Understanding, 167:50 – 62, 2018.
- D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary Equilibrium Generative Adversarial Networks. ArXiv e-prints, Mar. 2017.
- L. Bottou. Stochastic Gradient Descent Tricks, pages 421–436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
- D. Chen, Z. Yuan, B. Chen, and N. Zheng. Similarity learning with spatial constraints for person re-identification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1268–1277, June 2016.
- D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1335–1344, June 2016.
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, Sept 2010.
- M. Geng, Y. Wang, T. Xiang, and Y. Tian. Deep Transfer Learning for Person Re-identification. ArXiv e-prints, Nov. 2016.
- I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In 10th IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), 09/2007 2007.
- J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
- A. Hermans, L. Beyer, and B. Leibe. In Defense of the Triplet Loss for Person Re-Identification. ArXiv e-prints, Mar. 2017.
- G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- G. J. Ian, P.-A. Jean, M. Mehdi, X. Bing, S. O. David, C. Aaron, and B. Yoshua. Generative adversarial network. In NIPS. The Neural Information Processing Systems, 2014.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 448–456. JMLR.org, 2015.
- P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, Nov. 2016.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
- M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2288–2295, June 2012.
- A. Kumar, P. Rai, and H. Daumé, III. Co-regularized multi-view spectral clustering. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 1413–1421, USA, 2011. Curran Associates Inc.
- C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. ArXiv e-prints, Sept. 2016.
- D.-H. Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. 07 2013.
- W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 152–159, June 2014.
- S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2197–2206, June 2015.
- S. Liao and S. Z. Li. Efficient psd constrained asymmetric metric learning for person re-identification. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 3685–3693, Dec 2015.
- X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, J. Yan, and X. Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE international conference on computer vision, pages 350–359, 2017.
- G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pages 1742–1750, 2015.
- D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. In Computer Vision and Pattern Recognition (CVPR), 2016.
- A. Radford, L. Metz, and S. Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ArXiv e-prints, Nov. 2015.
- S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative Adversarial Text to Image Synthesis. ArXiv e-prints, May 2016.
- E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, abs/1609.01775, 2016.
- P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 – 65, 1987.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, Dec 2015.
- Y. Sun, L. Zheng, W. Deng, and S. Wang. Svdnet for pedestrian retrieval. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3820–3828, Oct 2017.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, July 2016.
- E. Ustinova, Y. Ganin, and V. Lempitsky. Multiregion Bilinear Convolutional Neural Networks for Person Re-Identification. ArXiv e-prints, Dec. 2015.
- R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human re-identification. In ECCV, 2016.
- C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 613–621, USA, 2016. Curran Associates Inc.
- F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-image representations for person re-identification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1288–1296, June 2016.
- J. Wang, S. Zhou, J. Wang, and Q. Hou. Deep ranking model by large adaptive margin learning for person re-identification. Pattern Recognition, 74:241 – 252, 2018.
- W. Wang, Q. Huang, S. You, C. Yang, and U. Neumann. Shape Inpainting using 3D Generative Adversarial Network and Recurrent Convolutional Networks. ArXiv e-prints, Nov. 2017.
- X. Wang. Intelligent multi-camera video surveillance: A review. Pattern Recognition Letters, 34:3–19, 2013.
- Y. Wang, W. Zhang, L. Wu, X. Lin, and X. Zhao. Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Transactions on Neural Networks and Learning Systems, 28(1):57–70, Jan 2017.
- L. Wu, C. Shen, and A. Hengel. Deep linear discriminant analysis on fisher networks: A hybrid architecture for person re-identification. 65, 06 2016.
- T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identification feature learning for person search. In CVPR, 2017.
- F. Xiong, M. Gou, O. Camps, and M. Sznaier. Person re-identification using kernel-based metric learning methods. In Computer Vision – ECCV 2014, pages 1–16, Cham, 2014. Springer International Publishing.
- H.-X. Yu, A. Wu, and W.-S. Zheng. Cross-view asymmetric metric learning for unsupervised person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
- C. Zhang, L. Wu, and Y. Wang. Crossing Generative Adversarial Networks for Cross-View Person Re-identification. ArXiv e-prints, Jan. 2018.
- L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1239–1248, June 2016.
- Y. Zhang and S. Li. Gabor-lbp based region covariance descriptor for person re-identification. In 2011 Sixth International Conference on Image and Graphics, pages 368–371, Aug 2011.
- H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. Conference on Computer Vision and Pattern Recognition, pages 907–915, 07 2017.
- L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1116–1124, Dec 2015.
- L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. Person re-identification in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3346–3355, July 2017.
- Z. Zheng, L. Zheng, and Y. Yang. A discriminatively learned cnn embedding for person re-identification. ACM Transactions on Multimedia Computing Communications and Applications, 2017.
- Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
- Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random Erasing Data Augmentation. ArXiv e-prints, Aug. 2017.
- Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang. Camera style adaptation for person re-identification. In CVPR, 2018.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ArXiv e-prints, Mar. 2017.
- J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems 30, pages 465–476. Curran Associates, Inc., 2017.