Regularizing Deep Hashing Networks Using GAN Generated Fake Images
Recently, deep-networks-based hashing (deep hashing) has become a leading approach for large-scale image retrieval. It aims to learn a compact bitwise representation for images via deep networks, so that similar images are mapped to nearby hash codes. Since a deep network model usually has a large number of parameters, it may probably be too complicated for the training data we have, leading to model over-fitting. To address this issue, in this paper, we propose a simple two-stage pipeline to learn deep hashing models, by regularizing the deep hashing networks using fake images. The first stage is to generate fake images from the original training set without extra data, via a generative adversarial network (GAN). In the second stage, we propose a deep architecture to learn hash functions, in which we use a maximum-entropy based loss to incorporate the newly created fake images by the GAN. We show that this loss acts as a strong regularizer of the deep architecture, by penalizing low-entropy output hash codes. This loss can also be interpreted as a model ensemble by simultaneously training many network models with massive weight sharing but over different training sets. Empirical evaluation results on several benchmark datasets show that the proposed method has superior performance gains over state-of-the-art hashing methods.
Hashing methods are widely-used in large-scale image retrieval due to their excellent efficiency in both storage and computation. Recently, much effort has been devoted to deep-networks-based hashing (deep hashing for short) methods for image retrieval (e.g., [\citeauthoryearXia et al.2014, \citeauthoryearLai et al.2015, \citeauthoryearZhao et al.2015, \citeauthoryearLi et al.2015, \citeauthoryearLiu et al.2016]), which use deep networks to learn similarity-preserving hash functions that encode similar images to nearby binary hash codes.
Since a deep hashing model usually has millions (or even billions) of parameters, to train such a model, we may probably face the so-called “high variance” issue, i.e., the model is too complicate for the amount of training data we have, making the model be prone to over-fitting. To address this issue, in this paper, we propose a simple yet effective two-stage pipeline to improve the performance of deep hashing in favor of a generative adversarial network (GAN). The key idea is to regularize deep hashing networks using fake images produced by the GAN. Specifically, as shown in Figure 1(a), in Stage 1, we use the GAN to generate extra fake images from the original training set. In Stage 2, we develop a deep architecture that leverages the original training images and the newly generated fake images to learn similarity-preserving hash functions. As shown in Figure 1(b), the proposed architecture for hashing consists of two “parallel” networks. These networks have a shared convolutional sub-network that maps an input image to an approximate hash code, where this sub-network consists of stacked convolution layers, followed by a fully connected layer with sigmoid activations that generate approximate hash codes. Through the top network in Figure 1(b), a fake image is encoded to an approximate hash code. We define a maximum-entropy based loss that regularizes the deep architecture by penalizing low entropy distributions defined on the output hash codes of the fake images. In the bottom network in Figure 1(b), with triplets of the original training images as input, we use a triplet loss [\citeauthoryearLai et al.2015] to preserve relative similarities among the original training images.
We conduct comprehensive evaluations on three benchmark datasets for image retrieval. The evaluation results show that the proposed hashing method, in favor of the regularizer defined on the fake images created by GAN, outperforms the state-of-the-art baselines for hashing.
2 Related Work
2.1 Hashing Methods based on Deep Networks
This paper focuses on learning-based hashing for image retrieval, which aims to learn a compact binary representation for images from the training data. Depending on whether using side information in the learning process, learning-based hashing methods can be roughly divided into unsupervised, supervised and semi-supervised methods.
Unsupervised methods (e.g., SH [\citeauthoryearWeiss et al.2009], ITQ [\citeauthoryearGong et al.2013]) seek to learn similarity-preserving hash functions only from the unlabeled data. Supervised and semi-supervised methods leverage supervised information (e.g., pairwise similarities, or relative similarities of images) to learn better bitwise representations. The early supervised or semi-supervised hashing methods (e.g.,MLH [\citeauthoryearNorouzi and Blei2011], KSH [\citeauthoryearLiu et al.2012], FastH [\citeauthoryearLin et al.2014], SSH [\citeauthoryearWang et al.2010]) usually use hand-crafted visual features as the image representations, followed by projection and/or quantization to generate hash codes.
In the past few years, with the dramatic progress of deep neural networks in various computer vision tasks, deep hashing has become an emerging stream of supervised hashing methods. Some deep hashing methods have been proposed. For example, Xia et al. [\citeauthoryearXia et al.2014] proposed a two-stage method that firstly learns approximate hash codes from pairwise similarities of images, and then uses deep convolutional networks to learn both hash functions and image representations based on the learned approximate hash codes. Lai et al. [\citeauthoryearLai et al.2015] proposed a one-stage method that simultaneously learns image representations and hash codes via a carefully designed deep convolutional network. Zhao et al. [\citeauthoryearZhao et al.2015] proposed a ranking-based hashing method that learns hash functions of preserving multilevel semantic similarities between images, via deep convolutional networks. Li et al. [\citeauthoryearLi et al.2015] proposed a deep hashing method to perform simultaneous feature learning and hash coding for applications with pairwise labels. Liu et al. [\citeauthoryearLiu et al.2016] proposed a deep supervised hashing method that takes image pairs as training inputs and encourages the output of each image to approximate discrete values (e.g. +1/-1).
2.2 Generative Adversarial Networks
Generative adversarial networks (GANs) [\citeauthoryearGoodfellow et al.2014] simultaneously learn a generator and a discriminator contesting with each other in a zero-sum game framework, where the discriminator tries to distinguish between real samples and the samples generated by the generator, and the generator produces samples to cheat the discriminator. The GANs are first proposed in [\citeauthoryearGoodfellow et al.2014]. In the past few years, many variants of GANs are proposed (e.g., [\citeauthoryearRadford et al.2015, \citeauthoryearChen et al.2016, \citeauthoryearSalimans et al.2016]).
GANs are largely employed in unsupervised and semi-supervised learning for visual classification tasks, in which the extra samples (e.g., images) generated by GANs are leveraged to improve the learning performance. For example, in [\citeauthoryearSalimans et al.2016], the samples produced by GANs are regarded as an extra class of samples, and then merged with the original training data for training. Zheng et al. [\citeauthoryearZheng et al.2017] leveraged the GAN-generated samples by the label smoothing regularizer. Some recent research has employed GANs to cross-modal retrieval. For example, Wang et al. [\citeauthoryearWang et al.2017] proposed an adversarial framework for cross-modal retrieval, in which a feature projector tries to generate a modality-invariant representation to confuse the modality classifier, and the modality classifier tries to discriminate between different modalities based on the generated representation. While GANs are employed in various visual tasks, little effort has been made in GANs for hashing.
3 The Proposed Approach
We denote as the image space. The task of learning-based hashing for images is to learn a mapping function such that an input image can be mapped to an -bit binary code , where the similarities among images are preserved in the Hamming space.
The proposed approach has two stages. Stage 1 is to generate fake images from the original training set, by the deep convolutional generative adversarial networks (DCGAN) [\citeauthoryearRadford et al.2015]. Stage 2 is to train a deep architecture for hash coding, which leverages the original training images and the newly created fake images.
3.1 Stage 1: Generate fake Images Using DCGAN
A Generative adversarial network (GAN) [\citeauthoryearGoodfellow et al.2014] consists of a generator and a discriminator contesting with each other in a zero-sum game framework. The goal of the discriminator is to tell whether a sample is drawn from true data distribution or generated by the generator, while the generator is optimized to generate samples that are not distinguishable by the discriminator.
In Stage 1 of the proposed approach, we use the deep convolutional generative adversarial network (DCGAN) [\citeauthoryearRadford et al.2015] to generate fake images from the original training set. Specifically, we adopt the network architecture in [\citeauthoryearRadford et al.2015]. In the generator , a -dimensional input (sampled from uniform distributions) is projected to a convolution representation with feature maps in the size . Then, by a series of four fractionally-strided convolution layers, this representation is converted to an output image with pixels. Each of the four fractionally-strided convolution layers is followed by a rectified linear unit and batch normalization [\citeauthoryearIoffe and Szegedy2015]. The input to the discriminator is an image from the original training set or an image generated by the generator. Then, through four convolution layers and a fully connected layer with sigmoid activations, the discriminator outputs a binary prediction (i.e., real or fake) of the input image. During training, the generator and the discriminator play the two-player min-max game with the value function in Eq. (1).
where is sampled from the training set, is sampled from the uniform distributions.
With the generator/discriminator being trained, we can use the generator to create fake images as input to Stage 2. Some generated samples are shown in Figure 2.
3.2 Stage 2: Train a Deep Architecture for Hashing
The proposed deep architecture in Stage 2 consists of three main building blocks: (1) convolutional sub-network that encodes images to approximate hash codes; (2) triplet loss that preserve relative similarities among the original training images; (3) maximum-entropy based loss defined on the fake images, which works as a regularizer in training the deep architecture. In the following, we will present the details of these parts, respectively.
As shown in Figure 1(b), the proposed deep architecture consists of two “parallel” networks. These networks have a shared convolutional sub-network whose role is to capture discriminative image representations and convert an input image to an approximate hash code.
This convolutional sub-network is based on the architecture of GoogLeNet [\citeauthoryearSzegedy et al.2015], which converts an input image to a 1024-dimensional feature vector. On top of this feature vector, we add a fully connected layer with sigmoid activations, to generate an -dimensional approximate hash code, each of whose elements is in the range . In prediction, one can easily convert this approximate hash code to an -bit binary code by quantization. Specifically, let be the output vector of the sigmoid activations (i.e., the approximate hash code), one can obtain the binary code by:
where () is the -th element in (), respectively.
Triplet Loss for Original Training Images
For ranking-based image retrieval, users mainly focus on the top- images returned by the retrieval systems. In such scenarios, it is a common practice to preserve relative similarities of the form “image is more similar to image than to image ”. To learn hash codes that preserve such relative similarities, the triplet ranking loss has been proposed in the existing hashing methods [\citeauthoryearLai et al.2015]. Specifically, for a triplet of images that is more similar to than to , we denote the real-valued approximate hash code for , and as , and , respectively. The triplet ranking loss function is defined as:
where is the margin parameter depending on the hash code length , is the norm.
Note that the triplet loss in (3) is designed for single-label data. To support multi-label image retrieval, in this paper, we use the weighted triplet loss [\citeauthoryearLai et al.2016] for multi-label images. Concretely, for two multi-label images and , the similarity between and , denoted by , is defined as the number of shared labels between and . Hence, for a triplet that , the weighted triplet loss is defined as:
where is defined in (3). It can be verified that this weighted triplet ranking loss is convex, which can be easily integrated in back propagation in neural networks.
Maximum-Entropy Based Loss for Fake Images
For each of the fake images created by the DCGAN in Stage 1, we assign a “virtual” hash code, a real-valued vector with all the elements being . Then, for a fake input image , we define a loss function which minimizes the distance of the output approximate hash code and the “virtual” hash code . Specifically, the loss function is defined as:
where and is the norm.
Interpretation as a Maximum-Entropy Regularizer The loss in Eq. (5) can be regarded as a regularizer of the deep architecture, which penalizes low-entropy output (approximate) hash codes of the fake images.
Specifically, for an image , let be the output approximate hash code by the deep architecture, be the corresponding binary code for . Let () be the -th element in (), respectively. We denote as the probability of being given , i.e.,
where represents the parameters in the deep architecture. Then, can be regarded as the predicted probability (by the deep architecture) that is 1, i.e., . On the other hand, we have .
The deep architecture produces a conditional distribution over the binary code given an input image . For any , we assume the values of and are independent to each other. Then we have:
The entropy of the condition distribution is defined as:
it can be verified that the entropy is maximized if and only if ().
Based on the above observations, the loss function in Eq. (5), which minimizes the distance between the output approximate hash code and the vector , can be interpreted as a loss that enforces the output approximate hash code to achieve maximum entropy. Recent research [\citeauthoryearPereyra et al.2017] has shown that regularizing neural networks by penalizing low-entropy output distributions (which is equivalent to seeking maximum entropy of output distributions) can improve the performance of neural networks. Hence, the loss in Eq. (5) defined on the fake images can be regarded as a maximum-entropy based regularizer of the deep architecture.
Interpretation as Model Ensemble We denote the set of the fake images for training as . Consider a dataset , where each of in is a randomly generated binary code for the -th image in . We can use to train the top network in Figure 1(b). Previous research [\citeauthoryearXie et al.2016] shows that combining deep network models that are trained on the same data with noisy labels can usually improve generalization. In our case, since , we can construct different datasets , where is not identical to for . If we could use each of these datasets to train the top network in Figure 1(b) separately, and then ensemble these models, we would obtain a model ensemble that has better generalization performance than a single model. However, training exponentially many network models separately may be unaffordable due to the expensive computational cost. Even if these models had trained, combining these models’ prediction results in testing would also be computationally expensive or even infeasible.
According to the loss in Eq. (5), we assign a “virtual” hash code to each of the fake images in and train the network by using the training set . Interestingly, training the network using can be regarded as training exponentially many networks with massive weight sharing but over different training sets (i.e., ). Figure 3 shows an illustrative example. Suppose we have fake images for training, and the hash code length is . As shown in Figure 3, we can simultaneously train network models with weight sharing over different training set , where and () are different from each other. For an image in , if we average the hash code of in , , …, , we can obtain . In other words, as shown in Figure 3, it is equivalent to training a single network model using , where each of the hash code in is a constant vector .
In summary, training the proposed deep architecture using the loss in Eq. (5) can be interpreted as a model ensemble by simultaneously training many network models with weight sharing but over different training sets, which can act as a regularizer in the proposed deep architecture.
Combination of Loss Functions
We train the proposed deep architecture by stochastic gradient descent on mini-batches. In training, we use a combination of the above loss functions defined by:
where represents the -th triplet (, and are the number of triplets and the fake images in a mini-batch. The hyper-parameter in (6) controls the balance between the two loss functions.
|16 bits||32 bits||48 bits||64 bits||16 bits||32 bits||48 bits||64 bits||16 bits||32 bits||48 bits||64 bits|
4.1 Datasets and Evaluation Metrics
We conduct extensive evaluations of the proposed method and compare with state-of-the-art baselines on three benchmark datasets: (1) CIFAR-10
In CIFAR-10, we randomly select 1000 images (100 images per class) as the test query set, and 5000 images (500 images per class) as the training set, the rest images plus the training set is used as the retrieval database. In NUS-WIDE, we randomly select 2100 images (100 images from each of the 21 most frequent classes) as the test query set, and 10500 images (500 images per class) as the training set, the rest images plus the training set is used as the retrieval database. In CUB-200-2011, we use the official split, where 5794 test images as the test query set, 5994 training images as the training set and the retrieval database. For a fair comparison, all of the methods use identical training/test sets and retrieval database.
To evaluate the retrieval performance of the hashing methods, we use four evaluation metrics: Mean Average Precision (MAP) , Precision-Recall curves, Precision curves within Hamming distance 3, and Precision curves w.r.t. different numbers of top returned samples. For NUS-WIDE with multi-label images, two images are regarded as similar if and only if they have at least one shared label.
4.2 Implementation Details
The proposed method has two stages. In Stage 1, we train the DCGAN and use its generator to produce fake images. In Stage 2, we train a deep architecture for hashing, which leverages the original training images and the newly created fake images.
To train the DCGAN, we use Adam with the mini-batch size of 64, the learning rate of 0.0002 and the momentum of 0.5. All the weights were initialized by a Gaussian distribution with the standard deviation being 0.02. For CIFAR-10, the size of a generated image is . For NUS-WIDE and CUB-200-2011, the size of a generated image is . We train the DCGANs by 100/160/160 epochs on CIFAR-10/NUS-WIDE/CUB-200-2011, respectively. For each of the three datasets, with the generator of the learned DCGAN, we generate the same number of fake images as that of the original training set.
For the deep architecture for hashing, we initialize the weights of the convolutional sub-network (shared by two “parallel” networks) by the pre-trained GoogLeNet model [\citeauthoryearSzegedy et al.2015]. All of the input images (real and fake) are resized to . We use stochastic gradient descent to train the deep architecture, with the initial learning rate being 0.001, the batch size for real images being 72, the batch size of fake images being 72, the momentum being 0.9 and the weight decay being 0.0005. During training, the triplets of images are exhaustively generated from the current batch of the real images. For the trade-off parameters in Eq.(6) and the margin parameter in Eq.(3), we first choose the best value of () in ( with being code length) by -fold cross validation on the training set, and then use the chosen () to train the final model with the whole training set.
4.3 Results of Retrieval Accuracies
We compare the proposed method with seven baselines, including an unsupervised method ( ITQ [\citeauthoryearGong et al.2013]), a semi-supervised method (SSH [\citeauthoryearWang et al.2010]), and five supervised method (ITQ-CCA [\citeauthoryearGong et al.2013], KSH [\citeauthoryearLiu et al.2012], FastH [\citeauthoryearLin et al.2014], DTH [\citeauthoryearLai et al.2015], DSH [\citeauthoryearLiu et al.2016]), where DTH is a representative triplet-based deep hashing method, and DSH is representative pair-based deep hashing method. Note that SSH is a semi-supervised hashing method, in which we use the original training images as labeled images and the generated fake images as unlabeled images. All of the baselines are implemented based on the source code provided by the authors, respectively. Note that the convolutional sub-network in the proposed method is based on GoogLeNet. For a fair comparison, we use the same convolutional sub-network as that in the proposed method to replace the corresponding CNN part in DTH and DSH. In particular, our implementation of DTH is a variant of [\citeauthoryearLai et al.2015], in which we replace the divide-and-encode module by a fully connected layer with sigmoid activations. Note that the architecture of this variant is exactly the same as that of the bottom network in Figure 1(b). This is convenient for us to investigate the improvement made by the maximum-entropy based loss using GAN-generated fake images in the proposed method.
For DTH, DSH and the proposed method, images in are used as input. For the rest baselines that are not based on deep learning, we use the pre-trained GoogLeNet model to extract -dimensional features from images, and then use these features as input.
The comparison results on three datasets are shown in Table 1 and Figure 4, 5, 6, in which two observations can be made. (1) In most metrics on all of the three datasets, the proposed method show superior performance gains over the state-of-the-art baselines. For instance, on CUB-200-2011, with different length of hash bits, the MAP results of the proposed method indicate a relative improvement of 2.898.07 against the second best baseline. (2) the proposed method outperforms consistently DTH. Since the architecture of the implemented DTH is exactly the same as that of the bottom network in Figure 1(b), DTH is equivalent to the proposed method without the fake input image and the maximum-entropy based loss. The superior performance of the proposed method against DTH verifies that the maximum-entropy based loss, in favor of the fake images produced by GAN, can act as a strong regularizer to improve the performance of deep hashing.
4.4 Effects of different numbers of incorporated fake images
With the trained DCGAN model in Stage 1, one can produce a large number of fake images by the model’s generator. A natural question arising here is whether the performance of the proposed method is sensitive to the number of fake images incorporated by the maximum-entropy based loss. To answer this question, we conduct experiments to investigate the effects with different number of incorporated fake images. As can be seen from Figure 7, the performance of the proposed method is relatively insensitive when the number of the incorporated fake images in sufficiently large (e.g., not less than 5000).
In this paper, we developed a two-stage pipeline that improves the performance of deep hashing by GANs, in which we first train a GAN model to produce a large number of fake images, and then incorporate these fake images to a maximum-entropy loss in the deep architecture for hashing. We show that this loss acts as a strong regularizer for the deep network of hashing, by penalizing low-entropy output hash codes of fake images. This loss can also be interpreted as a model ensemble by simultaneously training many network models with massive weight sharing but over different training sets. The regularizer of “maximum-entropy based loss fake images” is generic, which can be easily employed to other supervised deep hashing methods.
- https://www.cs.toronto.edu/ kriz/cifar.html
- Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
- Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
- Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3270–3278, 2015.
- Hanjiang Lai, Pan Yan, Xiangbo Shu, Yunchao Wei, and Shuicheng Yan. Instance-aware hashing for multi-label image retrieval. IEEE Transactions on Image Processing, 25(6):2469–2479, 2016.
- Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. arXiv preprint arXiv:1511.03855, 2015.
- Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton Van den Hengel, and David Suter. Fast supervised hashing with decision trees for high-dimensional data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1963–1970, 2014.
- Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1–8. Citeseer, 2011.
- Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2074–2081. IEEE, 2012.
- Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2064–2072, 2016.
- Mohammad Norouzi and David M Blei. Minimal loss hashing for compact binary codes. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 353–360. Citeseer, 2011.
- Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
- Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for scalable image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3424–3431. IEEE, 2010.
- Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. Adversarial cross-modal retrieval. In Proceedings of the 2017 ACM on Multimedia Conference, pages 154–162. ACM, 2017.
- Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in neural information processing systems, pages 1753–1760, 2009.
- Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, volume 1, pages 2156–2162, 2014.
- Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4753–4762, 2016.
- Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. Deep semantic ranking based hashing for multi-label image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1556–1564. IEEE, 2015.
- Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017.