Adversarial Binary Coding for Efficient Person Reidentification
Abstract
Person reidentification (ReID) aims at matching persons across different views/scenes. In addition to accuracy, the matching efficiency has received more and more attention because of demanding applications using largescale data. Several binary coding based methods have been proposed for efficient ReID, which either learn projections to map highdimensional features to compact binary codes, or directly adopt deep neural networks by simply inserting an additional fullyconnected layer with tanhlike activations. However, the former approach requires timeconsuming handcrafted feature extraction and complicated (discrete) optimizations; the latter lacks the necessary discriminative information greatly due to the straightforward activation functions. In this paper, we propose a simple yet effective framework for efficient ReID inspired by the recent advances in adversarial learning. Specifically, instead of learning explicit projections or adding fullyconnected mapping layers, the proposed Adversarial Binary Coding (ABC) framework guides the extraction of binary codes implicitly and effectively. The discriminability of the extracted codes is further enhanced by equipping the ABC with a deep triplet network for the ReID task. More importantly, the ABC and triplet network are simultaneously optimized in an endtoend manner. Extensive experiments on three largescale ReID benchmarks demonstrate the superiority of our approach over the stateoftheart methods.
Keywords:
Person reidentification, binary coding, generative adversarial network, matching efficiency1 Introduction
Given one or multiple images of a pedestrian, person reidentification (ReID) aims to retrieve the person with the same identity from a large collection of images captured in different scenes and from various viewpoints. ReID enables various potential applications, such as longterm crossscenario tracking and criminal retrieval. The task, however, still remains challenging due to the significant variations in poses, viewpoints and illuminations across different cameras.
Numerous ReID methods have been proposed, most of which adopt highdimensional (usually thousands or more) features [1, 2, 3, 4, 5, 6] in order to represent persons comprehensively with various cues (\egcolors, textures, and spatialtemporal cues). This directly bring much higher computational complexity to the subsequent similarity measurement (\egmetric learning). Besides, current largescale ReID benchmarks contain numerous identities and cameras to simulate realworld scenarios, making existing stateoftheart ReID approaches computationally unaffordable [7]. Therefore, despite the noteworthy improvement in matching accuracies, the computational and memory requirements have at the same time become more challenging.
Binary coding (\iehashing), adopted by \eg[8, 9], maps highdimensional features into compact binary codes and efficiently measures similarities in the lowdimensional Hamming space. It is one of the promising solutions for efficient ReID. The hashing based ReID methods can be mainly divided into two categories: 1) The method shown in Fig. 1(a) learns multiple projection matrices to concurrently map original features to a lowdimensional and discriminative Hamming space. However, its objective is generally a nonconvex joint function of several subtasks (\egsimilaritypreserving mapping and binary transformation), which requires the explicit design of sophisticated functions and timeconsuming nonconvex (discrete) optimizations. The memory storage and computational efficiency are serious issues, especially when dealing with largescale data. 2) Fig. 1(b) shows a deep neural network based method, which is able to process largescale data much more efficiently compared with traditional methods by using minibatch learning algorithms and advanced GPUs. The binary codes here are generated by inserting hashing layers at the end of the networks. However, the hashing layer is simply a fullyconnected layer followed by a tanhlike activation to force the outputs in binary form. This straightforward scheme hardly constrains the outputs under the important principles of hashing (\egbalancedness and independence [10]) to obtain highquality binary codes. Moreover, the outputs of the hashing layers tend to lie in the approximately linear part of the tanhlike functions for preserving discriminability. Therefore, directly binarizing the outputs by the function will lose the discriminative information.
To address the above issues, this paper proposes an unified endtoend deep learning framework for efficient ReID, aiming to jointly learn a discriminative feature representation, an accurate similarity measurement and an implicit binary transformation. In particular, we propose Adversarial Binary Coding (ABC) by adopting a Generative Adversarial Net (GAN) [11, 12] to regularize features into binary form without loss of discriminability (see Fig. 2). Instead of explicit projections, the adversarial learning makes the target distribution (in binary form) an ‘expert’ that implicitly guides the network to generate samples under the same distribution. Specifically, we employ the Bernoulli distribution to guide a CNN to generate discrete features. Benefiting from the nature of the Bernoulli distribution, our ABC can generate highquality discriminative codes complying with the important principle of hashing, \egbalancedness. As shown in Fig. 1(c), our strategy avoids both timeconsuming explicit projection learning and lowquality codes with the simple tanhlike activation. More importantly, our ABC can be flexibly embedded into any similarity regressive networks (\egdeep triplet networks) and optimized jointly with the network in an endtoend manner. The main contributions of this paper are summarized as follows:
1) We propose a binary transformation strategy based on deep adversarial learning. The proposed architecture is composed of a CNN for feature extraction and a discriminator network for distinguishing realvalued and binary features, where the CNN is guided to generate features in binary form to confuse the discriminator. Thus, the features are implicitly regularized into binary codes.
2) An endtoend deep neural network that seamlessly accommodates the above adversarial binary coding module is built for efficient ReID. We jointly optimize the binary transformation and similarity measurement. Consequently, the discriminative information is largely preserved during feature binarization.
2 Related work
Person reidentification:
Traditional approaches usually propose certain feature learning algorithms for ReID, including lowlevel color features [1, 16, 17] and local gradients [18, 19, 20, 4], and highlevel features [2, 21, 3]. Due to the breakingthrough performance of deep neural networks, deep learning based ReID methods [13, 22, 23, 24, 25, 26, 27] have been proposed increasingly. For instance, siamese CNNs [28, 29, 30] and triplet CNNs [31, 32, 6] are widely used for similarity measurement. Very recently, several binary coding based approaches [33, 34, 8, 9] have emerged to deal with the high computation and storage costs existing in ReID problems.
Generative adversarial nets:
GANs [11, 12] provide a methodology to map random variables from a simple distribution to a certain complex one, and have been widely used in image generating [12, 35, 36, 37], style transferring [38, 39] and latent feature learning [40, 41, 42]. To stabilize and quantify the training of GANs, a breakthrough named Wasserstein GAN (WGAN) was proposed in [43] and improved in [44]. More recently, GANs are also utilized for image retrieval problems. In [45], GANs were adopted to distinguish synthetic and real images, aiming to improve the discriminability of binary codes. GANs were also employed to enhance the intermediate representation of the generator in [46]. However, these kinds of studies still simply adopted tanhlike activations for binarization. To our best knowledge, our ABC is the first work that intuitively adopts the spirit of adversarial learning to perform binary transformation for efficient ReID.
3 Approach
The proposed framework transforms highdimensional realvalued features to compact binary codes mainly based on the adversarial learning. In the following, we first briefly review the principles of GANs in Section 3.1. Then, we introduce the adversarial binary coding (ABC) in detail in Section 3.2. In Section 3.3, we present the joint endtoend framework with triplet networks for efficient ReID.
3.1 A Brief Review of GANs
In the framework of GAN, there is a generator which competes against an adversary, \iea discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. To learn the generator’s distribution over data , a prior on input noise variables is defined as , then a mapping to data space is denoted as , where is a differentiable function represented by a deep neural network with parameters . Meanwhile, represents the probability that comes from the data rather than . is trained to maximize the probability of assigning the correct label to both samples from real data and samples from . Simultaneously, is trained to make confused. The formal loss function is defined as follows:
(1) 
However, GANs are difficult to train so that the generator may fail to generate either reallooking or diverse samples. Arjovsky \etal[47, 43] addressed this problem by introducing the WGAN, which optimizes the Wasserstein loss instead of the JensenShannon divergence to evaluate the similarity. The Wasserstein loss is defined as the following:
(2) 
It provides stronger stability of gradients based on the Wasserstein1 distance (also called the EarthMover distance). Moreover, WGAN provides meaningful learning curves useful for debugging and hyperparameter searching. Therefore, in this work, we adopt the training strategy of WGAN for adversarial learning.
3.2 Adversarial Binary Coding
Our binary coding scheme is intuitively inspired by GANs. Instead of formulating explicit hashing functions (\ielearning explicit projections), we implicitly guide a deep neural network to directly learn the transformation of data from the original distribution (\ieimages) to a distribution of binary vectors in a GAN framework. In this section, we focus on the binary transformation module of the endtoend efficient ReID framework. How to keep the semantic/discriminative information during the transformation procedure is explained in Section 3.3.
The proposed framework of adversarial binary coding is illustrated in Fig. 2. The feature extractor can be any CNN architecture (ResNet50 [48] is adopted in this work), which finally represents the images as feature vectors. Meanwhile, a binary code sampler performs random sampling for every bit of the binary vectors. To satisfy the principle of effective binary coding mentioned in [10], we sample from the Bernoulli distribution, based on which there is a 50% chance for each bit to be 0 or 1, and different bits are independent of each other. The discriminator is expected to classify the binary vectors as positive samples and the realvalued feature vectors as negative samples. Thus, the extractor is trained to generate feature vectors that are under the same distribution of positive samples using the Wasserstein loss (W loss) in Eq. (2).
Formally, we denote a batch of images as under the distribution . The feature extractor is denoted as a mapping function which plays the role of the generator in the definition of GANs under an encoding distribution where denotes the extracted feature vectors. aims to transform data from the original distribution to a target distribution :
(3) 
Since a binomial distribution is equivalent to multiple Bernoulli samplings with the same probability, the extractor is essentially regularized by matching the posterior to a prior binomial distribution using the Wasserstein distance.
As mentioned above, we use ResNet50 [48] as the backbone model, where Rectified Linear Unit (ReLU) is adopted as the activation function. Hence, we represent every bit of the binary codes by instead of [34, 9] due to the nonnegative outputs of ReLU. We further find that the performance will severely deteriorate if the feature vectors and binary codes are directly fed into the discriminator and the similarity regressive loss (\egtriplet loss in Fig. 3) without normalization, due to the contradiction between the expected 0 or 1 outputs and the learning algorithm. More concretely, the weights of the neural network are generally initialized to very small values (much smaller than 1). Meanwhile, the learning algorithm carefully controls the scale of the weights (\egby learning rates and weight decay mechanism) to avoid gradient vanishing or exploding under the loss function. As a result, the features extracted by the network will also be very close to 0 since they share the same scale with weights. On the contrary, our ABC expects every dimension of the output features to be constrained near 0 or 1. As a consequence, we will encounter an unstable optimization process if not adopting any normalization.
To address the above issue, we normalize both the output feature vectors and sampled binary codes to the same scale by normalization. As for the realvalued features, we adopt the standard Norm operation. In terms of the binary codes, we perform the normalization specifically as follows. Given a batch of random binary vectors , where and is the code length, the binary vectors can be directly normalized as follows:
(4) 
However, the Norm of every vector could be different since every binary vector may contain different numbers of bits assigned to . In other words, the values of nonzero entries in the normalized vectors will be different. This leads to an unstable training process, where the losses are unable to guide the optimization clearly. Therefore, in this study, we adopt the expectation of the Bernoulli distribution to calculate the Norm of binary vectors. Specifically, we calculate an uniform normalization factor as:
(5) 
where represents the expectation of Bernoulli random variables, and thus the binary vectors can be normalized as .
3.3 Triplet Loss based Efficient ReID Framework
To not only transform features to binary form, but also measure similarities between binary codes, the ABC is further embedded into a triplet network for ensuring the discriminability of the learned binary codes. The triplet loss [49] is formulated as follows:
(6) 
where , , and are input features, is the imposed distance margin between positive and negative pairs, and measures the similarity distance. and are features from the same class (same identity in ReID), and is from another class (different identity). The triplet loss forces the distances between samples in negative pairs larger than those in positive pairs. Therefore, it is widely used in the tasks which aim to retrieve data with high relevance.
The overall framework for efficient ReID is shown in Fig. 3. ResNet50 pretrained on ImageNet [50] is adopted as the backbone model, where the fixed average pooling layer is replaced by an adaptive average pooling to fit different input sizes, followed by a feature embedding (fullyconnected) layer to reduce feature dimensions into expected lengths. At the beginning of training, we finetune the model on pedestrian images with Cross Entropy Error (CEE) loss, by solving a conventional classification problem, \ieeach class contains the images of one person. Because finetuning the models pretrained on a large image collection on small datasets has been verified as an effective approach for knowledge transfer. This is also helpful for deep networks with less data to find the optimal parameters more easily and to reach the convergence faster compared with training from scratch. Note that the outputs of the last layer are not normalized by norm in this phase, just as conventional CNNs for image classification. After that, we train the model with normalization as shown in Fig. 3, by jointly optimizing the Wasserstein loss for binary coding and triplet loss for similarity measuring. Specifically for the composition of triplet batches, we randomly select different persons and pick two images from different views of each person to be the anchor and positive samples. Then we randomly select an image of a person different from the anchor as the negative sample in each triplet.
Particularly, in the training phase, we adopt the Euclidean distance to measure similarities between realvalued features for the triplet loss without binarization. Because the Euclidean distance provides conspicuously more stable gradients than the Hamming distance, while obtaining the equivalent distance measuring results as the Hamming distance. In this way, the triplet loss focuses on reducing intraclass distances and enlarging interclass distances in terms of the realvalued features, whilst the Wasserstein loss focuses on the binary transformation of realvalued features.
In the testing phase, images are sent into the trained CNN to obtain the realvalued features, of which every entry should be very close to binary values, such as or , where is a very small value. Finally, we binarize the features as follows:
(7) 
where is the value of the th entry of a realvalued feature extracted by , and is the binary bit of after binarization. The Hamming distances between queries and the gallery set are further computed using extremely fast XOR operations to measure similarities.
4 Experiments
We evaluate the performance of our method on three largescale ReID datasets: CUHK03 [13], Market1501 [14], and DukeMTMCreID [15]. The goal of our experiments is mainly to answer the following three research questions:
4.1 Datasets and Settings
CUHK03 contains 14,096 images of 1,467 identities captured by six surveillance cameras. The dataset provides both manually labeled and automatically detected bounding boxes with variant sizes. In the experiments, we resize the images to 16060, and the 20 training/testing splits reported in [13] are used. The number of training iterations is set to 6,000, and the margin of the triplet loss is initialized to 0.2 and increased to 0.3 after 1,000 iterations, 0.4 after 2,500 iterations, and 0.5 after 4,000 iterations.
Market1501 contains 32,668 automatically detected 12864 bounding boxes of 1,501 pedestrians under six cameras and provides a fixed evaluation protocol. In the experiments, the number of training iterations is set to 8,000, and the margin of the triplet loss is initialized to 0.2 and increased to 0.3 after 1,000 iterations and 0.4 after 4,000 iterations.
DukeMTMCreID contains 36,411 manually annotated bounding boxes of 1,404 identities under 8 cameras and provides a fixed training/testing split. In the experiments, the size of images is resized to 12864. The number of training iterations is set to 8,000, and the margin of the triplet loss is initialized to 0.2 and increased to 0.3 after 2,000 iterations and 0.4 after 5,000 iterations.

CUHK03 (labelled)  CUHK03 (detected)  Market1501  DukeMTMCreID  

realvalued  binary  realvalued  binary  realvalued  binary  realvalued  binary  
64bit  51.7  42.0  49.5  43.1  52.2  37.1  58.0  48.8  
128bit  55.3  46.2  54.5  48.4  57.7  44.5  67.9  60.3  
256bit  57.6  52.9  56.8  50.3  61.6  49.6  71.4  63.5  
512bit  60.5  58.2  61.4  57.8  67.5  59.3  75.1  65.5  
1024bit  61.7  60.4  65.6  61.2  70.7  66.8  76.9  69.7  
2048bit  69.4  68.8  68.9  68.1  75.8  73.5  80.3  77.6 
We implement our framework based on the PyTorch deep learning library. The hardware environment is a PC with Intel Core CPUs (3.4GHz), 32 GB memory, and an NVIDIA GTX TITAN X GPU. For all the datasets, the images are horizontally flipped to augment training samples. The batch size is set to 64 in the pretraining phase and altered to 128 in the subsequent training. The learning rate of the extractors in the experiments is initialized to 0.001 and decreased to 0.0001 with the iterations. The learning rate of the discriminator is consistently set to 0.01. To ensure the stability, we update the GAN 10 iterations alone after every 20 global optimization iterations. Every GAN iteration consists of 5 iterations of discriminator updating and 1 iteration of generator (extractor) updating. In the experiments, we rerun the comparison methods if the codes are publicly available to evaluate their efficiencies for fair comparisons.
4.2 Evaluation of Computation and Storage Efficiency



We first evaluate the efficiency of our method with different bit lengths, since shorter binary codes are more efficient but may cause the descent of accuracies, while longer codes do the opposite. The time of retrieving a query (Q. Time) and the memory storing the gallery features (Mem.) are shown in Fig. 4. As can be seen, the query time and memory consumed by binary codes are far less than those of realvalued features. The matching time and memory of realvalued features increase significantly faster than the binary features with the bit lengths. Besides, we compare the rank1 matching rates of realvalued and binary features in Table 1. As we can see from the last two rows of the table, binarized features with more bits (\eg1024 or 2048) perform only slightly worse than the corresponding realvalued features, which demonstrates that with sufficient capacity, the discriminative information is well preserved by the binary features using our method. It is also noteworthy that even with 2048 bits, our binary features require much less query time and memory than realvalued counterparts.
In addition to the matching time, the time consumed by feature extraction (F. Time) should also be taken into consideration. With the data scale in ReID getting larger, it is necessary to process a large number of queries in a short time. Therefore, we compare the time consumed by extracting features of our method with two stateoftheart approaches, namely Local Maximal Occurrence Representation (LOMO) [2] and Hierarchical Gaussian Descriptor (GOG) [3], which are widely adopted by traditional ReID methods. As shown in Table 2, our method extracts features much faster than LOMO and GOG.
Furthermore, we provide the training losses of our framework with 2048 bit length in Fig. 5. We can observe that the losses corresponding to the GAN descend steadily with the proceeding of the training. The triplet loss on Market1501 is well optimized at every margin value. The triplet losses on CUHK03 and DukeMTMCreID become fluctuations at certain margin values, nevertheless the losses can reach the steady state by the end of training.
Method  CUHK03  Market1501  DukeMTMCreID 

LOMO[2]  7.50e+00  2.96e+02  2.65e+02 
GOG[3]  7.10e+02  2.80e+04  2.51e+04 
64bit ABC+triplet  5.70e01  5.93e+00  5.05e+00 
128bit ABC+triplet  5.70e01  6.11e+00  5.09e+00 
256bit ABC+triplet  5.71e01  6.25e+00  5.17e+00 
512bit ABC+triplet  5.72e01  6.30e+00  5.19e+00 
1024bit ABC+triplet  5.73e01  6.38e+00  5.32e+00 
2048bit ABC+triplet  5.77e01  6.49e+00  5.33e+00 
Method  Rank 1  Rank 5  Rank 20  mAP  F. Time (s)  Q. Time (s)  Mem. (MB) 

DRSCH [34]  22.0  48.4  81.0         
DSRH [33]  14.4  43.4  79.2         
CSBT [9]  55.5  84.3  98.0    7.50e+00  8.07e04  1.01e02 
512bit ABC+triplet (Ours)  58.2  85.7  98.2  59.7  5.72e01  8.07e04  1.01e02 
Method  Rank 1  mAP  F. Time (s)  Q. Time (s)  Mem. (MB) 

CSBT [9]  42.9  20.3  2.96e+02  2.54e02  1.52e+00 
512bit ABC+triplet (Ours)  59.3  43.8  6.30e+00  2.54e02  1.52e+00 
4.3 Comparison with Binary Coding based Methods
Here we compare our framework with the following stateoftheart binary coding (hashing) based ReID methods: 1) Deep hashing: Deep Regularized Similarity Comparison Hashing (DRSCH) [34], Deep Semantic Ranking based Hashing (DSRH) [33], and 2) Nondeep hashing: Crosscamera Semantic Binary Transformation (CSBT) [9]. Since CSBT has already significantly outperformed other hashing methods (\egSePH [51], COSDISH [52] and SDH [53]) in ReID according to the results reported in [9], we mainly compare our method with CSBT. The results on different datasets are shown in Tables 3 and 4, respectively, where the best performance is highlighted in red and the second best in blue.
From Table 3, we can observe that DRSCH and DSRH perform poorly on CUHK03, falling behind the other methods. Our method outperforms the stateoftheart hashing based ReID method CSBT. The superiority of our method over CSBT becomes greater on the larger dataset, namely Market1501, as can be seen in Table 4. We achieve 16.3% higher in rank1 accuracy and double the mean average precision (mAP) of CSBT. This is probably because Market1501 has much more training data than CUHK03, and projection learning based CSBT can hardly handle such amount of data at once. In contrast, our method optimizes the network based on minibatch learning approaches, which is able to train the model on large amounts of data. Moreover, CSBT requires extracting LOMO features in advance, which is much slower than extracting binary codes using our method.
4.4 Comparison with the StateoftheArt Methods
Method  Rank 1  Rank 5  Rank 20  mAP  F. Time (s)  Q. Time (s)  Mem. (MB) 

DeepReID[13]  19.9  49.8  78.2         
Improved Deep[22]  44.9  76.4  93.6         
NSL[54]  54.7  84.8  95.2        2.11e+01 
Gated CNN[30]  61.8  80.9    51.3       
EDM[29]  51.9  83.6        2.85e02  4.90e01 
SIR+CIR[6]  52.2          1.26e02  5.16e01 
Reranking[55]  58.5      64.7  7.50e+00    1.03e+02 
SSM[56]  72.7  92.4      7.18e+02  6.00e01  2.08e+02 
Partaligned[57]  81.6  97.3  99.5         
MuDeep[58]  75.6  94.4           
PDC[59]  78.3  94.8  98.4         
ABC+triplet (Ours)  68.1  90.3  98.3  61.6  5.77e01  1.85e03  1.22e01 
Method  Rank 1  mAP  F. Time (s)  Q. Time (s)  Mem. (MB) 

SDALF[1]  20.5  8.2    2.53e+02  9.31e+01 
KISSME[16]  40.5  19.0       
eSDC[60]  33.5  13.5  1.58e+04  7.47e+02  1.45e+03 
BoW(best)[14]  44.4  21.9    2.08e+00  1.54e+01 
LOMO+NSL[54]  55.4  29.9  2.96e+02    4.06e+03 
Gated CNN[30]  65.9  39.6      4.32e+01 
SCSP[61]  51.9  26.4      1.21e+02 
SpindleNet[62]  76.9         
Reranking[55]  77.1  63.6  2.96e+02    4.06e+03 
SSM[56]  82.2  68.8  2.83e+04  1.68e+02  8.21e+03 
Partaligned[57]  81.0  63.4       
PDC[59]  84.1  63.4       
ABC+triplet (Ours)  73.5  52.9  6.49e+00  7.32e02  4.82e+00 
Method  Rank 1  mAP  F. Time (s)  Q. Time (s)  Mem. (MB) 

BoW+KISSME[14]  25.l  12.1       
LOMO+XQDA[2]  30.8  17.0  2.65e+02    3.62e+03 
Basel.+LSRO[15]  67.7  47.1       
Basel.+OIM[63]  68.1         
SVDNet[64]  76.7  56.8       
ABC+triplet (Ours)  77.6  47.9  5.33e+00  6.58e02  4.31e+00 
We also compare our method (2048bit ABC+triplet) with the stateoftheart nonhashing ReID methods. The methods mainly include: 1) Deep Learning based methods, such as DeepReID [13], Improved Deep [22], SIR+CIR [6], EDM [29], Gated CNN [30], Spindle Net [62], SVDNet [64], Deeplylearned PartAligned Representation [57], Multiscale Deep Learning Architectures [58], and PoseDriven Deep Convolutional Model [59]; 2) Metric learning based methods, such as KISSME [16], XQDA [2], and NSL [54]; 3) Local Patch Matching based methods, such as SDC [60] and BoW [14]; 4) Other ReID methods.
The comparison results on the three datasets are shown in Tables 5, 6, and 7, respectively. Our framework achieves competitive matching accuracies compared to the stateoftheart methods, which adopt highdimensional realvalue features. It is also obviously that our framework not only outperforms many existing nonhashing approaches, but also achieves significant advantages in terms of the matching efficiencies. The advantages are more outstanding if the gallery set contains more samples. For instance, the query time of ABC is at least dozens of times faster than the nonhashing methods on Market1501 with 19,732 gallery samples. Several methods adopt LOMO, which represents images as 26960dim realvalued features. Differently, our method just represents images as 2048bit binary codes which requires far less memory.
4.5 Effects of Different Network Settings
In this section, we embed the proposed ABC into different similarity measuring networks and evaluate the performance under different settings. We first evaluate two types of networks widely used to measure similarity, namely siamese network [29, 28] and triplet network. A siamese network receives a pair of images and minimizes the distance between images if they are from the same class and maximizes the distance if they have different labels. The siamese network evaluated in our experiments adopts the same ResNet50 backbone model as the triplet network and employs the contractive loss to measure similarity.
From Table 8, we can observe that adopting siamese network performs worse than triplet network. This is because the loss of the siamese network is too strict, \ieenforcing images of one identity to be projected onto a single point in the subspace. Differently, the triplet loss allows the images from one person to lie on a manifold, while enforcing larger distances between different persons’ images. We can also observe that embedding the ABC into a triplet network achieves better results than into a siamese network.

CUHK03 (labelled)  CUHK03 (detected)  Market1501  DukeMTMCreID  

Rank 1  mAP  Rank 1  mAP  Rank 1  mAP  Rank 1  mAP  
2048dim siamese  62.0  59.4  61.5  56.1  67.2  48.8  75.0  43.8  
2048dim triplet  70.8  66.6  69.7  63.1  75.2  54.1  82.0  48.8  
2048bit ABC+siamese  49.2  35.4  45.6  31.2  52.7  21.8  56.9  29.7  
2048bit ABC+triplet  52.3  43.7  50.9  38.1  55.8  27.5  60.3  27.6  
2048bit ABC+siamese+  61.7  60.4  61.6  60.2  65.7  48.1  70.9  41.7  
2048bit ABC+triplet+  68.8  64.5  68.1  61.6  73.5  52.9  77.6  47.9 
As explained in Section 3.2, we normalize both the generated features and binary codes to the same scale to eliminate the conflict between two modules. Here we also compare the networks with and without the normalization. As can be seen from Table 8, the performance is significantly improved after the normalization.
In addition, we evaluate the effect of finetuning on the three datasets. Fig. 6 shows that the triplet losses with finetuning converge faster than those without finetuning. Since the employed ResNet50 network is pretrained on ImageNet, it already captures a variety of useful image features. Finetuning the network further enables learning features specialized for person representation more efficiently than training the network from scratch.
5 Conclusion
In this work, the adversarial binary coding (ABC) framework for efficient person reidentification was proposed, which could generate discriminative and efficient binary features from pedestrian images. Specifically, our ABC trained a discriminator network to distinguish the realvalued features from binary ones, in order to guide the feature extractor network to generate features in binary form under the Wasserstein loss. The ABC framework was further embedded into a deep triplet network to preserve the semantic information of binary features for the ReID task. Extensive experiments on three largescale ReID datasets showed that our method outperformed the stateoftheart hashing based ReID approaches, and was competitive to the stateoftheart nonhashing ReID approaches, whilst reducing time and memory costs significantly. Considering that the triplet network has been overtaken by other network architectures proposed more recently, one possible improvement of this work in future is to explore the combinations of the ABC framework and other more sophisticated similarity measuring frameworks.
References
 Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person reidentification by symmetrydriven accumulation of local features. In: CVPR. (2010)
 Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person reidentification by local maximal occurrence representation and metric learning. In: CVPR. (2015)
 Matsukawa, T., Okabe, T., Suzuki, E., Sato, Y.: Hierarchical gaussian descriptor for person reidentification. In: CVPR. (2016)
 Liu, K., Ma, B., Zhang, W., Huang, R.: A spatiotemporal appearance representation for viceobased pedestrian reidentification. In: ICCV. (2015)
 Shi, Z., Hospedales, T.M., Xiang, T.: Transferring a semantic representation for person reidentification and search. In: CVPR. (2015)
 Wang, F., Zuo, W., Lin, L., Zhang, D., Zhang, L.: Joint learning of singleimage and crossimage representations for person reidentification. In: CVPR. (2016)
 Zheng, L., Yang, Y., Hauptmann, A.G.: Person reidentification: Past, present and future. arXiv preprint arXiv:1610.02984 (2016)
 Zheng, F., Shao, L.: Learning crossview binary identities for fast person reidentification. In: IJCAI. (2016) 2399–2406
 Chen, J., Wang, Y., Qin, J., Liu, L., Shao, L.: Fast person reidentification via crosscamera semantic binary transformation. In: CVPR. (2017)
 Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS. (2009)
 Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
 Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
 Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep filter pairing neural network for person reidentification. In: CVPR. (2014)
 Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person reidentification: A benchmark. In: ICCV. (2015)
 Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by gan improve the person reidentification baseline in vitro. In: ICCV. (2017)
 Koestinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric learning from equivalence constraints. In: CVPR. (2012)
 Pedagadi, S., Orwell, J., Velastin, S., Boghossian, B.: Local fisher discriminant analysis for pedestrian reidentification. In: CVPR. (2013)
 Prosser, B., Zheng, W.S., Gong, S., Xiang, T., Mary, Q.: Person reidentification by support vector ranking. In: BMVC. Volume 2. (2010)
 Martinel, N., Das, A., Micheloni, C., RoyChowdhury, A.K.: Reidentification in the function space of feature warps. IEEE TPAMI 37(8) (2015) 1656–1669
 Lisanti, G., Masi, I., Bagdanov, A.D., Del Bimbo, A.: Person reidentification by iterative reweighted sparse ranking. IEEE TPAMI 37(8) (2015) 1629–1642
 Lan, R., Zhou, Y., Tang, Y.Y.: Quaternionic local ranking binary pattern: A local descriptor of color images. IEEE TIP 25(2) (2016) 566–579
 Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person reidentification. In: CVPR. (2015)
 Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person reidentification. In: CVPR. (2016)
 Zhou, S., Wang, J., Wang, J., Gong, Y., Zheng, N.: Point to set similarity based deep feature learning for person reidentification. In: CVPR. (2017)
 Lin, J., Ren, L., Lu, J., Feng, J., Zhou, J.: Consistentaware deep learning for person reidentification in a camera network. In: CVPR. (2017)
 Panda, R., Bhuiyan, A., Murino, V., RoyChowdhury, A.K.: Unsupervised adaptive reidentification in open world dynamic camera networks. In: CVPR. (2017)
 Chen, W., Chen, X., Zhang, J., Huang, K.: Beyond triplet loss: A deep quadruplet network for person reidentification. In: CVPR. (2017)
 Yi, D., Lei, Z., Liao, S., Li, S.Z.: Deep metric learning for person reidentification. In: ICPR. (2014)
 Shi, H., Yang, Y., Zhu, X., Liao, S., Lei, Z., Zheng, W., Li, S.Z.: Embedding deep metric for person reidentification: A study against large variations. In: ECCV. (2016)
 Varior, R.R., Haloi, M., Wang, G.: Gated siamese convolutional neural network architecture for human reidentification. In: ECCV. (2016)
 Chen, S.Z., Guo, C.C., Lai, J.H.: Deep ranking for person reidentification via joint representation learning. IEEE TIP 25(5) (2016) 2353–2367
 Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person reidentification by multichannel partsbased cnn with improved triplet loss function. In: CVPR. (2016)
 Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for multilabel image retrieval. In: CVPR. (2015)
 Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification. IEEE TIP 24(12) (2015) 4766–4779
 Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., Belongie, S.: Stacked generative adversarial networks. In: CVPR. (2017)
 Huang, R., Zhang, S., Li, T., He, R., et al.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. arXiv preprint arXiv:1704.04086 (2017)
 Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: ICCV. (2017)
 Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired imagetoimage translation using cycleconsistent adversarial networks. In: ICCV. (2017)
 Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Imagetoimage translation with conditional adversarial networks. In: CVPR. (2017)
 Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015)
 Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. arXiv preprint arXiv:1605.09782 (2016)
 Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., Courville, A.: Adversarially learned inference. arXiv preprint arXiv:1606.00704 (2016)
 Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
 Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028 (2017)
 Qiu, Z., Pan, Y., Yao, T., Mei, T.: Deep semantic hashing with generative adversarial networks. In: SIGIR, ACM (2017)
 Song, J.: Binary generative adversarial networks for image retrieval. In: AAAI. (2018)
 Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017)
 He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
 Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: BMVC. Volume 1. (2016) 3
 Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., FeiFei, L.: ImageNet: A LargeScale Hierarchical Image Database. In: CVPR. (2009)
 Lin, Z., Ding, G., Hu, M., Wang, J.: Semanticspreserving hashing for crossview retrieval. In: CVPR. (2015)
 Kang, W.C., Li, W.J., Zhou, Z.H.: Column sampling based discrete supervised hashing. In: AAAI. (2016) 1230–1236
 Shen, F., Shen, C., Liu, W., Shen, H.T.: Supervised discrete hashing. In: CVPR. Volume 2. (2015) 5
 Zhang, L., Xiang, T., Gong, S.: Learning a discriminative null space for person reidentification. In: CVPR. (2016)
 Zhong, Z., Zheng, L., Cao, D., Li, S.: Reranking person reidentification with kreciprocal encoding. In: CVPR. (2017)
 Bai, S., Bai, X., Tian, Q.: Scalable person reidentification on supervised smoothed manifold. In: CVPR. (2017)
 Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeplylearned partaligned representations for person reidentification. In: ICCV. (2017)
 Qian, X., Fu, Y., Jiang, Y.G., Xiang, T., Xue, X.: Multiscale deep learning architectures for person reidentification. In: ICCV. (2017)
 Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Posedriven deep convolutional model for person reidentification. In: ICCV. (2017)
 Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person reidentification. In: CVPR. (2013)
 Chen, D., Yuan, Z., Chen, B., Zheng, N.: Similarity learning with spatial constraints for person reidentification. In: CVPR. (2016)
 Zhao, H., Tian, M., Sun, S., Shao, J., Yan, J., Yi, S., Wang, X., Tang, X.: Spindle net: Person reidentification with human body region guided feature decomposition and fusion. In: CVPR. (2017)
 Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: Joint detection and identification feature learning for person search. In: CVPR. (2017)
 Sun, Y., Zheng, L., Deng, W., Wang, S.: Svdnet for pedestrian retrieval. In: ICCV. (2017)