Fast and Accurate Person ReIdentification with RMNet
\floatsetup[table]capposition=top \floatsetup[figure]capposition=bottom
I Introduction
\lettrine[nindent=0em,lines=3] The CNNbased solutions have demonstrated the ability to solve a wide range of computer vision tasks achieving the human level performance or even outperforming them. Moreover, not only demonstrating the ability to solve a set of canonical tasks like ImageNet [1] classification or Cityscapes [2] segmentation challenges can be attributed to CNNs but solving practical use case problems. It’s about issues like a person reidentification which is a key component of the tracking pipelines.
Unfortunately, many researchers offering each time a dramatically new approach allowing to lift a problem on the new level of understanding have a purpose of their work to only beat current state of the art without any attention to the performance problem. But speaking about the industryuseful solutions we should take into account the requirement of realtime inference on the customer affordable hardware.
In case of CNNbased solutions the necessity to affect the inference behavior the choice of backbone is the only thing that needs to be changed. We have many examples of backbone architectures like MobileNet ([3], [4]) and ShuffleNet ([5], [6]) designed for the fast inference in embedded applications. The most significant moment is that for many users these backbones are the only changes required to adopt their approach for the fast inference. Instead of thinking in terms of practices satisfying their target requirements, users mix the components from different and often incompatible areas and, as result, underperform what it could be.
In this paper, we address this issue by carefully designing the direct architecture to solve specific and small task like a person reidentification. Our aim is to show that this problem can be solved on near state of the art level and significantly outperformed by speed. Our contributions are as follows:

New lightweight backbone architecture named RMNet for the fast and accurate inference for mobile applications.

Rethinking of the manifold learning techniques according to the person reidentification challenge.

Novel lightweight network head to combine the advantages of the low and high level losses without grow in number of parameters.
More broadly, this work demonstrates some ways to design the lightweight CNNbased solution to tackle with specific (not general) tasks without needs to accept being fast as well as being is inaccurate. The proposed model (set of models with different trade off between speed and accuracy) you can find as a part of the Intel OpenVINO™toolkit^{1}^{1}1For more information you can follow the link: https://software.intel.com/enus/openvinotoolkit.
Index terms. Person reidentification, manifold learning, local and global structure losses, mobile network architecture, lightweight backbone, RMNet.
Ii Related Work
i Mobile architectures
\lettrine[nindent=0em,lines=3]In recent time Deep Learning (DL) as an independent tool of Machine Learning has made significant leap in a CNN architecture development starting from the vanilla networks like VGG [7] and continue with ResNet [8] and Inception [9] families. Recent architectures bring some key understanding at how to deal with permanent DL problems: vanishing/exploding gradients, overparametrization and next overfitting. As a byproduct for the fast inference purposes we get some reduction in the computation budget while using ResNet18 and similar models which we can name "relatively small". For some simple tasks the cheap speed up of inference by reduction of the depth of some default architecture is enough and no future investigation is performed on this aspect. But when we speak about mobile applications the future inference time reduction is needed. On the way to do that the techniques like model weights pruning [10] and quantization [11] are used.
The first one is based on the assumption that the trained CNNbased model has a parameter redundancy [12] by some imperfection of the Stochastic Gradient Descent (SGD) based training procedure which sins to produce duplicate filters [13]. The main idea of pruning methods is to remove useless parameters without significant drop in accuracy. As it can be seen, the recent papers demonstrate model compression and inference speed up pretty well [10]. But the parameter redundancy problem has another point of view – instead of putting up with the necessity to use pruning we can try to train a model directly without any parameter redundancy. In the proposed paper we have investigated one of possible ways to get it.
Regarding a quantization or more restricted binarization [14] techniques we do not consider this issue because it’s mostly related to edgespecific implementations than general ideas which are applicable for the wide range of tasks.
Completely different approach is to design the network architecture directly assuming some possible degradation in the accuracy but with gain in a computation time. The first significant step by introducing the depthwise separable convolutions [3] has been made. This idea was simple but powerful. In present time all mobile network architectures reuse it including the proposed paper too. To future speed up the computations the MobileNetv2 [4] architecture focuses on an idea of fixing some internal problems with ReLU [15] activation function by inverting wellknown bottlenecks. We agree with authors that some problems arise because of incompatibility SGD properties with ReLU function. But we have found out that refusal in favor of ELU [16] activation function with some other changes in backbone is more flexible way.
ii Person reidentification
\lettrine[nindent=0em,lines=3]The person reidentification task is formulated as a task of learning of some parametric mapping function which maps semantically similar points from the image space onto close points on the embedding space . During the inference a pair of input images is compared by or cosine distance between the embeddings vectors.
For now the best working practices utilize the Siamese network [18] with appropriate target function like the triplet loss [19] as well as train a model as a classification task with Softmax and crossentropy loss [20]. More recently they reuse the AMSoftmax loss [21] from the twin face recognition challenge.
Next improvement in person reidentification has been connected with joint training both metric learning approaches (triplet and AMSoftmax losses), incorporating some form of attention by slicing images on horizontal stripes [22], aggregation of embeddings from different levels [23] and mix of the previous attempts in single network without regard for the computation budget [24].
Another attempt to resolve the person reidentification challenge is based on some kind of hard sample mining techniques for both the triplet loss and for joint training [25].
Iii Backbone design
i TopDown architecture design
\lettrine[nindent=0em,lines=3]As it was previously sad, the evolution of network architectures has made several steps on the way from the regular structure where the representation power is focused in simple stacking of convolution layers to architectures which exploit the fusion of differentlevel representations into a single stage. The last trend is to concentrate on the network design in variation of its building blocks like bottlenecks in ResNet architecture. This strategy is followed by the recent mobile architectures: MobileNet and ShuffleNet.
Regarding the design of a network for mobile applications, we can follow one of possible approaches. The first one is a "bottomup" approach which is based on discovering the inference bottlenecks and following fixes for them. The most powerful example of such approach is ShuffleNetv2 [6] architecture. It includes strong baseline to exclude as much memory consumed operations as possible. Generally speaking it’s a good attempt to build a fast network foremost but without any attention to the target task. Final accuracy in this case is mostly a result of lucky choice of architecture, otherwise the incrementation of the model size is proposed only.
Another approach is presented by a "topdown" one. It includes the definition of key requirements which cannot be omitted and the following growing of the network building blocks. Moreover, such requirements don’t need to be of one and the same logical level. Often this list is composed of highlevel architecture solutions (shallow or deep network) and lowlevel operations. All the next steps are targeted to merge requirements into a single multilevel solution. It is worth saying that next steps are not limited in an architecture design only but may include initialization tricks and more sophisticated training procedure.
Of course it may happen that key requirements supplemented by the limited computation budget are contradictory. Fortunately we should remember that our purpose is not focused on developing the solution for some general task (e.g. ImageNet [1] classification or COCO [28] detection problems). It gives the hope that the relief in generality of model brings us the realizable tradeoff. In this paper we follow the "topdown" approach and the next sections show our vision on direct building of a lightweight model according to the specific person reidentification task.
ii Deep vs Shallow networks
\lettrine[nindent=0em,lines=3]In the course of the conversation about a network design in the limited computation budget we face well known dilemma of deep or shallow network architecture. Most often the choice is to cut a general architecture to satisfy the restriction on maximal number of FLOPs per the single network input. In the architecture level it means the aggressive usage of pooling operations on the early stages (e.g. [6]). On the one hand, the pooling operator should bring some kind of transformation which is the equivariant to translations. Unluckily, for the rest of the tasks aggressive pooling prevents from extracting of accurate highlevel features.
On the other hand, we cannot give up pooling operators because it is a lightweight way to control the number of FLOPs on each scale level by changing the spatial resolution of the feature map. In addition to that, we can vary the number of blocks on each scale and the width of each block. Unfortunately, for most of users the restriction of number of blocks without any change in each of them is the easiest way.
In the presented paper we defend the position that the key component for robust feature extractor is the depth of a network (in terms of number of convolutions in the longest path from the input to the network output). Regarding the design of a backbone the choice is imposed on a fight with a gradient flow during training and is based on the ResNetfamily. According to the results [29] the residual structure of bottlenecks can be interpreted as an iterative feature enhancing on a single representation level (obviously the level border is defined by downscaling operations). It is also important to note that ResNet18 or even ResNet50 architectures are too "shallow" and don’t satisfy our intuition. We should talk about a hundred of layers at least.
But choosing the deep architectures we face the necessity to make each residual block as light as possible. The simplest way is to follow the mainstream practice to use bottlenecks with two consecutive convolutions instead of original [30]. Contrariwise, we can think about a network depth in terms of the representation power [31]. For us it means that the choice to use either two or three convolutions in the bottleneck is decided in favor of three convolutions with nonlinearity after each.
Finally, we can formulate the list of key requirements which forms the basis of the presented backbone architecture (on Figure 1 you can see our flow of thoughts on the way to build lightweight network):

Very deep network with a hundred of layers.

ResNetlike architecture.

Residual blocks with three convolutions () and nonlinearity after each.
iii RMNet backbone
\lettrine[nindent=0em,lines=3]For now we have the general vision on a backbone design and support points to fit a model to the target computation budget. As it was mentioned earlier the ResNetlike bottlenecks consist of 3 convolutions: the first convolution maps the input onto some internal representation with simultaneous reduction of number of channels, the next internal convolution carry out spatial mixing and the last convolution maps internal representation back onto the input manifold.
Name  Times  Stride 




Input  3  
conv  1  2  32  
RMblock  4  1  32  
RMblock  1  2  64  
RMblock  8  1  64  
RMblock  1  2  128  
RMblock  10  1  128  
RMblock  1  2  256  
RMblock  11  1  256 
The first step to reduce the number of operations is to replace the internal convolution with its depthwise variant [3]. But instead of the depthwise separable convolution practice [32] we preserve the nonlinearity after the internal convolution to leave unchanged the representation power of the network. Unfortunately, this reduction is not enough and the last support point should be used too. This is about the channel reduction factor used in the internal convolution. In this paper we need to use strong factor. Moreover the maximal number of channels is also limited 256 too.
Another unobvious question is about the choice of an activation function. The common practice is to use ReLU [15] nonlinearity. It is found out that some negative effect of using ReLU in deep networks ([4], [30]) which is connected with well known sparsity of activations. Easy to see that this sparsity in forward pass will affect the backward pass too by producing sparsity in gradients and the following convergence retardation. Researchers propose different solutions but we follow more simple way to replace ReLU onto ELU [16] activation function. As it will be described further it dramatically changes the behavior of the network.
The next important question is related to the utilization of model parameters. Looking at the effectiveness of pruning methods [33] we should take into account the fact that not all learnt model parameters are useful according to the target task. In case of general architectures with millions of parameters it is expected behavior but regarding our network design with strong channel reduction it’s impossible to leave some rudimentary parameters.
Iv ReID network
i Manifold learning
\lettrine[nindent=0em,lines=3]As it was described earlier the goal of the person reidentification based on metric learning is to learn the parametric function embedding vectors of which can be compared with simple norm. For us it means that learning process can be interpreted as a process of forming the target manifold with desired properties.
Generally speaking each loss function impacts different aspects of the final manifold. In light of this we can divide them in two big families: global and local structure losses. Let’s describe a set of appearances of different instances. Our goal is to find the transformation after which the appearances of the same instance will be closer to each other rather than to different instances. On the one hand, for this purpose we can select the single appearance (center) of each instance and try to learn mapping by forcing other appearances to be close to its center instance. In other words we define the global rule for the mapping function. And this is the nature of the first family losses. Regarding the examples of implementation there are different modifications of Softmax with CrossEntropy losses (see eq. 1). In the presented paper we are focused on a variant with large margins between classes – AMSoftmax loss (see eq. 2). [21].
(1) 
(2) 
On the other hand, we can follow the Hebbian Learning Rule [37] which declares that local rules of interactions between elements define the global order of the system. This learning strategy is implicitly presented by the triplet loss [19] family. Unluckily, the main drawback of triplets is a sampling procedure which significantly impacts on the final model accuracy [25].
Recent papers proposed to merge both loss families into a single training procedure and achieved the state of the art results [24]. In our opinion the better performance can be achieved by an elimination of the triplets by dividing them in two constituent forces: push and pull losses [38]. In the presented paper we follow the same strategy to divide triples into components thereby overcoming the sampling issues but we supplement the default margins by the "smart" variant like in [27]. Finally we have three local structure losses: Center (eq. 3), PushPlus (eq. 4) and GlobPushPlus (eq. 5) losses.
(3) 
(4) 
(5) 
Total loss to train the model is a weighted sum of global and local losses (weights are estimated to equalize the impact of each loss in the total sum):
(6) 
ii Reidentification head
\lettrine[nindent=0em,lines=3]The last component of our network is a reidentification head which maps the point from the internal representation (backbone output) onto the final embedding which can be compared with others by the cosine (or ) distance. Recently, the unique choice is to use a fully connected (FC) layer on the top of backbone output. Unfortunately, FC layer are too wasteful to the computation resources and cannot be used for mobile networks.
Another variant is presented by using global pooling operators like max or averagepooling. As it is reported in the paper [39] such approach includes some form of the spatial attention due to pooling over all spatial locations of a feature map. We follow the same solution and use global maxpooling (GMP) operator to collapse the spatial dimensions. You can find the proposed reidentification head on Figure 3.
Our reidentification head has two key components. The first one is inverted bottleneck after the GMP operator – by convolution we increase the number of channels from 256 to 512 and then compress it back to the 256 (attempt to leap in high dimensional space where the class separation can be solved by linear transformation). The second one is based on dividing the support point of global and local structure losses. It means that we extract some internal representation which is trained with local structure losses only and then we calibrate it by learning with global structure losses. For both representations we use normalization to follow the AMSoftmax proposed restrictions on the embeddings (to be compatible with a cosine similarity measure). Finally, the network output is the last calibrated embedding.
V Implementation details
i Network architecture
\lettrine[nindent=0em,lines=3]The proposed network consists of two consecutive components: lightweight feature extractor (RMNetbased backbone) and single reidentification head. To reduce the total inference time we follow the fully convolutional network (FCN) practice and don’t use any FC layers. Moreover we avoid the usage of multibranch [24] solutions and concatenation of embeddings from different layers [23].
The network extracts the normalized embedding vector with 256 elements which can be compared with another one in pairwise manner using the cosine similarity measure.
ii Optimization
\lettrine[nindent=0em,lines=3]All experiments have been completed in Caffe framework [40]. We use the SGD with momentum optimization method and decay on the learning rate each 50k iteration starting with .
To initialize the network parameters we use the mixed strategy: input convolutions of each bottleneck are initialized orthogonally [34] and the rest weights initialized using MSRA method [41]. Before running the main experiment we pretrained the backbone on the OpenImages dataset [42] by fitting a classification task on the extracted object crops ( input size).
One more important step to train the lightweight network which is able to utilize significant part of parameters and prevent from the need to use pruning is using dropout regularization [43] in each block (dropout ratio is set to ). But the dropout regularization reduces the total network capacity and it’s unsuitable for our initially small implementation. To overcome this issue we disable the dropout regularization on the late iterations (when the learning rate is small enough) and continue without it. This strategy allows us to form the manifold structure on early iterations without the threat of overfitting but to use up the whole network capacity later.
To solve the unbalanced data problem (significant difference in a number of appearances of each identity) we follow the common practice to reuse the hard sample mining procedure [44]. Our implementation of it consists of next steps:

[label=]

To sample augmented images for each identity from the training dataset.

To estimate the value of the loss for each sample.

To select top of hardest (with highest loss value) samples.

To train the network in minibatches as usual on hardest samples.

To increase the difficulty of the augmentation and go to beginning.
The last component to train the network successfully is a strong data augmentation with the progressively increased difficulty. The best choice is to use random erasing augmentation [45] in addition to standard horizontal flip and random crop methods.
Vi Experimented Result
i Data
\lettrine[nindent=0em,lines=3]To evaluate the proposed solution we use the Market1501 dataset [46]. It is a benchmark for person reidentification purposes with images from 6 cameras of different resolutions. It was annotated with the 1501 identities: 751 among which are used for training and 750 are used for testing. The training set contains 12936 images with 3368 query images. The gallery set is composed of images from the 750 test identities and of distractor images, 19732 images in total. The most common and useful evaluation scenario is a single query image.
ii Metrics
\lettrine[nindent=0em,lines=3]We follow standard procedure and report the mean average precision over all queries (mAP) and the cumulative matching curve (CMC) at rank1 using the evaluation codes provided by the benchmark.
It’s worth saying that there are some techniques to improve the final result in both metrics. The first common method is to estimate the embedding for the original and flipped images and then concatenate them into a single one (including additional normalization step to use with cosine similarity measure). In our opinion it is not an honest way to improve the accuracy because it doubles the computation time. Unfortunately some authors don’t report the result with a mark that flipping is used. However to be able to go with that approach we report results including horizontal flipping metric.
The second method is based on using reranking (RK) techniques [47]. In other words it is direct optimization over comparable metrics. We report result with RK too.
iii Ablation study
\lettrine[nindent=0em,lines=3]As it was announced earlier we first compare the backbone implementations with different activation functions. Our main message in this paper is that widely used ReLU activation is not a proper one that leads to uprising of some problems. To prove it we measure the ratio between absolute values of filter weights for each convolution layer in network. On Figure 4 you can find this ratios for both ReLU and ELU activation functions. High value of ratio means that there are invalid filters on the current level. As it can be seen the network trained with ReLU have more than half of noisy filters which usually is pruned for model compression purposes. Another picture gives us the result of using the ELU activation function – significant part of filters is still useful and no capacity reduction is observed. Due to low final quality of model with ReLU activation all the next experiments are performed with ELU.
Table 2 shows ablation of study experiments. The initial point of our experiments is training on our dataset with AMSoftmax loss only. Generally speaking this approach should beat SOTA results with generalpurpose backbone. But in our case we are very limited in the model capacity and default training is failed. In other words the task to train the lightweight but accurate person reidentification network is really challenging.
Method  Market1501  

rank@1  mAP  
AMSoftmax  78.00  60.74 
+ HSM  79.07  57.88 
+ Center loss  81.53  60.42 
+ Disabled dropout  85.24  65.94 
+ Push loss  87.11  70.95 
+ GlobPush loss  88.69  73.40 
+ Smart margins  90.20  78.80 
+ Weighted HSM  91.66  81.63 
+ Increased resolution  92.37  82.53 
Method  Market1501  GFLOPs  MParams  FPS  
rank@1  mAP  
GPReID [39]  92.2  81.2  8  24.66  64 
DeepPerson [48]  92.3  79.5  8  24.66  64 
PCB [22]  92.4  77.3  8  24.66  64 
PCB+RPP [22]  93.8  81.6  8  24.66  64 
HPM (flip) [23]  94.2  82.7  24.66  32  
MGN (flip) [24]  95.7  86.9  68.75  16  
Our (light)  91.7  81.6  0.12  0.81  923 
Our (strong)  92.4  82.5  0.58  0.81  268 
Our (strong, flip)  92.5  83.1  0.81  134  
MGN (RK) [24]  96.6  94.2  68.75  16  
Our (strong, RK)  93.1  91.1  0.58  0.81  268 
The first step to improve the baseline is to tackle with the data imbalance problem. As it was mentioned earlier in this paper we use the hard sample mining (HSM) procedure (see the description of the used method above). In the first experiment the AMSoftmax loss value is used to order the samples only (instead of step ). The impact on metrics is not significant but it allows us not to think more about possible overfitting due to training on plain samples.
During next steps we dive into the manifold learning approach by introducing different local structure losses: Center, Push and GlobPush. Each step gives us the following improvement in both metrics. The most significant impact is achieved after using the smart margins for the Push and GlobPush losses. Moreover as it can be expected smart margins mostly affect the mAP metric which reflects the orderliness of the learnt manifold.
It is worth noting that our concerns about the limited model capacity due to using dropout regularization are confirmed and the strategy to disable this type of regularization on the late iterations brings us significant leap in accuracy for both metrics.
The very last attempt to increase the metrics is to make the sample mining procedure more flexible (more smart to consider the sample complexity for differentlevel losses) by mixing multiple losses into the ranking criterion. It allows us to improve the mAP metric mostly.
To be able to align with other state of the art solutions we should also test different input resolutions. As it was shown in paper [39] the plain increasing of input size can bring significant leap in accuracy. For our model the main input resolution is . We also tested the higher resolution and as it can be seen the result is slightly better but with expected slowdown in the inference time (both model variants can be found in Intel OpenVINO™toolkit).
iv Comparison with the state of the art
\lettrine[nindent=0em,lines=3]Table 3 compares the proposed solution to the state of the art approaches. Our approach without using multibranching [24] or merging embeddings from the different levels [23] achieves well enough accuracy but significantly outperforms in the inference time more than one order of magnitude. It happens due to our lightweight backbone RMNet instead of the widely used ResNet50 architecture.
The proposed combination of loss functions and the training strategy allows us to achieve the comparable results even when our model has significantly less number of parameters (0.81 vs 25 MParams). Moreover our solution is in top3 by rank@1 metric and in top2 by the mAP metric.
To measure the model performance we use publicly available OpenVINO toolkit and run experiments on Intel Core i76700K CPU. We significantly outperforms other solutions by the Frame per Second (FPS) metric. It is worth saying that person reidentification method can be referred to realtime solutions if it’s able to perform several pairwise comparisons on each frame from the input stream in the realtime mode. For example using our faster solution (light, 923 fps) we can process about 30 persons on each frame in realtime. No one other state of the art is able to do that on the same quality.
Vii Conclusion
\lettrine[nindent=0em,lines=3]In this paper we have proposed the novel lightweight backbone (RMNet) and set of training practices to tackle with the person reidentification problem. We have demonstrated that our solution is close to state of the art approaches but significantly outperforms them by the inference time. We presume that our work gives new breath to the lightweight solution development for the wide range of applications by direct designing of taskspecific networks.
References
 [1] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., FeiFei, L.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211–252
 [2] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
 [3] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
 [4] Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381 (2018)
 [5] Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR abs/1707.01083 (2017)
 [6] Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: The European Conference on Computer Vision (ECCV). (September 2018)
 [7] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556 (2014)
 [8] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)
 [9] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR abs/1409.4842 (2014)
 [10] Huang, Q., Zhou, S.K., You, S., Neumann, U.: Learning to prune filters in convolutional neural networks. CoRR abs/1801.07365 (2018)
 [11] Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: A whitepaper. ArXiv eprints (June 2018)
 [12] Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A.N., Chang, S.: Fast neural networks with circulant projections. CoRR abs/1502.03436 (2015)
 [13] RoyChowdhury, A., Sharma, P., LearnedMiller, E.G.: Reducing duplicate filters in deep neural networks. (2018)
 [14] Courbariaux, M., Bengio, Y.: Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR abs/1602.02830 (2016)
 [15] Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. (2010) 807–814
 [16] Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). CoRR abs/1511.07289 (2015)
 [17] Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. CoRR abs/1608.06993 (2016)
 [18] Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. (2005) 539–546
 [19] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. CoRR abs/1503.03832 (2015)
 [20] Janocha, K., Czarnecki, W.M.: On loss functions for deep neural networks in classification. CoRR abs/1702.05659 (2017)
 [21] Wang, F., Liu, W., Liu, H., Cheng, J.: Additive margin softmax for face verification. CoRR abs/1801.05599 (2018)
 [22] Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: Person retrieval with refined part pooling. CoRR abs/1711.09349 (2017)
 [23] Fu, Y., Wei, Y., Zhou, Y., Shi, H., Huang, G., Wang, X., Yao, Z., Huang, T.S.: Horizontal pyramid matching for person reidentification. CoRR abs/1804.05275 (2018)
 [24] Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person reidentification. CoRR abs/1804.01438 (2018)
 [25] Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person reidentification. CoRR abs/1703.07737 (2017)
 [26] Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. (2016) 499–515
 [27] Qi, L., Huo, J., Wang, L., Shi, Y., Gao, Y.: Maskreid: A mask based deep ranking neural network for person reidentification. CoRR abs/1804.03864 (2018)
 [28] Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
 [29] Veit, A., Wilber, M.J., Belongie, S.J.: Residual networks are exponential ensembles of relatively shallow networks. CoRR abs/1605.06431 (2016)
 [30] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. CoRR abs/1603.05027 (2016)
 [31] Wang, H., Raj, B., Xing, E.P.: On the origin of deep learning. CoRR abs/1702.07800 (2017)
 [32] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. CoRR abs/1610.02357 (2016)
 [33] He, Y., Dong, X., Kang, G., Fu, Y., Yang, Y.: Progressive Deep Neural Networks Acceleration via Soft Filter Pruning. ArXiv eprints (August 2018)
 [34] Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR abs/1312.6120 (2013)
 [35] Huh, M., Agrawal, P., Efros, A.A.: What makes imagenet good for transfer learning? CoRR abs/1608.08614 (2016)
 [36] Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network architecture for realtime semantic segmentation. CoRR abs/1606.02147 (2016)
 [37] Shaw, G.L.: Donald hebb: The organization of behavior. (1986) 231–233
 [38] Sohn, K.: Improved deep metric learning with multiclass npair loss objective. (2016) 1857–1865
 [39] Almazán, J., Gajic, B., Murray, N., Larlus, D.: Reid done right: towards good practices for person reidentification. CoRR abs/1801.05339 (2018)
 [40] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
 [41] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. CoRR abs/1502.01852 (2015)
 [42] Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., AbuElHaija, S., Kuznetsova, A., Rom, H., Uijlings, J., Popov, S., Kamali, S., Malloci, M., PontTuset, J., Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D., Feng, Z., Narayanan, D., Murphy, K.: Openimages: A public dataset for largescale multilabel and multiclass image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html (2017)
 [43] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing coadaptation of feature detectors. CoRR abs/1207.0580 (2012)
 [44] Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Endtoend learning of deep visual representations for image retrieval. CoRR abs/1610.07940 (2016)
 [45] Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. CoRR abs/1708.04896 (2017)
 [46] Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person reidentification: A benchmark. (2015)
 [47] Zhong, Z., Zheng, L., Cao, D., Li, S.: Reranking person reidentification with kreciprocal encoding. CoRR abs/1701.08398 (2017)
 [48] Bai, X., Yang, M., Huang, T., Dou, Z., Yu, R., Xu, Y.: Deepperson: Learning discriminative deep features for person reidentification. CoRR abs/1711.10658 (2017)