Video Person Re-Identification using Learned Clip Similarity Aggregation
We address the challenging task of video-based person re-identification. Recent works have shown that splitting the video sequences into clips and then aggregating clip based similarity is appropriate for the task. We show that using a learned clip similarity aggregation function allows filtering out hard clip pairs, e.g. where the person is not clearly visible, is in a challenging pose, or where the poses in the two clips are too different to be informative. This allows the method to focus on clip-pairs which are more informative for the task. We also introduce the use of 3D CNNs for video-based re-identification and show their effectiveness by performing equivalent to previous works, which use optical flow in addition to RGB, while using RGB inputs only. We give quantitative results on three challenging public benchmarks and show better or competitive performance. We also validate our method qualitatively.
Person re-identification is the problem of identifying and matching persons in videos captured from multiple non-overlapping cameras. It plays an important role in many intelligent video surveillance systems and is a challenging problem due to the variations in camera viewpoint, person pose and appearance, and challenging illumination along with various types and degrees of occlusions.
Visual person re-identification involves matching two images or video sequences (containing persons) to answer whether the persons in the two videos are the same or not. The general approach for it includes (a) extraction of features that are discriminative wrt. the identity of the persons while being invariant to changes in pose, viewpoint, and illumination and (b) estimating a distance metric between the features. The earlier methods for re-identification used handcrafted features in conjunction with metric learning to perform the task [7, 10, 16, 24, 44]. These works mainly leveraged intuitions for the task, while in recent years, the use of deep CNNs has become more common owing to their superior performance [1, 4, 6, 19, 40, 41].
Many of the previous works on person re-identification have focused on image-based benchmarks, however, with the introduction of large-scale video re-identification benchmarks such as MARS  video-based setting is becoming popular. Most existing methods on video-based re-identification extract CNN features of individual frames and aggregate them using average pooling, max pooling, temporal attention mechanisms, or RNNs [25, 42, 45, 48]. These methods, thus, represent the video sequence as a single feature vector. However, for long sequences that have a significant amount of variation in pose, illumination, etc., a single vector might not be enough to represent them.
A recent state-of-the-art video based method by Chen et al.  address the problem by dividing the sequences into short clips, and embedding each clip separately using a CNN and applying a temporal attention based method. To match two given sequences, they compute similarities between all pairs of clips, and compute the final similarity by aggregating a fixed percentage of top clip pair similarities. Thus, the contribution of a clip in a video sequence is dynamically determined, based on its similarities to the clips in the other sequence. Chen et al.  assume that the similarity between a pair of clips is indicative of the informativeness of the clip pair. We argue that this assumption is not necessarily true in practice, e.g. a pair of clips with low similarity can be utilized as evidence for the fact that the persons in the two clips are different. Such clip-pairs get discarded while computing the final similarity, which may hurt the re-identification performance. Another shortcoming of the method is that it uses a fixed percentage of the clip-pairs for all pairs of sequences. This limits the performance of the method since for different pairs of sequences, the number of informative clip-pairs can vary.
We address the above shortcomings of Chen et al. , and propose an end-to-end trainable model to estimate the similarity between two video sequences. Our model takes pairs of clips as input in a sequence and predicts an importance score for each clip pair. It computes the final similarity between the two sequences by taking an average of the clip-pair similarities weighted by their corresponding importance scores. Thus, our model allows filtering of non-informative or distracting clip-pairs while focusing only on clip-pairs relevant for estimating the similarities. While  aim to filter non-informative or distracting clip-pairs, like here, the measure of informativeness is different.  uses clip-level similarity as a proxy for the informativeness, while our method uses a learnable scoring function optimized for the task at the video level. Consider a clip-pair without any artefact, but with a low clip-similarity due to different persons being present. While  would reject such a pair despite it being informative, our scoring function would give it high importance to maintain a low overall similarity.
As another contribution, we show effectiveness of 3D CNNs [37, 2] for obtaining clip features. 3D CNNs, which have been used for various video based tasks such as action recognition in recent years, remain largely unexplored for the task of video based person re-identification. We show their effectiveness on this task, by reporting performances equivalent to previous works which use optical flow in addition to RGB, while using RGB inputs only.
We give quantitative results on three video-based person re-identification benchmarks, MARS , DukeMTMC-VideoReID [29, 39] and PRID2011 . We show that our trainable similarity estimation model performs better than the top clip-similarity aggregation proposed by Chen et al. . To simulate more challenging situations, we also report experiments with partial frame corruption, which could happen due to motion blur or occlusions, and show that our method degrades gracefully and performs better than the competitive baseline. We also provide qualitative results that verify the intuition of the method.
2 Related Work
Image based Re-Identification.
Initial works on person re-identification focused on designing and extracting discriminative features from the images [10, 7, 24, 16, 44]. These works mainly leveraged intuitions for the task and proposed hand designed descriptors that capture the shape, appearance, texture, and other visual aspects of the person. Other works proposed better metric learning methods for the task of person reidentification [10, 27, 47, 15, 20, 26]. This line of work mainly worked with standard features and innovated on the type and better applicability of metric learning algorithms for the task.
More recent methods, have started leveraging CNN features for the task of person re-identification. These methods explore various CNN architectures and loss functions. Li et al.  proposed a CNN architecture specifically for the re-identification task, which was trained using a binary verification loss. Ding et al.  proposed a triplet loss to learn CNN features. Ahmed et al.  proposed a siamese CNN architecture and used binary verification loss for training. Cheng et al.  used a parts-based CNN model for re-identification, which was learned using a triplet loss. Xiao et al.  used domain guided dropout that allowed learning of CNN features from multiple domains. They used a softmax classification loss to train the model. Xiao et al.  jointly trained a CNN for pedestrian detection and identification. They proposed online instance matching (OIM) loss, which they showed to be more efficient than the softmax classification loss.
Another line of work [35, 46, 43, 36] leverages human
pose estimators and uses parts-based representations for person re-identification. For example, Suh
et al.  used a two-stream framework with an appearance and a pose stream, which were
combined using bilinear pooling to get a part-aligned representation.
Video-based person re-identification. The methods working with videos commonly rely on CNNs to extract features from the individual frames, while using different ways for aggregating frame-wise CNN features, e.g. Yan et al.  used LSTM to aggregate the frame-wise features. Zheng et al.  aggregated the CNN features using max/average pooling, and also used metric learning schemes such as KISSME  and XQDA  to improve the re-identification performance. McLaughlin et al.  used RNN on top of CNN features followed by temporal max/average pooling.
More recent works have also started exploring temporal and spatial attention based methods for video-based re-identification. Zhou et al.  used a temporal attention mechanism for weighted aggregation of frame features. Li et al.  employed multiple spatial attention units for discovering latent visual concepts that are discriminative for re-identification. They combined the spatially gated frame-wise features from each spatial attention unit using temporal attention mechanisms and concatenation.
Liu et al.  used the two-stream framework for video re-identification, which consists of an appearance and a motion stream, to exploit the motion information in the video sequences. Instead of using pre-computed optical flow, however, they learned the motion context from RGB images in an end-to-end manner.
We assume humans have been detected and tracked and we are provided with cropped videos which contain a single human. We view the videos as ordered sequence of tensors (RGB frames). We formally define the problem we address as that of learning a parameterized similarity between two ordered sequences of tensors. Denote the query and the gallery video sequence as, and , with being an RGB frame. We are interested in learning a function , with parameters , which takes as input two sequences, and outputs a real valued similarity between them , where a high (low) similarity indicates that they are (not) of the same person.
3.1 Learning Clip Similarity Aggregation
The similarity function we propose is based on a learned aggregation of clip-pairs sampled from the video sequences. Fig. 2 gives a full block diagram of our method. We uniformly sample clips of length from both the query and the gallery sequences, denoted by and , where, . The number of clips could also be different for the two sequences being compared, but for brevity and implementation ease we keep them to be the same, allowing potential overlap of the clips if the number of frames in the sequence(s) is less than .
We first forward pass the clips through a state-of-the-art 3D CNN with parameters to obtain -dimensional features and , where, , and . We then learn to estimate which pairs of clips are informative, considering all the combinations. This is in contrast to many sequence modeling approaches, like those based on max/average pooling  or attention-based temporal pooling , which encode the clip sequences individually with the intuition that some clips might be bad due to occlusion, difficult pose or high motion blur etc. In our case we argue that even if some clips have partial artifacts, due to the various nuisance factors, they might still match with a similarly (partially) corrupted clip from another video, and thus should not be discarded. Hence, in the proposed method we consider all the quadratic combinations of pairs of clips and learn to weight them according to their importance. We run the importance estimation in a sequential manner and condition on the information that we have already accumulated at any step . We estimate the importance score of the clip pair at step , , using a small neural network which takes as input the difference of the aggregated representation till that point and the combined representation of current clip pair. The combined representation used for a pair of clips is an element-wise dot product (denoted as ) of the clip features, and the pooling process, at step is given by,
with, . This gives the final combined representation .
We then predict the similarity score between and by taking an average of all clip-pair cosine similarities weighted according to the importance scores,
If the clip features and are -normalized, then the final similarity (3) can be directly computed using the final combined representation as
Our method allows us to learn all the parameters, end-to-end and jointly
for the task using standard backpropagation algorithm for neural networks. However, due to
computational constraints, we design the training as a two step process. First, we learn the
parameters of 3D CNNs, then we fix the 3D CNNs and learn the clip-similarity aggregation module
parameters. We now describe each of these steps.
3D CNN. In each training iteration, following , we randomly sample a batch of sequences belonging to person identities with sequences from each identity. Then, we randomly sample one clip of length frames from each sampled sequence to form the mini-batch. We use a combination of the hard mining triplet loss  and the cross-entropy loss as our objective,
The hard mining triplet loss is given as,
where, , is the 3D CNN feature vector of the -the clip of the -th person in the batch, and is the margin, and .
We add a classification layer on top of our 3D-CNN network with classes, where is the total number of identities in the training set. Let be the weights of the classification layer. The softmax cross-entropy loss is given by,
where, is the person index of the -th person in the batch. Note that, while learning 3D CNN parameters, , we do not use our clip-similarity aggregation module.
Clip similarity aggregation module. For learning , we use the same batch sampling process as described for the learning of 3D CNN parameters , except now we uniformly sample clips of length instead of a single clip from each sampled sequence. We extract features of the clips, with the above learned 3D CNNs, and normalize them. Then, we compute the similarity scores between all pairs of sequences in the batch using (4). We use the hard mining triplet loss similar to (7) as the objective, with the euclidean distances replaced by negative clip similarities as defined above in (3)–(6).
4 Experiments and Results
MARS. The MARS dataset  is a large scale video-based person
re-identification benchmarks. It contains 20,478 pedestrian sequences belonging to 1261 identities.
The sequences are automatically extracted using DPM pedestrian detector
 and GMMCP tracker . The lengths of the
sequences range from 2 to 920 frames. The videos are captured from six cameras and each identity is
captured from at least two cameras. The training set consists of 8,298 sequences from 625
identities while the remaining 12,180 sequences from 636 identities make up the test set which
consists of a query and a gallery set.
DukeMTMC-VideoReID. The DukeMTMC-VideoReID [29, 39] is another large benchmark of video-based person re-identification. It consists of 702 identities for training, 702 identities for testing. The gallery set contains additional 408 identities as distractors. There are total 2,196 sequences for training and 2,636 sequences for testing and distraction. Each sequence has 168 frames on average.
PRID2011. PRID2011 dataset  contains 400 sequences of 200 person identities captured from two cameras. Each image sequence has a length of 5 to 675 frames. Following the evaluation protocol from [38, 45], we discard sequences shorter than 21 frames and use 178 sequences from the remaining for training and rest 178 sequences for testing.
4.2 Implementation Details
3D CNN Architecture. We use the pytorch implementation111https://github.com/piergiaj/pytorch-i3d of Inception-V1 I3D network  pretrained on the Kinetics action recognition dataset.
We remove the final classification layer from the I3D network and replace the original average pooling layer of kernel with a global average pooling layer.
The resulting I3D network takes an input clip of size and outputs a 1024-dimensional feature vector ().
Clip similarity aggregation module architecture. The clip-pair similarity aggregation module takes as input a pair of tensors representing I3D features of clips sampled from the two sequences to be matched. In our experiments, we set the number of clips to 8 and the clip length to 4 frames, this setting was faster than higher and smaller while giving similar performance (kindly see the supplementary document for complete ablation experiment). The importance scoring function consists of two hidden layers with 1024 units in both layers. The output layer has a single unit that represents the estimated importance score. The hidden layers have ReLU activation function while the output layer has the softplus activation function, . The softplus function, a smooth approximation of ReLU function, constraints the importance score to always be positive. We also use a dropout layer  with dropout probability and a batch normalization layer  after both hidden layers.
Training details. Due to lack of space, we include the complete training details of the 3D CNN and the Clip Similarity Aggregation module in the supplementary document.
Evaluation protocol and evaluation metrics. We follow the experimental setup of ,  and  for PRID2011, MARS and DukeMTMC-VideoReID respectively. For MARS and DukeMTMC-VideoReID, we use the train/test split provided by  and , respectively. For PRID2011, we average the re-identification performance over 10 random train/test splits. We report the re-identification performance using CMC (cumulative matching characteristics) at selected ranks and mAP (mean average precision).
4.3 Analysis of I3D Features for Re-Identification
Frame sampling method and clip length. In the scenario, where we use a single clip to represent a sequence, it becomes important how we sample the frames from the sequence to form a clip. In this experiment, we explore multiple frame sampling methods given in Tab. 1 and their effect on the re-identification performance.
|consec||Randomly sample a clip of consecutive frames|
|random||Randomly sample frames (arrange in order)|
|evenly||Sample frames uniformly|
|all||Take all frames|
Note that, all sampling methods in Tab. 1 result in a clip of length except the all sampling method. We train three I3D models with different clip lengths, . The frames are sampled consecutively (consec) to form a clip during training. During evaluation, each test sequence is represented by I3D features of a single clip sampled in one of the ways described above. Given a query, the gallery sequences are ranked based on the distances of their I3D features. We evaluate the three models with different frame sampling methods and test clip lengths .
Figure 3 shows the plots of re-identification performance as a function of with different frame sampling methods. We observe that the performance improves as we increase the clip-length during testing, although with diminishing returns. We also observe that when tested on longer clips (e.g. ), models trained on different clip-lengths () show similar performance to each other. On the other hand when tested on shorter clips (e.g. ), a model trained on shorter clips performs better than the model trained on longer clips.
The sampling methods random and evenly perform better than the consec sampling method,
especially for smaller clip lengths.
This can be explained by the fact that random and
evenly have larger temporal extent than consec and do not rely on frames only from
a narrow temporal region which could be non-informative because of difficult pose, occlusion etc.
Averaging features of multiple clips. Since sequences in the MARS dataset can be up to 920 frames long, using single short clips to represent these sequences is not optimal. In this experiment, we take average of I3D features of multiple clips evenly sampled from the original sequence to represent these sequences. We vary the number of clips in on the MARS dataset. We use the model trained with and we keep the same clip length during the evaluation. We also evaluate with and without the -normalization of clip-features. Figure 4 shows the test re-identification performance for different number of clips with and without -normalization of clip-features. We observe that averaging features from multiple clips significantly improves the re-id performance. The performance improves up to around 8 clips beyond which there is little improvement. We also find that -normalization of clip features leads to consistent improvement in performance.
4.4 Evaluation of Learned Clip Similarity Aggregation on MARS
In this section, we present the re-identification performance results of our learned clip similarity aggregation method on the MARS test set. We also investigate the robustness of our method by evaluating it with varying degrees of input corruption. We randomly corrupt clips during training and evaluation as follows. For every training or test sequence x, we first randomly pick a number with . Here, denotes the maximum number of corrupt clips in a sequence with . Next, we apply a corruption transformation function to randomly selected of the clips sampled from the sequence x. The corruption transformation function consists of first scaling down every frame in the clip by a factor of 5, JPEG compression of resulting scaled down frames, and finally rescaling of the frames up to the original size. Figure 5 shows examples of uncorrupted and corrupted clips.
Let and be the -normalized I3D features of clips sampled from a query sequence and a gallery sequence respectively. As described in Section 3, the similarity between and , as estimated by our method, is given by (3) or (4). We train and evalulate our clip-similarity aggregation module for different rates of input corruption. The rate of input corruption is changed via the parameter . We use the I3D network trained only on uncorrupted clips and keep it fixed throughout the experiment.
We compare our method with the top- clip-similarity aggregation (top-t%) baseline, which is based on . It takes of the clip-pairs with highest similarity and averages their similarities to estimate the overall similarity between the two sequences. By taking only top and not all clip-pairs into account, the resulting similarity becomes more robust and improves re-identification performance . In our implementation, we learn a linear layer that projects the -dimensional I3D features to a new -dimensional space. We define the similarity between two given clips as the cosine similarity between their projected I3D features. Let and be the projected clip features and let be the set of clip-pairs with highest similarity. Then, the top-t% similarity between the two sequences is given by,
We implement two variants of this method. In the first variant top-t%-eval, we perform the top- similarity aggregation only during the evaluation. In the second variant, top-t%-traineval, we perform the top- similarity aggregation during the evaluation as well as during the training. This means that the loss gradients are backpropagated only for the clips that are included in the top of the clip-pairs.
Figure 6 shows the test re-identification performance vs plots for top-t%-eval and top-t%-eval respectively with different values of . As expected, the re-identification performance deteriorates as the value of is increased. We also observe that top- aggregation during training significantly improves the re-identification performance, especially with the smaller selection rates.
Table 2 shows the re-identification performance of our method and the baselines on the MARS test set. Our method has comparable performance to the top- clip-similarity aggregation when the corruption rate is low i.e. is small. accuracy However, it significantly outperforms the top- clip-similarity aggregation baseline for higher rates of input corruption, \egfor the maximum mAP for the baseline topt-e is (for ), while our method degrades more gracefully to give mAP. This highlights the advantage of the proposed learning the clip similarity aggregation.
4.5 Comparison with the state-of-the-art
|RQEN+XQDA+Reranking (2018 )||71.1||77.8||88.8||94.3|
|TriNet + Reranking (2017 )||77.4||81.2||90.8||_|
|DuATM (2018 )||67.7||81.1||92.5||_|
|MGCAN-Siamese (2018 )||71.2||77.2||_||_|
|PSE (2018 )||56.9||72.1||_||_|
|PSE + ECN (2018 )||71.8||76.7||_||_|
|RRU + STIM (2018 ) *||72.7||84.4||93.2||96.3|
|Two-Stream M3D (2018 ) *||74.1||84.4||93.8||97.7|
|PABR (2018 )||75.9||84.7||94.4||97.5|
|PABR + Reranking (2018 )||83.9||85.1||94.2||97.4|
|CSSA-CSE + Flow (2018 )||76.1||86.3||94.7||98.2|
|STA (2019 )||80.8||86.3||95.7||98.1|
|STA + Reranking (2019 )||87.7||87.2||96.2||98.6|
|D + GE + (2019 )||81.8||87.3||96.0||98.1|
|Ours + Reranking *||83.3||83.4||93.4||97.4|
|ETAP-Net [Supervised] (2018 )||78.3||83.6||94.6||97.6|
|STA (2019 )||94.9||96.2||99.3||99.6|
|R + GE + (2019 )||94.9||95.6||99.3||99.9|
|CNN + XQDA (2016 )||77.3||93.5||99.3|
|E2E AMOC+EpicFlow (2017 )||83.7||98.3||100.0|
|QAN (2017 )||90.3||98.2||100.0|
|M3D+RAL (2018 )||91.0||_||_|
|CSSA-CSE (2018 )||88.6||99.1||_|
|Two-Stream M3D (2018)||94.4||100.0||_|
|CSSA-CSE + Flow (2018)||93.0||99.3||100.0|
In Table 3, we compare our method with the state-of-the-art techniques on MARS dataset. Our method achieves 75.9% mAP and 82.7% Rank-1 accuracy. In terms of mAP, our method is on-par with all the methods, except for the recently published visual distributional representation based method of Hu and Hauptmann , who achieve an mAP of 81.8%, which is significantly higher than ours (we discuss below). In terms of mAP performance, our method is very close to the part-aligned bilinear representations (PABR)  and CSSA-CSE + Flow . However, the performance of  is much lower than ours when optical flow is not used (see CSSA-CSE in Table 3). Among methods that use 3D CNNs as their backbone (marked * in Table 3), our method achieves the best mAP performance.
Table 4 shows the comparison of our method with the state-of-the-art on DukeMTMC-VideoReID dataset. There are only few works with results on this dataset.We achieve 88.5% mAP and 89.3% Rank-1 accuracy, which is significantly better than the baseline presented in . However, the performance of Hu and Hauptmann  and  is better than our method.
Comparing our method to very recent works such as that of Hu and Hauptmann , we note that their method is significantly more costly than ours in terms of gallery storage requirements, and uses a CNN networks which is deeper than ours. While we use a 3D CNN with 22 layers, they use an image based DenseNet CNN with 121 layers. They compare the test video with the gallery videos by estimating the Wasserstein distance between the densities estimated using KDE. This requires them to use (and save) all the frames to make inference. While in our case, we use a limited number of clip features () per video. While such accurate method achieves higher performance, it comes at a significant cost.
STA  is another recent method with state-of-the-art performance. While STA focuses on aggregating features effectively from a small set of input frames (4-8 frames), our method is more focused on predicting the overall similarities between two long sequences while relying on I3D for clip-level features (a clip is typically 4-16 frames long). Since the video benchmarks contain much longer sequences, our method can be used in conjuction with  to further boost the performance as it is complimentary to it.
In Table 5, we show results on PRID2011 dataset. Unfortunately being a video based end-to-end method, our method seems to overfit severely on the dataset. PRID2011 dataset has only 178 videos from training cf. 8,298 in MARS. We see that we are still comparable with initial CNN based methods (eg. CNN+XQDA ). The more recent methods seem to utilize optical flow as input, which could be leading to some regularization by removing the apperance from the videos.
4.6 Qualitative Results
Figure 7 shows four examples of pairs of query-gallery sequences and the similarity between them as predicted by our method. For each example, we also show two clip pairs (4 frames each) with the highest importance scores and two with the lowest importance score. One of the clips in the bottom clip-pair has, from left to right, (i) a significant amount of occlusion, (ii) no person in the frame, (iii) different persons in different frames due to a tracking error, and (iv) an improperly cropped person due to poor bounding box estimation. Our method learns to correctly identify the clip pairs that are unreliable for estimating the overall similarity between the two video sequences, and gives them very low importance scores (bottom two rows). Our method gives an overall high similarity (the heading of each column) to all the examples shown in Figure 7 by minimizing the effect of bad clip-pairs. Although MARS dataset considers the gallery sequences in column 2 and 4 as distractors, the high similarity estimated by our method is reasonable since they contain the same person as in the query in many of their frames, and are annotation edge cases.
These qualitative results highlight the ability of the proposed method to identify reliable clip pairs to match, and filter out unreliable ones despite non trivial appearance similarities estimated by the base network.
We addressed the video based person re-identification task and, to the best of our knowledge, showed that 3D CNNs can be used competitively for the task. We demonstrated better performance with 3D CNN on RGB images only, cf., existing methods which use optical flow channel in addition to RGB channel. This is indicative of the fact that 3D CNNs are capable of capturing necessary motion cues relevant for the task of video based person re-identification.
Further, we proposed a novel clip similarity learning method which identifies clip pairs which are informative for correlating the two clips. While previous methods used ad-hoc methods to obtain such pairs, we showed that our method is capable of learning to do so. We showed with simulated partial corruption of input clips, that the proposed method is robust to nuisances which might occur as a result of motion blur or partial occlusions. We also verified the intuition used to develop the method qualitatively.
The proposed method can be seen as an approximate discriminative mode matching method. There have been recent works using deeper CNN models (121 layers cf. 22 here) and more accurate distribution matching which obtain better results than the proposed method, however, they come at a computational and storage cost. A future work would be systematically find the balance between the two approaches to obtain the best performance for a given budget.
-  E. Ahmed, M. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3908–3916, 2015.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
-  D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1169–1178, 2018.
-  D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2016.
-  A. Dehghan, S. Modiri Assari, and M. Shah. Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4091–4099, 2015.
-  S. Ding, L. Lin, G. Wang, and H. Chao. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition, 48(10):2993–3003, 2015.
-  M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person re-identification by symmetry-driven accumulation of local features. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2360–2367. IEEE, 2010.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
-  Y. Fu, X. Wang, Y. Wei, and T. Huang. Sta: Spatial-temporal attention for large-scale video-based person re-identification. In Proceedings of the Association for the Advancement of Artificial Intelligence. 2019.
-  D. Gray and H. Tao. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European conference on computer vision, pages 262–275. Springer, 2008.
-  A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
-  M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof. Person Re-Identification by Descriptive and Discriminative Classification. In Proc. Scandinavian Conference on Image Analysis (SCIA), 2011.
-  T.-Y. Hu and A. G. Hauptmann. Multi-shot person re-identification through set distance with visual distributional representation. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 262–270. ACM, 2019.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In 2012 IEEE conference on computer vision and pattern recognition, pages 2288–2295. IEEE, 2012.
-  I. Kviatkovsky, A. Adam, and E. Rivlin. Color invariants for person reidentification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1622–1634, 2013.
-  J. Li, S. Zhang, and T. Huang. Multi-scale 3d convolution network for video based person re-identification. arXiv preprint arXiv:1811.07468, 2018.
-  S. Li, S. Bak, P. Carr, and X. Wang. Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 369–378, 2018.
-  W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 152–159, 2014.
-  S. Liao and S. Z. Li. Efficient psd constrained asymmetric metric learning for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 3685–3693, 2015.
-  H. Liu, Z. Jie, K. Jayashree, M. Qi, J. Jiang, S. Yan, and J. Feng. Video-based person re-identification with accumulative motion context. IEEE transactions on circuits and systems for video technology, 28(10):2788–2802, 2017.
-  Y. Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5790–5799, 2017.
-  Y. Liu, Z. Yuan, W. Zhou, and H. Li. Spatial and temporal mutual promotion for video-based person re-identification. arXiv preprint arXiv:1812.10305, 2018.
-  B. Ma, Y. Su, and F. Jurie. Local descriptors encoded by fisher vectors for person re-identification. In European Conference on Computer Vision, pages 413–422. Springer, 2012.
-  N. McLaughlin, J. Martinez del Rincon, and P. Miller. Recurrent convolutional network for video-based person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1325–1334, 2016.
-  S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel. Learning to rank in person re-identification with metric ensembles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1846–1855, 2015.
-  B. J. Prosser, W.-S. Zheng, S. Gong, T. Xiang, and Q. Mary. Person re-identification by support vector ranking. In BMVC, volume 2, page 6, 2010.
-  S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
-  E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016.
-  M. Saquib Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 420–429, 2018.
-  J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5363–5372, 2018.
-  C. Song, Y. Huang, W. Ouyang, and L. Wang. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1179–1188, 2018.
-  G. Song, B. Leng, Y. Liu, C. Hetang, and S. Cai. Region-based quality estimation network for large-scale person re-identification. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 3960–3969, 2017.
-  Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee. Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 402–419, 2018.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
-  T. Wang, S. Gong, X. Zhu, and S. Wang. Person re-identification by video ranking. In European Conference on Computer Vision, pages 688–703. Springer, 2014.
-  Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5177–5186, 2018.
-  T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1249–1258, 2016.
-  T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3415–3424, 2017.
-  Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang. Person re-identification via recurrent feature aggregation. In European Conference on Computer Vision, pages 701–716. Springer, 2016.
-  H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1077–1085, 2017.
-  R. Zhao, W. Ouyang, and X. Wang. Unsupervised salience learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3586–3593, 2013.
-  L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. In European Conference on Computer Vision, pages 868–884. Springer, 2016.
-  L. Zheng, Y. Huang, H. Lu, and Y. Yang. Pose invariant embedding for deep person re-identification. arXiv preprint arXiv:1701.07732, 2017.
-  W.-S. Zheng, S. Gong, and T. Xiang. Person re-identification by probabilistic relative distance comparison. In CVPR 2011, pages 649–656. IEEE, 2011.
-  Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4747–4756, 2017.
Appendix A Further implementation details
a.1 Details of training of 3D CNN
For the training of I3D network, we use the AMSGrad optimizer  with , . We use a weight decay of . In each training iteration, we use a batch of 32 clips belonging to 8 person identities with 4 instances of each identity i.e. and . The RGB input values are scaled and shifted to be in the range . For data augmentation, each input clip is first resized up to () and then a random crop of size is taken. Input clips are also randomly flipped horizontally with a probabiltiy of 0.5. For training on MARS dataset, we train the network for 1200 epochs with an intial learning rate of . We reduce the learning rate by a factor of 10 after every 400 epochs. The margin in the triplet loss expression is set to 0.3.
a.2 Details of training of Clip-Similarity Aggregation Module
For the training of Clip-Similarity Aggregation module, we again use the AMSGrad optimizer with , and a weight decay of . We use a batch size of 48 with and . We use the same input transformations and data augmentation techniques as described for the training of the I3D network. We train the aggregation module for 12 epochs with an initial learning rate of . We reduce the learning rate by a factor of 10 after 8 epochs. We set margin in the triplet loss.
Appendix B Further experiments
b.1 Ablation experiment for choice of and
Tab. 6 shows the re-identification performance (mAP) with averaging I3D features of multiple clips as we vary the number of clips () and the clip-length (). We can observe that while has a better performance than when using a single clip , it performs lower when number of clips averaged is larger. For , and have similar performances, i.e. vs. . Considering the higher computational cost with , we have used , with higher , for the experiments in the paper.