Fairest of Them All: Establishing a Strong Baseline for Cross-Domain Person ReID
Person re-identification (ReID) remains a very difficult challenge in computer vision, and critical for large-scale video surveillance scenarios where an individual could appear in different camera views at different times. There has been recent interest in tackling this challenge using cross-domain approaches, which leverages data from source domains that are different than the target domain. Such approaches are more practical for real-world widespread deployment given that they don’t require on-site training (as with unsupervised or domain transfer approaches) or on-site manual annotation and training (as with supervised approaches). In this study, we take a systematic approach to establishing a large baseline source domain and target domain for cross-domain person ReID. We accomplish this by conducting a comprehensive analysis to study the similarities between source domains proposed in literature, and studying the effects of incrementally increasing the size of the source domain. This allows us to establish a balanced source domain and target domain split that promotes variety in both source and target domains. Furthermore, using lessons learned from the state-of-the-art supervised person re-identification methods, we establish a strong baseline method for cross-domain person ReID. Experiments show that a source domain composed of two of the largest person ReID domains (SYSU and MSMT) performs well across six commonly-used target domains. Furthermore, we show that, surprisingly, two of the recent commonly-used domains (PRID and GRID) have too few query images to provide meaningful insights. As such, based on our findings, we propose the following balanced baseline for cross-domain person ReID consisting of: i) a fixed multi-source domain consisting of SYSU, MSMT, Airport and 3DPeS, and ii) a multi-target domain consisting of Market-1501, DukeMTMC-reID, CUHK03, PRID, GRID and VIPeR.
One of the fundamental computer vision problems for the purpose of large-scale computerized video surveillance is the ability to identify an individual across a multitude of cameras in a given environment. This requires the ability to match an individual from one camera view to another, and is commonly referred to as the person re-identification (ReID) problem. One of the most promising approaches to person ReID that has achieved state-of-the-art results in recent years leverages deep convolutional neural networks (CNN) to learn an embedding space [zheng2015scalable, Zeng2018hierarchical, liu2017Stepwise, wang2018transferable]. Current research on person ReID using deep convolutional neural networks can be categorized as follows (Fig. 1):
Supervised: Supervised approaches explicitly assume that manually annotated data in the form of matched individuals across cameras is available for the camera network where person ReID is needed (target domain). The manual annotation, used for training, must be sufficiently large and span all cameras in the target domain. Furthermore, for best performance an implicit assumption is made that the manually annotated data spans all expected environmental conditions in the target domain. To this end, existing datasets consist of data collected from an environment in a short time period and randomly split data into training and testing.
Cross-Domain: Cross-domain approaches assume the availability of manually annotated data from a single or multiple source domains which are different from the target domain. Embedding space is trained on the source domains, then applied to the target domain. In this approach an implicit assumption is made that the source domain is similar to the target domain or the source domain has enough environmental variations as to capture the differences in the target domain. Techniques for handling differences in source and target domains are studied in [song2019generalizable, marchwica2018evaluation, yu2017cross].
Domain Adaptation: Domain adaptation approaches start with cross-domain result – that is they start with an embedding space learned from a different source domain – then utilize unlabeled data from the target domain to adapt to the target domain. There are two main approaches here: i) use unlabeled data to adjust to the target domain by transforming target domain images to look more like the source domain images (or vice versa) [Wei2018GAN, deng2018image], or ii) automatically annotate unlabeled target domain data and re-train models [yu2017cross]. The latter approach can be considered as a form of self-learning. Such approaches can be considered a good compromise between supervised and unsupervised approaches.
Unsupervised: Unsupervised approaches do not use manually annotated data from the target domain or any source domains. They only utilize unlabeled data from the target domain to train the deep model’s embedding space. This is a pure self-learning problem that typically utilizes the inherent spatio-temporal nature of a camera network to initialize the self-learning process [li2018unsupervised].
Supervised approaches are the most extensively studied approaches in person ReID [zhang2019densely, tay2019aanet, zheng2019pyramidal, zheng2019joint] and achieve the best results so far, which is expected. However, from a practical point of view this approach is unfeasible for large scale deployment of person ReID because of the need for manually annotated data from every target domain. On the opposite end of the manual annotation spectrum are unsupervised approaches where no manual annotation is needed at all. This is a promising avenue of research with some amazing results already reported [song2018unsupervised, li2018unsupervised, qin2015unsupervised, fan2018unsupervised]. However, this approach assumes the availability of training hardware at the target sites or the ability to transfer unlabeled data from target site to a training facility. Again, this complicates large scale deployment of person ReID using such an approach. Furthermore, the system requires a learning period in which person ReID functionality will not be available at all. Domain adaptation is a compromise between supervised and unsupervised approaches but it still requires on site training for the domain adaptation part.
Cross-domain approaches can be regarded as the ideal approaches for practical deployment because a pre-trained model from annotated source domain(s) can simply be deployed to any target domain without on-site training. Currently, cross-domain research in person ReID is split into two key directions, with some works reporting cross-domain results as part of domain adaptation research [li2018adaptation], while other works looking directly at the cross-domain problem [jia2019frustratingly, song2019generalizable]. As a result, there is a lack of consistency in the use of source domains and target domains making comparison of different approaches difficult.
Many papers [li2014deepreid, lisanti2015person, li2018unsupervised] consider single source domain and single target domain scenarios. This results in easier comparison between different approaches but does not take advantage of the existing multiple person ReID datasets. To address this shortcoming, recent works have combined [jia2019frustratingly, marchwica2018evaluation, song2019generalizable] multiple datasets to form a large source domain and tested with several small target domains. However, the target domains have no commonality between the previous approaches that used single source and target domain for testing. Furthermore, the target domains used in these multisource domain papers [jia2019frustratingly, song2019generalizable] are very small datasets and as such we can’t make any strong conclusions about their performance.
In this study, we take a systematic approach to establishing a large baseline source domain and target domain for cross-domain person ReID. We accomplish this by conducting a comprehensive analysis to study the similarities between source domains proposed in literature, and studying the effects of incremental increasing the size of the source domain. This allows us to establish a balanced source domain and target domain split that promotes variety in both source and target domains. Furthermore, using lessons learned from state-of-the-art supervised person ReID methods, we establish a strong baseline method for cross-domain person ReID.
In summary, the key contributions of this study are as follows:
conduct a comprehensive analysis to study the similarities between source domains proposed in literature,
study the effects of incrementally increasing the size of the source domain,
establish a large, baseline source domain and target domain for cross-domain person ReID based on the findings of the above analysis.
establish a strong baseline method for cross-domain person ReID using lessons learned from the state-of-the-art supervised person ReID methods.
|Dataset||# IDs||# Images||# Cams||Common Test-Set|