Multiple Human Association between Top and Horizontal Views by Matching Subjects’ Spatial Distributions
Video surveillance can be significantly enhanced by using both top-view data, e.g., those from drone-mounted cameras in the air, and horizontal-view data, e.g., those from wearable cameras on the ground. Collaborative analysis of different-view data can facilitate various kinds of applications, such as human tracking, person identification, and human activity recognition. However, for such collaborative analysis, the first step is to associate people, referred to as subjects in this paper, across these two views. This is a very challenging problem due to large human-appearance difference between top and horizontal views. In this paper, we present a new approach to address this problem by exploring and matching the subjects’ spatial distributions between the two views. More specifically, on the top-view image, we model and match subjects’ relative positions to the horizontal-view camera in both views and define a matching cost to decide the actual location of horizontal-view camera and its view angle in the top-view image. We collect a new dataset consisting of top-view and horizontal-view image pairs for performance evaluation and the experimental results show the effectiveness of the proposed method.
The advancement of moving-camera technologies provides a new perspective for video surveillance. Unmanned aerial vehicles (UAVs), such as drones in the air, can provide top views of a group of subjects on the ground. Wearable cameras, such as Google Glass and GoPro, mounted over the head of a wearer (one of the subjects on the ground), can provide horizontal views of the same group of subjects. As shown in Fig. 1, the data collected from these two views well complement each other – top-view images contain no mutual occlusions and well exhibit a global picture and the relative positions of the subjects, while horizontal-view images can capture the detailed appearance, action, and behavior of subjects of interest in a much closer distance. Clearly, their collaborative analysis can significantly improve the video-surveillance capabilities such as human tracking, human detection, and activity recognition.
The first step for such a collaborative analysis is to accurately associate the subjects across these two views, i.e., we need to identify any person present in both views and identify his location in both views, as shown in Fig. 1. In general, this can be treated as a person re-identification (re-id) problem – for each subject in one view, re-identify him in the other view. However, this is a very challenging person re-id problem because the same subject may show totally different appearance in top and horizontal views, not to mention that the top view of subjects contains very limited features by only showing the top of heads and shoulders and it can be very difficult to distinguish different subjects from their top views, as shown in Fig. 1.
Prior works [1, 2, 3] tried to alleviate the challenge of this problem by assuming 1) the view direction of the top-view camera in the air has certain slope such that subjects’ body, and even part of the background, are still partially visible in top views and can be used for feature matching to the horizontal views, and 2) the view angle of the horizontal-view camera on the ground is consistent with the moving direction of the camera wearer and can be easily estimated by computing optical flow in the top-view videos This can be used to identify the on-the-ground camera wearer in the top-view video. These two assumptions, however, limit the their applicability in practice, e.g., the horizontal-view camera wearer may turn head (and therefore the head-mounted camera) when he walks, leading to inconsistency between his moving direction and wearable-camera view direction.
In this paper, we develop a new approach to associate subjects across top and horizontal views without the above two assumptions. Our main idea is to explore the spatial distribution of the subjects for cross-view subject association. From the horizontal-view image, we detect all the subjects, and estimate their depths and spatial distribution using the sizes and locations of the detected subjects, respectively. On the corresponding top-view image, we traverse each detected subject and possible direction to localize the horizontal-view camera (wearer), as well as its view angle. For each traversed location and direction, we estimate the spatial distribution of all the visible subjects. We finally define a matching cost between the subjects’ spatial distributions in top and horizontal views to decide the horizontal-view camera location and view angle, with which we can associate the subjects across the two views. In the experiments, we collect a new dataset consisting of image pairs from top and horizontal-views for performance evaluation. Experimental results verify that the proposed method can effectively associate multiple subjects across top and horizontal views.
The main contributions of this paper are: 1) We propose to use the spatial distribution of multiple subjects for associating subjects across top and horizontal views, instead of using subject appearance and motions in prior works. 2) We develop geometry-based algorithms to model and match the subjectsâ spatial distributions across top and horizontal views. 3) We collect a new dataset of top-view and horizontal-view images for evaluating the proposed cross-view subject association.
2 Related Work
Our work can be regarded as a problem of associating first-person and third-person cameras, which has been studied by many researchers. For example, Fan et al.  identify a first-person camera wearer in a third-person video by incorporating spatial and temporal information from the videos of both cameras. In , information from first- and third-person cameras, together with laser range data, is fused to improve depth perception and 3D reconstruction. Park et al.  predict gaze behavior in social scenes using both first- and third-person cameras. In , first- and third-person cameras are synchronized, followed by associating subjects between their videos. In , a first-person video is combined to multiple third-person videos for more reliable action recognition. The third-person cameras in these methods usually bear horizontal views or views with certain slope angle. Differently, in this paper the third-person camera is mounted on a drone and produces top-view images, making cross-view appearance matching a very difficult problem.
As mentioned above, cross-view subject association can be treated as a person re-id problem, which has been widely studied in recent years. Most existing re-id methods can be grouped into two classes: similarity learning and representation learning. The former focuses on learning the similarity metric, e.g., the invariant feature learning based models [10, 16, 23], classical metric learning models [9, 13, 7], and deep metric learning models [11, 21]. The latter focuses on feature learning, including low-level visual features such as color, shape, and texture [8, 12], and more recent CNN deep features [24, 20]. These methods assume that all the data are taken from horizontal views, with similar or different horizontal view angles, and almost all of these methods are based on appearance matching. In this paper, we attempt to re-identify subjects across top and horizontal views, where appearance matching is not an appropriate choice.
More related to our work is a series of recent works by Ardeshir and Borji [1, 2, 3] on building association between top-view and horizontal-view cameras. In [1, 2], by jointly handling a set of egocentric (first-person) horizontal-view videos and a top-view video, a graph-matching based algorithm is developed to locate all the horizontal-view camera wearers in the top-view video. In , the problem is extended to locate not only the camera wearers, but also other horizontal-view subjects in the top-view video. However, as mentioned above, these methods are based on two assumptions: 1) the top-view camera bears certain slope angle to enable the partial visibility of human body and the use of appearance matching for cross-view association, and 2) the looking-at direction of the horizontal-view camera is the same as the moving direction of the camera wearer. In this paper, we remove these two assumptions and leverage the spatial distribution of subjects for cross-view subject association. The methods developed in [1, 2, 3] require multi-frame video inputs since it needs to estimate each subject’s moving direction, while our method can associate a single frame in top view and its corresponding frame in horizontal view.
3 Proposed Method
In this section, we first give an overview of the proposed method and then elaborate on the main steps.
Given a top-view image and a horizontal-view image that are taken by respective cameras at the same time, we detect all persons (referred to as subjects in this paper) on both images by a person detector . Let be the collection of subjects detected on the top-view image, with being the - detected subject. Similarly, let be the collection of subjects detected on the horizontal-view image, with being the - detected subject. The goal of cross-view subject association is to identify all the matched subjects between and that indicate the the same persons.
In this paper, we address this problem by exploring the spatial distributions of the detected subjects in both views. More specifically, from each detected subject in the top view, we infer a vector that reflects its relative position to the horizontal-view camera (wearer) on the ground. Then for each detected subject in the horizontal view, we also infer a vector to reflect its relative position to the horizontal-view camera on the ground. We associate the subjects detected in two views by seeking matchings between two vector sets and , where and are the location and view angle of the horizontal-view camera (wearer) in the top-view image and they are unknown priorly. Finally, we define a matching cost function to measure the dissimilarity between the two vector sets and optimize this function for finding the matching subjects between two views, as well as the camera location , and camera view angle . In the following, we elaborate on each step of the proposed method.
3.2 Vector Representation
In this section, we discuss how to derive and . On the top-view image, we first assume that the horizontal-view camera location and its view angle are given. This way, we can compute its field of view in the top-view image and all the detected subjects’ relative positions to the horizontal-view camera on the ground. Horizontal-view image is egocentric and we can compute the detected subjects’ relative positions to the camera based on the subjects’ sizes and positions on the horizontal-view image.
3.2.1 Top-View Vector Representation
As shown in Fig. 2(a), in the top-view image we can easily compute the left and right boundaries of the field of view of the horizontal-view camera, denoted by , , respectively, based on the given camera location and its view angle . For a subject at in the field of view, we estimate its relative position to the horizontal-view camera by using two geometry parameters and , where is the (signed) distance to the horizontal-view camera along the (camera) right direction , as shown in Fig. 2(a) and is the depth. Based on pinhole camera model, we can calculate them by
where indicates the angle between two directions and is the focus length of horizontal-view camera.
Next we consider the range of . From Fig. 2(a), we can get
To enable the matching to the vector representation from the horizontal view, we further normalize the value range of to , i.e.,
With this normalization, we actually do not need the actual value of in the proposed method.
Let , be the subset of detected subjects in the field of view in the top-view image. We can find the vector representation for all of them and sort them in terms of values in an ascending order. We then stack them together as
where is the size of , and and are the column-wise vectors of all the and values of the subjects in the field of view, respectively.
3.2.2 Horizontal-View Vector Representation
For each subject in the horizontal-view image, we also compute a vector representation to make it consistent to the top-view vector representation, i.e., -value reflects the distance to the horizontal-view camera along the right direction and -value reflects the depth to the horizontal-view camera. As shown in Fig. 2 (b), in the horizontal-view image, let and be the location and height of a detected subject, respectively. If we take the top-left corner of the image as the origin of the coordinate, , with being the width of the horizontal-view image, is actually the subject’s distance to the horizontal-view camera along the right direction. To facilitate the matching to the top-view vectors, we normalize its value range to by
where we simply take the inverse of the subject height as its depth to the horizontal-view camera.
For all detected subjects in the horizontal-view image, we can find their vector representations and sort them in terms of values in an ascending order. We then stack them together as
where and are the column-wise vectors of all the and values of the subjects detected in the horizontal-view image, respectively.
3.3 Vector Matching
In this section we associate the subjects across two views by matching the vectors between the two vector sets and . Since the values of both vector sets have been normalized to the range of , they can be directly compared. However, the values in these two vector sets are not comparable, although both of them reflect the depth to the horizontal-view camera: values are in terms of number of pixels in the top-view image while values are in terms of the number of pixels in the horizontal-view image. It is non-trivial to normalize them into a same scale given their errors in reflecting the true depth – it is a very rough depth estimation by using since it is very sensitive to subject detection errors and height difference among subjects.
We first find reliable subset matchings between and and use them to estimate the scale difference between their corresponding values. More specifically, we find a scaling factor to scale values to make them comparable to the values. For this purpose, we use a RANSAC-alike strategy : for each element in , we find the nearest in . If is less than a very small threshold value, we consider and a matched pair and take the ratio of their corresponding values and the average of this ratio over all the matched pairs is taken as the scaling factor .
With the scaling factor , we match and using dynamic programming (DP) . Specifically, we define a dissimilarity matrix of dimension , where is the dissimilarity between and and it is defined by
where is a balance factor. Given that and are both ascending sequences, we use dynamic programming algorithm to search a monotonic path in from to to build the matching between and with minimum total dissimilarities. If a vector matches to multiple vectors in , we only keep one with the smallest dissimilarity given in Eq. (7). After that, we check if a vector matches to multiple vectors in and we keep one with the smallest dissimilarity. These two-step operations will guarantee the resulting matching is one-on-one and we denote to be the number of final matched pairs. Denote the resulting matched vector subsets to be and , both of dimension . We define a matching cost between and as
where is a pre-specified factor and . In this matching cost, the term encourages the inclusion of more vector pairs into the final matching, which is important when we use this matching cost to search for optimal camera location and view angle to be discussed next.
3.4 Detecting Horizontal-View Camera and View Angle
In calculating the matching cost of Eq. (8), we need to know the horizontal-view camera location and its view angle to compute the vector . In practice, we do not know and priorly. As mentioned earlier, we exhaustively try all possible values for and and then select the ones that lead to the minimum matching cost . The matching with such minimum cost provides us the final cross-view subject association. For view angle , we sample the its range uniformly with an interval of and in the experiments, we will report results by using different sample intervals. For the horizontal-view camera location , we simply try every subject detected in the top-view image as the camera (wearer) location.
An occlusion in the horizontal-view image indicates that two subjects and the horizontal-view camera are collinear, as shown by and in Fig. 3(a). In this case, the subject with larger depth, i.e., , is not visible in the horizontal view and we simply ignore this occluded subject in vector representation of . In practice, we set a tolerance threshold and if , we ignore the one with larger depth. The entire cross-view subject association algorithm is summarized in Algorithm 1.
In this section, we first describe the dataset used for performance evaluation and then introduce our experimental results.
4.1 Test Dataset
We do not find publicly available dataset with corresponding top-view and horizontal-view images/videos and ground-truth labeling of the cross-view subject association. Therefore, we collect a new dataset for performance evaluation. Specifically, we use a GoPro HERO7 camera (mounted over wearer’s head) to take horizontal-view videos and a DJI “yu” Mavic 2 drone to take top-view videos. Both cameras were set to have the same fps of 30. We manually synchronize these videos such that corresponding frames between them are taken at the same time. We then temporally sample these two videos uniformly to construct frame (image) pairs for our dataset. Videos are taken at three different sites with different background and the sampling interval is set to 100 frames to ensure the variety of the collected images. Finally, we obtain 220 image pairs from top and horizontal views, and for both views, the image resolution is . We label the same persons across two videos on all 220 image pairs. Note that, this manual labeling is quite labor intensive given the difficulty in identifying persons in the top-view images (see Fig. 1 for an example).
For evaluating the proposed method more comprehensively, we examine all 220 image pairs and consider the following five attributes: Occ: horizontal-view images containing partially or fully occluded subjects; Hor_mov: the horizontal-view images sampled from videos when the camera-wearer moves and rotates his head. Hor_rot: the horizontal-view images sampled from videos when the camera-wearer rotates his head. Hor_sta: the horizontal-view images sampled from videos when the camera-wearer stays static. TV_var: the top-view images sampled from videos when the drone moves up, down and/or change camera-view direction. Table 1 shows the number of image pairs with these five attributes, respectively. Note that some image pairs show multiple attributes listed above.
|# image pairs||96||62||124||96||30|
For each pair of images, we analyze two more properties. One is the number of subjects in an image, which reflects the level of crowdedness. The other is the proportion between the number of shared subjects in two views and the total number of subjects in an image. Both of them can be computed against either the top-view image or the horizontal-view image and their histograms on all 220 image pairs are shown in Fig. 4.
In this paper, we use two metrics for performance evaluation. 1) The accuracy in identifying the horizontal-view camera wearer in the top-view image, and 2) the precision and recall of cross-view subject association. We do not include the camera-view angle for evaluation because it is difficult to annotate its ground truth.
4.2 Experiment Setup
We implement the proposed method in Matlab and run on a desktop computer with an Intel Core i7 3.4GHz CPU. We use the general YOLO  detector to detect subjects in the form of bounding boxes in both top-view and horizontal-view images 111We use the YOLOv3 version detector. For top-view subject detection, we fine-tune the network using 600 top-view human images that have no overlap with our test images.. The pre-specified parameters and are set to 25 and 0.015 respectively. We will further discuss the influence of these parameters in Section 4.4.
We did not find available methods with code that can directly handle our top- and horizontal-view subject association. One related work is  for cross-view matching. However, we could not include it directly into comparison because 1) its code is not available to public, and 2) it computes optical flow for and therefore cannot handle a pair of static images in our dataset. Actually, the method in  assumes a certain slope view angle of the top-view camera and use appearance matching for cross association. This is similar to the appearance-matching-based person re-id methods.
In this paper, we chose a recent person re-id method  for comparison. We take each subject detected in the horizontal-view image as query and search it in the set of subjects detected in the top-view image. We tried two versions of this re-id method: one is retrained from scratch using 1,000 sample subjects collected by ourselves (no overlap with the subjects in our test dataset) and the other is to fine-tune from the version provided in  these 1,000 sample subjects.
We apply the proposed method to all 220 pairs of images in our dataset. We detect the horizontal-view camera wearer on the top-view image as described in 3.4 and the detection accuracy is 84.1%. We also use the Cumulative Matching Characteristic (CMC) curve to evaluate the matching accuracy, as shown in Fig. 5(a), where the horizontal and vertical axes are the CMC rank and the matching accuracy respectively.
For a pair of images, we use the precision and recall scores to evaluate the cross-view subject association. As shown in Table 2, the average precision and recall scores of our method are 79.6% and 77.0% respectively. In this table, ‘Ours w ’ indicates the use of our method by giving the ground-truth camera location . We can find in this table that the re-id method, either retrained or fune-tuned, produces very poor result, which confirms the difficulty in using appearance features for the proposed cross-view subject association.
|Ours (w )||86.6||84.2||66.4||57.7|
We also calculate the proportion of all the image pairs with precision or recall score of 1 (Prec.@1 and Reca.@1). They reach 60.0% and 50.9% respectively. The distributions of these two scores on all 220 image pairs are shown in Fig. 5(b). In Table 3, we report the evaluation results on different subsets with respective attributes. We can see that the proposed method is not sensitive to the motion of both top-view and horizontal-view cameras, which is highly desirable for motion-camera applications.
4.4 Ablation Studies
Step Length for . We study the influence of the value , the step length for searching optimal camera view angle in the range . We set the value of to 1, 5 and 10, respectively and the association results are shown in Table 4. As expected, leads to the highest performance, although a larger step length, such as also produces acceptable results.
Vector representation. Next we compare the association results using different vector representation methods as shown in Table 5. The first row denotes that we represent the subjects in two views by one-dimensional vectors and respectively. The second row denotes that we represent the subjects in two views by one-dimensional vectors and , respectively, which are simply normalized to the range to make them comparable. The third row denotes that we combine the one-dimensional vectors for the first and second rows to represent each view, which differs from our proposed method (the fourth row of Table 5) only on the normalization of and – our proposed method uses a RANSAC strategy. By comparing the results in the third and fourth rows, we can see that the use of RANSAC strategy for estimating the scaling factor does improve the final association performance. The results in the first and second rows show that using only one dimension of the proposed vector representation cannot achieve performance as good as the proposed method that combines both dimensions. We can also see that and provides more accurate information than and when used for cross-view subject association.
Parameters selection. There are two free parameters and in Eq. (8). We select different values for them and see their influence to the final association performance. Table 6 reports the results by varying one of these two parameters while fixing the other one. We can see that the final association precision and recall scores are not very sensitive to the selected values of these two parameters.
Detection method. In order to analyze the influence of subjects detection’s accuracy to the proposed cross-view association, we tried the use of different subject detections. As shown in Table 7, in the first row, we use manually annotated bounding boxes of each subject on both views for the proposed association. In the second and third rows, we use manually annotated subjects on top-view images and horizontal-view images, respectively, while using automatically detected subjects  on the other-view images. In the fourth row, we automatically detect subjects in both views first, and then only keep those that show an IoU (Intersection over Union) against a manually annotated subject, in terms of their bounding boxes. We can see that the use of manually annotated subjects produces much better cross-view subject association. This indicates that further efforts on improving subject detection will benefit the association.
|Automatic w selection||80.7||76.1||69.6||52.7|
|Method||Whole dataset||Occ subset|
Number of associated subjects. We investigate the correlation between the association performance and the number of associated subjects. Figure 6(a) shows the average association performance on the image pairs with different number of associated subjects. We can see that the association results get worse when the number of associated subjects is too high or too low. When there are too many associated subjects, the crowded subjects in the horizontal view may prevent the accurate detection of subjects. When there are two few subjects, the constructed vector representation is not sufficiently discriminative to locate the camera location and camera-view angle . Figure 6(b) shows the average association performance on the image pairs with different proportions of associated subjects. More specifically, the performance at along the horizontal axis is the average precision/recall score on all the image pairs with the proportion of associated subjects (to the total number of subjects in the top-view image) less than . This confirms that on the images with higher such proportion, the association can be more reliable.
Occlusion. Occlusions are very common, as shown in Table 1. Table 8 shows the association results on the entire dataset and the subset of data with occlusions, by using the proposed method with and without the step of identifying and ignoring occluded subjects. We can see that our simple strategy for handling occlusion can significantly improve the association performance on the image pairs with occlusions. Sample results on image pairs with occlusions are shown in the top row of Fig. 7, where associated subjects bear same number labels. We can see that occlusions occur more often when 1) the subjects are crowded, and 2) one subject is very close to the horizontal-view camera.
Proportion of shared subjects. It is a common situation that many subjects in two views are not the same persons. In this case, the shared subjects may only count for a small proportion in both top- and horizontal-views. Two examples are shown in the second row of Fig. 7. In the left, we show a case where many subjects in the top view are not in the field of view of the horizontal-view camera. In the right, we show a case where many subjects in the horizontal view are too far from the horizontal-view camera and not covered by the top-view camera. We can see that the proposed method can handle these two cases very well, by exploring the spatial distribution of the shared subjects.
Failure case. At last, we give two failure cases as shown in Fig. 8 – one caused by the error in subject detection (blue boxes) and the other is caused by the close distance of multiple subjects, e.g, subjects 3,4 and 5, in either top or horizontal view, which lead to error detection of occlusions and incorrect vector representations.
In this paper, we developed a new method to associate multiple subjects across top-view and horizontal-view images by modeling and matching the subjects’ spatial distributions. We constructed a vector representation for all the detected subjects in the horizontal-view image and another vector representation for all the detected subjects in the top-view image that are located in the field of view of the horizontal-view camera. These two vector representations are then matched for cross-view subject association. We proposed a new matching cost function with which we can further optimize for the location and view angle of the horizontal-view camera in the top-view image. We collected a new dataset, as well as manually labeled ground-truth cross-view subject association, and experimental results on this dataset are very promising.
-  S. Ardeshir and A. Borji. Ego2top: Matching viewers in egocentric and top-view videos. In ECCV, 2016.
-  S. Ardeshir and A. Borji. Egocentric meets top-view. IEEE TPAMI, 2018.
-  S. Ardeshir and A. Borji. Integrating egocentric videos in top-view surveillance videos: Joint identification and temporal alignment. 2018.
-  C. Fan, J. Lee, M. Xu, K. K. Singh, Y. J. Lee, D. J. Crandall, and M. S. Ryoo. Identifying first-person camera wearers in third-person videos. In CVPR, 2017.
-  F. Ferland, F. Pomerleau, C. T. L. Dinh, and F. Michaud. Egocentric and exocentric teleoperation interface using real-time, 3d video projection.
-  M. Fischler. Random sample consensus : A paradigm for model fitting with application to image analysis and automated cartography. ACMM, 1981.
-  L. Giuseppe, M. Iacopo, A. D. Bagdanov, and D. B. Alberto. Person re-identification by iterative re-weighted sparse ranking, 2015.
-  D. Gray and T. Hai. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In ECCV, 2008.
-  M. KÃ¶stinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012.
-  S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
-  K. B. Low and U. U. Sheikh. Learning hierarchical representation using siamese convolution neural network for human re-identification. In ICDIM, 2016.
-  B. Ma, S. Yu, and F. Jurie. Local descriptors encoded by fisher vectors for person re-identification. In ECCV, 2012.
-  S. Paisitkriangkrai, C. Shen, and A. V. D. Hengel. Learning to rank in person re-identification with metric ensembles. In CVPR, 2015.
-  H. S. Park, E. Jain, and Y. Sheikh. Predicting primary gaze behavior using social saliency fields. In ICCV, 2013.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
-  Z. Rui, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In CVPR, 2014.
-  M. Sniedovich. Dynamic programming. foundations and principles. Monographs and Textbooks in Pure and Applied Mathematics, 2011.
-  B. Soran, A. Farhadi, and L. G. Shapiro. Action recognition in the presence of one egocentric and multiple static cameras. In ACCV, 2014.
-  Y. Sun, Z. Liang, Y. Yi, T. Qi, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
-  X. Tong, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, 2016.
-  R. R. Varior, S. Bing, J. Lu, X. Dong, and W. Gang. A siamese long short-term memory architecture for human re-identification. In ECCV, 2016.
-  M. Xu, C. Fan, Y. Wang, M. S. Ryoo, and D. J. Crandall. Joint person segmentation and identification in synchronized first- and third-person videos. In ECCV, 2018.
-  Y. Yang, J. Yang, J. Yan, S. Liao, Y. Dong, and S. Z. Li. Salient color names for person re-identification. In ECCV, 2014.
-  L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. Person re-identification in the wild. In CVPR, 2017.