Multiple Human Association between Top and Horizontal Views by Matching Subjects’ Spatial Distributions

Multiple Human Association between Top and Horizontal Views by Matching Subjects’ Spatial Distributions

Ruize Han, Yujun Zhang, Wei Feng, Chenxing Gong, Xiaoyu Zhang, Jiewen Zhao, Liang Wan, Song Wang
School of Computer Science and Technology, Tianjin University, Tianjin, China
Abstract

Video surveillance can be significantly enhanced by using both top-view data, e.g., those from drone-mounted cameras in the air, and horizontal-view data, e.g., those from wearable cameras on the ground. Collaborative analysis of different-view data can facilitate various kinds of applications, such as human tracking, person identification, and human activity recognition. However, for such collaborative analysis, the first step is to associate people, referred to as subjects in this paper, across these two views. This is a very challenging problem due to large human-appearance difference between top and horizontal views. In this paper, we present a new approach to address this problem by exploring and matching the subjects’ spatial distributions between the two views. More specifically, on the top-view image, we model and match subjects’ relative positions to the horizontal-view camera in both views and define a matching cost to decide the actual location of horizontal-view camera and its view angle in the top-view image. We collect a new dataset consisting of top-view and horizontal-view image pairs for performance evaluation and the experimental results show the effectiveness of the proposed method.

1 Introduction

The advancement of moving-camera technologies provides a new perspective for video surveillance. Unmanned aerial vehicles (UAVs), such as drones in the air, can provide top views of a group of subjects on the ground. Wearable cameras, such as Google Glass and GoPro, mounted over the head of a wearer (one of the subjects on the ground), can provide horizontal views of the same group of subjects. As shown in Fig. 1, the data collected from these two views well complement each other – top-view images contain no mutual occlusions and well exhibit a global picture and the relative positions of the subjects, while horizontal-view images can capture the detailed appearance, action, and behavior of subjects of interest in a much closer distance. Clearly, their collaborative analysis can significantly improve the video-surveillance capabilities such as human tracking, human detection, and activity recognition.

Figure 1: An illustration of the top-view (left) and horizontal-view (right) images. The former is taken by a camera mounted to a drone in the air and the latter is taken by a GoPro worn by a wearer who walked on the ground. The proposed method identifies on the top-view image the location and view angle of the camera (indicated by red box) that produces the horizontal-view image, and associate subjects, indicated by identical color boxes, across these two videos.

The first step for such a collaborative analysis is to accurately associate the subjects across these two views, i.e., we need to identify any person present in both views and identify his location in both views, as shown in Fig. 1. In general, this can be treated as a person re-identification (re-id) problem – for each subject in one view, re-identify him in the other view. However, this is a very challenging person re-id problem because the same subject may show totally different appearance in top and horizontal views, not to mention that the top view of subjects contains very limited features by only showing the top of heads and shoulders and it can be very difficult to distinguish different subjects from their top views, as shown in Fig. 1.

Prior works [1, 2, 3] tried to alleviate the challenge of this problem by assuming 1) the view direction of the top-view camera in the air has certain slope such that subjects’ body, and even part of the background, are still partially visible in top views and can be used for feature matching to the horizontal views, and 2) the view angle of the horizontal-view camera on the ground is consistent with the moving direction of the camera wearer and can be easily estimated by computing optical flow in the top-view videos This can be used to identify the on-the-ground camera wearer in the top-view video. These two assumptions, however, limit the their applicability in practice, e.g., the horizontal-view camera wearer may turn head (and therefore the head-mounted camera) when he walks, leading to inconsistency between his moving direction and wearable-camera view direction.

In this paper, we develop a new approach to associate subjects across top and horizontal views without the above two assumptions. Our main idea is to explore the spatial distribution of the subjects for cross-view subject association. From the horizontal-view image, we detect all the subjects, and estimate their depths and spatial distribution using the sizes and locations of the detected subjects, respectively. On the corresponding top-view image, we traverse each detected subject and possible direction to localize the horizontal-view camera (wearer), as well as its view angle. For each traversed location and direction, we estimate the spatial distribution of all the visible subjects. We finally define a matching cost between the subjects’ spatial distributions in top and horizontal views to decide the horizontal-view camera location and view angle, with which we can associate the subjects across the two views. In the experiments, we collect a new dataset consisting of image pairs from top and horizontal-views for performance evaluation. Experimental results verify that the proposed method can effectively associate multiple subjects across top and horizontal views.

The main contributions of this paper are: 1) We propose to use the spatial distribution of multiple subjects for associating subjects across top and horizontal views, instead of using subject appearance and motions in prior works. 2) We develop geometry-based algorithms to model and match the subjects’ spatial distributions across top and horizontal views. 3) We collect a new dataset of top-view and horizontal-view images for evaluating the proposed cross-view subject association.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 elaborates on the proposed method and Section 4 reports the experimental results, followed by a brief conclusion in Section 5.

2 Related Work

Our work can be regarded as a problem of associating first-person and third-person cameras, which has been studied by many researchers. For example, Fan et al. [4] identify a first-person camera wearer in a third-person video by incorporating spatial and temporal information from the videos of both cameras. In [5], information from first- and third-person cameras, together with laser range data, is fused to improve depth perception and 3D reconstruction. Park et al. [14] predict gaze behavior in social scenes using both first- and third-person cameras. In [22], first- and third-person cameras are synchronized, followed by associating subjects between their videos. In [18], a first-person video is combined to multiple third-person videos for more reliable action recognition. The third-person cameras in these methods usually bear horizontal views or views with certain slope angle. Differently, in this paper the third-person camera is mounted on a drone and produces top-view images, making cross-view appearance matching a very difficult problem.

As mentioned above, cross-view subject association can be treated as a person re-id problem, which has been widely studied in recent years. Most existing re-id methods can be grouped into two classes: similarity learning and representation learning. The former focuses on learning the similarity metric, e.g., the invariant feature learning based models [10, 16, 23], classical metric learning models [9, 13, 7], and deep metric learning models [11, 21]. The latter focuses on feature learning, including low-level visual features such as color, shape, and texture  [8, 12], and more recent CNN deep features [24, 20]. These methods assume that all the data are taken from horizontal views, with similar or different horizontal view angles, and almost all of these methods are based on appearance matching. In this paper, we attempt to re-identify subjects across top and horizontal views, where appearance matching is not an appropriate choice.

More related to our work is a series of recent works by Ardeshir and Borji [1, 2, 3] on building association between top-view and horizontal-view cameras. In [1, 2], by jointly handling a set of egocentric (first-person) horizontal-view videos and a top-view video, a graph-matching based algorithm is developed to locate all the horizontal-view camera wearers in the top-view video. In [3], the problem is extended to locate not only the camera wearers, but also other horizontal-view subjects in the top-view video. However, as mentioned above, these methods are based on two assumptions: 1) the top-view camera bears certain slope angle to enable the partial visibility of human body and the use of appearance matching for cross-view association, and 2) the looking-at direction of the horizontal-view camera is the same as the moving direction of the camera wearer. In this paper, we remove these two assumptions and leverage the spatial distribution of subjects for cross-view subject association. The methods developed in [1, 2, 3] require multi-frame video inputs since it needs to estimate each subject’s moving direction, while our method can associate a single frame in top view and its corresponding frame in horizontal view.

3 Proposed Method

In this section, we first give an overview of the proposed method and then elaborate on the main steps.

3.1 Overview

Figure 2: An illustration of vector representation in (a) top view and (b) horizontal view.

Given a top-view image and a horizontal-view image that are taken by respective cameras at the same time, we detect all persons (referred to as subjects in this paper) on both images by a person detector [15]. Let be the collection of subjects detected on the top-view image, with being the - detected subject. Similarly, let be the collection of subjects detected on the horizontal-view image, with being the - detected subject. The goal of cross-view subject association is to identify all the matched subjects between and that indicate the the same persons.

In this paper, we address this problem by exploring the spatial distributions of the detected subjects in both views. More specifically, from each detected subject in the top view, we infer a vector that reflects its relative position to the horizontal-view camera (wearer) on the ground. Then for each detected subject in the horizontal view, we also infer a vector to reflect its relative position to the horizontal-view camera on the ground. We associate the subjects detected in two views by seeking matchings between two vector sets and , where and are the location and view angle of the horizontal-view camera (wearer) in the top-view image and they are unknown priorly. Finally, we define a matching cost function to measure the dissimilarity between the two vector sets and optimize this function for finding the matching subjects between two views, as well as the camera location , and camera view angle . In the following, we elaborate on each step of the proposed method.

3.2 Vector Representation

In this section, we discuss how to derive and . On the top-view image, we first assume that the horizontal-view camera location and its view angle are given. This way, we can compute its field of view in the top-view image and all the detected subjects’ relative positions to the horizontal-view camera on the ground. Horizontal-view image is egocentric and we can compute the detected subjects’ relative positions to the camera based on the subjects’ sizes and positions on the horizontal-view image.

3.2.1 Top-View Vector Representation

As shown in Fig. 2(a), in the top-view image we can easily compute the left and right boundaries of the field of view of the horizontal-view camera, denoted by , , respectively, based on the given camera location and its view angle . For a subject at in the field of view, we estimate its relative position to the horizontal-view camera by using two geometry parameters and , where is the (signed) distance to the horizontal-view camera along the (camera) right direction , as shown in Fig. 2(a) and is the depth. Based on pinhole camera model, we can calculate them by

(1)

where indicates the angle between two directions and is the focus length of horizontal-view camera.

Next we consider the range of . From Fig. 2(a), we can get

(2)

where is the given field-of-view angle of the horizontal-view camera as indicated in Fig. 2(a). From Eq. (2), we have .

To enable the matching to the vector representation from the horizontal view, we further normalize the value range of to , i.e.,

(3)

With this normalization, we actually do not need the actual value of in the proposed method.

Let , be the subset of detected subjects in the field of view in the top-view image. We can find the vector representation for all of them and sort them in terms of values in an ascending order. We then stack them together as

(4)

where is the size of , and and are the column-wise vectors of all the and values of the subjects in the field of view, respectively.

3.2.2 Horizontal-View Vector Representation

For each subject in the horizontal-view image, we also compute a vector representation to make it consistent to the top-view vector representation, i.e., -value reflects the distance to the horizontal-view camera along the right direction and -value reflects the depth to the horizontal-view camera. As shown in Fig. 2 (b), in the horizontal-view image, let and be the location and height of a detected subject, respectively. If we take the top-left corner of the image as the origin of the coordinate, , with being the width of the horizontal-view image, is actually the subject’s distance to the horizontal-view camera along the right direction. To facilitate the matching to the top-view vectors, we normalize its value range to by

(5)

where we simply take the inverse of the subject height as its depth to the horizontal-view camera.

For all detected subjects in the horizontal-view image, we can find their vector representations and sort them in terms of values in an ascending order. We then stack them together as

(6)

where and are the column-wise vectors of all the and values of the subjects detected in the horizontal-view image, respectively.

3.3 Vector Matching

In this section we associate the subjects across two views by matching the vectors between the two vector sets and . Since the values of both vector sets have been normalized to the range of , they can be directly compared. However, the values in these two vector sets are not comparable, although both of them reflect the depth to the horizontal-view camera: values are in terms of number of pixels in the top-view image while values are in terms of the number of pixels in the horizontal-view image. It is non-trivial to normalize them into a same scale given their errors in reflecting the true depth – it is a very rough depth estimation by using since it is very sensitive to subject detection errors and height difference among subjects.

We first find reliable subset matchings between and and use them to estimate the scale difference between their corresponding values. More specifically, we find a scaling factor to scale values to make them comparable to the values. For this purpose, we use a RANSAC-alike strategy [6]: for each element in , we find the nearest in . If is less than a very small threshold value, we consider and a matched pair and take the ratio of their corresponding values and the average of this ratio over all the matched pairs is taken as the scaling factor .

With the scaling factor , we match and using dynamic programming (DP) [17]. Specifically, we define a dissimilarity matrix of dimension , where is the dissimilarity between and and it is defined by

(7)

where is a balance factor. Given that and are both ascending sequences, we use dynamic programming algorithm to search a monotonic path in from to to build the matching between and with minimum total dissimilarities. If a vector matches to multiple vectors in , we only keep one with the smallest dissimilarity given in Eq. (7). After that, we check if a vector matches to multiple vectors in and we keep one with the smallest dissimilarity. These two-step operations will guarantee the resulting matching is one-on-one and we denote to be the number of final matched pairs. Denote the resulting matched vector subsets to be and , both of dimension . We define a matching cost between and as

(8)

where is a pre-specified factor and . In this matching cost, the term encourages the inclusion of more vector pairs into the final matching, which is important when we use this matching cost to search for optimal camera location and view angle to be discussed next.

3.4 Detecting Horizontal-View Camera and View Angle

In calculating the matching cost of Eq. (8), we need to know the horizontal-view camera location and its view angle to compute the vector . In practice, we do not know and priorly. As mentioned earlier, we exhaustively try all possible values for and and then select the ones that lead to the minimum matching cost . The matching with such minimum cost provides us the final cross-view subject association. For view angle , we sample the its range uniformly with an interval of and in the experiments, we will report results by using different sample intervals. For the horizontal-view camera location , we simply try every subject detected in the top-view image as the camera (wearer) location.

Figure 3: An illustration of mutual occlusion in the horizontal view. (a) Top-view image and (b) horizontal-view image.

An occlusion in the horizontal-view image indicates that two subjects and the horizontal-view camera are collinear, as shown by and in Fig. 3(a). In this case, the subject with larger depth, i.e., , is not visible in the horizontal view and we simply ignore this occluded subject in vector representation of . In practice, we set a tolerance threshold and if , we ignore the one with larger depth. The entire cross-view subject association algorithm is summarized in Algorithm 1.

Input: , : Subjects detected in top view and horizontal view respectively; parameters , .
Output: Matched vector pair and ; camera location ; camera-view angle .
1 Compute the horizontal-view vector by Eq. ; for  do
2        for  do
3               Compute the top-view vector by Eq. ; Estimate scaling as discussed in Section 3.3. Calculate by Eq.  using and ; Calculate , based on by DP algorithm; Calculate by Eq. 
4       Find with the minimum ;
return , , and with the minimum .
Algorithm 1 Cross-View Subject Association:

4 Experiment

In this section, we first describe the dataset used for performance evaluation and then introduce our experimental results.

4.1 Test Dataset

We do not find publicly available dataset with corresponding top-view and horizontal-view images/videos and ground-truth labeling of the cross-view subject association. Therefore, we collect a new dataset for performance evaluation. Specifically, we use a GoPro HERO7 camera (mounted over wearer’s head) to take horizontal-view videos and a DJI “yu” Mavic 2 drone to take top-view videos. Both cameras were set to have the same fps of 30. We manually synchronize these videos such that corresponding frames between them are taken at the same time. We then temporally sample these two videos uniformly to construct frame (image) pairs for our dataset. Videos are taken at three different sites with different background and the sampling interval is set to 100 frames to ensure the variety of the collected images. Finally, we obtain 220 image pairs from top and horizontal views, and for both views, the image resolution is . We label the same persons across two videos on all 220 image pairs. Note that, this manual labeling is quite labor intensive given the difficulty in identifying persons in the top-view images (see Fig. 1 for an example).

For evaluating the proposed method more comprehensively, we examine all 220 image pairs and consider the following five attributes: Occ: horizontal-view images containing partially or fully occluded subjects; Hor_mov: the horizontal-view images sampled from videos when the camera-wearer moves and rotates his head. Hor_rot: the horizontal-view images sampled from videos when the camera-wearer rotates his head. Hor_sta: the horizontal-view images sampled from videos when the camera-wearer stays static. TV_var: the top-view images sampled from videos when the drone moves up, down and/or change camera-view direction. Table 1 shows the number of image pairs with these five attributes, respectively. Note that some image pairs show multiple attributes listed above.

Attribute Occ Hor_mov Hor_rot Hor_sta TV_var
# image pairs 96 62 124 96 30
Table 1: Number of image pairs with the considered attributes.

For each pair of images, we analyze two more properties. One is the number of subjects in an image, which reflects the level of crowdedness. The other is the proportion between the number of shared subjects in two views and the total number of subjects in an image. Both of them can be computed against either the top-view image or the horizontal-view image and their histograms on all 220 image pairs are shown in Fig. 4.

Figure 4: (a) Histogram of the crowdedness in top- and horizontal-view images, respectively. (b) Histogram of the proportion of the shared subjects in top- and horizontal-view images, respectively.

In this paper, we use two metrics for performance evaluation. 1) The accuracy in identifying the horizontal-view camera wearer in the top-view image, and 2) the precision and recall of cross-view subject association. We do not include the camera-view angle for evaluation because it is difficult to annotate its ground truth.

4.2 Experiment Setup

We implement the proposed method in Matlab and run on a desktop computer with an Intel Core i7 3.4GHz CPU. We use the general YOLO [15] detector to detect subjects in the form of bounding boxes in both top-view and horizontal-view images 111We use the YOLOv3 version detector. For top-view subject detection, we fine-tune the network using 600 top-view human images that have no overlap with our test images.. The pre-specified parameters and are set to 25 and 0.015 respectively. We will further discuss the influence of these parameters in Section 4.4.

We did not find available methods with code that can directly handle our top- and horizontal-view subject association. One related work is [3] for cross-view matching. However, we could not include it directly into comparison because 1) its code is not available to public, and 2) it computes optical flow for and therefore cannot handle a pair of static images in our dataset. Actually, the method in [3] assumes a certain slope view angle of the top-view camera and use appearance matching for cross association. This is similar to the appearance-matching-based person re-id methods.

In this paper, we chose a recent person re-id method [19] for comparison. We take each subject detected in the horizontal-view image as query and search it in the set of subjects detected in the top-view image. We tried two versions of this re-id method: one is retrained from scratch using 1,000 sample subjects collected by ourselves (no overlap with the subjects in our test dataset) and the other is to fine-tune from the version provided in [19] these 1,000 sample subjects.

4.3 Results

We apply the proposed method to all 220 pairs of images in our dataset. We detect the horizontal-view camera wearer on the top-view image as described in 3.4 and the detection accuracy is 84.1%. We also use the Cumulative Matching Characteristic (CMC) curve to evaluate the matching accuracy, as shown in Fig. 5(a), where the horizontal and vertical axes are the CMC rank and the matching accuracy respectively.

Figure 5: (a) The CMC curve for horizontal-view camera detection. (b) Precision and recall scores in association, where the horizontal axis denotes a precision or recall score , and the vertical coordinate denotes the proportion of image pairs with corresponding precision or recall score that is greater than .

For a pair of images, we use the precision and recall scores to evaluate the cross-view subject association. As shown in Table 2, the average precision and recall scores of our method are 79.6% and 77.0% respectively. In this table, ‘Ours w ’ indicates the use of our method by giving the ground-truth camera location . We can find in this table that the re-id method, either retrained or fune-tuned, produces very poor result, which confirms the difficulty in using appearance features for the proposed cross-view subject association.

 

Method Prec.Avg Reca.Avg Prec.@1 Reca.@1
Re-id (fine-tune) 14.0 16.0 - -
Re-id (retrain) 22.0 24.0 - -
Ours (w ) 86.6 84.2 66.4 57.7
Ours 79.6 77.0 60.0 50.9
Table 2: Comparative results of different methods. Prec.Avg and Reca.Avg are the average precision and recall scores over all image pairs. Prec.@1 and Reca.@1 are the proportion of image pairs with precision and recall scores of 1, respectively.

We also calculate the proportion of all the image pairs with precision or recall score of 1 (Prec.@1 and Reca.@1). They reach 60.0% and 50.9% respectively. The distributions of these two scores on all 220 image pairs are shown in Fig. 5(b). In Table 3, we report the evaluation results on different subsets with respective attributes. We can see that the proposed method is not sensitive to the motion of both top-view and horizontal-view cameras, which is highly desirable for motion-camera applications.

 

Attr Occ Hor_mov Hor_rot Hor_sta TV_var
Prec.Avg 76.6 78.3 80.5 78.4 53.3
Rec.Avg 74.5 74.9 77.9 75.7 53.3
Table 3: Comparative results on the subsets with different attributes.

4.4 Ablation Studies

Step Length for . We study the influence of the value , the step length for searching optimal camera view angle in the range . We set the value of to 1, 5 and 10, respectively and the association results are shown in Table 4. As expected, leads to the highest performance, although a larger step length, such as also produces acceptable results.

 

Step length Prec.Avg Reca.Avg Prec.@1 Reca.@1
= 1 79.6 77.0 60.0 50.9
= 5 78.8 76.9 59.6 51.8
= 10 72.5 71.1 53.2 46.8

Table 4: Results by using different values for .

Vector representation. Next we compare the association results using different vector representation methods as shown in Table 5. The first row denotes that we represent the subjects in two views by one-dimensional vectors and respectively. The second row denotes that we represent the subjects in two views by one-dimensional vectors and , respectively, which are simply normalized to the range to make them comparable. The third row denotes that we combine the one-dimensional vectors for the first and second rows to represent each view, which differs from our proposed method (the fourth row of Table 5) only on the normalization of and – our proposed method uses a RANSAC strategy. By comparing the results in the third and fourth rows, we can see that the use of RANSAC strategy for estimating the scaling factor does improve the final association performance. The results in the first and second rows show that using only one dimension of the proposed vector representation cannot achieve performance as good as the proposed method that combines both dimensions. We can also see that and provides more accurate information than and when used for cross-view subject association.

 

Vector Prec.Avg Reca.Avg Prec.@1 Reca.@1
63.2 61.6 41.8 35.9
23.4 13.4 6.8 0.9
, 67.7 66.6 46.4 42.7
Ours 79.6 77.0 60.0 50.9
Table 5: Comparative study of using different vector representations. and are normalized results of and , respectively, by simply scaled to .

Parameters selection. There are two free parameters and in Eq. (8). We select different values for them and see their influence to the final association performance. Table 6 reports the results by varying one of these two parameters while fixing the other one. We can see that the final association precision and recall scores are not very sensitive to the selected values of these two parameters.

 

Prec.Avg Reca.Avg Prec.Avg Reca.Avg
76.4 73.7 79.2 76.6
79.6 77.0 79.6 77.0
79.1 76.7 78.4 75.9

 

Table 6: Results by varying values of and .

Detection method. In order to analyze the influence of subjects detection’s accuracy to the proposed cross-view association, we tried the use of different subject detections. As shown in Table 7, in the first row, we use manually annotated bounding boxes of each subject on both views for the proposed association. In the second and third rows, we use manually annotated subjects on top-view images and horizontal-view images, respectively, while using automatically detected subjects [15] on the other-view images. In the fourth row, we automatically detect subjects in both views first, and then only keep those that show an IoU (Intersection over Union) against a manually annotated subject, in terms of their bounding boxes. We can see that the use of manually annotated subjects produces much better cross-view subject association. This indicates that further efforts on improving subject detection will benefit the association.

 

Subjects detection Prec.Avg Reca.Avg Prec.@1 Reca.@1
Manual 83.5 80.5 76.8 61.4
Manual-Top 84.8 82.0 70.5 59.1
Manual-Hor 80.6 77.4 69.1 55.5
Automatic w selection 80.7 76.1 69.6 52.7
Automatic (Ours) 79.6 77.0 60.0 50.9
Table 7: Results by using different methods for subject detection.
Figure 6: (a) Association performance for image pairs with different number of associated subjects. (b) Association performance for image pairs with different proportion of associated subjects.

 

Method Whole dataset Occ subset
Prec.Avg Reca.Avg Prec.Avg Reca.Avg
Ours 79.6 77.0 76.6 74.5
Ours(w/o occ) 65.2 65.1 46.1 46.8

 

Table 8: Association results using the proposed method with and without handling occlusions.
Figure 7: Row 1: Two sample results on image pairs with occlusions. Row 2: Two sample results with large number of unshared subjects between two views. Vector sets and are shown in the top-right corner of every image.
Figure 8: Two failure cases.

4.5 Discussion

Number of associated subjects. We investigate the correlation between the association performance and the number of associated subjects. Figure 6(a) shows the average association performance on the image pairs with different number of associated subjects. We can see that the association results get worse when the number of associated subjects is too high or too low. When there are too many associated subjects, the crowded subjects in the horizontal view may prevent the accurate detection of subjects. When there are two few subjects, the constructed vector representation is not sufficiently discriminative to locate the camera location and camera-view angle . Figure 6(b) shows the average association performance on the image pairs with different proportions of associated subjects. More specifically, the performance at along the horizontal axis is the average precision/recall score on all the image pairs with the proportion of associated subjects (to the total number of subjects in the top-view image) less than . This confirms that on the images with higher such proportion, the association can be more reliable.

Occlusion. Occlusions are very common, as shown in Table 1. Table 8 shows the association results on the entire dataset and the subset of data with occlusions, by using the proposed method with and without the step of identifying and ignoring occluded subjects. We can see that our simple strategy for handling occlusion can significantly improve the association performance on the image pairs with occlusions. Sample results on image pairs with occlusions are shown in the top row of Fig. 7, where associated subjects bear same number labels. We can see that occlusions occur more often when 1) the subjects are crowded, and 2) one subject is very close to the horizontal-view camera.

Proportion of shared subjects. It is a common situation that many subjects in two views are not the same persons. In this case, the shared subjects may only count for a small proportion in both top- and horizontal-views. Two examples are shown in the second row of Fig. 7. In the left, we show a case where many subjects in the top view are not in the field of view of the horizontal-view camera. In the right, we show a case where many subjects in the horizontal view are too far from the horizontal-view camera and not covered by the top-view camera. We can see that the proposed method can handle these two cases very well, by exploring the spatial distribution of the shared subjects.

Failure case. At last, we give two failure cases as shown in Fig. 8 – one caused by the error in subject detection (blue boxes) and the other is caused by the close distance of multiple subjects, e.g, subjects 3,4 and 5, in either top or horizontal view, which lead to error detection of occlusions and incorrect vector representations.

5 Conclusion

In this paper, we developed a new method to associate multiple subjects across top-view and horizontal-view images by modeling and matching the subjects’ spatial distributions. We constructed a vector representation for all the detected subjects in the horizontal-view image and another vector representation for all the detected subjects in the top-view image that are located in the field of view of the horizontal-view camera. These two vector representations are then matched for cross-view subject association. We proposed a new matching cost function with which we can further optimize for the location and view angle of the horizontal-view camera in the top-view image. We collected a new dataset, as well as manually labeled ground-truth cross-view subject association, and experimental results on this dataset are very promising.

References

  • [1] S. Ardeshir and A. Borji. Ego2top: Matching viewers in egocentric and top-view videos. In ECCV, 2016.
  • [2] S. Ardeshir and A. Borji. Egocentric meets top-view. IEEE TPAMI, 2018.
  • [3] S. Ardeshir and A. Borji. Integrating egocentric videos in top-view surveillance videos: Joint identification and temporal alignment. 2018.
  • [4] C. Fan, J. Lee, M. Xu, K. K. Singh, Y. J. Lee, D. J. Crandall, and M. S. Ryoo. Identifying first-person camera wearers in third-person videos. In CVPR, 2017.
  • [5] F. Ferland, F. Pomerleau, C. T. L. Dinh, and F. Michaud. Egocentric and exocentric teleoperation interface using real-time, 3d video projection.
  • [6] M. Fischler. Random sample consensus : A paradigm for model fitting with application to image analysis and automated cartography. ACMM, 1981.
  • [7] L. Giuseppe, M. Iacopo, A. D. Bagdanov, and D. B. Alberto. Person re-identification by iterative re-weighted sparse ranking, 2015.
  • [8] D. Gray and T. Hai. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In ECCV, 2008.
  • [9] M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012.
  • [10] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
  • [11] K. B. Low and U. U. Sheikh. Learning hierarchical representation using siamese convolution neural network for human re-identification. In ICDIM, 2016.
  • [12] B. Ma, S. Yu, and F. Jurie. Local descriptors encoded by fisher vectors for person re-identification. In ECCV, 2012.
  • [13] S. Paisitkriangkrai, C. Shen, and A. V. D. Hengel. Learning to rank in person re-identification with metric ensembles. In CVPR, 2015.
  • [14] H. S. Park, E. Jain, and Y. Sheikh. Predicting primary gaze behavior using social saliency fields. In ICCV, 2013.
  • [15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [16] Z. Rui, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In CVPR, 2014.
  • [17] M. Sniedovich. Dynamic programming. foundations and principles. Monographs and Textbooks in Pure and Applied Mathematics, 2011.
  • [18] B. Soran, A. Farhadi, and L. G. Shapiro. Action recognition in the presence of one egocentric and multiple static cameras. In ACCV, 2014.
  • [19] Y. Sun, Z. Liang, Y. Yi, T. Qi, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
  • [20] X. Tong, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, 2016.
  • [21] R. R. Varior, S. Bing, J. Lu, X. Dong, and W. Gang. A siamese long short-term memory architecture for human re-identification. In ECCV, 2016.
  • [22] M. Xu, C. Fan, Y. Wang, M. S. Ryoo, and D. J. Crandall. Joint person segmentation and identification in synchronized first- and third-person videos. In ECCV, 2018.
  • [23] Y. Yang, J. Yang, J. Yan, S. Liao, Y. Dong, and S. Z. Li. Salient color names for person re-identification. In ECCV, 2014.
  • [24] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. Person re-identification in the wild. In CVPR, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
384035
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description