Joint Person Segmentation and Identification in Synchronized First- and Third-person Videos

Joint Person Segmentation and Identification in
Synchronized First- and Third-person Videos

Mingze Xu Indiana University, Bloomington, IN 47408
{mx6, fan6, wang617, mryoo, djcran}
   Chenyou Fan Indiana University, Bloomington, IN 47408
{mx6, fan6, wang617, mryoo, djcran}
   Yuchen Wang Indiana University, Bloomington, IN 47408
{mx6, fan6, wang617, mryoo, djcran}
   Michael S. Ryoo Indiana University, Bloomington, IN 47408
{mx6, fan6, wang617, mryoo, djcran}
   David J. Crandall Indiana University, Bloomington, IN 47408
{mx6, fan6, wang617, mryoo, djcran}

In a world in which cameras are becoming more and more pervasive, scenes in public spaces are often captured from multiple perspectives by diverse types of cameras, including surveillance and wearable cameras. An important problem is how to organize these heterogeneous collections of videos by finding connections between them, such as identifying common correspondences between people both appearing in the videos and wearing the cameras. In this paper, we consider scenarios in which multiple cameras of different types are observing a scene involving multiple people, and we wish to solve two specific, related problems: (1) given two or more synchronized third-person videos of a scene, produce a pixel-level segmentation of each visible person and identify corresponding people across different views (i.e., determine who in camera A corresponds with whom in camera B), and (2) given one or more synchronized third-person videos as well as a first-person video taken by a wearable camera, segment and identify the camera wearer in the third-person videos. Unlike previous work which requires ground truth bounding boxes to estimate the correspondences, we jointly perform the person segmentation and identification. We find that solving these two problems simultaneously is mutually beneficial, because better fine-grained segmentations allow us to better perform matching across views, and using information from multiple views helps us perform more accurate segmentation. We evaluate our approach on a challenging dataset of interacting people captured from multiple wearable cameras, and show that our proposed method performs significantly better than the state-of-the-art on both person segmentation and identification.

Synchronized first- and third-person cameras.

1 Introduction

By some estimates, there will be nearly 45 billion cameras on Earth by 2022 – more than five times the number of people [1]! In a world with so many cameras, it will be commonplace for a scene to be simultaneously recorded by multiple cameras of different types and viewpoints. For example, at any given time, a public space may be recorded by not only multiple fixed surveillance cameras, but also by mobile cameras on smartphones, laptops, tablets, self-driving cars, and even wearable devices like GoPro [2] and Snap Spectacles [3]. As cameras continue to multiply, automated techniques will be needed to help organize and make sense of all these weakly-structured video feeds. For example, a key problem in many applications is to detect, identify, and track the people in a scene. Combining data from multiple cameras could significantly improve performance on these and other scene understanding problems, since evidence from multiple viewpoints could help resolve ambiguities caused by occlusion, perspective distortion, etc. However, integrating evidence across multiple heterogeneous cameras in unconstrained and dynamic environments is a significant challenge, especially for wearable and other mobile devices where the camera is constantly moving in highly unpredictable ways.

Figure 1: Two or more people move around an environment while wearing cameras. We are interested in two specific problems: (a) segmenting and identifying the people appearing in a set of videos, and (b) deciding which person appearing in a third-person video is the camera wearer of a first-person view, and estimating his or her segmentation mask. These sample video frames are reprinted from the public dataset of [4].

For example, consider a law enforcement scenario in which multiple police officers chase a suspect through a crowded public place. Body-worn police cameras (which nearly 95% of police departments in the U.S. use or plan to deploy [5]) record the events from the officers’ perspectives. Investigators later want to reconstruct the incident by combining analysis of the wearable camera videos, as well as videos from third-person sources such as surveillance cameras and civilian smartphone videos uploaded to social media sites. In particular, in any given frame of any given camera, they may want to identify: (1) fine-grained, pixel-level segmentation masks for all people of interest, including both the suspect and the police officers, to permit further analysis (e.g. for activity or action recognition), (2) the instances in which one of the camera wearers (police officers) was visible in another camera’s view, and (3) instances of the same person appearing in different views at the same time. The scene is complex and crowded, requiring fine-grained segmentation masks to separate individual people (since frequent occlusions cause person bounding boxes to overlap). The videos from wearable cameras are particularly challenging because the cameras themselves are moving in rapid, unpredictable ways.

While person tracking and (re-)identification are well-studied in computer vision [6, 7], only very recently have these problems been considered in such challenging scenarios of heterogeneous first-person (e.g., “egocentric” wearable cameras) and traditional cameras. Ardeshir and Borji [8] consider the case of several people moving around an environment while wearing cameras, and try to match each of these first-person views to one of the people appearing in a third-person, overhead view of the scene. This is challenging because the person wearing the camera is never seen in their own wearable video, so he or she must be identified by comparing their motion from a third-person perspective with the first-person visual changes that are induced by their movements. Their approach is applicable in closed environments where overhead cameras are available (e.g., a supermarket or museum), but not in unconstrained environments such as the law enforcement scenario described above. Fan et al[4] present an alternative technique that relaxes many assumptions, allowing arbitrary third-person camera viewpoints and including evidence based on scene appearance. Zheng et al[9] consider the distinct (but related) problem of identifying the same person appearing in multiple wearable camera videos (but not trying to identify the camera wearers themselves). But both of these techniques identify individual people using bounding boxes, which are too coarse in crowded scenes in which people frequently occlude one another (causing their bounding boxes to overlap). Moreover, both of these techniques assume that accurate oracle bounding boxes are already available, so their problems simplify to matching boxes across views and frames.

This paper. We consider the more challenging problem of not only finding correspondences between people in first- and third-person cameras, but also producing pixel-level segmentations of the people in each view. In particular, we consider two specific related problems (as shown in Figure 1): (1) given one or more synchronized third-person videos of a scene, segment all the visible people and identify corresponding people across the different videos (i.e. determine who in camera A corresponds with whom in camera B); and (2) given one or more synchronized third-person videos of a scene as well as a video that was taken by a wearable first-person camera, identify and segment the person who was wearing the camera in the third-person videos. We define a “first-person” camera to be a wearable camera for which we care about the identity of the camera wearer, while a “third-person” camera is either a static or wearable camera for which we are not interested in determining who is wearing it.

Intuitively, these problems can be solved by finding correspondences in appearance and motion features across different viewpoints of a scene. Our hypothesis is that performing simultaneous segmentation and matching is mutually beneficial: segmentation helps refine matching by producing finer-granularity appearance features compared to coarse bounding boxes, which is especially important in crowded scenes in which occlusions cause people’s bounding boxes to overlap (such as the law enforcement scenario mentioned above), while matching provides evidence that helps locate a person of interest and produces better segmentation masks, which in turn can be used for tasks like activity and action recognition.

More specifically, we present a Convolutional Neural Network (CNN)-based framework to simultaneously learn a distance metric between videos taken from different viewpoints, and to learn the segmentation mask of people across the views. We use a Siamese network design that allows for learning low-level features invariant across multiple videos, and that integrates evidence from both scene appearance and motion. We show that previous work [4] is a special case of our model, since our approach can naturally handle their first- and third-person cases. We evaluate our technique on a publicly available dataset (which we augment with pixel-level segmentation masks), showing that our models achieve significantly better results compared to numerous baselines.

2 Related Work

We are not aware of existing work on joint person segmentation and identification in synchronized first- and third-person camera scenarios, but we draw inspiration from papers on several related problems.

Object segmentation in images and videos. Deep learning has achieved state-of-the-art performance on semantic image segmentation [10, 11, 12, 13, 14], typically using fully convolutional networks (FCNs) that extract low-resolution feature representations and then up-sample them back to the input image size. Other approaches [15, 16, 17, 18] are based on region proposals, inspired by the effectiveness of R-CNNs [19, 20] for object detection. For example, Mask R-CNNs [18] achieve among the best results through separate prediction of object masks and their class labels, avoiding competition among classes and improving performance for overlapped instances.

Much work has applied deep networks to object segmentation in videos [21, 22, 23, 24, 25, 26]. Most of these methods are semi-supervised, assuming that the object mask in the first frame is known (during both training and testing) and the task is to propagate them to subsequent frames. Khoreva et al[27] propose a guided instance segmentation approach which uses the object mask from the previous frame to predict the next one. The network is pre-trained (off-line) on static images and fine-tuned (on-line) on the annotations of the first frame for specific objects of interest. Our approach follows a similar overall formulation, except that we include information from both visual appearance and optical flow in a two-stream network; combining both motion information and visual appearance information from the previous frame helps to better update the object mask across time. Our approach is also inspired by Yoon et al[28], who design a pixel-level Siamese matching network for identifying and segmenting objects, even for those not seen in the training data. We extend their idea to the multi-camera case by learning from object instances across multiple synchronized videos, which allows our network to better learn variations and correspondences in object appearance across perspective changes. Cheng et al[29] propose a two-stream network which outputs the segmentation and optical flow simultaneously, where the segmentation stream focuses on objectness and optical flow exploits motion information. Inspired by their observation that segmentation and optical flow estimation benefit from each other, we propose a novel architecture that simultaneously performs person segmentation and identification.

Co-segmentation. Our work is also related to object co-segmentation [30], in which an object appearing in multiple images is segmented. Several existing methods address this task as an optimization problem on Markov Random Fields (MRFs) with a regularized difference of feature histograms, for example, by assuming a Gaussian prior on the objectness appearance [30] or by computing the sum squared differences [31]. Many follow-up approaches focus on object co-segmentation for videos [32, 33, 34, 35, 36], while Chiu et al[34] use distance-dependent Chinese Restaurant Processes as priors on both appearance and motion cues to perform unsupervised (not semantic) co-segmentation. Fu et al[35] address video co-segmentation as a CRF inference problem on an object-based co-selection graph. However, segmentation candidates are computed only by the category independent object proposals method [37] and are not refined from information across multiple videos. Guo et al[36] perform iterative constrained clustering using seed superpixels and pairwise constraints, and refine the segmentation results by optimizing a multi-class MRF. Most of the above methods assume that either a target object appears in all videos or that videos contain at least one common target object, and none apply deep learning techniques to this problem. To the best of our knowledge, ours is the first paper to propose a deep learning approach to co-segmentation in videos, and is applicable both to single and multiple camera scenarios.

First-person camera applications. Ardeshir and Borji [38] consider matching a set of first-person videos to a set of people appearing in a top-view video using graph-based analysis. They assume there are multiple first-person cameras sharing the same field of view at any given time, and only consider overhead third-person cameras. While these assumptions apply in some applications, they are violated in complex real-world scenarios like our law enforcement example above. Fan et al[4] identify the wearer of a given first-person camera in a third-person video using a two-stream semi-Siamese network to incorporate spatial and temporal information from both viewpoints, and learn a joint embedding space from positive and negative pairs (i.e., correct and incorrect first- and third-person matches). Zheng et al[9] identify people appearing in multiple wearable camera videos (but not trying to identify the camera wearers themselves).

The above work assumes that the people in a scene have been detected with accurate bounding boxes in both training and testing datasets. Our approach builds on these methods, proposing a novel architecture that simultaneously segments and identifies both camera wearers and others. We find that segmenting and identifying mutually benefit one another; in the law enforcement scenario described above with crowded scenes and frequent occluded people, for example, fine-grained segmentation masks are needed to adequately extract visual features specific to any given person, while identity information from multiple views helps accurately segment the person in any individual view.

3 Our Approach

Given two or more videos taken from a set of cameras (potentially including a mixture of static and wearable cameras), our goal is to segment each person appearing in these videos, to identify matches between segments that correspond to the same person across different views, and to identify the segments that correspond to the wearer of each of the first-person cameras. The main idea is that despite having very different perspectives, synchronized cameras recording the same environment should be capturing some of the same people and background objects. This overlap may permit finding similarities and correspondences among these videos in both visual and motion domains, while hopefully ignoring differences caused by differing viewpoints. Unlike prior work [4], which assumes a ground truth bounding box is available for each person in each video frame, we perform segmentation and matching tasks simultaneously. We hypothesize that these two tasks could be mutually beneficial: using estimated person segmentations could provide more concise and accurate information than coarse localizations (e.g. bounding boxes) for people matching, while the consistency of people’s appearance and motion from different perspectives could help produce better segmentation masks in individual views.

More concretely, we formulate our problem as two separate but related tasks. The third-third problem is to segment each person and find person correspondences across different views captured from a pair of third-person cameras. The third-first problem is to segment and identify the camera wearer of a given first-person video in third-person videos. In this section, we first introduce the basic network architecture used to solve both of these problems: a two-stream fully convolutional network (FCN) to estimate the segmentation mask for each person using the current RGB frame, the stacked optical flow fields, and the segmentation result of the previous video frame (which we call the pre-mask) as inputs (Section 3.1). We then introduce the general structure of the Siamese network, and show two specific architectures (targeting our two problems) that incorporate its structure into the FCN, enabling person segmentation and identification to benefit each other (Section 3.2). We also describe the details of our proposed loss functions, each used for the segmentation and distance metric learning (Section 3.3).

3.1 Two-stream FCN Network

We use FCN8s [10] as the foundation of our network framework, but with several important modifications. We chose FCN8s due to their effectiveness and compactness, although our model can be easily extended to more recent segmentation architectures, such as DeepLabv3+ [14] and Mask R-CNN [18]. Figure 2 presents our novel network architecture. First, to take advantage of video and incorporate evidence from both visual and motion domains, we expand FCN8s to a two-stream architecture, where a visual stream receives the RGB frame as input (top stream in Figure 2) and a motion stream receives stacked optical flow fields (bottom stream). This design is inspired by Simonyan and Zisserman [39], although their network was proposed for the completely different problem of action recognition from a single static camera. To jointly consider both spatial and temporal information, we use an “early” fusion method to concatenate the features of two streams at different levels of pool3, pool4, and pool5 (middle of Figure 2). Following Long et al.  to incorporate “coarse, high level information with fine, low level information” [10] for more accurate segmentation results, we combine the fused features from these different levels.

However, in contrast to Long et al[10], our two-stream FCN network targets solving the instance segmentation problem; i.e., we want to segment specific people, not just all instances of the person class. We address this problem by using an instance-by-instance strategy in both training and testing, in which we only consider a single person of interest at a time. In order to guide the network to segment a specific person among the many that may potentially appear in a frame, we append that person’s binary pre-mask (without any semantic information) to the input of each stream as an additional channel. The key idea is that the pre-mask provides a rough estimate of the person’s location and his or her approximate shape in the current video frame. In training, our network is first pre-trained by taking the ground truth pre-masks as inputs, and is then fine-tuned with the estimated masks from the previous frame. In testing, we assume that a segmentation mask of each person of interest is available in the first video frame (although this segmentation could be quite coarse) and then propagate this mask forward by evaluating each subsequent unlabeled frame in sequence. A pixel-level classification loss function is used to guide the learning process, discussed in Section 3.3.

Figure 2: Visualization of our two-stream FCN network. We feed RGB frames with a pre-mask (a segmentation mask for a specific person) to the visual stream (top, in dark grey) and corresponding stacked optical flow fields with pre-mask to the motion stream (bottom, in light grey). The spatial and temporal features at pool3, pool4, and pool5 are fused to predict the segmentation of the “target” person. We downsample the extracted features of the softmax layer by a factor of 16, then tile the background and foreground channels by a factor of 512, separately.

3.2 Siamese Networks

The network in the last section learns to estimate the segmentation mask of a specific person across frames of a video sequence. We now show how to use this network in a Siamese structure with a contrastive loss to match person instances across different third- and first-person views. The main idea behind the use of Siamese networks is to learn an embedding space such that feature representations captured by different cameras from different perspectives are close together only if they actually belong to the same person. This is motivated by the fact that the representation of the same person at the same location performing the same motion should ideally be invariant to camera viewpoint. The Siamese formulation allows us to simultaneously learn the viewpoint-invariant embedding space for matching identities and the pixel-wise segmentation network described above in an end-to-end fashion.

Moreover, our Siamese (or semi-Siamese) FCN architecture improves the invariance of object segmentation across different perspectives and transformations. In contrast to co-segmentation methods that require pairs of images or videos in both training and testing, our approach only need pairs in the training phase. In testing, our two-stream FCN network can be applied to any single stream input, and uses the embedding space to match with others. To allow the segmentation network to receive arbitrary sizes of inputs, our contrastive loss function is generalized to a 3D XYC representation space, with a Euclidean distance for positive exemplars and a hinge loss for negative ones.

Figure 3: Visualization of our third-third network for segmenting and identifying the people in common across different videos. The network is composed of two FCN branches with a Siamese structure, where all convolution layers (shown in the same color) are shared.

In particular, we explore two different Siamese network structures, customized for our two tasks: the third-third problem of segmenting and matching people in common across a pair of cameras, and the third-first problem of segmenting a person of interest and identifying if he or she is the wearer of a first-person camera. The third-third problem considers a more general case in which the cameras may be static or may be wearable, but they are all viewing a person of interest from a third-person viewpoint; we thus use a full-Siamese network that shares all convolution layers in the FCN branch and the embedding layers for the Siamese loss for this problem. In contrast, the third-first problem must match feature representations from different perspectives (identifying how a camera wearer’s ego-motion visible in a first-person view correlates with the appearance of that same motion from a third-person view). Inspired by [4], our third-first network is formulated in a semi-Siamese structure, where different shallow layers are needed to capture different low-level features and only deeper ones are shared.

Third-third network. Figure 3 shows the architecture of our third-third network, which segments and matches people in common from a pair of third-person camera views. We use a Siamese structure with two branches of the FCN network from Figure 2 (and discussed in Section 3.1), where all corresponding convolution layers are shared. The Siamese branch is thus encouraged to learn relationships between people’s appearance in different views by optimizing a generalized embedding space. The key idea is that despite being captured from very different perspectives, the same person in synchronized videos should have some correspondences in both visual and motion domains.

In more detail, given the RGB frame and the optical flow (appended with the pre-mask of the person of interest) as inputs, each of size , the FCN branch estimates a binary-valued person segmentation mask of the same size. The Siamese branch is then appended to the pool5 layer of both visual and motion streams with an input size of , where and , for matching. To obtain more accurate representations for each “target” person, we re-weight the spatial and temporal features by multiplying them with the confidence outputs of the FCN branch. To emphasize the pixel positions belonging to the person while retaining some contextual information, we use the soft attention maps after the softmax layer rather than the estimated segmentation mask. We first resize the soft attention of the foreground from to and tile it to to fit the size of pool5 outputs. For both visual and motion streams, we multiply this resized confidence map with the features, which gives pixels corresponding to the person a higher score and those of the background a lower one. The intuition is that by “cropping out” the region corresponding to a person from the feature maps, the match across two views should receive a higher correspondence. This correspondence will also back-propagate its confidence to improve segmentation performance. Finally, the re-weighted spatial and temporal features are concatenated together as a more robust representation for matching each person instance.

Figure 4: Visualization of our third-first network for segmenting and identifying a first-person camera wearer in third-person videos. The network is formulated in a semi-Siamese structure where only convolution layers of the embedding space (shown in the same color) are shared.

Third-first network. Figure 4 shows the architecture of our third-first network, the goal of which is to segment a first-person camera wearer in third-person videos and to recognize the correspondence between the first-person view and its representation in third-person videos. To be specific, given a first-person video, our network must decide if each of the people appearing in a third-person video is the wearer of this third-person camera, and to estimate the wearer’s segmentation. In contrast to the third-third network, which has two FCN branches focusing on the same task (person segmentation), the second branch of the third-first network that receives the first-person videos as inputs is only designed to extract the wearer’s ego-motion and the visual information of the background, which hopefully also provides constraints for the segmentation task. We thus propose a semi-Siamese network to learn the first- and third-person distance metric, where the first-person branch has a similar structure to the FCN but without the up-sampling layers or the segmentation loss. The structure of the Siamese branch is similar to that of the third-third network, but with a different re-weighting method: we multiply the spatial features with the soft attention of the background but the temporal features with the soft attention of the foreground. The reason for this is that camera wearers do not appear in their own first-person videos (with the occasional exception of arms or hands), but the background views reflect some similarities between different perspectives; meanwhile, motion features of camera wearers in third-person videos should be related to the ego-motion (as reflected by camera movements) in first-person videos. The re-weighted appearance and motion features are then concatenated after several convolution operations to be fed into the embedding space, as discussed above.

3.3 Loss Functions

We propose two loss functions for jointly learning the segmentation and distance metric.

First, the sigmoid cross entropy loss evaluates a predicted segmentation mask compared to ground truth. In particular, for a batch of training exemplars, we define


where is the predicted segmentation mask of the -th exemplar and is the corresponding ground truth binary segmentation mask.

Second, the generalized contrastive loss encourages low distances between positive exemplars (pairs of corresponding people) and high distances between negative ones (non-corresponding person pairs). This loss enables our model to learn an embedding space for arbitrary input sizes. In particular, for a batch of exemplars,


where is a constant, and are two features corresponding to the -th exemplar, and is if the -th exemplar is a correct correspondence and otherwise.

4 Experiments

We evaluate our third-third and third-first networks on joint person segmentation and identification problems in synchronized first- and third-person videos. We also measure the performance of each component of our networks as additional baselines.

4.1 Dataset

We evaluate our proposed approaches on the publicly available dataset of Fan et al[4], which (to our knowledge) is the only dataset that includes correspondences between multiple first-person cameras. The dataset consists of 9 sets of videos, each including two first-person videos between 5-10 minutes in length. Each records 3-4 participants performing a variety of everyday, unstructured activities, such as shaking hands, chatting, drinking, eating, etc., in one of six indoor environments. Each person in each frame is annotated with a ground truth bounding box and a unique person ID. To evaluate our methods on person segmentation, we manually augmented a subset of the dataset with pixel-level person segmentation maps, for a total of 1,277 labeled frames containing 2,654 annotated person instances. For our experiments, we pre-computed optical flow fields for all videos using the public pre-trained model of FlowNet2.0 [40].

Since adjacent frames are typically highly correlated, we split the training and test dataset at the video level, with 6 video sets used for training (875 annotated frames) and 3 sets reserved for testing (402 annotated frames). In each set of videos, there are 3-4 participants, where two of them are first-person camera wearers. Note that a first-person camera never sees its own camera wearer, so the people not wearing cameras are the ones who are in common across the first-person videos. Since our approach uses sequences of contiguous frames and pairs of instances (either a pair of two people or a pair of one person and one camera view), we divide each video set into several short sequences, each with 10-15 consecutive frames. More specifically, in training we create 484 positive and 1,452 negative pairs for the third-third problem, and 865 positive and 1,241 negative pairs for the third-first problem (about a 1:3 ratio in each case). In testing, each problem has 10 sequences of pairs of videos, and each video has 20 consecutive frames (about 4 seconds). This means that we have a total of about 400 annotated test frames for evaluating matching, and about 1,000 person instances for evaluating segmentation (since every frame has 2-3 people).

4.2 Evaluation Protocol

We implemented our networks in PyTorch [41], and performed all experiments on a single Nvidia Titan X Pascal GPU. We optimized our joint models with both offline and online training processes, which we now detail.

Figure 5: IoU and precision-recall curves of our approaches and baselines.

Offline training. Our offline training process consisted of two stages: (a) optimizing only the Fully Convolution Network branch supervised by the pixel-level classifier for providing imperfect but reasonable soft attentions, and (b) optimizing the joint model (either the third-third or third-first network) based on the person segmentation and identification tasks, simultaneously. Our two-stream FCN network is built on VGG16 [42], and to start training on our dataset, we initialized both visual and motion streams using the weights pre-trained on ImageNet [43]. The FCN branch was then optimized with an instance-by-instance strategy, which only considers one particular person of interest at a time, and uses the ground truth person mask as an additional channel to indicate which person the network should focus on. We used stochastic gradient descent (SGD) with fixed learning rate , momentum , weight decay , and batch size 25. The learning process was terminated after epochs. Our joint model was then initialized with the weights of the pre-trained FCN and fine-tuned by considering pairs of instances as inputs for person segmentation and identification. We again used SGD optimization but with a smaller learning rate, . For the first epochs, we froze the weights of the FCN branch, and optimized the Siamese branch to make the contrastive loss converge to a “reasonable” range (not too large to destroy the soft attention). We then started the joint learning process, and terminated after another epochs.

Online training. As described in the last section, our joint model was first trained offline with the ground truth pre-mask as input, in order to guide the person instance segmentation task. This process assumes that perfect segmentation results from the previous frame are always available, which is not reasonable at test time. To adapt the models to be able to use imperfect pre-masks of people of interest estimated by the model itself, we fine-tuned the joint model on each individual sequence, using the estimated segmentation masks as inputs (assuming perfect segmentation annotations are available only in the first frame of the sequence). We used the same SGD optimization as above, except we changed the learning rate to .

Testing. In contrast to the training process, which requires pairs of videos as inputs, our joint model can be applied to an individual stream, where each video frame is processed to simultaneously estimate each person’s segmentation and to extract corresponding features from matching between different streams. In testing, all possible pairs of instances are considered as candidate matches: each pair contains either two people from different videos in the third-third problem, or a first-person camera view and a person appearing in a third-person video in the third-first problem. Unlike other methods that require a pair of instances as input, our approach only needs to process each person and camera view once.

Network Architecture Evaluation
Streams Segmentation Identification
Backbone      Image     Optical flow Re-weighting IoU mAP ACC
Baselines Copy First - 41.9 - -
FCN X - 47.1 - -
FCN X - 50.9 - -
FCN X X - 57.3 - -
Third-Third VGG X X bounding box [4] - 44.2 40.1
FCN X soft attention 49.3 44.3 44.5
FCN X soft attention 54.1 48.4 46.2
FCN X X w/o 60.6 45.6 48.9
FCN X X soft attention 62.7 49.0 55.5
Third-First VGG X X bounding box [4] - 64.1 50.6
FCN X soft attention 47.4 51.4 52.7
FCN X soft attention 58.9 55.1 53.1
FCN X X w/o 59.8 64.0 61.7
FCN X X soft attention 61.9 65.2 73.1
Table 1: Evaluation results of each component of our network architecture.

4.3 Evaluation

For each problem (third-first and third-third), we evaluate our approach with two tasks: (1) person instance segmentation and (2) people identification across multiple cameras.

Person instance segmentation. We evaluate segmentation in terms of intersection over union (IoU) between the estimated segmentation maps and the ground truth. This is measured over each video in the testing dataset by applying our models to each frame. Our model sequentially takes the segmentation results from the previous frame (which we call the pre-mask) as input to guide the segmentation of the next frame. In the evaluation, the ground segmentation mask of the first (and only the first) video frame is assumed to be available.

Person identification. We evaluate the person identification task through mean average precision and accuracy, each of which takes a slightly different view of the problem. Mean average precision (mAP) treats people matching task as a retrieval problem: given all possible pairs of person instances from two different cameras (i.e., two person instances from different third-person videos in the third-third problem, or one person instance from third-person video and one first-person video in the third-first problem), the objective is to retrieve all of the pairs corresponding to the same person. Meanwhile, accuracy (ACC) evaluates whether the single best match for a given candidate is correct or not; i.e., for every person instance in each view, the classifier is forced to choose a single matching instance in all other views, and we calculate the percentage of matches that are correct. This setting is the same to the one used in Fan et al[4], except that their task is significantly easier because they assume person ground-truth bounding boxes are available during both training and testing, whereas our approach must infer the person’s position (as well as segmentation mask) automatically.

4.4 Experimental Results

Table 1 summarizes the results of our experiments. To characterize the difficulty of segmentation in this dataset, we first tested several baselines. Copy First simply propagates the ground truth segmentation mask from the first frame to all following frames in the sequence. In a completely static scenes with no motion, the IoU of this baseline should be 100.0, but this rarely occurs in our dataset because of frequent ego-motion of the wearable cameras, as well as motion of the people themselves. Copy First thus achieved a relatively low IoU of 41.9. A second baseline consisting of a single-stream FCN using only image information achieved somewhat better IoU of 47.1, while a third baseline consisting of a single-stream FCN using only optical flow achieved 50.9. A two-stream baseline FCN that combined both visual and motion information created a significant boost over either one-stream network, achieving IoU of 57.3. We next tested our proposed approach that jointly performs segmentation with person instance matching. On the segmentation task, our full model produced an IoU score of 62.7 for the third-third scenario, and 61.9 for third-first, compared to 57.3 for the two-stream baseline that performed only segmentation. Figure 5 (a) reports more detailed analysis of the segmentation performance (Y-axis) based on the length of video sequences (X-axis), and shows that our approach is still able to predict reasonable results on long video sequences. To permit a fair comparison across models, both the one- and two-stream FCNs were optimized with the same hyper-parameters (as discussed in Section 4.2).

Table 1 also presents our results on person instance matching. We achieve mAP scores of 49.0 and 65.2 on the third-third and third-first problems, respectively, and ACC scores of 55.5 and 73.1. We compared these results with those of the state-of-the-art method of Fan et al[4]. Although their task is to match first-person camera views to camera wearers in static third-person video, we extended it to our third-third and third-first problems to enable a fair comparison. In particular, we re-implemented their best model based on VGG16 [42] (they used AlexNet [44]) and trained on our new, augmented dataset. The results show that our joint model outperforms in both third-third (mAP of 49.0 vs. 44.2) and third-first (mAP of 65.2 vs. 64.1) problems. This is due to the learning of a more accurate embedding space, with the help of jointly learning to perform segmentation. More importantly, our approach is able to obtain more accurate feature representations from people’s pixel-level locations rather than simply relying on rough bounding boxes. Figure 5 (b) compares the precision-recall curves of the different techniques for person matching.

Figure 6: Sample results of the third-third and third-first problems, where video 1 and video 2 of each sample are from two synchronized wearable cameras. The color of person segmentation masks and camera views indicates the correspondences across different cameras.

We conducted ablation studies to test simpler variants of our technique. To measure the contribution of our re-weighting method that incorporates estimated soft attention maps, we tried not re-weighting the spatial and temporal features but simply using pool5 layer outputs. We also compared with the results of [4], which uses ground truth bounding boxes to “crop out” regions of interest. As shown in Table 1, using re-weighting with soft attention maps not only outperforms for the matching task, but also helps generate better person segmentation results. Our ablation study also tested the relative contribution of each of our motion and visual feature streams. As shown in Table 1, our dual-stream approach performs significantly better than either single-stream optical flow or visual information, on both the third-third and third-first problems, evaluated on both segmentation and matching tasks.

To summarize, our evaluation results demonstrated the effectiveness of our approach on jointly learning the person segmentation and identification tasks. The results suggested that jointly inferring pixel-level segmentation maps and correspondences of people helps perform each individual task more accurately, and that incorporating both motion and visual information works better than either individually.

5 Conclusion

We presented a novel fully convolutional network (FCN) with Siamese and semi-Siamese structures for joint person instance segmentation and identification. We showed that jointly optimizing these two tasks is mutually beneficial and requires less restrictions, such as per-frame bounding boxes. We also prepared a new, challenging dataset with person pixel-level annotations and matching in multiple first- and third-person cameras.

Although our results are encouraging, our techniques have limitations and raise opportunities for future work. First, the joint models assume people appear in every frame of the video, so that our approach will treat someone who disappears from the scene and then re-enters as a new person instance. While this assumption is reasonable for the relatively short video sequences we consider here, future work could easily add a re-identification module to recognize people who have appeared in previous frames. Second, the joint models perform a FCN forward pass for every individual person in each frame; future work could explore sharing computation costs to improve the efficiency of our method, especially for real-time applications. Lastly, we plan to further evaluate our approach on larger datasets including more diverse scenarios.


  • [1] LDV Capital: 45 billion cameras by 2022 fuel business opportunities. Technical report (2017)
  • [2]
  • [3]
  • [4] Fan, C., Lee, J., Xu, M., Singh, K.K., Lee, Y.J., Crandall, D.J., Ryoo, M.S.: Identifying first-person camera wearers in third-person videos. In: CVPR. (2017)
  • [5] Lafayette Group: Survey of technology needs - body worn cameras. Technical report, Major Cities Chiefs and Major County Sheriffs (2015)
  • [6] Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: An experimental survey. PAMI (2014)
  • [7] Zheng, L., Yang, Y., Hauptmann, A.G.: Person re-identification: Past, present and future. arXiv 1610.02984 (2016)
  • [8] Ardeshir, S., Borji, A.: Ego2top: Matching viewers in egocentric and top-view cameras. In: ECCV. (2016)
  • [9] Zheng, K., Fan, X., Lin, Y., Guo, H., Yu, H., Guo, D., Wang, S.: Learning view-invariant features for person identification in temporally synchronized videos taken by wearable cameras. In: ICCV. (2017)
  • [10] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
  • [11] Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561 (2015)
  • [12] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. arXiv:1612.01105 (2016)
  • [13] Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: CVPR. (2017)
  • [14] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611 (2018)
  • [15] Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: NIPS. (2015)
  • [16] Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: ECCV. (2016)
  • [17] Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. arXiv:1611.07709 (2016)
  • [18] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. arXiv:1703.06870 (2017)
  • [19] Girshick, R.: Fast r-cnn. In: CVPR. (2015)
  • [20] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS. (2015)
  • [21] Khoreva, A., Perazzi, F., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. arXiv:1612.02646 (2016)
  • [22] Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv:1706.09364 (2017)
  • [23] Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR. (2017)
  • [24] Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: CVPR. (2017)
  • [25] Jun Koh, Y., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR. (2017)
  • [26] Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. arXiv:1704.05737 (2017)
  • [27] Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR. (2017)
  • [28] Yoon, J.S., Rameau, F., Kim, J., Lee, S., Shin, S., Kweon, I.S.: Pixel-level matching for video object segmentation using convolutional neural networks. arXiv:1708.05137 (2017)
  • [29] Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: Segflow: Joint learning for video object segmentation and optical flow. In: ICCV. (2017)
  • [30] Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching-incorporating a global constraint into mrfs. In: CVPR. (2006)
  • [31] Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: icoseg: Interactive co-segmentation with intelligent scribble guidance. In: CVPR. (2010)
  • [32] Rubio, J.C., Serrat, J., López, A.: Video co-segmentation. In: ACCV. (2012)
  • [33] Chen, D.J., Chen, H.T., Chang, L.W.: Video object cosegmentation. In: ACMMM. (2012)
  • [34] Chiu, W.C., Fritz, M.: Multi-class video co-segmentation with a generative multi-video model. In: CVPR. (2013)
  • [35] Fu, H., Xu, D., Zhang, B., Lin, S.: Object-based multiple foreground video co-segmentation. In: CVPR. (2014)
  • [36] Guo, J., Cheong, L.F., Tan, R.T., Zhou, S.Z.: Consistent foreground co-segmentation. In: ACCV. (2014)
  • [37] Endres, I., Hoiem, D.: Category independent object proposals. ECCV (2010)
  • [38] Ardeshir, S., Borji, A.: Ego2Top: matching viewers in egocentric and top-view videos. In: ECCV. (2016)
  • [39] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS. (2014)
  • [40] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR. (2017)
  • [41]
  • [42] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  • [43] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR. (2009)
  • [44] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description