Instance Embedding Transfer to Unsupervised Video Object Segmentation

Instance Embedding Transfer to Unsupervised Video Object Segmentation

Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C.-C. Jay Kuo
University of Southern California, Google Inc., {seybold,vorobya,alirezafathi},,

We propose a method for unsupervised video object segmentation by transferring the knowledge encapsulated in image-based instance embedding networks. The instance embedding network produces an embedding vector for each pixel that enables identifying all pixels belonging to the same object. Though trained on static images, the instance embeddings are stable over consecutive video frames, which allows us to link objects together over time. Thus, we adapt the instance networks trained on static images to video object segmentation and incorporate the embeddings with objectness and optical flow features, without model retraining or online fine-tuning. The proposed method outperforms state-of-the-art unsupervised segmentation methods in the DAVIS dataset and the FBMS dataset.

1 Introduction

One important task in video understanding is object localization in time and space. Ideally, it should be able to localize familiar or novel objects consistently over time with a sharp object mask, which is known as video object segmentation (VOS). If no indication of which object to segment is given, the task is known as unsupervised video object segmentation or primary object segmentation. Once an object is segmented, visual effects and video understanding tools can leverage that information [2, 24].

Related object segmentation tasks in static images are currently dominated by methods based on the fully convolutional neural network (FCN)  [5, 26]. These neural networks require large datasets of segmented object images such as PASCAL [8] and COCO [25]. Video segmentation datasets are smaller because they are more expensive to annotate [23, 30, 33]. As a result, it is more difficult to train a neural network explicitly for video segmentation. Classic work in video segmentation produced results using optical flow and shallow appearance models [20, 29, 22, 32, 11] while more recent methods typically pretrain the network on image segmentation datasets and later adapt the network to the video domain, sometimes combined with optical flow [4, 36, 6, 40, 37, 16].

Figure 1: An example of the changing segmentation target (foreground) in videos depending on motion. A car is the foreground in the top video while a car is the background in the bottom video. To address this issue, our method obtains embeddings for object instances and identifies representative embeddings for foreground/background and then segments the frame based on the representative embeddings. Left: the ground truth. Middle: A visualization of the embeddings projected into RGB space via PCA, along with representative points for the foreground (magenta) and background (blue). Right: the segmentation masks produced by the proposed method. Best viewed in color.

In this paper, we propose a method to transfer the knowledge encapsulated in instance segmentation embeddings learned from static images and integrate it with objectness and optical flow to segment a moving object in video. Instead of training an FCN that directly classifies each pixel as foreground/background as in [37, 16, 6, 36], we train an FCN that jointly learns object instance embeddings and semantic categories from images [10]. The distance between the learned embeddings encodes the similarity between pixels. We argue that the instance embedding is a more useful feature to transfer from images to videos than a foreground/background prediction. As shown in Fig. 1, cars appear in both videos but belong to different categories (foreground in the first video and background in the second video). If the network is trained to directly classify cars as foreground on the first video, it tends to classify the cars as foreground in the second video as well. As a result, the network needs to be fine-tuned for each sequence [4]. In contrast, the instance embedding network can produce unique embeddings for the car in both sequences without interfering with other predictions or requiring fine-tuning. The task then becomes selecting the correct embeddings to use as an appearance model. Relying on the embeddings to encode object instance information, we propose a method to identify the representative embeddings for the foreground (target object) and the background based on objectness scores and optical flow. Visual examples of the representative embeddings are displayed in the middle column of Fig. 1. Finally, all pixels are classified by finding the nearest neighbor in a set of representative foreground or background embeddings. This is a non-parametric process requiring no video specific supervision for training or testing.

We evaluate the proposed method on the DAVIS dataset [33] and the FBMS dataset [30]. Without fine-tuning the embedding network on the target datasets, we obtain better performance than previous state-of-the-art methods. More specifically, we achieve a mean intersection-over-union (IoU) of 78.5% and 71.9% on the DAVIS dataset [33] and the FBMS dataset [30], respectively.

To summarize, our main contributions include

  • A new strategy for adapting instance segmentation models trained on static images to videos. Notably, this strategy performs well on video datasets without requiring any video object segmentation annotations.

  • This strategy outperforms previously published unsupervised methods on both DAVIS benchmark and FBMS benchmark and approaches the performance of semi-supervised CNNs without requiring retraining any networks at test time.

  • Proposal of novel criteria for selecting a foreground object without supervision, based on semantic score and motion features over a track.

  • Insights into the stability of instance segmentation embeddings over time.

2 Related Work

Unsupervised video object segmentation. Unsupervised video object segmentation discovers the most salient, or primary, objects that move against a video’s background or display different color statistics. One set of methods to solve this task builds hierarchical clusters of pixels that may delineate objects [11]. Another set of methods performs binary segmentation of foreground and background. Early foreground segmentation methods often used Gaussian Mixture Models and Graph Cut [27, 38], but more recent work uses convolutional neural networks (CNN) to identify foreground pixels based on saliency, edges, and/or motion  [36, 37, 16]. For example, LMP [36] trains a network which takes optical flow as an input to separate moving and non-moving regions and then combines the results with objectness cues from SharpMask [34] to generate the moving object segmentation. LVO [37] trains a two-stream network, using RGB appearance features and optical flow motion features that feed into a ConvGRU [42] layer to generate the final prediction. FSEG [16] also proposes a two-stream network trained with mined supplemental data. SfMNet [39] uses differentiable rendering to learn object masks and motion models without mask annotations. Despite the risk of focusing on the wrong object, unsupervised methods can be deployed in more places because they do not require user interaction to specify an object to segment. Since we are interested in methods requiring no user interaction, we choose to focus on unsupervised segmentation.

Semi-supervised video object segmentation. Semi-supervised video object segmentation utilizes human annotations on the first frame of a video (or more) indicating which object the system should track. Importantly, the annotation provides a very good appearance model initialization that unsupervised methods lack. The problem can be formulated as either a binary segmentation task conditioned on the annotated frame or a mask propagation task between frames. Non-CNN methods typically rely on Graph Cut [27, 38], but CNN based methods offer better accuracy  [19, 4, 40, 6]. Mask propagation CNNs take in the previous mask prediction and a new frame to propose a segmentation in the new frame. VPN [17] trains a bilateral network to propagate to new frames. MSK [19] trains a propagation network with synthetic transformations of still images and applies the same technique for online fine-tuning. SegFlow [6] finds that jointly learning moving object masks and optical flow helps to boost the segmentation performance. Binary segmentation CNNs typically utilize the first frame for fine-tuning the network to a specific sequence. The exact method for fine-tuning varies: OSVOS [4] simply fine-tunes on the first frame. OnAVOS [40] fine-tunes on the first frame and a subset of predictions from future frames. Fine-tuning can take seconds to minutes, and longer fine-tuning typically results in better segmentation. Avoiding the time cost of fine-tuning is a further inducement to focus on unsupervised methods.

Image segmentation. Many video object segmentation methods [40, 4, 19] are built upon image semantic segmentation neural networks [26, 5, 14], which predict a category label for each pixel. These fully convolutional networks allow end-to-end training on images of arbitrary sizes. Semantic segmentation networks do not distinguish different instances from the same object category, which limits their suitability to video object segmentation. Instance segmentation networks [10, 28, 12] can label each instance uniquely. Instance embedding methods [10, 28] provide an embedding space where pixels belonging to the same instance have similar embeddings. Spatial variations in the embeddings indicate the edges of object masks. Relevant details are given in Sec. 3.1. It was unknown if instance embeddings are stable over time in videos, but we hypothesized that these embeddings might be useful for video object segmentation.

Figure 2: An overview of the proposed method. Given the video sequences, the dense embeddings are obtained by applying an instance segmentation network trained on static images. Then representative embeddings, called seeds, are obtained. Seeds are linked across the whole sequence (we show 3 consecutive frames as an illustration here). The seed with the highest score based on objectness and motion saliency is selected as the initial seed (in magenta) to grow the initial segmentation. Finally, more foreground seeds as well as background seeds are identified to refine the segmentation.

3 Proposed Method

An overview of the proposed method is depicted in Fig. 2. We first obtain instance embeddings, objectness scores, and optical flow that we will use as inputs (Sec. 3.1). Based on the instance embeddings, we identify “seed” points that mark segmentation proposals (Sec. 3.2). Consistent proposal seeds are linked to build seed tracks, and we rank the seed tracks by objectness scores and motion saliency to select a foreground proposal seed on every frame (Sec. 3.3). We further build a set of foreground/background proposal seeds to produce the final segmentation mask in each frame (Sec. 3.4).

3.1 Extracting features

Our method utilizes three inputs: instance embeddings, objectness scores, and optical flow. None of these features are fine-tuned on video object segmentation datasets or fine-tuned online to specific sequences. The features are extracted for each frame independently as follows.

Instance embedding and objectness. We train a network to output instance embeddings and semantic categories on the image instance segmentation task as in [10]. Briefly, the instance embedding network is a dense-output convolutional neural network with two output heads trained on static images from an instance segmentation dataset.

The first head outputs an embedding for each pixel, where pixels from same object instance have smaller Euclidean distances between them than pixels belonging to separate objects. Similarity between two pixels and is measured as a function of the Euclidean distance in the -dimensional embedding space, ,


This head is trained by minimizing the cross entropy between the similarity and the ground truth matching indicator . For locations and , the ground truth matching indicator if pixels belong to the same instance and otherwise. The loss is given by


where is a set of pixel pairs, is the similarity between pixels and in the embedding space and is inversely proportional to instance size to balance training.

The second head outputs an objectness score from semantic segmentation. We minimize a semantic segmentation log loss to train the second head to output a semantic category probability for each pixel. The objectness map is derived from the semantic prediction as


where is the probability that pixel belongs to the semantic category “background”111Here in semantic segmentation, “background” refers to the region that does not belong to any category of interest, as opposed to video object segmentation where the “background” is the region other than the target object. We use “background” as in video object segmentation for the rest of the paper.. We do not use the scores for any class other than the background in our work.

Embedding graph. We build a 4-neighbor graph from the dense embedding map, where each embedding vector becomes a vertex and edges exist between spatially neighboring embeddings with weights equal to the Euclidean distance between embedding vectors. This embedding graph will be used to generate image regions later. A visualized embedding graph is shown in Fig. 3.

Optical flow. The motion saliency cues are built upon optical flow. For fast optical flow estimation at good precision, we utilize a reimplementation of FlowNet 2.0 [15], an iterative neural network.

3.2 Generating proposal seeds

We propose a small number of representative seed points in frame for some subset of frames (typically all) in the video. Most computations only compare against seeds within the current frame, so the superscript is omitted for clarity unless the computation is across multiple frames. The seeds we consider as FG or BG should be diverse in embedding space because the segmentation target can be a moving object from an arbitrary category. In a set of diverse seeds, at least one seed should belong to the FG region. We also need at least one BG seed because the distances in the embedding space are relative. The relative distances in embedding space, or similarity from Eq. 1, from each point to the FG and BG seed(s) can be used to assign a labels to all pixels.

Candidate points. In addition to being diverse, the seeds should be representative of objects. The embeddings on the boundary of two objects are usually not close to the embedding of either object. Because we want embeddings representative of objects, we exclude seeds from object boundaries. To avoid object boundaries, we only select seeds from candidate points where the instance embeddings are locally consistent. (An alternative method to identify the boundaries to avoid would be to use an edge detector such as [7, 41].) We construct a map of embedding edges by mapping discontinuities in the embedding space. The embedding edge map is defined as the “inverse” similarity in the embedding space within the neighbors around each pixel,


where and are pixel locations, contains the four neighbors of , and is the similarity measure given in Eq. 1. Then in the edge map we identify the pixels which are the minimum within a window of centered at itself. These pixels from the candidate set . Mathematically,


where denotes the local window.

Diverse seeds. These candidate points, , are diverse, but still redundant with one another. We take a diverse subset of these candidates as seeds by adopting the sampling procedure from KMeans++ initialization [1]. We only need diverse sampling rather than cluster assignments, so we do not perform the time-consuming KMeans step afterwards. The sampling procedure begins by adding the candidate point with the largest objectness score, , to the seed set, . Sampling continues by iteratively adding the candidate, , with smallest maximum similarity to all previously selected seeds and stops when we reach seeds,


We repeat this procedure to produce the seeds for each frame independently, forming a seed set . Note that the sampling strategy differs from [10], where they consider a weighted sum of the embedding distance and semantic scores. We do not consider the semantic scores because we want to have representative embeddings for all regions of the current frame, including the background, while in [10], the background is disregarded. Fig. 3 shows one example of the visualized embedding edge map, the corresponding candidate set and the selected seeds.

Original image Embedding graph
Embedding edge map Candidate set Seed set
Figure 3: Top: An image (left) and the visualization of its embedding graph in the box in blue. The edge colors on the right reflect distances between the embeddings at each pixel (the center subfigure visualizes the embeddings via PCA). High costs appear along object boundaries. Bottom: The embedding edge map (left), the candidate set (center) and the seed set (right). Best viewed in color.

3.3 Ranking proposed seeds

In the unsupervised video object segmentation problem, we do not have an explicitly specified target object. Therefore, we need to identify a moving object as the segmentation target (i.e., FG). We first score the seeds based on objectness and motion saliency. To find the most promising seed for FG, we then build seed tracks by group embedding-consistent seeds across frames into seed tracks and aggregate scores along tracks. The objectness score is exactly in Eq. 3 for each seed. The motion saliency as well as seed track construction and ranking are explained below.

Motion saliency. Differences in optical flow can separate objects moving against a background [3]. Because optical flow estimation is still imperfect [15], we average flow within the image regions rather than using the flow from a single pixel. The region corresponding to each seed consists of the pixels in the embedding graph from Sec. 3.1 with the shortest geodesic distance to that seed. For each seed , we use the average optical flow in the corresponding region as . An example of image regions is shown in Fig. 4.

Original image Embedding map Image regions
Pixel optical flow Region optical flow Motion saliency
Figure 4: Top: left: An image. center: A projection of the embedding map into RBG space (via PCA) with the initial background seeds marked in blue and other seeds in red. right: The regions near each seed in the embedding graph. Bottom: left: The optical flow. center: Average flow within each region. right: A map of motion saliency scores. Best viewed in color.

Then we construct a model of the background. First, seeds with the lowest objectness score, , are selected as the initial background seeds, denoted by . The set of motion vectors associated with these seeds forms our background motion model . The motion saliency for each seed, , is the normalized distance to the nearest background motion vector,


where the normalization factor is given by


There are other approaches to derive motion saliency from optical flow. For example, in [20], motion edges are obtained by applying some edge detector to optical flow and then motion saliency of some region is computed as a function of motion edge intensity. The motion saliency proposed in this work is more efficient and works well in terms of the final segmentation performance.

Seed tracks. Another property of the foreground object is that it should be a salient object in multiple frames. We score this by linking similar seeds together across frames into a seed track and taking the average product of objectness and motion saliency scores over each track. The -th seed on frame 0, , initiates a seed track, . is extended frame by frame by adding the seed with the highest similarity to . Specifically, supposing that we have a track across frames 0-, it is extended to frame by adding the most similar seed on frame to , forming :


where is the similarity measure given by Eq. 1, and is the seed in frame with the highest similarity to . Then we have . Eventually, we have starting from and ending at some seed on the last frame. The foreground score for is


where is the size of the seed track, equal to the sequence length.

3.4 Segmenting a foreground proposal

Initial foreground segmentation. The seed track with the highest foreground score is selected on each frame to provide an initial foreground seed, denoted by . We obtain an initial foreground segmentation by identifying pixels close to the foreground seed in the embedding graph explained in Sec. 3.1. Here the distance, denoted by , between any two nodes, and , is defined as the maximum edge weight along the shortest geodesic path connecting them. We again take the seeds with the lowest objectness scores as the initial background seed set, . Then the initial foreground region is composed of the pixels closer to the foreground seed than any background seeds,


Adding foreground seeds. Often, selecting a single foreground seed is overly conservative and the initial segmentation fails to cover an entire object. To expand the foreground segmentation, we create a set of foreground seeds, from the combination of and seeds marking image regions mostly covered by the initial foreground segmentation. These image regions are the ones described in Sec. 3.3 and Fig. 4. If more than a proportion of a region intersects with the initial foreground segmentation, the corresponding seed is added to the .

Adding background seeds. The background contains two types of regions: non-object regions (such as sky, road, water, etc.) that have low objectness scores, and static objects, which are hard negatives because in the embedding space, they are often closer to the foreground than the non-object background. Static objects are particularly challenging when they belong to the same semantic category as the foreground object222E.g., the “camel” sequence in DAVIS in supplementary materials.. We expand our representation of the BG regions by taking the union of seeds with object scores less than a threshold and seeds with motion saliency scores less than a threshold :


Final segmentation. Once the foreground and background sets are established, similarity to the nearest foreground and background seeds is computed for each pixel. It is possible to use the foreground and background sets from one frame to segment another frame for FG/BG similarity computation:


where pixels on the target frame are denoted by . Instead of directly propagating the foreground or background label from the most similar seed to the embedding, we obtain a soft score as the confidence of the embedding being foreground:


Finally, the dense CRF [21] is used to refine the segmentation mask, with the unary term set to the negative log of , as in [5].

Online adaptation. Online adaptation of our method is straightforward: we simply generate new sets of foreground and background seeds. This is much less expensive than fine-tuning an FCN for adaptation as done in [40]. Though updating the foreground and background sets could result in segmenting different objects in different frames, it improves the results in general, as discussed in Sec. 4.4.

4 Experiments

NLC [9] CUT [18] FST [32] SFL [6] LVO [37] MP [36] FSEG [16] ARP[20] Ours
Fine-tune on DAVIS? No Yes No Yes Yes No No No No
Mean 65.3 55.1 55.2 55.8 67.4 75.9 70.0 70.7 76.2 78.5
Mean 52.3 55.2 51.1 66.7 72.1 65.9 65.3 70.6 75.5
Table 1: The results on the val set of DAVIS 2016 dataset [33]. Our method achieves the highest in both evaluation metrics, and outperforms the methods fine-tuned on DAVIS. Online adaptation is applied on every frame. Per-video results are listed in supplementary materials.
Figure 5: Example qualitative results on DAVIS dataset. Our method performs well on videos with large appearance changes (top row), confusing backgrounds (second row, with people in background), changing viewing angle (third row), and unseen semantic categories (forth row, with a goat as the foreground). Best viewed in color.

4.1 Datasets and evaluation

We evaluate the proposed method on the DAVIS dataset [33], Freiburg-Berkeley Motion Segmentation (FBMS) dataset  [30], and the SegTrack-v2 dataset [23]. Note that neither the embedding network nor the optical flow network has been trained on these datasets.

DAVIS. The DAVIS 2016 dataset [33] is a recently constructed dataset, containing 50 video sequences in total, with 30 in the train set and 20 in the val set. It provides binary segmentation ground truth masks for all 3455 frames. This dataset contains challenging videos featuring object deformation, occlusion, and motion blur. The “target object” may consist of multiple objects that move together, e.g., a bike with the rider. To evaluate our method, we adopt the protocols in [33], which include region similarity and boundary accuracy, denoted by and , respectively. is computed as the intersection over union (IoU) between the segmentation results and the ground truth. is the harmonic mean of boundary precision and recall.

FBMS. The FBMS dataset [30] contains 59 video sequences with 720 frames annotated. In contrast to DAVIS, multiple moving objects are annotated separately in FBMS. We convert the instance-level annotations to binary masks by merging all foreground annotations, as in [37]. The evaluation metrics include the F-score evaluation protocol proposed in [31] as well as used for DAVIS.

SegTrack-v2. The SegTrack-v2 dataset [23] contains 14 videos with a total of 976 frames. Annotations of individual moving objects are provided for all frames. As with FBMS, the union of the object masks is converted to a binary mask for unsupervised video object segmentation evaluation. For this dataset, we only use for evaluation to be consistent with previous work.

4.2 Implementation details

We use the image instance segmentation network trained on the PASCAL VOC 2012 dataset [10] to extract the object embedding and objectness. The instance segmentation network is based on DeepLab-v2 [5] with ResNet [13] as the backbone. We use the stabilized optical flow from a reimplementation of FlowNet2.0 [15]. The dimension for the embedding vector, , is 64. The window size to identify the candidate set is set to 9 for DAVIS/FBMS, and 5 for SegTrack-v2. For frames in DAVIS dataset, the 9x9 window results in approximately 200 candidates in the embedding edge map. We select seeds from the candidates. The initial number of background seeds is . To add FG seeds as in Sec. 3.4, is set to 0.5. The thresholds for BG seed selection are and . The CRF parameters are identical with the ones in DeepLab [5] (used for the PASCAL dataset). We first tuned all of these parameters on the DAVIS train set, where we our was 77.5%. We updated the window size for SegTrack-v2 empirically, considering the video resolution.

4.3 Comparing to the state-of-the-art

DAVIS. As shown in Tab. 1, we obtain the best performance for unsupervised video object segmentation: 2.3% higher than the second best and 2.6% higher than the third best. Our unsupervised approach even outperforms some of the semi-supervised methods that have access to the first frame annotations, VPN [17] and SegFlow [6], by more than 2%333Numeric results for these two methods are listed in Tab. 5.. Some qualitative segmentation results are shown in Fig. 5.

FBMS. We evaluate the proposed method on the test set, with 30 sequences in total. The results are shown in Tab. 2. Our method achieves an F-score of 82.8%: 5.0% higher than the second best method [37]. Our method’s mean is more than 10% better than ARP [20], which performs the second best on DAVIS.

NLC [9] CUT [18] FST [32] CVOS[35] LVO [37] MP [36] ARP[20] Ours
F-score - 76.8 69.2 74.9 77.8 77.5 - 82.8
Mean 44.5 - 55.5 - - - 59.8 71.9
Table 2: The results on the test set of FBMS dataset [30]. Our method achieves the highest in both evaluation metrics.

SegTrack-v2. We achieve a of 59.3% on this dataset, which is higher than other methods that do well on DAVIS, LVO [37] (57.3%) and FST [32] (54.3%). Due to low resolution of SegTrack-v2 and the fact that SegTrack-v2 videos can have multiple moving objects of the same class in the background, we are weaker than NLC [9] (67.2%) in this dataset.

4.4 Ablation studies

The effectiveness of instance embedding. To prove that instance embeddings are more effective than feature embeddings from semantic segmentation networks in our method, we compare against the features from DeepLab-v2 [5]. Replacing the instance embedding in our method with DeepLab fc7 features achieves 65.2% in , more than 10% less than the instance embedding features. The instance embedding feature vectors are therefore much better suited to linking objects over time and space than semantic segmentation feature vectors. The explicit pixelwise similarity loss (Eq. 2) used to train instance embeddings helps to produce more stable feature vectors than semantic segmentation.

Embedding temporal consistency and online adaptation. We analyze whether embeddings for an object are consistent over time. Given the embeddings for each pixel in every frame and the ground truth foreground masks, we determine how many foreground embeddings in later frames are closer to the background embeddings than foreground embeddings from the first frame. If a foreground embedding from a later frame is closer to any background embedding in the first frame, we call it an incorrectly classified foreground pixel. We plot the proportion of foreground pixels that are incorrectly classified as a function of relative time in the video since the first frame in Fig. 6. As time from the first frame increases, more foreground embeddings are incorrectly classified. This “embedding drift” problem is probably caused by differences in the appearance and location of foreground and background objects in the video.

Figure 6: The proportion of incorrectly classified foreground embeddings versus relative timestep. As time progresses, more foreground embeddings are closer to first frame’s background than the foreground.

To overcome “embedding drift”, we do online adaptation to update our sets of foreground and background seeds. Updating the seeds is much faster than fine-tuning a neural network to do the adaptation, as done in OnAVOS [40] with a heuristically selected set of examples. The effects of doing online adaptation every frames are detailed in Tab. 3. More frequent online adaptation results in better performance: per-frame online adaptation boosts by 7.0% over non-adaptive seed sets from the first frame.

Adapt every k frame k=1 k=5 k=10 k=20 k=40 k=
Mean 77.5 76.4 75.9 75.0 73.6 70.5
Table 3: The segmentation performance versus online adaptation frequency. Experiments conducted on DAVIS train set. Note that k= means no online adaptation.

Foreground seed track ranking. In this section, we discuss some variants of foreground seed track ranking. In Eq. 10, the ranking is based on objectness as well as motion saliency. We analyze three variants: motion saliency alone, objectness alone, and objectness+motion saliency. The results are shown in Tab. 4. The experiments are conducted on the DAVIS train set. The initial FG seed accuracy (second row in Tab. 4) is evaluated as the proportion of the initial foreground seeds located within the ground truth foreground region. We see that combining motion saliency and objectness results in the best performance, outperforming “motion alone” and “objectness alone” by 4.0% and 6.4%, respectively. Final segmentation performance is consistent with the initial foreground seed accuracy, with the combined ranking outperforming the “motion alone” and “objectness alone” by 3.2% and 1.8%, respectively. The advantage of combining motion and objectness is reported in several previous methods as well [6, 16, 37]. It is interesting to see that using objectness only gives lower initial foreground seed accuracy but higher mean than motion only. It is probably because of the different errors the two scores make. When errors selecting foreground seeds occur in “motion only” mode, it is more likely that seeds representing “stuff” (sky, water, road, etc) are selected as the foreground, but when errors occur in “objectness only” mode, incorrect foreground seeds are usually located on static objects in the sequence. In the embedding space, static objects are usually closer in the embedding space to the target object than “stuff”, so these errors are more forgiving.

Motion Obj. Motion + Obj.
Init. FG seed acc. 90.6 88.2 94.6
Mean 74.3 75.7 77.5
Table 4: The segmentation performance versus foreground ranking strategy. Experiments are conducted on DAVIS train set.

4.5 Semi-supervised video object segmentation

We extend the method to semi-supervised video object segmentation by selecting the foreground seeds and background seeds based on the first frame annotation. The seeds covered by the ground truth mask are added to the foreground set and the rest are added to . Then we apply Eqs. 13-15 to all embeddings of the sequence. Results are further refined by a dense CRF. Experiment settings are detailed in supplementary materials. As shown in Tab. 5, we achieve 77.6% in , better than [6] and [17]. Note that there are more options for performance improvement such as motion/objectness analysis and online adaptation as we experimented in the unsupervised scenario. We leave those options for future exploration.

DAVIS fine-tune? Online fine-tune? Mean
OnAVOS [40] Yes Yes 86.1
OSVOS [4] Yes Yes 79.8
SFL [6] Yes Yes 76.1
MSK [19] No Yes 79.7
VPN [17] No No 70.2
Ours No No 77.6
Table 5: The results of semi-supervised video object segmentation on DAVIS val set by adopting the proposed method.

5 Conclusion

In this paper, we propose a method to transfer the instance embedding learned from static images to unsupervised object segmentation in videos. To be adaptive to the changing foreground in the video object segmentation problem, we train a network to produce embeddings encapsulating instance information rather than training a network that directly outputs a foreground/background score. In the instance embeddings, we identify representative foreground/background embeddings from objectness and motion saliency. Then, pixels are classified based on embedding similarity to the foreground/background. Unlike many previous methods that need to fine-tune on the target dataset, our method achieves the state-of-the-art performance under the unsupervised video object segmentation setting without any fine-tuning, which saves a tremendous amount of labeling effort.


  • [1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
  • [2] M. Billinghurst, A. Clark, G. Lee, et al. A survey of augmented reality. Foundations and Trends® in Human–Computer Interaction, 8(2-3):73–272, 2015.
  • [3] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. Computer Vision–ECCV 2010, pages 282–295, 2010.
  • [4] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR 2017. IEEE, 2017.
  • [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
  • [6] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. 2017.
  • [7] P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 1841–1848, 2013.
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision (IJCV), 88(2):303–338, 2010.
  • [9] A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In BMVC, volume 2, page 8, 2014.
  • [10] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277, 2017.
  • [11] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph based video segmentation. IEEE CVPR, 2010.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  • [14] Q. Huang, C. Xia, C. Wu, S. Li, Y. Wang, Y. Song, and C.-C. J. Kuo. Semantic segmentation with reverse attention. In British Machine Vision Conference, 2017.
  • [15] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. arXiv preprint arXiv:1612.01925, 2016.
  • [16] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [17] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. arXiv preprint arXiv:1612.05478, 2016.
  • [18] M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In Proceedings of the IEEE International Conference on Computer Vision, pages 3271–3279, 2015.
  • [19] A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [20] Y. J. Koh and C.-S. Kim. Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [21] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011.
  • [22] Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1995–2002. IEEE, 2011.
  • [23] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013.
  • [24] H.-D. Lin and D. G. Messerschmitt. Video composition methods and their semantics. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, pages 2833–2836. IEEE, 1991.
  • [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • [27] N. Märki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilateral space video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 743–751, 2016.
  • [28] A. Newell and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. CoRR, abs/1611.05424, 2016.
  • [29] P. Ochs and T. Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1583–1590. IEEE, 2011.
  • [30] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014.
  • [31] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014.
  • [32] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision, pages 1777–1784, 2013.
  • [33] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [34] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016.
  • [35] B. Taylor, V. Karasev, and S. Soatto. Causal video object segmentation from persistence of occlusions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4268–4276, 2015.
  • [36] P. Tokmakov, K. Alahari, and C. Schmid. Learning motion patterns in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [37] P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • [38] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3899–3908, 2016.
  • [39] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
  • [40] P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. In British Machine Vision Conference, 2017.
  • [41] S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  • [42] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.

Appendix A Supplemental materials

a.1 Experiment Settings of Semi-supervised Video Object Segmentation

We extract the seeds for the first frame (frame 0) and form image regions, as described in Sec. 3.2 and Sec. 3.3, respectively. Then we compare the image regions with the ground truth mask. For one image region , if the ground truth mask covers more than of , i.e.,


where denotes the area, the average embedding within the intersection is computed and added to the foreground embedding set . We set .

For the background, if does not intersect with at all, i.e.,


the average embedding in is added to the background embedding set . A visual illustration is shown in Fig. 7.

Figure 7: Left: The image regions and the ground truth mask (in magenta) on frame 0. Center: The image regions (in magenta) whose average embeddings are used as foreground embeddings for the rest of frames. Right: The image regions (in blue) whose average embeddings are used as background embeddings for the rest of frames. Best viewed in color.

Then the foreground probability for a pixel on an arbitrary frame is obtained by Eqs. 13-15 and results are further refined by a dense CRF with identical parameters from the unsupervised scenario. We compare our results with multiple previous semi-supervised methods in Tab. 5 in the paper.

a.2 Per-video Results for DAVIS

The per-video result are shown for DAVIS train set and val set are listed in Tab. 6 and Tab. 7. The evaluation metric is the region similarity mentioned in the paper. Note that we used the train set for ablation studies (Sec. 4.4), where the masks were not refined by dense CRF.

Sequence ARP [20] FSEG [16] Ours Ours + CRF
bear 0.92 0.907 0.935 0.952
bmx-bumps 0.459 0.328 0.431 0.494
boat 0.436 0.663 0.652 0.491
breakdance-flare 0.815 0.763 0.809 0.843
bus 0.849 0.825 0.848 0.842
car-turn 0.87 0.903 0.923 0.921
dance-jump 0.718 0.612 0.674 0.716
dog-agility 0.32 0.757 0.7 0.708
drift-turn 0.796 0.864 0.815 0.798
elephant 0.842 0.843 0.828 0.816
flamingo 0.812 0.757 0.633 0.679
hike 0.907 0.769 0.871 0.907
hockey 0.764 0.703 0.817 0.878
horsejump-low 0.769 0.711 0.821 0.832
kite-walk 0.599 0.489 0.598 0.641
lucia 0.868 0.773 0.863 0.91
mallard-fly 0.561 0.695 0.683 0.699
mallard-water 0.583 0.794 0.76 0.811
motocross-bumps 0.852 0.775 0.849 0.884
motorbike 0.736 0.407 0.685 0.708
paragliding 0.881 0.474 0.82 0.873
rhino 0.884 0.875 0.86 0.835
rollerblade 0.839 0.687 0.851 0.896
scooter-gray 0.705 0.733 0.686 0.655
soccerball 0.824 0.797 0.849 0.905
stroller 0.857 0.667 0.722 0.758
surf 0.939 0.881 0.87 0.902
swing 0.796 0.741 0.822 0.868
tennis 0.784 0.707 0.817 0.866
train 0.915 0.761 0.766 0.765
mean 0.763 0.722 0.775 0.795
Table 6: Per-video results on DAVIS train set. The region similarity is reported.
Sequence ARP [20] FSEG [16] Ours Ours + CRF
blackswan 0.881 0.812 0.715 0.786
bmx-trees 0.499 0.433 0.496 0.499
breakdance 0.762 0.512 0.485 0.555
camel 0.903 0.836 0.902 0.929
car-roundabout 0.816 0.907 0.88 0.88
car-shadow 0.736 0.896 0.929 0.93
cows 0.908 0.869 0.91 0.925
dance-twirl 0.798 0.704 0.781 0.797
dog 0.718 0.889 0.906 0.936
drift-chicane 0.797 0.596 0.76 0.775
drift-straight 0.715 0.811 0.884 0.857
goat 0.776 0.83 0.861 0.864
horsejump-high 0.838 0.652 0.794 0.851
kite-surf 0.591 0.392 0.569 0.647
libby 0.654 0.584 0.686 0.743
motocross-jump 0.823 0.775 0.754 0.778
paragliding-launch 0.601 0.571 0.595 0.624
parkour 0.828 0.76 0.852 0.909
scooter-black 0.746 0.688 0.727 0.74
soapbox 0.846 0.624 0.668 0.67
mean 0.762 0.707 0.758 0.785
Table 7: Per-video results on DAVIS val set. The region similarity is reported.

a.3 Instance Embedding Drifting

In Sec. 4.4 of the paper, we mentioned the “embedding drift” problem. Here we conduct another experiment to demonstrate that the embedding changes gradually with time. In this experiment, we extract foreground and background embeddings based on the ground truth masks for every frame. The embeddings from the first frame (frame 0) are used as references. We compute the average distance between the foreground/background embeddings from an arbitrary frame and the reference embeddings. Mathematically,


where and denote the ground truth foreground and background regions, respectively, denotes the embedding corresponding to pixel , and / represent the foreground/background embedding distance between frame and frame 0. Then we average and across sequences and plot their relationship with the relative timestep in Fig. 8. As we observe, the embedding distance is increasing with time elapsing. Namely, both objects and background become less similar to themselves on frame 0, which supports the necessity of online adaptation.

Figure 8: The FG/BG distance between later frames and frame 0. Both FG/BG embeddings become farther from their reference embedding on frame 0.

a.4 More visual examples

We provide more visual examples for the DAVIS dataset [33] and the FBMS dataset [30] in Fig. 9 and Fig. 10, respectively. Furthermore, the results for all annotated frames in DAVIS and FBMS are attached in the folders named “DAVIS” and “FBMS” submitted together with this document. The frames are resized due to size limit.

Figure 9: Visual examples from the DAVIS dataset. The “camel” sequence (first row) is mentioned as an example where the static camel (the one not covered by our predicted mask) acts as hard negatives because it is semantically similar with foreground while belongs to the background. Our method correctly identifies it as background from motion saliency. The last three rows show some failure cases. In the ”stroller” sequence (third last row), our method fails to include the stroller for some frames. In the ”bmx-bump” sequence (second last row), when the foreground, namely the rider and the bike, is totally occluded, our method wrongly identifies the occluder as foreground. The “flamingo” sequence (last row) illustrates a similar situation with the “camel” sequence, where the proposed method does less well due to imperfect optical flow (the foreground mask should include only the flamingo located in the center of each frame). Best viewed in color.
Figure 10: Visual examples from the FBMS dataset. The last two rows show some failure cases. In the “rabbits04” sequence (second last row), the foreground is wrongly identified when the rabbit is wholy occluded. In the “marple6” sequence (last row), the foreground should include two people, but our method fails on some frames because one of them demonstrates low motion saliency. Best viewed in color.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description