Instance Embedding Transfer to Unsupervised Video Object Segmentation
We propose a method for unsupervised video object segmentation by transferring the knowledge encapsulated in image-based instance embedding networks. The instance embedding network produces an embedding vector for each pixel that enables identifying all pixels belonging to the same object. Though trained on static images, the instance embeddings are stable over consecutive video frames, which allows us to link objects together over time. Thus, we adapt the instance networks trained on static images to video object segmentation and incorporate the embeddings with objectness and optical flow features, without model retraining or online fine-tuning. The proposed method outperforms state-of-the-art unsupervised segmentation methods in the DAVIS dataset and the FBMS dataset.
One important task in video understanding is object localization in time and space. Ideally, it should be able to localize familiar or novel objects consistently over time with a sharp object mask, which is known as video object segmentation (VOS). If no indication of which object to segment is given, the task is known as unsupervised video object segmentation or primary object segmentation. Once an object is segmented, visual effects and video understanding tools can leverage that information [2, 24].
Related object segmentation tasks in static images are currently dominated by methods based on the fully convolutional neural network (FCN) [5, 26]. These neural networks require large datasets of segmented object images such as PASCAL  and COCO . Video segmentation datasets are smaller because they are more expensive to annotate [23, 30, 33]. As a result, it is more difficult to train a neural network explicitly for video segmentation. Classic work in video segmentation produced results using optical flow and shallow appearance models [20, 29, 22, 32, 11] while more recent methods typically pretrain the network on image segmentation datasets and later adapt the network to the video domain, sometimes combined with optical flow [4, 36, 6, 40, 37, 16].
In this paper, we propose a method to transfer the knowledge encapsulated in instance segmentation embeddings learned from static images and integrate it with objectness and optical flow to segment a moving object in video. Instead of training an FCN that directly classifies each pixel as foreground/background as in [37, 16, 6, 36], we train an FCN that jointly learns object instance embeddings and semantic categories from images . The distance between the learned embeddings encodes the similarity between pixels. We argue that the instance embedding is a more useful feature to transfer from images to videos than a foreground/background prediction. As shown in Fig. 1, cars appear in both videos but belong to different categories (foreground in the first video and background in the second video). If the network is trained to directly classify cars as foreground on the first video, it tends to classify the cars as foreground in the second video as well. As a result, the network needs to be fine-tuned for each sequence . In contrast, the instance embedding network can produce unique embeddings for the car in both sequences without interfering with other predictions or requiring fine-tuning. The task then becomes selecting the correct embeddings to use as an appearance model. Relying on the embeddings to encode object instance information, we propose a method to identify the representative embeddings for the foreground (target object) and the background based on objectness scores and optical flow. Visual examples of the representative embeddings are displayed in the middle column of Fig. 1. Finally, all pixels are classified by finding the nearest neighbor in a set of representative foreground or background embeddings. This is a non-parametric process requiring no video specific supervision for training or testing.
We evaluate the proposed method on the DAVIS dataset  and the FBMS dataset . Without fine-tuning the embedding network on the target datasets, we obtain better performance than previous state-of-the-art methods. More specifically, we achieve a mean intersection-over-union (IoU) of 78.5% and 71.9% on the DAVIS dataset  and the FBMS dataset , respectively.
To summarize, our main contributions include
A new strategy for adapting instance segmentation models trained on static images to videos. Notably, this strategy performs well on video datasets without requiring any video object segmentation annotations.
This strategy outperforms previously published unsupervised methods on both DAVIS benchmark and FBMS benchmark and approaches the performance of semi-supervised CNNs without requiring retraining any networks at test time.
Proposal of novel criteria for selecting a foreground object without supervision, based on semantic score and motion features over a track.
Insights into the stability of instance segmentation embeddings over time.
2 Related Work
Unsupervised video object segmentation. Unsupervised video object segmentation discovers the most salient, or primary, objects that move against a video’s background or display different color statistics. One set of methods to solve this task builds hierarchical clusters of pixels that may delineate objects . Another set of methods performs binary segmentation of foreground and background. Early foreground segmentation methods often used Gaussian Mixture Models and Graph Cut [27, 38], but more recent work uses convolutional neural networks (CNN) to identify foreground pixels based on saliency, edges, and/or motion [36, 37, 16]. For example, LMP  trains a network which takes optical flow as an input to separate moving and non-moving regions and then combines the results with objectness cues from SharpMask  to generate the moving object segmentation. LVO  trains a two-stream network, using RGB appearance features and optical flow motion features that feed into a ConvGRU  layer to generate the final prediction. FSEG  also proposes a two-stream network trained with mined supplemental data. SfMNet  uses differentiable rendering to learn object masks and motion models without mask annotations. Despite the risk of focusing on the wrong object, unsupervised methods can be deployed in more places because they do not require user interaction to specify an object to segment. Since we are interested in methods requiring no user interaction, we choose to focus on unsupervised segmentation.
Semi-supervised video object segmentation. Semi-supervised video object segmentation utilizes human annotations on the first frame of a video (or more) indicating which object the system should track. Importantly, the annotation provides a very good appearance model initialization that unsupervised methods lack. The problem can be formulated as either a binary segmentation task conditioned on the annotated frame or a mask propagation task between frames. Non-CNN methods typically rely on Graph Cut [27, 38], but CNN based methods offer better accuracy [19, 4, 40, 6]. Mask propagation CNNs take in the previous mask prediction and a new frame to propose a segmentation in the new frame. VPN  trains a bilateral network to propagate to new frames. MSK  trains a propagation network with synthetic transformations of still images and applies the same technique for online fine-tuning. SegFlow  finds that jointly learning moving object masks and optical flow helps to boost the segmentation performance. Binary segmentation CNNs typically utilize the first frame for fine-tuning the network to a specific sequence. The exact method for fine-tuning varies: OSVOS  simply fine-tunes on the first frame. OnAVOS  fine-tunes on the first frame and a subset of predictions from future frames. Fine-tuning can take seconds to minutes, and longer fine-tuning typically results in better segmentation. Avoiding the time cost of fine-tuning is a further inducement to focus on unsupervised methods.
Image segmentation. Many video object segmentation methods [40, 4, 19] are built upon image semantic segmentation neural networks [26, 5, 14], which predict a category label for each pixel. These fully convolutional networks allow end-to-end training on images of arbitrary sizes. Semantic segmentation networks do not distinguish different instances from the same object category, which limits their suitability to video object segmentation. Instance segmentation networks [10, 28, 12] can label each instance uniquely. Instance embedding methods [10, 28] provide an embedding space where pixels belonging to the same instance have similar embeddings. Spatial variations in the embeddings indicate the edges of object masks. Relevant details are given in Sec. 3.1. It was unknown if instance embeddings are stable over time in videos, but we hypothesized that these embeddings might be useful for video object segmentation.
3 Proposed Method
An overview of the proposed method is depicted in Fig. 2. We first obtain instance embeddings, objectness scores, and optical flow that we will use as inputs (Sec. 3.1). Based on the instance embeddings, we identify “seed” points that mark segmentation proposals (Sec. 3.2). Consistent proposal seeds are linked to build seed tracks, and we rank the seed tracks by objectness scores and motion saliency to select a foreground proposal seed on every frame (Sec. 3.3). We further build a set of foreground/background proposal seeds to produce the final segmentation mask in each frame (Sec. 3.4).
3.1 Extracting features
Our method utilizes three inputs: instance embeddings, objectness scores, and optical flow. None of these features are fine-tuned on video object segmentation datasets or fine-tuned online to specific sequences. The features are extracted for each frame independently as follows.
Instance embedding and objectness. We train a network to output instance embeddings and semantic categories on the image instance segmentation task as in . Briefly, the instance embedding network is a dense-output convolutional neural network with two output heads trained on static images from an instance segmentation dataset.
The first head outputs an embedding for each pixel, where pixels from same object instance have smaller Euclidean distances between them than pixels belonging to separate objects. Similarity between two pixels and is measured as a function of the Euclidean distance in the -dimensional embedding space, ,
This head is trained by minimizing the cross entropy between the similarity and the ground truth matching indicator . For locations and , the ground truth matching indicator if pixels belong to the same instance and otherwise. The loss is given by
where is a set of pixel pairs, is the similarity between pixels and in the embedding space and is inversely proportional to instance size to balance training.
The second head outputs an objectness score from semantic segmentation. We minimize a semantic segmentation log loss to train the second head to output a semantic category probability for each pixel. The objectness map is derived from the semantic prediction as
where is the probability that pixel belongs to the semantic category “background”111Here in semantic segmentation, “background” refers to the region that does not belong to any category of interest, as opposed to video object segmentation where the “background” is the region other than the target object. We use “background” as in video object segmentation for the rest of the paper.. We do not use the scores for any class other than the background in our work.
Embedding graph. We build a 4-neighbor graph from the dense embedding map, where each embedding vector becomes a vertex and edges exist between spatially neighboring embeddings with weights equal to the Euclidean distance between embedding vectors. This embedding graph will be used to generate image regions later. A visualized embedding graph is shown in Fig. 3.
Optical flow. The motion saliency cues are built upon optical flow. For fast optical flow estimation at good precision, we utilize a reimplementation of FlowNet 2.0 , an iterative neural network.
3.2 Generating proposal seeds
We propose a small number of representative seed points in frame for some subset of frames (typically all) in the video. Most computations only compare against seeds within the current frame, so the superscript is omitted for clarity unless the computation is across multiple frames. The seeds we consider as FG or BG should be diverse in embedding space because the segmentation target can be a moving object from an arbitrary category. In a set of diverse seeds, at least one seed should belong to the FG region. We also need at least one BG seed because the distances in the embedding space are relative. The relative distances in embedding space, or similarity from Eq. 1, from each point to the FG and BG seed(s) can be used to assign a labels to all pixels.
Candidate points. In addition to being diverse, the seeds should be representative of objects. The embeddings on the boundary of two objects are usually not close to the embedding of either object. Because we want embeddings representative of objects, we exclude seeds from object boundaries. To avoid object boundaries, we only select seeds from candidate points where the instance embeddings are locally consistent. (An alternative method to identify the boundaries to avoid would be to use an edge detector such as [7, 41].) We construct a map of embedding edges by mapping discontinuities in the embedding space. The embedding edge map is defined as the “inverse” similarity in the embedding space within the neighbors around each pixel,
where and are pixel locations, contains the four neighbors of , and is the similarity measure given in Eq. 1. Then in the edge map we identify the pixels which are the minimum within a window of centered at itself. These pixels from the candidate set . Mathematically,
where denotes the local window.
Diverse seeds. These candidate points, , are diverse, but still redundant with one another. We take a diverse subset of these candidates as seeds by adopting the sampling procedure from KMeans++ initialization . We only need diverse sampling rather than cluster assignments, so we do not perform the time-consuming KMeans step afterwards. The sampling procedure begins by adding the candidate point with the largest objectness score, , to the seed set, . Sampling continues by iteratively adding the candidate, , with smallest maximum similarity to all previously selected seeds and stops when we reach seeds,
We repeat this procedure to produce the seeds for each frame independently, forming a seed set . Note that the sampling strategy differs from , where they consider a weighted sum of the embedding distance and semantic scores. We do not consider the semantic scores because we want to have representative embeddings for all regions of the current frame, including the background, while in , the background is disregarded. Fig. 3 shows one example of the visualized embedding edge map, the corresponding candidate set and the selected seeds.
3.3 Ranking proposed seeds
In the unsupervised video object segmentation problem, we do not have an explicitly specified target object. Therefore, we need to identify a moving object as the segmentation target (i.e., FG). We first score the seeds based on objectness and motion saliency. To find the most promising seed for FG, we then build seed tracks by group embedding-consistent seeds across frames into seed tracks and aggregate scores along tracks. The objectness score is exactly in Eq. 3 for each seed. The motion saliency as well as seed track construction and ranking are explained below.
Motion saliency. Differences in optical flow can separate objects moving against a background . Because optical flow estimation is still imperfect , we average flow within the image regions rather than using the flow from a single pixel. The region corresponding to each seed consists of the pixels in the embedding graph from Sec. 3.1 with the shortest geodesic distance to that seed. For each seed , we use the average optical flow in the corresponding region as . An example of image regions is shown in Fig. 4.
Then we construct a model of the background. First, seeds with the lowest objectness score, , are selected as the initial background seeds, denoted by . The set of motion vectors associated with these seeds forms our background motion model . The motion saliency for each seed, , is the normalized distance to the nearest background motion vector,
where the normalization factor is given by
There are other approaches to derive motion saliency from optical flow. For example, in , motion edges are obtained by applying some edge detector to optical flow and then motion saliency of some region is computed as a function of motion edge intensity. The motion saliency proposed in this work is more efficient and works well in terms of the final segmentation performance.
Seed tracks. Another property of the foreground object is that it should be a salient object in multiple frames. We score this by linking similar seeds together across frames into a seed track and taking the average product of objectness and motion saliency scores over each track. The -th seed on frame 0, , initiates a seed track, . is extended frame by frame by adding the seed with the highest similarity to . Specifically, supposing that we have a track across frames 0-, it is extended to frame by adding the most similar seed on frame to , forming :
where is the similarity measure given by Eq. 1, and is the seed in frame with the highest similarity to . Then we have . Eventually, we have starting from and ending at some seed on the last frame. The foreground score for is
where is the size of the seed track, equal to the sequence length.
3.4 Segmenting a foreground proposal
Initial foreground segmentation. The seed track with the highest foreground score is selected on each frame to provide an initial foreground seed, denoted by . We obtain an initial foreground segmentation by identifying pixels close to the foreground seed in the embedding graph explained in Sec. 3.1. Here the distance, denoted by , between any two nodes, and , is defined as the maximum edge weight along the shortest geodesic path connecting them. We again take the seeds with the lowest objectness scores as the initial background seed set, . Then the initial foreground region is composed of the pixels closer to the foreground seed than any background seeds,
Adding foreground seeds. Often, selecting a single foreground seed is overly conservative and the initial segmentation fails to cover an entire object. To expand the foreground segmentation, we create a set of foreground seeds, from the combination of and seeds marking image regions mostly covered by the initial foreground segmentation. These image regions are the ones described in Sec. 3.3 and Fig. 4. If more than a proportion of a region intersects with the initial foreground segmentation, the corresponding seed is added to the .
Adding background seeds. The background contains two types of regions: non-object regions (such as sky, road, water, etc.) that have low objectness scores, and static objects, which are hard negatives because in the embedding space, they are often closer to the foreground than the non-object background. Static objects are particularly challenging when they belong to the same semantic category as the foreground object222E.g., the “camel” sequence in DAVIS in supplementary materials.. We expand our representation of the BG regions by taking the union of seeds with object scores less than a threshold and seeds with motion saliency scores less than a threshold :
Final segmentation. Once the foreground and background sets are established, similarity to the nearest foreground and background seeds is computed for each pixel. It is possible to use the foreground and background sets from one frame to segment another frame for FG/BG similarity computation:
where pixels on the target frame are denoted by . Instead of directly propagating the foreground or background label from the most similar seed to the embedding, we obtain a soft score as the confidence of the embedding being foreground:
Online adaptation. Online adaptation of our method is straightforward: we simply generate new sets of foreground and background seeds. This is much less expensive than fine-tuning an FCN for adaptation as done in . Though updating the foreground and background sets could result in segmenting different objects in different frames, it improves the results in general, as discussed in Sec. 4.4.
|NLC ||CUT ||FST ||SFL ||LVO ||MP ||FSEG ||ARP||Ours|
|Fine-tune on DAVIS?||No||Yes||No||Yes||Yes||No||No||No||No|
4.1 Datasets and evaluation
We evaluate the proposed method on the DAVIS dataset , Freiburg-Berkeley Motion Segmentation (FBMS) dataset , and the SegTrack-v2 dataset . Note that neither the embedding network nor the optical flow network has been trained on these datasets.
DAVIS. The DAVIS 2016 dataset  is a recently constructed dataset, containing 50 video sequences in total, with 30 in the train set and 20 in the val set. It provides binary segmentation ground truth masks for all 3455 frames. This dataset contains challenging videos featuring object deformation, occlusion, and motion blur. The “target object” may consist of multiple objects that move together, e.g., a bike with the rider. To evaluate our method, we adopt the protocols in , which include region similarity and boundary accuracy, denoted by and , respectively. is computed as the intersection over union (IoU) between the segmentation results and the ground truth. is the harmonic mean of boundary precision and recall.
FBMS. The FBMS dataset  contains 59 video sequences with 720 frames annotated. In contrast to DAVIS, multiple moving objects are annotated separately in FBMS. We convert the instance-level annotations to binary masks by merging all foreground annotations, as in . The evaluation metrics include the F-score evaluation protocol proposed in  as well as used for DAVIS.
SegTrack-v2. The SegTrack-v2 dataset  contains 14 videos with a total of 976 frames. Annotations of individual moving objects are provided for all frames. As with FBMS, the union of the object masks is converted to a binary mask for unsupervised video object segmentation evaluation. For this dataset, we only use for evaluation to be consistent with previous work.
4.2 Implementation details
We use the image instance segmentation network trained on the PASCAL VOC 2012 dataset  to extract the object embedding and objectness. The instance segmentation network is based on DeepLab-v2  with ResNet  as the backbone. We use the stabilized optical flow from a reimplementation of FlowNet2.0 . The dimension for the embedding vector, , is 64. The window size to identify the candidate set is set to 9 for DAVIS/FBMS, and 5 for SegTrack-v2. For frames in DAVIS dataset, the 9x9 window results in approximately 200 candidates in the embedding edge map. We select seeds from the candidates. The initial number of background seeds is . To add FG seeds as in Sec. 3.4, is set to 0.5. The thresholds for BG seed selection are and . The CRF parameters are identical with the ones in DeepLab  (used for the PASCAL dataset). We first tuned all of these parameters on the DAVIS train set, where we our was 77.5%. We updated the window size for SegTrack-v2 empirically, considering the video resolution.
4.3 Comparing to the state-of-the-art
DAVIS. As shown in Tab. 1, we obtain the best performance for unsupervised video object segmentation: 2.3% higher than the second best and 2.6% higher than the third best. Our unsupervised approach even outperforms some of the semi-supervised methods that have access to the first frame annotations, VPN  and SegFlow , by more than 2%333Numeric results for these two methods are listed in Tab. 5.. Some qualitative segmentation results are shown in Fig. 5.
FBMS. We evaluate the proposed method on the test set, with 30 sequences in total. The results are shown in Tab. 2. Our method achieves an F-score of 82.8%: 5.0% higher than the second best method . Our method’s mean is more than 10% better than ARP , which performs the second best on DAVIS.
|NLC ||CUT ||FST ||CVOS||LVO ||MP ||ARP||Ours|
SegTrack-v2. We achieve a of 59.3% on this dataset, which is higher than other methods that do well on DAVIS, LVO  (57.3%) and FST  (54.3%). Due to low resolution of SegTrack-v2 and the fact that SegTrack-v2 videos can have multiple moving objects of the same class in the background, we are weaker than NLC  (67.2%) in this dataset.
4.4 Ablation studies
The effectiveness of instance embedding. To prove that instance embeddings are more effective than feature embeddings from semantic segmentation networks in our method, we compare against the features from DeepLab-v2 . Replacing the instance embedding in our method with DeepLab fc7 features achieves 65.2% in , more than 10% less than the instance embedding features. The instance embedding feature vectors are therefore much better suited to linking objects over time and space than semantic segmentation feature vectors. The explicit pixelwise similarity loss (Eq. 2) used to train instance embeddings helps to produce more stable feature vectors than semantic segmentation.
Embedding temporal consistency and online adaptation. We analyze whether embeddings for an object are consistent over time. Given the embeddings for each pixel in every frame and the ground truth foreground masks, we determine how many foreground embeddings in later frames are closer to the background embeddings than foreground embeddings from the first frame. If a foreground embedding from a later frame is closer to any background embedding in the first frame, we call it an incorrectly classified foreground pixel. We plot the proportion of foreground pixels that are incorrectly classified as a function of relative time in the video since the first frame in Fig. 6. As time from the first frame increases, more foreground embeddings are incorrectly classified. This “embedding drift” problem is probably caused by differences in the appearance and location of foreground and background objects in the video.
To overcome “embedding drift”, we do online adaptation to update our sets of foreground and background seeds. Updating the seeds is much faster than fine-tuning a neural network to do the adaptation, as done in OnAVOS  with a heuristically selected set of examples. The effects of doing online adaptation every frames are detailed in Tab. 3. More frequent online adaptation results in better performance: per-frame online adaptation boosts by 7.0% over non-adaptive seed sets from the first frame.
|Adapt every k frame||k=1||k=5||k=10||k=20||k=40||k=|
Foreground seed track ranking. In this section, we discuss some variants of foreground seed track ranking. In Eq. 10, the ranking is based on objectness as well as motion saliency. We analyze three variants: motion saliency alone, objectness alone, and objectness+motion saliency. The results are shown in Tab. 4. The experiments are conducted on the DAVIS train set. The initial FG seed accuracy (second row in Tab. 4) is evaluated as the proportion of the initial foreground seeds located within the ground truth foreground region. We see that combining motion saliency and objectness results in the best performance, outperforming “motion alone” and “objectness alone” by 4.0% and 6.4%, respectively. Final segmentation performance is consistent with the initial foreground seed accuracy, with the combined ranking outperforming the “motion alone” and “objectness alone” by 3.2% and 1.8%, respectively. The advantage of combining motion and objectness is reported in several previous methods as well [6, 16, 37]. It is interesting to see that using objectness only gives lower initial foreground seed accuracy but higher mean than motion only. It is probably because of the different errors the two scores make. When errors selecting foreground seeds occur in “motion only” mode, it is more likely that seeds representing “stuff” (sky, water, road, etc) are selected as the foreground, but when errors occur in “objectness only” mode, incorrect foreground seeds are usually located on static objects in the sequence. In the embedding space, static objects are usually closer in the embedding space to the target object than “stuff”, so these errors are more forgiving.
|Motion||Obj.||Motion + Obj.|
|Init. FG seed acc.||90.6||88.2||94.6|
4.5 Semi-supervised video object segmentation
We extend the method to semi-supervised video object segmentation by selecting the foreground seeds and background seeds based on the first frame annotation. The seeds covered by the ground truth mask are added to the foreground set and the rest are added to . Then we apply Eqs. 13-15 to all embeddings of the sequence. Results are further refined by a dense CRF. Experiment settings are detailed in supplementary materials. As shown in Tab. 5, we achieve 77.6% in , better than  and . Note that there are more options for performance improvement such as motion/objectness analysis and online adaptation as we experimented in the unsupervised scenario. We leave those options for future exploration.
In this paper, we propose a method to transfer the instance embedding learned from static images to unsupervised object segmentation in videos. To be adaptive to the changing foreground in the video object segmentation problem, we train a network to produce embeddings encapsulating instance information rather than training a network that directly outputs a foreground/background score. In the instance embeddings, we identify representative foreground/background embeddings from objectness and motion saliency. Then, pixels are classified based on embedding similarity to the foreground/background. Unlike many previous methods that need to fine-tune on the target dataset, our method achieves the state-of-the-art performance under the unsupervised video object segmentation setting without any fine-tuning, which saves a tremendous amount of labeling effort.
-  D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
-  M. Billinghurst, A. Clark, G. Lee, et al. A survey of augmented reality. Foundations and Trends® in Human–Computer Interaction, 8(2-3):73–272, 2015.
-  T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. Computer Vision–ECCV 2010, pages 282–295, 2010.
-  S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR 2017. IEEE, 2017.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
-  J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. 2017.
-  P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 1841–1848, 2013.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision (IJCV), 88(2):303–338, 2010.
-  A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In BMVC, volume 2, page 8, 2014.
-  A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277, 2017.
-  M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph based video segmentation. IEEE CVPR, 2010.
-  K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  Q. Huang, C. Xia, C. Wu, S. Li, Y. Wang, Y. Song, and C.-C. J. Kuo. Semantic segmentation with reverse attention. In British Machine Vision Conference, 2017.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. arXiv preprint arXiv:1612.01925, 2016.
-  S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. arXiv preprint arXiv:1612.05478, 2016.
-  M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In Proceedings of the IEEE International Conference on Computer Vision, pages 3271–3279, 2015.
-  A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Y. J. Koh and C.-S. Kim. Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011.
-  Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1995–2002. IEEE, 2011.
-  F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013.
-  H.-D. Lin and D. G. Messerschmitt. Video composition methods and their semantics. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, pages 2833–2836. IEEE, 1991.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  N. Märki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilateral space video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 743–751, 2016.
-  A. Newell and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. CoRR, abs/1611.05424, 2016.
-  P. Ochs and T. Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1583–1590. IEEE, 2011.
-  P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014.
-  P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014.
-  A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision, pages 1777–1784, 2013.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016.
-  B. Taylor, V. Karasev, and S. Soatto. Causal video object segmentation from persistence of occlusions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4268–4276, 2015.
-  P. Tokmakov, K. Alahari, and C. Schmid. Learning motion patterns in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
-  Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3899–3908, 2016.
-  S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
-  P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. In British Machine Vision Conference, 2017.
-  S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
-  S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
Appendix A Supplemental materials
a.1 Experiment Settings of Semi-supervised Video Object Segmentation
We extract the seeds for the first frame (frame 0) and form image regions, as described in Sec. 3.2 and Sec. 3.3, respectively. Then we compare the image regions with the ground truth mask. For one image region , if the ground truth mask covers more than of , i.e.,
where denotes the area, the average embedding within the intersection is computed and added to the foreground embedding set . We set .
For the background, if does not intersect with at all, i.e.,
the average embedding in is added to the background embedding set . A visual illustration is shown in Fig. 7.
Then the foreground probability for a pixel on an arbitrary frame is obtained by Eqs. 13-15 and results are further refined by a dense CRF with identical parameters from the unsupervised scenario. We compare our results with multiple previous semi-supervised methods in Tab. 5 in the paper.
a.2 Per-video Results for DAVIS
The per-video result are shown for DAVIS train set and val set are listed in Tab. 6 and Tab. 7. The evaluation metric is the region similarity mentioned in the paper. Note that we used the train set for ablation studies (Sec. 4.4), where the masks were not refined by dense CRF.
|Sequence||ARP ||FSEG ||Ours||Ours + CRF|
|Sequence||ARP ||FSEG ||Ours||Ours + CRF|
a.3 Instance Embedding Drifting
In Sec. 4.4 of the paper, we mentioned the “embedding drift” problem. Here we conduct another experiment to demonstrate that the embedding changes gradually with time. In this experiment, we extract foreground and background embeddings based on the ground truth masks for every frame. The embeddings from the first frame (frame 0) are used as references. We compute the average distance between the foreground/background embeddings from an arbitrary frame and the reference embeddings. Mathematically,
where and denote the ground truth foreground and background regions, respectively, denotes the embedding corresponding to pixel , and / represent the foreground/background embedding distance between frame and frame 0. Then we average and across sequences and plot their relationship with the relative timestep in Fig. 8. As we observe, the embedding distance is increasing with time elapsing. Namely, both objects and background become less similar to themselves on frame 0, which supports the necessity of online adaptation.
a.4 More visual examples
We provide more visual examples for the DAVIS dataset  and the FBMS dataset  in Fig. 9 and Fig. 10, respectively. Furthermore, the results for all annotated frames in DAVIS and FBMS are attached in the folders named “DAVIS” and “FBMS” submitted together with this document. The frames are resized due to size limit.