SuperTrajectory for Video Segmentation
Abstract
We introduce a novel semisupervised video segmentation approach based on an efficient video representation, called as “supertrajectory”. Each supertrajectory corresponds to a group of compact trajectories that exhibit consistent motion patterns, similar appearance and close spatiotemporal relationships. We generate trajectories using a probabilistic model, which handles occlusions and drifts in a robust and natural way. To reliably group trajectories, we adopt a modified version of the density peaks based clustering algorithm that allows capturing rich spatiotemporal relations among trajectories in the clustering process. The presented video representation is discriminative enough to accurately propagate the initial annotations in the first frame onto the remaining video frames. Extensive experimental analysis on challenging benchmarks demonstrate our method is capable of distinguishing the target objects from complex backgrounds and even reidentifying them after occlusions.
1 Introduction
We state the problem of semisupervised video object segmentation as the partitioning of objects in a given video sequence with available annotations in the first frame. Aiming for this task, we incorporate an efficient video representation, supertrajectory, to capture the underlying spatiotemporal structure information that is intrinsic to realword scenes. Each supertrajectory corresponds to a group of trajectories that are similar in nature and have common characteristics. A point trajectory, e.g., the tracked positions of an individual point across multiple frames, is a constituent of supertrajectory. This representation captures several aspects of a video:

Longterm motion information is explicitly modeled as it consists of trajectories over extended periods;

Spatiotemporal location information is implicitly interpreted by clustering nearby trajectories; and

Compact features, such as color and motion pattern, are described in a conveniently compact form.
With above good properties, supertrajectory simplifies and reduces the complexity of propagating humanprovided labels in the segmentation process. We first generate trajectory based on Markovian process, which handles occlusions and drifts naturally and efficiently. Then a density peaks based clustering (DPC) algorithm [31] is modified for obtaining reasonable division of the trajectories, which offers proper split of videos in space and time axes. The design of our supertrajectory is motivated by the flowing two aspects.
Firstly, for the task of video segmentation, it is desirable to have a powerful abstraction of videos that is robust to structure variations and deformations in image space and time. As demonstrated in recently released DAVIS dataset [28], most of the existing approaches exhibit severe limitations for occlusions, motion blur, and appearance changes. The proposed supertrajectory, encoded with several well properties, is able to capture above instances (see Fig. 1).
Secondly, from the perspective of feature generation, point trajectory is desired to be improved for meeting above requests. Merging and splitting video segments (and corresponding trajectories) into atomic spatiotemporal components is essential for handling occlusions and temporal discontinuities. However, it is wellknown that, classical clustering methods (e.g., kmeans and spectral clustering), which are widely adopted by previous trajectory methods, even cannot reached a consensus on the definition of a cluster. Here, we modify DPC algorithm [31] for grouping trajectories, favoring its advantage of choosing cluster centers based on a reasonable criterion.
We conduct video segmentation via operating trajectories as unified supertrajectory groups. To eliminate adverse effects of camera motion, we introduce a reversetracking strategy to exclude objects that originate outside the frame. To reidentify objects after occlusions, we exploit object reoccurrence information, which reflects the spatiotemporal relations of objects across the entire video sequence.
The remainder of the article is organized as follows. A summarization of related work is introduced in Sec. 2. Our approach for supertrajectory generation is presented in detail in Sec. 3. In Sec. 4, we describe our supertrajectory based video segmentation algorithm. We then experimentally demonstrate its robustness, effectiveness, and efficiency in Sec. 5. Finally, we draw conclusions in Sec. 6.
2 Related Work
We provide a brief overview of recent works in video object segmentation and point trajectory extraction.
2.1 Video Object Segmentation
According to the level of supervision required, video segmentation techniques can be broadly categorized as unsupervised, semisupervised and supervised methods.
Unsupervised algorithms [37, 49, 26, 44, 42] do not require manual annotations but often rely on certain limiting assumptions about the application scenario. Some techniques [5, 14, 27] emphasize the importance of motion information. More specially, [5, 14] analyze longterm motion information via trajectories, then solve the segmentation as a trajectory clustering problem. The works [10, 43, 46] introduce saliency information [45] as prior knowledge to infer the object. Recently, [19, 22, 50, 12, 48] generate object segments via ranking several object candidates.
Semisupervised video segmentation, which also refers to label propagation, is usually achieved via propagating human annotation specified on one or a few keyframes onto the entire video sequence [4, 2, 35, 7, 21, 17, 32, 36]. These methods mainly use flowbased random field propagation models [38], patchseams based propagation strategies [30], energy optimizations over graph models [29], joint segmentation and detection frameworks [47], or pixel segmentation on bilateral space [23].
2.2 Point Trajectory
Point trajectories are generated through tracking points over multiple frames and have the advantage of representing longterm motion information. KanadeLucasTomasi (KLT) [33] is among the most popular methods that track a small amount of feature points. Inspiring several followup studies in video segmentation and action recognition, optical flow based dense trajectories [34] improve over sparse interest point tracking. In particular, [39, 40, 41] introduce dense trajectories for action recognition. Other methods [5, 13, 20, 15, 14, 24, 25, 18, 9] address the problem of unsupervised video segmentation, in which case the problem also be described as motion segmentation. These methods usually track points via dense optical flow and perform segmentation via clustering trajectories.
Existing approaches often handle trajectories in pairs or individually and directly group all the trajectories into few clusters as segments, easily ignoring the inner coherence in a group of similar trajectories. Instead, we operate trajectories as united supertrajectory groups instead of individual entities, thus offering compact and atomic video representation and fully exploiting spatiotemporal relations among trajectories.
3 SuperTrajectory via Grouping Trajectories
3.1 Trajectory Generation
Given a sequence of video frames within time range , each pixel point can be tracked to the next frame using optical flow. This tracking process can be executed framebyframe until some termination conditions (e.g., occlusion, incorrect motion estimates, etc.) are reached. The tracked points are composed into a trajectory and a new tracker is initialized where prior tracker finished. We build our trajectory generation on a unified probabilistic model which naturally considers various termination conditions.
Let w denote a flow field indexed by pixel positions that returns a 2D flow vector at a given point. Using LDOF [6], we compute forwardflow field from frame to , and the backwardflow field from to . We track pixel potion to the consecutive frames in both directions. The tracked points of consecutive frames are concatenated to form a trajectory :
(1) 
where indicates the length of trajectory and . We model point tracking process as a first order Markovian process, and denote the probability that th point of trajectory is correctly tracked from frame as . The prediction model is defined by:
(2) 
where and is formulated as:
(3) 
The energy functions penalize various potential tracking error. The former energy is expressed as:
(4) 
which penalizes the appearance variations between corresponding points. The latter energy is included to penalize occlusions. It uses the consistency of the forward and backward flows:
(5) 
When this consistency constraint is violated, occlusions or unreliable optical flow estimates might occur (see [34] for more discussion). It is important to notice that the proposed tracking model performs accurately yet our model is not limited to the above constraints. We terminate the tracking process when , and then we start a new tracker at . In our implementation, we discard the trajectories shorter than four frames.
3.2 SuperTrajectory Generation
Previous studies indicate the value of trajectory based representation for longterm motion information. Our additional intuition is that neighbouring trajectories exhibit compact spatiotemporal relationships and they have similar natures in appearance and motion patterns. This motives us operating on trajectories as united groups.
We generate supertrajectory by clustering trajectories with density peaks based clustering (DPC) algorithm[31]. Before introducing our supertrajectory generation method, we first describe DPC.
Density Peaks based Clustering (DPC) DPC is proposed to cluster the data by finding of density peaks. It provides a unique solution of fast clustering based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. It offers a reasonable criterion for finding clustering centers.
Given the distances between data points, for each data point , DPC calculates two quantities: local density and its distance from points of higher density. The local density of data point is defined as ^{1}^{1}1Here we do not use the cutoff kernel or gaussian kernel adopted in [31], due to the small data amount.:
(6) 
Here, is measured by computing the minimum distance between the point and any other point with higher density:
(7) 
For the point with highest density, it takes .
Cluster centers are the points with high local density () and large distance () from other points with higher local density. The data points can be ranked via , and the top ranking points are selected as centers. After successfully declaring cluster centers, each remaining data points is assigned to the cluster center as its nearest neighbor of higher density.
Grouping Trajectories via DPC Given a trajectory spans frames, we define three features: spatial location (), color (), and velocity (), for describing :
(8)  
where we set . We tested and did not observe obvious effect on the results.
Between each pair of trajectories and that share some frames, we define their distance via measuring descriptor similarity:
(9) 
We normalize color distance on max intensity, location distance on sampling step (detailed below), motion distance on the mean motion magnitude of all the trajectories, which makes above distance measures to have similar scales. In case there is no temporal overlap, we set , where has a very large value.
We first roughly partition trajectories into several nonoverlap clusters, and then iteratively updates each partition to get the optimized trajectory clusters.
The only parameter of our supertrajectory algorithm is number of spatial grids , as the degree of spatial subdivision. The spatial sampling step becomes , where refers to the product of the height and width of image frame. The clustering procedure begins with an initialization step where we divide the input video into several nonoverlap spatiotemporal volumes of size . As shown in Fig. 2, all trajectories are divided into volumes. A trajectory falls into the volume where it starts. Then we need to find a proper cluster number of each trajectory group, thereby further offering a reasonable temporal split of video.
For each trajectory group, we initially estimate the cluster number as , where indicates the average length of all the trajectories. Then we apply a modified DPC algorithm for generating trajectory clusters, as described with in Alg. 1. In Alg. 13, if we have , then trajectory does not have any temporal overlap with those trajectories have higher local densities. That means trajectory is the center of a isolated group. If , in Alg. 14,
that means there exist more than unconnected trajectory groups. Then we select the trajectories with highest densities of those unconnected trajectory groups as centers (Alg. 15). Otherwise, in Alg. 17,8, the trajectories with the highest values are selected as the cluster centers. The whole initialization process is described in Alg. 21,2,3.
Based on the above initialization process, we group trajectories into supertrajectories according to their spatiotemporal relationships and similarities (see Fig. 3(b)). Next, we iteratively refine our supertrajectory assignments. In this process, each trajectory is classified into the nearest cluster center. For reducing the searching space, we only search the trajectories fall into a spacetime volume around the cluster center (Alg. 27). This results in a significant speed advantage by limiting the size of search space to reduce the number of distance calculations. Once each trajectory has been associated to the nearest cluster center, an update step adjusts the center of each trajectory cluster via Alg. 1 with (Alg. 214,15). We drop very small trajectory clusters and combine those trajectories to other nearest trajectory clusters. In practice, we find 5 iterations for above refining process are enough for obtaining satisfactory performance. Visualization results of supertrajectory generation with different iterations are presented in Fig. 3.
Using Alg. 1, we group all trajectories into nonoverlap clusters, represented as supertrajectories , where . It is worth to note that, (the number of supertrajectories) is varied in each iteration in Alg. 2 since we merge small clusters into other clusters. Additionally, for different videos is different even with same input parameter . That is important, since different videos have different temporal characteristics, thus we only constrain their spatial shape via .
4 SuperTrajectory based Video Segmentation
In Sec. 3, we cluster a set of compact trajectories into supertrajectory. In this section, we describe our video segmentation approach that leverages on supertrajectories.
Given the mask of the first frame, we seek a binary partitioning of pixels into foreground and background classes. Clearly, the annotation can be propagated to the rest of the video, using the trajectories that start at the first frame. However, only a few of points can be successfully tracked across the whole scene, due to occlusion, drift or unreliable motion estimation. Benefiting from our efficient trajectory clustering approach, supertrajectories are able to spread more annotation information over longer periods. This inspires us to base our label propagation process on supertrajectory.
For inferring the foreground probability of supertrajectories , we first divide all the trajectories into three categories: foreground trajectories , background trajectories and unlabeled trajectories , where . The and are the trajectories which start at the first frame and are labeled by the annotation mask , while the are the trajectories start at any frames except the first frame, thus cannot be labeled via . Accordingly, supertrajectories are classified into two categories: labeled ones and unlabeled ones . A labeled supertrajectory contains at least one labeled trajectory from or , and its foreground probability can be computed as the ratio between the included foreground trajectories and the labeled ones it contains:
(10) 
For the points belonging to the labeled supertrajectory , their foreground probabilities are set as .
Then we build an appearance model for estimating the foreground probabilities of unlabeled pixels. The appearance model is built upon the labeled supertrajectories , consists of two weighted Gaussian Mixture Models over RGB colour values, one for the foreground and one for the background. The foreground GMM is estimated form all labeled supertrajectories , weighted by their foreground probabilities . The estimation of background GMM is analogous, with the weight replaced by the background probabilities . The appearance models leverage the foreground and background supertrajectories over many frames, instead of only using the first frame or labeled trajectories, therefore they can robustly estimate appearance information.
Although above model successfully propagates more annotation information across the whole video sequence, it still suffers from some difficulties: the model will be confused when a new object come into view (see Fig. 4 (b)). To this, we propose to reverse track points for excluding new incoming objects. We compute the ‘source’ of unlabeled trajectory :
(11) 
where indicates starting position and refers to velocity via Eq. 8. It is clear that, if the virtual position is out of image frame domain, trajectory is a latecomer. For those trajectories start outside view, we treat them as background. Labeled supertrajectory is redefined as the one contains at least one trajectory from , or , and Eq. 10 is updated as
(12) 
Those outside trajectories are also adopted for training appearance model in prior step. According to our experiment, this assumption offers about performance improvement. Foreground estimation results via our reverse tracking strategy are presented in Fig. 4 (c).
For reidentifying objects after longterm occlusions and constraining segmentation consistency, we explore reoccurrence of objects. As suggested by [10], objects, or regions, often reoccur both in space and in time. Here, we build correspondences among reoccurring regions across distant frames and transport foreground estimates globally. This process is based on superpixel level, since supertrajectories cannot cover all of pixels.
Let be the superpixel set of input video. For each region, we search its Nearest Neighbors (NNs) as its reoccurring regions using KDtree search. For region of frame , we only search its NNs in previous frames . Such backward search strategy is for biasing the segmentation results of prior frames as the propagation accuracy degrades over time. Following [10], each region is represented as a concatenation of several descriptors : RGB and LAB color histograms (6 channels20 bins), HOG descriptor (9 cells6 orientation bins) computed over a patch around superpixel center, and spatial coordinate of superpixel center. The spatial coordinate is with respect to image center and normalized into , which implicitly incorporates spatial consistency in NNsearch.
After NNsearch in the feature space, we construct a weight matrix for all the regions :
(13) 
Then a probability transition matrix is built via rowwise normalization of . We define a column vector that gathers all the foreground probabilities of . The foreground probability of a superpixel is assigned as the average foreground probabilities of its pixels.
We iteratively update via the probability transition matrix . In each iteration , we update via:
(14) 
which equivalents to updating foreground probability of a region with the weighted average of its NNs. In each iteration, we keep the foreground probabilities of those points belonging to labeled trajectories unchanged. Then we recompute and update it in next iteration. In this way, the relatively accurate annotation information of the labeled trajectories is preserved. Additionally, the annotation information is progressively propagated in a forward way and the supertrajectories based foreground estimates are consistent even across many distant frames (see Fig. 4 (d)).
After 10 iterations, the pixels (regions) with foreground probabilities lager than 0.5 are classified as foreground, thus obtaining final binary segments. In Sec. 5.2, we test and only observe performance variation. We set for obtaining best performance.
5 Experimental results
Parameter Settings
In Sec. 3.2, we set number of spatial grids . In Sec. 4, we oversegment each frame into about superpixels via SLIC [1] for well boundary adherence. For each superpixel, we set the number of NNs . In our experiments, all the parameters of our algorithm are fixed to unity.
Datasets We evaluate our method on two public video segmentation benchmarks, namely DAVIS [28], and SegtrackV2 [21].
The new released DAVIS [28] contains
50 video sequences (3, 455 frames in total) and pixellevel manual groundtruth for the foreground object in every frame.
Those videos span a wide range of object segmentation challenges such as occlusions, fastmotion and appearance changes.
Since DAVIS contains diverse scenarios which break classical assumptions, as demonstrated in [28], most stateoftheart methods fail to produce reasonable segments.
SegtrackV2 [21] consists of 14 videos with 24 instance objects and 947 frames. Pixellevel mask is offered for every frame.
5.1 Performance Comparison
Video  IoU score  

BVS  FCP  JMP  SEA  TSP  HVS  STV  
breakdanceflare  0.727  0.723  0.430  0.131  0.040  0.499  0.835 
camel  0.669  0.734  0.640  0.649  0.654  0.876  0.798 
carroundabout  0.851  0.717  0.726  0.708  0.614  0.777  0.904 
dancetwirl  0.492  0.471  0.444  0.117  0.099  0.318  0.640 
driftchicane  0.033  0.457  0.243  0.119  0.018  0.331  0.466 
horsejumplow  0.601  0.607  0.663  0.498  0.291  0.551  0.768 
libby  0.776  0.316  0.295  0.226  0.070  0.553  0.723 
mallardfly  0.606  0.541  0.536  0.557  0.200  0.436  0.650 
motorbike  0.563  0.713  0.506  0.451  0.340  0.687  0.749 
rhino  0.782  0.794  0.716  0.736  0.694  0.812  0.893 
soapbox  0.789  0.449  0.759  0.783  0.247  0.684  0.751 
stroller  0.767  0.597  0.656  0.464  0.369  0.662  0.826 
surf  0.492  0.843  0.941  0.821  0.814  0.759  0.917 
swing  0.784  0.648  0.115  0.511  0.098  0.104  0.765 
tennis  0.737  0.623  0.765  0.482  0.074  0.576  0.826 
Avg. (entire)  0.665  0.631  0.607  0.556  0.358  0.596  0.736 
Quantitative Results Standard IntersectionoverUnion (IoU) metric is employed for quantitative evaluation. Given a segmentation mask and groundtruth , IoU is computed via . We compare the proposed STV against various stateoftheart alternatives: BVS [23], FCP [29], JMP [11], SEA [30], TSP [8], HVS [16], JOT [47], and OFL [36].
In Table 1, we report IoU score on a representative subset of the DAVIS dataset. As shown, the proposed STV performs superior on most video sequences. And STV achieves the highest average IoU score (0.736) over all the 50 video sequence of the DAVIS dataset, which demonstrates significant improvement over previous methods.
We further report quantitative results on SegtrackV2 [21] dataset in Table 2. The results consistently demonstrate the favorable performance of the proposed method.
Method  BVS  OFL  SEA  FCP  HVS  JOT  STV 

IoU  0.584  0.675  0.453  0.574  0.518  0.718  0.781 
Qualitative Results Qualitative video segmentation results for video sequences from the DAVIS dataset [28] and SegTrackV2 [21] are presented in Fig. 5 and Fig. 6. With the first frame as initialization, the proposed algorithm has the ability to segment the objects with fast motion patterns (breakdanceflare and cheetah1) or large shape deformation (dogagility). It also produces accurate segmentation maps even when the foreground suffers occlusions (libby).
5.2 Validation of the Proposed Algorithm
In this section, we offer more detailed exploration for the proposed approach in several aspects with DAVIS dataset [28]. We test the values of important parameters, verify basic assumptions of the proposed algorithm, evaluate the contributions from each part of our approach, and perform runtime comparison.
Parameter Verification We study the influence of the needed input parameter: number of spatial grids , of our supertrajectory algorithm in Sec. 3.2. We report the performance by plotting the IoU value of the segmentation results as functions of a variety of , where we vary . As shown in Fig. 7 (a), the performance increases with finer supertrajectory clustering in spatial domain (). However, when we further increase , the final performance does not change obviously. We set where the maximum performance is obtained. Later, we investigate the influence of parameter , which indicates the number of the NNs of a region in Sec. 4. We plot IoU score with varying in Fig. 7 (b), and set for achieving best performance.
Ablation Study To quantify the improvement obtained with our proposed trajectories in Sec. 3.1, we compare to two baseline trajectories: LTM [14] and DAD [39] in our experimental results. LTM is widely used for motion segmentation and DAD shows promising performance for action detection. To be fair, we only replace our trajectory generation part with above two methods, estimate optical flow via LDOF [6] and keep all other parameters fixed. From the comparison results in Table. 3. we can find that, compared with classical trajectory methods [14, 39], the proposed trajectory generation approach is preferable.
6 Conclusions
This paper introduced a video segmentation approach by representing video as supertrajectories. Based on DPC algorithm, compact trajectories are efficiently grouped into supertrajectories. Occlusion and drift are naturally handled by our trajectory generation method based on a probabilistic model. We proposed to perform video segmentation on supertrajectory level. Via reverse tracking points and leveraging the property of region reoccurrence, the algorithm is robust for many segmentation challenges. Experimental results on famous video segmentation datasets [28, 21] demonstrate that our approach outperforms current stateoftheart methods.
References
 [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. SLIC superpixels compared to stateoftheart superpixel methods. IEEE PAMI, 2012.
 [2] V. Badrinarayanan, F. Galasso, and R. Cipolla. Label propagation in video sequences. In CVPR, 2010.
 [3] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video SnapCut: robust video object cutout using localized classifiers. ACM Trans. on Graphics, 2009.
 [4] W. Brendel and S. Todorovic. Video object segmentation by tracking regions. In ICCV, 2009.
 [5] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, 2010.
 [6] T. Brox and J. Malik. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE PAMI, 2011.
 [7] I. Budvytis, V. Badrinarayanan, and R. Cipolla. Semisupervised video segmentation using tree structured graphical models. In CVPR, 2011.
 [8] J. Chang, D. Wei, and J. W. Fisher. A video representation using temporal superpixels. In CVPR, 2013.
 [9] L. Chen, J. Shen, W. Wang, and B. Ni. Video object segmentation via dense trajectories. IEEE TMM, 2015.
 [10] A. Faktor and M. Irani. Video segmentation by nonlocal consensus voting. In BMVC, 2014.
 [11] Q. Fan, F. Zhong, D. Lischinski, D. CohenOr, and B. Chen. JumpCut: Nonsuccessive mask transfer and interpolation for video cutout. ACM Trans. on Graphics, 2015.
 [12] K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik. Learning to segment moving objects in videos. In CVPR, 2015.
 [13] K. Fragkiadaki and J. Shi. Detection free tracking: Exploiting motion and topology for segmenting and tracking under entanglement. In CVPR, 2011.
 [14] K. Fragkiadaki, G. Zhang, and J. Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In CVPR, 2012.
 [15] K. Fragkiadaki, W. Zhang, G. Zhang, and J. Shi. Twogranularity tracking: Mediating trajectory and detection graphs for tracking under occlusions. In ECCV, 2012.
 [16] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graphbased video segmentation. In CVPR, 2010.
 [17] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and finegrained localization. In CVPR, 2015.
 [18] M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In ICCV, 2015.
 [19] Y. J. Lee, J. Kim, and K. Grauman. Keysegments for video object segmentation. In ICCV, 2011.
 [20] J. Lezama, K. Alahari, J. Sivic, and I. Laptev. Track to the future: Spatiotemporal video segmentation with longrange motion cues. In CVPR, 2011.
 [21] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figureground segments. In ICCV, 2013.
 [22] T. Ma and L. J. Latecki. Maximum weight cliques with mutex constraints for video object segmentation. In CVPR, 2012.
 [23] N. Maerki, F. Perazzi, O. Wang, and A. SorkineHornung. Bilateral space video segmentation. In CVPR, 2016.
 [24] P. Ochs and T. Brox. Higher order motion models and spectral clustering. In CVPR, 2012.
 [25] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE PAMI, 2014.
 [26] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatiotemporal object detection proposals. In ECCV, 2014.
 [27] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In ICCV, 2013.
 [28] F. Perazzi, J. PontTuset, B. McWilliams, L. V. Gool, M. Gross, and A. SorkineHornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
 [29] F. Perazzi, O. Wang, M. Gross, and A. Sorkinehornung. Fully connected object proposals for video segmentation. In ICCV, 2015.
 [30] S. A. Ramakanth and R. V. Babu. SeamSeg: Video object segmentation using patch seams. In CVPR, 2014.
 [31] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 2014.
 [32] N. Shankar Nagaraja, F. R. Schmidt, and T. Brox. Video segmentation with just a few strokes. In ICCV, 2015.
 [33] J. Shi and C. Tomasi. Good features to track. In CVPR, 1994.
 [34] N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by GPUaccelerated large displacement optical flow. In ECCV, 2010.
 [35] D. Tsai, M. Flagg, and J. M. Rehg. Motion coherent tracking using multilabel MRF optimization. BMVC, 2010.
 [36] Y.H. Tsai, M.H. Yang, and M. J. Black. Video segmentation via object flow. In CVPR, 2016.
 [37] A. VazquezReina, S. Avidan, H. Pfister, and E. Miller. Multiple hypothesis video segmentation from superpixel flows. In ECCV, 2010.
 [38] S. Vijayanarasimhan and K. Grauman. Active frame selection for label propagation in videos. In ECCV, 2012.
 [39] H. Wang, A. Kläser, C. Schmid, and C.L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
 [40] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
 [41] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectorypooled deepconvolutional descriptors. In CVPR, 2015.
 [42] W. Wang, J. Shen, X. Li, and F. Porikli. Robust video object cosegmentation. IEEE TIP, 2015.
 [43] W. Wang, J. Shen, and F. Porikli. Saliencyaware geodesic video object segmentation. In CVPR, 2015.
 [44] W. Wang, J. Shen, and L. Shao. Consistent video saliency using local gradient flow optimization and global refinement. IEEE TIP, 2015.
 [45] W. Wang, J. Shen, L. Shao, and F. Porikli. Correspondence driven saliency transfer. IEEE TIP.
 [46] W. Wang, J. Shen, R. Yang, and F. Porikli. Saliencyaware video object segmentation. IEEE PAMI, 2017.
 [47] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.H. Yang. JOTS: Joint online tracking and segmentation. In CVPR, 2015.
 [48] F. Xiao and Y. Jae Lee. Track and segment: An iterative unsupervised approach for video object proposals. In CVPR, 2016.
 [49] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In ECCV, 2012.
 [50] D. Zhang, O. Javed, and M. Shah. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR, 2013.
 [51] F. Zhong, X. Qin, Q. Peng, and X. Meng. Discontinuityaware video object cutout. ACM Trans. on Graphics, 2012.