One-Shot Video Object Segmentation
This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).
From Pre-Trained Networks…
Convolutional Neural Networks (CNNs) are revolutionizing many fields of computer vision. For instance, they have dramatically boosted the performance for problems like image classification [Krizhevsky2012, SiZi15, He+16] and object detection [Girshick2014, Gir15, Liu+16]. Image segmentation has also been taken over by CNNs recently [Maninis2016, Kokkinos2016, XiTu15, BST15b, BST16], with deep architectures pre-trained on the weakly related task of image classification on ImageNet [Russakovsky2015]. One of the major downsides of deep network approaches is their hunger for training data. Yet, with various pre-trained network architectures one may ask how much training data do we really need for the specific problem at hand? This paper investigates segmenting an object along an entire video, when we only have one single labeled training example, e.g. the first frame.
…to One-Shot Video Object Segmentation
This paper presents One-Shot Video Object Segmentation (OSVOS), a CNN architecture to tackle the problem of semi-supervised video object segmentation, that is, the classification of all pixels of a video sequence into background and foreground, given the manual annotation of one (or more) of its frames. Figure 1 shows an example result of OSVOS, where the input is the segmentation of the first frame (in red), and the output is the mask of the object in the 90 frames of the sequence (in green).
The first contribution of the paper is to adapt the CNN to a particular object instance given a single annotated image (hence one-shot). To do so, we adapt a CNN pre-trained on image recognition [Russakovsky2015] to video object segmentation. This is achieved by training it on a set of videos with manually segmented objects. Finally, it is fine-tuned at test time on a specific object that is manually segmented in a single frame. Figure 1 shows the overview of the method. Our proposal tallies with the observation that leveraging these different levels of information to perform object segmentation would stand to reason: from generic semantic information of a large amount of categories, passing through the knowledge of the usual shapes of objects, down to the specific properties of a particular object we are interested in segmenting.
The second contribution of this paper is that OSVOS processes each frame of a video independently, obtaining temporal consistency as a by-product rather than as the result of an explicitly imposed, expensive constraint. In other words, we cast video object segmentation as a per-frame segmentation problem given the model of the object from one (or various) manually-segmented frames. This stands in contrast to the dominant approach where temporal consistency plays the central role, assuming that objects do not change too much between one frame and the next. Such methods adapt their single-frame models smoothly throughout the video, looking for targets whose shape and appearance vary gradually in consecutive frames, but fail when those constraints do not apply, unable to recover from relatively common situations such as occlusions and abrupt motion.
In this context, motion estimation has emerged as a key ingredient for state-of-the-art video segmentation algorithms [Tsai2016, Ramakanth2014, Grundmann2010]. Exploiting it is not a trivial task however, as one e.g. has to compute temporal matches in the form of optical flow or dense trajectories [Brox2010], which can be an even harder problem.
We argue that temporal consistency was needed in the past, as one had to overcome major drawbacks of the then inaccurate shape or appearance models. On the other hand, in this paper deep learning will be shown to provide a sufficiently accurate model of the target object to produce temporally stable results even when processing each frame independently. This has some natural advantages: OSVOS is able to segment objects through occlusions, it is not limited to certain ranges of motion, it does not need to process frames sequentially, and errors are not temporally propagated. In practice, this allows OSVOS to handle e.g. interlaced videos of surveillance scenarios, where cameras can go blind for a while before coming back on again.
Our third contribution is that OSVOS can work at various points of the trade-off between speed and accuracy. In this sense, it can be adapted in two ways. First, given one annotated frame, the user can choose the level of fine-tuning of OSVOS, giving him/her the freedom between a faster method or more accurate results. Experimentally, we show that OSVOS can run at 181 ms per frame and 71.5% accuracy, and up to 79.7% when processing each frame in 7.83 s. Second, the user can annotate more frames, those on which the current segmentation is less satisfying, upon which OSVOS will refine the result. We show in the experiments that the results indeed improve gradually with more supervision, reaching an outstanding level of 84.6% with two annotated frames per sequence, and 86.9% with four, from 79.8% from one annotation.
Technically, we adopt the architecture of Fully Convolutional Networks (FCN) [farabet2013learning, LSD15], suitable for dense predictions. FCNs have recently become popular due to their performance both in terms of accuracy and computational efficiency [LSD15, Dai2016a, Dai2016]. Arguably, the Achilles’ heel of FCNs when it comes to segmentation is the coarse scale of the deeper layers, which leads to inaccurately localized predictions. To overcome this, a large variety of works from different fields use skip connections of larger feature maps [LSD15, Har+15, XiTu15, Man+16], or learnable filters to improve upscaling [NHH15, Yan+16]. To the best of our knowledge, this work is the first to use FCNs for the task of video segmentation.
We perform experiments on two video object segmentation datasets (DAVIS [Perazzi2016] and Youtube-Objects [Prest2012, Jain2014]) and show that OSVOS significantly improves the state of the art 79.8% vs 68.0%. Our technique is able to process a frame of DAVIS (480854 pixels) in 102 ms. By increasing the level of supervision, OSVOS can further improve its results to 86.9% with just four annotated frames per sequence, thus providing a vastly accelerated rotoscoping tool.
All resources of this paper, including training and testing code, pre-computed results, and pre-trained models are publicly available at www.vision.ee.ethz.ch/~cvlsegmentation/osvos/.