# One-Shot Video Object Segmentation

## Abstract

This paper tackles the task of semi-supervised video object segmentation, \ie, the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).

## 1 Conclusions

Deep learning approaches often require a huge amount of training data in order to solve a specific problem such as segmenting an object in a video. Quite in contrast, human observers can solve similar challenges with only a single training example. In this paper, we demonstrate that one can reproduce this capacity of one-shot learning in a machine: Based on a network architecture pre-trained on generic datasets, we propose One-Shot Video Object Segmentation (OSVOS) as a method which fine-tunes it on merely one training sample and subsequently outperforms the state-of-the-art on DAVIS by 11.8 points. Interestingly, our approach does not require explicit modeling of temporal consistency using optical flow algorithms or temporal smoothing and thus does not suffer from error propagation over time (drift). Instead, OSVOS processes each frame of the video independently and gives rise to highly accurate and temporally consistent segmentations. All resources of this paper can be found at www.vision.ee.ethz.ch/~cvlsegmentation/osvos/

### Acknowledgements:

Research funded by the EU Framework Programme for Research and Innovation Horizon 2020 (Grant No. 645331, EurEyeCase), the Swiss Commission for Technology and Innovation (CTI, Grant No. 19015.1 PFES-ES, NeGeVA), and the ERC Consolidator Grant “3D Reloaded”. The authors gratefully acknowledge support by armasuisse and thank NVidia Corporation for donating the GPUs used in this project.

