One-Shot Video Object Segmentation

One-Shot Video Object Segmentation


This paper tackles the task of semi-supervised video object segmentation, \ie, the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).









1 Conclusions

Deep learning approaches often require a huge amount of training data in order to solve a specific problem such as segmenting an object in a video. Quite in contrast, human observers can solve similar challenges with only a single training example. In this paper, we demonstrate that one can reproduce this capacity of one-shot learning in a machine: Based on a network architecture pre-trained on generic datasets, we propose One-Shot Video Object Segmentation (OSVOS) as a method which fine-tunes it on merely one training sample and subsequently outperforms the state-of-the-art on DAVIS by 11.8 points. Interestingly, our approach does not require explicit modeling of temporal consistency using optical flow algorithms or temporal smoothing and thus does not suffer from error propagation over time (drift). Instead, OSVOS processes each frame of the video independently and gives rise to highly accurate and temporally consistent segmentations. All resources of this paper can be found at


Research funded by the EU Framework Programme for Research and Innovation Horizon 2020 (Grant No. 645331, EurEyeCase), the Swiss Commission for Technology and Innovation (CTI, Grant No. 19015.1 PFES-ES, NeGeVA), and the ERC Consolidator Grant “3D Reloaded”. The authors gratefully acknowledge support by armasuisse and thank NVidia Corporation for donating the GPUs used in this project.


  1. P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. TPAMI, 33(5):898–916, 2011.
  2. J. T. Barron and B. Poole. The fast bilateral solver. In ECCV, 2016.
  3. G. Bertasius, J. Shi, and L. Torresani. High-for-low and low-for-high: Efficient boundary detection from deep object features and its applications to high-level vision. In ICCV, 2015.
  4. G. Bertasius, J. Shi, and L. Torresani. Semantic segmentation with boundary neural fields. In CVPR, 2016.
  5. T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, 2010.
  6. J. Chang, D. Wei, and J. W. Fisher III. A video representation using temporal superpixels. In CVPR, 2013.
  7. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015.
  8. J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016.
  9. J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016.
  10. A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In BMVC, 2014.
  11. Q. Fan, F. Zhong, D. Lischinski, D. Cohen-Or, and B. Chen. Jumpcut: Non-successive mask transfer and interpolation for video cutout. ACM Trans. Graph., 34(6), 2015.
  12. C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. TPAMI, 35(8):1915–1929, 2013.
  13. K. Fragkiadaki, G. Zhang, and J. Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In CVPR, 2012.
  14. R. Girshick. Fast R-CNN. In ICCV, 2015.
  15. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  16. M. Godec, P. M. Roth, and H. Bischof. Hough-based tracking of non-rigid objects. CVIU, 117(10):1245–1256, 2013.
  17. M. Grundmann, V. Kwatra, M. Han, and I. A. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, 2010.
  18. B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
  19. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  20. S. D. Jain and K. Grauman. Supervoxel-consistent foreground propagation in video. In ECCV, 2014.
  21. V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. In CVPR, 2017.
  22. A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In CVPR, 2017.
  23. I. Kokkinos. Pushing the boundaries of boundary detection using deep learning. In ICLR, 2016.
  24. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  25. Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, 2011.
  26. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector. In ECCV, 2016.
  27. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  28. M. Kristan et al. The visual object tracking VOT2015 challenge results. In Visual Object Tracking Workshop 2015 at ICCV 2015, Dec 2015.
  29. K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. Van Gool. Convolutional oriented boundaries. In ECCV, 2016.
  30. K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. Van Gool. Deep retinal image understanding. In MICCAI, 2016.
  31. R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
  32. H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016.
  33. N. Nicolas Märki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilateral space video segmentation. In CVPR, 2016.
  34. H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
  35. P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. TPAMI, 36(6):1187–1200, 2014.
  36. A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In ICCV, 2013.
  37. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  38. F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. Fully connected object proposals for video segmentation. In ICCV, 2015.
  39. P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.
  40. J. Pont-Tuset, P. Arbeláez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. TPAMI, 2017.
  41. A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In CVPR, 2012.
  42. S. A. Ramakanth and R. V. Babu. Seamseg: Video object segmentation using patch seams. In CVPR, 2014.
  43. O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  44. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
  45. N. Shankar Nagaraja, F. R. Schmidt, and T. Brox. Video segmentation with just a few strokes. In ICCV, 2015.
  46. J. Shen, W. Wenguan, and F. Porikli. Saliency-Aware geodesic video object segmentation. In CVPR, 2015.
  47. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  48. B. Taylor, V. Karasev, and S. Soatto. Causal video object segmentation from persistence of occlusions. In CVPR, 2015.
  49. Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In CVPR, 2016.
  50. S. Vijayanarasimhan and K. Grauman. Active frame selection for label propagation in videos. In ECCV, 2012.
  51. S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, 2015.
  52. J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang. Object contour detection with a fully convolutional encoder-decoder network. In CVPR, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description