Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

Yuhua Chen         Jordi Pont-Tuset         Alberto Montes         Luc Van Gool
Computer Vision Lab, ETH Zurich          VISICS, ESAT/PSI, KU Leuven

This paper tackles the problem of video object segmentation, given some user annotation which indicates the object of interest. The problem is formulated as pixel-wise retrieval in a learned embedding space: we embed pixels of the same object instance into the vicinity of each other, using a fully convolutional network trained by a modified triplet loss as the embedding model. Then the annotated pixels are set as reference and the rest of the pixels are classified using a nearest-neighbor approach. The proposed method supports different kinds of user input such as segmentation mask in the first frame (semi-supervised scenario), or a sparse set of clicked points (interactive scenario). In the semi-supervised scenario, we achieve results competitive with the state of the art but at a fraction of computation cost (275 milliseconds per frame). In the interactive scenario where the user is able to refine their input iteratively, the proposed method provides instant response to each input, and reaches comparable quality to competing methods with much less interaction.

1 Introduction

Immeasurable amount of multimedia data is recorded and shared in the current era of the Internet. Among it, video is one of the most common and rich modalities, albeit it is also one of the most expensive to process. Algorithms for fast and accurate video processing thus become crucially important for real-world applications. Video object segmentation, i.e. classifying the set of pixels of a video sequence into the object(s) of interest and background, is among the tasks that despite having numerous and attractive applications, cannot currently be performed in a satisfactory quality level and at an acceptable speed. The main objective of this paper is to fill in this gap: we perform video object segmentation at the accuracy level comparable to the state of the art while keeping the processing time at a speed that even allows for real-time human interaction.

Towards this goal, we model the problem in a simple and intuitive, yet powerful and unexplored way: we formulate video object segmentation as pixel-wise retrieval in a learned embedding space. Ideally, in the embedding space, pixels belonging to the same object instance are close together and pixels from other objects are further apart. We build such embedding space by learning a Fully Convolutional Network (FCN) as the embedding model, using a modified triplet loss tailored for video object segmentation, where no clear correspondence between pixels is given. Once the embedding model is learned, the inference at test-time only needs to compute the embedding vectors with a forward pass for each frame, and then perform a per-pixel nearest neighbor search in the embedding space to find the most similar annotated pixel. The object, defined by the user annotation, can therefore be segmented throughout the video sequence.

Figure 1: Interactive segmentation using our method: The white circles represent the clicks where the user has provided an annotation, the colored masks show the resulting segmentation in a subset of the sequence’s frames.

There are several main advantages of our formulation: Firstly, the proposed method is highly efficient as there is no fine-tuning in test time, and it only requires a single forward pass through the embedding network and a nearest-neighbor search to process each frame. Secondly, our method provides the flexibility to support different types of user input (i.e. clicked points, scribbles, segmentation masks, etc.) in an unified framework. Moreover, the embedding process is independent of user input, thus the embedding vectors do not need to be recomputed when the user input changes, which makes our method ideal for the interactive scenario. We show an example in Figure 1, where the user aims to segment several objects in the video: The user can iteratively refine the segmentation result by gradually adding more clicks on the video, and get feedback immediately after each click.

The proposed method is evaluated on the DAVIS 2016 [Perazzi2016] and DAVIS 2017 [Pont-Tuset_arXiv_2017] datasets, both in the semi-supervised and interactive scenario. In the context of semi-supervised Video Object Segmentation (VOS), where the full annotated mask in the first frame is provided as input, we show that our algorithm presents the best trade-off between speed and accuracy, with 275 milliseconds per frame and =77.5% on DAVIS 2016. In contrast, better performing algorithms start at 8 seconds per frame, and similarly fast algorithms reach only 60% accuracy. Where our algorithm shines best is in the field of interactive segmentation, with only 10 clicks on the whole video we can reach an outstanding 74.5% accuracy.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description