Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning
This paper tackles the problem of video object segmentation, given some user annotation which indicates the object of interest. The problem is formulated as pixel-wise retrieval in a learned embedding space: we embed pixels of the same object instance into the vicinity of each other, using a fully convolutional network trained by a modified triplet loss as the embedding model. Then the annotated pixels are set as reference and the rest of the pixels are classified using a nearest-neighbor approach. The proposed method supports different kinds of user input such as segmentation mask in the first frame (semi-supervised scenario), or a sparse set of clicked points (interactive scenario). In the semi-supervised scenario, we achieve results competitive with the state of the art but at a fraction of computation cost (275 milliseconds per frame). In the interactive scenario where the user is able to refine their input iteratively, the proposed method provides instant response to each input, and reaches comparable quality to competing methods with much less interaction.
Immeasurable amount of multimedia data is recorded and shared in the current era of the Internet. Among it, video is one of the most common and rich modalities, albeit it is also one of the most expensive to process. Algorithms for fast and accurate video processing thus become crucially important for real-world applications. Video object segmentation, i.e. classifying the set of pixels of a video sequence into the object(s) of interest and background, is among the tasks that despite having numerous and attractive applications, cannot currently be performed in a satisfactory quality level and at an acceptable speed. The main objective of this paper is to fill in this gap: we perform video object segmentation at the accuracy level comparable to the state of the art while keeping the processing time at a speed that even allows for real-time human interaction.
Towards this goal, we model the problem in a simple and intuitive, yet powerful and unexplored way: we formulate video object segmentation as pixel-wise retrieval in a learned embedding space. Ideally, in the embedding space, pixels belonging to the same object instance are close together and pixels from other objects are further apart. We build such embedding space by learning a Fully Convolutional Network (FCN) as the embedding model, using a modified triplet loss tailored for video object segmentation, where no clear correspondence between pixels is given. Once the embedding model is learned, the inference at test-time only needs to compute the embedding vectors with a forward pass for each frame, and then perform a per-pixel nearest neighbor search in the embedding space to find the most similar annotated pixel. The object, defined by the user annotation, can therefore be segmented throughout the video sequence.
There are several main advantages of our formulation: Firstly, the proposed method is highly efficient as there is no fine-tuning in test time, and it only requires a single forward pass through the embedding network and a nearest-neighbor search to process each frame. Secondly, our method provides the flexibility to support different types of user input (i.e. clicked points, scribbles, segmentation masks, etc.) in an unified framework. Moreover, the embedding process is independent of user input, thus the embedding vectors do not need to be recomputed when the user input changes, which makes our method ideal for the interactive scenario. We show an example in Figure 1, where the user aims to segment several objects in the video: The user can iteratively refine the segmentation result by gradually adding more clicks on the video, and get feedback immediately after each click.
The proposed method is evaluated on the DAVIS 2016 [Perazzi2016] and DAVIS 2017 [Pont-Tuset_arXiv_2017] datasets, both in the semi-supervised and interactive scenario. In the context of semi-supervised Video Object Segmentation (VOS), where the full annotated mask in the first frame is provided as input, we show that our algorithm presents the best trade-off between speed and accuracy, with 275 milliseconds per frame and =77.5% on DAVIS 2016. In contrast, better performing algorithms start at 8 seconds per frame, and similarly fast algorithms reach only 60% accuracy. Where our algorithm shines best is in the field of interactive segmentation, with only 10 clicks on the whole video we can reach an outstanding 74.5% accuracy.