Video Object Segmentation with Language Referring Expressions

Video Object Segmentation with Language Referring Expressions

Abstract

Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, and with language descriptions of target objects. We show that our approach performs on par with the methods which have access to a pixel-level mask of the target object on and is competitive to methods using scribbles on the challenging dataset.

Figure 1: Examples of the proposed approach. Classical semi-supervised video object segmentation relies on an expensive pixel-level mask annotation of a target object in the first frame of a video. We explore a more natural and more practical way of pointing out a target object by providing a language referring expression.

1 Introduction

Video object segmentation has recently witnessed growing interest [1, 2, 3, 4]. Segmenting objects at pixel level provides a finer understanding of video and is relevant for many applications, e.g. augmented reality, video editing, rotoscoping, and summarisation.

Ideally, one would like to obtain a pixel-accurate segmentation of objects in video with no human input during test time. However, the current state-of-the-art unsupervised video object segmentation methods [5, 6, 7] have troubles segmenting the target objects in videos containing multiple objects and cluttered backgrounds without any guidance from the user. Hence, many recent works [1, 8, 9] employ a semi-supervised approach, where a pixel-level mask of the target object is manually annotated in the first frame and the task is to accurately segment the object in successive frames. Although this setting has proven to be successful, it can be prohibitive for many applications. It is tedious and time-consuming for the user to provide a pixel-accurate segmentation and usually takes more than a minute to annotate a single instance ([10] reports s for polygon annotations, precisely delineating an object would take even more). To make video object segmentation more applicable in practice, instead of costly pixel-level masks [4, 11, 12] propose to employ point clicks or scribbles to specify the target object in the first frame. This is much faster and takes an annotator on average s to label an object with point clicks [11] and s with scribbles [13]. However, on small touchscreen devices, such as tablets or phones, providing precise clicks or drawing scribbles using fingers could be cumbersome and inconvenient for the user.

To overcome these limitations, in this work we propose a novel task - segmenting objects in video using language referring expressions - which is a more natural way of human-computer interaction. It is much easier for the user to say: “I want the man in a red sweatshirt performing breakdance to be segmented” (see Figure 1), than to provide a tedious pixel-level segmentation mask or struggle with drawing a scribble which does not straddle the object boundary. Moreover, employing language specifications can make the system more robust to occlusions, help to avoid drift and better adapt to the complex dynamics inherent to videos while not over-fitting to a particular view in the first frame.

We aim to investigate how far one can go while leveraging the advances in image-level language grounding and pixel-level segmentation in videos. We start by analyzing the performance of the state-of-the-art referring expression grounding models [14, 15] for localization of target objects in videos via bounding boxes. We point out their limitations and show a way to enhance their predictions by enforcing temporal coherency. Next we propose a convnet-based framework that allows to utilize language referring expressions for video object segmentation task, where the output of the grounding model (bounding box) is used as a guidance for pixel-wise segmentation of the target object in each video frame. To the best of our knowledge, this is the first approach to address pixel-level object segmentation in video via language specifications. In addition, we show that video object segmentation using the mask annotation on the first frame can be further improved by using the supervision of language expressions, highlighting the complementarity of both modalities.

To evaluate the proposed approach we extend the recently released benchmarks for segmenting single and multiple objects in videos, [16] and [17], with language descriptions of the target objects. For fairness we collect the annotations using two different settings, asking the annotators to provide a description of the target object based on the first frame only as well as on the full video. On average each video has been annotated with referring expressions and it takes the annotator around s to provide a referring expression for a target object.

We show that our language-supervised approach performs on par with semi-super-vised methods which have access to the pixel-accurate object mask and shows comparable results to the techniques that employ scribbles on the challenging dataset.

In summary, our contributions are the following. We show that high quality video object segmentation results can be obtained by employing language referring expressions, which allows a more natural and practical human-computer interaction. Moreover, we show that language descriptions are complementary to visual forms of supervision, such as masks, and can be exploited as an additional source of guidance for object segmentation. We conduct an extensive analysis of the performance of the state-of-the-art referring expression grounding models on video data and propose a way to improve their temporal coherency. We augment two well-known video object segmentation benchmarks with textual descriptions of target objects, which will be publicly released, and present a new task of segmenting objects in video using natural language referring expressions. We hope our findings would further promote the research in the field of video object segmentation via language expressions and help to discover better techniques that can be used in realistic scenarios.

2 Related Work

2.1 Grounding natural language expressions

There has been an increasing interest in the task of grounding natural language expressions over the last few years [18, 19, 20, 21]. We group the existing works by the type of visual domain: images and video.

Image domain. Grounding natural language expressions is a task of localizing a given expression in an image with a bounding box [18, 22] or a segmentation mask [23, 21]. Referring expression comprehension is a closely related task, where the goal is to localize the non-ambiguous referring expression. Most existing approaches rely on external bounding box proposals which are scored to determine the top scoring box as the correct region [24, 20, 25, 26]. A few recent works explore methods of inferring object regions by proposal generation network [27] or efficient subwindow search [28]. Multiple existing approaches model relationships between objects present in the scene [29, 30, 31]. In this work we choose two state-of-the-art grounding models for experimentation and analysis [14, 15]. DBNet [14] frames grounding as a classification task, where an expression and an image region serve as input and a binary classification decision is an output. A key component of this approach is utilization of negative expressions and image regions to ensure discriminative training. DBNet currently leads on Visual Genome [32]. MattNet [15] is a modular network which “softly” decomposes referring expressions in three parts: subject, location, and relationship, each of which is processed by a different visual module. This allows MattNet to process referring expressions of general forms, as each module can be “enabled” or “disabled” depending on the expression. MattNet achieves top performance on RefCOCO(g/+) [33, 22] both in terms of bounding box localization and pixel-wise segmentation accuracy.

Video domain. The progress made in image-level natural language grounding leads to an increasing interest in application to video. The recent work of [34] studies object tracking in video using language expressions. They introduce a dynamic convolutional layer, where a language query is used to predict visual convolutional filters. We compare to [34] on their proposed Lingual ImageNet Videos dataset in Section 5. [35] addresses object tracking in video with the language descriptions and human gaze as input. [36] constructs a referring expression dataset for video, where the task is to localize the expressions temporally. Our work falls in the same line of research, as we are exploring natural language as input for video object segmentation. To the best of our knowledge, this is the first work to apply natural language to this task.

2.2 Video Object Segmentation

Video object segmentation has witnessed considerable progress [37, 38, 39, 7, 40, 1, 9]. In the following, we group the related work into unsupervised and semi-supervised.

Unsupervised methods. Unsupervised methods assume no human input on the video during test time. They aim to group pixels consistent in both appearance and motion and extract the most salient spatio-temporal object tube. Several techniques exploit object proposals [41, 42, 5, 40], saliency [43, 44] and optical flow [38, 45]. Convnet-based approaches [3, 6, 7] cast video object segmentation as a foreground/background classification problem and feed to the network both appearance and motion cues. Because these methods do not have any knowledge of the target object, they have difficulties in videos with multiple moving and dominant objects and cluttered backgrounds.

Semi-supervised methods. Semi-supervised methods assume human input for the first frame, either by providing a pixel-accurate mask [39, 46, 1], clicks [47, 48, 11] or scribbles [49, 50, 4], and then propagate the information to the successive frames. Existing approaches focus on leveraging superpixels [51, 52], constructing graphical models [46, 39], utilizing object proposals [53] or employing optical flow and long term trajectories [54, 39]. Lately, convnets have been considered for the task [1, 55, 9]. These methods usually build the convnet architecture upon the semantic segmentation networks [56] and process each frame of the video individually. [1] proposes to fine-tune a pre-trained generic object segmentation network on the first frame mask of the test video to make it sensitive to the target object. [55] employs a similar strategy, but also provides a temporal context by feeding the mask from the previous frame to the network. Several methods extend the work of [1] by incorporating the semantic information [57] or by integrating online adaptation [9]. [58] introduces a pixel-level matching network, while [2] employs a recurrent network exploiting the long-term temporal information.

The above methods employ a pixel-level mask on the first frame. However, for many applications, particularly on small touchscreen devices, it can be prohibitive to provide a pixel-accurate segmentation. Hence, there has been a growing interest to integrate cheaper forms of supervision, such as point clicks [12, 11] or scribbles [4], into convnet-based techniques. In spirit with these approaches, we aim to reduce the annotation effort on the first frame by using language referring expressions to specify the object. Our approach also builds upon convnets and exploits both linguistic and visual modalities.

3 Method

In this section we provide an overview of the proposed approach. Given a video with N frames and a textual query of the target object , our aim is to obtain a pixel-level segmentation mask of the target object in every frame that it appears.

We leverage recent advances in grounding referring expressions in images [14, 15] and pixel-level segmentation in videos [55, 6]. Our method consists of two main steps (see Figure 2). Using as input the textual query provided by the user, we first generate target object bounding box proposals for every frame of the video by exploiting referring expression grounding models, designed for images only. Applying these models off-the-shelf results in temporally inconsistent and jittery box predictions (see Figure 3). Therefore, to mitigate this issue and make them more applicable for video data, we next employ temporal consistency, which enforces bounding boxes to be coherent across frames. As a second step, using as guidance the obtained box predictions of the target object on every frame of the video we apply a convnet-based pixel-wise segmentation model to recover detailed object masks in each frame.

Figure 2: System overview. We first localize the target object via grounding model using the given referring expression and enforce temporal consistency of bounding boxes across frames. Next we apply a convnet-based pixel-wise segmentation model to recover detailed object masks.

3.1 Grounding objects in video by referring expressions

As discussed in Section 2, the task of natural language grounding is to automatically localize a region described by a given language expression. It is typically formulated as measuring the compatibility between a set of object proposals and a given textual query . The grounding model provides as output a set of matching scores between a box proposal and a textual query . The box proposal with the highest matching score is selected as the predicted region.

We employ two state-of-the-art referring expression grounding models – DBNet [14] and MattNet [15], to localize the object in each frame. Mask R-CNN [59] bounding box proposals are exploited as an initial set of proposals for both models, although originally DBNet has been designed to utilize EdgeBox proposals [60]. However, using the grounding models designed for images and picking the highest scoring proposal for each video frame lead to temporally incoherent results. Even with simple textual queries for adjacent frames that from a human perspective look very much alike, the referring model often outputs inconsistent predictions (see Figure 3). This indicates the inherent instability of the grounding models trained on the image domain. To resolve this problem we propose to re-rank the object proposals by exploiting temporal structure along with the original matching scores given by a grounding model.

Temporal consistency. The goal of the temporal smoothing step is to improve temporal consistency and to reduce id-switches for target object predictions across frames. Since objects tend to move smoothly through space and in time, there should be little changes from frame to frame and the box proposals should have high overlap between neighboring frames. By finding temporally coherent tracks of an object that are spread-out in time, we can focus on the predictions that consistently appear throughout the video and give less emphasis to objects that appear for only a short period of time.

The grounding model provides the likeliness of each box proposal to be the target object by outputting a matching score . Then each box proposal is re-ranked based on its overlap with the proposals in other frames, the original objectness score given by [59] and its matching score from the grounding model. Specifically, for each proposal we compute a new score: , where measures an intersection-over-union ratio between box proposals and , denotes the temporal distance between two proposals () and is the original objectness score. Then, in each frame we select the proposals with the highest new score. The new scoring rewards temporally coherent predictions which likely belong to the target object and form a spatio-temporal tube.

3.2 Pixel-level video object segmentation

We next show how to output pixel-level object masks, exploiting the bounding boxes from grounding as a guidance for the segmentation network. The boxes are used as the input to the network to guide the network towards the target object, providing its rough location and extent. The task of the network is to obtain a pixel-level foreground/background segmentation mask using appearance and motion cues.

Approach. We model pixel-level segmentation as a box refinement task. The bounding box is transformed into a binary image (255 for the interior of the box, 0 for the background) and concatenated with the RGB channels of the input image and optical flow magnitude, forming a 5-channel input for the network. Thus we ask the network to learn to refine the provided boxes into accurate masks. Fusing appearance and motion cues allows to better exploit video data and handle better both static and moving objects.

We make one single pass over the video, applying the model per-frame. The network does not keep a notion of the specific appearance of the object in contrast to [55, 1], where the model is fine-tuned during the test time to learn the appearance of the target object. Neither do we do an online adaptation as in [9], where the model is updated on its previous predictions while processing video frames. This makes the system more efficient during the inference time, which is more suitable for real-world applications.

Similar to [55], we train the network on static images, employing the saliency segmentation dataset [61] which contains a diverse set of objects. The bounding box is obtained from the ground truth masks. To make the system robust during test time to sloppy boxes from the grounding model, we augment the ground truth box by randomly jittering its coordinates (uniformly, of the original box width and height). We synthesize optical flow from static images. Following [62] we apply affine transformations for both background and foreground object to simulate the camera and object motion in the neighboring frames. This simple strategy allows us to train on diverse set of static images, while exploiting motion information during test time. We train the network on many triplets of RGB images, synthesized flow magnitude images and loose boxes in order for the model generalize well to different localization quality of boxes given by the grounding model and different dynamics of the object.

During inference we use the state-of-the-art optical flow estimation method Flow-Net2.0 [63]. We compute the optical flow magnitude by subtracting the median motion for each frame and averaging the magnitude of the forward and backward flow. The obtained image is further scaled to [0; 255] to maintain the same range as RGB channels.

Query: "A woman with a stroller."
Query: "A girl riding a horse."
W/o temporal consistency With temporal consistency
Figure 3: Qualitative results of referring expression comprehension with and w/o temporal consistency on . The results are obtained using MattNet [15] trained on RefCOCO [33].

Network. As our network architecture we use ResNet-101 [64]. We adapt the network to the segmentation task following the procedure of [56] and employing atrous convolutions [65, 66] with hybrid rates [67] within the last two blocks of ResNet to enlarge the receptive field as well as to alleviate the “gridding” issue. After the last block, we apply spatial pyramid pooling [66], which aggregates features at multiple scales by applying atrous convolutions with different rates, and augment it with the image-level features [68, 69] to exploit better global context. The network is trained using a standard cross-entropy loss (all pixels are equally weighted). The final logits are upsampled to the ground truth resolution to preserve finer details for back-propagation.

For network initialization we use a model pre-trained on ImageNet [64]. The new layers are initialized using the “Xavier” strategy [70]. The network is trained on MSRA [61] for segmentation. To avoid the domain shift we fine-tune the network on the training sets of [16] and [17] respectively. We employ use SGD with a polynomial learning policy with initial learning rate of . The network is trained for iterations on MSRA and iterations on the training set of /.

Other sources of supervision. Additionally we consider variants of the proposed model using different sources of supervision. Our approach is flexible and can take advantage of the first frame mask annotation as well as language. We describe how language can be used on top of the mask supervision, improving the robustness of the system against occlusions and dynamic backgrounds (see Section 6 for results).

Mask.

Here we discuss a variant that uses only the first frame mask supervision during test time. The network is initialized with the bounding box obtained from the object mask in the 1st frame and for successive frames uses the prediction from the preceding frame warped with the optical flow (as in [55]) to get the input box for the next frame. Following [55, 1] we fine-tune the model for iterations on an augmented set obtained from the first frame image and mask, to learn the specific properties of the object.

Mask + Language.

We show that using language supervision is complementary to the first frame mask. Instead of relying on the preceding frame prediction as in the previous paragraph, we use the bounding boxes obtained from the grounding model after the temporal consistency step. We initialize with the ground truth box in the first frame and fine-tune the network on the 1st frame.

4 Collecting referring expressions for video

To validate our approach we employ two popular video object segmentation datasets, [16] and [17]. Our task is to localize and provide a pixel-level mask of an object on all video frames given a language referring expression obtained either by looking at the first frame only or the full video. These two datasets introduce various challenges, containing videos with single or multiple salient objects, crowded scenes, similar looking instances, occlusions, camera view changes, fast motion, etc.

ID 1: "A woman in a white t-shirt in the back" ID 1: "A girl in the back wearing a white t-shirt"
ID 2: "A guy in the front" ID 2: "A person capturing video while walking"
ID 1: "A man in a grey t-shirt and yellow trousers" ID 1: "A man in a grey shirt walking through the crossing"
ID 2: "A woman in a black shirt" ID 2: "A woman walking through the crossing"
ID 3: "A white truck on the road" ID 3: "A white truck moving from the left to right"
First frame annotation Full video annotation
Figure 4: Example of collected annotations provided for the first frame (left) vs. the full video (right). Full video annotations include descriptions of activities and overall are more complex.

[16] consists of 30 training and 20 test videos of diverse object categories with all frames annotated with pixel-level accuracy. Note that in this dataset only a single object is annotated per video. For the multiple object video segmentation task we consider . Compared to , this is a more challenging dataset, with multiple objects annotated per video and more complex scenes with more distractors, occlusions, smaller objects, and fine structures. Overall, consists of a training set with videos, and a validation/test-dev/test-challenge set with sequences each.

As our goal is to segment objects in videos using language specifications, we augment all objects annotated with mask labels in and with non-amb-iguous referring expressions. We follow the work of [22] and ask the annotator to provide a language description of the object, which has a mask annotation, by looking only at the first frame of the video. Then another annotator is given the first frame and the corresponding description, and asked to identify the referred object. If the annotator is unable to correctly identify the object, the description is corrected to remove ambiguity and to specify the object uniquely. We have collected two referring expressions per target object annotated by non-computer vision experts (Annotator 1, 2). As there are objects per video, each video has on average expressions annotated. In many applications, such as video editing or video-based advertisement, the user has access to a full video. Providing a language query which is valid for all frames might decrease the editing time and result in more coherent predictions. Thus, on we asked the workers to provide a description of the object by looking at the whole video sequence. We have collected one expression of this type per target object.

The average length for the first frame/full video expressions is words. For first frame annotations we notice that descriptions given by Annotator 1 are longer than the ones by Annotator 2 ( vs. words). We evaluate the effect of description length on the grounding performance in the following Section 5. Besides, the expressions relevant to a full video mention verbs more often than the first frame descriptions ( vs. ). This is intuitive, as referring to an object which changes its appearance and position over time may require mentioning its actions. Adjectives are present in over for all expressions. Most of them refer to colors (over ), shapes and sizes () and spatial/ordering words ( first frame vs. full video expressions). The full video expressions also have a higher number of adverbs and prepositions, and overall are more complex than the ones provided for the first frame, see Figure 4 for examples.

ID 1: "A girl with blonde hair dressed in blue".
ID 1: "A brown camel in the front".
ID 1: "A green motorbike". ID 2: "A man riding a motorbike".
ID 1: "A man on the left". ID 2: "A man on the right". ID 3: "A cardboard box held by a man".
ID 1: "A black scooter ridden by a man". ID 2: "A man in a suit riding a scooter".
Figure 5: Video object segmentation qualitative results using only referring expressions as supervision on and , val sets. Frames sampled along the video duration.

5 Evaluation of natural language grounding in video

In this section we discuss the performance of natural language grounding models on video data. We experiment with DBNet [14] and MattNet [33]. DBNet is trained on Visual Genome [32] which contains images from MS COCO [10] and YFCC100M [71], and spans thousands of object categories. MattNet is trained on referring expressions for MS COCO images [10], specifically RefCOCO and RefCOCO+ [33]. Unlike RefCOCO which has no restrictions on the expressions, RefCOCO+ contains no spatial words and rather focuses on object appearance. Both aforementioned models rely on external bounding box proposals, such as EdgeBox [60] or Mask R-CNN [59].

Datasets.

We carry out most of our evaluation on and with the referring expressions introduced in Section 4. For the natural language grounding task we additionally consider Lingual ImageNet Videos [34], which provides referring expression annotations for a subset of the ImageNet Video Object Detection dataset [72]. The dataset is split into a training and a validation set, each consisting of videos.

Evaluation criteria.

To evaluate the localization quality we employ the intersection-over-union overlap (IoU) of the top scored box proposal with the ground truth bounding box, averaged across all queries. The performance on Lingual ImageNet [34] is measured in terms of the AUC (area under the curve) score metric, following [34].

5.1 / referring expression grounding

Table 1 reports performance of the grounding models on and referring expressions. In the following we summarize our key observations.

(1) We see the effect of replacing EdgeBox with Mask R-CNN object proposals for DBNet model ( to ). Employing better proposals significantly improves the quality of this grounding method, thus we rely on Mask R-CNN proposals in all the following experiments. (2) We note the stability of grounding performance across two annotations (see (A1,A2)), showing that the grounding methods are quite robust to variations in language expressions. (3) The grounding models trained on images are not stable across frames, even when small changes in appearance occur (e.g. see Figure 3). We see that our proposed temporal consistency technique benefits both methods (e.g. DBNet: vs. on , MattNet vs. on ). (4) On both datasets MattNet performs better than DBNet. The gap is particularly large on ( vs. ), as contains videos of a single foreground moving object, while DBNet is trained on a densely labeled Visual Genome dataset with many foreground and background objects. (5) On MattNet trained on RefCOCO+ outperforms MattNet trained on RefCOCO ( vs. ), while both perform similar on . As RefCOCO+ contains no spatial words, MattNet trained on this dataset is more accurate in localizing queries mentioning object appearance. (6) Compared to , is significantly more challenging, as it contains cluttered scenes with multiple moving objects (e.g. for MattNet vs. ). (7) When comparing results on expressions provided for the first frame versus expressions provided for the full video, we observe diverging trends. While DBNet is able to improve its performance ( vs. ), MattNet performance decreases ( vs. ). We attribute this to the fact that DBNet is trained on the more diverse Visual Genome descriptions.

Method
Object
proposals
Train.
data
Temp.
cons.
1st frame 1st frame Full video
mIoU (A1,A2) mIoU (A1,A2) mIoU
DBNet EdgeBox Vis.Gen. - 54.1 1.0 - - -
Mask R-CNN - 64.9 2.1 48.4 1.3 49.6
MattNet Mask R-CNN RefCOCO - 67.1 2.2 51.6 1.6 50.3
RefCOCO+ - 69.1 3.2 50.8 1.2 50.1
DBNet Mask R-CNN Vis.Gen. \Checkmark 68.8 0.6 49.6 1.6 50.2
MattNet Mask R-CNN RefCOCO \Checkmark 71.4 0.2 52.8 0.5 51.3
RefCOCO+ \Checkmark 72.5 0.3 52.3 0.0 51.2
Table 1: Comparison of grounding models DBNET[14] and MattNet [15] on training set and val set. (A1,A2) denotes the difference between Annotator 1 and Annotator 2.
Method
Train.
data
Obj.
prop.
mIoU
CO. ~CO. Sp. ~Sp. Ve. ~Ve. Expr. length Num. obj.
S M L 1 2-3 >3
DBNet Vis.Gen. Mask R-CNN 55.5 37.3 36.5 55.7 37.4 52.0 61.8 49.2 33.6 79.5 49.3 22.6
MattNet RefCOCO 59.6 36.9 33.8 58.5 55.8 51.7 63.9 50.2 49.1 86.1 51.2 16.1
DBNet Vis.Gen. Oracle 79.3 59.0 47.7 81.7 70.3 77.6 84.8 69.9 67.9 100 73.8 37.2
MattNet RefCOCO 73.2 46.6 42.2 72.5 74.7 62.9 79.0 61.1 59.0 100 64.5 23.2
Table 2: Comparison of grounding models on language expressions/videos with different attributes on , val set. Results obtained after the temporal consistency step, using average between two annotators. Attributes: COCO /non COCO, Spatial /non Spatial, Verbs /no Verbs, Expression length (Short, Medium, Long) and Number of objects.

Attribute-based analysis. Next we perform a more detailed analysis of the grounding models on . We split the textual queries/videos into subsets where a certain attribute is present and report the averaged results for the subsets. Table 2 presents attribute-based grounding performance on first-frame based expressions averaged across annotators. To estimate the upper bound performance and the impact of imperfect bounding box proposals we add an Oracle comparison, where performance is reported on the ground-truth object boxes. We summarize our findings in the following.

(1) As MattNet is trained on MS COCO images and both models rely on MS COCO-based Mask R-CNN proposals, we compare performance for expressions which include COCO versus non-COCO objects. Both models drop in performance on non-COCO expressions, showing the impact of the domain shift to (e.g. for MattNet vs. ). Even DBNet which is trained on a larger training corpus suffers from the same effect ( vs. ). (2) We label the expressions as “spatial” if they include some of the spatial words (e.g. left, right). Such queries are significantly harder for all models (e.g. for MattNet vs. ). This is due to changes over time in object’s viewpoint and position, which may lead to the first-frame expressions becoming challenging or even invalid for later frames. This shows an advantage of non-spatial expressions. However, in some cases the objects may be very similar, so appearance alone can not be used to discriminate between them. (3) Verbs are important as they allow to disambiguate the object in video based on its actions. Presence of verbs in expressions is a challenging factor for DBNet trained on Visual Genome, while MattNet does significantly better ( vs. ). (4) Expression length is also an important factor. We quantize our expressions into Short (<4 words), Medium (4–6 words) and Long (>6 words). All models demonstrate similar drop in performance as expression length increases (e.g. for MattNet ). (5) Videos with more objects are more difficult, as these objects also tend to be very similar, such as e.g. fish in a tank (e.g. for MattNet ). (6) From the Oracle performance on COCO versus non-COCO expressions, we see that all models are able to significantly improve their performance even for non-COCO objects (e.g. for DBNet to ). DBNet benefits more than MattNet from Oracle boxes, showing its higher potential to generalize to a new domain given better proposals.

Method Supervision AUC score
Tracking by
language [34]
Language 26.3
Box 47.9
Box + Language 49.4
DBNet Language 54.0
MattNet Language 60.8
Table 3: Comparison of grounding models on Lingual ImageNet Videos, val set.

5.2 Lingual ImageNet Videos referring expression grounding

We also compare to the related work of [34], who perform tracking of objects using language specifications, on their Lingual ImageNet Videos dataset. Table 3 presents grounding results reported by [34], including tracking by language only, tracking given the ground-truth bounding box on the first frame, and the combined approach. Our method is based on language input only, specifically, we report the results after the temporal consistency step applied to DBNet and MattNet predictions. As we see both models significantly outperform [34], even when [34] has access to the ground-truth bounding box on the first frame.

6 Video object segmentation results

In this section we present single and multiple video object segmentation results using natural language referring expressions on two datasets: [16] and [17]. In addition, we experiment with fusing two complementary sources of information, employing both the pixel-level mask and language supervision on the first frame. All our results are obtained using the bounding boxes given by the MattNet model [15] trained on RefCOCO [33] after the temporal consistency step (see Section 3.1).

Evaluation criteria.

For evaluation we use the intersection-over-union measure (mIoU, also called Jaccard index - ) between the ground truth and the predicted segmentation, averaged across all video sequences, all frames. For we also use the global mean () of mIoU () and boundary () measure [17].

6.1 single object segmentation

Table 4 compares our results to previous work on [16]. As we employ MattNet [15], which exploits Mask R-CNN [59], we also report the oracle results of Mask R-CNN segment proposals, where each segment is picked based on the highest overlap with the ground truth. Even with the oracle assignment of segment proposals, Mask R-CNN under-performs compared to the state of the art for semi-supervised video object segmentation ( vs. for OnAVOS [9]). has very detailed object annotations and requires a more complex segmentation module than in [59].

Our method, while only exploiting language supervision, shows competitive performance, on par with techniques which use a pixel-level mask on the first frame ( vs. for OnAVOS [9]). This shows that high quality results can be obtained via a more natural way of human-computer interaction – referring to an object via language, making video object segmentation techniques more applicable in practice. Note that [57, 9] show superior results to our approach ( mIoU). However, they employ additional cues by incorporating semantic information [57] or integrating online adaptation in the pipeline of [1]. Potentially, these techniques can also be applied to our method, though it is out of scope of this paper.

Compared to the approaches which use point click supervision [12, 11], our method shows superior performance ( vs. and ). This indicates that language can be successfully utilized as an alternative and cheaper form of supervision for video object segmentation, on par with clicks and scribbles.

Supervision Method mIoU
Oracle Mask R-CNN [59] 71.5
Unsupervised
SegFlow[3] 67.4
FusionSeg [6] 70.7
LVO [7] 75.9
ARP [40] 76.2
Semi-supervised
1st frame
mask
CTN [73] 73.5
SegFlow [3] 76.1
MaskTrack [55] 79.7
OSVOS [57] 80.2
MaskRNN [2] 80.4
OnAVOS [9] 81.7
Our 82.3
Clicks
iVOS[12] 80.6
DEXTR [11] 80.9
Language
Our 82.2
Mask + Lang.
Our 83.9
Table 4: Comparison of video object segmentation results on , val set.
12

Maks and language. In Table 4 we also report the results for variants using only mask supervision on the the first frame or combining both mask and language (see Section 3.2 for details). Notice that employing either mask or language results in comparable performance ( vs. ), while fusing both modalities leads to a further improvement ( vs. ). This shows that referring expressions are complementary to visual forms of supervision and can be exploited as an additional source of guidance for segmentation, on top of not only pixel-level masks, but potentially scribbles and point clicks.

Table 5 presents a more detailed evaluation using video attributes. We report the averaged results on a subset of sequences where a certain challenging attribute is present. Note that using language only leads to more robust performance for videos with low resolution, camera shake, background clutter and occlusions without the need for an expensive pixel-level mask. When utilizing both mask and language we observe that the system becomes consistently more robust to various video challenges (e.g. fast motion, scale variation, motion blur, etc.) and compares favorably to mask only on all attributes, except appearance change. Overall, employing language specifications can help the model to better handle occlusions, avoid drift and better adapt to complex dynamics inherent to video.

Supervision AC LR SV SC CS DB BC FM MB DEF OCC
Language 79.3 77.9 73.6 77.0 85.2 65.1 84.5 76.2 76.8 83.8 79.3
Mask 80.5 77.1 74.6 77.0 85.1 66.2 83.3 78.2 78.2 84.6 78.2
Mask + Lang. 80.2 77.8 75.4 79.6 86.4 69.9 84.8 78.5 79.7 85.4 81.4
Table 5: Attribute-based performance using different forms of supervision on , val set. AC: appearance change, LR: low resolution, SV: scale variation, SC: shape complexity, CS: camera shake, DB: dynamic background, BC: background clutter, FM: fast motion, MB: motion blur, DEF: deformation, OCC: occlusions. See Section 6.1 for more details.

6.2 multiple object segmentation

Supervision Method mIoU
Oracle Mask R-CNN [59] 52.8 53.3
Grounding 54.9 57.4
Box proposals 42.1 45.3
1st frame
mask
OSVOS [1] 52.1 57.0
SPN [74] 54.0 57.6
OnAVOS [75] 57.0 59.4
MaskRNN [2] 60.5 -
Our 58.0 60.8
Scribbles
CNN lin. class. [4] - 39.3
Scribble-OSVOS [4] - 39.9
Language
Our 37.3 39.3
Our, COCO 45.0 47.5
Our, non-C. 27.5 29.4
Mask+
Language
Our 59.0 62.2
Table 6: Comparison of semi-supervised video object segmentation methods on , val set. Numbers in italic are reported on subsets of containing/non-containing COCO objects.
3

Table 6 presents results on [17]. The lower numbers in comparison with Table 4 indicate that is significantly more difficult than . Even when employing mask supervision on the first frame the dataset presents a challenging task and there is much room for improvement. The semi-supervised methods perform well on foreground-background segmentation, but have problems separating multiple foreground objects, handling small objects and preserving the correct object identities [17].

Compared to mask supervision using language descriptions significantly under-performs. We believe that one of the main problems is a relatively unstable behavior of the underlying grounding model. There are a lot of identity switches, that are heavily penalized by the evaluation metric as every pixel should be assigned to one object instance. We conducted an oracle experiment assigning Mask R-CNN proposals to the correct object ids and then performing the pixel-level segmentation (denoted “Oracle - Grounding”). We observe a significant increase in performance ( to ), making the results competitive to mask supervision. If we utilize Mask R-CNN segment proposals for oracle case the result is points lower than using our segmentation model on top. The underlying choice of proposals for the grounding model could also have its effect. If the object is not detected by Mask R-CNN, the grounding model has no chances to recover the correct instance. To evaluate the influence of proposals we conduct an oracle experiment where the ground truth boxes are exploited in the grounding model (denoted “Oracle - Box proposals”). With oracle boxes we observe an increase in performance ( to ), however, recovering the correct identities still poses a problem for the grounding model.

Another factor influencing our performance is the domain shift between the training and test data. Both Mask R-CNN and MattNet are trained on MS COCO [10], and have troubles recovering instances not belonging to COCO categories. We split the validation set into COCO and non-COCO objects/language queries ( vs. ) and evaluate separately on the two subsets. As in Section 5 we observe much higher results for queries in COCO subset ( to ), indicating the problem of generalization from training to test data.

The method which exploits scribble supervision [4] performs on par with our approach. Note that even for scribble supervision the task remains difficult.

Mask and language. In Table 6 we also report the results for variants of our approach using only mask supervision or combining mask and language. Employing language on top of mask leads to an increase in performance over using mask only ( to ), again showing complementarity of both sources of supervision.

Figure 5 provides qualitative results of our method using only language as supervision. We observe successful handling of similar looking objects, fast motion, deformations and partial occlusions.

Discussion. Our results indicate that language alone can be successfully used as an alternative and a more natural form of supervision. Particularly, high quality results can be achieved for videos with the salient target object. Videos with multiple similar looking objects pose a challenge for grounding models, as they have problems preserving object identities across frames. Experimentally we show that better proposals, grounding and proximity of training and test data can further boost the performance for videos with multiple objects. Language is complementary to mask supervision and can be exploited as an additional source of guidance for segmentation.

7 Conclusion

We present an approach that uses language referring expressions to identify a target object for video object segmentation. To show the potential of the proposed approach, we extended two well-known video object segmentation benchmarks with textual descriptions of target objects. Our experiments indicate that language alone can be successfully exploited to obtain high quality segmentations of objects in videos. While allowing a more natural human-computer interaction, using guidance from language descriptions can also make video segmentation more robust to occlusions, complex dynamics and cluttered backgrounds. We show that classical semi-supervised video object segmentation which uses the mask annotation on the first frame can be further improved by the use of language descriptions. We believe there is a lot of potential in fusing lingual (referring expressions) and visual (clicks, scribbles or masks) forms of supervision for object segmentation in video. We hope that our results encourage the research for video object segmentation with referring expressions and foster discovery of new techniques applicable in realistic settings, which discard tedious pixel-level annotations.

Supplementary material

Appendix A Content

This supplementary material provides additional quantitative and qualitative results and is structured as follows.

Section B provides additional examples of the collected referring expressions for video object segmentation task (see Figure 6).

Section C provides an ablation study (Table 7), additional evaluation metrics for (Table 8) and comparisons of different grounding models, effect of temporal consistency and annotation types on video object segmentation task (Table 9). We also include more qualitative examples for Language, Mask and Mask + Language approaches (see Figures 7-9).

Appendix B Referring expressions for video object segmentation

As our goal is to segment objects in videos using language specifications, we augment all objects annotated with mask labels in [16] and [17] with non-amb-iguous referring expressions. We present additional examples of collected referring expressions in Figure 6.

ID 1: "A man on the left wearing blue" ID 1: "A man in a blue dress on the left getting punched"
ID 2: "A man on the right wearing red" ID 2: "A man in a red dress on the right punching"
ID 3: "A referee in the middle in white" ID 3: "A man in a white shirt and black shorts in the middle"
ID 1: "A brown sheep in the middle" ID 1: "A brown sheep in the front"
ID 2: "A sheep on the left with a black face" ID 2: "A grey sheep with dark face moving behind fence"
ID 3: "A black lamb with white nose" ID 3: "A black baby sheep"
ID 4: "A white lamb next to a brown sheep" ID 4: "A white baby sheep closer to a brown sheep"
ID 5: "A white lamb in the middle next to a white sheep" ID 5: "A white baby sheep farther from a brown sheep"
ID 1: "A black bicycle" ID 1: "A bicycle moving on the road"
ID 2: "A backpack" ID 2: "A backpack worn by a guy"
ID 3: "A black board" ID 3: "A longboard"
ID 4: "A man on a bicycle in a black jacket" ID 4: "A guy riding a bicycle"
ID 5: "A man in a yellow t-shirt" ID 5: "A person rolling over longboard"
First frame annotation Full video annotation
Figure 6: Example of collected annotations provided for the first frame (left) vs. the full video (right). Full video annotations include descriptions of activities and overall are more complex than the ones provided for the first frame.

Appendix C Video object segmentation

Variant mIoU mIoU
Full system -
No box jittering during training
No optical flow magnitude as input channel
Backbone architecture: ResNet-101 VGG-16 [55]
Table 7: Ablation study of our method on , training set. Given our full system, we remove one ingredient at a time, to see each individual contribution.

c.1 Ablation study

We validate the contributions of the components in our method by presenting an ablation study summarized in Table 7 on , training set. We report the impact of bounding box jittering during training, using optical flow magnitude as an input channel and the effect of the ResNet-101 backbone architecture for video object segmentation.

Augmenting the ground truth boxes by random jittering makes the system more robust during test time to sloppy boxes ( vs. ) and employing motion cues allows to handle better moving objects ( vs. ). Exploiting the proposed in the paper network architecture versus using the network proposed in [55] results in points boost, providing more detailed object masks.

c.2 Additional metrics for

We report video object segmentation results for the benchmark in Table 8, using evaluation metrics proposed in [16]. Three measures are used: region similarity in terms of intersection-over-union (, higher is better), contour accuracy (, higher is better), and temporal instability of the masks (, lower is better). See [16] for more details. Note that using only language supervision results in a smaller decay over time for and measures and a better overall temporal stability compared to employing pixel-level mask supervision on the first frame.

Supervision Method
Region, Boundary,
Temp.
stab.,
Mean Recall Decay Mean Recall Decay Mean
Oracle Mask R-CNN [59] 71.5 87.3 5.9 72.4 84.6 6.8 24.8
Unsupervised
NLC [43] 55.1 55.8 12.6 52.3 51.9 11.4 42.5
FST[76] 55.8 64.9 0.0 51.1 51.6 2.9 36.6
SegFlow[3] 67.4 81.4 6.2 66.7 77.1 5.1 28.2
MP-Net [77] 70.0 85.0 1.3 65.9 79.2 2.5 57.2
FusionSeg [6] 70.7 83.5 1.5 65.3 73.8 1.8 32.8
LVO [7] 75.9 89.1 0.0 72.1 8.4 1.3 26.5
ARP [40] 76.2 91.1 7.0 70.6 83.5 7.9 39.3
Semi-supervised
1st frame
mask
FCP [53] 58.4 71.5 -2.0 49.2 49.5 -1.1 30.6
BVS [46] 60.0 66.9 28.9 58.8 67.9 21.3 34.7
ObjFlow [39] 68.0 75.6 26.4 63.4 70.4 27.2 22.2
PLM [58] 70.2 86.3 11.2 62.5 73.2 14.7 31.8
VPN [8] 70.2 82.3 12.4 65.5 69.0 14.4 32.4
CTN [73] 73.5 87.4 15.6 69.3 79.6 12.9 22.0
SegFlow [3] 76.1 90.6 12.1 76.0 85.5 10.4 18.9
MaskTrack [55] 79.7 93.1 8.9 75.4 87.1 9.0 21.8
OSVOS [1] 79.8 93.6 14.9 80.6 92.6 15.0 37.8
MaskRNN [2] 80.4 96.0 4.4 82.3 93.2 8.8 19.0
OnAVOS4 [9] 81.7 92.2 11.9 81.1 88.2 11.2 27.3
Our 82.3 94.1 9.5 84.1 92.6 10.6 26.6
Language
Our 82.2 94.2 3.4 84.2 94.2 2.8 26.3
Mask + Lang.
Our 83.9 96.0 7.3 85.5 94.4 7.3 27.8
Table 8: Comparison of video object segmentation results on , validation set.
Annotation type Grounding Temporal consistency mIoU
1st frame DBNet - 32.6 34.7
\Checkmark 35.4 37.6
MattNet - 35.4 38.5
\Checkmark 37.3 39.3
Full video DBNet \Checkmark 35.5 37.7
MattNet \Checkmark 35.5 37.1
Table 9: Effect of different grounding models, temporal consistency and annotation types on video object segmentation on , validation set.

c.3 Effect of grounding models, temporal consistency and annotation types on video object segmentation

Table 9 reports the effect of different grounding models, temporal consistency step for grounding and employing the first frame versus the full video descriptions on video object segmentation.

We compare DBNet [14] versus MattNet [33] (trained on RefCOCO [33]) as a base grounding model for video object segmentation task. Exploiting MattNet grounding boxes results in a better performance compared to DBNet ( vs. ). Overall the temporal consistency step has a positive impact on video object segmentation performance across different grounding models (for MattNet and for DBNet ).

We also compare the segmentation performance from first frame versus full video descriptions in Table 9. Employing the full video versus the first frame descriptions results in a minor improvement for DBNet ( vs. ), however has a negative effect for MattNet ( vs. ).

ID 1: "A red car".
ID 1: "A man jumping across fences".
ID 1: "A dog running in the garden".
ID 1: "A goat walking on rocks".
ID 1: "A red and white car".
ID 1: "A woman riding a horse". ID 2: "A horse doing high-jumps".
ID 1: "A bald man with black belt in the center". ID 2: "A man with blue belt on the right".
ID 1: "A boy wearing a white t-shirt". ID 2: "A red bmx bike".
Figure 7: Video object segmentation qualitative results using only Language as supervision on and , val sets. Frames sampled along the video duration.

c.4 Qualitative results for video object segmentation

Figure 7 provides more qualitative examples of Language-only supervision for video object segmentation on and , validation sets. We observe successful handling of shape deformations, fast motion as well as partial and full occlusions.

Figure 8 shows examples of Mask + Language supervision on , validation set. We observe high quality instance level segmentation of multiple similar looking objects.

Figure 9 shows comparison of Language versus Mask supervision on and , validation sets. Note that using only language supervision results in a more robust performance for videos with similar looking instances and camera view changes in comparison to employing pixel-level masks.

ID 1: "A man wearing a cap". ID 2: "A black bike".
ID 1: "A brown piglet in the middle". ID 2: "A brown and white colored piglet".
ID 3: "An adult pig on the right".
ID 1: "An orange goldfish in the center next to the largest fish". ID 2: "The biggest goldfish".
ID 3: "The smallest goldfish". ID 4: "A small goldfish in the end".
ID 5: "A goldfish on the bottom".
Figure 8: Video object segmentation qualitative results using Mask + Language as supervision on , val set. Frames sampled along the video duration.
  Language supervision, ID 1: "A brown camel in the front".
Pixel-level mask supervision
  Language supervision, ID 1: "A silver car".
Pixel-level mask supervision
  Language supervision, ID 1: "A black car".
Pixel-level mask supervision
  Language supervision, ID 1: "A green motorbike". ID 2: "A man riding a motorbike".
  Pixel-level mask supervision
Language supervision, ID 1: "A black scooter ridden by a man".
ID 2: "A man in a suit riding a scooter".
Pixel-level mask supervision
Figure 9: Video object segmentation results using Language versus Mask on the 1st frame as supervision on and , val sets. Using language only results in a more robust performance for videos with similar looking instances and camera view changes in comparison to employing pixel-level masks. Frames sampled along the video duration. The videos are chosen with the highest mIoU difference.

Footnotes

  1. footnotetext: OSVOS reports 86.0 mIoU by employing semantic segmentation as additional supervision.
  2. footnotetext: OnAVOS gives 86.1 mIoU by online adaptation on successive frames.
  3. footnotetext: OnAVOS reports 64.5 mIoU by performing online adaptation on successive frames.
  4. OnAVOS gives 86.1 mIoU by online adaptation on successive frames.

References

  1. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixe, L., Cremers, D., Gool, L.V.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  2. Hu, Y.T., Huang, J., Schwing, A.G.: Maskrnn: Instance level video object segmentation. In: Advances in Neural Information Processing Systems (NIPS). (2017)
  3. Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: Segflow: Joint learning for video object segmentation and optical flow. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2017)
  4. Pont-Tuset, J., Caelles, S., Perazzi, F., Montes, A., Maninis, K.K., Chen, Y., Van Gool, L.: The 2018 davis challenge on video object segmentation. arXiv:1803.00557 (2018)
  5. Xiao, F., Lee, Y.J.: Track and segment: An iterative unsupervised approach for video object proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  6. Jain, S.D., Xiong, B., Grauman, K.: Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv:1701.05384 (2017)
  7. Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2017)
  8. Jampani, V., Gadde, R., Gehler, P.V.: Video propagation networks. arXiv:1612.05478 (2016)
  9. Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: Proceedings of the British Machine Vision Conference (BMVC). (2017)
  10. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV). (2014)
  11. Maninis, K., Caelles, S., Pont-Tuset, J., Gool, L.V.: Deep extreme cut: From extreme points to object segmentation. arXiv: 1711.09081 (2017)
  12. Benard, A., Gygli, M.: Interactive video object segmentation in the wild. arXiv: 1801.00269 (2017)
  13. Lin, D., Dai, J., Jia, J., He, K., Sun, J.: Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  14. Zhang, Y., Yuan, L., Guo, Y., He, Z., Huang, I.A., Lee, H.: Discriminative bimodal networks for visual localization and detection with natural language queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  15. Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018)
  16. Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  17. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv:1704.00675 (2017)
  18. Plummer, B., Wang, L., Cervantes, C., Caicedo, J., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2015)
  19. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Proceedings of the European Conference on Computer Vision (ECCV). (2016)
  20. Luo, R., Shakhnarovich, G.: Comprehension-guided referring expressions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  21. Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2017)
  22. Mao, J., Jonathan, H., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  23. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Proceedings of the European Conference on Computer Vision (ECCV). (2016)
  24. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  25. Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J., Lazebnik, S.: Phrase localization and visual relationship detection with comprehensive linguistic cues. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2016)
  26. Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  27. Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2017)
  28. Yeh, R., Xiong, J., Hwu, W.M., Do, M., Schwing, A.: Interpretable and globally optimal prediction for textual grounding using image concepts. In: Advances in Neural Information Processing Systems. (2017) 1909–1919
  29. Wang, M., Azab, M., Kojima, N., Mihalcea, R., Deng, J.: Structured matching for phrase localization. In: Proceedings of the European Conference on Computer Vision (ECCV), Springer (2016) 696–711
  30. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), Springer (2016) 792–807
  31. Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  32. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv: 1602.07332 (2016)
  33. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Proceedings of the European Conference on Computer Vision (ECCV). (2016)
  34. Li, Z., Tao, R., Gavves, E., Snoek, C.G.M., Smeulders, A.W.M.: Tracking by natural language specification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  35. Balajee Vasudevan, A., Dai, D., Van Gool, L.: Object referring in videos with language and human gaze. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018)
  36. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2017) 5803–5812
  37. Li, X., Qi, Y., Wang, Z., Chen, K., Liu, Z., Shi, J., Luo, P., Loy, C.C., Tang, X.: Video object segmentation with re-identification. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017)
  38. Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2013)
  39. Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  40. Koh, Y., Kim, C.: Primary object segmentation in videos based on region augmentation and reduction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  41. Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2013)
  42. Oneata, D., Revaud, J., Verbeek, J., Schmid, C.: Spatio-Temporal Object Detection Proposals. In: Proceedings of the European Conference on Computer Vision (ECCV). (2014)
  43. Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: Proceedings of the British Machine Vision Conference (BMVC). (2014)
  44. Wang, W., Shen, J., Porikli, F.M.: Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
  45. Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Proceedings of the European Conference on Computer Vision (ECCV). (2010)
  46. Maerki, N., Perazzi, F., Wang, O., Sorkine-Hornung, A.: Bilateral space video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  47. Jain, S.D., Grauman, K.: Click carving: Segmenting objects in video with point clicks. In: Conf. on Human Computation and Crowdsourcing. (2016)
  48. Spina, T.V., Falcão, A.X.: Fomtrace: Interactive video segmentation by image graphs and fuzzy object models. arXiv:1606.03369 (2016)
  49. Nagaraja, N., Schmidt, F., Brox, T.: Video segmentation with just a few strokes. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2015)
  50. Fan, Q., Zhong, F., Lischinski, D., Cohen-Or, D., Chen, B.: Jumpcut: Non-successive mask transfer and interpolation for video cutout. SIGGRAPH Asia (2015)
  51. Wen, L., Du, D., Lei, Z., Li, S.Z., Yang, M.H.: Jots: Joint online tracking and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
  52. Jain, S.D., Grauman, K.: Supervoxel-consistent foreground propagation in video. In: Proceedings of the European Conference on Computer Vision (ECCV). (2014)
  53. Perazzi, F., Wang, O., Gross, M., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2015)
  54. Wang, W., Shen, J.: Super-trajectory for video segmentation. arXiv:1702.08634 (2017)
  55. Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  56. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
  57. Maninis, K., Caelles, S., Chen, Y., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Gool, L.V.: Video object segmentation without temporal information. arxiv: 1709.06031 (2017)
  58. Shin Yoon, J., Rameau, F., Kim, J., Lee, S., Shin, S., So Kweon, I.: Pixel-level matching for video object segmentation using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2017)
  59. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2017)
  60. Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) (2015)
  61. Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.M.: Global contrast based salient region detection. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) (2015)
  62. Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  63. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  64. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  65. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: Proceedings of the International Conference on Learning Representations (ICLR). (2016)
  66. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915 (2016)
  67. Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. arXiv:1702.08502 (2017)
  68. Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arxiv:1506.04579 (2015)
  69. Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arxiv: 1706.05587 (2017)
  70. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proc. of the International Conf. on Artificial Intelligence and Statistics (AISTATS). (2010)
  71. Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: the new data in multimedia research. Communications of the ACM 59(2) (2016) 64–73
  72. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) (2015)
  73. Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  74. Cheng, J., Liu, S., Tsai, Y.H., Hung, W.C., Gupta, S., Gu, J., Kautz, J., Wang, S., Yang, M.H.: Learning to segment instances in videos with spatial propagation network. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017)
  75. Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017)
  76. Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2013)
  77. Tokmakov, P., Alahari, K., Schmid, C.: Learning motion patterns in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
131538
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description