Dual Convolutional LSTM Network for Referring Image Segmentation

Dual Convolutional LSTM Network for Referring Image Segmentation


We consider referring image segmentation. It is a problem at the intersection of computer vision and natural language understanding. Given an input image and a referring expression in the form of a natural language sentence, the goal is to segment the object of interest in the image referred by the linguistic query. To this end, we propose a dual convolutional LSTM (ConvLSTM) network to tackle this problem. Our model consists of an encoder network and a decoder network, where ConvLSTM is used in both encoder and decoder networks to capture spatial and sequential information. The encoder network extracts visual and linguistic features for each word in the expression sentence, and adopts an attention mechanism to focus on words that are more informative in the multimodal interaction. The decoder network integrates the features generated by the encoder network at multiple levels as its input and produces the final precise segmentation mask. Experimental results on four challenging datasets demonstrate that the proposed network achieves superior segmentation performance compared with other state-of-the-art methods.

Referring Image Segmentation, Encoder-Decoder, Vision and Language, Deep Learning

I Introduction

Segmenting objects of interest in an image is a fundamental problem in computer vision field. Researchers have formulated various high-level computer vision tasks related to segmenting objects in images, such as semantic segmentation [27, 2], instance segmentation [8], salient object segmentation [3, 35]. However, these tasks have some limitations. For example, semantic segmentation and instance segmentation assume a pre-defined set of object categories (e.g., cat, person, bus, etc). Salient object segmentation does not have the restriction on the pre-defined object categories and is based on the human visual cognition system to distinguish the most salient objects in the scene from background. But in some complicated cases, the salient objects are ambiguous and may not be unique for different viewers [16].

man with back to camera guy on left man in yellow shirt
Fig. 1: Illustration of the referring image segmentation task. Given an input image and a referring expression, the goal is to generate the segmentation mask for the referred object in the image. The referring expression may use diverse descriptions to identify the referred object, such as object names (e.g. “man”, “guy”), attributes (e.g. “back to camera”, “yellow”) and spatial relationships (e.g. “on left”). The first row shows three query expressions and the last row indicates the corresponding segmentation mask in the image.

In recent years, referring image segmentation has attracted the attentions of many researchers. In referring image segmentation, the object of interest to be segmented is specified by a free-form referring expressions in natural language. Fig. 1 illustrates the example of referring image segmentation. With an input image, the segmentation mask can be referred using diverse descriptions for the same object (the first two examples from the left) by its attribute “back to camera” or spatial relationship“on left” to differ the two same categorical objects “man” or “guy”. The prediction can also be identified by the attribute of the shirt “yellow” related to another object in the image. Referring image segmentation is a challenging problem which requires a combinational comprehension of both linguistic and visual information, and enables many real-world applications including interactive image editing, intelligence visual search and human-robot interaction.

There are several existing works in this area. Some of them [10, 15, 28] represent the whole referring expression and visual features separately. For example, the referring expression is encoded as a hidden vector using recurrent neural network (RNN) or long short-term memory (LSTM) model [9], while the input image is represented using convolutional neural network (CNN) features. The textual feature vector is then combined with visual features at each spatial location followed by deconvolution [10], recurrent refinement [15] and key-word context [28] for producing the final segmentation mask. The limitation of these approaches is that visual and textual features are extracted in an independent way. It may not be able to capture the detailed multimodal information often useful for the referring image segmentation task.

A different line of previous work [18, 23] processes each word in the referring expression in a sequential order. These methods can potentially capture the detailed information of words in the referring expression with visual context by a sequential interaction [18] or synthesis module [23]. However, these methods treat every word equally in their models. This may cause difficulty for long referring expressions that likely contain unimportant words.

To address the limitations of previous work, we propose a dual convolutional LSTM network for referring image segmentation. The convolutional LSTM (ConvLSTM) is originally proposed in [29] to replace fully connected layers with convolutional layers in order to capture spatial-temporal information of a sequence of images. We elaborately modify the original ConvLSTM to fit the multimodal data in the referring segmentation problem by the proposed encoder-decoder framework. Our network models the input image as the spatial information, and formulates the language expression and multi-level features as the temporal sequence for the encoder and the decoder, respectively. Specifically, the encoder network (E-ConvLSTM) is used to capture multimodal feature interactions. It first adopts the feature maps of the input image as the spatial information. For each spatial location of the feature map, we capture the interaction at this location over every word in the referring expression at each recurrent time step of E-ConvLSTM to gradually localize the referred object. In addition, the proposed approach introduces word-level attentions embedded into the cell state of E-ConvLSTM to guide the interaction process towards more important words (e.g. words corresponding to the object of interest), instead of treating each word equally in the recurrent step [18, 23]. The decoder network (D-ConvLSTM) formulates attentive multimodal features encoded by E-ConvLSTM at different levels as the sequence and iteratively refines these features to take full advantage of correlations from multi-level features. We further introduce spatial attentions for multi-level features to better focus on the semantics by the high-level features and fine details by the low-level features for a precise segmentation mask.

In summary, the main contributions of this paper lie in the following four aspects:

1) We propose a dual convolutional LSTM network to exploit an encoder-decoder framework for multimodal feature encoder and multi-level segment decoder in the spatial regions for referring image segmentation.

2) The multimodal feature encoder embeds word attention into the cell (memory) state of E-ConvLSTM to adaptively encode the multimodal interaction towards more important words and localize the referred objects.

3) The multi-level segment decoder (D-ConvLSTM) progressively decodes the features with spatial attention at multiple levels to refine more precise referring image segmentation results.

4) Our proposed dual convolutional LSTM network is evaluated on four public available datasets thoroughly and achieves the state-of-the-art performance.

The rest of this paper is organized as follows. We first introduce some related works in Sec. II and the basic ConvLSTM unit as background in Sec. III. Then we present an encoder-decoder framework consisting of the proposed E-ConvLSTM in Sec. IV and D-ConvLSTM in Sec. V in detail. Experimental setup and extensive experimental results for performance evaluation are given in Sec. VI and Sec. VII, respectively. Finally, we make conclusions in Sec. VIII.

Ii Related Work

In this section, we first review some relevant object segmentation tasks. Then a set of vision and language problems is introduced for studies in the intersection of multimodal information processing. Last, the recent studies tightly related to our work about referring image segmentation are presented.

Object Segmentation: Object segmentation or extraction from an image can be done rapidly and freely by human but is a challenge problem in computer vision. A major difficulty lies in that computers have to be aware of the object of interest ahead of executing segmentation operation. One straightforward way is to roughly specify an object interactively by human, e.g., drawing a bounding-box [26]. Then a segmentation approach can refine the specific object with well-defined boundaries according to visual cues of the candidate region and background (inside and outside the bounding-box). The desired object can also be identified automatically inspired by human visual system which distinguish visually salient objects from background [19, 16, 3, 35]. In the recent data exploding era, convolutional neural network [14] drives significant segmentation performance boost in semantic segmentation [27, 2] where all pixels are labeled with pre-defined object categories, and instance segmentation [8] where additional instance labels are available. This makes visual understanding foundation for referring image segmentation in this paper.

Combining Vision with Language: There has been a lot of previous works on combining vision with language for different tasks, such as image captioning [34] and visual question answering (vqa) [1]. These models incorporate attention mechanism to explore relevant visual features on the corresponding spatial regions, the localized regions are not precise enough since the goal of these works is to generate a sentence or a bounding box. Visual grounding [25, 37] requires an exact bounding-box as the output according to a referring expression. A joint multimodal embedding is used in reconstruction of the text phrase [25] and modular networks of subject, location and relationship [37]. However, all these grounding methods rely on proposals generated by off-the-shelf object detectors.

For tasks like referring image segmentation, we need to effectively represent the multimodal interaction between the linguistic and visual information and keep detailed spatial information in order to generate a segmentation mask. A straightforward way to combine visual and linguistic features is to use simple concatenation as [10, 15], key-word context [28] or cross-modal self-attention mechanism [36] for multimodal features. Another line of work [18, 23], which is more related to our work, exploits the sequential nature of language and differ from how human solves this problem [33]. However, these two methods consider each word equally in the interaction. Instead, we exploit word attentions over the input expression and embed them into cell states of ConvLSTM to guide the effective multimodal feature interaction for the encoder network.

Referring Image Segmentation: Referring image segmentation is first introduced by [10] to segment the object-of-interest referred by an expression. They concatenate visual, linguistic and spatial features on the spatial feature maps and use a deconvolutional layer to recover a high-resolution segmentation mask. In [18], a recurrent multimodal interaction model is proposed to gradually ground the referred objects onto the image according to the progressive meaning of the language input. Each word is gradually combined with visual features in a sequential order. The work in [23] adopts a similar method as [18] but generates dynamic filters in the synthesis module to incorporate linguistic information. It further uses an incremental module with bilinear upsampling and convolution over feature maps for fine details. However, these works equally encode every word in the referring expression at each step of a recurrent model. It may be difficult to effectively capture the important words in a long referring sentence. In this paper, we propose a multimodal feature encoder to embed word attentions into the cell state of ConvLSTM. The multimodal interaction of visual and linguistic features can be learned adaptively towards more important words in the expression to identify the referred object.

Key-word-aware context is proposed in [28] to combine textual feature with image regions to model their relationships, which aligns key words to different image regions. In order to obtain a more precise mask for segmentation, more elaborated refinement approaches are proposed by [15] and [36]. Specifically, multi-scale features are progressively refined to improve the segmentation mask from a roughly localized mask in [15]. In [36], the cross-modal self-attention network is proposed to capture the long-range dependencies between linguistic and visual contexts and a gated multi-level fusion is then used to extract a precise segmentation mask. Both methods show the effectiveness of the iterative refinement for removing irrelevant regions and producing more precise segmentation masks in the end. Though significant improvements have been achieved by these refinement methods, the importance and relation between different levels of features are not fully exploited. Different from these methods which simply integrate multi-level visual features for refinement, we deploy a decoder to utilize encoded multimodal information and introduce spatial attentions into multi-level features to focus on more specific features in the refinement. It selectively concentrates on the main body of the referred object and boundary details according to the high-level and low-level features, respectively.

Long Short-Term Memory Network: Long short-term memory (LSTM) network [9] has been widely adopted for sequential data (e.g., language [32], audio [22] and video [30]). LSTM can effectively capture long-range dependencies in sequential data. LSTM contains fully connected layers in both the input-to-state and state-to-state transitions with four gates including an input gate, a memory gate, a forget gate and an output gate. It can be flexibly used in both the encoder and the decoder framework for sequential inputs and outputs [32]. LSTM can be used to combine word features with visual context for multimodal interaction [34, 18]. Instead of propagating information in one direction as in standard LSTM, bidirectional LSTM is proposed to learn hidden states from both forward and backward directions at the same time. The forward and backward passes can learn comprehensive information through the sequential input [7, 31]. Convolutional LSTM (ConvLSTM) replaces the fully connected layers of LSTM with convolutional layers to make the sequential learning possible in spatial-temporal domains. It performs better in handling spatio-temporal correlations for a set of images [29]. In addition, several other variants of LSTM are designed to stack multilayer LSTM for skeleton-based action recognition [40], formulate AutoEncoder topology [6] or combine fully convolutional neural networks with LSTM for vehicle counting [39].

Iii Background: Convolutional LSTM

A general model for referring image segmentation requires spatial relationships of an image and sequential dependency of words. Convolutional networks and LSTM [9] are two powerful feature representation approaches for an image and words, respectively. A convolutional network extracts hierarchical spatial semantic features from an image. LSTM models long-range dependency for sequential words. Convolutional LSTM (ConvLSTM) [29] is an extension of the vanilla LSTM for capturing spatio-temporal relationship of data. It replaces fully connected layers in LSTM with convolutional layers for input-to-state and state-to-state transitions so that spatial correlation can also be built within the sequential inference process. In order to simultaneously capture spatial and sequential information, we introduce a dual convolutional LSTM framework for multimodal feature encoder and multi-level segment decoder between an image and words. The proposed network captures the interaction between linguistic and visual context needed for localizing the referred object. It also maintains the spatial information needed for producing a precise segmentation mask.

Let denote a set of feature maps where is the input at time step in ConvLSTM. The complete ConvLSTM operation can be summarized as follows:


where represents the ConvLSTM parameters, is the sigmoid function and is the element-wise product. Here we use , , , to denote the input gate, forget gate, output gate and memory gate, respectively. is the hidden state and is the cell state at each time step . At each time , ConvLSTM takes the input and previous hidden state to generate , , and at current time.

The essential design of ConvLSTM is to use convolution layer for spatial correlations and uses recurrence over time for sequential dependency. At each time , the input gate controls how much information from is exposed to the cell state. The forget gate controls how much information from the past should be forgotten. This results in an updated cell state . The output gate is then used to propagate effective information to the hidden state at time as the output. Therefore, the cell state is considered as memory state to control the information flow update. In other words, it determines how much the current and previous states influence the current hidden state .

Fig. 2: The multimodal feature encoder with E-ConvLSTM. We embed word attentions into the ConvLSTM cell to adaptively encode the multimodal interaction towards more important words.
Fig. 3: Examples of word attentions. From left to right at each row: the original image, word attentions, visualization of the final feature map. Darker red color means higher attention weights.

Iv Multimodal Feature Encoder

The goal of the encoder network of our model is to generate multimodal features that capture both detailed linguistic and visual information. The overall architecture of the encoder network is shown in Fig. 2. First, we use an attention mechanism to learn to focus on certain important words in the referring expression (Sec. IV-A). Then word vectors are re-weighed by word attentions and tiled as the same size as the visual feature map. Each word feature is concatenated with the visual feature and the spatial feature for a word-specific multimodal feature. These multimodal features capture the complex interactions between the linguistic information from the referring expression and the visual information from the image. Finally, multimodal interaction (Sec. IV-B) is guided by word attentions to adaptively towards more important words in the referring expression.

Iv-a Word Attention

Instead of representing the entire referring expression as a hidden vector [10, 15] using LSTM, we propose to keep track of the vector representation of each word in the expression. We also learn the relative importance (i.e. word attention) of each word and reweight the vector representation of each word with its corresponding attention score. This allows our model to focus on important words in the expression and use it to adaptively encode multimodal information.

Let us denote a referring expression as where is the -th word and is the number of words. Each word is represented as a one-hot vector. We project each word into a vector representation using a word embedding layer, then use a bidirectional LSTM to produce a hidden vector for each word as follows:


where and denote the forward and backward directions in bidirectional LSTM, respectively. is used as the vector representation for the -th word, which concatenates the hidden states of bidirectional LSTM at every time step (word) with consideration of both previous and future relations of words.

We then apply two linear layers on the word feature and normalize the output to calculate an attention weight that indicates the relative importance of the word:


where and are the model parameters of these two linear layers. We then use the attention weight of each word to re-weight its vector representation as follows:


The final vector representation is considered as the generated attentive word feature that takes into account of the relative importance of the -th word. It conveys discriminative information in the given referring expression. Fig. 3 shows examples of word attentions. The word attentions encourage the network to focus on the more important words so as to identify the referred object (“keyboard” in the top example) or discriminate the similar objects (“black cow” in the bottom example) with the relative location (“front”) and different attribute (“facing to the left”). These word attentions will also contribute differently on the multimodal interaction in the next section.

Iv-B Word Attentive Multimodal Interaction

We propose a word attentive multimodal interaction model that captures the multimodal information, including the linguistic information in the referring expression and the visual/spatial information in the input image.

Multimodal Feature Generation: We use a pretrained CNN network to extract a visual feature map for an input image. Let the feature map size be , where and are the width and height of the feature map, and is the channel dimension. We then generate an -dimensional vector representing the spatial information at each spatial location in the feature map as [18]. Specifically, we use first three dimensions to encode the relative horizontal coordinates and another three dimensions to encode the relative vertical coordinates. The last two dimensions correspond to the relative sizes (width and height) of the image. The spatial feature map has a dimension of . All values of this spatial feature map are normalized to the range of , i.e., the values of the upper left corner and the lower right corner of the spatial feature is and , respectively. This spatial feature is then appended to the visual feature map at each spatial location. Thus the dimension of the final visual feature map is . This feature map captures both visual and spatial information of an image.

Then the multimodal feature can be generated as follows. For the -th word, we append its vector representation to each spatial cell of the feature map . This results in a word-specific multimodal feature . We repeat this process for all words . In the end, we obtain word-specific multimodal feature maps .

Multimodal Interaction: Given the feature maps obtained for each word separately, we want to combine these feature maps and capture their interactions along each word over time for detailed multimodal comprehension. Different from the previous work [18] that treats each word equally, our multimodal interaction model can pay more attention to important words by taking into account of word attentions.

For the word-specific multimodal feature maps , we modify the standard ConvLSTM (see Sec. III) in order to take into account of the word attentions as follows:




The output of the hidden state at each time step summarizes the semantic information for all previous seen words before . So the hidden state at the final step contains information of all words in the referring expression. The cell state can take advantage of the attention weight of the corresponding word. As shown in Eq. 11, the attention of a word (obtained in Eq. 8) is used to modulate the cell memory of the multimodal interaction. If a word has a higher attention weight, it will encourage the cell state to allow more information to flow from the current state. In contrast, a word with lower attention weight will allow less information flow into the cell state. So the cell state will rely more on the historic memory. In the end, the modified ConvLSTM is able to pay more attention to important words with higher attention weights.

Fig. 2 illustrates the whole process of E-ConvLSTM. Given the input query “lemon on left”, the multimodal feature encoder produces word attention for each word in the referring expression. The different word attentions are used to generate attentive word feature for multimodal features. In addition, the attention of every word is also embedded into the cell memory of each E-ConvLSTM cell to adjust multimodal interaction towards more informative words (e.g., “lemon” and “left”) adaptively.

V Multi-level Segment Decoder

Referring image segmentation aims to segment the referred object rather than predicting dense labels for the entire image. Directly applying the multimodal features to predict a segmentation mask might lead to an unsatisfactory result because of the distraction of non-referred regions. In addition, multi-level feature representations help to provide fine details for a more precise boundary of the object. To this end, we propose to generate spatial attention to focus on important spatial regions in the multimodal feature maps. The spatial attentions are then applied at the different levels of encoded multimodal features generated by the E-ConvLSTM. We also propose the multi-level segment decoder with another ConvLSTM (D-ConvLSTM) to sequentially refine the different levels of encoded multimodal features for a precise segmentation mask.

The hidden state of the last word from Eq. 10 can be used as the input to the decoder network. Note that is computed using the CNN features at a particular layer. In practice, we can apply the encoder by using the CNN features at several different layers in the network. We use to denote the hidden state in the encoder based on different levels of CNN visual features. In this paper, we set and corresponding to {} of a DeepLab101 backbone network [2]. Specifically, can be generated by applying the encoder network in Sec. IV-B at a particular level for multimodal representation over the referring expression.

Referring expression: “dude in blue shirt tie”
Referring expression: “right meter”
(a) (b) (c) (d)
Fig. 4: Visualization of the spatial attentions corresponding to different levels of encoded features. (a) original image; (b,c,d) the spatial attentions from the high-level to low-level features.

Spatial Attention: The spatial attention is generated to focus on important spatial regions in the feature map. We use a convolutional layer with a large kernel () to capture relatively large regions. The convolution of a large kernel can effectively gather information with a larger receptive field and is robust for localization [24].


where and are the parameters of the convolution filters. The sigmoid function summarizes the importance of different regions in the feature map. Then it is applied to each slice of . Fig. 4 presents the spatial attentions in different levels of features. It can be observed that spatial attentions can help the decoder refines features from the important spatial regions. The spatial attention corresponding to high-level features tends to concentrate on the referred objects, while the spatial attention of the low-level features tends to be more spread-out.

Fig. 5: An illustration of the multi-level segment decoder with D-ConvLSTM. The decoder network iteratively refines the segmentation mask by using the features extracted from different CNN levels. It also uses spatial attentions to focus on image regions that are informative for generating the segmention mask.

Multimodal Feature Refinement: Existing works [15, 23] have shown that refining the segmentation mask over multi-level features can significantly improve the performance with respect to fine boundaries and completed objects. Instead of directly taking visual features from different level layers of the network, we propose to use multimodal features that have incorporated word-attentive interactions and spatial attentions.

As shown in Fig. 5, the decoder network progressively integrates these features from high-level to low-level semantics as follows:


where represents the hidden state of each time step over different level in the refinement. The hidden state at the last time step of Eq. 13 is adopted as the output of the decoder network. Finally, is fed to another convolutional layer to produce a 2-D probability score map normalized with sigmoid function. The probability score can be trained with a ground truth label map by a binary cross entropy loss function as:


where is the whole set of pixels in the image and is n-th pixel in it.

Method prec@0.5 prec@0.6 prec@0.7 prec@0.8 prec@0.9 IoU
E-ConvLSTM(w/o word attention) 47.15 36.97 25.63 12.84 2.14 46.70
E-ConvLSTM 54.62 44.20 30.77 16.02 2.56 50.50
E-ConvLSTM + D-ConvLSTM(w/o spatial attention) 65.54 57.27 46.82 30.54 8.65 56.27
E-ConvLSTM + D-ConvLSTM 68.67 60.95 50.48 32.96 9.37 58.62
E-ConvLSTM + D-ConvLSTM + DCRF 68.97 61.49 51.98 36.15 11.42 59.04
TABLE I: Ablation study of the relative contributions of different components of the proposed network on the UNC val set. The first three rows correspond to variants of our model where some components are removed. The 4th row is our model. The last row is our model with DCRF postprocessing. The results show that each component of our model help improving the performance.
Google-Ref UNC UNC+ Referit

val val testA testB val testA testB test
RMI [18] 34.40 44.33 44.74 44.63 29.91 30.37 29.43 57.34
DMN [23] 36.76 49.78 54.83 45.13 38.88 44.22 32.29 52.81
KWA [28] 36.92 - - - - - - 59.09
RRN [15] 36.32 54.26 56.21 52.71 39.23 41.68 35.63 63.12
CMSA [36] 39.87 58.00 60.33 54.74 43.57 47.27 37.73 63.17
Ours 41.32 58.62 59.73 56.23 44.18 47.44 39.43 63.75
RMI + DCRF [18] 34.52 45.18 45.69 45.57 29.86 30.48 29.50 58.73
RRN + DCRF [15] 36.45 55.33 57.26 53.93 39.75 42.15 36.11 63.63
CMSA +DCRF [36] 39.98 58.32 60.61 55.09 43.76 47.60 37.89 63.80
Ours + DCRF 41.77 59.04 60.74 56.73 44.54 47.92 39.73 63.92
TABLE II: Comparison of the segmentation performance with the state-of-the-art methods in term of IoU. The top five methods are directly evaluated by taking the outputs of networks and the bottom three results are post-processed with DCRF.

Vi Experimental Setup

In this section, we introduce the datasets in Sec. VI-A and evaluation metrics in Sec. VI-B, and also describe the implementation details of the proposed approach in Sec. VI-C.

Vi-a Datasets

We evaluate our model on four publicly available datasets including Google-Ref [21], UNC [38], UNC+ [38] and Referit [11]. All experiments are conducted under the same train and test split sets as [18].

The Google-Ref dataset is composed of 104,560 expressions referring, 54,822 objects and 26,711 images. These images and ground truth masks are collected from the MS COCO dataset [17] and referring expressions are annotated from Amazon Mechanical Turk. The referring expressions of the Google-Ref dataset are longer with an average length of 8.43 words compared with the other three datasets and multiple objects with the same category can appear in a single image.

The UNC dataset is also based on the MS COCO dataset which contains 19,994 images with 142,209 referring expressions for 50,000 objects. It is gathered by a two-player game [11] where one player annotates the image region according to the expression by other player interactively. Both location and appearance words can be used to describe the referred objects.

The UNC+ dataset is similar as the UNC dataset with a total of 141,564 expressions for 49,856 objects in19,992 images. The main difference from the UNC dataset is that location information is not allowed to refer to the object of interest. In other words, referring expressions purely rely on the appearance and context descriptions to describe referred objects.

The Referit dataset is built on IAPR TC-12 dataset [4] which includes masks for stuffs, e.g., “sky” and “ground”, in addition to objects. Same as the UNC and UNC+ datasets, referring expressions of Referit is also collected by the two-player game. It has 130,525 expressions referring to 96,654 object masks in 19,894 images in total.

Vi-B Evaluation Metrics

Following the previous work [18], we use Intersection-over-union (IoU) and Precision@X (Prec@X) as the evaluation metrics. IoU measures the intersection area divides by the union area between ground-truth and a predicted segmentation mask averaged over all test data. To be more specific, let and be the ground-truth and predicted segmentation masks, respectively. IoU is defined as the ratio between the intersection and the union of these two masks, i.e. . For a more precise comparison, Prec@X is provided to evaluate detailed contributions in ablation study. It measures the percentage of test images which have higher IoU than a threshold . We choose the value of ranging from to with an interval of .

Vi-C Implementation Details

We use DeepLab-101 network [2] with pre-trained weights from the Pascal VOC dataset [5] as previous works [18, 28, 15, 36] to extract visual features. The input image is resized and zero-padded to and keep the maximum length of the referring expression as , so the spatial dimensions of the visual features from {} are the same as thanks to dilated convolution. The visual features at different levels are transformed to a fixed channel size of using convolution. Every word is first embedded to a dimensional vector and then passed through a bidirectional LSTM with the cell size of . After combining both forward and backward hidden outputs, the dimension of the language feature for every word is . The cell sizes of E-ConvLSTM and D-ConvLSTM are set to be and , respectively. Furthermore, we apply DCRF [13] which is a widely used post-processing operation for precise segmentation masks. In order to minimize the loss function, the network is trained with Adam optimization algorithm [12] with initial learning rate of , weight decay of . We employ a “poly” strategy [2] to adaptively tune the learning rate with power of .

Fig. 6: Comparison of segmentation performance of different lengths of referring expressions on Google-Ref.
Referring expression: “the white car”
Referring expression: “brown cow”
Referring expression: “smaller container”
Referring expression: “portion of building right of girl’s head”
Referring expression: “a black bowl of vegetable stir fry”
Referring expression: “sky”
(a) (b) (c) (d) (e) (f) (g)
Fig. 7: Visual comparison of referring image segmentation results: (a) original image; (b,c,d,e,f) referring image segmentation masks from [18], [15], [36] and our network without or with DCRF; (g) ground-truth segmentation mask. The corresponding referring expressions are shown above these images in each row.
(a) (b) (c) (d) (e) (f)
Fig. 8: Visualization on how E-ConvLSTM works for multimodal interaction: (a) original image; (b,c,d) intermediate activation by mean-pooling hidden state output after each underline word has been processed by the encoder network; (e) predicted segmentation mask; (f) ground truth segmentation mask.
(a) (b) (c) (d) (e) (f)
Fig. 9: Visualization to illustrate the hidden states of D-ConvLSTM after refining with low-level features gradually: (a) original image; (b,c,d) intermediate activation by mean-pooling hidden state outputs; (e) predicted segmentation mask; (f) ground-truth segmentation mask.
“a chair to the
“chair” right of a woman” “a right chair is empty”
“a girl in a yellow shirt
“girl” “girl on the grass” sitting on the grass”
Fig. 10: Visualization of the feature representation. These spatial heatmaps show the responses of our network to the diverse referring expressions that are not included in the dataset.
Referring expression: “broccoli plate back left”
Referring expression: “far left front guy”
Referring expression: “guys shoulder on right”
(a) (b) (c) (d)
Fig. 11: Comparison of visualization results of our attention model with the attention method used in image captioning. (a) original image; (b) visualized result of the attention model in [34]; (c) visualized result of our E-ConvLSTM; (d) ground truth segmentation mask.
Referring expression: “bottom hotdog”
Referring expression: “chocolate donut second from right in front row”
(a) (b) (c) (d)
Fig. 12: Comparison of visualization results of our network with or without the spatial feature. (a) original image; (b) visualized result without spatial features; (c) visualized result with spatial features; (d) ground truth segmentation mask.

Vii Performance Evaluation

We perform ablation study in Sec. VII-A to evaluate the relative contributions of various components of the proposed network. We also present quantitative and qualitative results for comparisons with other state of the art methods in Sec. VII-B and Sec. VII-C, respectively.

Vii-a Ablation Study

To verify the effectiveness of each component of our dual convolutional LSTM network, we first conduct ablation experiments on the UNC validation dataset with the following different variants of the proposed method.

  • E-ConvLSTM(w/o word attention): This model does not use multi-level segment decoder to iteratively refinement of the segmentation mask. Instead the segmentation mask is directly predicted from the output of the multimodal feature encoder. This model also does not use word attentions.

  • E-ConvLSTM: This is similar to the previous model, but uses word attentions to adaptively encode the multimodal interaction.

  • E-ConvLSTM + D-ConvLSTM(w/o spatial attention): This is similar to the proposed model, but without the spatial attentions for the multimodal feature maps.

  • E-ConvLSTM + D-ConvLSTM: This is the complete proposed model.

  • E-ConvLSTM + D-ConvLSTM + DCRF: This is our proposed model with DCRF post-processing.

The results of the ablation study are shown in Table I. It can be seen that introducing word attentions into the cell state of E-ConvLSTM leads to better multimodal interaction result for multimodal feature encoder. In addition, multimodal feature refinement at multiple levels by the multi-level segment decoder considerably improves segmentation performance compared with encoder-only methods shown in the first two rows of Table I. Spatial attention further improves the performance due to a stronger feature representation that focuses on important spatial regions. DCRF achieves more precise segmentation masks in the final result. In summary, each component of our model contributes to improve the segmentation performance.

Vii-B Quantitative Results

We quantitatively compare our model (with or without DCRF post-processing) with state-of-the-art referring image segmentation methods including RMI [18], DMN [23], KWA [28], RRN [15] and CMSA [36] in Table II.

The proposed method outperforms all other methods consistently on all datasets with or without DCRF post-processing. Specifically, the performance improvement on Google-Ref is particularly significant. As we mentioned in Sec. VI-A, the Google-Ref dataset is much more challenging since it has longer referring expressions and richer descriptions compared with other datasets. This demonstrates the importance of the proposed E-ConvLSTM which adaptively focuses on more important words during the multimodal interaction. It plays an important role in understanding the referring expression and localizing the referred object simultaneously. The other methods may fail to capture the long-range dependency since considering each word equally such as RMI [18] and RRN [15] or without sequentially multimodal interaction such as KWA [28] and CMSA [36]. In addition, the approaches with multi-level refinement e.g., RRN [15], CMSA [36] and our model achieve better results than other compared methods. It demonstrates that multi-level features help segment objects with clear boundaries and our model benefits from the D-ConvLSTM decoder with spatial attentions.

In order to clearly show the effect of the length of referring expressions for different models, we follow [18] and evaluate the performance separately for four different groups of the referring expression length range at [1-5], [6-7], [8-10] and [11-20]. As illustrated in Fig. 6, our model outperforms other methods in all cases. The performance gain is particularly noticeable with longer referring expressions. This also manifests the advantage of the proposed E-ConvLSTM encoder which embeds word attentions in the ConvLSTM cell to adaptively encode the multimodal interaction in understanding longer and more complicated referring expressions.

Vii-C Qualitative Results

We show some qualitative examples of the segmentation masks generated by our approach and compare with existing state-of-the-art methods in Fig. 7. We can see that our method can successfully handle appearance attributes (the 1st and 2nd examples) where the complete object has internal high contrasts (white car of the 1st example) and low contrasts to the neighboring object (the brown cow and the shady part of nearby cow of 2nd example). The compared methods shown in the column (b), (c), (d) fail to capture the complete objects or suppress the background clearly. In addition, in the 3rd and 4th examples, our method can accurately identify the objects of interest from complicated referring expressions with relative relationships between homogeneous objects (the container of the 3rd example and the portion of building of the 4th example). A complete object with heterogeneous components is well-identified in the 5th example and large sky regions are obtained with clear boundaries in the bottom example, while the other methods may not be able to accurately understand the complex referring expressions and obtain precise referred objects. The DCRF post-processing can further refine the segmentation mask as a whole without noisy background regions (the 2nd and 6th examples) and recover precise boundary details (the 3rd and 5th examples).

We also present some visualization examples in Fig. 8 and Fig. 9 to help understand how E-ConvLSTM localizes the referred objects along multimodal interaction and how D-ConvLSTM refines segmentation results with multi-level features, respectively. The reddish blocks under each word presents their word attentions in the expression. The intermediate visualized results are extracted from the hidden state tensors of E-ConvLSTM and D-ConvLSTM, which are meanpooled along channel dimension with normalization and resized to the original image resolution. We show the visualization examples after seen underlined word for Fig. 8 and after refining features with each level for Fig. 9. It can be observed that E-ConvLSTM can capture the corresponding concepts as far as words have been seen such as “colored”, “umbrella” in the first row and “glass” in the second row of Fig. 8. Once it interacts with more concrete descriptions in the expression, the network pinpoints to the correct referred objects. For the examples in Fig. 9, the high-level features is capable of roughly identifying the object of interest and gradually refine the response with fine boundary from the second column to fourth column. Fig. 10 presents visualization results when our network responses to diverse referring expressions that are not included in the dataset. The purpose of showing these examples is to verify the generalization ability of our network. The visualization results show that our network can effectively handle different concepts appearing in the expressions and accurately identify the referred objects under different semantics.

Attention models have been widely used and proven important for image captioning [34, 20] to relate a part of active region to each generated word in the caption. These attention-based methods follow a similar encoder-decoder framework which encodes spatial image features and the hidden state of the word feature to produce the visual context, then decodes the context to predict the next word instead of a segmentation mask in referring image segmentation task. Their methods also have the ability to visualize the related regions of the image through word and context information to illustrate which image region the model is focusing on. In order to compare with attention model of image captioning for visual and linguistic features, we adopt the soft attention model [34] in our encoder network and show visualization results in Fig. 11. It can be observed that the highlighted regions by the attention model of image captioning shown in Fig. 11 (b) are more scattered and do not present a clear boundary of the referred object-of-interest since their method loses spatial information when producing the visual context. These attention maps might be sufficient for the image captioning task, but do not provide precise spatial information needed in the referring image segmentation task. As shown in Fig. 11 (c), our E-ConvLSTM precisely identifies the referred object and suppresses the background regions, since our encoder network can keep the spatial information of the features and embed the word attention in the multimodal interaction by ConvLSTM to adaptively localize the referred objects.

We further provide qualitative examples to illustrate the effect of the spatial features. Fig. 12 presents visualization results generated by our network with and without the spatial features. As shown in Fig. 12 (b), the network without the spatial feature can be confused by the objects with similar appearance, while they can be clearly discriminated by the network with spatial features. For example, the referred hotdog is located at the bottom of the another as shown in the first row and the desired donut is at the relative location compared to other donuts in the second row.

Referring expression: “a sparrow is sitting along with two others”
Referring expression: “man player on right”
Referring expression: “the bear hiding behind pole”
Referring expression: “left brocoli”
(a) (b) (c)
Fig. 13: Some examples of failure cases of our network. (a) original image; (b) predicted segmentation mask; (c) ground truth segmentation mask.

Some examples of failure cases are presented in Fig. 13. In the first example, our network may have failed to identify the exact object due to the language ambiguity where the predicted segmentation mask points to a sparrow different from the ground truth. The second example shows the case where our method correctly understands the referring expression and recognize the attributes of a person, e.g., head, hand and leg, but incorrectly segments the leg on the right which actually belongs to another person on the left. In the third example, our segmentation result is negatively affected by the heavy occlusion to find the bear. In the last example, the failure is caused by some internal gaps within the broccoli that misleads our network to consider they are apart and only segment the left part of the referred broccoli. We are going to explore fine-grained recognition technique to detect these subtle details in the future.

Viii Conclusion

We present a novel dual convolutional LSTM network for the task of referring image segmentation. The E-ConvLSTM encodes the multimodal feature along word sequence to localize the referred objects and D-ConvLSTM decodes the encoded multimodal information at multiple levels for mask refinement. Extensive experiments on four datasets shows consistent improvement achieved by the proposed network.


  1. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick and D. Parikh (2015) Vqa: visual question answering. In IEEE International Conference on Computer Vision, pp. 2425–2433. Cited by: §II.
  2. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §I, §II, §V, §VI-C.
  3. M. Cheng, N. J. Mitra, X. Huang, P. H. Torr and S. Hu (2014) Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 569–582. Cited by: §I, §II.
  4. H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor and M. Grubinger (2010) The segmented and annotated iapr tc-12 benchmark. Computer Vision and Image Understanding 114 (4), pp. 419–428. Cited by: §VI-A.
  5. M. Everingham, L. Van Gool, C. K.I. Williams, J. Winn and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §VI-C.
  6. A. Gensler, J. Henze, B. Sick and N. Raabe (2016) Deep learning for solar power forecasting—an approach using autoencoder and lstm neural networks. In IEEE international conference on systems, man, and cybernetics, pp. 2858–2865. Cited by: §II.
  7. A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks 18 (5-6), pp. 602–610. Cited by: §II.
  8. K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In IEEE International Conference on Computer Vision, pp. 2961–2969. Cited by: §I, §II.
  9. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §I, §II, §III.
  10. R. Hu, M. Rohrbach and T. Darrell (2016) Segmentation from natural language expressions. In European Conference on Computer Vision, pp. 108–124. Cited by: §I, §II, §II, §IV-A.
  11. S. Kazemzadeh, V. Ordonez, M. Matten and T. Berg (2014) Referitgame: referring to objects in photographs of natural scenes. In Conference on Empirical Methods in Natural Language Processing, pp. 787–798. Cited by: §VI-A, §VI-A.
  12. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §VI-C.
  13. P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems, pp. 109–117. Cited by: §VI-C.
  14. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §II.
  15. R. Li, K. Li, Y. Kuo, M. Shu, X. Qi, X. Shen and J. Jia (2018) Referring image segmentation via recurrent refinement networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5745–5753. Cited by: §I, §II, §II, §IV-A, TABLE II, §V, Fig. 7, §VI-C, §VII-B, §VII-B.
  16. Y. Li, X. Hou, C. Koch, J. M. Rehg and A. L. Yuille (2014) The secrets of salient object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287. Cited by: §I, §II.
  17. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §VI-A.
  18. C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu and A. L. Yuille (2017) Recurrent multimodal interaction for referring image segmentation. In IEEE International Conference on Computer Vision, pp. 1271–1280. Cited by: §I, §I, §II, §II, §II, §IV-B, §IV-B, TABLE II, Fig. 7, §VI-A, §VI-B, §VI-C, §VII-B, §VII-B, §VII-B.
  19. Z. Liu, R. Shi, L. Shen, Y. Xue, K. N. Ngan and Z. Zhang (2012) Unsupervised salient object segmentation based on kernel density estimation and two-phase graph cut. IEEE Transactions on Multimedia 14 (4), pp. 1275–1289. Cited by: §II.
  20. J. Lu, C. Xiong, D. Parikh and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: §VII-C.
  21. J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille and K. Murphy (2016) Generation and comprehension of unambiguous object descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20. Cited by: §VI-A.
  22. E. Marchi, G. Ferroni, F. Eyben, L. Gabrielli, S. Squartini and B. Schuller (2014) Multi-resolution linear prediction based features for audio onset detection with bidirectional lstm neural networks. In IEEE international conference on acoustics, speech and signal processing, pp. 2164–2168. Cited by: §II.
  23. E. A. Margffoy-Tuay, J. C. Pérez, E. Botero and P. Arbeláez (2018) Dynamic multimodal instance segmentation guided by natural language queries. In European Conference on Computer Vision, pp. 630–645. Cited by: §I, §I, §II, §II, TABLE II, §V, §VII-B.
  24. C. Peng, X. Zhang, G. Yu, G. Luo and J. Sun (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361. Cited by: §V.
  25. A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell and B. Schiele (2016) Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pp. 817–834. Cited by: §II.
  26. C. Rother, V. Kolmogorov and A. Blake (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics, Vol. 23, pp. 309–314. Cited by: §II.
  27. E. Shelhamer, J. Long and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 640–651. Cited by: §I, §II.
  28. H. Shi, H. Li, F. Meng and Q. Wu (2018) Key-word-aware network for referring expression image segmentation. In European Conference on Computer Vision, pp. 38–54. Cited by: §I, §II, §II, TABLE II, §VI-C, §VII-B, §VII-B.
  29. X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pp. 802–810. Cited by: §I, §II, §III.
  30. N. Srivastava, E. Mansimov and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §II.
  31. M. Sundermeyer, T. Alkhouli, J. Wuebker and H. Ney (2014) Translation modeling with bidirectional recurrent neural networks. In Conference on Empirical Methods in Natural Language Processing, pp. 14–25. Cited by: §II.
  32. I. Sutskever, O. Vinyals and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §II.
  33. G. Underwood, L. Jebbett and K. Roberts (2004) Inspecting pictures for information to verify a sentence: eye movements in general encoding and in focused search. The Quarterly Journal of Experimental Psychology Section A 57 (1), pp. 165–182. Cited by: §II.
  34. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International Conference on Machine Learning, pp. 2048–2057. Cited by: §II, §II, Fig. 11, §VII-C.
  35. L. Ye, Z. Liu, L. Li, L. Shen, C. Bai and Y. Wang (2017) Salient object segmentation via effective integration of saliency and objectness. IEEE Transactions on Multimedia 19 (8), pp. 1742–1756. Cited by: §I, §II.
  36. L. Ye, M. Rochan, Z. Liu and Y. Wang (2019) Cross-modal self-attention network for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10502–10511. Cited by: §II, §II, TABLE II, Fig. 7, §VI-C, §VII-B, §VII-B.
  37. L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal and T. L. Berg (2018) Mattnet: modular attention network for referring expression comprehension. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315. Cited by: §II.
  38. L. Yu, P. Poirson, S. Yang, A. C. Berg and T. L. Berg (2016) Modeling context in referring expressions. In European Conference on Computer Vision, pp. 69–85. Cited by: §VI-A.
  39. S. Zhang, G. Wu, J. P. Costeira and J. M. Moura (2017) Fcn-rlstm: deep spatio-temporal neural networks for vehicle counting in city cameras. In IEEE International Conference on Computer Vision, pp. 3667–3676. Cited by: §II.
  40. S. Zhang, X. Liu and J. Xiao (2017) On geometric features for skeleton-based action recognition using multilayer lstm networks. In IEEE Winter Conference on Applications of Computer Vision, pp. 148–157. Cited by: §II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description