Multimodal Unified Attention Networks for Vision-and-Language Interactions

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Abstract

Learning an effective attention mechanism for multimodal data is important in many vision-and-language tasks that require a synergic understanding of both the visual and textual contents. Existing state-of-the-art approaches use co-attention models to associate each visual object (e.g., image region) with each textual object (e.g., query word). Despite the success of these co-attention models, they only model inter-modal interactions while neglecting intra-modal interactions. Here we propose a general ‘unified attention’ model that simultaneously captures the intra- and inter-modal interactions of multimodal features and outputs their corresponding attended representations. By stacking such unified attention blocks in depth, we obtain the deep Multimodal Unified Attention Network (MUAN), which can seamlessly be applied to the visual question answering (VQA) and visual grounding tasks. We evaluate our MUAN models on two VQA datasets and three visual grounding datasets, and the results show that MUAN achieves top level performance on both tasks without bells and whistles.

Multimodal learning, visual question answering (VQA), visual grounding, unified attention, deep learning.

I Introduction

Deep learning in computer vision and natural language processing has facilitated recent advances in artificial intelligence. Such advances drive research interest in multimodal learning tasks lying at the intersection of vision and language such as multimodal embedding learning [67][60][55], visual captioning [54][50], visual question answering (VQA) [2] and visual grounding [45], etc. In these tasks, learning a fine-grained semantic understanding of both visual and textual content is key to their performance.

The attention mechanism is a predominant focus of recent deep learning research. It aims to focus on certain data elements, and aggregate essential information to obtain a more discriminative local representation [4, 54]. This mechanism has improved the performance of a wide range of unimodal learning tasks (e.g., vision [38, 14, 7], language [36, 10, 46]) in conjunction with deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

Fig. 1: Schematic of the proposed unified attention, which simultaneously models inter- and intra-modal interactions in a single framework. Given multimodal inputs and , , denote the intra-modal interactions within each modality, while and denote the inter-modal interactions across different modalities. and are the attended features for and respectively.

For the multimodal learning tasks described above, attention learning considers the inputs from both the visual and textual modalities. Taking the VQA problem in Fig. 1 as an example, to correctly answer a question like ‘How many people are catching the frisbee’ for an image, the attention model should ideally learn to focus on particular image regions (i.e., the person near the frisbee). Such visual attention based models have become an integral component in many multimodal tasks that require fine-grained visual understanding [54][56][11]. Beyond the visual attention models, recent studies have introduced co-attention models, which simultaneously learn the visual attention and textual attention to benefit from fine-grained representations for both modalities. Early approaches learned separate attention distributions for each modality in an iterative manner, neglecting the dense interaction between each question word and image region [35][62]. To address this problem, dense co-attention models have been proposed to capture complete interactions between word-region pairs, which are further extended to form deep co-attention models [26][39].

Despite the success of the co-attention models in multimodal learning tasks, these models only consider inter-modal interactions (i.e., or in Fig. 1) while neglecting intra-modal ones (i.e., and ). On the other hand, modeling intra-modal interactions has been proved to be beneficial for many unimodal learning tasks [52][9][53][18]. We argue that intra-modal interactions within each modality provide complementary and important information to the inter-modal interactions.

Inspired by the famous self-attention model [52] in the NLP community, we naturally extend its idea for multimodal data and propose a unified attention accordingly. Our unified attention model characterizes the intra- and inter-modal interactions jointly in a unified framework which we call the unified attention (UA) block (see Fig. 1). The attention map learned from the UA block includes four relationships: the inter-modal interactions ( and ) to build co-attention across different modalities, and the intra-modal interactions ( and ) to build self-attention within each modality. The learned unified attention is further used to obtain the attended output features for multimodal inputs. By stacking such UA block in depth, we obtain the Multimodal Unified Attention Network (MUAN), which can be trained in an end-to-end manner to perform deep multimodal reasoning.

To evaluate the effectiveness of our proposed MUAN model, we apply it to for VQA and visual grounding. The quantitative and qualitative results on two VQA dataset VQA-v2 [13] and CLEVR [23], and three visual grounding datasets RefCOCO [25], RefCOCO+ [25] and RefCOCOg [37] show that MUAN achieves top level performance on both tasks without using any dataset specific model tuning.

In summary, we have made the following contributions in this study:

  • We extend the self-attention model for single modality to a unified attention model, which can characterize intra- and inter-modal interactions of multimodal data. By stacking such unified attention model (i.e., UA block) in depth, we obtain a neat multimodal unified attention network (MUAN), which can perform accurate multimodal reasoning.

  • We modify the original self-attention model to a gated self-attention (GSA) model as the basic component for the UA block, which facilities more accurate and robust attention learning and leads to more discriminative features for specific tasks.

  • We apply MUAN to two multimodal learning tasks, namely VQA and visual grounding. The results on five benchmark datasets show the superiority of MUAN over existing state-of-the-art approaches.

Ii Related Work

We briefly review existing studies on VQA and visual grounding, and establish a connection between these two tasks by attention learning.

Visual Question Answering (VQA). VQA aims to answer a question in natural language with respect to a given image, so requires multimodal reasoning over multimodal inputs. Since Antol et al. presented a large-scale VQA benchmark dataset with free-form questions [2], multimodal fusion and attention learning have become two major research focuses for VQA. For multimodal fusion, early methods used simple concatenation or element-wise multiplication between multimodal features [68][2]. Fukui et al. [11], Kim et al. [27], Yu et al. [62] and Ben et al. [5] proposed different approximated bilinear pooling methods to effectively integrate the multimodal features with second-order feature interactions. For attention learning, question-guided visual attention on image regions has become the de-facto component in many VQA approaches [56][6]. Chen et al. proposed a question-guided attention map that projects the question embeddings to the visual space and formulates a configurable convolutional kernel to search the image attention region [6]. Yang et al. proposed a stacked attention network to learn the attention iteratively [56]. Some approaches introduce off-the-shelf object detectors [22] or object proposals [48] as the candidates of the attention regions and then use the question to identify the relevant ones. Taken further, co-attention models that consider both textual and visual attentions have been proposed [35][62]. Lu et al. proposed a co-attention learning framework to alternately learn the image attention and question attention [35]. Yu et al. reduced the co-attention method into two steps, self-attention for a question embedding and the question-conditioned attention for a visual embedding [63]. The learned co-attentions by these approaches are coarse, in that they neglect the interaction between question words and image regions. To address this issue, Nguyen et al. [39] and Kim et al. [26] introduced dense co-attention models that established the complete interaction between each question word and each image region.

Visual Grounding. Visual grounding (a.k.a., referring expression comprehension) aims to localize an object in an image referred to in query text. Most previous approaches follow a two-stage pipeline [45][59][11]: 1) use an off-the-shelf object detector, such as Edgebox [70] or Faster R-CNN [16] to generate a set of region proposals along with the proposal features for the input image; and 2) compute a matching score between each proposal feature and query feature and adopt the proposal (or its refined bounding box [64]) with the highest score as the referent. From the attention learning point of view, visual grounding represents a task of learning query-guided attention on the image region proposals. The aforementioned two-stage approaches are analogous to the visual attention models in VQA. Yu et al. [57], Zhang et al. [69] and Deng et al. [8] also modeled the attention on question words along with visual attention, providing a connection to the co-attention model in VQA.

Joint Modeling of Self- and Co-Attention. Although extensive studies on self-attention and co-attention have been made by existing multimodal learning methods, the two kinds of attentions are usually considered solely. To the best of our knowledge, only a few attempts have modeled intra- and inter-modal interactions jointly. Li et al. introduced a videoQA approach which used self-attention to learn intra-modal interactions of video and question modalities respectively, and then fed them through a co-attention block to model inter-modal interactions [30]. Gao et al. presented a dynamic fusion framework for VQA with modeling intra- and inter-modal attention blocks. [40]. Yu et al applied a modular co-attention network for VQA which stacked multiple self-attention and guided-attention blocks in depth to perform deep visual reasoning. In summary, all these methods models the self-attention and co-attention in two sequential stages, which is sub-optimal and may result in serious information lose. This inspires us to design a general unified attention framework to simultaneously model the two attentions in one stage.

Iii Multimodal Unified Attention

In this section, we introduce the multimodal unified attention, which is the basic component of our Multimodal Unified Attention Network (MUAN). Taking the multimodal input features from the image modality and from the text modality, the unified attention outputs their corresponding attended features. In contrast to existing visual attention methods, which model unidirectional inter-modal interactions (i.e., ) [11][27], or the co-attention methods, which model bidirectional inter-modal interactions (i.e., ) [26][39], our unified attention models the intra-modal and inter-modal interactions simultaneously (i.e., , and ) in a general framework.

Inspired by the self-attention model which has achieved remarkable performance in natural language processing [52][43][9], we design a unified attention model for multimodal data. Furthermore, to obtain more accurate attention map in the unified attention learning, we introduce a bilinear pooling based gating model to reweight the importance of input features, which can to some extent eliminate the irrelevant or noisy features.

Iii-a Gated Self-Attention

The self-attention model proposed in [52] takes a group of input features and outputs a group of attended features , where is the number of samples, and are the dimensionalities of input and output features, respectively. To achieve this goal, is first fed into three independent fully-connected layers.

(1)

where are three feature matrices of the same shape, corresponding to the queries, keys, and values, respectively.

Given a query and all keys , we calculate the dot-products of with , divide each by a scaling factor and apply the softmax function to obtain the attention weights on the values. In practice, the attention function can be computed on all queries simultaneously, and in doing so we obtain the output features as follows:

(2)
(3)

where is the attention map containing the attention weights for all query-key pairs, and the output features are the weighted summation of the values determined by .

Learning an accurate attention map is crucial for self-attention learning. The scaled dot-product attention in Eq.(2) models the relationship between feature pairs. However, the importance of each individual features is not explicitly considered during attention learning. Consequently, irrelevant or noisy features may have a negative impact on the attention map, resulting in inaccurate output features. To address this problem, we introduce a novel gating model into Eq.(2) to improve the quality of the learned attention. Inspired by the bilinear pooling models which have been in fine-grained visual recognition [31] and multi-modal fusion [27], we design a gating model based on low-rank bilinear pooling to reweight the features of and before their scaled dot-products:

(4)

where , are three independent fully-connected layers, and is the dimensionality of the projected space. denotes the element-wise product function and the sigmoid function. corresponds to the two masks and for the features and , respectively.

The learned two masks and are tiled to and then used to formulate a gated self-attention (GSA) model as follows:

(5)
(6)
(a) Gated Self-Attention (GSA)
(b) UA Block
Fig. 2: Flowcharts of the Gated Self-Attention (GSA) model and unified attention (UA) block for multimodal data

Fig. 1(a) illustrates the flowchart of our gated self-attention model. Similar to [52], the multi-head strategy is introduced in our model to attain more diverse attention.

Iii-B Unified Attention Block

Based on the gated self-attention model above, we introduce the multimodal unified attention block, which simultaneously models intra- and inter-modal interactions.

Given a group of textual features (e.g., question words) and a group of visual features (e.g., image regions) , we first learn two fully-connected layers and to embed and into a -dimensional common space, and then concatenate the two groups of embedded features on rows to form a unified feature matrix :

(7)

where with 1.

Fig. 3: Architectures of the Multimodal Unified Attention Networks (MUAN) for visual question answer (left) and visual grounding (right), respectively. Both architectures contain the a MUAN- model which consists of stacked UA blocks to output the features with unified attention learning. For VQA, we add a dummy token at the beginning of the question, and use its attended feature to predict the answer. For visual grounding, the attended features of the region proposals are used to predict their ranking scores and refined bounding boxes.

The UA block (see Fig. 1(b)) consists of a gated self-attention (GSA) module and a feed-forward network (FFN) module. Taking the unified feature matrix as input, the GSA module learns the pairwise interactions between the sample pairs within . Since and may come from different (or the same) modalities, the intra- and inter-modal relationships are represented at the same time. Compared to existing co-attention models, which only model the inter-modal relationships [26][39], the intra-modal relationships (e.g., word-to-word or region-to-region) are also important for understanding the intrinsic structure within each modality, thus facilitating more accurate visual reasoning. The FFN module takes the output features of the GSA module as input, and then performs transformation through two consecutive fully-connected layers (FC(4)-ReLU-Drop(0.1)-FC()). To simplify optimization, shortcut connection [16] and layer normalization [3] are applied after the GSA and FFN modules. It is worth noting that the final output features of the UA block are of the same shape as the input features , making it possible to stack multiple UA blocks in depth2.

Iv Multimodal Unified Attention Networks

In this section, we describe the MUAN architectures for VQA and visual grounding (see Fig. 3). The core component of both models is the deep MUAN- model, which consists of UA blocks stacked in depth to perform deep multimodal reasoning and attentional feature transformation. The proposed VQA model and the visual grounding model are very similar to each other, except for the input feature representations and the loss functions used during model training. We therefore highlight these two parts in each model.

Iv-a Architecture for VQA

Image and Question Representations. The inputs for VQA consist of an images and a question, and the goal is to predict an answer to the question. Our model first extracts representations for the image and the question and then feeds the multimodal features into the MUAN model to output their corresponding output features with unified attention learning. Finally, one of the attended feature is fed to a multi-label classifier to predict the correct answer.

The input question is first tokenized into a sequence of words, and then trimmed (or zero padded) to a maximum length of . Similar to [9], we add a dummy token at the beginning of the question, and the attended feature of this token will be used to predict the answer. These words are firstly represented as one-hot vectors and then transformed to 300-D word embeddings using the pre-trained GloVe model [41]. Finally, the word embeddings are fed into a one-layer LSTM network [17] with hidden units, resulting in the final question feature . The input image is represented as a group of -dimensional visual features extracted from a pre-trained CNN model [16] or a pre-trained object detector [1]. This results in the image feature , where is the number of extracted features.

Note that we mask the zero-padded features during attention learning to make their attention weights all zero.

MUAN-. The multimodal features and are fed into a deep MUAN- model consisting of UA blocks . For , and are integrated by Eq.(7) to obtain the initialized unified features , which are further fed to the remaining UA blocks in a recursive manner.

(8)

where . Note that the final output features are the same shape as the input features , and each paired has a one-to-one correspondence.
Answer Prediction. Using the attended features from MUAN-, we project the first feature (the token) into a vector , where corresponds to the size of the answer vocabulary.

For the datasets that have multiple answers to each question, we following the strategy in [51] and use the binary cross-entropy (BCE) loss to train an -way classifier with respect to the ground-truth label :

(9)

where is the sigmoid activation function.

For the datasets that have exactly one answer to each question, we use the softmax cross-entropy loss to train the model with respect to the one-hot ground-truth label :

(10)

Iv-B Architecture for Visual Grounding

The inputs for visual grounding consist of an image and a query. Similar to the VQA architecture above, we extract the query features using GloVe embeddings followed by a LSTM network, and extract the region-based proposal features for the image using an pre-trained object detector. Note that we do not use the dummy token for visual grounding which is specially designed for VQA.

The multimodal input features are integrated and transformed by MUAN- to output their attended representations. On top of the attended feature for each region proposal, we append two fully-connected layers to project each attended feature into a score and a 4-D vector to regress the refined bounding box coordinates for the proposal, respectively.

(11)

Accordingly, a ranking loss and a regression loss are designed to optimize the model in a multitask learning manner. Following the strategy in [64], KL-divergence is used as the ranking loss:

(12)

where are the predicted scores for proposals. The ground-truth label is obtained by calculating the IoU scores of all proposals w.r.t. the unique ground-truth bounding box and assign the IoU score of the -th proposal to if the IoU score is larger than a threshold and 0 otherwise. Softmax normalizations are respectively applied to and to make them form a score distribution.

The smoothed loss [12] is used as the regression loss to penalize the differences between the refined bounding box and the ground-truth bounding box:

(13)

where and correspond to the coordinates of the predicted bounding box and the ground-truth bounding box for -th proposal, respectively.

By combining the two terms, we obtain the overall loss function as follows:

(14)

where is a hyper-parameter to balance the two terms.

V Experiments

In this section, we conduct experiments to evaluate the performance of the MUAN models in VQA and visual grounding tasks. We conduct extensive ablation experiments to explore the effect of different hyper-parameters in MUAN. Finally, we compare the best MUAN models to current state-of-the-art methods on five benchmark datasets (two VQA datasets and three visual grounding datasets).

(a) VQA-v2
(b) CLEVR
(c) RefCOCO
(d) RefCOCO+
(e) RefCOCOg
Fig. 4: Typical examples from VQA-v2, CLEVR, RefCOCO, RefCOCO+, and RefCOCOg.

V-a Datasets

VQA-v2 is a commonly-used benchmark dataset for open-ended VQA [13]. It contains human annotated question-answer pairs for MS-COCO images [32]. The dataset is split into three subsets: train (80k images with 444k questions); val (40k images with 214k questions); and test (80k images with 448k questions). The test subset is further split into test-dev and test-std sets that are evaluated online with limited attempts. For each questions, multiple answer are provided by different annotators. To evaluate the performance of a model with respect to such multi-label answers, an accuracy-based evaluation metric is defined as follows which is robust to inter-human variability in phrasing the answer [2]:

(15)

where is a function that count the answer voted by different annotators.

CLEVR is a synthesized dataset containing 100k images and 853k questions [23]. Each image contains 3D-rendered objects and is associated with a number of questions that test various aspects of visual reasoning including attribute identification, object counting, and logical operations. The whole dataset is split into three subsets: train (70k images with 700k questions), val (15k images with 150k questions) and test (15k images with 15k questions). Each question is associated with exactly one answer and standard accuracy metric is used to evaluate model performance.

RefCOCO, RefCOCO+, and RefCOCOg are three datasets to evaluate visual grounding performance. All three datasets are collected from MS-COCO images [32], but the queries are different in three respects: 1) RefCOCO [25] and RefCOCO+ [25] contains short queries (3.6 words on average) while RefCOCOg [37] contains relatively long queries (8.4 words on average); 2) RefCOCO and RefCOCO+ contain 3.9 same-type objects on average, while in RefCOCOg this number is 1.6; and 3) RefCOCO+ does not contain any location word, while the counterparts do not have this constraint. RefCOCO and RefCOCO+ are split into four subsets: train (120k queries), val (11k queries), testA (6k queries about people), and testB (5k queries about objects). RefCOCOg is split into three subsets: train (81k queries), val (5k queries), and test (10k queries). For all the three datasets, accuracy is adopted as the evaluation metric, which is defined as the percentage in which the predicted bounding box overlaps with the ground-truth bounding box by IoU0.5.

Fig. 4 shows some typical examples from these datasets.

V-B Experimental Setup

Universal Setup. We use the following hyper-parameters as the default settings for MUAN unless otherwise noted. In each UA block, the latent dimensionality is 768 and the number of heads is 8, so the dimensionality of each head is . The latent dimensionality in the gating model is 96. The number of UA blocks ranges from 2 to 12.

All the models are optimized using the Adam solver [28] with and . The models (except those for CLEVR) are trained up to 13 epochs with a batch size 64 and a base learning rate set to . Similar to [26], the learning rate is warmed-up for 3 epochs and decays by 1/5 every 2 epochs after 10 epochs. We report the best results evaluated on the validation set. For CLEVR, a smaller base learning rate is used to train up to 20 epochs and decay by 1/5 at the 16th and 18th epochs, respectively.

VQA Setup. For VQA-v2, we follow the strategy in [51] and extract the pool5 feature for each object from a Faster R-CNN model (with a ResNet-101 backbone) [44] pre-trained on the Visual Genome dataset [29], resulting in the input visual features , where is the number of extracted objects with a confidence threshold. The maximum number of question words , and the size of the answer vocabulary , which corresponds to answers appearing more than 8 times in the training set. For CLEVR, we follow the strategy in [21] and extract the res4b22 features from a ResNet-101 model pre-trained on ImageNet [16], resulting in the image features . The maximum number of question words , and the size of the answer vocabulary .

Visual Grounding Setup. We use the same settings for the three evaluated datasets. To detect proposals and extract their visual features for each image, we use two pre-trained proposal detectors as previous works did: 1) a Faster R-CNN model [44] pre-trained on the Visual Genome dataset [64]; and 2) a Mask R-CNN model [15] pre-trained on MS-COCO dataset [57]. During the training data preparation for the proposal detectors, we exclude the images in the training, validation and testing sets of RefCOCO, RefCOCO+ and RefCOCOg to avoid contamination of the used visual grounding datasets. Each of the obtained proposal visual features is further concatenated with a spatial feature containing the bounding-box coordinates of the proposal3. This results in the image features . The maximum number of question words is 15 and the loss weight is 0.5.

(a) Effect of gating mechanism
(b) Effect of self- and co-attention
Fig. 5: Ablation of the MUAN models with the number of UA blocks ranges from 2 to 12. All results are evaluated on the val split of VQA-v2. (a) Results of MUAN- variants with or without the gating mechanism. (b) Results of reference MUAN model along with the variants without modeling self-attention ( and ) or co-attention ( and ).

V-C Ablation Studies

We run a number of ablation experiments on VQA-v2 to explore the effectiveness of MUAN.

First, we explore the effectiveness of the gating mechanism for the UA block with respect to different number of block . In Fig. 4(a), we report the overall accuracies of the MUAN- models ( ranges from 2 to 12) with the gating mechanism (i.e., Eq.(5)) or without the gating mechanism (i.e., Eq.(2)) for the UA block. From the results, we can see that MUAN with the gating model steadily outperforms counterpart without the gating model. Furthermore, increasing consistently improves the accuracies of both models, which finally saturate at . We think the saturation is caused by over-fitting. To train a deeper model we may require more training data [9].

Next, we conduct the ablation studies to explore the effects of self-attention and co-attention in MUAN. By masking the values in the self-attention part (i.e., and ) or the co-attention part (i.e., and ) to , we obtain two degraded variants of MUAN. We compare the two MUAN variants to its reference model in Fig. 4(b) with . The results shows that: 1) both the self-attention and co-attention in MUAN contribute to the performance of VQA; and 2) co-attention plays a more important role than self-attention in MUAN, especially when the model is relatively shallow.

Acc. (%) #Param ()
ref. 768 8 96 96 67.28 83.0
(A) 32 67.16 82.9
64 67.27 82.9
128 67.17 83.1
(B) 6 128 67.11 83.1
12 64 67.23 82.9
16 48 67.25 82.9
(C) 256 32 66.30 14.5
512 64 66.92 40.6
1024 128 67.30 141.6
TABLE I: Ablation of the MUAN-10 models with different hyper-parameters. All results are reported on the val split of VQA-v2. Unlisted hyper-parameter values are identical to those of the reference model.

Finally, we investigate MUAN-10 model performance with different hyper-parameters for the UA block in Table I. In row (A), we vary the dimensionality in the gating model. The results suggest that the reference model results in a 0.12 point improvement over the worst counterpart. Further, the model sizes of these variants are almost identical, indicating that the computational cost of the gating model can be more or less ignored. In row (B), we vary the number of parallel heads with a fixed output dimensionality , keeping the computational cost constant. The results suggest that is the best choice for MUAN. Too few or too many heads reduces the quality of learned attention. In row (C), we fix the number of heads to and vary the dimensionality , resulting in much smaller and larger models with the model complexity proportional to . From the results, we can see that is a key hyper-parameter to the performance. Too small may restrict the model capacity, leading to inferior performance. The model with slightly surpasses the reference model at the expense of much higher computational complexity and greater risk of over-fitting.

The hyper-parameters in the reference model is a trade-off between efficiency and efficacy. Therefore, we adopt the reference MUAN-10 model (abbreviated to MUAN for simplicity) in all the following experiments.

Method Test-dev Test-std
All Y/N Num Other All
Bottom-Up [51] 65.32 81.82 44.21 56.05 65.67
Counter [66] 68.09 83.14 51.62 58.97 68.41
MFH+CoAtt [63] 68.76 84.27 49.56 59.89 -
BAN [26] 69.52 85.31 50.93 60.26 -
BAN+Counter [26] 70.04 85.42 54.04 60.52 70.35
DFAF [40] 70.22 86.09 53.32 60.49 70.34
MCAN [61] 70.63 86.82 53.26 60.72 70.90
MUAN (ours) 70.82 86.77 54.40 60.89 71.10
TABLE II: Accuracies (%) of the single-model on the test-dev and test-std splits of VQA-v2 to compare with the state-of-the-art methods. All models use the same bottom-up attention visual features [1] and are trained on the train+val+vg splits, where vg indicates the augmented training samples from Visual Genome [29].
Method Human [23] Q-type Prior [23] LSTM [23] CNN+LSTM [23] N2NMN* [19] RN [47] PG+EE* [24] FiLM [42] MAC [21] MUAN (ours)
Accuracy 92.6 41.8 46.8 52.3 83.7 95.5 96.9 97.6 98.9 98.7
TABLE III: Overall accuracies (%) on the test split of CLEVR to compare with the state-of-the-art methods. (*) denotes use of extra program labels. () denotes use of data augmentation.
Method Proposal Generator RefCOCO RefCOCO+ RefCOCOg
Dataset Detector Backbone TestA TestB Val TestA TestB Val Test Val
Attr [33] COCO FRCN VGG-16 72.0 57.3 - 58.0 46.2 - - -
CMN [20] COCO FRCN VGG-16 71.0 65.8 - 54.3 47.8 - - -
VC [65] COCO FRCN VGG-16 73.3 67.4 - 58.4 53.2 - - -
Spe.+Lis.+Rein.+MMI [59] COCO SSD VGG-16 73.7 65.0 69.5 60.7 48.8 55.7 59.6 60.2
Spe.+Lis.+Rein.+MMI [59] COCO SSD VGG-16 73.1 64.9 69.0 60.0 49.6 54.9 59.2 59.3
DDPN [64] Genome FRCN VGG-16 76.9 67.5 73.4 67.0 50.2 60.1 - -
DDPN [64] Genome FRCN ResNet-101 80.1 72.4 76.8 70.5 54.1 64.8 67.0 66.7
MAttNet [57] COCO FRCN ResNet-101 80.4 69.3 76.4 70.3 56.0 64.9 67.0 66.7
MAttNet [57] COCO MRCN ResNet-101 81.1 70.0 76.7 71.6 56.0 65.3 67.3 66.6
MUAN (ours) COCO MRCN ResNet-101 82.8 78.6 81.4 70.5 62.9 68.9 71.5 71.0
MUAN (ours) Genome FRCN ResNet-101 86.5 78.7 82.8 79.5 64.3 73.2 74.3 74.2
TABLE IV: Accuracies (%) on RefCOCO, RefCOCO+ and RefCOCOg to compare with the state-of-the-art methods. All methods use the detected proposals rather than the ground-truth bounding-boxes. COCO [32] and Genome [29] denote two datasets for training the proposal detectors. SSD [34], FRCN [44] and MRCN [15] denote the used detection models with VGG-16 [49] or ResNet-101 [16] backbones.

V-D Results on VQA-v2

Taking the ablation studies into account, we compare our best MUAN model to the state-of-the-art methods on VQA-v2 in Table II. With the same bottom-up-attention visual features [1], MUAN significantly outperforms current state-of-the-art methods BAN [26] by 1.3 points in terms of overall accuracy on the test-dev split. Furthermore, for the Num-type questions, which verify object counting performance, BAN+Counter [26] reports the best result by utilizing an elaborate object counting module [66]. In contrast, MUAN achieves slightly higher accuracy than BAN+Counter, and in doing so does not use the auxiliary bounding-box coordinates of each object [66]. This suggests that MUAN can perform accurate object counting based on the visual features alone. As far as we know, MUAN is the first single model that achieves 71%+ accuracy on the test-std split with the standard bottom-up-attention visual features provided by [1].

V-E Results on CLEVR

We also conduct experiments to compare MUAN with existing state-of-the-art approaches, and human performance on CLEVR, which is a synthesized dataset for evaluating compositional visual reasoning. Compared to VQA-v2, CLEVR requires a model not only to focus on query-specific objects, but only to reason the relations among the related objects, which is much more challenging. In the meantime, since the image contents are completely synthesized by the algorithm, it is possible for a model to fully understand the semantic, resulting in relatively higher performance of existing state-of-the-arts compared to those on VQA-v2.

From the results shown in Table III, we can see that MUAN is at least comparable to the state-of-the-art, even if the model is not specifically designed for this dataset. While some prior approaches used extra supervisory program labels [24][19] or augmented dataset [42][47] to guide training, MUAN is able to learn to infer the correct answers directly from the image and question features.

V-F Results on RefCOCO, RefCOCO+, and RefCOCOg

We report the comparative results on RefCOCO, RefCOCO+, and RefCOCOg in Table IV. We use the common evaluation criterion accuracy, which is defined as the percentage of predicted bounding box overlaps with the groundtruth of IoU 0.5. From the results, we can see that: 1) with the standard proposal features extracted from the detector pre-trained on MSCOCO, MUAN reports a remarkable improvement over MAttNet, the state-of-the-art visual grounding model; 2) with the powerful proposal features extracted from the detector pre-trained on Visual Genome, MUAN reports 9% improvement over a strong baseline DDPN [64], which uses the same visual features. These results reveal the fact that MUAN outperforms existing state-of-the-arts steadily regardless of the used proposal features. Compared with existing approaches, MUAN additionally models the intra-modal interactions within each modality, which provide contextual information to facilitate visual grounding performance.

Fig. 6: Visualizations of the learned unified attention maps (Eq.(5)) for VQA. The attention maps come from the 1st, 3rd, 6th and 9th UA block, respectively. The index within [0-32] on the axes of the attention maps corresponds to the object in the image (33 objects in total). For better visualization effect, we highlight the objects in the image that are related to the answer. Furthermore, we split the last attention map into four parts (i.e., , , and ) to carry out detailed analysis.
Fig. 7: Visualizations of the prediction and the learned visual attention for visual grounding. The groundtruth (red), top-ranked proposal (blue) and refined prediction (yellow) are shown in the first image. Next four images illustrate the learned visual attentions from the 1st, 3rd, 6th and 9th UA blocks, respectively. The visual attention is represented by three representative objects with the largest attention values. The brightness of objects and darkness of words represent their importance in the attention weights.

V-G Qualitative Analysis

In Fig. 6, we show one VQA example and visualize four attention maps (obtained by Eq.(5)) from the 1st, 3rd, 6th and 9th UA blocks, respectively. Since only the feature of the token is used to predict the answer, we focus on its related attention weights (i.e., the first row of each attention map). In the 1st attention map, the word ‘many’ obtains the largest weight while the other words and visual objects are almost abandoned. This suggests that the 1st block acts as a question-type classifier. In the 3rd attention map, the word ‘street’ is highlighted, which is a contextual word to understand the question. The key word ‘buses’ is highlighted in the 6th attention map, and the two buses (i.e., the 22th and 31th objects) are highlighted in the 9th attention map. This visual reasoning process explains the information of the highlighted words and objects is gradually aggregated into the feature. For the 9th UA block, we split its attention map into four parts (i.e., , , and ). In , the largest values reflect the relationships between the key word and its context, providing a structured and fine-grained understanding of the question semantics (i.e., bus is on the street). In , some words on the rows attend to the key objects, suggesting that these words aggregate the information from the key objects to improve their representations. Similar observations can be observed from and .

In Fig. 7, we demonstrate one visual grounding example and visualize the prediction and the learned unified attention. In the first image, we can see that MUAN accurately localize the most relevant object proposal, and then output the refined bounding boxes as the final prediction. We visualize the learned textual and visual attentions of the 1st, 3rd, 6th and 9th UA blocks, respectively. By performing columnwise max-pooling over the unified attention map, we obtain the attention weights for the words and objects. For better visualization effect, we only visualize three representative objects with the largest attention weights. From the results, we can see that: 1) the keywords are highlighted only in the 1st block, indicating that this information has been successfully transferred to the attended visual features in the following blocks; and 2) the learned visual attention in the 1st block is meaningless. After receiving the textual information, the visual attention tends to focus on the contextual objects in the 3rd and 6th blocks (i.e., the hat and the baby), and finally focuses on the correct target object (i.e., the woman) in the 9th block.

Vi Conclusion and Future work

In this work, we present a novel unified attention model that captures intra- and inter-modal interactions simultaneously for multimodal data. By stacking such unified attention blocks in depth, we obtain a Multimodal Unified Attention Network (MUAN), that is suitable for both VQA and visual grounding tasks. Our approach is simple and highly effective. We verify the effectiveness of MUAN on five datasets, and the experimental results show that our approach achieves top level performance on all the benchmarks without using any dataset specific model tuning.

Since MUAN is a general framework that can be applied to many multimodal learning tasks, there remains significant room for improvement, for example by introducing multitask learning with sharing the same backbone model or introducing weakly-supervised model pre-training with large-scale multimodal data in the wild.

Footnotes

  1. In our implementation, we let and omit for simplicity, and rewrite Eq.(7) as
  2. For multiple UA blocks stacked in depth, only the first block needs to handle multimodal inputs. Eq.(7) is omitted in the other blocks.
  3. For each proposal, we first extract a 5-D spatial feature proposed in [58], and then linearly transform it to a 2048-D feature with a fully-connected layer to match the dimensionality of a 2048-D proposal visual feature.

References

  1. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §IV-A, §V-D, TABLE II.
  2. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick and D. Parikh (2015) Vqa: visual question answering. In IEEE International Conference on Computer Vision (ICCV), pp. 2425–2433. Cited by: §I, §II, §V-A.
  3. J. L. Ba, J. R. Kiros and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §III-B.
  4. D. Bahdanau, K. Cho and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I.
  5. H. Ben-Younes, R. Cadene, M. Cord and N. Thome (2017) Mutan: multimodal tucker fusion for visual question answering. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
  6. K. Chen, J. Wang, L. Chen, H. Gao, W. Xu and R. Nevatia (2015) ABC-cnn: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960. Cited by: §II.
  7. L. Chen, Y. Yang, J. Wang, W. Xu and A. L. Yuille (2016) Attention to scale: scale-aware semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3640–3649. Cited by: §I.
  8. C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu and M. Tan (2018) Visual grounding via accumulated attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7746–7755. Cited by: §II.
  9. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I, §III, §IV-A, §V-C.
  10. T. Dozat and C. D. Manning (2017) Deep biaffine attention for neural dependency parsing. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §I.
  11. A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: §I, §II, §II, §III.
  12. R. Girshick (2015) Fast r-cnn. In IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §IV-B.
  13. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I, §V-A.
  14. K. Gregor, I. Danihelka, A. Graves, D. J. Rezende and D. Wierstra (2015) Draw: a recurrent neural network for image generation. In International Conference on Machine Learning (ICML), pp. 1462–1471. Cited by: §I.
  15. K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2961–2969. Cited by: §V-B, TABLE IV.
  16. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §II, §III-B, §IV-A, §V-B, TABLE IV.
  17. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §IV-A.
  18. H. Hu, J. Gu, Z. Zhang, J. Dai and Y. Wei (2018) Relation networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  19. R. Hu, J. Andreas, M. Rohrbach, T. Darrell and K. Saenko (2017) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §V-E, TABLE III.
  20. R. Hu, M. Rohrbach, J. Andreas, T. Darrell and K. Saenko (2017) Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124. Cited by: TABLE IV.
  21. D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §V-B, TABLE III.
  22. I. Ilievski, S. Yan and J. Feng (2016) A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485. Cited by: §II.
  23. J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901–2910. Cited by: §I, §V-A, TABLE III.
  24. J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick and R. Girshick (2017) Inferring and executing programs for visual reasoning. In IEEE International Conference on Computer Vision (ICCV), pp. 2989–2998. Cited by: §V-E, TABLE III.
  25. S. Kazemzadeh, V. Ordonez, M. Matten and T. Berg (2014) Referitgame: referring to objects in photographs of natural scenes. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798. Cited by: §I, §V-A.
  26. J. Kim, J. Jun and B. Zhang (2018) Bilinear attention networks. NIPS. Cited by: §I, §II, §III-B, §III, §V-B, §V-D, TABLE II.
  27. J. Kim, K. W. On, W. Lim, J. Kim, J. Ha and B. Zhang (2017) Hadamard Product for Low-rank Bilinear Pooling. In International Conference on Learning Representation (ICLR), Cited by: §II, §III-A, §III.
  28. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
  29. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li and D. A. Shamma (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332. Cited by: §V-B, TABLE II, TABLE IV.
  30. X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He and C. Gan (2019) Beyond rnns: positional self-attention with co-attention for video question answering. In AAAI, Cited by: §II.
  31. Y. Li, N. Wang, J. Liu and X. Hou (2017) Factorized bilinear models for image recognition. IEEE International Conference on Computer Vision (ICCV). Cited by: §III-A.
  32. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §V-A, §V-A, TABLE IV.
  33. J. Liu, L. Wang and M. Yang (2017) Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4856–4864. Cited by: TABLE IV.
  34. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. C. Berg (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: TABLE IV.
  35. J. Lu, J. Yang, D. Batra and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, pp. 289–297. Cited by: §I, §II.
  36. M. Luong, H. Pham and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §I.
  37. J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille and K. Murphy (2016) Generation and comprehension of unambiguous object descriptions. In IEEE International Conference on Computer Vision (ICCV), pp. 11–20. Cited by: §I, §V-A.
  38. V. Mnih, N. Heess and A. Graves (2014) Recurrent models of visual attention. In NIPS, pp. 2204–2212. Cited by: §I.
  39. D. Nguyen and T. Okatani (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I, §II, §III-B, §III.
  40. G. Peng, H. Li, H. You, Z. Jiang, P. Lu, S. Hoi and X. Wang (2018) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. arXiv preprint arXiv:1812.05252. Cited by: §II, TABLE II.
  41. J. Pennington, R. Socher and C. D. Manning (2014) Glove: global vectors for word representation.. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Vol. 14, pp. 1532–1543. Cited by: §IV-A.
  42. E. Perez, F. Strub, H. De Vries, V. Dumoulin and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §V-E, TABLE III.
  43. A. Radford, K. Narasimhan, T. Salimans and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf. Cited by: §III.
  44. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §V-B, §V-B, TABLE IV.
  45. A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell and B. Schiele (2016) Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision (ECCV), pp. 817–834. Cited by: §I, §II.
  46. A. M. Rush, S. Chopra and J. Weston (2015) A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. Cited by: §I.
  47. A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §V-E, TABLE III.
  48. K. J. Shih, S. Singh and D. Hoiem (2016) Where to look: focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4613–4621. Cited by: §II.
  49. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: TABLE IV.
  50. J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic and H. T. Shen (2018) From deterministic to generative: multi-modal stochastic rnns for video captioning. IEEE transactions on neural networks and learning systems. Cited by: §I.
  51. D. Teney, P. Anderson, X. He and A. v. d. Hengel (2017) Tips and tricks for visual question answering: learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711. Cited by: §IV-A, §V-B, TABLE II.
  52. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010. Cited by: §I, §I, §III-A, §III-A, §III.
  53. X. Wang, R. Girshick, A. Gupta and K. He (2018) Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §I.
  54. K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention.. In International Conference on Machine Learning (ICML), Vol. 14, pp. 77–81. Cited by: §I, §I, §I.
  55. E. Yang, C. Deng, C. Li, W. Liu, J. Li and D. Tao (2018) Shared predictive cross-modal deep quantization. IEEE transactions on neural networks and learning systems 29 (11), pp. 5292–5303. Cited by: §I.
  56. Z. Yang, X. He, J. Gao, L. Deng and A. Smola (2016) Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21–29. Cited by: §I, §II.
  57. L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal and T. L. Berg (2018) Mattnet: modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315. Cited by: §II, §V-B, TABLE IV.
  58. L. Yu, P. Poirson, S. Yang, A. C. Berg and T. L. Berg (2016) Modeling context in referring expressions. In European Conference on Computer Vision (ECCV), pp. 69–85. Cited by: footnote 3.
  59. L. Yu, H. Tan, M. Bansal and T. L. Berg (2017) A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7282–7290. Cited by: §II, TABLE IV.
  60. Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo and Y. Zhuang (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 395–404. Cited by: §I.
  61. Z. Yu, J. Yu, Y. Cui, D. Tao and Q. Tian (2019) Deep modular co-attention networks for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6281–6290. Cited by: TABLE II.
  62. Z. Yu, J. Yu, J. Fan and D. Tao (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. IEEE International Conference on Computer Vision (ICCV), pp. 1839–1848. Cited by: §I, §II.
  63. Z. Yu, J. Yu, C. Xiang, J. Fan and D. Tao (2018) Beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems. External Links: Document Cited by: §II, TABLE II.
  64. Z. Yu, J. Yu, C. Xiang, Z. Zhao, Q. Tian and D. Tao (2018) Rethinking diversified and discriminative proposal generation for visual grounding. International Joint Conference on Artificial Intelligence (IJCAI). Cited by: §II, §IV-B, §V-B, §V-F, TABLE IV.
  65. H. Zhang, Y. Niu and S. Chang (2018-06) Grounding referring expressions in images by variational context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE IV.
  66. Y. Zhang, J. Hare and A. Prügel-Bennett (2018) Learning to count objects in natural images for visual question answering. International Conference on Learning Representation (ICLR). Cited by: §V-D, TABLE II.
  67. C. Zheng, L. Pan and P. Wu (2019) Multimodal deep network embedding with integrated structure and attribute information. IEEE transactions on neural networks and learning systems. Cited by: §I.
  68. B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam and R. Fergus (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167. Cited by: §II.
  69. B. Zhuang, Q. Wu, C. Shen, I. Reid and A. van den Hengel (2018) Parallel attention: a unified framework for visual object discovery through dialogs and queries. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4252–4261. Cited by: §II.
  70. C. L. Zitnick and P. Dollár (2014) Edge boxes: locating object proposals from edges. In European Conference on Computer Vision (ECCV), pp. 391–405. Cited by: §II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
385550
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description