Multimodal Unified Attention Networks for Vision-and-Language Interactions
Learning an effective attention mechanism for multimodal data is important in many vision-and-language tasks that require a synergic understanding of both the visual and textual contents. Existing state-of-the-art approaches use co-attention models to associate each visual object (e.g., image region) with each textual object (e.g., query word). Despite the success of these co-attention models, they only model inter-modal interactions while neglecting intra-modal interactions. Here we propose a general ‘unified attention’ model that simultaneously captures the intra- and inter-modal interactions of multimodal features and outputs their corresponding attended representations. By stacking such unified attention blocks in depth, we obtain the deep Multimodal Unified Attention Network (MUAN), which can seamlessly be applied to the visual question answering (VQA) and visual grounding tasks. We evaluate our MUAN models on two VQA datasets and three visual grounding datasets, and the results show that MUAN achieves top level performance on both tasks without bells and whistles.
Deep learning in computer vision and natural language processing has facilitated recent advances in artificial intelligence. Such advances drive research interest in multimodal learning tasks lying at the intersection of vision and language such as multimodal embedding learning , visual captioning , visual question answering (VQA)  and visual grounding , etc. In these tasks, learning a fine-grained semantic understanding of both visual and textual content is key to their performance.
The attention mechanism is a predominant focus of recent deep learning research. It aims to focus on certain data elements, and aggregate essential information to obtain a more discriminative local representation [4, 54]. This mechanism has improved the performance of a wide range of unimodal learning tasks (e.g., vision [38, 14, 7], language [36, 10, 46]) in conjunction with deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
For the multimodal learning tasks described above, attention learning considers the inputs from both the visual and textual modalities. Taking the VQA problem in Fig. 1 as an example, to correctly answer a question like ‘How many people are catching the frisbee’ for an image, the attention model should ideally learn to focus on particular image regions (i.e., the person near the frisbee). Such visual attention based models have become an integral component in many multimodal tasks that require fine-grained visual understanding . Beyond the visual attention models, recent studies have introduced co-attention models, which simultaneously learn the visual attention and textual attention to benefit from fine-grained representations for both modalities. Early approaches learned separate attention distributions for each modality in an iterative manner, neglecting the dense interaction between each question word and image region . To address this problem, dense co-attention models have been proposed to capture complete interactions between word-region pairs, which are further extended to form deep co-attention models .
Despite the success of the co-attention models in multimodal learning tasks, these models only consider inter-modal interactions (i.e., or in Fig. 1) while neglecting intra-modal ones (i.e., and ). On the other hand, modeling intra-modal interactions has been proved to be beneficial for many unimodal learning tasks . We argue that intra-modal interactions within each modality provide complementary and important information to the inter-modal interactions.
Inspired by the famous self-attention model  in the NLP community, we naturally extend its idea for multimodal data and propose a unified attention accordingly. Our unified attention model characterizes the intra- and inter-modal interactions jointly in a unified framework which we call the unified attention (UA) block (see Fig. 1). The attention map learned from the UA block includes four relationships: the inter-modal interactions ( and ) to build co-attention across different modalities, and the intra-modal interactions ( and ) to build self-attention within each modality. The learned unified attention is further used to obtain the attended output features for multimodal inputs. By stacking such UA block in depth, we obtain the Multimodal Unified Attention Network (MUAN), which can be trained in an end-to-end manner to perform deep multimodal reasoning.
To evaluate the effectiveness of our proposed MUAN model, we apply it to for VQA and visual grounding. The quantitative and qualitative results on two VQA dataset VQA-v2  and CLEVR , and three visual grounding datasets RefCOCO , RefCOCO+  and RefCOCOg  show that MUAN achieves top level performance on both tasks without using any dataset specific model tuning.
In summary, we have made the following contributions in this study:
We extend the self-attention model for single modality to a unified attention model, which can characterize intra- and inter-modal interactions of multimodal data. By stacking such unified attention model (i.e., UA block) in depth, we obtain a neat multimodal unified attention network (MUAN), which can perform accurate multimodal reasoning.
We modify the original self-attention model to a gated self-attention (GSA) model as the basic component for the UA block, which facilities more accurate and robust attention learning and leads to more discriminative features for specific tasks.
We apply MUAN to two multimodal learning tasks, namely VQA and visual grounding. The results on five benchmark datasets show the superiority of MUAN over existing state-of-the-art approaches.
Ii Related Work
We briefly review existing studies on VQA and visual grounding, and establish a connection between these two tasks by attention learning.
Visual Question Answering (VQA). VQA aims to answer a question in natural language with respect to a given image, so requires multimodal reasoning over multimodal inputs. Since Antol et al. presented a large-scale VQA benchmark dataset with free-form questions , multimodal fusion and attention learning have become two major research focuses for VQA. For multimodal fusion, early methods used simple concatenation or element-wise multiplication between multimodal features . Fukui et al. , Kim et al. , Yu et al.  and Ben et al.  proposed different approximated bilinear pooling methods to effectively integrate the multimodal features with second-order feature interactions. For attention learning, question-guided visual attention on image regions has become the de-facto component in many VQA approaches . Chen et al. proposed a question-guided attention map that projects the question embeddings to the visual space and formulates a configurable convolutional kernel to search the image attention region . Yang et al. proposed a stacked attention network to learn the attention iteratively . Some approaches introduce off-the-shelf object detectors  or object proposals  as the candidates of the attention regions and then use the question to identify the relevant ones. Taken further, co-attention models that consider both textual and visual attentions have been proposed . Lu et al. proposed a co-attention learning framework to alternately learn the image attention and question attention . Yu et al. reduced the co-attention method into two steps, self-attention for a question embedding and the question-conditioned attention for a visual embedding . The learned co-attentions by these approaches are coarse, in that they neglect the interaction between question words and image regions. To address this issue, Nguyen et al.  and Kim et al.  introduced dense co-attention models that established the complete interaction between each question word and each image region.
Visual Grounding. Visual grounding (a.k.a., referring expression comprehension) aims to localize an object in an image referred to in query text. Most previous approaches follow a two-stage pipeline : 1) use an off-the-shelf object detector, such as Edgebox  or Faster R-CNN  to generate a set of region proposals along with the proposal features for the input image; and 2) compute a matching score between each proposal feature and query feature and adopt the proposal (or its refined bounding box ) with the highest score as the referent. From the attention learning point of view, visual grounding represents a task of learning query-guided attention on the image region proposals. The aforementioned two-stage approaches are analogous to the visual attention models in VQA. Yu et al. , Zhang et al.  and Deng et al.  also modeled the attention on question words along with visual attention, providing a connection to the co-attention model in VQA.
Joint Modeling of Self- and Co-Attention. Although extensive studies on self-attention and co-attention have been made by existing multimodal learning methods, the two kinds of attentions are usually considered solely. To the best of our knowledge, only a few attempts have modeled intra- and inter-modal interactions jointly. Li et al. introduced a videoQA approach which used self-attention to learn intra-modal interactions of video and question modalities respectively, and then fed them through a co-attention block to model inter-modal interactions . Gao et al. presented a dynamic fusion framework for VQA with modeling intra- and inter-modal attention blocks. . Yu et al applied a modular co-attention network for VQA which stacked multiple self-attention and guided-attention blocks in depth to perform deep visual reasoning. In summary, all these methods models the self-attention and co-attention in two sequential stages, which is sub-optimal and may result in serious information lose. This inspires us to design a general unified attention framework to simultaneously model the two attentions in one stage.
Iii Multimodal Unified Attention
In this section, we introduce the multimodal unified attention, which is the basic component of our Multimodal Unified Attention Network (MUAN). Taking the multimodal input features from the image modality and from the text modality, the unified attention outputs their corresponding attended features. In contrast to existing visual attention methods, which model unidirectional inter-modal interactions (i.e., ) , or the co-attention methods, which model bidirectional inter-modal interactions (i.e., ) , our unified attention models the intra-modal and inter-modal interactions simultaneously (i.e., , and ) in a general framework.
Inspired by the self-attention model which has achieved remarkable performance in natural language processing , we design a unified attention model for multimodal data. Furthermore, to obtain more accurate attention map in the unified attention learning, we introduce a bilinear pooling based gating model to reweight the importance of input features, which can to some extent eliminate the irrelevant or noisy features.
Iii-a Gated Self-Attention
The self-attention model proposed in  takes a group of input features and outputs a group of attended features , where is the number of samples, and are the dimensionalities of input and output features, respectively. To achieve this goal, is first fed into three independent fully-connected layers.
where are three feature matrices of the same shape, corresponding to the queries, keys, and values, respectively.
Given a query and all keys , we calculate the dot-products of with , divide each by a scaling factor and apply the softmax function to obtain the attention weights on the values. In practice, the attention function can be computed on all queries simultaneously, and in doing so we obtain the output features as follows:
where is the attention map containing the attention weights for all query-key pairs, and the output features are the weighted summation of the values determined by .
Learning an accurate attention map is crucial for self-attention learning. The scaled dot-product attention in Eq.(2) models the relationship between feature pairs. However, the importance of each individual features is not explicitly considered during attention learning. Consequently, irrelevant or noisy features may have a negative impact on the attention map, resulting in inaccurate output features. To address this problem, we introduce a novel gating model into Eq.(2) to improve the quality of the learned attention. Inspired by the bilinear pooling models which have been in fine-grained visual recognition  and multi-modal fusion , we design a gating model based on low-rank bilinear pooling to reweight the features of and before their scaled dot-products:
where , are three independent fully-connected layers, and is the dimensionality of the projected space. denotes the element-wise product function and the sigmoid function. corresponds to the two masks and for the features and , respectively.
The learned two masks and are tiled to and then used to formulate a gated self-attention (GSA) model as follows:
Iii-B Unified Attention Block
Based on the gated self-attention model above, we introduce the multimodal unified attention block, which simultaneously models intra- and inter-modal interactions.
Given a group of textual features (e.g., question words) and a group of visual features (e.g., image regions) , we first learn two fully-connected layers and to embed and into a -dimensional common space, and then concatenate the two groups of embedded features on rows to form a unified feature matrix :
The UA block (see Fig. 1(b)) consists of a gated self-attention (GSA) module and a feed-forward network (FFN) module. Taking the unified feature matrix as input, the GSA module learns the pairwise interactions between the sample pairs within . Since and may come from different (or the same) modalities, the intra- and inter-modal relationships are represented at the same time. Compared to existing co-attention models, which only model the inter-modal relationships , the intra-modal relationships (e.g., word-to-word or region-to-region) are also important for understanding the intrinsic structure within each modality, thus facilitating more accurate visual reasoning. The FFN module takes the output features of the GSA module as input, and then performs transformation through two consecutive fully-connected layers (FC(4)-ReLU-Drop(0.1)-FC()). To simplify optimization, shortcut connection  and layer normalization  are applied after the GSA and FFN modules. It is worth noting that the final output features of the UA block are of the same shape as the input features , making it possible to stack multiple UA blocks in depth
Iv Multimodal Unified Attention Networks
In this section, we describe the MUAN architectures for VQA and visual grounding (see Fig. 3). The core component of both models is the deep MUAN- model, which consists of UA blocks stacked in depth to perform deep multimodal reasoning and attentional feature transformation. The proposed VQA model and the visual grounding model are very similar to each other, except for the input feature representations and the loss functions used during model training. We therefore highlight these two parts in each model.
Iv-a Architecture for VQA
Image and Question Representations. The inputs for VQA consist of an images and a question, and the goal is to predict an answer to the question. Our model first extracts representations for the image and the question and then feeds the multimodal features into the MUAN model to output their corresponding output features with unified attention learning. Finally, one of the attended feature is fed to a multi-label classifier to predict the correct answer.
The input question is first tokenized into a sequence of words, and then trimmed (or zero padded) to a maximum length of . Similar to , we add a dummy token at the beginning of the question, and the attended feature of this token will be used to predict the answer. These words are firstly represented as one-hot vectors and then transformed to 300-D word embeddings using the pre-trained GloVe model . Finally, the word embeddings are fed into a one-layer LSTM network  with hidden units, resulting in the final question feature . The input image is represented as a group of -dimensional visual features extracted from a pre-trained CNN model  or a pre-trained object detector . This results in the image feature , where is the number of extracted features.
Note that we mask the zero-padded features during attention learning to make their attention weights all zero.
MUAN-. The multimodal features and are fed into a deep MUAN- model consisting of UA blocks . For , and are integrated by Eq.(7) to obtain the initialized unified features , which are further fed to the remaining UA blocks in a recursive manner.
where . Note that the final output features are the same shape as the input features , and each paired has a one-to-one correspondence.
Answer Prediction. Using the attended features from MUAN-, we project the first feature (the token) into a vector , where corresponds to the size of the answer vocabulary.
For the datasets that have multiple answers to each question, we following the strategy in  and use the binary cross-entropy (BCE) loss to train an -way classifier with respect to the ground-truth label :
where is the sigmoid activation function.
For the datasets that have exactly one answer to each question, we use the softmax cross-entropy loss to train the model with respect to the one-hot ground-truth label :
Iv-B Architecture for Visual Grounding
The inputs for visual grounding consist of an image and a query. Similar to the VQA architecture above, we extract the query features using GloVe embeddings followed by a LSTM network, and extract the region-based proposal features for the image using an pre-trained object detector. Note that we do not use the dummy token for visual grounding which is specially designed for VQA.
The multimodal input features are integrated and transformed by MUAN- to output their attended representations. On top of the attended feature for each region proposal, we append two fully-connected layers to project each attended feature into a score and a 4-D vector to regress the refined bounding box coordinates for the proposal, respectively.
Accordingly, a ranking loss and a regression loss are designed to optimize the model in a multitask learning manner. Following the strategy in , KL-divergence is used as the ranking loss:
where are the predicted scores for proposals. The ground-truth label is obtained by calculating the IoU scores of all proposals w.r.t. the unique ground-truth bounding box and assign the IoU score of the -th proposal to if the IoU score is larger than a threshold and 0 otherwise. Softmax normalizations are respectively applied to and to make them form a score distribution.
The smoothed loss  is used as the regression loss to penalize the differences between the refined bounding box and the ground-truth bounding box:
where and correspond to the coordinates of the predicted bounding box and the ground-truth bounding box for -th proposal, respectively.
By combining the two terms, we obtain the overall loss function as follows:
where is a hyper-parameter to balance the two terms.
In this section, we conduct experiments to evaluate the performance of the MUAN models in VQA and visual grounding tasks. We conduct extensive ablation experiments to explore the effect of different hyper-parameters in MUAN. Finally, we compare the best MUAN models to current state-of-the-art methods on five benchmark datasets (two VQA datasets and three visual grounding datasets).
VQA-v2 is a commonly-used benchmark dataset for open-ended VQA . It contains human annotated question-answer pairs for MS-COCO images . The dataset is split into three subsets: train (80k images with 444k questions); val (40k images with 214k questions); and test (80k images with 448k questions). The test subset is further split into test-dev and test-std sets that are evaluated online with limited attempts. For each questions, multiple answer are provided by different annotators. To evaluate the performance of a model with respect to such multi-label answers, an accuracy-based evaluation metric is defined as follows which is robust to inter-human variability in phrasing the answer :
where is a function that count the answer voted by different annotators.
CLEVR is a synthesized dataset containing 100k images and 853k questions . Each image contains 3D-rendered objects and is associated with a number of questions that test various aspects of visual reasoning including attribute identification, object counting, and logical operations. The whole dataset is split into three subsets: train (70k images with 700k questions), val (15k images with 150k questions) and test (15k images with 15k questions). Each question is associated with exactly one answer and standard accuracy metric is used to evaluate model performance.
RefCOCO, RefCOCO+, and RefCOCOg are three datasets to evaluate visual grounding performance. All three datasets are collected from MS-COCO images , but the queries are different in three respects: 1) RefCOCO  and RefCOCO+  contains short queries (3.6 words on average) while RefCOCOg  contains relatively long queries (8.4 words on average); 2) RefCOCO and RefCOCO+ contain 3.9 same-type objects on average, while in RefCOCOg this number is 1.6; and 3) RefCOCO+ does not contain any location word, while the counterparts do not have this constraint. RefCOCO and RefCOCO+ are split into four subsets: train (120k queries), val (11k queries), testA (6k queries about people), and testB (5k queries about objects). RefCOCOg is split into three subsets: train (81k queries), val (5k queries), and test (10k queries). For all the three datasets, accuracy is adopted as the evaluation metric, which is defined as the percentage in which the predicted bounding box overlaps with the ground-truth bounding box by IoU0.5.
Fig. 4 shows some typical examples from these datasets.
V-B Experimental Setup
Universal Setup. We use the following hyper-parameters as the default settings for MUAN unless otherwise noted. In each UA block, the latent dimensionality is 768 and the number of heads is 8, so the dimensionality of each head is . The latent dimensionality in the gating model is 96. The number of UA blocks ranges from 2 to 12.
All the models are optimized using the Adam solver  with and . The models (except those for CLEVR) are trained up to 13 epochs with a batch size 64 and a base learning rate set to . Similar to , the learning rate is warmed-up for 3 epochs and decays by 1/5 every 2 epochs after 10 epochs. We report the best results evaluated on the validation set. For CLEVR, a smaller base learning rate is used to train up to 20 epochs and decay by 1/5 at the 16th and 18th epochs, respectively.
VQA Setup. For VQA-v2, we follow the strategy in  and extract the pool5 feature for each object from a Faster R-CNN model (with a ResNet-101 backbone)  pre-trained on the Visual Genome dataset , resulting in the input visual features , where is the number of extracted objects with a confidence threshold. The maximum number of question words , and the size of the answer vocabulary , which corresponds to answers appearing more than 8 times in the training set. For CLEVR, we follow the strategy in  and extract the res4b22 features from a ResNet-101 model pre-trained on ImageNet , resulting in the image features . The maximum number of question words , and the size of the answer vocabulary .
Visual Grounding Setup. We use the same settings for the three evaluated datasets. To detect proposals and extract their visual features for each image, we use two pre-trained proposal detectors as previous works did: 1) a Faster R-CNN model  pre-trained on the Visual Genome dataset ; and 2) a Mask R-CNN model  pre-trained on MS-COCO dataset . During the training data preparation for the proposal detectors, we exclude the images in the training, validation and testing
sets of RefCOCO, RefCOCO+ and RefCOCOg to avoid contamination of the used visual grounding datasets. Each of the obtained proposal visual features is further concatenated with a spatial feature containing the bounding-box coordinates of the proposal
V-C Ablation Studies
We run a number of ablation experiments on VQA-v2 to explore the effectiveness of MUAN.
First, we explore the effectiveness of the gating mechanism for the UA block with respect to different number of block . In Fig. 4(a), we report the overall accuracies of the MUAN- models ( ranges from 2 to 12) with the gating mechanism (i.e., Eq.(5)) or without the gating mechanism (i.e., Eq.(2)) for the UA block. From the results, we can see that MUAN with the gating model steadily outperforms counterpart without the gating model. Furthermore, increasing consistently improves the accuracies of both models, which finally saturate at . We think the saturation is caused by over-fitting. To train a deeper model we may require more training data .
Next, we conduct the ablation studies to explore the effects of self-attention and co-attention in MUAN. By masking the values in the self-attention part (i.e., and ) or the co-attention part (i.e., and ) to , we obtain two degraded variants of MUAN. We compare the two MUAN variants to its reference model in Fig. 4(b) with . The results shows that: 1) both the self-attention and co-attention in MUAN contribute to the performance of VQA; and 2) co-attention plays a more important role than self-attention in MUAN, especially when the model is relatively shallow.
|Acc. (%)||#Param ()|
Finally, we investigate MUAN-10 model performance with different hyper-parameters for the UA block in Table I. In row (A), we vary the dimensionality in the gating model. The results suggest that the reference model results in a 0.12 point improvement over the worst counterpart. Further, the model sizes of these variants are almost identical, indicating that the computational cost of the gating model can be more or less ignored. In row (B), we vary the number of parallel heads with a fixed output dimensionality , keeping the computational cost constant. The results suggest that is the best choice for MUAN. Too few or too many heads reduces the quality of learned attention. In row (C), we fix the number of heads to and vary the dimensionality , resulting in much smaller and larger models with the model complexity proportional to . From the results, we can see that is a key hyper-parameter to the performance. Too small may restrict the model capacity, leading to inferior performance. The model with slightly surpasses the reference model at the expense of much higher computational complexity and greater risk of over-fitting.
The hyper-parameters in the reference model is a trade-off between efficiency and efficacy. Therefore, we adopt the reference MUAN-10 model (abbreviated to MUAN for simplicity) in all the following experiments.
|Method||Human ||Q-type Prior ||LSTM ||CNN+LSTM ||N2NMN* ||RN ||PG+EE* ||FiLM ||MAC ||MUAN (ours)|
V-D Results on VQA-v2
Taking the ablation studies into account, we compare our best MUAN model to the state-of-the-art methods on VQA-v2 in Table II. With the same bottom-up-attention visual features , MUAN significantly outperforms current state-of-the-art methods BAN  by 1.3 points in terms of overall accuracy on the test-dev split. Furthermore, for the Num-type questions, which verify object counting performance, BAN+Counter  reports the best result by utilizing an elaborate object counting module . In contrast, MUAN achieves slightly higher accuracy than BAN+Counter, and in doing so does not use the auxiliary bounding-box coordinates of each object . This suggests that MUAN can perform accurate object counting based on the visual features alone. As far as we know, MUAN is the first single model that achieves 71%+ accuracy on the test-std split with the standard bottom-up-attention visual features provided by .
V-E Results on CLEVR
We also conduct experiments to compare MUAN with existing state-of-the-art approaches, and human performance on CLEVR, which is a synthesized dataset for evaluating compositional visual reasoning. Compared to VQA-v2, CLEVR requires a model not only to focus on query-specific objects, but only to reason the relations among the related objects, which is much more challenging. In the meantime, since the image contents are completely synthesized by the algorithm, it is possible for a model to fully understand the semantic, resulting in relatively higher performance of existing state-of-the-arts compared to those on VQA-v2.
From the results shown in Table III, we can see that MUAN is at least comparable to the state-of-the-art, even if the model is not specifically designed for this dataset. While some prior approaches used extra supervisory program labels  or augmented dataset  to guide training, MUAN is able to learn to infer the correct answers directly from the image and question features.
V-F Results on RefCOCO, RefCOCO+, and RefCOCOg
We report the comparative results on RefCOCO, RefCOCO+, and RefCOCOg in Table IV. We use the common evaluation criterion accuracy, which is defined as the percentage of predicted bounding box overlaps with the groundtruth of IoU 0.5. From the results, we can see that: 1) with the standard proposal features extracted from the detector pre-trained on MSCOCO, MUAN reports a remarkable improvement over MAttNet, the state-of-the-art visual grounding model; 2) with the powerful proposal features extracted from the detector pre-trained on Visual Genome, MUAN reports 9% improvement over a strong baseline DDPN , which uses the same visual features. These results reveal the fact that MUAN outperforms existing state-of-the-arts steadily regardless of the used proposal features. Compared with existing approaches, MUAN additionally models the intra-modal interactions within each modality, which provide contextual information to facilitate visual grounding performance.
V-G Qualitative Analysis
In Fig. 6, we show one VQA example and visualize four attention maps (obtained by Eq.(5)) from the 1st, 3rd, 6th and 9th UA blocks, respectively. Since only the feature of the token is used to predict the answer, we focus on its related attention weights (i.e., the first row of each attention map). In the 1st attention map, the word ‘many’ obtains the largest weight while the other words and visual objects are almost abandoned. This suggests that the 1st block acts as a question-type classifier. In the 3rd attention map, the word ‘street’ is highlighted, which is a contextual word to understand the question. The key word ‘buses’ is highlighted in the 6th attention map, and the two buses (i.e., the 22th and 31th objects) are highlighted in the 9th attention map. This visual reasoning process explains the information of the highlighted words and objects is gradually aggregated into the feature. For the 9th UA block, we split its attention map into four parts (i.e., , , and ). In , the largest values reflect the relationships between the key word and its context, providing a structured and fine-grained understanding of the question semantics (i.e., bus is on the street). In , some words on the rows attend to the key objects, suggesting that these words aggregate the information from the key objects to improve their representations. Similar observations can be observed from and .
In Fig. 7, we demonstrate one visual grounding example and visualize the prediction and the learned unified attention. In the first image, we can see that MUAN accurately localize the most relevant object proposal, and then output the refined bounding boxes as the final prediction. We visualize the learned textual and visual attentions of the 1st, 3rd, 6th and 9th UA blocks, respectively. By performing columnwise max-pooling over the unified attention map, we obtain the attention weights for the words and objects. For better visualization effect, we only visualize three representative objects with the largest attention weights. From the results, we can see that: 1) the keywords are highlighted only in the 1st block, indicating that this information has been successfully transferred to the attended visual features in the following blocks; and 2) the learned visual attention in the 1st block is meaningless. After receiving the textual information, the visual attention tends to focus on the contextual objects in the 3rd and 6th blocks (i.e., the hat and the baby), and finally focuses on the correct target object (i.e., the woman) in the 9th block.
Vi Conclusion and Future work
In this work, we present a novel unified attention model that captures intra- and inter-modal interactions simultaneously for multimodal data. By stacking such unified attention blocks in depth, we obtain a Multimodal Unified Attention Network (MUAN), that is suitable for both VQA and visual grounding tasks. Our approach is simple and highly effective. We verify the effectiveness of MUAN on five datasets, and the experimental results show that our approach achieves top level performance on all the benchmarks without using any dataset specific model tuning.
Since MUAN is a general framework that can be applied to many multimodal learning tasks, there remains significant room for improvement, for example by introducing multitask learning with sharing the same backbone model or introducing weakly-supervised model pre-training with large-scale multimodal data in the wild.
- In our implementation, we let and omit for simplicity, and rewrite Eq.(7) as
- For multiple UA blocks stacked in depth, only the first block needs to handle multimodal inputs. Eq.(7) is omitted in the other blocks.
- For each proposal, we first extract a 5-D spatial feature proposed in , and then linearly transform it to a 2048-D feature with a fully-connected layer to match the dimensionality of a 2048-D proposal visual feature.
- (2018) Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §IV-A, §V-D, TABLE II.
- (2015) Vqa: visual question answering. In IEEE International Conference on Computer Vision (ICCV), pp. 2425–2433. Cited by: §I, §II, §V-A.
- (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §III-B.
- (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I.
- (2017) Mutan: multimodal tucker fusion for visual question answering. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
- (2015) ABC-cnn: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960. Cited by: §II.
- (2016) Attention to scale: scale-aware semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3640–3649. Cited by: §I.
- (2018) Visual grounding via accumulated attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7746–7755. Cited by: §II.
- (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I, §III, §IV-A, §V-C.
- (2017) Deep biaffine attention for neural dependency parsing. In International Conference on Learning Representations (ICLR), External Links: Cited by: §I.
- (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: §I, §II, §II, §III.
- (2015) Fast r-cnn. In IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §IV-B.
- (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I, §V-A.
- (2015) Draw: a recurrent neural network for image generation. In International Conference on Machine Learning (ICML), pp. 1462–1471. Cited by: §I.
- (2017) Mask r-cnn. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2961–2969. Cited by: §V-B, TABLE IV.
- (2016) Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §II, §III-B, §IV-A, §V-B, TABLE IV.
- (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §IV-A.
- (2018) Relation networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
- (2017) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §V-E, TABLE III.
- (2017) Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124. Cited by: TABLE IV.
- (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §V-B, TABLE III.
- (2016) A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485. Cited by: §II.
- (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901–2910. Cited by: §I, §V-A, TABLE III.
- (2017) Inferring and executing programs for visual reasoning. In IEEE International Conference on Computer Vision (ICCV), pp. 2989–2998. Cited by: §V-E, TABLE III.
- (2014) Referitgame: referring to objects in photographs of natural scenes. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798. Cited by: §I, §V-A.
- (2018) Bilinear attention networks. NIPS. Cited by: §I, §II, §III-B, §III, §V-B, §V-D, TABLE II.
- (2017) Hadamard Product for Low-rank Bilinear Pooling. In International Conference on Learning Representation (ICLR), Cited by: §II, §III-A, §III.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
- (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332. Cited by: §V-B, TABLE II, TABLE IV.
- (2019) Beyond rnns: positional self-attention with co-attention for video question answering. In AAAI, Cited by: §II.
- (2017) Factorized bilinear models for image recognition. IEEE International Conference on Computer Vision (ICCV). Cited by: §III-A.
- (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §V-A, §V-A, TABLE IV.
- (2017) Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4856–4864. Cited by: TABLE IV.
- (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: TABLE IV.
- (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, pp. 289–297. Cited by: §I, §II.
- (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §I.
- (2016) Generation and comprehension of unambiguous object descriptions. In IEEE International Conference on Computer Vision (ICCV), pp. 11–20. Cited by: §I, §V-A.
- (2014) Recurrent models of visual attention. In NIPS, pp. 2204–2212. Cited by: §I.
- (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I, §II, §III-B, §III.
- (2018) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. arXiv preprint arXiv:1812.05252. Cited by: §II, TABLE II.
- (2014) Glove: global vectors for word representation.. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Vol. 14, pp. 1532–1543. Cited by: §IV-A.
- (2018) Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §V-E, TABLE III.
- (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf. Cited by: §III.
- (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §V-B, §V-B, TABLE IV.
- (2016) Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision (ECCV), pp. 817–834. Cited by: §I, §II.
- (2015) A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. Cited by: §I.
- (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §V-E, TABLE III.
- (2016) Where to look: focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4613–4621. Cited by: §II.
- (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: TABLE IV.
- (2018) From deterministic to generative: multi-modal stochastic rnns for video captioning. IEEE transactions on neural networks and learning systems. Cited by: §I.
- (2017) Tips and tricks for visual question answering: learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711. Cited by: §IV-A, §V-B, TABLE II.
- (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010. Cited by: §I, §I, §III-A, §III-A, §III.
- (2018) Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §I.
- (2015) Show, attend and tell: neural image caption generation with visual attention.. In International Conference on Machine Learning (ICML), Vol. 14, pp. 77–81. Cited by: §I, §I, §I.
- (2018) Shared predictive cross-modal deep quantization. IEEE transactions on neural networks and learning systems 29 (11), pp. 5292–5303. Cited by: §I.
- (2016) Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21–29. Cited by: §I, §II.
- (2018) Mattnet: modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315. Cited by: §II, §V-B, TABLE IV.
- (2016) Modeling context in referring expressions. In European Conference on Computer Vision (ECCV), pp. 69–85. Cited by: footnote 3.
- (2017) A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7282–7290. Cited by: §II, TABLE IV.
- (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 395–404. Cited by: §I.
- (2019) Deep modular co-attention networks for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6281–6290. Cited by: TABLE II.
- (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. IEEE International Conference on Computer Vision (ICCV), pp. 1839–1848. Cited by: §I, §II.
- (2018) Beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems. External Links: Cited by: §II, TABLE II.
- (2018) Rethinking diversified and discriminative proposal generation for visual grounding. International Joint Conference on Artificial Intelligence (IJCAI). Cited by: §II, §IV-B, §V-B, §V-F, TABLE IV.
- (2018-06) Grounding referring expressions in images by variational context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE IV.
- (2018) Learning to count objects in natural images for visual question answering. International Conference on Learning Representation (ICLR). Cited by: §V-D, TABLE II.
- (2019) Multimodal deep network embedding with integrated structure and attribute information. IEEE transactions on neural networks and learning systems. Cited by: §I.
- (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167. Cited by: §II.
- (2018) Parallel attention: a unified framework for visual object discovery through dialogs and queries. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4252–4261. Cited by: §II.
- (2014) Edge boxes: locating object proposals from edges. In European Conference on Computer Vision (ECCV), pp. 391–405. Cited by: §II.