Deep Modular Co-Attention Networks for Visual Question Answering
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective ‘co-attention’ model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN’s effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63 overall accuracy on the test-dev set. Code is available at https://github.com/MILVLG/mcan-vqa.
Multimodal learning to bridge vision and language has gained broad interest from both the computer vision and natural language processing communities. Significant progress has been made in many vision-language tasks, including image-text matching [23, 14], visual captioning [9, 30, 1], visual grounding [10, 34] and visual question answering (VQA) [2, 21, 14, 36]. Compared to other multimodal learning tasks, VQA is a more challenging task that requires fine-grained semantic understanding of both the image and the question, together with visual reasoning to predict an accurate answer.
The attention mechanism is a recent advance in deep neural networks, that has successfully been applied to the unimodal tasks (e.g., vision , language , and speech ), as well as the aforementioned multimodal tasks. The idea of learning visual attention on image regions from the input question in VQA was first proposed by [27, 7], and it becomes a de-facto component of almost all VQA approaches [10, 16, 1]. Along with visual attention, learning textual attention on the question key words is also very important. Recent works have shown that simultaneously learning co-attention for the visual and textual modalities can benefit the fine-grained representation of the image and question, leading to more accurate prediction [20, 33]. However, these co-attention models learn the coarse interactions of multimodal instances, and the learned co-attention cannot infer the correlation between each image region and each question word. This results in a significant limitation of these co-attention models.
To overcome the problem of insufficient multimodal interactions, two dense co-attention models BAN  and DCN  have been proposed to model dense interactions between any image region and any question word. The dense co-attention mechanism facilitates the understanding of image-question relationship to correctly answer questions. Interestingly, both of these dense co-attention models can be cascaded in depth, form deep co-attention models that support more complex visual reasoning, thereby potentially improving VQA performance. However, these deep models shows little improvement over their corresponding shallow counterparts or the coarse co-attention model MFH  (see Figure 1). We think the bottleneck in these deep co-attention models is a deficiency of simultaneously modeling dense self-attention within each modality (i.e., word-to-word relationship for questions, and region-to-region relationship for images).
Inspired by the Transformer model in machine translation , here we design two general attention units: a self-attention (SA) unit that can model the dense intra-modal interactions (word-to-word or region-to-region); and a guided-attention (GA) unit to model the dense inter-modal interactions (word-to-region). After that, by modular composition of the SA and GA units, we obtain different Modular Co-Attention (MCA) layers that can be cascaded in depth. Finally, we propose a deep Modular Co-Attention Network (MCAN) which consists of cascaded MCA layers. Results in Figure 1 shows that a deep MCAN model significantly outperforms existing state-of-the-art co-attention models on the benchmark VQA-v2 dataset , which verifies the synergy of self-attention and guided-attention in co-attention learning, and also highlights the potential of deep reasoning. Furthermore, we find that modeling self-attention for image regions can greatly improve the object counting performance, which is challenging for VQA.
2 Related Work
We briefly review previous studies on VQA, especially those studies that introduce co-attention models.
Visual Question Answering (VQA). VQA has been of increasing interest over the last few years. The multimodal fusion of global features are the most straightforward VQA solutions. The image and question are first represented as global features and then fused by a multimodal fusion model to predict the answer . Some approaches introduce a more complex model to learn better question representations with LSTM networks , or a better multimodal fusion model with residual networks .
One limitation of the aforementioned multimodal fusion models is that the global feature representation of an image may lose critical information to correctly answer the questions about local image regions (e.g., “what is in the woman’s left hand”). Therefore, recent approaches have introduced the visual attention mechanism into VQA by adaptively learning the attended image features for a given question, and then performing multimodal feature fusion to obtain the accurate prediction. Chen et al. proposed a question-guided attention map that projected the question embeddings into the visual space and formulated a configurable convolutional kernel to search the image attention region . Yang et al. proposed a stacked attention network to learn the attention iteratively . Fukui et al. , Kim et al. , Yu et al. [32, 33] and Ben et al.  exploited different multimodal bilinear pooling methods to integrate the visual features from the image’s spatial grids with the textual features from the questions to predict the attention. Anderson et al. introduced a bottom-up and top-down attention mechanism to learn the attention on candidate objects rather than spatial grids .
Co-Attention Models. Beyond understanding the visual contents of the image, VQA also requires to fully understand the semantics of the natural language question. Therefore, it is necessary to learn the textual attention for the question and the visual attention for the image simultaneously. Lu et al. proposed a co-attention learning framework to alternately learn the image attention and question attention . Yu et al. reduced the co-attention method into two steps, self-attention for a question embedding and the question-conditioned attention for a visual embedding . Nam et al. proposed a multi-stage co-attention learning model to refine the attentions based on memory of previous attentions . However, these co-attention models learn separate attention distributions for each modality (image or question), and neglect the dense interaction between each question word and each image region. This become a bottleneck for understanding fine-grained relationships of multimodal features. To address this issue, dense co-attention models have been proposed, which establish the complete interaction between each question word and each image region [24, 14]. Compared to the previous co-attention models with coarse interactions, the dense co-attention models deliver significantly better VQA performance.
3 Modular Co-Attention Layer
Before presenting the Modular Co-Attention Network, we first introduce its basic component, the Modular Co-Attention (MCA) layer. The MCA layer is a modular composition of the two basic attention units, i.e., the self-attention (SA) unit and the guided-attention (GA) unit, inspired by the scaled dot-product attention proposed in . Using different combinations, we obtain three MCA variants with different motivations.
3.1 Self-Attention and Guided-Attention Units
The input of scaled dot-product attention consists of queries and keys of dimension , and values of dimension . For simplicity, and are usually set to the same number . We calculate the dot products of the query with all keys, divide each by and apply a softmax function to obtain the attention weights on the values. Given a query , key-value pairs (packed into a key matrix and a value matrix ), the attended feature is obtained by weighted summation over all values with respect to the attention learned from and :
To further improve the representation capacity of the attended features, multi-head attention is introduced in , which consists of paralleled ‘heads’. Each head corresponds to an independent scaled dot-product attention function. The attended output features is given by:
where are the projection matrices for the -th head, and . is the dimensionality of the output features from each head. To prevent the multi-head attention model from becoming too large, we usually have . In practice, we can compute the attention function on a set of queries seamlessly by replacing with in Eq.(2), to obtain the attended output features .
We build two attention units on top of the multi-head attention to handle the multimodal input features for VQA, namely the self-attention (SA) unit and the guided-attention (GA) unit. The SA unit (see Figure 1(a)) is composed of a multi-head attention layer and a pointwise feed-forward layer. Taking one group of input features , the multi-head attention learns the pairwise relationship between the paired sample within and outputs the attended output features by weighted summation of all the instances in . The feed-forward layer takes the output features of the multi-head attention layer, and further transforms them through two fully-connected layers with ReLU activation and dropout (FC()-ReLU-Dropout(0.1)-FC()). Moreover, residual connection  followed by layer normalization  is applied to the outputs of the two layers to facilitate optimization. The GA unit (see Figure 1(b)) takes two groups of input features and , where guides the attention learning for . Note that the shapes of and are flexible, so they can be used to represent the features for different modalities (e.g., questions and images). The GA unit models the pairwise relationship between the each paired sample from and , respectively.
Interpretation: Since the multi-head attention in Eq.(2) plays a key role in the two attention units, we take a closer look at it to see how it works with respect to different types of inputs. For a SA unit with input features , for each , its attended feature can be understood as reconstructing by all the samples in with respect to their normalized similarities to . Analogously, for a GA unit with input features and , the attended feature for is obtained by reconstructing by all the samples in with respect to their normalized cross-modal similarity to .
3.2 Modular Composition for VQA
Based on the two basic attention units in Figure 2, we composite them to obtain three modular co-attention (MCA) layers (see Figure 3) to handle the multimodal features for VQA. All three MCA layers can be cascaded in depth, such that the outputs of the previous MCA layer can be directly fed to the next MCA layer. This implies that the number of input features is equal to the number of output features without instance reduction.
The ID(Y)-GA(X,Y) layer in Figure 2(a) is our baseline. In ID(Y)-GA(X,Y), the input question features are directly passed through to the output features with an identity mapping, and the dense inter-modal interaction between each region with each word is modeled in a GA(X,Y) unit. These interactions are further exploited to obtain the attended image features. Compared to the ID(Y)-GA(X,Y) layer, the SA(Y)-GA(X,Y) layer in Figure 2(b) adds a SA(Y) unit to model the dense intra-modal interaction between each question word pair . The SA(Y)-SGA(X,Y) layer in Figure 2(c) continues to add a SA(X) unit to the SA(Y)-GA(X,Y) layer to model the intra-modal interaction between each image region pairs .111In our implementation, we omit the feed-forward layer and norm layer of the SA(X) unit to save memory costs..
Note that the three MCA layers above have not covered all the possible compositions. We have also explored other MCA variants like the symmetric architectures GA(X,Y)-GA(Y,X) and SGA(X,Y)-SGA(Y,X). However, these MCA variants do not report comparative performance, so we do not discuss them further due to space limitations.
4 Modular Co-Attention Networks
In this section, we describe the Modular Co-Attention Networks (MCAN) architecture for VQA. We first explain the image and question feature representation from the input question and image. Then, we propose two deep co-attention models, namely stacking and encoder-decoder, which consists of multiple MCA layers cascaded in depth to gradually refine the attended image and question features. As we obtained the attended image and question features, we design a simple multimodal fusion model to fuse the multimodal features and finally feed them to a multi-label classifier to predict answer. An overview flowchart of MCAN is shown in Figure 4.
We name the MCAN model with the stacking strategy as MCAN- and the MCAN model with the encoder-decoder strategy as MCAN-, where is the total number MCA layers cascaded in depth.
4.1 Question and Image Representations
The input image is represented as a set of regional visual features in a bottom-up manner . These features are the intermediate features extracted from a Faster R-CNN model (with ResNet-101 as its backbone)  pre-trained on the Visual Genome dataset . We set a confidence threshold to the probabilities of detected objects and obtain a dynamic number of objects . For the -th object, it is represented as a feature by mean-pooling the convolutional feature from its detected region. Finally, the image is represented as a feature matrix .
The input question is first tokenized into words, and trimmed to a maximum of 14 words similar to [28, 14]. Each word in the question is further transformed into a vector using the 300-D GloVe word embeddings  pre-trained on a large-scale corpus. This results in a sequence of words of size 300, where is the number of words in the question. The word embeddings are then passed through a one-layer LSTM network  with hidden units. In contrast to  which only uses the final state (i.e., the output feature for the last word) as the question feature, we maintain the output features for all words and output a question feature matrix .
To deal with the variable number of objects and variable question length , we use zero-padding to fill and to their maximum sizes (i.e., and , respectively). During training, we mask the padding logits with to get zero probability before every softmax layer to avoid the underflow problem.
4.2 Deep Co-Attention Learning
Taking the aforementioned image features and the question features as inputs, we perform deep co-attention learning by passing the input features though a deep co-attention model consisting of MCA layers cascaded in depth (denoted by MCA, MCA … MCA). Denoting the input features for MCA as and respectively, their output features are denoted by and , which are further fed to the MCA as its inputs in a recursive manner.
For MCA, we set its input features and , respectively.
Taking the SA(Y)-SGA(X,Y) layer as an example (the other two MCA layers proceed in the same manner), we formulate two deep co-attention models in Figure 5.
The stacking model (Figure 4(a)) simply stacks MCA layers in depth and outputs and as the final attended image and question features. The encoder-decoder model (Figure 4(b)) is inspired by the Transformer model proposed in . It slightly modifies the stacking model by replacing the input features of the GA unit in each MCA with the question features from the last MCA layer. The encoder-decoder strategy can be understood as an encoder to learn the attended question features with stacked SA units and a decoder to use to learn the attended image features with stacked SGA units.
The two deep models are of the same size with the same . As a special case that , the two models are strictly equivalent to each other.
4.3 Multimodal Fusion and Output Classifier
After the deep co-attention learning stage, the output image features and question features already contain rich information about the attention weights over the question words and image regions. Therefore, we design an attentional reduction model with a two-layer MLP (FC()-ReLU-Dropout(0.1)-FC(1)) for (or ) to obtain its attended feature (or ). Taking as an example, the attended feature is obtained as follows:
where are the learned attention weights. We can obtain the attended feature for using an independent attentional reduction model by analogy.
Using the computed and , we design the linear multimodal fusion function as follows:
where are two linear projection matrices. is the common dimensionality of the fused feature. is used here to stabilize training .
The fused feature is projected into a vector followed by a sigmoid function, where is the number of the most frequent answers in the training set. Following , we use binary cross-entropy (BCE) as the loss function to train an -way classifier on top of the fused feature .
In this section, we conduct experiments to evaluate the performance of our MCAN models on the largest VQA benchmark dataset, VQA-v2 . Since the different MCA variants and deep co-attention models may influence final performance, we perform extensive quantitative and qualitative ablation studies to explore the reasons why MCAN performs well. Finally, with the optimal hyper-parameters, we compare our best model with current state-of-the-art models under the same settings.
VQA-v2 is the most commonly used VQA benchmark dataset . It contains human-annotated question-answer pairs relating to the images from the MS-COCO dataset , with 3 questions per image and 10 answers per question. The dataset is split into three: train (80k images and 444k QA pairs); val (40k images and 214k QA pairs); and test (80k images and 448k QA pairs). Additionally, there are two test subsets called test-dev and test-standard to evaluate model performance online. The results consist of three per-type accuracies (Yes/No, Number, and Other) and an overall accuracy.
5.2 Implementation Details
The hyper-parameters of our model used in the experiments are as follows. The dimensionality of input image features , input question features , and fused multimodal features are 2,048, 512, and 1,024, respectively. Following the suggestions in , the latent dimensionality in the multi-head attention is 512, the number of heads is set to 8, and the latent dimensionality for each head is . The size of the answer vocabulary is set to using the strategy in . The number of MCA layers is .
To train the MCAN model, we use the Adam solver  with and . The base learning rate is set to , where is the current epoch number starting from 1. After 10 epochs, the learning rate is decayed by 1/5 every 2 epochs. All the models are trained up to 13 epochs with the same batch size 64. For the results on the val split, only the train split is used for training. For the results on the test-dev or test-standard splits, both train and val splits are used for training, and a subset of VQA samples from Visual Genome  is also used as the augmented dataset to facilitate training.
5.3 Ablation Studies
We run a number of ablations to investigate the reasons why MCAN is effective. The results shown in Table 1 and Figure 6 are discussed in detail below.
MCA Variants: From the results in Table 1, we can see that SA(Y)-GA(X,Y) outperforms ID(Y)-GA(X,Y) for all answer types. This verifies that modeling self-attention for question features benefits VQA performance, which is consistent with previous works . Moreover, we can see that SA(Y)-SGA(X,Y) also outperforms SA(Y)-GA(X,Y). This reveals, for the first time, that modeling self-attention for image features is meaningful. Therefore, we use SA(Y)-SGA(X,Y) as our default MCA in the following experiments unless otherwise stated.
Stacking vs. Encoder-Decoder: From the results in Table 1, we can see that with increasing , the performances of both deep co-attention models steadily improve and finally saturate at . The saturation can be explained by the unstable gradients during training when , which makes the optimization difficult. Similar observations are also reported by . Furthermore, the encoder-decoder model steadily outperforms the stacking model, especially when is large. This is because the learned self-attention from an early SA(Y) unit is inaccurate compared to that from the last SA(Y) unit. Directly feeding it to a GA(X,Y) unit may damage the learned guided-attention for images. The visualization in 5.4 supports this explanation. Finally, MCAN is much more parametric-efficient than other approaches, with MCAN-2 (27M) reporting a 66.2 accuracy, BAN-4 (45M) a 65.8 accuracy , and MFH (116M) a 65.7 accuracy . More in-depth comparisons can be found in the supplementary material.
MCA vs. Depth: In Figure 6, we show the detailed performance of MCAN- with different MCA variants. With increasing , the performance gaps between the three variants increases. Furthermore, an interesting phenomenon occurs in Figure 5(c). When , the number type accuracy of the ID(Y)-GA(X,Y) and SA(Y)-GA(X,Y) models are nearly identical, while the SA(Y)-SGA(X,Y) model reports a 4.5-point improvement over them. This verifies that self-attention for images plays a key role in object counting.
Question Representations: Table 1 summarizes ablation experiments on different question representations. We can see that using the word embeddings pre-trained by GloVe  significantly outperforms that by random initialization. Other trick like fine-tuning the GloVe embeddings or replacing the position encoding  with a LSTM network to model the temporal information can slightly improve the performance further.
5.4 Qualitative Analysis
In Figure 7, we visualize the learned attentions from MCAN-6 and MCAN-6. Due to space limitations, we only show one example and visualize six attention maps from different attention units and different layers. More visualizations can be found in the supplementary material. From the results, we have the following observations.
Question Self-Attention SA(Y): The attention maps of SA(Y)-1 form vertical stripes, and the words like ‘how’ and ‘see’ obtain large attention weights. This unit acts as a question type classifier. Besides, the large values in the attention maps of SA(Y)-6 occur in the column ‘sheep’. This reveals that all the attended features tend to use the feature of ‘sheep’ for reconstruction. That is to say, the keyword ‘sheep’ is identified correctly.
Image Self-Attention SA(X): Values in the attention maps of SA(X)-1 are uniformly distributed, suggesting that the key objects for sheep are unclear. The large values in the attention maps of SA(X)-6 occur on the 1st, 3rd, and 11th columns, which correspond to the three sheep in the image. This explains why introducing SA(X) can greatly improve object counting performance.
Question Guided-Attention GA(X,Y): The attention maps of GA(X,Y)-1 do not focus on the current objects in the image; and the attention maps of GA(X,Y)-6 tend to focus on all values in the ‘sheep’ column. This can be explained by the fact that the input features have been reconstructed by the sheep features in SA(X)-6. Moreover, the GA(X,Y) units of the stacking model contain much more noise than the encoder-decoder model. This verifies our hypothesis presented in 5.3.
In Figure 8, we also visualize the final image and question attentions learned by Eq.(5). For the correctly predicted examples, the learned question and image attentions are usually closely focus on the key words and the most relevant image regions (e.g., the word ‘holding’ and the region of ‘hand’ in the first example, and the word ‘vegetable’ and the region of ‘broccoli’ in the second example). From the incorrect examples, we can draw some weaknesses of our approach. For example, it occasionally makes mistakes in distinguishing the key words in questions (e.g., the word ‘left’ in the third example and the word ‘catcher’ in the last example). These observations are useful to guide further improvements in the future.
5.5 Comparison with State-of-the-Art
By taking the ablation results into account, we compare our best single model MCAN-6 with the current state-of-the-art methods in Table 3. Using the same bottom-up attention visual features , MCAN-6 significantly outperforms the current best approach BAN  by 1.1 points in terms of overall accuracy. Compared to BAN+Counter , which additionally introduces the counting module  to significantly improve object counting performance, our model is still 0.6 points higher. Moreover, our method obtains comparable object counting performance (i.e., the number type) to BAN+Counter, and in doing so does not use any auxiliary information like the bounding-box coordinates of each object . This suggests that MCAN is more general that can naturally learn to deduplicate the redundant objects based on the visual features alone. The comparative results with model ensembling are demonstrated in the supplementary material.
In this paper, we present a novel deep Modular Co-Attention Network (MCAN) for VQA. MCAN consists of a cascade of modular co-attention (MCA) layers, each of which consists of the self-attention and guided-attention units to model the intra- and inter-modal interactions synergistically. By stacking MCA layers in depth using the encoder-decoder strategy, we obtain a deep MCAN model that achieves new state-of-the-art performance for VQA.
This work was supported in part by National Natural Science Foundation of China under Grant 61702143, Grant 61836002, Grant 61622205, and in part by the Australian Research Council Projects under Grant FL-170100117, Grant DP-180103424 and Grant IH-180100002.
-  Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6077–6086, 2018.
-  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), pages 2425–2433, 2015.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neural machine translation models with transparent attention. arXiv preprint arXiv:1808.07561, 2018.
-  Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for visual question answering. In International Conference on Computer Vision (ICCV), pages 2612–2620, 2017.
-  Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015.
-  Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in neural information processing systems (NIPS), pages 577–585, 2015.
-  Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634, 2015.
-  Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
-  Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems (NIPS), 2018.
-  Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal residual learning for visual qa. In Advances in neural information processing systems (NIPS), pages 361–369, 2016.
-  Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling. In International Conference on Learning Representation (ICLR), 2017.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014.
-  Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In Advances in neural information processing systems (NIPS), pages 289–297, 2016.
-  Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in neural information processing systems (NIPS), pages 1682–1690, 2014.
-  Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems (NIPS), pages 2204–2212, 2014.
-  Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471, 2016.
-  Duy-Kien Nguyen and Takayuki Okatani. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6087–6096, 2018.
-  Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS), pages 91–99, 2015.
-  Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4613–4621, 2016.
-  Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
-  Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), volume 14, pages 77–81, 2015.
-  Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21–29, 2016.
-  Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. IEEE International Conference on Computer Vision (ICCV), pages 1839–1848, 2017.
-  Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 29(12):5947–5959, 2018.
-  Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. Rethinking diversified and discriminative proposal generation for visual grounding. International Joint Conference on Artificial Intelligence (IJCAI), pages 1114–1120, 2018.
-  Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Learning to count objects in natural images for visual question answering. International Conference on Learning Representation (ICLR), 2018.
-  Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In International Joint Conference on Artificial Intelligence (IJCAI), pages 3683–3689, 2018.
-  Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167, 2015.
Appendix A Model Ensembling
To compare MCAN to the best results on VQA-v2 leaderboard222https://visualqa.org/roe.html, we train 4 MCAN-6 models with slightly different hyper-parameters for ensemble. The comparative results in Table 3 indicate that MCAN surpasses the top most solutions on the leaderboard. It is worth noting that our solution only use the basic bottom-up attention visual features  and much fewer models for ensemble.
Appendix B Comparisons of Model Stability and Computational Costs
We compare MCAN-6 with the best two approaches (MFH  and BAN-8 ) in Table 4 in terms of overall accuracy std, number of parameters and FLOPs, respectively. The accuracies are reported on the val split, and the standard deviation for each method is calculated by training three models with the same architecture but different initializations. The FLOPs are calculated for one testing sample. We can see that MCAN-6 outperforms the counterparts in both accuracy and stability, and is more parameteric- and computational-efficient at the same time.
Appendix C More Visualized Results
Similar to Figure 7 in the main text, we visualize the learned attentions of two more examples from MCAN-6 in Figure 9. For each example, we visualize the attention maps from three attention units (SA(X), SA(Y), GA(X,Y)) and from two layers (1st and 6th). For each unit, we show the attention maps from 2 parallel heads (8 heads in total). From the results, we have the similar observations and explanations to those in the main text. The visualized attentions can well explain the reasoning process of MCAN to predict the correct answers. Furthermore, we find that different heads may provide complementary information to benefit VQA performance, which is similar to the ‘multi-glimpses’ strategy in existing VQA approaches [10, 33].