Compact Trilinear Interaction for Visual Question Answering
In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is first time applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interaction model to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results when using a single model on all three datasets. The source code is available at https://github.com/aioz-ai/ICCV19_VQA-CTI.
1 Introduction††footnotetext: indicates equal contribution.
The aim of VQA is to find out a correct answer for a given question which is consistent with visual content of a given image [25, 3, 10]. There are two main variants of VQA which are Free-Form Opened-Ended (FFOE) VQA and Multiple Choice (MC) VQA. In FFOE VQA, an answer is a free-form response to a given image-question pair input, while in MC VQA, an answer is chosen from an answer list for a given image-question pair input.
Traditional approaches to both VQA tasks mainly aim to learn joint representations between images and questions, while the answers are treated in a “passive” form, i.e., the answers are only considered as classification targets. However, an answer is expected to have high correlation with its corresponding question-image input, hence a jointly and explicitly information extraction from these three inputs will give a highly meaningful joint representation. In this paper, we propose a novel trilinear interaction model which simultaneously learns high level associations between all three inputs, i.e., image, question, and answer.
The main difficulty in trilinear interaction is the dimensionality issue which causes expensive computational cost and huge memory requirement. To tackle this challenge, we propose to use PARALIND decomposition  which factorizes a large tensor into smaller tensors which reduces the computational cost and the usage memory.
The proposed trilinear interaction takes images, questions and answers as inputs. However, answer information in FFOE VQA [1, 40, 26, 39] is only available in the training phase but not in the testing phase. To apply the trilinear interaction for FFOE VQA, we propose to use knowledge distillation to transfer knowledge from trilinear model to bilinear model. The distilled bilinear model only requires pairs of image and question as inputs, hence it can be used for the testing phase. For MC VQA [47, 19, 27, 15, 30, 44], the answer information can be easily extracted, thanks to the given answer list that contains few candidate answers for each image-question pair and is available in both training and testing phases. Thus, the proposed trilinear interaction can be directly applied to MC VQA.
To evaluate the effectiveness of the proposed model, the extensive experiments are conducted on the benchmarking datasets TDIUC, VQA-2.0, and Visual7W. The results show that the proposed model achieves state-of-the-art results on all three datasets.
The main contributions of the paper are as follows. (i) We propose a novel trilinear interaction model which simultaneously learns high level joint presentation between image, question, and answer information in VQA task. (ii) We utilize PARALIND decomposition to deal with the dimensionality issue in trilinear interaction. (iii) To make the proposed trilinear interaction applicable for FFOE VQA, we propose to use knowledge distillation for transferring knowledge from trilinear interaction model to bilinear interaction model. The remaining of this paper is organized as follows. Section 2 presents the related work. Section 3 presents the proposed compact trilinear interaction (CTI). Section 4 presents the proposed models when applying CTI to FFOE VQA and MC VQA. Section 5 presents ablation studies, experimental results and analysis.
2 Related Work
Joint embedding in Visual Question Answering. There are different approaches have been proposed for VQA [18, 5, 8, 45, 20, 42, 24, 2, 28, 23, 38, 46, 29, 40]. Most of the successful methods focus on learning joint representation between the input question and image [8, 5, 18, 45]. In the state-of-the-art VQA, the features of the input image and question are usually represented under matrix forms. E.g., each image is described by a number of interested regions, and each region is represented by a feature vector. Similar idea is applied for question, e.g., an question contains a number of words and each word is represented by a feature vector. A fully expressive interaction between an image region and a word should be the outer product between their two corresponding vectors . The outer product allows a multiplicative interaction between all elements of both vectors. However, a fully bilinear interaction using outer product between every possible pairs of regions and words will dramatically increase the output space. Hence instead of directly computing the fully bilinear with outer product, most of works try to compress or decompose the fully bilinear interaction.
In , the authors proposed the Multimodal Compact Bilinear pooling which is an efficient method to compress the bilinear interaction. The method works by projecting the visual and linguistic features to a higher dimensional space and then convolving both vectors efficiently by using element-wise product in Fast Fourier Transform space. In , the authors proposed Multimodal Tucker Fusion which is a tensor-based Tucker decomposition to efficiently parameterize bilinear interaction between visual and linguistic representations. In , the author proposed Factorized Bilinear Pooling that uses two low rank matrices to approximate the fully bilinear interaction. Recently, in  the authors proposed Bilinear Attention Networks (BAN) that finds bilinear attention distributions to utilize given visual-linguistic information seamlessly. BAN also uses low rank approximation to approximate the bilinear interaction for each pair of vectors from image and question.
There are other works that consider answer information, besides image and question information, to improve VQA performance [16, 9, 36, 14, 41, 34]. Typically, in , the authors learned two embedding functions to transform an image-question pair and an answer into a joint embedding space. The distance between the joint embedded image-question and the embedded answer is then measured to determine the output answer. In , the authors computed joint representations between image and question, and between image and answer. They then learned a joint embedding between the two computed representations.
In , the authors computed “ternary potentials” which capture the dependencies between three inputs, i.e., image, question, and answer. For every triplet of vectors, each from each different input, to compute the interaction between three vectors, instead of calculating the outer products, the author computed the sum of element-wise product of the three vectors. This greatly reduces the computational cost but it might not be expressive enough to fully capture the complex associations between the three vectors.
Different from previous works that mainly aim to learn the joint representations from pairs of modalities [8, 5, 18, 45, 14, 41] or greatly simplify the interaction between the three modalities by using the element-wise operator , in this paper, we propose a principle and direct approach – a trilinear interaction model, which simultaneously learns a joint representation between three modalities. In particular, we firstly derive a fully trilinear interaction between three modalities. We then rely on a decomposition approach to develop a compact model for the interaction.
Knowledge Distillation. Knowledge Distillation is a general approach for transferring knowledge from a cumbersome model (teacher model) to a lighten model (student model) [13, 11, 33, 7, 4]. In FFOE VQA, the trilinear interaction model, which takes image, question, and answer as inputs, can only be applied for training phase but not for testing phase due to the omission of answer in testing. To overcome this challenge and also to reduce computational cost, inspired from the Hinton’s seminar work , we propose to use knowledge distillation to transfer knowledge from trilinear model to bilinear model.
3 Compact Trilinear Interaction (CTI)
3.1 Fully parameterized trilinear interaction
Let be the representations of three inputs. , where is the number of channels of the input and is the dimension of each channel. For example, if is the region-based representation for an image, then is the number of regions and is the dimension of the feature representation for each region. Let be the row of , i.e., the feature representation of channel in , where .
The joint representation resulted from a fully parameterized trilinear interaction over the three inputs is presented by which is computed as follows
where is a learning tensor; ; is a vectorization of which outputs a row vector; operator denotes the -mode tensor product.
The tensor helps to learn the interaction between the three input through -mode product. However, learning such a large tensor is infeasible when the dimension of each input modality is high, which is the usual case in VQA. Thus, it is necessary to reduce the size to make the learning feasible.
Inspired by , we rely on the idea of unitary attention mechanism. Specifically, let be the joint representation of triplet of channels where each channel in the triplet is from a different input. The representation of each channel in a triplet is , where , respectively. There are possible triplets over the three inputs. The joint representation resulted from a fully parameterized trilinear interaction over three channel representations of triplet is computed as
where is the learning tensor between channels in the triplet.
Follow the idea of unitary attention , the joint representation is approximated by using joint representations of all triplets described in (2) instead of using fully parameterized interaction over three inputs as in (1). Hence, we compute
Note that in (3), we compute a weighted sum over all possible triplets. The triplet is associated with a scalar weight . The set of is called as the attention map , where .
The attention map resulted from a reduced parameterized trilinear interaction over three inputs and is computed as follows
3.2 Parameter factorization
Although the large tensor of trilinear interaction model is replaced by two smaller tensors and , the dimension of these two tensors still large which makes the learning difficult. To further reduce the computational complexity, the PARALIND decomposition  is applied for and . The PARALIND decomposition for the learning tensor can be calculated as
where is a slicing parameter, establishing a trade-off between the decomposition rate (which is directly related to the usage memory and the computational cost) and the performance. Each is a smaller learnable tensor called Tucker tensor. The number of these Tucker tensors equals to . The maximum value for is usually set to the greatest common divisor of and . In our experiments, we found that gives a good trade-off between the decomposition rate and the performance.
Here, we have dimension , and ; , and are learnable factor matrices. Figure 1 shows the illustration of PARALIND decomposition for a tensor .
The shorten form of in (6) can be rewritten as
Similar to , PARALIND decomposition is also applied to the tensor in (5) to reduce the complexity. It is worth noting that the size of directly effects to the dimension of the joint representation . Hence, to minimize the loss of information, we set the slicing parameter and the projection dimension of factor matrices at , i.e., the same dimension of the joint representation .
Therefore, in (5) can be calculated as
where , , are learnable factor matrices and is a smaller tensor (compared to ).
Here, it is interesting to note that in (10) has rank . Thus, the result got from -mode tensor products in (10) can be approximated by the Hadamard products without the presence of rank-1 tensor . In particular, in (10) can be computed without using as
Note that , which is the joint embedding dimension, is a user-defined parameter which makes a trade-off between the capability of the representation and the computational cost. In our experiments, we found that gives a good trade-off.
4 Compact Trilinear Interaction for VQA
The input for training VQA is set of in which is an image representation; where is the number of interested regions (or bounding boxes) in the image and is the dimension of the representation for a region; is a question representation; where is the number of hidden states and is the dimension for each hidden state. is an answer representation; where is the number of hidden states and is the dimension for each hidden state.
By applying the Compact Trilinear Interaction (CTI) to each , we achieve the joint representation . Specifically, we firstly compute the attention map by (8) as follows
Then the joint representation is computed by (11) as follows
4.1 Multiple Choice Visual Question Answering
To make a fair comparison to the state of the art in MC VQA [14, 41], we follow the representations used in those works. Specifically, each input question and each answer are trimmed to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 300-D GloVe word embedding . Each image is represented by a grid feature (i.e., cells; each cell is with a -D feature), extracted from the second last layer of ResNet-152 which is pre-trained on ImageNet .
Follow , input samples are divided into positive samples and negative samples. A positive sample, which is labelled as in binary classification, contains image, question and the right answer. A negative sample, which is labelled as in binary classification, contains image, question, and the wrong answer. These samples are then passed through our proposed CTI to get the joint representation . The joint representation is passed through a binary classifier to get the prediction. The Binary Cross Entropy loss is used for training the proposed model. Figure 2 visualizes the proposed model when applying CTI to MC VQA.
4.2 Free-Form Opened-Ended Visual Question Answering
Unlike MC VQA, FFOE VQA treats the answering as a classification problem over the set of predefined answers. Hence the set possible answers for each question-image pair is much more than the case of MC VQA. Therefore the model design proposed in Section 4.1, i.e. for each question-image input, the model takes every possible answers from its answer list to computed the joint representation, causes high computational cost. In addition, the proposed CTI requires all three inputs to compute the joint representation. However, during the testing, there are no available answer information in FFOE VQA. To overcome these challenges, we propose to use Knowledge Distillation  to transfer the learned knowledge from a teacher model to a student model. Figure 3 visualizes the proposed design for FFOE VQA.
Our teacher model takes triplets of image-question-right answer as inputs. Each triplet is passed through the proposed CTI to get the joint representation . The joint representation is then passed through a multiclass classifier (over the set of predefined answers) to get the prediction which is similar to . The Cross Entropy loss is used for training the teacher model. Regarding the student models, any state-of-the-art VQA can be used. In our experiments, we use BAN2  or SAN  as student models. The student models take pairs of image-question as inputs and treat the prediction as a mutilclass classification problem. The loss function for the student model is defined as
where stands for Cross Entropy loss; is the standard softmax output of the student; is the ground-truth answer labels; is a hyper-parameter for controlling the importance of each loss component; are the softened outputs of the student and the teacher using the same temperature parameter , which are computed as follows
where for both teacher and the student models, the logit is the predictions outputted by the corresponding classifiers.
Following by the current state of the art in FFOE VQA , for image representation, we use object detection-based features with FPN detector (ResNet152 backbone), in which the number of maximum detected bounding boxes is set to . For question and answer representations, we trim question and answer to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 600-D vector that is a concatenation of the 300-D GloVe word embedding  and the augmenting embedding from training data as . In the other words, a question is with a representation with size . It is similar for answer.
5.1 Dataset and evaluation protocol
Dataset. We conduct the experiments on three benchmarking VQA datasets that are Visual7W  for the MC VQA, VQA-2.0  and TDIUC  for the FFOE VQA. We use training set to train and validation set to evaluate in all mentioned datasets when conducting ablation study.
Implementation details. Our CTI is implemented using PyTorch . The experiments are conducted on a NVIDIA Titan V GPUs with 12GB RAM. In all experiments, the learning rate is set to . Batch size is set to for training MC VQA and for training FFOE VQA. When training both MC VQA model (Section 4.1) and FFOE VQA model (Section 4.2), except the image representation extraction, other components are trained end-to-end. The temperature parameter in (15) is set to . The dimension of the joint representation is set at for both MC VQA and FFOE VQA.
Evaluation Metrics. We follow the literature [3, 17, 47] in which the evaluation metrics for each VQA task are different. For FFOE VQA, the single accuracy, which is a standard VQA accuracy (Acc) , is applied for both TDIUC and VQA-2.0 datasets. In addition, due to the imbalance in the question types of TDIUC dataset, follow , we also report four other metrics that compensate for the skewed question-type distribution. They are Arithmetic MPT (Ari), Arithmetic Norm-MPT (Ari-N), Harmonic MPT (Har), and Harmonic Norm-MPT (Har-N). For MC VQA, we follow the evaluation metric (Acc-MC) proposed by  in which the performance is measured by the portion of correct answers selected by the VQA model from the candidate answer set.
5.2 Ablation study
The effectiveness of CTI on FFOE VQA. We compare our distilled BAN2 (BAN2-CTI) and distilled SAN (SAN-CTI) student models to the state-of-the-art baselines BAN2  and SAN . Table 1 presents a comprehensive evaluation on five different metrics on TDIUC. Among all metrics, on overall, our BAN2-CTI and SAN-CTI outperform corresponding baselines by a noticeable margin. These results confirm the effectiveness of our proposed CTI for learning the joint representation. In addition, the proposed teacher model (Figure 3) is also effective. It successfully transfers useful learned knowledge to the student models. Note that in Table 1, the “Absurd” question category indicates the cases in which input questions are irrelevant to the image contents. Thus, the answers are always “does not apply”, i.e., “no answer”. Using these meaningless answers when training the teacher causes negative effect when learning the joint representation, hence, reducing the model capacity. If the “Absurd” category is not taken into account, the proposed model achieves more improvements over baselines.
|Util & Aff||36.3||24.0||31.0||26.3|
Table 2 presents detail performances with Acc metric over each question category of TDIUC when all categories, including “Absurd”, are used for training. The results show that we achieve the best results on all question categories but “Absurd”. We note that in the real applications, the “Absurd” question problem may be mitigated in some cases by using a simple trick, i.e., asking a “presence question” before asking the main question, e.g., we have an image with no human but the main question is “Is the people wearing hat?”, i.e., a “Absurd” question. By asking a “presence question” as “Are there any people in the picture?”, we can have a confirmation about the presence of human in the considered image, before asking the main question.
Table 3 presents comparative results between our distilled student models and two baselines BAN2, SAN on Acc metric on VQA-2.0. Although our proposal outperforms the baselines, the improvement gap is not much. This is understandable because the VQA-2.0 dataset has a large number of questions of which answers are “yes/no” or contain only one word (i.e., answers for “number” question types). These answers have little semantic meanings which prevent proposed trilinear interaction from promoting its efficiency.
The effectiveness of CTI on MC VQA. We still use the state-of-the-art BAN2  and SAN  as baselines and conduct experiments on Visual7W dataset. In MC VQA, in both training and testing, each image-question pair has a corresponding answer list that contains four answers. To make a fair comparison, we try different pair combinations over three modalities (image, question, and answer) for the baselines BAN2 and SAN. Similar to , we find the following combination gives best results for the baselines. Using BAN2 (or SAN), we first compute the joint representation between image and question; and the joint representation between image and answer. Then, we concatenate the two computed representations to get the joint “image-question-answer” representation, and pass it through VQA classifier with cross entropy loss for training the baseline.
|Ref models||Visual7W validation set|
|Acc-MC||Number of parameters|
Table 4 presents comparative results on Visual7W with Acc-MC metric. The results show that our proposed model outperforms the baselines by a noticeable margin. These results confirm that the joint representation learned by the proposed trilinear interaction achieves better performance than the combination of joint representations computed by BAN (or SAN) of pairs of modalities. In addition, in Table 4 we also provide the number of total parameters of our proposed MC VQA model with CTI (Figure 2) and BAN2, SAN. The results show that our model requires less memory than those baselines. That means that the proposed MC VQA model with CTI not only outperforms the baselines in term of accuracy, but also more efficient than those baselines in term of the usage memory. Figure 4 visualizes the attention map resulted by CTI for an example of image-question-answer. The attention map is computed by (12).
5.3 Comparison with the state of the art
To further evaluate the effectiveness of CTI, we conduct a detailed comparison with the current state of the art. For FFOE VQA, we compare our proposal with the recent state-of-the-art methods on TDIUC and VQA-2.0 datasets, including SAN , QTA , BAN2 , Bottom-up , MCB , and RAU . For MC VQA, we compare with the state-of-the-art methods on Visual7W dataset, including BAN2 , SAN , MLP , MCB , STL , and fPMC ). It is worth noting that depending on tasks FFOE VQA or MC VQA, we use different representations for images and questions as clearly mentioned in Section 4. This ensures a fair comparison with compared methods.
Regarding FFOE VQA, Tables 3 and 5 show comparative results on VQA-2.0 and TDIUC respectively. Specifcaly, Table 5 shows that our distilled student BAN2-CTI outperforms all compared methods over all metrics by a large margin, i.e., our model outperforms the current state-of-the-art QTA  on TDIUC by and on Ari and Har metrics, respectively. The results confirm that the proposed trilinear interaction has learned informative representations from the three inputs and the learned information is effectively transferred to student models by distillation.
|Visual7W test set||MLP ||67.1|
Regarding MC VQA, Table 6 shows that the proposed model (denoted as CTI in Table 6) outperforms compared methods by a noticeable margin. Our model outperforms the current state-of-the-art STL  . Again, this validates the effectiveness of the proposed joint presentation learning, which precisely and simultaneously learns interactions between the three inputs. We note that when comparing with other methods on Visual7W, for image representations, we used the grid features extracted from ResNet-512  for a fair comparison. Our proposed model can achieve further improvements by using the object detection-based features used in FFOE VQA. With new features, our model denoted as CTIwBoxes in Table 6 achieve accuracy with Acc-MC metric which improves over the current state-of-the-art STL  .
5.4 Further analysis
The effectiveness of PARALIND decomposition. In this section, we compute the decomposition rate of PARALIND. For a fully interaction between the three inputs, using (1), we would need to learn billions parameters which is infeasible in practice. By using the PARALIND decomposition presented in Section 3 with the provided settings, i.e., the number of slicing and the dimension of the joint representation , the number of parameters that need to learn is only millions. In the other words, we achieve a decomposition rate .
In BAN, each input contains an image representation and a question representation . The trilinear interaction model can be modified to adapt to these two inputs. The joint representation in (1) can be adapted for two input as
where is a learnable tensor; is the vectorization of and is the vectorization of which output row vectors; ; .
where and are learnable factor matrices; is an attention weight of attention map which can be computed from (12) as
where and are learnable factor matrices; ; ; each is a learnable Tucker tensor.
where is the element of the joint representation ; and are column in factor matrices and . Note that in (19), our attention map is resulted from the PARALIND decomposition, while in BAN , their attention map is computed by bilinear pooling.
We propose a novel compact trilinear interaction which simultaneously learns high level associations between image, question, and answer in both MC VQA and FFOE VQA. In addition, knowledge distillation is the first time applied to FFOE VQA to overcome the computational complexity and memory issue of the interaction. The extensive experimental results show that the proposed models achieve the state-of-the-art results on three benchmarking datasets.
-  (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In CVPR, Cited by: §1.
-  (2018) Bottom-up and top-down attention for image captioning and VQA. In CVPR, Cited by: §2, Table 6.
-  (2015) VQA: Visual Question Answering. In ICCV, Cited by: §1, §5.1.
-  (2014) Do deep nets really need to be deep?. In NIPS, Cited by: §2.
-  (2017) MUTAN: multimodal tucker fusion for visual question answering. In ICCV, Cited by: §2, §2, §2.
-  (2009) Modeling multi-way data with linearly dependent loadings. Journal of Chemometrics: A Journal of the Chemometrics Society, pp. 324–340. Cited by: §1, §3.2.
-  (2017) Learning efficient object detection models with knowledge distillation. In NIPS, Cited by: §2.
-  (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, Cited by: §2, §2, §2, §5.3, Table 5, Table 6.
-  (2017) Vqs: linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation. In ICCV, Cited by: §2.
-  (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1, §5.1.
-  (2016) Cross modal distillation for supervision transfer. In CVPR, Cited by: §2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1, §5.3.
-  (2014) Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, Cited by: §2, §4.2, §4.2.
-  (2018) Learning answer embeddings for visual question answering. In CVPR, Cited by: §2, §2, §4.1, §5.3, Table 6.
-  (2016) A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485. Cited by: §1.
-  (2016) Revisiting visual question answering baselines. In ECCV, Cited by: §2, §5.3, Table 6.
-  (2017) An analysis of visual question answering algorithms. In ICCV, Cited by: §5.1, §5.1.
-  (2018) Bilinear attention networks. In NIPS, Cited by: §2, §2, §2, §4.2, §4.2, §5.2, §5.2, §5.3, §5.4, §5.4, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
-  (2016) Multimodal residual learning for visual qa. In NIPS, Cited by: §1.
-  (2017) Hadamard product for low-rank bilinear pooling. In ICLR, Cited by: §2.
-  (2009) Tensor decompositions and applications. SIAM review, pp. 455–500. Cited by: §3.2.
-  (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §4.2.
-  (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, Cited by: §2.
-  (2018) Visual question answering with memory-augmented networks. In CVPR, Cited by: §2.
-  (2014) Towards a visual turing challenge. In NIPS workshop, Cited by: §1.
-  (2015) Ask your neurons: a neural-based approach to answering questions about images. ICCV, pp. 1–9. Cited by: §1.
-  (2017) Dual attention networks for multimodal reasoning and matching. In CVPR, Cited by: §1.
-  (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR, Cited by: §2.
-  (2016) Training recurrent answering units with joint loss minimization for vqa. arXiv preprint arXiv:1606.03647. Cited by: §2, §5.3, Table 5.
-  (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In CVPR, Cited by: §1.
-  (2017) Automatic differentiation in pytorch. In NIPS 2017 Workshop, Cited by: §5.1.
-  (2014) GloVe: global vectors for word representation. In EMNLP, Cited by: §4.1, §4.2.
-  (2015) Fitnets: hints for thin deep nets. In ICLR, Cited by: §2.
-  (2017) High-order attention models for visual question answering. In NIPS, Cited by: §2, §2, §2.
-  (2018) Question type guided attention in visual question answering. In ECCV, Cited by: §5.3, §5.3, Table 5.
-  (2016) Where to look: focus regions for visual question answering. In CVPR, Cited by: §2.
-  (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In CVPR, Cited by: §4.2, §5.3, Table 3.
-  (2016) Zero-shot visual question answering. arXiv preprint arXiv:1611.05546. Cited by: §2.
-  (2017) Graph-structured representations for visual question answering. In CVPR, Cited by: §1.
-  (2018) Visual question answering as a meta learning task. In ECCV, Cited by: §1, §2.
-  (2018) Structured triplet learning with pos-tag guided attention for visual question answering. In WACV, Cited by: §2, §2, §4.1, §4.1, §5.2, §5.3, §5.3, Table 6.
-  (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In ECCV, Cited by: §2.
-  (2016) Stacked attention networks for image question answering. In CVPR, Cited by: §3.1, §3.1, §4.2, §5.2, §5.2, §5.3, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
-  (2017) Multi-level attention networks for visual question answering. In CVPR, Cited by: §1.
-  (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV, Cited by: §2, §2, §2.
-  (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167. Cited by: §2.
-  (2016) Visual7W: grounded question answering in images. In CVPR, Cited by: §1, §5.1, §5.1.