VQABQ: Visual Question Answering by Basic Questions
Taking an image and question as the input of our method, it can output the text-based answer of the query question about the given image, so called Visual Question Answering (VQA). There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the basic questions of the main given question. The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question. We formulate the basic questions generation problem as a LASSO optimization problem, and also propose a criterion about how to exploit these basic questions to help answer main question. Our method is evaluated on the challenging VQA dataset  and yields state-of-the-art accuracy, 60.34% in open-ended task.
Visual Question Answering (VQA) is a challenging and young research field, which can help machines achieve one of the ultimate goals in computer vision, holistic scene understanding . VQA is a computer vision task: a system is given an arbitrary text-based question about an image, and then it should output the text-based answer of the given question about the image. The given question may contain many sub-problems in computer vision, e.g.,
Scene classification - Is it a rainy day?
Object recognition - What is on the desk?
Attribute classification - What color is the ground?
Counting - How many people are in the room?
Object detection - Are there any apples in the image?
Activity recognition - What kind of exercise is the man doing?
Besides, in our real life there are a lot of more complicated questions that can be queried. So, in some sense, VQA can be considered as an important basic research problem in computer vision. From the above sub-problems in computer vision, we can discover that if we want to do holistic scene understanding in one step, it is probably too difficult. So, we try to divide the holistic scene understanding-task into many sub-tasks in computer vision. The task-dividing concept inspires us to do Visual Question Answering by Basic Questions (VQABQ), illustrated by Figure 1. That means, in VQA, we can divide the query question into some basic questions, and then exploit these basic questions to help us answer the main query question. Since 2014, there has been a lot of progress in designing systems with the VQA ability [17, 1, 18, 24, 16, 6]. Regarding these works, we can consider most of them as visual-attention VQA works because most of them do much effort on dealing with the image part but not the text part. However, recently there are some works [14, 12] that try to do more effort on the question part. In , authors proposed a Question Representation Update (QRU) mechanism to update the original query question to increase the accuracy of the VQA algorithm. Typically, VQA is a strongly image-question dependent issue, so we should pay equal attention to both the image and question, not only one of them. In reality, when people have an image and a given question about the image, we usually notice the keywords of the question and then try to focus on some parts of the image related to question to give the answer. So, paying equal attention to both parts is a more reasonable way to do VQA. In , the authors proposed a Co-Attention mechanism, jointly utilizing information about visual and question attention, for VQA and achieved the state-of-the-art accuracy.
The Co-Attention mechanism inspires us to build part of our VQABQ model, illustrated by Figure 2. In the VQABQ model, there are two main modules, the basic question generation module (Module 1) and co-attention visual question answering module (Module 2). We take the query question, called the main question (MQ), encoded by Skip-Thought Vectors , as the input of Module 1. In the Module 1, we encode all of the questions, also by Skip-Thought Vectors, from the training and validation sets of VQA  dataset as a 4800 by 215623 dimension basic question (BQ) matrix, and then solve the LASSO optimization problem, with MQ, to find the 3 BQ of MQ. These BQ are the output of Module 1. Moreover, we take the MQ, BQ and the given image as the input of Module 2, the VQA module with co-attention mechanism, and then it can output the final answer of MQ. We claim that the BQ can help Module 2 get the correct answer to increase the VQA accuracy. In this work, our main contributions are summarized below:
We propose a method to generate the basic questions of the main question and utilize these basic questions with proper criterion to help answer the main question in VQA.
Also, we propose a new basic question dataset generated by our basic question generation algorithm.
The rest of this paper is organized as the following. We first talk about the motivation about this work in Section 2. In Section 3, we review the related work, and then Section 4 shortly introduces the proposed VQABQ dataset. We discuss the detailed methodology in Section 5. Finally, the experimental results are demonstrated in Section 6.
The following two important reasons motivate us to do Visual Question Answering by Basic Questions (VQABQ). First, recently most of VQA works only emphasize more on the image part, the visual features, but put less effort on the question part, the text features. However, image and question features both are important for VQA. If we only focus on one of them, we probably cannot get the good performance of VQA in the near future. Therefore, we should put our effort more on both of them at the same time. In , they proposed a novel co-attention mechanism that jointly performs image-guided question attention and question-guided image attention for VQA.  also proposed a hierarchical architecture to represent the question, and construct image-question co-attention maps at the word level, phrase level and question level. Then, these co-attended features are combined with word level, phrase level and question level recursively for predicting the final answer of the query question based on the input image.  is also a recent work focusing on the text-based question part, text feature. In , they presented a reasoning network to update the question representation iteratively after the question interacts with image content each time. Both of [14, 12] yield better performance than previous works by doing more effort on the question part.
Secondly, in our life , when people try to solve a difficult problem, they usually try to divide this problem into some small basic problems which are usually easier than the original problem. So, why don’t we apply this dividing concept to the input question of VQA ? If we can divide the input main question into some basic questions, then it will help the current VQA algorithm achieve higher probability to get the correct answer of the main question.
Thus, our goal in this paper is trying to generate the basic questions of the input question and then exploit these questions with the given image to help the VQA algorithm get the correct answer of the input question. Note that we can consider the generated basic questions as the extra useful information to VQA algorithm.
3 Related Work
Recently, there are many papers [1, 26, 2, 9, 15, 25, 36, 29] have proposed methods to solve the VQA issue. Our method involves in different areas in machine learning, natural language processing (NLP) and computer vision. The following, we discuss recent works related to our approach for solving VQA problem.
Sequence modeling by Recurrent Neural Networks.
Recurrent Neural Networks (RNN) can handle the sequences of flexible length. Long Short Term Memory (LSTM)  is a particular variant of RNN and in natural language tasks, such as machine translation [27, 3], LSTM is a successful application. In , the authors exploit RNN and Convolutional Neural Network (CNN) to build a question generation algorithm, but the generated question sometimes has invalid grammar. The input in  is the concatenation of each word embedding with the same feature vector of image.  encodes the input question sentence by LSTM and join the image feature to the final output.  groups the neighbouring word and image features by doing convolution. In , the question is encoded by Gated Recurrent Unit (GRU)  similar to LSTM and the authors also introduce a dynamic parameter layer in CNN whose weights are adaptively predicted by the encoded question feature.
In order to analyze the relationship among words, phrases and sentences, several works, such as [23, 11, 20], proposed methods about how to map text into vector space. After we have the vector representation of text, we can exploit the vector analysis skill to analyze the relationship among text. [23, 20] try to map words to vector space, and if the words share common contexts in the corpus, their encoded vectors will close to each other in the vector space. In , the authors propose a framework of encoder-decoder models, called skip-thoughts. In this model, the authors exploit an RNN encoder with GRU activations  and an RNN decoder with a conditional GRU . Because skip-thoughts model emphasizes more on whole sentence encoding, in our work, we encode the whole question sentences into vector space by skip-thoughts model and use these skip-thought vectors to do further analysis of question sentences.
In some sense, VQA is related to image captioning [32, 10, 28, 5].  uses a language model to combine a set of possible words detected in several regions of the image and generate image description. In , the authors use CNN to extract the high-level image features and considered them as the first input of the recurrent network to generate the caption of image.  proposes an algorithm to generate one word at a time by paying attention to local image regions related to the currently predicted word. In , the deep neural network can learn to embed language and visual information into a common multi-modal space. However, the current image captioning algorithms only can generate the rough description of image and there is no so called proper metric to evaluate the quality of image caption , even though BLEU  can be used to evaluate the image caption.
There are several VQA models have ability to focus on specific image regions related to the input question by integrating the image attention mechanism [26, 2, 33, 12]. In , in the pooling step, the authors exploit an image attention mechanism to help determine the relevance between original questions and updated ones. Before , no work applied language attention mechanism to VQA, but the researchers in NLP they had modeled language attention. In , the authors propose a co-attention mechanism that jointly performs language attention and image attention. Because both question and image information are important in VQA, in our work we introduce co-attention mechanism into our VQABQ model.
4 Basic Question Dataset
We propose a new dataset, called Basic Question Dataset (BQD), generated by our basic question generation algorithm. BQD is the first basic question dataset. Regarding the BQD, the dataset format is . All of our images are from the testing images of MS COCO dataset , the MQ, main questions, are from the testing questions of VQA, open-ended, dataset , the BQ, basic questions, are from the training and validation questions of VQA, open-ended, dataset , and the corresponding similarity score of BQ is generated by our basic question generation method, referring to Section 5. Moreover, we also take the multiple-choice questions in VQA dataset  to do the same thing as above. Note that we remove the repeated questions in the VQA dataset, so the total number of questions is slightly less than VQA dataset . In BQD, we have 81434 images, 244302 MQ and 732906 (BQ + corresponding similarity score). At the same time, we also exploit BQD to do VQA and achieve the competitive accuracy compared to state-of-the-art.
In Section 5, we mainly discuss how to encode questions and generate BQ and why we exploit the Co-Attention Mechanism VQA algorithm  to answer the query question. The overall architecture of our VQABQ model can be referred to Figure 2. The model has two main parts, Module 1 and Module 2. Regarding Module 1, it takes the encoded MQ as input and uses the matrix of the encoded BQ to output the BQ of query question. Then, the Module 2 is a VQA algorithm with the Co-Attention Mechanism , and it takes the output of Module 1, MQ, and the given image as input and then outputs the final answer of MQ. The detailed architecture of Module 1 can be referred to Figure 2.
5.1 Question encoding
There are many popular text encoders, such as Word2Vec , GloVe  and Skip-Thoughts . In these encoders, Skip-Thoughts not only can focus on the word-to-word meaning but also the whole sentence semantic meaning. So, we choose Skip-Thoughts to be our question encoding method. In Skip-Thoughts model, it uses an RNN encoder with GRU  activations, and then we use this encoder to map an English sentence into a vector. Regarding GRU, it has been shown to perform as well as LSTM  on the sequence modeling applications but being conceptually simpler because GRU units only have 2 gates and do not need the use of a cell.
Question encoder. Let be the words in question and N is the total number of words in . Note that denotes the -th word for and denotes its word embedding. The question encoder at each time step generates a hidden state . It can be considered as the representation of the sequence . So, the hidden state can represent the whole question. For convenience, here we drop the index and iterate the following sequential equations to encode a question:
, where , , , , and are the matrices of weight parameters. is the state update at time step , is the reset gate, denotes an element-wise product and is the update gate. These two update gates take the values between zero and one.
5.2 Problem Formulation
Our idea is the BQ generation for MQ and, at the same time, we only want the minimum number of BQ to represent the MQ, so modeling our problem as optimization problem is an appropriate way:
, where is the matrix of encoded BQ, is the encode MQ and is a parameter of the regularization term.
5.3 Basic Question Generation
We now describe how to generate the BQ of a query question, illustrated by Figure 2. Note that the following we only describe the open-ended question case because the multiple-choice case is same as open-ended one. According to Section 5.2, we can encode the all questions from the training and validation questions of VQA dataset  by Skip-Thought Vectors, and then we have the matrix of these encoded basic questions. Each column of the matrix is the vector representation, 4800 by 1 dimensions, of a basic question and we have 215623 columns. That is, the dimension of BQ matrix, called , is 4800 by 215623. Also, we encode the query question as a column vector, 4800 by 1 dimensions, by Skip-Thought Vectors, called . Now, we can solve the optimization problem, mentioned in Section 5.3, to get the solution, . Here, we consider the elements, in solution vector , as the weights of the corresponding BQ in BQ matrix, . The first element of corresponds to the first column, i.e. the first BQ, of . Then, we rank the all weights in and pick up the top 3 large weights with corresponding BQ to be the BQ of the query question. Intuitively, because BQ are important to MQ, the weights of BQ also can be considered as importance scores and the BQ with larger weight means more important to MQ. Finally, we find the BQ of all 142093 testing questions from VQA dataset and collect them together, with the format , as the BQD in Section 4.
5.4 Basic Question Concatenation
In this section, we propose a criterion to use these BQ. In BQD, each MQ has three corresponding BQ with scores. We can have the following format, , and these scores are all between 0 and 1 with the following order,
and we define 3 thresholds, , and . Also, we compute the following 3 averages () and 3 standard deviations () to , and , respectively, and then use , referring to Table 3, to be the initial guess of proper thresholds. The BQ utilization process can be explained as Table 1. The detailed discussion about BQ concatenation algorithm is described in the Section 6.4.
|Basic Question Concatenation Algorithm|
5.5 Co-Attention Mechanism
There are two types of Co-Attention Mechanism  , Parallel and Alternating. In our VQABQ model, we only use the VQA algorithm with Alternating Co-Attention Mechanism to be our VQA module, referring to Figure 2, because, in , Alternating Co-Attention Mechanism VQA module can get the higher accuracy than the Parallel one. Moreover, we want to compare with the VQA method, Alternating one, with higher accuracy in . In Alternating Co-Attention Mechanism, it sequentially alternates between generating question and image attention. That is, this mechanism consists of three main steps:
First, the input question is summarized into a single vector .
Second, attend to the given image depended on .
Third, attend to the question depended on the attended image feature.
We can define is an attention operator, which is a function of and . This operator takes the question (or image) feature and attention guider derived from image (or question) as inputs, and then outputs the attended question (or image) vector. We can explain the above operation as the following steps:
, where is the attention weight of feature , is a vector whose elements are all equal to 1, and , and are matrices of parameters.
Concretely, at the first step of Alternating Co-Attention Mechanism, is and . Then, at the second step, where is the image features and the guider, , is intermediate attended question feature, , which is from the first step. At the final step, it uses the attended image feature, , as the guider to attend the question again. That is, and .
In Section 6, we describe the details of our implementation and discuss the experiment results about the proposed method.
We conduct our experiments on VQA  dataset. VQA dataset is based on the MS COCO dataset  and it contains the largest number of questions. There are questions, 248349 for training, 121512 for validation and 244302 for testing. In the VQA dataset, each question is associated with 10 answers annotated by different people from Amazon Mechanical Turk (AMT). About 98% of answers do not exceed 3 words and 90% of answers have single words. Note that we only test our method on the open-ended case in VQA dataset because it has the most open-ended questions among the all available dataset and we also think open-ended task is closer to the real situation than multiple-choice one.
In order to prove our claim that BQ can help accuracy and compare with the state-of-the-art VQA method , so, in our Module 2, we use the same setting, dataset and source code mentioned in . Then, the Module 1 in VQABQ model, is our basic question generation module. In other words, in our model ,the only difference compared to  is our Module 1, illustrated by Figure 2.
|Opend-Ended Case (Total: 142093 questions)|
|LSTM Q+I ||36.8||80.5||43.0||57.8||58.2|
|LSTM Q+I ||36.46||80.87||43.40||58.02||58.18|
6.3 Evaluation Metrics
VQA dataset provides multiple-choice and open-ended task for evaluation. Regarding open-ended task, the answer can be any phrase or word. However, in multiple-choice task, an answer should be chosen from 18 candidate answers. For both cases, answers are evaluated by accuracy which can reflect human consensus. The accuracy is given by the following:
, where is the total number of examples, denotes an indicator function, is the predicted answer and is an answer set of the example. That is, a predicted answer is considered as a correct one if at least 3 annotators agree with it, and the score depends on the total number of agreements when the predicted answer is not correct.
|Main Question||Corresponding Weight||Basic Question|
|What type of computer is this?||
|Is this a farm?||
|What are these animals?||
|Is the bed made?||
|Where is the baby?||
|What dessert is pictured on the plate?||
6.4 Results and Analysis
Here, we describe our final results and analysis by the following parts:
Does Basic Question Help Accuracy ?
The answer is yes. Here we only discuss the open-ended case. In our experiment, we use the , referring to Table 3, to be the initial guess of proper thresholds of s1, s2 and s3, in Table 1. We discover that when s1 = 0.43, s2 = 0.82 and s3 = 0.53, we can get the better utilization of BQ. The threshold, s1 = 0.43, can be consider as 43% of testing questions from VQA dataset which cannot find the basic question, from the training and validation sets of VQA dataset, and only 57% of testing questions can find the basic questions. Note that we combine the training and validation sets of VQA dataset to be our basic question dataset. Regarding s2 = 0.82, that means 82% of those 57% testing questions, i.e. 46.74%, only can find 1 basic question, and 18% of those 57% testing questions, i.e. 10.26%, can find at least 2 basic questions. Furthermore, s3 = 0.53 means that 53% of those 10.26% testing question, i.e. around 5.44%, only can find 2 basic questions, and 47% of those 10.26% testing question, i.e. around 4.82%, can find 3 basic questions. The above detail can be referred to Table 2.
Accordingly to the Table 2, 43% of testing questions from VQA dataset cannot find the proper basic questions from VQA training and validation datasets, and there are some failed examples about this case in Table 6. We also discover that a lot of questions in VQA training and validation datasets are almost the same. This issue reduces the diversity of basic question dataset. Although we only have 57% of testing questions can benefit from the basic questions, our method still can improve the state-of-the-art accuracy  from 60.32% to 60.34%, referring to Table 4 and 5. Then, we have 142093 testing questions, so that means the number of correctly answering questions of our method is more than state-of-the-art method 28 questions. In other words, if we have well enough basic question dataset, we can increase accuracy more, especially in the counting-type question, referring to Table 4 and 5. Because the Co-Attention Mechanism is good at localizing, the counting-type question is improved more than others. So, based on our experiment, we can conclude that basic question can help accuracy obviously.
Comparison with State-of-the-art.
Recently,  proposed the Co-Attention Mechanism in VQA and got the state-of-the-art accuracy. However, when we use their code and the same setup mentioned in their paper to re-run the experiment, we cannot get the same accuracy reported in their work. The re-run results are presented in Table 5. So, under the fair conditions, our method is competitive compared to the state-of-the-art.
7 Conclusion and Future Work
In this paper, we propose a VQABQ model for visual question answering. The VQABQ model has two main modules, Basic Question Generation Module and Co-Attention VQA Module. The former one can generate the basic questions for the query question, and the latter one can take the image , basic and query question as input and then output the text-based answer of the query question. According to the Section 6.4, because the basic question dataset generated from VQA dataset is not well enough, we only have the 57% of all testing questions can benefit from the basic questions. However, we still can increase 28 correctly answering questions compared to the state-of-the-art. We believe that if our basic question dataset is well enough, the increment of accuracy will be much more.
According to the previous state-of-the-art methods in VQA, they all got the highest accuracy in the Yes/No-type question. So, how to effectively only exploit the Yes/No-type basic questions to do VQA will be an interesting work, illustrated by Figure 3. Also, how to generate other specific type of basic questions based on the query question and how to do better combination of visual and textual features in order to decrease the semantic inconsistency? The above future works will be our next research focus.
This work is supported by competitive research funding from King Abdullah University of Science and Technology (KAUST). Also, we would like to acknowledge Fabian Caba, Humam Alwassel and Adel Bibi. They always can provide us helpful discussion about this work.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
-  K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
-  H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473–1482, 2015.
-  H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, pages 2296–2304, 2015.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  I. Ilievski, S. Yan, and J. Feng. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485, 2016.
-  K. Kafle and C. Kanan. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4976–4984, 2016.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015.
-  R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In NIPS, pages 3294–3302, 2015.
-  R. Li and J. Jia. Visual question answering with question representation update (qru). In NIPS, pages 4655–4663, 2016.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, pages 289–297, 2016.
-  L. Ma, Z. Lu, and H. Li. Learning to answer questions from image using convolutional neural network. arXiv preprint arXiv:1506.00333, 2015.
-  L. Ma, Z. Lu, and H. Li. Learning to answer questions from image using convolutional neural network. In AAAI, page 16, 2016.
-  M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, pages 1682–1690, 2014.
-  M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1–9, 2015.
-  M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision (IJCV), 2017. to appear.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
-  H. Noh, P. Hongsuck Seo, and B. Han. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 30–38, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
-  M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pages 2953–2961, 2015.
-  M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pages 2953–2961, 2015.
-  K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4613–4621, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
-  Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR, pages 4622–4630, 2016.
-  C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. arXiv, 1603, 2016.
-  H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, pages 451–466. Springer, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, volume 14, pages 77–81, 2015.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
-  J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR 2012, pages 702–709. IEEE, 2012.
-  B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167, 2015.
-  Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4995–5004, 2016.