Automatic Generation of Grounded Visual Questions
In this paper, we propose the first model to be able to generate visually grounded questions with diverse types for a single image. Visual question generation is an emerging topic which aims to ask questions in natural language based on visual input. To the best of our knowledge, it lacks automatic methods to generate meaningful questions with various types for the same visual input.
To circumvent the problem, we propose a model that automatically generates visually grounded questions with varying types. Our model takes as input both images and the captions generated by a dense caption model, samples the most probable question types, and generates the questions in sequel. The experimental results on two real world datasets show that our model outperforms the strongest baseline in terms of both correctness and diversity with a wide margin.
Multi-modal learning of vision and language is an important task in artificial intelligence because it is the basis of many applications such as education, user query prediction, interactive navigation, and so forth. Apart from describing visual scenes by using declarative sentences [\citeauthoryearChen and Zitnick2014, \citeauthoryearGupta and Mannem2012, \citeauthoryearKarpathy and Fei-Fei2015, \citeauthoryearHodosh et al.2013, \citeauthoryearKulkarni et al.2011, \citeauthoryearKuznetsova et al.2012, \citeauthoryearLi et al.2009, \citeauthoryearVinyals et al.2015, \citeauthoryearXu et al.2015], recently, automatic answering of visually related questions (VQA) has also attracted a lot of attention in computer vision communities [\citeauthoryearAntol et al.2015, \citeauthoryearMalinowski and Fritz2014, \citeauthoryearGao et al.2015, \citeauthoryearRen et al.2015, \citeauthoryearYu et al.2015, \citeauthoryearZhu et al.2015]. However, there is little work on automatic generation of questions for images.
”The art of proposing a question must be held higher value than solving it. -Georg Canton”. An intelligent system should be able to ask meaningful questions given the environment. Beyond demonstrating a high-level of AI, in practice, multi-modal question-asking modules find their use in a wide range of AI systems such as child education and dialogue systems.
To the best of our knowledge, almost all existing VQA systems rely on manually constructed questions [\citeauthoryearAntol et al.2015, \citeauthoryearMalinowski and Fritz2014, \citeauthoryearGao et al.2015, \citeauthoryearRen et al.2015, \citeauthoryearYu et al.2015, \citeauthoryearZhu et al.2015]. An common assumption of the existing VQA systems is that answers are visually grounded thus all relevant information can be found in the visual input. However, the construction of such data sets are labor-intensive and time consuming, thus limits the diversity and coverage of questions being asked. As a consequence, the data incompleteness imposes a special challenge for supervised-learning based VQA systems.
In light of the above analysis, we focus on automatic generation of visually grounded questions, coined VQG. The generated questions should be grammatically well-formed, reasonable for given images, and as diverse as possible. However, the existing systems are either rule-based such that they generate questions with few limited textual patterns [\citeauthoryearRen et al.2015, \citeauthoryearZhu et al.2015], or they are able to ask only one question per image and the generated questions are frequently not visually grounded [\citeauthoryearSimoncelli and Olshausen2001].
To tackle this task, we propose the first model capable of asking questions of various types for the same image. As illustrated in Fig. 2, we first apply DenseCap [\citeauthoryearJohnson et al.2015] to construct dense captions that provides a almost complete coverage of information for questions. Then we feed these captions into the question type selector to sample the most probable question types. Taking as input the questions types, the dense captions, as well as visual features generated by VGG-16 [\citeauthoryearSimonyan and Zisserman2014], the question generator decodes all these information into questions. We conduct extensive experiments to evaluate our model as well as the most competitive baseline with three kinds of measures adapted from the ones commonly used in the tasks of image caption generation and machine translation.
The contributions of our paper are three-fold:
We propose the first model capable of asking visually grounded questions with diverse types for a single image.
Our model outperforms the strongest baseline up to 216% in terms of the coverage of asked questions.
The grammaticality of the questions generated by our model as well as their relatedness to visual input also outperform the strongest baseline with a wide margin.
The rest of the paper is organized as follows: we cover the related work in Section 2, followed by presenting our model in Section 3. After introducing the experimental setup in Section 4, we discuss the results in Section 5, and draw the conclusion in Section 6.
2 Related Work
The generation of textual description for visual information has gained popularity in recent years. This includes joint learning of both visual information and text [\citeauthoryearBarnard et al.2003, \citeauthoryearKong et al.2014, \citeauthoryearZitnick et al.2013]. A typical task is to describe images with a few declarative sentences, often referred to as image captions [\citeauthoryearBarnard et al.2003, \citeauthoryearChen and Zitnick2014, \citeauthoryearGupta and Mannem2012, \citeauthoryearKarpathy and Fei-Fei2015, \citeauthoryearHodosh et al.2013, \citeauthoryearKulkarni et al.2011, \citeauthoryearKuznetsova et al.2012, \citeauthoryearLi et al.2009, \citeauthoryearVinyals et al.2015, \citeauthoryearXu et al.2015].
Visual Question and Answering
Automatic answering of questions based on visual input is an one of the most popular tasks in computer vision [\citeauthoryearGeman et al.2015, \citeauthoryearMalinowski and Fritz2014, \citeauthoryearMalinowski et al.2015, \citeauthoryearPirsiavash et al.2014, \citeauthoryearRen et al.2015, \citeauthoryearWeston et al.2015, \citeauthoryearYu et al.2015]. VQA models are now been evaluated on a few datasets [\citeauthoryearAntol et al.2015, \citeauthoryearMalinowski and Fritz2014, \citeauthoryearGao et al.2015, \citeauthoryearRen et al.2015, \citeauthoryearYu et al.2015, \citeauthoryearZhu et al.2015]. For these datasets, while images are collected by sub-sampling MS-Coco [\citeauthoryearLin et al.2014], the questions-answer pairs are manually generated [\citeauthoryearAntol et al.2015, \citeauthoryearGao et al.2015, \citeauthoryearYu et al.2015, \citeauthoryearZhu et al.2015] or by using NLP tools[\citeauthoryearRen et al.2015] that converts limited types of image captions into queries.
Visual Question Generation
While asking questions automatically is explored in-depth in NLP, it is rarely researched for visually related questions. Such questions are strongly desired by creating VQA dataset. Early methods simply converting image labeling into questions, which only allows generation of low-level questions. To diversify questions per image, however, it is still labor-consuming [\citeauthoryearAntol et al.2015, \citeauthoryearGao et al.2015, \citeauthoryearMalinowski and Fritz2014]. Zhu et al.[\citeauthoryearZhu et al.2015], recently categorizes the manually generated questions into 7W question types, say, what, where, when and etc. Yu et al.[\citeauthoryearYu et al.2015] consider question question as a task of selectively removing content words related answers from a caption. In a similar manner, Ren et al.[\citeauthoryearRen et al.2015] design rules to transform image captions into questions with limited types. Apart from that, the most closed work is [\citeauthoryearSimoncelli and Olshausen2001] exploit abstract human like questions according to visual input. However, what they generate are ambiguous open questions where no determined answer is available within the visual input. In a word, automatically generation of reasonable, and in the meanwhile, versatile close-form questions remains a challenging problem.
Knowledge Base (KB) based Question Answering (KB-QA)
KB-QA has attracted considerable attention due to the ubiquity of the World Wide Web and the rapid development of the artificial intelligence (AI) technology. Large-scale structured KBs, such as DBpedia [\citeauthoryearAuer et al.2007], Freebase [\citeauthoryearBollacker et al.2008], and YAGO [\citeauthoryearSuchanek et al.2007], provide abundant resources and rich general human knowledge, which can be used to respond to users’ queries in open-domain question answering (QA). However, how to bridge the gap between visual questions and structured data in KBs remains a huge challenge.
The existing KB-QA methods can be broadly classified into two main categories, namely, semantic parsing based methods[\citeauthoryearKwiatkowski et al.2013, \citeauthoryearReddy et al.2016] and information retrieval based methods [\citeauthoryearYao and Durme2014, \citeauthoryearBordes et al.2014] methods. Most semantic parsing based methods transform a question into its meaning representation (i.e., logical form), which will be then translated to a KB query to retrieve the correct answer(s). Information retrieval based methods initially roughly retrieve a set of candidate answers, and subsequently perform an in-depth analysis to re-rank the candidate answers and select the correct ones. These methods focus on modeling the correlation of question-answer pairs from the perspective of question topic, relation mapping, answer type, and so forth.
3 Question Generation
Our goal is to generate visually grounded questions directly from images with diverse question types. We start with randomly picking a caption from a set of automatically generated captions, which describes a certain region of image with natural language. Then we sample a reasonable question type and varying the caption. In the last step, our question generator learns the correlation between the caption and the image, generates a question of the chosen type.
Formally, for each raw image , our model generates a set of captions , samples a set of question types , followed by yielding a set of grounded questions . Herein, a caption or a question is a sequence of words.
Where is the length of the word sequence. Each word employs 1-of- encoding, where is the size of the vocabulary. A question type is represented by the first word of a question, adopting 1-of- encoding where is the number of question types. The same as [\citeauthoryearZhu et al.2015], we consider six question types in our experiments: what, when, where, who, why and how.
For each image , we apply a dense caption model (DenseCap) [\citeauthoryearJohnson et al.2015] trained on the Visual Genome dataset [\citeauthoryearKrishna et al.2016] to produce a set of captions . Then the generative process is described as follows:
Choose a caption from .
Choose a question type given .
Generate a question conditioned on and .
Denoted by all model parameters, for each image , the joint distribution of , and is factorized as follows:
where , is the distribution of generating question, and are the distributions for sampling question type and caption respectively. More details are given in the following sections.
Since we do not observe the alignment between captions and questions, is latent. Sum over , we obtain:
Let denote the question set of the image , the probability of the training dataset is given by taking the product of the above probabilities over all images and their questions.
For word representations, we initialize a word embedding matrix by using Glove [\citeauthoryearPennington et al.2014], which are trained on 840 billions of words. For the image representations, we apply a VGG-16 model [\citeauthoryearSzegedy et al.2015] trained on ImageNet [\citeauthoryearDeng et al.2009] without fine-tuning to produce 300-dimensional feature vectors. The dimension is chosen to match the size of the pre-trained word embeddings.
Compared to the question generation model [\citeauthoryearSimoncelli and Olshausen2001], which generates only one question per image, the probabilistic nature of this model allows generating questions of multiple types which refer to different regions of interests, because each caption predicted by DenseCap is associated with a different region.
3.1 Sample captions and question types
The caption model DenseCap generates a set of captions for a given image. Each caption is associated with a region and a confidence of the proposed region. Intuitively, we should give a higher probability to the caption with higher confidence than the lower one. Thus, given a caption set of an image , we define the prior distribution as:
A caption is either a declarative sentence, a word, or a phrase. We are able to ask many different types of questions but not all of them for a chosen caption. For example, for a caption ”floor is brown” we can ask ”what color is the floor” but it would be awkward to ask a who question. Thus, our model draws a question type given a caption with the probability by assuming it suffices to infer question types given a caption.
Our key idea is to learn the association between question types and key words/phrases in captions. The model consists of two components. The first one is a Long Short Term Memory (LSTM) [\citeauthoryearHochreiter and Schmidhuber1997] that maps a caption into a hidden representation. LSTM is a recurrent neural network taking the following form:
where is the input and the hidden state of LSTM at time step , and and are the hidden states and memory states of LSTM at time step , respectively. As the representation of the whole sequence, we take the last state generated at the end of the sequence. This representation is further fed into a softmax layer to compute a probability vector for all question types. The probability vector characterizes a multinomial distribution of all question types.
3.2 Generate questions
At the core of our model is the question generation module, which models , given a chosen caption and a question type . It is composed of three modules: i) an LSTM encoder to generate caption embeddings; ii) a correlation module to learn the association between images and captions; iii) a decoder consisting of an LSTM decoder and an ngram language model.
A grounded question is deeply anchored in both the sampled caption and the associated image. In our preliminary experiments, we found it useful to let the LSTM encoder to read the image features prior to reading captions. In particular, at time step , we initialize the state vector to zero and feed the image features as . At the st time step, the encoder reads in a special token indicating the start of a sentence, which is a good practice adopted by many caption generation models [\citeauthoryearVinyals et al.2015]. After reading the whole caption of length , the encoder yields the last state vector as the embedding of caption.
The correlation module takes as input the caption embeddings from the encoder and the image features from VGG-16, produces a 300-dimensional joint feature map. We apply a linear layer of size and a PReLU [\citeauthoryearHe et al.2015] layer in sequel to learn the associations between captions and images. Since an image gives an overall context and the chosen caption provides the focus in the image, the joint representation provides sufficient context to generate grounded questions. Although the LSTM encoder incorporates image features before reading captions, this correlation module enhances the correlation between images and text by building more abstract representations.
Our decoder extends the LSTM decoder of [\citeauthoryearVinyals et al.2015] with a ngram language model. The LSTM decoder consists of an LSTM layer and a softmax layer. The LSTM layer starts with reading the joint feature map and the start token in the same fashion as the caption encoder. From time step , the softmax layer predicts the most likely word given the state vector at time yielded by the LSTM layer. A word sequence ends when the end of sequence token is produced.
Although the LSTM decoder alone can generate questions, we found that it would frequently produce repeated words and phrases such as ”the the”. The problem didn’t disappear even the beam search [\citeauthoryearKoehn et al.2003] was applied. It is due to the fact that the state vectors produced at adjunct time steps tend to be similar. Since repeated words and phrases are rarely observed in text corpora, we discount such occurrence by joint decoding with a ngram language model. Given a word sequence , a bigram language model is defined as:
Instead of using neural models, we adopt the word count based estimation of model parameters. In particular, we apply the KneserâNey smoothing [\citeauthoryearKneser and Ney1995] to estimate , which is given by:
where denotes the corpus frequency of term , is a back-off statistic of unigram in case the bigram does not appear in the training corpus. The parameter is usually fixed to 0.75 to avoid overfitting for low frequency bigrams. And is a normalizing constant conditioned on .
We incorporate bigram statistics with the LSTM decoder from the time step because the LSTM decoder can well predict the first words of questions. The LSTM decoder essentially captures the conditional probability , while the bigram model considers only the previous word by using word counts. By interpolating these two, we obtain the final probability as:
where is an interpolation weight. In addition, we fix the first words of questions during decoding according to the chosen question types.
The key challenge of training is the involvement of the latent variables indicating the alignment between captions and gold standard questions for a deep neural network. We estimate the latent variables in a similar fashion as EM but computationally more efficient.
Suppose we are given the training set , the loss is given by:
Suppose denote some proposed distribution such that and . Consider the following:
The last step used Jensenâs inequality. The Equation (5) gives a upper bound of the loss . When the bound is tight, we have .
To save the EM loop, we propose a non-parametric estimation of . As a result, for each question-image pair , we maximize the lower bound by optimizing:
This in fact assigns a weight to each instance. By using a non-parametric estimation, we are still able to apply BackProp and the SGD style optimizing algorithms by just augmenting each instance with an estimated weight.
Given a question and a caption set from the train set, we estimate by using the kernel density estimator [\citeauthoryearScott2008]:
where is a similarity function between a question and a caption. We assume are conditionally independent of because we can directly extract the question type from the question by looking at the first few words.
For a given question, there are usually very few matched captions generated by DenseCap , hence the distribution of captions given a question is highly skewed. It is sufficient to randomly draw a caption each time to compute the probability based on Equation (6).
We formulate the similarity between a question and a caption by using both string similarity and embedding based similarity measures.
The surface string of a caption could be an exact or partial match of a given question. Thus we employ the Jaccard Index as string similarity measure between the surface string of a caption and that of a question.
where and denote their surface string respectively. Both strings are broken down to a set of char-based trigrams during the computation so that this measure still gives a high similarity if two strings differ only in some small variations such as singular and plural forms of nouns.
In case of synonyms or words of similar meanings come with different form such as ”car” and ”automobile”, we adopt the pre-trained word embeddings to calculate their similarity by using the weighted averaged of word embeddings:
where cos denotes the cosine similarity, is the inverse document frequency of word defined by , and is the corpus containing all questions, answers, and captions.
The final similarity measure is computed as the interpolation of the two measures:
where the hyperparameter .
4 Experimental Setup
We conduct our experiments on two datasets: VQA-Dataset [\citeauthoryearAntol et al.2015] and Visual7W [\citeauthoryearZhu et al.2015]. The former is the most popular benchmark for VQA and the latter is a recently created dataset with more visually grounded questions per image than VQA.
VQA: a sample from the MS-COCO dataset [\citeauthoryearLin et al.2014], which contains 254,721 images and 764,163 manually compiled questions respectively. Each image is associated with three questions on average.
Visual7W: a dataset composed of 327,939 QA pairs on 47,300 COCO images, collected from the MS-COCO dataset [\citeauthoryearLin et al.2014] as well. In addition, it includes 1,311,756 human-generated answers in form of multiple-choice and 561,459 object groundings from 36,579 categories. Each image is associated with five questions on average.
In this paper, we consider a baseline by training the image caption generation model NeuralTalk2 [\citeauthoryearVinyals et al.2015] on image-question pairs. The baseline is almost the same as [\citeauthoryearSimoncelli and Olshausen2001], which is the only work generating questions from visual input. The model of neuraltalk2 differs from [\citeauthoryearSimoncelli and Olshausen2001] only in the RNNs used in the decoder. NeuralTalk2 adopts LSTM while [\citeauthoryearSimoncelli and Olshausen2001] chooses GRU [\citeauthoryearCho et al.2014]. The two RNN models achieve almost identical performance in language modeling [\citeauthoryearChung et al.2015].
4.3 Evaluation measures
As a common practice for evaluating generated word sequences we employ three different evaluation metrics: BLEU [\citeauthoryearPapineni et al.2002], METEOR [\citeauthoryearBanerjee and Lavie2005] and ROUGE-L [\citeauthoryearLin2004].
BLEU is a modified n-gram precision. We varied the size of ngram from one to four, computed the corresponding measures respectively for each image and averaged the results across all images. Both METEOR and ROUGE-L111We take the same of F-Measure as the implementation in https://github.com/tylin/coco-caption. are F-Measures favoring precision, computed against the reference question with the highest score among all reference questions in the same image. The measures are averaged in sequel across all images. Therefore, all three of them are precision-oriented measures.
To measure the diversity of our generated questions, we also compute the the same set of evaluation measures by comparing each reference sentence with the best matching generated sentence of the same images. This provides an estimate of coverage in analogy of recall.
4.4 Implementation details
We optimize all models with Adam [\citeauthoryearKingma and Ba2014]. We fix the batch size to 64. We set the maximal epochs to 64 for Visual7W and the maximal epochs to 128 for VQA. The corresponding model hyperparameters were tuned on the validation sets. Herein, we set .
5 Results and Discussions
Figure 3 illustrates all three precision-oriented measures evaluated on Visual7W and VQA datasets respectively. Our baseline is able to generate only one question per image. When we compare its results with the highest scored question per image generated by our model, our model outperforms the baseline with a wide margin. On the VQA test set, in the case of BLEU measures, the improvement over the baseline grows from 24% with unigram to 97% with four-gram. It is evident that our model is capable to generate many more higher-order n-grams co-occurred in reference questions. This improvement is also consistent with ROUGE-L because it is based on the longest common subsequence between generated questions and reference questions. Our model performs better than the baseline also not just because it generates more exact higher order n-grams than reference questions. METEOR considers unigram alignment by allowing multiple matching modules to consider synonymy and alternating word forms. With this measure, our model is still 65% higher than the baseline on the VQA test set. We also observe similar level of improvement over baseline on Visual7W dataset.
On both datasets, when the number of generated questions per image grows, the precison-oriented measures of our model are either similar or slightly declining because our model often generates meaningful questions that are not included in the ground-truth. The more questions we generate the more likely that the questions are not covered by manually constructed ones.
To measure the coverage of generated questions, we computed each reference question against all generated questions per image with all evaluation measures. As shown by Figure 3, all measures improves as the number of questions grows. Herein, both ROUGE-L and METEOR are way better than the baseline regardless of the number of generated questions on both datasets. When all six questions are generated, our model is 130% better than the baseline across all measures. In particular, with METEOR, our model shows an improvement of 216% and 179% over the baseline on VQA and Visual7W respectively. When the number of manually constructed questions is small, our model provides even more types questions than manual ones, as shown with the examples in Figure 4.
The distribution of question types generated by our model is more balanced than that of the ground-truth, while almost 55% of questions in Visual7W and 89% in VQA start with ”what”, as illustrated by Figure 6. Our model has also no tendency of generating too long or too short questions because the length distribution of the generated questions are very similar to that of the manually constructed datasets.
We also evaluate the effectiveness of the integration of bigram language model on both datasets. Herein, we compare two variants of our model, with and without the bigram model during decoding. As shown in Figure 5, regardless of precision or recall, decoding with the bigram model consistently outperforms the one without it. The inclusion of the bigram model effectively eliminates almost all repeated terms such as ”the the” because the statistics collected by the bigram model favors grammatically well-formed sentences. This observation is also reflected in BLEU with higher-order ngrams by showing larger gaps.
In this paper, we propose the first model to automatically generate visually grounded questions with varying types. Our model is capable of automatically selecting most likely question types and generating corresponding questions based on images and captions constructed by DenseCap. Experiments on VQA and Visual7W dataset demonstrates that the proposed model is able to generate reasonable and grammatically well-formed questions with high diversity. For future work, we consider automatically generation of visual question-answer pairs, which will likely enhance training of VQA systems.
This work is supported by National Key Technology R&D Program of China: 2014BAK09B04, National Natural Science Foundation of China: U1636116, 11431006, Research Fund for International Young Scientists: 61650110510, Ministry of Education of Humanities and Social Science: 16YJC790123.
- [\citeauthoryearAntol et al.2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.
- [\citeauthoryearAuer et al.2007] Sóren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. Dbpedia: A nucleus for a web of open data. In ISWC/ASWC, 2007.
- [\citeauthoryearBanerjee and Lavie2005] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.
- [\citeauthoryearBarnard et al.2003] Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I Jordan. Matching words and pictures. Journal of machine learning research, 2003.
- [\citeauthoryearBollacker et al.2008] Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.
- [\citeauthoryearBordes et al.2014] Antoine Bordes, Jason Weston, and Nicolas Usunier. Open question answering with weakly supervised embedding models. In ECML/PKDD, pages 165–180, 2014.
- [\citeauthoryearChen and Zitnick2014] Xinlei Chen and C Lawrence Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
- [\citeauthoryearCho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- [\citeauthoryearChung et al.2015] Junyoung Chung, Caglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. CoRR, abs/1502.02367, 2015.
- [\citeauthoryearDeng et al.2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- [\citeauthoryearGao et al.2015] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, 2015.
- [\citeauthoryearGeman et al.2015] Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 2015.
- [\citeauthoryearGupta and Mannem2012] Ankush Gupta and Prashanth Mannem. From image annotation to image description. In ICNIP. Springer, 2012.
- [\citeauthoryearHe et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
- [\citeauthoryearHochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
- [\citeauthoryearHodosh et al.2013] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 2013.
- [\citeauthoryearJohnson et al.2015] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint arXiv:1511.07571, 2015.
- [\citeauthoryearKarpathy and Fei-Fei2015] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
- [\citeauthoryearKingma and Ba2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [\citeauthoryearKneser and Ney1995] Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In ICASSP-95, 1995.
- [\citeauthoryearKoehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In NAACL, pages 48–54, 2003.
- [\citeauthoryearKong et al.2014] Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. What are you talking about? text-to-image coreference. In CVPR, 2014.
- [\citeauthoryearKrishna et al.2016] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
- [\citeauthoryearKulkarni et al.2011] Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Baby talk: Understanding and generating image descriptions. In CVPR, 2011.
- [\citeauthoryearKuznetsova et al.2012] Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi. Collective generation of natural image descriptions. In 50th Annual Meeting of the Association for Computational Linguistics, 2012.
- [\citeauthoryearKwiatkowski et al.2013] Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. Scaling semantic parsers with on-the-fly ontology matching. In EMNLP, 2013.
- [\citeauthoryearLi et al.2009] Li-Jia Li, Richard Socher, and Li Fei-Fei. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In CVPR, 2009.
- [\citeauthoryearLin et al.2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- [\citeauthoryearLin2004] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: ACL-04 workshop, 2004.
- [\citeauthoryearMalinowski and Fritz2014] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, NIPS’14, 2014.
- [\citeauthoryearMalinowski et al.2015] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. In ICCV, 2015.
- [\citeauthoryearPapineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. Bleu: a method for automatic evaluation of machine translation. 2002.
- [\citeauthoryearPennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 2014.
- [\citeauthoryearPirsiavash et al.2014] Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. Inferring the why in images. arXiv preprint arXiv:1406.5472, 2014.
- [\citeauthoryearReddy et al.2016] Siva Reddy, Oscar Táckstr0́m, Michael Collins, Tom Kwiatkowski, Dipanjan Das, Mark Steedman, and Mirella Lapata. Transforming dependency structures to logical forms for semantic parsing. In TACL, 2016.
- [\citeauthoryearRen et al.2015] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, 2015.
- [\citeauthoryearScott2008] David W Scott. Kernel density estimators. Multivariate Density Estimation: Theory, Practice, and Visualization, pages 125–193, 2008.
- [\citeauthoryearSimoncelli and Olshausen2001] Eero P Simoncelli and Bruno A Olshausen. Natural image statistics and neural representation. Annual review of neuroscience, 2001.
- [\citeauthoryearSimonyan and Zisserman2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- [\citeauthoryearSuchanek et al.2007] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In WWW, 2007.
- [\citeauthoryearSzegedy et al.2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- [\citeauthoryearVinyals et al.2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
- [\citeauthoryearWeston et al.2015] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
- [\citeauthoryearXu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.
- [\citeauthoryearYao and Durme2014] Xuchen Yao and Benjamin Van Durme. Information extraction over structured data: Question answering with freebase. In ACL, 2014.
- [\citeauthoryearYu et al.2015] Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L Berg. Visual madlibs: Fill in the blank image generation and question answering. arXiv preprint arXiv:1506.00278, 2015.
- [\citeauthoryearZhu et al.2015] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. arXiv preprint arXiv:1511.03416, 2015.
- [\citeauthoryearZitnick et al.2013] C Lawrence Zitnick, Devi Parikh, and Lucy Vanderwende. Learning the visual interpretation of sentences. In ICCV, pages 1681–1688, 2013.