Automatic Generation of Grounded Visual Questions
In this paper, we propose the first model to be able to generate visually grounded questions with diverse types for a single image. Visual question generation is an emerging topic which aims to ask questions in natural language based on visual input. To the best of our knowledge, it lacks automatic methods to generate meaningful questions with various types for the same visual input.
To circumvent the problem, we propose a model that automatically generates visually grounded questions with varying types. Our model takes as input both images and the captions generated by a dense caption model, samples the most probable question types, and generates the questions in sequel. The experimental results on two real world datasets show that our model outperforms the strongest baseline in terms of both correctness and diversity with a wide margin.
Multi-modal learning of vision and language is an important task in artificial intelligence because it is the basis of many applications such as education, user query prediction, interactive navigation, and so forth. Apart from describing visual scenes by using declarative sentences [?], recently, automatic answering of visually related questions (VQA) has also attracted a lot of attention in computer vision communities [?]. However, there is little work on automatic generation of questions for images.
“The art of proposing a question must be held higher value than solving it. -Georg Canton”. An intelligent system should be able to ask meaningful questions given the environment. Beyond demonstrating a high-level of AI, in practice, multi-modal question-asking modules find their use in a wide range of AI systems such as child education and dialogue systems.
To the best of our knowledge, almost all existing VQA systems rely on manually constructed questions [?]. An common assumption of the existing VQA systems is that answers are visually grounded thus all relevant information can be found in the visual input. However, the construction of such data sets are labor-intensive and time consuming, thus limits the diversity and coverage of questions being asked. As a consequence, the data incompleteness imposes a special challenge for supervised-learning based VQA systems.
In light of the above analysis, we focus on automatic generation of visually grounded questions, coined VQG. The generated questions should be grammatically well-formed, reasonable for given images, and as diverse as possible. However, the existing systems are either rule-based such that they generate questions with few limited textual patterns [?], or they are able to ask only one question per image and the generated questions are frequently not visually grounded [?].
To tackle this task, we propose the first model capable of asking questions of various types for the same image. As illustrated in Figure 2, we first apply DenseCap [?] to construct dense captions that provides a almost complete coverage of information for questions. Then we feed these captions into the question type selector to sample the most probable question types. Taking as input the questions types, the dense captions, as well as visual features generated by VGG-16 [?], the question generator decodes all these information into questions. We conduct extensive experiments to evaluate our model as well as the most competitive baseline with three kinds of measures adapted from the ones commonly used in the tasks of image caption generation and machine translation.
The contributions of our paper are three-fold:
We propose the first model capable of asking visually grounded questions with diverse types for a single image.
Our model outperforms the strongest baseline up to 216% in terms of the coverage of asked questions.
The grammaticality of the questions generated by our model as well as their relatedness to visual input also outperform the strongest baseline with a wide margin.
The rest of the paper is organized as follows: we cover the related work in Section 2, followed by presenting our model in Section 3. After introducing the experimental setup in Section 4, we discuss the results in Section 5, and draw the conclusion in Section 6.
The generation of textual description for visual information has gained popularity in recent years. This includes joint learning of both visual information and text [?]. A typical task is to describe images with a few declarative sentences, often referred to as image captions [?].
Visual Question and Answering Automatic answering of questions based on visual input is an one of the most popular tasks in computer vision [?]. VQA models are now been evaluated on a few datasets [?]. For these datasets, while images are collected by sub-sampling MS-Coco [?], the questions-answer pairs are manually generated [?] or by using NLP tools[?] that converts limited types of image captions into queries.
Visual Question Generation While asking questions automatically is explored in-depth in NLP, it is rarely researched for visually related questions. Such questions are strongly desired by creating VQA dataset. Early methods simply converting image labeling into questions, which only allows generation of low-level questions. To diversify questions per image, however, it is still labor-consuming [?]. Zhu et al.[?], recently categorizes the manually generated questions into 7W question types, say, what, where, when and etc. Yu et al.[?] consider question question as a task of selectively removing content words related answers from a caption. In a similar manner, Ren et al.[?] design rules to transform image captions into questions with limited types. Apart from that, the most closed work is [?] exploit abstract human like questions according to visual input. However, what they generate are ambiguous open questions where no determined answer is available within the visual input. In a word, automatically generation of reasonable, and in the meanwhile, versatile close-form questions remains a challenging problem.
Knowledge Base (KB) based Question Answering (KB-QA) KB-QA has attracted considerable attention due to the ubiquity of the World Wide Web and the rapid development of the artificial intelligence (AI) technology. Large-scale structured KBs, such as DBpedia [?], Freebase [?], and YAGO [?], provide abundant resources and rich general human knowledge, which can be used to respond to users’ queries in open-domain question answering (QA). However, how to bridge the gap between visual questions and structured data in KBs remains a huge challenge.
The existing KB-QA methods can be broadly classified into two main categories, namely, semantic parsing based methods[?] and information retrieval based methods [?] methods. Most semantic parsing based methods transform a question into its meaning representation (i.e., logical form), which will be then translated to a KB query to retrieve the correct answer(s). Information retrieval based methods initially roughly retrieve a set of candidate answers, and subsequently perform an in-depth analysis to re-rank the candidate answers and select the correct ones. These methods focus on modeling the correlation of question-answer pairs from the perspective of question topic, relation mapping, answer type, and so forth.
Our goal is to generate visually grounded questions directly from images with diverse question types. We start with randomly picking a caption from a set of automatically generated captions, which describes a certain region of image with natural language. Then we sample a reasonable question type and varying the caption. In the last step, our question generator learns the correlation between the caption and the image, generates a question of the chosen type.
Formally, for each raw image , our model generates a set of captions , samples a set of question types , followed by yielding a set of grounded questions . Herein, a caption or a question is a sequence of words.
Where is the length of the word sequence. Each word employs 1-of- encoding, where is the size of the vocabulary. A question type is represented by the first word of a question, adopting 1-of- encoding where is the number of question types. The same as [?], we consider six question types in our experiments: what, when, where, who, why and how.
For each image , we apply a dense caption model (
DenseCap) [?] trained on the Visual Genome dataset [?] to produce a set of captions . Then the generative process is described as follows:
Choose a caption from .
Choose a question type given .
Generate a question conditioned on and .
Denoted by all model parameters, for each image , the joint distribution of , and is factorized as follows:
where , is the distribution of generating question, and are the distributions for sampling question type and caption respectively. More details are given in the following sections.
Since we do not observe the alignment between captions and questions, is latent. Sum over , we obtain:
Let denote the question set of the image , the probability of the training dataset is given by taking the product of the above probabilities over all images and their questions.
For word representations, we initialize a word embedding matrix by using Glove [?], which are trained on 840 billions of words. For the image representations, we apply a VGG-16 model [?] trained on ImageNet [?] without fine-tuning to produce 300-dimensional feature vectors. The dimension is chosen to match the size of the pre-trained word embeddings.
Compared to the question generation model [?], which generates only one question per image, the probabilistic nature of this model allows generating questions of multiple types which refer to different regions of interests, because each caption predicted by DenseCap is associated with a different region.
3.1Sample captions and question types
The caption model DenseCap generates a set of captions for a given image. Each caption is associated with a region and a confidence of the proposed region. Intuitively, we should give a higher probability to the caption with higher confidence than the lower one. Thus, given a caption set of an image , we define the prior distribution as:
A caption is either a declarative sentence, a word, or a phrase. We are able to ask many different types of questions but not all of them for a chosen caption. For example, for a caption “floor is brown” we can ask “what color is the floor” but it would be awkward to ask a who question. Thus, our model draws a question type given a caption with the probability by assuming it suffices to infer question types given a caption.
Our key idea is to learn the association between question types and key words/phrases in captions. The model consists of two components. The first one is a Long Short Term Memory (LSTM) [?] that maps a caption into a hidden representation. LSTM is a recurrent neural network taking the following form:
where is the input and the hidden state of LSTM at time step , and and are the hidden states and memory states of LSTM at time step , respectively. As the representation of the whole sequence, we take the last state generated at the end of the sequence. This representation is further fed into a softmax layer to compute a probability vector for all question types. The probability vector characterizes a multinomial distribution of all question types.
At the core of our model is the question generation module, which models , given a chosen caption and a question type . It is composed of three modules: i) an LSTM encoder to generate caption embeddings; ii) a correlation module to learn the association between images and captions; iii) a decoder consisting of an LSTM decoder and an ngram language model.
A grounded question is deeply anchored in both the sampled caption and the associated image. In our preliminary experiments, we found it useful to let the LSTM encoder to read the image features prior to reading captions. In particular, at time step , we initialize the state vector to zero and feed the image features as . At the st time step, the encoder reads in a special token indicating the start of a sentence, which is a good practice adopted by many caption generation models [?]. After reading the whole caption of length , the encoder yields the last state vector as the embedding of caption.
The correlation module takes as input the caption embeddings from the encoder and the image features from VGG-16, produces a 300-dimensional joint feature map. We apply a linear layer of size and a PReLU [?] layer in sequel to learn the associations between captions and images. Since an image gives an overall context and the chosen caption provides the focus in the image, the joint representation provides sufficient context to generate grounded questions. Although the LSTM encoder incorporates image features before reading captions, this correlation module enhances the correlation between images and text by building more abstract representations.
Our decoder extends the LSTM decoder of [?] with a ngram language model. The LSTM decoder consists of an LSTM layer and a softmax layer. The LSTM layer starts with reading the joint feature map and the start token in the same fashion as the caption encoder. From time step , the softmax layer predicts the most likely word given the state vector at time yielded by the LSTM layer. A word sequence ends when the end of sequence token is produced.
Joint decoding Although the LSTM decoder alone can generate questions, we found that it would frequently produce repeated words and phrases such as “the the”. The problem didn’t disappear even the beam search [?] was applied. It is due to the fact that the state vectors produced at adjunct time steps tend to be similar. Since repeated words and phrases are rarely observed in text corpora, we discount such occurrence by joint decoding with a ngram language model. Given a word sequence , a bigram language model is defined as:
Instead of using neural models, we adopt the word count based estimation of model parameters. In particular, we apply the Kneser–Ney smoothing [?] to estimate , which is given by:
where denotes the corpus frequency of term , is a back-off statistic of unigram in case the bigram does not appear in the training corpus. The parameter is usually fixed to 0.75 to avoid overfitting for low frequency bigrams. And is a normalizing constant conditioned on .
We incorporate bigram statistics with the LSTM decoder from the time step because the LSTM decoder can well predict the first words of questions. The LSTM decoder essentially captures the conditional probability , while the bigram model considers only the previous word by using word counts. By interpolating these two, we obtain the final probability as:
where is an interpolation weight. In addition, we fix the first words of questions during decoding according to the chosen question types.
The key challenge of training is the involvement of the latent variables indicating the alignment between captions and gold standard questions for a deep neural network. We estimate the latent variables in a similar fashion as EM but computationally more efficient.
Suppose we are given the training set , the loss is given by:
Suppose denote some proposed distribution such that and . Consider the following:
The last step used Jensen’s inequality. The Equation gives a upper bound of the loss . When the bound is tight, we have .
To save the EM loop, we propose a non-parametric estimation of . As a result, for each question-image pair , we maximize the lower bound by optimizing:
This in fact assigns a weight to each instance. By using a non-parametric estimation, we are still able to apply BackProp and the SGD style optimizing algorithms by just augmenting each instance with an estimated weight.
Given a question and a caption set from the train set, we estimate by using the kernel density estimator [?]:
where is a similarity function between a question and a caption. We assume are conditionally independent of because we can directly extract the question type from the question by looking at the first few words.
For a given question, there are usually very few matched captions generated by DenseCap , hence the distribution of captions given a question is highly skewed. It is sufficient to randomly draw a caption each time to compute the probability based on Equation .
We formulate the similarity between a question and a caption by using both string similarity and embedding based similarity measures.
The surface string of a caption could be an exact or partial match of a given question. Thus we employ the Jaccard Index as string similarity measure between the surface string of a caption and that of a question.
where and denote their surface string respectively. Both strings are broken down to a set of char-based trigrams during the computation so that this measure still gives a high similarity if two strings differ only in some small variations such as singular and plural forms of nouns.
In case of synonyms or words of similar meanings come with different form such as “car” and “automobile”, we adopt the pre-trained word embeddings to calculate their similarity by using the weighted averaged of word embeddings:
where denotes the cosine similarity, is the inverse document frequency of word defined by , and is the corpus containing all questions, answers, and captions.
The final similarity measure is computed as the interpolation of the two measures:
where the hyperparameter .
We conduct our experiments on two datasets: VQA-Dataset [?] and Visual7W [?]. The former is the most popular benchmark for VQA and the latter is a recently created dataset with more visually grounded questions per image than VQA.
VQA: a sample from the MS-COCO dataset [?], which contains 254,721 images and 764,163 manually compiled questions respectively. Each image is associated with three questions on average.
Visual7W: a dataset composed of 327,939 QA pairs on 47,300 COCO images, collected from the MS-COCO dataset [?] as well. In addition, it includes 1,311,756 human-generated answers in form of multiple-choice and 561,459 object groundings from 36,579 categories. Each image is associated with five questions on average.
In this paper, we consider a baseline by training the image caption generation model NeuralTalk2 [?] on image-question pairs. The baseline is almost the same as [?], which is the only work generating questions from visual input. The model of neuraltalk2 differs from [?] only in the RNNs used in the decoder. NeuralTalk2 adopts LSTM while [?] chooses GRU [?]. The two RNN models achieve almost identical performance in language modeling [?].
As a common practice for evaluating generated word sequences we employ three different evaluation metrics: BLEU [?], METEOR [?] and ROUGE-L [?].
BLEU is a modified n-gram precision. We varied the size of ngram from one to four, computed the corresponding measures respectively for each image and averaged the results across all images. Both METEOR and ROUGE-L
To measure the diversity of our generated questions, we also compute the the same set of evaluation measures by comparing each reference sentence with the best matching generated sentence of the same images. This provides an estimate of coverage in analogy of recall.
We optimize all models with Adam [?]. We fix the batch size to 64. We set the maximal epochs to 64 for Visual7W and the maximal epochs to 128 for VQA. The corresponding model hyperparameters were tuned on the validation sets. Herein, we set .
5Results and Discussions
Figure 3 illustrates all three precision-oriented measures evaluated on Visual7W and VQA datasets respectively. Our baseline is able to generate only one question per image. When we compare its results with the highest scored question per image generated by our model, our model outperforms the baseline with a wide margin. On the VQA test set, in the case of BLEU measures, the improvement over the baseline grows from 24% with unigram to 97% with four-gram. It is evident that our model is capable to generate many more higher-order n-grams co-occurred in reference questions. This improvement is also consistent with ROUGE-L because it is based on the longest common subsequence between generated questions and reference questions. Our model performs better than the baseline also not just because it generates more exact higher order n-grams than reference questions. METEOR considers unigram alignment by allowing multiple matching modules to consider synonymy and alternating word forms. With this measure, our model is still 65% higher than the baseline on the VQA test set. We also observe similar level of improvement over baseline on Visual7W dataset.
On both datasets, when the number of generated questions per image grows, the precison-oriented measures of our model are either similar or slightly declining because our model often generates meaningful questions that are not included in the ground-truth. The more questions we generate the more likely that the questions are not covered by manually constructed ones.
To measure the coverage of generated questions, we computed each reference question against all generated questions per image with all evaluation measures. As shown by Figure 3, all measures improves as the number of questions grows. Herein, both ROUGE-L and METEOR are way better than the baseline regardless of the number of generated questions on both datasets. When all six questions are generated, our model is 130% better than the baseline across all measures. In particular, with METEOR, our model shows an improvement of 216% and 179% over the baseline on VQA and Visual7W respectively. When the number of manually constructed questions is small, our model provides even more types questions than manual ones, as shown with the examples in Figure 4.
The distribution of question types generated by our model is more balanced than that of the ground-truth, while almost 55% of questions in Visual7W and 89% in VQA start with “what”, as illustrated by Figure 6. Our model has also no tendency of generating too long or too short questions because the length distribution of the generated questions are very similar to that of the manually constructed datasets.
We also evaluate the effectiveness of the integration of bigram language model on both datasets. Herein, we compare two variants of our model, with and without the bigram model during decoding. As shown in Figure 5, regardless of precision or recall, decoding with the bigram model consistently outperforms the one without it. The inclusion of the bigram model effectively eliminates almost all repeated terms such as “the the” because the statistics collected by the bigram model favors grammatically well-formed sentences. This observation is also reflected in BLEU with higher-order ngrams by showing larger gaps.
In this paper, we propose the first model to automatically generate visually grounded questions with varying types. Our model is capable of automatically selecting most likely question types and generating corresponding questions based on images and captions constructed by DenseCap. Experiments on VQA and Visual7W dataset demonstrates that the proposed model is able to generate reasonable and grammatically well-formed questions with high diversity. For future work, we consider automatically generation of visual question-answer pairs, which will likely enhance training of VQA systems.
This work is supported by National Key Technology R&D Program of China: 2014BAK09B04, National Natural Science Foundation of China: U1636116, 11431006, Research Fund for International Young Scientists: 61650110510, Ministry of Education of Humanities and Social Science: 16YJC790123.
- We take the same of F-Measure as the implementation in https://github.com/tylin/coco-caption.