GQA: a new dataset for compositional question
answering over real-world images
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages scene graph structures to create 22M diverse reasoning questions, all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate language biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. An extensive analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We strongly hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding for images and language.
It takes more than a smart guess to answer a good question. The ability to truly assimilate knowledge and use it to draw inferences is among the holy grails of artificial intelligence. A tangible form of this goal is embodied in the task of Visual Question Answering (VQA), where a system has to answer free-form questions by reasoning about presented images. The task demands a rich set of abilities as varied as object recognition, commonsense understanding and relation extraction, spanning both the visual and linguistic ends. In recent years, it has sparked a substential interest throughout the research community, becoming extremely popular across the board, with a host of datasets being constructed [3, 10, 15, 43, 20] and numerous models being proposed [4, 40, 5, 9, 11].
The multi-modal nature of the task and the diversity of skills required to address different questions make VQA particularly challenging. Yet, designing a good test that will reflect its full qualities and complications may not be as trivial. In spite of the great strides the field recently made, it has been established through a series of studies that existing benchmarks suffer from critical vulnerabilities that render them highly unreliable in measuring the actual degree of visual understanding capacities [41, 10, 1, 7, 2, 13, 18].
Most notable flaw of existing benchmarks is the strong language priors displayed throughout the data [41, 10, 2] – indeed, most tomatoes are red and most tables are wooden. These in turn are exploited by VQA models, which become heavily reliant upon statistical biases and tendencies, rather than on true scene understanding skills [1, 10, 15]. They memorize the precise answer distributions of different questions to handle them with relative ease, while only glancing over the provided images and at times not even considering them, let alone understanding their content [1, 7]. Consequentialy, early benchmarks have lead to an inflated sense of the state of visual scene understanding, severely diminishing their credibility .
Apart from the prevalent biases within the questions, current real-image VQA datasets suffer from multiple other issues and deficiencies: For one thing, they commonly use basic, non-compositional language which rarely require far beyond object recognition . Second, the immense variability in potential ways to refer or describe objects and scenes make it particularly hard for systems to distill/capture and learn clear/unambiguous grounded semantics, a crucial element of cogent scene understanding. Finally, the lack of annotations regarding questions structure, type and content leave it difficult to identify and fix the root causes behind mistakes models make .
To address these shortcomings, while retaining the visual and semantic richness of real-world images, we introduce GQA, a new dataset for visual reasoning and compositional question answering. We have developed and carefully refined a robust question engine, which leverages content: information about objects, attributes and relations provided through the Visual Genome Scene Graphs , along with structure: a newly-created extensive linguistic grammar which couples hundreds of structural patterns and detailed lexical semantic resources, partly derived from the VQA dataset. Together, we combine them to generate over 22 million novel and diverse questions, all come with structured representations in the form of functional programs that specify their contents and semantics, and are visually grounded in the image scene graphs.
Many of the GQA questions involve varied reasoning skills, and multi-step inference in particular, standing in sharp contrast with existing real-image VQA datasets [3, 10, 43] which tend to have fairly simple questions from both linguistic and semantic perspectives . We further use the associated functional representations to greatly reduce biases within the dataset and control for its question type composition, downsampling it to create a 1.7M-questions balanced dataset. Contrary to VQA2.0 , here we balance not only binary questions, but also open ones, by applying a tunable smoothing technique that makes the answers distribution for each question group more uniform, thereby enabling tight control over the dataset composition. Just like a well designed exam, our benchmark makes the educated guesses strategy far less rewarding, and demands instead more refined comprehension of both the visual and linguistic contents. At the same time, we recognize the importance of research on robustness against biases, and so will provide both the dataset’s balanced and original unbalanced versions.
Along with the data, we have designed a suite of new metrics, which include consistency, validity, plausibility, grounding and distribution scores, to complement the standard accuracy measure commonly used in assessing method’s performance. Indeed, studies have shown that the accuracy metric alone does not account for a range of anomalous behaviors that models demonstrate, such as ignoring key question words or attending to irrelevant image regions [1, 7]. Other works have argued for the need to devise new evaluation measures and techniques to shed more light on systems’ inner workings [18, 35, 36, 17]. In fact, beyond providing new metrics, GQA can even directly support the development of more interpretable models, as it provides a sentence-long explanation that corroborates each answer, and further associates each word from both the questions and the responses with a visual pointer to the relevant region/s in the image, similar in nature to datasets by Yuke et al. , Park et al. , and Li et al. . These in turn can serve as a strong supervision signal to train models with enhanced transparency and accessibility.
In the following, we delineate the design of our question engine, explain the multi-step question generation process, analyze the resultant dataset and compare it with existing benchmarks. We present and discuss the new metrics and use them to evaluate an array of baselines and state-of-the-art models, comparing them to human subjects and revealing a large gap in performance across multiple axes. Finally, we discuss models’ strengths and weakness discovered through the analysis on GQA, and propose potential research directions to overcome them.
GQA combines the best of both worlds, having clearly defined and crisp semantic representations on the one hand but enjoying the semantic and visual richness of real-world images on the other. Our three main contributions are (1) the GQA dataset as a resource for studying visual reasoning; (2) development of an effective method for generating a large number of semantically varied questions, which marries scene graph representations with computational linguistic methods; (3) new metrics for GQA, that allow for better assessment of system success and failure modes, as demonstrated through a comprehensive performance analysis of existing models on this task. We hope that the GQA dataset will provide fertile ground for the development of novel methods that push the boundaries of questions answering and visual reasoning.
2 Related Work
The last few years have witnessed tremendous progress in visual understanding in general and VQA in particular, as we move beyond classic perceptual tasks towards problems that ask for high-level semantic understanding and integration of multiple modalities. However, as discussed in section 1, many of these benchmarks suffer from systematic biases, allowing models to circumvent the need for thorough visual understanding, and instead make use of the prevalent real-world language priors to predict plenty of answers with confidence. On the common VQA1.0 dataset, blind models achieve over 50% in accuracy without even considering the images whatsoever .
Initial attempts have been made to remedy this situation [10, 41, 2, 15], but they fall short in providing an adequate solution: Some approaches operate over constrained and synthetic images [41, 15], neglecting the realism and diversity natural photos provide. Meanwhile, Goyal et al.  associate most of the questions in VQA1.0 with a pair of similar pictures that result in different answers. While offering partial relief, this technique fails to address open questions, leaving their answer distribution largely unbalanced. In fact, since the method does not cover 29% of the questions, even within the binary ones biases still remain.111According to Goyal et al. , 22% of the original questions are left unpaired, and 9% of the paired ones get the same answer due to annotation errors. Indeed, baseline experiments reveal that 67% and 27% of the binary and open questions respectively are answered correctly by a blind model with no access to the input images.
At the other extreme, Agrawal et al.  partition the questions into training and validation sets such that their respective answer distributions become intentionally dissimilar. While undoubtedly challenging, these adversarial settings penalize models, maybe unjustly, for learning salient properties of the training data. In the absence of other information, making an educated guess is actually the right choice – a valid and beneficial strategy pursued by machines and people alike [28, 6, 27]. While the ability to generalize in the face of change is certainly important, it is ancillary to the task of visual understanding in its purest form. Instead, what we essentially need is a fair but balanced test that is more resilient to such gaming strategies, as we strive to achieve with GQA.
In creating our dataset, we drew inspiration from the CLEVR task , which consists of compositional questions over synthetic images. However, its artificial nature and low diversity, with only 3 classes of objects and 12 different properties, makes it particularly vulnerable to memorization. In other words, its space is small enough that a model can easily learn an independent representation for each of the 96 combinations such as “large red sphere”, reducing its effective degree of compositionality. Conversely, GQA operates over real images and a large semantic space, making it much more realistic and challenging. Even though our questions are not natural as in other VQA datasets [10, 43], they display a broad vocabulary and diverse grammatical structures. They may serve in fact as a cleaner benchmark to asses models in a more controlled and comprehensive fashion, as discussed below. Specficially, our dataset builds on top of the scene graph annotations of Visual Genome , which is a crowdsourced dataset specifying the objects, attributes and relations present in 108K different images, all through natural, unconstrained language. Compared to synthetic datasets such as CLEVR, constructing a generation pipeline to cover such linguistic diversity and rich vocabulary entails unique challenges for GQA, as we further discuss in section 3.
Somewhat related to our work are several datasets created through templates [25, 17, 24], most of them for the purpose of data augmentation. However, they are either small in scale  or use only a restricted set of objects and a handful of non-compositional templates [17, 24].222According to [17, 24], 74% and 86% of the their questions respectively are of the form “Is there X in the picture?”. In the latter, the answer to them is invariably “Yes”. Neural alternatives to visual question generation have been recently proposed [29, 14, 42], but they aim at a quite different goal of creating “interesting and engaging” questions about the wider context of the image, e.g. subjective evoked feelings or speculative events that may lead or result from the depicted scenes . On top of that, the neurally generated questions may actually be incorrect, irrelevant, or nonsensical. In contrast, here we are focusing on pertinent, factual and accurate questions with objective answers, seeking to create a challenging benchmark for the task of VQA.
3 GQA Dataset Creation
GQA is a new dataset for visual reasoning and compositional question answering over real-world images, designed to foster the development of models capable of advanced reasoning skills and improved scene understanding capabilities. By creating a balanced set of challenging questions over images, along with detailed annotations of the question and image semantics and a suite of new metrics, we allow comprehensive diagnosis of methods’ performance, and open the door for novel models with more transparent and coherent knowledge representation and reasoning.
Figure 2 provides a brief overview of the GQA components and generation process, and figure 3 offers multiple instances from the dataset. More examples are provided in figure 10. The dataset along with further information are available at visualreasoning.net.
The GQA dataset consists of 113K images and 22M questions of assorted types and varying compositionality degrees, measuring performance on an array of reasoning skills such as object and attribute recognition, transitive relation tracking, spatial reasoning, logical inference and comparisons.
The images, questions and corresponding answers are all accompanied by matching semantic represenations: Each image is annotated with a dense Scene Graph [16, 20], representing the objects, attributes and relations it contains. Each question is associated with a functional program which lists the series of reasoning steps that have to be performed to arrive at the answer. Each answer is augmented with both textual and visual justifications, pointing to the relevant region within the image.
The structured representations offer multiple advantages, as they enable tight control over the question and answer distribution, discussed further in section 3.5, and facilitate assessment of models’ performance along various axes, including question type, topology and semantic length (section 4.2 and section D). In addition, they may support the development of more interpretable grounded models by serving as a strong supervision signal during training. Finally, they enable the design of a new consistency metric, as we can see in section 4.4.
We proceed by describing the four-step pipeline of the dataset construction: First, we thoroughly clean, normalize, consolidate and augment the Visual Genome scene graphs  linked to each image. Then, we traverse the graphs to collect information about objects and relations, which is then coupled with grammatical patterns gleaned from VQA2.0  and sundry probabilistic grammar rules to produce semantically-rich and diverse set of questions. At the third stage, we use the underlying semantic forms to reduce biases in the conditional answer distribution, resulting in a balanced dataset that is more robust against shortcuts and guesses. Finally, we provide further detail about the questions’ functional representations, and explain how we utilize these to compute entailment between questions, which will be further used in section 4.4.
3.2 Scene Graph Normalization
Our starting point in creating the GQA dataset is the Visual Genome Scene Graph annotations333We use the cleaner 1.4 version of the dataset, following Xu et al.   that cover 113k images from COCO  and Flickr .444We expand the original Visual Genome dataset with 5k new scene graphs collected through crowdsourcing. The scene graph serves as a formalized representation of the image: each node denotes an object, a visual entity within the image, like a person, an apple, grass or clouds. It is linked to a bounding box specifying its position and size, and is marked up with about 1-3 attributes, properties of the object: e.g. its color, shape, material or activity. The objects are connected by relation edges, representing actions (i.e. verbs), spatial relations (e.g. prepositions), and comparatives.
The scene graphs are annotated with unconstrained natural language. Our first goal is thus to convert the annotations into a clear and unambiguous semantic ontology.555Note that we cannot effectively use the wordnet annotations  used throughout the Visual Genome dataset since they are highly inaccurate, relating objects to irrelevant senses, e.g. accountant for a game controller, Cadmium for a CD, etc. We begin by cleaning up the graphs vocabulary, removing stop words, fixing typos, consolidating synonyms and filtering rare or amorphous concepts.666During this stage we also address additional linguistic subtleties such as the use of noun phrases (“pocket watch”) and opaque compounds (“soft drink”, “hard disk”). We then classify the vocabulary into predefined categories (e.g. animals and fruits for objects; colors and materials for attributes), using word embedding distances to get preliminary annotations, which are then followed by manual curation. This results in a class hierarchy over the scene graphs vocabulary, that we further augment with various semantic and linguistic features such as part of speech, voice, plurality and synonyms – information that will be used to create grammatically correct questions in further steps. Our final ontology contains 1740 objects, 620 attributes and 330 relations, grouped into a hierarchy that consists of 60 different categories and subcategories. Visualization of the ontology can be found in figure 9.
At the next step, we prune graph edges that sound unnatural or are otherwise inadequate to be incoporated within the questions to be generated, such as (woman, in, shirt), (tail, attached to, giraffe), or (hand, hugging, bear). We filter these triplets using a combination of category-based rules, n-gram frquencies , dataset co-occurrence statistics, and manual curation.
In order to generate correct and unambiguous questions, some cases will require us to validate the uniqueness or absence of an object. Visual Genome, while meant to be as exhaustive as possible, cannot guarantee full coverage (as it may be practically infeasible). Hence, in those cases we use object detectors , trained on visual genome with a low detection threshold, to conservatively confirm the object absence or uniqueness.777An object uniqueness can be validated by confidently denying the existence of other same-class objects within the image.
Next, we augment the graph objects with absolute and relative positional information: objects appearing within margins, horizontal or vertical, are annotated accordingly. Object pairs for which we can safely determine horizontal positional relations (e.g. one is to the left of the other), are annotated as well.888We do not annotate objects with vertical positional relations since these cannot be confidently determined from their bounding boxes – we cannot distinguish between cases of below/above and behind/in-front. We also annotate object pairs if they share the same color, material or shape. Finally, we enrich the graph with global information about the image location or weather, if these can be directly inferred from the objects it contains.
By the end of this stage, the resulting scene graphs have clean, unified, rich and unambiguous semantics for both the nodes and the edges.
3.3 The Question Engine
At the heart of our pipeline is the question engine, responsible for producing diverse, relevant and grammatical questions in varying degrees of compositionality. The generation process harnesses two resources: one is the scene graphs which fuel the engine with rich content – information about objects, attributes and relationships; the other is the structural patterns, a mold that shapes the content, casting it into a question.
Our engine operates over 524 patterns, spanning 117 question groups. Each group is associated with three components: (1) a functional program that represents its semantics; (2) A set of textual rephrases which express it in natural language, e.g. “What|Which <type> [do you think] <is> <theObject>?”; (3) A pair of short and long answers: e.g. <attribute> and “The <object> <is> <attribute>.” respectively.999Note that the long answers can serve as textual justifications, especially for questions that require increased reasoning such as logical inference, where a question like “Is there a red apple in the picture?” may have the answer: “No, there is an apple, but it is green”
We begin from a seed set of 250 manually constructed patterns, and extend it with 274 natural patterns derived from VQA1.0  through anonymization of words from our ontology.101010For instance, a question-answer pair in VQA1.0 such as “What color is the apple? red” turns after anonymization into “What <type> <is> the <object>? <attribute>”. To increase the question diversity, apart from using synonyms for objects and attributes, we incorporate probabilistic sections into the patterns, such as optional phrases [x] and alternate expressions (x|y), which get instantiated at random.
It is important to note that the patterns do not strictly limit the structure or depth of each question, but only outline their high-level form, as many of the template fields can be populated with nested compositional references. For instance, in the pattern above, we may replace <theObject> with ‘the apple to the left of the white refrigerator.
To achieve that compositionality, we compute for each object a set of candidate references, which can either be direct, e.g. the bear, this animal, or indirect, using modifiers, e.g. the white bear, the bear on the left, the animal behind the tree, the bear that is wearing a coat. Direct references are used when the uniqueness of the object can be confidently confirmed by object detectors, making the corresponding references unambiguous. Alternatively, we use indirect references, leading to multi-step questions as varied as Who is looking at the animal that is wearing the red coat in front of the window?, and thus greatly increasing the patterns’ effective flexibility. This is the key ingredient behind the automatic generation of compositional questions.
Finally, we compute a set of decoys for the scene graph elements. Indeed, some questions, such as negative ones or those that involve logical inference, pertain to the absence of an object or to an incorrect attribute. Examples include e.g. Is the apple green? for a red apple, or Is the girl eating ice cream? when she is in fact eating a cake. Given a triplet , (e.g. (girl, eating, cake) we select a distractor considering its likelihood to be in relation with s and its plausibility to co-occur in the context of the other objects in the depicted scene. Similar technique is applied in selecting attribute decoys (e.g. a green apple). While choosing distractors, we exclude from consideration candidates that we deem too similar (e.g. pink and orange), based on a manually defined list for each concept in the ontology.
Having all resources prepared: (1) the clean scene graphs, (2) the structural patterns, (3) the object references and (4) the decoys, we can proceed to generating the questions! We traverse the graph, and for each object, object-attribute pair or subject-relation-object triplet, we produce relevant questions by instantiating a randomly selected question pattern, e.g. What <type> is <theObject>, <attribute> or <cAttribute>?”, populating all the fields with the matching information, yielding, for example, the question: What (color) (is) the (apple on the table), (red) or (green)?. When choosing object references, we avoid selecting those that disclose the answer or repeat information, e.g. What color is the red apple? or Which dessert sits besides the apple to the left of the cake?. We also avoid asking about relations that tend to have multiple instances for the same object, e.g. asking what object is on the table, as they may be multiple valid answers.
Finally, we use the linguistic features associated with each vocabulary item along with n-gram frequencies  to correctly resolve final grammatical subtleties and further increase the questions linguistic variance, ironing out the last kinks and twists of the question engine.111111The considered nuances include determining prepositions, choosing articles and selecting the person for verbs and pronouns. Among other adjustments performed randomly, are changes to verb tense, use of contractions or apostrophes, and minor rearrangements in prepositions order.
By the end of this stage, we obtain a diverse set of 22M interesting, challenging and grammatical questions, pertaining to each and every aspect of the image.
3.4 Functional Representation and Entailment
Each question pattern is associated with a structured representation in the form of a functional program. For instance, the question What color is the apple on the white table? is semantically equivalent to the following program: select: table filter: white relate(subject,on): apple query: color. As we can see, these programs are composed of atomic operations such as object selection, traversal along a relation edge or an attribute verification, which are in turn concatenated together to create challenging reasoning questions.
The semantic unambiguous representations offer multiple advantages over free form unrestricted questions. For one thing, they enable comprehensive assessment of methods by dissecting their performance along different axes of question textual and semantic lengths, type and topology, thus facilitating the diagnosis of their success and failure modes (section 4.2 and section D). Second, they aid us in balancing the dataset distribution, mitigating its language priors and guarding against educated guesses (section 3.5). Finally, they allow us to identify entailment and equivalence relations between different questions: knowing the answer to the question What color is the apple? allows a coherent learner to infer the answer to the questions Is the apple red? Is it green? etc. The same goes especially for questions that involve logical inference like or and and operations or spatial reasoning, e.g. left and right. Please refer to figure 4 and figure 15 for entailemnts examples.
As further discussed in section 4.4, this entailment property can be used to measure the coherence and consistency of the models, shedding new light on their inner workings, compared to the widespread but potentially misleading accuracy metric. We define direct entailment relations between the various functional programs and use these to recursively compute all the questions that can be entailed from a given source. A complete catalog of the functions, their associated question types, and the entailment relations between them is provided in table 3 and figure 15.
3.5 Sampling and Balancing
One of the main issues of existing VQA datasets is the prevalent question-conditional biases that allow learners to make educated guesses without truly understanding the presented images, as explained in section 1. However, precise representation of the questions’ semantics can allow tighter control over these biases, having potential to greatly alleviate the problem. We leverage this observation and use the functional programs attached to each question to smooth out the answer distribution.
Given a question’s semantic program, we derive two labels, global and local: The global label assigns the question to its answer type, e.g. color for What color is the apple?. The local label further considers the main subject/s of the question, e.g. apple-color or table-material. We use these labels to partition the questions into groups, and smooth the answer distribution of each group within the two levels of granularity, first globally, and then locally.
For each group, we first compute its answer distribution , and then sort the answers based on their frequency within the group. Then, we downsample the questions (formally, using rejection-sampling) to fit a smoother answer distribution derived through the following procedure: We iterate over the answers in decreasing frequency order, and reweight ’s head up to the current iteration to make it more comparable to the tail size. While repeating this operation as we go through the answers, iteratively “moving” probability from the head into the tail , we also maintain minimum and maximum ratios between each pair of subsequent answers (sorted by frequency). This ensures that the relative frequency-based answer ranking stays the same.
The main advantage of this scheme is that it retains the general real-world tendencies, smoothing them out up to a tunable degree to make the benchmark more challenging and less biased. Refer to figure 5 for a visualization and to section B for a precise depiction of the procedure. Since we perform this balancing in two granularity levels, the obtained answer distributions are made more uniform both locally and globally. Quantitatively, the entropy of the answer distribution is increased by 72%, confirming the success of this stage.
Finally, we downsample the questions based on their type to control the dataset type composition, and filter out redundant questions that are too semantically similar to existing ones. We split the dataset into 70% train, 10% validation, 10% test and 19% challenge, making sure that all the questions about a given image appear in the same split.
4 Analysis and Baseline experiments
In the following, we provide an analysis of the GQA dataset, perform a head-to-head comparison with the common VQA2.0 dataset , and evaluate the performance of baselines as well as state-of-the-art models. We introduce the new metrics that complement our dataset, provide quantitative results and discuss their implications and merits. To establish the diversity and realism of GQA questions, we then show test transfer performance between the GQA and VQA datasets. Finally, In section D, we proceed with further diagnosis of the current top-performing model, MAC , evaluating it along multiple axes such as training-set size, question length and compositionality degree.
4.1 Dataset Analaysis and Comparison
The GQA dataset includes 22,669,678 questions over 113,018 images. As figure 7 shows, the questions of varied lengths, longer than those of the VQA2.0 benchmark, alluding to their higher compositionality degree (figure 6). It has a vocabulary size of 3097 words and 1878 possible answers. While inadvertently smaller than natural language datasets, further investigation reveals that it covers 88.8% and 70.6% of VQA questions and answers respectively, corroborating its wide diversity. A wide selection of dataset visualizations is provided in section A.
We associate each question with two types: structural and semantic. The structural type is derived from the final operation in the question’s functional program. It can be (1) verify for yes/no questions, pertaining to object existence, attribute or relation verification, (2) query for all open questions, (3) choose for questions that present two alternatives to choose from, e.g. “Is it red or blue?”; (4) logical which involve logical inference, and (5) compare for comparison questions between two or more objects. The semantic type refers to the main subject of the question: (1) object: for existence questions, (2) attribute: consider the properties or position of an object, (3) category: related to object identification within some class, (4) relation: for questions asking about the subject or object of a described relation (e.g. “what is the girl wearing?”), and (5) global: about overall properties of the scene such as weather or place. As shown in figure 6, the questions have diverse set of types in both the semantic and structural levels.
We proceed by performing a head-to-head comparison with the VQA2.0 dataset , the findings of which are summarized in table 1. Apart from the higher average question length, we can see that GQA consequently contains more verbs and prepositions than VQA (as well as more nouns and adjectives), providing further evidence for its increased compositionality. Semantically, we can see that the GQA questions are significantly more compositional than VQA’s, and involve variety of reasoning skills in much higher frequency (spatial, logical, relational and comparative).
Some VQA question types are not covered by GQA, such as intention (why) questions or ones involving OCR or external knowledge. The GQA dataset focuses on factual questions and multi-hop reasoning in particular, rather than covering all types. Comparing to VQA, GQA questions are objective, unambiguous, more compositional and can be answered from the images only, potentially making this benchmark more controlled and convenient for making research progress on.
4.2 Baseline Experiments
We analyse an assortment of models on GQA, including both baselines as well as state-of-the-art models. The baselines include a “blind” LSTM model with access to the questions only, a “deaf” CNN model with access to the images only, an LSTM+CNN model, and two prior models based on the question group, local or global, which return the most common answer for each group, as defined in section 3.4. Beside these, we evaluate the performance of the bottom-up attention model  – the winner of 2017 VQA challenge, and the MAC model  – a compositional attention state-of-the-art model for CLEVR . For human evaluation, we used Amazon Mechanical Turk to collect human responses for 4000 random questions, taking majority over 5 answers per question. Further description of the evaluated models along with implementation details can be found in section C.
The evaluation results, including the overall accuracy and the accuracies for each question type, are summarized in table 2. As we can see, the priors and the blind LSTM model achieve very low results of 41.07%: inspection of specific question types reveals that LSTM achieves only 22.7% for open query questions, and not far above chance for all other binary question types. We can further see that the “deaf” CNN model achieves as well low results across almost all question types, as expected. On the other hand, state-of-the-art models such as MAC  and Bottom-Up Attention  perform much better than baselines, but still well below human scores, offering ample opportunity for further research in the visual reasoning domain.
4.3 Transfer Performance
|question length||6.2 + 1.9||7.9 + 3.1|
|verbs||1.4 + 0.6||1.6 + 0.7|
|nouns||1.9 + 0.9||2.5 + 1.0|
|adjectives||0.6 + 0.6||0.7 + 0.7|
|prepositions||0.5 + 0.6||1 + 1|
We tested the transfer performance between the GQA and VQA datasets, training on one and testing on the other: A MAC model trained on GQA achieves 52.1% on VQA before fine-tuning and 60.5% afterwards. Compare these with 51.6% for LSTM+CNN and 68.3% for MAC, when both are trained and tested on VQA. These quite good results demonstrate the realism and diversity of GQA questions, showing the dataset can serve as a good proxy for human-like questions. In contrast, MAC trained on VQA gets 39.8% on GQA before fine-tuning and 46.5% afterwards, illustrating the further challenge GQA poses.
4.4 New Evaluation Metrics
|Metric||Global Prior||Local Prior||CNN||LSTM||CNN+LSTM||BottomUp||MAC||Humans|
Apart from the standard accuracy metric, and the more detailed type-based diagnosis our dataset supports, we introduce five new metrics to get further insight into visual reasoning methods and point to missing capabilities we believe coherent reasoning models should possess.
Consistency. measure responses consistency across different questions. Recall that in section 3.4, we used the questions’ semantic representation to derive equivalence and entailment relations between them. When being presented with a new question, any learner striving to be trustworthy should not contradict its previous answers. It should not answer “green” to a new question about an apple it has just identified as “red”.
For each question-answer pair , we define a set of entailed questions, the answers to which can be unambiguously inferred given . For instance, given the question-answer pair “Is there a red apple to the left of the white plate?”, we can infer the answers to questions such as Is the plate to the right of the apple?, “Is there a red fruit to the left of the plate?”, “What is the white thing to the right of the apple?”, etc. For each question in – the set of questions the model answered correctly, we measure the model’s accuracy over the entailed questions and then average these score across all questions in .
We can see that while people have exceptional consistency of 98.4%, even best models are inconsistent in about 1 out of 5 questions, and models such as LSTM contradict themselves almost half the times. Apparently, achieving high consistency requires deeper understanding of the question semantics in the context of the image, and, in contrast with accuracy, is more robust against “random” educated guesses as it inspects connections between related questions, and thus may serve as a better measure of models’ true visual understanding skills.
Validity and Plausibility. The validity metric checks whether a given answer is in the scope of the question, e.g. responding some color to a color question. The plausibility score goes a step further, measuring whether the answer is reasonable, or makes sense, given the question (e.g. elephant usually do not eat, say, pizza). Specifically, we check whether the answer occurs at least once in relation with the question’s subject, across the whole dataset, thus, for instance, we consider e.g. red and green as plausible apple colors, wheres purple as not.121212While the plausibility metric may not be fully precise especially for infrequent objects due to potential data scarcity issues, it may provide a good sense of the general level of world-knowledge the model has acquired. The experiments show that models fail to respond with plausible or even valid answers in at least 5-15% of the times, indicating limited comprehension of some questions. Given that these properties are noticeable statistics of the dataset’s conditional answer distribution, not even depending on the specific images, we would expect a sound method to achieve higher scores.
Distribution. To get further insight into the extent to which methods manage to model the conditional answer distribution, we define the distribution metric, which measures the overall match between the true answer distribution and the model predicted distribution. For each question global group (section 3.4), we compare the golden and prediction distributions using Chi-Square statistic , and then average across all the groups. It allows us to see if the model predicts not only the most common answers but also the less frequent ones. Indeed, the experiments demonstrate that better models such as the state-of-the-art Bottom Up and MAC models score lower than the baselines (for this metric, lower is better), indicating increased capacity in fitting more subtle trends of the dataset’s distribution.
Grounding. For attention-based models, the grounding score checks whether the model attends to regions within the image that are relevant to the question. For each dataset instance, we define a pointer to the visual region which the question or answer refer to, and compare it to the look at the model visual attention (summing up the overall attention the model gives to ). This metric allows us to evaluate the degree to which the model grounds its reasoning in the image, rather than just making educated guesses based on language priors or world tendencies.
We can see from the experiments that in fact the attention models attend mostly to the right and relevant regions in the image, with grounding scores of about 80%. To verify the realiability of the metric, we further perform experiments with spatial features instead of the object-informed ones used by BottomUp  and MAC , which lead to a much lower 43% score, demonstrating that indeed object-based features provide models with better granularity for the task, allowing them to focus on more pertinent regions than the coarser spatial features.
In this paper, we introduced the GQA dataset for visual reasoning and compositional question answering. We described the dataset generation process, provided baseline experiments and defined new measures to get more insight into models’ behavior and performance. We believe this benchmark can help driving VQA research in the right directions of deeper semantic understanding, sound reasoning, enhanced robustness and improved consistency. A potential avenue towards such goals may involve more intimate integration between visual knowledge extraction and question answering, two flourishing fields that oftentimes have been pursued independently. We strongly hope that GQA will motivate and support the development of more compositional, interpretable and cogent reasoning models, to advance research in scene understanding and visual question answering.
We wish to thank Justin Johnson for the discussions about the early versions of this work, and Ross Girshick for his inspirational talk at the VQA workshop 2018. We further would like to thank Ranjay Krishna, Eric Cosatto and Alexandru Niculescu-Mizil for the helpful suggestions and comments. Stanford University gratefully acknowledges the generous support of Facebook Inc. as well as the Defense Advanced Research Projects Agency (DARPA) Communicating with Computers (CwC) program under ARO prime contract no. W911NF15-1-0462 for supporting this work.
-  A. Agrawal, D. Batra, and D. Parikh. Analyzing the behavior of visual question answering models. In EMNLP, pages 1955–1960, 2016.
-  A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Donât just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4971–4980, 2018.
-  A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra. VQA: Visual question answering. International Journal of Computer Vision, 123(1):4–31, 2017.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998, 2017.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016.
-  Y. Attali and M. Bar-Hillel. Guess where: The position of correct answers in multiple-choice test items as a psychometric variable. Journal of Educational Measurement, 40(2):109–128, 2003.
-  A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6325–6334, 2017.
-  D. A. Hudson and C. D. Manning. Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067, 2018.
-  G. Inc. Google books ngram corpus.
-  A. Jabri, A. Joulin, and L. van der Maaten. Revisiting visual question answering baselines. In European conference on computer vision, pages 727–739. Springer, 2016.
-  U. Jain, Z. Zhang, and A. G. Schwing. Creativity: Generating diverse questions using variational autoencoders. In CVPR, pages 5415–5424, 2017.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
-  J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
-  K. Kafle and C. Kanan. An analysis of visual question answering algorithms. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1983–1991. IEEE, 2017.
-  K. Kafle and C. Kanan. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding, 163:3–20, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
-  H. O. Lancaster and E. Seneta. Chi-square distribution. Encyclopedia of biostatistics, 2, 2005.
-  Q. Li, Q. Tao, S. Joty, J. Cai, and J. Luo. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions. arXiv preprint arXiv:1803.07464, 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  A. Mahendru, V. Prabhu, A. Mohapatra, D. Batra, and S. Lee. The promise of premise: Harnessing question premises in visual question answering. arXiv preprint arXiv:1705.00601, 2017.
-  M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in neural information processing systems, pages 1682–1690, 2014.
-  G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
-  J. Millman, C. H. Bishop, and R. Ebel. An analysis of test-wiseness. Educational and Psychological Measurement, 25(3):707–726, 1965.
-  J. J. Mondak and B. C. Davis. Asked and answered: Knowledge levels when we won’t take ‘don’t know’for an answer. Political Behavior, 23(3):199–224, 2001.
-  N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.
-  D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. In 31st IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
-  A. Suhr, S. Zhou, I. Zhang, H. Bai, and Y. Artzi. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
-  D. Teney, P. Anderson, X. He, and A. van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.
-  D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. arXiv preprint, 2017.
-  B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
-  A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528. IEEE, 2011.
-  D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2017.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
-  P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5014–5022, 2016.
-  S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang. Automatic generation of grounded visual questions. arXiv preprint arXiv:1612.06530, 2016.
-  Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4995–5004, 2016.
Appendix A Dataset Visualizations
|open||query||global||select: scene/query: type||How is the weather in the image?|
|binary||verify||global||select: scene/verify type: attr||Is it cloudy today?|
|open||query||global||select: scene/choose type: ab||Is it sunny or cloudy?|
|open||query||attribute||select: obj/…/query: type||What color is the apple?|
|binary||verify||attribute||select: obj/…/verify type: attr||Is the apple red?|
|binary||logical||attribute||select: obj/…/verify t1: a1/verify t2: a2/and||Is the apple red and shiny?|
|open||choose||attribute||select: obj/…/choose type: ab||Is the apple green or red?|
|binary||verify||object||select: obj/…/exist||Is there an apple in the picture?|
|binary||verify||relation||select: subj/…/relate (rel): obj/exist||Is there an apple on the black table?|
|binary||logical||object||select: obj1/…/exist/select: obj2/…/exist/or||Do you see either an apple or a banana there?|
|binary||logical||obj/attr||select: obj1/…/exist/select: obj2/…/exist/and||Do you see both green apples and bananas there?|
|open||query||category||select: category/…/query: name||What kind of fruit is on the table?|
|open||choose||category||select: category/…/choose: ab||What kind of fruit is it, an apple or a banana?|
|open||query||relation||select: subj/…/relate (rel): obj/query: name||What is the small girl wearing?|
|binary||verify||relation||select: subj/…/verifyRel (rel): obj||Is she wearing a blue dress?|
|open||choose||relation||select: subj/…/chooseRel (r1r2): obj||Is the cat to the left or to the right of the flower?|
|open||choose||relation||select: subj/…/relate (rel): obj/choose: ab||What is the boy eating, an apple or a slice of pizza?|
|binary||compare||object||select: obj1/…/select: obj2/…/compare type||Who is taller, the boy or the girl?|
|open||compare||object||select: obj1/…/select: obj2/…/common||What is common to the shirt and the flower?|
|verify||compare||object||select: obj1/…/select: obj2/…/same||Does the shirt and the flower have the same color?|
|verify||compare||object||select: obj1/…/select: obj2/…/different||Are the table and the chair made of different materials?|
|verify||compare||object||select: allObjs/same||Are all the people there the same gender?|
|verify||compare||object||select: allObjs/different||Are the animals in the image of different types?|
VQA 1. What animal is the lady feeding? 2. Is it raining? 3. Is the man wearing sunglasses?
VQA 1. What is the man holding? 2. Where are the people playing? 3. Is the player safe? 4. What is the sport being played?
VQA 1. Where is the bus driver? 2. Why is the man in front of the bus? 3. What numbers are repeated in the bus number?
VQA 1. What are the yellow lines called? 2. Why don’t the trees have leaves? 3. Where is the stop sign?
Appendix B Dataset Balancing
Appendix C Baselines Implementation Details
In section 4.2, we perform experiments over multiple baselines and state-of-the-art models. All CNN models use spatial features pre-trained on ImageNet , whereas state-of-the-art approaches such as bottomUp  and MAC  are based on object-based features produced by faster R-CNN detector . All models use GloVe word embeddings of dimension 300 . To allow a fair comparison, all the models use the same LSTM, CNN and classifier components, and so the only difference between the models stem from their core architectural design.
We have used a sigmoid-based classifier and trained all models using Adam  for 15 epochs, each takes about an hour to complete. For MAC , we use the official authored code available online, with 4 cells. For BottomUp , since the official implementation is unfortunately not publicly available, we re-implemented the model, carefully following details presented in [4, 35]. To ensure the correctness of our implementation, we have tested the model on the standard VQA dataset, achieving 67%, which matches the original scores reported by Anderson et al. .
Appendix D Further Diagnosis
Following section 4.2, and in order to get more insight into models’ behaviors and tendencies, we perform further analysis of the top-scoring model for the GQA dataset, MAC . The MAC network is a recurrent attention network that reasons in multiple concurrent steps over both the question and the image, and is thus geared towards compositional reasoning as well as rich scenes with several regions of relevance.
We assess the model along multiple axes of variation, including question length, both textually, i.e. number of words, and semantically, i.e. number of reasoning operations required to answer it, where an operation can be e.g. following a relation from one object to another, attribute identification, or a logical operation such as or, and or not. We provide additional results for different network lengths (namely, cells number) and varying training-set sizes, all can be found in figure 12.
Interestingly, question textual length correlates positively with the model accuracy. It may be the case that longer questions reveal more cues or information that the model can exploit, potentially sidestepping direct reasoning about the image. However, question semantic length has the opposite impact as expected: 1-step questions are particularly easy for models than the compositional ones which involve more steps.
We can further see that longer MAC networks with more cells are more competent in performing the GQA task, substantiating its increased compositionality. Other experiments show that increasing the training set size has significant impact on the model’s performance, as found out also by Kafle et al. . Apparently, the training set size has not reached saturation yet and so models may benefit from even larger datasets.
Finally, we have measured the impact of different input representations on the performance. We encode the visual scene with three different methods, ranging from standard pretrained CNN-based spatial features, to object-informed features obtained through faster R-CNNs detectors , up to even a “perfect sight” model that has access to the precise semantic scene graph through direct node and edge embeddings. As figure 12 shows, the more high-level and semantic the representation is, the better are the results.
On the question side, we explore both training on the standard textual questions as well as the semantic functional programs. MAC achieves 53.8% accuracy and 81.59% consistency on the textual questions and 59.7% and 85.85% on the programs, demonstrating the usefulness and further challenge embodied in the former. It is also more consistent Indeed, the programs consist of only a small operations vocabulary, whereas the questions use both synonyms and hundreds of possible structures, incorporating probabilistic rules to make them more natural and diverse. In particular, GQA questions have sundry subtle and challenging linguistic phenomena such as long-range dependencies, absent from the canonical programs. The textual questions thus provide us with the opportunity to engage with real, interesting and significant aspects of natural language, and consequently foster the development of models with enhanced language comprehension skills.