Language-Conditioned Graph Networks for Relational Reasoning

Language-Conditioned Graph Networks for Relational Reasoning

Ronghang Hu      Anna Rohrbach      Trevor Darrell      Kate Saenko
University of California, Berkeley Boston University
Abstract

Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question “What color is the mug on the plate?” we must check the color of the specific mug that satisfies the “on” relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, most of their power is in the inference structure, while the scene is represented with simple local appearance features. In this paper, we take an alternate approach and build contextualized representations for objects in a visual scene to support relational reasoning. We propose a general framework of Language-Conditioned Graph Networks (LCGN), where each node represents an object, and is described by a context-aware representation from related objects through iterative message passing conditioned on the textual input. E.g., conditioning on the “on” relationship to the plate, the object “mug” gathers messages from the object “plate” to update its representation to “mug on the plate”, which can be easily consumed by a simple classifier for answer prediction. We experimentally show that our LCGN approach effectively supports relational reasoning and improves performance across several tasks and datasets.

1 Introduction

Grounded language comprehension tasks, such as visual question answering (VQA) or referring expression comprehension (REF), require finding the relevant objects in the scene and reasoning about certain relationships between them. For example in Figure 1, to answer the question is there a person to the left of the woman holding a blue umbrella, we must locate the relevant objects – person, woman and blue umbrella – and model the specified relationships – to the left of and holding.

Figure 1: In this work, we create context-aware representations for objects by sending messages between relevant objects in a dynamic way that depends on the input language. In the left example, the first round of message passing updates object 2 with features of object 3 based on the woman holding a blue umbrella (green arrow), and the second round updates object 1 with object 2’s features based on person to the left (red arrow). The final answer prediction can be made by a single attention hop over the most relevant object (blue box).

How should we build a model to perform reasoning in grounded language comprehension tasks? Prior works have explored various approaches from learning joint visual-textual representations (e.g. [8, 29]) to pooling over pairwise relationships (e.g. [33, 41]) or constructing explicit reasoning steps with modular or symbolic representations (e.g. [2, 40]). Although these models are capable of performing complex relational inference, their scene representations are built upon local visual appearance features that do not contain much contextual information. Instead, they tend to rely heavily on manually designed inference structures or modules to perform reasoning about relationships, and are often specific to a particular task.

In this work, we propose an alternative way to facilitate reasoning with a context-aware scene representation, suitable for multiple tasks. Our proposed Language-Conditioned Graph Network (LCGN) model augments the local appearance feature of each entity in the scene with a relational contextualized feature. Our model is a graph network built upon visual entities in the scene, which collects relational information through multiple iterations of message passing between the entities. It dynamically determines which objects to collect information from on each round, by weighting the edges in the graph, and sends messages through the graph to propagate just the right amount of relational information. The key idea is to condition the message passing on the specific contextual relationships described in the input text. Figure 1 illustrates this process, where the person would be represented not only by her local appearance, but also by contextualized features indicating her relationship to other relevant objects in the scene, e.g., left of a woman. Our contextualized representation can be easily plugged into task-specific models to replace standard local appearance features, facilitating reasoning with rich relational information. E.g. for the question answering task, it is sufficient to perform a single attention hop over the relevant object, whose representation is contextualized (e.g. blue box in Figure 1).

Importantly, our scene representation is constructed with respect to the given reasoning task. An object in the scene may be involved in multiple relations in different contexts: in Figure 1, the person can be simultaneously left of a woman holding a blue umbrella, holding a white bag, and standing on a sidewalk. Rather than building a complete representation of all the first- and higher-order relational information for each object (which can be enormous and unnecessary), we focus the contextual representation on relational information that is helpful to the reasoning task by conditioning on the input text (Figure 1 left vs. right).

We apply our Language-Conditioned Graph Networks to two reasoning tasks with language inputs—Visual Question Answering (VQA) and Referring Expression Comprehension (REF). In these tasks, we replace the local appearance-based visual representations with the context-aware representations from our LCGN model, and demonstrate that our context-aware scene representations can be used as inputs to perform complex reasoning via simple task-specific approaches, with a consistent improvement over the local appearance features across different tasks and datasets. We obtain state-of-the-art results on the GQA dataset [17] for VQA and the CLEVR-Ref+ dataset [24] for REF.

2 Related work

We first provide an overview of the reasoning tasks addressed in this paper. Then we review related work on graph networks and other contextualized representations. Finally, we discuss alternative approaches to reasoning problems.

Visual question answering (VQA) and referring expression comprehension (REF)

VQA and REF are two popular tasks that require reasoning about image content. While in VQA the goal is to answer a question about an image [3], in REF one has to localize an image region that corresponds to a referring expression [27]. While the real-world VQA dataset [3, 10] focuses more on perception than complex reasoning, the more recent synthetic CLEVR [18] dataset is a standard benchmark for relational reasoning. An even more recent GQA dataset [17] brings together the best of both worlds: real images and relational questions. It is built upon the Visual Genome dataset [21] and construct the balanced question-answer pairs from scene graphs.

For REF, there are a number of standard benchmarks such as RefCOCO [42] and RefCOCOg [27], with natural language referring expressions and images from the COCO dataset [23]. However, many of the expressions in these datasets do not require resolving relations. Recently, a new CLEVR-Ref+ dataset [24] has been proposed for REF. It is built using the CLEVR environment and involves very complex queries, aiming to assess the reasoning capabilities of existing models and find their limitations.

In this work we tackle both VQA and REF tasks on three datasets in total. Notably, in all cases, we use the same approach, Language-Conditioned Graph Network (LCGN), to build contextualized representations of objects/image regions. This shows the generality and effectiveness of our approach for various visual reasoning tasks.

Graph networks and contextualized representations

Graph networks are powerful models that can perform relational inference through message passing [4, 9, 20, 22, 36, 44]. The core idea is to enable communication between image regions to build contextualized representations of these regions. Graph networks have been successfully applied to various tasks, from object detection [25] and region classification [7] to human-object interaction [30] and activity recognition [12]. Besides, self-attention models [35] and non-local networks [38] can also be cast as graph networks in a general sense. Below we review some of the recent works that rely on graph networks and other contextualized representations for VQA and REF.

A prominent work that introduced relational reasoning in VQA is [33], which proposes Relation Networks (RNs) for modeling relations between all pairs of objects, conditioned on a question. [6] extends RNs with the Broadcasting Convolutional Network module, which globally broadcasts objects’ visuo-spatial features. The first work to use graph networks in VQA is [34], which combines dependency parses of questions and scene graph representations of abstract scenes. [45] proposes modeling structured visual attention over a Conditional Random Field on image regions. A recent work, [28], conditions on a question to learn a graph representation of an image, capturing object interactions with the relevant neighbours via spatial graph convolutions. Later, [5] extends this idea to modeling spatial-semantic pairwise relations between all pairs of regions.

For the REF task, [37] proposes Language-guided Graph Attention Networks, where attention over nodes and edges is guided by a referring expression, which is decomposed into subject, intra-class and inter-class relationships.

Our work is related to, yet distinct from, the approaches above. While [28] predicts a sparsely connected graph (conditioned on the question) that remains fixed for each step of graph convolution, our LCGN model predicts dynamic edge weights to focus on different connections in each message passing iteration. Besides, [28] is tailored to VQA and is non-trivial to adapt to REF (since it includes max-pooling over node representations). Compared to [5], instead of max-pooling over explicitly constructed pairwise vectors, our model predicts normalized edge weights that both improve computation efficiency in message passing and make it easier to visualize and inspect connections. Finally, [37] is tailored to REF by modeling specific subject attention and inter-and-extra class relations, and does not gather higher-order relational information in an iterative manner. We propose a more general approach for scene representation that is applicable to both VQA and REF.

Reasoning models

A multitude of approaches have been recently proposed to tackle visual reasoning tasks, such as VQA and REF. Neural Module Networks (NMNs) [2, 14] are interpretable multi-step models that build question-specific layouts and execute them against an image. NMNs have also been applied to REF, e.g. the Compositional Modular Networks [15] and Stack-NMN [13]. (The latter is a multi-task approach to VQA and REF.) An alternative approach, Memory, Attention, and Composition (MAC) [16], also performs multi-step reasoning while recording information in its memory. FiLM [29] is an approach which modulates image representation with the given question via conditional batch normalization, and is extended in [39] with Cascaded Mutual Modulation, a multi-step reasoning procedure where both modalities can modulate each other. The Neural-Symbolic approach [40] disentangles reasoning from image and language understanding, by first extracting symbolic representations from images and text, and then executing symbolic programs over them. MAttNet [41], a state-of-the-art approach to REF, is conceptually related to NMNs as it uses attention to parse an expression and ground it through subject, location and relation modules.

Our approach is not meant to substitute the aforementioned reasoning models, but to complement them. Our contextualized visual representation can be combined with other reasoning models to replace the local feature representation. A prominent reasoning model capable of addressing both VQA and REF is Stack-NMN [13], and we empirically compare to it in Section 4.

3 Language-Conditioned Graph Networks

Given a visual scene and a textual input for a reasoning task such as VQA or REF, we propose to construct a contextualized representation for each entity in the scene that contains the relational information needed for the reasoning procedure specified in the language input.

This contextualized representation is obtained in our novel Language-Conditioned Graph Networks (LCGN) model, through iterative message passing conditioned on the language input. It can be then used as input to a task-specific output module such as a single-hop VQA classifier.

3.1 Context-aware scene representation

Figure 2: We propose Language-Conditioned Graph Networks (LCGN) to address reasoning tasks such as VQA and REF. Our model constructs a context-aware representation for each object through iterative message passing conditioned on the input text. During message passing, each object is represented by a local feature and a context feature . In every iteration, each object sends a message vector to each object , which is collected by to update its context feature . The local feature and the final context feature are combined into a joint context-aware feature , which is used in simple task-specific output modules for VQA or REF.

For an image and a textual input that represents a reasoning task, let be the number of entities in the scene, where each entity can be a detected object or a spatial location on the convolutional feature map of the image. Let (where ) be the local feature representation of the -th entity, i.e. the -th detected object’s bounding box features or the convolutional features at the -th location on the feature grid. We would like to output a context-aware representation for each entity conditioned on the textual input that contains the relational context associated with entity . This is obtained through iterative message passing over iterations with our Language-Conditioned Graph Networks, as shown in Figure 2.

We use a fully-connected graph over the scene, where each node corresponds to an entity as defined above, and there is a directed edge between every pair of entities and . Each node is represented by a local feature that is fixed during message passing, and a context feature that is updated during each iteration . A learned parameter is used as the initial context representation at for all nodes, before the message passing starts.

Textual command extraction

To incorporate the textual input in the iterative message passing, we build a textual command vector for each iteration (where ). Given a textual input for the reasoning task, such as a question in VQA or a query in REF, we extract a set of vectors from the text , using the same multi-step textual attention mechanism as in Stack-NMN [13] and MAC [16]. Specifically, is encoded into a sequence and a summary vector with a bi-directional LSTM as:

(1)

where is the number of words in , and is the concatenation of the forward and backward hidden states for word from the bi-directional LSTM output. At each iteration , a textual attention is computed over the words, and the textual command is obtained from the textual attention as follows:

(2)
(3)

where is element-wise multiplication. Each can be seen as a textual command supplied during the -th iteration. Unlike all other parameters that are shared across message passing iterations, here is learned separately for each iteration .

Language-conditioned message passing

At the -th iteration where , we first build a joint representation of each entity. Then, we compute the (directed) connection weights from every entity (the sender, ) to every entity (the receiver, ). Finally, each entity sends a message vector to each entity , and each entity sums up all of its incoming messages to update its contextual representation from to as described below.

Step 1. We build a joint representation for each node, by concatenating and and their element-wise product (after linear mapping) as

(4)

Step 2. We compute the directed connection weights from node (the sender) to node (the receiver), conditioning on the textual command at iteration . Here, the connection weights are normalized with a softmax function over , so that the sender weights sum up to for each receiver, i.e. for all as follows:

(5)

Step 3. Each node sends a message to each node conditioning on the textual input and weighted by the connection weight . Then, each node sums up the incoming messages and updates its context representation:

(6)
(7)

A naive implementation would involve pairwise vectors , which is inefficient for large . We implement it more efficiently by building an -row matrix containing unweighted messages in Eqn. 6, which is left multiplied by the edge weight matrix (where ) to obtain the sums in Eqn. 7 for all nodes in a single matrix multiplication. With this implementation, we can train our LCGN model efficiently with as large as 196 in our experiments.

Final representation

We combine each entity’s local feature and context feature (after iterations) as its final representation :

(8)

The can be used as input to subsequent task-specific modules such as VQA or REF models, instead of the original local representation .

3.2 Application to VQA and REF

To apply our LCGN model to language-based reasoning tasks such as Visual Question Answering (VQA) and Referring Expression Comprehension (REF), we build simple task-specific output modules based on the language input and the contextualized representation of each entity. Our LCGN model and the subsequent task-specific modules are jointly trained end-to-end.

A single-hop answer classifier for VQA

The VQA task requires outputting an answer for an input image and a question . We adopt the commonly used classification approach and build a single-hop attention model as a classifier to select one of the possible answers from the training set.

First, the question is encoded into a vector with the Bi-LSTM in Eqn. 1. Then a single-hop attention is used over the objects to aggregate visual information, which is fused with to predict the score vector for each answer.

(9)
(10)

During training, a softmax classification loss is applied on the output scores for answer classification.

GroundeR [32] for REF

The REF task requires outputting a target bounding box as the grounding result for an input referring expression . Here, we use a retrieval approach as in previous works and select one target entity from the candidate entities in the scene (either object detection results or spatial locations on a convolutional feature map). To select the target object from the candidates, we encode expression to vector as in Eqn 1 and build a model similar to the fully-supervised version of GroundeR [32] to output a matching score for each entity . In the case of using spatial locations on a convolutional feature map, we further output a -dimensional vector to predict the bounding box offset from the feature grid location.

(11)
(12)
(13)

During training, we use a softmax loss over the scores among the candidates to select the target entity , and an L2 loss over the box offset to refine the box location.

4 Experiments

We apply our LCGN model to two tasks – VQA and REF – for language-conditioned reasoning. For the VQA task, we evaluate on the GQA dataset [17] and the CLEVR dataset [18], which both require resolving relations between objects. For the REF task, we evaluate on the CLEVR-Ref+ dataset [24]. In particular, the CLEVR and CLEVR-Ref+ datasets contain many complicated questions or expressions with higher-order relations, such as the ball on the left of the object behind a blue cylinder.

4.1 Visual Question Answering (VQA)

Evaluation on the GQA dataset

We first evaluate our LCGN model on the GQA dataset [17] for visual question answering. The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset [21] and balanced question-answer pairs. Each training and validation image is also associated with scene graph annotations describing the classes and attributes of those objects in the scene, and their pairwise relations. Along with the images and question-answer pairs, the GQA dataset provides two types of pre-extracted visual features for each image – convolutional grid features of size extracted from a ResNet-101 network [11] trained on ImageNet, and object detection features of size (where is the number of detected objects in each image with a maximum of 100 per image) from a Faster R-CNN detector [31].

We apply our LCGN model together with the single-hop classifier (“single-hop + LCGN”) in Sec. 3.2 for answer prediction. We use rounds of message passing in our LCGN model, which takes approximately 20 hours to train using a single Titan Xp GPU. As a comparison to the context-aware representation from our LCGN model, we also train the single-hop classifier with only the local features in Eqn. 9 (“single-hop”).

Method Accuracy1
val test-dev test
CNN+LSTM [17] 49.2% 46.6%
Bottom-Up [1] 52.2% 49.7%
MAC [16] 57.5% 54.1%
single-hop 59.7% 50.7% 51.5%
single-hop + LCGN (ours) 63.8% 55.6% 56.0%
Table 1: VQA performance on the GQA dataset.1

We first experiment with using the released object detection features in the GQA dataset as our local features , which is shown in [17] to perform better than the convolutional grid features, and compare with previous works.111We learned from the GQA dataset authors that its test-dev and test splits were collected differently from its train and val splits, with a noticeable domain shift causing a performance drop from val to test-dev and test. We train on the train split and report results on three GQA splits (val, test-dev and test). The performance of previous work on val was obtained from the dataset authors. The results are shown in Table 1. By comparing “single-hop + LCGN” with “single-hop” in the last two rows, it can be seen that our LCGN model brings over 4% (absolute) improvement in accuracy, indicating that our LCGN model facilitates reasoning by replacing the local features with the contextualized features containing rich relational information for the reasoning task. Figure 5 shows question answering examples from our model on this dataset.

We compare to three previous approaches in Table 1. CNN+LSTM [17] and Bottom-Up [1] are simple fusion approaches between the text and the image, using the released GQA convolutional grid features or object detection features respectively. The MAC model [16] is a multi-step attention and memory model with specially designed control, reading and writing cells, and is trained on the same object detection features as our model. Our approach outperforms the MAC model that performs multi-step inference, obtaining the state-of-the-art results on the GQA dataset.

Method Local features Accuracy
val test-dev
single-hop convolutional 52.4% 45.3%
single-hop + LCGN grid features 55.3% 49.5%
single-hop object features 59.7% 50.7%
single-hop + LCGN from detection 63.8% 55.6%
single-hop GT objects 84.6% n/a
single-hop + LCGN and attributes2 90.2% n/a
Table 2: Ablation on different local features on the GQA dataset.

We further apply our LCGN model to other types of local features, and experiment with using either the same -dimensional convolutional grid features as used in CNN+LSTM in Table 1 (where each is a -dimensional vector at the -th spatial location and ) or an “oracle” symbolic local representation at both training and test time, based on a set of ground-truth objects along with their class and attribute annotations (“GT objects and attributes”) in the scene graph data of the GQA dataset. In the latter setting with symbolic representation, we construct two one-hot vectors to represent each object’s class and attributes, and concatenate them as each object’s .222In this setting, we can only evaluate on the val split with public scene graph annotations. We note that this is the only setting where we use the scene graphs in the GQA dataset. In all other settings, we only use the images and question-answer pairs to train our models. Also, our model does not rely on the GQA question semantic step annotations in any settings. The results are shown in Table 2, where our LCGN model delivers consistent improvements over all three types of local feature representations.

Evaluation on the CLEVR dataset

We also evaluate our LCGN model on the CLEVR dataset [18], a dataset for VQA with complicated relational questions, such as what number of other objects are there of the same size as the brown shiny object. Following previous works, we use the convolutional grid features extracted from the C4 block of an ImageNet-pretrained ResNet-101 network [11] as the local features on the CLEVR dataset (i.e. each is a 1024-dimensional vector and ).

Similar to our experiments on the GQA dataset, we apply our LCGN model together with the single-hop answer classifier and compare it with using only the local features in the answer classifier. We also compare to previous works that also use only question-answer pairs as supervision (without relying on the functional program annotations in [18]).

The results are shown in Table 3. It can be seen that the single-hop classifier only achieves 72.6% accuracy when using the local convolutional grid features (“single-hop”), which is unsurprising since the CLEVR dataset often involves resolving multiple and higher-order relations beyond the capacity of the single-hop classifier alone. However, when trained together with the context-aware representation from our LCGN model, this same single-hop classifier (“single-hop + LCGN”) achieves a significantly higher accuracy of 97.9% comparable to several state-of-the-art approaches on this dataset, showing that our LCGN model is able to embed relational context information in its output scene representation . Among previous works, Stack-NMN [13] and MAC [16] rely on multi-step inference procedures to predict an answer. RN [33] pools over all pairwise object-object vectors to collect relational information in a single step. FiLM [29] modulates the batch normalization parameters of a convolutional network with the input question. NS-CL [26] learns symbolic representations of the scene and uses quasi-logical reasoning. Except for Stack-NMN [13], most previous works are tailored to the VQA task, and it is non-trivial to apply them to other tasks such as REF, while our LCGN model provides a generic scene representation applicable to multiple tasks. Figure 5 shows question answering examples of our model.

Method Accuracy
Stack-NMN [13] 93.0%
RN [33] 95.5%
FiLM [29] 97.6%
MAC [16] 98.9%
NS-CL [26] 99.2%
single-hop 72.6%
single-hop + LCGN (ours) 97.9%
Table 3: VQA performance on the test split of the CLEVR dataset. We use rounds of message passing in our LCGN model.

We further experiment with varying the number of message passing iterations in our LCGN model. In addition, to isolate the effect of conditioning on textual inputs during message passing, we also train and evaluate a restricted version of LCGN without text conditioning (“single-hop + LCGN w/o txt”), by replacing the ’s from Eqn 3 with a vector of all ones. The results are shown in Table 4, where it can be seen that using multiple rounds of iterations () leads to a significant performance increase, and it is crucial to incorporate the textual information into the message passing procedure. This is likely because the CLEVR dataset involves complicated questions that need multi-step context propagation. In addition, it is more efficient to collect the specific relational context relevant to the input question, instead of building a scene representation with a complete and unconditional knowledge base of all relational information that any input questions can query from.

Method Steps Accuracy
single-hop n/a 72.6%
single-hop + LCGN 94.0%
single-hop + LCGN 94.5%
single-hop + LCGN 96.4%
single-hop + LCGN 97.9%
single-hop + LCGN 96.9%
single-hop + LCGN w/o txt 78.6%
single-hop + LCGN w/ static 96.5%
Table 4: Ablation on iteration steps and whether to condition on the text or have dynamic weights, on the CLEVR validation split.

Given that multi-round message passing () works better than using only a single round (), we further study whether it is beneficial to have dynamic connection weights in Eqn. 5 that can be different in each iteration to allow an object to focus on different context objects in different rounds. As a comparison, we train a restricted version of LCGN with static connection weights (“single-hop + LCGN w/ static ”), where we only predict the weights in Eqn. 5 for the first round , and reuse it in all subsequent rounds (i.e. setting for all ). From the last row of Table 4 it can be seen that there is a performance drop when restricting to static connection weights predicted only in the first round, and we also observe a similar (but larger) drop for the REF task in Sec. 4.2 and Table 5. This suggests that it is better to have dynamic connections during each iteration, instead of first predicting a fixed connection structure on which iterative message passing is performed (e.g. [28]).

4.2 Referring Expression Comprehension (REF)

Our LCGN model provides a generic approach to building context-aware scene representations and is not restricted to a specific task such as VQA. We also apply our LGCN model to the referring expression comprehension (REF) task, where given a referring expression that describes an object in the scene, the model is asked to localize the target object with a bounding box.

We experiment with the CLEVR-Ref+ dataset [24], which contains similar images as in the CLEVR dataset [18] for VQA and complicated referring expressions requiring relation resolution. On the CLEVR-Ref+ dataset, we evaluate with the bounding box detection task in [24], where the output is a bounding box of the target object and there is only one single target object described by the expression. A localization is consider correct if it overlaps with the ground-truth box with at least 50% IoU. Same as in our VQA experiments on the CLEVR dataset in Sec. 4.1, here we also use the convolutional grid features from ResNet-101 C4 block as our local features (i.e. is 1024-dimensional and ), with rounds of message passing. The final target bounding box is predicted with a 4-dimensional bounding box offset vector in Eqn. 13 from the selected grid location in Eqn. 12.

Method Accuracy
Stack-NMN [13] 56.5%
SLR [43] 57.7%
MAttNet [41] 60.9%
GroundeR [32] 61.7%
GroundeR + LCGN w/o txt 65.0%
GroundeR + LCGN w/ static 71.4%
GroundeR + LCGN (ours) 74.8%
Table 5: Performance on the CLEVR-Ref+ dataset for REF.

We apply our LCGN model to build a context-aware representation conditioned on the input referring expression, which is used as input to our implementation of the GroundeR approach [32] (Sec. 3.2) for bounding box prediction (“GroundeR + LCGN”). As a comparison, we train and evaluate the GroundeR model without our context-aware representation (“GroundeR”), using local features as inputs in Eqn. 11. Similar to our experiments on the CLEVR dataset for VQA in Sec. 4.1, we also ablate our LCGN model with not conditioning on the input expression in message passing (“GroundeR + LCGN w/o txt”) or using static connection weights predicted from the first round (“GroundeR + LCGN w/ static ”).

The results are shown in Table 5, where our context-aware scene representation from LCGN leads to approximately 13% (absolute) improvement in REF accuracy. Consistent with our observation on the VQA task, for the REF task we find it important for the message passing procedure to depend on the input expression, and allowing the model to have dynamic connection weights that can differ for each round . Our model outperforms previous work by a large margin, achieving the state-of-the-art performance for REF on the CLEVR-Ref+ dataset. Figure 5 shows example predictions of our model on the CLEVR-Ref+ dataset.

In previous works, SLR [43] and MAttNet [41] are specifically designed for the REF task. SLR jointly trains an expression generation model (speaker) and an expression comprehension model (listener), while MAttNet relies on modular structure for subject, location and relation comprehension. While Stack-NMN [13] is also a generic approach that is applicable to both the VQA task and the REF task, the major contribution of Stack-NMN is to construct an explicit step-wise inference procedure with compositional modules, and it relies on hand-designed module structures and local appearance-based scene representations. On the other hand, our work augments the scene representation with rich relational context. We show that our approach outperforms Stack-NMN on both the VQA and the REF tasks.

        input image                                                                                                   single-hop attention

question: is the fence in front of the elephant green and metallic?   prediction: yes   ground-truth: yes

question: the frisbee is on what animal?   prediction: dog   ground-truth: dog

Figure 3: Examples from our LCGN model on the validation split of the GQA dataset for VQA. In the middle 4 columns, each red line shows an edge along the message passing paths (among the detected objects) where the connection edge weight exceeds a threshold. The blue star on each line is the sender node , and the line width corresponds to its connection weight. In the upper example, the person, the elephant and the fence propagate messages with each other, and fence receives messages from the elephant in . In the lower example, the frisbee collect messages from the dog as contextual information in multiple rounds, and is picked up by the single-hop classifier. The red star (along with the box) in the last column shows the object with the highest single-hop attention in Eqn. 9.
        input image                                                                                                   single-hop attention

question: what color is the matte ball that is the same size as the gray metal thing?   prediction: yellow   ground-truth: yellow

question: how many other things are the same size as the yellow rubber ball?   prediction: 3   ground-truth: 3

Figure 4: Examples from our LCGN model on the validation split of the CLEVR dataset for VQA. The middle 4 columns show the connection edge weights similar to Figure 5, where the blue stars are the sender nodes. The last column shows the single-hop attention in Eqn. 9 over the feature grid. In the upper example, in the matte ball (leftmost) collects messages from the gray metal ball (of the same size), and then in messages are propagated within the convolutional grids on the matte ball, possibly to refine the collected context from the gray ball. In the lower example, in all four balls try to propagate messages within the convolutional grids of each ball region, and in the three other balls (of the same size) receive messages from the rubber ball (leftmost) and are picked up by the single-hop classifier.
        input image                                                                                                    bounding box output

referring expression: any other things that are the same shape as the big matte thing(s)

referring expression: the second one of the cube(s) from right

Figure 5: Examples from our LCGN model on the validation split of the CLEVR-Ref+ dataset for REF. The middle 4 columns show the connection edge weights similar to Figure 5, where the blue stars are the sender nodes. The last column shows the selected target grid location on the spatial grid (the red star) in Eqn. 12, along with the ground-truth (yellow) box and the predicted box (red box from bounding box regression in Eqn. 13). In the upper example, the blue cube (the target object) collects messages from the two other objects in , and then the blue cube further collects messages from the big matte green cube on the left (which has the same shape) in . In the lower example, the green cube checks for other cubes by collecting messages from things on its right in .

5 Conclusion

In this work, we propose Language-Conditioned Graph Networks (LCGN), a generic approach to language-based reasoning tasks such VQA and REF. Instead of building task-specific inference procedures, our LCGN model constructs rich context-aware representations of the scene through iterative message passing. Experimentally, we show that the context-aware representations from our LCGN model greatly improve over the local appearance-based representations across various types of local features and multiple datasets, and it is crucial for the message passing procedure to depend on the language inputs.

Acknowledgement

This work was partially supported by the Berkeley AI Research, NSF and DARPA XAI.

References

  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998, 2017.
  • [2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
  • [4] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • [5] R. Cadene, H. Ben-younes, M. Cord, and N. Thome. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [6] S. Chang, J. Yang, S. Park, and N. Kwak. Broadcasting convolutional network for visual relational reasoning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 754–769, 2018.
  • [7] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7239–7248, 2018.
  • [8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  • [9] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
  • [10] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [12] R. Herzig, E. Levi, H. Xu, E. Brosh, A. Globerson, and T. Darrell. Classifying collisions with spatio-temporal action graph networks. arXiv preprint arXiv:1812.01233, 2018.
  • [13] R. Hu, J. Andreas, T. Darrell, and K. Saenko. Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 53–69, 2018.
  • [14] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [15] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1115–1124, 2017.
  • [16] D. A. Hudson and C. D. Manning. Compositional attention networks for machine reasoning. In Proceedings of the International Conference on Learning Representation (ICLR), 2018.
  • [17] D. A. Hudson and C. D. Manning. Gqa: a new dataset for compositional question answering over real-world images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [18] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
  • [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [20] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • [21] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [22] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), 2016.
  • [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [24] R. Liu, C. Liu, Y. Bai, and A. Yuille. Clevr-ref+: Diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [25] Y. Liu, R. Wang, S. Shan, and X. Chen. Structure inference net: object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6985–6994, 2018.
  • [26] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In ICLR, 2019.
  • [27] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016.
  • [28] W. Norcliffe-Brown, S. Vafeias, and S. Parisot. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pages 8344–8353, 2018.
  • [29] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
  • [30] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 401–417, 2018.
  • [31] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [32] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–834. Springer, 2016.
  • [33] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4974–4983, 2017.
  • [34] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2017.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [36] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
  • [37] P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. v. d. Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [38] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
  • [39] Y. Yao, J. Xu, F. Wang, and B. Xu. Cascaded mutual modulation for visual reasoning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • [40] K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pages 1039–1050, 2018.
  • [41] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1307–1315, 2018.
  • [42] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
  • [43] L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7282–7290, 2017.
  • [44] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
  • [45] C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma. Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 1291–1300, 2017.

Appendix A Implementation details

In our implementation, we use as the dimensionality for the textual vectors (such as , , and ), and as the dimensionality for the context features of each entity . On the GQA dataset, we first reduce the dimensionality of the input local features (convolutional grid features, object detection features or GT objects and attributes in Table 2 of the main paper) to the same dimensionality with a single fully-connected layer (without non-linearity). During training, we use the Adam optimizer [19] with a batch size of and a learning rate of . On the CLEVR dataset and the CLEVR-Ref+ dataset, we first apply a small two-layer convolutional network on the ResNet-101-C4 features to output a feature map, so that the feature dimensionality at each location on the feature map is also reduced to . We train with the Adam optimizer [19] using a batch size of and a learning rate of .

The shapes of the parameters in our Language-Conditioned Graph Networks (LCGN) and task-specific output modules are shown in Table 6. All our models are trained using a single Titan Xp GPU.

Parameter Shape Shared across
(textual command extraction)
yes
no
yes
(language-conditioned message passing)
yes
yes
yes
yes
yes
yes
yes
yes
yes
(the single-hop answer classifier for VQA)
n/a
n/a
n/a
n/a
(GroundeR for REF)
n/a
n/a
n/a
Table 6: The shape of each parameter in our LCGN model. All parameters are shared across different time steps , except for .

Appendix B Additional visualization examples

Figures 6 and 7 show additional visualization examples for the VQA task on the GQA dataset and the CLEVR dataset, respectively. Figure 8 shows additional examples for the REF task on the CLEVR-Ref+ dataset.

        input image                                                                                                   single-hop attention

question: are there carts near the pond?   prediction: yes   ground-truth: yes

question: what color is the flag?   prediction: white   ground-truth: white

question: what type of vehicle is in front of the hanging wires?   prediction: train   ground-truth: train

question: on what does the man sit?   prediction: bench   ground-truth: bench

question: are there both a tennis ball and a racket in the image?   prediction: yes   ground-truth: yes

question: what vehicle is on the highway?   prediction: truck   ground-truth: ambulance

question: who is holding the umbrella?   prediction: woman   ground-truth: lady

Figure 6: Additional examples from our LCGN model on the validation split of the GQA dataset for VQA. In the middle 4 columns, each red line shows an edge along the message passing paths (among the detected objects) where the connection edge weight exceeds a threshold. The blue star on each line is the sender node . In these example, the objects of interest receive messages from other objects through those connections with high weights (the red lines). The red star (along with the box) in the last column shows the object with the highest attention in the single-hop VQA classifier in Sec. 3.2 of the main paper. The last two rows show two failure examples on the GQA dataset. Some failure cases are due to ambiguity in the answers in the GQA dataset (e.g. “woman” vs. “lady” in the last example).
        input image                                                                                                   single-hop attention

question: there is a small gray block ; are there any spheres to the left of it?   prediction: yes   ground-truth: yes

question: is the purple thing the same shape as the large gray rubber thing?   prediction: no   ground-truth: no

question: do the large metal sphere and the matte block have the same color?   prediction: yes   ground-truth: yes

question: is there anything else that has the same material as the red thing?   prediction: yes   ground-truth: yes

question: is there any other thing that is the same color as the cylinder?   prediction: no   ground-truth: no

question: what number of other objects are there of the same size as the gray sphere?   prediction: 5   ground-truth: 5

question: is the number of small cylinders behind the cyan thing greater than the number of cubes that are behind the green block?   prediction: yes   ground-truth: no

question: how many other objects are the same shape as the purple metallic thing?   prediction: 6   ground-truth: 7

Figure 7: Additional examples from our LCGN model on the validation split of the CLEVR dataset for VQA. The middle 4 columns show the connection edge weights similar to Figure 6, where the blue stars are the sender nodes. The last column shows the attention in the single-hop VQA classifier in Sec. 3.2 of the main paper over the feature grid. In these examples, the relevant objects in the question usually first propagate messages within the convolutional grids of the same object (possibly to form an object representation from the CNN features), and then the object of interest tends to collect messages from other context objects. The last two rows show two failure examples on the CLEVR dataset.
        input image                                                                                                    bounding box output

referring expression: any other yellow shiny objects that have the same size as the first one of the objects from front

referring expression: any other tiny objects that have the same material as the third one of the objects from left

referring expression: the second one of the things from left

referring expression: any other matte things that have the same shape as the first one of the red metal things from right

referring expression: the first one of the things from front that are on the right side of the first one of the purple spheres from front

referring expression: the second one of the shiny objects from front

referring expression: any other matte things of the same shape as the fifth one of the rubber things from right

referring expression: look at sphere that is right of the first one of the things from front; the second one of the objects from right that are in front of it

Figure 8: Additional examples from our LCGN model on the validation split of the CLEVR-Ref+ dataset for REF. The middle 4 columns show the connection edge weights similar to Figure 6, where the blue stars are the sender nodes. The last column shows the selected target grid location on the spatial grid (the red star) in the GroundeR model in Sec. 3.2 of the main paper, along with the ground-truth (yellow) box and the predicted box (red box from bounding box regression in GroundeR). In these examples, the objects of interest tend to collect messages from other context objects. The last two rows show two failure examples on the CLEVR-Ref+ dataset.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
362849
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description