Progressive Reasoning by Module Composition

Progressive Reasoning by Module Composition

Seung Wook Kim        Makarand Tapaswi           Sanja Fidler
Department of Computer Science, University of Toronto
Vector Institute, Canada
{seung,makarand,fidler}@cs.toronto.edu
Abstract

Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn – most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Each module also contains a residual component that learns to solve aspects of the new task that lower modules cannot solve. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate state-of-the-art performance in Visual Question Answering, the highest-level task in our task set. By evaluating the reasoning process using non-expert human judges, we show that our model is more interpretable than an attention-based baseline.

 

Progressive Reasoning by Module Composition


  Seung Wook Kim        Makarand Tapaswi           Sanja Fidler Department of Computer Science, University of Toronto Vector Institute, Canada {seung,makarand,fidler}@cs.toronto.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Humans acquire skills and knowledge in a curriculum by building on top of previously acquired knowledge. For example, in school we first learn simple mathematical operations such as addition and multiplication before moving on to solving equations. Similarly, the ability to answer complex visual questions requires the skills to understand attributes such as color, recognize a variety of objects, and be able to spatially relate them. Just as humans, machines may also benefit by learning tasks in progressive complexity sequentially and composing knowledge along the way.

The process of training a machine learning model to be able to solve multiple tasks, or multi-task learning (MTL), has been widely studied [5, 20, 25, 26, 27]. The dominant approach is to have a model that shares parameters (e.g., bottom layers of a CNN), but has individualized prediction heads [5, 20]. By sharing parameters, the goal is to obtain a better data representation that is task-agnostic. However, the tasks themselves are disconnected and are not combined to solve tasks of increasing complexity. It is desirable if one task can directly learn to process the predictions from other tasks thereby maximally enjoying the benefits of multi-task learning.

We address the problem of MTL where tasks naturally progress in complexity. In [3], Neural Module Networks (NMN) that call a sequence of modules to solve a complex task are introduced. By utilizing modules with architectures tailored to solve a subproblem, they showed improved performance on the higher-level task. In particular, [3, 11] addressed Visual Question Answering (VQA), where, given a question, a sequence of modules such as describe-region or find-object are called. This sequence was either parsed from the question [3], or required training via policy gradient optimization [11].

In this paper, we propose Progressive Module Networks (PMN), a framework for multi-task learning by progressively designing modules on top of existing modules. Each module is a neural network that can query modules for lower-level tasks, which in turn may query modules for even simpler tasks. The modules communicate by learning to query (input) and process outputs, while the internal module processing remains a blackbox. This is similar to a computer program that uses available libraries without having to know their internal operations. Parent modules can choose which lower-level modules they want to query via a soft gating mechanism. Additionally, each module also has a “residual” submodule that learns to address aspects of the new task that low-level modules cannot.

Our model can be seen as a generalization of NMN. PMN is compositional, i.e. modules build on modules which build on modules, and is fully differentiable. This modularity allows efficient use of data by not needing to re-learn previously acquired knowledge. By learning selective information flow between modules, interpretability arises naturally.

We demonstrate PMN in learning a set of visual reasoning tasks such as counting, captioning and visual question answering. Our compositional model outperforms a flat baseline on all tasks. We further analyze the interpretability of PMN’s reasoning process with non-expert humans judges.

2 Related Work

Multi-task learning. The dominant approach to multi-task learning is to have a model that shares parameters in a soft [7, 30] or hard way [5]. Soft sharing refers to each task having independent weights that are constrained to be similar (e.g regularization [7], trace norm [30]) while hard sharing typically means that all tasks have shared base network but independent layers on top (e.g[17, 22]). While sharing parameters helps to compute a task-agnostic representation that is not overfit to a specific task, tasks do not directly share information or exploit each others predictions.

Bilen et al[4] propose the Multinet architecture where tasks can interact with each other in addition to shared image features. Multinet solves one task at each time step and appends the encoded output of each task to existing data representation starting from image features from a CNN. Thus, at the next time step, the new task uses enriched data representation. A similar idea, Progressive Neural Networks (PNNs) is proposed by Rusu et al[27]. PNNs use a new neural network for each task, but are designed to prevent catastrophic forgetting as they transfer knowledge from previous tasks by making lateral connections to representations of previously learned tasks. In both Multinet and PNN, multiple tasks interact with each other in an indirect fashion as they are mainly used to learn a better data representation. We go one step further, by enabling task-wise interactions.

Module networks. Pioneering work in modular structure, NMN [3, 11] addresses VQA where questions have a compositional structure. From an inventory of small network fragments, or modules, NMN produces a layout for assembling those modules for a given question. We extend their modularity idea further and treat each task as compositional. Our approach is more general and can be used for any arbitrary task where there exists a exploitable learning sequence.

Visual question answering. VQA has seen great progress in recent years: improved multimodal pooling functions [8], multi-hop attention [31], driving attention through both bottom-up and top-down schemes [2], and modeling attention between words and image regions recurrently [12] have been some of the important advances. There are also attempts to generate programs or sequence of modules automatically that yield a list of interpretable steps [11, 15] using policy gradient optimization. Our approach treats visual reasoning as a compositional multi-task problem, and shows that using sub-tasks compositionally can help improve performance and interpretability.

Figure 1: Overview of PMN. Rectangles with a single border denote terminal modules and double borders denote compositional modules. Each arrow represents a communication. Left: General architecture of a compositional module. Given input and environment variables , it calls lower level modules, gathers information and produces an output. This diagram represents calling where red arrows indicate current activity. For clarity, all connections from to other modules are not shown. Right: An example computation graph for PMN with four tasks. Note that does not call directly, as it need not use a lower level module unless necessary.

3 Progressive Module Networks

Most complex reasoning tasks can be broken down into a series of sequential reasoning steps. We hypothesize that there exists a hierarchy with regards to complexity and order of execution: high level tasks (e.g. counting) are more complex and benefit from leveraging outputs from lower level tasks (e.g. classification). For any task , Progressive Module Networks (PMN) learn a module that requests and uses information from lower modules to aid in solving the given task. Crucially, this process is compositional – lower-level modules may call modules at an even lower level. Solving a task is thus equivalent to executing a directed acyclic computation graph where each node represents a module. This is schematically shown in Fig. 1 (right).

Formally, given a task at level , the task module can query other modules at level such that . Each module is designed to solve a particular task (produces its best prediction) given an input and environment . Note that is accessible to every module and represents a broader set of “sensory” information available to the model. For example, may contain visual information such as an image, and text in the form of words (i.e., question).

PMN has two types of modules: (i) terminal modules execute the simplest tasks that do not require information from other modules; and (ii) compositional modules that learn to efficiently communicate and exploit lower-level modules to solve a task. We describe them in detail next.

3.1 Terminal Modules

Terminal modules are by definition at the lowest level . They are analogous to base cases in a recursive function. Given an input query , a terminal module generates an output , where is implemented with a neural network. A typical example of a terminal module is an object classifier that takes as input a visual descriptor , and predicts an object label .

3.2 Compositional Modules

A compositional module makes a sequence of calls to lower level modules which in turn make calls to their children. Fig. 1 visualizes the structure of the module, which we explain in detail next.

We denote the list of modules that is allowed to call by . We assume every module in has level lower than and omit the superscript for brevity. Since the lower level modules need not be sufficient in fully solving the new task, we add a special terminal module , which performs such “residual” reasoning.

The compositional aspect of PMN means that modules in can have their own hierarchy of calls. We make an ordered list, where calls are being made in a sequential order, starting with the first in the list. This way, information output by a particular module can be used when generating the query for the next. For example, if one module is performing object detection, we may want to use its output (extracted box proposals), for querying other modules such as an attribute classifier module. Note that we append last on the list , to maximally exploit information from other modules.

Notice that the number of back-and-forth communications increases exponentially if each module makes use of every lower-level module. Thus, in practice we restrict the list to those lower-level modules that intuitively make sense for the task. We emphasize that the module can still (softly) choose between them, thus our hard choice is only on removing the lower-level modules that are either redundant or uninformative to the task.

1:function () and are global variables
2:      = initialize the state variable
3:     for  to  do is the maximum time step
4:          compute importance scores
5:          wipe out scratch pad
6:         for  to  do is a sequence of lower modules
7:               produce query for
8:               call module , generate output
9:               receive and project output
10:               write gate and to pad          
11:          update module state      
12:      = () produce the output
13:     return
Algorithm 1 Computation performed by our Progressive Module Network, for one module

Our compositional module keeps track of a state variable at time step . It contains useful information obtained by querying other modules. For example, can be a hidden state of an RNN. For the VQA task, we choose to be a tuple where is a question vector for the current time step and is cumulated information at time .

Each module also has a scratch pad to store outputs it receives from a list of lower modules . accesses to produce queries for lower level modules and update its state. At time step , the scratch pad is wiped clean, and the module steps through the list of modules .

State initializer.   Given a task query (input) , the initial state is initialized with a state initializer . It could be a simple MLP or an assignment function such as in our VQA implementation, that sets where and is a zero vector. Details in Appendix A.

Importance function.   For each (and ) in , we compute its importance score with . The purpose of is to enable to (softly) choose which modules to use, and allows us to train the model end-to-end with back-propagation. Notice that is input dependent, and thus the module can effectively control which lower-level modules to call at which time step. Here, can be implemented as an MLP followed by either a softmax over submodules, or a sigmoid that outputs a score for each submodule. However, note that the proposed setup can be modified to ignore lower modules that are deemed unimportant (e.g. using a threshold or sampling, and adopting RL).

Query transmitter and receiver.   A query for module is produced using a query transmitter, as . The output received from is modified using a receiver function , . One can think of these functions as interpreting and translating the inputs and outputs into the module’s own “language". The output are stored to the scratch pad. Note that the residual module is called directly, i.e., we do not use the query transmitter and receiver.

State updater.   Module updates its internal state using a state updater as , which completes one time step of the module’s computation. The state updater can be a simple gated sum of the received outputs, i.e., . For a more complicated example, passes as an input to a GRU (whose initial state is initialized with ) that produces a query for the next time step, . Then, is set to .

Prediction function.   After steps, the final module output is produced using a prediction function as . As an example, can be a mean-pool over MLP on for all , or directly an MLP on .

All module functions: state initializer , importance function , query transmitter , receiver , state updater , residue computer , and final predictor are implemented as neural networks. Note that the exact form of these functions can be different across different modules. The complete inference program for our PMN is summarized in Algorithm 1. Module details are in Appendix A.

Training.   We train our modules sequentially, from low level to high level tasks, one at a time. When training a higher level module, internal weights of the lower level modules are not updated, thus preserving their performance on the original task. We do train the weights of the residual module . We train , , , , , and , by allowing gradients to pass through the lower level modules. Thus, while the existing lower modules are held fixed, the new module learns to communicate with them via the query transmitter and receiver . The loss function depends on the task .

4 Progressive Module Networks for Visual Reasoning

We present an example of how PMN can be adopted for several tasks related to visual reasoning. In particular, we consider six tasks: object classification, attribute classification, relationship detection, object counting, image captioning and visual question answering. The level at which we consider each task is written as a superscript in the following sections.

The sensory input that form our environment consists of: (i) objects: image features , each with corresponding bounding box coordinates extracted from Faster R-CNN [24]; and (ii) language: vector representation of a word or a sentence (in our example, a question). Below, we discuss each task and a module designed to solve it. Note that when we say ‘attend’ or ‘attention map’, we refer to soft-attention mechanism over the image regions. We only provide a rough idea of what each module does. Further implementation details of all modules’ architectures are provided in Appendix A.

4.1 Object and Attribute Classification

The tasks of object and attribute classification require naming the object inside a provided image crop, or providing the object’s attributes (e.g. color). We treat these two tasks as the simplest, and place and as terminal modules at level .

takes as input a visual descriptor for one or more bounding boxes , i.e., . The object classification module consists of an MLP and produces a penultimate vector prior to classification. Attribute module has a similar structure. These are the only modules for which we do not use their direct output labels as it gave better results for higher level tasks empirically.

4.2 Image Captioning

For the task of image captioning, one needs to produce a natural language description of the image. We design our module as a compositional module which can exploit information from the object and attribute modules, i.e., . We implement the state updater as a two layer Gated Recurrent Unit (GRU) network with corresponding to the hidden states. At each time step, the query transmitter attends over image regions using the hidden state of the first layer, similar to [2], and produces a query (image vector at the attended location) for querying the lower-level models for nouns and adjectives when appropriate. Note that our module also has the residual submodule that processes other image-related semantic information. The outputs from modules in are projected to the same dimensional vectors by the receiver and stored in the scratch pad. Based on their importance score, the gated summed outputs are used to update the module state.

The natural language sentence is obtained by producing a word at each time step using a fully connected layer on the hidden state of the second GRU layer of the state updater. Note that a strength of our framework is that we are able to view captioning just like other modules.

4.3 Relationship Detection

This task requires one to produce triplets in the form of “subject - relationship - object” [21]. We re-purpose this task as one that involves finding the relevant item (region) in an image that is related to a given input through a given relationship.

The input to the module is where is a one-hot encoding of the input box and is a one-hot encoding of the relationship category (e.g. above, behind). The module produces corresponding to the box for the subject/object related to the input through . We place on the first level by allowing it to use object and attribute information that can be useful to infer relationships, i.e., . We train the module using the cross-entropy loss.

4.4 Object Counting

Our next task is object counting. Given a vector representation of a natural language question (e.g. how many cats are in this image?), the goal of this module is to produce a count. As a by-product, it produces relevant attention map on the image regions that could also be useful for other tasks.

The object counting task is a higher-level task since it may also require us to understand relationships between objects. We thus place on the second level and enable it access to . This module queries and to get an enriched data representation for each image features, and finds relevant objects by sending the enriched representation through the residue module . It can also query if the question requires relational reasoning. For example, “how many cats are on the blue chair?”, requires counting cats on top of the blue chair. To answer such a question, we can expect the query transmitter to produce a query for the relationship module that includes the chair bounding box and relationship “on top of” so that outputs boxes that contain cats on the chair. Note that both and produce attention maps. The state updater softly chooses a useful attention map by calculating softmax on the importance scores of and . For prediction function , we adopt the counting algorithm [33], which builds a graph representation from attention maps to count objects. returns which contains the count vector (representation of number) and the attention map.

4.5 Visual Question Answering

Our final and most complex task is Visual Question Answering (VQA). Given a vector representation of a natural language question, , uses . The VQA module first queries which produces a count vector and an attention map. We treat them as two separate entries in . The produced attention map is fed to the downstream modules using the query transmitter, and the received outputs are used depending on the importance scores. For which produces a caption for the whole image, the receiver attends over the words in the produced caption to find relevant words. We also use multiple glimpses (corresponding to passes of ) similar to SAN [31]. produces an output vector based on and all states .

5 Experiments

We present experiments demonstrating the impact of learning modules progressively on three datasets (see Appendix B.1 for details): Visual Genome (VG) [18], VQA 2.0 [9], MS-COCO [19]. All these contain natural images and are thus much harder in visual and language diversity than CLEVR [14] which contains synthetic scenes. Neural module networks [3, 11] show excellent performance on CLEVR but their performance on natural images is quite below the state-of-the-art. For all datasets, we extract bounding boxes and their visual representations using a pretrained model from [2].

5.1 Progressive Learning of Tasks and Modules

Object and attribute classification.   We train these modules with annotated bounding boxes from the VG dataset. We follow [2] and use 1,600 and 400 most commonly occurring object and attribute classes, respectively. Each extracted box is associated with the ground truth label of the object with greatest overlap. It is ignored if there are no ground truth boxes with IoU 0.5. In this way, each box is annotated with one object label and zero or more attribute labels. achieves 54.9% top-1 accuracy and 86.1% top-5 accuracy. We report mean average precision (mAP) for attribute classification which is a multi-label classification problem. achieves 0.14 mAP and 0.49 weighted mAP. mAP is defined as the mean over all classes, and weighted mAP is weighted by the number of instances for each class. As there are a lot of redundant classes (e.g. car, cars, vehicle) and boxes have sparse attribute annotations, the accuracy may seem artificially low.

Image captioning.   We report results on the MS-COCO dataset for image captioning. We use the standard split from the 2014 captioning challenge to avoid data contamination with VQA 2.0 or VG. This split contains 30% less data compared to that proposed in [16] that most current works adopt. We report performance using the CIDEr [29] metric. A baseline (non-compositional module) achieves a strong CIDEr score of 108. Using the object and attribute modules we are able to obtain 109 CIDEr. While this is not a big improvement, we suspect a reason for this is the limited vocabulary. The MS-COCO dataset has a limited set of object categories (80) and does not benefit by using knowledge from modules that are trained on much more diverse data. We believe PMN would benefit more by using a diverse captioning dataset with more object classes. Also, including high-level modules such as would be an interesting direction for future work.

Model Composition Acc. (%)
BASE OBJ ATT Object Subject
- - 51.0 55.9
53.4 57.8
Table 1: Performance of
Model Composition Acc. (%)
BASE OBJ ATT REL
- - - 45.4
- 47.4
50.0
Table 2: Accuracy for

Relationship detection.   We use top 20 commonly occurring relationship categories, which are defined by a set of words with similar meaning (e.g. in, inside, standing in). Relationship tuples in the form of “subject - relationship - object” are extracted from Visual Genome [18, 21]. We train and validate the relationship detection module using 200K/38K train/val tuples that have both subject and object boxes overlapping with the ground truth boxes (IoU 0.7). Table 2 reports results. Even though accuracy is relatively low, model errors are reasonable, qualitatively. This is partially attributed to multiple correct answers although there is only one ground truth answer.

Counting.   We extract questions starting with ‘how many’ from the VQA 2.0 dataset which results in a training set of 50K questions. We additionally add 89K synthetic questions to our training set. We form these questions on the VG dataset by counting the object boxes and forming ‘how many’ questions. This synthetic data helps to increase the accuracy by 1% for all module variants. Since the number of questions that have relational reasoning and counting (e.g. how many people are sitting on the sofa?) is limited, we also sample relational synthetic questions from each training image from VG (e.g. how many plates on table?) that are used to train only the module communication parameters when the relationship module is included. Table 2 reports results. Improvement of the compositional module over the flat baseline is significant (4.6%).

Model Composition Accuracy (%) BASE OBJ ATT REL CNT CAP - - - - - 62.05 0.11 - - - 63.38 0.05 - - 63.64 0.07 - - 64.06 0.05 - 64.36 0.06 64.68 0.04
Table 3: Model ablation for VQA. We report meanstd computed over three runs. Steady increase indicates that information from modules helps, and that PMN makes use of lower modules effectively. The base model does not use any modules. For and , is implicitly included by .
Model Ens VQA 2.0 val VQA 2.0 test-dev VQA 2.0 test-std
Yes/No Number Other All Yes/No Number Other All Yes/No Number Other All
Andreas [3] CVPR16* - 73.38 33.23 39.93 51.62 - - - - - - - -
Yang [31] CVPR16* - 68.89 34.55 43.80 52.20 - - - - - - - -
Teney [28] CVPR18 - 80.07 42.87 55.81 63.15 81.82 44.21 56.05 65.32 82.20 43.90 56.26 65.67
Teney [28] CVPR18 - - - - 86.08 48.99 60.80 69.87 86.60 48.64 61.15 70.34
Zhou [32] TNNLS18 - - - - - 84.27 49.56 59.89 68.76 - - - -
Zhou [32] TNNLS18 - - - - - - - - 86.65 51.13 61.75 70.92
Zhang [33] ICLR18 - - 49.36 - 65.42 83.14 51.62 58.97 68.09 83.56 51.39 59.11 68.41
baseline  (ours) - 80.28 43.06 53.21 62.05 - - - - - - - -
PMN  (ours) - 82.48 48.15 55.53 64.68 84.07 52.12 57.99 68.07 - - - -
PMN  (ours) - - - - 85.74 54.39 60.60 70.25 86.34 54.26 60.80 70.68
Table 4: Comparing VQA accuracy of PMN with state-of-the-art models. Rows with Ens denote ensemble models. test-dev is development test set and test-std is standard test set for VQA 2.0. Entries with * are from [1].
Figure 2: Example of PMN’s module execution trace on VQA task. For brevity, calls to (, ) by , , and are not shown. Yellow circle denote execution order. shows importance scores for modules’ outputs stored in . For , words with higher intensity in red are deemed more relevant by .

Visual Question Answering.   We present ablation studies on the val set of VQA 2.0 in Table 3. As seen, PMN strongly benefits from utilizing different modules. Note that all results here are without additional questions from the VG data.

To verify that the gain is not merely from increased model capacity, we trained a baseline model with the number of parameters approximately matching the total number of parameters of the full PMN model. This baseline with more capacity also achieved 62.0% indicating that the gain is not due to more parameters. Unlike other modules whose parameters are fixed, we fine-tune only the counting module because counting module expects the same form of input - embedding of natural language question. The performance of counting module depends crucially on the quality of attention map over bounding boxes. By employing more questions from the whole VQA dataset, we obtain a better attention map, and the performance of counting module increases from 50.0% to 55.8% with finetuning. We also compare performance of PMN for the VQA task with state-of-the-art models in Table 4. Although we start with a much lower baseline performance of 62.05% on the val set (vs. 65.42% [33] and 63.15% [28]), PMN’s performance is on par with these models. For results on VQA val, models are trained on the train split. For test-dev and test-std, models are trained on both the train and val splits.

5.2 Interpretability Analysis

Visualizing the model’s reasoning process. We present a qualitative analysis of the answering process. first queries to get the attention map. In Fig. 2, makes query where corresponds to the blue box ‘bird’ and corresponds to ‘on top of’ relationship. It shows correctly chooses to use rather than its own output produced by since the question requires relational reasoning. With the attended green box obtained from , mostly uses the object module and captioning module to produce the final answer. The implementation details of this reasoning process is in Appendix A, and more examples are presented in Appendix B.3.

Correct? # Q Human Rating
PMN Baseline PMN Baseline
715 3.13 2.86
584 2.78 1.40
162 1.73 2.47
139 1.95 1.66
All images 1600 2.54 2.24
Table 5: Average human judgments from 0 to 4.  indicates that model got final answer right, and  for wrong.

Judging Answering Quality. The modular structure of PMN makes it easy to interpret the reasoning that led to the outputs. We conduct a human evaluation with Amazon Mechanical Turk on 1,600 randomly chosen questions. Each worker is asked to rate the explanation generated by the baseline model and the PMN like a teacher grades student exams. The baseline explanation is composed of the bounding box it attends to and the final answer. For PMN, we form a rule-based natural language explanation based on the prominent modules that it uses. An example is shown in Fig. 3.

Figure 3: Example of PMN’s reasoning processes. Top: it correctly first find a person and then uses relationship module to find the tree behind him. Bottom: it finds the wire and then use attribute module to correctly infer its attributes - white, long, electrical - and then outputs the correct answer.

Each question is assessed by three human workers. The workers are instructed to score how satisfactory the explanations are in the scale of 0 (very bad), 1 (bad), 2 (satisfactory), 3 (good), 4 (very good). Incorrect reasoning steps are penalized, so if PMN produces wrong reasoning steps that do not lead to the correct answer, it could get a low score. On the other hand, the baseline model often scores well on simple questions that do not need complex reasoning (e.g. what color is the cat?).

We report results in Table 5, and show more examples in Appendix C. Human evaluators tend to give low scores to wrong answers and high scores to correct answers regardless of explanations, but PMN always scores higher if both PMN and baseline gets a question correct or wrong. Interestingly, a correct answer from PMN gets 1.38 points higher than wrong baseline, but a correct baseline scores only 0.74 higher than a wrong PMN answer. This proves that even PMN gets partial marks even when it gets an answer wrong since the evaluator is able to judge the reasoning steps.

Low Data Regime. PMN benefits from re-using modules and only needs to learn the communication between them. This allows us to achieve good performance even when using a fraction of the training data. Table 6 presents the absolute gain in accuracy PMN achieves. For this experiment, we use . When the amount of data is really small, PMN does not help because there is not enough data to learn to communicate with lower modules. The maximum gain is obtained when using 10% of data. It shows that PMN can help in situations where there is not a huge amount of training data since it can exploit previously learned knowledge from other modules. The gain remains constant at about 2% from then on.

Fraction of VQA training data (in %) 1 5 10 25 50 100 Absolute accuracy gain (in %) -0.49 2.21 4.01 2.66 1.79 2.04
Table 6: Absolute gain in accuracy when using a fraction of the training data.

6 Conclusion

In this work, we proposed Progressive Module Networks (PMN) that train task modules in a compositional manner, by exploiting previously learned lower-level task modules. PMN can produce queries to call other modules and make use of the returned information to solve the current task. PMN is data efficient and provides a more interpretable reasoning processes. It is also an important step towards more intelligent machines as it can easily accommodate novel and increasingly more complex tasks.

Acknowledgments. Partially supported by the DARPA Explainable AI (XAI) program and NSERC. We also thank NVIDIA for their donation of GPUs.

References

  • [1] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. In CVPR, 2018.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and Top-down Attention for Image Captioning and VQA. In CVPR, 2018.
  • [3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural Module Networks. In CVPR, 2016.
  • [4] H. Bilen and A. Vedaldi. Integrated Perception with Recurrent Multi-task Neural Networks. In NIPS, 2016.
  • [5] R. Caruana. Multitask Learning: A Knowledge-Based Source of Inductive Bias. In ICML, 1993.
  • [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [7] L. Duong, T. Cohn, S. Bird, and P. Cook. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. In Association for Computational Linguistics (ACL), 2015.
  • [8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Empirical Methods in Natural Language Processing (EMNLP), 2016.
  • [9] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR, 2017.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [11] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to Reason: End-to-End Module Networks for Visual Question Answering. In ICCV, 2017.
  • [12] D. A. Hudson and C. D. Manning. Compositional attention networks for machine reasoning. In ICLR, 2018.
  • [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [14] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  • [15] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and Executing Programs for Visual Reasoning. In ICCV, 2017.
  • [16] A. Karpathy and L. Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR, 2015.
  • [17] I. Kokkinos. UberNet: Training a ‘Universal’ Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory. In CVPR, 2017.
  • [18] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arXiv preprint arXiv:1602.07332, 2016.
  • [19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
  • [20] M. Long, Z. CAO, J. Wang, and P. S. Yu. Learning Multiple Tasks with Multilinear Relationship Networks. In NIPS, 2017.
  • [21] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual Relationship Detection with Language Priors. In ECCV, 2016.
  • [22] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-Stitch Networks for Multi-task Learning. In CVPR, 2016.
  • [23] J. Pennington, R. Socher, and C. Manning. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), 2014.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In NIPS, 2015.
  • [25] S. Ruder. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint, arXiv:1706.05098, 2017.
  • [26] S. Ruder, J. Bingel, I. Augenstein, and A. Sogaard. Learning what to share between loosely related tasks. arXiv preprint, arXiv:1705.08142, 2017.
  • [27] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive Neural Networks. arXiv preprint, arXiv:1606.04671, 2016.
  • [28] D. Teney, P. Anderson, X. He, and A. v. d. Hengel. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. In CVPR, 2018.
  • [29] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. CIDEr: Consensus-based Image Description Evaluation. In CVPR, 2015.
  • [30] Y. Yang and T. M. Hospedales. Trace Norm Regularised Deep Multi-Task Learning. In ICLR Workshop Track, 2017.
  • [31] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked Attention Networks for Image Question Answering. In CVPR, 2016.
  • [32] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems, 2018.
  • [33] Y. Zhang, J. Hare, and A. Prügel-Bennett. Learning to Count Objects in Natural Images for Visual Question Answering. In ICLR, 2018.

Appendices

Appendix A Module architectures

We discuss the detailed architecure of each module. We first describe the shared environment and soft attention mechanism architecture.

Environment.   The sensory input that form our environment consists of: (i) objects: image features , each with corresponding bounding box coordinates extracted from Faster R-CNN [24]; and (ii) language: vector representation of a word or a sentence (in our example, a question).

Soft attention.   For all parts that use soft-attention mechanism, an MLP is emloyed. Given some key vector and a set of data to be attended , we compute

(1)

where and are a sequence of linear layer followed by ReLU activation function that project and into the same dimension, and is a linear layer that projects the joint representation into a single number. Note that we do not specify softmax function here because sigmoid is used for some cases.

a.1 Object and attribute classification

The input to both modules is a visual descriptor for one or more bounding boxes in the image . and projects the visual feature to a 300-dimensional vector through a single layer neural network followed by non-linearity.

a.2 Image captioning

takes zero vector as the model input and produces natural language sentence as the output based on the environment (detected image regions in an image). It has and goes through maximum of time steps or until it reaches the end of sentence token. is implemented similarly to the captioning model in Anderson et al[2]. We employ two layered GRU [6] as the recurrent state updater where with hidden states of the first and second layers of . Each layer has 1000-d hidden states.

State initializer. sets the initial hidden state of , or the model state , as a zero vector.

Importance function.   Based on the current state, importance functions figure out which lower level module would be useful. Importance functions are implemented as a linear layer (for the three modules) that take the , specifically as input.

Query transmitter and receiver.   It first produces an image region to look at based on the current model state, specifically . This is implemented as a soft-attention mechanism so that it produces attention probabilities (via softmax) for each image feature . The inputs to the attention function are visual features ( for th box) and the key vector . Then the query to be sent to is computed as

(2)

where is the number of bounding boxes. The output from is projected to 1000-d vector through which is defined as a sequence of linear layer, batch normalization (BN) [13] layer, and non-linearlity. The exact same procedure with different parameterization is used to compute . The outputs and are added to the scratch pad .

Residual computer.   takes in as input. Through a sequence of linear layer, BN, and , it produces 1000-d vector which is added to .

State updater. As stated above, is a two-layered GRU. At time , the first layer takes input the average visual features from the environment , , embedding vector of previous word , and . For time , beginning-of-sentence embedding and zero vector are inputs for and , respectively. The second layer is fed as well as the information from other modules,

(3)

which is a gated summation of outputs in with softmaxed importance scores. We now have a new state .

Output generator.   is a sequence of words produced through which is a linear layer projecting each in to the output word vocabulary.

a.3 Relationship detection

Relationship detection task requires one to produce triplets in the form of “subject - relationship - object” [21]. We re-purpose this task as one that involves finding the relevant item (region) in an image that is related to a given input through a given relationship. The input to the module is where is a one-hot encoded input bounding box (whose -th entry is 1 and others 0) and is a one-hot encoded relationship category (e.g. above, behind). has and goes through steps where is the number of bounding boxes. So at time step , the module looks at the -th box. uses and just as feature extractors for each bounding box. Therefore, it does not have a complex structure.

State initializer.   projects to a 512 dimensional vector with an embedding layer, and the resulting vector is set as the first state .

Importance function.   Importance function is not used.

Query transmitter and receiver.   At time step , and pass the image vector corresponding to the bounding box to and . and are identity functions, i.e., we do not modify the object and attribute vectors. The outputs and are added to .

Residual computer.   projects the coordinate of the current box to a 512 dimensional vector. This resulting is added to .

State updater.   At time step , concatenates the visual feature with from . The concatenated vector is fed through a MLP resulting in 512 dimensional vector. This corresponds to the new state .

Output generator.   The first state which contains relationship information is multiplied element-wise with (Note: corresponds to the input box ). Let such a vector be . It produces an attention map over all bounding boxes in . The inputs to the attention function are (i.e. all image regions) and the key vector . is the output of .

a.4 Counting

Given a vector representation of a natural language question (e.g. how many cats are in this image?), the goal of this module is to produce a count. As a by-product, it produces relevant attention map on the image regions that could be also useful for other tasks.

The input is a vector representing a natural language question. When training , is computed through a one layer GRU with hidden size of 512 dimensions. The input to the GRU at each time step is the embedding of each word from the question. Word embeddings are initialized with 300 dimensional Glove word vectors [23] and fine-tuned thereafter. Similar to visual features obtained through CNN, the question vector is treated as an environment variable. has and goes through only one time step.

State initializer.   is a simple function that just sets .

Importance function.   is implemented as a linear layer (for the two modules) that takes as input. and act as feature extractors, and used always (weight is 1).

Query transmitter and receiver.   first computes a relationship category through an MLP that produces 20-dimensional relationship vector (for 20 relationship categories). The input to the MLP is . It also produces soft attention map . This can be thought of as the attention map corresponding to the ‘table’ in the question ‘How many cats are on the table?’. and as the key vector are fed into the attention function. then passes to . is an identity function. is an attention map produced by and is added to .

and pass all visual features to and . and are identity functions resulting in the outputs and .

Residual computer.   computes an attention map over bounding boxes. The inputs to the attention function are and as the key vector. Note function concatenates each vectors from corresponding to the same bounding boxes. is put into .

State updater.   first computes probabilities of using or by doing a softmax over the importance scores. or are weighted and summed with the softmax probabilities resulting in the new state containing the attention map. Thus, the state updater chooses the map from if the given question involves in relational reasoning.

Output generator.   returns a count vector along with the attention map . The count vector is computed through the counting algorithm by Zhang et al[33], which builds a graph representation from attention maps to count objects. The method uses through a sigmoid and bounding box coordinates as inputs. [33] is a fully differentiable algorithm and the resulting count vector corresponds to one-hot encoding of a number. We let the range of count be 0 to 12 . Please refer to [33] for details of the counting algorithm.

a.5 Visual Question Answering

Our final module and the most complex task is Visual Question Answering (VQA). Given a vector representation of a natural language question, , uses . The VQA module first queries which produces a count vector and an attention map. We treat them as two separate entries in . The produced attention map is fed to the downstream modules using the query transmitter, and the received outputs are used depending on the importance scores. For which produces a caption for the whole image, the receiver attends over the words in the produced caption to find relevant words. We also use multiple glimpses (corresponding to passes of ) similar to Stacked Attention Networks [31]. produces an output vector based on the question vector and all states .

The input is a vector representing a natural language question. When training , is computed using the same GRU used for the counting task.

State initializer.   sets where and 0 is a zero vector.

Importance function.   is implemented as a linear layer (for the five modules) that takes , specifically as input.

Query transmitter and receiver.   passes to . We get from . projects the counting vector into a 512 dimensional vector through a sequence of linear layer, BN, and . It also computes a softmax on the attention map , referred to as . and are added to as two separate entries.

passes zero vector 0 to . receives natural language caption of the image. It converts words in the caption into vectors through an embedding layer. The word embedding layer is initialized with 300 dimensional Glove word vectors [23] and fine-tuned. It then does softmax attention operation over with as the key vector resulting in word probabilities . The sentence representation is projected into a 512 dimensional vector using the same sequnce as . is added to .

and pass the sum of visual features weighted by . and project and into 512 dimensional vectors and through the same sequence as . and are added to .

Residual computer.   computes residual information from the sum of visual features weighted by through an MLP. is stored to .

State updater.   first does softmax operation over the importance scores of modules. Let the information vector be the summation of weighted by the softmax importance scores. internally maintains a GRU that produces query for the next time step. The initial hidden state of the GRU is , and the input to the GRU at time is . Let be the new hidden state and information, . So the GRU computes new question vector based on what has been asked and seen.

Output generator.   computes the final output based on the initial question vector and all . and are fused with gated-tanh layers and fed through a final classification layer similar to Anderson et al[2], and the logits for all time steps are added. The resulting logit is .

Appendix B More experimental details

In this section, we provide more details about datasets and module training. We also give more examples of execution traces of PMN on the visual question answering task.

b.1 Datasets

We extract bounding boxes and their visual representations using a pretrained model from [2]which is a Faster-RCNN [24] based on ResNet-101 [10]. It produces 10 to 100 boxes with 2048-d feature vectors for each region. To accelerate training, we remove overlapping bounding boxes that are most likely duplicates (area overlap IoU > 0.7) and keep only the 36 most confident boxes (when available).

MS-COCO contains 100K images with annotated bounding boxes and captions. It is a widely used dataset used for benchmarking several vision tasks such as captioning and object detection.

Visual Genome is collected to relate image concepts to image regions. It has over 108K images with annotated bounding boxes containing 1.7M visual question answering pairs, 3.8M object instances, 2.8M attributes and 1.5M relationships between two boxes. Since the dataset contains MS-COCO images, we ensure that we do not train on any MS-COCO validation or test images.

VQA 2.0 is the most popular visual question-answering dataset, with 1M questions on 200K natural images. Questions in this dataset require reasoning about objects, actions, attributes, spatial relations, counting, and other inferred properties; making it an ideal dataset for our visual-reasoning PMN.

b.2 Training

Here, we give training details of each module. We train our modules sequentially, from low level to high level tasks, one at a time. When training a higher level module, internal weights of the lower level modules are not updated, thus preserving their performance on the original task. We do train the weights of the residual module . We train , , , , , and , by allowing gradients to pass through the lower level modules. Thus, while the existing lower modules are held fixed, the new module learns to communicate with them via the query transmitter and receiver .

Object and attribute classification. is trained to minimize the cross-entropy loss for predicting object class by including an additional linear layer on top of the module output. also include an additional linear layer on top of the module output, and is trained to minimize the binary cross-entropy loss for prediction attribute classes since one detected image region can contain zero or more attribute classes. We make use of 780K/195K train/val object instances paired with attributes from the Visual Genome dataset. They are trained with the Adam optimizer at learning rate of 0.0005 with batch size 32 for 20 epochs.

Image captioning. is trained using cross-entropy loss at each time step (maximum likelihood). Parameters are updated using the Adam optimizer at learning rate of 0.0005 with batch size 64 for 20 epochs. We use the standard split of MSCOCO captioning dataset.

Relationship detection. is trained using cross-entropy loss on “subject - relationship - object” pairs with Adam optimizer with learning rate of 0.0005 with batch size 128 for 20 epochs. The pairs are extracted from the Visual Genome dataset that have both subject and object boxes overlapping with the ground truth boxes (IoU 0.7), resulting in 200K/38K train/val tuples.

Counting. is trained using cross-entropy loss on questions starting with ‘how many’ from the VQA 2.0 dataset. We use Adam optimizer with learning rate of 0.0001 with batch size 128 for 20 epochs. As stated in the experiments section, we additionally create 89K synthetic questions to increase our training set by counting the object boxes and forming ‘how many’ questions from the VG dataset (e.g. (Q: how many dogs are in this picture?, A:3) from an image containing three bounding boxes of dog). We also sample relational synthetic questions from each training image from VG that are used to train only the module communication parameters when the relationship module is included. We use the same 200K/38K split from the relationship detection task by concatenating ‘how many’+subject+relationship’ or ‘how many’+relationship+object (e.g. how many plates on table?, how many behind door?). The module communication parameters for in this case are which compute a relationship category and the input image region to be passed to . To be clear, we supervise to be sent to by reducing cross entropy loss on and .

Visual Question Answering. is trained using binary cross-entropy loss on with Adam optimizer with learning rate of 0.0005 with batch size 128 for 7 epochs. We empirically found binary cross-entropy loss to work better than cross-entropy which was also reported by  [2]. Unlike other modules whose parameters are fixed, we fine-tune only the counting module because counting module expects the same form of input - embedding of natural language question. The performance of counting module depends crucially on the quality of attention map over bounding boxes. By employing more questions from the whole VQA dataset, we obtain a better attention map, and the performance of counting module increases from 50.0% to 55.8% with finetuning.

b.3 PMN execution illustrated

We provide more examples of the execution traces of PMN on the visual question answering task in Figure 4 where yellow circles denote execution order.

Figure 4: Example of PMN’s module execution trace on VQA task. For brevity, calls to (, ) by , , and are not shown. Yellow circle denote execution order. shows importance scores for modules’ outputs stored in . For , words with higher intensity in red are deemed more relevant by .

Appendix C Examples of PMN’s reasoning

We provide more examples of the human evaluation experiment on interpretability of PMN compared with the baseline model in Figure 5.

Figure 5: Example of PMN’s reasoning processes compared with the baseline given the question on the left. and denote correct and wrong answers, respectively.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
202223
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description