Graph-based Heuristic Search for Module Selection Procedure in Neural Module Network

Graph-based Heuristic Search for Module Selection Procedure in Neural Module Network

Abstract

Neural Module Network (NMN) is a machine learning model for solving the visual question answering tasks. NMN uses programs to encode modules’ structures, and its modularized architecture enables it to solve logical problems more reasonably. However, because of the non-differentiable procedure of module selection, NMN is hard to be trained end-to-end. To overcome this problem, existing work either included ground-truth program into training data or applied reinforcement learning to explore the program. However, both of these methods still have weaknesses. In consideration of this, we proposed a new learning framework for NMN. Graph-based Heuristic Search is the algorithm we proposed to discover the optimal program through a heuristic search on the data structure named Program Graph. Our experiments on FigureQA and CLEVR dataset show that our methods can realize the training of NMN without ground-truth programs and achieve superior efficiency over existing reinforcement learning methods in program exploration.

1 Introduction

With the development of machine learning in recent years, more and more tasks have been accomplished such as image classification, object detection, and machine translation. However, there are still many tasks that human beings perform much better than machine learning systems, especially those in need of logical reasoning ability. Neural Module Network (NMN) is a model proposed recently targeted to solve these reasoning tasks [2, 1]. It first predicts a program indicating the required modules and their layout, and then constructs a complete network with these modules to accomplish the reasoning. With the ability to break down complicated tasks into basic logical units and to reuse previous knowledge, NMN achieved super-human level performance on challenging visual reasoning tasks like CLEVR [10]. However, because the module selection is a discrete and non-differentiable process, it is not easy to train NMN end-to-end.

To deal with this problem, a general solution is to separate the training into two parts: the program predictor and the modules. In this case, the program becomes a necessary intermediate label. The two common solutions to provide this program label are either to include the ground-truth programs into training data or to apply reinforcement learning to explore the optimal candidate program. However, these two solutions still have the following limitations. The dependency on ground-truth program annotation makes NMN’s application hard to be extended to datasets without this kind of annotation. This annotation is also highly expensive while being hand-made by humans. Therefore, program annotation cannot always be expected to be available for tasks in real-world environments. In view of this, methods relying on ground-truth program annotation cannot be considered as complete solutions for training NMN. On the other hand, the main problem in the approaches based on reinforcement learning is that with the growth of the length of programs and number of modules, the size of the search space of possible programs becomes so huge that a reasonable program may not be found in an acceptable time.

In consideration of this, we still regard the training of NMN as an open problem. With the motivation to take advantage of NMN on broader tasks and overcome the difficulty in its training in the meanwhile, in this work, we proposed a new learning framework to solve the non-differentiable module selection problem in NMN.

Figure 1: Our learning framework enables the NMN to solve the visual reasoning problem without ground-truth program annotation.

In this learning framework, we put forward the Graph-based Heuristic Search algorithm to enable the model to find the most appropriate program by itself. Basically, this algorithm is inspired by Monte Carlo Tree Search (MCTS). Similar to MCTS, our algorithm conducts a heuristic search to discover the most appropriate program in the space of possible programs. Besides, inspired by the intrinsic connection between programs, we proposed the data structure named Program Graph to represent the space of possible programs in a way more reasonable than the tree structure used by MCTS. Further, to deal with the cases that the search space is extremely huge, we proposed the Candidate Selection Mechanism to narrow down the search space.

With these proposed methods, our learning framework implemented the training of NMN regardless of the existence of the non-differentiable module selection procedure. Compared to existing work, our proposed learning framework has the following notable characteristics:

  • It can implement the training of NMN with only the triplets of {question, image, answer} and without the ground-truth program annotation.

  • It can explore larger search spaces more reasonably and efficiently.

  • It can work on both trainable modules with neural architectures and non-trainable modules with discrete processing.

2 Related Work

2.1 Visual Reasoning

Generally, Visual Reasoning can be considered as a kind of Visual Question Answering (VQA) [3]. Besides the requirement of understanding information from both images and questions in common VQA problems, Visual Reasoning further asks for the capacity to recognize abstract concepts such as spatial, mathematical, and logical relationships. CLEVR [9] is one of the most famous and widely used datasets for Visual Reasoning. It provides not only the triplets of {question, image, answer} but also the functional programs paired with each question. FigureQA [12] is another Visual Reasoning dataset we focus on in this work. It provides questions in fifteen different templates asked on five different types of figures.

To solve Visual Reasoning problems, a naive approach would be the combination of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). Here, CNN and RNN are responsible for extracting information from images and questions, respectively. Then, the extracted information is combined and fed to a decoder to obtain the final answer. However, this methodology of treating Visual Reasoning simply as a classification problem sometimes cannot achieve desirable performance due to the difficulty of learning abstract concepts and relations between objects [3, 12, 10]. Instead, more recent work applied models based on NMN to solve Visual Reasoning problems [10, 8, 7, 15, 20, 30, 14].

2.2 Neural Module Network

Neural Module Network (NMN) is a machine learning model proposed in 2016 [2, 1]. Generally, the overall architecture of NMN can be considered as a controller and a set of modules. Given the question and the image, firstly, the controller of NMN takes the question as input and outputs a program indicating the required modules and their layout. Then, the specified modules are concatenated with each other to construct a complete network. Finally, the image is fed to the assembled network and the answer is acquired from the root module. As far as we are concerned, the advantage of NMN can be attributed to the ability to break down complicated questions into basic logical units and the ability to reuse previous knowledge efficiently.

By the architecture of modules, NMN can further be categorized into three subclasses: the feature-based, attention-based, and object-based NMN.

For feature-based NMNs, the modules apply CNNs and their calculations are directly conducted on the feature maps. Feature-based NMNs are the most concise implementation of NMN and were utilized most in early work [10].

For attention-based NMNs, the modules also apply neural networks but their calculations are conducted on the attention maps. Compared to feature-based NMNs, attention-based NMNs retain the original information within images better so they achieved higher reasoning precision and accuracy [2, 1, 8, 15].

For object-based NMNs, they regard the information in an image as a set of discrete representations on objects instead of a continuous feature map. Correspondingly, their modules conduct pre-defined discrete calculations. Compared to feature-based and attention-based NMNs, object-based NMNs achieved the highest precision on reasoning [20, 30]. However, their discrete design usually requires more prior knowledge and pre-defined attributes on objects.

2.3 Monte Carlo Methods

Monte Carlo Method is the general name of a group of algorithms that make use of random sampling to get an approximate estimation for a numerical computing [16]. These methods are broadly applied to the tasks that are impossible or too time-consuming to get exact results through deterministic algorithms. Monte Carlo Tree Search (MCTS) is an algorithm that applied the Monte Carlo Method to the decision making in game playing like computer Go [13, 5]. Generally, this algorithm arranges the possible state space of games into tree structures, and then applies Monte Carlo estimation to determine the action to take at each round of games. In recent years, there also appeared approaches to establish collaborations between Deep Learning and MCTS. These work, represented by AlphaGo, have beaten top-level human players on Go, which is considered to be one of the most challenging games for computer programs [21, 22].

3 Proposed Method

3.1 Overall Architecture

The general architecture of our learning framework is shown as Fig.2. As stated above, the training of the whole model can be divided into two parts: a. Program Predictor and b. modules. The main difficulty of training comes from the side of Program Predictor because of the lack of expected programs as training labels. To overcome this difficulty, we proposed the algorithm named Graph-based Heuristic Search to enable the model to find the optimal program by itself through a heuristic search on the data structure Program Graph. After this searching process, the most appropriate program that was found is utilized as the program label so that the Program Predictor can be trained in a supervised manner. In other words, this searching process can be considered as a procedure targeted to provide training labels for the Program Predictor.

The abstract of the total training workflow is presented as Algorithm 1. Note that here denotes the question, denotes the program, {} denotes the set of modules available in the current task, {} denotes the set of images that the question is asking on, {} denotes the set of answers paired with images. Details about the function are provided in Appendix A.

Figure 2: Our Graph-based Heuristic Search algorithm assists the learning of the Program Predictor.
1:function Train()
2:     Program_Predictor, {} Intialize()
3:     for loop in range(do
4:         , {}, {} Sample(Dataset)
5:          Graph-based_Heuristic_Search(, {}, {}, {})
6:         Program_Predictor.train(, )
7:     end for
8:end function
Algorithm 1 Total Training Workflow

3.2 Program Graph

To start with, we first give a precise definition of the program we use. Note that each of the available modules in the model has a unique name, fixed numbers of inputs, and one output. Therefore, a program can be defined as a tree meeting the following rules :

i) Each of the non-leaf nodes stands for a possible module, each of the leaf nodes holds a END flag.

ii) The number of children that a node has equal to the number of inputs of the module that the node represents.

For the convenience of representation in prediction, a program can also be transformed into a sequence of modules together with END flags via pre-order tree traversal. Considering that the number of inputs of each module is fixed, the tree form can be rebuilt from such sequence uniquely.

Then, as for the Program Graph, Program Graph is the data structure we use to represent the relation between all programs that have been reached throughout the searching process, and it is also the data structure that our algorithm Graph-based Heuristic Search works on. A Program Graph can be built meeting the following rules :

i) Each graph node represents a unique program that has been reached.

ii) There is an edge between two nodes if and only if the edit distance of their programs is one. Here, insertion, deletion, and substitution are the three basic edit operations whose edit distance is defined as one. Note that the edit distance between programs is judged on their tree form.

iii) Each node in the graph maintains a score. This score is initialized as the output probability of the program of a node according to the Program Predictor when the node is created, and can be updated when the program of a node is executed.

Fig.3 is an illustration of a Program Graph consisting of several program nodes together with their program trees as examples. To distinguish the node in the tree of a program and the node in the Program Graph, the former will be referred to as for “module node” and the latter will be referred to as for “program node” in the following discussion. Details about the initialization of the Program Graph are provided in Appendix B.

Figure 3: Illustration of part of a Program Graph

3.3 Graph-based Heuristic Search

1:function Main(, {}, {}, {})
2:      InitializeGraph()
3:     for step in range(do
4:         {} for in and .fully_explored == False
5:         .Exp FindExpectation(, ) for in {}
6:          s.t. .Exp = {.Exp for in {}}
7:         Expand(, , {}, {}, {})
8:     end for
9:      s.t. .score = {.score for in {}}
10:     return .program
11:end function
12:
13:function Expand(, , {}, {}, {})
14:     .visit_count .visit_count + 1
15:     if .visited == False then
16:         .score accuracy(.program, {}, {}, {})
17:         .visited True
18:     end if
19:     {} for in .program and .expanded == False
20:      Sample({})
21:     {} Mutate(.program, , {})
22:     for  in {} do
23:         if LegalityCheck() == True then
24:              .update()
25:         end if
26:     end for
27:     .expanded True
28:     .fully_explored True if {}.remove() ==
29:end function
Algorithm 2 Graph-based Heuristic Search

Graph-based Heuristic Search is the core algorithm in our proposed learning framework. Its basic workflow is presented as the function in line 1 of Algorithm 2. After Program Graph gets initialized, the basic workflow can be described as a recurrent exploration on the Program Graph consisting of the following four steps :

i) Collecting all the program nodes in Program Graph that have not been fully explored yet as the set of candidate nodes {}.

ii) Calculating the Expectation for all the candidate nodes.

iii) Selecting the node with the highest Expectation value among all the candidate nodes.

iv) Expanding on the selected node to generate new program nodes and update the Program Graph.

The details about the calculation of Expectation and expanding strategy are as follows.

Expectation

Expectation is a grade defined on each program node to determine which node should be selected for the following expansion step. This Expectation is calculated through the following Equation 1.

(1)

Intuitively, this equation measures how desirable a program is to guide the modules to answer a given question reasonably. Here, , , and are hyperparameters indicating the max distance in consideration, a sequence of weight coefficients while summing best scores in the different distance , and the scale coefficient to encourage visiting unexplored nodes, respectively.

In this equation, the first term observes the nodes nearby and find the highest score in each different distance from to . Then, these scores are weighted by and summed up. Note that the distance here is measured on the Program Graph, which also equals to the edit distance between two programs. The second term in this equation is a balance term negatively correlated to the number of times that a node has been visited and expanded on. This term balances the grades of unexplored or less explored nodes.

Expansion Strategy

Expansion is another important procedure in our proposed algorithm as shown in line 12 of Algorithm 2. The main objective of this procedure is to generate new program nodes and update the Program Graph. To realize this, the five main steps are as follows:

i) If the node in Program Graph is visited for the first time, try its program by building the model with specified modules to answer the question, then update the score of the node with the accuracy. If there are modules with neural architecture, these modules should also be trained here, but the updated parameters are retained only if the new accuracy exceeds the previous one.

ii) Collect the module nodes that have not been expanded on yet within the program, then sample one from them as the module node to expand on.

iii) Mutate the program at module to generate a new set of programs {program} with three edit operations: insertion, deletion, and substitution.

iv) For the new programs judged to be legal, if there is not yet a node representing the same program in the Program Graph , then create a new program node representing this program and add it to . The related edge should also be added to if it does not exist yet.

v) If all of the module nodes have been expanded on, then mark this program node as fully explored.

For the Mutation in step iii), the three edit operations are illustrated by Fig.4. Here, insertion adds a new module node between the node and its parent node. The new module can be any of the available modules in the model. If the new module has more than one inputs, should be set as one of its children, and the rest of the children are set to leaf nodes with END flag.

Deletion deletes the node and set its child as the new child of ’s parent. If has more than one child, only one of them should be retained and the others are abandoned.

Substitution replaces the module of with another module. The new module can be any of the modules that have the same number of inputs as .

For insertion and deletion, if they are multiple possible mutations because the related node has more than one child as shown in Fig.4, all of them are retained.

These rules ensure that newly generated programs consequentially have legal structures, but there are still cases that these programs are not legal in the sense of semantics, e.g., the output data type of a module does not match the input data type of its parent. Legality check is conducted to determine whether a program is legal and should be added to the Program Graph, more details about this function are provided in Appendix C.

Figure 4: Example of the mutations generated by the three opeartions insertion, deletion, and subsitution.

3.4 Candidate Selection Mechanism for Modules

The learning framework presented above is already a complete framework to realize the training of the NMN. However, in practice we found that with the growth of the length of programs and the number of modules, the size of search space explodes exponentially. This brings trouble to the search. To overcome this problem, we further proposed the Candidate Selection Mechanism (CSM), which is an optional component within our learning framework. Generally speaking, if CSM is activated, it selects only a subset of modules from the whole of available modules. Then, only these selected modules are used in the following Graph-based Heuristic Search. The abstract of the training workflow with CSM is presented as Algorithm 3.

1:function Train()
2:     Program_Predictor, Necessity_Predictor, {} Intialize()
3:     for loop in range(do
4:         , {}, {} Sample(Dataset)
5:         {} Necessity_Predictor(, {})
6:          Graph-based_Heuristic_Search(, {}, {}, {})
7:         Necessity_Predictor.train(, )
8:         Program_Predictor.train(, )
9:     end for
10:end function
Algorithm 3 Training Workflow with Candidate Selection Mechanism

Here, we included another model named Necessity Predictor into the learning framework. This model takes the question as input, and predicts a -dimensions vector as shown in Fig.5. Here, indicates the total number of modules. Each value in the output vector is a real number in the range of [0, 1] indicating the possibility that each module is necessary for the solution of the given question. and are the two hyperparameters for the candidate modules selection procedure. indicates the number of modules that are selected according to the predicted possibility value, i.e., to select modules with the top prediction values. indicates the number of modules that are selected randomly besides the ones. Then, the union of these two selections with modules becomes the candidate modules for the following search.

For the training of this Necessity Predictor, the best program found in the search is transformed into a -dimensions boolean vector indicating whether each module appeared in the program. Then, this boolean vector is set as the training label so that the Necessity Predictor can also be trained in a supervised manner as Program Predictor does.

Figure 5: The process to selecte the candidate modules

4 Experiments and Results

Our experiments are conducted on the FigureQA and the CLEVR dataset. Their settings and results are presented in the following subsections respectively.

4.1 FigureQA Dataset

The main purpose of the experiment on FigureQA is to certify that our learning framework can realize the training of NMN on a dataset without ground-truth program annotations and outperform the existing methods with models other than NMN.

An overview of how our methods work on this dataset is shown in Fig.6. Considering that the size of the search space of the programs used in FigureQA is relatively small, the CSM introduced in Section 3.4 is not activated.

Generally, the workflow consists of three main parts. Firstly, the technique of object detection [6] together with optical character recognition [23] are applied to transform the raw image into discrete element representations as shown in Fig.6.a. For this part, we applied Faster R-CNN [18, 29] with ResNet 101 as the backbone for object detection and Tesseract open source OCR engine [24, 25] for text recognition. All the images are resized to 256 by 256 pixels before following calculations.

Secondly, for the part of program prediction as shown in Fig.6.b., we applied our Graph-based Heuristic Search algorithm for the training. The setting of the hyperparameters for this part are shown in Table 1. The type of figure is treated as an additional token appended to the question.

Thirdly, for the part of modules as shown in Fig.6.c., we designed some pre-defined modules with discrete calculations on objects. Their functions are corresponded to the reasoning abilities required by FigureQA. These pre-defined modules are used associatively with modules with neural architecture. Details of all these modules are provided in Appendix D.

Figure 6: An example of the inference process on FigureQA

Table 2 shows the results of our methods compared with baseline and existing methods. “Ours” is the primitive result from the experiment settings presented above. Besides, we also provide the result named “Ours + GE” where “GE” stands for ground-truth elements. In this case, element annotations are obtained directly from ground-truth plotting annotations provided by FigureQA instead of the object detection results. We applied this experiment setting to measure the influence of the noise in object detection results.

100 1000 4 (0.5, 0.25, 0.15, 0.1) 0.05
Table 1: Setting of hyperparameters in our experiment
Method Accuracy
Validation Sets Test Sets
Set 1 Set 2 Set 1 Set 2
Text only [12] 50.01% 50.01%
CNN+LSTM [12] 56.16% 56.00%
Relation Network [12, 19] 72.54% 72.40%
Human [12] 91.21%
FigureNet [17] 84.29%
PTGRN [4] 86.25% 86.23%
PReFIL [11] 94.84% 93.26% 94.88% 93.16%
Ours 95.74% 95.55% 95.61% 95.28%
Ours + GE 96.61% 96.52%
Table 2: Comparison of accuracy with previous methods on the FigureQA dataset.

Through the result, firstly it can be noticed that both our method and our method with GE outperform all the existing methods. In our consideration, the superiority of our method mainly comes from the successful application of NMN. As stated in Section 2.2, NMN has shown outstanding capacity in solving logical problems. However, limited by the non-differentiable module selection procedure, the application of NMN can hardly be extended to those tasks without ground-truth program annotations like FigureQA. In our work, the learning framework we proposed can realize the training of NMN without ground-truth programs so that we succeeded to apply NMN on this FigureQA. This observation can also be certified through the comparison between our results and PReFIL.

Compared to PReFIL, considering that we applied the nearly same 40-layer DenseNet to process the image, the main difference we made in our model is the application of modules. The modules besides the final Discriminator ensure that the inputs fed to the Discriminator are related to what the question is asking on more closely.

Here, another interesting fact shown by the result is the difference between accuracies reached on set 1 and set 2 of both validation sets and test sets. Note that in FigureQA, validation set 1 and test set 1 adopted the same color scheme as the training set, while validation set 2 and test set 2 adopted an alternated color scheme. This difference leads to the difficulty of the generalization from the training set to the two set 2. As a result, for PReFIL the accuracy on each set 2 drops more than 1.5% from the corresponding set 1. However, for our method with NMN, this decrease is only less than 0.4%, which shows a better generalization capacity brought by the successful application of NMN.

Also, Appendix E reports the accuracies achieved on test set 2 by different question types and figure types. It is worth mentioning that our work is the first one to exceed human performance on every question type and figure type.

4.2 CLEVR Dataset

The main purpose of the experiment on CLEVR is to certify that our learning framework can achieve superior searching efficiency compared to the classic reinforcement learning method.

For this experiment, we created a subset of CLEVR containing only those training data whose questions appear at least two times in all training questions. There are 31252 different questions together with their corresponding programs in this subset. The reason of applying such a subset is that the size of the whole space of possible programs is approximately up to , which is so huge that no existing method can realize the search in it without any prior knowledge or simplification on programs. Considering that the training of modules is highly time-consuming, we only activate the part of program prediction in our learning framework, which is shown as Fig.6.b. With this setting, the modules specified by the program would not be trained actually. Instead, a boolean value indicating whether the program is correct or not is returned to the model as a substitute for the question answering accuracy. Here, only the programs that are exactly the same as the ground-truth programs paired with given questions are considered as correct.

In this experiment, comparative experiments were made on the cases of both activating and not activating the CSM. The structures of the models used as the Program Predictor and the Necessity Predictor are as follows. For Program Predictor, we applied a 2-layer Bidirectional LSTM with hidden state size of 256 as the encoder, and a 2-layer LSTM with hidden state size of 512 as the decoder. Both the input embedding size of encoder and decoder are 300. The setting of hyperparameters are the same as FigureQA as shown in Table.1 except that is not limited. For Necessity Predictor, we applied a 4-layer MLP. The input of the MLP is a boolean vector indicating whether each word in the dictionary appears in the question, the output of the MLP is a 39-dimensional vector for there are 39 modules in CLEVR, the size of all hidden layers is 256. The hyperparameters and are set to 15 and 5 respectively. For the sentence embedding model utilized in the initialization of the Program Graph, we applied the GenSen model with pre-trained weights [27, 26].

For the baseline, we applied REINFORCE [28] as most of the existing work [10, 14] did to train the same Program Predictor model.

The searching processes of our method, our method without CSM, and REINFORCE are shown by Fig.7. Note that in this figure, the horizontal axis indicates the times of search, the vertical axis indicates the number of correct programs found. The experiments on our method and our method without CSM are repeated four times each, and the experiment on REINFORCE is repeated eight times. Also, we show the average results as the thick solid lines in this figure indicating the average times of search used to find specific numbers of correct programs. Although in this subset of CLEVR, the numbers of correct programs that can be finally found are quite similar for the three methods, their searching processes show great differences. From this result, three main conclusions can be drawn.

Figure 7: Relation between the times of search and the number of correct programs found within the searching processes of three methods.

Firstly, in terms of the average case, our method shows a significantly higher efficiency in searching appropriate programs.

Secondly, the searching process of our method is much more stable while the best case and worst case of REINFORCE differ greatly.

Thirdly, the comparison between the result of our method and our method without CSM certified the effectiveness of the CSM.

5 Conclusion

In this work, to overcome the difficulty of training NMN because of its non-differentiable module selection procedure, we proposed a new learning framework for the training of the NMN. Our main contribution in this framework can be summarized as follows.

Firstly, we proposed the data structure named Program Graph to represent the search space of programs more reasonably.

Secondly and most importantly, we proposed the Graph-based Heuristic Search algorithm to enable the model to find the most appropriate program by itself to get rid of the dependency on the ground-truth programs in training.

Thirdly, we proposed the Candidate Selection Mechanism to improve the performance of the learning framework when the search space is huge.

Through the experiment, the experiment on FigureQA certified that our learning framework can realize the training of NMN on a dataset without ground-truth program annotations and outperform the existing methods with models other than NMN. The experiment on CLEVR certified that our learning framework can achieve superior efficiency in searching programs compared to the classic reinforcement learning method. In view of this evidence, we conclude that our proposed learning framework is a valid and advanced approach to realize the training of NMN.

Nevertheless, our learning framework still cannot deal with the extremely huge search spaces, e.g., the whole space of possible programs in CLEVR. We leave further study on methods that can realize the search in such enormous search spaces as the future work.

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number JP19K22861.

Appendix A Training Data Sampling

The basic sampling unit of training data is triplet as (, {}, {}). Generally, as shown in Fig.8, we maintain three sets {}, {}, {} to distinguish training data in different status.

Intuitively, {} contains training data that have not been met and used. At the beginning of learning, all the training data is stored in {}.

{} contains training data that has been sampled from {} but on which the final accuracy achieved in the following search did not reach a hyperparameter named .

{} contains training data that has been sampled from {} or {} and the final accuracy reached the .

We denote the numbers of training data triplets in these three sets as , and , respectively.

Figure 8: Data sampling strategy

For each step in each training loop, the training data can be sampled from either {} or {} with probability and as shown in Equation 2.

(2a)
(2b)

Appendix B Program Graph Initialization

To initialize the Program Graph, at most three initial program nodes are created as the starting points for the following search. The programs of them are:

i) The program predicted by the Program Predictor model.

ii) The program found for the question within {} that is closest to the current given question.

iii) The shortest legal program.

Specifically for ii), this term only works when {} is not empty. If so, a pre-trained sentence embedding model is utilized to judge the semantic distance between questions and find a question from {} that is semantically closest to the current given question . This process can be expressed as Equation 3. Here, takes the question sentence as input and outputs a fixed-length vector. judges the distance between and . Then, the program found for in previous searches becomes the initial program for the Program Graph.

(3)

Appendix C Legality Check for Programs

As stated in Section 3.3, our rules for generating mutations on programs can ensure the legality of structure, but not necessarily the legality of semantics. Here, the illegality of semantics mainly comes from the type system of modules. Within NMN, the inputs and outputs passed between modules are restricted with types such as feature map, number, object, or set of objects. The calculation of NMN fails if the intermediate data fed to a module does not match the data type that module requires.

Generally, there are two solutions to this problem. One is to add the illegal programs to the Program Graph anyway yet mark these programs as non-executable and skip the step of trying these programs to get the accuracies. However, excessive illegal programs within the Program Graph waste plenty of searching steps on them so that the efficiency of search drops obviously.

The other solution is to simply refuse to add these illegal programs to the Program Graph. However, in this way the Program Graph is possible to become disconnected. Therefore, some sub-graphs may never be reached from others.

In consideration of this, we applied a compromise between these two solutions. We used a hyperparameter named to restrict the maximum count of data type mismatches that can be tolerated. The programs of which the count of data type mismatches is not greater than will still be added to the Program Graph although they cannot be executed to obtain the accuracy. This setting can balance the efficiency and coverage of the search.

Appendix D Modules Used in FigureQA Dataset

The modules used in the experiment on the FigureQA dataset are shown in Table 3. Here, the column of “Shape” indicates the number and type of inputs and output. The column of “Architecture” indicates whether the module is pre-defined with rule-based calculation, or is a trainable neural network.

Name Shape Architecture Number
Find Element (None) Element pre-defined 2
Look Up (Element) Element pre-defined 1
Look Down (Element) Element pre-defined 1
Look Left (Element) Element pre-defined 1
Look Right (Element) Element pre-defined 1
Find Same (Element) Elements pre-defined 1
Discriminator
(Element/Elements/None) * 2
Answer
neural network N
Table 3: Modules used in the experiment on FigureQA dataset

Specifically for the behavior of each module, “Find Element” finds an element that matches the given keyword from all the detected elements. Here, keywords are the name of colors extracted from the questions. Because there are at most two keywords within a question, two of this module are required and each of them corresponds to one of the keywords.

“Look Up” finds the closest element that is in the area of from 45 top left to 45 top right of the given element.

“Look Down”, “Look Left”, and “Look Right” behave similarly to “Look Up”.

“Find Same” finds a set of elements with the same attributes as the given element. In this experiment, we specify this attribute to color.

Figure 9: Architecture of our Discriminator with a 40 layer DenseNet as backbone

“Discriminator” has two inputs. For each input, it masks the original image with the bounding boxes of the given element or sets of elements. Then, the masked image is fed to a neural network to infer the answer. The input can also be empty. In this case, it would directly feed the original image to the neural network. To compare our method with existing work fairly, we use a 40-layer DenseNet similar to the one applied in PReFIL as the backbone of the Discriminator. The architecture of Discriminator is shown in Fig.9. The number of filters in the first convolutional layer of DenseNet is 64. Considering that the two inputs of Discriminator are parallel and most of their features are similar, the first convolutional layer works on them independently with shared weight. All three following dense blocks have 12 layers. Their growth rate is set to 12. The number of final classes is 2 representing the answer “Yes” or “No” in FigureQA.

For training, we used cross-entropy loss and SGD optimizer with learning rate decay. The batch size is set to 64. The learning rate is initialized to be 0.1 and drops to 0.01, 0.001, 0.0001, and 0.00001 on epoch 8, 12, 16, and 20, respectively. The maximum number of the epochs of training is 24, yet considering that the training of a 40-layer DenseNet on the entire 24 epochs is highly time-consuming, during the search only the training on the first 4 epochs are conducted and the validation accuracy is returned then. After the search on each question is completed, the Discriminator specified by the optimal program will be trained again on the entire 24 epochs.

Appendix E Results by Question Type and Figure Type in FigureQA Dataset

Question Template RN Human PReFIL Ours
Is X the minimum? 76.78 97.06 97.20 98.44
Is X the maximum? 83.47 97.18 98.07 98.79
Is X the low median? 66.69 86.39 93.07 94.07
Is X the high median? 66.50 86.91 93.00 94.29
Is X less than Y? 80.49 96.15 98.20 99.43
Is X greater than Y? 81.00 96.15 98.07 99.45
Does X have the minimum area under the curve? 69.57 94.22 94.00 95.77
Does X have the maximum area under the curve? 78.45 95.36 96.91 97.65
Is X the smoothest? 58.57 78.02 71.87 80.90
Is X the roughest? 56.28 79.52 74.67 85.19
Does X have the lowest value? 69.65 90.33 92.17 95.42
Does X have the highest value? 76.23 93.11 94.83 96.68
Is X less than Y? 67.75 90.12 92.38 95.19
Is X greater than Y? 67.12 89.88 92.00 95.29
Does X intersect Y? 68.75 89.62 91.25 95.22
Overall 72.18 91.21 92.79 95.28
Table 4: Accuracy on Test Set 2 by different question types.
Figure Type RN Human PReFIL Ours
Vertical Bar 77.13 95.90 98.25 98.55
Horizontal Bar 77.02 96.03 97.98 99.32
Pie 73.26 88.26 92.84 94.31
Line 66.69 90.55 87.79 92.66
Dot Line 69.22 87.20 89.57 93.11
Overall 72.18 91.21 92.79 95.28
Table 5: Accuracy on Test Set 2 by different figure types.

Footnotes

  1. email: {wuyuxuan,nakayama}@nlab.ci.i.u-tokyo.ac.jp
  2. email: {wuyuxuan,nakayama}@nlab.ci.i.u-tokyo.ac.jp

References

  1. J. Andreas, M. Rohrbach, T. Darrell and D. Klein (2016) Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1545–1554. Cited by: §1, §2.2, §2.2.
  2. J. Andreas, M. Rohrbach, T. Darrell and D. Klein (2016) Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48. Cited by: §1, §2.2, §2.2.
  3. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.1, §2.1.
  4. Q. Cao, X. Liang, B. Li and L. Lin (2019) Interpretable visual question answering by reasoning on dependency trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 2.
  5. R. Coulom (2006) Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72–83. Cited by: §2.3.
  6. R. Girshick, J. Donahue, T. Darrell and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §4.1.
  7. R. Hu, J. Andreas, T. Darrell and K. Saenko (2018) Explainable neural computation via stack neural module networks. In Proceedings of the European conference on computer vision (ECCV), pp. 53–69. Cited by: §2.1.
  8. R. Hu, J. Andreas, M. Rohrbach, T. Darrell and K. Saenko (2017) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813. Cited by: §2.1, §2.2.
  9. J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §2.1.
  10. J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick and R. Girshick (2017) Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998. Cited by: §1, §2.1, §2.2, §4.2.
  11. K. Kafle, R. Shrestha, B. Price, S. Cohen and C. Kanan (2019) Answering questions about data visualizations using efficient bimodal fusion. arXiv preprint arXiv:1908.01801. Cited by: Table 2.
  12. S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár, A. Trischler and Y. Bengio (2018) Figureqa: an annotated figure dataset for visual reasoning. In International Conference on Learning Representations, External Links: Link Cited by: §2.1, §2.1, Table 2.
  13. L. Kocsis and C. Szepesvári (2006) Bandit based monte-carlo planning. In European conference on machine learning, pp. 282–293. Cited by: §2.3.
  14. J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum and J. Wu (2019) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, External Links: Link Cited by: §2.1, §4.2.
  15. D. Mascharka, P. Tran, R. Soklaski and A. Majumdar (2018) Transparency by design: closing the gap between performance and interpretability in visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4942–4950. Cited by: §2.1, §2.2.
  16. N. Metropolis and S. Ulam (1949) The monte carlo method. Journal of the American statistical association 44 (247), pp. 335–341. Cited by: §2.3.
  17. R. Reddy, R. Ramesh, A. Deshpande and M. M. Khapra (2019) FigureNet: a deep learning model for question-answering on scientific plots. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: Table 2.
  18. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), Cited by: §4.1.
  19. A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: Table 2.
  20. J. Shi, H. Zhang and J. Li (2019) Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8376–8384. Cited by: §2.1, §2.2.
  21. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam and M. Lanctot (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §2.3.
  22. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai and A. Bolton (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §2.3.
  23. S. Singh (2013) Optical character recognition techniques: a survey. Journal of emerging Trends in Computing and information Sciences 4 (6), pp. 545–550. Cited by: §4.1.
  24. R. Smith (2007) An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2, pp. 629–633. Cited by: §4.1.
  25. R. Smith (2019) Tesseract open source ocr engine. https://github.com/tesseract-ocr/tesseract. Cited by: §4.1.
  26. S. Subramanian, A. Trischler, Y. Bengio and C. J. Pal (2018) GenSen. https://github.com/Maluuba/gensen. Cited by: §4.2.
  27. S. Subramanian, A. Trischler, Y. Bengio and C. J. Pal (2018) Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations, External Links: Link Cited by: §4.2.
  28. R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §4.2.
  29. J. Yang, J. Lu, D. Batra and D. Parikh (2017) A faster pytorch implementation of faster r-cnn. https://github.com/jwyang/faster-rcnn.pytorch. Cited by: §4.1.
  30. K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli and J. Tenenbaum (2018) Neural-symbolic vqa: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pp. 1031–1042. Cited by: §2.1, §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414405
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description