A Question-Answering framework for plots using Deep learning

A Question-Answering framework for plots using Deep learning


Deep Learning has managed to push boundaries in a wide variety of tasks. One area of interest is to tackle problems in reasoning and understanding, in an aim to emulate human intelligence. In this work, we describe a deep learning model that addresses the reasoning task of question-answering on bar graphs and pie charts. We introduce a novel architecture that learns to identify various plot elements, quantify the represented values and determine a relative ordering of these statistical values. We test our model on the recently released FigureQA dataset, which provides images and accompanying questions, for bar graphs and pie charts, augmented with rich annotations. Our approach outperforms the state-of-the-art Relation Networks baseline and traditional CNN-LSTM models when evaluated on this dataset. Our model also has a considerably faster training time of approximately 2 days on 1 GPU compared to the Relation Networks baseline which requires around two weeks to train on 4 GPUs.


1 Introduction

Deep learning has transformed the computer vision and natural language processing landscapes and has become a ubiquitous tool in their associated applications. The potential of convolutional neural networks on images was demonstrated with its success in the ImageNet classification task [\citeauthoryearKrizhevsky et al.2012]. Long-short-term Memory networks [\citeauthoryearHochreiter and Schmidhuber1997] have demonstrated a capability to tackle complex tasks like sentence summarization [\citeauthoryearRush et al.2015], Machine passage comprehension [\citeauthoryearHermann et al.2015] and Neural Machine translation [\citeauthoryearBahdanau et al.2014]. Neural network models are in-fact a result of preliminary attempts to model the brain and hence a natural area of interest is to accurately model “reasoning”. A plethora of visual reasoning tasks have been created to benchmark these capabilities of neural networks [\citeauthoryearLin et al.2014, \citeauthoryearJohnson et al.2017]. Visual question answering tasks require a combination of reasoning, Natural Language Processing and Computer vision techniques. The model must be capable of obtaining representations of the image and question apart from intelligently combining these representations to generate an answer. This task helps machines gain the ability to process visual signals and use it to solve multi-modal problems.

The rudimentary Convolutional neural networks(CNN) and Long-short term memory networks(LSTM) models were incapable of handling these datasets. However, reasoning specific architectures have managed to achieve super-human scores on these reasoning based tasks [\citeauthoryearPerez et al.2017, \citeauthoryearSantoro et al.2017]. One point of note is that these datasets have predominantly addressed spatial and relational reasoning. [\citeauthoryearKahou et al.2017] designed a dataset that uses scientific graphs and figures to test count-based, numeric, spatial and relational reasoning. Scientific figures are a compact representation of statistical information. These figures are generally line plots, bar graphs and pie charts. They are found not only in scientific research papers but also in business analysis reports, consensus reports and various other places wherein it is possible to supplement textual information with figures. Therefore, automating the understanding of this visual information could be of great utility to human analysts since it allows drawing inferences from various reports and papers. An architecture addressing this task is hence of great utility and also bridges the gap towards a universal reasoning module.

We propose a neural network architecture FigureNet, that incorporates various entities in the plot, to address the reasoning task. FigureNet works on the principle of divide and conquer. Different modules are used to emulate different logical components and are put together, ensuring that the model is end-to-end differentiable. In order to ensure that the functionality of the modules are made clear, we employ supervised-pretraining on each of the modules on relevant individual sub-tasks.

In this work, we tackle a subset of the FigureQA dataset that comprises of bar graphs and pie charts. We compare our model against the Relation network architecture [\citeauthoryearSantoro et al.2017] and a standard CNN-LSTM architecture. Our model outperforms these baselines with a computation time that is 20 times lesser than that of Relation networks .The rest of the paper is structured as follows. Section 2 gives the related work for this paper. Section 3 describes the FigureQA dataset and the baselines that have been reported in [\citeauthoryearKahou et al.2017]. In Section 4, we lay out our approach for question answering on bar graphs and pie charts. In Section 5, we explain our training process and show the improvements over various baselines on the FigureQA dataset. Section 6 gives a methodology for extending our approach to real-life figures. Finally, Section 7 concludes the paper and gives directions for future work.

2 Related work

There are a variety of visual question-answering datasets [\citeauthoryearLin et al.2014, \citeauthoryearJohnson et al.2017]. These datasets however have questions which solely deal with the positional relationship between objects. Hence, the key function of the neural network is to identify the different objects and codify their positions.

The baselines for this task involve naively combining the LSTM and CNN architectures. [\citeauthoryearRen et al.2015] describe an end-to-end differentiable architecture which sets the bar for neural networks on spatial reasoning tasks. [\citeauthoryearMalinowski et al.2017] report results on a varied set of combinations of textual model embeddings and image embeddings. These baselines were consequently superceded by attention based models which use the image embeddings to generate attention maps over the text [\citeauthoryearNam et al.2016]. Parallel to the development of attention based architectures, several pieces of work in literature explored different fusion functions that combine image and sentence representations [\citeauthoryearBen-younes et al.2017]. A large body of work also addresses Visual question-answering problem using modular networks where different modules are used to replicate different logical components [\citeauthoryearAndreas et al.2016, \citeauthoryearHu et al.2017]. The state of the art models in visual question answering use a rather simple, end-to-end differentiable model and achieve super-human performance on relational reasoning tasks [\citeauthoryearPerez et al.2017, \citeauthoryearSantoro et al.2017].

There is a plethora of literature on the advantages of pre-training in deep learning. [\citeauthoryearErhan et al.2009] discusses the difficulty of training deep architectures and the effect of unsupervised pre-training. They infer that starting the supervised optimization from pre-trained weights rather than from random initialized weights consistently yields better performing classifiers. [\citeauthoryearErhan et al.2010] suggest that unsupervised pre-training acts as a regularizer and guides the learning towards basins of attraction of minima that support better generalization from the training data set. We employ supervised pre-training in this work since FigureQA dataset has the advantage of having extensive annotations. These annotations can be used to formulate supervised subtasks that can simplify the principal task of answering the questions.

The disadvantage of Relation Networks, FiLM [\citeauthoryearPerez et al.2017] is the computational demand of these models. Our architecture is computationally lightweight in comparison. A key requirement for our neural network model is to identify colours. Traditional convolutional layers typically mix the information content present in various channels. Inspired by the depth-wise separate convolution operation present in the Xception model [\citeauthoryearChollet2016], we adopt a similar family of convolution models in our design.

3 Preliminaries

In this section, we first describe the FigureQA dataset2 which was introduced by [\citeauthoryearKahou et al.2017]. This is followed by a description of the Relation Networks baseline for this dataset.

3.1 The FigureQA Dataset

FigureQA [\citeauthoryearKahou et al.2017] is a visual reasoning corpus which contains over a million question-answer pairs which are grounded in scientific style figures like line plots, dot-line plots, horizontal and vertical bar graphs, and pie charts. In our work, we consider only bar graphs and pie charts. The training set contains 1.3 million questions grounded in 100,000 images. The test and validation sets each contain over 250,000 questions derived from 20,000 images. FigureQA is a synthetic corpus that has been designed to focus specifically on reasoning. It follows the general Visual Question answering setup, but also provides annotated data with bounding boxes for each figure.

100 unique colours covering the entire spectrum of colours, were chosen from the X11 named colour set. FigureQA’s training, validation and test sets are constructed such that all 100 colours are seen during training. The 100 colours are divided into two disjoint subsets of size 50 each. In the training set, a figure type is coloured by choosing colours from one, and only one, of the two subsets. For the test and validation sets, colours are drawn from the alternate subset for that figure type, i.e if subset was used for pie charts in training set, then subset is used for pie charts in validation and test sets. This colouring for the validation and test sets is called the “alternate colouring scheme”. Validation and test sets with the “same colouring scheme” are also provided. Figure 1 and Figure 3 are examples of different figure types with question-answer pairs. Figure 2 shows an example for annotations available for each figure. Images taken from [\citeauthoryearKahou et al.2017].

Fig. 1: Vertical Bar graph with question-answer pairs
Fig. 2: Horizontal Bar graph with annotations
Fig. 3: Pie chart with question-answer pairs

3.2 Relation Networks

Relation networks(RN) were introduced by [\citeauthoryearSantoro et al.2017] as a simple yet powerful neural module for relational reasoning. Relation Networks have the ability to compute relations, just as convolutional neural networks have the ability to generate image feature map and recurrent neural networks have the ability to capture sequential dependencies. RNs have been demonstrated to achieve a state-of-the-art, superhuman performance on a challenging dataset called CLEVR [\citeauthoryearJohnson et al.2017]. RN takes the object representation as input and processes the relations between objects as follows:

where is the matrix in which the row contains the object representation . Here, calculates the relations between a pair of objects and aggregates these relations and computes the final output of the model. refers to the width of the feature map.

For the FigureQA evaluation, the object representations are obtained from a convolutional neural network. The CNN output contains 64 feature maps each of size . Each pixel from this output corresponds to an object . We have 64() such objects wherein each object has a 64 dimensional representation. The row and column coordinates of the pixel are appended to the corresponding object’s representation so as to include the information about location of objects inside the feature map.

The input to the relation network is the set of all pairs of object representations, which are concatenated with the question encoding. The question encoding is obtained from an LSTM which has a hidden unit size of 256 in the RN baseline. processes each of the object pairs separately to produce a representation for the relations between the objects. These relation representations are then summed up and given as input to , which gives the final output. For training the model, four parallel workers were used. The average of the gradients from the workers was used to update the parameters.

4 FigureNet

In this section, we describe the FigureNet architecture that tackles the question-answering task on bar plots and pie charts. These plots have bars or sectors present in them, which we refer to as plot elements. In these figure types, the plot elements are generally distinguished by their respective colours. Thus, we can recognize a plot element by identifying the colour in which it is drawn. For example, in Figure 1, we can see that the five vertical bars are drawn in five different colours. Each image represents a sequence of numeric values and obtaining this sequence allows one to answer any relevant question. For the FigureQA dataset in particular, the absolute values are not required and the relative ordering suffices. For example, in Figure 1, the relative ordering of the five bars is [1,5,4,3,2]. The lower numbers represent lower numerical values for plot elements and this representation allows questions involving maximum, greater than, high median etc.. to be answered easily.

We hypothesize that tackling the larger task of answering the questions can be solved by handling the subtasks of identifying plot elements followed by arriving at a relative ordering of plot elements. We employ supervised pre-training for each of the subtasks, using the annotations provided in FigureQA dataset. The model is comprised of modules which are logically intended to tackle one specific subtask each.

4.1 Spectral Segregator Module

Fig. 4: Architecture of Spectral Segregator Module - The left image helps one visualize the sequence of convolution operations and the image on the right is a representation of the utilized LSTM architecture.

The purpose of this module is to identify all plot elements and the colour of each of these elements. For vertical bar graphs, the model identifies the plot elements from left to right, for horizontal bar graphs, from bottom to top and for pie charts, in an anti-clockwise direction(starting from 0 degrees). The pre-training targets enforce the identification order for the respective plots. The module takes the figure as input and outputs the probabilities of colours for each of the plot elements. By taking advantage of the fact that the number of plot elements in bar graphs and pie charts of FigureQA is always less than 11, the module has 11 output units where each output unit is a probability distribution over the 100 colours. For example, in Figure 1, the targets for the module would be [Royal Blue, Aqua, Midnight Blue, Purple, Tomato, STOP, STOP, STOP, STOP, STOP, STOP] where STOP represents that there are no more plot elements present and Royal Blue represents a one-hot vector(probability distribution with a unit probability for the colour Royal Blue).

Traditional convolution layers do not suffice since they tend to aggregate the information and give an activation map that is a coarse representation of the image. Another peculiarity of the convolution operation is that the information across channels are summed over. Ideally, the channel information is required to be separated. Hence we solely use convolutions followed by scaling layers and depthwise convolutions.

The input to this module is an image with dimensions . The first convolutional layer filters the input image with 64 kernels of size . This is followed by a max-pooling layer that lowers the 2D feature map dimensions to . The second, third and fourth convolution layers apply convolutions with number of filters for each layer being 64, 128 and 256 respectively. The output feature map is of dimensions . This is followed by a scaling layer that performs channel-wise multiplication of each of the 256 channels. In other words, each channel is multiplied by a scalar parameter . This operation will not change the dimensions of the feature map. The idea behind adding the convolution layers and scaling layer is that different colours have different channel values and these operations will help differentiate between the colours.

In the next layer, we perform depthwise convolutions with 30 kernels of size each. Since there are 256 channels in the feature map, each kernel will produce a 256 dimensional vector, thereby giving an output with dimensions . We add two fully connected layers on top of this, with 1048 and 512 hidden units respectively to finally output a 512 dimensional image representation. The motivation behind adding the depthwise convolutions is that each filter can be understood to aggregate the count of a particular colour, thereby quantifying the values represented by various coloured plot elements.

Finally, to output the colour probabilities for each plot element, we use a modified version of a two layered LSTM network. The architecture for this can be seen in Figure 4. The 512 dimensional image representation is the initial state that is input to the LSTM. The output at every time-step is a probability distribution over the 100 colours and STOP label. Output at time step gives the probability of colours for the plot element. In order to mitigate the differences between the training and testing phases, the output probabilities at time step are given as input to the LSTM at time step . This is different from a traditional LSTM in which the output is sampled from the probabilities at time step and then given as input at time step , i.e we do away with the sampling. This also allows propagating gradients from input at time step to the output of time-step . The input at time step 1 is a 101 dimensional parameter that is learned by the network. The motivation behind using an LSTM mainly comes from the fact that the number of plot elements in a figure is not fixed and we found that using an LSTM performs better than predicting the 11 outputs at one go. If and are hidden states at time step t-1 for first layer and second layer respectively, the equations for finding the output probabilities at time step t are given below:

4.2 Order Extraction Module

This module identifies and quantifies the statistical values of each plot element, followed by sorting these values into a linear order. Since, the number of plot elements in bar graphs and pie charts of FigureQA is always less than 11, the possible positions in the sorted order are [1,2,3,4,5,6,7,8,9,10], where lower numbers represent lower statistical values, with 0 being reserved as order for plot elements that are absent. For example, in Figure 1, the targets for the Order Extraction module would be one-hot values of [1,5,4,3,2,0,0,0,0,0,0](i.e each element is one-hot vector). The module takes the image as input and gives the probabilities for the position in the sorted order of each of the plot element as output. We observed that the final feed-forward network learns to ignore the output probabilities for the plot elements which are absent.

The architecture for this module is almost the same as that of the Spectral Segregator module except that it has three fully connected layers with 2048, 1024, 512 hidden units respectively, after the depthwise convolutions. The output of two layered LSTM network at each time step is a probability distribution over the 11 possible relative ordering values(0 to 10). The additional parameters are required to perform the heavy lifting of the sorting operation.

4.3 Final Feed-forward network

Fig. 5: Architecture of final feedforward network

We concatenate the output probabilities from the 11 timesteps in the Spectral Segregation and Order Extraction modules. Thus, we get a dimensional figure representation. We consider the output probabilities instead of sampling the outputs so that we can backpropagate the gradients through these modules when the entire network is trained end-to-end. The question representation consists of two parts, question encoding and question-colour encoding. The question encoding is produced by passing the question, through an LSTM with 256 hidden units. The question is passed to LSTM as a sequence of words(each represented as a one-hot vector). The question-colour encoding is obtained by concatenating the 100 dimensional one-hot vector of first colour in question with the 101 dimensional(100 colours + one label for no second colour) one-hot vector of second colour in question. The question encoding and question-colour encoding together form the question representation. The figure representation is concatenated with question representation and given as input to feed-forward neural network.

The feed-forward network has four hidden layers and one output layer. The hidden layers have 1024, 512, 256 and 256 hidden units respectively and the output layer has only 1 unit. The activations are ReLU for the hidden layer and sigmoid for the output layer. The architecture is shown in the Figure 5.

5 Experiments

The training set contains 60,000 images with 20,000 each for vertical bar graphs, horizontal bar graphs and pie charts. The validation set in FigureQA contains 12,000 images. Since the test set for FigureQA is unreleased till date, we split the validation set(with same colour scheme as training) into 4500 images(1500 vertical bars + 1500 horizontal bars + 1500 pie charts), to be used for validation and remaining 7500 images will be used for testing. For the supervised pre-training task, the targets for the modules are generated from the annotations for each image provided in the FigureQA dataset.

5.1 Training Specifics

For pre-training the modules, a cross entropy loss between the softmax output probabilities at each time step and the one-hot targets generated from the annotations, are utilized. For the question answering task, a sigmoid cross entropy loss function on the output unit of feed-forward network is made use of.

The first step involves carrying out the supervised pre-training of the Spectral Segregator and Order Extraction modules. The learning rate is 0.00025 and we train each of the modules for 70 epochs. Consequently, the parameters of the modules are fixed and the final feed-forward network is trained on the question answering task for 50 epochs with a learning rate of 0.00025. Finally, the learning rate is lowered to 0.000025 and the entire architecture is trained(along with the modules) end-to-end for 50 epochs.

5.2 Results

Table 1 compares the performances of CNN + LSTM, Relation Networks, FigureNet and a human baseline. These numbers are obtained on a subset of the test set(as reported by [\citeauthoryearKahou et al.2017]). The CNN + LSTM baseline is a simple architecture that concatenates the representation of an image after passing it through a CNN with the representation of the text after passing it through an LSTM. This concatenated representation is passed through feed-forward layers to obtain the answer. The RN baseline is identical to that described in Section 3.2.

Model Accuracy
CNN + LSTM 59.94
RN(Baseline) 77.33
Our Model 83.95
Human 93.29
Table 1: Accuracy

It can be seen that our model outperforms the baselines on all three figure types(see Table 2). We find that our model performs particularly well on pie charts and the performance on this figure type is closest to human performance.

Figure Type CNN + LSTM RN(Baseline) Our Model Human
Vertical Bar 60.84 77.53 87.36 95.90
Horizontal Bar 61.06 75.76 81.57 96.03
Pie Chart 57.91 78.71 83.13 88.26
Table 2: Accuracy per figure type

We observed that model performance is close to human performance for questions on maximum, minimum and comparison of plot elements. From Table 3, one observes that the model struggles with questions on the low median and high median which is also the case for humans.

Template CNN + LSTM RN(Baseline) Our Model Human
Is X the minimum? 60.12 75.55 89.31 97.06
Is X the maximum? 64.70 89.29 89.88 97.18
Is X the low median? 54.87 68.94 73.58 86.39
Is X the high median? 55.83 69.37 73.17 86.91
Is X less than Y? 62.31 80.63 89.30 96.15
Is X greater than Y? 62.07 80.85 89.20 96.15
Table 3: Accuracy per question type

5.3 Ablation Studies

We perform an ablative analysis to highlight the essentiality of different components of the model.

Effect of modification to LSTM

In the two layered LSTMs present in each of the modules, the output probabilities at time step are given as input to time step . This is a modification to the standard approach where the output is sampled from the output probabilities at time step , and the sampled one-hot vector is fed as input at time step . The disadvantage with the standard approach is the discrepancy during the training and testing phases. Instead, we directly feed the output probabilities of the previous time step as input to current time step. An improvement in performance is seen when the sampling step is avoided(see Table 4).

Model Accuracy
Our Model 83.95
Model with normal LSTM 81.61
Table 4: Comparing LSTM training methods

Importance of question colour encoding

We perform another ablation study in which we train a model without the question colour encoding as input to the final feed-forward network. The training process is the same as given in Section 5.1 except that the final end-to-end training is done for 100 epochs. We show that the model is robust enough to learn mappings between one-hot values of colours and colour names in question, when these one-hot values of question colour names are not given as input. We observe a drop in performance of only 1.33% as indicated in Table 5.

Model Accuracy
Our Model 83.95
Model without question colour input 82.62
Table 5: Effect of additional colour input encoding

Effect of using two layered LSTM

Finally, we investigate the effect of using a two layered LSTM. We train another model that uses a single layer LSTM. We observe a huge drop in accuracy as shown in Table 6, which signifies the greater representational capacity of a two layered LSTM. The drop in performance of the Order Extraction module was much higher than that of the Spectral Segregator module, thereby emphasizing that the second layer of the LSTM is essential for the sorting sub-task.

Model Accuracy
Our Model 83.95
Model with 1 layer LSTM 75.29
Table 6: Comparing performances of different LSTM layer sizes

6 Extending to beyond Synthetic Figures

Real life scientific figures need not have a mapping between the plot element colour and name, since the plot elements can be indistinguishably coloured for each figure. Hence, there is a need to identify the plot element names from the axis/legend in the figure. Here, we give an approach for extending the current modules to real life figures:

  1. The bounding box annotations, as shown in Figure 2, can be used to train a detection model. This model detects the bounding boxes around the plot element names on the axis or legend.

  2. Optical Character Recognition(OCR) can be used to get the plot element names from the detected bounding boxes. The detection model + OCR replaces the Spectral Segregator module that we used earlier.

  3. The Order Extraction module can be used as is, to obtain the relative ordering of plot elements.

  4. The figure representation is formed by concatenating the word embeddings of plot element names obtained, with the outputs from Order Extraction module.

  5. This figure representation, combined with the question encoding, can be used for the final question answering task on realistic scientific plots/figures.

7 Conclusion and Future Work

In this work, we proposed a novel architecture for question answering on bar graphs and pie charts. The model aims to tackle visual and numeric reasoning with modular components. We formulated supervised pre-training tasks to train simpler modules and then combined these modules to solve the question answering task. We ensure that each of the modules is differentiable so that once we incorporate the pre-trained modules into our network, the entire architecture can be trained end-to-end.

Our model performs significantly better than the state-of-the-art Relation Networks baseline and the CNN+LSTM baseline. We show improvements in accuracy for each figure type and question type bridging the gap towards human-level performance. We also obtain significant improvements in training time as our model takes 2 days to train on 1 GPU compared to RNs which required around 2 weeks on 4 GPUs.

In future work, we intend to extend the approach for question answering on line plots and improve the performance on low-median and high-median questions. Another more ambitious extension is to tackle a more varied variety of question-answering tasks on real life scientific figures. Another line of work includes making the current model colour agnostic in order to test the model on unseen plot colour combinations.


  1. footnotetext: This work was presented at 1st Workshop on Humanizing AI (HAI) at IJCAI’18 in Stockholm, Sweden.
  2. https://datasets.maluuba.com/FigureQA


  1. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016.
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  3. Hedi Ben-younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for visual question answering. In The IEEE International Conference on Computer Vision (ICCV), volume 1, page 3, 2017.
  4. François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, 2016.
  5. Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Artificial Intelligence and Statistics, pages 153–160, 2009.
  6. Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.
  7. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc., 2015.
  8. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  9. Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. CoRR, abs/1704.05526, 3, 2017.
  10. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
  11. Samira Ebrahimi Kahou, Adam Atkinson, Vincent Michalski, Akos Kadar, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017.
  12. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  13. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  14. Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision, 125(1-3):110–135, 2017.
  15. Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471, 2016.
  16. Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
  17. Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In Advances in neural information processing systems, pages 2953–2961, 2015.
  18. Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
  19. Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4974–4983, 2017.
This is a comment super asjknd jkasnjk adsnkj
The feedback cannot be empty
Comments 0
The feedback cannot be empty
Add comment

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.