Textually Enriched Neural Module Networks for Visual Question Answering

Textually Enriched Neural Module Networks for Visual Question Answering

Khyathi Chandu, Mary Arpita Pyreddy, Matthieu Felix, Narendra Nath Joshi

Language Technologies Institute, School of Computer Science
Carnegie Mellon University, Pittsburgh, PA

{kchandu, mpyreddy, matthief, nnj}@andrew.cmu.edu
Abstract

Problems at the intersection of language and vision, like visual question answering, have recently been gaining a lot of attention in the field of multi-modal machine learning as computer vision research moves beyond traditional recognition tasks. There has been recent success in visual question answering using deep neural network models which use the linguistic structure of the questions to dynamically instantiate network layouts. In the process of converting the question to a network layout, the question is simplified, which results in loss of information in the model. In this paper, we enrich the image information with textual data using image captions and external knowledge bases to generate more coherent answers. We achieve 57.1% overall accuracy on the test-dev open-ended questions from the visual question answering (VQA 1.0) real image dataset.

 

Textually Enriched Neural Module Networks for Visual Question Answering


  Khyathi Chandu, Mary Arpita Pyreddy, Matthieu Felix, Narendra Nath Joshi Language Technologies Institute, School of Computer Science Carnegie Mellon University, Pittsburgh, PA {kchandu, mpyreddy, matthief, nnj}@andrew.cmu.edu

1 Introduction

Visual question answering (VQA), introduced by Antol et al. (2015), has generated a lot of interest in recent computer vision research. Given an image and a question in natural language about the image, the task of VQA is to provide an accurate natural language answer (for instance, a simple question could be “How many people are in the picture?”). Achieving consistently good performance in VQA could have a tremendous impact on both the research community and society in general: it can be used, for instance, for the assistance of the visually impaired for daily navigation, as well as lead us to deeper understanding and better representations for other computer vision tasks like image search. We believe it can have a high impact on multi-modal conversational agent research as well.

Andreas et al. (2016b) and Andreas et al. (2016a) propose neural module networks (NMN) for VQA where a network layout is generated by putting together neural modules based on a natural language dependency parse (that is, a tree structure representing relationship between the words) of the question. One limitation in this model is that some of the information present in the question is destroyed when the question is converted to a network layout: NMNs obtain better results with short parse trees, which represent only the main elements of the original question.

In this paper, we propose two approaches to textually enrich NMNs for VQA by leveraging the image captions. The first approach is to incorporate image caption (see Karpathy and Fei-Fei (2015), for instance) information into the model, making the model resilient to information deletion stemming from incorrect or simplified parses. The second approach is a modification of the first approach where we attend to the caption to pick only the useful parts instead of using the caption as a whole. This is to ensure that irrelevant captions do not introduce noise into the system. We also propose a third approach where we leverage information from external knowledge sources to provide better answers to questions that might benefit from additional knowledge. Additionally, we implement the Measure module as proposed by Andreas et al. (2016b), predominantly for answering yes/no questions.

We first introduce the related work in the next section, propose our approaches and describe our experimental setup, and report our results and discuss our analysis. Finally, we conclude by discussing some possible future directions.

2 Related Work

2.1 Visual Question Answering

We review here those that we have deemed most representative of the current approaches in VQA.

Late Fusion

Antol et al. (2015), Malinowski et al. (2015), and Gao et al. (2015) use a model where a long short-term memory network (LSTM, Hochreiter and Schmidhuber (1997)) and a convolutional neural network (CNN, LeCun et al. (1989)), both pre-trained, run independently on the questions and the images, respectively. The image embedding is transformed to a smaller one (for instance, 1024 dimensions) by a fully-connected layer with nonlinearity. The output of both networks are fused via element-wise multiplication, and a final fully connected softmax layer is added. Ren et al. (2015) perform asymmetric fusion by feeding the output of the CNN into the first LSTM, as though it were the first word of the sentence. This performs somewhat better than naive late fusion.

Attention Models

This family of models assign weights to different parts of the image, in order to filter out irrelevant information that could generate noise in the answer. The model developed by Shih et al. (2016) does this by computing the dot-products of text features extracted from the question and region-by-region features extracted by a CNN. The model then weighs information from each region by the corresponding dot-product value to produce a final answer (information for each region is generated much like in the late-fusion approach).

Lu et al. (2016) and Nam et al. (2016) argue that it is equally important to model “which words to listen to” (question attention). They present a co-attention model for VQA that jointly performs image and question attention. In addition, their model attends to the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a 1-dimensional CNN. Nam et al. (2016) show that Dual Attention Networks (DANs) jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language, and that DANs attend to specific regions in images and words in text through multiple steps and gather essential information from both modalities.

Fukui et al. (2016) perform fusion between visual and text modalities by computing the tensor product of the two feature vectors. Since this operation would create a very large -sized vector for -sized input vectors, several tricks are performed to compute a smaller approximation of this product, as suggested by Gao et al. (2016). The general architecture of the model is then set up as in most attention models, with this compact bilinear pooling used to generate both the attention maps and the final features.

Addition of a Knowledge Base

Wu et al. (2016) have proposed using an external knowledge base in a VQA system. Their approach is to obtain attributes from an image using an image labeling model (Åström et al. (2016)), query an external knowledge base, and use that information, along with image features generated from a CNN, to seed the initial hidden state of an LSTM. The question is then fed to the LSTM and an answer is generated.

2.2 Image Captioning

Image captioning is closely related to VQA: both tasks try to generate textual data from visual inputs, and captions have been used to provide supplemental information in VQA models (as in Wu et al. (2016)). In particular, Karpathy and Fei-Fei (2015) have proposed a widely-used captioning model that is composed of two main parts. In this model, a CNN is used to obtain image region embeddings, an RNN to obtain a caption representation, and an alignment is performed between them using a Markov random field. These aligned pairs are then fed to the second model, which uses an RNN to generate a caption for each region of the image, using features computed from that region with a CNN as the initial hidden state of the RNN. Xu et al. (2015) propose another model where image features are generated with a CNN, and passed to an LSTM with attention to generate the caption.

2.3 Counting with Neural Networks

VQA models generally perform significantly worse on counting questions than on other types of questions (this is the case in all reviewed prior work: Antol et al. (2015), Shih et al. (2016), Gao et al. (2016), Andreas et al. (2016b) for instance). Yet approaches to counting with neural networks have been proposed. Seguí et al. (2015) use a straightforward model to count even digits or pedestrians in images. Their system uses two or more convolutional layers, followed by a one or several fully-connected layers.

3 Proposed Approach

In this work, we extend the baseline of Neural Module Networks (NMN) proposed by Andreas et al. (2016b). This model is trained with triples of (question, image, answer) from which it dynamically learns to assemble a neural network in which individual modules perform specific tasks. The modules to use are selected based on the dependency parse tree111This work uses Stanford dependency parser (Chen and Manning, 2014) for this purpose. of the question, and the type of the question word (i.e. the first or first few words in the question). Essentially, this combines the good performance of neural networks in image recognition and captioning with the power of classical NLP methods, by assembling the network using linguistic information.

Specifically, a question is mapped to a logical representation of meaning, from which all the nouns, verbs and prepositional phrases that are directly related to the ‘wh’ word (“what”, “where”, “how”, “how many”, etc.) are collected. The common nouns and verbs are mapped to a find module. The modules can then be combined using and modules, and a measure or a describe module is inserted at the top. Table 1 lists the roles and implementations of these modules.

Module Description Input Output Find Convolves every position in the input image with a weight vector to produce an attention map. Image Attention And Merges two attention maps into a single attention map using element wise product. Attention x Attention Attention Describe Computes an average over image features weighted by the attention, and then passes through a single fully-connected layer. Image x Attention Label Measure Passes the attention through a fully connected layer, ReLU nonlinearity, fully connected layer and a softmax to generate a distribution over labels. Attention Label

Table 1: Different NMN modules

In any case, answer prediction is formulated as a classification problem where we are selecting the answer from 2000 most common answers that were encountered in the training dataset. This approach (with varying sizes for the number of answers considered) seems very common in prior work, and works well because the majority of answers are short and occur several times in the dataset.

The main contribution of our work is the enrichment of the features through text. We incorporate this information in three ways, two of which primarily depend on the information from captions and the third is based on the information that can be incorporated from external knowledge bases. The caption information that we incorporate is obtained from a pre-trained image captioning model from Karpathy and Fei-Fei (2015), which is described in the prior art section and is trained on the MSCOCO (Lin et al., 2014) dataset.

We now describe the three proposed approaches in detail. In the following, let represent the word embedding for word and be the maximum number of words in the question in that batch. Hence the entire question is represented as

In a similar way, let the maximum number of words in a caption be and let represent the word embedding of word in the caption. Hence the caption is represented as

3.1 Caption Information

In this subsection, we describe the methodology of incorporating the entire caption information to assist the prediction of NMNs. Figure 0(a) represents the architecture of this approach.

(a) Textual enrichment of NMN through caption information
(b) Textual enrichment of NMN through attention on captions
Figure 1: Schematic representation of our two captioning-based approaches

is processed through a single layer LSTM and a fully connected layer from which a question context vector is obtained. The same procedure is applied to using another LSTM and a fully connected layer to obtain the caption context vector . The image features along with the dependency parse are provided as input to the NMN. Let the output vector of NMN be represented as . An elementwise addition is performed on , the question context vector, and the caption context vector, followed by a rectified linear unit (ReLU) nonlinearity and finally another fully connected layer (with as the weight matrix). To obtain the final answer distribution, we compute a over this vector call the output . The final answer is the word corresponding to the maximum value in this vector. These steps are mathematically represented as the following equations:

3.2 Caption Attention

In the previous approach, we notice that there are cases where the entire caption may not be helpful in answering the question being asked. Instead, an end-to-end back propagation of the error by attending to the necessary information from the caption after combining with the respective question and prediction from the NMNs can help localize on the answer space. Figure 0(b) outlines the architecture of this approach.

In order to capture attention on the caption words, the caption word embeddings and the question context vector are individually passed through a fully connected layer with a () activation function. The weight matrix used for all caption word embeddings () and question context vector () are and respectively. An elementwise dot product is then performed on the two resulting vectors. The hidden representation is:

The elementwise multiplication propagates the error that does not maximize the importance of the words in the captions with respect to the question. The projected space is thus brought back into the dimensions of the caption length and a normalized probability distribution is computed over this vector using a function (with as weight parameter) to get the attention vector for the captions. The caption attention context vector is then the average of the caption word embeddings weighted by the attention as shown below:

, , and the caption attention context vector are added element wise, followed by ReLU nonlinearity and a fully connected layer (with as weight parameter) over which we obtain the probability distribution in the answer space using a function, as before.

3.3 External Knowledge Sources

A fraction of the questions in the dataset seem to call for some external general knowledge. Two examples are given in figure 2. In 1(a), the top answer (“meat”) is present in the Wikipedia abstract of the article titled “Pizza”. In 1(b), one can easily argue that the image is not necessary to answer the question, while access to information about what a helmet is useful for giving a direct answer. While it is difficult to objectively evaluate how many questions can benefit from the addition of an external knowledge source, a cursory look at 100 images from the VQA 1.0 dataset suggests that about one image in every fifteen would be helped by the addition of general knowledge.

Question: What toppings are on the pizza? Answers: meat; onion; peppers Excerpt from Wikipedia article “pizza”: It is commonly topped with a selection of meats, vegetables and condiments. (a)             Question: Why are the people wearing helmets? Answers (top 3): so they don’t injure their heads; for protection; safety Excerpt from Wikipedia article “helmet”: A helmet is a form of protective gear worn to protect the head from injuries. (b)

Figure 2: Examples of images that could be helped by the addition of an external knowledge source

To alleviate this problem, we add support for external knowledge bases in NMN. As suggested by Wu et al. (2016), we use information extracted from our knowledge base as the seed for the hidden state of the LSTM that will parse the question. While it is known that LSTMs tend to forget their initial state if their input is too long (Neubig (2017) for instance), we do not believe this to be an issue here since questions are generally under ten words in length (3% of all questions exceed this length).

The novelty of our approach essentially resides in the selection and the preprocessing of the knowledge base. We use the DBPedia collection of English Wikipedia abstracts (Auer et al., 2007) as our knowledge base. Articles are filtered to keep only those that correspond to proper or common nouns, in order to remove the many observed false matches on movie or song titles. This allows us to match articles in the content of the abstract, instead of the title or meta-information only, potentially giving access to more information if no direct matches are found, as well as being able to take in more information, since Wikipedia abstracts are more detailed than the DBPedia ontology elements used by Wu et al. (2016). A potential drawback of this approach is that longer abstracts may not be harder to exploit as expected if they contain more information that is irrelevant to the question. Essentially, our hypothesis for this part is that giving the model access to more external information should be beneficial. The entire collection of abstracts is indexed in an Apache Lucene222https://lucene.apache.org/ database. Each question is then parsed to select only nouns using NLTK (Natural Language Toolkit) (Loper and Bird, 2002), and a query is run against the database using the nouns extracted from the question. All selected articles are then turned into 300-dimensional vectors using Doc2Vec (Le and Mikolov, 2014). We use a pre-trained model from Lau and Baldwin (2016) for this task.

4 Experimental Setup

We evaluate the performance of our model on the VQA 1.0 dataset (Antol et al., 2015). This dataset is widely used for VQA tasks, comprising of around 200,000 images from MSCOCO Lin et al. (2014). Each of these images is associated with three different questions along with ten answers to each of these questions which were generated by human annotators. We train our model using the standard train/val/test split. The conv5 layer after performing max pooling, from a 16-layer VGGNet (a deep CNN, Simonyan and Zisserman (2014)) is used generate the visual features, that are normalized to have a mean of 0 and a standard deviation of 1, used by NMN.

We use ApolloCaffe (Jia et al., 2014) to develop our model. Since we treat the problem as a classification problem, categorical cross entropy is used as a loss function. We use the ADADELTA optimizer with standard parameter settings (Zeiler, 2012). We set a batch size to be 100 and train up to 12 epochs with early stopping if the validation accuracy has not improved. In Caption Attention Model, the hidden layer size of and is set to 200. The question context vector is generated through a single layer LSTM of 1000 hidden units and the caption context vector in Caption Attention Model is generated through a single layer LSTM of 500 hidden units.

5 Results and Discussion

The two baselines we explore are hierarchical co-attention models (Lu et al., 2016) and NMNs (Andreas et al., 2016b). Table 2 shows the results that are replicated by training these models with train+val as the training set.

Models Yes/No Number Other Overall Hierarchical Question-Image Co-Attention 79.6 38.4 49.1 60.5 Neural Module Networks 80.8 36.1 43.7 58.1 Neural Module Networks (Reported in paper) 81.2 35.2 43.3 58.0

Table 2: Comparison of baseline models on test-dev open-ended questions. The published implementation of NMN2 performs somewhat better than reported in the paper, and both do worse than Hierarchical Co-Attention.

The major focus of our work lies in improving the NMNs by combining some advantages from the attention model. Henceforth, we present the results of our system obtained by improving the baseline of NMNs. The reason for this choice though it is not the state of the art model, is the intuitive and interesting combination of dependency trees to dynamically adapt the neural networks by assembling different neural modules.

5.1 Experimental results

Model Name Train Acc Validation Acc Yes/No Number Other Overall Neural Module Networks (NMN) 60.0 55.2 79.0 37.5 42.4 56.9 Caption Attention Only 50.0 48.2 78.4 36.5 27.8 49.5 NMN + Caption Information 61.9 56.4 79.8 37.4 42.1 57.1 NMN + Caption Attention 60.3 55.2 79.2 35.8 42.1 56.6 NMN + External Knowledge Source 61.3 - 79.2 36.4 42.2 56.8

Table 3: Comparison of our different models and the baseline on test-dev open-ended questions

Table 3 uses train set whereas table 4 uses train+val as the training set. As we observe in table 3, the train and test accuracy in the second row, which corresponds to the experiment by removing the output from the NMN completely and just relying on the attentions from the captions is about 50%. This indicates that the model is able to learn to predict some answers from the captions to the images.

Adding caption information increases the train and test accuracy by 1.6 percentage points and 1.2 points respectively. However, our caption attention model degrades NMN performance. One possible reason for this could be that we are forcing the model to attend to some words in the caption in cases where the caption is not relevant to the question being asked.

To combat information loss when the question parse is simplified into a network layout, we have experimented with using larger parse trees. For example consider the question, ‘What is this person playing?’. The shorter version of the parse that baseline NMN uses is ‘Describe(Find(Person)). This parse does not include any information about ‘playing’. Hence we consider tree from longer parse Describe(And(Find(Person), Find(Playing))). In addition, we implement a measure module to address counting type questions specifically333The measure module is described in Andreas et al. (2016b) but is not implemented in the published code. We implement this module as outlined in the paper; it is very similar to the counting system in Seguí et al. (2015).

Model Name Train Acc Yes/No Number Other Overall
Neural Module Networks (NMN) 62.9 80.8 36.1 43.7 58.1
Longest Parse 61.4 79.7 35.8 42.3 57.0
Measure Module 64.0 79.6 34.8 41.7 56.5
Table 4: Comparison of results using some simple tweaks to the NMN model. All of our changes lead to worse results, so we do not use them for our other experiments.

As we can see from table 4, performing less simplification on the parse tree decreases the scores. This could be due to a higher chance of error in complex parses, which are consequently not mapped to appropriate neural modules.

Adding a measure module at the top of the parse tree instead of a describe module for counting questions (those that start with “how many”) does not improve the results either. This could be due to the errors in the attention maps from the find module, or to the fact that while the describe module takes both image and attention maps as inputs, the measure module only takes the attention maps, so some information from image data might have been lost.

5.2 Comparative Qualitative Analysis

To analyze the performance of our approach with respect to the baseline of NMNs, we randomly select some images from the validation set and compare the answers produced by our system with correct answers and those produced with NMNs. We present this analysis in table 5. The notations ‘Q’ and ‘C’ represent corresponding questions and captions. “Predicted Answer NMN” is the output from the NMN model and “Predicted Answer NMN+CA” is the answer predicted by our model after adding the attention on captions with respect to the question.

Q: How many planes are flying? Q: What is the food called? Q: What is the child holding?
C: two planes flying in the sky in the sky C: a little boy sitting at a table with a pizza C: a young boy brushing his teeth in the bathroom
Correct Answer: 2 Correct Answer: pizza Correct Answer: toothbrush
Predicted Answer NMN: 4 Predicted Answer NMN: pizza Predicted Answer NMN: phone
Predicted Answer NMN+CA: 2 Predicted Answer NMN+CA: pizza Predicted Answer NMN+CA: toothbrush
(a) (b) (c)
Q: How many people are standing? Q: What color is the sign? Q: Is there a tree on the desk?
C: a group of people in a field with a frisbee C: a group of people riding motorcycles down a street C: a laptop computer sitting on top of a desk
Correct Answer: 6 Correct Answer: black and white Correct Answer: no
Predicted Answer NMN: 2 Predicted Answer NMN: red Predicted Answer NMN: yes
Predicted Answer NMN+CA: 5 Predicted Answer NMN+CA: red Predicted Answer NMN+CA: yes
(d) (e) (f)
Table 5: Comparative performance analysis of before and after adding Caption Attention to the NMNs. The first row of examples show the questions correctly by our model. The second row of examples show questions where both NMN and our model predict the wrong answer.

In example (a), the information about ‘two planes’ is clearly mentioned in the caption and hence this additional information helps our model capture this answer from the embedded context vector when the caption is passed through an LSTM. Similarly the answer words ‘pizza’ and ‘tooth brush’ in (b) and (c) are explicitly present in the caption. An obvious drawback to our approach is when the caption produced for the image is not relevant to the question being asked.

For the examples in the second row, the captions are not exactly relevant to the question and hence attention over the captions has not been particularly helpful. The predicted answers by our model exactly match that from NMN in (e) and (f). We observed an interesting type of error that our model makes by mapping generic terms to frequent occurrences in the captions. This is depicted in (d). The caption for this image has the word ‘a group of people’ that correspond to the number of people in the image. While NMN predicted the answer as 2, the model with caption attentions predicted the answer as 5. This could be due to frequent associations mapping the word in the caption ‘group’ to the answer ‘5’ in the training data and this is learnt by our model during end-to-end error backpropagation. Though the final predicted answer is wrong, common sense information that ‘two people’ are not called group and a relatively higher number is required to be called a group is learnt by the model.

6 Conclusion and Future Directions

We show that incorporating the information from captions improves results slightly (by 1.9% on training and by 1.2% on testing), especially in cases where the caption is relevant to the question being asked. However, our model fails when the generated caption is not relevant to the question and hence one future direction is towards generating captions with certain required words. This approach also learns appropriate mappings to generic terms mapped to more probable associations from the cations in the training data. An interesting direction to explore in the future would be the analysis of the irrelevant captions generated for an image for better fitting attention models.

Acknowledgments

We thank Professor Louis-Phillipe Morency, Dr. Tadas Baltrusaitis, Amir Zadeh and Chaitanya Ahuja for providing us the guidance, constant timely feedback and resources required for this project.

References

  • Andreas et al. (2016a) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016a). Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (ACL).
  • Andreas et al. (2016b) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016b). Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48.
  • Antol et al. (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. (2015). VQA: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV). Institute of Electrical and Electronics Engineers (IEEE).
  • Åström et al. (2016) Åström, F., Petra, S., Schmitzer, B., and Schnörr, C. (2016). Image labeling by assignment. Journal of Mathematical Imaging and Vision, pages 1–28.
  • Auer et al. (2007) Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. The semantic web, pages 722–735.
  • Chen and Manning (2014) Chen, D. and Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750.
  • Fukui et al. (2016) Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847.
  • Gao et al. (2015) Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. (2015). Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, pages 2296–2304.
  • Gao et al. (2016) Gao, Y., Beijbom, O., Zhang, N., and Darrell, T. (2016). Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 317–326.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
  • Jia et al. (2014) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
  • Karpathy and Fei-Fei (2015) Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137.
  • Lau and Baldwin (2016) Lau, J. H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.
  • Le and Mikolov (2014) Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In ICML, volume 14, pages 1188–1196.
  • LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer.
  • Loper and Bird (2002) Loper, E. and Bird, S. (2002). Nltk: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, pages 63–70. Association for Computational Linguistics.
  • Lu et al. (2016) Lu, J., Yang, J., Batra, D., and Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297.
  • Malinowski et al. (2015) Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In 2015 IEEE International Conference on Computer Vision (ICCV). Institute of Electrical and Electronics Engineers (IEEE).
  • Nam et al. (2016) Nam, H., Ha, J.-W., and Kim, J. (2016). Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471.
  • Neubig (2017) Neubig, G. (2017). Neural machine translation lecture notes: Attentional neural machine translation.
  • Ren et al. (2015) Ren, M., Kiros, R., and Zemel, R. (2015). Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pages 2953–2961.
  • Seguí et al. (2015) Seguí, S., Pujol, O., and Vitria, J. (2015). Learning to count with deep object features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 90–96.
  • Shih et al. (2016) Shih, K. J., Singh, S., and Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Institute of Electrical and Electronics Engineers (IEEE).
  • Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
  • Wu et al. (2016) Wu, Q., Wang, P., Shen, C., Dick, A., and van den Hengel, A. (2016). Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4622–4630.
  • Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057.
  • Zeiler (2012) Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
283416
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description