DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Abstract

Visual Dialog is a vision-language task that requires an AI agent to engage in a conversation with humans grounded in an image. It remains a challenging task since it requires the agent to fully understand a given question before making an appropriate response not only from the textual dialog history, but also from the visually-grounded information. While previous models typically leverage single-hop reasoning or single-channel reasoning to deal with this complex multimodal reasoning task, which is intuitively insufficient. In this paper, we thus propose a novel and more powerful Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM. DMRM synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning. Specifically, DMRM maintains a dual channel to obtain the question- and history-aware image features and the question- and image-aware dialog history features by a mulit-hop reasoning process in each channel. Additionally, we also design an effective multimodal attention to further enhance the decoder to generate more accurate responses. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate that the proposed model is effective and outperforms compared models by a significant margin.

Introduction

With the rapid development of both computer vision and natural language processing, visual-language tasks such as image caption [26, 2, 3] and visual question answering [20, 8, 17, 3] have attracted increasing attention in recent years. Although these tasks have inspired tremendous efforts on integrating vision and language to develop smarter AI, they are mostly single-round while human conversations are generally multi-round. Therefore, the Visual Dialog task is proposed to encourage research on multi-round visually-grounded dialog by \citeauthordas2017visual \shortcitedas2017visual.

In Visual Dialog, an agent is required to answer a question given the dialog history and the visual context. In order to make an appropriate response, it is necessary for the agent to gain a proper understanding of the question, which requires it to exploit the textual dialog history and the visual context. To this end, some studies [5, 15] design models to obtain features from both modalities. \citeauthordas2017visual \shortcitedas2017visual propose Late Fusion (LF), which directly concatenates individual representations of the question, the dialog history, and the image, and then generates a new joint representation by a linear transformation on them. \citeauthorlu2017best \shortcitelu2017best design a history-conditioned attention image encoder to generate the representation of the question, the question-aware dialog history and the history-conditioned image features, and then concatenate them to be joint representations.

Nevertheless, the approaches mentioned above are of single-hop approaches, which show the limited ability of reasoning and neglect latent information of the interactions among the question, the dialog history and the image. For better solutions, researchers [5, 25, 18, 12] investigate multi-hop reasoning approaches [11, 10] to conduct interactions among modalities. For example, \citeauthorwu2018you \shortcitewu2018you provide a sequential co-attention encoder, which firstly obtains question-aware image features, secondly extracts history features by co-attention mechanism with the question features and the extracted image features, thirdly gets the attended question features by the extracted history features and image features, and finally joints the three attended features and send them to the decoder. \citeauthorniu2019recursive \shortciteniu2019recursive propose a recursive visual attention model, which recursively reviews the dialog history to find the reference of the question, and then extracts the image features by the attention model with the extracted history context and the question. These approaches are of single-channel approaches, which firstly use the question to find reference from the dialog history and then extract the image context from both the question and the history context. However, humans usually deal with a visually-grounded, multi-turn dialog by simultaneously comprehending the two aspects of information, namely both the textual dialog history and the visual context. That is to say, the question can find reference first from the image and then form the dialog history to enrich the question representation, and vice versa.

Dual-channel reasoning, i.e., acquiring information from the dialog history and the image simultaneously, is beneficial for gaining an original understanding of the question from the dialog history and the image. Meanwhile, multi-hop reasoning, i.e., reasoning among the question, the dialog history and the image, is conducive to utilizing abundant latent information among the three inputs. Therefore, in this paper, we propose a Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM. DMRM synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning, which is composed of a Track Module and a Locate Module. Track Module aims to enrich the representation of the question from the visual information while Locate Module aims to reach the same goal from the textual dialog history. Specifically, DMRM maintains dual channels to obtain the question- and history-aware image features and the question- and image-aware dialog history features by a mulit-hop reasoning process in each channel. In addition, we design an effective multimodal attention to further enhance the decoder to generate more accurate responses.

We validate the DMRM model on large-scale datasets: VisDial v0.9 and v1.0 [5]. DMRM achieves the state-of-the-art results on some metrics compared to other methods. We also conduct ablation studies to demonstrate the effectiveness of our proposed components. Furthermore, we conduct the human evaluation to indicate the effectiveness of our model in inferring answers.

Our main contributions are threefold:

  • We propose a dual-channel multi-hop reasoning model to deal with this complex multimodal reasoning task which enriches the semantic representation of the question, and thus the agent can make an appropriate response.

  • We are the first to apply multimodal attention to the decoder for visual dialog and demonstrate the necessity and effectiveness of this attention mechanism for the decoding of visual dialog.

  • We evaluate our method on two large-scale datasets and conduct ablation studies, human evaluation. Experimental results on VisDial v0.9 and v1.0 demonstrate that the proposed model achieves the state-of-the-art results on some metrics1.

{overpic}

[width=]model.png

Figure 1: The framework of the DMRM model. DMRM synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning, which is composed of Track Module and Locate Module. Track Module aims to make a fully understanding of the question from the aspect of the image. Locate Module aims to make a fully understanding of the question from the aspect of the dialog history. Finally, the outputs of the dual-channel reasoning are sent to the decoder after att-enhanece and multimodal fusion operation.

Our Approach

In this section, we formally describe the visual dialog task and our proposed method, Dual-channel Multi-hop Reasoning Model (DMRM). According to \citeauthordas2017visual\shortcitedas2017visual, inputs of a visual dialog agent consist of an image , a caption describing the image, a dialog history (question-answer pairs) till round : and the current question at round . The goal of the visual dialog agent is to generate a response to the question .

Given the problem setup, DMRM for visual dialog consists of four components: (1) Input Representation, where the representations of the image and the textual information are generated for reasoning; (2) Dual-channel Multi-hop Reasoning, where our reasoning is applied to encode input representations; (3) Multimal Fusion, where we fuse the multimodal information; (4) Generative Decoder, where we use our multimodal attention decoder to generate the response. Specifically, we use Track Module and Locate Module to implement our dual-channel multi-hop reasoning. As shown in Figure 1, Track Module aims to enrich the representation of the question from the visual information by exploiting the question and the dialog history. Locate Module aims to enrich the representation of the question from the textual dialog history by exploiting the question and the image. Answer decoder takes the outputs of Track Module and Locate module as inputs, and generates an appropriate response.

We first introduce the representations of inputs (both image features and the language features). Then we describe the detailed architectures of dual-channel multi-hop reasoning and multimodal fusion operation. Finally, we present the multimodal attention answer decoder.

{overpic}

[width=]submodel.png

Figure 2: Schematic representation of multi-hop reasoning. Please see Section Dual-channel Multi-hop Reasoning for details. All at different hops denote different query features, denotes the image features and denotes the dialog history features.

Input Representation

Image Features

We use a pre-trained Faster R-CNN [21] to extract object-level image features. Specifically, the image features for the image are represented by:

(1)

where denotes the total number of the object detection features per image and denotes the dimension of each features, respectively. We extract the object features by using a fixed number .

Language Features

We first embed each word in the current question to by using pre-trained Glove embeddings [19], where denotes the number of tokens in . We use a one-layer BiLSTM to generate a sequence of hidden states . We use the last hidden state of the BiLSTM as question features as follows:

(2)
(3)
(4)

Also, each question-answer pair in the dialog history and the answer are embedded in the same way as the current question, yielding the dialog history features and the answer features . , and are embedded with the same word embedding vectors but three different BiLSTMs.

Dual-channel Multi-hop Reasoning

The dual-channel multi-hop reasoning framework is implemented via two modules, i.e., Track Module and Locate Module. Track Module aims to make a fully understanding of the question from the aspect of the image. Locate Module aims to make a fully understanding of the question from the aspect of the dialog history. The multi-hop reasoning pathway of Track Module is illustrated as and the multi-hop reasoning pathway of Locate Module is illustrated as . Next, we formally describe the single-hop Track Module and Locate Module, and then extend them to multi-hop ones. We use the 3-hop reasoning in this paper.

Track Module

Track Module is designed to help enrich the semantic presentation of the question from the image. In order to obtain the question- and history-aware representation of the image, we implement Track Module by taking the inspiration from bottom-up attention mechanism [3]. Track Module takes the query features (for instance, the question feature at reasoning hop 1) and image features (Eq.1) as inputs, and then outputs query-aware representation of the image. We first project these two vectors to dimension and compute soft attention over all the object detection features as follows:

(5)
(6)

where and denote the 2-layer perceptrons with ReLU activation which transform the dimension of input features to , is the project matrix for the softmax activation and denotes Hadamard product. From these equations, we get the query-aware attention weights . Next we apply the query-aware attention weights to image features to compute the query-aware representation of the image as follows:

(7)

We use to represent the operations of Track Module, namely Eq.5 - Eq.7, here and after.

Furthermore, we use Track Module in the multi-hop reasoning process to enrich the semantic presentation of the question from the image. Details are to be formalized in Section Multi-hop Reasoning.

Locate Module

Locate Module is designed to get a rich representation of the question from the dialog history. Similar with Track Module, Locate Module takes the query features (for instance, the question feature at reasoning hop 1) and dialog history features (Eq.4) as inputs, and then outputs query-aware representation of the dialog history features as follows:

(8)
(9)

where and denote the two layer multi-layer perceptrons with ReLU activation which transform the dimension of input features to , is the project matrix for the softmax activation and denotes Hadamard product. From these equations, we get the query-aware attention weights . Next we apply the query-aware attention weights to the dialog history features to compute the query-aware representation of the dialog history as follows:

(10)

Next we apply to two layer multi-layer perceptrons with ReLU activation in between, then add it with the representation of the caption . Layer normalization [12] is also applied in this step.

(11)
(12)

We use to represent the operations of locate module, namely Eq.8 - Eq.12, here and after.

Furthermore, we use Locate Module in the multi-hop reasoning process to enrich the semantic presentation of the question from the dialog history. Details are to be formalized in Section Multi-hop Reasoning.

Multi-hop Reasoning

Dual-channel multi-hop reasoning contains two types of multi-hop reasoning. One is multi-hop reasoning, starting from and ending with the image, illustrated as . The other one is multi-hop reasoning, starting from and ending with the dialog history, illustrated as . We implement each reasoning pathway via Track Module and Locate Module. The reasoning pathway includes the following steps:

The reasoning pathway includes the following steps:

Parameters of modules at each reasoning hop are not shared with the other. Note that the reasoning process is valid only if is an odd number. In this paper, we use 3-hop reasoning for Visual Dialog.

Multimodal Fusion

In this section, we introduce multimoal fusion. As shown in Figure 1, before we fuse the multimodal representations and generated by Track Module and Locate Module, we use question features to enhance the representations and as follows:

(13)
(14)

where , and denote 2-layer perceptrons with ReLU activation. Both Eq.13 and Eq.14 are named as the Att-Enhance module. We also use Att-Enhance modules between 2-hop and 3-hop. Then we fuse the representations of two channels as follows:

(15)
(16)

where denotes the concatenation operation, , , and , , are learned parameters.

{overpic}

[width=]decoder.png

Figure 3: Multi-modal Attention Decoder. We use the multi-modal context vector (Eq.16) to initial the deocder LSTM, utilize hidden to attend to the question features , history features , image features and combine the attended representations , , to predict the next word united with hidden .

Generative Decoder

As illustrated in Figure 3, our generative decoder is adapted from spatial attention based decoders [16]. In the encoder-decoder framework, with recurrent neural network (RNN), we model the conditional probability as:

(17)

where is 2-layer perceptrons with ReLU activation, is the mulitmodal context vector at time and is the hidden state of the RNN at time . In this paper, we use LSTM and is modeled as:

(18)

where is the representation of the generative answer at time step .

Given the question features , dialog history features , image features , and hidden state , we feed them through a 1-layer perceptron with a softmax function to generate the three attention distribution over the question, rounds of the history and object detection features per image, respectively. First, the attended question vector is as defined:

(19)
(20)

where is a vector with all elements set to 1, , , are learned parameters. All the bias terms in description of Eq.19 and Eq.22 are omitted for simplicity. Then we obtain the attended question vector as follows:

(21)

Similar with the computation of attended question, we obtain the attended history vector and attended image vector . Then we fuse these three context vectors to obtain the context vector by:

(22)

where denotes concatenation and is learned parameters. and are combined to predict next word .

In addition, we use the encoder output as embedding input to initialize our decoder LSTM. Formally,

(23)

where is the last state of the question LSTM in the encoder and is used as the initial state of the decoder LSTM.

Experiments

Datasets

We evaluate our proposed approach on the VisDial v0.9 and v1.0 datasets [5]. VisDial v0.9 contains 83k dialog on COCO-train [15] and 40k dialog on COCO-val [15] images, for a total of 1.23M dialog question-answer pairs. VisDial v1.0 dateset is an extension of VisDial v0.9 dateset with an additional 10k COCO-like images from Flickr. Overall, VisDial v1.0 dateset contains 123k (all images from v0.9), 2k and 8k images as train, validation and test splits, respectively.

Model MRR R@1 R@5 R@10 Mean
AP [5] 37.35 23.55 48.52 53.23 26.50
NN [5] 42.74 33.13 50.83 58.69 19.62
LF [5] 51.99 41.83 61.78 67.59 17.07
HREA [5] 52.42 42.28 62.33 68.71 16.79
MN [5] 52.59 42.29 62.85 68.88 17.06
HCIAE [15] 53.86 44.06 63.55 69.24 16.01
CoAtt [25] 54.11 44.32 63.82 69.75 16.47
CoAtt [25] 55.78 46.10 65.69 71.74 14.43
RvA [18] 55.43 45.37 65.27 72.97 10.71
DMRM 55.96 46.20 66.02 72.43 13.15
Table 1: Performance on VisDial val v0.9 [5]. Higher the better for mean reciprocal rank (MRR) and recall@ (R@1, R@5, R@10), while lower the better for mean rank. Our proposed model outperforms all other models on MRR, R@5, and mean rank. indicates that the model is trained by using reinforcement learning.
Model MRR R@1 R@5 R@10 Mean
MN [5] 47.99 38.18 57.54 64.32 18.60
HCIAE [15] 49.10 39.35 58.49 64.70 18.46
CoAtt [25] 49.25 39.66 58.83 65.38 18.15
ReDAN [7] 49.69 40.19 59.35 66.06 17.92
DMRM 50.16 40.15 60.02 67.21 15.19
Table 2: Performance on VisDial val v1.0 [5]. All the models are re-implemented by \citeauthorgan2019multi \shortcitegan2019multi.

Evaluation Metrics

We follow \citeauthordas2017visual \shortcitedas2017visual to use a retrieval setting to evaluate the individual responses at each round of a dialog. Specifically, at test time, apart from the image, ground truth dialog history and the question, a list of 100 candidates answers are also given. The model is evaluated on retrieval metrics: (1) rank of human response, (2) existence of the human response in ranked responses, i.e., recall@ and (3) mean reciprocal rank (MRR) of the human response. Since we focus on evaluating the generalization ability of our generator, the sum of the log-likelihood of each option is used for ranking.

Implementation Details

To process the data, we first lowercase all the texts, convert digits to words, then remove contractions before tokenizing. The captions, questions and answers are further truncated to ensure that they are no longer than 24, 16 or 8 tokens, respectively. We then construct the vocabulary of tokens that appear at least 5 times in the training split, giving us a vocabulary of 8,958 words on VisDial v0.9 and 10,366 words on VisDial v1.0. All the BiLSTMs in our model are 1-layered with 512 hidden states. The Adam optimizer [14] is used with the base learning rate of 1e-3, further decreasing to 1e-5 with a warm-up process.

Results and Analysis

We compare our proposed models to the state-of-the-art generative models developed in previous works. As shown in  Table 1 and  Table 2, our proposed model achieves the state-of-the-art results on some metrics on the VisDial v0.9 and v1.0 datasets. The key observations are as follows:

  • By comparing with singe-hop approaches (LF [5] and HCAIE [15]), we demonstrate the validity of multi-hop reasoning, because it utilizes the abundant latent information between modalities.

  • By comparing with singe-channel approaches (CoAtt [25] and RvA [18]), we come to the conclusion that dual-channel reasoning is beneficial for gaining an original understanding of the question from the dialog history and the image.

  • By comparing with other methods and the state-of-the-art approaches (HREA [5], MN [5] and ReDAN [7]), our approach achieves the state-of-the-art results on some metrics that demonstrate the superiority of our model.

Ablation Study

As shown in Table 3, different settings explain the importance of each part of our model. By comparing “DMRM w/ n-hop”, we see the effectiveness of multi-hop reasoning. By comparing “DMRM w/o Locate” and “DMRM w/o Track” with DMRM, we see the effectiveness of our dual-channel models. By comparing our final model DMRM with “DMRM w/o AttD”, we illustrates the improvement due to multimodal attention in the decoder.

Model MRR R@1 R@5 R@10 Mean
DMRM w/ 1-hop 55.04 45.55 64.46 70.49 14.68
DMRM w/ 2-hop 54.87 44.85 65.05 71.75 13.66
DMRM w/ 3-hop 55.57 45.80 65.54 72.09 13.51
DMRM w/o Locate 54.77 45.35 64.04 70.01 14.81
DMRM w/o Track 53.28 43.06 63.47 70.06 14.54
DMRM w/o AttD 55.57 45.80 65.54 72.09 13.51
DMRM 55.96 46.20 66.02 72.43 13.15
Table 3: Ablation study of our proposed model on VisDial val v0.9 [5]. “DMRM w/ n-hop” means the model use n-hop reasoning. “DMRM w/o AttD” means the model is not use multimodal attention decoder. Note that “DMRM w/ 2-hop” is an incomplete reasoning process under our designed architecture and the ablation study of n-hop reasoning is based on the model “DMRM w/o AttD”.

Significance Test

We use t-test and analysis of variance (ANOVA) to analyze results of sentences generated by our model and the HCIAE model [15]. The p-values of these two analytical methods are all less than 0.01, indicating that the results are significantly different.

HCIAE Ours
Human evaluation method 1 (M1): 0.60 0.65
Human evaluation method 2 (M2): 0.53 0.62
Table 4: Human evaluation on 100 sampled responses on VisDial val v0.9. M1: percentage of responses pass the Turing Test. M2: percentage of responses evaluated as better or equal to human responses.

Human Evaluation

We randomly extract 100 samples for human evaluation. The evaluation results are as shown in Table 4, which show the effectiveness of our model.

Quantitative Results Analysis

As shown in Figure 4, our model generate responses of a high degree of consistency with human answers, which shows the effectiveness of it. Compared with “DMRM w/o AttD”, DMRM generate more correct and meaningful responses. Figure 5 is the visualization of our reasoning process. For the question “what color is his bike ?”, the model infers step by step, finally pays attention to the bike to answer the current question.

Related Work

{overpic}

[width=0.94]examples.png

Figure 4: Qualitative results of our final model (DMRM) on VisDial v0.9 comparing to human ground-truth answers and our baseline model (“DMRM w/o AttD”). Compared with “DMRM w/o AttD”, DMRM utilizes the muliti-modal attention in the decdoer. The improvement of correctness (marked in green and red) and interpretability (marked in blue) of generated answers due to our multi-modal attention in the decoder are partially colored.
{overpic}

[width=0.94]attention.pdf

Figure 5: Visualization of our reasoning process. (a) The attended image at 1-hop via Track Module. (b) The attended image at 2-hop via Track Module. (c) The attended image at 3-hop via Track Module. With the question “what color is his bike ?”, our model finally attends to the bike to get the answer.

Vision-language Task

Vision-language tasks, such as image caption [20, 8, 13, 23, 6] and visual question answering (VQA) [27, 3, 1, 4, 24], have aroused great interest in recent years. Image caption is a task of describing the visual content of an image by using one or more sentences while visual question answering focuses on providing a natural language answer given an image and free-form, open question. Visual dialog [25, 15, 22, 9] can be seen as an extension of image caption and VQA tasks. Visual dialog enables an AI agent not only to interact with the visual environment but also to have a continuous conversation with human.

Visual Dialog

Visual dialog has attracted widespread attention. Some previous works are similar to our work, but fundamentally different from ours. \citeauthordas2017visual \shortcitedas2017visual propose a dialog-RNN, which takes the question, the image and the last round history as inputs, and then produces both an encoding representation for this round and a dialog context for the next round. \citeauthordas2017visual \shortcitedas2017visual exploit a dialog-RNN to deal with the mutil-turn dialog only by using the information of last round history while we leverage a multi-hop reasoning for visual dialog at each turn and exploit the whole dialog history. Besides, \citeauthorgan2019multi \shortcitegan2019multi provide a multi-step reasoning model via a RNN, which firstly leverage the query and history to attend to the image, secondly use the query and the image to attend to the history, and finally utilize the image and the history to update RNN State. Nevertheless, we propose the dual-channel multi-hop reasoning via two modules where Track Module only deals with the image and Locate Module only utilizes the information of the dialog history. Moreover, we conduct the representation of the question to the representation generated by Track Module and Locate module between reasoning hops.

Conclusion

We introduce our Dual-channel Multi-hop Reasoning Model (DMRM) for visual dialog, a new framework to simultaneously capture information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning. This dual-channel mulit-hop reasoning process provides a more fine-grained understanding of the question by utilizing the textual information and the visual context simultaneously via multi-hop reasoning, thus boosting the performance of answer generation. Experiments conducted on the VisDial v0.9 and v1.0 certify the effectiveness of our proposed method.

Acknowledgments

This work was supported by the Major Project for New Generation of AI (Grant No. 2018AAA0100400), the National Natural Science Foundation of China (Grant No. 61602479), and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB32070000).

Footnotes

  1. Code is available at https://github.com/phellonchen/DMRM.

References

  1. C. Alberti, J. Ling, M. Collins and D. Reitter (2019) Fusion of detected objects in text for visual question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2131–2140. Cited by: Vision-language Task.
  2. P. Anderson, B. Fernando, M. Johnson and S. Gould (2016) SPICE: semantic propositional image caption evaluation. Adaptive Behavior 11 (4), pp. 382–398. Cited by: Introduction.
  3. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086. Cited by: Introduction, Track Module, Vision-language Task.
  4. R. Cadene, H. Ben-Younes, M. Cord and N. Thome (2019) Murel: multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1989–1998. Cited by: Vision-language Task.
  5. A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh and D. Batra (2017) Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335. Cited by: Introduction, Introduction, Introduction, 1st item, 3rd item, Datasets, Table 1, Table 2, Table 3.
  6. S. Ding, S. Qu, Y. Xi, A. K. Sangaiah and S. Wan (2019) Image caption generation with high-level image features. Pattern Recognition Letters 123, pp. 89–95. Cited by: Vision-language Task.
  7. Z. Gan, Y. Cheng, A. E. Kholy, L. Li, J. Liu and J. Gao (2019) Multi-step reasoning via recurrent dual attention for visual dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6463–6474. Cited by: 3rd item, Table 2.
  8. H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang and W. Xu (2015) Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, pp. 2296–2304. Cited by: Introduction, Vision-language Task.
  9. D. Guo, C. Xu and D. Tao (2019) Image-question-answer synergistic network for visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10434–10443. Cited by: Vision-language Task.
  10. R. Hu, J. Andreas, T. Darrell and K. Saenko (2018) Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision, pp. 53–69. Cited by: Introduction.
  11. D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. In International Conference on Learning Representations, External Links: Link Cited by: Introduction.
  12. G. Kang, J. Lim and B. Zhang (2019) Dual attention networks for visual reference resolution in visual dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2024–2033. Cited by: Introduction, Locate Module.
  13. P. Kinghorn, L. Zhang and L. Shao (2018) A region-based image caption generator with refined descriptions. Neurocomputing 272, pp. 416–424. Cited by: Vision-language Task.
  14. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation Details.
  15. J. Lu, A. Kannan, J. Yang, D. Parikh and D. Batra (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems, pp. 314–324. Cited by: Introduction, 1st item, Datasets, Significance Test, Table 1, Table 2, Vision-language Task.
  16. J. Lu, C. Xiong, D. Parikh and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383. Cited by: Generative Decoder.
  17. J. Lu, J. Yang, D. Batra and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pp. 289–297. Cited by: Introduction.
  18. Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu and J. Wen (2019) Recursive visual attention in visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6679–6688. Cited by: Introduction, 2nd item, Table 1.
  19. J. Pennington, R. Socher and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: Language Features.
  20. M. Ren, R. Kiros and R. Zemel (2015) Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pp. 2953–2961. Cited by: Introduction, Vision-language Task.
  21. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: Image Features.
  22. P. H. Seo, A. Lehrmann, B. Han and L. Sigal (2017) Visual reference resolution using attention memory for visual dialog. In Advances in Neural Information Processing Systems, pp. 3719–3729. Cited by: Vision-language Task.
  23. Y. H. Tan and C. S. Chan (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333, pp. 86–100. Cited by: Vision-language Task.
  24. R. Vedantam, K. Desai, S. Lee, M. Rohrbach, D. Batra and D. Parikh (2019) Probabilistic neural symbolic models for interpretable visual question answering. In Proceedings of International Conference on Machine Learning, pp. 6428–6437. Cited by: Vision-language Task.
  25. Q. Wu, P. Wang, C. Shen, I. Reid and A. van den Hengel (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6106–6115. Cited by: Introduction, 2nd item, Table 1, Table 2, Vision-language Task.
  26. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning, pp. 2048–2057. Cited by: Introduction.
  27. Z. Yang, X. He, J. Gao, L. Deng and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29. Cited by: Vision-language Task.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402514
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description