\thesection Qualitative Evaluation

In this supplementary material we provide additional quantitative (see \secrefsec:quan) and qualitative (see \secrefsec:qual) results.

\section

Quantitative Evaluation \labelsec:quan

\textbf

Ensemble model for VisDial v0.9: For VisDial v1.0, a simple ensemble technique has significantly improved the results as discussed in the main paper. We observe a similar effect for VisDial v0.9, pushing the current state of the art for MRR from 0.6525 to 0.6892 as summarized in \tabreftab:abl-atten. We achieve this result with an ensemble of 9 models which differ only by the initial seed. For VisDial v1.0 we report a 5 model ensemble score. Due to restriction of the number of submissions to the evaluation server we could not evaluate a larger ensemble model. The results in \tabreftab:abl-atten suggest that the VisDial v1.0 score in the paper can be further improve with a larger ensemble model.

\textbf

Analysis of Factor Graph Attention weights: To infer the attention belief for a utility, \ie, , we aggregate marginalized joint and local interactions and also local-information and prior terms. To calibrate each cue, we use scalar weights, \ie, . To obtain a better understanding of the reasoning process and analyze attention, we suggest an importance score: \beS(γ) = \frac—m_γ⋅γ—∑_δ∈{^w_i, w_i, (w_i, j)_j∈\mathcalU} —m_δ⋅δ—, \labeleq:score \eewhere is the weight of a cue and is the mean term of the corresponding cue , which was calculated over the entire validation set. Note that are the scalar weights. captures the importance of the -th cue for utility . A high score means the -th utility attention belief heavily relies on cue . Similarly, , capture the importance of local-interactions, local-information and prior cues for the -th utility. We report the scores in \tabreftab:weights. We observe that the answer utility relies mostly on local-interactions. The question heavily relies on the prior, but also makes use of history answers and question cues. The caption ignores all utilities other than the prior. The image question utility is the most important cue. Interestingly, we observe importance of priors. Image attention relies on the captions, while the caption ignores all the cues and preserves the prior behavior. The history question and answers rely on the question and the local factors.

\textbf

Computation and insignificant interactions: Upon training interactions may be found to be unnecessary. Our model can be optimized easily: 1) The score in \equrefeq:score, can be used to omit less significant interactions. Previous multimodal attention doesn’t model pairwise interaction scores, making it hard to eliminate computations. 2) For the same image but different question, we can re-use calculated joint interactions, such as local-interaction, image-caption, \etc. This is impossible for approaches that pool cues since the question changes. 3) As mentioned in Sec. 4.2, it’s possible to share weights between similar utilities, \eg, different history questions/answers.

Currently, we don’t consider most of those options, as the model trains quickly (8 hours vs. 33 hours of previous state-of-the-art) and fits into a single 12GB GPU.

## \thesection Qualitative Evaluation

Factors visualization: We provide additional visualization in \figreffig:onecol. We visualize scores for each image region obtained from different types of factors. ‘Image-Local-Information,’ ‘Image-Caption’ and ‘Image-Local-Interaction’ are constant for different questions, while ‘Image-Question,’ ‘Image-Answer,’ ‘Image-History-Q’ and ‘Image-History-A’ change for every question. We calculated the variance of interactions and observe that ‘Image-Question’ has the highest variance (), while ‘Image-Answer,’ ‘Image-History-Q’ and ‘Image-History-A’ have a variance of . Beyond the importance score, the high-variance also suggests that the ‘Image-Question’ cue is most important. Attention over dialogs: In \figreffig:res, we present a randomly-picked set of 50 images along with their corresponding dialogs. An automatic script is used to generate the figures. We highlight that image attention is aware of the scene in the question context, and able to attend to correct foreground or background regions. Question attention attends to informative words, and answer attention frequently correlates with the predicted answer. History attention emphasizes nuances.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters

351410

How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description