DialGraph: Sparse Graph Learning Networks for Visual Dialog
Abstract
Visual dialog is a task of answering a sequence of questions grounded in an image utilizing a dialog history. Previous studies have implicitly explored the problem of reasoning semantic structures among the history using softmax attention. However, we argue that the softmax attention yields dense structures that could distract to answer the questions requiring partial or even no contextual information. In this paper, we formulate the visual dialog tasks as graph structure learning tasks. To tackle the problem, we propose Sparse Graph Learning Networks (SGLNs) consisting of a multimodal node embedding module and a sparse graph learning module. The proposed model explicitly learn sparse dialog structures by incorporating binary and score edges, leveraging a new structural loss function. Then, it finally outputs the answer, updating each node via a message passing framework. As a result, the proposed model outperforms the stateoftheart approaches on the VisDial v1.0 dataset, only using 10.95% of the dialog history, as well as improves interpretability compared to baseline methods.
Keywords:
visual dialog, visual QA, graph neural networks, structural learning, multimodal deep learning1 Introduction
A human dialogue by its nature exhibits highly complex structures. Specifically, when we have a dialogue, some utterances are semantically dependent on previous ones (i.e., context), while others are independent, due to an abrupt change in topic. Previous topics could be readdressed later on in the dialogue. Furthermore, we take advantage of multimodal inputs, including visual, linguistic, and auditory information, to capture the temporal topics of conversation. Notably, Visual Dialog (VisDial) [8], which is an extended version of visual question answering (VQA) [2, 12], reflects the complex and multimodal nature of the dialogue. Unlike VQA, it is designed to answer a sequence of questions given an image, utilizing a dialog history as context. For example, to answer an ambiguous question like “Do you think they are her parents?” (D4 in Fig. 1), a dialog agent should attend to the meaningful context from a dialog history as well as visual information. This task demands a rich set of abilities – understanding a sequence of multimodal contents (i.e., an input image, questions, and dialog history), and reasoning semantic structures among them.
Previous approaches in visual dialog have explored the problem of reasoning semantic structures in dialogs by employing the softattention mechanism [4, 43]. Typically, most of the previous research has focused on extracting the rich questionrelevant representations from the given image and dialog history, while implicitly finding their relationships [8, 28, 42, 13, 9, 38]. Another line of research has tackled the problem of visual coreference resolution [39, 32, 19], and the other approach [47] attempts to find the inherent structures of the dialog. However, all previous work relies on the softattention mechanism, and we argue that applying it to the previous utterances severely limits a dialog agent to learn various types of semantic relationships. Specifically, the soft attention, which is based on a softmax function, always assigns a nonzero weight to all previous utterances, which results in dense (i.e., fullyconnected) relationships. Herein lies the problem: even for questions that are partially dependent (Q5 in Fig. 1) or independent (Q6 in Fig. 1) from the dialog history, all previous utterances are still taken into consideration and integrated into the contextual representations. As a consequence, the dialog agent overly relies on all previous utterances, even when these previous utterances are irrelevant to the given question. It may potentially hurt performance and interpretability.
In this paper, we propose Sparse Graph Learning Networks (SGLNs) that explicitly discover the sparse structures of the visuallygrounded dialogs. We present a dialog in a graph structure where each node corresponds to the round of dialog, and edges represent the semantic dependencies between the nodes, as shown in Fig. 1. The proposed SGLNs infer the graph structure and predict the answer simultaneously. SGLNs involve two novel modules: a multimodal node embedding module and a sparse graph learning module. Inspired by a bottomup and topdown attention mechanism [1], the node embedding module embeds the given image and each round of dialog in a joined fashion, yielding the multimodal joint embeddings. We represent each embedding vector as a node of the graph. The sparse graph learning module infers two edge weights: binary (i.e., 0 or 1) and score edges. It then ultimately discovers the sparse and weighted structure by incorporating them. Note that the sparse graph learning module ensures an isolated node when all elements in the binary edge weights are zero. It updates each node by integrating the neighborhood nodes via a message passing framework and feeds the updated node features to the answer decoder. Furthermore, we introduce a new structural loss function to encourage our model to infer explicit and reliable dialog structures by leveraging supervision that is readily obtainable. Consequently, as shown in (c) for Fig. 1, our model learns various types of semantic relationships: (1) dense relationships as in D1D4, (2) sparse relationships as in D5, and (3) no relationships as in D6. The main contributions of our paper are as follows:

We propose Sparse Graph Learning Networks (SGLNs) that consider the sparse nature of a visuallygrounded dialog. By using a multimodal node embedding module and a sparse graph learning module, our proposed model circumvents the conceptual shortcoming of dense structures by pruning unnecessary relationships.

We propose a new structural loss function to encourage SGLNs to learn the aforementioned semantic relationships explicitly. SGLNs are the first approach that predicts the sparse structures of the visuallygrounded dialog with the structural loss function.

SGLNs achieve the new stateoftheart results on the visual dialog v1.0 dataset using only 10.95% of the dialog history. Also, we make a comparison between SGLNs and the baseline models to demonstrate the effectiveness of the proposed method. Finally, we perform a qualitative analysis of our proposed model, showing that SGLNs reasonably infer the underlying sparse structures and improve interpretability compared to a baseline model.
2 Related Work
Visual Dialog. Visual dialog task [8] was recently introduced as a temporal extension of VQA [2, 12]. In this task, a dialog agent should answer a sequence of questions by using an image and the dialog history as a clue. We carefully categorize the previous studies on visual dialog into three groups: (1) soft attentionbased methods that compute attended representations of the image, and the history [8, 28, 42, 13, 9, 38, 30], (2) a visual coreference resolution [39, 25, 32, 19] that clarifies ambiguous expressions (e.g., it, them) in the question and links them to a specific entity in the image, and (3) a structural learning method [47] that attempts to infer dialog structures. Our approach belongs to the third group. Zheng et al. [47] designed a structure inference model while predicting the answer in the context of an expectationmaximization (EM) algorithm. Specifically, they proposed the model based on graph neural networks (GNNs) that approximate a process of the EM algorithm. However, similar to the soft attentionbased methods, they inferred the dense semantic structures using a softmax function in GNNs. Moreover, they implicitly recovered the structures only using supervision for the given questions. To address these two aspects, we propose SGLNs that explicitly infer sparse structures with a definite objective (i.e., a structural loss function).
On the one hand, a few [32, 20] have noticed the sparse property of the visual dialog, but their reasoning capability is still quite limited. The CDF [20] randomly extracted up to three elements of the dialog history to avoid excessive exploitation of the whole history. For the visual coreference resolution, RvA [32] backtracked the history and selectively retrieved the visual attention maps of the previous dialogs, which are determined to be useful.
Graph Neural Networks (GNNs) [11, 37] have sparked a tremendous interest at the intersection of deep neural networks and structural learning approaches. There are two existing methods involving GNNs: (1) a method that operates on graphstructured data [24, 6, 14, 31, 44], and (2) a method that constructs a graph with neural networks to approximate the learning or inference process of graphical models [41, 5, 10, 23]. More recently, graph learning networks (GLNs), which are an extension of the second method, were proposed by [35, 33], with the goal of reasoning underlying structures of input data. Note that GLNs consider unstructured data and dynamic domains (e.g., timevarying domain). Accordingly, CBGLNs [33] attempted to discover the compositional structure of long video data by using a normalized graphcut algorithm [40]. Our method belongs to GLNs. However, SGLNs are significantly different from previous studies in that the SGLNs learn to build sparse structures adaptively, not relying on a predefined algorithm, and the dataset we use is highly multimodal.
3 Sparse Graph Learning Networks
In this section, we formulate the visual dialog task using graph structures, then describe our proposed model, Sparse Graph Learning Networks (SGLNs). The visual dialog task [8] is defined as follows: given an image , a caption describing the image, a dialog history until round , and a question at round , the goal is to find an appropriate answer to the question among the answer candidates, = . Following the previous work [8], we use the groundtruth answers for the dialog history.
In our approach, we consider the task as a graph with nodes, where each node corresponds to the multimodal feature for the previous dialog history and the current question . The semantic dependencies among the nodes are represented as weighted edges .
Fig. 2 provides an overview of our proposed model, Sparse Graph Learning Networks (SGLNs). Specifically, the SGLNs consist of three components: a multimodal node embedding module, a sparse graph learning module, and an answer decoder. The multimodal node embedding module aims to learn the rich visuallinguistic representations for each round of dialog by employing the simple attention mechanism. We represent the multimodal joint feature vector for each round of dialog as a node of the graph. The sparse graph learning module estimates the binary and score edges among the nodes and combines these two edge weights into sparse weighted edges. Then, the sparse graph learning module aggregates the neighborhood node feature vector for the current question via the message passing algorithm [10]. The aggregated hidden feature is fed into the answer decoder, which yields the most likely answer. Furthermore, the binary edges (i.e., 0 or 1) that represent the semantic relevance among the nodes are fed into the structural loss function to predict reliable dialog structures in test time. Drawing comparisons to human cognition, this multimodal node embedding module acts similarly to human episodic memory [3], where each node corresponds to a unit of episodic memory that contains visual and linguistic information for each round of dialog. Also, the sparse graph learning module mimics the behavior of a human who adaptively recalls relevant multimodal information from their episodic memory.
In the following subsections, we will introduce input features for SGLNs, then describe the detailed architectures of the multimodal node embedding module, the sparse graph learning module, and the answer decoder. Finally, we present the objective function for SGLNs.
3.1 Input Features
Visual Features. In the given image , we extract the dimensional visual features of objects by employing the pretrained Faster RCNN model [36, 1], which are denoted as .
Language Features. We first encode the question which is a word sequence of length , , by using a bidirectional LSTM [16] as follows:
(1)  
(2)  
(3) 
where and denote the forward and backward hidden states of the th word, respectively. Note that we use the concatenation of the last hidden states from each LSTM, followed by a projection matrix , which results in . Likewise, each round of the dialog history is encoded into , and the all answer candidates at the th round are also embedded to with additional LSTMs.
3.2 Multimodal Node Embedding Module
As shown in Fig. 2, the multimodal node embedding module embeds the visuallinguistic joint representations associated with each node , by performing visual grounding of each language features. To implement these processes, we take inspiration from a bottomup and topdown attention mechanism [1, 21]. For the objectlevel visual features and the corresponding language feature , the node embedding module firstly finds the spatial objects that the language feature describes with the soft attention mechanism. Formally,
(4)  
(5)  
(6) 
where and are nonlinear functions that transforms inputs to dimensional space, such as multilayer perceptrons (MLPs). denotes the hadamard product (i.e., elementwise multiplication), and are a vector whose elements are all one. The attention function is parametrized by vector. Then, the multimodal feature is obtained from the attended visual feature and the language feature as follow:
(7) 
where and are projection functions. As a consequence, we obtain visuallinguistic joint representations for all nodes which can be represented in the matrixform .
3.3 Sparse Graph Learning Module
The sparse graph learning module infers the underlying sparse and weighted graph structure between nodes, where the edge weights are estimated based on the node features. To make the graph structure to be sparse, we propose two types of edges on the graph : binary edges and score edges , which corresponding adjacency matrices are and respectively. To simplify the notations, we omit the subscription in the following equations.
Binary Edges. We first define the binary edge between two nodes and as a binary random variable , for all and . The sparse graph learning module estimates the likelihood of the binary variables given the node features, where the probability implies whether the two nodes are semantically related or not. We regard the binary variable as a twoclass categorical variable and define the probability distribution as follows:
(8)  
(9) 
where is a learnable parameter and is the softmax temperature. Since is discrete and nondifferentiable, we employ StraightThrough GumbelSoftmax estimator (i.e., STGumbel) [17] to ensure endtoend training. During the forward propagation, the STGumbel makes a discrete decision by using a GumbelMax trick [29]:
(10) 
where random variable are drawn from [17]. In the backward pass, the STGumbel utilizes the derivative of the probabilities by approximating , thus enabling the backpropagation and endtoend training.
Score Edges. We also define the score edges that measure the extent to which the two nodes are relevant, and the weighted adjacency matrix is computed as:
(11) 
with a learnable parameter . Following the relational graph learning algorithm [46], we compute the score edges using the squared operation for the stabilized training.
Sparse Weighted Edges. The sparse graph learning module multiplies the binary edges and score edges, finally yielding the sparse and weighted adjacency matrix as:
(12) 
With the above edge weight estimations, the sparse graph learning module is able to model three types of relationship on : (1) dense relationships similar to the previous conventional softmaxbased approaches if (i.e., all entries in are one),
(2) sparse relationships if , and
(3) no relationships if (i.e., isolated).
Messagepassing and Update. Based on the sparse weighted adjacency matrix , the sparse graph learner updates the hidden states of all nodes through a messagepassing framework [10]. Similar to the graph convolutional networks [24], we simply implemented the messagepassing layer as a linear projection of node features, followed by the normalized weighted sum according to the adjacent weights.
(13) 
Note that is the degree matrix of . The hidden features of nodes are calculated via the update layer that adds the input feature and aggregated messages then feeds them into a nonlinear function .
(14) 
Notice that the sparse graph structure inference followed by the hidden state update can be viewed as a dialog reasoning. Moreover, the model is able to do multistep reasoning by repeatedly conducting the inference and update based on the hidden states. In this paper, for the sake of simplicity, we assume that the only edges connected to the question node exist (i.e., ). For the question node , the message vector and hidden state vector is simply represented as below formula:
(15) 
The sparse graph learner outputs the hidden state vector for the question node to predict the answer.
3.4 Answer Decoder
Discriminative Decoder.
The discriminative decoder computes the likelihood of the answer candidates by dotproduct operations between the hidden vector and feature vectors for the answer candidates . Then, the SGLNs are optimized by minimizing negative loglikelihood of the groundtruth answer as:
(16)  
(17) 
where is the onehot encoded label vector. For evaluation, the answer candidates are ranked according to the likelihood.
Generative Decoder.
Similarly to the sequencetosequence model, the generative decoder aims to generate the groundtruth answer’s word sequence autoregressively via a LSTMs:
(18) 
where is the output of the sparse graph learning module, and denotes the groundtruth answer consisting of words . We initialize the hidden states of the LSTMs with (i.e., ). Following the Visual Dialog task [8], we utilize the loglikelihood scores to determine the rank of candidate answers for the process of evaluation.
3.5 Objective Function
Structural Loss Function.
Along with the two loss functions, and , we introduce a structural loss function to encourage the SGLNs to infer explicit, reliable dialog structures. Inspired by the visual coreference resolution model [25], our method utilizes the structural supervision in addition to the groundtruth answer at each round. Specifically, we automatically obtain the semantic dependencies among each round of dialog as a form of a lower triangular binary matrix from an offtheshelf neural coreference resolution tool
(19) 
where denotes the elementwise mean squared error. Here, encourages the SGLNs to predict a reliable adjacency matrix (i.e., dialog structure). Note that the SGLNs use the structural supervision only while training, and infer the dialog structures at test time. We clarify that the efficiency of the coreference resolution was explored for the visual dialog tasks by the previous work [25]; however, their gain is limited as they use a different approach from ours.
Multitask Learning.
To predict the dialog structure and answer to the given questions, the SGLNs are trained to minimize the sum of the losses based on both the structural loss and the loss of answer decoder: or where are weights for each loss. Optionally, the SGLNs takes the dual decoder strategy by minimizing the three losses simultaneously: . Unless stated otherwise, the default loss is . The implementation details and results will be discussed in Section 4.
4 Experiments
In this section, we describe the details of our experiments on the Visual Dialog dataset. We first introduce the Visual Dialog dataset, evaluation metrics, and implementation details. Then, we compare the SGLNs with baseline models and stateoftheart methods. Note that the qualitative analysis of our proposed model is described in Sec. 5.
4.1 Experimental Setup
Dataset. We benchmark our proposed model on the Visual Dialog (i.e., VisDial) v1.0 dataset. The VisDial dataset [8] was collected in a twoplayer chatting environment, where a questioner tries to figure out an unseen image by asking freeform questions, and an answerer responds to the questions based on the image. As a result, the VisDial v1.0 dataset contains 1.2M, 20k, and 44k questionanswer pairs as train, validation, and test splits, respectively. The 123,287 images from COCO [27], 2,064, and 8k images from Flickr are used to collect the dialog data for each split, respectively. A list of 100 answer candidates accompanies each questionanswer pair.
Evaluation. We follow the standard protocol for evaluating the visual dialog model, as proposed in the earlier work [8]. Specifically, the visual dialog model ranks a list of 100 candidate answers and returns the ranked list for further evaluation. There are four kinds of evaluation metrics in the Visual Dialog task: (1) mean reciprocal rank (MRR) of the groundtruth answer in the ranked list, (2) recall@k (R@k), which is the existence of the groundtruth answer in the topk list, (3) mean rank (Mean) of the groundtruth answer, and (4) normalized discounted cumulative gain (NDCG). Contrary to the classical retrieval metrics (MRR, R@k or mean rank), which are only based on a single groundtruth answer, NDCG takes into account all relevant answers from the 100answer list by using the densely annotated relevance scores. It penalizes the lowerranked answers with high relevance scores, and swapping candidates with the same relevance does not affect NDCG. Due to these properties, NDCG is regarded as the primary metric and used to evaluate methods for the VisDial v1.0 dataset.
Implementation Details. The SGLNs embed all the language inputs to 300dimensional vector initialized by GloVe [34]. All three BiLSTMs used for encoding the word embedding vectors are singlelayer with 512 hidden units. We also use the bottomup attention features [1] from Faster RCNN [36] pretrained on the Visual Genome [26]. The number of object features per image is , and the dimension of each feature is . The dimension of is 512. The hyperparameters for the multitask learning are , , and . We employ Adam optimizer [22] with initial learning rate . The learning rate is warmed up to until epoch 4 and is halved every two epochs from 5 to 10 epochs. We use the VisDial v1.0 training split for evaluating our proposed model on the validation and test splits.
Method  NDCG  MRR  R@1  R@5  R@10  Mean  Sparsity 
Dense  59.53  63.21  49.56  80.02  89.25  4.42  0% 
Sparsehard  61.89  61.28  47.41  78.13  88.27  4.70  81.82% 
SGLNs  62.83  60.54  46.64  77.59  87.33  4.89  89.15% 
4.2 Quantitative Results
Comparison with Baselines.
We compare SGLNs to the baseline models to demonstrate the effectiveness of our method. We define two models as baselines: Dense, and Sparsehard. The Dense model utilizes a softmax attention mechanism, which results in the fullyconnected graph. Contrary to the Dense model, the Sparsehard model picks exactly one element among the dialog history by applying the GumbelSoftmax to the whole dialog history. Note that the structural supervision is provided in the Sparsehard model. The results are summarized in Table 1. The SGLNs achieve better performance than the baseline models on the NDCG metric, maintaining competitive performance on the groundtruth dependent metrics (i.e., MRR, R@k, and Mean rank). We also observe that the Dense model, which overly exploits the dialog history, shows the best performance on the groundtruth dependent metrics. We argue that the Dense model mainly focuses on finding the single groundtruth answer with a rich set of dialog history, with the cost of sacrificing the ability to provide ‘flexible’ answers (i.e., NDCG). Similarly, the NDCG performance for the Sparsehard model tends to increase as the sparsity increased.
Questiontype Analysis.
As the same setup as the above experiment, we conduct a questiontype analysis of the NDCG scores to verify our hypothesis discussed in Sec. 1. Based on the semantic dependency information introduced in Sec. 3, we categorize the entire questions in the VisDial v1.0 validation split into three groups: (1) independent questions that can be answered without dialog history, (2) partially dependent questions that demand a few elements of dialog history, and (3) densely dependent questions that require all previous dialogs. As illustrated in Fig. 3, we compare our proposed model with a softmaxbased Dense model, showing that the SGLNs significantly outperform the Dense model on all types of questions. The performance gap between the two models is 3.74%, 2.61%, and 0.83% for each type of question, respectively. We observe that the Dense model relatively suffers from finding relevant answers for independent questions. It validates that excessive exploitation of the dialog history could cause a distraction for such questions.
Method  NDCG  MRR  R@1  R@5  R@10  Mean  Sparsity 
LF [8]  45.31  55.42  40.95  72.45  82.83  5.95   
HRE [8]  45.46  54.16  39.93  70.45  81.50  6.41   
MN [8]  47.50  55.49  40.98  72.30  83.30  5.92   
GNN [47]  52.82  61.37  47.33  77.98  87.83  4.57   
CorefNMN [25]  54.70  61.50  47.55  78.10  88.80  4.40   
RvA [32]  55.59  63.03  49.03  80.40  89.83  4.18   
DualVD [18]  56.32  63.23  49.25  80.23  89.70  4.11   
FGA [38]  56.93  66.22  52.75  82.92  91.08  3.81   
HACAN [45]  57.17  64.22  50.88  80.63  89.45  4.20   
DL61 [13]  57.32  62.20  47.90  80.43  89.95  4.17   
DAN [19]  57.59  63.20  49.63  79.75  89.35  4.30   
NMN [25]  58.10  58.80  44.15  76.88  86.88  4.81   
Transformer [30]  60.92  60.65  47.00  77.03  87.75  4.90   
SGLNs  60.77  58.40  44.15  75.65  85.70  5.22  89.14% 
SGLNs  61.27  59.97  45.68  77.12  87.10  4.85  89.05% 
Method  NDCG  MRR  R@1  R@5  R@10  Mean  Sparsity 
MN [8]  56.99  47.83  38.01  57.49  64.08  18.76   
HCIAE [28]  59.70  49.07  39.72  58.23  64.73  18.32   
CoAtt [42]  59.24  49.64  40.09  59.37  65.92  17.86   
ReDAN [9]  60.47  50.02  40.27  59.93  66.78  17.40   
SGLNs  60.82  48.82  39.64  57.58  64.37  18.03  87.03% 
Comparison with the Stateoftheart.
We compare our proposed model with the stateoftheart methods on VisDial v1.0 dataset. As shown in Table 2, SGLNs with the discriminative decoder outperform all other methods with respect to the NDCG metric, including the concurrent work, Transformer [30]. They demonstrated the effectiveness of training the discriminative and generative decoder simultaneously (i.e., ). Accordingly, we also apply the dual decoder strategy as described Sec. 3 for a fair comparison, lifting our model’s NDCG to 61.27%. The results of the dual decoder models are obtained from the output of the discriminative decoder. Note that the sparsity of the SGLNs is 89.05%, which means that our proposed model only utilizes 10.95% of the dialog history. The sparsity is calculated as the percentage of zerovalued edges in the graph. We consider these results encouraging as they indicate that the SGLNs adaptively attend to the dialog history while achieving the new stateoftheart performance on the primary metric. Furthermore, we report the performance of the generative decoderbased models on VisDial v1.0 validation split. As shown in Table 3, the SGLNs achieve a new stateoftheart performance on NDCG with sparsity of 87.03%. Note that all entries in Table 3 are reimplemented by [9], utilizing the objectlevel visual features from the Faster RCNN [36] and GloVe [34] vectors for a fair comparison.
5 Discussions
Visualization of the Inferred Graph Structures.
For qualitative analysis, in Fig. 4, we visualize the images, the corresponding dialogs in the validation split, and the inferred adjacency matrices as well as the ones from the Dense mode as a counter.
Compared to the dense structure in the baseline, the proposed SGLNs indeed learn the innate sparse structures, and the question nodes receive the information from the other nodes in a selective fashion.
For instance, In the first dialog example, the questions from Q3 to Q10 have nonzero binary edges to all previous contexts except the D1 and D2, which do not contain relevant information about ‘the woman’.
On the contrary, the Q1 and Q2 are not connected to the other, even the caption node, because they can be answered solely without additional context.
Knowledge Transfer of Semantic Structure. In Section 3.5, the structural loss function can be seen as a knowledge distillation loss [15] to transfer the knowledge from the pretrained neural coreference resolution model to our sparse graph learning module. Even though we employ STGumbel to mitigate the unpredictability of training the binary edges, this structural loss was decisively helpful to boost the early stage of training.
6 Conclusions
In this paper, we formulate the visual dialog tasks as a graph structure learning tasks where the edges represent the semantic dependencies among the multimodal embedding nodes learned from the given image, caption and question, and dialog history.
The proposed Sparse Graph Learning Networks (SGLNs) learn the sparse dialog structures by incorporating binary and score edges, leveraging structural supervisions.
Our experiments demonstrate the efficacy of SGLN by achieving the stateoftheart NDCG performance on the VisDial v1.0 dataset with 61.27 for the teststd split, only using the 10.95 % of dialog.
Qualitatively, the visualized analysis with the inferred graph structures shows adaptive mechanisms depending on the type of the questions.
Acknowledgements. The authors would like to thank SK TBrain for sharing GPU resources. This work was partly supported by the Korea government (2015000310SW.StarLab, 2017001772VTT, 2019001367BabyMind).
Supplementary
Structure Inference. At the inference stage, the SGLNs greedily infer the binary edges with the largest probability without drawing the sample in Eq. 10. This strategy is similar to the RvA [32] model that also makes discrete decisions for the visual coreference resolution in the visual dialog.
Structural Supervision. We readily obtain the semantic dependency information from the neural coreference resolution tool based on [7] and use it as the structural supervision. As shown in (c) for Fig. 5, the structural supervision represents the sentencelevel semantic dependencies between the given question (i.e., Q1Q6 in rows) and each element of the dialog history (i.e., C and D1D6 in columns) in the form of a binary matrix. Specifically, the onevalued entries in the structural supervision indicate that both sentences include noun phrases or a pronoun referring to the same entity. On the other hand, the zerovalued entries denote that both sentences do not share any entity. The upper triangular of the structural supervision matrix (i.e., a gray area) indicates zero because of the temporal nature of the dialogs. since the dialog has a sequence. The sparsity of the structural supervision is 85.50%, and it is calculated as the percentage of zerovalued entries in the blue area.
Structural Loss Function. We define the structural loss function as a elementwise mean squared error between the structural supervision and the binary edges that are inferred from the SGLNs. By minimizing the loss, the SGLNs learn to infer the binary edges based on the structural supervision. Although the structural supervision automatically obtained from the offtheshelf coreference resolution tool may not cover the exact semantic dependencies in the visual dialog, we demonstrate the effectiveness of the proposed method quantitatively and qualitatively.
Footnotes
 https://github.com/huggingface/neuralcoref based on the work [7].
References
 (2018) Bottomup and topdown attention for image captioning and visual question answering. In CVPR, Cited by: §1, §3.1, §3.2, §4.1.
 (2015) Vqa: visual question answering. In ICCV, Cited by: §1, §2.
 (2000) The episodic buffer: a new component of working memory?. In Trends in cognitive sciences, Cited by: §3.
 (2014) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1.
 (2016) Interaction networks for learning about objects, relations and physics. In NIPS, Cited by: §2.
 (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §2.
 (2016) Deep reinforcement learning for mentionranking coreference models. In ACL, Cited by: footnote 1, DialGraph: Sparse Graph Learning Networks for Visual Dialog.
 (2017) Visual dialog. In CVPR, Cited by: §1, §1, §2, §3.4.2, §3, §4.1, §4.1, Table 2, Table 3.
 (2019) Multistep reasoning via recurrent dual attention for visual dialog. In ACL, Cited by: §1, §2, §4.2.3, Table 3.
 (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §2, §3.3, §3.
 (2005) A new model for learning in graph domains. In IJCNN, Cited by: §2.
 (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1, §2.
 (2019) Imagequestionanswer synergistic network for visual dialog. In CVPR, Cited by: §1, §2, Table 2.
 (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §2.
 (2014) Distilling the Knowledge in a Neural Network. In NIPS 2014 Deep Learning Workshop, Cited by: §5.
 (1997) Long shortterm memory. In Neural computation, Cited by: §3.1.
 (2017) Categorical reparameterization with gumbelsoftmax. In ICLR, Cited by: §3.3.
 (2019) DualVD: an adaptive dual encoding model for deep visual understanding in visual dialogue. In AAAI, Cited by: Table 2.
 (2019) Dual attention networks for visual reference resolution in visual dialog. In EMNLP, Cited by: §1, §2, Table 2.
 (2020) Modalitybalanced models for visual dialogue. In AAAI, Cited by: §2.
 (2017) Hadamard Product for Lowrank Bilinear Pooling. In ICLR, Cited by: §3.2.
 (2014) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
 (2018) Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687. Cited by: §2.
 (2017) Semisupervised classification with graph convolutional networks. Cited by: §2, §3.3.
 (2018) Visual coreference resolution in visual dialog using neural module networks. In ECCV, Cited by: §2, §3.5.1, Table 2.
 (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. In ICCV, Cited by: §4.1.
 (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §4.1.
 (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In NIPS, Cited by: §1, §2, Table 3.
 (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §3.3.
 (2019) Efficient attention mechanism for handling all the interactions between many inputs with application to visual dialog. arXiv preprint arXiv:1911.11390. Cited by: §2, §4.2.3, Table 2.
 (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §2.
 (2018) Recursive visual attention in visual dialog. In CVPR, Cited by: §1, §2, §2, Table 2, DialGraph: Sparse Graph Learning Networks for Visual Dialog.
 (2020) Cutbased graph learning networks to discover compositional structure of sequential video data. Cited by: §2.
 (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §4.1, §4.2.3.
 (2019) Graph learning network: a structure learning algorithm. arXiv preprint arXiv:1905.12665. Cited by: §2.
 (2015) Faster rcnn: towards realtime object detection with region proposal networks. In NIPS, Cited by: §3.1, §4.1, §4.2.3.
 (2008) The graph neural network model. In IEEE Transactions on Neural Networks, Cited by: §2.
 (2019) Factor graph attention. In CVPR, Cited by: §1, §2, Table 2.
 (2017) Visual reference resolution using attention memory for visual dialog. In NIPS, Cited by: §1, §2.
 (2000) Normalized cuts and image segmentation. In IEEE, Cited by: §2.
 (2016) Learning multiagent communication with backpropagation. In NIPS, Cited by: §2.
 (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In CVPR, Cited by: §1, §2, Table 3.
 (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §1.
 (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.
 (2019) Making history matter: historyadvantage sequence training for visual dialog. In ICCV, Cited by: Table 2.
 (2018) GLoMo: unsupervised learning of transferable relational graphs. In NIPS, Cited by: §3.3.
 (2019) Reasoning visual dialogs with structural and partial observations. In CVPR, Cited by: §1, §2, Table 2.