DialGraph: Sparse Graph Learning Networks for Visual Dialog

DialGraph: Sparse Graph Learning Networks for Visual Dialog

Abstract

Visual dialog is a task of answering a sequence of questions grounded in an image utilizing a dialog history. Previous studies have implicitly explored the problem of reasoning semantic structures among the history using softmax attention. However, we argue that the softmax attention yields dense structures that could distract to answer the questions requiring partial or even no contextual information. In this paper, we formulate the visual dialog tasks as graph structure learning tasks. To tackle the problem, we propose Sparse Graph Learning Networks (SGLNs) consisting of a multimodal node embedding module and a sparse graph learning module. The proposed model explicitly learn sparse dialog structures by incorporating binary and score edges, leveraging a new structural loss function. Then, it finally outputs the answer, updating each node via a message passing framework. As a result, the proposed model outperforms the state-of-the-art approaches on the VisDial v1.0 dataset, only using 10.95% of the dialog history, as well as improves interpretability compared to baseline methods.

Keywords:
visual dialog, visual QA, graph neural networks, structural learning, multi-modal deep learning

1 Introduction

A human dialogue by its nature exhibits highly complex structures. Specifically, when we have a dialogue, some utterances are semantically dependent on previous ones (i.e., context), while others are independent, due to an abrupt change in topic. Previous topics could be readdressed later on in the dialogue. Furthermore, we take advantage of multimodal inputs, including visual, linguistic, and auditory information, to capture the temporal topics of conversation. Notably, Visual Dialog (VisDial) [8], which is an extended version of visual question answering (VQA) [2, 12], reflects the complex and multimodal nature of the dialogue. Unlike VQA, it is designed to answer a sequence of questions given an image, utilizing a dialog history as context. For example, to answer an ambiguous question like “Do you think they are her parents?” (D4 in Fig. 1), a dialog agent should attend to the meaningful context from a dialog history as well as visual information. This task demands a rich set of abilities – understanding a sequence of multimodal contents (i.e., an input image, questions, and dialog history), and reasoning semantic structures among them.

Figure 1: An example from the VisDial dataset. (a): a given image. (b): dialogue regarding the image, including image caption (C), and each round of dialog (D1-D6). (c) and (d): the semantic structures from our proposed model and the soft attention-based model, respectively. The left and right column in each figure denote the dialog history and the current question, respectively. We argue that the dense structure could be a distraction for the questions that demand partial (Q5) or no (Q6) contextual information. The thicker and darker links indicate the higher semantic dependencies.

Previous approaches in visual dialog have explored the problem of reasoning semantic structures in dialogs by employing the soft-attention mechanism [4, 43]. Typically, most of the previous research has focused on extracting the rich question-relevant representations from the given image and dialog history, while implicitly finding their relationships [8, 28, 42, 13, 9, 38]. Another line of research has tackled the problem of visual coreference resolution [39, 32, 19], and the other approach [47] attempts to find the inherent structures of the dialog. However, all previous work relies on the soft-attention mechanism, and we argue that applying it to the previous utterances severely limits a dialog agent to learn various types of semantic relationships. Specifically, the soft attention, which is based on a softmax function, always assigns a non-zero weight to all previous utterances, which results in dense (i.e., fully-connected) relationships. Herein lies the problem: even for questions that are partially dependent (Q5 in Fig. 1) or independent (Q6 in Fig. 1) from the dialog history, all previous utterances are still taken into consideration and integrated into the contextual representations. As a consequence, the dialog agent overly relies on all previous utterances, even when these previous utterances are irrelevant to the given question. It may potentially hurt performance and interpretability.

In this paper, we propose Sparse Graph Learning Networks (SGLNs) that explicitly discover the sparse structures of the visually-grounded dialogs. We present a dialog in a graph structure where each node corresponds to the round of dialog, and edges represent the semantic dependencies between the nodes, as shown in Fig. 1. The proposed SGLNs infer the graph structure and predict the answer simultaneously. SGLNs involve two novel modules: a multimodal node embedding module and a sparse graph learning module. Inspired by a bottom-up and top-down attention mechanism [1], the node embedding module embeds the given image and each round of dialog in a joined fashion, yielding the multimodal joint embeddings. We represent each embedding vector as a node of the graph. The sparse graph learning module infers two edge weights: binary (i.e., 0 or 1) and score edges. It then ultimately discovers the sparse and weighted structure by incorporating them. Note that the sparse graph learning module ensures an isolated node when all elements in the binary edge weights are zero. It updates each node by integrating the neighborhood nodes via a message passing framework and feeds the updated node features to the answer decoder. Furthermore, we introduce a new structural loss function to encourage our model to infer explicit and reliable dialog structures by leveraging supervision that is readily obtainable. Consequently, as shown in (c) for Fig. 1, our model learns various types of semantic relationships: (1) dense relationships as in D1-D4, (2) sparse relationships as in D5, and (3) no relationships as in D6. The main contributions of our paper are as follows:

  1. We propose Sparse Graph Learning Networks (SGLNs) that consider the sparse nature of a visually-grounded dialog. By using a multimodal node embedding module and a sparse graph learning module, our proposed model circumvents the conceptual shortcoming of dense structures by pruning unnecessary relationships.

  2. We propose a new structural loss function to encourage SGLNs to learn the aforementioned semantic relationships explicitly. SGLNs are the first approach that predicts the sparse structures of the visually-grounded dialog with the structural loss function.

  3. SGLNs achieve the new state-of-the-art results on the visual dialog v1.0 dataset using only 10.95% of the dialog history. Also, we make a comparison between SGLNs and the baseline models to demonstrate the effectiveness of the proposed method. Finally, we perform a qualitative analysis of our proposed model, showing that SGLNs reasonably infer the underlying sparse structures and improve interpretability compared to a baseline model.

2 Related Work

Visual Dialog. Visual dialog task [8] was recently introduced as a temporal extension of VQA [2, 12]. In this task, a dialog agent should answer a sequence of questions by using an image and the dialog history as a clue. We carefully categorize the previous studies on visual dialog into three groups: (1) soft attention-based methods that compute attended representations of the image, and the history [8, 28, 42, 13, 9, 38, 30], (2) a visual coreference resolution [39, 25, 32, 19] that clarifies ambiguous expressions (e.g., it, them) in the question and links them to a specific entity in the image, and (3) a structural learning method [47] that attempts to infer dialog structures. Our approach belongs to the third group. Zheng et al. [47] designed a structure inference model while predicting the answer in the context of an expectation-maximization (EM) algorithm. Specifically, they proposed the model based on graph neural networks (GNNs) that approximate a process of the EM algorithm. However, similar to the soft attention-based methods, they inferred the dense semantic structures using a softmax function in GNNs. Moreover, they implicitly recovered the structures only using supervision for the given questions. To address these two aspects, we propose SGLNs that explicitly infer sparse structures with a definite objective (i.e., a structural loss function).

On the one hand, a few [32, 20] have noticed the sparse property of the visual dialog, but their reasoning capability is still quite limited. The CDF [20] randomly extracted up to three elements of the dialog history to avoid excessive exploitation of the whole history. For the visual coreference resolution, RvA [32] backtracked the history and selectively retrieved the visual attention maps of the previous dialogs, which are determined to be useful.

Graph Neural Networks (GNNs) [11, 37] have sparked a tremendous interest at the intersection of deep neural networks and structural learning approaches. There are two existing methods involving GNNs: (1) a method that operates on graph-structured data [24, 6, 14, 31, 44], and (2) a method that constructs a graph with neural networks to approximate the learning or inference process of graphical models [41, 5, 10, 23]. More recently, graph learning networks (GLNs), which are an extension of the second method, were proposed by [35, 33], with the goal of reasoning underlying structures of input data. Note that GLNs consider unstructured data and dynamic domains (e.g., time-varying domain). Accordingly, CB-GLNs [33] attempted to discover the compositional structure of long video data by using a normalized graph-cut algorithm [40]. Our method belongs to GLNs. However, SGLNs are significantly different from previous studies in that the SGLNs learn to build sparse structures adaptively, not relying on a predefined algorithm, and the dataset we use is highly multimodal.

3 Sparse Graph Learning Networks

In this section, we formulate the visual dialog task using graph structures, then describe our proposed model, Sparse Graph Learning Networks (SGLNs). The visual dialog task [8] is defined as follows: given an image , a caption describing the image, a dialog history until round , and a question at round , the goal is to find an appropriate answer to the question among the answer candidates, = . Following the previous work [8], we use the ground-truth answers for the dialog history.

Figure 2: Overview of Sparse Graph Learning Networks (SGLNs). The SGLNs consist of three components: a multimodal node embedding module, a sparse graph learning module, and an answer decoder.

In our approach, we consider the task as a graph with nodes, where each node corresponds to the multimodal feature for the previous dialog history and the current question . The semantic dependencies among the nodes are represented as weighted edges .

Fig. 2 provides an overview of our proposed model, Sparse Graph Learning Networks (SGLNs). Specifically, the SGLNs consist of three components: a multimodal node embedding module, a sparse graph learning module, and an answer decoder. The multimodal node embedding module aims to learn the rich visual-linguistic representations for each round of dialog by employing the simple attention mechanism. We represent the multimodal joint feature vector for each round of dialog as a node of the graph. The sparse graph learning module estimates the binary and score edges among the nodes and combines these two edge weights into sparse weighted edges. Then, the sparse graph learning module aggregates the neighborhood node feature vector for the current question via the message passing algorithm [10]. The aggregated hidden feature is fed into the answer decoder, which yields the most likely answer. Furthermore, the binary edges (i.e., 0 or 1) that represent the semantic relevance among the nodes are fed into the structural loss function to predict reliable dialog structures in test time. Drawing comparisons to human cognition, this multimodal node embedding module acts similarly to human episodic memory [3], where each node corresponds to a unit of episodic memory that contains visual and linguistic information for each round of dialog. Also, the sparse graph learning module mimics the behavior of a human who adaptively recalls relevant multimodal information from their episodic memory.

In the following sub-sections, we will introduce input features for SGLNs, then describe the detailed architectures of the multimodal node embedding module, the sparse graph learning module, and the answer decoder. Finally, we present the objective function for SGLNs.

3.1 Input Features

Visual Features. In the given image , we extract the -dimensional visual features of objects by employing the pre-trained Faster R-CNN model [36, 1], which are denoted as .

Language Features. We first encode the question which is a word sequence of length , , by using a bidirectional LSTM [16] as follows:

(1)
(2)
(3)

where and denote the forward and backward hidden states of the -th word, respectively. Note that we use the concatenation of the last hidden states from each LSTM, followed by a projection matrix , which results in . Likewise, each round of the dialog history is encoded into , and the all answer candidates at the -th round are also embedded to with additional LSTMs.

3.2 Multimodal Node Embedding Module

As shown in Fig. 2, the multimodal node embedding module embeds the visual-linguistic joint representations associated with each node , by performing visual grounding of each language features. To implement these processes, we take inspiration from a bottom-up and top-down attention mechanism [1, 21]. For the object-level visual features and the corresponding language feature , the node embedding module firstly finds the spatial objects that the language feature describes with the soft attention mechanism. Formally,

(4)
(5)
(6)

where and are non-linear functions that transforms inputs to dimensional space, such as multi-layer perceptrons (MLPs). denotes the hadamard product (i.e., element-wise multiplication), and are a vector whose elements are all one. The attention function is parametrized by vector. Then, the multi-modal feature is obtained from the attended visual feature and the language feature as follow:

(7)

where and are projection functions. As a consequence, we obtain visual-linguistic joint representations for all nodes which can be represented in the matrix-form .

3.3 Sparse Graph Learning Module

The sparse graph learning module infers the underlying sparse and weighted graph structure between nodes, where the edge weights are estimated based on the node features. To make the graph structure to be sparse, we propose two types of edges on the graph : binary edges and score edges , which corresponding adjacency matrices are and respectively. To simplify the notations, we omit the subscription in the following equations.

Binary Edges. We first define the binary edge between two nodes and as a binary random variable , for all and . The sparse graph learning module estimates the likelihood of the binary variables given the node features, where the probability implies whether the two nodes are semantically related or not. We regard the binary variable as a two-class categorical variable and define the probability distribution as follows:

(8)
(9)

where is a learnable parameter and is the softmax temperature. Since is discrete and non-differentiable, we employ Straight-Through Gumbel-Softmax estimator (i.e., ST-Gumbel) [17] to ensure end-to-end training. During the forward propagation, the ST-Gumbel makes a discrete decision by using a Gumbel-Max trick [29]:

(10)

where random variable are drawn from [17]. In the backward pass, the ST-Gumbel utilizes the derivative of the probabilities by approximating , thus enabling the back-propagation and end-to-end training.

Score Edges. We also define the score edges that measure the extent to which the two nodes are relevant, and the weighted adjacency matrix is computed as:

(11)

with a learnable parameter . Following the relational graph learning algorithm [46], we compute the score edges using the squared operation for the stabilized training.

Sparse Weighted Edges. The sparse graph learning module multiplies the binary edges and score edges, finally yielding the sparse and weighted adjacency matrix as:

(12)

With the above edge weight estimations, the sparse graph learning module is able to model three types of relationship on : (1) dense relationships similar to the previous conventional softmax-based approaches if (i.e., all entries in are one), (2) sparse relationships if , and (3) no relationships if (i.e., isolated).

Message-passing and Update. Based on the sparse weighted adjacency matrix , the sparse graph learner updates the hidden states of all nodes through a message-passing framework [10]. Similar to the graph convolutional networks [24], we simply implemented the message-passing layer as a linear projection of node features, followed by the normalized weighted sum according to the adjacent weights.

(13)

Note that is the degree matrix of . The hidden features of nodes are calculated via the update layer that adds the input feature and aggregated messages then feeds them into a non-linear function .

(14)

Notice that the sparse graph structure inference followed by the hidden state update can be viewed as a dialog reasoning. Moreover, the model is able to do multi-step reasoning by repeatedly conducting the inference and update based on the hidden states. In this paper, for the sake of simplicity, we assume that the only edges connected to the question node exist (i.e., ). For the question node , the message vector and hidden state vector is simply represented as below formula:

(15)

The sparse graph learner outputs the hidden state vector for the question node to predict the answer.

3.4 Answer Decoder

Discriminative Decoder.

The discriminative decoder computes the likelihood of the answer candidates by dot-product operations between the hidden vector and feature vectors for the answer candidates . Then, the SGLNs are optimized by minimizing negative log-likelihood of the ground-truth answer as:

(16)
(17)

where is the one-hot encoded label vector. For evaluation, the answer candidates are ranked according to the likelihood.

Generative Decoder.

Similarly to the sequence-to-sequence model, the generative decoder aims to generate the ground-truth answer’s word sequence auto-regressively via a LSTMs:

(18)

where is the output of the sparse graph learning module, and denotes the ground-truth answer consisting of words . We initialize the hidden states of the LSTMs with (i.e., ). Following the Visual Dialog task [8], we utilize the log-likelihood scores to determine the rank of candidate answers for the process of evaluation.

3.5 Objective Function

Structural Loss Function.

Along with the two loss functions, and , we introduce a structural loss function to encourage the SGLNs to infer explicit, reliable dialog structures. Inspired by the visual coreference resolution model [25], our method utilizes the structural supervision in addition to the ground-truth answer at each round. Specifically, we automatically obtain the semantic dependencies among each round of dialog as a form of a lower triangular binary matrix from an off-the-shelf neural coreference resolution tool 1 and use the information as the structural supervision. Consequently, the SGLNs minimize the distance between the structural supervision and the binary matrix that is predicted from our model:

(19)

where denotes the element-wise mean squared error. Here, encourages the SGLNs to predict a reliable adjacency matrix (i.e., dialog structure). Note that the SGLNs use the structural supervision only while training, and infer the dialog structures at test time. We clarify that the efficiency of the coreference resolution was explored for the visual dialog tasks by the previous work [25]; however, their gain is limited as they use a different approach from ours.

Multi-task Learning.

To predict the dialog structure and answer to the given questions, the SGLNs are trained to minimize the sum of the losses based on both the structural loss and the loss of answer decoder: or where are weights for each loss. Optionally, the SGLNs takes the dual decoder strategy by minimizing the three losses simultaneously: . Unless stated otherwise, the default loss is . The implementation details and results will be discussed in Section 4.

4 Experiments

In this section, we describe the details of our experiments on the Visual Dialog dataset. We first introduce the Visual Dialog dataset, evaluation metrics, and implementation details. Then, we compare the SGLNs with baseline models and state-of-the-art methods. Note that the qualitative analysis of our proposed model is described in Sec. 5.

4.1 Experimental Setup

Dataset. We benchmark our proposed model on the Visual Dialog (i.e., VisDial) v1.0 dataset. The VisDial dataset [8] was collected in a two-player chatting environment, where a questioner tries to figure out an unseen image by asking free-form questions, and an answerer responds to the questions based on the image. As a result, the VisDial v1.0 dataset contains 1.2M, 20k, and 44k question-answer pairs as train, validation, and test splits, respectively. The 123,287 images from COCO [27], 2,064, and 8k images from Flickr are used to collect the dialog data for each split, respectively. A list of 100 answer candidates accompanies each question-answer pair.

Evaluation. We follow the standard protocol for evaluating the visual dialog model, as proposed in the earlier work [8]. Specifically, the visual dialog model ranks a list of 100 candidate answers and returns the ranked list for further evaluation. There are four kinds of evaluation metrics in the Visual Dialog task: (1) mean reciprocal rank (MRR) of the ground-truth answer in the ranked list, (2) recall@k (R@k), which is the existence of the ground-truth answer in the top-k list, (3) mean rank (Mean) of the ground-truth answer, and (4) normalized discounted cumulative gain (NDCG). Contrary to the classical retrieval metrics (MRR, R@k or mean rank), which are only based on a single ground-truth answer, NDCG takes into account all relevant answers from the 100-answer list by using the densely annotated relevance scores. It penalizes the lower-ranked answers with high relevance scores, and swapping candidates with the same relevance does not affect NDCG. Due to these properties, NDCG is regarded as the primary metric and used to evaluate methods for the VisDial v1.0 dataset.

Implementation Details. The SGLNs embed all the language inputs to 300-dimensional vector initialized by GloVe [34]. All three BiLSTMs used for encoding the word embedding vectors are single-layer with 512 hidden units. We also use the bottom-up attention features [1] from Faster R-CNN [36] pre-trained on the Visual Genome [26]. The number of object features per image is , and the dimension of each feature is . The dimension of is 512. The hyperparameters for the multi-task learning are , , and . We employ Adam optimizer [22] with initial learning rate . The learning rate is warmed up to until epoch 4 and is halved every two epochs from 5 to 10 epochs. We use the VisDial v1.0 training split for evaluating our proposed model on the validation and test splits.

Method NDCG MRR R@1 R@5 R@10 Mean Sparsity
Dense 59.53 63.21 49.56 80.02 89.25 4.42 0%
Sparse-hard 61.89 61.28 47.41 78.13 88.27 4.70 81.82%
SGLNs 62.83 60.54 46.64 77.59 87.33 4.89 89.15%
Table 1: Comparison with the baseline models on the VisDial v1.0 validation splits. All models in this table are based on the discriminative decoder.

4.2 Quantitative Results

Comparison with Baselines.

We compare SGLNs to the baseline models to demonstrate the effectiveness of our method. We define two models as baselines: Dense, and Sparse-hard. The Dense model utilizes a softmax attention mechanism, which results in the fully-connected graph. Contrary to the Dense model, the Sparse-hard model picks exactly one element among the dialog history by applying the Gumbel-Softmax to the whole dialog history. Note that the structural supervision is provided in the Sparse-hard model. The results are summarized in Table 1. The SGLNs achieve better performance than the baseline models on the NDCG metric, maintaining competitive performance on the ground-truth dependent metrics (i.e., MRR, R@k, and Mean rank). We also observe that the Dense model, which overly exploits the dialog history, shows the best performance on the ground-truth dependent metrics. We argue that the Dense model mainly focuses on finding the single ground-truth answer with a rich set of dialog history, with the cost of sacrificing the ability to provide ‘flexible’ answers (i.e., NDCG). Similarly, the NDCG performance for the Sparse-hard model tends to increase as the sparsity increased.

Figure 3: NDCG scores (%) for each question type. We divide the entire questions in the VisDial v1.0 validation split into three groups: independent, partially dependent, and densely dependent questions.

Question-type Analysis.

As the same setup as the above experiment, we conduct a question-type analysis of the NDCG scores to verify our hypothesis discussed in Sec. 1. Based on the semantic dependency information introduced in Sec. 3, we categorize the entire questions in the VisDial v1.0 validation split into three groups: (1) independent questions that can be answered without dialog history, (2) partially dependent questions that demand a few elements of dialog history, and (3) densely dependent questions that require all previous dialogs. As illustrated in Fig. 3, we compare our proposed model with a softmax-based Dense model, showing that the SGLNs significantly outperform the Dense model on all types of questions. The performance gap between the two models is 3.74%, 2.61%, and 0.83% for each type of question, respectively. We observe that the Dense model relatively suffers from finding relevant answers for independent questions. It validates that excessive exploitation of the dialog history could cause a distraction for such questions.

Method NDCG MRR R@1 R@5 R@10 Mean Sparsity
LF [8] 45.31 55.42 40.95 72.45 82.83 5.95 -
  HRE [8] 45.46 54.16 39.93 70.45 81.50 6.41 -
  MN [8] 47.50 55.49 40.98 72.30 83.30 5.92 -
  GNN [47] 52.82 61.37 47.33 77.98 87.83 4.57 -
  CorefNMN [25] 54.70 61.50 47.55 78.10 88.80 4.40 -
  RvA [32] 55.59 63.03 49.03 80.40 89.83 4.18 -
  DualVD [18] 56.32 63.23 49.25 80.23 89.70 4.11 -
  FGA [38] 56.93 66.22 52.75 82.92 91.08 3.81 -
  HACAN [45] 57.17 64.22 50.88 80.63 89.45 4.20 -
  DL-61 [13] 57.32 62.20 47.90 80.43 89.95 4.17 -
  DAN [19] 57.59 63.20 49.63 79.75 89.35 4.30 -
  NMN [25] 58.10 58.80 44.15 76.88 86.88 4.81 -
  Transformer [30] 60.92 60.65 47.00 77.03 87.75 4.90 -
  SGLNs 60.77 58.40 44.15 75.65 85.70 5.22 89.14%
  SGLNs 61.27 59.97 45.68 77.12 87.10 4.85 89.05%
Table 2: Test-std performance of the discriminative model on the VisDial v1.0 dataset. Higher performance is better for NDCG, MRR, and R@k, while lower performance is better for Mean. denotes the use of dual decoders.
Method NDCG MRR R@1 R@5 R@10 Mean Sparsity
MN [8] 56.99 47.83 38.01 57.49 64.08 18.76 -
HCIAE [28] 59.70 49.07 39.72 58.23 64.73 18.32 -
CoAtt [42] 59.24 49.64 40.09 59.37 65.92 17.86 -
ReDAN [9] 60.47 50.02 40.27 59.93 66.78 17.40 -
SGLNs 60.82 48.82 39.64 57.58 64.37 18.03 87.03%
Table 3: VisDial v1.0 validation performance of the models that utilize the generative decoder. denotes the re-implemented models for a fair comparison.

Comparison with the State-of-the-art.

We compare our proposed model with the state-of-the-art methods on VisDial v1.0 dataset. As shown in Table 2, SGLNs with the discriminative decoder outperform all other methods with respect to the NDCG metric, including the concurrent work, Transformer [30]. They demonstrated the effectiveness of training the discriminative and generative decoder simultaneously (i.e., ). Accordingly, we also apply the dual decoder strategy as described Sec. 3 for a fair comparison, lifting our model’s NDCG to 61.27%. The results of the dual decoder models are obtained from the output of the discriminative decoder. Note that the sparsity of the SGLNs is 89.05%, which means that our proposed model only utilizes 10.95% of the dialog history. The sparsity is calculated as the percentage of zero-valued edges in the graph. We consider these results encouraging as they indicate that the SGLNs adaptively attend to the dialog history while achieving the new state-of-the-art performance on the primary metric. Furthermore, we report the performance of the generative decoder-based models on VisDial v1.0 validation split. As shown in Table 3, the SGLNs achieve a new state-of-the-art performance on NDCG with sparsity of 87.03%. Note that all entries in Table 3 are re-implemented by [9], utilizing the object-level visual features from the Faster R-CNN [36] and GloVe [34] vectors for a fair comparison.

Figure 4: A visualization of the inferred sparse structures. From the left, the given image and caption, the dialog history, and the semantic structures of ours and baseline. The darker fill is the higher score.

5 Discussions

Visualization of the Inferred Graph Structures. For qualitative analysis, in Fig. 4, we visualize the images, the corresponding dialogs in the validation split, and the inferred adjacency matrices as well as the ones from the Dense mode as a counter. Compared to the dense structure in the baseline, the proposed SGLNs indeed learn the innate sparse structures, and the question nodes receive the information from the other nodes in a selective fashion. For instance, In the first dialog example, the questions from Q3 to Q10 have non-zero binary edges to all previous contexts except the D1 and D2, which do not contain relevant information about ‘the woman’. On the contrary, the Q1 and Q2 are not connected to the other, even the caption node, because they can be answered solely without additional context.

Knowledge Transfer of Semantic Structure. In Section 3.5, the structural loss function can be seen as a knowledge distillation loss [15] to transfer the knowledge from the pre-trained neural coreference resolution model to our sparse graph learning module. Even though we employ ST-Gumbel to mitigate the unpredictability of training the binary edges, this structural loss was decisively helpful to boost the early stage of training.

6 Conclusions

In this paper, we formulate the visual dialog tasks as a graph structure learning tasks where the edges represent the semantic dependencies among the multimodal embedding nodes learned from the given image, caption and question, and dialog history. The proposed Sparse Graph Learning Networks (SGLNs) learn the sparse dialog structures by incorporating binary and score edges, leveraging structural supervisions. Our experiments demonstrate the efficacy of SGLN by achieving the state-of-the-art NDCG performance on the VisDial v1.0 dataset with 61.27 for the test-std split, only using the 10.95 % of dialog. Qualitatively, the visualized analysis with the inferred graph structures shows adaptive mechanisms depending on the type of the questions.

Acknowledgements. The authors would like to thank SK T-Brain for sharing GPU resources. This work was partly supported by the Korea government (2015-0-00310-SW.StarLab, 2017-0-01772-VTT, 2019-0-01367-BabyMind).

Supplementary

Structure Inference. At the inference stage, the SGLNs greedily infer the binary edges with the largest probability without drawing the sample in Eq. 10. This strategy is similar to the RvA [32] model that also makes discrete decisions for the visual coreference resolution in the visual dialog.

Structural Supervision. We readily obtain the semantic dependency information from the neural coreference resolution tool based on [7] and use it as the structural supervision. As shown in (c) for Fig. 5, the structural supervision represents the sentence-level semantic dependencies between the given question (i.e., Q1-Q6 in rows) and each element of the dialog history (i.e., C and D1-D6 in columns) in the form of a binary matrix. Specifically, the one-valued entries in the structural supervision indicate that both sentences include noun phrases or a pronoun referring to the same entity. On the other hand, the zero-valued entries denote that both sentences do not share any entity. The upper triangular of the structural supervision matrix (i.e., a gray area) indicates zero because of the temporal nature of the dialogs. since the dialog has a sequence. The sparsity of the structural supervision is 85.50%, and it is calculated as the percentage of zero-valued entries in the blue area.

Structural Loss Function. We define the structural loss function as a element-wise mean squared error between the structural supervision and the binary edges that are inferred from the SGLNs. By minimizing the loss, the SGLNs learn to infer the binary edges based on the structural supervision. Although the structural supervision automatically obtained from the off-the-shelf coreference resolution tool may not cover the exact semantic dependencies in the visual dialog, we demonstrate the effectiveness of the proposed method quantitatively and qualitatively.

Figure 5: (a): a given image. (b): dialogue for a given image, including image caption (C), and each round of dialog (D1-D6). (c): the structural supervision obtained from the neural coreference resolution tool and the binary edges inferred from the SGLNs.

Footnotes

  1. https://github.com/huggingface/neuralcoref based on the work [7].

References

  1. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1, §3.1, §3.2, §4.1.
  2. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick and D. Parikh (2015) Vqa: visual question answering. In ICCV, Cited by: §1, §2.
  3. A. Baddeley (2000) The episodic buffer: a new component of working memory?. In Trends in cognitive sciences, Cited by: §3.
  4. D. Bahdanau, K. Cho and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1.
  5. P. Battaglia, R. Pascanu, M. Lai and D. J. Rezende (2016) Interaction networks for learning about objects, relations and physics. In NIPS, Cited by: §2.
  6. P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro and R. Faulkner (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §2.
  7. K. Clark and C. D. Manning (2016) Deep reinforcement learning for mention-ranking coreference models. In ACL, Cited by: footnote 1, DialGraph: Sparse Graph Learning Networks for Visual Dialog.
  8. A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh and D. Batra (2017) Visual dialog. In CVPR, Cited by: §1, §1, §2, §3.4.2, §3, §4.1, §4.1, Table 2, Table 3.
  9. Z. Gan, Y. Cheng, A. E. Kholy, L. Li, J. Liu and J. Gao (2019) Multi-step reasoning via recurrent dual attention for visual dialog. In ACL, Cited by: §1, §2, §4.2.3, Table 3.
  10. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §2, §3.3, §3.
  11. M. Gori, G. Monfardini and F. Scarselli (2005) A new model for learning in graph domains. In IJCNN, Cited by: §2.
  12. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1, §2.
  13. D. Guo, C. Xu and D. Tao (2019) Image-question-answer synergistic network for visual dialog. In CVPR, Cited by: §1, §2, Table 2.
  14. W. Hamilton, Z. Ying and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §2.
  15. G. Hinton, O. Vinyals and J. Dean (2014) Distilling the Knowledge in a Neural Network. In NIPS 2014 Deep Learning Workshop, Cited by: §5.
  16. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. In Neural computation, Cited by: §3.1.
  17. E. Jang, S. Gu and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §3.3.
  18. X. Jiang, J. Yu, Z. Qin, Y. Zhuang, X. Zhang, Y. Hu and Q. Wu (2019) DualVD: an adaptive dual encoding model for deep visual understanding in visual dialogue. In AAAI, Cited by: Table 2.
  19. G. Kang, J. Lim and B. Zhang (2019) Dual attention networks for visual reference resolution in visual dialog. In EMNLP, Cited by: §1, §2, Table 2.
  20. H. Kim, H. Tan and M. Bansal (2020) Modality-balanced models for visual dialogue. In AAAI, Cited by: §2.
  21. J. Kim, K. W. On, W. Lim, J. Kim, J. Ha and B. Zhang (2017) Hadamard Product for Low-rank Bilinear Pooling. In ICLR, Cited by: §3.2.
  22. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  23. T. Kipf, E. Fetaya, K. Wang, M. Welling and R. Zemel (2018) Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687. Cited by: §2.
  24. T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. Cited by: §2, §3.3.
  25. S. Kottur, J. M. Moura, D. Parikh, D. Batra and M. Rohrbach (2018) Visual coreference resolution in visual dialog using neural module networks. In ECCV, Cited by: §2, §3.5.1, Table 2.
  26. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li and D. A. Shamma (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. In ICCV, Cited by: §4.1.
  27. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §4.1.
  28. J. Lu, A. Kannan, J. Yang, D. Parikh and D. Batra (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In NIPS, Cited by: §1, §2, Table 3.
  29. C. J. Maddison, A. Mnih and Y. W. Teh (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §3.3.
  30. V. Nguyen, M. Suganuma and T. Okatani (2019) Efficient attention mechanism for handling all the interactions between many inputs with application to visual dialog. arXiv preprint arXiv:1911.11390. Cited by: §2, §4.2.3, Table 2.
  31. M. Niepert, M. Ahmed and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §2.
  32. Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu and J. Wen (2018) Recursive visual attention in visual dialog. In CVPR, Cited by: §1, §2, §2, Table 2, DialGraph: Sparse Graph Learning Networks for Visual Dialog.
  33. K. On, E. Kim, Y. Heo and B. Zhang (2020) Cut-based graph learning networks to discover compositional structure of sequential video data. Cited by: §2.
  34. J. Pennington, R. Socher and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §4.1, §4.2.3.
  35. D. S. Pilco and A. R. Rivera (2019) Graph learning network: a structure learning algorithm. arXiv preprint arXiv:1905.12665. Cited by: §2.
  36. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §3.1, §4.1, §4.2.3.
  37. F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner and G. Monfardini (2008) The graph neural network model. In IEEE Transactions on Neural Networks, Cited by: §2.
  38. I. Schwartz, S. Yu, T. Hazan and A. G. Schwing (2019) Factor graph attention. In CVPR, Cited by: §1, §2, Table 2.
  39. P. H. Seo, A. Lehrmann, B. Han and L. Sigal (2017) Visual reference resolution using attention memory for visual dialog. In NIPS, Cited by: §1, §2.
  40. J. Shi and J. Malik (2000) Normalized cuts and image segmentation. In IEEE, Cited by: §2.
  41. S. Sukhbaatar and R. Fergus (2016) Learning multiagent communication with backpropagation. In NIPS, Cited by: §2.
  42. Q. Wu, P. Wang, C. Shen, I. Reid and A. Van Den Hengel (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In CVPR, Cited by: §1, §2, Table 3.
  43. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §1.
  44. K. Xu, W. Hu, J. Leskovec and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.
  45. T. Yang, Z. Zha and H. Zhang (2019) Making history matter: history-advantage sequence training for visual dialog. In ICCV, Cited by: Table 2.
  46. Z. Yang, J. Zhao, B. Dhingra, K. He, W. W. Cohen, R. R. Salakhutdinov and Y. LeCun (2018) GLoMo: unsupervised learning of transferable relational graphs. In NIPS, Cited by: §3.3.
  47. Z. Zheng, W. Wang, S. Qi and S. Zhu (2019) Reasoning visual dialogs with structural and partial observations. In CVPR, Cited by: §1, §2, Table 2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414323
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description