A Read-Write Memory Network for Movie Story Understanding

A Read-Write Memory Network for Movie Story Understanding

Abstract

We propose a novel memory network model named Read-Write Memory Network (RWMN) to perform question and answering tasks for large-scale, multimodal movie story understanding. The key focus of our RWMN model is to design the read network and the write network that consist of multiple convolutional layers, which enable memory read and write operations to have high capacity and flexibility. While existing memory-augmented network models treat each memory slot as an independent block, our use of multi-layered CNNs allows the model to read and write sequential memory cells as chunks, which is more reasonable to represent a sequential story because adjacent memory blocks often have strong correlations. For evaluation, we apply our model to all the six tasks of the MovieQA benchmark [25], and achieve the best accuracies on several tasks, especially on the visual QA task. Our model shows a potential to better understand not only the content in the story, but also more abstract information, such as relationships between characters and the reasons for their actions.

1Introduction

Figure 1: The intuition of the RWMN (Read-Write Memory Network) model for movie question and answering tasks. Using read/write networks of multi-layered CNNs, it abstracts a given series of frames stepwise to capture higher-level sequential information and stores it into memory slots. It eventually helps answer complex questions of movie QAs.
Figure 1: The intuition of the RWMN (Read-Write Memory Network) model for movie question and answering tasks. Using read/write networks of multi-layered CNNs, it abstracts a given series of frames stepwise to capture higher-level sequential information and stores it into memory slots. It eventually helps answer complex questions of movie QAs.

For many problems of video understanding, including video classification [1], video captioning [29] and MovieQA [25], it is key to success for models to correctly process, represent, and store long sequential information. In the era of deep learning, one prevailing approach to model sequential input is to use recurrent neural networks (RNNs) [17] which store the given information into a hidden memory and update it over time. However, RNNs accumulate information in a single fixed-length memory regardless of the length of an input sequence, thus tend to fail to utilize far-distant information due to a vanishing gradient problem, which is still not fully solved even with advanced models such as LSTM [12] and GRU [3].

As another recent alternative to resolve this issue, many studies attempt to leverage an external memory structure for neural networks, often referred to as neural memory networks [8]. One key benefit of external memory is to enable a neural model to cache sequential inputs in memory slots, and explicitly utilize even far early information. Such ability is particularly powerful to solve question and answering (QA) problems, which often require models to memorize a large amount of information, and correctly access the most relevant information to a given question. For this reason, memory networks have been popularly applied as state-of-the-art approaches to many QA tasks, such as bAbI task [27], SQuAD [22], and LSMDC [23].

MovieQA [25] is another challenging visual QA dataset, in which models need to understand movies over two hours long, and solve QA problems related to movie content and plots. The MovieQA benchmark consists of six tasks according to which sources of information is usable to solve the QA problems, including videos, subtitles, DVS, scripts, plot synopses, and open-end information. Understanding a movie is a highly challenging task; it is necessary not only to understand the content of individual video frames such as a characters’ actions, places of events, but also to infer more abstract and high-level knowledge such as reasons of a characters’ behaviors, and relationships between them. For instance, in the Harry Potter movie, to answer a question (Q. What does Harry trick Lucius into doing? A. Freeing Dobby), models need to realize that Dobby was a Lucius’s house elf, wanted to escape from him, had a positive relationship with Harry, and Harry helped him. Some of such information is visually or textually observable in the movie, but much information like relationships between characters and correlations between events should be deduced.

Our objective is to propose a novel memory network model to perform QA tasks for large-scale, multimodal movie story understanding. That is, the input to the model can be very long (e.g. videos more than two hours long), or be multimodal (e.g. text-only or video-text pairs). The key focus of our novel memory network named Read-Write Memory Networks (RWMN) is on defining the memory read/write operations to have high capacity and flexibility, for which we propose the read and write networks that consist of multiple convolutional layers. Existing neural memory network models treat each memory slot as an independent block. However, adjacent memory blocks often have strong correlations, which are the case to represent a sequential story. That is, when human understands a story, the entire story is often recognized as a sequence of closely-interconnected abstract events. Hence, preferably memory networks need to read and write sequential memory cells as chunks, which are implemented by multiple convolutional layers of the read and write network.

To conclude introduction, we summarize the contributions of this work as follows.

  1. We propose a novel memory network named RWMN that enables the model to flexibly read and write more complex and abstract information into memory slots through read/write networks. To the best of our knowledge, it is the first attempt to leverage multi-layer CNNs for read/write operations of a memory network.

  2. The RWMN shows the best accuracies on several tasks of MovieQA benchmark [25]; as of the ICCV2017 submission deadline (March 27, 2017 23:59 GMT), our RWMN achieves the best performance for four out of five tasks in the validation set, and four out of six tasks in the test set. Our quantitative and qualitative evaluation also assures that the read/write networks effectively utilize higher-level information in the external memory, especially on the visual QA task.

2Related Work

Neural Memory Networks

. Recently, much research has been done to model sequential data using explicit memory architecture. The memory access of existing memory network models can be classified into content-based addressing and location-based addressing [8]. The content-based addressing (e.g.[9]) lets the controller to generate a key vector and measure its similarity with each memory cell to find out which cells are to be attended as the relevant cells to the key vector. Location-based addressing (e.g.[8]), on the other hand, enables simple arithmetic operations that find out the addresses to store or retrieve information, regardless of the content of the key vector.

Neural Turing Machine (NTM) [8] and its extensions of DNC [9], D-NTM [10], focus on learning the entire process of memory interaction (read/write operations), and thus the degree of freedom (or capability) of the model is high in solving a given problem. They have been successfully applied to complex tasks such as sorting, sequence copying, and graph traversal. The memory networks of [16] address the QA problems using continuous memory representation similar to the NTM. However, while the NTM leverages both content-based and location-based addressing, they use only the former (content-based) memory interaction. They apply the concept of multi-hops to recurrently read the memory, which results in performance improvement in solving QA problems that require causal reasoning. The work of [19] proposes a key-value memory network that stores information in the form of (key, value) pairs into the external knowledge base. These methods are good at solving QA problems that focus on the content or facts in a context such as WikiMovies [19] and bAbI dataset [27].

The work of [2] deals with how to make the read/write operations scalable with extremely large amount of memory. Chandar et al.[2] propose to organize memory hierarchically, and Rae et al.[21] make read and write operations sparse, thereby increasing scalability and reducing the cost of operations.

Compared to all the previous models, our RWMN model is explicitly equipped with learnable read/write networks of CNNs, which are specialized in storing and utilizing more abstract information, such as relationships between characters, reasons for characters’ specific behaviors, as well as understanding of facts in a given story.

Models for MovieQA

. Among the models applied to the MovieQA benchmark [25], the end-to-end memory network [24] is the state-of-the-art approach. It splits each movie into shot subshots, and constructs memory slots with video and subtitle features. It then uses content-based addressing to attend on the information relevant to a given question. Recently, Wang and Jiang [26] present the compare-aggregate framework for word-level matching to measure the similarity of sentences. However, it is applied to only a single task (plot synopses) of MovieQA.

There have been also several studies to solve Video QA tasks in other datasets, such as LSMDC [23], MSR-VTT [29], and TGIF-QA [13], which mainly focus on understanding short video clips, and answering about factual elements in the clips. Yu et al. [30] achieve compelling performance in video captioning, video QA, and video retrieval by constructing an end-to-end trainable concept-word-detector along with vision-to-language models.

3 Read-Write Memory Network (RWMN)

Figure 2: Illustration of the proposed Read-Write Network. (a) The multimodal movie embedding \mathbf{E} is obtained using the ResNet feature and the Word2Vec representation from movie subshots and subscripts (section ). (b) The write memory \mathbf M abstracts higher-level sequential information through multiple convolution layers (section ). (c) The query-dependent memory \mathbf M_q is obtained via the Compact Bilinear Pooling (CBP) between the query and each slot of \mathbf M, and then the read memory \mathbf M_r is constructed through convolution layers (section ). (d) Finally, the answer with the highest confidence score is chosen out of five candidates (section ).
Figure 2: Illustration of the proposed Read-Write Network. (a) The multimodal movie embedding is obtained using the ResNet feature and the Word2Vec representation from movie subshots and subscripts (section ). (b) The write memory abstracts higher-level sequential information through multiple convolution layers (section ). (c) The query-dependent memory is obtained via the Compact Bilinear Pooling (CBP) between the query and each slot of , and then the read memory is constructed through convolution layers (section ). (d) Finally, the answer with the highest confidence score is chosen out of five candidates (section ).

Figure 2 shows the overall structure of our RWMN. The RWMN is trained to store the movie content with proper representation in the memory, extract relevant information from memory cells in response to a given query, and select correct answer from five choices.

Based on the QA format of MovieQA dataset [25], the input of the model is (i) a sequence of video segment and subtitle pairs for the whole movie, which takes about 2 hours ( on average), (ii) a question for the movie, and (iii) five answer candidates . In the video+subtitle task of MovieQA, for example, each is a dialog sentence of a character, and is a video subshot (i.e. a set of frames) sampled at 6 fps that are temporally aligned with . The output is a confidence score vector over the five answer candidates.

In the following, we explain the architecture according to information flow, from movie embedding to answer selection via write/read networks.

3.1Movie Embedding

We convert each subshot and text sentence into feature representation as follows. For each frame , we first obtain its feature by applying the ResNet-152 [11] pretrained on ImageNet [4]. We then mean-pool over all frames as , as a representation of the subshot . For each sentence , we first divide the sentence into words, apply the pretrained Word2Vec [18], and then mean-pool with the position encoding (PE) [24] as .

Finally, to obtain a multimodal space embedding of and , we use the Compact Bilinear Pooling (CBP) [6] as

We perform this procedure for all pairs of subshots and text, resulting in a 2D movie embedding matrix , which is the input of our write network.

3.2The Write Network

The write network takes a movie embedding matrix as an input and generates a memory tensor as output. The write network is motivated by that when human understands a movie, she does not remember it as a simple sequence of speech and visual content, but rather ties together several adjacent utterances and scenes in a form of events or episodes. That is, each memory cell needs to associate neighboring movie embeddings, instead of storing each of movie embedding separately. To implement this idea of jointly storing adjacent embeddings into every slot, we exploit a convolutional neural network (CNN) as the write network. We experimentally confirm the following CNN design after thorough tests, by varying the dimensions, depths, strides of convolution layers.

To the movie embedding , we first apply a fully connected (FC) layer with parameter to project each into a -dimensional vector. The FC layer reduces the dimension of in order to equalize the dimensions of query embedding and answer embedding, which is also beneficial to reduce the number of required convolution operations later. We then use a convolution layer consisting of a filter , whose vertical and horizontal filter size is , the number of filter channel is and strides are and , respectively:

where conv (input, filter, bias) indicates the convolution layer, is a bias, and ReLU indicates the element-wise ReLU activation [20]. Finally, the generated memory is , where .

Note that the write network can employ multiple convolutional layers. If the number of layers is , then we obtain by recursively applying

from . In Section 4, we will report the result of ablation study to find out the best-performing .

3.3The Read Network

The read network takes a question and then generate answer from a compatibility between and .

Question embedding

. We embed the question sentence as follows. We first obtain the Word2Vec vector [18] as done in Section 3.1, and then project it as follows.

where parameters are and .

Next the read network takes the memory and the query embedding as input, and generates the confidence score vector as follows.

Query-dependent memory embedding

. We first transform the memory to be query-dependent. Its intuition is that, according to the query, different types of information must be retrieved from the memory slots. For example, for the Harry Potter movie, suppose that one memory slot contains the information about a particular scene where Harry is chanting magic spells. This memory slot should be read differently according to two different questions : What color is Harry wearing? and : Why is Harry chanting magic spells? In Section 4, we will empirically show the effectiveness of this question-dependent memory update.

To transform the memory into a query-dependent memory , we apply the CBP [6] between each memory cell of and the query embedding as

for all , and .

Convolutional memory read

. As done in the write network, we also leverage a CNN to implement the read network. Our intuition is that, for correctly answering the question of movie understanding, it is important to connect and relate a series of scenes as a whole. Therefore, we use the CNN architecture to access chunks of sequential memory slots. We obtain the reconstructed memory by applying convolutional layers with a filter whose vertical and horizontal filter size is , the number of filter channel is and strides are , , respectively. Finally, the reconstructed memory is with :

where is a bias term. As in the write network, the read network can also have a number of stacks of convolutional layers; the formulation is the same with Eq.(Equation 3) only except replacing with , respectively. We will also report the results of ablation study about different in Section 4.

3.4Answer Selection

Next we compute the attention matrix through applying the softmax to the dot product between the query embedding and each cell of memory :

where indicates the dot product. Finally, the output vector is obtained through a weighted sum between each memory cell of and the attention vector :

Next we obtain the embedding of five answer candidate sentences as done for the question in Eq.(Equation 4) with sharing the parameters and . As a result, we compute the embedding of answer candidates .

We compute the confidence vector by finding the similarity between and the weighted sum of and .

where is a trainable parameter. Finally, we predict the answer with the highest confidence score: .

3.5Training

For training of our model, we minimize the softmax cross-entropy between the prediction and the groundtruth one-hot vector . All training parameters are initialized with the Xavier method [7]. Experimentally, we select the Adagrad [5] optimizer with a mini-batch size of 32, a learning rate of 0.001, and an initial accumulator value of 0.1. We train our model up to 200 epochs, although we actively use the early stopping to avoid overfitting due to the small size of the MovieQA dataset. We repeat training each model with 12 different random initializations, and select the one with the lowest cost.

4Experiments

We evaluate the proposed RWMN model for all the tasks of MovieQA benchmark [25]. We defer more experimental results and implementation details to the supplementary file.

4.1MovieQA Tasks and Experimental Setting

The number of movies and QA pairs according to data sources in the MovieQA dataset [25].
Story sources # movie # QA pairs
Videos and subtitles 140 6,462
Subtitles 408 14,944
DVS 60 2,446
Scripts 199 7,810
Plot synopses 408 14,944

As summarized in Table ?, MovieQA dataset [25] contains 408 movies and 14,944 multiple choice QA pairs, each of which consists of five answer choices with only one correct answer. The dataset provides with five types of story sources associated with the movies: videos, subtitles, DVS, scripts, and plot synopses, based on which the MovieQA challenge hosts 6 subtasks, according to which sources of information are differently used: (i) video+subtitle, (ii) subtitles only, (iii) DVS only, (iv) scripts only, (v) plot synopses only, and (vi) open-ended. That is, there are one video-text QA task, and four text-only QA tasks, and one open-end QA task with no restriction on additional story sources. We strictly follow the test protocols of the challenge, including training/validation/test split and evaluation metrics. More details of the dataset and rules are available in [25] and its homepage1.

Among six tasks, we discuss our results with more focus on the video+subtitle task, because it is the only VQA task that requires both video and text understanding, whereas the other tasks are text-only. We weight less on the plot synopses only task, since plot synopses are given with a question, and all the QA pairs are generated from plot synopses, this task can be tackled using simple word/sentence matching algorithms (with little movie understanding), achieving a very high accuracy of 77.63%.

We solve the video+subtitle task using the proposed RWMN model in Figure 2. For the four text-only QA tasks, no visual sources are given, thus we use only to construct the movie embedding of Eq.(Equation 1) without the CBP. Except this, we use the same RWMN model to solve four text-only QA tasks.

Performance comparison for the video+subtitle task on MovieQA public validation/test dataset. (–) means that the method does not participate on the task. Baselines include DEMM (Deep embedded memory network), OVQAP (Only video question answer pairs) and VCFSM (Video clip features with simple MLP).

Methods

val test
OVQAP 23.61
Simple MLP 24.09
LSTM + CNN 23.45
LSTM + Discriminative CNN 24.32
VCFSM 24.09
DEMN [15] 29.97
MEMN2N [25] 34.20
RWMN-noRW 34.20
RWMN-noR 36.50
RWMN-noQ 38.17
RWMN-noVid 37.20
RWMN 38.67 36.25
RWMN-bag 38.37 35.69
RWMN-ensemble 38.30

4.2Baselines

We compare the performance of our approach with those of all the methods proposed in the original MovieQA paper [25] or in the official MovieQA leaderboard2. We describe the baseline names in the caption of each result table.

In order to measure the effects of key components of the RWMN, we experiment with five variants: (i) (RWMN-noRW) model without read/write networks, (ii) (RWMN-noR) model with only the write network, (iii) (RWMN-noQ) model without query-dependant memory embedding, (iv) (RWMN-noVid) model trained without using videos to quantify the importance of visual input, and (v) (RWMN) model with both write/read networks.

We also test two ensemble versions of our model. Since the MovieQA dataset size is relatively small compared to task difficulty (e.g.4,318 training QA examples in video+subtitle category), models often suffer from severe overfitting, which the ensemble methods can mitigate. The first (RWMN-bag) is a bagged version of our approach, in which we independently learn RWMN models on 30 bootstrapped datasets, and obtain the averaged prediction. The second (RWMN-ensemble) is a simple ensemble, in which we independently train 20 models with different random initializations, and compute the average prediction.

4.3Quantitative Results

We below report the results of each method on the validation and test sets, both of which are not used for training at all. While the original MovieQA paper [25] reports the results on the validation set only, the official leaderboard shows the performance on the test set only, for which groundtruth answers are not observable and the evaluation is performed through the evaluation server. The test submission to the server is limited to once every 72 hours.

As of the ICCV2017 submission deadline, our RWMN achieves the best performance for four out of five tasks in the validation set, and four out of six tasks in the test set.

Results of VQA task

. Table ? compares the performance of our RWMN model with those of baselines for the video+subtitle task. We observe that RWMN achieves the best performance on both validation and test sets. For example, in the test set, RWMN attains 36.25%, which is significantly better than the runner-up DEMN of 29.97%.

As expected, the RWMN with both read/write networks is the best among our variants on both validation and test sets. It implicates that read/write networks play a key role in improving movie understanding. For example, the RWMN-noR with only write network attains higher performance than the RWMN-noRW, which has similar or lower performance than other existing models. The RWMN-noQ without question-dependent memory embedding also underperforms the normal RWMN, which shows that the memory update according to the question is indeed helpful to select a more relevant answer to the question. Finally, the RWMN-noVid is not as good as the RWMN, meaning that our RWMN successfully exploits both full videos and subtitles for training. Interestingly, the ensemble methods of our model, RWMN-bag and RWMN-ensemble, slightly underperform the single model RWMN.

Results of text-only tasks

. Table ? shows the results on the validation and test sets for text-only categories (i.e. subtitle only, DVS only, script only, plot synopses only).

For the open-end task, we simply use the plot synopses version of our method, which outperforms the only trivial baseline for the test set (i.e. selecting the longest answer choice).

Our RWMN achieves the best performance in all tasks except for DVS-test set and plot synopses task.

We also observe that the ensemble methods hardly improve the performance of our method noticeably. As discussed before, the memory network approaches including our RWMN and MEMN2N are not outstanding in the plot synopses only category. It is mainly due to that the queries and answer choices are made directly from the plot sentences, and thus, this task can be tackled better by word/sentence matching methods with little story comprehension. In addition, each plot synopsis consists of about 35 sentences on average as a summary of a movie, which is much shorter than other data types, for examples, about 1,558 sentences of subtitles per movie. Therefore, the memory abstraction by our method becomes less critical to solve the problems in this category.

One important difference between the four text-only tasks is that each story source has a different (i.e. the number of sentences), and thus the density of information contained in each sentence is also different. For example, the average of the scripts is about 2,877 per movie, while the average of DVS is about 636; thus, each sentence in the script contains low-level details, while each sentence in the DVS contain high-level and abstract content. Given that the performance improvement by our RWMN is more significant in the DVS only task (e.g. RWMN: 40.0 and MEMN2N: 33.0), it can be seen that our proposal to read/write networks may be more beneficial to understand and answer high-level and abstract content.

4.4Ablation Results

We experiment the performance variation according to the structure of CNNs in the write/read networks. Among hyperparameters of the RWMN, the following three combinations have significant effects on the performance of the model; i) conv-filter/stride sizes of the write network (), ii) conv-filter/stride sizes of the read network (), and iii) number of read/write CNN layers . Regarding the convolutions, the larger the convolution filter sizes, the more memories are read/written as a chunk. Also, as the stride size decreases or the number of output channels increases, the total number of memory blocks increases.

Table 1 summarizes the performance variation on the video+subtitle task according to different combinations of these three hyperparameters. We make several observations from the results. First, as the number of CNN layers in read/write network increases, the capacity of memory interaction may increase as well; yet the performance becomes worsen. Presumably, the main reason may be overfitting due to a relative small dataset size of MovieQA as discussed. It is hinted by our results that the two-layer CNN is the best for training performance, while the one-layer CNN is the best for validation. Second, we observe that there is no absolute magic number of how many memory slots should be read/written as a single chunk and how many strides the memory controller moves. If the stride height is too small or too large compared to the height of a convolution filter, the performance decreases. It means that the performance can be degraded when too much information is read/written as a single abstracted slot, when too much information is overlapped in adjacent reads/writes (due to a small stride), or when the information overlap is too coarse (due to a high stride). We present more ablation results to the supplementary file.

Figure 3 compares between the MEMN2N [25] and our RWMN model according to question types in the video+subtitle task. We examine the results of six question types, according to what starting word is used in the question: Who, Where, When, What, Why, and How. Usually, Why questions require abstraction and high-level reasoning to answer correctly (e.g.Why did Harry end his relationship with Helen?, Why does Michael depart for Sicily?). On the other hand, Who and When questions primarily deal with factual elements (e.g.Who is Harry’s girlfriend?, When does Grissom plan to set up Napier to be murdered?). Compared to the MEMN2N [25], our RWMN shows higher performance enhancement in the questions starting with Why, which may implicate the superiority of the RWMN to deals with high-level reasoning questions.

Table 1: Performance of the RWMN on the video+subtitle task, according to the structure parameters of write/read networks. : the number of layers for write/read networks, : the height and the stride of convolution filters, and the number of output channels.
Write network Read network Acc.
0 0 34.2
1 0 (40,7,1) 33.9
1 0 (40,30,3) 36.5
1 1 (40,30,3) (3,1,1) 38.6
1 1 (40,60,3) (3,1,1) 33.6
2 1 (40,10,3), (10,5,3) (3,1,1) 37.2
2 1 (5,3,1), (5,3,1) (3,1,1) 37.3
2 2 (4,2,1), (4,2,1) (3,1,1), (3,1,1) 36.9
2 2 (4,2,1), (4,2,1) (4,2,1), (4,2,1) 37.3
3 1 (10,3,3), (40,3,3), (100,3,3) (3,1,1) 35.1
3 1 (40,3,3), (10,3,3), (10,3,3) (3,1,1) 37.9
3 1 (40,3,3), (40,3,3), (40,3,3) (3,1,1) 35.7
3 1 (100,3,3), (40,3,3), (10,3,3) (3,1,1) 35.8

4.5Qualitative Results

Figure 3: Accuracy comparison between RWMN and the MEMN2N  baseline on the video+subtitle task according to question types. The RWMN leads higher improvement for Why questions that often require abstract and high-level understanding.
Figure 3: Accuracy comparison between RWMN and the MEMN2N baseline on the video+subtitle task according to question types. The RWMN leads higher improvement for Why questions that often require abstract and high-level understanding.
Figure 4: Qualitative examples of MovieQA video+subtitle problems solved by our methods (success cases in the top two rows, and failure cases in the last row). Bold sentences are groundtruth answers and red check symbols indicate our model’s selection. In each example, we also show on which parts our RWMN model attend over entire movie. The attention by the RWMN often matches well with the groundtruth (GT) where the question is actually generated.
Figure 4: Qualitative examples of MovieQA video+subtitle problems solved by our methods (success cases in the top two rows, and failure cases in the last row). Bold sentences are groundtruth answers and red check symbols indicate our model’s selection. In each example, we also show on which parts our RWMN model attend over entire movie. The attention by the RWMN often matches well with the groundtruth (GT) where the question is actually generated.

Figure 4 illustrates selected qualitative examples of video+subtitle problems solved by our methods, including four success and two near-miss cases. In each example, we present a sampled query video, a question, and five answer choices in which groundtruth is in bold and our model’s selection is red checked. We also show on which parts our RWMN attends over entire movies, along with the groundtruth (GT) attention maps indicating the temporal locations of the clips where the question is actually generated, provided by the dataset. As examples show, movie question answering is highly challenging, and sometimes is not easy even for human.

Our predicted attention often agrees well with the GT; the RWMN can implicitly learn where to place its attention in a very long movie for answering, although such information is not available for training. However, sometimes the RWMN can find correct answers even with the attention mismatch with the GT. It is due to that the MovieQA dataset also includes many questions that are hardly solvable with only attending on the GT parts. That is, some questions require understanding the relationship between characters or progress of event development, for which attending beyond GT parts is necessary.

5Conclusion

We proposed a new memory network model named Read-Write Memory Network (RWMN), whose key idea is to propose the CNN-based read/write network that enable the model to have highly-capable and flexible read/write operations. We empirically validated that the proposed read/write networks indeed improve the performance of visual question answering tasks for large-scale, multimodal movie story understanding. Specifically, our approach achieved the best accuracies in multiple tasks of MovieQA benchmark, with a significant improvement on visual QA task. We believe that there are several future research directions that go beyond this work. First, we can apply our approach to other QA tasks that require complicated story understanding. Second, we can explore better video and text representation methods beyond ResNet and Word2Vec.

Acknowledgements

. This research is partially supported by SK Telecom and Basic Science Research Program through National Research Foundation of Korea (2015R1C1A1A02036562). Gunhee Kim is the corresponding author.

Footnotes

  1. http://movieqa.cs.toronto.edu/.
  2. http://movieqa.cs.toronto.edu/leaderboard/ as of the ICCV2017 submission deadline (March 27, 2017 23:59 GMT).

References

  1. Youtube-8M: A Large-scale Video Classification Benchmark.
    S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. arXiv preprint arXiv:1609.08675, 2016.
  2. Hierarchical Memory Networks.
    S. Chandar, S. Ahn, H. Larochelle, P. Vincent, G. Tesauro, and Y. Bengio. arXiv preprint arXiv:1605.07427, 2016.
  3. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.
    K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. In EMNLP, 2014.
  4. Imagenet: A Large-scale Hierarchical Image Database.
    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. In CVPR, 2009.
  5. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
    J. Duchi, E. Hazan, and Y. Singer. JMLR, pages 2121–2159, 2011.
  6. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding.
    A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. In EMNLP, 2016.
  7. Understanding the Difficulty of Training Deep Feedforward Neural Networks.
    X. Glorot and Y. Bengio. In AISTATS, 2010.
  8. Neural Turing Machines.
    A. Graves, G. Wayne, and I. Danihelka. arXiv preprint arXiv:1410.5401, 2014.
  9. Hybrid Computing Using a Neural Network with Dynamic External Memory.
    A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Nature, 538:471–476, 2016.
  10. Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes.
    C. Gulcehre, S. Chandar, K. Cho, and Y. Bengio. In ICLR, 2017.
  11. Deep Residual Learning for Image Recognition.
    K. He, X. Zhang, S. Ren, and J. Sun. In CVPR, 2016.
  12. Long Short-term Memory.
    S. Hochreiter and J. Schmidhuber. Neural computation, 9(8):1735–1780, 1997.
  13. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering.
    Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. In CVPR, 2017.
  14. Large-scale Video Classification with Convolutional Neural Networks.
    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. In CVPR, 2014.
  15. Deepstory: video story qa by deep embedded memory networks.
    K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang. IJCAI, 2017.
  16. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing.
    A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani, and R. Socher. ICML, 2016.
  17. Recurrent Neural Network Based Language Model.
    T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur. In Interspeech, 2010.
  18. Distributed Representations of Words and Phrases and Their Compositionality.
    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. In NIPS, 2013.
  19. Key-value Memory Networks for Directly Reading Documents.
    A. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and J. Weston. In EMNLP, 2016.
  20. Rectified Linear Units Improve Restricted Boltzmann Machines.
    V. Nair and G. E. Hinton. In ICML, 2010.
  21. Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes.
    J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap. In NIPS, 2016.
  22. Squad: 100,000+ Questions for Machine Comprehension of Text.
    P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. In EMNLP, 2016.
  23. Movie Description.
    A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. IJCV, 123(1):94–120, 2017.
  24. End-to-End Memory Networks.
    S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. In NIPS, 2015.
  25. MovieQA: Understanding Stories in Movies through Question-answering.
    M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. In CVPR, 2016.
  26. A Compare-Aggregate Model for Matching Text Sequences.
    S. Wang and J. Jiang. In ICLR, 2017.
  27. Towards AI-complete Question Answering: A Set of Prerequisite Toy Tasks.
    J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov. arXiv preprint arXiv:1502.05698, 2015.
  28. Memory Networks.
    J. Weston, S. Chopra, and A. Bordes. ICLR, 2015.
  29. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language.
    J. Xu, T. Mei, T. Yao, and Y. Rui. In CVPR, 2016.
  30. End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering.
    Y. Yu, H. Ko, J. Choi, and G. Kim. In CVPR, 2017.
  31. Dynamic Key-Value Memory Network for Knowledge Tracing.
    J. Zhang, X. Shi, I. King, and D.-Y. Yeung. In WWW, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
220
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description