Multimodal Matching Transformer for Live Commenting

Multimodal Matching Transformer for Live Commenting

Abstract

Automatic live commenting aims to provide real-time comments on videos for viewers. It encourages users engagement on online video sites, and is also a good benchmark for video-to-text generation. Recent work on this task adopts encoder-decoder models to generate comments. However, these methods do not model the interaction between videos and comments explicitly, so they tend to generate popular comments that are often irrelevant to the videos. In this work, we aim to improve the relevance between live comments and videos by modeling the cross-modal interactions among different modalities. To this end, we propose a multimodal matching transformer to capture the relationships among comments, vision, and audio. The proposed model is based on the transformer framework and can iteratively learn the attention-aware representations for each modality. We evaluate the model on a publicly available live commenting dataset. Experiments show that the multimodal matching transformer model outperforms the state-of-the-art methods.

{CJK*}

UTF8gbsn

1 Introduction

Live commenting is an emerging feature of online video sites that allows real-time comments to fly across the screen or roll at the right side of the videos, so that viewers can see comments and videos at the same time. Automatic live commenting aims to provide some additional opinions of videos and respond to live comments from other viewers, which encourages users engagement on online video sites. Automatic live commenting is also a good testbed of a model’s ability of dealing with multi-modality information [16]. It requires the model to understand the vision, text, and audio, and organize the language to produce the comments of the videos. Therefore, it is an interesting and important task for human-AI interaction.

Although great progress has been made in multimodal learning [15, 24, 25], live commenting is still a challenging task. Recent work on live commenting implements an encoder-decoder model to generate the comments [16]. However, these methods do not model the interaction between the videos and the comments explicitly. Therefore, the generated comments are often general to any videos and irrelevant to the specific input videos. Figure 1 shows an example of the generated comments by an encoder-decoder model. It shows that the encoder-decoder model tends to output the popular sentences, such as “Oh my God !”, while the reference comment is much more informative and relevant to the video. The reason is that the encoder-decoder model cares more about the language model, rather than the interaction between the videos and the comments, so generating popular comments is a safe way for the model to reduce the empirical risk. As a result, the encoder-decoder model is more likely to generate a frequent sentence, rather than an informative and relevant comment.


Case 1 Oh My God !!!!!
Case 2 So am I.
Reference The cat is afraid of spicy.
Figure 1: An example of the generated comments by the encoder-decoder model. Above is a frame extracted from a selected video. Below are two cases generated by the encoder-decoder model around the above frame, as well as a reference comment by human.

Another problem with current state-of-the-art live commenting models is that they do not take the audio into consideration. Audio, as an important part of videos, carries information that may not appear in the vision or text. For example, when the video is about playing the piano, it is difficult to make a proper comment without the audio. The audio also includes dialogues or background music, which helps understand the story in videos. Therefore, the audio should not be neglected if the model needs to fully understand videos and make an informative comment.

In this work, we build a novel live commenting model to make more relevant comments. Based on existing observations, we propose a multimodal matching transformer to learn the cross-modal interaction between videos and comments explicitly. The proposed multimodal matching network can match the most relevant comments with the given videos from a candidate set, so it can encourage the produced comments to be more informative and less general. Our model is based on the transformer architecture, and it jointly learns the cross-modal representations of text, vision, and audio. We evaluate our model on a live commenting dataset [16]. Experiments show that the proposed multimodal matching transformer model is effective and significantly outperforms state-of-the-art methods.

The contributions of this paper can be summarized as follows:

  • We propose using the audio information for the task of live commenting, which is neglected by previous work.

  • We propose a novel multimodal matching network to capture the relationship among text, vision, and audio, based on the state-of-the-art transformer framework.

  • Experiments show that the proposed multimodal matching model significantly outperforms the state-of-the-art methods.


Figure 2: Architecture of the multi-head attention.

2 Background

In this work, our proposed model is based on the Transformer network [20]. This section will give an introduction for core modules of the Transformer network.

2.1 Input Representation

In the Transformer network architecture, there is no recurrence and no convolution. To leverage the order of the sequence, it introduces positional embeddings and each positional embedding is computed based on the token’s position in the sequence:

(1)

where is the position in the sequence and is the dimension.

Namely, the input representation of the Transformer network contains two parts: word embeddings and positional embeddings , where is the length of the input sentence. The two parts are fused by an addition operation.

2.2 Multi-Head Attention

After obtaining outputs of previous layers, the Transformer network uses a multi-head attention mechanism to learn the context-aware representation for the sequence.

Figure 2 shows the architecture of the multi-head attention. , and , three matrices, are inputs derived from previous layers and is the output. The Multi-head attention can be denoted as:

(2)

where is a trainable parameter. is the number of parallel attention layers. is computed by Eq. (3).

(3)

where , and are trainable parameters and is the scaled dot-product attention which can be denoted as:

(4)

Figure 3: Architecture of the matching transformer. Part (a) illustrates the whole architecture. In this model, the matching layer consists of matching blocks and part (b) illustrates the structure of a matching block.

3 Multimodal Matching Transformer

Automatic live commenting aims to make comments on a video clip. According to the analysis in [16], the live comments are relevant to not only the video clip, but also the surrounding comments from other viewers. In this work, we find it helpful to incorporate the audio information into the live commenting model. However, the surrounding comments, the vision part, and the audio part of the videos are from different modalities, and not trivial to model their relationships. To address these issues, we propose a multimodal matching model, which we denote as Matching Transformer. The model is based on the popular transformer architecture [20, 9]. Our model can jointly learn the cross-modal representations of textual context, visual context, audio context, and model the relationships among them.

3.1 Task Definition

We formulate the automatic live commenting as a ranking problem. Formally, given a video and a time-stamp , automatic live commenting aims to select a comment from a candidate set , which is most relevant to the video clip near the time-stamp based on the surrounding comments , the visual part and the audio part . Concretely, we extract comments near the time-stamp as , where is a comment. For the vision part , we sample video frames near the time-stamp as , where is an image and the interval between two images is 1 second. We convert a 5-second audio clip surrounding the time-stamp to a log magnitude mel-frequency spectrogram, and it can be denoted as , where is a vector and is the length of the audio clip. The candidate comment can be denoted as , where is a word and is the number of words.

In this way, the task can be formulated as searching the most relevant comment to the video clip in the multimodal semantic space:

(5)

where is a model to produce the similarity between and .

3.2 Model Overview

Figure 3 shows the architecture of the Matching Transformer. The model consists of three components: (1) an encoder layer converts different modalities of a video clip (including surrounding comments, the vision part of the video, the audio part of the video) and a candidate comment into vectors; (2) a matching layer iteratively learns the attention-aware representation for each modality; (3) a prediction layer outputs a score measuring the matching degree between a video clip and a comment.

Formally, given different contexts of a video clip and a comment , the model can be denoted as:

(6)

Next, we introduce each layer in detail.

3.3 Encoder Layer

As shown in part (a) of Figure 3, our model contains three kinds of encoder: a comment encoder, a vision encoder and an audio encoder. These encoders convert a comment, a vision clip and an audio clip into vectors respectively. In our model, the existing surrounding comments and the candidate comment share the same comment encoder.

Comment Encoder

In our model, comments near the time-stamp, , are first concatenated into one comment , where is the i-th word in the comment and is the total number of words. Then, the comment encoder converts words of the comment into vectors by looking up , where is the embedding table. is the dimension of the embedding and is the size of the vocabulary. Similarly, the comment encoder also converts the candidate comment into vectors: .

Vision Encoder

The vision encoder converts a vision clip into vectors by a pre-trained model, where is equal to . Similar to [16], we leverage a pre-trained 18-layer ResNet [13] to encode the frames within a vision clip. It can be denoted as:

(7)

Audio Encoder

For audio encoding, we first slice a 5-second audio clip into five audio frame sets, , based on the timestamp. Then, we use a GRU [7] to encode each set. It can be denoted as:

(8)

At last, we use the last hidden state of each set as the representation of the audio clip: .

Positional Embedding

To exploit the temporal information in each modality, following [20], we also use positional embedding (PE) by adding it to the output of each encoder.

3.4 Matching Layer

Inspired by the recent successful deep learning frameworks [13, 20], we adopt a matching layer which consists of matching blocks to iteratively learn the attention-aware representation for each modality. The structure of a matching block is shown in part (b) of Figure 3. Each matching block is composed of four parts: a multi-head self-attention, a multi-head cross attention and two position-wise FNN. Compared to the basic block defined in [20], our matching block adds a multi-head cross attention and a position-wise FNN. We use these auxiliary mechanisms to learn attention-aware representation from other modalities.

For simplicity, we take the candidate comment as the example to illustrate the matching layer.

Formally, in the -th block, given the output of previous matching block corresponding to the candidate comment: , we first utilize a multi-head self-attention and a position-wise FNN to learn the context of the candidate comment :

(9)
(10)

Similar to Eq. (9) and Eq. (10), we also compute the context vectors of surrounding comment , visual clip , and audio clip . Then we employ a multi-head cross attention to learn the attention-aware representation from each modality:

(11)
(12)
(13)

After geting these three attention-aware representations, we use MLP to build a fusional gate and combine them with the weighted sum:

(14)
(15)

where means the element-wise dot and is the dimension of , , and .

Finally, we feed into a position-wise FNN to produce the output of the -th matching block corresponding to the candidate comment:

(16)

As described above, Eq. (9)-Eq. (16) illustrate how to compute the representation of candidate comment . In implementation, we adopt the same way to compute the representations of surrounding comment , vision clip and audio clip .

3.5 Prediction Layer

The prediction layer outputs a score measuring the matching degree between and . In this layer, we first employ a weighted pooling to convert the output of the last matching block to a fixed-length vector:

(17)
(18)

where , , and are trainable parameters.

Similarly, we get the vectors , and for , and respectively. Then, we adopt a fusional gate to combine , and into :

(19)
(20)

where is the dimension of the , and .

Finally, we use a cosine distance to measure the similarity between and :

(21)

3.6 Training

To learn the , we leverage the max-margin loss function, which can be formulated as:

(22)

where is the number of instances in the training set, is the negative sample and is the positive sample. is the margin that needs to be specified manually. denotes all the trainable parameters of our model. When training, we employ Adam [14] as the optimizer.

4 Experiments

4.1 Dataset

We evaluate our model on a live commenting dataset1 that is released by [16]. The live commenting dataset is a large-scale video-comment dataset. It contains 2,361 videos and 895,929 comments. The data is collected from a popular Chinese video streaming website, Bilibili. Therefore, it has strong authenticity and practicability. In our experiment, we use the same partition as in [16]. The detailed statistics of the dataset is shown in Table 1.

Train Dev Test Total
#Video 2,161 100 100 2,361
#Comment 820k 42k 34k 896k
#Word 4,419k 248k 193k 4,860k
Avg. Words 5.39 5.85 5.58 5.42
Hours 103.81 5.02 5.01 113.84
Table 1: Statistics of the Live Comment Dataset.
Model Text Vision Audio Recall@1 Recall@5 Recall@10 MR MRR
S2S 12.89 33.78 50.29 17.05 0.2454
Fusional RNN 17.25 37.96 56.10 16.14 0.2710
Unified Transformer 18.01 38.12 55.78 16.01 0.2753
Matching Transformer-C 18.02 42.83 59.37 12.28 0.3087
Matching Transformer-CF 22.77 46.71 62.87 11.19 0.3519
Matching Transformer-CFA 23.52 46.99 64.24 11.05 0.3596
Table 2: The performance comparison on the live commenting dataset (Recall@k, MRR: higher is better; MR: lower is better). Our matching transformer significantly outperforms the baselines in terms of all metrics. Meanwhile, our model achieves better performance than baselines by using the same two modalities.
Text Vision Audio Recall@1 Recall@5 Recall@10 MR MRR
Single-Modal 18.02 42.83 59.37 12.28 0.3087
18.55 38.38 50.98 16.33 0.2920
17.95 36.89 50.52 15.33 0.2861
Double-Modal 22.77 46.71 62.87 11.19 0.3519
19.93 44.39 59.68 12.21 0.3276
18.03 39.00 52.77 15.60 0.2933
Triple-Modal 23.52 46.99 64.24 11.05 0.3596
Table 3: Effect of different modalities used in the Matching Transformer (Recall@k, MRR: higher is better; MR: lower is better). It shows that more modalities always lead to better performance, which indicates that the proposed model can capture the semantic information of different modalities to help the live commenting task.
Model Rel Cor
S2S 2.23 2.91
Fusional RNN 2.95 3.34
Unified Transformer 3.07 3.45
Matching Transformer 3.25 3.57
Human 3.31 4.11
Table 4: Human evaluation results of different models (Rel refers to the relevance score; Cor refers to the correctness score; Human means the natural comments in the dataset.). It shows that the produced comments of our model are more relevant than those of the baseline models. Besides, our model can produce more correct and proper comments.

4.2 Evaluation Metric

Following the previous work [8, 16], we adopt Recall@k, Mean Recall (MR) and Mean Reciprocal Rank (MRR) for automatic evaluation, which are standard evaluation metrics of the ranking task. For testing, we construct a candidate comments set in which each video clip contains 100 comments, which is exactly the same as the previous work [16] for fair comparison. The comments in the candidate comment set are comprised of three parts: (1) the ground-truth comments; (2) top 20 popular comments; (3) random selected comments. We evaluate our model on the testing set.

In addition, we also test the performance of our model by human evaluation. Following [16], we use the metrics of relevance (Rel) and correctness (Cor) to evaluate our model. Relevance measures the relevance between produced comments and videos and correctness measures the confidence that produced comments are made by human in the context of the video. We do not evaluate the fluency of the produced comments, because our model just selects a proper comment from a candidate comment set, which is naturally fluent. For both relevance and correctness, we use a score to denote the degree, the higher the better. When testing, three human annotators are asked to give a score to evaluate the top one comment produced by our model and we use the average score as the final result.

(a) Frame 1.
(b) Frame 2.
(c) Frame 3.
Figure 4: An example of the produced comments of different models on a video. Above are three selected frames in the videos. Below are the existing comments in the video and the produced comments of different models.

4.3 Settings

In our experiments, the word embeddings and video frame vectors are in 512 dimensions while audio frame vectors are in 64 dimensions. The GRU for audio encoder is in 512 dimensions. For positional embedding, we use fixed sinusoidal positional embedding and set the dimension as 512. The word embeddings are randomly initialized and updated during training, while the video frame vectors and audio frame vectors are fixed. There are 6 matching blocks in the matching layer. In each matching block, the number of heads in the multi-head attention is 8 and the dimension of the position-wise FNN is 2,048. The margin is set to 0.1 in our experiment. We employ the Adam [14] for training, whose default hyper-parameters and are set to 0.9 and 0.999 for optimization respectively. The initial learning rate of Adam is set to 0.00009. The learning rate is halved when the accuracy on the development set drops. We also employ a dropout strategy [18] and layer normalization [4] to reduce the risk of over-fitting. The dropout rate is set to 0.2 and the batch size is 64.

For pre-processing, we use the Stanford tokenizer [17] to tokenize the comments and audio feature extractor2 released by [12] to process the audio clip. During training, we draw 1 negative sample for each video clip.

4.4 Baselines

  • S2S [21] is a traditional sequence to sequence model without the attention mechanism. Specifically, the model uses two encoders to encode visual and textual information respectively. During decoding, the decoder uses the concatenation of the outputs from the two encoders as input.

  • Fusional RNN [16] consists of three parts: a video encoder, a comment encoder and a comment decoder. The three parts are all RNN-based networks and they are related by an attention layer. This model uses the visual and textual context as input.

  • Unified Transformer [16] is a transformer-based generative model. Similar to Fusional RNN, this model is comprised of three parts: a video encoder, a comment encoder and a comment decoder. The difference is that these three parts are all stacked attention-based transformer blocks.

4.5 Overall Results

Table 2 shows the automatic evaluation results of the baseline models and our proposed models. The baselines (S2S, Fusional RNN and Unified Transformer) only use the text and vision of the videos. Our matching transformer leverages three modalities (text, vision, and audio) and significantly outperforms the baselines in terms of all metrics. Moreover, we also report the results of our model with only one modality (less than baselines) and two modalities (equal to baselines). It shows that the matching transformer achieves comparable performance with the baselines using only one modality. Meanwhile, our model achieves better performance than baselines by using the same two modalities, which verifies the efficiency of the proposed model. Finally, the triple-modality model significantly outperforms the baselines, achieving +5.51 points on Recall@1, +8.87 points on Recall@5 and +8.46 points on Recall@10.

4.6 Effect of Different Modalities

We also would like to know how different kinds of modalities contribute to our proposed model. Therefore, we conduct ablation experiments by removing different modalities from our model. Table 3 summarizes the results of the ablation experiments. Under the single-modality setting, it shows that the model with the modality of text achieves better performance over the other two modalities. Among the possible alternatives of double-modality, the combination of text and vision obtains the best performance. Finally, the model with triple-modalities get the highest scores in terms of all the automatic metrics. Besides, it is observed that more modalities always lead to better performance, which indicates that the proposed model can capture the semantic information of different modalities to help the live commenting task.

4.7 Human Evaluation

We randomly sample 100 video clips from the test set to evaluate our model in terms of the relevance and the correctness. For both metrics, we use a score range from 1 to 5 to denote the degree, where the higher the better. We have three human annotators to give a score that evaluates the top one comment produced by our model, and we use the average score as the reported result.

The result is shown in Table 4. It shows that the produced comments of our model are more relevant than those of the baseline models. Besides, our model can produce more correct and proper comments. The scores of both relevance and correctness degrees are also closer to that of the comments made by human. This result indicates that our model is able to produce the relevant comments to the videos by modeling the relevance in different modalities.

4.8 Case Study

To further compare our model with the baselines, we provide an example for case study. This example is talking about a Chinese food called soup dumplings. As illustrated in Figure 4, this example consists of three frames, three surrounding comments and three target comments. Since the audio is not visible, we do not provide the audio part of the video. The surrounding comments are in the first row of the table below the three frames. The second row contains three target comments which are naturally made by human viewers and correspond to a specific time-stamp. We compare the produced comments of different models with the target comments. It shows that when we select the top one output as the produced comment, both unified transformer and matching transformer can produce a comment relevant to the target comments. However, the produced comments of both fusional RNN and S2S are of low relevance to the video. The output of S2S is talking about eggs and the output of fusional RNN is about dancing, both of which are far away from the video clip. Furthermore, we compare the produced comments between matching transformer and unified transformer. According to the case in Figure 4, it is obvious that the comment from the matching transformer is more relevant to the video clip. The matching transformer can make comments about the soup dumplings that are exactly the key point of the video clip, while the unified transformer can only produce comments about how to process the dirty soup dumplings that fell on the ground, which do not appear in the video. In conclusion, the comments made by our matching transformer are more relevant and correct than that of other baselines.

5 Related Work

Automatic live commenting aims to comment on a video clip based on the surrounding comments, the video clip itself and the corresponding audio clip. This task is similar to the image captioning and video captioning, both of which attract much attention for a long time.

Image Captioning

Image captioning involves taking an image, analyzing its visual content, and generating a textual description [5]. [26] try to adopt a retrieval-based model to produce a description of an image from a multimodal space. [27] propose a retrieval approach based on the features extracted by VGG. [23] use a CNN-based model to encode the image and an LSTM to generate the description. [15] try to utilize a merging gate to merge the information in the image and the topics.

Video Captioning

Video captioning aims to automatic generate natural language sentences that describe the content of a video [1]. [22] present a CNN-LSTM architecture for generating natural language description of videos. [19] use one LSTM to extract features from video frames and then pass the feature vector through another LSTM for decoding. [24] propose a different neural network architecture based on reinforcement learning for video captioning. [25] release a knowledge-rich video captioning dataset and proposed a new knowledge-aware video description network.

Matching Model

A matching model aims to compute the relation between two objects. For text-to-text, [6] adopt an LSTM-based model with cross-attention to predict the relation between two sentences. Inspired by transformer, [28] use self-attention and cross-attention to encode two sentences and model the relation between them. For text-to-image, [11] propose a multi-step reasoning model for visual dialog, which measures the similarity between text-image pairs. [2] present a bottom-up and top-down attention for image captioning and visual question answering. For text-to-audio, [3] try to use transformer learning to learn aligned representations for image, sound and text. [10] propose a framework that learns joint embeddings from a shared lexico-acoustic space for text and audio.

Despite the similarity to image captioning and video captioning, automatic live commenting has its own characteristics. Compared to existing research, it has more diverse contexts including textual context, visual context and audio context, which is more difficult to tackle. To this end, we propose a multimodal matching transformer model. It can jointly learn the representation of three modalities and the relations among them. Therefore, the proposed model can better integrate information from different angles.

6 Conclusion and Future Work

In this paper, we propose a multimodal matching transformer model for automatic live commenting. It can jointly learn the representation of visual context, audio context, and textual context. In addition, the matching transformer model also explicitly leverages the relations among three modalities to enrich the representation of each one. We evaluate our model on a publicly available live commenting dataset. Experiments show that the proposed multimodal matching transformer can significantly outperforms the state-of-the-art approaches.

For future research, we will further investigate the multimodal interactions among vision, audio, and text in the real-world applications. Moreover, we believe the multimodal pre-training will be a promising direction to explore, where tasks like image captioning and video captioning will benefit from pre-trained models.

Footnotes

  1. https://github.com/lancopku/livebot
  2. https://github.com/tensorflow/models/tree/master/research/audioset

References

  1. N. Aafaq, S. Z. Gilani, W. Liu and A. Mian (2018) Video description: a survey of methods, datasets and evaluation metrics. arXiv preprint arXiv:1806.00186. Cited by: §5.
  2. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086. Cited by: §5.
  3. Y. Aytar, C. Vondrick and A. Torralba (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932. Cited by: §5.
  4. J. L. Ba, J. R. Kiros and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.3.
  5. R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat and B. Plank (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55, pp. 409–442. Cited by: §5.
  6. Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang and D. Inkpen (2017) Enhanced lstm for natural language inference. In Proc. ACL, Cited by: §5.
  7. J. Chung, C. Gulcehre, K. Cho and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.3.
  8. A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh and D. Batra (2017) Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335. Cited by: §4.2.
  9. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.
  10. B. Elizalde, S. Zarar and B. Raj (2019) Cross modal audio search and retrieval with joint embeddings based on text and audio. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4095–4099. Cited by: §5.
  11. Z. Gan, Y. Cheng, A. E. Kholy, L. Li, J. Liu and J. Gao (2019) Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579. Cited by: §5.
  12. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §4.3.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3, §3.4.
  14. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.6, §4.3.
  15. F. Liu, X. Ren, Y. Liu, H. Wang and X. Sun (2018) SimNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732. Cited by: §1, §5.
  16. S. Ma, L. Cui, D. Dai, F. Wei and X. Sun (2018) Livebot: generating live video comments based on visual and textual contexts. arXiv preprint arXiv:1809.04938. Cited by: §1, §1, §1, §3.3, §3, 2nd item, 3rd item, §4.1, §4.2, §4.2.
  17. C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard and D. McClosky (2014) The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60. Cited by: §4.3.
  18. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.3.
  19. N. Srivastava, E. Mansimov and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §5.
  20. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.3, §3.4, §3.
  21. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell and K. Saenko (2015) Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pp. 4534–4542. Cited by: 1st item.
  22. S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney and K. Saenko (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. Cited by: §5.
  23. O. Vinyals, A. Toshev, S. Bengio and D. Erhan (2015) Show and tell: a neural image caption generator. In Proc. of CVPR, pp. 3156–3164. Cited by: §5.
  24. X. Wang, W. Chen, J. Wu, Y. Wang and W. Yang Wang (2018) Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222. Cited by: §1, §5.
  25. S. Whitehead, H. Ji, M. Bansal, S. Chang and C. Voss (2018) Incorporating background knowledge into video description generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3992–4001. Cited by: §1, §5.
  26. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §5.
  27. S. Yagcioglu, E. Erdem, A. Erdem and R. Cakici (2015) A distributed representation based query expansion approach for image captioning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2, pp. 106–111. Cited by: §5.
  28. A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi and Q. V. Le (2018) Qanet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541. Cited by: §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407123
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description