Live Video Comment Generation Based on Surrounding Frames and Live Comments

Live Video Comment Generation Based on Surrounding Frames and Live Comments

Damai Dai
School of EECS, Peking University
daidamai@pku.edu.cn
July 17, 2019
Abstract

In this paper, we propose the task of live comment generation. Live comments are a new form of comments on videos, which can be regarded as a mixture of comments and chats. A high-quality live comment should be not only relevant to the video, but also interactive with other users. In this work, we first construct a new dataset for live comment generation. Then, we propose a novel end-to-end model to generate the human-like live comments by referring to the video and the other users’ comments. Finally, we evaluate our model on the constructed dataset. Experimental results show that our method can significantly outperform the baselines.111The dataset and the code will be released to the public if the manuscript is accepted.

\aclfinalcopy

1 Introduction

In this paper, we focus on the task of automatically generating live comments. Live comments (also known as “{CJK}UTF8gbsn弹幕 bullet screen” in Chinese) are a new form of comments that appear in videos. We show an example of live comments in Figure 1. Live comments are popular among youngsters as it plays the role of not only sharing opinions but also chatting. Automatically generating live comments can make the video more interesting and appealing.

Different from the other video-to-text tasks, such as video caption, a live comment appears at a certain point of the video timeline, which gives it some unique characteristics. The live comments can be causal chats about a topic with other users instead of serious descriptions of the videos. Therefore, a human-like comment should be not only relevant to the video, but also interactive with other users.

Figure 1: An example of live comments. The colorful Chinese characters are the real-time live comments.
Figure 2: An illustration of our joint video and live comment model. We make use of not only the surrounding frames but also the surrounding live comments to generate the target live comment.

In this paper, we aim at generating human-like live comments for the videos. We propose a novel end-to-end model to generate the comments by referring to the video and other users’ comments. We have access to not only the current frame but also the surrounding frames and live comments because a live comment and its associated frame are in the context of a series of surrounding frames and live comments. To make use of the information in those two parts, we design a model that encodes the surrounding frames and live comments together into a vector, based on which we decode the new live comment. Experimental results show that our model can generate human-like live comments.

Our contributions are two folds:

  • We propose a new task of automatically generating live comments and build a dataset with videos and live comments for live comment generation.

  • We propose a novel joint video and live comment model to make use of the current frame, the surrounding frames, and the surrounding live comments to generate a new live comment. Experimental results show that our model can generate human-like live comments.

2 Proposed Model

In Figure 2 we show the architecture of our model. Our live comment generation model is composed of four parts: a video encoder, a text encoder, a gated component, and a live comment generator. The video encoder encodes consecutive frames, and the text encoder encodes surrounding live comments into the vectors. The gated component aggregates the video and the comments into a joint representation. Finally, the live comment generator generates the target live comment.

2.1 Video Encoder

In our task, each generated live comment is attached with consecutive frames. In the video encoding part, each frame is first encoded into a vector by a convolution layer. We then use a GRU Cho et al. (2014) layer to encode all the frame vectors into their hidden states :

(1)
(2)

We set the last hidden state as the representation of the video .

2.2 Text Encoder

In the comment encoding part, a live comment with words is first encoded into a series of word-level hidden states (), using a word-level GRU layer. We use the last hidden state as the representation for denoted as . Then we use a sentence-level GRU layer to encode all the live comment vectors of different live comments into their hidden states :

(3)
(4)

The last hidden state is used as the vector of all the live comments .

2.3 Gated Selection

In order to describe how much information we should get from the video and the live comments, we apply a gated multi-layer perceptron (MLP) to combine and , and get the final vector :

(5)
(6)
(7)

where are trainable parameters.

2.4 Live Comment Generator

We use a GRU to decode the live comment. The encoder encodes the frames and live comments jointly into a vector . The probability of generating a sentence given the encoded vector is defined as,

(8)

More specifically, the probability distribution of word is calculated as follows,

(9)
(10)

3 Experiments

In this section, we show the experimental results of our proposed model and compare it with three baselines on the dataset we construct from Youku.222http://www.youku.com

3.1 Live Comment Dataset Construction

Video frames: We extract frames from an animated TV series named “Tianxingjiuge” {CJK}UTF8gbsn (“天行九歌”) at a frequency of 1 frame per second. We get frames from 40 videos in total, with a shape of for each frame. We split the frames into training set () and test set (600).

Live comments: We use the developer tools in Google Chrome to manually detect the live comments sources, via which we get all live comments of the 40 videos. For each extracted frame, we select 5 live comments which are the nearest to the frame at the time they appear.

Reference set: Besides the live comments in our training set and test set, we crawl extra live comments to be the reference set for calculating BLEU score and perplexity which can evaluate the fluency of generated live comments (refer to Table 2).

Copyright statement: The dataset we construct can only be used for scientific research. The copyright belongs to the original website Youku.

3.2 Baselines

Besides the model described in section  2, we have three baseline methods:

Frame-to-Comment (F2C) (Vinyals et al., 2015) applies a CNN to encode the current frame to a vector, based on which the decoder generates the target live comment.

Moment-to-Comment (M2C) applies an RNN to make use of one live comment near the current frame besides the CNN for the frame. The two encoded vectors are concatenated to be the initial hidden state for the decoder.

Context-to-Comment (C2C) is similar to  (Venugopalan et al., 2015) which makes use of a series of surrounding frames and live comments by encoding them with extra higher-level RNNs.

3.3 Evaluation Metrics

We design two types of evaluation metrics: human evaluation and automatic evaluation.

Human Evaluation: We evaluate in three aspects: Fluency is designed to measure whether the generated live comments are fluent setting aside the relevance to videos. Relevance is designed to measure the relevance between the generated live comments and their associated frames. Overall Score is designed to synthetically measure the confidence that the generated live comments are made by humans in the context of the video. For all of the above three aspects, we stipulate the score to be an integer in . The higher the better. The scores are evaluated by three seasoned native speakers and finally we take the average of three raters as the final result.

Automatic Evaluation: We adopt two metrics: BLEU score Papineni et al. (2002) and perplexity. These two metrics are designed to measure whether the generated live comments accord with the human-like style. To get BLEU scores, for each generated live comment, we calculate its BLEU-4 score with all live comments in the reference set, and then we pick the maximal one to be its final score. Perplexity is to measure the language quality of the generated live comments, which is estimated as,

for each word in the sentence, is the predicted word.

3.4 Experimental Details

The vocabulary is limited to be the 34,100 most common words in the training dataset. We use a shared embedding between encoder and decoder and set the word embedding size to 300. The word embedding is randomly initialized and learned automatically during the training.

For the 3 GRUs used in the encoding stage, we set the hidden size to 300. For the decoding GRU, we set the hidden size to 600. For the encoding CNN, we use 3 convolution layers and 3 linear layers, and get a final vector with a size of 300. The batch size is 512. We use the Adam Kingma and Ba (2014) optimization method to train the model. For the hyper-parameters of Adam optimizer, we set the learning rate , two momentum parameters and respectively, and . During training, we use “teacher forcing” Williams and Zipser (1989) to make our model converge faster and we set the teacher forcing ratio .

3.5 Results and Analysis

Model Fluency Relevance Overall
F2C 3.80 1.73 2.56
M2C 3.88 1.74 2.58
C2C 4.11 1.83 2.94
Proposal 4.45 1.95 3.30
Human 4.84 2.60 3.87
Table 1: Results of human evaluation metrics on the test set. Human means the real-world live comments from videos. Overall is a comprehensive metric given by our annotators.

As shown in Table 1, our model achieves the highest scores over the baseline models in all three degrees. When only the current frame or one extra live comment is considered (F2C and M2C), the generated live comments have low scores. After considering more surrounding frames and live comments (C2C), all of the scores get higher. Finally, with the gate mechanism that can automatically decide the weights of surrounding frames and live comments, our proposal achieves the highest scores, which are almost near to those of real-world live comments. We use Spearman’s Rank correlation coefficients to evaluate the agreement among the raters. The coefficients between any two raters are all near 0.6 and at an average of 0.63. These high coefficients show that our human evaluation scores are consistent and credible.

Relevance: The relevant scores presented in Table 1 show that the live comments generated by all models do not achieve high relevant scores, which means that many of the generated live comments are not relevant to the current frame. We go through the real live comments and find that about 75.7% live comments are not relevant to the current frame, but are just chatting. In fact, we can find from Table 1 that the relevance score of real live comments is not high as well. Therefore, the low relevant scores are reasonable. Still, our proposal can generate more relevant live comments owing to its ability to combine the information from the surrounding frames and live comments.

Fluency: From the fluency score presented in Table 1, the BLEU-4 score and the perplexity score presented in Table 2, we can see our proposal can generate live comments which best accord with the human-like style.

Informativeness: From the Average Length in Table 2, we can see our proposal improves the length of the generated live comments, which indicates that more meaningful information is embodied.

Model BLEU-4 Perplexity Average Length
F2C 63.8 181.15 4.50
M2C 65.4 146.15 4.28
C2C 77.2 93.82 4.72
Proposal 83.5 56.85 5.17
Human 97.9 41.01 5.86
Table 2: Automatic evaluation results of the BLEU-4 scores, perplexities and average lengths of live comments on the test set.

4 Related Work

Inspired by the great success achieved by the sequence-to-sequence learning framework in machine translation (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014), Vinyals et al. (2015) and Mao et al. (2014) proposed to use a deep convolutional neural network (CNN) to encode the image and a recurrent neural network (RNN) to generate the image captions. Xu et al. (2015) further proposed to apply attention mechanism to focus on certain parts of the image when decoding. Using CNN to encode the image while using RNN to decode the description is natural and effective when generating textual descriptions.

One task that is similar to live comment generation is image caption generation, which is an area that has been studied for a long time. Farhadi et al. (2010) tried to generate descriptions of an image by retrieving from a big sentence pool. Kulkarni et al. (2011) proposed to generate descriptions based on the parsing result of the image with a simple language model. These systems are often applied in a pipeline fashion, and the generated description is not creative. More recent work is to use stepwise merging network to improve the performance Liu et al. (2018).

Another task which is similar to this work is video caption generation. Venugopalan et al. (2015) proposed to use CNN to extract image features, and use LSTM to encode them and decode a sentence. Similar models (Shetty and Laaksonen, 2016; Jin et al., 2016; Ramanishka et al., 2016; Dong et al., 2016; Pasunuru and Bansal, 2017; Shen et al., 2017) are also proposed to handle the task of video caption generation. Das et al. (2017b, a) introduce the task of Visual Dialog, which requires an AI agent to answer a question about an image when given the image and a dialog history.

We cast this problem as a natural language generation problem, and we are also inspired by the recent related work of natural language generation models with the text inputs Ma et al. (2018a, b); Xu et al. (2018b, a).

5 Conclusion

In this paper, we propose the task of live comment generation. In order to generate high-quality comments, we propose a novel neural model which makes use of the surrounding frames in the video and other surrounding live comments. Experimental results show that our model performs better than the baselines in various metrics, and even approaches the performance of human.

References

  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Das et al. (2017a) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. 2017a. Visual dialog. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, volume 2017-January, pages 1080–1089.
  • Das et al. (2017b) Abhishek Das, Satwik Kottur, Jose M.F. Moura, Stefan Lee, and Dhruv Batra. 2017b. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Computer Vision, volume 2017-October, pages 2970–2979.
  • Dong et al. (2016) Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees GM Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1082–1086. ACM.
  • Farhadi et al. (2010) Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A Forsyth. 2010. Every picture tells a story: generating sentences from images. pages 15–29.
  • Jin et al. (2016) Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1087–1091. ACM.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Kulkarni et al. (2011) Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2011. Baby talk: Understanding and generating simple image descriptions. pages 1601–1608.
  • Liu et al. (2018) Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Houfeng Wang, and Xu Sun. 2018. Stepwise image-topic merging network for generating detailed and comprehensive image captions. In EMNLP 2018.
  • Ma et al. (2018a) Shuming Ma, Xu Sun, Wei Li, Sujian Li, Wenjie Li, and Xuancheng Ren. 2018a. Query and output: Generating words by querying distributed word representations for paraphrase generation. CoRR, abs/1803.01465.
  • Ma et al. (2018b) Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018b. Bag-of-words as target for neural machine translation. CoRR, abs/1805.04871.
  • Mao et al. (2014) Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv: Computer Vision and Pattern Recognition.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Pasunuru and Bansal (2017) Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. arXiv preprint arXiv:1704.07489.
  • Ramanishka et al. (2016) Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal video description. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1092–1096. ACM.
  • Shen et al. (2017) Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, page 10.
  • Shetty and Laaksonen (2016) Rakshith Shetty and Jorma Laaksonen. 2016. Frame-and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1073–1076. ACM.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Venugopalan et al. (2015) Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542.
  • Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
  • Williams and Zipser (1989) Ronald J. Williams and David Zipser. 1989. Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1(1):87–111.
  • Xu et al. (2018a) Jingjing Xu, Xu Sun, Xuancheng Ren, Junyang Lin, Bingzhen Wei, and Wei Li. 2018a. DP-GAN: diversity-promoting generative adversarial network for generating informative and diversified text. CoRR, abs/1802.01345.
  • Xu et al. (2018b) Jingjing Xu, Xu Sun, Qi Zeng, Xuancheng Ren, Xiaodong Zhang, Houfeng Wang, and Wenjie Li. 2018b. Unpaired sentiment-to-sentiment translation: A cycled reinforcement learning approach. CoRR, abs/1805.05181.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
254277
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description