Using Inter-Sentence Diverse Beam Search to Reduce Redundancy in Visual Storytelling

Using Inter-Sentence Diverse Beam Search to Reduce Redundancy in Visual Storytelling

Chao-Chun Hsu, Szu-Min Chen, Ming-Hsun Hsieh, Lun-Wei Ku
Academia Sinica, Taiwan
{joe32140, b02902026, troutman, lwku}

Visual storytelling includes two important parts: coherence between the story and images as well as the story structure. For image to text neural network models, similar images in the sequence would provide close information for story generator to obtain almost identical sentence. However, repeatedly narrating same objects or events will undermine a good story structure. In this paper, we proposed an inter-sentence diverse beam search to generate a more expressive story. Comparing to some recent models of visual storytelling task, which generate story without considering the generated sentence of the previous picture, our proposed method can avoid generating identical sentence even given a sequence of similar pictures.

Using Inter-Sentence Diverse Beam Search to Reduce Redundancy in Visual Storytelling

Chao-Chun Hsu, Szu-Min Chen, Ming-Hsun Hsieh, Lun-Wei Ku Academia Sinica, Taiwan {joe32140, b02902026, troutman, lwku}

1 Introduction

Visual storytelling is gaining more popularity in recent years. The task requires machines to both comprehend the content of a stream of images and also produce a narrative story. The recent rapid progress of the neural networks has enabled the models to achieve promising performance on the task of image captioning (Xu et al., 2015; Vinyals et al., 2017), which is also an image-to-text problem. Nonetheless, visual storytelling is much more difficult and complicated; besides generating an individual sentence solely, to form a complete story needs to take the coherency as well as the main focus of the story and even the creativity into consideration.

Park and Kim(Park and Kim, 2015) viewed this problem as a retrieval task and incorporating the information of discourse entities to model the coherence. Some researchers designed a variation of GRU which can ”skip” some input which strengthens the ability to deal with longer dependency(Liu et al., 2016). However, retrieval-based methods are less general and limited to those seen sentences. In a task where the output varies dramatically, using a more flexible manner to generate stories seems to be more appropriate. Huang(Huang et al., 2016) published a large dataset containing photo streams where each of them is paired with a story. They also proposed a neural baseline model of this task which based on the seq-to-seq framework. First encoding all images, and then they use the encoder hidden state as initial hidden state of decoder GRU and produce the whole story. In addition to visual grounding, we expect the generated stories can be as similar as possible to those written by human. This may be breakdown to several characteristics such as the style, the repeated use of words, or how detailed the paragraph is, etc. It is hard to enumerate all possible features, thus, the work (Wang et al., 2018a) utilize the Seq-Gan framework, designing two discriminators, one is responsible for the degree of matching of an image and a sentence, the other focuses on the text only, trying to mimic the language style of human. More recent work analyzing the relation between scores of automatic metrics such as BLEU (Papineni et al., 2002) and those from human evaluation. They utilize Inverse Reinforcement Learning framework, attempting to ”learn” a reward by adversarial network which is similar to the criterion of human judgment (Wang et al., 2018b).

However, these neural models aims to learn a human distribution and have nothing to do with the creativity and also consumes lots of computing resource and time. In this paper, we proposed an inter-sentence diverse beam search which can generate interesting sentences and avoid redundancy given a sequence of similar photos. Comparing to the diverse beam search which generates different groups of sentence given a condition, our inter-sentence version focuses on producing various sub-stories given a sequence of images. The proposed model improves the max meteor scores on visual storytelling dataset from 29.3% to 31.7% comparing to the baseline model.

the friends were excited
to go out for a night .
we had a lot of fun .
we had a great time .
we had a great time .
we all had a great time .
Table 1: Repetitive story generated by baseline model

2 Method

2.1 Basic Architecture

The baseline model is a encoder-decoder framework as in Figure 1. We then apply proposed method on the top of the baseline model.


We utilize the pretrained Resnet-152 and extract the output of the second-to-last fully connected layer to form an 2048-dimensional image representation vector. Since a story is composed of five images, the model should take the order of them into consideration as well as memorize the content of previous pictures. Thus, a bidirectional GRU will take as input the five image vectors and produce five context-aware image embeddings.


Given an image embedding, another GRU is used to produce one sentence at a time. To form a complete story, we first generate five sub-stories separately and then concatenating them. We have tried both using image embedding as the initial hidden state and concatenated the image embedding with the word embedding as the input of each time steps. The latter one yields a smoother validation curve and thus we adopted this setting in our last submission.

2.2 Decoding techniques

We have noticed a major issue of the baseline model is that the generated sentences may repeat themselves as in Table 1 because of the similar image or generic story. E.g. I have a great time. To overcome this, we proposed a variation of beam search which takes into account the previous generated sentences.

Inspired by the diverse beam search method (Vijayakumar et al., 2016), our model will aware of the words used in previous incomplete story when decoding one sub-story, and then calculating the score based on each next word’s probability and the diversity penalty, which will discuss below. Afterward, the model will rearrange the candidates according to the scores and selects the proper word. In our setting, we use bag-of-word to represent the previous sentences and adopting the relatively simple hamming diversity, which punishes a selected token proportional to its previous occurrences, but any kind of penalty can be plugged into this framework. Different from Vijayakumar et al. (2016), their work deals with intra-sentences diversity while ours focus on the inter-sentences diversity, which is more suitable for this task.

Inter-sentence Diverse Beam Search

For the decoding of the first sentence, we perform a regular beam search with no diversity penalty. After that, we consider the diversity with all previous sentences for the following sentence generation process.

From a finite vocabulary and an input , the model output the sequence y based on the probability P(yx) where the output sequence y = . Let , we denote . At each time step , the set is updated by considering the set for the image with beams. Diverse beam search adds diversity penalty by calculating sentences with diversity function multiplied by the diversity strength . Using the notation of (Vijayakumar et al., 2016), each time step of this process for image can be presented as,


algocf[htbp]     \end@dblfloat

baseline the trees were beautiful.
we took a picture of
the mountains.
we took a picture of
the mountains.
the mountains were
beautiful and the.
the river was
the lake was beautiful.
ours i went on a hike last week.
we had to take a picture
of the mountains.
they took pictures of the
lake and scenery.
this is my favorite part of the
trip with my wife ,i ’ve never
seen such a beautiful view!
it was very peaceful
and serene.
Table 2: Generated stories from two models.
Figure 1: The architecture of baseline model

In our experiments, we applied hamming diversity for and found achieved better results at 2. Our purposed inter-sentence diverse beam search is illustrated in Algorithm LABEL:alg:math_here_lol.

3 Visual Storytelling Dataset

Visual Storytelling dataset(VIST) is the first dataset of image sequence paired with text: (1) Descriptions of imagesin-isolation (DII); and (2) Stories for images-insequence (SIS). The dataset contain more than 20,000 unique photos in 50,000 sequences. We eliminate the stories that have broken images().

4 Experiments and results

4.1 Model Setup

We first scale the images to 224*224 and apply horizontal flip into training process. Then the images is normalized to fit in pretrained resnet-152 CNN model. For the hyperparameters, the image feature size is 256, and the hidden size of decoder GRU cell is 512. The word appearing more than three times in the corpus is pick into vocabulary. We select Adam as our optimizer and set learning rate to 2e-4. Schedule sampling and batch normalization are introduced in the training process.

4.2 Result

Every photo sequence in test set has 2 to 5 reference stories. We evaluate our models by max meteor score of references for a photo sequence. As we can see in Table 3, the inter-sentence diverse beam search improve the max meteor score from 29.3 to 31.7. Besides the improvement on metric, the proposed method can generate interesting sentences (Table 2).

Baseline Ours
Meteor 29.3 31.7
Table 3: Max meteor score(%)

5 Conclusion and Future Work

We proposed a new decoding approach for the visual storytelling task, which avoided to generate the repeated information of the photo sequence. Instead, our model would attempt to produce a diverse expression for the image.

Nevertheless, the value of diversity weight requires human heuristics to determine the trade-off between the diversity and the output sequence probability. Future efforts should be dedicated to introduce a data-driven method to make machines learn to put the attention on either the sequence probability or the distinct details in the image.


  • Huang et al. (2016) Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239.
  • Liu et al. (2016) Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2016. Storytelling of photo stream with bidirectional multi-thread recurrent neural network. arXiv preprint arXiv:1606.00625.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Park and Kim (2015) Cesc C Park and Gunhee Kim. 2015. Expressing an image stream with a sequence of natural sentences. In Advances in Neural Information Processing Systems, pages 73–81.
  • Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
  • Vinyals et al. (2017) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663.
  • Wang et al. (2018a) Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, and Tao Mei. 2018a. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training.
  • Wang et al. (2018b) Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018b. No metrics are perfect: Adversarial reward learning for visual storytelling. arXiv preprint arXiv:1804.09160.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description