What Makes A Good Story? Designing Composite Rewards for Visual Storytelling
Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a realistically-natural and topically-coherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a “high-quality” story to the human eye. Following this quality guideline, we propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluations demonstrate that our ReCo-RL model achieves better performance than state-of-the-art baselines on both traditional metrics and the proposed new criteria.
There has been a recent surge of interest in enabling machines to understand the semantic concepts of complex visual scenarios and depict visual objects/relations with natural language. One main line of research is grounding the visual concepts of a single image to textual descriptions, known as image captioning [fang2015captions, vinyals2015show, you2016image]. Visual storytelling [huang2016visual] takes one step further, aiming at understanding long-range photo streams and generating a sequence of sentences to describe a coherent story.
Most existing visual storytelling methods focus on maximizing data likelihood [yu2017hierarchically], topic consistency [huang2018hierarchically], or expected rewards through imitation learning [xinwang-wenhuchen-ACL-2018]. However, maximizing data likelihood or implicit rewards does not necessarily optimize the quality of generated stories. In fact, simply optimizing on automatic evaluation metrics may even hurt the performance of story generation, losing on other assessment dimensions that are more important to the human eye.
In this paper, we revisit the visual storytelling problem by asking ourselves the question: what makes a good story? Given a photo stream, the first and foremost goal should be telling a story that accurately describes the objects and the concepts that appear in the photos. This can be termed as the ‘Relevance’ dimension. Secondly, the created story should read smoothly. In other words, the consecutive sentences should be semantically and logically coherent with each other, instead of being mutually-independent sentences describing each photo separately. This can be termed as the ‘Coherency’ dimension. Lastly, to tell a compelling story that can vividly describe the visual scenes and actions in the photos, the language used for creating the story should contain a rich vocabulary and diverse styles. We call this the ‘Expressiveness’ dimension.
Most existing storytelling approaches that optimize on BLEU or CIDEr do not perform very well on these human-judging dimensions. As shown in Figure 1, compared with the model-generated story, the human-written one is more semantically relevant to the content of the photo stream (e.g., describing more fine-grained visual concepts), more structurally coherent across sentences, and more diversified in language style (e.g., less repetition in pattern and vocabulary).
Motivated by this, we propose a reinforcement learning framework with composite reward functions designed to encourage the model to generate a relevant, expressive and coherent story given a photo stream. The proposed ReCo-RL (Relevance-Expressiveness-Coherence) framework consists of two layers: a high-level decoder (i.e., manager) and a low-level decoder (i.e., worker). The manager summarizes the visual information from each image into a goal vector, by taking into account the overall story flow, the visual context, and the sentences generated for previous images. Then it passes on the goal vector to each worker, which generates word-by-word description for each image, guided by the manager’s goal.
The proposed model consists of three quality enhancement components. The first relevance function gives a high reward to a generated description that mentions fine-grained concepts in an image. The second coherence function measures the fluency of a generated sentence given its preceding sentence, using a pre-trained language model at the sentence level. The third expressiveness function penalizes phrasal overlap between a generated sentence and its preceding sentences. The framework aggregates these rewards and optimizes with REINFORCE algorithm [williams1992simple]. Empirical results demonstrate that ReCo-RL can achieve better performance than state-of-the-art baselines. Our main contributions can be summarized as follows:
We propose three new criteria to assess the quality of text generation for the visual storytelling task.
We propose a reinforcement learning framework, ReCo-RL, with composite reward functions designed to align with the proposed criteria, using policy gradient for model training.
We provide quantitative, qualitative analysis and human evaluation to demonstrate the effectiveness of our proposed model.
is a task where given a photo stream, the machine is trained to generate a coherent story in natural language to describe the photos. Compared with visual captioning tasks [vinyals2015show, krishna2017dense, RennieMMRG17, gan2017semantic], visual storytelling requires capabilities in understanding more complex visual scenarios and composing more structured expressions. Pioneering work has used sequence-to-sequence model on this task [park2015expressing]. \citeauthorhuang2016visual \shortcitehuang2016visual has provided the benchmark dataset VIST for this task. \citeauthoryu2017hierarchically \shortciteyu2017hierarchically has shown promising results on VIST with a multi-task learning algorithm for both album summarization and sentence generation.
Recent efforts have explored REINFORCE training, by learning an implicit reward function [xinwang-wenhuchen-ACL-2018] to mimic human behavior or injecting a topic consistency constraint during training [huang2018hierarchically]. \citeauthorDBLP:conf/aaai/WangFTLM18 \shortciteDBLP:conf/aaai/WangFTLM18 proposed a hierarchical generative model to create relevant and expressive narrative paragraphs. To improve the structure and diversity, \citeauthorNIPS2018_7426 \shortciteNIPS2018_7426 reconciled a traditional retrieval-based method with modern learning-based method to form a hybrid agent. Notably, these studies did not directly (or explicitly) examine what accounts for a real good story to the human eye, which is the main focus of our work.
State-of-the-art text generation methods use encoder-decoder architectures for sequence-to-sequence learning [Bordes2016LearningEG, Sutskever2014SSL]. To better model structured information, hierarchical models have been proposed [fan2018hierarchical, P15-1107]. Follow-up work tried to overcome exposure bias resulting from MLE training [Bengio:2015:SSS:2969239.2969370, NIPS2016_6099]. In recent years, reinforcement learning (RL) has gained its popularity in many tasks [Ranzato2015SequenceLT], such as image captioning [RennieMMRG17] and text summarization [DBLP:journals/corr/PaulusXS17]. Other techniques such as adversarial learning [seqgan, dai17] and inverse reinforcement learning (IRL) [NIPS2016_6391, shi2018towards] have also been applied. Compared with previous work, we define explicit rewards and propose a reinforcement learning framework to optimize them.
Meanwhile, how to assess the quality of generated text still remains a big challenge. BLEU [papineni2002bleu] and METEOR [banerjee2005meteor] are widely used in machine translation. CIDEr [vedantam2015cider] and SPICE [anderson2016spice] are used for image captioning. ROUGE-L [lin2004rouge] is used for evaluating text summarization. However, these metrics all have limitations in evaluating natural language output, as there exists a huge gap between automatic metrics and assessment by humans. There have been some recent studies on more natural assessment for text generation tasks, such as evaluating on structuredness, diversity and readability [plainandwrite, dai17, chen2018fast, xinwang-wenhuchen-ACL-2018], although these studies do not explicitly consider relevance between a stream of images and a story for the task of visual storytelling. Similar to these studies, we argue that the aforementioned automatic metrics are not sufficient to evaluate the visual storytelling task, which requires high readability and naturalness in generated stories.
In this section, we first describe the hierarchical structure of the story generator, then present the three reward functions, and finally introduce the training strategies for model optimization.
Given a stream of images, we denote their features extracted by a pre-trained Convolutional Neural Network as a sequence of vectors , respectively. The reference descriptions are denoted as a sequence of sentences , where is a sequence of word indices that depicts the -th image. We define a dataset of input-output pairs as . Based on the -th image, our model generates the corresponding sentence , where denotes the -th word in . We denote as the word embedding matrix, and as the word embedding of . We denote the hidden state of the manager and the worker as and , respectively. We use a bold letter to denote a vector or matrix, and use an unbold letter to denote a sequence or a set.
module consists of a pre-trained Convolutional Neural Network which extracts deep visual features from each image, with ResNet-101 [he2016deep]. The encoder obtains the overall summary of a photo stream by averaging the visual features of all the images, i.e., .
module in our model is a Long Short-term Memory (LSTM) network, which captures the long-term consistency of the generated story at the sentence-level. When depicting one image of a photo stream, the manager should take into account three aspects: 1) the overall flow of the photo stream; 2) the context information in the current image; and 3) the sentences generated from previous images in the photo stream. To do so, for each image in the -th step, the manager takes as input the features of the whole image sequence , the features of the -th image , and the worker’s last hidden state from the previous image. The manager then predicts a hidden state as the goal vector.
where denotes vector concatenation. The goal vector is then passed on to the worker, and the worker is responsible for completing the generation of word description based on the goal from the manager.
module is a fine-grained LSTM network, which predicts one word at a time and controls the fluency of one sentence. Intuitively, the worker is guided by the goal from the manager, and focuses more on fine-grained context information in the current image. More specifically, when predicting one word at the -th step, the worker takes as input the features of the -th image , the manager’s goal vector , and the word embedding of the previous generated word . The worker then predicts a hidden state and applies a linear layer to approximate the probability of choosing the next word in Eq. (3).
Composite Rewards Design
One way to measure the relevance between an image and its generated description is to ground the entities mentioned in the description to corresponding bounding boxes in the image. However, a straightforward way of comparing the n-gram overlap between the reference sentence and the generated sentence (e.g., BLEU or METEOR) treats each word in the sentence equally, without taking into account the semantic relevance of the words to the image.
To tackle this limitation, we propose to measure the semantic similarity between entities mentioned in the reference and generated sentences. More specifically, we are given a set of reference sentences for the -th image. We then extract a set of entities mentioned in its reference sentences with a Part-Of-Speech (POS) tagger, and count the frequency of the entities in its reference sentences as . The normalized frequency of an entity is computed by dividing by the sum of the frequency of all entities of in in Eq. (5).
Similarly, we extract all the entities mentioned in an n-gram of a hypothesis sampled by the model, and denote the hypothesis n-gram as and its entity set as . To measure the relevance of each hypothesis n-gram with respect to the key concepts in an image, we compute the relevance weight of an n-gram in Eq. (6).
If a hypothesis n-gram contains any key entities in , is greater than 1, which distinguishes it from other n-grams that do not ground to any bounding objects in the image. Notice that the weight is proportional to the number of key entities in and the entity frequency in the reference sentences . Intuitively, the more entities an n-gram contains, the more bounding objects in the image this n-gram grounds to. If an entity is mentioned by multiple annotators in the reference sentences, the weight of mentioning this entity in the hypothesis should be high.
Inspired by the modified n-gram precision in the BLEU score calculation, we aim to avoid rewarding multiple identical n-grams in the hypothesis. To this end, we count the maximum number of times an n-gram exists in any single reference sentence in Eq. (7), and clip the count of each hypothesis n-gram by its maximum reference count in Eq. (8). We then compute the weighted precision of all the n-grams in the hypothesis in Eq. (9).
The relevance score of a sampled hypothesis with respect to the key concepts of an image is computed as the product of a brevity penalty and the geometric mean of the weighted n-gram precision in Eq. (10). In our implementation, we consider unigram and bigram, i.e., , since most entities only contain one or two words.
A coherent story should organize its sentences in a correct sequential order and preserve the same topic among adjacent sentences. One way to measure coherence between two sentences is a sentence coherence discriminator that models the probability of two sentences and being continuous in a correct sequential order as well as containing the same topic.
To this end, we leverage a language model with a next-sentence-prediction objective, as was explored in \citeauthordevlin2018bert \shortcitedevlin2018bert. We first construct a sequence by concatenating two sentences and decoded by our model, and get the sequence representation using a pre-trained language model. Then, we apply a linear layer to the sequence representation followed by a function and a softmax function to predict a binary label, which indicates whether the second sentence is the next sentence of the first one.
where indicates is the next sentence of .
An expressive story should contain diverse phrases to depict the rich content of a photo stream, rather than repeatedly using identical phrases. To capture this expressiveness, we keep track of already-generated n-grams, and punish the model when it generates repeated n-grams.
To this end, we propose a diversity reward which measures the n-gram overlap between the current sentence and previously decoded sentences . More specifically, we first regard all the preceding sentences as the reference sentences to the current sentence , and compute the BLEU score of the current sentence compared to the reference sentences. Finally we substract this value from 1 as the expressiveness reward in Eq. (15). Intuitively, if the current sentence contains more identical n-grams as any one of preceding decoding sentences, the BLEU score of the current sentence with respect to that already-generated sentence would be high, thus the story is lack of expressiveness when adding the current decoding sentence. In our implementation of BLEU in Eq. (15), we only consider the precision of bigram, trigram and 4-gram, since we want to focus on repeated phrases that have more than one word.
We first train our proposed model using maximum likelihood estimation (MLE), and then continue training the model using REINFORCE algorithm together with an MLE objective.
Maximum Likelihood Estimation
[williams1992simple] makes it possible to learns a policy by maximizing an arbitrary expected reward in Eq. (18). This makes it possible to design reward functions specifically for the visual storytelling task. We compute the weighted sum of the aforementioned three reward functions, to encourage the model to focus on those key aspects of a good story and control the generation quality of the sentences.
where is the policy, and is a reinforcement baseline that reduces the variance of the expected rewards, , and are the weights of the three designed rewards. In our implementation, we sample hypotheses generated by the current policy for the -th image, and approximate the expected rewards with respect to the empirical distribution . We compute the reinforcement baseline by using the average reward of all the sampled hypotheses, i.e., .
Rather than starting from a random policy model, we start from a model pre-trained by the MLE objective, and continue training the model jointly with MLE and REINFORCE objectives on each mini-batch in Eq. (20), which has been exploited in \citeauthorRanzato2015SequenceLT \shortciteRanzato2015SequenceLT.
Dataset and Baseline
Dataset: The VIST dataset [huang2016visual] used in our evaluation consists of 10,117 Flickr albums with 210,819 unique photos. Each sample contains one story that describes 5 selected images from a photo stream, and the same album is paired with 5 different stories as references. The split is similar to previous work, with 40,098 samples for training, 4,988 for validation and 5,050 for testing. The vocabulary size of VIST is 12,977. The released data was processed by a name entity recognition (NER) tagger to solve the sparsity issue of low-frequence words. The name of a person, a location and an organization are replaced by [male]/[female], [location], and [organization], respectively.
Implementation Details: The visual features are extracted from the last fully-connected layer of ResNet152 pretrained on ImageNet [he2016deep]. The word embeddings of size 300 are uniformly initialized within . We use a 512-hidden-unit LSTM layer for both the manager and the worker modules. We apply dropout to the embedding layer and every LSTM layer with the rate of 0.3. We set the hyper-parameters to assign equal weights to all the three aspects of the reward functions, and set to balance both MLE and REINFORCE objectives during training. We use BERT [devlin2018bert] as our next sentence predictor and fine-tune the next sentence prediction on sentence pairs in the correct and random order in the VIST dataset. For negative sentence pairs, we randomly concatenate two sentences in two different albums to make sure that the topics of these sentences are different.
Baseline: We compare our method with the following baselines: (1) AREL [xinwang-wenhuchen-ACL-2018]111https://github.com/eric-xw/AREL.git, an approach to learning an implicit reward function with imitation learning; (2) HSRL [huang2018hierarchically]222Codes are provided by the authors., a hierarchical RL approach that injects a topic consistency constraint during training. These two recent approaches achieved state-of-the-art results on VIST, and we follow the same parameter settings in the original papers.
In addition, we also compare three variants of our model: (1) MLE that uses MLE training in Eq. (17); (2) BLEU-RL that is jointly trained by MLE and REINFORCE, using sentence-level BLEU as a reward; and (3) ReCo-RL that is jointly trained by MLE and REINFORCE, using the designed rewards in Eq. (20). The decoding outputs generated by all the models are evaluated by the same scripts as \citeauthorxinwang-wenhuchen-ACL-2018 \shortcitexinwang-wenhuchen-ACL-2018.
Automatic metrics, including BLEU-4, CIDEr, METEOR and ROUGE-L, are used for quantitative evaluation. Table 1 summarizes the results of all the methods in comparison. Our models (MLE , BLEU-RL and ReCo-RL ) achieve competitive or better performance over the baselines on most metrics except CIDEr. Specially, BLEU-RL achieves better performance in METEOR, ROUGE-L and BLEU-4, while ReCo-RL improves the CIDEr score.
In addition to the standard automatic metrics, we can also use the designed reward functions to score each story generated by different methods. To evaluate the overall performance of one method at the corpus level, we average the reward scores of all stories generated by the method on the test set. Similar to the automatic evaluation metrics, we multiply the average reward scores by 100 and report the scaled results of all the methods on the test set in Table 3. Our proposed ReCo-RL outperforms all the start-of-the-art methods and our variants (BLEU-RL, MLE) on all three quality aspects.
|AREL||the officers of the military officers are in charge of the military. he was very proud of his speech. the meeting was a great success. the president of the company gave a speech to the audience. we had a great time.||the wedding was held at a church. the wedding was beautiful. the bride and groom cut the cake. the bride and groom were very happy. the whole family was there to celebrate.|
|HSRL||i was so excited to see my new team. he was very happy to see the new professor. i had a lot of time to talk about. i had a great time. i had a great time.||the wedding was a beautiful wedding. the bride and groom cut the dance together. the bride and groom were very happy to be married. the bride and groom were very happy. at the end of the night, the bride and groom were happy to be married.|
|BLEU-RL||at the end of the day, the men were very proud of the military. they had a lot of people there. this is a picture of the meeting. a group of people had a great time. after the end of the day , we all had a lot of questions.||it was a beautiful day for the wedding. at the end of the night, the bride and groom were very happy. the bride and groom were very happy. she was so happy to be married. the bride and groom pose for pictures.|
|ReCo-RL||today was a picture of the military officer, he was ready to go to the organization. they were very happy to see the awards ceremony. the speaker was very excited to be able to talk about the meeting. everyone was having a great time to get together for the event after the ceremony. we all had a lot of people there.||it was a beautiful day at the wedding party. the bride and groom were so happy to be married. [female] was happy and she was so excited to celebrate. she had a great time to take a picture of her wedding. all of the girls posed for pictures.|
Due to the subjective nature of the storytelling task, we further conduct human evaluation to explicitly examine the quality of the stories generated by all the models, through crowdsourcing using Amazon Mechanical Turk (AMT). Specifically, we randomly sampled 500 stories generated by all the models for the same photo streams. Given one photo stream and the stories generated by two models, three turkers were asked to perform a pairwise comparison and select the better one from the two stories based on three criteria: relevance, coherence and expressiveness. The user interface of the evaluation tool also provides a neutral option, which can be selected if the turker thinks both outputs are equally good on one particular criterion. The order of the outputs for each assignment is randomly shuffled for fair comparison. Notice that in the pairwise human evaluation, each pair of system outputs for one photo stream was judged by a different group of three people. The total number of turkers for all photo streams is 862.
Table 2 reports the pairwise comparison between ReCo-RL and three other methods. Based on human judgment, the quality of the stories generated by ReCo-RL are significantly better than the BLEU-RL variant on all dimensions, even though BLEU-RL is fine-tuned to obtain comparative scores on existing automatic metrics. Comparing with two strong baselines, AREL and HSRL, ReCo-RL can still achieve better performance. For each pairwise comparison between two model outputs, we also scored each story based on the number of votes from three turkers, and performed the Student’s paired t-test between the scores of two systems. Our ReCo-RL is significantly better than all baseline methods with .
|BLEU-RL||a group of friends gathered together for dinner. the turkey was delicious. the guests were having a great time. at the end of the night, we had a great time. at the end of the night, we had a great time.||2.47||11.06||37.10||73.57|
|ReCo-RL||a group of friends gathered together for a party . the turkey was delicious . it was a delicious meal . everyone was having a great time . after the party , we all sat down and talked about the night . my friend and [female] were very happy to drink .||3.32||16.99||41.71||78.51|
In Figure 3, we show two image streams and the stories generated by four models. For the second image stream on the right, BLEU-RL repeatedly generates uninformative segments, such as “we had a lot of people there”, even though BLEU-RL achieves high scores on automatic metrics. The same problem exists in the stories generated by HSRL in the first and second examples such as “i had a great time.”. From our observation, when the images are similar across an image stream, the three baseline methods are not enforced to discover the different content between subsequent images, thus generating repeated sentences with redundant information.
With regards to the relevance between the visual concepts of the photo stream and the stories, ReCo-RL consistently generates more specific concepts highly correlated to the appearing objects in the image stream. In Figure 3, words highlighted in yellow represent the entities that can be grounded in the images. In the second example, our ReCo-RL is encouraged to generate descriptions on rare entities such as “sign” and “flags” in addition to frequent entities such as “people”.
In the first example of Figure 3, sentence pairs that are not semantically coherent are highlighted with a wavy underline. The forth sentence generated by AREL mentions “the president of the company” that is quite different from the previously-described entity “military officer”, showing that AREL forgets the content in previous images when it generates the next sentence. Similarly the second sentence generated by HSRL suddenly changes the subject of the story from “i” to “he”, and mentions the “new professor” that is quite different from the previously-described entity “new team”. From our observation, this type of disconnection is quite common in stories generalized by the three baseline methods. The stories generated by ReCo-RL are a lot more coherent in content.
Moreover, we further compare the stories generated by our proposed ReCo-RL and our variant BLEU-RL . These two methods use different sentence-level reward functions during the reinforcement training. In Figure 4, we find that ReCo-RL generates more related entities such as “meal” and “drink”. We also observe that the key entities generated ReCo-RL make the story more consistent in the topic, while BLEU-RL forgets the previous context when generating the last two sentences. Even more surprisingly, our proposed ReCo-RL not only obtains higher scores of our proposed rewards, but also higher BLEU-4 score than BLEU-RL .
In this paper, we propose ReCo-RL, a novel approach to visual storytelling, which directly optimizes story generation quality on three dimensions natural to human eye: relevance, coherence, and expressiveness. Experiments demonstrate that our model outperforms state-of-the-art methods on both the existing automatic metrics and the proposed assessment criteria. In future work, we will extend the proposed model to other text-generation tasks, such as storytelling based on some writing prompts [fan2018hierarchical] and table-to-text generation [table2text].
Appendix A Appendix
We provide the template of our human evaluation on AMT.
We provide more examples of generated stories in the following.