Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

Bei Liu Jianlong Fu Kyoto University Microsoft Research Asia Makoto P. Kato Masatoshi Yoshikawa Kyoto University Kyoto University This work was performed when Bei Liu was visiting Microsoft Research Asia as a research intern.

Automatic generation of natural language from images has attracted extensive attention. In this paper, we take one step further to investigate generation of poetic language (with multiple lines) to an image for automatic poetry creation. This task involves multiple challenges, including discovering poetic clues from the image (e.g., hope from green), and generating poems to satisfy both relevance to the image and poeticness in language level. To solve the above challenges, we formulate the task of poem generation into two correlated sub-tasks by multi-adversarial training via policy gradient, through which the cross-modal relevance and poetic language style can be ensured. To extract poetic clues from images, we propose to learn a deep coupled visual-poetic embedding, in which the poetic representation from objects, sentiments 111In this paper, we consider both adjectives and verbs as sentiment words, which can usually express many types of emotions and feelings in a poem. and scenes in an image can be jointly learned. Two discriminative networks are further introduced to guide the poem generation, including a multi-modal discriminator and a poem-style discriminator. To facilitate the research, we have collected two poem datasets by human annotators with two distinct properties: 1) the first human annotated image-to-poem pair dataset (with pairs in total), and 2) to-date the largest public English poem corpus dataset (with different poems in total). Extensive experiments are conducted with 8K images generated with our model, among which 1.5K image are randomly picked for evaluation. Both objective and subjective evaluations show the superior performances against the state-of-art methods for poem generation from images. Turing test carried out with over human subjects, among which 30 evaluators are poetry experts, demonstrates the effectiveness of our approach.

1 Introduction

Researches that involve both vision and languages have attracted great attentions recently as we can witness from the bursting works on image descriptions like image caption and paragraph [1, 4, 16, 27]. Researches of image descriptions aim to generate sentence(s) to describe facts from images in human-level languages. In this paper, we take one step further to tackle a more cognitive task: generation of poetic language to an image for the purpose of poetry creation, which has attracted tremendous interest in both research and industry fields.

Figure 1: Example of human written description and poem of the same image. We can see a significant difference from words of the same color in these two forms. Instead of describing facts in the image, poem tends to capture deeper meaning and poetic symbols from objects, scenes and sentiments from the image (such as knight from falcon, hunting and fight from eating, and waiting from standing).

In the natural language processing field, poem generation related problems have been studied. For instance, in [11, 32], the authors mainly focused on the quality of style and rhythm. In [7, 32, 37], these works have taken one more step to generate poems from topics. In the industrial field, Facebook has proposed to generate English rhythmic poetry with neural networks [11], and Microsoft has developed a system called Xiaoice, in which poem generation is one of the most important features. Nevertheless, generating poems from images in an end-to-end fashion remains a new topic with grand challenges.

Compared with image captioning and paragraphing that focus on generating descriptive sentences about an image, generation of poetic language is a more challenging problem. There is a larger gap between visual representations and poetic symbols that can be inspired from images and facilitate better generation of poems. For example, “man” detected in image captioning can further indicate “hope” with “bright sunshine” and “opening arm”, or “loneliness” with “empty chairs” and “dark” background in poem creation. Figure 1 shows a concrete example of the differences between descriptions and poems for the same image.

In particular, to generate a poem from an image, we are facing with the following three challenges. First of all, it is a cross-modality problem compared with poem generation from topics. An intuitive way for poem generation from images is to first extract keywords or captions from images and then consider them as seeds for poem generation as what poem generation from topics do. However, keywords or captions will miss a lot of information in images, not to mention the poetic clues that are important for poem generation [7, 37]. Secondly, compared with image captioning and image paragraphing, poem generation from images is a more subjective task, which means an image can be relevant to several poems from various aspects while image captioning/paragraphing is more about describing facts in the images and results in similar sentences. Thirdly, the form and style of poem sentences is different from that of narrative sentences. In this research, we mainly focus on free verse which is an open form of poetry. Although we do not require meter, rhyme or other traditional poetic techniques, it remains some sense of poetic structures and poetic style language in poems. We define this quality of poem as poeticness in this research. For example, length of poems are usually not very long, specific words are preferred in poems compared with image descriptions, and sentences in one poem should be consistent to one topic.

To address the above challenges, we propose to collect two poem datasets by human annotators, and conduct the research on poetry creation by integrating retrieval and generation techniques in one system. Specifically, to better learn poetic clues from images for poem generation, we first learn a deep coupled visual-poetic embedding model with CNN features of images, and skip-thought vector features [15] of poems from a multi-modal poem dataset (namely “MultiM-Poem”) that consists of thousands of image and poem pairs. This embedding model is then used to retrieve relevant and diverse poems from a larger uni-modal poem corpus (namely “UniM-Poem”) for images. Images with these retrieved poems and MultiM-Poem together construct an enlarged image-poem pair dataset (namely “MultiM-Poem (Ex)”). We further propose to leverage the state-of-art sequential learning techniques for training an end-to-end poem generation model on the MultiM-Poem (Ex) dataset. Such a framework ensures substantial poetic clues, that are significant for poem generation, could be discovered and modeled from those extended pairs.

To avoid exposure bias problems caused by long length of long sequence (all poem lines together) and the problem that there is no specific loss available to score a generated poem, we propose a recurrent neural network (RNN) for poem generation with multi-adversarial training and further optimize it by policy gradient. Two discriminative networks are used to provide rewards in terms of the generated poem’s relevance to the given image and poeticness of the generated poem. We conduct experiments on MultiM-Poem, UniM-Poem and MultiM-Poem (Ex) to generate poems for images. The generated poems are evaluated in both automatic and manual ways. We define automatic evaluation metrics concerning relevance, novelty and translative consistence and conducted user studies about relevance, coherence and imaginativeness to compare our generated poems with those generated by baseline methods. The contributions in this research are concluded as follows:

  • We propose to generate poems (English free verse) from images in an automatical fashion. To the best of our knowledge, this is the first attempt to study the image-inspired poem generation problem in a holistic framework, which enables a machine to approach human capability in cognition tasks.

  • We incorporate a deep coupled visual-poetic embedding model and a RNN-based generator for joint learning, in which two discriminators provide rewards for measuring cross-modality relevance and poeticness by multi-adversarial training.

  • We collect the first paired dataset of image and poem annotated by human annotators and the largest public poem corpus dataset. Extensive experiments demonstrate the effectiveness of our approach compared with several baselines by using both automatic and manual evaluation metrics, including a Turing test from more than human subjects. To better promote the research in poetry generation from images, we will release these datasets in the near future.

Figure 2: The framework of poetry generation with multi-adversarial training. We first use image and poem pairs (a) from human-annotated paired image and poem dataset (MultiM-Poem) to train a deep coupled visual-poetic embedding model (e). The image features (b) are poetic multi-CNN features obtained by fine-tuning a CNN with the extracted poetic symbols (e.g., objects, scenes and sentiments) by a POS parser (Stanford NLP tool) from poems. The sentence features (d) of poems are extracted from a skip-thought model (c) trained on the largest public poem corpus (UniM-Poem). A RNN-based sentence generator (f) is trained as agent and two discriminators considering multi-modal (g) and poem-style (h) critics of a generated poem to a given image provide rewards to policy gradient (i). POS parser extracts Part-Of-Speech words from poems.

2 Related Work

2.1 Poetry Generation

Traditional approaches for poetry generation include template and grammar-based method [19, 20, 21], generative summarization under constrained optimization [32] and statistical machine translation model [10, 12]. By applying deep learning approaches recent years, researches about poetry generation has entered a new stage. Recurrent neural network is widely used to generate poems that can even confuse readers from telling them from poems written by human poets [7, 8, 11, 33, 37]. Previous works of poem generation mainly focus on style and rhythmic qualities of poems [11, 32], while recent studies introduce topic as a condition for poem generation [7, 8, 32, 37]. For a poem, topic is still a rather abstract concept without specific scenarios. Inspired by the fact that many poems were created when poets were in a conditioned scenario and viewing some specific views, we take one step further to tackle the problem of generating poems inspired by a visual scenario. Compared with previous researches, our work is facing with more challenges, especially in terms considering multi-modal problems.

2.2 Image Description

Image captioning is first regarded as a retrieval problem which aims to search captions from dataset for a given image [5, 13] and hence cannot provide accurate and proper descriptions for all images. To overcome this problem, methods like template filling [17] and paradigm for integrating convolutional neural network (CNN) and recurrent neural network (RNN) [2, 27, 34] are proposed to generate readable human-level sentences. Recently, generative adversarial network (GAN) is applied to generate captions based on different problem settings [1, 35]. Similarly to image captioning, image paragraphing is going the similar way. Recent researches about image paragraphing mainly focus on region detection and hierarchical structure for generated sentences [16, 18, 23]. However, as we have addressed, image captioning and paragraphing aim to generate descriptive sentences to tell the facts in images, while poem generation is tackling an advanced form of linguistic form which requires poeticness and language style constrains.

3 Approach

In this research, we aim to generate poems from images so that the generated poems are relevant to input images and satisfy poeticness. For this purpose, we cast our problem in a multi-adversarial procedure [9] and further optimize it with a policy gradient [30, 36]. A CNN-RNN generative model acts as an agent. The parameters of this agent define a policy whose execution will decide which word to be picked as an action. When the agent has picked all words in a poem, it observes a reward. We define two discriminative networks to serve as rewards concerning whether the generated poem is a paired one with the input image and whether the generated poem is poetic. The goal of our poem generation model is to generate a sequence of words as a poem for an image to maximize the expected end reward. This policy-gradient method has shown significant effectiveness to many tasks without non-differentiable metrics [1, 24, 35].

As shown in Figure 2, the framework consists of several parts: (1) a deep coupled visual-poetic embedding model (e) to learn poetic representations from images, and (2) a multi-adversarial training procedure optimized by policy gradient. A RNN based generator (f) serves as agent, and two discriminative networks (g and h) provide rewards to the policy gradient.

3.1 Deep Coupled Visual-Poetic Embedding

The goal of visual-poetic embedding model [6, 14] is to learn an embedding space where points of different modality, e.g. images and sentences, can be projected to. In a similar way to image captioning problem, we assume that a pair of image and poem shares similar poetic semantics which makes the embedding space learnable. By embedding both images and poems to the same feature space, we can directly compute the relevance between a poem and an image by poetic vector representations of them. Moreover, the embedding feature can be further utilized to initialize a optimized representation of poetic clues for poem generation.

The structure of our deep coupled visual-poetic embedding model is shown in left part of Figure 2. For image input, we leverage the deep convolutional neural network (CNN) concerning three aspects that indicate important poetic clues from images, namely object (), scene () and sentiment (), after conducting a prior user study about important factors for poem creation from images. We observed that concepts in poems are often imaginative and poetic while concepts in the classification datasets we use to train our CNN models are concrete and common. To narrow the semantic gap between the visual representation of images and the textual representation of poems, we propose to fine-tune these three networks with MultiM-Poem dataset. Specifically, frequent used keywords about object, sentiment and scenes in the poems are picked as label vocabulary, and then we build three multi-label datasets based on MultiM-Poem dataset for object, sentiment and scenes detection respectively. Once the multi-label datasets are built, we fine-tune the pre-trained CNN models on the three datasets independently, which is optimized by sigmoid cross entropy loss as shown in Eqn. (1). After that, we adopt the -dimension deep features for each aspect from the penultimate fully-connected layer of the CNN models, and get a concatenated -dimension () feature vector as input of visual-poetic embedding for each image:


in which we use the “FC7” layer outputs as the features for . The output of visual-poetic embedding vector is a -dimension vector representing the image embedding with linear mapping from image features:


where is the image embedding matrix and is the image bias vector. Meanwhile, representation feature vector of a poem is computed by average value of its sentences’ skip-thought vectors[15]. Combine-skip with -dimension vector denoted by is used as it demonstrates better performance as shown in [15]. The skip-thought model is trained on UniM-Poem dataset. Similar to image embedding, the poem embedding is denoted as:


where for the poem embedding matrix and for the poem bias vector. Finally, the image and poem are embedded together by minimizing a pairwise ranking loss with dot-product similarity:


where is a contrastive (irrelevant unpaired) poem for image embedding , and vice-versa with . denotes the contrastive margin. As a result, the model we trained will produce higher cosine similarity (consistent with dot-product similarity) between embedding features of original image-poem pairs than similarity between randomly generated pairs.

3.2 Poem Generator as an Agent

A conventional CNN-RNN model for image captioning is used in our approach to serve as an agent. Instead of using hierarchical methods that are used recently in generating multiple sentences in image paragraphing [16], we use a non-hierarchical recurrent model by treating the end-of-sentence token as a word in the vocabulary. The reason is that poems often consist of fewer words compared with paragraphs. Moreover, there is lower consistent hierarchy between sentences in the training poems, which makes the hierarchy between sentences much more difficult to learn. We also conduct experiment with hierarchical recurrent language model as a baseline and we will show the result in the experiment part.

The generative model includes CNNs for image encoder and a RNN for poem decoder. In this research, we apply Gated Recurrent Units (GRUs) [3] for decoder. We use image-embedding features learned by the deep coupled visual-poetic embedding model explained in Section 3.1 as input of image encoder. Suppose is the parameters of the model. Traditionally, our target is to learn by maximizing the likelihood of the observed sentence where is the maximum length of generated sentence (including for start of sentence, for end of sentence and line breaks) and denotes a space of all sequences of selected words.

Let denote the reward achieved at time and is the cumulative reward, namely . Let be a parametric conditional probability of selecting at time step given all the previous words . is defined as a parametric function of policy . The reward of policy gradient in each batch can be computed as the sum over all sequences of valid actions as the expected future reward. To iterate over sequences of all possible actions is exponential, but we can further write it in expectation so that it can be approximated with an unbiased estimator:


We aim to maximize by following its gradient:


In practice the expected gradient can be approximated using a Monte-Cartlo sample by sequentially sample each from the model distribution for from to . As discussed in [24], a baseline can be introduce to reduce the variance of the gradient estimate without changing the expected gradient. Thus, the expected gradient with a single sample is approximated as follow:


3.3 Discriminators as Rewards

A good poem for an image has to satisfy at least two criteria: the poem (1) is relevant to the image, and (2) has some sense of poectiness concerning proper length, poem’s language style and consistence between sentences. Based on these two requirements, we propose two discriminative networks to guide the generated poem: multi-modal discriminator and poem-style discriminator. Deep discriminative networks have been shown of great effectiveness in text classification task [1, 35], especially for tasks that cannot establish good loss functions. In this work, both discriminators we propose have several classes including one positive class and several negative classes.

Multi-Modal Discriminator. In order to check whether the generated poem is paired with input image , we train a multi-modal discriminator () to classify as paired, unpaired and generated. includes a multi-modal encoder, modality fusion layer and a classifier with softmax function:


where , , , , , , are parameters to be learned, is element-wise multiplication and denotes the probabilities over three classes of the multi-modal discriminator. We utilize LSTM-based sentence encoder for discriminator training. Equation 11 provides way to generate the probability of classified into each class as denoted by where .

Poem-Style Discriminator. In contrast with most poem generation researches that emphasize on meter, rhyme or other traditional poetic techniques, we focus on free verse which is an open form of poetry. Even though, we require our generated poems have the quality of poeticness as we define in Section 1. Without making specific templates or rules for poems, we propose a poem-style discriminator () to guide generated poems towards human written poems. In , generated poems will be classified into four classes: poetic, disordered, paragraphic and generated.

Class poetic is addressed as positive example of poems that satisfy poeticness. The other three classes are all regarded as negative examples. Class disordered concerns about the inner structure and coherence between sentences of poems and paragraphic class uses paragraph sentences as negative examples. In , we use UniM-Poem as positive poetic samples. To construct disordered poems, we first construct a poem sentence pool by splitting all poems in UniM-Poem. Examples of class disordered are poems that we reconstruct by sentences randomly picked up with a reasonable line numbers from poem sentence pool. Paragraph dataset provided by [16] is used as paragraph examples.

A completed generated poem is encoded by LSTM and parsed to a fully connected layer, and the probability of falling into four classes is computed by a softmax function. Formula of this procedure is as follow:


where , , are parameters to be learned. The probability of classifying generated poem to a class is formulated as where .

Reward Function. We define the reward function for policy gradient as a linear combination of probability of classifying generated poem for an input image to the positive class (paired for multi-modal discriminator and poetic for poem-style discriminator ) weighted by tradeoff parameter :


3.4 Multi-Adversarial Training

Before adversarial training, we pre-train a generator based on image captioning method [27] which can provide a better policy initialization for generator. The generator and discriminators are iteratively updated in an adversarial way. The generator aims to generate poems that have higher rewards for both discriminators so that they can fool the discriminators while the discriminators are trained to distinguish the generated poems from paired and poetic poems. The probabilities of classifying generated poem into positive classes in both discriminators are used as rewards to policy gradient as explained above.

Multiple discriminators (two in this work) are trained by providing positive examples from the real data (paired poems in and poem corpus in ) and negative examples from poems generated from the generator as well as other negative forms of real data (unpaired poems in , paragraphs and disordered poems in . Meanwhile, by employing a policy gradient and Monte Carlo sampling, the generator is updated based on the expected rewards from multiple discriminators. Since we have two discriminators, we apply a multi-adversarial training method that will train two discriminators in a parallel way.

4 Experiments

4.1 Datasets

Figure 3: Examples in two datasets: UniM-Poem and MultiM-Poem.
Name #Poem #Line/poem #Word/line
MultiM-Poem 8,292 7.2 5.7
UniM-Poem 93,265 5.7 6.2
MultiM-Poem (Ex) 26,161 5.4 5.9
Table 1: Detailed information about the three datasets. The first two datasets are collected by ourselves and the third one is extended by VPE.

To facilitate the research of poetry generation from images, we collected two poem datasets, in which one consists of image and poem pairs, namely Multi-Modal Poem dataset (MultiM-Poem), and the other is a large poem corpus, namely Uni-Modal Poem dataset (UniM-Poem). By using the embedding model we have trained, the image and poem pairs are extended by adding the nearest three neighbor poems from the poem corpus without redundancy, and an extended image and poem pair dataset is constructed and denoted as MultiM-Poem (Ex). The detailed information about these datasets is listed in Table 1. Examples of the two collected datasets can be seen in Figure 3. To better promote the research in poetry generation from images, we will release these datasets in the near future.

For MultiM-Poem dataset, we first crawled 34,847 image-poem pairs in Flickr from groups that aim to use images illustrating poems written by human. Five human assessors majoring in English literature were further asked to evaluate these poems as relevant or irrelevant by judging whether the image can exactly inspire the poem in a pair by considering the associations of objects, sentiments and scenes. We filtered out pairs labeled as irrelevant and kept the remaining 8,292 pairs to construct the MultiM-Poem dataset.

UniM-Poem is crawled from several public online poetry websites, such as Poetry Foundation222, PoetrySoup333, and To achieve robust model training, a poem pre-processing procedure is conducted to filter out those poems with too many lines () or too fewer lines (). We also remove poems with strange characters, poems in languages other than English and duplicate poems.

4.2 Compared Methods

To investigate the effectiveness of the proposed methods, we compare with four baseline models with different settings. The models of show-and-tell [27] and SeqGAN [35] are selected due to their state-of-art results in image captioning. A competitive image paragraphing model is selected for its strong capability for modeling diverse image content. Note that all methods use MultiM-Poem (Ex) as the training dataset, and can generate multiple lines as poems. The detailed methods and experiment settings are shown as follows:

Show and tell (1CNN): CNN-RNN model trained with only object CNN by VGG-16 .

Show and tell (3CNNs): CNN-RNN model trained with three CNN features by VGG-16.

SeqGAN: CNN-RNN model optimized with one discriminator that is used to tell from generated poems and ground-truth poems.

Regions-Hierarchical: Hierarchical paragraph generation model based on [16]. To better align with poem distribution, we restrict the maximum lines to be 10 and each line has up to 10 words in the experiment.

Our Model: To demonstrate the effectiveness of the two discriminators, we train our model (Image to Poem with GAN, I2P-GAN) in four settings: pretrained model without discriminators (I2P-GAN w/o discriminator), with multi-modal discriminator only (I2P-GAN w/ ), with poem-style discriminator only (I2P-GAN w/ ) and with both discriminators (I2P-GAN).

4.3 Automatic Evaluation Metrics

Evaluation of poems is generally a difficult task and there are no established metrics in existing works, not to mention the new task of generating poems from images. To better address the performance of the poems, we propose to evaluate them in both automatic and manual way.

We propose to employ three evaluation metrics for automatic measurement, e.g., BLEU, novelty and relevance. Then an overall score is summarized based on the three metrics after normalization.

BLEU. We first use Bilingual Evaluation Understudy (BLEU) [22] score-based evaluation to examine how likely the generated poems can approximate towards the ground truth ones as researches of image captioning and paragraphing usually apply. It is also used in some poem generation works [32]. For each image, we only use the human written poems as ground-truth poems.

Novelty. By introducing discriminator , the generator is supposed to introduce words or phrases from UniM-Poem dataset and results in words or phrases that are not very frequent in MultiM-Poem (Ex). We use novelty as proposed by [31] to measure the number of infrequent words or phrases observed in the generated poems. Two scales of N-gram are explored, e.g. bigram and trigram, as Novelty-2 and Novelty-3. We first rank the n-grams that occur in the training dataset of MultiM-Poem (Ex) and take the top 2,000 as frequent ones. Novelty is computed as the proportion of n-grams that occur in training dataset except the frequent ones in the generated poem.

Relevance. Different from poem generation researches that have no or weak constrains to poem contents, we consider relevance of the generated poem to the given image as an important measurement in this research. However, unlike captions that concern more about fact descriptions about images, different poems can be relevant to the same image from various aspects. Thus, instead of computing relevance between generated poem and ground-truth poems, we define relevance between a poem and an image using our learned deep coupled visual-poetic embedding model. After mapping the image and the poem to the same space through our embedding model, cosine similarity is used to measure their relevance. Although our embedding model can approximately quantify relevance between images and poems, we leverage subjective evaluation to better explore the effectiveness of our generated poems in human level.

Overall. We compute an overall score based on the above three metrics. For each value in all values of one metric , we first normalize it with following method:


After that, we get average values for BLEU (e.g. BLEU-1, BLEU-2 and BLEU-3) and novelty (e.g. Novelty-2 and Novelty-3). A final score is computed by averaging the normalized values, to ensure equal contribution of different metrics.

However, in such an open-ended task, there are no particularly suitable metrics that can perfectly evaluate the performance of generated poems. The automatic metrics we use can be regarded as a guidance to some extent. To better illustrate the performance of poems from human perception, we further conduct extensive user studies as in the follows.

4.4 Human Evaluation

We conducted human evaluation in Amazon Mechanical Turk. In particular, three types of tasks are assigned to AMT workers as follows :

Task1: to explore the effectiveness of our deep coupled visual-poetic embedding model, annotators were requested to give a 0-10 scale score to a poem given an image considering their relevance in case of content, emotion and scene.

Task2: the aim of this task is to compare the generated poems by different methods (four baseline methods and our four model settings) for one image on different aspects. Given an image, the annotators were asked to give ratings to a poem on a 0-10 scale with respect to four criteria: relevance (to the image), coherence (whether the poem is coherent across lines), imaginativeness (how much imaginative and creative the poem can show for the given image) and overall impression.

Task3: Turing test was conducted by asking annotators to select human written poem from mixed human written and generated poems. Note that Turing test was implemented in two settings, i.e., poems with images and poems only.

For each task, we have randomly picked up 1K images and each task is assigned to three assessors. As poem is a form of literature, we also ask 30 annotators whose majors are related to English literature (among which ten annotations are English natives) as expert users to do the Turing test.

4.5 Training Details

In the deep coupled visual-poetic embedding model, we use -dimension features for each CNN. Object features are extracted from VGG-16 [26] trained on ImageNet [25], scene features from Place205-VGGNet model [29], and sentiment features from sentiment model[28].

To better extract visual feature for poetic symbols, we first get nouns, verbs and adjectives with at least five frequency in UniM-Poem dataset. Then we manually picked adjectives and verbs for sentiment (including 328 labels), nouns for object (including 604 labels) and scenes (including 125 labels). As for poem features, we extract a combined skip-thought vector with -dimension (in which each -dimension represents for uni-direction and bi-direction, respectively) for each sentence, and finally we get poem features by mean pooling. And the margin is set to based on empirical experiments in [14]. We randomly select 127 poems as unpaired poems for an image and used them as contrastive poems ( and in Equation 5), and we re-sample them in each epoch. We empirically set the tradeoff parameter by conducting a comparable observation on automatic evaluation results from to .

4.6 Evaluations

Figure 4: Example of poems generated by six methods for an image.
Ground-Truth VPE w/o FT VPE w/ FT
Relevance 7.22 6.32 5.82
Table 2: Average score of relevance to images for three types of human written poems on 0-10 scale (0-irrelevant, 10-relevant). One-way ANOVA revealed that evaluation on these poems is statistically significant (.

Retrieved Poems. We compare three kinds of poems considering their relevance to images: ground-truth poems, poems retrieved with VPE and image features before fine-tuning (VPE w/o FT), and poems retrieved with VPE and fine-tuned image features (VPE w/ FT). Table 2 shows a comparison of these three types of poems on a scale of 0-10 (0 means irrelevant and 10 means the most relevant). We can see that by using the proposed visual-poetic embedding model, the retrieved poems can achieve a relevance score above the average score (i.e., the score of five). And image features fine-tuned with poetic symbols can improve the relevance significantly.

Figure 5: Example of poems generated by our approach I2P-GAN.
Method Relevance Novelty-2 Novelty-3 BLEU-1 BLEU-2 BLEU-3 Overall
Show and Tell (1CNN)[27] 1.79 43.66 76.76 11.88 3.35 0.76 5.61
Show and Tell (3CNNs)[27] 1.91 48.09 81.37 12.64 3.34 0.8 11.89
SeqGAN[35] 2.03 47.52 82.32 13.40 3.72 0.76 15.86
Regions-Hierarchical[16] 1.81 46.75 79.90 11.64 2.5 0.67 2.35
I2P-GAN w/o discriminator 1.94 45.25 80.13 13.35 3.69 0.88 14.65
I2P-GAN w/ 2.07 43.37 78.98 15.15 4.13 1.02 22.09
I2P-GAN w/ 1.90 60.66 89.74 12.91 3.05 0.72 16.00
I2P-GAN 2.25 54.32 85.37 14.25 3.84 0.94 27.57
Table 3: Automatic evaluation. Note that BLEU scores are computed in comparison with human-annotated ground-truth poems (one poem for one image). Overall score is computed as an average of three metrics after normalization (Equation 14). All scores are reported as percentage (%).

Generated Poems. Table 3 exhibits the automatic evaluation results of the proposed model with four settings, as well as the four baselines proposed in previous works. Comparing results of caption model with one CNN and three CNNs, we can see that multi-CNN can actually help to generate poems that are more relevant to images. Regions-Hierarchical model emphasizes more on the topic coherence between sentences while many human written poems will cover several topics or use different symbols for one topic. SeqGAN shows the advantage of applying adversarial training for poem generation compared with only caption models with only CNN-RNN while lacking of generating novel concepts in poems. Better performance of our pre-trained model with VPE than caption model demonstrates the effectiveness of VPE in extracting poetic features from images for better poem generation. We can see that our three models outperform in most of the metrics with each one performs better at one aspect. The model with only multi-modal discriminator (I2P-GAN w/ ) will guide the model to generate poems towards ground-truth poems, thus it results in the highest BLEU scores that emphasize the similarity of n-grams in a translative way. Poem-style discriminator () is designed to guide the generated poem to be more poetic in language style, and the highest novelty score of I2P-GAN w/ shows that helps to provide more novel and imaginative words to the generated poem. Overall, I2P-GAN combines the advantages of both discriminators with a rational intermediate score regarding BLEU and novelty while still outperforms compared with other generation models. Moreover, our model with both discriminators can generate poems that have highest relevance on our embedding relevance metric.

Method Rel Col Imag Overall
Show and Tell (1CNN)[27] 6.31 6.52 6.57 6.67
Show and Tell (3CNNs)[27] 6.41 6.59 6.63 6.75
SeqGAN[35] 6.13 6.43 6.50 6.63
Regions-Hierarchical[16] 6.35 6.54 6.63 6.78
I2P-GAN w/o discriminator 6.44 6.64 6.77 6.85
I2P-GAN w/ 6.59 6.83 6.94 7.06
I2P-GAN w/ 6.53 6.75 6.80 6.93
I2P-GAN 6.83 6.95 7.05 7.18
Ground-Truth 7.10 7.26 7.23 7.37
Table 4: Human evaluation results of six methods on four criteria: relevance (Rel), coherence (Col), imaginativeness (Imag) and Overall. All criteria are evaluated on 0-10 scale (0-bad, 10-good).

Comparison of human evaluation results are shown in Table 4. Different from automatic evaluation results where Regions-Hierarchical performs not well, it gets a slightly better result than caption model for the reason that sentences all about the same topic tend to gain better impressions from users. Our three models outperform the other four baseline methods on all metrics. Two discriminators promote human-level comprehension towards poems compared with pre-trained model. The model with two discriminators has generated better poems from images in terms of relevance, coherence and imaginativeness. Figure 4 shows one example of poems generated with three baselines and our methods for a given image. More examples generated by our approach can be seen in Figure 5.

Data Users Ground-Truth Generated
Poem w/ Image AMT 0.51 0.49
Expert 0.60 0.40
Poem w/o Image AMT 0.55 0.45
Expert 0.57 0.43
Table 5: Accuracy of Turing test on AMT users and expert users on poems with and without images.

Turing Test. For the Turing test of annotators in AMT, we have hired 548 workers with 10.9 tasks for each one on average. For experts, 15 people were asked to judge human written poems with images and another 15 annotators were asked to do test with only poems. Each one is assigned with 20 images and in total we have 600 tasks conducted by expert users. Table 5 shows the probability of different poems being selected as human-written poems for an given image. As we can see, the generated poems have caused a competitive confusion to both ordinary annotators and experts though experts can figure out the accurate one better than ordinary people. One interesting observation comes from that experts are better at figuring out correct ones with images while AMT workers do better with only poems.

5 Conclusion

As the first work of poetry (English free verse) generation from images, we propose a novel approach to model the problem by incorporated deep coupled visual-poetic embedding model and RNN based adversarial training with multi-discriminators as rewards for policy gradient. Furthermore, we introduce the first image and poem pair dataset (MultiM-Poem) and a large poem corpus (UniM-Poem) to enhance the researches on poem generation, especially from images. Extensive experiments have demonstrated that our embedding model can approximately learn a rational visual-poetic embedding space. Automatic and manual evaluation results demonstrated the effectiveness of our poem generation model.


  • [1] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun. Show, adapt and tell: Adversarial training of cross-domain image captioner. ICCV, 2017.
  • [2] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422–2431, 2015.
  • [3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS, 2014.
  • [4] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In CVPR, pages 1473–1482, 2015.
  • [5] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, pages 15–29, 2010.
  • [6] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121–2129, 2013.
  • [7] M. Ghazvininejad, X. Shi, Y. Choi, and K. Knight. Generating topical poetry. In EMNLP, pages 1183–1191, 2016.
  • [8] M. Ghazvininejad, X. Shi, J. Priyadarshi, and K. Knight. Hafez: an interactive poetry generation system. ACL, pages 43–48, 2017.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • [10] J. He, M. Zhou, and L. Jiang. Generating chinese classical poems with statistical machine translation models. In AAAI, 2012.
  • [11] J. Hopkins and D. Kiela. Automatically generating rhythmic verse with neural networks. In ACL, volume 1, pages 168–178, 2017.
  • [12] L. Jiang and M. Zhou. Generating chinese couplets using a statistical mt approach. In COLING, pages 377–384, 2008.
  • [13] A. Karpathy, A. Joulin, and F. F. F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889–1897, 2014.
  • [14] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  • [15] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In NIPS, pages 3294–3302, 2015.
  • [16] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. CVPR, 2017.
  • [17] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In CVPR, 2011.
  • [18] Y. Liu, J. Fu, T. Mei, and C. W. Chen. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI, 2017.
  • [19] H. M. Manurung. A chart generator for rhythm patterned text. In Proceedings of the First International Workshop on Literature in Cognition and Computer, pages 15–19, 1999.
  • [20] H. Oliveira. Automatic generation of poetry: an overview. Universidade de Coimbra, 2009.
  • [21] H. G. Oliveira. Poetryme: a versatile platform for poetry generation. Computational Creativity, Concept Invention, and General Intelligence, 1:21, 2012.
  • [22] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
  • [23] C. C. Park and G. Kim. Expressing an image stream with a sequence of natural sentences. In NIPS, pages 73–81, 2015.
  • [24] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563, 2016.
  • [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
  • [28] J. Wang, J. Fu, Y. Xu, and T. Mei. Beyond object recognition: Visual sentiment analysis with deep coupled adjective and noun neural networks. In IJCAI, pages 3484–3490, 2016.
  • [29] L. Wang, S. Guo, W. Huang, and Y. Qiao. Places205-vggnet models for scene recognition. arXiv preprint arXiv:1508.01667, 2015.
  • [30] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [31] Z. Xu, B. Liu, B. Wang, S. Chengjie, X. Wang, Z. Wang, and C. Qi. Neural response generation via gan with an approximate embedding layer. In EMNLP, pages 628–637, 2017.
  • [32] R. Yan, H. Jiang, M. Lapata, S.-D. Lin, X. Lv, and X. Li. i, poet: Automatic chinese poetry composition through a generative summarization framework under constrained optimization. In IJCAI, pages 2197–2203, 2013.
  • [33] X. Yi, R. Li, and M. Sun. Generating chinese classical poems with rnn encoder-decoder. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pages 211–223. Springer, 2017.
  • [34] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, pages 4651–4659, 2016.
  • [35] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
  • [36] W. Zaremba and I. Sutskever. Reinforcement learning neural turing machines-revised. arXiv preprint arXiv:1505.00521, 2015.
  • [37] X. Zhang and M. Lapata. Chinese poetry generation with recurrent neural networks. In EMNLP, pages 670–680, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description