Generating Thematic Chinese Poetry with Conditional Variational Autoencoder

Generating Thematic Chinese Poetry
with Conditional Variational Autoencoder

Xiaopeng Yang, Xiaowen Lin, Shunda Suo, and Ming Li
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, ON, Canada N2L 3G1
{x335yang, x65lin, sdsuo, mli}

Computer poetry generation is our first step towards computer writing. Writing must have a theme. The current approaches of using sequence-to-sequence models with attention often produce non-thematic poems. We present a conditional variational autoencoder with augmented word2vec architecture that explicitly represents the topic or theme information. This approach significantly improves the relevance of the generated poems by representing each line of the poem not only in a context-sensitive manner but also in a holistic way that is highly related to the given keyword and the learned topic. The proposed augmented word2vec model further improves the rhythm and symmetry. We also present a straightforward evaluation metric RHYTHM score to automatically measure the rule-consistency of generated poems. Tests show that 45.24% generated poems by our model are judged by humans to be written by real people.


1 Introduction

Poetry is a beauty of simplicity. Its abstractness, concise formats, and rules provide regularities as the first target of language generation. Such regularity is especially amplified in the classical Chinese poetry, for example, the quatrains where each poem (1) consists of four lines, each with five (or seven) characters, (2) the last character in the second and fourth line follow the same rhythm, and (3) tonal pattern requests characters in particular positions hold particular tones in terms of Ping (level tone) and Ze (downward tone) Wang (2002). An example of a quatrain written by Bo Wang, a famous poet in the Tang Dynasty, is shown in Table 1. As illustrated in Table 1, a good quatrain should follow all the three pattern regularities mentioned above.

Besides the rules, a poem is an expression of a certain theme or human emotion. It has to hold consistent semantic meanings and emotional expressions. It is not trivial to create a quatrain just by following rules of rhythm and tone, and express a consistent theme or some consistent affection even by people. Automatically generating poetry that expresses what we want them to express is a primary task of language generation.



Long stay by the Yangtze River,
Thousands of Miles away from home,
Yellow Leaves in late autumn wind,
Fall and float in hills, make me sad.
Table 1: An example of five-character quatrain written by Bo Wang. The tonal pattern is shown at the end of each line, where ‘P’ indicates a level tone, ‘Z’ indicates a downward tone, and ‘*’ indicates the tone can be either. The translation is from Tang (2005).

Major progress has been made in poetry generation, from rules/templates-based methods He et al. (2012); Yan et al. (2013), to using statistical machine translation (SMT) models Jiang and Zhou (2008); He et al. (2012), and to deep learning methods Bahdanau et al. (2014); Wang et al. (2016a); Zhang and Lapata (2014); Wang et al. (2016b); Zhang et al. (2017). Approaches using neural networks have received much more attention recently and have been proved to be capable of generating more fluent poems.

Even though the existing approaches have shown their great power in poetry automatic generation, they still suffer from a major problem: lack of consistent theme representation and unique emotional expression. Taking poem shown in Table 1 for instance, the consistent theme of this poem is nostalgia. Apparently, every single line of this poem is related closely to the theme and emotion. Recent work Wang et al. (2016b); Hopkins and Kiela (2017) have tried to generate poems with smooth and consistent theme by using topic planning scheme or similar word extensions. It is still hard for these methods to represent topics in an explicit way and use them to further improve the quality of generated poems.

In this paper, we try to solve the difficulty in learning the theme representation, meanwhile leveraging it to boost the generation of corresponding poems. As Variational AutoEncoder (VAE) Kingma and Welling (2013); Rezende et al. (2014) has been proved effective in topic representation using learned latent variable for textual generation Bowman et al. (2015); Serban et al. (2017); Semeniuta et al. (2017), we regard VAE as a possible solution for the above mentioned problem. Moreover, since most written poems are composed under certain “intent,” we seek for Conditional Variational AutoEncoder (CVAE), which is a recent modification of VAE to generate diverse images/texts conditioned on certain attributes Yan et al. (2016b); Sohn et al. (2015); Zhao et al. (2017). In our work, we regard part of the “intent” can be represented in the form of keywords as conditions for VAE, and the other can be expressed by the latent variables learned from CVAE. We propose to add vertical slices of poems as additional sentences in training data for the word2vec model in order to furthermore improve the rhythm and symmetry delivered in poems, and name this model as an augmented word2vec model. In addition, due to the difficulty to automatically judge the quality of generated poems, we propose a straightforward and easily applied evaluation metric to measure the poetry rule-consistency. Specifically, the contributions of this paper can be summarized as follows:

  • We propose to use conditional variational autoencoders to learn the theme information from poetry lines. To the best of our knowledge, this represents one of the first attempts at using variational autoencoders for poetry generation.

  • We present an augmented word2vec model (AW2V) to improve the rhythm and symmetry delivered in poems. Experiments show that AW2V is not only able to boost the rule-consistency of generated poems, but also can be used to search characters representing similar semantic meanings in Chinese poems.

  • We introduce a simple but reasonable evaluation metric named RHYTHM score to automatically evaluate the structural adherence of generated poems.

  • We build a Chinese poetry generation system framework which can take author “intent” into the generation process. The experimental results show that our system using the proposed approach is able to generate good quatrains which satisfy the rules and have a consistent topic and unique emotion.

2 Related Work

Poetry generation is our first step on experimenting language generation. According to the methodology used in different approaches, we categorize those methods into three major directions, i.e., approaches with rules/templates He et al. (2012); Yan et al. (2013), approaches using Statistical Machine Translation (SMT) models Jiang and Zhou (2008); He et al. (2012) and approaches using neural networks Bahdanau et al. (2014); Wang et al. (2016a); Zhang and Lapata (2014); Wang et al. (2016b); Zhang et al. (2017); Xie et al. (2017); Hopkins and Kiela (2017).

The first kind of approach is based on rules and/or templates, such as phrase search Tosa et al. (2008); Wu et al. (2009), word association norm Netzer et al. (2009), template search Oliveira (2012), genetic search Zhou et al. (2010), and text summarization Yan et al. (2013).

The second kind of approach involves various statistical machine translation methods. Rather than designing algorithms to identify useful rules of rules/templates-based approaches, the approaches using SMT models, whose parameters are derived from the analysis of bilingual text corpora, regard the previous line of each poem as the source language in the Machine Translation (MT) task and the posterior line as the target language sentence Jiang and Zhou (2008); He et al. (2012).

Figure 1: The framework of our Chinese poetry generation approach. denotes the concatenation of input vectors [best viewed in color].

Due to the fact that all the approaches mentioned above are based on the superficial meanings of words or characters, they suffer from the lack of deep understanding of the poems’ semantic meaning. To address this issue, many approaches using neural networks have been proposed and attracted much attention in recent years. For example, Zhang and Lapata (2014) proposed an approach using Recurrent Neural Network (RNN) that generates each new poem line character-by-character (see also Hopkins and Kiela (2017)), with all the lines generated previously as a contextual input. Experimental results show that quatrains of reasonable quality can be generated using this approach. Following this RNN-based approach, Wang et al. (2016a) proposed a character-based RNN treating a poem as an entire character sequence, which can be easily extended to various genres such as Song Iambics. This approach has the advantage of the flexibility and easy implementation, but the long-sequence generation process causes the instability of poetry theme. To avoid this situation, Wang et al. (2016a) further brought forward the attention mechanism Bahdanau et al. (2014) into the RNN-based framework, and encoded human intention to guide the poetry generation. Yan (2016) proposed a RNN-based poetry generation model with an iterative polishing scheme. Specifically, they encoded users’ writing intent first and then decoded it using a hierarchical recurrent neural network. Wang et al. (2016b) proposed a two-step poetry generating method, which first plans the sub-topics of the poem according to users’ writing intent, and then generates each line sequentially using an attention-based sequence-to-sequence model, where the attention is placed not only on the human input but also on all characters that have been generated so far. Recently, Zhang et al. (2017) proposed a memory-augmented neural model trying to imitate poets’ writing process. This approach uses the augmented memory to refine poems generated via the neural model, which can balance the requirements of linguistic accordance and aesthetic innovation to some extent. Parallel efforts have been made in generating English poems, more recent works are in Xie et al. (2017); Hopkins and Kiela (2017). For example, Hopkins and Kiela (2017) considered adding a list of similar words to a key theme.

We follow the third type of approach to automatically generate Chinese poetry. As introduced above, all the mentioned neural models attempt to produce poems with regulated rules, a consistent theme, and meaningful semantics, but none of them consider to represent poem theme in an explicit way and use it to further boost the results. To address this issue, we propose a conditional variational autoencoder with augmented word2vec architecture, in which the learned latent variables combined with conditional keywords are able to convey topical information of the entire poetry.

3 Approaches

3.1 Overview

As most human poets write poems according to a sketch of ideas, we use a two-stage Chinese poem generation approach, i.e., writing intent representation and thematic poem generation. Specifically, our system can take a word, a sentence or even a document as input containing users’ writing intent, and then generate rule-complied and theme-consistent poem sequentially using an improved conditional variational autoencoder. Similar work has been done in Wang et al. (2016b), the main distinction from our work to theirs is the implemented neural model is a generative neural network in our work.

The framework of our Chinese poetry generation approach using the proposed Conditional Variational AutoEncoder with Augmented Word2Vec model (CVAE_AW2V) is illustrated in Fig. 1. Suppose an input query {CJK}UTF8gbsn“冬天雪花纷飞” (“The snowflakes are flying in winter”) is given, in the writing intent representation stage, the sentence is transformed into four keywords (), i.e., {CJK}UTF8gbsn“冬天” (winter), {CJK}UTF8gbsn“雪花” (snowflake), {CJK}UTF8gbsn“纷飞” (fly), and {CJK}UTF8gbsn“庭院” (courtyard), where represents the sub-topic for the corresponding th line . In thematic poem generation stage, assuming that keywords are not enough to convey topic information for the entire poem, each line is first encoded into a latent variable to learn a distribution over potential writing intent by a prior network, and then generated by decoding from a concatenation of the learned latent variable and the extracted or expanded keyword . As a result, the poem is created automatically not only by the sub-topic provided by the corresponding keyword, but also the topic messages stored in latent variables, which are learned from the current line , the previously generated lines , and the corresponding keyword . Note that the seven-character quatrain given in Fig. 1 is produced automatically from our generation system.

3.2 Writing Intent Representation

Due to the fact that each line of a quatrain consists of five or seven characters, we hypothesize that the sub-topic of each line can be represented by one keyword. Therefore, it is important to evaluate the importance of words extracted from the input query provided by users. We use TextRank Kingma and Ba (2014) algorithm which is a graph-based ranking algorithm based on PageRank Brin and Page (2012) to measure the importance of different words to deal with this issue. In the graph of TextRank, a vertex represents each candidate word and edges between two words indicate their co-occurrence, where the edge weight is set according to the total count of co-occurrence strength between these two words. The TextRank score is computed iteratively until convergence according to the following equation:


where is the weight of the edge between node and , is the set of vertices connected with , and is a damping factor. Empirically, the damping factor is usually set to 0.85, and the initial score of is set to 1.

When the number of extracted keywords from users’ input query is less than the required one, we leverage RNNLM-based 111RNNLM is the abbreviation for Recurrent Neural Network Language Model method Mikolov et al. (2010) and knowledge-based method Wang et al. (2016b) to conduct the keyword extension in which the candidate word with the highest TextRank score is selected as the new keyword.

Figure 2: The training procedure of poetry generation using CVAE. The black dashed lines represent the residual connection between layers [best viewed in color].

3.3 Conditional Variational Autoencoder with Augmented Word2Vec

Compared with Variational AutoEncoders (VAE) Kingma and Welling (2013); Rezende et al. (2014) which is one of the most popular generative models, Conditional Variational AutoEncoder (CVAE) is a recent modification of VAE to generate diverse images or texts conditioned on certain attributes, such as generating different human faces given skin color Yan et al. (2016b); Sohn et al. (2015) or generating various textual response given dialog contexts Zhao et al. (2017). For Chinese poetry generation, since most human poets create poems based on a plain outline, we believe that keywords () obtained from the first stage of our generation framework can partially represent users’ writing intent, and regard them as conditions for CVAE.

We define the conditional distribution as , and set the learning target to approximate and via deep neural networks parameterized by . CVAE is trained to maximize the conditional log likelihood of given , meanwhile minimizing the KL regularizer between the posterior distribution and a prior distribution . Inspired by Zhao et al. (2017), we use a recognition network and a prior network to approximate the true posterior distribution and the prior distribution . To sum up, the objective of CVAE takes the following form:


As shown in Eqn. 2, the objective of CVAE is a valid lower bound on the true log likelihood of the data under conditions , making the CVAE a generative model. Then the generative process of can be summarized as sampling a latent variable from and then generating by . CVAE can be efficiently trained with the Stochastic Gradient Variational Bayes (SGVB) framework Kingma and Ba (2014) by maximizing the variational lower bound of the conditional log likelihood Sohn et al. (2015). Fig. 2 illustrates the training procedure of poetry generation using CVAE. As shown in Fig. 2, we use a Bidirectional Recurrent Neural Network (BRNN) Schuster and Paliwal (1997) with a Long Short Term Memory (LSTM) Hochreiter and Schmidhuber (1997) as an encoder to encode each concatenation of the current line , the corresponding keyword , and previously generated lines into fixed-size vectors by concatenating the last hidden states of the forward and backward RNN . Then, can be simply represented by . We adopt multiple layers Hinton (2007) in both encoder and decoder and residual connections He et al. (2016) between layers are considered for learning a describable . We suppose follows a multivariate Gaussian distribution with a diagonal covariance matrix, thus the recognition network and the prior network , and then we have:


We use a reparametrization trick Kingma and Welling (2013) to sample from the recognition network during training and predicted by the prior network during testing. Finally, the multi-layer LSTM decoders predict the characters sequentially with its initial state .

3.4 Optimization

Although CVAE has achieved impressive results in image generation, it is non-trivial to adapt it to natural language generators due to the vanishing latent variable problem Bowman et al. (2015); Zhao et al. (2017). KL annealing Bowman et al. (2015) gradually increasing the weight of the KL term from 0 to 1 during training plays a powerful role in dealing with the vanishing latent variable problem. Another solution word drop decoding, which sets a certain percentage of the target words to 0, may hurt the performance when the drop rate is too high. Thus, we adopt KL annealing instead of word drop decoding during training for CVAE.

Since we use the combination of both keywords extracted from users’ query and the latent vector learned from CVAE to represent poetry theme, the representation of keywords is the key for the performance to some extent. Therefore, we try to mine the nature of quatrains to obtain a good representation of poetry word. We notice that for some lines in quatrains, mostly the third and the fourth line, corresponding characters from the same position in these two lines often match each other by certain constraints on semantic and/or syntactic relatedness.

Taking two lines {CJK}UTF8gbsn“千山鸟飞绝 (A thousand mountains without birds flying),万径人踪灭 (Ten thousand paths without a footprint)” of the famous five-quatrain {CJK}UTF8gbsn“江雪” (River Snow) as an example, the characters {CJK}UTF8gbsn“千” (thousand) and {CJK}UTF8gbsn“万” (ten thousand) both represent numbers, meanwhile {CJK}UTF8gbsn“绝” (gone) and {CJK}UTF8gbsn“灭” (disappeared) both deliver similar meanings of nonexistence. Even though the constraints of quatrains are not as strict as the Chinese antithetical couplets Yan et al. (2016a), we propose to initialize the word-embedding vectors using an augmented word2vec model (AW2V) to further enhance the rhythm and symmetry delivered in poems. This model adds vertical slices of poems as additional sentences in training data based on the word2vec Mikolov et al. (2013). AW2V is not only able to boost the rule-consistency of generated poems, but also can be used to search characters representing similar semantic meanings in Chinese poems. We will show the capability of AW2V in Section 5.1.3.

4 Experimental Setup

4.1 Dataset

Two large-scale datasets are used in our experiments. The first dataset is a Chinese poem corpus (CPC) containing 284,899 traditional Chinese poems in various genres, including Tang quatrains, Song Iambics, Yuan Songs, and Ming and Qing poems. We use this dataset to train the word-embedding for Chinese characters. Since we focus on generating quatrains which have four lines with the same length of five or seven characters in each line, we filter 76,305 quatrains from CPC, named as Chinese quatrain corpus (CQC), to train the neural network model. Specifically, we randomly choose 2,000 poems for validation, 1,000 poems for testing, and other non-overlap ones for training. We segment all the poems into words and calculate the TextRank score for each word. Then, the word with the highest TextRank score is selected as the keyword for each line so that each quatrain owns four keywords.

4.2 Training

We choose the 6,000 most frequently used characters as the vocabulary. The word-embedding vectors are initialized by our proposed augmented word2vec model (AW2V) and the dimension is set to 128. The recurrent hidden layers of the encoder and decoder contain 128 hidden units. The number of layers in each encoder and decoder is set to 4. We use 64-dimensional latent variables. Parameters of our model were randomly initialized over a uniform distribution with support [-0.08,0.08]. The model is trained using the AdaDelta algorithm Zeiler (2012), where the mini-batch is set to 64 and the learning rate is 0.001. Not only that, dropout technique Srivastava et al. (2014) is adopted and the dropout rate is set to 0.2. The perplexity value on the validation set is used for the early stop of training to avoid an over-fitting learned model.

5 Evaluation

Generally, it is difficult to judge the quality of poems generated by computers. We conduct both automatic and human evaluation to verify the feasibility and availability of our proposed Chinese poetry generation approach.

For the comparative approach, we mainly compare our proposed approach with the attention-based sequence-to-sequence model (AS2S) presented in Wang et al. (2016b), which has been proved to be capable of generating different genres of Chinese poems. The reasons we choose the attention-based sequence-to-sequence model to compare rather than others can be summarized into three aspects. First, this model has been fully compared with other previous methods such as SMT, RNNLM, RNNPG, and ANMT in Wang et al. (2016b) and proved better than all of them. Second, the first generation phase of our proposed approach, i.e., writing intent representation, is similar to the procedure introduced in Wang et al. (2016b) while the second phase is completely different. Therefore, through comparing our framework with theirs, we can inspect the effects of our proposed Conditional Variational AutoEncoder with Augmented Word2Vec model (AW2V_CVAE). Third, as a matter of fact that poetry generation is such a subjective issue, it is inevitable to invite human experts to provide feedback for measuring the quality of generated poems. In order to reduce human efforts, it is reasonable to choose one of the most popular approaches achieving the state-of-the-art performance.

Approach Training Validation Testing
Perplexity KL cost Perplexity KL cost Perplexity KL cost
AS2S 31.54 - 42.77 - 41.13 -
VAE 28.26 8.0635e-3 42.13 7.0086e-3 40.57 6.1395e-3
CVAE 30.11 7.7867e-3 42.09 0.0113 40.86 0.0110
AW2V_VAE 25.73 0.1194 43.03 8.9510e-3 41.46 7.3427e-3
AW2V_CVAE 29.16 0.0126 42.6 0.0179 41.23 0.0155
Table 2: Language modeling results of Chinese quatrain corpus, reported as the reconstruction perplexity and KL terms on training, validation, and testing dataset.

5.1 Automatic Evaluation

5.1.1 Rule-Consistency Evaluation

Some of the previous works Wang et al. (2016a) use BLEU score Papineni et al. (2002) to measure and compare the effects of different approaches. However, poetry generation is such an innovation task rather than the word-by-word translation. This makes BLEU less appropriate for text generation than machine translation. In addition to this, Chinese quatrains have strict regulations and a good quatrain should follow particular tonal and structural rules. Therefore, we propose to use a new RHYTHM score instead of BLEU to automatically measure the effects of diverse methods. We define the RHYTHM score as


where represents each line of poems, is the number of characters of line , is the rule of line , and and represent the set of tonal patterns and rhyming patterns severally.

Approach Mean Standard Deviation
Groundtruth 0.8975 0.1818
AS2S 0.8356 0.2483
VAE 0.8536 0.2342
CVAE 0.8456 0.2378
AW2V_VAE 0.8732 0.2216
AW2V_CVAE 0.8846 0.2075
Table 3: RHYTHM scores of different generation approaches in test dataset containing 1,000 quatrains. The mean value and standard deviation value in terms of the RHYTHM score are reported respectively.

The results of automatic evaluation based on RHYTHM score are demonstrated in Table 3. A higher mean score and a lower standard deviation score indicate approaches owning better capability of generating poems with regulated rhythm and structure. Note that groundtruth represents the humanly written poems in the test dataset. Beyond AS2S, we also use Variational AutoEncoder (VAE) as one of our comparative approaches, which has a similar framework of Conditional Variational AutoEncoder (CVAE). The difference between them is that VAE only uses the learned latent vector in despite of keywords to represent poetry thematic information.

From Table 3, we can find that our proposed conditional variational autoencoder with the augmented word2vec model (AW2V_CVAE) outperforms other methods in terms of RHYTHM. Using AS2S, we get a RHYTHM mean score of 0.8356 and a standard deviation value of 0.2483. When directly applying VAE and CVAE, we observe that the RHYTHM score improves to 0.8536 and 0.8456 respectively. The reasons possibly originate from the introduction of latent variables to VAE and CVAE. Comparing the results of CVAE with VAE, we notice there is a slight drop of CVAE indicating that the raw keywords representations from the original word2vec model are problematic. Thus, we propose an augmented word2vec model (AW2V) to represent better poetry characters. The last two rows from Table 3 show that approaches with our proposed AW2V perform are better than the ones without it, e.g., AW2V_VAE and AW2V_CVAE both obtain higher mean RHYTHM score and lower standard deviation than VAE and CVAE correspondingly. This demonstrates that the position of matching characters in quatrains can improve the word2vec representation and further boost the generated poems to comply with the regulations. Among all the comparative approaches, AW2V_CVAE achieves the best performance and promotes the RHYTHM mean score by 5.86% from 0.8356 of AS2S to 0.8846 on the entire test dataset. To sum up, our proposed AW2V_CVAE can generate rule-complied poems with competitive RHYTHM score on average.

Approach Readability Consistency Aesthetic Evocative Overall Identification Probability
Groundtruth 3.38 3.34 3.26 3.33 3.35 69.86%
AS2S 3.22 3.12 3.08 3.09 3.13 30.40%
AW2V_CVAE 3.33 3.17 3.18 3.14 3.2 45.24%
Table 4: Average scores for five subjective evaluation metrics and the average probability of poems identified as humanly written poetry.

5.1.2 Poetry Language Modeling Results

Language modeling results of Chinese quatrain corpus is shown in Table 2, in which the reconstruction perplexity and KL terms on training, validation, and testing dataset are reported.

From Table 2, we can find that our proposed AW2V_CVAE obtains a steady and good performance on all dataset. Due to the vanishing latent variable problem, we notice that it is much harder for poetry generation task to learn a high KL cost than general natural language generation, although KL annealing is adopted during training. We guess this is because of the simplification of poems. Compared with VAE and CVAE, VAE and CVAE with our proposed augmented word2vec model, i.e., AW2V_VAE and AW2V_CVAE, get a relatively higher KL cost, which demonstrates that they have better capability of learning latent semantics.

5.1.3 Poetry Character Similarity

We measure the similarity of poetry characters to varify the superiority of our proposed augmented word2vec model (AW2V) over the original word2vec model (W2V).

Taking the same poetry {CJK}UTF8gbsn“江雪” (River Snow) mentioned in Section 3.4 as an instance, the similarity between {CJK}UTF8gbsn千 (thousand) and {CJK}UTF8gbsn万 (ten thousand) using AW2V is 0.4389, while 0.4039 using W2V. It is worth noticing that {CJK}UTF8gbsn绝 (gone) and {CJK}UTF8gbsn灭 (disappeared) get a 0.2745 similarity score in AW2V model, while only get 0.0205 in W2V. Beyond that, we can use AW2V to search similar words. For instance, if we search similar words for {CJK}UTF8gbsn年 (year), we obtain {CJK}UTF8gbsn旬 (ten days), {CJK}UTF8gbsn番 (multiple times), and {CJK}UTF8gbsn时 (time) which are all time-related Chinese characters.

5.2 Human Evaluation

We launch a crowd-sourcing online study asking users to distinguish automatically generated poetry and humanly written poetry to validate the effects of our proposed approach in subjective expression. This evaluation is similar to the Turing Test Turing (1950). All of the participants are well-educated and have great passion for poetry writing.

Since there are no explicit topics in the CQC dataset, we segment all the poems into words and calculate the TextRank score for each word. Then, the word with the highest TextRank score is selected as the keyword for each line so that each quatrain owns four keywords. We assume the selected four keywords can represent the thematic information for each poem. Therefore, we use poems in testing dataset as the comparative quatrains written by human, and the selected four keywords of each poem as the input query for AS2S and AW2V_CVAE in order to make a fair comparison during human evaluation.

Figure 3: The human evaluation interface asking participants to rate poetry based on several subjective evaluation metrics and distinguish between automatically generated poems and humanly written ones [best viewed in color].

Specifically, the number of human-written poems, AS2S-generated ones, and CVAE-generated ones is all 1,000. During each round of the evaluation, we randomly select 30 poems from the above-mentioned three categories with probability of 50%, 25%, and 25% respectively. Note that the chosen probability is blind to the participants. Even though the four keywords are not shown in the evaluation interface, we believe that human can determine the thematic consistency between demonstrated poetry lines. We do not push participants to rate all poems during each round, i.e., they can stop the evaluation job whenever they feel exhausted. Besides this, all the participants are asked to rate poems in the score range based on five subjective evaluation metrics including Readability (if the sentences read smoothly and fluently), Consistency (if the entire poem delivers a consistent theme), Aesthetic (if the quatrain stimulates any aesthetic feeling), Evocative (if the quatrain expresses meaningful emotion), and Overall (if the quatrain is overall well written). We build a user-friendly web-based environment to conduct this experiment, and the interface for human interaction is illustrated in Fig. 3.

Table 4 presents the results of the human evaluation. Specifically, columns from the second to fifth show the average scores in terms of readability, consistency, aesthetic, evocative, and overall metrics severally, and the last column illustrates the probability of poems to be identified as humanly written poetry written as Identification Probability. Statistic results show that 438 humanly written poems, 227 AS2S-generated poems, and 210 CVAE-generated poems are annotated by the participated experts. Among the 210 CVAE-generated poems and the 227 AS2S-generated poems, 95 poems are considered as humanly written poems, while only 69 AS2S-generated poems obtain the same treatment. In other words, 45.24% CVAE-generated poems are regarded as humanly written ones, in contrast, only 30.40% AS2S-generated poems are considered to be written by real people. Moreover, our proposed AW2V_CVAE performs better than AS2S in all five subjective metrics, which further proves its effectiveness at generating high-quality thematic and inspiring poetry.

5.3 Examples

We give some specific examples based on various given queries to demonstrate the thematic information delivered by every single line and the overall quality of the generated poems. Several automatically generated general quatrains are shown from Table 5 to Table 8. We can find that each generated poem from the above-generated poems intuitively conforms to quatrains’ tonal and structural rules and contains consistent and meaningful theme.



Burned candle flickered at dawn,
A dim light shone on the man home alone.
Drinking, my body and mind flowed,
While you were fainting in the snow.
Table 5: An example of five-character quatrain based on the given keyword “蜡烛 (candle)”.


Lonely, a plum blossom is homeless,
Apprehensive, the ephemeral beauty will wither.
Hey, don’t worry. I will pick a twig in the woods,
And house it in my flute.
Table 6: An example of seven-character quatrain based on the given query “梅 (wintersweet)”.

6 Conclusions

In this work, we have studied poetry generation. We present a two-step generation approach including writing intent representation and thematic poem generation to imitate the poem creation process by human poets. We have proposed a Conditional Variational AutoEncoder with augmented word2vec model (AW2V_CVAE) to mine the implicit topic information from each line of a poem. An augmented word2vec model (AW2V) has also been proposed to further enhance the rhythm and symmetry delivered in poems and improve the training procedure. The generative neural model can incorporate more flexibility to represent the theme message contained within poem lines by learning the latent variables.

We conduct the experiments on several evaluation metrics and compare our proposed approach with some existing ones. Experimental results demonstrate that our proposed poetry generation approach can not only produce quatrains with regulated rules and consistent themes but also learn a good representation of the latent elements in the poems. Our proposed conditional variational autoencoder with the augmented word2vec model has been proved to outperform the attention-based sequence-to-sequence model.

Currently, we are working on improving the KL cost and using reinforcement learning to further improve the poetry quality. We would also like to add more poem characteristics, such as sentiment classification, entities identification, etc., in the process of poetry generation.



Through shimmering spring river
I can see the old lake,
Misty rain brings memory of sunshine
in the summer days.
Gazing and whispering at the autumn sky,
Just a blink and at the farm winter has arrived.
Table 7: An example of seven-character quatrain based on the given query “春天 (spring), 夏天 (summer), 秋天 (autumn), 冬天 (winter)”.


Walk back alone from the east of grass field,
Coinside with the wildflowers in the highland.
If pick up a bunch to linger the mast,
All of them will be among the lotus.
Table 8: An example of seven-character quatrain based on the given query “路边的野花不要采 (do not pick up roadside wildflowers)”.


  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
  • Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 .
  • Brin and Page (2012) Sergey Brin and Lawrence Page. 2012. Reprint of: The anatomy of a large-scale hypertextual web search engine. Computer networks 56(18):3825–3833.
  • He et al. (2012) Jing He, Ming Zhou, and Long Jiang. 2012. Generating chinese classical poems with statistical machine translation models. In AAAI.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770–778.
  • Hinton (2007) Geoffrey E Hinton. 2007. Learning multiple layers of representation. Trends in cognitive sciences 11(10):428–434.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • Hopkins and Kiela (2017) Jack Hopkins and Douwe Kiela. 2017. Automatically generating rhythmic verse with neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 168–178.
  • Jiang and Zhou (2008) Long Jiang and Ming Zhou. 2008. Generating chinese couplets using a statistical mt approach. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, pages 377–384.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech. volume 2, page 3.
  • Netzer et al. (2009) Yael Netzer, David Gabay, Yoav Goldberg, and Michael Elhadad. 2009. Gaiku: Generating haiku with word associations norms. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity. Association for Computational Linguistics, pages 32–39.
  • Oliveira (2012) Hugo Gonçalo Oliveira. 2012. Poetryme: a versatile platform for poetry generation. Computational Creativity, Concept Invention, and General Intelligence 1:21.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 .
  • Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673–2681.
  • Semeniuta et al. (2017) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017. A hybrid convolutional variational autoencoder for text generation. arXiv preprint arXiv:1702.02390 .
  • Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI. pages 3295–3301.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems. pages 3483–3491.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research 15(1):1929–1958.
  • Tang (2005) Yihe Tang. 2005. English Translation for Tang Poems (Ying Yi Tang Shi San Bai Shou). Tianjin People Publisher.
  • Tosa et al. (2008) Naoko Tosa, Hideto Obara, and Michihiko Minoh. 2008. Hitch haiku: An interactive supporting system for composing haiku poem. In International Conference on Entertainment Computing. Springer, pages 209–216.
  • Turing (1950) Alan M Turing. 1950. Computing machinery and intelligence. Mind 59(236):433–460.
  • Wang (2002) Li Wang. 2002. A Summary of Rhyming Constraints of Chinese Poems (Shi Ci Ge Lv Gai Yao), volume 1. Beijin Press.
  • Wang et al. (2016a) Qixin Wang, Tianyi Luo, Dong Wang, and Chao Xing. 2016a. Chinese song iambics generation with neural attention-based model. arXiv preprint arXiv:1604.06274 .
  • Wang et al. (2016b) Zhe Wang, Wei He, Hua Wu, Haiyang Wu, Wei Li, Haifeng Wang, and Enhong Chen. 2016b. Chinese poetry generation with planning based neural network. arXiv preprint arXiv:1610.09889 .
  • Wu et al. (2009) Xiaofeng Wu, Naoko Tosa, and Ryohei Nakatsu. 2009. New hitch haiku: An interactive renku poem composition supporting tool applied for sightseeing navigation system. Entertainment Computing–ICEC 2009 pages 191–196.
  • Xie et al. (2017) Stanley Xie, Ruchir Rastogi, and Max Chang. 2017. Deep poetry: Word-level and character-level language models for shakespearean sonnet generation .
  • Yan (2016) Rui Yan. 2016. i, poet: Automatic poetry composition through recurrent neural networks with iterative polishing schema. In IJCAI. pages 2238–2244.
  • Yan et al. (2013) Rui Yan, Han Jiang, Mirella Lapata, Shou-De Lin, Xueqiang Lv, and Xiaoming Li. 2013. i, poet: Automatic chinese poetry composition through a generative summarization framework under constrained optimization. In IJCAI. pages 2197–2203.
  • Yan et al. (2016a) Rui Yan, Cheng-Te Li, Xiaohua Hu, and Ming Zhang. 2016a. Chinese couplet generation with neural network structures. In ACL (1).
  • Yan et al. (2016b) Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016b. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision. Springer, pages 776–791.
  • Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 .
  • Zhang et al. (2017) Jiyuan Zhang, Yang Feng, Dong Wang, Yang Wang, Andrew Abel, Shiyue Zhang, and Andi Zhang. 2017. Flexible and creative chinese poetry generation using neural memory. arXiv preprint arXiv:1705.03773 .
  • Zhang and Lapata (2014) Xingxing Zhang and Mirella Lapata. 2014. Chinese poetry generation with recurrent neural networks. In EMNLP. pages 670–680.
  • Zhao et al. (2017) Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960 .
  • Zhou et al. (2010) Cheng-Le Zhou, Wei You, and Xiaojun Ding. 2010. Genetic algorithm and its implementation of automatic generation of chinese songci. Journal of Software 21(3):427–437.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description