Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence

Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence

Rui Meng Xingdi Yuan Tong Wang
Peter Brusilovsky Adam Trischler Daqing He
School of Computing and Information, University of Pittsburgh
Microsoft Research, Montréal

Recently, concatenating multiple keyphrases as a target sequence has been proposed as a new learning paradigm for keyphrase generation. Existing studies concatenate target keyphrases in different orders but no study has examined the effects of ordering on models’ behavior. In this paper, we propose several orderings for concatenation and inspect the important factors for training a successful keyphrase generation model. By running comprehensive comparisons, we observe one preferable ordering and summarize a number of empirical findings and challenges, which can shed light on future research on this line of work.

1 Introduction & Related Works

Keyphrases are multi-word units used for summarizing high-level meaning of a longer text and highlighting certain important topics or information. As an important capacity of language understanding and knowledge extraction systems, keyphrase extraction has been discussed for over a decade (Witten et al., 1999; Liu et al., 2011; Wang et al., 2016; Yang et al., 2017; Luan et al., 2017; Subramanian et al., 2017; Sun et al., 2019). However, extractive models lack the ability to predict keyphrases that are absent from the source text (i.e., keyphrases that abstract and summarize key ideas of the source text). Meng et al. (2017) first propose CopyRNN, a neural model that both generates words from a vocabulary and copies words from the source text. It has since served as the basis for a host of later works (Chen et al., 2018; Ye and Wang, 2018; Yuan et al., 2018; Çano and Bojar, 2019).

Given a piece of source text, our objective is to generate a set of multi-word phrases. This falls naturally under the paradigm of set generation (i.e., output phrases should be permutation invariant). There exists some literature exploring the effects of using different orderings in the task of language modeling (Vinyals et al., 2016; Ford et al., 2018). Recently, Yang et al. (2018) also propose a sequence-to-set model trained with Reinforcement Learning, which captures the correlation between labels and reduces the dependence on label order for the multi-label text classification.

To our best knowledge, however, there is no existing work that has successfully applied set generation to real-world, large scale language generation tasks such as keyphrase generation. Compared to multi-class classification tasks, the complexity in keyphrase generation is combinatorially larger — each phrase is a multi-word sequence with a usually very large vocabulary.

To this end, most existing keyphrase generation methods aim to generate a single keyphrase for each source text during training. During decoding, beam search is often used to produce a large number of candidate phrases. However, this decoding strategy generates all target phrases independently, resulting in many similar or identical phrases being generated. Furthermore, as an intrinsic limitation of beam search, a single beam may dominate the search process and further diminish the diversity of the final output.

Many studies have noted this problem (Chen et al., 2018; Ye and Wang, 2018). Consequently, instead of independent generation, another line of work proposes to generate multiple phrases in one output sequence (Yuan et al., 2018), where models are trained to generate the concatenation of target phrases. While overcoming the problem of independent generation, this latter approach introduces a new question of ordering among the now inter-dependent phrases, as pointed out by Vinyals et al. (2016), order matters for sequence modeling. However, previous studies have largely overlooked this problem.

In this study, we aim to fill this research gap by systematically examine the influence of concatenation ordering, as well as other factors like beam width and model complexity, to sequential generation models for keyphrase generation. By conducting comprehensive empirical experiments, we find our model delivers superior performance, indicating that learning to generate multiple phrases as a sequence is an effective paradigm for this task. More importantly, models trained with certain orderings consistently outperform others.

2 Generating Multiple Keyphrases as a Sequence

2.1 Model Architecture

In this paper we use One2One to denote the training and decoding strategy where each source text corresponds to a single target keyphrase, and a common practice is to use beam search to over-generate multiple keyphrases. This is in contrast of One2Seq, where each source text corresponds to a sequence of keyphrases that are concatenated with a delimiter token <SEP>. By simple greedy search, a model with One2Seq setting is capable of generating a sequence of multiple phrases. But the over-generation with beam search is often necessary to boost the recall. Please refer to Yuan et al. (2018); Ye and Wang (2018) for details.

We adopt a sequence-to-sequence based framework with pointer-generator and coverage mechanism proposed by See et al. (2017). Our focus in this work is to study what factors are most critical for models trained with One2Seq setting, such as keyphrase ordering and beam width, rather than the model structure itself, thus we describe details of the model structure in Appendix A.

2.2 Ordering for Concatenating Phrases

In this subsection, we define six ordering strategies for concatenating target phrases as follows. We are interested in seeing if different orderings affect the performance on keyphrase generation and which orderings may be optimal for training models in the One2Seq setting.

  • Random: Randomly shuffle the target phrases. As the goal of keyphrase generation is to output an order-invariant structure (a set of phrases), we expect models trained with randomly shuffled targets would capture such nature better than other variants with more fixed ordering.

  • No-Sort: Keep phrases in original order. Also used by Ye and Wang (2018).

  • Length: Sort phrases by their lengths from short to long. Phrases of the same length are sorted in original order.

  • Alpha: Sort phrases in alphabetical order (by their first word).

  • Appear-Pre: Sort present phrases by their first occurrences in the source text, and prepend absent phrases at the beginning. Absent phrases are randomly shuffled.

  • Appear-Ap: Same to Appear-Pre but append absent phrases at the end. Also used by Yuan et al. (2018).

2.3 Efficient Decoding Strategy

Previous study Ye and Wang (2018) adopts beam search and a phrase-ranking technique to collect a excessive number of unique phrases. However in the setting of One2Seq, this decoding strategy can cause very high computational cost, as a result of longer decoding sequence and much deeper beam search process.

In order to make One2Seq decoding more computationally affordable, we propose to use a early-stop technique during beam search. Instead of expanding all the search branches until reaching a given maximum depth, we terminate the beam search once the best sequence is found. This is a common heuristic for speed-up in single-sequence generation tasks such as translation and summarization. We observe that it is also effective for One2Seq decoding, leading to up to 10 times faster decoding. By the time that top sequence is completed, there is usually enough number of good phrases, meanwhile the quality of later generated sequences degenerates drastically and most of them are duplicates of existing phrases. Therefore this early-stop technique achieves a significant efficiency gain without sacrificing the quality of output phrases.

Method Inspec (Avg=7.8) Krapivin (Avg=3.4) NUS (Avg=6.1) SemEval (Avg=6.7) Average
F@5 F@10 F@5 F@10 F@5 F@10 F@5 F@10 F@5 F@10
One2One 0.244 0.289 0.305 0.266 0.376 0.352 0.318 0.318 0.336 0.335
Random 0.283 0.206 0.288 0.183 0.344 0.238 0.304 0.218 0.305 0.211
Length 0.298 0.224 0.321 0.206 0.364 0.259 0.311 0.222 0.324 0.228
No-Sort 0.323 0.253 0.317 0.209 0.376 0.264 0.318 0.244 0.333 0.243
Alpha 0.319 0.283 0.329 0.238 0.376 0.289 0.343 0.266 0.342 0.269
Appear-Pre 0.320 0.307 0.322 0.245 0.369 0.302 0.327 0.291 0.334 0.286
Appear-Ap 0.344 0.333 0.320 0.236 0.367 0.295 0.324 0.286 0.339 0.287
Table 1: Present keyphrase prediction performance (-score) on four benchmark datasets. Dataset names are listed in the header followed by the average number of target phrases per document. The baseline One2One model is on the first row, followed by different One2Seq variants. Bold indicates best score in each column.

3 Experiment Settings

Following the experiment setting in Meng et al. (2017), we train all the models with the KP20k training set, which contains 514,154 scientific papers each with a title, an abstract and a list of keyphrases provided by the author(s). We take four common datasets Inspec, Krapivin, NUS, and SemEval for testing. We report the testing performance of best checkpoints, which are determined by F1@5 score on validation set, containing 500 data examples from the KP20k validation set.

In both settings of One2Seq and One2One, we use model described in §2.1. In the One2Seq setting, we experiment with beam widths of [10, 30, 50] in decoding phase; while in One2One, we use a beam width of 200 for fair comparison with (Meng et al., 2017). Note that only unique phrases are used for evaluation.

4 Results and Discussion

We present and discuss the results of present phrases (i.e, appearing directly and verbatim in the source text) in §4.1-4.3, and absent phrases in §4.4.

4.1 Effects of Ordering

We report experimental results on four common benchmark datasets (totalling 1241 testing data points) in Table 1. Our implementation of the One2One model Meng et al. (2017) yields much better scores on NUS and SemEval, and comparable performance on the remaining datasets. Meanwhile, One2Seq models produce top 5 phrases in better quality than the baseline and there is no noticeable difference on their @5. However, the scores deteriorate if we include top 10 outputs. As shown in Table 1, the average performance increases from Random to Appear-Ap: a trend that becomes particularly obvious for @10. We will elaborate on this observation shortly and offer possible explanations.

Table 2 presents the statistics on the output of various models. We can see a notable correlation between -score and the number of unique predicted keyphrases (#(UniqKP)). Specifically, both Random and Length predict less than 5 unique phrases on average, leading to lower . Intuitively, we were expecting Random to help capture the order-invariance among phrases. In practice, however, the random phrase order seems to have induced more difficulties than robustness in learning. Since the lengths of most phrases are in the range between 1 to 4, Length ordering only presents a very weak clue for models to exploit, especially given that counting itself can be a rather challenging task. Similarly, despite the static nature of the phrase order in No-Sort is, there also exists a high degree of arbitrariness in the way different authors list keyphrases in different papers. Such intrinsic randomness again poses much difficulty in learning.

In contrast, Appear-Pre and Appear-Ap have the largest number of unique predictions as well as the best scores on @10. In particular, Appear-Ap out-numbers others in both number of beams and phrases by a large margin. We postulate that the order of the targets’ occurrence in the source text provides the pointer (copying) attention with a fairly reliable pattern to follow. This is in line with the previous findings that using pointer attention can greatly facilitate the training of keyphrase generation Meng et al. (2017). In addition, appending the absent phrases to the end of the concatenated sequence may cause less confusion to the model. Finally, the Alpha ordering surprisingly yields satisfactory scores and training stability. We manually check the output sequences and notice that the model is actually able to retain alphabetical order among the predicted keyphrases, hinting that the model might be capable of learning simple morphological dependencies (sort words in alphabetical order) even without access to any character-level representation.

#(Beam) Len(Beam) #(UniqKP) #(KP)
Random 13.01 21.72 4.27 47.13
Length 8.84 17.74 4.89 32.58
No-Sort 14.33 23.60 5.25 56.19
Alpha 12.15 23.31 6.71 51.24
Appear-Pre 13.09 19.05 7.60 45.55
Appear-Ap 23.73 28.86 7.82 128.32
Table 2: Statistics of predictions from various models. For each model variant, #(Beam) and Len(Beam) are the average number and average length of all predicted beams (BeamWidth=10). #(UniqKP) means the average number of unique predicted keyphrases and #(KP) is the average number of all predicted keyphrases.

4.2 Effects of Beam Width

We report the model performance with different beam widths in Table 3. From the results we can see a noticeable trend that, regardless of the target order, all models have a significant performance boost with a bigger beam width, which is largely due to the larger number of unique predictions. However the performance gap among different orderings remains clear, indicating that training ordering has a consistent effect on One2Seq models. With beam width 50, Appear-Ap wins against One2One on both @5 and @10. But to a certain extent, more unique predictions may also introduce noise. Compared with the other four orderings, Appear-Ap and Appear-Pre have much lower precision and scores for top 5 predictions with beam width 25 and 50.

Beam Width 10 25 50
F@5 F@10 F@5 F@10 F@5 F@10
Random 0.305 0.211 0.345 0.261 0.358 0.304
Length 0.324 0.228 0.347 0.279 0.351 0.319
No-Sort 0.333 0.243 0.357 0.295 0.364 0.325
Alpha 0.342 0.269 0.349 0.320 0.354 0.341
Appear-Pre 0.334 0.286 0.340 0.326 0.337 0.345
Appear-Ap 0.339 0.287 0.340 0.329 0.339 0.347
Table 3: Average scores on present keyphrase generation with different beam widths.

4.3 Effects of Model Complexity

After concatenating multiple phrases, the target sequence of One2Seq models becomes much longer and it may require more parameters to model the dependency. Therefore we are interested in knowing whether one can achieve better performance by increasing model complexity. Besides the afore-mentioned base model (referred as BaseRNN), two larger models are used for comparison: (1) BigRNN used in Ye and Wang (2018), same architecture as the base model except a larger embedding size (128) and hidden size (512); (2) Transformer used in Gehrmann et al. (2018), a four-layer transformer with 8 heads, 512 hidden units and the copy attention. As shown in Table 4, neither of the two is able to outperform BaseRNN. Interestingly, the performance differences among six orderings are consistently observed on the bigger models.

Model BaseRNN BigRNN Transformer
#(Param) 13M 37M 80M
F@5 F@10 F@5 F@10 F@5 F@10
Random 0.358 0.304 0.356 0.305 0.359 0.289
Length 0.351 0.319 0.349 0.321 0.361 0.318
No-Sort 0.364 0.325 0.361 0.329 0.358 0.329
Alpha 0.354 0.341 0.358 0.341 0.353 0.336
Appear-Pre 0.337 0.345 0.339 0.341 0.352 0.343
Appear-Ap 0.339 0.347 0.344 0.346 0.357 0.345
Table 4: scores on present keyphrase generation of One2Seq models with different model complexities (BeamWidth=50).

4.4 Effects on Absent Keyphrase Generation

Absent keyphrase prediction examines models’ abilities to generate synonymous expressions based on the semantic understanding of the text. We report the absent keyphrase results in Table 5. Although One2Seq models exhibit superior performance in predicting present phrases, they work poorly for absent ones, primarily due to the very limited number of unique predictions they are able to generate. Recall@50 becomes very low as they can hardly produce more than 10 absent phrases. Even with a larger beam width, both RNN models yield very low recall. In contrast, Transformer demonstrates good abstractiveness and beats the baseline on Recall@10.

Model BaseRNN BigRNN Transformer
R@10 R@50 R@10 R@50 R@10 R@50
One2One 0.044 0.101
Random 0.012 0.012 0.017 0.017 0.044 0.044
Length 0.012 0.012 0.022 0.022 0.045 0.045
No-Sort 0.013 0.013 0.016 0.016 0.052 0.052
Alpha 0.018 0.018 0.026 0.026 0.057 0.057
Appear-Pre 0.011 0.011 0.023 0.023 0.055 0.055
Appear-Ap 0.015 0.015 0.026 0.026 0.070 0.071
Table 5: Recall@10 and @50 on absent keyphrase generation of different models (BeamWidth=50).

5 Conclusion

We present an empirical study on how different orderings affect the performance of One2Seq models for keyphrase generation. We conclude our discussion with the following take-aways:

  • The ordering of concatenating target phrases matters. Consistent with Vinyals et al. (2016), target ordering plays a key role in successfully training models, perhaps due to the potential difference in trainability of each ordering. Appear-Ap demonstrates the overall best performance among the six ordering we experimented with.

  • Larger beam width can yield more stable performance boost and reduce the gap among orderings. Model complexity has no significant effect on present keyphrase prediction.

  • Training with concatenated phrases seems to strengthen a model’s extraction capacities more than abstraction capacities. On the other hand, abstraction capacity can be enhanced by increasing model complexity. How to balance the two would be a good direction for future study.

  • By utilizing over-generation (with a beam size of 10), only less than 20% of phrases generated by One2Seq models are unique. Decoding for keyphrase generation remains an open challenge that may deserve more research attention.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Link, 1409.0473 Cited by: Appendix A.
  • E. Çano and O. Bojar (2019) Keyphrase generation: A text summarization struggle. CoRR abs/1904.00110. External Links: Link, 1904.00110 Cited by: §1.
  • J. Chen, X. Zhang, Y. Wu, Z. Yan, and Z. Li (2018) Keyphrase generation with correlation constraints. CoRR abs/1808.07185. External Links: Link, 1808.07185 Cited by: §1, §1.
  • K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Link, 1406.1078 Cited by: Appendix A.
  • N. Ford, D. Duckworth, M. Norouzi, and G. E. Dahl (2018) The importance of generation order in language modeling. CoRR abs/1808.07910. External Links: Link, 1808.07910 Cited by: §1.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §4.3.
  • Z. Liu, X. Chen, Y. Zheng, and M. Sun (2011) Automatic keyphrase extraction by bridging vocabulary gap. the Fifteenth Conference on Computational Natural Language Learning. Cited by: §1.
  • Y. Luan, M. Ostendorf, and H. Hajishirzi (2017) Scientific information extraction with semi-supervised neural tagging. CoRR abs/1708.06075. External Links: Link, 1708.06075 Cited by: §1.
  • R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, and Y. Chi (2017) Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 582–592. External Links: Document, Link Cited by: §1, §3, §3, §4.1, §4.1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. CoRR abs/1704.04368. External Links: Link, 1704.04368 Cited by: Appendix A, Appendix A, §2.1.
  • S. Subramanian, T. Wang, X. Yuan, and A. Trischler (2017) Neural models for key phrase detection and question generation. CoRR abs/1706.04560. Cited by: §1.
  • Z. Sun, J. Tang, P. Du, Z. Deng, and J. Nie (2019) DivGraphPointer: a graph pointer network for extracting diverse keyphrases. SIGIR. External Links: Link Cited by: §1.
  • O. Vinyals, S. Bengio, and M. Kudlur (2016) Order matters: sequence to sequence for sets. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §1, 1st item.
  • X. Wan and J. Xiao (2008) Single document keyphrase extraction using neighborhood knowledge.. Cited by: Appendix C.
  • M. Wang, B. Zhao, and Y. Huang (2016) PTR: phrase-based topical ranking for automatic keyphrase extraction in scientific publications. 23rd International Conference, ICONIP 2016. Cited by: §1.
  • I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning (1999) KEA: practical automatic keyphrase extraction. In Proceedings of the Fourth ACM Conference on Digital Libraries, DL ’99, New York, NY, USA, pp. 254–255. External Links: ISBN 1-58113-145-3, Link, Document Cited by: §1.
  • P. Yang, S. Ma, Y. Zhang, J. Lin, Q. Su, and X. Sun (2018) A deep reinforced sequence-to-set model for multi-label text classification. CoRR abs/1809.03118. External Links: 1809.03118 Cited by: §1.
  • Z. Yang, J. Hu, R. Salakhutdinov, and W. W. Cohen (2017) Semi-supervised qa with generative domain-adaptive nets. In the 55th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
  • H. Ye and L. Wang (2018) Semi-supervised learning for neural keyphrase generation. CoRR abs/1808.06773. External Links: Link, 1808.06773 Cited by: §1, §1, 2nd item, §2.1, §2.3, §4.3.
  • X. Yuan, T. Wang, R. Meng, K. Thaker, D. He, and A. Trischler (2018) Generating diverse numbers of diverse keyphrases. CoRR abs/1810.05241. External Links: Link, 1810.05241 Cited by: §1, §1, 6th item, §2.1.

Appendix A Model Architecture

We adopt a sequence-to-sequence based framework with pointer-generator and coverage mechanism proposed by See et al. (2017). The basic sequence-to-sequence model uses a bi-directional GRU111For the Transformer experiments described in §4.3, GRUs are replaced by transformer modules. (Cho et al., 2014) encoder, it takes word embedding vectors of the source text as input and produces a sequence of encoder hidden states, . On the decoding side of the model, at step , an uni-directional GRU decoder takes the word embedding of the token as input and generates decoder hidden states . Attention mechanism (Bahdanau et al., 2014) is applied:


where , , and are learnable parameters. represents the probability distribution over correlation between the token with each of the source words. Weighted sum of encoder hidden states, , is further used to generate the output (a probability distribution over an given vocabulary) at step .

The pointer-generator learns a linear interpolation between the two probability distributions (i.e., distribution over source words, , and distribution over vocabulary, ), so that the combined output considers both distributions. One advantage of pointer-generator is, it can point to out-of-vocabulary tokens when they appear in source text.

As for coverage mechanism, the model maintains a coverage vector , which is the sum of attention distributions over all previous decoder time steps. serves as a episodic memory that helps to prevent model from generating repetition.

For more details about the model we use, we refer readers to read See et al. (2017)’s clearly described model section.

Appendix B Generating Variable-number of Phrases

One critical advantage of the One2Seq setting is that the model is capable of generating variable-number of phrases. Specifically, during the decoding phase, we take as the output the multiple phrases from one completely decoded sequence, which usually is the top-ranked sequence (i.e. top beam) from beam search. We refer to this decoding strategy for generating variable-number of phrases as self-terminating generation.

We report average scores of variable-number keyphrase generation (self-terminating) in Table 6. We can see that without over-generation, One2Seq models are able to achieve a decent performance. A larger beam width results in an improvement on top 10 phrases but a drop on top 5, showing a different phenomenon from §4.2.

Method BeamWidth=1 BeamWidth=10 BeamWidth=25 BeamWidth=50
F@5 F@10 F@5 F@10 F@5 F@10 F@5 F@10
Random 0.242 0.242 0.263 0.263 0.262 0.262 0.253 0.253
Length 0.269 0.269 0.284 0.284 0.276 0.276 0.273 0.273
No-Sort 0.272 0.272 0.286 0.286 0.278 0.278 0.268 0.268
Alpha 0.270 0.270 0.305 0.305 0.292 0.292 0.281 0.281
Appear-Pre 0.274 0.274 0.280 0.280 0.275 0.275 0.265 0.265
Appear-Ap 0.293 0.294 0.300 0.300 0.292 0.292 0.279 0.279
Table 6: Experimental results on predicting variable-number of phrases (self-terminating). Best score of each column is highlighted in bold font.
Model BaseRNN BigRNN Transformer
F@5 F@10 F@5 F@10 F@5 F@10
One2One 0.118 0.131
Random 0.147 0.154 0.143 0.155 0.133 0.119
Length 0.140 0.156 0.138 0.139 0.136 0.145
No-Sort 0.141 0.151 0.141 0.151 0.136 0.153
Alpha 0.135 0.152 0.131 0.154 0.123 0.131
Appear-Pre 0.102 0.138 0.109 0.133 0.113 0.144
Appear-Ap 0.123 0.165 0.110 0.155 0.133 0.152
Table 7: -score on DUC news dataset of different models (BeamWidth=50).

Appendix C Transferring to Another Domain

Different from the four scientific writing datasets mentioned in §3, DUC Wan and Xiao (2008) is a dataset where data is collected from domain of news. Using DUC, we investigate when transferring to another domain, how different ordering affects an One2Seq model’s test performance. Table 7 presents the results on DUC dataset. Note the DUC dataset contains only present phrases, thus most One2Seq models outperform the baseline One2One model.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description