Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence
Recently, concatenating multiple keyphrases as a target sequence has been proposed as a new learning paradigm for keyphrase generation. Existing studies concatenate target keyphrases in different orders but no study has examined the effects of ordering on models’ behavior. In this paper, we propose several orderings for concatenation and inspect the important factors for training a successful keyphrase generation model. By running comprehensive comparisons, we observe one preferable ordering and summarize a number of empirical findings and challenges, which can shed light on future research on this line of work.
1 Introduction & Related Works
Keyphrases are multi-word units used for summarizing high-level meaning of a longer text and highlighting certain important topics or information. As an important capacity of language understanding and knowledge extraction systems, keyphrase extraction has been discussed for over a decade (Witten et al., 1999; Liu et al., 2011; Wang et al., 2016; Yang et al., 2017; Luan et al., 2017; Subramanian et al., 2017; Sun et al., 2019). However, extractive models lack the ability to predict keyphrases that are absent from the source text (i.e., keyphrases that abstract and summarize key ideas of the source text). Meng et al. (2017) first propose CopyRNN, a neural model that both generates words from a vocabulary and copies words from the source text. It has since served as the basis for a host of later works (Chen et al., 2018; Ye and Wang, 2018; Yuan et al., 2018; Çano and Bojar, 2019).
Given a piece of source text, our objective is to generate a set of multi-word phrases. This falls naturally under the paradigm of set generation (i.e., output phrases should be permutation invariant). There exists some literature exploring the effects of using different orderings in the task of language modeling (Vinyals et al., 2016; Ford et al., 2018). Recently, Yang et al. (2018) also propose a sequence-to-set model trained with Reinforcement Learning, which captures the correlation between labels and reduces the dependence on label order for the multi-label text classification.
To our best knowledge, however, there is no existing work that has successfully applied set generation to real-world, large scale language generation tasks such as keyphrase generation. Compared to multi-class classification tasks, the complexity in keyphrase generation is combinatorially larger — each phrase is a multi-word sequence with a usually very large vocabulary.
To this end, most existing keyphrase generation methods aim to generate a single keyphrase for each source text during training. During decoding, beam search is often used to produce a large number of candidate phrases. However, this decoding strategy generates all target phrases independently, resulting in many similar or identical phrases being generated. Furthermore, as an intrinsic limitation of beam search, a single beam may dominate the search process and further diminish the diversity of the final output.
Many studies have noted this problem (Chen et al., 2018; Ye and Wang, 2018). Consequently, instead of independent generation, another line of work proposes to generate multiple phrases in one output sequence (Yuan et al., 2018), where models are trained to generate the concatenation of target phrases. While overcoming the problem of independent generation, this latter approach introduces a new question of ordering among the now inter-dependent phrases, as pointed out by Vinyals et al. (2016), order matters for sequence modeling. However, previous studies have largely overlooked this problem.
In this study, we aim to fill this research gap by systematically examine the influence of concatenation ordering, as well as other factors like beam width and model complexity, to sequential generation models for keyphrase generation. By conducting comprehensive empirical experiments, we find our model delivers superior performance, indicating that learning to generate multiple phrases as a sequence is an effective paradigm for this task. More importantly, models trained with certain orderings consistently outperform others.
2 Generating Multiple Keyphrases as a Sequence
2.1 Model Architecture
In this paper we use One2One to denote the training and decoding strategy where each source text corresponds to a single target keyphrase, and a common practice is to use beam search to over-generate multiple keyphrases. This is in contrast of One2Seq, where each source text corresponds to a sequence of keyphrases that are concatenated with a delimiter token <SEP>. By simple greedy search, a model with One2Seq setting is capable of generating a sequence of multiple phrases. But the over-generation with beam search is often necessary to boost the recall. Please refer to Yuan et al. (2018); Ye and Wang (2018) for details.
We adopt a sequence-to-sequence based framework with pointer-generator and coverage mechanism proposed by See et al. (2017). Our focus in this work is to study what factors are most critical for models trained with One2Seq setting, such as keyphrase ordering and beam width, rather than the model structure itself, thus we describe details of the model structure in Appendix A.
2.2 Ordering for Concatenating Phrases
In this subsection, we define six ordering strategies for concatenating target phrases as follows. We are interested in seeing if different orderings affect the performance on keyphrase generation and which orderings may be optimal for training models in the One2Seq setting.
Random: Randomly shuffle the target phrases. As the goal of keyphrase generation is to output an order-invariant structure (a set of phrases), we expect models trained with randomly shuffled targets would capture such nature better than other variants with more fixed ordering.
No-Sort: Keep phrases in original order. Also used by Ye and Wang (2018).
Length: Sort phrases by their lengths from short to long. Phrases of the same length are sorted in original order.
Alpha: Sort phrases in alphabetical order (by their first word).
Appear-Pre: Sort present phrases by their first occurrences in the source text, and prepend absent phrases at the beginning. Absent phrases are randomly shuffled.
Appear-Ap: Same to Appear-Pre but append absent phrases at the end. Also used by Yuan et al. (2018).
2.3 Efficient Decoding Strategy
Previous study Ye and Wang (2018) adopts beam search and a phrase-ranking technique to collect a excessive number of unique phrases. However in the setting of One2Seq, this decoding strategy can cause very high computational cost, as a result of longer decoding sequence and much deeper beam search process.
In order to make One2Seq decoding more computationally affordable, we propose to use a early-stop technique during beam search. Instead of expanding all the search branches until reaching a given maximum depth, we terminate the beam search once the best sequence is found. This is a common heuristic for speed-up in single-sequence generation tasks such as translation and summarization. We observe that it is also effective for One2Seq decoding, leading to up to 10 times faster decoding. By the time that top sequence is completed, there is usually enough number of good phrases, meanwhile the quality of later generated sequences degenerates drastically and most of them are duplicates of existing phrases. Therefore this early-stop technique achieves a significant efficiency gain without sacrificing the quality of output phrases.
|Method||Inspec (Avg=7.8)||Krapivin (Avg=3.4)||NUS (Avg=6.1)||SemEval (Avg=6.7)||Average|
3 Experiment Settings
Following the experiment setting in Meng et al. (2017), we train all the models with the KP20k training set, which contains 514,154 scientific papers each with a title, an abstract and a list of keyphrases provided by the author(s). We take four common datasets Inspec, Krapivin, NUS, and SemEval for testing. We report the testing performance of best checkpoints, which are determined by F1@5 score on validation set, containing 500 data examples from the KP20k validation set.
In both settings of One2Seq and One2One, we use model described in §2.1. In the One2Seq setting, we experiment with beam widths of [10, 30, 50] in decoding phase; while in One2One, we use a beam width of 200 for fair comparison with (Meng et al., 2017). Note that only unique phrases are used for evaluation.
4 Results and Discussion
4.1 Effects of Ordering
We report experimental results on four common benchmark datasets (totalling 1241 testing data points) in Table 1. Our implementation of the One2One model Meng et al. (2017) yields much better scores on NUS and SemEval, and comparable performance on the remaining datasets. Meanwhile, One2Seq models produce top 5 phrases in better quality than the baseline and there is no noticeable difference on their @5. However, the scores deteriorate if we include top 10 outputs. As shown in Table 1, the average performance increases from Random to Appear-Ap: a trend that becomes particularly obvious for @10. We will elaborate on this observation shortly and offer possible explanations.
Table 2 presents the statistics on the output of various models. We can see a notable correlation between -score and the number of unique predicted keyphrases (#(UniqKP)). Specifically, both Random and Length predict less than 5 unique phrases on average, leading to lower . Intuitively, we were expecting Random to help capture the order-invariance among phrases. In practice, however, the random phrase order seems to have induced more difficulties than robustness in learning. Since the lengths of most phrases are in the range between 1 to 4, Length ordering only presents a very weak clue for models to exploit, especially given that counting itself can be a rather challenging task. Similarly, despite the static nature of the phrase order in No-Sort is, there also exists a high degree of arbitrariness in the way different authors list keyphrases in different papers. Such intrinsic randomness again poses much difficulty in learning.
In contrast, Appear-Pre and Appear-Ap have the largest number of unique predictions as well as the best scores on @10. In particular, Appear-Ap out-numbers others in both number of beams and phrases by a large margin. We postulate that the order of the targets’ occurrence in the source text provides the pointer (copying) attention with a fairly reliable pattern to follow. This is in line with the previous findings that using pointer attention can greatly facilitate the training of keyphrase generation Meng et al. (2017). In addition, appending the absent phrases to the end of the concatenated sequence may cause less confusion to the model. Finally, the Alpha ordering surprisingly yields satisfactory scores and training stability. We manually check the output sequences and notice that the model is actually able to retain alphabetical order among the predicted keyphrases, hinting that the model might be capable of learning simple morphological dependencies (sort words in alphabetical order) even without access to any character-level representation.
4.2 Effects of Beam Width
We report the model performance with different beam widths in Table 3. From the results we can see a noticeable trend that, regardless of the target order, all models have a significant performance boost with a bigger beam width, which is largely due to the larger number of unique predictions. However the performance gap among different orderings remains clear, indicating that training ordering has a consistent effect on One2Seq models. With beam width 50, Appear-Ap wins against One2One on both @5 and @10. But to a certain extent, more unique predictions may also introduce noise. Compared with the other four orderings, Appear-Ap and Appear-Pre have much lower precision and scores for top 5 predictions with beam width 25 and 50.
4.3 Effects of Model Complexity
After concatenating multiple phrases, the target sequence of One2Seq models becomes much longer and it may require more parameters to model the dependency. Therefore we are interested in knowing whether one can achieve better performance by increasing model complexity. Besides the afore-mentioned base model (referred as BaseRNN), two larger models are used for comparison: (1) BigRNN used in Ye and Wang (2018), same architecture as the base model except a larger embedding size (128) and hidden size (512); (2) Transformer used in Gehrmann et al. (2018), a four-layer transformer with 8 heads, 512 hidden units and the copy attention. As shown in Table 4, neither of the two is able to outperform BaseRNN. Interestingly, the performance differences among six orderings are consistently observed on the bigger models.
4.4 Effects on Absent Keyphrase Generation
Absent keyphrase prediction examines models’ abilities to generate synonymous expressions based on the semantic understanding of the text. We report the absent keyphrase results in Table 5. Although One2Seq models exhibit superior performance in predicting present phrases, they work poorly for absent ones, primarily due to the very limited number of unique predictions they are able to generate. Recall@50 becomes very low as they can hardly produce more than 10 absent phrases. Even with a larger beam width, both RNN models yield very low recall. In contrast, Transformer demonstrates good abstractiveness and beats the baseline on Recall@10.
We present an empirical study on how different orderings affect the performance of One2Seq models for keyphrase generation. We conclude our discussion with the following take-aways:
The ordering of concatenating target phrases matters. Consistent with Vinyals et al. (2016), target ordering plays a key role in successfully training models, perhaps due to the potential difference in trainability of each ordering. Appear-Ap demonstrates the overall best performance among the six ordering we experimented with.
Larger beam width can yield more stable performance boost and reduce the gap among orderings. Model complexity has no significant effect on present keyphrase prediction.
Training with concatenated phrases seems to strengthen a model’s extraction capacities more than abstraction capacities. On the other hand, abstraction capacity can be enhanced by increasing model complexity. How to balance the two would be a good direction for future study.
By utilizing over-generation (with a beam size of 10), only less than 20% of phrases generated by One2Seq models are unique. Decoding for keyphrase generation remains an open challenge that may deserve more research attention.
- Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Cited by: Appendix A.
- Keyphrase generation: A text summarization struggle. CoRR abs/1904.00110. External Links: Cited by: §1.
- Keyphrase generation with correlation constraints. CoRR abs/1808.07185. External Links: Cited by: §1, §1.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Cited by: Appendix A.
- The importance of generation order in language modeling. CoRR abs/1808.07910. External Links: Cited by: §1.
- Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §4.3.
- Automatic keyphrase extraction by bridging vocabulary gap. the Fifteenth Conference on Computational Natural Language Learning. Cited by: §1.
- Scientific information extraction with semi-supervised neural tagging. CoRR abs/1708.06075. External Links: Cited by: §1.
- Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 582–592. External Links: Cited by: §1, §3, §3, §4.1, §4.1.
- Get to the point: summarization with pointer-generator networks. CoRR abs/1704.04368. External Links: Cited by: Appendix A, Appendix A, §2.1.
- Neural models for key phrase detection and question generation. CoRR abs/1706.04560. Cited by: §1.
- DivGraphPointer: a graph pointer network for extracting diverse keyphrases. SIGIR. External Links: Cited by: §1.
- Order matters: sequence to sequence for sets. In International Conference on Learning Representations (ICLR), External Links: Cited by: §1, §1, 1st item.
- Single document keyphrase extraction using neighborhood knowledge.. Cited by: Appendix C.
- PTR: phrase-based topical ranking for automatic keyphrase extraction in scientific publications. 23rd International Conference, ICONIP 2016. Cited by: §1.
- KEA: practical automatic keyphrase extraction. In Proceedings of the Fourth ACM Conference on Digital Libraries, DL ’99, New York, NY, USA, pp. 254–255. External Links: Cited by: §1.
- A deep reinforced sequence-to-set model for multi-label text classification. CoRR abs/1809.03118. External Links: Cited by: §1.
- Semi-supervised qa with generative domain-adaptive nets. In the 55th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
- Semi-supervised learning for neural keyphrase generation. CoRR abs/1808.06773. External Links: Cited by: §1, §1, 2nd item, §2.1, §2.3, §4.3.
- Generating diverse numbers of diverse keyphrases. CoRR abs/1810.05241. External Links: Cited by: §1, §1, 6th item, §2.1.
Appendix A Model Architecture
We adopt a sequence-to-sequence based framework with pointer-generator and coverage mechanism proposed by See et al. (2017). The basic sequence-to-sequence model uses a bi-directional GRU111For the Transformer experiments described in §4.3, GRUs are replaced by transformer modules. (Cho et al., 2014) encoder, it takes word embedding vectors of the source text as input and produces a sequence of encoder hidden states, . On the decoding side of the model, at step , an uni-directional GRU decoder takes the word embedding of the token as input and generates decoder hidden states . Attention mechanism (Bahdanau et al., 2014) is applied:
where , , and are learnable parameters. represents the probability distribution over correlation between the token with each of the source words. Weighted sum of encoder hidden states, , is further used to generate the output (a probability distribution over an given vocabulary) at step .
The pointer-generator learns a linear interpolation between the two probability distributions (i.e., distribution over source words, , and distribution over vocabulary, ), so that the combined output considers both distributions. One advantage of pointer-generator is, it can point to out-of-vocabulary tokens when they appear in source text.
As for coverage mechanism, the model maintains a coverage vector , which is the sum of attention distributions over all previous decoder time steps. serves as a episodic memory that helps to prevent model from generating repetition.
For more details about the model we use, we refer readers to read See et al. (2017)’s clearly described model section.
Appendix B Generating Variable-number of Phrases
One critical advantage of the One2Seq setting is that the model is capable of generating variable-number of phrases. Specifically, during the decoding phase, we take as the output the multiple phrases from one completely decoded sequence, which usually is the top-ranked sequence (i.e. top beam) from beam search. We refer to this decoding strategy for generating variable-number of phrases as self-terminating generation.
We report average scores of variable-number keyphrase generation (self-terminating) in Table 6. We can see that without over-generation, One2Seq models are able to achieve a decent performance. A larger beam width results in an improvement on top 10 phrases but a drop on top 5, showing a different phenomenon from §4.2.
Appendix C Transferring to Another Domain
Different from the four scientific writing datasets mentioned in §3, DUC Wan and Xiao (2008) is a dataset where data is collected from domain of news. Using DUC, we investigate when transferring to another domain, how different ordering affects an One2Seq model’s test performance. Table 7 presents the results on DUC dataset. Note the DUC dataset contains only present phrases, thus most One2Seq models outperform the baseline One2One model.