Generating Diverse Numbers of Diverse Keyphrases

Generating Diverse Numbers of Diverse Keyphrases

Xingdi Yuan
Microsoft Research Montréal
Montréal, Québec, Canada
&Tong Wang footnotemark:
Microsoft Research Montréal
Montréal, Québec, Canada
&Rui Meng footnotemark:
School of Computing and Information
University of Pittsburgh
Pittsburgh, PA, 15213
&Khushboo Thaker
School of Computing and Information
University of Pittsburgh
Pittsburgh, PA, 15213
&Daqing He
School of Computing and Information
University of Pittsburgh
Pittsburgh, PA, 15213
&Adam Trischler
Microsoft Research Montréal
Montréal, Québec, Canada
These authors contributed equally. The order is determined by a fidget spinner.

Existing keyphrase generation studies suffer from the problems of generating duplicate phrases and deficient evaluation based on a fixed number of predicted phrases. We propose a recurrent generative model that generates multiple keyphrases sequentially from a text, with specific modules that promote generation diversity. We further propose two new metrics that consider a variable number of phrases. With both existing and proposed evaluation setups, our model demonstrates superior performance to baselines on three types of keyphrase generation datasets, including two newly introduced in this work: StackExchange and TextWorld ACG. In contrast to previous keyphrase generation approaches, our model generates sets of diverse keyphrases of a variable number.


Generating Diverse Numbers of Diverse Keyphrases

  Xingdi Yuan thanks: These authors contributed equally. The order is determined by a fidget spinner. Microsoft Research Montréal Montréal, Québec, Canada Tong Wang footnotemark: Microsoft Research Montréal Montréal, Québec, Canada Rui Meng footnotemark: School of Computing and Information University of Pittsburgh Pittsburgh, PA, 15213 Khushboo Thaker School of Computing and Information University of Pittsburgh Pittsburgh, PA, 15213 Daqing He School of Computing and Information University of Pittsburgh Pittsburgh, PA, 15213 Adam Trischler Microsoft Research Montréal Montréal, Québec, Canada

1 Introduction

Keyphrases are short pieces of text that humans use to summarize the high-level meaning of a longer text, or to highlight certain important topics or information. Keyphrase generation is the task of automatically predicting keyphrases given a source text. Models that perform this task should be capable not only of distilling high-level information from a document, but also of locating specific, important snippets within it. Complicating the problem, keyphrases may or may not appear directly and verbatim in their source text (they may be present or absent).

A given source text is usually associated with a set of keyphrases. Thus, keyphrase generation is an instance of set generation, where each element in the set is a short sequence of tokens and the size of the set varies depending on the source. Most prior studies approach keyphrase generation similarly to summarization, relying on sequence-to-sequence (Seq2Seq) methods (meng2017deep; chen2018kp_correlation; ye2018kp_semi; chen2018kp_title). Conditioned on a source text, Seq2Seq models generate phrases individually or as a longer, concatenated sequence with delimiting tokens throughout. Standard Seq2Seq models generate only one sequence at a time. To overcome this limitation so that a sufficient set of diverse phrases can be generated, one common approach is to use beam-search decoding with a fixed, large beam width. Models are then evaluated by taking the top results from the over-generated beam of phrases ( is typically 5 or 10) and comparing them to “groundtruth” keyphrases.

Though this approach has achieved good results, we argue that it suffers from two major problems. Firstly, the evaluation setup is suboptimal because of the mismatch between a fixed and the number of groundtruth keyphrases for a text. The appropriate number of keyphrases for each text can vary drastically, depending on factors like the length or topic of the text or the granularity of keyphrase annotation. Therefore, arbitrarily using the same to evaluate on all data samples may not be appropriate. With the existing evaluation setup, for example, we find that the upper bounds for and on KP20k is 0.858 and 0.626, respectively (see Section 4.1 for more details), and worse for datasets where fewer keyphrases are available. Secondly, the beam-search strategy ignores interactions between generated phrases. This often results in insufficient diversity in decoded phrases. Although models such as in chen2018kp_correlation or ye2018kp_semi can take diversity into account during training, they rarely achieve it during decoding since they must over-generate and rank phrases with beam search.

To overcome the above issues, we propose two improvements with regard to decoding and evaluation for keyphrase generation frameworks. First, we propose a novel keyphrase generation model to fit the demand of generating variable numbers of diverse phrases. This model predicts the optimal number of phrases to generate for a given text, and uses a target encoder and orthogonal regularization to facilitate more diverse phrase generation. Second, we propose two variable numbers, and , in the evaluation as the cutoff for computing scores such as . They show better empirical characteristics than previous metrics based on a fixed . Besides the two improvements in modeling and evaluation, a third major contribution of our study is two brand-new datasets for keyphrase generation: StackExchange and TextWorld ACG. Because their source material is distinct from scientific publications as used in previous corpora, we expect these datasets to contribute to a more comprehensive testbed for keyphrase generation.

2 Related Work

2.1 Keyphrase Extraction and Generation

Traditional keyphrase extraction has been studied extensively in past decades. In most existing literature, keyphrase extraction has been formulated as a two-step process. First, lexical features such as part-of-speech tags are used to determine a list of phrase candidates by heuristic methods (witten1999kea; liu2011gap; wang2016ptr; yang2017semisupervisedqa). Second, a ranking algorithm is adopted to rank the candidate list and the top ranked candidates are selected as keyphrases. A wide variety of methods were applied for ranking, such as bagged decision trees (medelyan2009human_competitive; lopez2010humb), Multi-Layer Perceptron and Support Vector Machine (lopez2010humb) and PageRank ((Mihalcea2004textrank; le2016unsupervised; wan2008neighborhood_knowledge)). Recently, zhang2016twitter; luan2017tagging; gollapalli2017expert_knowledge used sequence labeling models to extract keyphrases from text. Similarly, subramanian2017kp used Pointer Networks to point to the start and end positions of keyphrases in a source text.

The main drawback of keyphrase extraction is that sometimes keyphrases are absent from the source text, thus an extractive model will fail predicting those keyphrases. meng2017deep first proposed the CopyRNN, a neural generative model that both generates words from vocabulary and points to words from the source text. Recently, based on the CopyRNN architecture, chen2018kp_correlation proposed CorrRNN, which takes states and attention vectors from previous steps into account in both encoder and decoder to reduce duplication and improve coverage. ye2018kp_semi proposed semi-supervised methods by leveraging both labeled and unlabeled data for training. chen2018kp_title; ye2018kp_semi proposed to use structure information (e.g., title of source text) to improve keyphrase generation performance. However, none of the above models have the ability to generate variable numbers of keyphrases.

2.2 Sequence to Sequence Generation

Sequence to Sequence (Seq2Seq) learning was first introduced by sutskever2014seq2seq; together with the soft attention mechanism of bahdanau2014attntion, it has been widely used in natural language generation tasks. gulcehre2016pointersoftmax; Gu2016copy used a mixture of generation and pointing to overcome the problem of large vocabulary size. paulus2017summarization; zhou2017summarization applied Seq2Seq models on summary generation tasks, while du2017qgen; yuan2017qgen generated questions conditioned on documents and answers from machine comprehension datasets. Seq2Seq was also applied on neural sentence simplification (zhang2017simplification) and paraphrase generation tasks (xu2018dpage).

2.3 Representation Learning for Language

Representation learning for language has been studied widely in the past few years. mikolov2013word2vec propose Word2Vec, in which a contrastive loss is used to predict context words given a focus word. kiros2015skipthought further propose Skip-thought vectors, which uses Recurrent Neural Networks to predict context sentences. subramanian2017reprentations leverage multi-task learning by sharing a single recurrent sentence encoder across weakly related tasks to learn general sentence representations. logeswaran2017representations formulated the sentence-representation-learning task as a classification problem, where the classifier learns to distinguish a context sentence from contrastive negative samples based on their vector representations. Recently, vandenoord2018cpc proposed Contrastive Predicting Coding (CPC), which learns sentence representations by maximizing the mutual information between sequence encodings at different time-steps, also using a contrastive loss.

3 Model Architecture

Given a piece of source text, our objective is to generate a variable number of multi-word phrases. To this end, we opt for the sequence-to-sequence framework as the basis of our model, combined with attention and pointer softmax mechanisms in the decoder. To teach the model to vary the number of generated phrases, we join a variable number of multi-word phrases, separated by delimiters, as a single sequence. This concatenated sequence is then the target for sequential generation during training. An overview of our model’s structure is shown in Figure 1.111We plan to release the datasets and code in the near future.

Figure 1: Overall structure of our proposed model. A represents last states of the bi-directional source encoder; B represents decoder states where target tokens are delimiters; C indicates target encoder states where input tokens are delimiters. During orthogonal regularization, all states are used; during target encoder training, we maximize the mutual information between states with each of the states; red dash arrow indicates a detached path, i.e., do not back propagate through this path.


In the following subsections, we use to denote input text tokens, to denote token embeddings, to denote hidden states, and to denote output text tokens. Superscripts denote time-steps in a sequence, and subscripts and indicate whether a variable resides in the encoder or the decoder of the model, respectively. The absence of a superscript indicates multiplicity in the time dimension. refers to a linear transformation and refers to it followed by a non-linear activation function . Angled brackets, , denote concatenation.

3.1 Source Encoding

Given a source text consisting of words , the encoder converts these discrete symbols into a set of real-valued vectors . Specifically, we first embed each word into a embedding vector , which is then fed into a bidirectional LSTM (hochreiter1997lstm) for deriving from contextual information in the source text:


Dropout (srivastava2014dropout) is applied to both and for regularization.

3.2 Attentive Decoding

The decoder is a recurrent model that takes the source encodings and generates a distribution over possible output tokens at each time-step . With pointer softmax (gulcehre2016pointersoftmax), the target distribution consists of two parts: a distribution over a prescribed vocabulary (abstractive), and a pointing distribution over the tokens in the source text (extractive). We will focus on the derivation of in this subsection.

The first component of the decoder is a uni-directional LSTM. At each time-step , the decoding generates a new state from the embedding vector and its recurrent state . Specifically, is the embedding of . During training by teacher forcing, is the groundtruth target token at previous time-step ; during evaluation, , is the prediction at the previous time-step,


The initial state is derived from the final encoder state by applying a single-layer feed-forward neural net (FNN): . Dropout is applied to both the embeddings and the LSTM states .

In order to better incorporate information from the source text, an attention mechanism (bahdanau2014attntion) is employed when generating token . The objective is to infer a notion of importance for each source word based on the current decoder state . This is achieved by measuring the “energy” between and the -th source encoding with a 2-layer FNN:


The output of the second layer is a scalar-valued score. The energies over all encoder states thus define a distribution over the source sequence for each decoding step :


To generate the new token , the final step is to derive the generative distribution by applying a 2-layer FNN to the concatenation of the decoder LSTM state and the weighted sum of the source encodings weighted by the distribution :


where the output size of the second layer equals to the target vocabulary size.

3.3 Pointer Softmax

We employ the pointer softmax (gulcehre2016pointersoftmax) mechanism to choose between generating tokens from the general vocabulary and pointing to tokens in the source text. Essentially, the pointer softmax module computes a scalar switch at each generation time-step and uses it to interpolate the abstractive distribution over the vocabulary (see Equation 5) and the extractive distribution over the source text tokens:


Semantically, the switch should be conditioned on both the source representation and the decoder state at each decoding step . Specifically, we project the attention-weighted sum of the source encodings and the decoder state into the same space by two separate linear transformations, and further transform the sum of the resulting vectors into the scalar switch value with a 1-layer FNN:


We use the source attention weights from Equation 4 as the extractive distribution for time-step :


3.4 Decoding Strategies

With the probability defined in Equation 6, various decoding methods can be applied to decode the target word . Selecting a decoding strategy is important because it determines whether the generated keyphrase set is of fixed or variable size.

To our best knowledge, all existing models generate fixed size sets by first over-generating a large number of candidate keyphrases, followed by some ranking algorithm to truncate the candidate set to a fixed number of final results. One major limitation is that such approaches are incompatible with the variable-number nature of keyphrases. In the KP20k dataset, for example, the average number of keyphrases in the training set is 5.27, while the variance is as high as 14.22. In addition, over-generation is usually achieved by setting a large beam size in beam search (e.g., 150 and 200 in chen2018kp_title; meng2017deep, respectively), which is computationally rather expensive.

Since our proposed model is trained to generate a dynamic number of phrases as a single sequence joined by delimiters, SEP, we can obtain variable length output by simply decoding a single sequence for each sample, either by greedy search or by taking the top beam sequence from beam search. The resulting model thus undertakes the additional task of dynamically estimating the proper size of the target phrase set. We will show in later sections that the model remains empirically competitive when compared to baselines that lack this desirable capacity. Another notable attribute of our decoding strategy is that, by generating a set of phrases in a single sequence, the model conditions its generation on all its generation history. Compared to the strategy used in previous works (i.e., phrases are generated in parallel by beam search, without interdependency), our model can learn dependencies among target phrases in a more explicit way.

3.5 Diversified Generation

There are typically multiple keyphrases for a given text because each keyphrase represents certain aspects of the text. Therefore keyphrase diversity is desired for the keyphrase generation, to increase the semantics coverage of source text and meanwhile to reduce redundancy in generated phrases. Most previous keyphrase generation models generate multiple phrases by over-generation, which is highly prone to generate similar phrases due to the nature of beam search. Given our objective to generate variable numbers of keyphrases, we need to adopt new strategies for achieving better diversity in the output.

Recall that we represent variable numbers of keyphrases as delimiter-separated sequences. One particular issue we observed during error analysis is that the model tends to produce identical tokens following the delimiter token. For example, suppose a target sequence contains delimiter tokens at time-steps , respectively. During training, the model is rewarded for generating the same delimiter token at these time-steps, which presumably introduces much homogeneity in the corresponding decoder states . When these states are subsequently used as inputs at the time-steps immediately following the delimiter, the decoder naturally produces highly similar distributions over the following tokens, resulting in identical tokens being decoded. To alleviate this problem, we propose two plug-in components for the sequential generation model.

3.5.1 Orthogonal Regularization

We first propose to explicitly encourage the delimiter-generating decoder states to be different from each other. This is inspired by bousmalis2016domainsep, who use orthogonal regularization to encourage representations across domains to be as distinct as possible. Specifically, we stack the decoder hidden states corresponding to delimiters together to form matrix and use the following equation as the orthogonal regularization loss:


where is the matrix transpose of , is the identity matrix of rank , indicates norm of a matrix . This loss function prefers orthogonality among the hidden states and thus improves diversity in the tokens following the delimiters.

3.5.2 Target Encoding

We propose an additional mechanism that focuses more on the semantic representations of generated phrases. Again, our goal here is to reduce covariance between certain decoder states and the corresponding target tokens (namely, delimiters). Specifically, we introduce another uni-directional recurrent model , dubbed target encoder, which encodes decoder-generated tokens , where , into hidden states . This state is then taken as an extra input to the decoder LSTM, modifying Equation 2 to:


We do not train the target encoder with the training signal from generation (i.e., errors do not backpropagate from the decoder LSTM to the target encoder, as shown in Figure 1), because the resulting decoder would be equivalent to a 2-layer LSTM with residual connections. Instead, inspired by logeswaran2017representations; vandenoord2018cpc, we train the target encoder in an unsupervised fashion. We extract target encoder hidden states for which is the delimiter token, SEP, and use these as phrase representations. We train by maximizing the mutual information between each of these phrase representations and the final state of the source encoder, , as follows. For each phrase representation vector , we take a set of encodings of different sources. This set contains one positive sample , the encoder representation for the source sequence whose keyphrases are being generated, and negative samples from other source sequences (sampled at random from the training data). The target encoder is optimized on the classification loss,


where is bi-linear transformation.

The motivation here is to constrain the representation of each generated keyphrase to be semantically close to the overall meaning of the source text. With a recurrent architecture in the target encoder, the model can potentially help to capture phrase-level semantics in the generated output, and thus effectively address the diversity issue stemming from the delimiter tokens.

3.5.3 Training Loss

We adopt the widely used negative log-likelihood loss in our sequence generation model, denoted as . The overall loss we use in our model is


where and are scalars. As shown in Figure 1, at each time-step , the decoder LSTM has 3 inputs: embedding vector of input token , target encoding , and hidden state from the previous time-step. Both and are designed to boost generation diversity.

4 Datasets and Experiments

In this section we introduce the datasets we use for evaluation and report results from our models. In the following subsections, catSeq refers to the sequence-to-concatenated-sequences model described in Section 3, without the target encoder and orthogonal regularization; catSeqD refers to the model augmented with the target encoder and orthogonal regularization. During teacher forcing, to build the groundtruth sequences, we sort target keyphrases by their order of first occurrence in the source text, and append absent keyphrases at the end. This order may guide the attention mechanism to attend to source positions in a smoother way. Implementation details can be found in Appendix A.

4.1 Evaluation Methods

As mentioned in Section 3.4, previous models are only able to generate a fixed number of keyphrases, and oftentimes, they over-generate. As a result, these studies opt for evaluating either the top 5 or top 10 results against the groundtruth labels. We argue that this evaluation method is unsuitable for the keyphrase generation task and variable-size set generation tasks in general. To validate this hypothesis, we calculated the performance upper bounds for and on the KP20k validation set. For this we assume an oracle that generates the groundtruth sets of keyphrases, and when the number of generated keyphrases differs from 5 or 10, we insert random wrong answers. The and scores of this system are 0.858 and 0.626, respectively, suggesting that the existing evaluation approach is problematic. Furthermore, different datasets have different natures, so using the same arbitrarily chosen to compute on all datasets may be inappropriate. Statistics of all datasets we use in this work are shown in Table 1.

Dataset #Train Valid Test Avg# Var#
KP20k 567,830 20k 20k 5.27 14.22
Inspec 1500 500 9.57 22.42
Krapivin 1844 460 5.24 6.64
NUS 169 42 11.54 64.57
SemEval 144 100 15.67 15.10
StackExchange 298,965 16k 16k 2.69 1.37
TextWorld ACG 12,837 575 575 9.96 25.01
Table 1: Statistics of datasets we use in this work. Avg# and Var# indicate the mean and variance of numbers of target phrases per data point, respectively.

On the other hand, a well-trained neural sequence generation model should have the ability to decide when to stop, i.e. decide how many keyphrases to predict by itself, by generating a stop token (e.g., EOS). Taking all this into consideration, we propose two new evaluation methods for set generation tasks:

  • (number determined by model): For models that generate variable numbers of outputs, compute score between all phrases generated by the model with phrases in groundtruth.

  • (number determined by validation performance): For models that over-generate, take the value that gives the highest on the validation set, report test performance using this value. On data point where the model still generates fewer than outputs, .

In the following subsections, we report our experimental results on multiple datasets and compare with CopyRNN222We compare all our results using new evaluation methods with CopyRNN, because by the time of writing, it’s the only open sourced model enabling us to modify the evaluation code for fair comparisons. and other existing works.

During generation, we use two strategies: Greedy and Recall+. In the Greedy strategy, the decoder generates only one target sequence greedily, halting generation whenever an EOS token is generated. In the Recall+ strategy, we generate more outputs to boost recall by utilizing beam search, where each beam generates a sequence of keyphrases. We then follow ye2018kp_semi to rank the generated set as follows: 1) Split each beam by SEP, remove duplicates; 2) Beams with higher probabilities have higher rank; 3) Within a beam, keyphrases that are generated earlier have higher rank.

4.2 Experiments on Scientific Publication Datasets

Following (meng2017deep; chen2018kp_correlation; ye2018kp_semi; chen2018kp_title), we use KP20k dataset to train our model and test it on a collection of scientific publication datasets, namely KP20k, Inspec, Krapivin, NUS, and SemEval.

KP20k is a dataset constructed by (meng2017deep) that comprises a large number of scientific publications. For each article, the abstract and title are used as the source text while the author keywords are used as target. The other four datasets contain many fewer articles, so we use them to test transferability of our model without tuning on them. Our proposed metric requires validation data to determine the value of , so for these four datasets, we use the training set for validation if it exists, and otherwise split the test set into two portions: we order all the data points by their file names alphabetically and take the first 20% as testing set and the rest 80% as validation set.

We report our model’s performance on the present-keyphrase portion of the KP20k dataset in Table 2, using and metrics. To compare with previous works, we also compute and scores; when the model generates less than 5 or 10 keyphrases, we use whatever it generates to compare with groundtruth. From the table, we can see TG-Net performs the best on , but the score drops immediately on the top 10 predictions, perhaps because the quality of to keyphrases are poor. An interesting observation is that, different from existing models, our model always has better scores on than . This suggests that our model always has an accurate expectation regarding the number of keyphrases there should be, i.e., it decides to generate more than 5 keyphrases when necessary.

catSeqD outperforms catSeq on all metrics, suggesting the target encoder and orthogonal regularization help the model to generate higher quality keyphrases.

catSeq 34.1 34.5 34.5 34.5
catSeqD 35.8 36.1 36.2 36.2
CopyRNN 32.8 25.5 33.1
Multi-Task 30.8 24.3
TG-Net 37.2 31.6
catSeq 33.8 34.5 34.6 34.6
catSeqD 35.0 35.8 35.9 35.9
Table 2: Present keyphrase predicting performance on KP20k test set. Compared with CopyRNN (meng2017deep), Multi-Task (ye2018kp_semi), and TG-Net (chen2018kp_title).

We report our models’ performance on the four transfer learning datasets in Table 3, using and for both Greedy and Recall+ modes. The scores are computed on the present portion of the four datasets. Our model performs competitively with CopyRNN on all four, and even outperforms it on NUS and SemEval. From Table 1, we know the statistics of these four datasets are very different from KP20k. For example, in KP20k there are on average 5.27 keyphrases per data point, while SemEval has 15.67. Without the ability to transfer, a model would memorize the distribution of the training data, thus having no chance to perform better on SemEval than an over-generating model (e.g., CopyRNN). This is evidence that our model generalizes effectively on the scientific publication datasets.

Model Inspec Krapivin NUS SemEval
Greedy ()
catSeq 21.8 32.0 33.0 28.2
catSeqD 22.5 32.0 35.9 30.6
Recall+ ()
CopyRNN 34.8 32.4 32.9 28.1
catSeq 28.3 28.3 35.9 28.9
catSeqD 30.9 27.4 36.3 29.9
Table 3: Model performance on transfer learning datasets.

On these datasets, generating absent keyphrases is extremely difficult and is far from solved. The current state-of-the-art model (chen2018kp_title) gets 0.267 and 0.075 recall on KP20k and SemEval datasets, respectively, by forcing the model to generate 50 keyphrases. We argue that evaluating a set generation task by only recall is problematic. In achieving higher recall by generating more keyphrases, a model sacrifices precision. An extreme case would be if a model generates all possible English phrases: recall would be high but precision would be close to 0. On the contrary, the precision score can be high if the model always generates only one keyphrase. According to our experiments, the scores on the absent portion of KP20k are around , which makes it hard to compare different models’ performance. We therefore do not report results on the absent portion and leave it for future work.

4.3 Experiments on StackExchange Dataset

Dev Test
Model present absent present absent
Greedy ()
catSeq 53.7 9.6 54.5 9.2
catSeqD 54.4 12.7 54.7 12.6
Recall+ ()
CopyRNN 57.6 24.6 58.0 24.8
catSeq 56.9 17.4 57.3 17.2
catSeqD 61.5 26.2 61.1 26.6
Table 4: Model performance on StackExchange dataset.

Inspired by the StackLite tag recommendation task on Kaggle, we build a new benchmark based on the public StackExchange data.333, we choose 19 computer science related topics from Oct. 2017 dump. From the raw data, we use question title and question text as the source, and use user-assigned tags as target keyphrases.

Since oftentimes the questions on StackExchange contain less information than in scientific publications, there are fewer keyphrases per data point in StackExchange. Furthermore, StackExchange uses a tag recommendation system that suggests topic-relevant tags to users while submitting questions; therefore, we are more likely to see general terminology such as Linux and Java. This characteristic challenges models with respect to their ability to distill major topics of a question rather than selecting specific snippets from the text.

We report our models’ performance on StackExchange in Table 4. Results show catSeqD performs the best with both generation strategies; on the absent-keyphrase generation tasks, it outperforms catSeq by a large margin.

4.4 Text-based Game Command Generation

Dev Test Dev Test
Model Greedy () Recall+ ()
CopyRNN 66.9 68.5
catSeq 88.3 85.4 87.2 88.4
catSeqD 90.6 86.2 87.3 88.1
Table 5: Model performance on TextWorld ACG dataset.
Source a visual test development environment for gui systems
We have implemented an experimental test development environment (TDE) intended to raise the effectiveness
of tests produced for GUI systems, and raise the productivity of the GUI system tester.The environment links
a test designer, a test design library, and a test generation engine with a standard commercial capture/replay tool.
These components provide a human tester the capabilities to capture sequences of interactions with the system under
test (SUT), to visually manipulate and modify the sequences, and to create test designs that represent multiple
individual test sequences. Test development is done using a high-level model of the SUT’s GUI, and graphical
representations of test designs. TDE performs certain test maintenance tasks automatically, permitting previously
written test scripts to run on a revised version of the SUT.
catSeq test development ; test development environment ; test ; test generation
catSeqD test generation ; gui ; tool ; version ; capabilities ; systems ; design ; test ; human ; generation
Groundtruth engine ; developer ; design ; human ; standardization ; tool ; links ; graphics ; model ; libraries ; replay ; component ;
interaction ; product ; development environment ; script ; visualization ; capabilities ; systems ; experimentation ;
test designer ; environments ; test generation ; testing ; maintenance ; test maintenance ; version ; effect ; sequence
Table 6: Example from KP20k validation set, predictions generated by catSeq and catSeqD models.

Text-based games are receiving more attention recently in the NLP, machine learning, and reinforcement learning communities. They are turn-based games in which all communications between game engine and player occur through text. At each game step, an agent receives a text observation describing the environment as input, then issues a text command to take action in the game. The engine responds in turn with textual feedback and a numerical reward.

We here consider the admissible commands that an agent can issue in response to an observation to be keyphrases. Admissible commands are those that have some effect on the environment described by the text observation, and thus require picking out key details from it. In this formulation, the observation is the source text and the set of admissible commands forms the target.

To generate keyphrase-style training data from text-based games, we use TextWorld (cote18textworld), a text-based learning environment, to procedurally generate a collection of games and their playthroughs. Some example (observation, command-set) datapoints are shown in Appendix C. We call this dataset TextWorld ACG, its statistics can be found in Table 1.

We report our models’ performance on this dataset in Table 5. Note that since almost all target commands are absent from the source text, we do not differentiate present and absent keyphrases in this dataset. From the table, we can see catSeqD outperforms catSeq with the Greedy generation strategy and gets similar performance when over-generating. Both of our models outperform CopyRNN by a large margin.

5 Analysis and Discussion

5.1 Effect of Target Encoding and Orthogonal Regularization

To verify our assumption that target encoding and orthogonal regularization help to boost the diversity of generated sequences, we use two metrics, one quantitative and one qualitative, to measure diversity of generation. First, we simply calculate the average unique predictions on both KP20k and TextWorld ACG datasets produced by both catSeq and catSeqD. We observe that average unique predictions from catSeqD are consistently slightly greater than those from catSeq, and give further details in Appendix B.

Second, from the model running on the KP20k validation set, we randomly sample 2000 decoder hidden states at steps following a delimiter () and apply an unsupervised clustering method (t-SNE (tsne)) on them. From the Figure 2 we can see that hidden states sampled from catSeqD are easier to cluster while hidden states sampled from catSeq yield one mass of vectors with no obvious distinct clusters. Results on both metrics suggest target encoding and orthogonal regularization indeed help diversifying generation of our model.

Figure 2: t-SNE results on decoder hidden states. Upper row: catSeq; lower row: catSeqD; column shows hidden states sampled from tokens at steps following a delimiter.

5.2 Result Examples

To illustrate the difference of predictions between catSeq and catSeqD, we show an example chosen from the KP20k validation set in Table 6. In this example there are 29 groundtruth phrases. Neither of the models is able to generate all of the keyphrases, but it is obvious that the predictions from catSeq all start with “test”, while predictions from catSeqD are diverse. This to some extent verifies our assumption that without the target encoder and orthogonal regularization, decoder states generated from delimiter SEP are less diverse.

6 Conclusion and Future Work

We propose a recurrent generative model that generates multiple keyphrases sequentially, with two extra modules that promote generation diversity. We propose new metrics to evaluate keyphrase generation. Our model shows promising performance on three keyphrase generation datasets including 2 newly introduced in this work. In future work, we plan to investigate how target phrase order affects the generation behavior, and further explore ways to generate set in an order invariant way. We would also like to investigate how to leverage reinforcement learning to help keyphrase generation.


Appendix A Implementation Details

Implemntation details of our proposed models are as follows. In all experiments, the word embeddings are initialized with 150-dimensional random matrices. The number of hidden units in both the encoder and decoder LSTM are 300. The number of hidden units in target encoder LSTM is 64. The size of vocabulary is 50,000.

The numbers of hidden units in MLPs described in Section 3 are as follows.

Dimension 300 300 1 50k 300 1 64 64

During target encoder training, we maintain a queue of size 128, to store the recent source representations. During negative sampling, we randomly sample 32 samples from this queue, thus target encoding loss in Equation 11 is a 33-way classification loss. In catSeqD, we set both the and in Equation 12 to be 0.03. In all experiments, we use a dropout rate of 0.3.

We use adam (kingma2014adam) as the step rule for optimization. The learning rate is . The model is implemented using PyTorch (paszke2017automatic).

Appendix B Average Amount of Unique Predictions by Models

catSeq catSeqD
Dataset Greedy Recall+ Greedy Recall+
KP20k 3.43 5.05 3.54 5.23
TextWorld ACG 8.24 10.97 8.88 11.30
Table 7: Average number of generated keyphrase on KP20k and TextWorld ACG.

Appendix C TextWorld ACG Data Examples

See Table 8

Observations -= Kitchen =- You find yourself in a kitchen. An usual one. Let’s see what’s in here.
You can see a closed refrigerator in the room. You see a counter.
Why don’t you take a picture of it, it’ll last longer! The counter appears to be empty.
You swear loudly. You can see a kitchen island.
You shudder, but continue examining the kitchen island. The kitchen island is typical.
On the kitchen island you can see a note. You can see a stove.
On the stove you make out a tomato plant.
You idly wonder how they came up with the name TextWorld for this place. It’s pretty fitting.
There is an open screen door leading east. There is an open wooden door leading west.
There is an unblocked exit to the north. You need an unblocked exit? You should try going south.
You put the tomato plant on the stove. You are carrying: an old key.
Admissible close screen door; close wooden door; cook tomato plant; drop old key; open refrigerator;
Commands go east; go north; go south; go west; put old key on counter; put old key on kitchen island;
put old key on stove; take note from kitchen island; take tomato plant from stove
Observations -= Bedroom =- You’re now in a bedroom. You make out a chest drawer.
Look over there! an antique trunk. The antique trunk contains an old key.
You make out a king-size bed. But there isn’t a thing on it.
What, you think everything in TextWorld should have stuff on it?
There is a closed wooden door leading east.
You open the antique trunk, revealing an old key. You are carrying nothing.
Admissible close antique trunk ; open chest drawer ;
Commands take old key from antique trunk
Table 8: Example observations and admissible commands in TextWorld ACG dataset.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description