Neural Text deGeneration withUnlikelihood Training

Neural Text deGeneration with
Unlikelihood Training

Sean Welleck &Ilia Kulikov1 &Stephen Roller &Emily Dinan &                                                Kyunghyun Cho & Jason Weston
         New York University, Facebook AI Research, CIFAR Azrieli Global Scholar
Equal contribution; the ordering was decided by a coin flip.
1footnotemark: 1

Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs (holtzman2019curious). While some post-hoc fixes have been proposed, in particular top- and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques.



1 Introduction

Neural text generation is a vital tool in a wide range of natural language applications. However, the standard approach – training a sequence to sequence model, e.g. Transformer (vaswani2017attention), to maximize log-likelihood and approximately decoding the most likely sequence – is known to be flawed. Generated text in open-ended applications such as language modeling or dialogue has been observed to be dull, with high frequency tokens used too often and interesting content words used too rarely (holtzman2019curious; dinan2019second). Moreover, the models repeat themselves at the token, phrase, and sentence levels, and statistics comparing a set of human-generated utterances and model-generated responses indicate a discrepancy between the human and model word distributions. This does not appear to be rectified by training on more data (radford2019language). Recent fixes involve modifying the decoding strategy using sampling or more sophisticated beam search variants. However, these decoding strategies do not address the core issue: the model’s underlying sequence probabilities are clearly not correct.

Several reasons for exactly why neural text is degenerate have been posited, with the cause currently unknown. Possible candidates include the problem being (i) a by-product of the model architecture, e.g. the Transformer architecture preferring repeats (holtzman2019curious; vig), (ii) an intrinsic property of human language (holtzman2019curious) rather than a modeling deficiency, or that (iii) a training objective relying on fixed corpora cannot take into account the real goal of using the language (yejin_talk). Our work shows that, while the above may be factors, a primary factor is the use of the likelihood objective itself, as we demonstrate that degeneration is alleviated if we replace the likelihood objective with our proposal.

While low perplexity in the limit should lead to predicting the correct next target word, there are two major flaws of the likelihood objective: (i) it pays relatively little attention to the argmax or the top of the ranked list of next token probabilities, instead optimizing the likelihood of the entire distribution; (ii) it is not focused on optimizing sequence generation, only on producing the next token. The first issue means that greedy or beam search decoding, which rely on the top of the list to generate, are not optimized – there is a discrepancy between maximizing the log-probability of a ground-truth token and ensuring the rank of the ground-truth token to be one. The second issue means that during sequence generation, any imperfection in next token prediction leads to error accumulation that is not addressed by likelihood training.

In this work, we introduce unlikelihood training, an approach that addresses the two aforementioned issues. It combines two types of updates: a likelihood update on the true target tokens so that they are assigned high probability, and an unlikelihood update on tokens that are otherwise assigned too high a probability. We can collect these unlikely token candidates either during next-token prediction or from generated sequences, allowing us to train at both the token and sequence levels. Both token and sequence level unlikelihood training are shown to improve metrics that measure dullness and repetition of the model, while maintaining performance in other metrics such as perplexity or token accuracy compared to the maximum likelihood baseline. Finally, we assess our models using human evaluations. We find that our generations have vastly improved quality compared to likelihood trained models when both models use beam search decoding. Moreover, our approach when using beam search also significantly improves over likelihood trained models using either beam blocking or nucleus sampling, thus outperforming the current state-of-the-art.

2 Related Work

Neural Text Degeneration

Recently, several papers have observed various forms of neural text degeneration, especially in open-ended generation tasks. In dialogue, it has been shown that there is a mismatch between model and human word distributions, where generative models are more likely to output frequent words, but less likely to produce rare words compared to humans. For example, this was observed across all generative models submitted to the ConvAI2 NeurIPS 2018 competition (dinan2019second). In language modeling, the work of holtzman2019curious highlighted problems with the word frequency distribution and level of repetition in model generations compared to human text. These issues are not remedied by simply increasing the amount of the training data; e.g. large-scale GPT-2 language models (radford2019language) display the same issues.

Improved Decoding Algorithms

Several methods have been proposed to rectify these issues. The primary ones involve changing the decoding method to a sophisticated beam search variant or to stochastic decoding, e.g. sampling. Different variants of beam search have been explored (li2016simple; vijayakumar2018diverse; kulikov2018importance; holtzman2018learning) which can decrease a model’s level of repetition by selecting candidates that are unlike previously chosen ones. Separately, hard or soft beam blocking has been investigated (paulus2017deep; klein2017opennmt), whereby previously generated -grams are blocked from subsequent generation. This approach is often used in dialogue generation, fixing some token or phrase level repetitions but removing repetitions that would naturally occur in human text.

The second major approach is that of sampling from the model at generation time. Top -sampling (fan2018hierarchical) and nucleus sampling (holtzman2019curious) are two methods that sample sequences based on a function of the predicted next token probability distribution given by the model. Both approaches vastly improve the repetition issue, as the randomization often reduces the number of duplicate tokens in a decoded sequence, even if highly scored paths under the model (represented by beam search candidates) contain repetitions. However, as the underlying model is unchanged, it often prefers semantically similar phrasing, depending on the temperature parameter of the sampling (holtzman2019curious). Furthermore, this solution is less relevant in less open-ended tasks such as machine translation, where beam search variants are the preferred method. Ideally we would like a model that can work with both beam and sampling decoding methods.

Improved Learning Algorithms

The proposed learning criteria are closely related to structured output prediction methods in which the goal is to increase the scores assigned by a model to true examples while decreasing those assigned to negative examples often generated by the model itself. Some representative algorithms include structured perceptron (collins-2002-discriminative), energy-based models (lecun2006tutorial) and more recently reflective likelihood (dieng2018learning). A particular variant in this family of algorithms, called negative training, was recently used by he2019negative to prevent generic and malicious responses in dialogue models. Similarly, these structured prediction algorithms with neural language models have been applied to machine translation in recent years by shen2015minimum and edunov2017classical.

3 Neural Text Generation

Language Modeling

In language modeling, our goal is to model a probability distribution over variable-length text sequences composed of tokens from a vocabulary, . We wish to find a model which resembles , meaning that samples are similar to samples from , and for all x. When is parameterized by a neural network, we call a neural language model. We assume that takes the form .

The de facto approach to training such a model is to find parameters that maximize the log-likelihood of a finite set of samples from by minimizing:

Sequence Completion

A closely related problem consists of sampling a sub-sequence, or prefix, , then using to conditionally decode a continuation, . We now want the resulting completion to resemble a sample from .

We use sequence completion as a setting to study the behavior of neural language models due to its generality. For instance, sequence completion encompasses story generation (fan2018hierarchical), contextual text completion (radford2019language), language modeling (for ), and dialogue modeling (Zhang2018PersonalizingToo) where is a dialogue history and a continuation is a next utterance.

Given and a prefix , finding the optimal continuation is not tractable, so in practice approximate deterministic or stochastic decoding strategies are used to generate continuations.

Deterministic Decoding

Two widely used deterministic decoding approaches are greedy search and beam search. The former can be seen as a special case of the latter. Greedy search selects the highest probability token at each time step: . Beam search maintains a fixed-size set of partially-decoded sequences, called hypotheses. At each time step, beam search forms new hypotheses by appending each token in the vocabulary to each existing hypothesis, scoring the resulting sequences As we describe in Section 4, these deterministic decoding strategies, which depend highly on underlying model probabilities, expose issues with conventionally trained neural language models.

Stochastic Decoding

An alternative is to sample from a model-dependent distribution at each step, . In order to prevent sampling low probability tokens, a typical approach is to restrict sampling to a subset of the vocabulary at each step:

where . The top-k sampler restricts sampling to the most-probable tokens; i.e. is the size subset of which maximizes (fan2018hierarchical). The nucleus sampler instead restricts sampling to the smallest set of tokens with total mass above a threshold ; i.e. is the smallest subset with (holtzman2019curious).

4 Neural Text Degeneration

In this section we discuss two degenerate properties that frequently occur in conventional neural language models trained with the maximum likelihood objective (Equation 1).


First, model-generated continuations exhibit sequence-level repetition, especially with deterministic decoding. The problem is seen by observing samples in Appendix Table 4, which shows completions from the state-of-the-art GPT-2 language model (radford2019language). Greedy decoding as well as top-k and nucleus sampling exhibit degenerate repetition (with a certain hyper-parameter setting), although greedy decoding shows the worst degradation. Using a Transformer language model trained with maximum likelihood (§6), we find that the average percentage of repeated n-grams in model continuations with greedy decoding (43%) far exceeds that of humans (0.5%), computed over prefixes drawn from a validation corpus.

Unlike previous work which only focused on degenerate sequence-level repeats (holtzman2019curious), we additionally observe that neural language models exhibit substantially more repetition in next-token prediction compared to human text:


For instance, the Transformer language model (§6) predicted next-tokens that appeared in the preceding 128 words 62% of the time, versus 49% in ground-truth text. This is especially concerning since the maximum-likelihood objective focuses on optimizing next-token conditional distributions.

Token Distribution Mismatch

Second, both greedy continuations and next-token predictions from conventional neural text generators have different token distributions from human text. As demonstrated by holtzman2019curious, such models with greedy or beam search tend to produce high frequency tokens too often and low frequency tokens too rarely, where frequency is defined by the human token distribution. With the Transformer language model (§6), the set of next-token greedy predictions on a held-out validation set had roughly 40% fewer unique tokens than the ground-truth tokens (11.6k vs. 18.9k), and overproduced frequent tokens (Appendix Figure 1). Such behavior has been linked to generations being judged as dull by humans because rare words can add engaging specificity (weston2018retrieve; see2019makes).

5 The Unlikelihood training objective

We now describe unlikelihood training for neural language models, then in Section 6 demonstrate empirically that our proposal substantially improves neural text degeneration (§4).

5.1 Unlikelihood Training

The key idea behind unlikelihood training is decreasing the model’s probability of certain tokens, called negative candidates. Given a sequence and a set of negative candidate tokens , where each , we define the unlikelihood loss for step as:


The loss decreases as decreases. We incorporate the unlikelihood loss into a token-level unlikelihood objective which augments each time-step of maximum likelihood training:


As candidates, we use previous context tokens:


Intuitively, minimizing the unlikelihood loss with this candidate set makes (i) incorrect repeating tokens less likely, as the previous context contains potential repeats, and (ii) frequent tokens less likely, as these tokens appear often in the previous context. These candidates are efficient to compute, without requiring additional supervision.

Gradient analysis

We assume and consider the gradient of (4) with respect to the softmax input . With a single negative candidate, the (negative) gradient is:


where is a one-hot ground-truth vector, , , and is the probability of the negative candidate at index (derivation in Appendix A).

This unlikelihood gradient (6) differs from the likelihood gradient, , due to the term which varies based on the hyper-parameter and the model’s negative candidate probability, . At the ground-truth token index , the unlikelihood gradient is positive, increasing the ground-truth token’s probability with a magnitude that grows with . Conversely, at the negative candidate index the gradient is negative. At all other token indices , the gradient moves from negative to positive as increases. For instance, with the gradient increases the probability of each token when the model assigns high probability to the negative candidate ().

5.2 Sequence-Level Unlikelihood Training

While the token-level unlikelihood objective efficiently augments maximum likelihood training with token-level penalties, it is limited to prefixes drawn from the training distribution. The resulting distribution mismatch between training sequences and generated sequences is a known issue with maximum-likelihood training, motivating objectives that operate on model-generated sequences (daume2009search; ross2011reduction; ranzato2015sequence; yu2016seqgan).

We thus propose a sequence-level unlikelihood objective which uses unlikelihood on decoded continuations. That is, given a prefix , we decode a continuation , construct per-step negative candidate sets , and define each per-step sequence-level loss for as:


Intuitively, the negative candidates can identify problematic tokens for the loss to penalize. We choose to penalize repeating n-grams in the continuation:


which says that is the (single) negative candidate for step if it is part of a repeating n-gram111An alternative we tried is to choose a penalization probability , and use as the single negative candidate for time when is 1, and no negative candidate for time otherwise; this approach was effective but under-performed the candidates; see Appendix D..

In our experiments we apply this sequence loss in two ways: (i) using it to fine-tune a standard MLE baseline; and (ii) using it to fine-tune an unlikelihood model trained at the token level, . We refer to the former as  and the latter as . In both cases, fine-tuning is done by equally mixing sequence-level unlikelihood updates (7) and the token-level loss from which it was initially trained (either likelihood updates (1) or token-level unlikelihood updates (4)).


Any objective that requires explicitly decoding a sequence is constrained by sample efficiency when decoding is slow; if sample efficiency is low, the total decoding time is too large for practical use. In our experiments we show that when used for fine-tuning, the sequence-level unlikelihood objective substantially reduced degeneration in under 1,500 updates, rendering it practical for modern large-scale neural models, even with high decoding costs.

6 Experiments

We follow a standard language modeling setup from baevski2018adaptive and evaluate our method on the task of sequence completion, detailed below.222Our code is available at; implemented with Fairseq (ott2019fairseq).

Model Architecture

Recent large-scale language models are based on the Transformer architecture, a multi-layer feed-forward network with self-attention (vaswani2017attention). We use a 16-layer Transformer with 8 attention heads, embedding dimension 1024, and fully-connected dimension 4096; the architecture is based on baevski2018adaptive but with standard embedding and softmax layers. Our proposed method is architecture agnostic; we choose this one as a representative of recent large-scale language models, e.g. radford2019language.


We use the Wikitext-103 dataset (merity2016pointer), a large-scale collection of Wikipedia articles containing over 100 million words and 260 thousand unique tokens. As a document-level dataset, Wikitext-103 is an open-source representative of recent datasets used for large-scale language modeling (baevski2018adaptive; radford2019language). We perform experiments at the word level.


We train on fixed-length contiguous sequences, in our case of length 1,536, which was selected based on GPU memory constraints. For the token-level losses (, ), we train each model on 8 GPUs for a maximum of 150k updates, evaluating on the validation set and saving the model state every 10k updates. For the experiments below, we select the saved model state with the best validation perplexity.

Sequence-level fine-tuning begins with the model state selected based on the validation perplexity. Models are fine-tuned for 1,500 total updates. With probability 0.5 an update uses and otherwise uses the token-level loss with which the model was trained. For a update, we split each training sequence and greedily decode continuations (details below). The experiments use a prefix length and continuation length for fine-tuning.


We evaluate a model on sequence completion by using the model to decode continuations of prefixes derived from the validation (or test) set. Specifically, the validation (or test) set is first partitioned into sequences of 1,536 tokens, as in training. Then we split each sequence into a batch of prefixes of length (discarding extra tokens), and decode a continuation of length for each prefix. The experiments below use and for evaluation. For deterministic decoding we use greedy search and beam search with beam size , and for stochastic decoding we use top- sampling with and nucleus sampling with .

Prefix … Lyrically the song has excerpts of different languages including French , Spanish
, Italian , Spanish , Italian , Spanish , Italian , Spanish , Spanish , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Spanish , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese 0.744
, and German . In the first verse , the protagonist sings about being a “ girl who ’s been in love with someone else ” , while the second verse describes the relationship between the protagonist and her lover . In the third verse , the protagonist sings 0.063
Prefix … starboard engines and was going to crash . “ We ’re going in ,”
he said . “ We ’re going to crash . We ’re going to crash . We ’re going to crash . We ’re going to crash . We ’re going to crash . We ’re going to crash . We ’re going to crash . We ’re going to 0.787
Hood said . “ I ’m going to make sure we ’re going to get back to the water . ” The order to abandon ship was given by Admiral Beatty , who ordered the remaining two battlecruisers to turn away . At 18 : 25 , Hood turned his 0.000
Prefix … career - high 27 points on 8 - for - 11 shooting with three rebounds
and two assists . On January 3 , 2012 , he was named to the 2012 – 13 All - Atlantic 10 first team . On February 3 , 2012 , he was named to the Atlantic 10 first team . On February 5 , 2012 , he was named 0.277
and a career - high 7 assists against the Minnesota Timberwolves . On February 3 , 2012 , he was named to the 2012 All - NBA First Team . On March 7 , 2012 , he was named one of five finalists for the Naismith Award , which is 0.064
Table 1: Example greedy completions showing representative examples of the MLE model’s degenerate single-token repetition (top), phrase-level repetition (middle), and ‘structural’ repetition (bottom), as well as the proposed method’s ability to fix these degenerate behaviors.

6.1 Evaluation Metrics


As a token-level metric for repetition, we use the fraction of next-token (top-1) predictions that occur in the previous tokens (rep/); given a set of length- sequences,


A predicted token is called a “single-token repeat” when is 1. Some of these single-token repeats also occur in the human-generated sequences, and we thus report a variant which only counts single-token repeats that are additionally not equal to the ground-truth next-token (wrep/).

We use the portion of duplicate -grams (seq-rep-n) in a generated sequence to measure sequence-level repetition. That is, for a continuation we compute,


and average over continuations. seq-rep-n is zero when the continuation has no repeating n-grams, and increases towards 1.0 as the model repeats. We compute seq-rep-n on the continuation.

Token Distribution

We quantify a model’s predicted token distribution using the number of unique tokens. As a token-level metric (uniq), we use the number of unique next-token predictions on a validation or test set , i.e. . As a sequence-level metric (uniq-seq) we use the number of unique tokens in continuations of validation or test prefixes (§6).

Language Modeling Quality

We use perplexity (ppl), and next-token prediction accuracy (acc), defined as , with prefixes and true next tokens .

6.2 Results

Token-level and sequence-level results on the test set are in Table 2 (valid set in Appendix Table 5).


The baseline model trained with maximum likelihood () achieved 25.64 test perplexity, comparable to a current state-of-the-art system (baevski2018adaptive) (24.92). However, the greedy baseline’s seq-level repeats (seq-rep-4 .442) and single-token repeats (rep .627) far exceed those in human text (.006, .487 respectively). The baseline continuations have far fewer unique tokens than human text (uniq-seq 11.8k vs 19.8k), with a high rate of frequent tokens (Figure 1).

Model search seq-rep-4 uniq-seq ppl acc rep wrep uniq
greedy .442 10.8k 25.64 .395 .627 .352 11.8k
beam .523 9.5k
greedy .283 13.2k 26.91 .390 .577 .311 12.7k
beam .336 11.7k
greedy .137 13.1k 25.42 .399 .609 .335 12.8k
beam .019 18.3k
greedy .058 15.4k 26.72 .395 .559 .293 13.8k
beam .013 19.1k
Human - .006 19.8k - - .487 - 19.8k
Table 2: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103.
Token-Level Objective

The proposed token-level unlikelihood objective () reduced next-token wrong repetition (wrep .311 vs. .352) while increasing the number of unique next-tokens (uniq 12.7k vs. 11.8k) compared to the baseline (). Perplexity and accuracy were similar.

Importantly, the token-level unlikelihood objective yielded substantial improvements in sequence-level generations. With greedy search, token-level unlikelihood training improved the 4-gram repetition in continuations by 36% (seq-rep-4 .283 vs. .442) while generating roughly 22% more unique tokens than the baseline (uniq-seq 13.2k vs. 10.8k), and a more favorable rate of infrequent tokens (Figure 1). With beam search, unlikelihood training showed similar improvements over the baseline.

Sequence-Level Objective

The sequence level fine-tuning () yielded further improvements, with a 97% reduction in 4-gram repetitions (seq-rep-4 .013 vs. .442) from the baseline level (greedy ), and 77% more unique tokens (uniq-seq 19.1k vs. 10.8k) with beam search.

Compared to the token-level unlikelihood model () which was the starting point of fine-tuning, the fine-tuned model’s repetition substantially improved (seq-rep-4 .058 vs. .283), unique tokens increased (uniq-seq 15.4k vs. 13.2k), and token-level metrics such as perplexity improved (ppl 26.72 vs. 26.91), despite using only 1,500 updates. The token distribution improved, with infrequent tokens produced more often than the baseline, and frequent tokens approaching the human level (Figure 1). Finally, after sequence-level fine-tuning, beam search out-performed greedy search.

To visualize how these improvements in metrics translate to generation quality, Table 1 shows greedy completions that characterize the baseline’s degeneration and ’s improved behavior.

GPT-2 Fine-Tuning

In the preceding experiment, sequence-level fine-tuning alone () showed substantial improvements over the baseline using a small number of updates. This indicates that the proposed sequence-level fine-tuning can be a cheap, effective way to improve existing pre-trained language models. We demonstrate this by fine-tuning a pre-trained GPT-2 (radford2019language) language model with sequence-level unlikelihood, using a comparable experimental setup to §6 (details in Appendix C). Fine-tuning with unlikelihood yielded similar improvements in sequence-level repetition (seq-rep-4 .042 vs. .506) to those observed in Table 5, while maintaining language modeling quality according to perplexity and accuracy (see Appendix Table 7).

Stochastic Decoding

Although we have focused on deterministic decoding, we also confirm that a model trained with the proposed unlikelihood objectives may still be used with stochastic decoders. Appendix Table 6 shows metrics for completions generated with top- sampling (fan2018hierarchical) and nucleus sampling (holtzman2019curious). Models trained with unlikelihood objectives maintain language modeling quality compared to the baseline, but with improvements in repetition.

Human Evaluation

We perform a crowdworker evaluation to judge the quality of the generations of our proposed models compared to each other, the baseline, two other generation methods, and the reference. We employ a pairwise setup: an evaluator is presented with a prefix and shown continuations from two different models and asked to select which continuation they found more natural. Following li2019acute, we filter workers using quality controls (detailed in Appendix E) and limit the number of annotations that they may complete. Prompts are from the Wikitext-103 test set. All models used beam search (beam size 10) for generation, except for those that use stochastic decoding. We report the win rates for each pairwise comparison.

Crowdworkers Experts
Winner Loser Win rate Win rate
beats  baseline 57%
 baseline *71%
 baseline *82%
beats  Nucleus sampling () 59% *83%
 Beam blocking (4-gram) 60% *74%
Table 3: Human eval results. * denotes statistical significance (2-sided binomial test, ).

The main results are presented in Table 3, with additional experiments in Appendix Table 9. We find that all proposed models are preferred over the baseline, and that congruent with automatic metrics, win rates improve after adding the sequence level objective. Our best model also outperforms the baseline used with either nucleus sampling or beam blocking.

We also collected limited annotations from other NLP researchers. These Expert annotators were given the same UI as the crowdworkers, and not told about models they were evaluating, but all annotators were familiar with language models. As shown in Table 3, the  model significantly outperforms both nucleus sampling and beam blocking according to the experts.

7 Conclusion

We described unlikelihood training, an approach to training neural language models. We observed that state-of-the-art models trained to maximize likelihood exhibit neural text degeneration, which we characterized and quantified in terms of repetition and token distribution mismatch. Our results show that the likelihood objective is not constrained enough, in the sense that two models with the same perplexity can exhibit wildly different generation performance. We empirically showed that unlikelihood training - both at the token and sequence levels - substantially reduced degeneration according to automatic metrics, and outperformed likelihood-trained models with various decoding methods according to human evaluation, being superior to the current state-of-the-art approaches.


Prefix Lyrically the song has excerpts of different languages including French , Spanish ,
GPT2-greedy German , Italian , Portuguese , Spanish , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , …
GPT2-top3 German , Italian , Portuguese , Spanish , Portuguese , Portuguese, Italian and Spanish. It is also available in English, French and Spanish. In addition to its lyrics, the album features a number of original songs, …
GPT2-nucleus-0.3 German , Italian , Portuguese , Spanish , Portuguese , Italian , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , …
Prefix The first pair of pereiopods is armed with a large , asymm
GPT2-greedy etrical, and long-range laser cannon. The second pair is armed with a large , asymmetrical, and long-range laser cannon. The third pair is armed with a large , asymmetrical, and long-range laser cannon. The fourth pair is armed with a large …
GPT2-top3 etrical, and highly mobile head, and the second pair has a small , asymmetrical, and highly mobile head. The second pair has a large and highly mobile head, and the third pair is armed with a large and highly mobile head. The first pair …
GPT2-nucleus-0.3 etrical head and a large body. The first pair of pereiopods is armed with a large , asymmetrical head and a large body. The first pair of pereiopods is armed with a large , asymmetrical head and a large body. The first pair of pereiopods is armed …
Table 4: Top: Degenerate repetition in completions from a state-of-the-art large-scale language model (radford2019language). The examples contain single-word repetitions, phrase-level repetitions, and structural repetitions where some tokens within a repeating phrase vary. Recently proposed stochastic samplers (top-, nucleus) exhibit degeneration based on hyper-parameter settings.
(a) Different combinations of unlikelihood
(b) Unlikelihood vs. stochastic decoding
Figure 1: Sequence-level token distribution using the test subset of Wikitext-103. Nucleus sampling () and beam blocking () are used with the maximum likelihood baseline ().

Appendix A Gradient


Let be the true next-token (index at step , and let be a negative candidate (index . Let be the output of where .

Denote the probability of an element as , and let , , and be probabilities of the true next-token, negative-candidate token, and any other token with .

a.1 Derivation

The (negative) token-level loss with a single candidate is,


and its gradient with respect to a logit is:


We consider the gradient when is the true next-token, a negative-candidate, and any other token.

True Next-Token ()
Negative Candidate ()
Other Token ()

Combining the three cases above, we get:


where is 1 at index and 0 otherwise, and is:

Multiple Candidates

In general the objective considers multiple candidates (see section 5):


We regroup the token-level objective to be a weighted sum of per-candidate objectives:


where .

Now the gradient can be generalized to multiple candidates, in which case the gradient takes the same form as Eqn. 20, but with in place of .

Model search seq-rep-4 uniq-seq ppl acc rep wrep uniq
greedy .429 10.6k 24.59 .401 .619 .346 11.6k
beam .495 9.4k
greedy .274 12.6k 25.62 .396 .569 .305 12.5k
beam .327 11.2k
greedy .130 12.7k 24.28 .406 .603 .329 12.4k
beam .018 16.8k
greedy .051 14.8k 25.37 .401 .551 .287 13.4k
beam .013 17.6k
Human - .005 18.9k - - .479 - 18.9k
Table 5: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103.
Search Model seq-rep-4 uniq-seq ppl acc rep wrep uniq
top-k-3 .0991 14.7k 25.70 .350 .597 .355 12.6k
.0491 16.4k 27.02 .344 .539 .306 13.6k
.0068 17.9k 25.11 .353 .581 .341 13.6k
.0087 15.2k 26.84 .347 .524 .292 14.6k
top-k-50 .0165 21.9k 25.70 .302 .511 .303 16.1k
.006 23.5k 27.02 .286 .440 .247 17.8k
.0005 25.7k 25.11 .291 .497 .291 17.3k
.0009 23.7k 26.84 .289 .430 .238 18.8k
top-p-0.3 .273 13.6k 25.70 .264 .339 .154 12.6k
.101 16.5k 27.02 .247 .290 .121 13.9k
.0033 20.8k 25.11 .266 .327 .145 13.6k
.0041 19.1k 26.84 .250 .284 .116 14.9k
top-p-0.9 .0154 26.9k 25.70 .288 .462 .263 18.6k
.004 30.2k 27.02 .266 .381 .202 22.3k
.0003 34.7k 25.11 .290 .450 .254 19.6k
.0007 32.4k 26.84 .269 .376 .198 22.7k
Human - .006 19.8k - - .487 - 19.8k
Table 6: Stochastic decoding results according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103.

Appendix B Stochastic Decoding Results

Table 6 provides automatic metrics for top- and nucleus sampling (called top-) on the Wikitext-103 test set. These can be compared with the main results of the paper in Table 2. In general, sampling methods yield worse next-token predictions than deterministic approaches (0.302 vs. 0.394 acc for top-k-50 vs. greedy MLE, where acc for stochastic decoding measures the probability that the decoding strategy chooses the ground truth word given a ground truth context). As the choice of sampling hyperparameter gets closer to greedy (i.e. lower values of and ) next token accuracy improves, eventually approaching the greedy MLE results. The unlikelihood-trained sampling models have similar next token accuracy (acc) to their likelihood-trained counterparts, but exhibit fewer repetitions. For lower values of and the improvements of unlikelihood training are larger, e.g. 0.277 reduced to 0.0041 for 4-gram sequence repetitions (seq-rep-4) using top-p-0.3. At higher levels of and , for all methods the continuations contain more unique tokens than that of humans, meaning those values may be too high.

Model search seq-rep-4 ppl acc rep wrep uniq
GPT-2 greedy .506 20.75 .430 .589 .306 13.3k
greedy .460 15.82 .464 .612 .305 11.8k
greedy .042 18.49 .444 .613 .317 11.3k
Human - .005 - - .407 - 17.7k
Table 7: GPT-2 results according to sequence-level and token-level metrics using the validation subset of wikitext-103. seq-rep-4 is computed on the word level; ppl, acc, rep, wrep are computed on the BPE level.

Appendix C GPT-2 Fine-tuning

We evaluated the GPT-2 medium pre-trained model (‘GPT-2’) and two separate fine-tuning variants on Wikitext-103. The first variant (‘’) was fine-tuned using maximum likelihood; we select the model state with the lowest validation perplexity. The second model (‘’) was fine-tuned using the sequence-level unlikelihood objective (§5.2). For both evaluation and sequence-level tuning, we used a prefix length of 50 BPE tokens and a continuation length of 100 BPE tokens. In order to train on a single GPU, we used a batch-size of 1024 tokens for MLE updates, and 300 prefix tokens for unlikelihood updates. Due to the smaller batch size and single-GPU setting, we used 10,000 updates during sequence-level fine-tuning, comparable to the 1,500 updates in the main experiment (§6) in terms of the total number of tokens. Results are shown in Table 7.

Appendix D Sequence-level random candidates

In Sec. 5.2 we described a way to penalize tokens that occurred in a n-gram repetition. One alternative is to penalize a random subset of the generated sequence. That is, given a continuation , we now define per-step candidates as:


for each , where , and is a fixed hyper-parameter. Intuitively, these candidates identify random tokens in the generated sequence (hence ‘random-seq’), which are then penalized by the sequence-level loss (Eqn. 7).

Results with different values of are shown in Table 8. Penalizing 10% of the generated tokens led to substantial improvements in seq-rep-4 for both greedy and beam search compared to the baseline (e.g. 41% for greedy, 73% for greedy), though using n-gram repetition candidates yielded further improvements (§5.2, Table 5). Improvements in single-token metrics were similar to those from the n-gram repetition candidates (e.g. wrep .287). These results with random-seq candidates demonstrate that sequence fine-tuning can yield improvements without explicitly using the notion of repetition for candidate selection. We also find that penalizing 90% of the generated tokens yields substantial improvements in beam search, but not greedy search; investigating this is left as future work.

Model search seq-rep-4 uniq-seq ppl acc rep wrep uniq
- greedy .429 10.6k 24.590 .401 .619 .346 11.6k
- beam .495 9.4k
0.1 greedy .253 9.9k 24.329 .404 .602 .330 12.3k
beam .274 13.1k
0.9 greedy .434 5.3k 26.519 .399 .600 .330 12.2k
beam .231 13.5k
0.1 greedy .116 12.5k 25.518 .399 .551 .287 13.2k
beam .146 14.2k
0.9 greedy .423 6.7k 26.629 .396 .551 .288 13.2k
beam .080 16k
Human - - .005 18.9k - - .479 - 18.9k
Table 8: Results for sequence-level fine-tuning using random-seq candidates according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103.

Appendix E Human Evaluation Details

e.1 UI Screenshot

Figure 2: Screen shot of the user interface used in the human evaluation.

e.2 Crowdworker Quality Controls

We require workers to correctly answer both of the following quality control questions for their evaluations to be included. Both quality controls compare the true completion against a greedy baseline model.

Following li2019acute, we informed workers that they must provide reasoning for their choices. We filtered workers who did not provide reasoning for at least 80% of their choices.

63% of workers fail at least one of our three quality control mechanisms (2 quality control metrics, and failing to give reasons). 61% fail at least one quality control question; 16% of workers fail both; 4% of workers fail to give reasoning for their choices.

e.2.1 Quality control 1


= = In the decades since its release , The Hustler has cemented its reputation as a classic . Roger Ebert , echoing earlier praise for the performances , direction , and cinematography and adding laurels for editor Dede Allen , cites the film as ” one of”’

Correct answer

those films where scenes have such psychic weight that they grow in our memories . ” He further cites Fast Eddie Felson as one of ” only a handful of movie characters so real that the audience refers to them as touchstones . ” TV Guide calls the film a ” dark stunner ” offering ” a grim world whose only bright spot is the top of the pool table , yet [ with ] characters [ who ] maintain a shabby nobility and grace . ” The four leads are again lavishly praised for their performances and the

Incorrect answer

the most influential films of the year ” . In his review for the Chicago Sun @-@ Times , Richard Corliss calls it ” a film of the highest order ” and ” a film of the highest order ” . In his review for the Chicago Sun @-@ Times , Richard Corliss calls it ” a film of the highest order ” and ” a film of the highest order ” . In his review for the Chicago Sun @-@ Times , Richard Corliss calls it ” a film of the highest order ” and ” a film of

e.2.2 Quality Control 2


, which had a major negative effect on the state ’s large merchant fleet . Gore was in 1808 elected to the Massachusetts House of Representatives , where he successfully led Federalist efforts to ensure the selection of a Federalist slate of presidential electors . He also spearheaded actions to

Correct answer

drive Senator John Quincy Adams from the Federalist Party over his support of Thomas Jefferson ’s foreign policy . The legislature elected Adams ’ successor nine months early , and gave Adams sufficiently distasteful instructions that he resigned the post and joined with the Republicans . = = Governor = = Gore led the Federalists to victory in 1809 against Sullivan ’s successor , Levi Lincoln , Sr. , who had taken over as acting governor upon Sullivan ’s death late in 1808 . During Gore ’s term the principal domestic issue occupying state politics

Incorrect Answer

prevent the American Revolutionary War from being fought by the British , and to prevent the British from using the country to launch a war against the British . Gore ’s actions in the House of Representatives were a major turning point in his political career . He was elected to the House of Representatives in 1811 , and served until his death in 1815 . = = Early life and education = = ¡/s¿ ¡/s¿ Gore was born in Boston , Massachusetts , on February 22 , 1798 , the son of Benjamin Gore and his

e.3 Full Human Evaluation Results

Crowdworkers Experts
Winner Loser Win rate W–L Win rate W–L
beats  baseline 57% 17–13
 baseline *71% 41–17
 baseline *82% 41–9
*75% 56–19
59% 38–27
beats Nucleus 59% 47–33 *83% 30–6
Beam blocking 60% 50–34 *74% 25–9
Reference beats  baseline *85% 17–3
Reference Nucleus *69% 38–17
Reference Beam blocking *68% 48–23
Reference *73% 44–16
Reference 50% 30–30
Reference *64% 46–26
Table 9: Full human evaluation results. Includes additional comparisons omitted for brevity, and the raw number of wins and loses by each comparison.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description