Extracting Training Data from Large Language Models

# Extracting Training Data from Large Language Models

## Abstract

It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.

We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. For example, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

## 1 Introduction

Language models (LMs)—statistical models which assign a probability to a sequence of words—are fundamental to many natural language processing tasks. Modern neural-network-based LMs use very large model architectures (e.g., 175 billion parameters [5]) and train on massive datasets (e.g., nearly a terabyte of English text [51]). This scaling increases the ability of LMs to generate fluent natural language [70, 49, 68], and also allows them to be applied to a plethora of other tasks [51, 26, 35], even without updating their parameters [5].

At the same time, machine learning models are known to leak information about their (potentially private) training data—both in general [60, 42] and in the specific case of language models [6, 40]. For instance, for certain models it is known that adversaries can apply membership inference [60] to predict if a given example was in the training data.

Such privacy leakage is typically associated with overfitting [69]—when a model’s training error is significantly lower than its test error—because overfitting often indicates that a model has memorized examples from its training set. Indeed, overfitting is a sufficient condition for privacy leakage [67] and many attacks work by exploiting overfitting [60].

The association between overfitting and memorization has—erroneously—led many to assume that state-of-the-art LMs will not leak information about their training data. Because these models are often trained on massive de-duplicated datasets only for a single epoch [5, 51], they exhibit little to no overfitting [49]. Accordingly, the prevailing wisdom has been that “the degree of copying with respect to any given work is likely to be, at most, de minimis[66] and that models do not significantly memorize any particular training example.

Contributions. In this work, we demonstrate that large language models memorize and leak individual training examples. In particular, we propose a simple and efficient method for extracting verbatim sequences from a language model’s training set using only black-box query access. Our key insight is that, although training examples do not have noticeably lower losses than test examples on average, certain worst-case training examples are indeed memorized.

In our attack, we first generate a large, diverse set of high-likelihood samples from the model, using one of three general-purpose sampling strategies. We then sort each sample using one of six different metrics that estimate the likelihood of each sample using a separate reference model (e.g., another LM), and rank highest the samples with abnormally high likelihood ratio between the two models.

Our attacks directly apply to any language model, including those trained on sensitive and non-public data [8, 14]. We use the GPT-2 LM [50] released by OpenAI as a representative LM in our experiments. We choose to attack GPT-2 to minimize real-world harm—the GPT-2 model and original training data source are already public.

To make our results quantitative, we define a testable definition of memorization. We then generate candidate memorized samples, under each of the attack configurations, and find that over of them are verbatim samples from the GPT-2 training data (confirmed in collaboration with the creators of GPT-2). In the best attack configuration, 67% of candidate samples are verbatim training examples. Our most obviously-sensitive attack extracts the full name, physical address, email address, phone number, and fax number of an individual (see Figure 1). We comprehensively analyze our attack, including studying how model size and string frequency affects memorization, as well as how different attack configurations change the types of extracted data.

We conclude by discussing numerous practical strategies to mitigate privacy leakage. For example, differentially-private training [1] is theoretically well-founded and guaranteed to produce private models if applied at an appropriate record level, but it can result in longer training times and typically degrades utility. We also make recommendations, such as carefully de-duplicating documents, that empirically will help to mitigate memorization but cannot prevent all attacks.

## 2 Background & Related Work

To begin, we introduce the relevant background on large (billion-parameter) neural network-based language models (LMs) as well as data privacy attacks.

### 2.1 Language Modeling

Language models are a fundamental building block of current state-of-the-art natural language processing pipelines [48, 10, 28, 46, 51]. While the unsupervised objectives used to train these models vary, one popular choice is a “next-step prediction” objective [4, 39, 28, 48]. This approach constructs a generative model of the distribution

 Pr(x1,x2,…,xn),

where is a sequence of tokens from a vocabulary , by applying the chain rule of probability

 Pr(x1,x2,…,xn)=Πni=1Pr(xi∣x1,…,xi−1).

State-of-the-art LMs use neural networks to estimate this probability distribution. We let denote the likelihood of token when evaluating the neural network with parameters . While it used to be common to use recurrent neural networks (RNNs) [39, 24] for the neural network architecture, attention-based models [3] have recently replaced RNNs in state-of-the-art models. In particular, Transformer LMs [65] consist of a sequence of attention layers and are the current model architecture of choice. Because we believe our results are independent of the exact architecture used, we will not describe the Transformer architecture in detail here and instead refer to existing work [2].

Training Objective. A language model is trained to maximize the probability of the data in a training set . In this paper, each training example is a text document—for example, a specific news article or webpage from the internet. Formally, training involves minimizing the loss function

 L(θ)=−logΠni=1fθ(xi∣x1,…,xi−1)

over each training example in the training dataset . Because of this training setup, the “optimal” solution to the task of language modeling is to memorize the answer to the question “what token follows the sequence ?” for every prefix in the training set. However, state-of-the-art LMs are trained with massive datasets, which causes them to not exhibit significant forms of memorization: empirically, the training loss and the test loss are nearly identical [49, 51, 5].

Generating Text. A language model can generate new text (potentially conditioned on some prefix ) by iteratively sampling and then feeding back into the model to sample . This process is repeated until a desired stopping criterion is reached. Variations of this text generation method include deterministically choosing the most-probable token rather than sampling (i.e., “greedy” sampling) or setting all but the top- probabilities to zero and renormalizing the probabilities before sampling (i.e., top- sampling1 [16]).

Gpt-2. Our paper focuses on the GPT variant of Transformer LMs [48, 50, 5]. Specifically, we demonstrate training data extraction attacks on GPT-2, a family of LMs all trained using the same dataset and training algorithm, but with varying model sizes.

GPT-2 XL is the largest model with 1.5 billion parameters. For the remainder of this paper, the “GPT-2” model refers to this 1.5 billion parameter model or, when we specifically indicate this, its Small and Medium variants with 124 million and 334 million parameters, respectively.

GPT-2 was trained on data scraped from the public Internet. The authors collected a dataset by following outbound links from the social media website Reddit. The webpages were cleaned of HTML, with only the document text retained, and then de-duplicated at the document level. This results in a final dataset of GB of text data, with the model being trained for approximately 12 epochs.2 As a result, GPT-2 does not overfit: the training loss is only roughly smaller than the test loss across all model sizes.

### 2.2 Training Data Privacy

It is undesirable for models to remember any details that are specific to their (potentially private) training data. The field of training data privacy develops attacks (to leak training data details) and defenses (to prevent leaks).

Privacy Attacks. When models are not trained with private algorithms, they are vulnerable to numerous privacy attacks. The least revealing form of attack is the membership inference attack [60, 42, 62, 25]: given a trained model, an adversary can predict whether or not a particular example was used to train the model. Separately, model inversion attacks [19] reconstruct representative views of a subset of examples (e.g., a model inversion attack on a face recognition classifier might recover a fuzzy image of a particular person that the classifier can recognize).

Training data extraction attacks, like model inversion attacks, aim to reconstruct training datapoints. However, training data extraction attacks aim to reconstruct verbatim training examples and not just representative “fuzzy” examples. This makes them significantly more dangerous, e.g., they can extract secrets such as verbatim social security numbers or passwords. Training data extraction attacks have until now been limited to small LMs trained on academic datasets under artificial training setups (e.g., for more epochs than typical) [6, 61, 63], settings with white-box model access [45], or settings where the adversary has a priori knowledge of the secret they want to extract (e.g., a social security number) [6].

Protecting Privacy. An approach to minimizing memorization of training data is to apply differentially-private training of deep learning models [56, 7, 59, 1, 38]. Unfortunately, training models with differentially-private mechanisms often reduces accuracy [31] because it causes models to fail to capture the long tails of the data distribution [62, 18, 17]. Moreover, it increases training time, which can further reduce accuracy because current LMs are limited by the cost of training [32, 34, 51]. As a result, state-of-the-art LMs such as GPT-2 [49], GPT-3 [5], and T5 [51] do not apply these privacy-preserving techniques.

## 3 Threat Model & Ethics

Training data extraction attacks are often seen as theoretical or academic and are thus unlikely to be exploitable in practice [66]. This is justified by the prevailing intuition that privacy leakage is correlated with overfitting [67], and because state-of-the-art LMs are trained on large (near terabyte-sized [5]) datasets for a few epochs, they tend to not overfit [49].

Our paper demonstrates that training data extraction attacks are practical. To accomplish this, we first precisely define what we mean by “memorization”. We then state our threat model and our attack objectives. Finally, we discuss the ethical considerations behind these attacks and explain why they are likely to be a serious threat in the future.

### 3.1 Defining Language Model Memorization

There are many ways to define memorization in language modeling. As mentioned earlier, memorization is in many ways an essential component of language models because the training objective is to assign high overall likelihood to the training dataset. LMs must, for example, “memorize” the correct spelling of individual words.

Indeed, there is an entire area of research that analyzes neural networks as repositories of (memorized) knowledge [47, 55]. For example, when GPT-2 is prompted to complete the sentence “My address is 1 Main Street, San Francisco CA”, it generates the next token “94107”: a correct zip code for San Francisco, CA. While this is clearly memorization in some abstract form, we aim to formalize our definition of memorization in order to restrict it to cases that we might consider “unintended” [6].

#### \titlecapeidetic memorization of Text

We define eidetic memorization as a particular type of memorization.3 Informally, eidetic memorization is data that has been memorized by a model despite only appearing in a small set of training instances. The fewer training samples that contain the data, the stronger the eidetic memorization is.

To formalize this notion, we first define what it means for a model to have knowledge of a string . Our definition is loosely inspired by knowledge definitions in interactive proof systems [22]: a model knows a string if can be extracted by interacting with the model. More precisely, we focus on black-box interactions where the model generates as the most likely continuation when prompted with some prefix :

###### Definition 1 (Model Knowledge Extraction)

A string is extractable4 from an LM if there exists a prefix such that:

 s←argmaxs′: |s′|=Nfθ(s′∣c)

Note that we abuse notation slightly here to denote by the likelihood of an entire sequence . Since computing the most likely sequence is intractable for large , the in Definition 1 can be replaced by an appropriate sampling strategy (e.g., greedy sampling) that reflects the way in which the model generates text in practical applications. We then define eidetic memorization as follows:

###### Definition 2 (k-\titlecapeidetic memorization)

A string is -eidetic memorized (for ) by an LM if is extractable from and appears in at most examples in the training data :

Note that here we count the number of distinct training examples containing a given string, and not the total number of times the string occurs—a string may appear multiple times in a single example, and our analysis counts this as .

This definition allows us to define memorization as a spectrum. While there is no definitive value of at which we might say that memorization is unintentional and potentially harmful, smaller values are more likely to be so. For any given , memorizing longer strings is also intuitively more harmful than shorter ones.

For example, under this definition, memorizing the correct spellings of one particular word is not particularly severe if the word occurs in many training examples (i.e., is large). Memorizing the zip code of a particular city might be eidetic memorization, depending on whether the city was mentioned in many training examples. Referring back to Figure 1, memorizing an individual person’s name and phone number clearly is at the (informally) worrying end of the spectrum, and also satisfies our formal definition: it is contained in just a few documents on the Internet—and hence the training data.

### 3.2 Threat Model

Adversary’s Capabilities. We consider an adversary who has black-box input-output access to a language model. This allows the adversary to compute the probability of arbitrary sequences , and as a result allows the adversary to obtain next-word predictions, but it does not allow the adversary to inspect individual weights or hidden states (e.g., attention vectors) of the language model.

This threat model is highly realistic as many LMs are available through black-box APIs. For example, the GPT-3 model [5] created by OpenAI is available through black-box API access. Auto-complete models trained on actual user data have also been made public, although they reportedly use privacy-protection measures during training [8].

Adversary’s Objective. The adversary’s objective is to extract memorized training data from the model. The strength of an attack is measured by how private (formalized as being -eidetic memorized) a particular example is. Stronger attacks extract more examples in total (both more total sequences, and longer sequences) and examples with lower values of .

We do not aim to extract targeted pieces of training data, but rather indiscriminately extract training data. While targeted attacks have the potential to be more adversarially harmful, our goal is to study the ability of LMs to memorize data generally, not to create an attack that can be operationalized by real adversaries to target specific users.

Attack Target. We select GPT-2 [50] as a representative LM to study for our attacks. GPT-2 is nearly a perfect target. First, from an ethical standpoint, the model and data are public, and so any memorized data that we extract is already public.5 Second, from a research standpoint, the dataset (despite being collected from public sources) was never actually released by OpenAI. Thus, it is not possible for us to unintentionally “cheat” and develop attacks that make use of knowledge of the GPT-2 training dataset.

### 3.3 Risks of Training Data Extraction

Training data extraction attacks present numerous privacy risks. From an ethical standpoint, most of these risks are mitigated in our paper because we attack GPT-2, whose training data is public. However, since our attacks would apply to any LM, we also discuss potential consequences of future attacks on models that may be trained on private data.

Data Secrecy. The most direct form of privacy leakage occurs when data is extracted from a model that was trained on confidential or private data. For example, GMail’s auto-complete model [8] is trained on private text communications between users, so the extraction of unique snippets of training data would break data confidentiality.

Contextual Integrity of Data. The above privacy threat corresponds to a narrow view of data privacy as data secrecy. A broader view of the privacy risks posed by data extraction stems from the framework of data privacy as contextual integrity [43]. That is, data memorization is a privacy infringement if it causes data to be used outside of its intended context. An example violation of contextual integrity is shown in Figure 1. This individual’s name, address, email, and phone number are not secret—they were shared online in a specific context of intended use (as contact information for a software project)—but are reproduced by the LM in a separate context. Due to failures such as these, user-facing applications that use LMs may inadvertently emit data in inappropriate contexts, e.g., a dialogue system may emit a user’s phone number in response to another user’s query.

Note that the two privacy threats described above might hold even for models that do not exhibit -eidetic memorization (see Section 3.1) for small values of . Nevertheless, we focus on -eidetic memorization with a small value because it makes extraction attacks more impactful.

Moreover, note that although we frame our paper as an “attack”, LMs will output memorized data even in the absence of an explicit adversary. We treat LMs as black-box generative functions, and the memorized content that we extract can be generated through honest interaction with the LM. Indeed, we have even discovered at least one memorized training example among the GPT-3 samples that OpenAI originally released in its official repository [44].

### 3.4 Ethical Considerations

In this paper, we will discuss and carefully examine specific memorized content that we find in our extraction attacks. There are ethical considerations for this analysis because some of the data that we extract contains information about individual users.

As previously mentioned, we minimize ethical concerns by using data that is already public. We attack the GPT-2 model, which is publicly available. Moreover, the GPT-2 training data was collected from the public Internet [50], and is in principle available to anyone who performs the same (documented) collection process as OpenAI, e.g., see [21].

However, there are still ethical concerns even though the model and data are public. It is possible—and indeed we find it is the case—that we might extract personal information for individuals from the training data. For example, as shown in Figure 1, we recovered a person’s full name, address, and phone number. In this paper, whenever we succeed in extracting personally-identifying information—usernames, phone numbers, etc.—we partially mask out this content with the token \censor——. We are aware of the fact that this does not provide complete mediation: disclosing that the vulnerability exists allows a malicious actor to perform these attacks on their own to recover this personal information.

Just as responsible disclosure still causes some (limited) harm, we believe that the benefits of publicizing these attacks outweigh the potential harms. Further, to make our attacks public, we must necessarily reveal some sensitive information. We contacted the individual whose information is partially shown in Figure 1 to disclose this fact to them in advance and received permission to use this example. Our research findings have also been disclosed to OpenAI.

Unfortunately, we cannot hope to contact all researchers who train large LMs in advance of our publication. We thus hope that this publication will spark further discussions on the ethics of memorization and extraction among other companies and research teams that train large LMs.

## 4 Initial Training Data Extraction Attack

We begin with a simple strawman baseline for extracting training data from a language model in a two-step procedure.

• Generate text. We generate a large quantity of data by unconditionally sampling from the model (Section 4.1).

• Predict which outputs contain memorized text. We next remove the generated samples that are unlikely to contain memorized text using a membership inference attack (Section 4.2).

These two steps correspond directly to extracting model knowledge (Definition 1), and then predicting which strings might be -eidetic memorization (Definition 2).

### 4.1 Initial Text Generation Scheme

To generate text, we initialize the language model with a special start-of-sentence token and then repeatedly sample tokens in an autoregressive fashion from the model (see Section 2.1 for background). We hope that by sampling according to the model’s assigned likelihood, we will sample sequences that the model considers “highly likely”, and that likely sequences correspond to memorized text. Concretely, we sample exactly tokens for each trial using the top- strategy from Section 2.1 with .

### 4.2 Initial Membership Inference

Given a set of samples from the model, the problem of training data extraction reduces to one of membership inference: predict whether each sample was present in the training data [60]. In their most basic form, past membership inference attacks rely on the observation that models tend to assign higher confidence to examples that are present in the training data [41]. Therefore, a potentially high-precision membership inference classifier is to simply choose examples that are assigned the highest likelihood by the model.

Since LMs are probabilistic generative models, there is a natural way to evaluate the likelihood of a given string : the perplexity of a sequence measures how well the LM “predicts” the tokens in that sequence. Concretely, given a sequence of tokens , the perplexity is defined as

 P=exp(−1nn∑i=1logfθ(xi|x1,…,xi−1))

That is, if the perplexity is low, then the model is not very “surprised” by the sequence and has assigned on average a high probability to each subsequent token in the sequence.

### 4.3 Initial Extraction Results

We generate 200,000 samples using the largest version of the GPT-2 model (XL, 1558M parameters) following the text generation scheme described in Section 4.1. We then sort these samples according to the model’s perplexity measure and investigate those with the lowest perplexity.

This simple baseline extraction attack can find a wide variety of memorized content. For example, GPT-2 memorizes the entire text of the MIT public license, as well as the user guidelines of Vaughn Live, an online streaming site. While this is “memorization”, it is only -eidetic memorization for a large value of —these licenses occur thousands of times.

The most interesting (but still not eidetic memorization for low values of ) examples include the memorization of popular individuals’ Twitter handles or email addresses (omitted to preserve user privacy). In fact, all memorized content we identify in this baseline setting is likely to have appeared in the training dataset many times.

This initial approach has two key weaknesses that we can identify. First, our sampling scheme tends to produce a low diversity of outputs. For example, out of the samples we generated, several hundred are duplicates of the memorized user guidelines of Vaughn Live.

Second, our baseline membership inference strategy suffers from a large number of false positives, i.e., content that is assigned high likelihood but is not memorized. The majority of these false positive samples contain “repeated” strings (e.g., the same phrase repeated multiple times). Despite such text being highly unlikely, large LMs often incorrectly assign high likelihood to such repetitive sequences [27].

## 5 Improved Training Data Extraction Attack

The proof-of-concept attack presented in the previous section has low precision (high-likelihood samples are not always in the training data) and low recall (it identifies no -memorized content for low ). Here, we improve the attack by incorporating better methods for sampling from the model (Section 5.1) and membership inference (Section 5.2).

### 5.1 Improved Text Generation Schemes

The first step in our attack is to randomly sample from the language model. Above, we used top- sampling and conditioned the LM on the start-of-sequence token as input. This strategy has clear limitations [29]: it will only generate sequences that are likely from beginning to end. As a result, top- sampling from the model will cause it to generate the same (or similar) examples several times. Below we describe two alternative techniques for generating more diverse samples from the LM.

#### Sampling With A Decaying Temperature

As described in Section 2.1, an LM outputs the probability of the next token given the prior tokens . In practice, this is achieved by evaluating the neural network to obtain the “logit” vector , and then computing the output probability distribution as defined by .

One can artificially “flatten” this probability distribution to make the model less confident by replacing the output with , for . Here, is called the temperature. A higher temperature causes the model to be less confident and more diverse in its output.

However, maintaining a high temperature throughout the generation process would mean that even if the sampling process began to emit a memorized example, it would likely randomly step off the path of the memorized output. Thus, we use a softmax temperature that decays over time, starting at and decaying down to over a period of the first tokens (10% of the length of the sequence). This gives a sufficient amount of time for the model to “explore” a diverse set of prefixes while also allowing it to follow a high-confidence paths that it finds.

#### Conditioning on Internet Text

Even when applying temperature sampling, there are still some prefixes that are unlikely to be sampled but nevertheless occur in actual data. As a final strategy, our third sampling strategy seeds the model with prefixes from our own scrapes of the Internet. This sampling strategy ensures that we will generate samples with a diverse set of prefixes that are similar in nature to the type of data GPT-2 was trained on.

We follow a different data collection process as used in GPT-2 (which follows Reddit links) in order to reduce the likelihood that our dataset has any intersection with the model’s training data. In particular, we select samples from a subset of Common Crawl6 to feed as context to the model.7

As in prior work [51], we perform basic data-sanitization by removing HTML and JavaScript from webpages, and we de-duplicate data on a line-by-line basis. This gives us a dataset of MB of text. We randomly sample between and tokens of context from this scraped data and then continue LM generation with top- sampling as in Section 4.1.

### 5.2 Improved Membership Inference

Performing membership inference by filtering out samples with low likelihood has poor precision due to failures in the underlying language model: there are many samples that are assigned spuriously high likelihood. There are predominantly two categories of such samples:

• Trivial memorization. We identify many cases where GPT-2 outputs content that is uninteresting because of how common the text is. For example, it repeats the numbers from 1 to 100 with high probability.

• Repeated substrings. One common failure mode of LMs is their propensity to repeatedly emit the same string over and over [33, 27]. We found many of the high-likelihood samples that are not memorized are indeed repeated texts (e.g., “I love you. I love you…”).

Our insight is that we can filter out these uninteresting (yet still high-likelihood samples) by comparing to a second LM. Given a second model that accurately captures text likelihood, we should expect it will also assign high likelihood to these forms of memorized content. Therefore, a natural strategy for finding more diverse and rare forms of memorization is to filter samples where the original model’s likelihood is “unexpectedly high” compared to a second model. Below we discuss four methods for achieving this.

Comparing to Other Neural Language Models. Assume that we have access to a second LM that memorizes a different set of examples than GPT-2. One way to achieve this would be to train a model on a disjoint set of training data, in which case it is unlikely that the two models will memorize the same data for small . An alternate strategy is to take a much smaller model trained on the same underlying dataset: because smaller models have less capacity for memorization, we conjecture that there are samples that are -eidetic memorized (for small ) by the largest GPT-2 model, but which are not memorized by smaller GPT-2 models. Specifically, we use the Small (117M parameters) and Medium (345M parameters) models.

Comparing to zlib Compression. It is not necessary that we compare to another neural LM; any technique that quantifies some notion of “surprise” for a given sequence can be useful. As a simple baseline method, we compute the zlib [20] entropy of the text: the number of bits of entropy when the sequence is compressed with zlib compression. We then use the ratio of the GPT-2 perplexity and the zlib entropy as our membership inference metric. Although text compressors are simple, they can identify many of the examples of trivial memorization and repeated patterns described above (e.g., they are excellent at modeling repeated substrings).

Comparing to Lowercased Text. Instead of detecting memorization by comparing one model to another model, another option detects memorization by comparing the perplexity of the model to the perplexity of the same model on a “canonicalized” version of that sequence. Specifically, we measure the ratio of the perplexity on the sample before and after lowercasing it, which can dramatically alter the perplexity of memorized content that expects a particular casing.

Perplexity on a Sliding Window. Sometimes a model is not confident when the sample contains one memorized substring surrounded by a block of non-memorized (and high perplexity) text. To handle this, we use the minimum perplexity when averaged over a sliding window of tokens.8

## 6 Evaluating Memorization

We now evaluate the various data extraction methods and study common themes in the resulting memorized content.

### 6.1 Methodology

An overview of our experimental setup is shown in Figure 2. We first generate samples of tokens using each of the three text generation strategies:

• Top-: The strategy from Section 4.1 that generates using top- sampling from the start-of-sequence token.

• Temperature: The strategy from Section 5.1.1 that uses a higher sampling temperature for the initial tokens.

• Internet: The strategy from Section 5.1.2 that conditions the LM on Internet text and then does top- sampling.

For each strategy, we sort the generated samples according to each of our six membership inference metrics:

• Perplexity: the perplexity of the largest GPT-2 model.

• Small: the ratio of log-perplexities of the largest GPT-2 model and the Small GPT-2 model.

• Medium: the ratio as above, but for the Medium GPT-2.

• zlib: the ratio of the (log) of the GPT-2 perplexity and the zlib entropy (as computed by compressing the text).

• Lowercase: the ratio of perplexities of the GPT-2 model on the original sample and on the lowercased sample.

• Window: the minimum perplexity of the largest GPT-2 model across any sliding window of 50 tokens.

For each of these configurations, we select samples from among the top- samples according to the chosen metric.9 This gives us total samples of potentially memorized content. In real-world attacks, adversaries will look to uncover large amounts of memorized content and thus may generate many more samples. We focus on a smaller set as a proof-of-concept attack.

Data De-Duplication. To avoid “double-counting” memorized content, we apply an automated fuzzy de-duplication step when we select the samples for each configuration.

Given a sample , we define the trigram-multiset of , denoted as a multiset of all word-level trigrams in (with words split on whitespace and punctuation characters). For example, the sentence “my name my name my name” has two trigrams (“my name my” and ”name my name”) each of multiplicity . We mark a sample as a duplicate of another sample , if their trigram multisets are similar, specifically if .

Evaluating Memorization Using Manual Inspection. For each of the selected samples, one of four authors manually determined whether the sample contains memorized text. Since the training data for GPT-2 was sourced from the public Web, our main tool is Internet searches. We mark a sample as memorized if we can identify a non-trivial substring that returns an exact match on a page found by a Google search.

Validating Results on the Original Training Data. Finally, given the samples that we believe to be memorized, we work with the original authors of GPT-2 to obtain limited query access to their training dataset. We sent the GPT-2 authors the output samples that we believe to be memorized, along with the memorized substrings. For efficiency, they performed a fuzzy -gram match to account for memorization with different possible tokenizations. We marked samples as memorized when all -grams in the memorized sequence occurred in close proximity in the training dataset. Whenever we report exact counts below, the authors perform a separate independent grep over the entire dataset to get exact counts.

### 6.2 Results

In total across all strategies, we identify 604 unique memorized training examples from among the candidates, for an aggregate true positive rate of (our best variant has a true positive rate of ). Below, we categorize what types of content is memorized by the model, and also study which attack methods are most effective.

Categories of Memorized Content. We manually grouped the memorized samples into different categories (a description of these categories is in Appendix B). The results are shown in Table 1. Most memorized content is fairly canonical text from news headlines, log files, entries from forums or wikis, or religious text. However, we also identify a significant amount of unique data, containing 128-bit UUIDs, (correctly-resolving) URLs containing random strings, and contact information of individual people. In Section 6.3, we study these cases in more detail.

Efficacy of Different Attack Strategies. Table 2 shows the number of memorized samples broken down by the different text generation and membership inference strategies. Sampling conditioned on Internet text is the most effective way to identify memorized content, however, all generation schemes reveal a significant amount of memorized content. For example, the baseline strategy of generating with top- sampling yields unique memorized samples, whereas conditioning on Internet text increases this to .

As discussed earlier, looking directly at the LM perplexity is a poor membership inference metric when classifying data generated with top- or temperature sampling: just 9% and 3% of inspected samples are memorized, respectively. The comparison-based metrics are significantly more effective at predicting if content was memorized. For example, 67% of Internet samples marked by zlib are memorized.

Figure 3 compares the zlib entropy and the GPT-2 XL perplexity for each sample, with memorized examples highlighted. Plots for the other strategies are shown in Figure 4 in Appendix C. Observe that most samples fall along a diagonal, i.e., samples with higher likelihood under one model also have higher likelihood under another model. However, there are numerous outliers in the top left: these samples correspond to those that GPT-2 assigns a low perplexity (a high likelihood) but zlib is surprised by. These points, especially those which are extreme outliers, are more likely to be memorized than those close to the diagonal.

The different extraction methods differ in the type of memorized content they find. A complete breakdown of the data is given in Appendix B; however, to briefly summarize:

1. The zlib strategy often finds non-rare text (i.e., has a high -eidetic memorization value). The samples are typically news headlines, license files, or repeated strings from forums or wikis. For example, there is only one “high entropy” value found with this strategy.

2. Lower-casing finds content that is likely to have irregular capitalization, such as news headlines (where words are capitalized) or error logs (with many uppercase words).

3. The Small and Medium strategies often find rare content. There are 13 and 10 high entropy examples found by using the Small and Medium GPT-2 variants, respectively.

### 6.3 Examples of Memorized Content

We next manually analyze some categories of memorized content that we find particularly compelling. Additional examples are presented in Appendix D.

Personally Identifiable Information. There are several examples of individual peoples’ names, phone numbers, addresses, and social media accounts. Some of this memorized content is exclusive to just a few documents. For example, we extract the usernames of six users participating in an IRC conversation that appeared in exactly one document.

URLs. We identify examples of memorized URLs that correctly resolve to live webpages. Many of these URLs contain uncommon pieces of text, such as random numbers or base-64 encoded strings. We also identify several URLs that resolve correctly but we cannot identify their source (and we thus do not count them as “memorized” in our evaluation).

Code. We identify generated samples that contain snippets of memorized source code. Despite our ability to recover the source code verbatim, we are almost always unable to recover the original authorship notices or terms of use. Often, this information is given either before the code itself or in a LICENSE file that appears separately. For many of these samples, we can also extend their length and recover thousands of lines of (near verbatim) source code (see Section 6.4).

Unnatural Text. Memorization is not limited to natural-looking text. We find instances of random number sequences with at least 50 bits of entropy.10 For example, we extract the following UUID:

1e4bd2a8-e8c8-4a62-adcd-40a936480059

from the model; a Google search for this string identifies just 3 documents containing this UUID, and it is contained in just one GPT-2 training document (i.e., it is -eidetic memorization). Other memorized random number sequences include UUIDs contained in only a few documents (not listed to preserve privacy), git commit hashes, random IDs used for ad tracking, and product model numbers.

Table 3 gives nine examples of eidetic memorized content, each of which is a random sequences between and characters long. In each of these cases, the memorized example is contained in exactly one training document, and the total number of occurrences within that single document varies between just 10 and 311.

Data From Two Sources. We find samples that contain two or more snippets of memorized text that are unrelated to one another. In one example, GPT-2 generates a news article about the (real) murder of a woman in 2013, but then attributes the murder to one of the victims of a nightclub shooting in Orlando in 2016. Another sample starts with the memorized Instagram biography of a pornography producer, but then goes on to incorrectly describe an American fashion model as a pornography actress. This type of generation is not -eidetic memorization (these independent pieces of information never appear in the same training documents), but it is an example of a contextual integrity violation.

Removed Content. Finally, GPT-2 memorizes content that has since been removed from the Internet, and is thus now primarily accessible through GPT-2. Some of this data is not particularly interesting in its own right, e.g., error logs due to a misconfigured webserver that has since been fixed. However, the fact that this type of memorization occurs highlights that LMs that are trained entirely on (at-the-time) public data may end up serving as an unintentional archive for removed data.

### 6.4 Extracting Longer Verbatim Sequences

In our previous experiments, we extract strings of 256 tokens in length. Here, we briefly investigate if we can extract longer sequences. In particular, we extend the length of some of the memorized sequences by seeding the model with each sample and continuing to generate. Rather than using sampling or greedy decoding, we instead use a beam-search-like decoding method introduced in prior work [6].

We can extend many of the memorized samples. For example, we identify a piece of source code taken from a repository on GitHub. We can extend this snippet to extract an entire file, namely 1450 lines of verbatim source code. We can also extract the entirety of the MIT, Creative Commons, and Project Gutenberg licenses. This indicates that while we have extracted memorized examples, we could likely extend many of these to much longer snippets of memorized content.

### 6.5 Memorization is Context-Dependent

Consistent with recent work on constructing effective “prompts” for generative LMs [5, 58], we find that the memorized content is highly dependent on the model’s context.

For example, GPT-2 will complete the prompt “3.14159” with the first digits of correctly using greedy sampling. However, we find that GPT-2 “knows” (under Definition 2) more digits of because using the beam-search-like strategy introduced above extracts digits correctly.

Interestingly, by providing the more descriptive prompt “pi is 3.14159”, straight greedy decoding gives the first digits of —more than with the sophisticated beam search. Further providing the context “e begins 2.7182818, pi begins 3.14159”, GPT-2 greedily completes the first digits of .

This example demonstrates the importance of the context: in the right setting, orders of magnitude more extraction is feasible than when the context is just slightly suboptimal. We find that this holds true for our memorized examples as well. None of the extracted samples found using Internet conditioning can be reliably reproduced when using the same prefix initially provided to GPT-2 that produced this sample. However, nearly all can be reproduced with high probability if we provided the entire sequence of data up to (but not including) the beginning of the memorized content.

The important lesson here is that our work vastly under-estimates the true amount of content GPT-2 memorizes. There are likely prompts that would identify much more memorized content, but because we stick to simple prompts we do not find this memorized content.

## 7 Correlating Memorization with Model Size & Insertion Frequency

Thus far, we have shown that language models can memorize verbatim training strings, even when they are trained for few epochs and achieve small train-test accuracy gaps. A natural question is how many times a string must appear for it to be memorized (i.e., in Definition 2). Prior work has investigated LM memorization by varying the number of times particular “canary” tokens were inserted into a training dataset [6]. The main limitation of this approach is that it is synthetic: canaries are inserted artificially after the dataset has been collected and may not be representative of natural data.

Here, we study how well GPT-2 memorizes naturally occurring canaries in the training data. In particular, we consider a piece of memorized content with the following prefix: {Verbatim}[samepage=true] “color”:“fuchsia”,“link”:“https://www. reddit.com/r/The_Donald/comments/

The reddit.com URL above is completed by a specific 6-character article ID and a title. We located URLs in this specific format in a single document on pastebin.com. Each URL appears a varying number of times in this document, and hence in the GPT-2 training dataset.11 Table 4 shows a subset of the URLs that appear more than once, and their respective counts in the document.12 This allows us to ask the question: how many times must an example appear in the training dataset for us to extract it?

Methods. We attempt two approaches to extract URLs of this format, and run three variants of GPT-2 (XL, Medium, and Small). The two approaches vary the “difficulty” of the attack, so even if the more difficult fails the easier may succeed.

First, we directly prompt each variant of GPT-2 with the prefix above, and use top- sampling to generate possible extensions. Then, we test whether any of the URLs in the training document were among those that were emitted by GPT-2. We count a URL as emitted if it matches verbatim with one of the generations.

Some URLs are not extractable with this technique, and so we make the problem easier for GPT-2 by additionally providing GPT-2 the 6-character random token that begins each URL. Given this additional prefix, we then sample from the model using the beam search procedure. This task is easier in two ways: we have first provided more context and additionally use a higher recall sampling strategy.

Results. Table 4 summarizes the key results. Under the more difficult of the two approaches, the full-sized 1.5 billion parameter GPT-2 model emits all examples that are inserted 33 times or more, the medium-sized 345 million parameter memorizes half of the URLs, and the smallest 117 million parameter model memorizes none of these URLs.

When given the additional context and using beam search, the medium model can emit four more URLs, and the small model only emits the one URL that was inserted 359 times.

These results illustrate two fundamental lessons in LM memorization. First, larger models memorize significantly more training data: even hundreds of millions of parameters are not enough to memorize some of the training points. The ability of LMs to improve with model size has been extensively studied [32, 34]; we show a negative trend where these improvements come at the cost of decreased privacy. Second, for the largest LM, complete memorization occurs after just insertions. This implies that any potentially sensitive information that is repeated a non-trivial amount of times is at risk for memorization, even if it was only repeated multiple times in a single training document.

## 8 Mitigating Privacy Leakage in LMs

Now that we have shown that memorized training data can be extracted from LMs, a natural question is how to mitigate these threats. Here we describe several possible strategies.

Training With Differential Privacy. Differential privacy (DP) [11, 12] is a well-established notion of privacy that offers strong guarantees on the privacy of individual records in the training dataset. Private machine learning models can be trained with variants of the differentially private stochastic gradient descent (DP-SGD) algorithm [1] which is widely implemented [23, 15]. Large companies have even used DP in production machine learning models to protect users’ sensitive information [64, 13]. The tradeoffs between privacy and utility of models have been studied extensively: differentially-private training typically prevents models from capturing the long tails of the data distribution and thus hurts utility [62, 18, 17].

In the content of language modeling, recent work demonstrates the privacy benefits of user-level DP models [52]. Unfortunately, this work requires labels for which users contributed each document; such labels are unavailable for data scraped from the open Web. It may instead seem natural to aim for DP guarantees at the granularity of individual web pages, but rare snippets of text (e.g., an individual’s name and contact information as in Figure 1) might appear in more than one web page. It is thus unclear how to apply DP in a principled and effective way on Web data.

Curating the Training Data. One cannot manually vet the extremely large training datasets used for training LMs. However, there are methods to limit the amount of sensitive content that is present, e.g., by identifying and filtering personal information or content with restrictive terms of use [54, 9].

Aside from attempting to remove sensitive content, it is also important to carefully de-duplicate the data. Many language modeling datasets are de-duplicated at the document- or paragraph-level, which means that a single document can still contain many repeated occurrences of a sensitive piece of content. We envision more sophisticated strategies to de-duplicate the training data, or limit the contribution of any single source of training data.

It is also vital to carefully source the training data. Many of the potentially-sensitive training examples that we extracted (e.g., individuals’ personal information) came from websites that are known to host sensitive content, e.g., pastebin is the 12th most popular domain in GPT-2’s training set.

Overall, sanitizing data is imperfect—some private data will always slip through—and thus it serves as a first line of defense and not an outright prevention against privacy leaks.

Limiting Impact of Memorization on Downstream Applications. In many downstream applications, e.g., dialogue systems [70] and summarization models [26], LMs are fine-tuned on task-specific data. On the positive side, this finetuning process may cause the LM to “forget” [37, 53] some of the data that is memorized during the pre-training stage. On the negative side, fine-tuning may introduce its own privacy leakages if the task-specific data also contains private information. An interesting direction for future work is to explore how memorization is inherited by fine-tuned models.

Auditing ML Models for Memorization. Finally, after mitigating privacy leaks, it is vital to audit models to empirically determine the privacy level they offer in practice [30]. Auditing is important even when using differential privacy, as it can complement theoretical upper bounds on privacy leakage [1]. We envision using our proposed methods, as well as existing attacks [60, 67, 30, 6], to audit LMs.

## 9 Lessons and Future Work

Extraction Attacks Are a Practical Threat. Prior work shows that ( smaller) language models potentially memorize training data in semi-realistic settings [6]. Our results show that state-of-the-art LMs do memorize their training data in practice, and that adversaries can extract this data with simple techniques. Our attacks are practical even when the data contains a given sequence only a handful of times.

As our attacks interact with a language model as a black-box, our results approximate the worst-case behavior of language models when interacting with benign users. In particular, among (honestly) generated samples, our attacks find that at least (or ) contain memorized text.

Note that this is likely an extremely loose lower bound. We only manually inspected potential candidate memorized samples; if we had started with more candidates we would likely have identified significantly more memorized content. Developing improved techniques for extracting memorized data, including attacks that are targeted towards specific content, is an interesting area for future work.

Memorization Does Not Require Overfitting. It is often believed that by preventing overfitting (i.e., reducing the train-test generalization gap) it is possible to prevent models from memorizing training data. However, large LMs have no significant train-test gap and yet we are still able to extract numerous examples verbatim from the training set. The key reason is that even though on average the training loss is only slightly lower than the validation loss, there are still some training examples that have anomalously low losses.

Larger Models Memorize More Data. Throughout our experiments, larger LMs consistently memorize more training data than smaller LMs. For example, in one setting the billion parameter GPT-2 model memorizes over as much content as the million parameter model (Section 7). Worryingly, it is likely that as LMs become bigger (they already have become larger than GPT-2 [5]), privacy leakage will become even more prevalent.

Memorization Can Be Hard to Discover. Much of the training data that we extract is only discovered when prompting the LM with a particular prefix. Currently, we simply attempt to use high-quality prefixes and hope that they might elicit memorization. Better prefix selection strategies [58] might identify more memorized data.

Adopt and Develop Mitigation Strategies. We discuss several directions for mitigating memorization in LMs, including training with differential privacy, vetting the training data for sensitive content, limiting the impact on downstream applications, and auditing LMs to test for memorization. All of these are interesting and promising avenues of future work, but each has weaknesses and are incomplete solutions to the full problem. Memorization in modern LMs must be addressed as new generations of LMs are emerging and becoming building blocks for a range of real-world applications.

## 10 Conclusion

For large language models to be widely adopted, they must address the training data memorization problems that we have identified. Our extraction attacks are practical and efficient, and can recover hundreds of training examples from a model, even when they are contained in just one training document.

Our analysis is best viewed as a cautionary tale of what could happen when training large LMs on sensitive data. Even though our attacks target GPT-2 (which allows us to ensure that our work is not harmful), the same techniques apply to any LM. Moreover, because memorization gets worse as LMs become larger, we expect that these vulnerabilities will become significantly more important in the future.

Training with differentially-private techniques is one method for mitigating privacy leakage, however, we believe that it will be necessary to develop new methods that can train models at this extreme scale (e.g., billions of parameters) without sacrificing model accuracy or training time. More generally, there are many open questions that we hope will be investigated further, including why models memorize, the dangers of memorization, and how to prevent memorization.

## Acknowledgements

We are grateful for comments on early versions of this paper by Dan Boneh, Andreas Terzis, Carey Radebaugh, Daphne Ippolito, Christine Robson, Kelly Cooke, Janel Thamkul, Austin Tarango, Jack Clark, Ilya Mironov, and Om Thakkar.

## Summary of Contributions

• Nicholas, Dawn, Ariel, Tom, Colin and Úlfar proposed the research question of extracting training data from GPT-2 and framed the threat model.

• Colin, Florian, Matthew, and Nicholas stated the memorization definitions.

• Florian, Ariel, and Nicholas wrote code to generate candidate memorized samples from GPT-2 and verify the ground truth memorization.

• Florian, Nicholas, Matthew, and Eric manually reviewed and categorized the candidate memorized content.

• Katherine, Florian, Eric, and Colin generated the figures.

• Adam, Matthew, and Eric ran preliminary investigations in language model memorization.

• Nicholas, Florian, Eric, Colin, Katherine, Matthew, Ariel, Alina, Úlfar, Dawn, and Adam wrote and edited the paper.

• Tom, Adam, and Colin gave advice on language models and machine learning background.

• Alina, Úlfar, and Dawn gave advice on the security goals.

## Appendix A Privacy Leakage from Text Tokenization

As described in Section 2.1, LMs represent text as a sequence of tokens from a vocabulary . Classically, the tokenization process occurs at a word-level. However, modern LMs make use of subword-level tokenization [57]. In particular, each word is broken up into one or more word pieces, e.g., the word pineapple might be represented as pine and apple.

To determine the subword units, it is common to build vocabularies using Byte Pair Encoding (BPE) [57]. First, the tokens in the vocabulary are initialized with every character that is present in the training data. Next, new tokens are iteratively added to the vocabulary by merging existing tokens using two steps:

1. Identify the pair of tokens that occurs most frequently in the training data.

2. Replace all occurrences of this substring with a new unique token.

This process is terminated when a desired number of merges occurs. The result of this procedure is a mapping from substrings of text to unique identifiers ( in the case of GPT-2), with the property that every token is contained in the training data. Hereafter, we illustrate two ways in which this tokenization process can be exploited to extract memorized training data.

### a.1 Improved Membership Inference by Checking Tokenization

Before being input into a modern LM, each example is tokenized into individual subword units according to the built vocabulary. As the procedure to convert any given word to its tokenization is a deterministic process, there is a single “correct” tokenization of every given word. For example, the word “goodness” could be split into the tokens “good” and “ness”, or perhaps the tokens “goo”, “d”, “ness”, or even have its own token “goodness” if the word occurs frequently enough in the training data. When given a string, a tokenizer will have a preference amongst the possible tokenizations. Typically, tokenizers will greedily select the longest possible subword, in this case, “goodness” if it is present in the vocab.

However, when the trained LM is used to generate text, it can emit the tokens “goo”, “d” and “ness” in sequence. If this were to happen, we would know that this is definitely not memorized text, because it would never have occurred that way in the training dataset.

More generally, given any sequence of tokens, it is possible to check if this sequence is consistent with the tokenization of the training data by simply evaluating if . If not, then the input is not verbatim memorized text. This simple check gives a “free” reduction in false positives at no cost to the true positive rate.

### a.2 Eidetic Memorization in the Vocabulary

The token vocabulary itself can contain rare snippets of training data. As a model’s vocabulary is typically public, extracting these memorized snippets is trivial (e.g., OpenAI’s GPT-3 model, which is not publicly accessible, uses the same public vocabulary as GPT-2). The GPT-2 vocabulary contains different word-pieces. This is a sufficiently large number that it is likely that the tails of the distribution will contain sequences of text that are (relatively) rare in the original training dataset.

Assuming that the text scrape follows an approximate Zipf’s law distribution, we should expect that the th most likely token should occur just times on the 40GB dataset that was used to construct the BPEs. Indeed, we show there are numerous tokens in the vocabulary that appear only a small number of times on the Web.

Memorized Usernames. There are BPE tokens for several usernames of individual people. For example, the Twitter handle for Donald Trump, realDonaldTrump, is represented by a single token in the encoding dictionary. However, this is not an instance of Eidetic memorization, as this token is contained in thousands of webpages. In contrast, through manual review of the BPEs, we identify three BPE tokens that correspond to usernames of individual users on Reddit.13 These three tokens are otherwise unique on the Internet: Google searches yield 24, 29 and 34 results for each of these usernames; all results correspond to content related to these users.

Similarly, we identify one token that corresponds to the GitHub repository name of a particular user. This repository has only two “stars” on GitHub, and there are 40 results for this phrase contained on Google.

Software Logs and Configuration Files. We also identify several other examples of tokens that are unique to only a few documents. For example, the string EStreamFrame is present in the vocab, and Google searches for this string yield only seven results (one of which was a link to the GPT-2 BPE).

Why Does This Memorization Happen? By construction, the tokens in the BPE vocabulary correspond directly to the most commonly repeated substrings in the training data. Thus, the only way that these BPEs could have been added to the encoding is if they were repeated multiple times in the training dataset. Upon investigation, we found a single text file on pastebin that contains a collection of many Reddit usernames repeated several times. The three usernames described above appeared respectively 128, 60, and 484 times in this one text file. Similarly, there is a single document on the Web that contains the identified Github repository name repeated over times. It is thus not (too) surprising that these individual names would become encoded into the dictionary.

Generally speaking, this privacy leakage gets at the difference between some text appearing hundreds of times within one particular document versus the same text appearing in hundreds of different documents. The former case might still be problematic, while the latter case is much less likely to be. Unfortunately, common methods for constructing subword vocabularies do differentiate between these two situations.

## Appendix B Categorization of Memorized Data

Table 5 describes the high-level categories that we assigned to the 604 memorized samples extracted from GPT-2. Note that a single sample can belong to multiple categories. Tables 6 and 7 show the categorization broken down by attack strategy.

## Appendix C Distribution of Model Perplexities

Figure 4 shows the distribution of the perplexities of samples generated with each of our three text generation strategies and ordered based on our six membership inference strategies.

## Appendix D Additional Case Studies of Memorization

Here we present additional results from our manual analysis of the memorized content.

Memorized Leaked Podesta Emails from WikiLeaks. We identify several memorized URLs that originated from the leaked Podesta Emails available on WikiLeaks.14 There is only a single training document that contains these memorized URLs. Due to the nature of email, the text of one message is often included in subsequent replies to this email. As a result, a URL that is used (intentionally) only once can be included in the dataset tens of times due to the replies.

Memorized Donald Trump Quotes and Tweets. The GPT-2 training dataset was collected when the 2016 US Presidential election was often in the news. As a result, we find several instances of memorized quotes from Donald Trump, both in the form of official remarks made as President (found in the official government records), as well as statements made on Twitter.

Memorized Promotional Content. We extract memorized samples of promotional content, such as advertisements for books, beauty products, software products. One of these samples includes a link to an author’s valid Patreon account, along with a list of named and pseudonymous prior donors.

Memorized Number Sequences. We identify many examples where GPT-2 emits common number sequences. Nearly ten examples contain the integers counting up from some specific value. We also find examples of GPT-2 counting the squares 1, 2, 4, 8, 16, 25, 36, Fibonacci numbers 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, or digits of , 3.14159265358979323846264. None of these examples should be unexpected, but the quantity of memorized number sequences was surprising to us.

Memorized News Headlines. Numerous memorized text snippets are verbatim copies of news articles and headlines. A large number of these memorized samples are attributed to a single source: thehill.com, an American news website. Interestingly, most of these samples follow the exact same template: (1) they contain a list of different news headlines separated by a “pipe” symbol (|), (2) the sample begins with two merged words, e.g., “TrumpJesuit”, (3) the headline list ends with the all-caps word “MORE”, and (4) the sample contains the all-caps word “ADVERTISEMENT”.

We indeed find pages on the Web that contain copies of headlines from thehill.com under this exact template. The peculiarities of these snippets likely contributed to their memorization. For example, the token TrumpJesuit does not appear in any other context on the entire Web.

Memorized Base-64 Content. One particularly interesting form of memorization that we identify is the ability of GPT-2 to emit base-64 encoded content. For example, we extract out of the model the following sequence:

    bWFzdGVyfGltYWdlc3w3OTkxOXxpbWFnZS9wbmd
8aW1hZ2VzL2hkZS9oMDQvODg0NTY3MjYxMTg3MC
5wbmd8ZmFkMTMlNmFiYWJhZjFiMjJlYTAyNzU0Z


which decodes to the sequence “master|images|79919|image /png|images/hde/h04/8845672611870.png|…”. Despite our attempts, we are unable to identify where this content originates.

### Footnotes

1. For notational clarity, we write top- instead of the more common top- because we will use the constant for a separate purpose.
2. Personal communication with the authors.
3. Eidetic memory (more commonly called photographic memory) is the ability to recall information after seeing it only once.
4. This definition admits certain pathological corner cases. For example, many LMs when prompted with the sequence “Repeat the following sentence: _____.” will do so correctly. This technically allows any string to be known under our definition. Simple refinements of this definition do not solve the issue, as LMs can also be asked to, for example, down-case a particular sentence. We circumvent these pathological cases by prompting LMs only with short prefixes.
5. Since the training data is sourced from the public Web, all the outputs of our extraction attacks can also be found via Internet searches. Indeed, to evaluate whether we have found memorized content, we search for the content on the Internet and are able to find these examples relatively easily.
6. http://commoncrawl.org/
7. It is possible there is some intersection between these two datasets, effectively allowing this strategy to “cheat”. We believe this does not considerably affect results. First, any overlap between the two datasets is rare on average. Second, because we only use the first or tokens of each sample, any possible overlap will be small in absolute terms.
8. Chosen after a cursory hyper-parameter sweep and manual analysis.
9. To favor low-ranked samples, while also exploring some of the higher-ranked samples, we select the samples so that the fraction of selected samples with rank below is .
10. We estimate the entropy through manual analysis by guessing the entropy space given the format of the string.
11. The purpose of this text dump was to tag users of Reddit who posted frequently on specific topics. In doing so, this page repeats some of the same links many times because many users comment on the same links.
12. We confirmed with OpenAI that the counts here are within 5% of the true counts of these URLs in the training data.
13. As described in Section 3.4, we intentionally omit the personal information of these users.
14. https://en.wikipedia.org/wiki/Podesta_emails

### References

1. M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar and L. Zhang (2016) Deep learning with differential privacy. In ACM CCS, Cited by: §1, §2.2, §8, §8.
2. J. Alammar (2018) The illustrated transformer. Visualizing Machine Learning One Concept at a Time. Cited by: §2.1.
3. D. Bahdanau, K. Cho and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.1.
4. Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin (2003) A neural probabilistic language model. JMLR. Cited by: §2.1.
5. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry and A. Askell (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1, §1, §2.1, §2.1, §2.2, §3.2, §3, §6.5, §9.
6. N. Carlini, C. Liu, Ú. Erlingsson, J. Kos and D. Song (2019) The secret sharer: evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, Cited by: §1, §2.2, §3.1, §6.4, §7, §8, §9.
7. K. Chaudhuri and C. Monteleoni (2009) Privacy-preserving logistic regression. In NIPS, Cited by: §2.2.
8. M. X. Chen, B. N. Lee, G. Bansal, Y. Cao, S. Zhang, J. Lu, J. Tsay, Y. Wang, A. M. Dai, Z. Chen, T. Sohn and Y. Wu (2019) Gmail smart compose: Real-Time assisted writing. In KDD, Cited by: §1, §3.2, §3.3.
9. A. Continella, Y. Fratantonio, M. Lindorfer, A. Puccetti, A. Zand, C. Kruegel and G. Vigna (2017) Obfuscation-Resilient Privacy Leak Detection for Mobile Apps Through Differential Analysis. In NDSS, Cited by: §8.
10. J. Devlin, M. Chang, K. Lee and K. Toutanova (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.1.
11. C. Dwork, F. McSherry, K. Nissim and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In TCC, Cited by: §8.
12. C. Dwork (2008) Differential privacy: a survey of results. In TAMC, Cited by: §8.
13. Ú. Erlingsson, V. Pihur and A. Korolova (2014) RAPPOR: randomized aggregatable privacy-preserving ordinal response. In ACM CCS, Cited by: §8.
14. A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau and S. Thrun (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature. Cited by: §1.
15. Facebook Opacus. Note: \urlhttps://github.com/pytorch/opacus Cited by: §8.
16. A. Fan, M. Lewis and Y. Dauphin (2018) Hierarchical neural story generation. In ACL, Cited by: §2.1.
17. V. Feldman and C. Zhang (2020) What neural networks memorize and why: Discovering the long tail via influence estimation. In NeurIPS, Cited by: §2.2, §8.
18. V. Feldman (2020) Does learning require memorization? A short tale about a long tail. In STOC, Cited by: §2.2, §8.
19. M. Fredrikson, S. Jha and T. Ristenpart (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In ACM CCS, Cited by: §2.2.
20. J. Gailly and M. Adler zlib compression library. External Links: Link Cited by: §5.2.
21. A. Gokaslan and V. Cohen (2019) OpenWebText corpus. Note: \urlhttp://Skylion007.github.io/OpenWebTextCorpus Cited by: §3.4.
22. S. Goldwasser, S. Micali and C. Rackoff (1989) The knowledge complexity of interactive proof systems. SICOMP. Cited by: §3.1.1.
23. Google Tensorflow Privacy. Note: \urlhttps://github.com/tensorflow/privacy Cited by: §8.
24. A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §2.1.
25. S. Hisamoto, M. Post and K. Duh (2020) Membership inference attacks on sequence-to-sequence models: Is my data in your machine translation system?. In TACL, Cited by: §2.2.
26. A. Hoang, A. Bosselut, A. Celikyilmaz and Y. Choi (2019) Efficient adaptation of pretrained transformers for abstractive summarization. arXiv preprint arXiv:1906.00138. Cited by: §1, §8.
27. A. Holtzman, J. Buys, M. Forbes and Y. Choi (2020) The curious case of neural text degeneration. In ICLR, Cited by: §4.3, 2nd item.
28. J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In ACL, Cited by: §2.1.
29. D. Ippolito, D. Duckworth, C. Callison-Burch and D. Eck (2020) Automatic detection of generated text is easiest when humans are fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1808–1822. Cited by: §5.1.
30. M. Jagielski, J. Ullman and A. Oprea (2020) Auditing differentially private machine learning: how private is private SGD?. In NeurIPS, Cited by: §8.
31. B. Jayaraman and D. Evans (2019) Evaluating differentially private machine learning in practice. In USENIX Security Symposium, Cited by: §2.2.
32. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu and D. Amodei (2020) Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Cited by: §2.2, §7.
33. J. Li, M. Galley, C. Brockett, J. Gao and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In NAACL, Cited by: 2nd item.
34. Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein and J. E. Gonzalez (2020) Train large, then compress: rethinking model size for efficient training and inference of transformers. In ICML, Cited by: §2.2, §7.
35. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
36. Y. Long, V. Bindschaedler, L. Wang, D. Bu, X. Wang, H. Tang, C. A. Gunter and K. Chen (2018) Understanding membership inferences on well-generalized learning models. arXiv preprint arXiv:1802.04889. Cited by: §9.
37. M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §8.
38. H. B. McMahan, D. Ramage, K. Talwar and L. Zhang (2018) Learning differentially private recurrent language models. In ICLR, Cited by: §2.2.
39. T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ and S. Khudanpur (2010) Recurrent neural network based language model. In Interspeech, Cited by: §2.1, §2.1.
40. R. Munroe (2019) Predictive models. Note: \urlhttps://xkcd.com/2169/ Cited by: §1.
41. M. Nasr, R. Shokri and A. Houmansadr (2018) Machine learning with membership privacy using adversarial regularization. In ACM SIGSAC, Cited by: §4.2.
42. M. Nasr, R. Shokri and A. Houmansadr (2019) Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning. In IEEE S&P, Cited by: §1, §2.2.
43. H. Nissenbaum (2004) Privacy as contextual integrity. Washington Law Review. Cited by: §3.3.
44. OpenAI (2020) Language models are few-shot learners. Note: \urlhttps://github.com/openai/gpt-3 Cited by: §3.3.
45. X. Pan, M. Zhang, S. Ji and M. Yang (2020) Privacy risks of general-purpose language models. In IEEE S&P, Cited by: §2.2.
46. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §2.1.
47. F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller and S. Riedel (2019) Language models as knowledge bases?. In EMNLP, Cited by: §3.1.
48. A. Radford, K. Narasimhan, T. Salimans and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §2.1, §2.1.
49. A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage and I. Sutskever (2019) Better language models and their implications. OpenAI Blog. Cited by: §1, §1, §2.1, §2.2, §3.
50. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1, §2.1, §3.2, §3.4.
51. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. In JMLR, Cited by: §1, §1, §2.1, §2.1, §2.2, §5.1.2.
52. S. Ramaswamy, O. Thakkar, R. Mathews, G. Andrew, H. B. McMahan and F. Beaufays (2020) Training production language models without memorizing user data. arXiv preprint arXiv:2009.10031. Cited by: §8.
53. R. Ratcliff (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: §8.
54. J. Ren, A. Rao, M. Lindorfer, A. Legout and D. Choffnes (2016) ReCon: revealing and controlling PII leaks in mobile network traffic. In MobiSys, Cited by: §8.
55. A. Roberts, C. Raffel and N. Shazeer (2020) How much knowledge can you pack into the parameters of a language model?. In EMNLP, Cited by: §3.1.
56. B. I. Rubinstein, P. L. Bartlett, L. Huang and N. Taft (2012) Learning in a large function space: privacy-preserving mechanisms for SVM learning. Privacy and Confidentiality. Cited by: §2.2.
57. R. Sennrich, B. Haddow and A. Birch (2016) Neural machine translation of rare words with subword units. In ACL, Cited by: Appendix A, Appendix A.
58. T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace and S. Singh (2020) AutoPrompt: eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980. Cited by: §6.5, §9.
59. R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In ACM CCS, Cited by: §2.2.
60. R. Shokri, M. Stronati, C. Song and V. Shmatikov (2017) Membership inference attacks against machine learning models. In IEEE S&P, Cited by: §1, §1, §2.2, §4.2, §8.
61. C. Song and A. Raghunathan (2020) Information leakage in embedding models. In ACM CCS, Cited by: §2.2.
62. C. Song and V. Shmatikov (2018) Auditing data provenance in text-generation models. In KDD, Cited by: §2.2, §2.2, §8.
63. O. Thakkar, S. Ramaswamy, R. Mathews and F. Beaufays (2020) Understanding unintended memorization in federated learning. arXiv preprint arXiv:2006.07490. Cited by: §2.2.
64. A. G. Thakurta, A. H. Vyrros, U. S. Vaishampayan, G. Kapoor, J. Freudiger, V. R. Sridhar and D. Davidson (2017) Learning new words. Google Patents. Note: US Patent 9,594,741 External Links: Link Cited by: §8.
65. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §2.1.
66. K. Walsh (2020) USPTO request for comments on intellectual property protection for artificial intelligence innovation – public comment by the electronic frontier foundation. Note: \urlhttps://www.uspto.gov/sites/default/files/documents/Electronic%20Frontier%20Foundation_RFC-84-FR-58141.PDF Cited by: §1, §3.
67. S. Yeom, I. Giacomelli, M. Fredrikson and S. Jha (2018) Privacy risk in machine learning: analyzing the connection to overfitting. In IEEE CSF, Cited by: §1, §3, §8.
68. R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner and Y. Choi (2019) Defending against neural fake news. In NeurIPS, Cited by: §1.
69. C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. ICLR. Cited by: §1.
70. Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu and B. Dolan (2020) DialoGPT: Large-scale generative pre-training for conversational response generation. In ACL Demo Track, Cited by: §1, §8.
Comments 0
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

Loading ...
423859

You are asking your first question!
How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description