CTRL: A Conditional Transformer Language Model for Controllable Generation
Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.6 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at github.com/salesforce/ctrl.
With enough data, model capacity, and compute, generative models can learn distributions powerful enough to produce high-quality samples from complex domains. In computer vision, the advent of generative adversarial networks (Goodfellow et al., 2014) improved image generation. Much research then focused on methods for controlling the generation process and improving estimation of generative distributions (Arjovsky et al., 2017; Chen et al., 2016; Kingma and Welling, 2013).
In natural language processing, language models are often trained as conditional language models for specific tasks that require text generation (Brants et al., 2007; Sutskever et al., 2014; Rush et al., 2015). They are also used as a means of learning word vectors (Mikolov et al., 2013), document vectors (Kiros et al., 2015), or contextualized word vectors (McCann et al., 2017; Peters et al., 2018; Devlin et al., 2018) for transfer learning. The language models themselves have been transferred to new tasks through fine-tuning as well (Radford et al., 2018; Howard and Ruder, 2018). Less is understood about generation that is not constrained to any specific task. Typically prompts generated by models (Fan et al., 2018) or written by humans can only be used to provide a rough guide or starting point for the generated text. This raises the question of how text generation can be controlled more explicitly.
Inspired by the degree of control available in image generation as well as the recent progress in text generation (Radford et al., 2019) and multitask learning McCann et al. (2018), we train a language model that is conditioned on a variety of control codes (Pfaff, 1979; Poplack, 1980) that make desired features of generated text more explicit. With 1.6 billion parameters, our Conditional Transformer Language (CTRL) model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior. To preserve the generality of the language model trained in an unsupervised setting, we train CTRL on control codes derived from structure that naturally co-occurs with the raw text typically collected for training large language models. For example, large resources like Wikipedia, Project Gutenberg, and Amazon Reviews can each be assigned a domain-related control code. Smaller resources, like the content extracted from individual subreddits, often occur with both a broader domain name, reddit, as well as subdomain information, r/subdomain. In the vast majority of cases, text collected for training is associated with a URL, which often contains information pertinent to the text it represents. Humans can use these codes to trigger generation of text from different linguistic communities without having to understand how to prompt with particular linguistic patterns. Text can be generated in more predictable ways by controlling for content or changing the domain even when the initial prompt remains fixed.
Because all control codes can be traced back to a particular subset of the training data, CTRL can be used to predict the subset of training data that is most likely given a sequence. This explicit relationship between CTRL and its training data can be exploited to analyze the correlations that the language model has learned from each domain, and it provides a means of studying large amounts of text through the language model.
These control codes also allow for the straightforward inclusion of task-specific data in a way that improves important skills without harming the generality of the model. Control codes for question answering and machine translation make these skills easily accessible with CTRL. These codes can be combined with codes during generation to create novel cross-over between control codes that are task-specific behavior and those that are related to domain and content.
In order to push towards more controllable, general models for natural language processing, we have released multiple full-sized, pretrained versions of CTRL at github.com/salesforce/ctrl. We hope that the release leads to further research into how controllable generation can enhance natural language understanding.
2 Language Modeling
Given example sequences of the form where each comes from a fixed set of symbols, the goal of language modeling is to learn . Because is a sequence, it is natural to factorize this distribution using the chain rule of probability (Bengio et al., 2003):
This decomposes language modeling into next-word prediction. Current state-of-the-art methods (Dai et al., 2019; Radford et al., 2019) train a neural network with parameters to minimize the negative log-likelihood over a dataset :
Because language models learn , a new of length can be generated by sequentially sampling its constituent symbols: .
3 Language Modeling with CTRL
CTRL is a conditional language model that is always conditioned on a control code and learns the distribution . The distribution can still be decomposed using the chain rule of probability and trained with a loss that takes the control code into account.
The control code provides a point of control over the generation process. This is true even when sampling , in contrast to the traditional language modeling framework described in Sec. 2.
CTRL learns by training on sequences of raw text prepended with control codes. After minimal preprocessing (described in Sec. 3.2), a single example sequence containing tokens is embedded as a sequence of corresponding vectors in . Each vector is the sum of a learned token embedding and a sinusoidal positional embedding as in the original Transformer architecture (Vaswani et al., 2017). This sequence of vectors is stacked into a matrix so that it can be processed by attention layers (Vaswani et al., 2017). The th layer consists of two blocks, each of which preserves the model dimension .
The core of the first block is multi-head attention with heads that uses a causal mask to preclude attending to future tokens:
The core of the second block is a feedforward network with ReLU activation (Nair and Hinton, 2010) that projects inputs to an inner dimension , with parameters and :
|Block 1||Block 2|
Scores for each token in the vocabulary are computed from the output of the last layer:
During training, these scores are the inputs of a cross-entropy loss function. During generation, the scores corresponding to the final token are normalized with a softmax, yielding a distribution for sampling a new token.
We train on GB of text drawing from a wide variety of domains: Wikipedia (En, De, Es, Fr), Project Gutenberg111We use a modified version of https://github.com/chiphuyen/lazynlp, submissions from 45 subreddits, OpenWebText222We use a modified version of https://github.com/jcpeterson/openwebtext.git, a large collection of news data (Hermann et al., 2015; Barrault et al., 2019; Sandhaus, 2008; Grusky et al., 2018), Amazon Reviews (McAuley et al., 2015), Europarl and UN data from WMT (En-De, En-Es, En-Fr) (Barrault et al., 2019), question-answer pairs (no context documents) from ELI5 (Fan et al., 2019) and the MRQA shared task333https://github.com/mrqa/MRQA-Shared-Task-2019, which includes the Stanford Question Answering Dataset (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2016), TriviaQA (Joshi et al., 2017), SearchQA (Dunn et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019). A full account of training data and associated control codes can be found in Table 7 in the Appendix.
3.2 Experimental Settings
We learn BPE (Sennrich et al., 2015) codes and tokenize the data using fastBPE444https://github.com/glample/fastBPE, but we use a large vocabulary of roughly K tokens. This includes the sub-word tokens necessary to mitigate problems with rare words, but it also reduces the average number of tokens required to generate long text by including most common words. We use English Wikipedia and a 5% split of our collected OpenWebText data for learning BPE codes. We also introduce an unknown token so that during preprocessing we can filter out sequences that contain more than unknown tokens. This, along with the compressed storage for efficient training (TFRecords) (Abadi et al., 2016), reduces our training data to GB from the total GB collected. Data was treated as a single stream of tokens with non-domain control codes inserted where appropriate (often at document boundaries). The stream was chunked into contiguous sequences of tokens. Each sequence originated from a domain, and it has the corresponding domain control code prepended as the first token in the sequence. In this way, domain control codes receive special treatment. They are propagated to all text in the domain as the first token. This is similar to how codes and natural language sequences have been used in multi-task settings (Wu et al., 2016; Johnson et al., 2017; McCann et al., 2018) to control conditional language models. All other control codes are injected into the data without such special treatment. We experimented with sequence lengths of and due to memory and optimization constraints. Despite training on relatively short sequences compared to other approaches, we found that a sliding-window approach allows for generation beyond these windows, and we also found little difference in quality between the two models within the first tokens. Further, we note that our vocabulary is approximately 4 times larger than similar approaches, hence the effective sequence length in characters is comparable.
CTRL has model dimension , inner dimension , layers, and heads per layer. Dropout with probability follows the residual connections in each layer. Token embeddings were tied with the final output embedding layer (Inan et al., 2016; Press and Wolf, 2016).
CTRL was implemented in TensorFlow (Abadi et al., 2016) and trained with a global batch size of distributed across cores of a Cloud TPU v Pod for k iterations. Training took approximately 2 weeks using Adagrad (Duchi et al., 2011) with a linear warmup from to over k steps. The norm of gradients were clipped to as in (Merity et al., 2017). Learning rate decay was not necessary due to the monotonic nature of the Adagrad accumulator. We compared to the Adam optimizer (Kingma and Ba, 2014) while training smaller models, but we noticed comparable convergence rates and significant memory savings with Adagrad. We also experimented with explicit memory-saving optimizers including SM3 (Anil et al., 2019), Adafactor (Shazeer and Stern, 2018), and NovoGrad (Ginsburg et al., 2019) with mixed results.
4 Controllable Generation
Typically, temperature-controlled stochastic sampling methods are used for generating text from a trained language model. It is also common to limit the sampling only to the top- alternatives. Given a temperature and scores for each token in the vocabulary, the probability of predicting the th token is given by:
The next token is then chosen by sampling through a multinomial distribution with probabilities clipped at the top- tokens. In the equation above, approximates a greedy distribution which magnifies the peaks in the probability distribution while flattens the distribution to make it more uniform. Rather than choosing a fixed value of , as is common practice, Holtzman et al. (2019) suggested adapting heuristically. The nucleus sampling approach chooses a probability threshold and sets to be the lowest value such that . If the model is confident in its next-word prediction, then will be lower and vice versa. Despite the improved generative capabilities of models with such heuristics, there still exists a trade-off between these parameters depending on the generation intended.
Given a prompt: Q: What is the capital of Australia?, a well-trained model assigns higher probability mass to the correct answer, Canberra, but a non-zero probability mass to other cities such as Melbourne, Sydney, Brisbane, Darwin, and Perth, see Figure 1.
By choosing to sample, we mistrust the model, despite it being correct. A natural solution to this is to choose the next token greedily. However, this is known to create repetitions of phrases or sentences even for large well-trained models (Radford et al., 2019; Holtzman et al., 2019). To reconcile the two, we propose a new sampling scheme that trusts the model distribution through near-greedy sampling but prevents repetitions through a penalty. This penalized sampling works by discounting the scores of previously generated tokens. The motivation is similar to coverage mechanisms (See et al., 2017) and other losses designed to discourage repetition (Welleck et al., 2019), but penalized sampling is not used during training. Given a list of generated tokens , using the notation from equation 1, the probability distribution for the next token is defined as:
We find that using a greedy sampling and yields a good balance between truthful generation and lack of repetition. Note that is equivalent to equation 1. We note in passing that this approach succeeds only if the model has learned a sufficiently reliable distribution.
4.2 Control Codes
Style by domain.
\colorred Wikipedia \colorblue Anarchism is a political philosophy that advocates the abolition of all forms of hierarchy and domination, including capitalism, patriarchy, racism, sexism, heterosexism and other oppressive social structures.
\colorred Horror \colorblue A knife handle pulled through the open hole in the front. I jumped when the knife hit.
\colorred Reviews \colorblue A knife
is a tool and this one does the job well.
\colorred Relationships \colorblue My neighbor is a jerk and I don’t know what to do
\colorred Legal \colorblue My neighbor is threatening to sue me for not letting him use my pool
Most control codes for our model specify the overall style of generated text by indicating a particular domain of training data. Examples in Table LABEL:tab:same_prompt demonstrate that even for identical prompts, control codes allow for predictable variation in generation. The examples in Table LABEL:tab:more_control show how CTRL can generate domain-specific text without any prompt.
\colorred Science Title: Scientists have discovered a new type of bacteria that can survive in the presence of high levels of carbon dioxide
\colorred Politics Title: The US is the only country in history to have a national debt of more than $20 trillion.
\colorred Running Text: I have been running for about a year and a half now but never really got into it.
\colorred Horror Text: I was a little girl when my parents got divorced. My dad had been in the military for years and he left me with my mom. She worked as an RN at a hospital so she could take care of me.
\colorred Reviews Rating: \colorred 5.0
\colorred Reviews Rating: \colorred 1.0
More complex control codes.
Additional control codes can be added to the domain code in order to increasingly constrain generation. In Table LABEL:tab:more_control, adding additional control codes following the domain code further constrains generation. These examples demonstrate constraints specifying that the model should start with a title and by specifying a particular rating for reviews.
Examples of more advanced control are given in Table 3. In our version of OpenWebText, we include the URL after each document as the start of the input sequence. During training, CTRL learns relationships between the structure of these URLs and the text that follows. At inference, novel URLs can be used to specify a variety of features: domain, subdomain, entities, entity relations, and even dates.
\colorred Links https://www.cnn.com/2007/09/20/us-president-meets-british-pm
LONDON, England (CNN) – U.S. President George W. Bush met with British Prime Minister Tony Blair on Monday to discuss the war in Iraq, according to a statement from Blair’s office.
|\colorred Links https://www.cnn.com/2014/09/20/us-president-meets-british-pm|
\colorred Links https://www.cnn.com/2018/09/20/us-president-meets-british-pm
\colorredLinks https://www.cnn.com/style/09/20/2018/george-clooney-interview George Clooney on the future of his acting career
\colorredLinks https://www.cnn.com/politics/09/20/2018/george-clooney-interview JUST WATCHED
Triggering specific tasks.
\colorred Questions \colorblue Q: What is the capital of India? \colorred A: New Delhi
\colorred Translation \colorred English \colorred : \colorblue We release a new model for coherent language generation \colorred ; \colorred French \colorred : Nous publions un nouveau modéle de génération cohérente du langage
A small number of control codes are related to specific tasks like question answering and translation. These codes constrain the generation process the most, by triggering task-specific generation. In Table 4, we demonstrate relatively complex control codes for question answering and machine translation that act as a template mixed with a natural language prompt.
\colorred Diet English \colorred : \colorblue I lost 10 kgs! \colorred ; \colorred German \colorred : Ich habe 10 Kilogramm verloren!
\colorred Politics Title: \colorblue Les Etats-Unis sont un pays de droite
In the first example we mix a diet subreddit (r/keto) with machine translation control codes for English and German. In contrast to using Translation in LABEL:tab:more_control, the generated text with mixed codes is coherent across multiple translated lines. This structure is an influence of Diet because it had multiline examples in the training data, whereas the translation data consisted of shuffled single lines. In the second example we mix the politics subreddit (r/politics) with a prompt that starts in French though no examples of this kind were found in the training data.
5 Source Attribution
|Query Prompt||Attributed Sources|
|Global warming is a lie.||r/unpopularopinion, r/conspiracy, r/science|
|Global warming is a lie||r/eli5, r/science, r/unpopularopinion|
|Global warming is a real phenomenon||r/eli5, r/science, r/changemyview|
|Global warming is a real phenomenon.||OpenWebText, r/changemyview, r/science|
|I don’t think women should be allowed to vote.||r/christianity, r/atheism, r/unpopularopinion|
|Carbs are your enemy when you want to get lean.||r/fitness, r/loseit, r/keto|
|I just want to be a fun aunt. I’m not interested in babies.||r/babybumps, r/childfree, r/twoxchromosome|
|My landlord is suing me for unpaid rent.||r/legaladvice, r/personalfinance, r/frugal|
|FROM fairest creatures we desire increase,
||Gutenberg, Wikipedia, OpenWebText|
The domain control codes can be used to partition the training data into mutually exclusive sets. This supports a simple method for determining which subsets of the training data the language model considers most likely given a sequence. Recall that the language model has learned a distribution . By specifying a prior over domain control codes for , it is straightforward to compute a ranking of domains:
We found that the empirical prior of the training data weights domains with large amounts of data too heavily. Instead, we use a uniform prior over the domain control codes. Examples can be found in Table 6.
We note that the data used to train this model does not have universal coverage and contains the cultural associations present in the original sources. All applications of the model inherently depend on those original associations for prediction. In fact, this method of source attribution relies on exploiting the original associations to establish relationships between the language model and its training data.
The model does not have a notion of whether any particular cultural association is good or bad, right or wrong, true or false. It only learns correlations between cultural associations and domains. This is evidenced by the fact that contradictory statements are often attributed to the same sources: competing claims often appear in the same contexts. CTRL provides model-based evidence that certain domains are more likely to contain language similar to given statements, but it should not be used to make normative or prescriptive claims. It is a descriptive tool for analyzing correlations in large amounts of text.
6 Related Work
Language models (Bengio et al., 2003) have played an important role in natural language processing through transferrable word vectors (Mikolov et al., 2013), contextualized word vectors (Peters et al., 2018; Devlin et al., 2018; Lample and Conneau, 2019), and models (Howard and Ruder, 2018; Radford et al., 2018). Recent work on memory mechanisms (Dai et al., 2019; Lample et al., 2019) has improved perplexities on the most common benchmarks, and even without these memories, large Transformer architectures (Vaswani et al., 2017) like GPT-2 (Radford et al., 2019), OpenGPT-2555 https://blog.usejournal.com/opengpt-2-we-replicated-gpt-2-because-you-can-too-45e34e6d36dc, and Megatron666 https://github.com/NVIDIA/Megatron-LM can achieve state-of-the-art results without directly training for any particular language modeling benchmark. Because these latter language models are trained on far more diverse data than is used in the supervised setting, they demonstrate impressive text generation capabilities (Radford et al., 2019; Zellers et al., 2019).
These models demonstrate the potential to learn multiple tasks as well as quick adaptation to patterns in input prompts (Radford et al., 2019). This potential showed that language models can offer an alternative to supervised multi-task learning as framed by several recent benchmarks (Wang et al., 2018; McCann et al., 2018). Language models might also offer a foundation to extend proposals of unified, multi-task systems for all of NLP (Collobert and Weston, 2008; Collobert et al., 2011), parsing and tagging (Hashimoto et al., 2016), multiple languages (Wu et al., 2016; Johnson et al., 2017), and multiple modalities (Luong et al., 2015; Kaiser et al., 2017). Several works have pointed to natural language as a means for controlling these multi-task systems (McCann et al., 2018; Radford et al., 2019; Keskar et al., 2019), and several point to the benefits of a code book either specified explicitly (Wu et al., 2016) or learned in a latent space (Kaiser et al., 2018). This work attempts to balance these approaches.
Sampling methods and coverage mechanisms.
Recent work in sampling methods for text generation has focused on reducing repetition by replacing it with novel, coherent text (Fan et al., 2018; Holtzman et al., 2019). The problem of repetition can instead be approached by altering the training objectives, as with coverage mechanisms (See et al., 2017) and context-based losses (Welleck et al., 2019). When prioritizing control, the trade-off between novelty in the generated text and consistency with prompts and prior generated text remains a difficult challenge, but this work found that relying on inference-time methods (Fan et al., 2018; Holtzman et al., 2019) that are closer in behavior to context-based losses (See et al., 2017; Welleck et al., 2019) provides a reasonable solution as long as the distribution of the language model is sufficiently confident in its decisions.
7 Future Directions
More control codes and finer-grained control.
The particular choice of control codes in this work is intended to represent a reasonably large variety in control over domain, topic, entities, entity relations, and dates. A very flexible means of control is through the natural structure of the internet in the form of URLs. Many of the domains that were mapped in this work to a single control code (e.g. Wikipedia, Project Gutenberg), could be refined to provide more fine-grained control either through further exploitation of URL structure (en.wikipedia.org, de.wikipedia.org, en.wikipedia.org/wiki/Anarchism, en.wikipedia.org/wiki/Anarchism#History) or through the manual extraction of structure already present in the data (e.g. Books Author Title Chapter). We hope future work explores extensions of CTRL to new domains in ways that provide further insight into controllable text generation.
Extensions to other areas in NLP.
This work suggests that including data for specific tasks need not harm the general nature of an unsupervised learning process. For important skills, the inclusion of supervised data or task-specific data generated through unsupervised means (Artetxe et al., 2017; Lewis et al., 2019) can lead to obvious improvements. While this work experimented with trivia-style question answering (without context documents) and small amounts of machine translation data, it remains an open question whether these language models can learn to effectively perform tasks like extractive question answering or state-of-the-art multilingual machine translation while still preserving general pattern recognition and text generation functionality.
Many tasks present difficult challenges to the supervised setting. Commonsense reasoning (Levesque et al., 2012) and abstractive summarization (Rush et al., 2015) represent two areas where these challenges remain readily apparent (Kryściński et al., 2019). Yet language models show potential for mitigating these problems directly (Trinh and Le, 2018; Radford et al., 2019) or indirectly (Rajani et al., 2019; Xenouleas et al., 2019; Scialom et al., 2019). We hope that in future work CTRL can be extended to far more tasks through the use of both unsupervised and supervised techniques.
Analyzing the relationships between language models and training data.
CTRL is trained on a small subset of the possible data available. Therefore the model is biased towards the patterns of language used in the training data. The data is likely not representative of many linguistic communities, but CTRL offers an explicit method for analyzing the relationship between the model and its current training data. As methods improve, more data is collected, and training of these large models continues, we hope to use this tool to better understand the particular cultural associations the model learns from each data source.
Making the interface between humans and language models more explicit and intuitive.
CTRL is designed to make the interface between humans and language models more intuitive. Text generation can be a powerful tool for enhancing creativity and exploration. In future work, we hope to study how the beneficial applications of such models can be enhanced by providing more control to human users.
8 CTRL-ALT-DEL: The Ethics of Large Language Models
Openness and replicability are central aspects of the scientific ethos that, prima facie, suggest the release of complete scientific research results. We reify these principles by releasing all trained CTRL models.
Although much scientific research and innovation can benefit the public, it may also be diverted to harmful uses or have unintended negative impacts (without animus). Brundage et al. (2019), among others, have argued artificial intelligence has such an omni-use character and have suggested governance policies emerging from the responsible innovation literature (Brundage, 2016). Historical evidence has pointed to the inadequacy of self-moratoriums for governing omni-use technologies (Kaiser and Moreno, 2012); we take a course of action that differs from such self-regulation.
Our actions reflect principles from a recent sociology-based AI governance framework that aims to expand responsible innovation to consider networks of users, dynamics, and feedback (Varshney et al., 2019).
Rather than self-governance, we sought to diversify inputs to governance through pre-release review from experts at the Partnership on AI (PAI). These experts, in turn, drew on emerging norms and governance processes that incorporate a broad set of values from across society.
Prior to release, the research team conducted a technology foresight exercise to anticipate possible malicious use cases. In particular, we used a scenario planning approach to technology foresight that systematically attempts to envision plausible longer-term future states of science, technology, and society. This anticipatory focus on possibilities rather than probabilities lessens several shortcomings of formal risk assessment in the face of contested assumptions, which has proven ineffective in identifying the most profound future impacts of innovation (Stilgoe et al., 2013).
As part of our model release, we include a code of conduct in the README at github.com/salesforce/ctrl. This code of conduct is modeled after emerging community norms ensconced in the Do No Harm and Just World Licenses. Simultaneously recognizing that it has no legal force and that users are agents of technological change embedded in social networks, the aim is to encourage reflection at the consumption junction (Cowan, 1987) through norm-setting and reduce unintended uses.
The README also includes a subset of the questions that the team discussed when deliberating release of the models, drawn from early drafts of community-driven PAI documents (to be released in the near future). This may further encourage users to reflect on norms and responsibilities associated with models that generate artificial content. In particular, users are asked to share answers to the included questions, to pose further questions, and suggest solutions by emailing firstname.lastname@example.org.
Finally, the README asks users to develop appropriate documentation (3; M. Arnold, R. K. E. Bellamy, M. Hind, S. Houde, S. Mehta, A. Mojsilovic, R. Nair, K. Natesan Ramamurthy, D. Reimer, A. Olteanu, D. Piorkowski, J. Tsay, and K. R. Varshney (2018); M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru (2019)) when building on CTRL and to tell the research team how they are using CTRL by emailing email@example.com. This facilitates a post-release monitoring plan that observes how people are using CTRL in the wild (together with active observations). Such post-market plans recognize that most innovations are unexpected and hard to forecast. It is intended to enable a responsive approach to responsible innovation, not just with respect to harmful uses but also unintended negative impacts without animus.
With 1.6 billion parameters, CTRL is the largest publicly released language model to date. It is trained with control codes so that text generation can be more easily controlled by human users. These codes allow users to explicitly specify domain, subdomain, entities, relationships between entities, dates, and task-specific behavior. We hope that the release of this model at github.com/salesforce/ctrl pushes towards more controllable, general models for natural language processing, and we encourage future discussion about artificial generation with our team by emailing firstname.lastname@example.org.
We would like to thank Kathy Baxter for her help in the ethical considerations of our work and facilitating the external review process; Srinath Meadusani, Lavanya Karanam, Ning Dong, and Navin Ramineni for their help with setting up and maintaining compute infrastructure; Zak Stone and his team at Google for assistance with TPU infrastructure and code; and Joseph Olsen, Roy Davis, Joshua Simmons, Denise Lo, and Sam Edwards for their help with open sourcing.
- Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §3.2, §3.2.
- Memory-efficient adaptive optimization for large-scale learning. arXiv preprint arXiv:1901.11150. Cited by: §3.2.
-  (2019) Annotation and benchmarking on understanding and transparency of machine learning lifecycles (ABOUT ML). Partnership on AI. Note: Partnership on AI (PAI), v0 External Links: Cited by: 5th item.
- Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. Cited by: §1.
- FactSheets: increasing trust in AI services through supplier’s declarations of conformity. Note: arXiv:1808.07261 [cs.CY]. Cited by: 5th item.
- Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041. Cited by: §7.
- Layer normalization. CoRR abs/1607.06450. Cited by: §3.
- Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61. Cited by: Table 7, §3.1.
- A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §2, §6.
- Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867. Cited by: §1.
- The malicious use of artificial intelligence: forecasting, prevention, and mitigation. Note: arXiv:1802.07228 [cs.AI]. Cited by: §8.
- Artificial intelligence and responsible innovation. In Fundamental Issues of Artificial Intelligence, V. C. Müller (Ed.), pp. 543–554. Cited by: §8.
- Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §1.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §3.
- Natural language processing (almost) from scratch. Journal of machine learning research 12 (Aug), pp. 2493–2537. Cited by: §6.
- A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §6.
- The consumption junction: a proposal for research strategies in the sociology of technology. In The Social Construction of Technological Systems, W. E. Bijker, T. P. Hughes, and T. J. Pinch (Eds.), pp. 261–280. Cited by: 3rd item.
- Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §2, §6.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §6.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §3.2.
- Searchqa: a new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179. Cited by: §3.1.
- ELI5: long form question answering. arXiv preprint arXiv:1907.09190. Cited by: Table 7, §3.1.
- Hierarchical neural story generation. arXiv preprint arXiv:1805.04833. Cited by: §1, §6.
- Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286. Cited by: §3.2.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
- NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, pp. 708–719. External Links: Cited by: Table 7, §3.1.
- A joint many-task model: growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587. Cited by: §6.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
- Teaching machines to read and comprehend. In Advances in neural information processing systems, pp. 1693–1701. Cited by: §3.1.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §4.1, §4.1, §6.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §1, §6.
- Tying word vectors and word classifiers: a loss framework for language modeling. arXiv preprint arXiv:1611.01462. Cited by: §3.2.
- Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §3.2, §6.
- Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. Cited by: §3.1.
- Self-censorship is not enough. Nature 492 (7429), pp. 345–347. External Links: Cited by: §8.
- One model to learn them all. arXiv preprint arXiv:1706.05137. Cited by: §6.
- Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382. Cited by: §6.
- Unifying question answering and text classification via span extraction. arXiv preprint arXiv:1904.09286. Cited by: §6.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
- Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: §1.
- Neural text summarization: a critical evaluation. arXiv preprint arXiv:1908.08960. Cited by: §7.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, pp. 453–466. Cited by: §3.1.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §6.
- Large memory layers with product keys. arXiv preprint arXiv:1907.05242. Cited by: §6.
- The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: §7.
- Unsupervised question answering by cloze translation. arXiv preprint arXiv:1906.04980. Cited by: §7.
- Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114. Cited by: §6.
- Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: Table 7, §3.1.
- Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: §1.
- The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §1, §3.2, §6.
- Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182. Cited by: §3.2.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §6.
- Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19), External Links: Cited by: 5th item.
- Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814. Cited by: §3.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023. Cited by: Table 7.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1, §6.
- Constraints on language mixing: intrasentential code-switching and borrowing in spanish/english. Language, pp. 291–318. Cited by: §1.
- Sometimes i’ll start a sentence in spanish y termino en espanol: toward a typology of code-switching1. Linguistics 18 (7-8), pp. 581–618. Cited by: §1.
- Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859. Cited by: §3.2.
Improving language understanding by generative pre-training.
ing_paper.pdf. Cited by: §1, §6.
Language models are unsupervised multitask learners.
ers.pdf. Cited by: §1, §2, §4.1, §6, §6, §7.
- Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361. Cited by: §7.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §3.1.
- A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. Cited by: §1, §7.
- The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6 (12), pp. e26752. Cited by: §3.1.
- Answers unite! unsupervised metrics for reinforced summarization models. arXiv preprint arXiv:1909.01610. Cited by: §7.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1073–1083. Cited by: §4.1, §6.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §3.2.
- Adafactor: adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235. Cited by: §3.2.
- Developing a framework for responsible innovation. Research Policy 42 (9), pp. 1568–1580. External Links: Cited by: 2nd item.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
- A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847. Cited by: §7.
- Newsqa: a machine comprehension dataset. arXiv preprint arXiv:1611.09830. Cited by: §3.1.
- Pretrained AI models: performativity, mobility, and change. Note: arXiv:1909.03290 [cs.CY]. Cited by: §8.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §3, §6.
- Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §6.
- Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319. Cited by: §4.1, §6.
- Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.2, §6.
- SumQE: a bert-based summary quality estimation model. arXiv preprint arXiv:1909.00578. Cited by: §7.
- Hotpotqa: a dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600. Cited by: §3.1.
- Defending against neural fake news. arXiv preprint arXiv:1905.12616. Cited by: §6.
Appendix A Data sources and breakdown
|Books||Books from Project Gutenberg|
|Reviews||Amazon Reviews data (McAuley et al., 2015)|
|Links||OpenWebText (See Sec. 3.2)|
|Translation||WMT translation date (Barrault et al., 2019)|
|News||News articles from CNN/DailyMail Nallapati et al. (2016), New York Times|
|and Newsroom (Grusky et al., 2018)|
|multilingual||Wikipedias in German, Spanish and French|
|Questions||(Questions and answers only) MRQA shared task (See Section 3.1)|
|Explain||(Only main post) (Fan et al., 2019)|
|Sub-reddit data (Title, Text and Score/Karma) collected from pushshift.io.|