Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation††thanks: This article is accepted for publication in the Journal of Artificial Intelligence Research (Jair)
This paper surveys the current state of the art in Natural Language Generation (nlg), defined as the task of generating text or speech from non-linguistic input. A survey of nlg is timely in view of the changes that the field has undergone over the past two decades, especially in relation to new (usually data-driven) methods, as well as new applications of nlg technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in nlg and the architectures adopted in which such tasks are organised; (b) highlight a number of recent research topics that have arisen partly as a result of growing synergies between nlg and other areas of artificial intelligence; (c) draw attention to the challenges in nlg evaluation, relating them to similar challenges faced in other areas of nlp, with an emphasis on different evaluation methods and the relationships between them.
- 1 Introduction
- 2 NLG Tasks
3 NLG Architectures and Approaches
- 3.1 Modular approaches
- 3.2 Planning-based approaches
- 3.3 Stochastic approaches to NLG
- 3.4 Discussion
- 4 The vision-language interface: Image captioning and beyond
- 5 Variation: Generating text with style, personality and affect
- 6 Generating creative and entertaining text
- 7.1 Intrinsic methods
- 7.2 Extrinsic evaluation methods
- 7.3 Black box vs glass box evaluation
- 7.4 On the relationship between evaluation methods
- 7.5 Evaluation: Concluding remarks
- 8 Discussion and future directions
- 9 Conclusion
In his intriguing story The Library of Babel (La biblioteca de Babel, 1941), Jorge Luis Borges describes a library in which every conceivable book can be found. It is probably the wrong question to ask, but readers cannot help wondering: who wrote all these books? Surely, this could not be the work of human authors? The emergence of automatic text generation techniques in recent years provides an interesting twist to this question. Consider Philip M. Parker, who offered more than 100.000 books for sale via Amazon.com, including for example his The 2007-2012 Outlook for Tufted Washable Scatter Rugs, Bathmats, and Sets That Measure 6-Feet by 9-Feet or Smaller in India. Obviously, Parker did not write these 100,000 books by hand. Rather, he used a computer program that collects publicly available information, possibly packaged in human-written texts, and compiles these into a book. Just like the library of Babel contains many books that are unlikely to appeal to a broad audience, Parker’s books need not find many readers. In fact, even if only a small percentage of his books get sold a few times, this would still make him a sizeable profit.
Parker’s algorithm can be seen to belong to a research tradition of so-called text-to-text generation methods, applications that take existing texts as their input, and automatically produce a new, coherent text as output. Other example applications that generate new texts from existing (usually human-written) text include:
machine translation, from one language to another \shortcite¡e.g.,¿hutchins1992,och2003;
fusion and summarization of related sentences or texts to make them more concise \shortcite¡e.g.,¿Clarke2010;
simplification of complex texts, for example to make them more accessible for low-literacy readers \shortcite¡e.g.,¿Siddharthan2014 or for children \shortciteMacdonald2016;
automatic spelling, grammar and text correction \shortcite¡e.g.,¿kukich1992techniques,dale2012hoo;
automatic generation of peer reviews for scientific papers \shortcitebartoli2016your;
generation of paraphrases of input sentences \shortcite¡e.g.,¿bannard2005paraphrasing,kauchak2006paraphrasing; and
automatic generation of questions, for educational and other purposes \shortcite¡e.g.,¿brown2005automatic,rus2010overview.
Often, however, it is necessary to generate texts which are not grounded in existing ones. Consider, as a case in point, the minor earthquake that took place close to Beverly Hills, California on March 17, 2014. The Los Angeles Times was the first newspaper to report it, within 3 minutes of the event, providing details about the time, location and strength of the quake. This report was automatically generated by a ‘robo-journalist’, which converted the incoming automatically registered earthquake data into a text, by filling gaps in a predefined template text.111See http://www.slate.com/blogs/future_tense/2014/03/17/quakebot_los_angeles_times_robot_journalist_writes_article_on_la_earthquake.html.
Robo-journalism and associated practices, such as data journalism, are examples of what is usually referred to as data-to-text generation. They have had a considerable impact in the fields of journalism and media studies \shortciteVanDalen2012,Clerwall2014,Hermida2015. The technique used by the Los Angeles Times was not new; many applications have been developed over the years which automatically generate text from non-linguistic data including, but not limited to, systems which produce:
soccer reports \shortcite¡e.g.,¿Theune2001,Chen2008;
virtual ‘newspapers’ from sensor data \shortciteMolina2011 and news reports on current affairs \shortciteLepp2017;
text addressing environmental concerns, such as wildlife tracking \shortciteSiddharthan2012,Ponnamperuma2013, personalised environmental information \shortciteWanner2015, and enhancing engagement of citizen scientists via generated feedback \shortciteVanderWal2016;
weather and financial reports \shortciteGoldberg1994,Reiter2005,Turner2008a,Ramos-Soto2015,Plachouras2016;
summaries of patient information in clinical contexts \shortciteHuske-Kraus2003,Harris2008,Portet2009,Gatt2009,Banaee2013;
interactive information about cultural artefacts, for example in a museum context \shortcite¡e.g.,¿ODonnell2001,Stock2007; and
text intended to persuade \shortciteCarenini2006, or motivate behaviour modification \shortciteReiter2003.
These systems may differ considerably in the quality and variety of the texts they produce, their commercial viability and the sophistication of the underlying methods, but all are examples of data-to-text generation. Many of the systems mentioned above focus on imparting information to the user. On the other hand, as shown by the examples cited above of systems focussed on persuasion or behaviour change, informing need not be the exclusive goal of nlg. Nor is it a trivial goal in itself, since in order to successfully impart information, a system needs to select what to say, distinguishing it from what can be easily inferred (possibly also depending on the target user), before expessing it coherently.
Generated texts need not have a large audience. There is no need to automatically generate a report of, say, the Champions League European football final, which is covered by many of the best journalists in the field anyway. However, there are many other games, less important to the general public (but presumably very important to the parties involved). Typically, all sports statistics (who played?, who scored? etc.) for these games are stored, but such statistics are not as a rule perused by sport-reporters. Companies like Narrative Science222https://www.narrativescience.com fill this niche by automatically generating sport reports for these games. Automated Insights333https://automatedinsights.com even generates reports based on user-provided ‘fantasy football’ data. In a similar vein, the automatic generation of weather forecasts for offshore oil platforms \shortciteSripada2003, or from sensors monitoring the performance of gas turbines \shortciteYu2006, has proven to be a fruitful application of data-to-text techniques. Such applications are now the mainstay of companies like Arria-NLG.444http://www.arria.com
Taking this idea one step further, data-to-text generation paves the way for tailoring texts to specific audiences. For example, data from babies in neonatal care can be converted into text differently, with different levels of technical detail and explanatory language, depending on whether the intended reader is a doctor, a nurse or a parent \shortciteMahamood2011. One could also easily imagine that different sport reports are generated for fans of the respective teams; the winning goal of one team is likely to be considered a lucky one from the perspective of the losing team, irrespective of its ‘objective’ qualities \shortcitevan2017pass. A human journalist would not dream of writing separate reports about a sports match (if only for lack of time), but for a computer this is not an issue and this is likely to be appreciated by a reader who receives a more personally appropriate report.
1.1 What is Natural Language Generation?
Both text-to-text generation and data-to-text generation are instances of Natural Language Generation (nlg). In the most widely-cited survey of nlg methods to date \shortciteReiter1997,Reiter2000, nlg is characterized as ‘the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information’ \shortcite[p.1]Reiter1997. Clearly this definition fits data-to-text generation better than text-to-text generation, and indeed \shortciteAReiter2000 focus exclusively on the former, helpfully and clearly describing the rule-based approaches that dominated the field at the time.
It has been pointed out that precisely defining nlg is rather difficult \shortcite¡e.g.,¿evans2002nlg: everybody seems to agree on what the output of an nlg system should be (text), but what the exact input is can vary substantially \shortciteMcDonald1993. Examples include flat semantic representations, numerical data and structured knowledge bases. More recently, generation from visual input such as image or video has become an important challenge \shortcite¡e.g.,¿[among many others]Mitchell2012,Kulkarni2013,Thomason2014.
A further complication is that the boundaries between different approaches are themselves blurred. For example, text summarisation was characterized above as a text-to-text application. However, many approaches to text-to-text generation (especially abstractive summarisation systems, which do not extract content wholesale from the input documents) use techniques which are also used in data-to-text, as when opinions are extracted from reviews and expressed in completely new sentences \shortcite¡e.g.,¿labbe2012towards. Conversely, a data-to-text generation system could conceivably rely on text-to-text generation techniques for learning how to express pieces of data in different or creative ways \shortciteMcintyre2009,Gatt2009,Kondadadi2013.
Considering other applications of nlg similarly highlights how blurred boundaries can get. For example, the generation of spoken utterances in dialogue systems \shortcite¡e.g.,¿Walker2007,Rieser2009,Dethlefs2014 is another application of nlg, but typically it is closely related to dialogue management, so that management and realisation policies are sometimes learned in tandem \shortcite¡e.g.,¿Rieser2011.
The position taken in this survey is that what distinguishes data-to-text generation is ultimately its input. Although this varies considerably, it is precisely the fact that such input is not – or isn’t exclusively – linguistic that is the main challenge faced by most of the systems and approaches we will consider. In what follows, unless otherwise specified in context, the terms ‘Natural Language Generation’ and ‘nlg’ will be used to refer to systems that generate text from non-linguistic data.
1.2 Why a survey on Natural Language Generation?
Arguably \shortciteAReiter2000 is still the most complete available survey of nlg. However, the field of nlg has changed drastically in the last 15 years, with the emergence of successful applications generating tailored reports for specific audiences, and with the emergence of text-to-text as well as vision-to-text generation applications, which also tend to rely more on statistical methods than traditional data-to-text. None of these are covered in \shortciteAReiter2000. Also notably absent are discussions of applications that move beyond standard, ‘factual’ text generation, such as those that account for personality and affect, or creative text such as metaphors and narratives. Finally, a striking omission by \shortciteAReiter2000 is the lack of discussion of evaluation methodology. Indeed, evaluation of nlg output has only recently started to receive systematic attention, in part due to a number of shared tasks that were conducted within the nlg community.
Since \shortciteAReiter2000, various other nlg overview texts have also appeared. \shortciteABateman2005 covers the cognitive, social and computational dimensions of nlg. \shortciteAMcDonald2010 offers a general characterization of nlg as ‘the process by which thought is rendered into language’ (p. 121). \shortciteAWanner2010 zooms in on automatic generation of reports, while \shortciteADiEugenio2010 looks at specific applications, especially in education and health-care. Various specialized collections of articles have also been published, including \shortciteAKrahmer2010a, which targets data-driven approaches; and \shortciteABangalore2014, which focusses on interactive systems. The web offers various unpublished technical reports, such as \shortciteATheune2003, which surveys dialogue systems, and \shortciteAPiwek2003 and \shortciteABelz2003 on affective nlg. While useful, these resources do not discuss recent developments or offer a comprehensive review. This indicates that a new state-of-the-art survey is highly timely.
1.3 Goals of this survey
The goal of the current paper is to present a comprehensive overview of nlg developments since 2000, both in order to provide nlg researchers with a synthesis and pointers to relevant research, and to introduce the field to researchers who are less familiar with nlg. Though nlg has been a part of ai and nlp from the early days \shortcite¡see e.g.,¿Winograd1972,Appelt1985, as a field it has arguably not been fully embraced by these broader communities, and has only recently began to take full advantage of recent advances in data-driven, machine learning and deep learning approaches.
As in \shortciteAReiter2000, our main focus, especially in the first part of the survey, will be on data-to-text generation. In any case, doing full justice to recent developments in the various text-to-text generation applications is beyond the scope of a single survey, and many of these are covered in other individual surveys, including \shortciteAMani2001 and \shortciteANenkova2011 for summarisation; \shortciteAandroutsopoulos2010survey for paraphrasing; and \shortciteApiwek2012varieties for automatic question generation. However, we will in various places discuss connections between data-to-text and text-to-text generation, both because – as noted above – the boundaries are blurred, but also, and perhaps more importantly, because text-to-text systems have long been couched in the data-driven frameworks that are becoming increasingly popular in data-to-text generation, also giving rise to some hybrid systems that combine rule-bused and statistical techniques \shortcite¡e.g.,¿Kondadadi2013.
Our review will start with an updated overview of the core nlg tasks that were introduced in \shortciteAReiter2000, followed by a discussion of architectures and approaches, where we pay special attention to those not covered in the \shortciteAReiter2000 survey. These two sections constitute the ‘foundational’ part of the survey. Beyond these, we highlight several new developments, including approaches where the input data is visual; and research aimed at generating more varied, engaging or creative and entertaining texts, taking nlg beyond the factual, repetitive texts it is sometimes accused of producing. We believe that these applications are not only interesting in themselves, but may also inform more ‘utility’-driven text generation application. For example, by including insights from narrative generation we may be able to generate more engaging reports and by including insights from metaphor generation we may be able to phrase information in these reports in a more original manner. Finally, we will discuss recent developments in evaluation of natural language generation applications.
In short, the goals of this survey are:
To draw attention to the challenges in nlg evaluation, relating them to similar challenges faced in other areas of nlp, with an emphasis on different evaluation methods and the relationships between them (Section 7).
2 NLG Tasks
Traditionally, the nlg problem of converting input data into output text was addressed by splitting it up into a number of subproblems. The following six are frequently found in many nlg systems \shortciteReiter1997,Reiter2000; their role is illustrated in Figure 1:
Content determination: Deciding which information to include in the text under construction,
Text structuring: Determining in which order information will be presented in the text,
Sentence aggregation: Deciding which information to present in individual sentences,
Lexicalisation: Finding the right words and phrases to express information,
Referring expression generation: Selecting the words and phrases to identify domain objects,
Linguistic realisation: Combining all words and phrases into well-formed sentences.
These tasks could be thought of in terms of ‘early’ decision processes (which information to convey to the reader?) to ‘late’ ones (which words to use in a particular sentence, and how to put them in their correct order?). Here, we refer to ‘early’ and ‘late’ tasks by way of distinguishing between choices that are more oriented towards the data (such as what to say) and choices that are of an increasingly linguistic nature (e.g., lexicalisation, or realisation). This characterization reflects a long-running distinction in nlg between strategy and tactics \shortcite¡a distinction that goes back at least to¿Thompson1977. This distinction also suggests a temporal order in which the tasks are executed, at least in systems with a modular, pipeline architecture (discussed in Section 3.1): for example, the system first needs to decide which input data to express in the text, before it can order information for presentation. However, such ordering of modules is nowadays increasingly put into question in the data-driven architectures discussed below (Section 3).
In this section, we briefly describe these six tasks, illustrating them with examples, and highlight recent developments in each case. As we shall see, while the ‘early’ tasks are crucial for the development of nlg systems, they are often intimately connected to the specific application. By contrast, ‘late’ tasks are more often investigated independently of an application, and hence have resulted in approaches that can be shared between applications.
2.1 Content determination
As a first step in the generation process, the nlg system needs to decide which information should be included in the text under construction, and which should not. Typically, more information is contained in data than we want to convey through text, or the data is more detailed than we care to express in text. This is clear in Figure 0(a), where the input signal – a patient’s heart rate – only contains a few patterns of interest. Selection may also depend on the target audience (e.g. does it consist of experts or novices, for example) and on the overall communicative intention (e.g. should the text inform the reader or convince him to do something).
Content determination involves choice. In a soccer report, we may not want to verbalise each pass and foul committed, even though the data may contain this information. In the case of neonatal care, data might be collected continuously from sensors measuring heart rate, blood pressure and other physiological parameters. Data thus needs to be filtered and abstracted into a set of preverbal messages, semantic representations of information which are often expressed in a formal representation language, such as logical or database languages, attribute-value matrices or graph structures. They can express, among other things, which relations hold between which domain entities, for example, expressing that player X scored the first goal for team Y at time T.
Though content determination is present in most nlg systems \shortcite¡cf.¿Mellish2006, approaches are typically closely related to the domain of application. A notable exception is \shortciteAGuhe2007, which offers a cognitively plausible, incremental account of content determination based on studies of speakers’ descriptions of dynamic events as they unfold. This work belongs to a strand of research which considers nlg first and foremost as a methodology eminently suitable for understanding human language production.
In recent years, researchers have started exploring data-driven techniques for content determination \shortcite¡see e.g.,¿Barzilay2004,BouayadAgha2013,Kutlak2013,Venigalla2013. \shortciteABarzilay2004, for example, used Hidden Markov Models to model topic shifts in a particular domain of discourse (say, earthquake reports), where the hidden states represented ‘topics’, modelled as sentences clustered together by similarity. A clustering approach was also used by \shortciteADuboue2003 in the biography domain, using texts paired with a knowledge base, from which semantic data was clustered and scored according to its occurrence in text. In a similar vein \shortciteABarzilay2005 use a database of American football records and corresponding text. Their aim was not only to identify bits of information that should be mentioned, but also dependencies between them, since mentioning a certain event (say, a score by a quarterback) may warrant the mention of another (say, another scoring event by a second quarterback). The solution proposed by \citeauthorBarzilay2005 was to compute both individual preference scores for events, and a link preference score.
More recently, various researchers have addressed the question of how to automatically learn alignments between data and text, also in the broader context of grounded language acquisition, i.e., modelling how we learn language by looking at correspondences between objects and events in the world and the way we refer to them in language \shortciteRoy2002,yu2004multimodal,yu2013grounded. For example, \shortciteALiang2009 extended the work by \shortciteABarzilay2005 to multiple domains (soccer and weather), relying on weakly supervised techniques; in a similar vein, \shortciteAkoncel2014multi presented a weakly supervised multilevel approach, to deal with the fact that there is no one-to-one correspondence between, for example, soccer events in data and sentences in associated soccer reports. We shall return to these methods as part of a broader discussion of data-driven approaches below (Section 3.3).
2.2 Text structuring
Having determined what messages to convey, the nlg system needs to decide on their order of presentation to the reader. For example, Figure 0(b) shows three events of the same type (all bradycardia events, that is, brief drops in heart rate), selected (after abstraction) from the input signal and ordered as a temporal sequence.
This stage is often referred to as text (or discourse or document) structuring. In the case of the soccer domain, for example, it seems reasonable to start with general information (where and when the game was played, how many people attended, etc.), before the goals are described, typically in temporal order. In the neonatal care domain, a temporal order can be imposed among specific events, as in Figure 0(b), but larger spans of text may reflect ordering based on importance, and grouping of information based on relatedness (e.g. all events related to a patient’s respiration) \shortcitePortet2009. Naturally, alternative discourse relations may exist between separate messages, such as contrasts or elaborations. The result of this stage is a discourse, text or document plan, which is a structured and ordered representation of messages.
These examples again imply that the application domain imposes constraints on ordering preferences. Early approaches, such as \shortciteAMcKeown1985, often relied on hand-crafted, domain-dependent structuring rules (which McKeown called schemata). To account for discourse relations between messages, researchers have alternatively relied on Rhetorical Structure Theory \shortcite¡rst; e.g.,¿Mann1988,Scott1990,Hovy1993, which also typically involved domain-specific rules. For example, \shortciteAWilliams2008 used rst relations to identify ordering among messages that would maximise clarity to low-skilled readers.
Various researchers have explored the possibilities of using machine learning techniques for document structuring \shortcite¡e.g.,¿Dimitromanolaki2003,althaus2004, sometimes doing this in tandem with content selection \shortciteDuboue2003. General approaches for information ordering \shortciteBarzilay2004,Lapata2006 have been proposed, which automatically try to find an optimal ordering of ‘information-bearing items’. These approaches can be applied to text structuring, where the items to be ordered are typically preverbal messages; however, they can also be applied in (multidocument) summarisation, where the items to be ordered are sentences from the input documents which are judged to be summary-worthy enough to include \shortcite¡e.g.,¿Barzilay2002,Bollegala2010.
2.3 Sentence aggregation
Not every message in the text plan needs to be expressed in a separate sentence; by combining multiple messages into a single sentence, the generated text becomes potentially more fluid and readable \shortcite¡e.g.,¿Dalianis1999,Cheng2000, although there are also situations where it has been argued that aggregation should be avoided (discussed in Section 5.2). For instance, the three events selected in Figure 0(b) are shown as ‘merged’ into a single pre-linguistic representation, which will be mapped to a single sentence. The process by which related messages are grouped together in sentences is known as sentence aggregation.
To take another example, from the soccer domain, one (unaggregated) way to describe the fastest hat-trick in the English Premier League would be:
Sadio Mane scored for Southampton after 12 minutes and 22 seconds.
Sadio Mane scored for Southampton after 13 minutes and 46 seconds.
Sadio Mane scored for Southampton after 15 minutes and 18 seconds.
Clearly, this is rather redundant, not very concise or coherent, and generally unpleasant to read. An aggregated alternative, such as the following, would therefore be preferred:
Sadio Mane scored three times for Southampton in less than three minutes.
In general, aggregation is difficult to define, and has been interpreted in various ways, ranging from redundancy elimination to linguistic structure combination. \shortciteAReape1999 offer an early survey, distinguishing between aggregation at the semantic level (as illustrated in Figure 0(c)) and at the level of syntax, illustrated in the transition from (1-3) to (4) above.
It is probably fair to say that much early work on aggregation was strongly domain-dependent. This work focussed on domain- and application-specific rules (e.g. ‘if a player scores two consecutive goals, describe these in the same sentence’), that were typically hand-crafted \shortcite¡e.g.,¿Hovy1988,Dalianis1999,Shaw1998. Once again, more recent work is gradually moving towards data-driven approaches, where aggregation rules are acquired from corpus data \shortcite¡e.g.,¿Walker2001,Cheng2000. \shortciteABarzilay2006 present a system that learns how to aggregate on the basis of a parellel corpus of sentences and corresponding database entries, by looking for similarities between entries. As was the case with the content selection method of \shortciteABarzilay2005, \shortciteABarzilay2006 view the problem in terms of global optimisation: an initial classification is done over pairs of database entries which determines whether they should be aggregated or not on the basis of their pairwise similarity. Subsequently, a globally optimal set of linked entries is selected based on transitivity constraints (if and are linked, then so should ) and global constraints, such as how many sentences should be aggregated in a document. Global optimisation is cast in terms of Integer Linear Programming, a well-known mathematical optimization technique \shortcite¡e.g.,¿nemhauser1988integer.
With syntactic aggregation, it is arguably more feasible to define domain-independent rules to eliminate redundancy \shortciteHarbusch2009,Kempen2009. For example, converting (5) into (6) below
Sadio Mane scored in the 12th minute and he scored again in the 13th minute.
Sadio Mane scored in the 12th minute and again in the 13th. could be achieved by identifying the parallel verb phrases in the two conjoined sentences and eliding the subject and verb in the second. Recent work has explored the possibility of acquiring such rules from corpora automatically. For example, \shortciteAStent2009 describe an approach to the acquisition of sentence-combining rules from a discourse treebank, which are then incorporated into the sparky sentence planner described by \shortciteAWalker2007. A more general approach to the same problem is discussed by \shortciteAWhite2015.
Arguably, aggregation on the syntactic level can only account for relatively small reductions, compared to aggregation at the level of messages. Furthermore, syntactic aggregation assumes that the sentence planning process (which includes lexicalisation) is complete. Hence, while traditional approaches to nlg view aggregation as part of sentence planning, which occurs prior to syntactic realisation, the validity of this claim depends on the type of aggregation being performed \shortcite¡see also¿theune2006.
Once the content of the sentence has been finalised, possibly also as a result of aggregation at the message level, the system can start converting it into natural language. In our example (Figure 0(c)), the outcome of aggregation and lexicalisation are shown together: here, the three events have been grouped, and mapped to a representation that includes a verb (be) and its arguments, though the arguments themselves still have to be rendered in a referring expression (see below). This reflects an important decision, namely, which words or phrases to use to express the messages’ building blocks. A complication is that often a single event can be expressed in natural language in many different ways. A scoring event in a soccer match, for example, can be expressed as ‘to score a goal’, ‘to have a goal noted’, ‘to put the ball in the net’, among many others.
The complexity of this lexicalisation process critically depends on the number of alternatives that the nlg system can entertain. Often, contextual constraints play an important role here as well: if the aim is to generate texts with a certain amount of variation \shortcite¡e.g.,¿Theune2001, the system can decide to randomly select a lexicalisation option from a set of alternatives (perhaps even from a set of alternatives not used earlier in the text). However, stylistic constraints come into play: ‘to score a goal’ is an unfortunate way of expressing an own goal, for example. In other applications, lexical choice may even be informed by other considerations, such as the attitude or affective stance towards the event in question \shortcite¡e.g.,¿[and the discussion in Section 5]Fleischman2002. Whether or not nlg systems aim for variation in their output or not depends on the domain. For example, variation in soccer reports is presumably more appreciated by readers than variation in weather reports \shortcite¡on which see¿Reiter2005; it may also depend on where in a text the variation occurs. For example, variation in expressing timestamps may be less appreciated than variation in referential forms \shortcitecastro2016towards.
One straightforward model for lexicalisation – the one assumed in Figure 1 – is to operate on preverbal messages, converting domain concepts directly into lexical items. This might be feasible in well-defined domains. More often, lexicalisation is harder, for at least two reasons \shortcite¡cf.¿Bangalore2000: First, it can involve selection between semantically similar, near-synonymous or taxonomically related words \shortcite¡e.g. animal vs dog;¿Stede1996,Edmonds2002. Second, it is not always straightforward to model lexicalisation in terms of a crisp concept-to-word mapping. One source of difficulty is vagueness, which arises, for example, with terms denoting properties that are gradable. For example, selecting the adjectives ‘wide’ or ‘tall’ based on the dimensions of an entity requires the system to reason about the width or height of similar objects, perhaps using some standard of comparison \shortcite¡since a ‘tall glass’ is shorter than a ‘short man’; cf.¿Kennedy2005a,VanDeemter2012. A similar issue has been noted in the context of presenting numerical information, such as timestamps and quantities \shortciteReiter2005,power2012generating. For example, \shortciteAReiter2005 discussed time expressions in the context of weather-forecast generation, pointing out that a timestamp 00:00 could be expressed as late evening, midnight, or simply evening \shortcite[p. 143]Reiter2005. Not surprisingly, humans (including the professional forecasters that contributed to \citeauthorReiter2005’s evaluation), show considerable variation in their lexical choices.
It is interesting to note that many issues related to lexicalisation have also been discussed in the psycholinguistic literature on lexical access \shortciteLevelt1989,Levelt1999:lexical. Among these is the question of how speakers home in on the right word and under what conditions they are liable to make errors, given that the mental lexicon is a densely connected network in which lexical items are connected at multiple levels (semantic, phonological, etc). This has also been a fruitful topic for computational modelling \shortcite¡e.g.,¿Levelt1999:lexical. In contrast to cognitive modelling approaches, however, research in nlg increasingly views lexicalisation as part of surface realisation (discussed below) \shortcite¡a similar observation is made by¿[p.351]Mellish1998a. A fundamental contribution in this context is by \shortciteAElhadad1997, who describe a unification-based approach, unifying conceptual representations (i.e., preverbal messages) with grammar rules encoding lexical as well as syntactic choices.
2.5 Referring expression generation
Referring Expression Generation (reg) is characterised by \shortciteA[p.11]Reiter1997 as “the task of selecting words or phrases to identify domain entities”. This characterisation suggests a close similarity to lexicalisation, but \shortciteAReiter2000 point out that the essential difference is that referring expression generation is a “discrimination task, where the system needs to communicate sufficient information to distinguish one domain entity from other domain entities”. reg is among the tasks within the field of automated text generation that has received most attention in recent years \shortciteMellish2006,Siddharthan2011. Since it can be separated relatively easily from a specific application domain and studied in its own right, various ‘standalone’ solutions for the reg problem exist.
In our running example, the three bradycardia events shown in Figure 0(b) are later represented as a set of three entities under the theme argument of be, following lexicalisation (Figure 0(c)). How the system refers to them will depend, among other things, on whether they’ve already been mentioned (in which case, a pronoun or definite description might work) and if so, whether they need to be distinguished from any other similar entities (in which case, they might need to be distinguished by some properties, such as the time when they occurred).
The first choice is therefore related to referential form: whether entities are referred to using a pronoun, a proper name or an (in)definite description, for example. This depends partly on the extent to which the entity is ‘in focus’ or ‘salient’ \shortcite¡see e.g.,¿Poesio2004 and indeed such notions underlie many computational accounts of pronoun generation \shortcite¡e.g.,¿McCoy1999,Callaway2002,Kibble2004. Choosing referential forms has recently been the topic of a series of shared tasks on the Generation of Referring Expressions in Context \shortcite¡grec;¿Belz2010, using data from Wikipedia articles, which included choices such as reflexive pronouns and proper names. Many systems participating in this challenge framed the problem in terms of classification among these many options. Still, it is probably fair to say that much work on referential form has focussed on when to use pronouns. Forms such as proper names remain understudied, although recently various researchers have highlighted the problems of proper name generation \shortciteSiddharthan2011,deemter2016designing,ferreira2017generating.
Determining the referential content usually comes into play when the chosen form is a description. Typically, there are multiple entities which have the same referential category or type in a domain (more than one player, for example, or several bradycardias). As a result, other properties of the entity will need to be mentioned if it is to be identified by the reader or hearer. Earlier reg research often worked with simple visual domains, such as Figure 1(a) or its corresponding tabular representation, taken from the gre3d corpus \shortciteViethen2008. In this example, the reg content selection problem is to find a set of properties for a target (say ) that singles it out from its two distractors ( and ).
reg content determination algorithms can be thought of as performing a search through the known properties of the referent for the ‘right’ combination that will distinguish it in context. What constitutes the ‘right’ combination depends on the underlying theory. Too much information in the description (as in the small blue ball before the large green cup) might be misleading or even boring; too little (the ball) might hinder identification. Much work on reg has appealed to the Gricean maxim stating that speakers should make sure that their contributions are sufficiently informative for the purposes of the exchange, but not more so \shortciteGrice1975. How this is interpreted has been the subject of a number of algorithmic interpretations, including:
Conducting an exhaustive search through the space of possible descriptions and choosing the smallest set of properties that will identify the target referent, the strategy incorporated by the Full Brevity procedure \shortciteDale1989. In our example domain, this would select size.
Selecting properties incrementally, but choosing the one which rules out most distractors at each step, thereby minimising the possibility of including information that isn’t directly relevant to the identification task. This is the underlying idea of the Greedy Heuristic algorithm \shortciteDale1989,Dale1992, and it has more recently been revived in stochastic utility-based models such as \shortciteAFrank2009. In our example scene, such an algorithm would once again consider size first.
Selecting properties incrementally, but based on domain-specific preference or cognitive salience. This is the strategy incorporated in the Incremental Algorithm \shortciteDale1995, which would predict that color should be preferred over size in our example.
While these heuristics focus exclusively on the requirement that a referent be unambiguously identified, research on reference in dialogue \shortcite¡e.g.,¿Jordan2005 has shown that under certain conditions, referring expressions may also include ‘redundant’ properties in order to achieve other communicative goals, such as confirmation of a previous utterance by an interlocutor. Similarly, \shortciteAWhite-Clark-Moore:2010 present a system which generates user-tailored descriptions in spoken dialogue, arguing that, for example, a frequent flyer would prefer different descriptions of flights than a student who only flies occasionally.
These various algorithms compute (possibly different) distinguishing descriptions for target referents (more precisely: they select sets of properties that distinguish the target, but that still need to be expressed in words; see Section 2.6 below). Various strands of more recent work can be distinguished \shortcite¡surveyed in¿Krahmer2012. Some researchers have focussed on extending the expressivity of the ‘classical’ algorithms, to include plurals (the two balls) and relations (the ball in front of a cube) \shortcite¡e.g.,¿[among many others]Horacek1997,Stone2000,Gardent2002,Kelleher2006,Viethen2008. Other work has cast the problem in probabilistic terms; for example, \shortciteAFitzgerald2013 frame reg as a problem of estimating a log-linear distribution over a space of logical forms representing expressions for sets of objects. Other work has concentrated on evaluating the performance of different reg algorithms, by collecting controlled human references and comparing these with the references predicted by various algorithms \shortcite¡e.g.,¿[again among many others]Belz2008,Gatt2010,Jordan2005. In a similar vein, researchers have also started exploring the relevance of reg algorithms as psycholinguistic models of human language production \shortcite¡e.g.,¿VanDeemter2012a.
A different line of work has moved away from the separation between content selection and form, performing these tasks jointly. For example, \shortciteAEngonopoulos2014 use a synchronous grammar that directly relates surface strings to target referents, using a chart to compute the possible expressions for a given target. This work bears some relationship to planning-based approaches we discuss in Section 3.2 below, which exploit grammatical formalisms as planning operators \shortcite¡e.g.¿Stone1998,Koller2007, solving realisation and content determination problems in tandem (including reg as a special case).
Finally, in earlier work visual information was typically ‘simplified’ into a table (as we did above), but there has been substantial progress on reg in more complex scenarios. For example, the give challenge \shortcitekoller2010report, provided impetus for the exploration of situated reference to objects in a virtual environment \shortcite¡see also¿Stoia2006,Garoufi2013. More recent work has started exploring the interface between computer vision and reg to produce descriptions of objects in complex, realistic visual scenes, including photographs \shortcite¡e.g.,¿Mitchell2013,Kazemzadeh2014,Mao2016. This forms part of a broader set of developments focussing on the relatonship between vision and language, which we turn to in Section 4.
2.6 Linguistic realisation
Finally, when all the relevant words and phrases have been decided upon, these need to be combined to form a well-formed sentence. The simple example in Figure 0(d) shows the structure underlying the sentence there were three successive bradycardias down to 69, the linguistic message corresponding to the portion selected from the original signal in Figure 0(a).
Usually referred to as linguistic realisation, this task involves ordering constituents of a sentence, as well as generating the right morphological forms (including verb conjugations and agreement, in those languages where this is relevant). Often, realisers also need to insert function words (such as auxiliary verbs and prepositions) and punctuation marks. An important complication at this stage is that the output needs to include various linguistic components that may not be present in the input (an instance of the ‘generation gap’ discussed in Section 3.1 below); thus, this generation task can be thought of in terms of projection between non-isomorphic structures \shortcite¡cf.¿Ballesteros2015. Many different approaches have been proposed, of which we will discuss
human-crafted grammar-based systems;
When application domains are small and variation is expected to be minimal, realisation is a relatively easy task, and outputs can be specified using templates \shortcite¡e.g.,¿Reiter1995,mcroy2003augmented, such as the following.
$player scored for $team in the $minute minute. This template has three variables, which can be filled with the names of a player, a team, and the minute in which this player scored a goal. It can thus serve to generate sentences like:
Ivan Rakitic scored for Barcelona in the 4th minute.
An advantage of templates is that they allow for full control over the quality of the output and avoid the generation of ungrammatical structures. Modern variants of the template-based method include syntactic information in the templates, as well as possibly complex rules for filling the gaps \shortciteTheune2001, making it difficult to distinguish templates from more sophisticated methods \shortciteVanDeemter2005. The disadvantage of templates is that they are labour-intensive if constructed by hand \shortcite¡though templates have recently been learned automatically from corpus data, see e.g.,¿[ and the discussion in Section 3.3 below]angeli2012parsing,Kondadadi2013. They also do not scale well to applications which require considerable linguistic variation.
2.6.2 Hand-coded grammar-based systems
An alternative to templates is provided by general-purpose, domain-independent realisation systems. Most of these systems are grammar-based, that is, they make some or all of their choices on the basis of a grammar of the language under consideration. This grammar can be manually written, as in many classic off-the-shelf realisers such as fuf/surge \shortciteElhadad1996, mumble \shortciteMeteer1987, kpml \shortciteBateman1997, nigel \shortciteMann1983, and RealPro \shortciteLavoie1997. Hand-coded grammar-based realisers tend to require very detailed input. For example, kpml \shortciteBateman1997 is based on Systemic-Functional Grammar \shortcite¡sfg; ¿Halliday2004, and realisation is modelled as a traversal of a network in which choices depend on both grammatical and semantico-pragmatic information. This level of detail makes these systems difficult to use as simple ‘plug-and-play’ or ‘off the shelf’ modules \shortcite¡e.g.,¿Kasper1989, something which has motivated the development of simple realisation engines which provide syntax and morphology apis, but leave choice-making up to the developer \shortciteGatt2009,Vaudry2013,Bollmann2011,DeOliveira2014.
One difficulty for grammar-based systems is how to make choices among related options, such as the following, where hand-crafted rules with the right sensitivity to context and input are difficult to design:
Ivan Rakitic scored for Barcelona in the 4th minute.
For Barcelona, Ivan Rakitic scored in minute four.
Barcelona player Ivan Rakitic scored after four minutes.
2.6.3 Statistical approaches
Recent approaches have sought to acquire probabilistic grammars from large corpora, cutting down on the amount of manual labour required, while increasing coverage. Essentially, two approaches have been taken to include statistical information in the realisation process. One approach, introduced by the seminal work of Langkilde and Knight \shortciteLangkilde2000,Langkilde-Geary2002 on the halogen/nitrogen systems, relies on a two-level approach, in which a small, hand-crafted grammar is used to generate alternative realisations represented as a forest, from which a stochastic re-ranker selects the optimal candidate. Langkilde and Knight rely on corpus-based statistical knowledge in the form of n-grams, whereas others have experimented with more sophisticated statistical models to perform reranking \shortcite¡e.g.,¿Bangalore2000,Ratnaparkhi2000,cahill2007stochastic. The second approach does not rely on a computationally expensive generate-and-filter approach, but uses statistical information directly at the level of generation decisions. An example of this approach is the pcru system developed by \shortciteABelz2008, which generates the most likely derivation of a sentence, given a corpus, using a context-free grammar. In this case, the statistics are exploited to control the generator’s choice-making behaviour as it searches for the optimal solution.
In both approaches, the base generator is hand-crafted, while statistical information is used to filter outputs. An obvious alternative would be to also rely on statistical information for the base-generation system. Fully data-driven grammar-based approaches have been developed by acquiring grammatical rules from treebanks. For example, the Openccg framework \shortcitehypertagging:acl08,white-rajkumar:2009:EMNLP,deplen:2012:EMNLP presents a broad coverage English surface realizer, based on Combinatory Categorial Grammar \shortcite¡ccg; ¿Steedman2000, relying on a corpus of ccg representations derived from the Penn Treebank \shortciteHockenmaier2007 and using statistical language models for re-ranking. There are several other approaches to realisation that adopt a similar rationale, based on a variety of grammatical formalisms, including Head-Driven Phrase Structure Grammar \shortcite¡hpsg; ¿Nakanishi2005,Carroll2005, Lexical-Functional Grammar \shortcite¡lfg; ¿Cahill2006 and Tree Adjoining Grammar \shortcite¡tag; ¿Gardent2015. In many of these systems, the base generator uses some variant of the chart generation algorithm \shortciteKay1996 to iteratively realise parts of an input specification and merge them into one or more final structures, which can then be ranked \shortcite¡see¿[for further discussion]Rajkumar2014. The existence of stochastic realisers with wide-coverage grammars has motivated a greater focus on subtle choices, such as how to avoid structural ambiguity, or how to handle choices such as explicit complementiser insertion in English \shortcite¡see e.g.,¿Rajkumar2011. In a somewhat similar vein, the statistical approach to microplanning proposed by \shortciteAgardent2017statistical focuses on interactions between surface realization, aggregation, and sentence segmentation in a joint model.
Other approaches to realisation also rely on one or more classifiers to improve outputs. For example, \shortciteAFilippova2007,Filippova2009 describe an approach to linearisation of constituents using a two-step approach with Maximum Entropy classifiers, first determining which constituent should occupy sentence-initial position, then ordering the constituents in the remainder of the sentence. \shortciteABohnet2010 present a realiser using underspecified dependency structures as input, in a framework based on Support Vector Machines, where classifiers are organised in a cascade. An initial classifier decodes semantic input into the corresponding syntactic features, while two subsequent classifiers first linearise the syntax and then render the correct morphological realisation for the component lexemes.
Modelling choices using classifier cascades is not restricted to realisation alone. Indeed, in some cases, it has been adopted as a model for the nlg process as a whole, a topic we will return to in Section 3.3.3. One outcome of this view of nlg is that the nature of the input representation also changes: the more decisions that are made within the statistical generation system, the less linguistic and more abstract the input representation becomes, paving the way for integrated, end-to-end stochastic generation systems, such as \shortciteAKonstas2013, which we also discuss in the next section.
This section has given an overview of some classic tasks that are found in most nlg systems. One of the common trends that can be identified in each case is the steady move from early, hand-crafted approaches based on rules, to the more recent stochastic approaches that rely on corpus data, with a concomitant move towards more domain-independent approaches. Historically, this was the case already for individual tasks, such as referring expression generation or realisation, which became topics of intensive research in their own right. However, as more and more approaches to all nlg tasks begin to take a statistical turn, there is increasing emphasis on learning techniques; the domain-specific aspect is, as it were, incidental, a property of the training data itself. As we shall see in the next section, this trend has also influenced the way different nlg tasks are organised, that is, the architecture of systems for text generation from data.
3 NLG Architectures and Approaches
Having given an overview of the most common sub-tasks that nlg systems incorporate, we now turn to the way such tasks can be organised. Broadly speaking, we can distinguish between three dominant approaches to nlg architectures:
Modular architectures: By design, such architectures involve fairly crisp divisions among sub-tasks, though with significant variations among them;
Planning perspectives: Viewing text generation as planning links it to a long tradition in ai and affords a more integrated, less modular perspective on the various sub-tasks of nlg;
Integrated or global approaches: Now the dominant trend in nlg (as it is in nlp more generally), such approaches cut across task divisions, usually by placing a heavy reliance on statistical learning of correspondences between (non-linguistic) inputs and outputs.
The above typology of nlg is based on architectural considerations. An orthogonal question concerns the extent to which a particular approach relies on symbolic or knowledge-based methods, as opposed to stochastic, data-driven methods. It is important to note that none of the three architectural types listed above is inherently committed to one or the other of these. Thus, it is possible for a system to have a modular design but incorporate stochastic methods in several, or even all, sub-tasks. Indeed, our survey of the various tasks in Section 2 included several examples of stochastic approaches. Below, we will also discuss a number of data-driven systems whose design is arguably modular. Similarly, it is possible for a system to take a non-modular perspective, but eschew the use of data-driven models (this is a feature of some planning-based nlg systems discussed in Section 3.2 below, for instance).
The fact that many modular nlg systems are not data-driven is largely due to historical reasons since, of the three designs outlined above, the modular one is the oldest. As we will show below, however, challenges to the classical modular pipeline architecture – once designated by \shortciteAReiter1994 as the consensus at the time – have included blackboard and revision-based architectures that were not stochastic. At the same time, it must be acknowledged that the large-scale adoption of integrated, non-modular approaches has been impacted significantly by the uptake of data-driven techniques within the nlg community and the development of repositories of data to support training and evaluation.
In summary, there are at least two orthogonal ways of classifying nlg systems, based on their design or on the methods adopted in their development. Our survey in this section follows the typology outlined above for convenience of exposition. The caveats raised here should, however, be borne in mind by the reader, and will in any case be brought up repeatedly in what follows, as we discuss different approaches under each heading.
3.1 Modular approaches
Existing surveys of nlg, including \shortciteAReiter1997,Reiter2000 and \shortciteAReiter2010 typically refer to some version of the pipeline architecture displayed in Figure 3 as the ‘consensus’ architecture in the field. Originally introduced by \shortciteAReiter1994, the pipeline was a generalisation based on actual practice and was claimed to have the status of a ‘de facto standard’. This, however, has been contested repeatedly, as we shall see.
Different modules in the pipeline incorporate different subsets of the tasks described in Section 2. The first module, the Text Planner (or Document Planner, or Macroplanner), combines content selection and text structuring (or document planning). Thus, it is concerned mainly with strategic generation \shortciteMcDonald1993, the choice of ‘what to say’. The resulting text plan, a structured representation of messages, is the input to the Sentence Planner (or microplanner), which typically combines sentence aggregation, lexicalisation and referring expression generation \shortciteReiter2000. If text planning amounts to deciding what to say, sentence planning can be understood as deciding how to say it. All that remains then is to actually say it, i.e., generate the final sentences in a grammatically correct way, by applying syntactic and morphological rules. This task is performed by the Linguistic Realiser. Together, sentence planning and realisation encompass the set of tasks traditionally referred to as tactical generation.
The pipeline architectures shares some characteristics with a widely-used architecture in text summarisation \shortciteMani2001,Nenkova2011, where the process is sub-divided into (a) analysis of source texts and selection of information; (b) transformation of the selected information to enhance fluency; and (c) synthesis of the summary.
A second related architecture, which was also noted by \shortciteAReiter1994, is that proposed in psycholinguistics for human speech production, where the most influential psycholinguistic model of language production, proposed by \shortciteALevelt1989,Levelt1999, makes a similar distinction between deciding what to say and determining how to say it. Levelt’s model allows for a limited degree of self-monitoring through feedback loops, a feature that is absent in Reiter’s nlg pipeline, but continues to play an important role in psycholinguistics \shortcite¡cf.¿Pickering2013, though here too there has been increasing emphasis on more integrated models.
A hallmark of the architecture in Figure 3 is that it represents clear-cut divisions among tasks that are traditionally considered to belong to the ‘what’ (strategic) and the ‘how’ (tactical). However, this does not imply that this division is universally accepted in practice. In an earlier survey, \shortciteAMellish2006 concluded that while several nlg systems incorporate many of the core tasks outlined in Section 2, their organisation varies considerably from system to system. Indeed, some tasks may be split up across modules. For example, the content determination part of referring expression generation might be placed in the sentence planner, but decisions about form (such as whether to use an anaphoric np, and if so, what kind of np to produce) may have to wait until at least some realisation-related decisions have been taken. Based on these observations, \citeauthorMellish2006 proposed an alternative formalism, the ‘objects-and-arrows’ framework, within which different types of information flow between nlg sub-tasks can be accommodated. Rather than offering a specific architecture, this framework was intended as a formalism within which high-level descriptions of different architectures can be specified. However, it retains the principle that the tasks, irrespective of their organisation, are relatively well-defined and distinguished.
Another recent development in relation to the pipeline architecture in Figure 3 is a proposal by \shortciteAReiter2007 to accommodate systems in which input consists of raw (often numeric) data that requires some preprocessing before it can undergo the kind of selection and planning that the Text Planner is designed to execute. The main characteristic of these systems is that input is unstructured, in contrast to systems which operate over logical forms, or database entries. Examples of application domains where this is the case include weather reporting \shortcite¡e.g.,¿Goldberg1994,Buseman1997,Coch1998,Turner2008a,Sripada2003,Ramos-Soto2015, where the input often takes the form of numerical weather predictions; and generation of summaries from patient data \shortcite¡e.g.,¿Hueske-kraus2003,Harris2008,Gatt2009,Banaee2013. In such cases, nlg systems often need to perform some form of data abstraction (for example, identifying broad trends in the data), followed by data interpretation. The techniques used to perform these tasks range from extensions of signal processing techniques \shortcite¡e.g.,¿Portet2009 to the application of reasoning formalisms based on fuzzy set theory \shortcite¡e.g.,¿Ramos-Soto2015. \shortciteAReiter2007’s proposal accommodates these steps by extending the pipeline ‘backwards’, incorporating stages prior to Text Planning.
Notwithstanding its elegance and simplicity, there are challenges associated with a pipeline nlg architecture, of which two are particularly worth highlighting:
The generation gap \shortciteMeteer1991 refers to mismatches between strategic and tactical components, so that early decisions in the pipeline have unforeseen consequences further downstream. To take an example from \shortciteAInui1992, a generation system might determine a particular sentence ordering during the sentence planning stage, but this might turn out to be ambiguous once sentences have actually been realised and orthography has been inserted;
Generating under constraints: Itself perhaps an instance of the generation gap, this problem can occur when the output of a system has to match certain requirements, for example, it cannot exceed a certain length \shortcite¡see¿[for discussion]Reiter2000a. Formalising this constraint might appear possible at the realisation stage – by stipulating the length constraint in terms of number of words or characters, for instance – but it is much harder at the earlier stages, where the representations are pre-linguistic and their mapping to the final text are potentially unpredictable.
These, and related problems, motivated the development of alternative architectures. For instance, some early nlg systems were based on an interactive design, in which a module’s initially incomplete output could be fleshed out based on feedback from a later module \shortcite¡the pauline system is an example of this;¿Hovy1988. An even more flexible stance is taken in blackboard architectures, in which task-specific procedures are not rigidly pre-organised, but perform their tasks reactively as the output, represented in a data structure shared between tasks, evolves \shortcite¡e.g.,¿Nirenburg1989. Finally, revision-based architectures allow a limited form of feedback between modules under monitoring, with the possibility of altering choices which prove to be unsatisfactory \shortcite¡e.g.,¿Mann1981,Inui1992. This has the advantage of not requiring ‘early’ modules to be aware of the consequences of their choices for subsequent modules, since something that goes wrong can always be revised \shortciteInui1992. Revision need not be carried out exclusively to rectify shortcomings. For instance, \shortciteARobin1993 used revision in the context of sports summaries; an initial draft was revised to add historical background information that was made relevant by the events reported in the draft, also taking decisions as to where to place them in relation to the main text. The price that all of these alternatives potentially incur is, of course, a reduction in efficiency, as noted by \shortciteASmedt1996.
Despite early criticisms of the modular approach, the strategic versus tactical division continues to influence recent data-driven approaches to nlg, including a number of those discussed in Sections 3.3 and 3.3.5 below \shortcite¡e.g.¿[among others]Dusek2015,Dusek2016.
However, other alternatives to pipelines often end up blurring the boundaries between modules in the nlg system. This is a feature that is more evident in some planning-based and integrated approaches proposed in recent years. It is to these that we now turn.
3.2 Planning-based approaches
In ai, the planning problem can be described as the process of identifying a sequence of one or more actions to satisfy a particular goal. An initial goal can be decomposed into sub-goals, satisfied by actions each of which has its preconditions and effects. In the classical planning paradigm \shortcite¡strips;¿Fikes1971, actions are represented as tuples of such preconditions and effects.
The connection between planning and nlg lies in that text generation can be viewed as the execution of planned behaviour to achieve a communicative goal, where each action leads to a new state, that is, a change in a context that includes both the linguistic interaction or discourse history to date, but also the physical or situated context and the user’s beliefs and actions \shortcite¡see¿[for some recent perspectives on this topic]Lemon2008,Rieser2009,Dethlefs2014,Garoufi2013,Garoufi2014. This perspective on nlg is therefore related to the view of ‘language as action’ \shortciteClark1996a, itself rooted in a philosophical tradition inaugurated by the work of \shortciteAAustin1962 and \shortciteASearle1969. Indeed, some of the earliest ai work in this tradition \shortcite¡especially¿Cohen1979,Cohen1985 sought an explicit formulation of preconditions (akin to Searle’s felicity conditions) for speech acts and their consequences.
Given that there is in principle no restriction on what types of actions can be incorporated in a plan, it is possible for plan-based approaches to nlg to cut across the boundaries of many of the tasks that are normally encapsulated in the classic pipeline architecture, combining both tactical and strategic elements by viewing the problems of what to say and how to say it as part and parcel of the same set of operations. Indeed, there are important precedents in early work for a unified view of nlg as a hierarchy of goals, the kamp system \shortciteAppelt1985 being among the best known examples. For instance, to generate referring expressions in kamp, the starting point was reasoning about interlocutors’ beliefs and mutual knowledge, whereupon the system generated sub-goals that percolated all the way down to property choice and realisation, finally producing a referential np whose predicted effect was to alter the hearer’s belief state about the referent \shortcite¡see¿[for a similar approach to the generation of referring expressions in dialogue]Heeman1995.
One problem with these perspectives, however, is that deep reasoning about beliefs, desires and intentions \shortcite¡or bdi, as it is often called following the work of¿Bratman1987 requires highly expressive formalisms and incurs considerable computational expense. One solution is to avoid general-purpose reasoning formalisms and instead adapt a linguistic framework to the planning paradigm for nlg.
3.2.1 Planning through the grammar
The idea of interpreting linguistic formalisms in planning terms is again prefigured in early nlg work. For example, some early systems \shortcite¡e.g. kpml, which we briefly discussed in the context of realisation in Section 2.6;¿Bateman1997 were based on Systemic-Functional Grammar \shortcite¡sfg; ¿Halliday2004, which can be seen as a precursor to contemporary planning-based approaches, since sfg models linguistic constructions as the outcome of a traversal through a decision network that extends backwards to pragmatic intentions. In a similar vein, both \shortciteAHovy1991 and \shortciteAMoore1993 interpreted the relations of Rhetorical Structure Theory \shortciteMann1988 as operators for text planning.
Some recent approaches integrate much of the planning machinery into the grammar itself, viewing linguistic structures as planning operators. This requires grammar formalisms which integrate multiple levels of linguistic analysis, from pragmatics to morpho-syntax. It is common for contemporary planning-based approaches to nlg to be couched in the formalism of Lexicalised Tree Adjoining Grammar \shortcite¡ltag; ¿Joshi1997, though other formalisms, such as Combinatory Categorial Grammar \shortciteSteedman2000 have also been shown to be adequate to the task \shortcite¡see especially¿[for an approach to generation using Discourse Combinatory Categorial Grammar]Nakatsu2010.
In an ltag, pieces of linguistic structure (so-called elementary trees in a lexicon) can be coupled with semantic and pragmatic information that specify (a) what semantic preconditions need to obtain in order for the item to be felicitously used; and (b) what pragmatic goals the use of that particular item will achieve \shortcite¡see¿[for planning-based work using ltag]Stone1998,Garoufi2013,Koller2002. As an example of how such a formalism could be deployed in a planning framework, let us focus on the task of referring to a target entity. \shortciteAKoller2007 formulated the task in a way that obviates the need to distinguish between the content determination and realisation phases \shortcite¡an approach already taken by¿Stone1998. Furthermore, they do not separate sentence planning, reg and realisation, as is done in the traditional pipeline. Consider the sentence Mary likes the white rabbit. Simplifying the formalism for ease of presentation, we can represent the lexical item likes as follows \shortcite¡this example is based on¿[albeit with some simplifications]Garoufi2014:
likes(, , ) action:
The proposition that likes is part of the knowledge base (i.e. the statement is supported);
The current utterance can be substituted into the derivation under construction;
is now part of
New np nodes for in agent position and in patient position have been set up (and need to be filled).
As in strips, an operator consists of preconditions and effects. Note that the preconditions associated with the lexical item require support in the knowledge base (thus making reference to the input kb, which normally would not be accessible to the realiser), and include semantic information (such as that the agent needs to be animate). Having inserted likes as the sentence’s main verb, we have two noun phrases which need to be filled by generating nps for the arguments and . Rather than deferring this task to a separate reg module, \citeauthorKoller2007 build referring expressions by associating further pragmatic preconditions on the linguistic operators (elementary trees) that will be incorporated in the referential np. First, the entity must be part of the hearer’s knowledge state, since an identifying description (say, to ) presupposes that the hearer is familiar with it. Second, an effect of adding words to the np (such as the predicates rabbit or white) is that the phrase excludes distractors, i.e. entities of which those properties are not true. In a scenario with one human being and two rabbits, only one of which (the in our example) is white, the derivation would proceed by first updating the np corresponding to with rabbit, thereby excluding the human from the distractor set, but leaving the goal to distinguish unsatisfied (since is not the only rabbit). The addition of another predicate to the np (white) does the trick.
A practical advantage to planning-based approaches is the availability of a significant number of off-the-shelf planners. Once the nlg task is formulated in an appropriate plan description language, such as the Planning Domain Definition Language \shortcite¡pddl; ¿McDermott2000, it becomes possible in principle to use any planner to generate text. However, planners remain beset by problems of efficiency. In a set of experiments on nlg tasks of differing complexity, \shortciteAKoller2011 noted that planners tend to spend significant amounts of time on preprocessing, though solutions could often be found efficiently once preprocessing was complete.
3.2.2 Stochastic planning under uncertainty using Reinforcement Learning
The approaches to planning we have discussed so far are largely rule-based and tend to view the relationship between a planned action and its consequences (that is, its impact on the context), as fixed \shortcite¡though exceptions exist, as in contingency planning, which generates multiple plans to address different possible outcomes;¿Steedman2007.
As \shortciteARieser2009 note, this view is unrealistic. Consider a system that generates a restaurant recommendation. The consequences of its output (that is, the new state it gives rise to) are subject to noise arising from several sources of uncertainty. In part, this is due to trade-offs, for example, between needing to include the right amount of information while avoiding excessive prolixity. Another source of uncertainty is the user, whose actions may not be the ones predicted by the system. An instance of Meteer’s \citeyearMeteer1991 generation gap can rear its head, for instance if a stochastic realiser renders the content of a message in an ambiguous, or excessively lengthy utterance \shortciteRieser2009, a problem that could be addressed by allowing different sub-tasks to share knowledge sources and be guided by overlapping constraints \shortcite[discussed below]Dethlefs2015.
In short, planning a good solution to reach a communicative goal could be viewed as a stochastic optimisation problem (a theme we revisit in Section 3.3.3 below). This view is shared by many recent approaches based on Reinforcement Learning \shortcite¡rl;¿Lemon2008,Rieser2009,Rieser2011, especially those that tackle nlg within a dialogue context. In this framework, generation can be modelled as a Markov decision process where states are associated with possible actions and each state-action pair is associated with a probability of moving from a state at time to a new state at via action . Crucially for the learning algorithm, transitions are associated with a reinforcement signal, via a reward function that quantifies the optimality of the generated output. Learning usually involves simulations in which different generation strategies or ‘policies’ – essentially, plans corresponding to possible paths through the state space – come to be associated with different rewards. The rl framework has been argued to be better at handling uncertainty in dynamic environments than supervised learning or classification, since these do not enable adaptation in a changing context \shortciteRieser2009. \shortciteARieser2011a showed that this approach is effective in optimising information presentation when generating restaurant recommendations. \shortciteAJanarthanam2014 used it to optimise the choice of information to select in a referring expression, given a user’s knowledge. The system learns to adapt its user model as the user acquires new knowledge in the course of a dialogue.
An important contribution of this work has been in exploring joint optimisation, where the policy learned satisfies multiple constraints arising from different sub-tasks of the generation process, by sharing knowledge across the sub-tasks. \shortciteALemon2011a showed that joint optimisation can learn a policy that determines when to generate informative utterances or queries to seek more information from a user. Similarly, \shortciteACuay2011 used hierarchical rl to jointly optimise the problem of finding and describing a short route description, while adapting to a user’s prior knowledge, giving rise to a strategy whereby the user is guided past landmarks that they are familiar with, while avoiding potentially confusing junctions. Also in a route-finding setting, \shortciteADethlefs2015 develop a hierarchical model comprising a set of learning agents whose tasks range from content selection through realisation. They show that a joint framework in which agents share knowledge, outperforms an isolated learning framework in which each task is modelled separately. For example, the joint policy learns to give high-level navigation instructions, but switches to low-level instructions if the user goes off-track. Furthermore, utterances produced by the joint policy are less verbose and lead to shorter interactions overall.
The joint optimisation framework is of course not unique to Reinforcement Learning and planning-based approaches. A number of approaches to content determination discussed in earlier sections, including the work of \shortciteAMarciniak2005 and \shortciteABarzilay2005, also use joint optimisation in their approach to content determination and realisation (see Sections 2.1), as does the work of \shortciteALampouras2013. We return to optimisation in Section 3.3.3 below.
In summary, nlg research within the planning paradigm has highlighted the desirability of developing unified formalisms to represent constraints on the generation process at multiple levels, whether this is done using ai-based planning formalisms \shortciteKoller2011, or stochastically via Reinforcement Learning. Among its contributions, the latter line of work has shed light on the value of (a) hierarchical relationships among sub-problems; and (b) joint optimisation of different sub-tasks. Indeed, the latter trend belongs to a much broader range of research on integrated approaches to nlg, to which we turn our attention immediately below.
3.3 Stochastic approaches to NLG
As we noted at the start of this section, whether a system is data-driven or not is independent of its architectural organisation. Indeed, some of the earliest challenges to a modular or pipeline approach described in Section 3.1 above, including revision-based and blackboard architectures, were symbolic in their methodological orientation. At the same time, the shift towards data-driven methods and the availability of data sources has given greater impetus to integrated approaches to nlg, although this shift began somewhat later that in other areas of nlp. As a result, a discussion of integrated approaches will necessarily tend to emphasise statistical methods.
In the remainder of this section, we start with an overview of methods used to acquire training data for nlg – in particular, pairings of inputs (data) and outputs (text) – before turning to an overview of techniques and frameworks. One of the themes that will emerge from this overview is that, as in the case of planning, statistical methods often take a unified or ‘global’, rather than a modularised, view of the nlg process.
3.3.1 Acquiring data
As noted in Section 2, some nlg tasks support the transition to a stochastic approach fairly easily. For example, research on realisation often exploits the existence of treebanks from which input-output correspondences can be learned. Similarly, the emergence of corpora of referring expressions representing both input domains and output descriptions \shortcite¡e.g.,¿Gatt2007a,Viethen2011a,Kazemzadeh2014,Gkatzia2015 has facilitated the development of probabilistic reg algorithms. Shared tasks have also contributed to the development of both data sources and methods (see Section 7). As we show in Section 4 below, recent work on image-to-text generation has also benefited from the availability of large datasets. For statistical, end-to-end generation in other domains, there is less of an embarrassment of riches. However, this situation is improving as methods to automatically align input data with output text are developed. Still, it is worth emphasising that many of these alignment approaches use data which is semi-structured, rather than the raw, numerical input (e.g., signals) used by the data-to-text systems that \shortciteAReiter2007, among others, drew attention to.
Currently, there are a number of data-text corpora in specific domains, notably weather forecasting \shortciteReiter2005,Belz2008,Liang2009 and sports summaries \shortciteBarzilay2005,Chen2008. These usually consist of database records paired with free text. A promising recent trend is the introduction of statistical techniques that seek to automatically segment and align such data and text \shortcite¡e.g.,¿Barzilay2005,Liang2009,Konstas2013. In an influential paper, \shortciteALiang2009 described this framework in terms of a generative model that defines a distribution , for sequences of words and input states , with latent variables specifying the correspondence between and in terms of three main components: (i) the likelihood of database records being selected, given ; (ii) the likelihood of certain fields being chosen for some record; (iii) the likelihood that a string of a certain length is generated given the records, fields and states. The parameters of the model can be found using the Expectation Maximization (em) algorithm. An example alignment is shown in Figure 4.
These models perform alignment by identifying regular co-occurrences of segments of data and text. \shortciteAkoncel2014multi go beyond this by proposing a model that exploits linguistic structure to align at varying resolutions. For example, (3.3.1) below is related to two observations in a soccer game log (an aerial pass and a miss), but can be further analysed into two sub-parts (indicated by indices 1 and 2 in our example), which individually map to these two sub-events.
(Chamakh rises highest) and (aims a header towards goal which is narrowly wide).
A different approach to data acquisition is described by \shortciteAMairesse2014, who use crowd-sourcing techniques to elicit realisations for semantic/pragmatic inputs describing dialogue acts in the restaurant domain \shortcite¡see¿[for another recent approach to crowd-sourcing in a similar domain]Novikova2016. The key to the success of this technique is the development of a semantics that is sufficiently transparent for use with non-specialists. In an earlier paper, \shortciteAMairesseEtAl2010 describe a method to cut down on the amount of training data required for generation by using uncertainty sampling \shortciteLewis1994, whereby a system can be trained on a relatively small amount of input data; subsequently, the learned model is applied to new data, from which the system samples the cases of which it is least certain, forwarding these to a (possibly human) oracle for feedback, which potentially leads to a new training cycle.
Many of the stochastic end-to-end systems we discuss below rely on well-defined formalisms and typically need fairly precise alignments between inputs and portions of the output. One of the limitations of these approaches is that the reliance on alignment makes such systems highly domain-specific, as noted by \shortciteAAngeli2010.
More recent stochastic methods obviate the need for alignment between input data and output strings. This is the case for many systems based on neural networks \shortcite¡e.g.,¿[discussed in Section 3.3.5]Wen2015,Dusek2016,Lebret2016,Mei2016 as well as other machine-learning approaches \shortcite¡e.g.,¿Dusek2015,Lampouras2016. For example, \shortciteADusek2015 use the dialogue acts from the bagel dataset \shortciteMairesseEtAl2010 as meaning representations; the bagel reference texts are parsed using an off-the-shelf deep syntactic analyser. They define a stochastic sentence planner, a variant of the algorithm, which builds optimal sentence plans using a base generator and a scoring function to rank candidates. Realisation is conducted using a rule-based realiser. The approach of \shortciteALampouras2016, also uses unaligned mr-text pairs from bagel, as well as the related sf hotel and restaurant dataset by \shortciteAWen2015. Here, content determination and realisation are both conceived as classification problems (choosing an attribute from the mr, or choosing a word for the output), but are optimised jointly in an iterative training algorithm using imitation learning.
3.3.2 NLG as a sequential, stochastic process
Given an alignment between data and text, one way of modelling the nlg process is to remain faithful to the division between strategic and tactical choices, using the statistical alignment to inform content selection, while deploying nlp techniques to acquire rules, templates or schemas \shortcite¡á la ¿McKeown1985 to drive sentence planning and realisation.
Recall that the generative model of \shortciteALiang2009 pairs data to text based on a sequential, Markov process, combining strategic choices (of db records and fields) with tactical choices (of word sequences) into a single probabilistic model. In fact, Markov-based language modelling approaches continue to feature prominently in data-driven nlg. One of the earliest examples is \shortciteAOh2002 in the context of a dialogue system in the travel domain, where the input takes the form of a dialogue act (e.g. a query that the system needs to make to obtain information about the user’s travel plans) with the attributes to include (e.g. the departure city). \citeauthorOh2002’s approach encompasses both content planning and realisation. It relies on dialogue corpora annotated with utterance classes, that is, the type of dialogue act that each utterance is intended to fulfil. On this basis, they construct separate -gram language models for each utterance class, as well as for word-classes that can appear in the input (for example, words corresponding to departure city). Content planning is handled by a model that predicts which attributes should be included in an utterance on the basis of recent dialogue history. Realisation is handled using a combination of templates and -gram models. Thus, generation is conceived as a two-step (planning followed by realisation) process.
The reliance on standard language models has one potential drawback, in that such models are founded on a local history assumption, limiting the extent to which prior selections can influence current choices. An alternative, discriminative model \shortcite¡known to the nlp community at least since¿Ratnaparkhi96 is logistic regression (Maximum Entropy). The foundations for this approach in nlg can be found in \shortciteARatnaparkhi2000, who focussed primarily on realisation (albeit combined with elements of sentence planning). He compared two stochastic nlg systems based on a maximum entropy learning framework, to a baseline nlg system. The first of these (nlg2 in Ratnaparkhi’s paper) uses a conditional language model that generates sentences in an incremental, left-to-right fashion, by predicting the best word given both the preceding history (as in standard n-gram models) and the semantic attributes that remain to be expressed. The second (nlg3) augments the model with syntactic dependency relations, performing generation by recursively predicting the left and right children of a given constituent. In an evaluation based on judgements of correctness, \citeauthorRatnaparkhi2000 found that the system augmented with dependencies was generally preferred.
In later work, \shortciteAAngeli2010 describe an end-to-end nlg system that maintains a separation between content selection, sentence planning and realisation, modelling each process as a sequence of decisions in a log-linear framework, where choices can be conditioned on arbitrarily long histories of previous decisions. This enables them to handle long-range dependencies, such as coherence relations, more flexibly \shortcite¡e.g., a model can incorporate the information that a weather report which describes wind speed should do so after mentioning wind direction; see¿[for similar insights based on global optimisation]Barzilay2005. The separation of tasks is maintained insofar as a different set of features can be used to inform decisions at each stage of the process. Sentence planning and realisation decisions are based on templates acquired from corpus texts: a template is selected based on its likelihood given the database fields selected during content selection.
Mairesse2014 describe a different approach, which also relies on alignments between database records and text, and seeks a global solution to generation, without a crisp distinction between strategic and tactical components. In this case, the basic representational framework is a tree of the sort shown in Figure 5. The root indicates a dialogue act type (in the example, the dialogue act seeks to inform). Leaves in the tree correspond to words or word sequences, while nonterminals are semantic stacks, that is, the pieces of input to which the words correspond. In this framework, content selection and realisation can be solved jointly by searching for the optimal stack sequence for a given dialogue act, and the optimal word sequence corresponding to that stack sequence. \citeauthorMairesse2014 use a factored language model (flm), which extends n-gram models by conditioning probabilities on different utterance contexts, rather than simply on word histories. Given an input dialogue act, generation works by applying a Viterbi search through the flm at each of the following stages: (a) mandatory semantic stacks are identified for the dialogue act; (b) these are enriched with possible non-mandatory stacks (those which are not in boldface in Figure 5), usually corresponding to function words; (c) realisations are found for the stack sequence. The approach is also extended to deal with best realisations, as well as to handle variation, in the form of paraphrases for the same input.
3.3.3 NLG as classification and optimisation
An alternative way to think about nlg decisions at different levels is in terms of classification, already encountered in the context of specific tasks, such as content determination \shortcite¡e.g.,¿Duboue2003 and realisation \shortcite¡e.g.,¿Filippova2007. Since generation is ultimately about choice-making at multiple levels, one way to model the process is by using a cascade of classifiers, where the output is constructed incrementally, so that any classifier uses as (part of) its input the output of a previous classifier . Within this framework, it is still possible to conceive of nlg in terms of a pipeline. As \shortciteAMarciniak2005 note, an alternative way of thinking about it is in terms of a weighted, multi-layered lattice, where generation amounts to a best-first traversal: at any stage , classifier produces the most likely output, which leads to the next stage along the most probable path. This generalisation is conceptually related to the view of nlg in terms of policies in the Reinforcement Learning framework (see Section 3.2.2 above), which define a traversal through sequences of states which may be hierarchically organised \shortcite¡as in the work of¿[for example]Dethlefs2015.
Marciniak2004 start from a small corpus of manually annotated texts of route descriptions, dividing generation into a series of eight classification problems, from determining the linear precedence of discourse units, to determining the lexical form of verbs and the type of their arguments. Generation decisions are taken using the instance-based KStar algorithm, which is shown to outperform a majority baseline on all classification decisions. Instance-based approaches to nlg are also discussed by \shortciteAVarges2010, albeit in an overgenerate-and-rank approach where rules overgenerate candidates, which are then ranked by comparison to the instance base.
A similar framework was recently adopted by \shortciteAZarriess2013, once again taking as their starting point textual data annotated with a dependency representation, as shown in (14) below, where referents are marked v and p and the implicit head of the dependency is underlined.
Junge Familie auf dem Heimweg ausgeraubt Young family on the way home robbed \glt‘A young family was robbed on their way home.’ \glend
These authors use a sequence of classifiers to perform referring expression generation and realisation. They use a ranking model based on Support Vector Machines which, given an input dependency representation extracted from annotated text such as (14), performs two tasks in either order: (a) mapping the input to a shallow syntactic tree for linearisation; and (b) inserting referring expressions. Interestingly, \shortciteAZarriess2013 observe that the performance of either task is order-dependent, in that both classification tasks perform worse when they are second in the sequence. They observe a marginal improvement when the tasks are performed in parallel, but achieve the best performance in a revision-based architecture, where syntactic mapping is followed by referring expression insertion, followed by a revision of the syntax.
Classification cascades for nlg maintain a clean separation between tasks, but research in this area has echoed earlier concerns about pipelines in general (see Section 3.1), the main problem being error propagation. Infelicitous choices will of course impact classification further downstream, a situation analogous to the problem of the generation gap. The conclusion by \shortciteAZarriess2013 in favour of a revision-based architecture, brings our account full circle, in that a well-known solution is shown to yield improvements in a new framework.
Our discussion so far has repeatedly highlighted the fact that a sequential organisation of nlg tasks is susceptible to error propagation, whether this takes the form of classifier errors, or decisions in a rule-based module that have a negative impact on downstream components. A potential solution is to view generation as an optimisation problem, where the best combination of decisions is sought in an exponentially large space of possible combinations. We have encountered the use of optimisation techniques, such as Integer Linear Programming (ilp) in the context of aggregation and content determination (Section 2.3). For example, \shortciteABarzilay2006 group content units based on their pairwise similarity, with an optimisation step to identify a set of pairs that are maximally similar. ilp has also been exploited by \shortciteAMarciniak2004,Marciniak2005, as a means to counteract the error propagation problem in their original classification-based approach. Similar solutions have been undertaken by \shortciteALampouras2013, in the context of generating text from owl ontologies. \citeauthorLampouras2013 show that joint optimization using Integer Linear Programming to jointly determine content selection, lexicalisation and aggregation produces more compact verbalisations of ontology facts, compared to a pipeline system \shortcite¡which the authors presented earlier in¿Androtsopoulos2013.
Conceptually, the optimisation framework is simple:
Each nlg task is once again modelled as classification or label-assignment, but this time, labels are modelled as binary choices (either a label is assigned or not), associated with a cost function, defined in terms of the probability of a label in the training data;
Pairs of tasks which are strongly inter-dependent \shortcite¡for example, syntactic choices and reg realisations, in the example from¿Zarriess2013 have a cost based on the joint probability of their labels;
An ilp model seeks the global labelling solution that minimises the overall cost, with the added constraint that if one of a pair of correlated labels is selected, the other must be too.
Optimisation solutions have been shown to outperform different versions of the classification pipeline \shortcite¡e.g., that of¿Marciniak2004, much as the results of \shortciteADethlefs2015, discussed above, showed that reinforcement learning of a joint policy produces better dialogue interactions than learning isolated policies for separate nlg tasks. The imitation learning framework of \shortciteALampouras2016 (discussed earlier in Section 3.3.1), which seeks to jointly optimise content determination and realisation, was also shown to achieve competitive results, approaching the performance of the systems of \shortciteAWen2015 on sf and of \shortciteADusek2015 on bagel.
3.3.4 NLG as ‘parsing’
In recent years, there has been a resurgence of interest in viewing generation in terms of probabilistic context-free grammar (cfg) formalisms, or even as the ‘inverse’ of semantic parsing. For example, \shortciteABelz2008 formalises the nlg problem entirely in terms of cfgs: a base generator expands inputs (bits of weather data in this case) by applying cfg rules; corpus-derived probabilities are then used to control the choice of which rules to expand at each stage of the process. The base generator in this work is hand-crafted. However, it is possible to extract rules or templates from corpora, as has been done for aggregation rules \shortcite[and Section 2.3]Stent2009,White2015, and also for more general statistical approaches to sentence planning and realisation in a text-to-text framework \shortcite¡e.g.,¿Kondadadi2013. Similarly, approaches to nlg from structured knowledge bases, expressed in formalisms such as rdf, have described techniques to extract lexicalised grammars or templates from such inputs paired with textual descriptions \shortciteEll2012,Duma2013,Gyawali2014.
The work of Mooney and colleagues \shortciteWong2007,Chen2008,Kim2010 has compared a number of different generation strategies inspired by the wasp semantic parser \shortciteWong2007, which uses probabilistic synchronous cfg rules learned from pairs of utterances and their semantic representations using statistical machine translation techniques. \shortciteAChen2008 use this framework for generation both by adapting wasp in a generation framework, and by further adapting it to produce a new system, wasper-gen. While wasp seeks to maximise the probability of a meaning representation (mr) given a sentence, wasper-gen does the opposite, seeking the maximally probable sentence given an input mr, as it were, learning a translation model from meaning to text. When trained on a dataset of sportscasts (the robocup dataset), wasper-gen outperforms wasp on corpus-based evaluation metrics, and is shown to achieve a level of fluency and semantic correctness which approaches that of human text, based on subjective judgements by experimental participants. Note, however, that this framework focusses mainly on tactical generation. Content determination is performed separately, using a variant of the em-algorithm to converge on a probabilistic model that predicts which events or predicates should be mentioned.
By contrast, the work of \shortciteAKonstas2012,Konstas2013, which also relies on cfgs, uses a unified framework throughout. The starting point is an alignment of text with database records, extending the proposal by \shortciteALiang2009. The process of converting input data to output text is modelled in terms of rules which implicitly incorporate different types of decisions. For example, given a database of weather records, the rules might take the (somewhat simplified) form shown below:
where stands for a database record, is a set of fields, stands for field in record , is a word sequence, and all rules have associated probabilities that condition the rhs on the lhs, akin to the pcfgs used in parsing. These rules specify that a description of windSpeed (3.3.4) should be followed in the text by a temperature and a rain report. According to rule (3.3.4), minimum windspeed should be followed by a mention of the maximum windspeed with a certain probability. Rule (3.3.4) expands the minimum windspeed rule to a sequence of words according to a bigram language model \shortciteKonstas2012. \shortciteAKonstas2012 pack the set of rules acquired from the alignment stage into a hypergraph, and treat generation as decoding to find the maximally likely word sequence.
Under this view, generation is akin to inverted parsing. Decoding proceeds using an adaptation of the cyk algorithm. Since the model defining the mapping from input to output does not incorporate fluency heuristics, the decoder is interleaved with two further sources of linguistic knowledge by \shortciteAKonstas2013: (a) a weighted finite-state automaton (representing an n-gram language model); and (b) a dependency model \shortcite¡cf.¿[, also discussed above]Ratnaparkhi2000.
3.3.5 Deep learning methods
We conclude our discussion of data-driven nlg with an overview of applications of deep neural network (nn) architectures. The decision to dedicate a separate section is warranted by the recent, renewed interest in these models \shortcite¡see¿[for an nlp-focussed overview]Goldberg2016, as well as the comparatively small (but steadily growing) range of nlg models couched within this framework to date. We will also revisit nn models for nlg under more specific headings in the following sections, especially in discussing stylistic variation (Section 5) and the image captioning (Section 4), where they are now the dominant approach.
As a matter of fact, applications of nns in nlg hark back at least to \shortciteAKukich1987, though her work was restricted to small-scale examples. Since the early 1990s, when interest in neural approaches waned in the nlp and ai communities, cognitive science research has continued to explore their application to syntax and language production \shortcite¡e.g.,¿Elman1990,Elman1993,Chang2006. The recent resurgence of interest in nns is in part due to advances in hardware that can support resource-intensive learning problems \shortciteGoodfellow2016. More importantly, nns are designed to learn representations at increasing levels of abstraction by exploiting backpropagation \shortciteLeCun2015,Goodfellow2016. Such representations are dense, low-dimensional, and distributed, making them well-suited to capturing grammatical and semantic generalisations \shortcite¡see¿[inter alia]Mikolov2013,Luong2013,Pennington2014. nns have also scored notable successes in sequential modelling using feedforward networks \shortciteBengio2003,Schwenk2005, log-bilinear models \shortciteMnih2007 and recurrent neural networks \shortcite¡rnns, ¿Mikolov2010, including rnns with long short-term memory units \shortcite¡lstm, ¿Hochreiter1997. The latter are now the dominant type of rnn for language modelling tasks. Their main advantage over standard language models is that they handle sequences of varying lengths, while avoiding both data sparseness and an explosion in the number of parameters through the projection of histories into a low-dimensional space, so that similar histories share representations.
A demonstration of the potential utility of recurrent networks for nlg was provided by \shortciteASutskever2011, who used a character-level lstm model for the generation of grammatical English sentences. This, however, focussed exclusively on their potential for realisation. Models that generate from semantic or contextual inputs cluster around two related types of models:
An influential architecture is the Encoder-Decoder framework \shortciteSutskever2014, where an rnn is used to encode the input into a vector representation, which serves as the auxiliary input to a decoder rnn. This decoupling between encoding and decoding makes it possible in principle to share the encoding vector across multiple nlp tasks in a multi-task learning setting \shortcite¡see¿[for some recent case studies]Dong2015,Luong2016. Encoder-Decoder architectures are particularly well-suited to Sequence-to-Sequence (seq2seq) tasks such as Machine Translation, which can be thought of as requiring the mapping of variable-length input sequences in the source language, to variable-length sequences in the target \shortcite¡e.g.,¿Kalchbrenner2013,Bahdanau2015. It is easy to adapt this view to data-to-text nlg. For example, \shortciteACastroFerreira2017 adapt seq2seq models for generating text from abstract meaning representations (amrs).
A further important development within the Encoder-Decoder paradigm is the use of attention-based mechanisms, which force the encoder, during training, to weight parts of the input encoding more when predicting certain portions of the output during decoding \shortcite¡cf.¿Bahdanau2015,Xu2015. This mechanism obviates the need for direct input-output alignment, since attention-based models are able to learn input-output correspondences based on loose couplings of input representations and output texts \shortcite¡see¿[for discussion]Dusek2016.
In nlg, many approaches to response generation in an interactive context (such as dialogue or social media posts) adopt this architecture. For example, \shortciteAWen2015 use semantically-conditioned lstms to generate the next act in a dialogue; a related approach is taken by \shortciteASordoni2015, who use rnns to encode both the input utterance and the dialogue context, with a decoder to predict the next word in the response \shortcite¡see also¿Serban2016. \shortciteAGoyal2016 found an improvement in the quality of generated dialogue acts when using a character-based, rather than a word-based rnn.
Dusek2016 also use a seq2seq model with attention for dialogue generation, comparing an end-to-end model where content selection and realisation are jointly optimised (so that outputs are strings), to a model which outputs deep syntax trees, which are then realised using an off-the-shelf realiser \shortcite¡as done in¿Dusek2015. Like \shortciteAWen2015, they use a reranker during decoding to rank beam search outputs, penalising those that omit relevant information or include irrelevant information. Their evaluation, on bagel, shows that the joint optimisation setup is superior to the seq2seq model that generates trees for subsequent realisation. \shortciteAMei2016 also explicitly address the division into content selection and realisation, using weathergov data \shortciteAngeli2010. They use a bidirectional lstm encoder to map input records to a hidden state, followed by an attention-based aligner which models content selection, determining which records to mention as a function of their prior probability and the likelihood of their alignment with words in the vocabulary; a further refinement step weights the outcomes of the alignment with the priors, making it more likely that more important records will be verbalised. In this approach, lstms are able to learn long-range dependencies between records and descriptors, which the log-linear model of \shortciteAAngeli2010 factored in explicitly (see Section 3.3.2 above). Comparable approaches are now also use for automatic generation of poetry \shortcite¡see e.g., ¿zhang2014chinese, a topic to which we will return below.
Conditioned Language Models
A related view of the data-to-text process views the generator as a conditioned language model, where output is generated by sampling words or characters from a distribution conditioned on input features, which may include semantic, contextual or stylistic attributes. For example, \shortciteALebret2016 restricts generation to the initial sentence of wikipedia biographies from the corresponding wiki fact table and models content selection and realisation jointly in a feedforward nn \shortciteBengio2003, conditioning output word probabilities on both local context and global features obtained from the input table. This biases the model towards full coverage of the contents of a field. For example, a field in the table containing a person’s name typically consists of more than one word and the model should concatenate the words making up the entire name. While simpler than some of the models discussed above, this model can also be thought of as incorporating an attentional mechanism. \shortciteALipton2016 use character-level rnns conditioned on semantic information and sentiment, to generate product reviews, while \shortciteATang2016 generate such reviews using an lstm conditioned on input ‘contexts’, where contexts incorporate both discrete (user, location etc) and continuous information. Similar approaches have been adopted in a number of models for stylistic and affective generation \shortcite¡see¿[and the discussion in Section 5 below]Li2016,Herzig2017,Ashgar2017,Hu2017,Ficler2017.
An important theme that has emerged from recent work is the blurring of boundaries between tasks that are encapsulated in traditional architectures. This is evident in planning-based approaches, but perhaps the most radical break from this perspective arises in stochastic data-to-text systems which capitalise on alignments between input data and output text, combining content-oriented and linguistic choices within a unified framework. Among the open questions raised by research on stochastic nlg is the extent to which sub-tasks need to be jointly optimised and, if so, which knowledge sources should be shared among them. This is also seen in recent work using neural models, where joint learning of content selection and realisation has been claimed to yield superior outputs, compared to models that leave the tasks separate \shortcite¡e.g.,¿Dusek2016.
An outstanding issue is the balancing act between achieving adequate textual output versus doing so efficiently and robustly. Early approaches that departed from a pipeline architecture tended to sacrifice the latter in favour of the former; this was the case in revision-based and blackboard architectures. The same is to some extent true of planning-based approaches which are rooted in paradigms with a long history in ai: As recent empirical work has shown \shortciteKoller2011, these too are susceptible to considerable computational cost, though this comes with the advantage of a unified view of language generation that is also compatible with well-understood linguistic formalisms, such as ltag.
Stochastic approaches present a different problem, namely, that of acquiring the right data to construct the necessary statistical models. While plenty of datasets have become available, for tasks such as recommendations in the restaurant or hotel domains, brief weather reports, or sports summaries, it remains to be seen whether data-driven nlg models can be scaled up to domains where large volumes of heterogeneous data (numbers, symbols etc) are the norm, and where longer texts need to be generated. While such data is not easy to come by, crowd-sourcing techniques can presumably be exploited \shortciteMairesse2014,Novikova2016.
As we have seen, systems vary in whether they require aligned data (by which we mean data where strings are paired with the portion of the input to which they correspond), or not. As deep learning approaches become more popular – and, as we shall see in the next section, they are now the dominant approach in certain tasks, such as generating image captions – the need for alignment is becoming less acute, as looser input-output couplings can constitute adequate training data, especially in models that incorporate attentional mechanisms. As these techniques become better understood, they are likely to feature more heavily in a broader range of nlg tasks, as well as end-to-end nlg systems.
A second possible outcome of the renewed interest in deep learning is its impact on representation learning and architectures. In a recent opinion piece, \shortciteAManning2015 suggested that the contribution of deep learning to nlp has to date been mainly due to the power of distributed representations, rather than the exploitation of the ‘depth’ of multi-layered models. Yet, as Manning also notes, greater depth can confer representational advantages. As researchers begin to define complex architectures that ‘self-organise’ during training by minimising a loss function, it might turn out that different components of such architectures acquire core representations pertaining to different aspects of the problem at hand. This raises the question whether such representations could be reusable, in the same way that the layers of deep convolutional networks in computer vision learn representations at different levels of granularity which turn out to be reusable in a range of tasks \shortcite¡not just object recognition, for instance, even though networks such as vgg are typically trained for such tasks; see¿Simonyan2015. A related aim, suggested by recent attempts at transfer learning, especially in the seq2seq paradigm, is to attempt to learn domain-invariant representations that carry over from one task to another.
Could nlp, and the field of nlg in particular, be about to witness a renewed emphasis on multi-levelled approaches to nlg, with ‘deep’ architectures whose components learn optimal representations for different sub-tasks, perhaps along the lines detailed in Section 2 above? And to what extent would such representations be reusable? As a number of other commentators have pointed out, the prospect of learning domain-invariant linguistic representations that facilitate transfer learning in nlp, remains somewhat elusive, despite certain notable successes, not least those scored in the development of distributed word representations.555For some remarks on this topic, see for example the blog entry by \shortciteARuder2017. A recent note of caution against unrealistic claims of success of neural methods in nlg was sounded by \shortciteAGoldberg2017. This could well be the next frontier in research on statistical nlg.
In the following sections, we turn our attention away from standard tasks and the way they are organised, focussing on three broad topics – image-to-text generation, stylistic variation and computational creativity – in which nlg research has also intersected with research in other areas of Artificial Intelligence and nlp.
4 The vision-language interface: Image captioning and beyond
Over the past few years, there has been an explosion of interest in the task of automatically generating captions for images, as part of a broader endeavour to investigate the interface between vision and language \shortciteBarnard2016. Image captioning is arguably a paradigm case of data-to-text generation, where the input comes in the form of an image. The task has become a research focus not only in the nlg community but also in the computer vision community, raising the possibility of more effective synergies between the two groups of researchers. Apart from its practical applications, the grounding of language in perceptual data has long been a matter of scientific interest in ai \shortcite¡see¿[for a variety of theoretical views on the computational challenges of the perception-language interface]Winograd1972,Harnad1990,Roy2005.
Figure 6 shows some examples of caption generation, sampled from publications spanning approximately 6 years. Current caption generation research focusses mainly on what \shortciteAHodosh2013 refer to as concrete conceptual image descriptions of elements directly depicted in a scene. As \shortciteADonahue2015 put it, image captioning is a task whose input is static and non-sequential (an image, rather than, say, a video), whereas the output is sequential (a multi-word text), in contrast to non-sequential outputs such as object labels \shortcite¡e.g.¿[among others]Duygulu2002,Ordonez2016.
Our discussion will be brief, since image captioning has recently been the subject of an extensive review by \shortciteABernardi2016, and has also been discussed against the background of broader issues in research on the vision-language interface by \shortciteABarnard2016. While the present section draws upon these sources, it is organised in a somewhat different manner, also bringing out the connections with nlg more explicitly.
A detailed overview of datasets is provided by \shortciteABernardi2016, while \shortciteAFerraro2015 offer a systematic comparison of datasets for both caption generation and visual question answering with an accompanying online resource666http://visionandlanguage.net.
Datasets typically consist of images paired with one or more human-authored captions (mostly in English) and vary from artificially created scenes \shortciteZitnick2013 to real photographs. Among the latter, the most widely used are Flickr8k \shortciteHodosh2013, Flickr30k \shortciteYoung2014 and ms-coco \shortciteLin2014. Datasets such as the sbu1m Captioned Photo Dataset \shortciteOrdonez2011 include naturally-occurring captions of user-shared photographs on sites such as Flickr; hence the captions included therein are not restricted to the concrete conceptual. There are also a number of specialised, domain-specific datasets, such as the Caltech ucsd Birds datast \shortcite¡cub; ¿Wah2011.
There have also been a number of shared tasks in this area, including the coco (‘Common Objects in Context’) Captioning Challenge777http://mscoco.org/dataset/#captions-challenge2015, organised as part of the Large-Scale Scene Understanding Challenge (lsun)888http://lsun.cs.princeton.edu/2016/ and the Multimodal Machine Translation Task \shortciteElliott2016. We defer discussion of evaluation of image captioning systems to Section 7 of this paper, where it is discussed in the context of nlg evaluation as a whole.
4.2 The core tasks
There are two logically distinguishable sub-tasks in an image captioning system, namely, image analysis and text generation. This is not to say that they need to be organised separately or sequentially. However, prior to discussing architectures as such, it is worth briefly giving an overview of the methods used to deal with these two tasks.
4.2.1 Image analysis
There are three main groups of approaches to treating visual information for captioning purposes.
Some systems rely on computer vision methods for the detection and labelling of objects, attributes, ‘stuff’ (typically mapped to mass nouns, such as grass), spatial relations, and possibly also action and pose information. This is usually followed by a step mapping these outputs to linguistic structures (‘sentence plans’ of the sort discussed in Section 2 and 3), such as trees or templates \shortcite¡e.g.¿Kulkarni2011,Yang2011,Mitchell2012,Elliott2015,Yatskar2014,Kuznetsova2014a. Since performance depends on the coverage and accuracy of detectors \shortciteKuznetsova2014a,Bernardi2016, some work has also explored generation from gold standard image annotations \shortciteElliott2013,Wang2015,Muscat2015 or artificially created scenes in which the components are known in advance \shortciteOrtiz2015.
Holistic scene analysis
Here, a more holistic characterisation of a scene is used, relying on features that do not typically identify objects, attributes and the like. Such features include rgb histograms, scale-invariant feature transforms \shortcite¡sift;¿Lowe2004, or low-dimensional representations of spatial structure \shortcite¡as in gist;¿Oliva2001, among others. This kind of image processing is often used by systems that frame the task in terms of retrieval, rather than caption generation proper. Such systems either use a unimodal space to compare a query image to training images before caption retrieval \shortcite¡e.g.¿Ordonez2011,Gupta2012, or exploit a multimodal space representing proximity between images and captions \shortcite¡e.g.¿Hodosh2013,Socher2014.
Dense image feature vectors
Given the success of convolutional neural networks (cnn) for computer vision tasks \shortcite¡cf. e.g.,¿LeCun2015, many deep learning approaches use features from a pre-trained cnn such as AlexNet \shortciteKrizhevsky2012, vgg \shortciteSimonyan2015 or Caffe \shortciteJia2014. Most commonly, caption generators use an activation layer from the pre-trained network as their input features \shortcite¡e.g.¿Kiros2014,Karpathy2014,Karpathy2015,Vinyals2015,Mao2015,Xu2015,Yagcioglu2015,Hendricks2016.
4.2.2 Text generation or retrieval
Depending on the type of image analysis technique, captions can be generated using a variety of different methods, of which the following are well-established.
Using templates or trees
Systems relying on detectors can map the output to linguistic structures in a sentence planning stage. For example, objects can be mapped to nouns, spatial relations to prepositions, and so on. \shortciteAYao2010 use semi-supervised methods to parse images into graphs and then generate text via a simple grammar. Other approaches rely on sequence classification algorithms, such as Hidden Markov Models \shortciteYang2011 and conditional random fields \shortciteKulkarni2011,Kulkarni2013. \shortciteA[see the example in Figure 5(b)]Kulkarni2013 experiment with both templates and web-derived -gram language models, finding that the former are more fluent, but suffer from lack of variation, an issue we also addressed earlier, in connection with realisation (Section 2.6).
In the Midge system \shortcite[see Figure 5(d) for an example caption]Mitchell2012, input images are represented as triples consisting of object/stuff detections, action/pose detections and spatial relations. These are subsequently mapped to triples and realised using a tree substitution grammar. This is further enhanced with the ability to ‘hallucinate’ likely words using a probabilistic model, that is, to insert words which are not directly grounded in the detections performed on the image itself, but have a high probability of occurring, based on corpus data. In a human evaluation, Midge was shown to outperform both the system by \shortciteAKulkarni2011 and \shortciteAYang2011 on a number of criteria, including humanlikeness and correctness.
Elliott2013 use visual dependency representations (vdr), a dependency grammar-like formalism to describe spatial relations between objects based on physical features such as proximity and relative position. Detections from an image are mapped to their corresponding vdr relations prior to generation \shortcite¡see also¿[and the example in Figure 5(c)]Elliott2015. \shortciteAOrtiz2015 use ilp to identify pairs of objects in abstract scenes \shortciteZitnick2013a before mapping them to a vdr. Realisation is framed as a machine translation task over vdr-text pairs. A similar concern with identifying spatial relations is found in the work of \shortciteALin2015, who use scene graphs as input to a grammar-based realiser. \shortciteAMuscat2015 propose a naive Bayes model to predict spatial prepositions based on image features such as object proximity and overlap.
Using language models
Using language models has the potential advantage of facilitating joint training from image-language pairs. It may also yield more expressive or creative captions if it is used to overcome the limitations of grammars or templates \shortcite¡as shown by the example of Midge;¿Mitchell2012. In some cases, n-gram models are trained on out-of-domain data, the approach taken by \shortcite[using web-scale -grams]Li2011 and \shortciteA[using a maximum entropy language model]Fang2015. Most deep learning architectures use language models in the form of vanilla rnns or long short-term memory networks \shortcite¡e.g.¿Kiros2014,Vinyals2015,Donahue2015,Karpathy2015,Xu2015,Hendricks2016,Hendricks2016a,Mao2016. These architectures model caption generation as a process of predicting the next word in a sequence. Predictions are biased both by the caption history generated so far (or the start symbol for initial words) and by the image features which, as noted above, are typically features extracted from a cnn trained on the object detection task.
Caption retrieval and recombination
Rather than generate captions, some systems retrieve them based on training data. The advantage of this is that it guarantees fluency, especially if retrieval is of whole, rather than partial, captions. \shortciteAHodosh2013 used a multimodal space to represent training images and captions, framing retrieval as a process of identifying the nearest caption to a query image. The idea of ‘wholesale’ caption retrieval has a number of precedents. For example \shortciteAFarhadi2010 use Markov random fields to parse images into triples, paired with parsed captions. A caption for a query image is retrieved by comparing it to the parsed images in the training data, finding the most similar based on WordNet. Similarly, the Im2Text \shortciteOrdonez2011 system ranks candidate captions for a query image. \shortciteADevlin2015 use a nearest neighbours approach, with caption similarity quantified using bleu \shortcitePapineni2002 and cider \shortciteVedantam2015. A different view of retrieval is proposed by \shortciteAFeng2010, who use extractive summarisation techniques to retrieve descriptions of images and associated narrative fragments from their surrounding text in news articles.
A potential drawback of wholesale retrieval is that captions in the training data may not be well-matched to a query image. For instance, \shortciteADevlin2015 note that the less similar a query is to training images, the more generic the caption returned by the system. A possible solution is to use partial matches, retrieving and recombining caption fragments. \shortciteAKuznetsova2014a use detectors to match query images to training instances, retrieving captions in the form of parse tree fragments which are then recombined. \shortciteAMason2014 use a domain-specific dataset to extract descriptions and adapt them to a query image using a joint visual and textual bag-of-words model. In the deep learning paradigm, both \shortciteASocher2014 and \shortciteAKarpathy2014 use word embeddings derived from dependency parses, which are projected, together with cnn image features, into a multimodal space. Subsequent work by \shortciteAKarpathy2015 showed that this fine-grained pairing works equally well with word sequences, eschewing the need for dependency parsing.
Recently, \shortciteADevlin2015a compared nearest-neighbour retrieval approaches to different types of language models for caption generation, specifically, the Maximum Entropy approach of \shortciteAFang2015, an lstm-based approach and rnns which are coupled with a cnn for image analysis \shortcite¡e.g.¿Vinyals2015,Donahue2015,Karpathy2015. A comparison of the linguistic quality of captions suggested that there was a significant tendency for all models to reproduce captions observed in the training set, repeating them for different images in the test set. This could be due to a lack of diversity in the data, which might also explain why the nearest neighbour approach compares favourably with language model-based approaches.
4.3 How is language grounded in visual data?
As the foregoing discussion suggests, views on the relationship between visual and linguistic data depend on how each of the two sub-tasks is dealt with. Thus, systems which rely on detections tend to make a fairly clear-cut distinction between input processing and content selection on the one hand, and sentence planning and realisation on the other \shortcite¡e.g.¿Kulkarni2011,Mitchell2012,Elliott2013. The link between linguistic expressions and visual features is mediated by the outcomes of the detectors. For example, Midge \shortciteMitchell2012 uses the object detections to determine which nouns to mention, before fleshing out the caption with attributes (mapped to adjectives) and verbs. Similarly, \shortciteAElliott2013 uses vdrs to determine spatial expressions.
Retrieval-based systems relying on unimodal or multimodal similarity spaces represent the link between linguistic expressions and image features more indirectly. Here, similarity plays the dominant role. In a unimodal space \shortciteOrdonez2011,Gupta2012,Mason2014,Kuznetsova2012,Kuznetsova2014a, it is images which are compared, with (partial) captions retrieved based on image similarity. A number of deep learning approaches also broadly conform to this scheme. For example, both \shortciteAYagcioglu2015 and \shortciteADevlin2015 retrieve and rank captions for a query image, using a cnn for the representation of the visual space. By contrast, multimodal spaces involve a direct mapping between visual and linguistic features \shortcite¡e.g.¿Hodosh2013,Socher2014,Karpathy2014, enabling systems to map from images to ‘similar’ – that is, related or relevant – captions.
Much interesting work on vision-language integration is being carried out with deep learning models. \shortciteAKiros2014 introduced multimodal neural language models (mrnn), experimenting with two main architectures. Their Modality-Biased Log-Bilinear Model (mlbl-b) uses an additive bias to predict the next word in a sequence based on both the linguistic context and cnn image features. The Factored 3-way Log-Bilinear Model (mlbl-f) also gates the representation matrix for a word with image features. In a related vein, \shortciteADonahue2015 propose a combined cnn lstm architecture \shortcite¡also used by¿[for video captioning]Venugopalan2015,Venugopalan2015a where the next word is predicted as a function of both previous words and image features. In one version of the architecture, they inject cnn features into the lstm at each time-step. In a second version, they use two stacked lstms, the first of which takes cnn features and produces an output which constitutes the input to the next lstm to predict the word. Finally, \shortciteAMao2015 experiment with various mrnn configurations, obtaining their best results with an architecture in which there are two word embedding layers preceding the recurrent layer, which is in turn projected into a multimodal layer where linguistic features are combined with cnn features. An example caption is shown in Figure 5(e) above.
These neural network models shed light on the consequences of combining the two modalities at different stages, reflecting the point made by \shortciteA[cf. Section 3.3.5]Manning2015 that this paradigm encourages a focus on architectures and design. In particular, image features can be used to bias the recurrent, language generation layer – at the start, or at each time-step of the rnn – as in the work of \shortciteADonahue2015. Alternatively, the image features can be combined with linguistic features at a stage following the rnn, as in the work of \shortciteAMao2015.
4.4 Vision and language: Current and future directions for NLG
Image to text generation is one area of nlg where there is a clear dominance of deep learning methods. Current work focusses on a number of themes:
Generalising beyond training data is still a challenge, as shown by the work of \shortciteADevlin2015a. More generally, dealing with novel images remains difficult, though experiments have been performed on using out-of-domain training data to expand vocabulary \shortciteOrdonez2013, learn novel concepts \shortciteMao2015a or transfer features from image regions containing known labels, to similar, but previously unattested ones \shortcite[from which an example caption is shown in Figure 5(f)]Hendricks2016. Progress in zero-shot learning, where the aim is to identify or categorise images for which little or no training data is available, is likely to contribute to the resolution of data sparseness problems \shortcite¡e.g.¿Antol2014,Elhoseiny2015.
Attention is also being paid to what \shortciteABarnard2016 refers to as localisation, that is, the association of linguistic expressions with parts of images, and the ability to generate descriptions of specific image regions. Recent work includes that of \shortciteAKarpathy2015, \shortciteAJohnson2016 and \shortciteAMao2016, who focus on unambiguous descriptions of specific image regions and/or objects in images (see Section 2.5 above for some related work). Attention-based models are a further development on this front. These have been exploited in various seq2seq tasks, notably for machine translation \shortciteBahdanau2015. In the case of image captioning, the idea is to allocate variable weights to portions of captions in the training data, depending on the current context, to reflect the ‘relevance’ of a word given previous words and an image region \shortciteXu2015.
Recent work has also begun to explore generation from images that goes beyond the concrete conceptual, for instance, producing explanatory descriptions \shortciteHendricks2016a. A further development is work on Visual Question Answering, where rather than descriptive captions, the aim is to produce responses to specific questions about images \shortciteAntol2015,Geman2015,Malinowski2015,Barnard2016,mostafazadeh2016. Recently, a new dataset was proposed providing both concrete conceptual and ‘narrative’ texts coupled with images \shortciteHuang2016, a promising new direction for this branch of nlg.
There is a growing body of work that generalises the task from static inputs to sequential ones, especially videos \shortcite¡e.g.¿Kojima2002,Regneri2013,Venugopalan2015,Venugopalan2015a. Here, the challenges include handling temporal dependencies between scenes, but also dealing with redundancy.
5 Variation: Generating text with style, personality and affect
Based on the preceding sections, the reader could be excused for thinking that nlg is mostly concerned with delivering factual information, whether this is in the form of a summary of weather data, or a description of an image. This bias was also flagged in the Introduction, where we gave a brief overview of some domains of application, and noted that informing was often, though not always, the goal in nlg.
Over the past decade or so, however, there has been a growing trend in the nlg literature to also focus on aspects of textual information delivery that are arguably non-propositional, that is, features of text that are not strictly speaking grounded in the input data, but are related to the manner of delivery. In this section, we focus on these trends, starting with the broad concept of ‘stylistic variation’, before turning to generation of affective text and politeness.
5.1 Generating with style: textual variation and personality
What does the term ‘linguistic style’ refer to? Most work on what we shall refer to as ‘stylistic nlg’ shies away from a rigorous definition, preferring to operationalise the notion in the terms most relevant to the problem at hand.
‘Style’ is usually understood to refer to features of lexis, grammar and semantics that collectively contribute to the identifiability of an instance of language use as pertaining to a specific author, or to a specific situation (thus, one distinguishes between levels of stylistic formality, or speaks of the distinctive characteristics of the style of William Faulkner). This implies that any investigation of style must concern itself, at least in part, with variation among the features that mark such authorial or situational variables. In line with this usage, this section reviews developments in nlg in which variation is the key concern, usually at the tactical, rather than the strategic, level, the idea being that a given piece of information can be imparted in linguistically distinct, ways \shortcite¡cf.¿Sluis2010. This strategy was, for example, explicitly adopted by \shortciteAPower2003.
Given its emphasis on linguistic features, controlling style (however it is defined) is a problem of great interest for nlg since it directly addresses issues of choice, which are arguably the hallmark of any nlg system \shortcite¡cf.¿Reiter2010. Early contributions in this area defined stylistic features using rules to vary generation according to pragmatic or stylistic goals. For example, \shortciteAMcDonald1985 argued that “prose style is a consequence of what decisions are made during the transition from the conceptual representation level to the linguistic level” (p. 61), thereby placing the problem within the domain of sentence planning and realisation. This stance was also adopted by \shortciteADimarco1993, who focus on syntactic variation, proposing a stylistic grammar for English and French. \shortciteASheikha2011 proposed an adaptation of the SimpleNLG realiser \shortciteGatt2009 to handle formal versus informal language, via specific features, such as contractions (are not vs. aren’t) and lexical choice.
A related perspective on stylistic variation was adopted by \shortciteAWalker2002, in their description of how the spot sentence planner was adapted to learn strategies for different communicative goals, as reflected in the rhetorical and syntactic structures of the sentence plans. The planner was trained using a boosting technique to learn correlations between features of sentence plans and human ratings of the adequacy of a sample of outputs for different communicative goals.
Like \shortciteAWalker2002, contemporary approaches to stylistic variation have tended to eschew rules in favour of data-driven methods to identify relevant features and dimensions of variation from corpora, in what might be thought of as an inductive view of style, where variation is characterised by the distribution of whatever linguistic features are considered relevant. An important precedent for this view is Biber’s corpus-based multidimensional approach to style and register variation \shortciteBiber1988, roughly a contemporary of the grammar-inspired approach of \shortciteADimarco1993.
Biber’s model was at the heart of work by \shortciteAPaiva2005, which exhibits some characteristics in common with the ‘global’ statistical approaches to nlg discussed in Section 3.3, insofar as it exploits statistics to inform decision-making at the relevant choice points, rather than to filter the outputs of an overgeneration module. \shortciteAPaiva2005 used a corpus of patient information leaflets, conducting factor analysis on their linguistic features to identify two stylistic dimensions. They then allowed their system to generate a large number of texts, varying its decisions at a number of choice points (e.g. choosing a pronoun versus a full np) and maintaining a trace. Texts were then scored on the two stylistic dimensions, and a linear regression model was developed to predict the score on a dimension based on the choices made by the system. This model was used during testing to predict the best choice at each choice point, given a desired style. Style, however, is a global feature of a text, though it supervenes on local decisions. These authors solved the problem by using a best-first search algorithm to identify the series of local decisions as scored by the linear models, that was most likely to maximise the desired stylistic effect, yielding variations such as the following \shortcite¡examples from¿[p. 61]Paiva2005:
The dose of the patient’s medicine is taken twice a day. It is two grams.
The two-gram dose of the patient’s medicine is taken twice a day.
The patient takes the two-gram dose of the patient’s medicine twice a day.
Some authors \shortcite¡e.g.,¿[on which more below]Mairesse2011 have noted that certain features, once selected, may ‘cancel’ or obscure the stylistic effect of other features. This raises the question whether style can in fact be modelled as a linear, additive phenomenon, in which each feature contributes to an overall perception of style independently of others (modulo its weight in the regression equation).
A second question is whether stylistic variation could be modelled in a more specific fashion, for example, by tailoring style to a specific author, rather than to generic dimensions related to ‘formality’, ‘involvement’ and so on. For instance a corpus-based analysis of human-written weather forecasts by \shortciteAReiter2005 found that lexical choice varies in part based on the author. One line of work has investigated this using corpora of referring expressions, such as the tuna Corpus \shortciteDeemter2012, in which multiple referring expressions by different authors are available for a given input domain. For instance, \shortciteABohnet2008 and \shortciteADiFabbrizio2008 explore statistical methods to learn individual preferences for particular attributes, a strategy also used by \shortciteAViethen2010. \shortciteAHervas2013 use case-based reasoning to inform lexical choice when realising a set of semantic attributes for a referring expression, where the case base differentiates between authors in the corpus to take individual lexicalisation preferences into account \shortcite¡see also¿Hervas2016.
A more ambitious view of individual variation is present in the work of \shortciteAMairesse2010,Mairesse2011, in the context of nlg for dialogue systems. Here, the aim is to vary the output of a generator so as to project different personality traits. Similar to the model of \shortciteABiber1988, personality is here given a multidimensional definition, via the classic ‘Big 5’ model \shortcite¡e.g.,¿John1999, where personality is a combination of five major traits (e.g. introversion/extraversion). However, while stylistic variation is usually defined as a linguistic phenomenon, the linguistic features of personality are only indirectly reflected in speaking or writing \shortcite¡a hypothesis underlying much work on detection of personality and other features in text, including¿Oberlander2006,Argamon2007,Schwartz2013a,Youyou2015.
Mairesse2011’s personage system, originally based on rules derived from an exhaustive review of psychological literature \shortciteMairesse2010, was developed in the restaurant domain. The subsequent, data-driven version of the system \shortciteMairesse2011 takes as input a pragmatic goal and, like the system of \shortciteAPaiva2005, a list of real-valued style parameters, this time representing scores on the five personality traits. The system estimates generation parameters for stylistic features based on the input traits, using machine-learned models acquired from a dataset pairing sample utterances with human personality judgements. For example, an utterance reflecting high extraversion might be more verbose and involve more use of expletives (5.1), compared to a more introverted style, which might demonstrate more uncertainty, for example through the use of stammering and hedging (5.1).
Kin Khao and Tossed are bloody outstanding. Kin Khao just has rude staff. Tossed features sort of unmannered waiters, even if the food is somewhat quite adequate.
Err… I am not really sure. Tossed offers kind of decent food. Mmhm… However, Kin Khao, which has quite ad-ad-adequate food, is a thai place. You would probably enjoy these restaurants.
An interesting outcome of the evaluation with human subjects reported by \shortciteAMairesse2011 is that readers vary significantly in their judgements of what personality is actually reflected by a given text. This suggests that the relationship between such psychological features and their linguistic effects is far from straightforward. \shortciteAWalker2011:Arboretum compared the ‘Big 5’ model incorporated in the rule-based version of personage, to a corpus-based model drawn from character utterances in film scripts. These models were used to generate utterances for characters in an augmented reality game; their main finding was that modelling characters’ style directly using corpora of utterances results in more specific and easily perceived traits than using a model based on personality traits, where the relationship between personality and individual style is more indirect. In another set of experiments on generating utterances for characters in a role-playing game, \shortciteAWalker2011:Film report the successful porting of personage to the new domain by tuning some of its parameters on features identified in film dialogue. Models learned from film corpora were found to be close in style to the characters they were actually based on.
5.2 Generating with feeling: affect and politeness
Personality is usually thought of in terms of traits, which are relatively stable across time. However, language use may vary not only across individuals, as a function of their stable characteristics, but also within individuals across time, as a function of their more transient affective states. ‘Affective nlg’ \shortcite¡a term due to¿Rosis2000 is concerned with variation that reflects emotional states which, unlike personality traits, are relatively transient. In this case, the goals can be twofold: (i) to induce an emotional state in the receiver; or (ii) to reflect the emotional state of the producer.
As in the case of personality, the relationship between emotion and language is far from clear, as noted by \shortciteABelz2003. For one thing, it isn’t clear whether only surface linguistic choices need be affected. Some authors have argued that a text’s affective impact impinges on content selection; this stance has been adopted, for example, in some applications in e-health where reporting of health-related issues should be sensitive to their potential emotional impact \shortciteDiMarco2007,Mahamood2011.
Most work on affective nlg has however focussed on tactical choices \shortcite¡e.g.¿Hovy1988,Fleischman2002,Strong2007,VanDeemter2008,Keshtkar2011. Various linguistic features that can have emotional impact have been identified, from the increased use of redundancy to enhance understanding of emotionally laden messages \shortciteWalker1995,Rosis2000, to the increased use of first-person pronouns and adverbs, as well as sentence ordering to achieve emphasis or reduce adverse emotional impact \shortciteRosis2000.
This research on affective nlg relies on models of emotion of various degrees of complexity and cognitive plausibility. The common trend underlying all these approaches however is that emotional states should impact lexical, syntactic and other linguistic choices. The question then is to what extent such choices are actually perceived by readers or users of a system.
In an empirical study, \shortciteASluis2010 reported on two experiments investigating the effect of various tactical decisions on the emotional impact of text on readers. In one experiment, texts gave a (fake) report to participants on their performance on an aptitude test, with manually induced variations, such as these:
Positive slant: On top of this you also outperformed most people in your age group with your exceptional scores for Imagination and Creativity (7.9 vs 7.2) and Logical- Mathematical Intelligence (7.1 vs. 6.5).
Neutral/factual slant: You did better than most people in your age group with your scores for Imagination and Creativity (7.9 vs 7.2) and Logical-Mathematical Intelligence (7.1 vs. 6.5).
Evaluation of these texts showed that the extent to which affective tactical decisions influence hearer’s emotional states is dependent on a host of other factors, including the degree to which the reader is directly implicated in what the text says (in the case of an aptitude test, the reader would be assumed to feel the outcomes have personal relevance). An important question raised by this study is how affect should be measured: \shortciteASluis2010 used a standardised self-rating questionnaire to estimate changes in affect before and after reading a text, but the best way to measure emotion remains an open question.
The emotional slant in the language used by an author or speaker may have implications for the degree to which the listener or reader may feel ‘impinged upon’. This becomes particularly relevant in interactive systems, where nlg components are generating language in the context of dialogue. Consider, for example, the difference between these requests:
Direct strategy: Chop the tomatoes!
Approval strategy: Would it be possible for you to chop the tomatoes?
Autonomy strategy: Could you possibly chop the tomatoes?
Indirect strategy: The tomatoes aren’t chopped yet.
The four strategies exemplified above come across as having varying degrees of politeness which, according to one influential account \shortciteBrownLevinson1987, depends on face. Positive face reflects the speaker’s desire that some of her goals be shared with her interlocutors; negative face refers to the speaker’s desire not to have her goals impinged upon by others. The connection with affect that we suggested above hinges on these distinctions: different degrees of politeness reflect different degrees of ‘threat’ to the listener; hence, generating language based on the right face strategy could be seen as a branch of affective nlg.
In an early, influential proposal, \shortciteAWalker1997a proposed an interpretation of \shortciteBrownLevinson1987 in terms of the four dialogue strategies, exemplified in (5.2 – 5.2) above. Subsequently, \shortciteAMoore2004 used this framework in the generation of tutorial feedback, where a discourse planner used a Bayesian network to inform linguistic choices compatible with the target politeness/affect value in a given context \shortcite¡see¿[for a related approach]Johnson2004.
Gupta2007 also used the four dialogue strategies identified by \shortciteAWalker1997a in the polly system, which used strips-based planning to generate a plan distributed among two agents in a collaborative task \shortcite¡see also¿Gupta2008. An interesting finding in their evaluation is that perception of face-threat depends on the speech act; for example, requests can be more threatening. \shortciteAGupta2007 also note possible cultural differences in perception of face threat (in this case, between uk and Indian participants).
5.3 Stylistic control as a challenge for neural nlg
In the past few years, stylistic – and especially affective – nlg has witnessed renewed interest by researchers working on neural approaches to generation. The trends that can be observed here mirror those outlined in our general overview of deep learning approaches (Section 3.3.5).
A number of models focus on response generation (in the context of dialogue, or social media exchanges), where the task is to generate a response, given an utterance. Thus, these models fit well within the seq2seq or Encoder-Decoder framework (see Section 3.3.5 for discussion). Often, these models exploit social media data, especially from Twitter, a trend which goes back at least to \shortciteARitter2011, who adapted a Phrase-Based Machine Translation model to response generation. For example \shortciteALi2016 proposed a persona-based model in which the decoder lstm is conditioned on embeddings obtained from tweets pertaining to individual speakers/authors. An alternative model conditions on both speaker and addressee profiles, with a view to incorporating not only the ‘persona’ of the generator, but its variability with respect to different interlocutors. \shortciteAHerzig2017, also working on Twitter data, condition their decoder on personality features extracted from tweets based on the ‘Big Five’ model, rather than on speaker-specific embeddings. This has the advantage of not enabling the generator to be tuned to specific personality settings, without re-training to adapt to a particular speaker style. While their personality-based model does not beat the persona-based model of \shortciteALi2016, a human evaluation showed that judges were able to identify high-trait responses as more expressive than low-trait responses, suggesting that the conditioning was having a noticeable impact on style. In a dialogue context, \shortciteAAshgar2017 proposed to achieve affective responses on three levels: (a) by augmenting word embeddings with data from an affective dictionary; (b) by decoding with an affect-sensitive beam search; and (c) by training with an affect-sensitive loss function.
On the other hand, a number of models condition an lstm on attributes reflecting affective or personality traits, with a view to generating strings that express such traits. \shortciteAGhosh2017 used lstms trained on speech corpora conditioned on affect category and emotional intensity to drive lexical choice. \shortciteAHu2017 used variational auto-encoders and attribute discriminators, to control the stylistic parameters of generated texts individually. They experimented on controlling sentiment and tense, but restricted the generation to sentences of up to 16 words. By contrast, \shortciteAFicler2017 extend the range of parameters used to condition the lstm, with two content-related attributes (sentiment and theme) and four stylistic parameters (length, whether the text is descriptive, whether it has a personal voice, and whether the style is professional). Their generator is trained on a corpus of movie reviews. Similarly, \shortciteADong2017 propose an attribute-to-sequence model for product review generation based on a corpus of Amazon user reviews \shortcite¡see also¿[for neural models for product review generation]Lipton2016,Tang2016. The conditioning includes the reviewer id, reminiscent of the persona-based response model of \shortciteALi2016; however, they also include the rating, which functions to modulate the affect in the output. Their model incorporates an attentional mechanism to concentrate on different parts of the input encoding when predicting the next word during decoding. For example, for a specific reviewer and a specific product, changing the input rating from 1 to 5 yields the following difference:
(Rating: 1) iâm sorry to say this was a very boring book. i didnât finish it. iâm not a new fan of the series, but this was a disappointment
(Rating: 5) this was a very good book. i enjoyed the characters and the story line. iâm looking forward to reading more in this series.
5.4 Style and affect: concluding remarks
Controlling stylistic, affective and personality-based variation in nlg is still in a rather fledgling state, with several open questions of both theoretical and computational import. Among these is the question of how best to model complex, multi-dimensional constructs such as personality or emotion; this question speaks both to the cognitive plausibility of the models informing linguistic choices, and to the practical viability of different machine learning strategies that could be leveraged for the task (for example, linear, additive models versus more ‘global’ models of personality or style). Also important here is the kind of data used to inform generation strategies: as we have seen above, a lot of affective nlg work relies on ratings by human judges. However, some recent work in affective computing has questioned the use of ratings, comparing them to ranking-based and physiological methods \shortcite¡e.g.¿Martinez2014,Yannakakis2015. This and similar research is probably of high relevance to nlg researchers. Some recent work relied on automatic extraction of personality features using tools such as ibm’s Personality Insights \shortciteHerzig2017. As such tools \shortcite¡another example of which is Lingustic Inquiry and Wordcount or liwc, ¿Pennebaker2007 become more reliable and widely available, we may see a turn towards less reliance on human elicitation.
A second important question is which linguistic choices truly convey the intended variation to the reader or listener. While current systems use a range of devices, from aggregation strategies to lexical choice, it is not clear which ones are actually perceived as having the desired effect.
A third important research avenue, which is especially relevant to interactive systems, is adaptivity, that is, the way speakers (or systems) alter their linguistic choices as a result of their interlocutors’ utterances \shortciteClark1996a,Niederhoffer2002,Pickering2004, a theme that has also begun to be explored in nlg \shortciteIsard2006,Herzig2017.
6 Generating creative and entertaining text
‘Good’ writers not only present their ideas in coherent and well-structured prose. They also succeed in keeping the attention of the reader through narrative techniques, and in occasionally surprising the reader, for example, through creative language use such as small jokes or well-placed metaphors \shortcite¡see e.g., among many others, ¿flower1981cognitive,nauman2011makes,veale2015distributed. The nlg techniques and applications discussed so far in this survey arguably do not simulate good writers in this sense, and as a result automatically generated texts can be perceived as somewhat boring and repetitive.
This lack of attention to creative aspects of language production within nlg is not due to a general lack of scholarly interest in these phenomena. Indeed, computational research into creativity has a long tradition, with roots that go back to the early days of ai \shortcite¡as¿[notes, the first story generation algorithm on record, Novel Writer, was developed by Sheldon Klein in 1973]Gervas2013. However, it is fair to say that, so far, there has been little interaction between researchers from the computational creativity and nlg communities respectively, even though both groups in our opinion could learn a lot from each other. In particular, nlg researchers stand to benefit from insights into what constitutes creative language production, as well as structural features of narrative that have the potential to improve nlg output even in data-to-text systems \shortcite¡see¿[for an argument to this effect in relation to a medical text generation system]Reiter2008. At the same time, researchers in computational creativity could also benefit from the insights provided by the nlg community where the generation of fluent language is concerned since, as we shall see, a lot of the focus in this research, especially where narrative is concerned, is on the generation of plans and on content determination.
In what follows, we give an overview of automatic approaches to creative language production, starting from relatively simple jokes and metaphors to more advanced forms, such as narratives.
6.1 Generating puns and jokes
What’s the difference between money and a bottom?
One you spare and bank, the other you bare and spank.
What do you call a weird market?
A bizarre bazaar.
These two (pretty good!) punning riddles were automatically generated by the jape system developed by \shortciteABinsted1994,Binsted1997a. Punning riddles form a specific joke genre and have received considerable attention in the context of computational humor, presumably because they are relatively straightforward to define, often relying on spelling or word sense ambiguities. Many good, human-produced examples have been collected in joke books and sites and may thus act as a source of inspiration or training data.
Simplifying somewhat, jape (Joke Analysis and Production Engine) relies on a template-based nlg system, combining fixed text (What’s the difference between X and Y? or What do you call X?) with slots, which are the source of the riddle. Various standard lexical resources are used for joke production, including a British pronunciation dictionary (to find different words with a similar pronunciation, such as ‘bizarre’ and ‘bazaar’) and WordNet \shortcite[to find words with a similar meaning, such as bazaar and market]Miller1995. jape uses various techniques to create the punning riddles, such as juxtaposition, in which related words are simply placed next to each other and treated as a normal construction, while making sure that the combination is novel (i.e., not in the jape database already). It is interesting to observe that in this way jape may automatically come up with existing jokes (a quick Google search reveals that many bizarre bazaars, as well as bazaar bizarres, exist).
Following the seminal work of Binsted and Ritchie, various other systems have been developed which can automatically generate jokes, including for example the hahacronym system of \shortciteAStock2005, which produces humorous acronyms, and the system of \shortciteABinsted2003, which focusses on the generation of referential jokes (“It was so cold, I saw a lawyer with his hands in his own pockets.”).
petrovic2013unsupervised offer an interesting, unsupervised alternative to this earlier work, which does not require labelled examples or hard-coded rules . Like their predecessors, \citeauthorpetrovic2013unsupervised also start from a template – in their case I like my X like I like my Y, Z – where and are nouns (e.g., coffee and war) and is an attribute (e.g., cold). Clearly, linguistic realisation is not an issue, but content selection – finding ‘funny’ triples , and – is a challenge. Interestingly, the authors postulate a number of guiding principles for ‘good’ triples. In particular, they hypothesize that (a) the joke is funnier if the attribute can be used to describe both nouns and ; (b) the joke is funnier if attribute is both common and ambiguous;and (c) the joke is funnier the more dissimilar and are. These three statements can be quantified relying on standard resources such as Wordnet and the Google n-gram corpus \shortciteBrants2006, and using these measures their system outputs, for example:
I like my relationships like I like my source, open.
It is probably fair to say that computational joke generation research to date has mostly focussed on laying bare the basic structure of certain relatively simple puns and exploiting these to good effect \shortcite¡e.g.,¿Ritchie2009. However, many other kinds of jokes exist, often requiring sophisticated, hypothetical reasoning. Presumably, many of the central problems within ai need to be solved first before generation systems will be capable of producing these kinds of advanced jokes.
6.2 Generating metaphors and similes
Whether you think something is funny or not may be subjective, but in any case insights from joke generation can be useful as a stepping stone towards a better understanding of creative language use, including metaphor, simile and analogy. In all of these, a mapping is made between two conceptual domains, in such a way that terminology from the source domain is used to say something about the target domain, typically in a nonliteral fashion, which can be helpful in computer-generated texts to illustrate complex information. For example, \shortciteAhervas2006cross study analogies in narrative contexts, such as Luke Skywalker was the King Arthur of the Jedi Knights, which immediately clarifies an important aspect of Luke Skywalker for those not in the know. In a simile, the two domains are compared (A ‘is like’ B); in a metaphor they are equated. Jokes and metaphors/similes are related: the automatically generated jokes of \citeauthorpetrovic2013unsupervised are comparable to similes, while \shortciteAkiddon2011thats, for example, frame the problem of identifying double entendre jokes as a type of metaphor identification. Nevertheless, one could argue that generating jokes is more complex because of the extra funniness constraint.
Like computational humor, the automatic recognition and interpretation of metaphorical, non-literal language has received considerable attention since the early days of ai \shortcite¡see¿[for an overview]Shutova2013. \shortciteAMartin1990,Martin1994, for example, focussed on the recognition of metaphor in the context of Unix Support, as in the following examples:
How can I kill a process?
How can I enter lisp?
The first one, for example, makes a mapping between ‘life’ (source) and ‘processes’ (target), and is by now so common that is almost a dead metaphor, but this was not the case in the early days of Unix. Clearly, understanding of the metaphors is a prerequisite for automatically answering these questions. Early research on the computational interpretation of metaphor already recognised that metaphors rely on semantic conventions that are exploited (‘broken’) to express new meanings. A system for metaphor understanding, as well as one for metaphor generation, therefore requires knowledge about what literal meanings are, and how these can be stretched or translated into metaphoric meanings \shortcite¡e.g.,¿Wilks1978,Fass1991.
Recent work by Veale and Hao \shortciteVeale2007,Veale2008 has shown that this kind of knowledge can be acquired from the web, and used for the generation of new metaphors and similes (comparisons). Their system, called Sardonicus, is capable of generating metaphors for user-provided targets (t), such as the following, expressing that Paris Hilton (“the person, not the hotel, though the distinction is lost on Sardonicus”, Veale & Hao, 2007, p. 1474) is skinny:
Paris Hilton is a stick
Sardonicus searches the web for nouns (n) that are associated with skinniness, which are included in a case-base and range from pole, pencil, and stick to snake and stick insect. Inappropriate ones (like cadaver) are ruled out, based on the theory of category-inclusion of \shortciteAGlucksberg2001. This list of potential similes is then used to create Google queries, inspired by the work of \shortciteAHearst1992, of the form n-like t (e.g., stick insect-like Paris Hilton, which actually occurs on the web), giving a ranking of the potential similes to be generated.
A comparable technique is used by \shortciteAVeale2013 to generate metaphors with an affective component, as in ‘Steve Jobs was a great leader, but he could be such a tyrant’. The Google -gram corpus is used to find stereotypes suitable for simile generation (e.g., ‘lonesome as a cowboy’), a strategy reminiscent of the use of web-scale gram data to smooth the output of image-to-text systems (see Section4). Next, an affective dimension is added, based on the assumption that properties occurring in a conjunction (‘as lush and green as a jungle’) are more likely to have the same affect than properties that do not. Using positive (e.g., ‘happy’, ‘wonderful’) and negative (e.g., ‘sad’, ‘evil’) seeds, coordination queries (e.g., ‘happy and X’) are used to collect positive and negative labels for stereotypes, indicating, for instance, that babies are positively associated with qualities such as ‘smiling’ and ‘cute’, and negatively associated with ‘crying’ and ‘sniveling’. This enables the automatic generation of positive (‘cute as a baby’) and negative (‘crying like a baby’) similes. \shortciteAVeale2013 even points out that by collecting, for example, a number of negative metaphors for Microsoft being a monopoly, and using these in a set of predefined tropes, it becomes possible to automatically generate a poem such as the following:
No Monopoly Is More Ruthless
Intimidate me with your imposing hegemony
No crime family is more badly organized, or controls more ruthlessly
Haunt me with your centralized organization
Let your privileged security support me
O Microsoft, you oppress me with your corrupt reign
In fact, automatic generation of poetry is an emerging area at the crossroads of computational creativity and natural language generation \shortcite¡see for example¿[for variations on this theme]Lutz1959,Gervas2001,Wong2008,Netzer2009,Greene2010,Colton2012,Manurung2012,zhang2014chinese. See \shortciteAOliveira2017 for a recent review.
6.3 Generating narratives
Computational narratology is concerned with computational models for the generation and interpretation of narrative texts \shortcite¡e.g.,¿Gervas2009,Mani2010,Mani2013. The starting point for many approaches to narrative generation is a view of narrative coming from classical narratology, a branch of literary studies with roots in the Formalist and Structuralist traditions \shortcite¡e.g.,¿Propp1968,Genette1980,Bal2009. This field has been concerned with analysing both the defining characteristics of narrative, such as plot or character, and more subtle features, such as the handling of time and temporal shifts, focalisation (that is, the ability to convey to the reader that a story is being recounted from a specific point of view), and the interaction of multiple narrative threads, in the form of sub-plots, parallel narratives, etc. An important recent development is the interest, on the part of narratologists, in bringing to bear insights from Cognitive Science and ai on their literary work, making this field ripe for multi-disciplinary interaction \shortcite¡see especially¿[for programmatic statements to this effect, as well as theoretical contributions]Herman1997,Herman2007,Meister2003.
Classical narratology makes a fundamental distinction between the ‘story world’ and the text that narrates the story. In line with the formalist and structuralist roots of this tradition, the distinction is usually articulated as a dichotomy between fabula (or story) and suzjet (or discourse). There is a parallel between this distinction and that between a text plan in nlg, versus the actual text which articulates that plan. However, the crucial difference is that in producing a plan for a narrative, a story generation system typically does not use input data of the sort required by most of the nlg systems reviewed thus far, since the story is usually fictional. On the other hand, narratological tools have also been successfully applied to real-world narratives, including oral narratives of personal experience \shortcite¡e.g.,¿Herman2001,Labov2010.
The focus of most work on narrative generation has been on the pre-linguistic stage, that is, on generating plans within a story world for fictional narratives, usually within a specific genre whose structural properties are well-understood, for example, fairy tales or Arthurian legends \shortcite¡see¿[for a review]Gervas2013. There are however links between the techniques used for such stories and those we have discussed above in relation to nlg (see especially Section 3.2). Prominent among these are planning and reasoning techniques to model the creative process as a problem-solving task. For example, minstrel \shortciteTurner1992 uses reasoning to model creativity from the author’s perspective, producing narrative plans based on authorial goals, such as the goal of introducing drama into a narrative, while ensuring thematic consistency.
More recently, brutus \shortciteBringsjord1999 used a knowledge base of story schemas, from which one is selected and elaborated using planning techniques to link causes and effects \shortcite¡see also¿[among others, for recent examples of the use of planning techniques to model the creative process in narrative generation]Young2008,Riedl2010.
As \shortciteAGervas2010 notes, the focus on planning story worlds and modelling creativity has often implied a sidelining of linguistic issues, so that rendering a story plan into text has often been viewed as a secondary consideration. For example Figure 6(a) shows an excerpt of a story produced by the talespin system \shortciteMeehan1977: here, the emphasis is on using problem-solving techniques to produce a narrative in which events follow from each other in a coherent fashion, rather than on telling it in a fluent way. An important exception to this trend is the work of \shortciteACallaway2002, who explicitly addressed the gap between computational narratology and nlg. Their system took a narrative plan as a starting point, but focussed on the process of rendering the narrative in fluent English, handling time shifts, aggregation, anaphoric nps and many other linguistic phenomena, as the excerpt in Figure 6(b) shows. It is worth noting that this system has since been re-used in the context of generating interactive text for a portable museum guide by \shortciteAStock2007.
In addition, there have been a number of contributions from the generation community on more specific issues related to narrative, such as how to convey the temporal flow of narrative discourse \shortciteOberlander1992,Dorr1995,Elson2010. This is a problem that deserves more attention in nlg, since texts with a complex narrative structure often narrate events in a different order from which they occurred. For example, a narrative or narrative-like text may recount events in order of importance rather than in temporal order, even when they are grounded in real-world data \shortcite¡e.g.¿Portet2009. This makes the use of the right choices for tense, aspect and temporal adverbials crucial to ensure clarity for the reader. This type of complexity in narrative structure also emerges in interactive narrative fiction \shortcite¡for example, in games; cf.,¿montfort2007ordering.
Beyond the focus on specific linguistic issues, there has also been some work that leverages data-driven techniques to generate stories. For example, \shortciteAMcintyre2009 propose a story generation system whose input is a database of entities and their interactions, extracted from a corpus of stories by parsing them, retrieving grammatical dependencies, and building chains of events in which specific entities play a role. The outcome is a graph encoding a partial order of events, with edges weighted by mutual information to reflect the degree of association between nodes. Sentence planning then takes place using template-like grammar rules specifying verbs with subcategorisation information, followed by realisation using realpro \shortciteLavoie1997. One of the most interesting features of this work is the coupling of the generation model with an interest model to predict which stories would actually be rated as interesting by readers. This was achieved by training a kernel-based classifier on shallow lexical and syntactic features of stories, a novel take on an old problem in narratology, namely, what makes a story ‘tellable’, thereby distinguishing it from a mere report \shortcite¡e.g.,¿Herman1997,Norrick2005,Bruner2011.
Most story generation work is restricted to (very) short stories. It is certainly true that planning a book-length narrative along the lines sketched above is extremely challenging, but researchers have recently started exploring the possibilities, for instance in the context of NaNoGenMon (National Novel Generation Month), in which participants write a computer program capable of generating a ’novel’. Perhaps the best known example is World Clock \shortcitemontfort2013world which describes 1440 (24 60) events taking place around the world, one randomly selected minute at a time. These are the first two:
It is now exactly 05:00 in Samarkand. In some ramshackle dwelling a person who is called Gang, who is on the small side, reads an entirely made-up word on a box of breakfast cereal. He turns entirely around.
It is now right about 18:01 in Matamoros. In some dim yet decent structure a man named Tao, who is no larger or smaller than one would expect, reads a tiny numeric code from a recipe clipping. He smiles a tiny smile.
The book was fully generated by 165 lines of Python code, written by the author in a few hours, and later published (together with the software) by Harvard Book Store press. There is even a Polish translation (by Piotr Marecki), created by translating the terms and phrases used in the Python implementation of the original algorithm.
6.4 Generating creative language: Concluding remarks
In this section we have highlighted recent developments in the broad area of creative language generation, a topic which is rather understudied in nlg. Nevertheless, we would like to argue that nlg researchers can improve the quality of their output by taking insights from computational creativity on board.
Work that exploits corpora and other lexical resources for the automatic generation of jokes, puns, metaphors and similes has revealed different ways in which words are related and can be juxtaposed to form unexpected and possibly even ‘funny’ or ‘poetic’ combinations. Given that, for example, metaphor is pervasive in everyday language \shortcite¡as argued, for example, by¿Lakoff1980, not just in overtly creative uses, nlg researchers interested in enhancing the readability – and especially the variability – of the text-generating capability of their models would benefit from a closer look at work in poetry, joke and metaphor generation.
In a similar vein, work on narratology is rich in insights on the interaction of multiple threads in a single narrative, and how the choice of events and their ordering can give rise to interesting stories \shortcite¡e.g.,¿Gervas2012. These insights are valuable, for example, in the development of more elaborate text planners in domains where time and causality play a role. Similarly, narratological work on character and focalisation can also help in the development of better nlg techniques to vary output according to specific points of view, an area that we touched on in Section 5,
We have deferred discussion of evaluation of creative nlg to Section 7, which deals with evaluation in general. Anticipating some of that discussion, it is worth noting that evaluation of creative language generation remains something of a bottleneck. In part, this is because it is not always easy to determine the ‘right’ question to ask in an evaluation of creative text. For instance, in the case of joke and poetry generators, demonstrating genre compatibility and recognition (‘Is this a joke?’) is arguably already an achievement, insofar as it suggests that a system is producing artefacts that conform to normative expectations (this is discussed further in Section 7.1.3 below). In other types of creative language generation, evaluation is more challenging because it is difficult to carry out without ensuring quality at all levels of the generation process, from planning to realisation. In the case of narrative generation, for example, if the emphasis is placed entirely on story planning, the perceived quality of the narrative will be compromised if story plans are rendered using a an excessively simple realisation strategy (as is the case in Figure 6(a)). This is an area where the consensus in the field is that much further research effort is required \shortcite¡see¿[for a recent argument to this effect]Zhu2012. It is also an area in which nlg can potentially offer much to computational creativity researchers, including in the use of techniques to render text fluently and consistently, facilitating the evaluation of generated artefacts with human subjects.
Though we have touched on the subject of evaluation at various points, it deserves a full discussion as a topic which has become a central methodological concern in nlg. A factor that contributed to this development was the establishment of a number of nlg shared tasks, launched in the wake of an nsf-funded workshop held in Virginia in 2007 \shortciteDale2007. These tasks have focussed on referring expression generation \shortciteBelz2010,Gatt2010; surface realisation \shortciteBelz2011; generation of instructions in virtual environments \shortciteStriegnitz2011,Janarthanam2011; content determination \shortciteBouayadAgha2013,Banik2013; and question generation \shortciteRus2011. Recent proposals for new challenges extend these to narrative generation \shortciteconcepcion-EtAl:2016:INLG, generation from structured web data \shortcitecolin-EtAl:2016:INLG, and from pairs of meaning representations and text \shortcitenovikova-rieser:2016:INLG,May2017. In image captioning, shared tasks have helped the development of large-scale datasets and evaluation servers such as ms-coco999http://mscoco.org/dataset/#captions-upload (cf. Section 4.1).
In general, however, nlg evaluation is marked by a great deal of variety and it is difficult to compare systems directly. There are at least two reasons why this is the case.
There is no single, agreed-upon input format for nlg systems \shortciteMcDonald1993,Mellish1998a,evans2002nlg. Typically, one can only compare systems against a common benchmark if the input is similar. Examples are the image-captioning systems described in Section 4, or systems submitted to one of the shared tasks mentioned above. Even in case a common ‘standard’ dataset is available for evaluation, comparison may not be straightforward due to input variation, or due to implicit biases in the input data. For example, \shortciteARajkumar2014 observe that, despite many realisers being evaluated against the Penn Treebank, they make different assumptions about the input format, including how detailed the pre-syntactic input representation is, a problem also observed in the first Surface Realisation shared task \shortciteBelz2011. As \shortciteARajkumar2014 note, a comparison of realisers on the basis of scores on the Penn Treebank shows that the highest-ranking is the fuf/surge realiser (which is second in terms of coverage), based on experiments by \shortciteACallaway2005. However, these experiments required painstaking effort to extract the input representations at the level of detail needed by fuf/surge; other realisers support more underspecified input. In a related vein, image captioning evaluation studies have shown that many datasets contain a higher proportion of nouns than verbs, and few abstract concepts \shortciteFerraro2015, making systems that generate descriptions emphasising objects more likely to score better. The relevance of this observation is shown by \shortciteAElliott2015, who note that the ranking of their image captioning system based on visual dependency grammar depends in part on the data it is evaluated on, with better performance on data containing more images depicting actions (we return to this study below).
Multiple possible outputs
Even for a single piece of input and a single system, the range of possible outputs is open-ended, a problem that arguably holds for any nlp task involving textual output, including machine translation and summarisation. Corpora often display a substantial range of variation and it is often unclear, without an independent assessment, which outputs are to be preferred \shortciteReiterSripada2002. In the image captioning literature, authors who have framed the problem in terms of retrieval have motivated the choice in part based on this problem, arguing that ‘since there is no consensus on what constitutes a good image description, independently obtained human assessments of different caption generation systems should not be compared directly’ \shortcite[p. 580]Hodosh2013. While capturing variation may itself be a goal \shortcite¡e.g.,¿Belz2008,Viethen2010,Hervas2013,castro2016towards, as we also saw in our discussion of style in Section 5, this is not always the case. Thus, in a user-oriented evaluation, the SumTime-mousam system weather forecasts were preferred by readers over those written by forecasters because the latter’s lexicalisation decisions were susceptible to apparently arbitrary variation \shortciteReiter2005; similar outcomes were more recently reported for statistical nlg systems trained on the SumTime corpus \shortciteBelz2008,Angeli2010.
Rather than give an exhaustive review of nlg evaluation – hardly a realistic prospect given the diversity we have pointed out – the rest of this section will highlight some topical issues in current work. By way of an overview of these issues, consider the hypothetical scenario sketched in Figure 8, which is loosely inspired by work on various weather-reporting systems developed in the field. This nlg system is embedded in the environment of an offshore oil-rig; the relevant features of the setup \shortcite¡in the sense of¿SparckJones1996 are the system itself and its users, here a group of engineers. While the task of the system is to generate weather reports from numerical weather prediction data, its ultimate purpose is to facilitate users’ planning of drilling and maintenance operations. Figure 8 highlights some of the common questions addressed in nlg evaluation, together with a broad typology of the methods used to address them, in particular, whether they are objective – that is measurable against an external criterion, such as corpus similarity or experimentally obtained behavioural data – or subjective, requiring human judgements.
A fundamental methodological distinction, due to \shortciteASparckJones1996, is between intrinsic and extrinsic evaluation methods. In the case of nlg, an intrinsic evaluation measures the performance of a system without reference to other aspects of the setup, such as the system’s effectiveness in relation to its users. In our example scenario, questions related to text quality, correctness of output and readability qualify as intrinsic, whereas the question of whether the system actually achieves its goal in supporting adequate decision-making on the offshore platform is extrinsic.
7.1 Intrinsic methods
Intrinsic evaluation in nlg is dominated by two methodologies, one relying on human judgements (and hence subjective), the other on corpora.
7.1.1 Subjective (human) judgements
Human judgements are typically elicited by exposing naive or expert subjects to system outputs and getting them to rate them on some criteria. Common criteria include:
Fluency or readability, that is, the linguistic quality of the text \shortcite¡e.g.,¿[inter alia]Callaway2002,Mitchell2012,Stent2005a,Lapata2006,Cahill2009,Espinosa2010;
Accuracy, adequacy, relevance or correctness relative to the input, reflecting the system’s rendition of the content \shortcite¡e.g.¿Lester1997,Sripada2005,Hunter2012, a criterion often used in subjective evaluations of image-captioning systems as well \shortcite¡e.g.¿Kulkarni2011,Mitchell2012,Kuznetsova2012,Elliott2013.
Though they are the most common, these two sets of criteria do not exhaust the possibilities. For example, subjective ratings have also been elicited for argument effectiveness in a system designed to generate persuasive text for prospective house buyers \shortciteCarenini2006. In image captioning, at least one system was evaluated by asking users to judge the creativity of the generated caption, with a view to assessing the contribution of web-scale n-gram language models to the captioning quality \shortciteLi2011. Below, we also discuss judgements of genre compatibility (Section 7.1.3). In the case of fictional narrative, some evaluations have elicited judgments on qualities such as novelty \shortcite¡e.g.,¿Perez2011 or believability of characters \shortcite¡e.g.,¿Riedl2005a.
The use of scales to elicit judgements raises a number of questions. One has to do with the nature of the scale itself. While discrete, ordinal scales are the dominant method, a continuous scale – for example, one involving a visually presented slider \shortciteGatt2010,Belz2011a – might give subjects the possibility of giving more nuanced judgements. For example, a text generated by our hypothetical weather report system might be judged so disfluent as to be given the lowest rating on an ordinal scale; if the following text is judged as being worse, a subject would have no way of indicating this. A related question is whether subjects find it easier to compare items rather than judge each one in its own right. This question has begun to be addressed in the nlp evaluation literature, usually with binary comparisons, for example between the outputs of two mt systems \shortcite¡see¿[for discussion]Dras2015. In a recent study evaluating causal connectives produced by an nlg system, \shortciteASiddharthan2012a used Magnitude Estimation, whereby subjects are not given a predefined scale, but are asked to choose their own and proceed to make comparisons of each item to a ‘modulus’, which serves as a comparison point throughout the experiment \shortcite¡see¿Bard1996.101010The modulus is an item – a text, or a sentence – which is selected in advance and which subjects are asked to rate first. All subsequent ratings or judgements are performed in comparison to this modus item. Though subjects are able to use any scale they choose, this method allows all judgements to be normalised by the judgement given for the modulus. Typically, normalised judgements are analysed on a logarithmic scale. \shortciteABelz2010a compared a preference-based paradigm to a standard rating scale to evaluate systems from two different domains (weather reporting and reg), and found that the former was more sensitive to differences between systems, and less susceptible to variance between subjects.
An additional concern with subjective evaluations is inter-rater reliability. Multiple judgements by different evaluators may exhibit high variance, a problem that was encountered in the case of Question Generation \shortciteRus2011. Recently, \shortciteAGodwin2016 suggested that such variance can be reduced by an iterative method whereby training of judges is followed by a period of discussion, leading to the updating of evaluation guidelines. This, however, is more costly in terms of time and resources.
It is probably fair to state that, these days, subjective, human evaluations are often carried out via online platforms such as Amazon Mechanical Turk111111https://www.mturk.com/mturk/welcome and CrowdFlower121212https://www.crowdflower.com, though this is probably more feasible for widely-spoken languages such as English. A seldom-discussed issue with such platforms concerns their ethical implications \shortcite¡for example, they involve large groups of poorly paid individuals; see¿Fort2011 as well as the reliability of the data collected, though measures can be put in place to ensure, for instance, that contributors are fluent in the target language \shortcite¡see e.g.,¿goodman2013data,mason2012conducting.
7.1.2 Objective humanlikeness measures using corpora
Intrinsic methods that rely on corpora can generally be said to be addressing the question of ‘humanlikeness’, that is, the extent to which the system’s output matches human output under comparable conditions. From the developer’s perspective, the selling point of such methods is their cheapness, since they are usually based on automatically computed metrics. A variety of corpus-based metrics, often used earlier in related fields such as Machine Translation or Summarisation, have been used in nlg evaluation. Some of the main ones are summarised in Table 1, which groups them according to their principal characteristics, and for each adds a key reference.
|N-gram overlap||bleu||Precision score over variable-length n-grams, with a length penalty \shortcitePapineni2002 and, optionally, smoothing \shortciteLin2004.||mt|
|nist||A version of bleu with higher weighting for less frequent grams and a different length penalty \shortciteDoddington2002.||mt|
|rouge||Recall-oriented score, with options for comparing non-contiguous grams and longest common subsequences \shortciteLin2003.||as|
|meteor||Harmonic mean of unigram precision and recall, with options for handling (near-synonymy) and stemming \shortciteLavie2007.||mt|
|gtm||General Text Matcher. F-Score based on precision and recall, with greater weight for contiguous matching spans \shortciteTurian2003||mt|
|cider||Cosine-based n-gram similarity score, with n-gram weighting using tf-idf \shortciteVedantam2015.||ic|
|wmd||Word-Mover Distance, a similarity score between texts, based on the (semantic) distance between words in the texts \shortciteKusner2015. For nlp, distance is operationalised using normalised bag of words (nbow) representations \shortciteMikolov2013.||ds; ic|
|String distance||Edit distance||Number of insertions, deletions, substitutions and, possibly, transposition required to transform the candidate into the reference string \shortciteLevenshtein1966.||n/a|
|ter||Translation edit rate, a version of edit distance \shortciteSnover2006.||mt|
|terp||Version of ter handling phrasal substitution, stemming and synonymy \shortciteSnover2006.||mt|
|terpa||Version of ter optimised for correlations with adequacy judgements \shortciteSnover2006.||mt|
|Content overlap||Dice/Jaccard||Set-theoretic measures of overlap between two unordered sets (e.g. of predicates or other content units)||n/a|
|masi||Measure of agreement between set-valued items, a weighted version of Jaccard \shortcitePassonneau2006||as|
|pyramid||Overlap measure relying on comparison of weighted Summarization Content Units (SCUs) \shortcitenenkova2004,yang2016||as|
|spice||Measure of overlap between candidate and reference texts based on propositional content obtained by parsing the text into graphs representing objects and relations, by first parsing captions into scene graphs representing objects and relations \shortciteAnderson2016||ic|
Measures of n-gram overlap or string edit distance, usually originating in Machine Translation or Summarisation \shortcite¡with some exceptions, such as cider,¿Vedantam2015 are frequently used for evaluating surface realisation \shortcite¡e.g.,¿White2007,Cahill2006,Espinosa2010,Belz2011 and occasionally also to evaluate short texts characteristic of data-driven systems in domains such as weather reporting \shortcite¡e.g.¿Reiter2009a,Konstas2013 and image captioning \shortcite¡see¿Bernardi2016,Kilickaya2017. Edit distance metrics have been exploited for realisation \shortciteEspinosa2010, but also for reg \shortciteGatt2010.
The focus of these metrics is on the output text, rather than its fidelity to the input. In a limited number of cases, surface-oriented metrics have been used to evaluate the adequacy with which output text reflects content \shortciteBanik2013,Reiter2009a. However, if content determination is the focus, a measure of surface overlap is at best a proxy, relying on an assumption of a straightforward correspondence between input and output. This assumption may be tenable if texts are brief and relatively predictable. In some cases, it has been possible to use metrics to measure content determination directly, based on semantically annotated corpora. For instance, reg algorithms have been evaluated in this fashion using set overlap metrics \shortciteViethen2007,Deemter2012. Also relevant in this connection is the pyramid method \shortcitenenkova2004 for summarisation, which relies on the identification of the content units (which maximally correspond to clauses) in multiple human summaries. These are weighted and ordered by their frequency of mention by human summarises. A candidate summary is scored according to the ratio between the weight of the content units it includes, compared to the weight of an ideal summary bearing the same number of content units \shortcite¡see¿[for discussion]Nenkova2011.
Direct measurements of content overlap between generated and candidate outputs will likely increase, as automatic data-text alignment techniques make such ‘semantically transparent’ corpora more readily available for end-to-end nlg \shortcite¡see e.g.,¿[and the discussion in Section 3.3]Chen2008,Liang2009. An important development away from pure surface overlap is the use of semantic resources \shortcite¡as in the case of meteor, ¿Lavie2007, or word embeddings \shortcite¡as in wmd, ¿Kusner2015, to compute the proximity of output to reference texts beyond literal string overlap. In a comparative evaluation of metrics for image captioning, \shortciteAKilickaya2017 found an advantage for wmd compared to other metrics.
7.1.3 Evaluating genre compatibility and stylistic effectiveness
A slightly different question that has occasionally been posed in evaluation studies asks whether the linguistic artefact produced by a system is a recognisable instance of a particular genre or style. As noted in Section 5, it is difficult to ascertain to what extent readers actually perceive subtle stylistic variation. Thus, \shortciteAMairesse2011 found inconsistent perceptions of personality in the evaluation of personage, which was complicated by the fact that stylistic features interact and may cancel each other out.
Genre perception is a central question for approaches to generating creative language (see Section 6). For example, \shortciteAHardcastle2008 describe an evaluation of a generation system for cryptic crossword clues based on a Turing test in which the objective was to determine whether the system’s outputs were recognisably different from human-authored clues. In a related vein, when evaluating the jape joke generation system, \shortciteA[see Section 6.1]Binsted1997 presented 120 8-11 year old children with a number of punning riddles, some automatically generated by jape and some selected from joke books. They also included a number of non-joke controls, such as:
What do you get when you cross a horse and a donkey?
For each stimulus that they were exposed to, children were asked to indicate whether they thought it was a joke, and how funny they considered it. The results revealed that computer generated riddles were recognised as jokes, and considered funnier than non-jokes. Interestingly, the joke children rated highest was automatically generated by jape (we urge the reader to inspect the original paper), although in general, human-produced jokes were considered funnier by children than automatically generated ones. In this evaluation study, therefore, an extrinsic aspect of the generated text, concerning its efficacy (here, its ‘funniness’) was found to be correlated with its recognisability as an instance of the target genre.
petrovic2013unsupervised evaluated their unsupervised approach to joke generation by harvesting human-written jokes from Twitter, conforming to the I like my X … template used by their system. Blind ratings by human judges of human-written and automatically generated jokes showed that their best-performing model was rated as funny in 16% of cases, compared to 33% of the human jokes (itself a relatively low rate).
While the questions posed in these studies clearly have an intrinsic orientation (‘Is the text compatible with the expected genre conventions?’), they also have a bearing on extrinsic factors, since the ability to recognise an artefact as an instance of a genre or as exhibiting a certain style or personality is arguably one of the sources of its impact, which in turn includes judgments of whether a text is funny or interesting, for example.
Of course, the intention behind variation in style, personality or affect may well be to ultimately increase effectiveness in achieving some ulterior goal. Indeed, any nlg system intended to be embedded in a specific environment will need to address stylistic and genre-based issues. For example, our hypothetical weather report generator might use a very brief, technical style given its professional pool of target users \shortcite¡as was the case with SumTime¿Reiter2005; in contrast, weather reports intended for public consumption, such as those in the WeatherGov corpus, would probably be longer and less technical \shortciteAngeli2010.
However, there is a difference between evaluating whether genre constraints or stylistic variation help contribute to a goal, and evaluating whether the text actually exhibits the desired variation. For example, \shortciteAMairesse2011 evaluated the personage system (see Section 5) by asking users to judge personality traits as reflected in generated dialogue fragments (rather than, say, measuring whether users were more likely to eat at a restaurant if this was recommended by a configuration of the system with a high degree of extraversion). This is similar in spirit to the question about jokehood asked by \shortciteABinsted1997, in contrast to the more explicitly extrinsic evaluation of the standup joke generator by \shortciteAWaller2009, which asked whether the system actually helped users improve their interactions with peers.
7.2 Extrinsic evaluation methods
In contrast to intrinsic methods, extrinsic evaluations measure effectiveness in achieving a desired goal. In the example scenario of Figure 8, such an evaluation might address the impact on planning by the engineers who are the target users of the system. Clearly, ‘effectiveness’ is dependent on the application domain and purpose of a system. Examples include:
persuasion and behaviour change, for example, through exposure to personalised smoking cessation letters \shortciteReiter2003;
purchasing decision after presentation of arguments for and against options on the housing market based on a user model \shortciteCarenini2006;
engagement with ecological issues after reading blogs about migrating birds \shortciteSiddharthan2012;
decision support in a medical setting following the generation of patient reports \shortcitePortet2009,Hunter2012;
enhancing linguistic interaction among users with complex communication needs via the generation of personal narratives \shortciteTintarev2016;
enhancing learning efficacy in tutorial dialogue \shortciteDieugenio2005,Fossati2015,Boyer2011,Lipschultz2011,Chi2014
While questionnaire-based or self-report studies can be used to address extrinsic criteria \shortcite¡e.g.,¿Hunter2012,Siddharthan2012,Carenini2006, in many cases evaluation relies on some objective measure of performance or achievement. This can be done with the target users in situ, enhancing the ecological validity of the study, but can also take the form of a task that models the scenarios for which the nlg system has been designed. Thus, in the give Challenge \shortciteStriegnitz2011, in which nlg systems generated instructions for a user to navigate through a virtual world, a large-scale task-based evaluation was carried out by having users play the give game online, while various indices of success were logged, including the time it took a user to complete the game. reg algorithms whose goal was to generate identifying descriptions of objects in visual domains, were evaluated in part based on the time it took readers to identify a referent based on a generated description, as well as their error rate \shortciteGatt2010. skillsum, a system to generate feedback reports from literacy assessments, was evaluated by measuring how user’s self-assessment of their own literacy skills improved after reading generated feedback, compared to control texts \shortciteWilliams2008.
A potential drawback of extrinsic studies, in addition to time and expense, is a reliance on an adequate user base (which can be difficult to obtain when users have to be sampled from a specific population, such as the engineers in our hypothetical scenario in Figure 8) and the possibility of carrying out the study in a realistic setting. Such studies also raise significant design challenges, due to the need to control for intervening and confounding variables, comparing multiple versions of a system (e.g. in an ablative design; see Section 7.3 below), or comparing a system against a gold standard or baseline. For example, \shortciteACarenini2006 note that evaluating the effectiveness of arguments presented in text needs to take into account aspects of a user’s personality which may impact how receptive they are to arguments in the first place.
An example of the trade-off between design and control issues and ecological validity is provided by the BabyTalk family of systems. A pilot system called bt-45 \shortcitePortet2009, which generated patient summaries from 45-minute spans of historical patient data, was evaluated in a task involving nurses and doctors, who chose from among a set of clinical actions to take based on the information given. These were then compared to ‘ground truth’ decisions by senior neonatal experts. This evaluation was carried out off-ward; hence, subjects took clinical decisions in an artificial environment without direct access to the patient. On the other hand, in the evaluation of bt-nurse, a successor to bt-45 which summarised patient data collected over a twelve-hour shift \shortciteHunter2012, the system was evaluated on-ward using live patient data, but ethical considerations precluded a task-based evaluation. For the same reasons, comparison to ‘gold standard’ human texts was also impossible. Hence, the evaluation elicited judgements, both on intrinsic criteria such as understandability and accuracy and on extrinsic criteria such as perceived clinical utility \shortcite¡see¿[for a similarly indirect extrinsic measure of impact, this time in an ecological setting]Siddharthan2012.
7.3 Black box vs glass box evaluation
With the exception of evaluations of specific modules or algorithms, as in the case of reg or surface realisers, most of the evaluation studies discussed so far would be classified as ‘black box’ evaluations of ‘end-to-end’, or complete, nlg systems. In a ‘glass box’ evaluation, on the other hand, it is the contribution of individual components that is under scrutiny, ideally in a setup where versions of a system with and without a component are evaluated in the same manner. Note that the distinction between black box and glass box evaluation is orthogonal to the question of which methods are used.
An excellent example of a glass-box evaluation is \shortciteACallaway2002, who used an ablative design, eliciting judgements of the quality of the output of their narrative generation system based on different configurations that omitted or included key components. In a related vein, \shortciteAElliott2013 compared image-to-text models that included fine-grained dependency representations of spatial as well as linguistic dependencies, to models with a coarser-grained image representation, finding an advantage for the former.
However, exhaustive component-wise comparisons are sometimes difficult to make and may result in a combinatorial explosion of configurations, with a concomitant reduction in data points collected per configuration (assuming subjects are limited and need to be divided among different conditions) and a reduction in statistical power. Alternatives do exist in the literature. \shortciteAReiter2003 elicited judgements on weather forecasts using human and machine-generated texts, together with a ‘hybrid’ version where the content was selected by forecasters, but the language was automatically generated. This enabled a comparison of human and automatic content selection. \shortciteAAngeli2010 used corpus-based and subjective measures to assess linguistic quality, coupled with precision and recall-based measures to assess content determination of their statistical system against human-annotated texts. In bt-nurse \shortciteHunter2012, nurses were prompted for free text comments (in addition to answering a questionnaire targeting extrinsic dimensions), which were then manually annotated and analysed to determine which elements of the system were potentially problematic.
7.4 On the relationship between evaluation methods
To what extent are the plethora of methods surveyed – from extrinsic, task-oriented to intrinsic ones relying on automatic metrics or human judgements – actually related? It turns out that multiple evaluation methods seldom give converging verdicts on a system, or on the relative ranking of a set of systems under comparison.
7.4.1 Metrics versus human judgements
Although corpus-based metrics used in mt and summarisation are typically validated by demonstrating their correlation with human ratings, meta-evaluation studies in these fields have suggested that the correspondence is somewhat weak \shortcite¡e.g.,¿Dorr2004,Callison-Burch2006,Caporaso2008. Similarly, shared task evaluations on referring expression generation showed that corpus-based, judgement-based and experimental or task-based methods frequently do not correlate \shortciteGatt2010. In their recent review \shortciteABernardi2016 note a similar issue in image captioning system evaluation. Thus, \shortciteAKulkarni2013 found that their image description system did not outperform two earlier methods \shortciteFarhadi2010,Yang2011 on bleu scores; however, human judgements indicated the opposite trend, with readers preferring their system \shortcite¡similar observations are made by¿Kiros2014. \shortciteAHodosh2013 compared the agreement (measured by Cohen’s ) between human judgements and bleu or rouge scores for retrieved captions, finding that outputs were not ranked similarly by humans and metrics, unless the retrieved captions were identical to the reference captions.
On occasion, the correlation between a metric and human judgements appears to differ across studies, suggesting that metric-based results are highly susceptible to variation due to generation algorithms and datasets. For instance, \shortciteAKonstas2013 (discussed in Section 3.3.4 above) find that on corpus-based metrics, the best-performing version of their model does not outperform that of \shortciteAKim2010 on the robocup domain, or that of \shortciteAAngeli2010 on their weather corpus (weathergov), though it performs better than \shortciteAAngeli2010 on the noisier atis travel dataset. However, an evaluation of fluency and semantic correctness, based on human judgements, showed that the system outperformed, by a small margin, both \shortciteAKim2010 and \shortciteAAngeli2010 on both measures in all domains with the exception of weathergov, where \citeauthorAngeli2010’s system did marginally better.
In a related vein, \shortciteAElliott2015 compare their image captioning system, based on visual dependency relations, to the Bidirectional rnn developed by \shortciteAKarpathy2015, on two different datasets. The two systems were close to each other on the vlt2k dataset, but not on Pascal1k, a result that the authors claim is due to vlt2k containing more pictures involving actions. As for the relationship between metrics and human judgements, \shortciteAElliott2013 concluded that meteor correlates better than bleu \shortcite¡see¿[for a systematic comparison of automatic metrics in this domain]Elliott2014, a finding also confirmed in their later work \shortciteElliott2015, as well as in the ms-coco Evaluation Challenge, which found that meteor was more robust. However, work by \shortciteAKuznetsova2014a showed variable results; their highest-scoring method as judged by humans, involving tree composition, was ranked higher by bleu than by meteor. In the ms-coco Evaluation Challenge, some systems outperformed a human-human upper bound when compared to reference texts using automatic metrics, but no system reached this level in an evaluation based on human judgements \shortcite¡see¿[for further discussion]Bernardi2016.
Some studies have explicitly addressed the relationship between methods as a research question in its own right. An important contribution in this direction is the study by \shortciteReiter2009a, which addressed the validity of corpus-based metrics in relation to human judgements, within the domain of weather forecast generation \shortcite¡a similar study has recently been conducted on image captioning by¿Elliott2014. In a first experiment, focussing on linguistic quality, the authors found a high correlation between expert and non-expert readers’ judgements, but the correlation between human judgements and the automatic metrics varied considerably (from to ), depending on the version of the metric used and whether the reference texts were included in the comparison by human judges. The second experiment evaluated both linguistic quality, by asking human judges to rate clarity/readability; and content determination, by eliciting judgements of accuracy/appropriateness (by comparing texts to the raw data). The automatic metrics correlated significantly with judgements of clarity, but far less with accuracy, suggesting that they were better at predicting the linguistic quality than correctness.
Other studies have yielded similarly inconsistent results. In a study on paraphrase generation, \shortciteAStent2005a found that automatic metrics correlated highly with judgements of adequacy (roughly akin to accuracy), but not fluency. By contrast, \shortciteAEspinosa2010 found that automatic metrics such as nist, meteor and gtm correlate moderately well with human fluency and adequacy judgements of English surface realisation quality, while \shortciteACahill2009 reported only a weak correlation for German surface realisation. \shortciteAWubben2012, comparing text simplification strategies, found low, but significant correlations between bleu and fluency judgements, and a very low, negative correlation between bleu and adequacy. These contrasting findings suggest that the relationship between metrics may depend on purpose and genre of the text under consideration; for example, \shortciteAReiter2009a used weather reports, while \shortciteAWubben2012 used Wikipedia articles.
Various factors can be adduced to explain the inconsistency of these meta-evaluation studies:
Metrics such as bleu are sensitive to the length of the texts under comparison. With shorter texts, n-gram based metrics are likely to result in lower scores.
The type of overlap matters: for example, many evaluations in image captioning rely on bleu-1 \shortcite¡the work of¿[was among the first to experiment with longer n-grams]Elliott2013,Elliott2014, but longer n-grams are harder to match, though they capture more syntactic information and are arguably better indicators of fluency.
Semantic variability is an important issue. Generated texts may be similar to reference texts, but differ on some near-synonyms, or subtle word order variations. As shown in Table 1, some metrics are designed to partially address these issues.
Many intrinsic corpus-based metrics are designed to compare against multiple reference texts, but this is not always possible in nlg. For example, while image captioning datasets typically contain multiple captions per image (typically, around 5), this is not the case in other domains, like weather reporting or restaurant recommendations.
The upshot is that nlg evaluations increasingly rely on multiple methods, a trend that is equally visible in other areas of nlp , such as mt \shortciteCallison-Burch2007,Callison-Burch2008.
7.4.2 Using controlled experiments
A few studies have validated evaluation measures against experimental data. For example, \shortciteASiddharthan2012a compared the outcomes of their magnitude estimation judgement study (see Section 7.1 above) to the results from a sentence recall task, finding that the results from the latter are largely consistent with judgements and concluding that they can substitute for task-based evaluations to shed light on breakdowns in comprehension at sentence level. A handful of studies have also used behavioral experiments and compared ‘online’ processing measures, such as reading time of referring expressions, to corpus-based metrics \shortcite¡e.g.¿Belz2010. Correlations with automatic metrics are usually poor. A somewhat different use of reading times was made by \shortciteALapata2006, who used them as an objective measure against which to validate Kendall’s as a metric for assessing information ordering in text (an aspect of text stucturing). In a recent study, \shortciteAZarriess2015 compared generated texts to human-authored and ‘filler’ texts (which were manually manipulated to compromise their coherence). They found that reading-time measures were more useful to distinguish these classes of texts than offline measures based on elicited judgements of fluency and clarity.
7.5 Evaluation: Concluding remarks
Against the background of this section, three main conclusions can be drawn:
There is a widespread acceptance of the necessity of using multiple evaluation methods in nlg. While these are not always consistent among themselves, they are useful in shedding light on different aspects of quality, from fluency and clarity of output, to adequacy of semantic content and effectiveness in achieving communicative intentions. The choice of method has a direct impact on the way in which results can be interpreted.
Meta-evaluation studies have yielded conflicting results on the relationship between human judgements, behavioural measures and automatically computed metrics. The correlation among them varies depending on task and application domain. This is a subject of ongoing research, with plenty of studies focussing on the reliabilty of metrics and their relationship to other measures, especially human judgements.
A question that remains under-explored concerns the dimensions of quality that are themselves the object of inquiry. (In this connection, it is worth noting that some kindred disciplines have sought to de-emphasise their role on the grounds that they are inconsistent; see \shortciteACallison-Burch2008, for example). For example, what are people judging when they judge fluency or adequacy and how consistently do they do so? It is far from obvious whether these judgements should really be expected to correlate with other measures, given that the latter are producer-oriented, focussing on output, while judgements are themselves often receiver-oriented, focussing on how the output is read or processed \shortcite¡for a related argument, see¿Oberlander1998. Furthermore, while meta-linguistic judgements can be expected to reflect the impact of a text on its readers, there is nevertheless the possibility that behavioural, online methods designed to directly investigate aspects of processing would yield a different picture, a result that has been obtained in some psycholinguistic studies \shortcite¡e.g.¿Engelhardt2006.
In conclusion, our principal recommendation to nlg practitioners, where evaluation is concerned, is to err in favour of diversity, by using multiple methods, as far as possible, and reporting not only their results, but also the correlation between them. Weak correlations need not imply that the results of a particular method are invalid. Rather, they may indicate that measures focus on different aspects of a system or its output.
8 Discussion and future directions
Over the past two decades, the field of nlg has advanced considerably, and many of these recent advances have not been covered in a comprehensive survey yet. This paper has sought to address this gap, with the following goals:
to give an update of the core tasks and architectures in the field, with an emphasis on recent data-driven techniques;
to briefly highlight recent developments in relatively new areas, incuding vision-to-text generation and the generation of stylistically varied, engaging or creative texts; and
to extensively discuss the problems and prospects of evaluating nlg applications.
Throughout this survey, various general, related themes have emerged. Probably the central theme has been the gradual shift away from traditional, rule-based approaches to statistical, data-driven ones, which, of course, has been taking place in ai in general. In nlg, this has had substantial impact on how individual tasks are approached (e.g., moving away from domain-dependent to more general, domain-independent approaches, relying on available data instead) as well as on how tasks are combined in different architectures (e.g., moving away from modular towards more integrated approaches). The trade-off between output quality of the generated text and the efficiency and robustness of an approach is becoming a central issue: data-driven approaches are arguably more efficient than rule-based approaches, but the output quality may be compromised, for reasons we have discussed. Another important theme has been the increased interplay between core nlg research and other disciplines, such as computer vision (in the case of vision-to-text) and computational creativity research (in the case of creative language use).
At the conclusion of this comprehensive survey of the state of the art in nlg, and given the fast pace at which developments occur both in industry and academia, we feel it is useful to point to some potential future directions, as well as to raise a number of questions which recent research has brought to the fore.
8.1 Why (and how) should NLG be used?
Towards the beginning of their influential survey on nlg, \shortciteAReiter2000 recommended to the developer that she pose this question before embarking on the design and implementation of a system. Can nlg really help in the target domain? Does a cheaper, more standard solution exist and would it work just as well? From the perspective of an engineer or a company, these are obviously relevant questions. As recent industry-based applications of nlg show, this technology is typically valuable whenever information that needs to be presented to users is relatively voluminous, and comes in a form which is not easily consumed and does not afford a straightforward mapping to a more user-friendly modality without considerable transformation. This is arguably where nlg comes into its own, offering a battery of techniques to select, structure and present the information.
However, the question whether nlg is worth using in a specific setting should also be accompanied by the question of how it should be used. Our survey has focussed on techniques for the generation of text, but text is not always presented in isolation. Other important dimensions include document structure and layout, an under-studied problem \shortcite¡but see¿Power2003. They also include the role of graphics in text, an area where there is the potential for further interaction between the nlg and visualisation communities, addressing such questions as which information should be rendered textually and which can be made more accessible in a graphical modality \shortcite¡e.g.,¿demir2012. These questions are of great relevance in some domains, especially those where accurate information delivery is a precursor to decision-making in fault-critical situations \shortcite¡for some examples, see¿Elting1999,Law2005,Meulen2007.
8.2 NLG isn’t about text-to-text…or is it?
In our introductory section, we distinguished text-to-text generation from data-to-text generation; this survey has focussed primarily on the latter. The two areas have distinguishing characteristics, not least the fact that nlg inputs tend to vary widely, as do the goals of nlg systems as a function of the domain under consideration. In contrast, the input in text-to-text generation, especially Automatic Summarisation, is comparatively homogeneous, and while its goals can vary widely, the field has also been successful at defining tasks and datasets (for instance, through the duc shared tasks), which have set the standard for subsequent research.
Yet, a closer look at the two types of generation will show more scope for convergence than the above characterisation suggests. To begin with, if nlg is concerned with going from data to text, then surely textual input should be considered as one out of broad variety of forms in which input data might be presented. Some recent work, such as \shortciteAKondadadi2013 (discussed in Section 3.3) and \shortciteAMcintyre2009 (discussed in Section 6) has explicitly focussed on leveraging such data to generate coherent text. Other approaches to nlg, including some systems that conform to a standard, modular, data-to-text architecture \shortcite¡e.g.,¿Hunter2012, have had to deal with text as one out of a variety of input types, albeit using very simple techniques. Generation from heterogeneous inputs which include text as one type of data is a promising research direction, especially in view of the large quantities of textual data available, often accompanied by numbers or images.
8.3 Theories and models in search of applications?
In their overview of the status of evaluation in nlg in the late 1990s, \shortciteAMellish1998a discussed, among the possible ways of evaluating a system, its theoretical underpinnings and in particular whether the theoretical model underlying an nlg system or one of its components is adequate to the task and can generalise to new domains. Rather than evaluating an nlg system as such, this question targets the theory itself, and suggests that we view nlg as a potential testbed for such theories or models. But what are the theories that underlie nlg?
The prominence of theoretical models in nlg tends to depend on the task under consideration. For instance, many approaches to realisation discussed in Section 2.6 are based on a specific theory of syntactic structure; research on reg has often been based on insights from pragmatic theory, especially the Gricean maxims \shortciteGrice1975; and much research on text structuring has been inspired by Rhetorical Structure Theory \shortciteMann1988. Relatively novel takes on various sentence planning tasks – especially those concerned with style, affect and personality – tend to have a theoretical inspiration, in the form of a model of personality \shortciteJohn1999 or a theory of politenes \shortciteBrownLevinson1987, for example.
More often than not, such theories are leveraged in the process of formalising a particular problem to achieve a tractable solution. Treating their implementation in an nlg system as an explicit test of the theory, as \shortciteAMellish1998a seem to suggest, happens far less often. This is perhaps a reflection of a division between ‘engineering-oriented’ and ‘theoretically-oriented’ perspectives in the field: the former perspective emphasises workable solutions, robustness and output quality; the latter emphasises theoretical soundness, cognitive plausibility and so forth. However, the theory/engineering dichotomy is arguably a false one. While the goal of nlg research is often different from, say, that of cognitive modelling (for example, few nlg systems seek to model production errors explicitly), it is also true that theory-driven implementations are themselves worthy contributions to theoretical work.
Recently, some authors have argued that nlg practitioners should pay closer attention to theoretical and cognitive models. The reasons marshalled in favour of this argument are twofold. First, psycholinguistic results and theoretical models can actually help to improve implemented systems, as \shortciteARajkumar2014 show for the case of realisation. Second, as argued for example by \shortciteAVanDeemter2012a, theoretical models can benefit from the formal precision that is the bread-and-butter of computational linguistic research; a concrete case in point in nlp is provided by \shortciteAPoesio2004, whose implementation of Centering Theory \shortciteGrosz1995 shed light on a number of underspecified parameters in the original model and subsequent modifications of it. Our argument here is that nlg has provided a wealth of theoretical insights which should not be lost to the broader research community; similarly, nlg researchers would undoubtedly benefit from an awareness of recent developments in theoretical and experimental work.
8.4 Where do we go from here?
Finally, we conclude with some speculations on some further directions for future research for which the time seems ripe.
Within the field of Natural Language Processing as a whole, a remarkable recent developments is the explosion of interest in social media, including online blogs, micro-blogs such as Twitter feeds, and social platforms such as Facebook. In one respect, interest in social media could be seen as a natural extension of long-standing topics in nlp, including the desire to deal with language ‘in the wild’. However, social media data has given more impetus to the exploration of non-canonical language \shortcite¡e.g.¿Eisenstein2013; the impact of social and demographic factors on language use \shortcite¡e.g.¿Hovy2015,Johannsen2015; the prevalence of paralinguistic features such as affect, irony and humour \shortcitePang2008,Lukin2013; and other variables such as personality \shortcite¡e.g.¿Oberlander2006,Farnadi2013,Schwartz2013a. Social media feeds are also important data streams for the identification of topical and trending events \shortcite¡see¿[for a recent review]Atefeh2015. There is as yet little work on generating textual or multimedia summaries of such data \shortcite¡but see, for example,¿Wang2014 or generating text in social media contexts \shortcite¡exceptions include¿Ritter2011,Cagan2014. Since much of social media text is subjective and opinionated, an increased interest in social media on the part of nlg researchers may also give new impetus to research on the impact of style, personality and affect on textual variation (discussed in Section 5), and on non-literal language (including some of the phenomena discussed in Section 6).
A second potential growth area for nlg is situated language generation. The term situated is usually taken to refer to language use in physical or virtual environments where production choices explicitly take into account perceptual and physical properties. Research on situated language processing has advanced significantly in the past several years, with frameworks for language production and understanding in virtual contexts \shortcite¡e.g.,¿Kelleher2005, as well as a number of contributions within nlg, especially for the generation of language in interactive environments \shortciteKelleher2006,Stoia2006,Garoufi2013,Dethlefs2015. The popular give Challenge added further impetus to this research \shortciteStriegnitz2011. Clearly, this work is also linked to the enterprise of grounding generated language in the perceptual world, of which the research discussed in Section 4 constitutes one of the current trends. However, there are many fields where situatedness is key, in which nlg can still make novel contributions. One of these is gaming. With the exception of a few endeavours to enhance the variety of linguistic expressions used in virtual environments \shortcite¡e.g.,¿Orkin2007, nlg technology is relatively unrepresented in research on games, despite significant progress on dynamic content generation in game environments \shortcite¡e.g.,¿Togelius2011. This may be due to the perception that linguistic interaction in games is predictable and can rely on ‘canned’ text. However, with the growing influence of gamification as a strategy for enhancing a variety of activities beyond entertainment, such as pedagogy, as well as the development of sophisticated planning techniques for varying the way in which game worlds unfold on the fly, the assumption of predictability where language use is concerned may well be up for revision.
Third, there is a growing interest in applying nlg techniques to generation from structured knowledge bases and ontologies \shortcite¡e.g.¿[some of which were briefly discussed in Section 3.3.4]Ell2012,Duma2013,Gyawali2014,Mrabet2016,Sleimi2016. The availability of knowledge bases such as dbpedia, or folksonomies such as Freebase, not only constitute input sources in their own right, but also open up the possibility of exploring alignments between structured inputs and text in a broader variety of domains than has hitherto been the case.
Finally, while there has been a significant shift in the past few years towards data-driven techniques in nlg, many of these have not been tested in commercial or real-world applications, despite the growth in commercialisation of text generation services noted in the introductory section. Typically, the arguments for rule-based systems in commercial scenarios, or in cases where input is high-volume and heterogeneous, are that (1) their output is easier to control for target systems; or (2) that data is in any case unavailable in a given domain, rendering the use of statistical techniques moot; or (3) data-driven systems have not been shown to be able to scale up beyond experimental scenarios \shortcite¡some of these arguments are made, for instance, by¿Harris2008. A response to the first point depends on the availability of techniques which enable the developer to ‘look under the hood’ and understand the statistical relationships learned by a model. Such techniques are, for example, being developed to investigate or visualise the representations learned by deep neural networks. The second point calls for more investment in research on data acquisition and data-text alignment. Techniques for generation which rely on less precise alignments between data and text are also a promising future direction. Finally, scalability remains an open challenge. Many of the systems we have discussed have been developed within research environments, where the aim is of course to push the frontiers of nlg and demonstrate feasibility or correctness of novel approaches. While in some cases, research on data-to-text has addressed large-scale problems – notably in some of the systems that summarise numerical data – a greater concern with scalability would also focus researchers’ attention on issues such as the time and resources required to collect data and train a system and the efficiency of the algorithms being deployed. Clearly, developments in hardware will alleviate these problems, as has happened with some statistical methods that have recently become more feasible.
Recent years have seen a marked increase in interest in automatic text generation. Companies now offer nlg technology for a range of applications in domains such as journalism, weather, and finance. The huge increase in available data and computing power, as well as rapid developments in machine-learning, have created many new possibilities and motivated nlg researchers to explore a number of new applications, related to, for instance, image-to-text generation, while applications related to social media seem to be just around the corner, as witness, for instance, the emergence of nlg-related techniques for automatic content-creation as well as nlg for twitter and chatbots \shortcite¡e.g.,¿Dale2016. With developments occurring at a steady pace, and the technology also finding its way into industrial applications, the future of the field seems bright. In our view, research in nlg should be further strengthened by more collaboration with kindred disciplines. It is our hope that this survey will serve to highlight some of the potential avenues for such multi-disciplinary work.
We thank the four reviewers for their detailed and constructive comments. In addition, we have greatly benefitted from discussions with and comments from Grzegorz Chrupala, Robert Dale, Raquel Hervás, Thiago Castro Ferreira, Ehud Reiter, Marc Tanti, Mariët Theune, Kees van Deemter, Michael White and Sander Wubben. EK received support from RAAK-PRO SIA (2014-01-51PRO) and The Netherlands Organization for Scientific Research (NWO 360-89-050), which is gratefully acknowledged.
- [\BCAYAlthaus, Karamanis, \BBA KollerAlthaus et al.2004] Althaus, E., Karamanis, N., \BBA Koller, A. \BBOP2004\BBCP. \BBOQComputing locally coherent discourses\BBCQ In \BemProc. ACL’04, \BPGS 399–406.
- [\BCAYAnderson, Fernando, Johnson, \BBA GouldAnderson et al.2016] Anderson, P., Fernando, B., Johnson, M., \BBA Gould, S. \BBOP2016\BBCP. \BBOQSPICE: Semantic Propositional Image Caption Evaluation\BBCQ In \BemProc. ECCV’16, \BPGS 1–17.
- [\BCAYAndroutsopoulos, Lampouras, \BBA GalanisAndroutsopoulos et al.2013] Androutsopoulos, I., Lampouras, G., \BBA Galanis, D. \BBOP2013\BBCP. \BBOQGenerating natural language descriptions from OWL ontologies: The natural OWL system\BBCQ \BemJournal of Artificial Intelligence Research, \Bem48, 671–715.
- [\BCAYAndroutsopoulos \BBA MalakasiotisAndroutsopoulos \BBA Malakasiotis2010] Androutsopoulos, I.\BBACOMMA \BBA Malakasiotis, P. \BBOP2010\BBCP. \BBOQA survey of paraphrasing and textual entailment methods\BBCQ \BemJournal of Artificial Intelligence Research, \Bem38, 135–187.
- [\BCAYAngeli, Liang, \BBA KleinAngeli et al.2010] Angeli, G., Liang, P., \BBA Klein, D. \BBOP2010\BBCP. \BBOQA Simple Domain-Independent Probabilistic Approach to Generation\BBCQ In \BemProc. EMNLP’10, \BPGS 502–512.
- [\BCAYAngeli, Manning, \BBA JurafskyAngeli et al.2012] Angeli, G., Manning, C. D., \BBA Jurafsky, D. \BBOP2012\BBCP. \BBOQParsing time: Learning to interpret time expressions\BBCQ In \BemProc. NAACL-HLT’12, \BPGS 446–455.
- [\BCAYAntol, Agrawal, Lu, Mitchell, Batra, Zitnick, \BBA ParikhAntol et al.2015] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., \BBA Parikh, D. \BBOP2015\BBCP. \BBOQVQA: Visual Question Answering\BBCQ In \BemProc. ICCV’15, \BPGS 2425–2433.
- [\BCAYAntol, Zitnick, \BBA ParikhAntol et al.2014] Antol, S., Zitnick, C. L., \BBA Parikh, D. \BBOP2014\BBCP. \BBOQZero-shot learning via visual abstraction\BBCQ In \BemProc. ECCV’14, \BPGS 401–416.
- [\BCAYAppeltAppelt1985] Appelt, D. \BBOP1985\BBCP. \BemPlanning English Sentences. Cambridge University Press, Cambridge, UK.
- [\BCAYArgamon, Koppel, Pennebaker, \BBA SchlerArgamon et al.2007] Argamon, S., Koppel, M., Pennebaker, J. W., \BBA Schler, J. \BBOP2007\BBCP. \BBOQMining the Blogosphere: Age, gender and the varieties of self-expression\BBCQ \BemFirst Monday, \Bem12(9).
- [\BCAYAsghar, Poupart, Hoey, Jiang, \BBA MouAsghar et al.2017] Asghar, N., Poupart, P., Hoey, J., Jiang, X., \BBA Mou, L. \BBOP2017\BBCP. \BBOQAffective Neural Response Generation\BBCQ \BemArXiv preprint, \Bem1709.03968.
- [\BCAYAtefeh \BBA KhreichAtefeh \BBA Khreich2015] Atefeh, F.\BBACOMMA \BBA Khreich, W. \BBOP2015\BBCP. \BBOQA survey of techniques for event detection in twitter\BBCQ \BemComputational Intelligence, \Bem31(1), 132–164.
- [\BCAYAustinAustin1962] Austin, J. L. \BBOP1962\BBCP. \BemHow to do things with words. Clarendon Press, Oxford.
- [\BCAYBahdanau, Cho, \BBA BengioBahdanau et al.2015] Bahdanau, D., Cho, K., \BBA Bengio, Y. \BBOP2015\BBCP. \BBOQNeural Machine Translation By Jointly Learning To Align and Translate\BBCQ In \BemProc. ICLR’15, \BPGS 1–15.
- [\BCAYBalBal2009] Bal, M. \BBOP2009\BBCP. \BemNarratology (Third \BEd). University of Toronto Press, Toronto.
- [\BCAYBallesteros, Bohnet, Mille, \BBA WannerBallesteros et al.2015] Ballesteros, M., Bohnet, B., Mille, S., \BBA Wanner, L. \BBOP2015\BBCP. \BBOQData-driven sentence generation with non-isomorphic trees\BBCQ In \BemProc. NAACL-HTL’15, \BPGS 387–397.
- [\BCAYBanaee, Ahmed, \BBA LoutfiBanaee et al.2013] Banaee, H., Ahmed, M. U., \BBA Loutfi, A. \BBOP2013\BBCP. \BBOQTowards NLG for Physiological Data Monitoring with Body Area Networks\BBCQ In \BemProc. ENLG’13, \BPGS 193–197.
- [\BCAYBangalore \BBA RambowBangalore \BBA Rambow2000] Bangalore, S.\BBACOMMA \BBA Rambow, O. \BBOP2000\BBCP. \BBOQCorpus-based lexical choice in Natural Language Generation\BBCQ In \BemProc. ACL’00, \BPGS 464–471.
- [\BCAYBangalore \BBA StentBangalore \BBA Stent2014] Bangalore, S.\BBACOMMA \BBA Stent, A. \BBOP2014\BBCP. \BemNatural Language Generation in Interactive Systems. Cambridge University Press.
- [\BCAYBanik, Gardent, \BBA KowBanik et al.2013] Banik, E., Gardent, C., \BBA Kow, E. \BBOP2013\BBCP. \BBOQThe KBGen Challenge\BBCQ In \BemProc. ENLG’13, \BPGS 94–97.
- [\BCAYBannard \BBA Callison-BurchBannard \BBA Callison-Burch2005] Bannard, C.\BBACOMMA \BBA Callison-Burch, C. \BBOP2005\BBCP. \BBOQParaphrasing with bilingual parallel corpora\BBCQ In \BemProc. ACL’05, \BPGS 597–604.
- [\BCAYBard, Robertson, \BBA SoraceBard et al.1996] Bard, E. G., Robertson, D., \BBA Sorace, A. \BBOP1996\BBCP. \BBOQMagnitude Estimation of Linguistic Acceptability\BBCQ \BemLamguage, \Bem72(1), 32–68.
- [\BCAYBarnardBarnard2016] Barnard, K. \BBOP2016\BBCP. \BemComputational Methods for Integrating Vision and Language. Morgan and Claypool Publishers.
- [\BCAYBartoli, De Lorenzo, Medvet, \BBA TarlaoBartoli et al.2016] Bartoli, A., De Lorenzo, A., Medvet, E., \BBA Tarlao, F. \BBOP2016\BBCP. \BBOQYour paper has been accepted, rejected, or whatever: Automatic generation of scientific paper reviews\BBCQ In \BemInternational Conference on Availability, Reliability, and Security, \BPGS 19–28.
- [\BCAYBarzilay, Elhadad, \BBA McKeownBarzilay et al.2002] Barzilay, R., Elhadad, N., \BBA McKeown, K. R. \BBOP2002\BBCP. \BBOQInferring strategies for sentence ordering in multidocument news summarization\BBCQ \BemJournal of Artificial Intelligence Research, \Bem17, 35–55.
- [\BCAYBarzilay \BBA LapataBarzilay \BBA Lapata2005] Barzilay, R.\BBACOMMA \BBA Lapata, M. \BBOP2005\BBCP. \BBOQCollective content selection for concept-to-text generation\BBCQ In \BemProc. HLT/EMNLP’05, \BPGS 331–338.
- [\BCAYBarzilay \BBA LapataBarzilay \BBA Lapata2006] Barzilay, R.\BBACOMMA \BBA Lapata, M. \BBOP2006\BBCP. \BBOQAggregation via Set Partitioning for Natural Language Generation\BBCQ In \BemProc. HLT-NAACL’06, \BPGS 359–366.
- [\BCAYBarzilay \BBA LeeBarzilay \BBA Lee2004] Barzilay, R.\BBACOMMA \BBA Lee, L. \BBOP2004\BBCP. \BBOQCatching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization\BBCQ In \BemProc. HLT-NAACL’04, \BPGS 113–120.
- [\BCAYBatemanBateman1997] Bateman, J. A. \BBOP1997\BBCP. \BBOQEnabling technology for multilingual natural language generation: the KPML development environment\BBCQ \BemNatural Language Engineering, \Bem3(1), 15–55.
- [\BCAYBateman \BBA ZockBateman \BBA Zock2005] Bateman, J. A.\BBACOMMA \BBA Zock, M. \BBOP2005\BBCP. \BBOQNatural Language Generation\BBCQ In Mitkov, R.\BED, \BemThe Oxford Handbook of Computational Linguistics. Oxford University Press, Oxford, UK.
- [\BCAYBelzBelz2003] Belz, A. \BBOP2003\BBCP. \BBOQAnd Now with Feeling: Developments in Emotional Language Generation (Technical Report No. ITRI-03-21)\BBCQ \BTR, University of Brighton, Brighton, UK.
- [\BCAYBelzBelz2008] Belz, A. \BBOP2008\BBCP. \BBOQAutomatic generation of weather forecast texts using comprehensive probabilistic generation-space models\BBCQ \BemNatural Language Engineering, \Bem14(04).
- [\BCAYBelz \BBA KowBelz \BBA Kow2010] Belz, A.\BBACOMMA \BBA Kow, E. \BBOP2010\BBCP. \BBOQComparing rating scales and preference judgements in language evaluation\BBCQ In \BemProc. INLG’10, \BPGS 7–15.
- [\BCAYBelz \BBA KowBelz \BBA Kow2011] Belz, A.\BBACOMMA \BBA Kow, E. \BBOP2011\BBCP. \BBOQDiscrete vs . Continuous Rating Scales for Language Evaluation in NLP\BBCQ In \BemProc. ACL’11, \BPGS 230–235.
- [\BCAYBelz, Kow, Viethen, \BBA GattBelz et al.2010] Belz, A., Kow, E., Viethen, J., \BBA Gatt, A. \BBOP2010\BBCP. \BBOQGenerating referring expressions in context: The GREC task evaluation challenges\BBCQ In Krahmer, E.\BBACOMMA \BBA Theune, M.\BEDS, \BemEmpirical Methods in Natural Language Generation. Springer, Berlin and Heidelberg.
- [\BCAYBelz, White, Espinosa, Kow, Hogan, \BBA StentBelz et al.2011] Belz, A., White, M., Espinosa, D., Kow, E., Hogan, D., \BBA Stent, A. \BBOP2011\BBCP. \BBOQThe First Surface Realisation Shared Task: Overview and Evaluation Results\BBCQ In \BemProc. ENLG’11, \BPGS 217–226.
- [\BCAYBengio, Ducharme, Vincent, \BBA JanvinBengio et al.2003] Bengio, Y., Ducharme, R., Vincent, P., \BBA Janvin, C. \BBOP2003\BBCP. \BBOQA Neural Probabilistic Language Model\BBCQ \BemJournal of Machine Learning Research, \Bem3, 1137–1155.
- [\BCAYBernardi, Cakici, Elliott, Erdem, Erdem, Ikizler-Cinbis, Keller, Muscat, \BBA PlankBernardi et al.2016] Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., \BBA Plank, B. \BBOP2016\BBCP. \BBOQAutomatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures\BBCQ \BemJournal of Artificial Intelligence Research, \Bem55, 409–442.
- [\BCAYBiberBiber1988] Biber, D. \BBOP1988\BBCP. \BemVariation Across Speech and Writing. Cambridge University Press, Cambridge.
- [\BCAYBinsted, Bergen, \BBA McKayBinsted et al.2003] Binsted, K., Bergen, B., \BBA McKay, J. \BBOP2003\BBCP. \BBOQPun and non-pun humour in second-language learning\BBCQ In \BemProc. CHI’03 Workshop on Humor Modeling in the Interface.
- [\BCAYBinsted, Pain, \BBA RitchieBinsted et al.1997] Binsted, K., Pain, H., \BBA Ritchie, G. D. \BBOP1997\BBCP. \BBOQChildren’s evaluation of computer-generated punning riddles\BBCQ \BemPragmatics & Cognition, \Bem5(2), 305–354.
- [\BCAYBinsted \BBA RitchieBinsted \BBA Ritchie1994] Binsted, K.\BBACOMMA \BBA Ritchie, G. D. \BBOP1994\BBCP. \BBOQAn implemented model of punning riddles\BBCQ In \BemProc. AAAI’94.
- [\BCAYBinsted \BBA RitchieBinsted \BBA Ritchie1997] Binsted, K.\BBACOMMA \BBA Ritchie, G. D. \BBOP1997\BBCP. \BBOQComputational rules for generating punning riddles\BBCQ \BemHumor: International Journal of Humor Research, \Bem10(1), 25–76.
- [\BCAYBohnetBohnet2008] Bohnet, B. \BBOP2008\BBCP. \BBOQThe fingerprint of human referring expressions and their surface realization with graph transducers\BBCQ In \BemProc. INLG’08, \BPGS 207–210.
- [\BCAYBohnet, Wanner, Mille, \BBA BurgaBohnet et al.2010] Bohnet, B., Wanner, L., Mille, S., \BBA Burga, A. \BBOP2010\BBCP. \BBOQBroad Coverage Multilingual Deep Sentence Generation with a Stochastic Multi-Level Realizer\BBCQ In \BemProc. COLING’10, \BPGS 98–106.
- [\BCAYBollegala, Okazaki, \BBA IshizukaBollegala et al.2010] Bollegala, D., Okazaki, N., \BBA Ishizuka, M. \BBOP2010\BBCP. \BBOQA bottom-up approach to sentence ordering for multi-document summarization\BBCQ \BemInformation Processing & Management, \Bem46(1), 89–109.
- [\BCAYBollmannBollmann2011] Bollmann, M. \BBOP2011\BBCP. \BBOQAdapting SimpleNLG for German\BBCQ In \BemProc. ENLG’11, \BPGS 133–138.
- [\BCAYBouayad-Agha, Casamayor, Wanner, \BBA MellishBouayad-Agha et al.2013] Bouayad-Agha, N., Casamayor, G., Wanner, L., \BBA Mellish, C. \BBOP2013\BBCP. \BBOQOverview of the First Content Selection Challenge from Open Semantic Web Data\BBCQ In \BemProc. ENLG’11, \BPGS 98–102.
- [\BCAYBoyer, Phillips, Ingram, Ha, Wallis, Vouk, \BBA LesterBoyer et al.2011] Boyer, K. E., Phillips, R., Ingram, A., Ha, E. Y., Wallis, M., Vouk, M., \BBA Lester, J. C. \BBOP2011\BBCP. \BBOQInvestigating the relationship between dialogue structure and tutoring effectiveness: A hidden markov modeling approach\BBCQ \BemInternational Journal of Artificial Intelligence in Education, \Bem21(1-2), 65–81.
- [\BCAYBrants \BBA FranzBrants \BBA Franz2006] Brants, T.\BBACOMMA \BBA Franz, A. \BBOP2006\BBCP. \BBOQWeb 1T 5-gram Version 1\BBCQ \BTR, Linguistic Data Consortium.
- [\BCAYBratmanBratman1987] Bratman, M. E. \BBOP1987\BBCP. \BemIntentions, Plans and Practical Reason. CSLI, Stanford, CA.
- [\BCAYBringsjord \BBA FerrucciBringsjord \BBA Ferrucci1999] Bringsjord, S.\BBACOMMA \BBA Ferrucci, D. A. \BBOP1999\BBCP. \BemArtificial Intelligence and Literary Creativity: Inside the Mind of BRUTUS, a Storytelling Machine. Lawrence Erlbaum Associates, Hillsdale, NJ.
- [\BCAYBrown, Frishkoff, \BBA EskenaziBrown et al.2005] Brown, J. C., Frishkoff, G. A., \BBA Eskenazi, M. \BBOP2005\BBCP. \BBOQAutomatic question generation for vocabulary assessment\BBCQ In \BemProc. EMNLP’05, \BPGS 819–826.
- [\BCAYBrown \BBA LevinsonBrown \BBA Levinson1987] Brown, P.\BBACOMMA \BBA Levinson, S. C. \BBOP1987\BBCP. \BemPoliteness: Some Universals in Language Usage. Cambridge University Press, Cambridge, UK.
- [\BCAYBrunerBruner2011] Bruner, J. \BBOP2011\BBCP. \BBOQThe Narrative Construction of Reality\BBCQ \BemCritical Inquiry, \Bem18(1), 1–21.
- [\BCAYBusemann \BBA HoracekBusemann \BBA Horacek1997] Busemann, S.\BBACOMMA \BBA Horacek, H. \BBOP1997\BBCP. \BBOQGenerating Air Quality Reports From Environmental Data\BBCQ In Busemann, S., Becker, T., \BBA Finkler, W.\BEDS, \BemDFKI Workshop on Natural Language Generation (DFKI Document D-97-06), \BPGS 1–7. DFKI, Saarbrücken.
- [\BCAYCagan, Frank, \BBA TsarfatyCagan et al.2014] Cagan, T., Frank, S. L., \BBA Tsarfaty, R. \BBOP2014\BBCP. \BBOQGenerating Subjective Responses to Opinionated Articles in Social Media: An Agenda-Driven Architecture and a Turing-Like Test\BBCQ In \BemProc. Joint Workshop on Social Dynamics and Personal Attributes in Social Media, \BPGS 58–67.
- [\BCAYCahillCahill2009] Cahill, A. \BBOP2009\BBCP. \BBOQCorrelating Human and Automatic Evaluation of a German Surface Realiser\BBCQ In \BemProc. ACL-IJCNLP’09, \BPGS 97–100.
- [\BCAYCahill, Forst, \BBA RohrerCahill et al.2007] Cahill, A., Forst, M., \BBA Rohrer, C. \BBOP2007\BBCP. \BBOQStochastic realisation ranking for a free word order language\BBCQ In \BemProc. ENLG’07, \BPGS 17–24.
- [\BCAYCahill \BBA JosefCahill \BBA Josef2006] Cahill, A.\BBACOMMA \BBA Josef, V. \BBOP2006\BBCP. \BBOQRobust PCFG-Based Generation using Automatically Acquired LFG Approximations\BBCQ In \BemProc. COLING-ACL’06, \BPGS 1033–1040.
- [\BCAYCallawayCallaway2005] Callaway, C. B. \BBOP2005\BBCP. \BBOQThe Types and Distributions of Errors in a Wide Coverage Surface Realizer Evaluation\BBCQ In \BemProc. ENLG’05, \BPGS 162–167.
- [\BCAYCallaway \BBA LesterCallaway \BBA Lester2002] Callaway, C. B.\BBACOMMA \BBA Lester, J. C. \BBOP2002\BBCP. \BBOQNarrative prose generation\BBCQ \BemArtificial Intelligence, \Bem139(2), 213–252.
- [\BCAYCallison-Burch, Fordyce, Koehn, Monz, \BBA SchroederCallison-Burch et al.2007] Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., \BBA Schroeder, J. \BBOP2007\BBCP. \BBOQ(Meta-) evaluation of machine translation\BBCQ In \BemProc. StatMT’07, \BPGS 136–158.
- [\BCAYCallison-Burch, Fordyce, Koehn, Monz, \BBA SchroederCallison-Burch et al.2008] Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., \BBA Schroeder, J. \BBOP2008\BBCP. \BBOQFurther Meta-Evaluation of Machine Translation\BBCQ In \BemProc. StatMT’08, \BPGS 70–106.
- [\BCAYCallison-Burch, Osborne, \BBA KoehnCallison-Burch et al.2006] Callison-Burch, C., Osborne, M., \BBA Koehn, P. \BBOP2006\BBCP. \BBOQRe-evaluating the Role of BLEU in Machine Translation Research\BBCQ In \BemProc. EACL’06, \BPGS 249–256.
- [\BCAYCaporaso, Deshpande, Fink, Bourne, Bretonnel Cohen, \BBA HunterCaporaso et al.2008] Caporaso, J. G., Deshpande, N., Fink, J. L., Bourne, P. E., Bretonnel Cohen, K., \BBA Hunter, L. \BBOP2008\BBCP. \BBOQIntrinsic evaluation of text mining tools may not predict performance on realistic tasks\BBCQ \BemPacific Symposium on Biocomputing, \Bem13, 640–651.
- [\BCAYCarenini \BBA MooreCarenini \BBA Moore2006] Carenini, G.\BBACOMMA \BBA Moore, J. D. \BBOP2006\BBCP. \BBOQGenerating and evaluating evaluative arguments\BBCQ \BemArtificial Intelligence, \Bem170(11), 925–952.
- [\BCAYCarroll \BBA OepenCarroll \BBA Oepen2005] Carroll, J.\BBACOMMA \BBA Oepen, S. \BBOP2005\BBCP. \BBOQHigh efficiency realization for a wide-coverage unification grammar\BBCQ In Dale, R.\BED, \BemProcedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP’05), \BPGS 165–176. Springer.
- [\BCAYCastro Ferreira, Calixto, Wubben, \BBA KrahmerCastro Ferreira et al.2017] Castro Ferreira, T., Calixto, I., Wubben, S., \BBA Krahmer, E. \BBOP2017\BBCP. \BBOQLinguistic realisation as machine translation: Comparing different MT models for AMR-to-text generation\BBCQ In \BemProc. INLG’17, \BPGS 1–10.
- [\BCAYCastro Ferreira, Krahmer, \BBA WubbenCastro Ferreira et al.2016] Castro Ferreira, T., Krahmer, E., \BBA Wubben, S. \BBOP2016\BBCP. \BBOQTowards more variation in text generation: Developing and evaluating variation models for choice of referential form\BBCQ In \BemProc. ACL’16, \BPGS 568 – 577.
- [\BCAYCastro Ferreira, Wubben, \BBA KrahmerCastro Ferreira et al.2017] Castro Ferreira, T., Wubben, S., \BBA Krahmer, E. \BBOP2017\BBCP. \BBOQGenerating flexible proper name references in text: Data, models and evaluation\BBCQ In \BemProc. EACL’17, \BPGS 655–664.
- [\BCAYChang, Dell, \BBA BockChang et al.2006] Chang, F., Dell, G. S., \BBA Bock, K. \BBOP2006\BBCP. \BBOQBecoming syntactic\BBCQ \BemPsychological review, \Bem113(2), 234–72.
- [\BCAYChen \BBA MooneyChen \BBA Mooney2008] Chen, D. L.\BBACOMMA \BBA Mooney, R. J. \BBOP2008\BBCP. \BBOQLearning to sportscast: a test of grounded language acquisition\BBCQ In \BemProc. ICML’08, \BPGS 128–135.
- [\BCAYCheng \BBA MellishCheng \BBA Mellish2000] Cheng, H.\BBACOMMA \BBA Mellish, C. \BBOP2000\BBCP. \BBOQCapturing the interaction between aggregation and text planning in two generation systems\BBCQ In \BemProc. INLG ’00, \BPGS 186–193.
- [\BCAYChi, Jordan, \BBA VanLehnChi et al.2014] Chi, M., Jordan, P. W., \BBA VanLehn, K. \BBOP2014\BBCP. \BBOQWhen Is Tutorial Dialogue More Effective Than Step-Based Tutoring?\BBCQ In \BemProc. ITS’14, \BPGS 210–219.
- [\BCAYClarkClark1996] Clark, H. H. \BBOP1996\BBCP. \BemUsing Language. Cambridge University Press, Cambridge, UK.
- [\BCAYClarke \BBA LapataClarke \BBA Lapata2010] Clarke, J.\BBACOMMA \BBA Lapata, M. \BBOP2010\BBCP. \BBOQDiscourse Constraints for Document Compression\BBCQ \BemComputational Linguistics, \Bem36(3), 411–441.
- [\BCAYClerwallClerwall2014] Clerwall, C. \BBOP2014\BBCP. \BBOQEnter the Robot Journalist\BBCQ \BemJournalism Practice, \Bem8(5), 519–531.
- [\BCAYCochCoch1998] Coch, J. \BBOP1998\BBCP. \BBOQInteractive generation and knowledge administration in MultiMeteo\BBCQ In \BemProc. IWNLG’98, \BPGS 300–303.
- [\BCAYCohen \BBA LevesqueCohen \BBA Levesque1985] Cohen, P. R.\BBACOMMA \BBA Levesque, H. J. \BBOP1985\BBCP. \BBOQSpeech acts and rationality\BBCQ In \BemProc. ACL’85, \BPGS 49–60.
- [\BCAYCohen \BBA PerraultCohen \BBA Perrault1979] Cohen, P. R.\BBACOMMA \BBA Perrault, C. R. \BBOP1979\BBCP. \BBOQElements of a plan-based theory of speech acts\BBCQ \BemCognitive Science, \Bem3, 177–212.
- [\BCAYColin, Gardent, Mrabet, Narayan, \BBA Perez-BeltrachiniColin et al.2016] Colin, E., Gardent, C., Mrabet, Y., Narayan, S., \BBA Perez-Beltrachini, L. \BBOP2016\BBCP. \BBOQThe webNLG challenge: Generating text from dbpedia data\BBCQ In \BemProc. INLG’16, \BPGS 163–167.
- [\BCAYColton, Goodwin, \BBA VealeColton et al.2012] Colton, S., Goodwin, J., \BBA Veale, T. \BBOP2012\BBCP. \BBOQFull-FACE Poetry Generation\BBCQ In \BemProc. ICCC’12, \BPGS 95–102.
- [\BCAYConcepción, Méndez, Gervás, \BBA LeónConcepción et al.2016] Concepción, E., Méndez, G., Gervás, P., \BBA León, C. \BBOP2016\BBCP. \BBOQA challenge proposal for narrative generation using CNLs\BBCQ In \BemProc. INLG’16, \BPGS 171–173.
- [\BCAYCuayáhuitl \BBA DethlefsCuayáhuitl \BBA Dethlefs2011] Cuayáhuitl, H.\BBACOMMA \BBA Dethlefs, N. \BBOP2011\BBCP. \BBOQHierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation\BBCQ In \BemProc. ACL’11, \BPGS 654–659.
- [\BCAYDaleDale1989] Dale, R. \BBOP1989\BBCP. \BBOQCooking up referring expressions\BBCQ In \BemProc. ACL’89, \BPGS 68–75.
- [\BCAYDaleDale1992] Dale, R. \BBOP1992\BBCP. \BemGenerating Referring Expressions: Constructing Descriptions in a Domain of Objects and Processes. MIT Press, Cambridge, MA.
- [\BCAYDaleDale2016] Dale, R. \BBOP2016\BBCP. \BBOQThe return of the chatbots\BBCQ \BemNatural Language Engineering, \Bem22(5), 811–817.
- [\BCAYDale, Anisimoff, \BBA NarrowayDale et al.2012] Dale, R., Anisimoff, I., \BBA Narroway, G. \BBOP2012\BBCP. \BBOQHoo 2012: A report on the preposition and determiner error correction shared task\BBCQ In \BemProc. 7th Workshop on Building Educational Applications Using NLP, \BPGS 54–62.
- [\BCAYDale \BBA ReiterDale \BBA Reiter1995] Dale, R.\BBACOMMA \BBA Reiter, E. \BBOP1995\BBCP. \BBOQComputational Interpretations of the Gricean Maxims in the Generation of Referring Expressions\BBCQ \BemCognitive Science, \Bem19(2), 233–263.
- [\BCAYDale \BBA WhiteDale \BBA White2007] Dale, R.\BBACOMMA \BBA White, M. \BBOP2007\BBCP. \BBOQShared Tasks and Comparative Evaluation in Natural Language Generation: Workshop Report\BBCQ \BTR, Ohio State University, Arlington, Virginia.
- [\BCAYDalianisDalianis1999] Dalianis, H. \BBOP1999\BBCP. \BBOQAggregation in Natural Language Generation\BBCQ \BemComputational Intelligence, \Bem15(4), 384–414.
- [\BCAYde Oliveira \BBA Sripadade Oliveira \BBA Sripada2014] de Oliveira, R.\BBACOMMA \BBA Sripada, S. \BBOP2014\BBCP. \BBOQAdapting SimpleNLG for Brazilian Portugese realisation\BBCQ In \BemProc. INLG’14, \BPGS 93–94.
- [\BCAYDe Rosis \BBA GrassoDe Rosis \BBA Grasso2000] De Rosis, F.\BBACOMMA \BBA Grasso, F. \BBOP2000\BBCP. \BBOQAffective Natural Language Generation\BBCQ In Paiva, A.\BED, \BemAffective interactions, \BPGS 204–218. Springer, Berlin and Heidelberg.
- [\BCAYDe Smedt, Horacek, \BBA ZockDe Smedt et al.1996] De Smedt, K., Horacek, H., \BBA Zock, M. \BBOP1996\BBCP. \BBOQArchitectures for Natural Language Generation : Problems and Perspectives\BBCQ In Adorni, G.\BBACOMMA \BBA Zock, M.\BEDS, \BemTrends in Natural Language Generation: an Artificial Intelligence Perspective, \BPGS 17–46. Springer, Berlin and Heidelberg.
- [\BCAYDemir, Carberry, \BBA McCoyDemir et al.2012] Demir, S., Carberry, S., \BBA McCoy, K. F. \BBOP2012\BBCP. \BBOQSummarizing information graphics textually\BBCQ \BemComputational Linguistics, \Bem38(3), 527–574.
- [\BCAYDethlefsDethlefs2014] Dethlefs, N. \BBOP2014\BBCP. \BBOQContext-Sensitive Natural Language Generation: From Knowledge-Driven to Data-Driven Techniques\BBCQ \BemLanguage and Linguistics Compass, \Bem8(3), 99–115.
- [\BCAYDethlefs \BBA CuayáhuitlDethlefs \BBA Cuayáhuitl2015] Dethlefs, N.\BBACOMMA \BBA Cuayáhuitl, H. \BBOP2015\BBCP. \BBOQHierarchical reinforcement learning for situated natural language generation\BBCQ \BemNatural Language Engineering, \Bem21(3), 391–435.
- [\BCAYDevlin, Cheng, Fang, Gupta, Deng, He, Zweig, \BBA MitchellDevlin et al.2015a] Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., \BBA Mitchell, M. \BBOP2015a\BBCP. \BBOQLanguage Models for Image Captioning : The Quirks and What Works\BBCQ In \BemProc. ACL/IJCNLP’15, \BPGS 100–105.
- [\BCAYDevlin, Gupta, Girshick, Mitchell, \BBA ZitnickDevlin et al.2015b] Devlin, J., Gupta, S., Girshick, R., Mitchell, M., \BBA Zitnick, C. L. \BBOP2015b\BBCP. \BBOQExploring Nearest Neighbor Approaches for Image Captioning\BBCQ \BemarXiv preprint, \Bem1505.04467.
- [\BCAYDi Eugenio, Fossati, Yu, Haller, \BBA GlassDi Eugenio et al.2005] Di Eugenio, B., Fossati, D., Yu, D., Haller, S., \BBA Glass, M. \BBOP2005\BBCP. \BBOQAggregation improves learning: Experiments in natural language generation for intelligent tutoring systems\BBCQ In \BemProc. ACL’05, \BPGS 50–57.
- [\BCAYDi Eugenio \BBA GreenDi Eugenio \BBA Green2010] Di Eugenio, B.\BBACOMMA \BBA Green, N. \BBOP2010\BBCP. \BBOQEmerging applications of natural language generation in information visualization, education, and health-care\BBCQ In Indurkhya, N.\BBACOMMA \BBA Damerau, F.\BEDS, \BemHandbook of Natural Language Processing (2nd \BEd)., \BPGS 557–575. Chapman and Hall/CRC, London.
- [\BCAYDi Fabbrizio, Stent, \BBA BangaloreDi Fabbrizio et al.2008] Di Fabbrizio, G., Stent, A., \BBA Bangalore, S. \BBOP2008\BBCP. \BBOQTrainable Speaker-Based Referring Expression Generation\BBCQ In \BemProc. CoNLL’08, \BPGS 151–158.
- [\BCAYDiMarco, Covvey, Bray, Cowan, DiCiccio, Hovy, Mulholland, \BBA LipaDiMarco et al.2007] DiMarco, C., Covvey, H. D., Bray, P., Cowan, D., DiCiccio, V., Hovy, E. H., Mulholland, D., \BBA Lipa, J. \BBOP2007\BBCP. \BBOQThe Development of a Natural Language Generation System For Personalized e-Health Information\BBCQ In \BemProc. MedInfo’07.
- [\BCAYDiMarco \BBA HirstDiMarco \BBA Hirst1993] DiMarco, C.\BBACOMMA \BBA Hirst, G. \BBOP1993\BBCP. \BBOQA Computational Theory of Goal-Directed Style in Syntax\BBCQ \BemComputational Linguistics, \Bem19(3), 451–499.
- [\BCAYDimitromanolaki \BBA AndroutsopoulosDimitromanolaki \BBA Androutsopoulos2003] Dimitromanolaki, A.\BBACOMMA \BBA Androutsopoulos, I. \BBOP2003\BBCP. \BBOQLearning to Order Facts for Discourse Planning in Natural Language Generation\BBCQ In \BemProc. ENLG’03, \BPGS 23–30.
- [\BCAYDoddingtonDoddington2002] Doddington, G. \BBOP2002\BBCP. \BBOQAutomatic evaluation of machine translation quality using n-gram co-occurrence statistics\BBCQ In \BemProc. ARPA Workshop on Human Language Technology, \BPGS 128–132.
- [\BCAYDonahue, Hendricks, Rohrbach, Venugopalan, Guadarrama, Saenko, \BBA DarrellDonahue et al.2015] Donahue, J., Hendricks, L. A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., \BBA Darrell, T. \BBOP2015\BBCP. \BBOQLong-term Recurrent Convolutional Networks for Visual Recognition and Description\BBCQ In \BemProc. CVPR’15, \BPGS 1–14.
- [\BCAYDong, Wu, He, Yu, \BBA WangDong et al.2015] Dong, D., Wu, H., He, W., Yu, D., \BBA Wang, H. \BBOP2015\BBCP. \BBOQMulti-Task Learning for Multiple Language Translation\BBCQ In \BemProc. ACL/IJCNLP’15, \BPGS 1723–1732.
- [\BCAYDong, Huang, Wei, Lapata, Zhou, \BBA XuDong et al.2017] Dong, L., Huang, S., Wei, F., Lapata, M., Zhou, M., \BBA Xu, K. \BBOP2017\BBCP. \BBOQLearning to Generate Product Reviews from Attributes\BBCQ In \BemProc. EACL’17, \BPGS 623–632.
- [\BCAYDorr \BBA GaasterlandDorr \BBA Gaasterland1995] Dorr, B.\BBACOMMA \BBA Gaasterland, T. \BBOP1995\BBCP. \BBOQSelecting tense, aspect and connecting words in language generation\BBCQ In \BemProc. IJCAI’95, \BPGS 1299–1305.
- [\BCAYDorr, Monz, Oard, President, Zajic, \BBA SchwartzDorr et al.2004] Dorr, B., Monz, C., Oard, D., President, S., Zajic, D., \BBA Schwartz, R. \BBOP2004\BBCP. \BBOQExtrinsic Evaluation of Automatic Metrics (LAMP-TR-115)\BBCQ \BTR, University of Maryland, College Park, MD.
- [\BCAYDrasDras2015] Dras, M. \BBOP2015\BBCP. \BBOQEvaluating human pairwise preference judgments\BBCQ \BemComputational Linguistics, \Bem41(2), 309–317.
- [\BCAYDuboue \BBA McKeownDuboue \BBA McKeown2003] Duboue, P. A.\BBACOMMA \BBA McKeown, K. R. \BBOP2003\BBCP. \BBOQStatistical acquistion of content selection rules for natural language generation\BBCQ In \BemProc. EMNLP’03, \BPGS 121–128.
- [\BCAYDuma \BBA KleinDuma \BBA Klein2013] Duma, D.\BBACOMMA \BBA Klein, E. \BBOP2013\BBCP. \BBOQGenerating natural language from linked data: Unsupervised template extraction\BBCQ In \BemProc. IWCS’13, \BPGS 83–94.
- [\BCAYDušek \BBA JurčíčekDušek \BBA Jurčíček2015] Dušek, O.\BBACOMMA \BBA Jurčíček, F. \BBOP2015\BBCP. \BBOQTraining a Natural Language Generator From Unaligned Data\BBCQ In \BemProc. ACL/IJCNLP’15, \BPGS 451–461.
- [\BCAYDušek \BBA JurčíčekDušek \BBA Jurčíček2016] Dušek, O.\BBACOMMA \BBA Jurčíček, F. \BBOP2016\BBCP. \BBOQSequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings\BBCQ In \BemProc. ACL’16, \BPGS 45–51.
- [\BCAYDuygulu, Barnard, de Freitas, \BBA ForsythDuygulu et al.2002] Duygulu, P., Barnard, K., de Freitas, N., \BBA Forsyth, D. \BBOP2002\BBCP. \BBOQObject recognition as machine translation: Learning a lexicon for a fixed image vocabulary\BBCQ In \BemProc. ECCV’02, \BPGS 97–112.
- [\BCAYEdmonds \BBA HirstEdmonds \BBA Hirst2002] Edmonds, P.\BBACOMMA \BBA Hirst, G. \BBOP2002\BBCP. \BBOQNear-Synonymy and Lexical Choice\BBCQ \BemComputational Linguistics, \Bem28(2), 105–144.
- [\BCAYEisensteinEisenstein2013] Eisenstein, J. \BBOP2013\BBCP. \BBOQWhat to do about bad language on the internet\BBCQ In \BemProc. NAACL-HLT’13, \BPGS 359–369.
- [\BCAYElhadad \BBA RobinElhadad \BBA Robin1996] Elhadad, M.\BBACOMMA \BBA Robin, J. \BBOP1996\BBCP. \BBOQAn overview of SURGE: A reusable comprehensive syntactic realization component\BBCQ In \BemProcedings of the 8th International Natural Language Generation Workshop (IWNLG’98), \BPGS 1–4.
- [\BCAYElhadad, Robin, \BBA McKeownElhadad et al.1997] Elhadad, M., Robin, J., \BBA McKeown, K. R. \BBOP1997\BBCP. \BBOQFloating constraints in lexical choice\BBCQ \BemComputational Linguistics, \Bem23(2), 195–239.
- [\BCAYElhoseiny, Elgammal, \BBA SalehElhoseiny et al.2017] Elhoseiny, M., Elgammal, A., \BBA Saleh, B. \BBOP2017\BBCP. \BBOQWrite a Classifier: Predicting Visual Classifiers from Unstructured Text Descriptions\BBCQ \BemIEEE Transactions on Pattern Analysis and Machine Intelligence, \Bem39(12), 2539–2553.
- [\BCAYEll \BBA HarthEll \BBA Harth2014] Ell, B.\BBACOMMA \BBA Harth, A. \BBOP2014\BBCP. \BBOQA language-independent method for the extraction of RDF verbalization templates\BBCQ In \BemProc. INLG’14, \BPGS 26–34.
- [\BCAYElliott \BBA De VriesElliott \BBA De Vries2015] Elliott, D.\BBACOMMA \BBA De Vries, A. P. \BBOP2015\BBCP. \BBOQDescribing Images using Inferred Visual Dependency Representations\BBCQ In \BemProc. ACL-IJCNLP’15, \BPGS 42–52.
- [\BCAYElliott, Frank, Sima’an, \BBA SpeciaElliott et al.2016] Elliott, D., Frank, S., Sima’an, K., \BBA Specia, L. \BBOP2016\BBCP. \BBOQMulti30K: Multilingual English-German Image Descriptions\BBCQ \BemarXiv preprint, \Bem1605.00459.
- [\BCAYElliott \BBA KellerElliott \BBA Keller2013] Elliott, D.\BBACOMMA \BBA Keller, F. \BBOP2013\BBCP. \BBOQImage Description using Visual Dependency Representations\BBCQ In \BemProc. EMNLP’13, \BPGS 1292–1302.
- [\BCAYElliott \BBA KellerElliott \BBA Keller2014] Elliott, D.\BBACOMMA \BBA Keller, F. \BBOP2014\BBCP. \BBOQComparing Automatic Evaluation Measures for Image Description\BBCQ In \BemProc. ACL’14, \BPGS 452–457.
- [\BCAYElmanElman1990] Elman, J. L. \BBOP1990\BBCP. \BBOQFinding structure in time\BBCQ \BemCognitive Science, \Bem14(2), 179–211.
- [\BCAYElmanElman1993] Elman, J. L. \BBOP1993\BBCP. \BBOQLearning and development in neural networks: The importance of starting small\BBCQ \BemCognition, \Bem48, 71–99.
- [\BCAYElson \BBA McKeownElson \BBA McKeown2010] Elson, D.\BBACOMMA \BBA McKeown, K. R. \BBOP2010\BBCP. \BBOQTense and aspect assignment in narrative discourse\BBCQ In \BemProc. INLG’10, \BPGS 47–56.
- [\BCAYElting, Martin, Cantor, \BBA RubensteinElting et al.1999] Elting, L. S., Martin, C. G., Cantor, S. B., \BBA Rubenstein, E. B. \BBOP1999\BBCP. \BBOQInfluence of data display formats on physician investigators’ decisions to stop clinical trials: prospective trial with repeated measures\BBCQ \BemBMJ (Clinical research ed.), \Bem318(7197), 1527–1531.
- [\BCAYEngelhardt, Bailey, \BBA FerreiraEngelhardt et al.2006] Engelhardt, P., Bailey, K., \BBA Ferreira, F. \BBOP2006\BBCP. \BBOQDo speakers and listeners observe the Gricean Maxim of Quantity?\BBCQ \BemJournal of Memory and Language, \Bem54(4), 554–573.
- [\BCAYEngonopoulos \BBA KollerEngonopoulos \BBA Koller2014] Engonopoulos, N.\BBACOMMA \BBA Koller, A. \BBOP2014\BBCP. \BBOQGenerating effective referring expressions using charts\BBCQ In \BemProc. INLG’14, \BPGS 6–15.
- [\BCAYEspinosa, Rajkumar, White, \BBA BerleantEspinosa et al.2010] Espinosa, D., Rajkumar, R., White, M., \BBA Berleant, S. \BBOP2010\BBCP. \BBOQFurther Meta-Evaluation of Broad-Coverage Surface Realization\BBCQ In \BemProc. EMNLP’10, \BPGS 564–574.
- [\BCAYEspinosa, White, \BBA MehayEspinosa et al.2008] Espinosa, D., White, M., \BBA Mehay, D. \BBOP2008\BBCP. \BBOQHypertagging: Supertagging for surface realization with CCG\BBCQ In \BemProc. ACL-HLT’08, \BPGS 183–191.
- [\BCAYEvans, Piwek, \BBA CahillEvans et al.2002] Evans, R., Piwek, P., \BBA Cahill, L. \BBOP2002\BBCP. \BBOQWhat is nlg?\BBCQ In \BemProc. INLG’02, \BPGS 144–151.
- [\BCAYFang, Gupta, Iandola, Srivastava, Deng, Dollár, Gao, He, Mitchell, Platt, Zitnick, \BBA ZweigFang et al.2015] Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., \BBA Zweig, G. \BBOP2015\BBCP. \BBOQFrom Captions to Visual Concepts and Back\BBCQ In \BemProc. CVPR’15, \BPGS 1473–1482.
- [\BCAYFarhadi, Hejrati, Sadeghi, Young, Rashtchian, Hockenmaier, \BBA ForsythFarhadi et al.2010] Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., \BBA Forsyth, D. \BBOP2010\BBCP. \BBOQEvery picture tells a story: Generating sentences from images\BBCQ In \BemProc. ECCV’10, \BVOL 6314 LNCS, \BPGS 15–29.
- [\BCAYFarnadi, Zoghbi, Moens, \BBA De CockFarnadi et al.2013] Farnadi, G., Zoghbi, S., Moens, M.-F., \BBA De Cock, M. \BBOP2013\BBCP. \BBOQRecognising Personality Traits Using Facebook Status Updates\BBCQ In \BemAAAI Technical Report WS-13-01: Computational Personality Recognition (Shared Task), \BPGS 14–18.
- [\BCAYFassFass1991] Fass, D. \BBOP1991\BBCP. \BBOQmet*: A Method for Discriminating Metonymy and Metaphor by Computer\BBCQ \BemComputational Linguistics, \Bem17(1), 49–90.
- [\BCAYFeng \BBA LapataFeng \BBA Lapata2010] Feng, Y.\BBACOMMA \BBA Lapata, M. \BBOP2010\BBCP. \BBOQHow many words is a picture worth? Automatic caption generation for news images\BBCQ In \BemProc. ACL’10, \BPGS 1239–1249.
- [\BCAYFerraro, Mostafazadeh, Huang, Vanderwende, Devlin, Galley, \BBA MitchellFerraro et al.2015] Ferraro, F., Mostafazadeh, N., Huang, T.-H., Vanderwende, L., Devlin, J., Galley, M., \BBA Mitchell, M. \BBOP2015\BBCP. \BBOQA Survey of Current Datasets for Vision and Language Research\BBCQ In \BemProc. EMNLP’15, \BPGS 207–213.
- [\BCAYFicler \BBA GoldbergFicler \BBA Goldberg2017] Ficler, J.\BBACOMMA \BBA Goldberg, Y. \BBOP2017\BBCP. \BBOQControlling Linguistic Style Aspects in Neural Language Generation\BBCQ In \BemProc. Workshop on Stylistic Variation, \BPGS 94–104.
- [\BCAYFikes \BBA NilssonFikes \BBA Nilsson1971] Fikes, R. E.\BBACOMMA \BBA Nilsson, N. J. \BBOP1971\BBCP. \BBOQStrips: A new approach to the application of theorem proving to problem solving\BBCQ \BemArtificial Intelligence, \Bem2(3-4), 189–208.
- [\BCAYFilippova \BBA StrubeFilippova \BBA Strube2007] Filippova, K.\BBACOMMA \BBA Strube, M. \BBOP2007\BBCP. \BBOQGenerating Constituent Order in German Clauses\BBCQ In \BemProc. ACL’07, \BPGS 320–327.
- [\BCAYFilippova \BBA StrubeFilippova \BBA Strube2009] Filippova, K.\BBACOMMA \BBA Strube, M. \BBOP2009\BBCP. \BBOQTree linearization in English: Improving language model based approaches\BBCQ In \BemProc. NAACL-HLT’09, \BPGS 225–228.
- [\BCAYFitzGerald, Artzi, \BBA ZettlemoyerFitzGerald et al.2013] FitzGerald, N., Artzi, Y., \BBA Zettlemoyer, L. \BBOP2013\BBCP. \BBOQLearning Distributions over Logical Forms for Referring Expression Generation\BBCQ In \BemProc. EMNLP’13, \BPGS 1914–1925.
- [\BCAYFleischman \BBA HovyFleischman \BBA Hovy2002] Fleischman, M.\BBACOMMA \BBA Hovy, E. H. \BBOP2002\BBCP. \BBOQEmotional Variation in speech-based Natural Language Generation\BBCQ In \BemProc. INLG’02, \BPGS 57–64.
- [\BCAYFlower \BBA HayesFlower \BBA Hayes1981] Flower, L.\BBACOMMA \BBA Hayes, J. R. \BBOP1981\BBCP. \BBOQA cognitive process theory of writing\BBCQ \BemCollege composition and communication, \Bem32(4), 365–387.
- [\BCAYFort, Adda, \BBA Bretonnel CohenFort et al.2011] Fort, K., Adda, G., \BBA Bretonnel Cohen, K. \BBOP2011\BBCP. \BBOQAmazon Mechanical Turk: Gold Mine or Coal Mine?\BBCQ \BemComputational Linguistics, \Bem37(2), 413–420.
- [\BCAYFossati, Di Eugenio, Ohlsson, Brown, \BBA ChenFossati et al.2015] Fossati, D., Di Eugenio, B., Ohlsson, S., Brown, C., \BBA Chen, L. \BBOP2015\BBCP. \BBOQData Driven Automatic Feedback Generation in the iList Intelligent Tutoring System\BBCQ \BemTechnology, Instruction, Cognition and Learning, \Bem10, 5–26.
- [\BCAYFrank, Goodman, \BBA TenenbaumFrank et al.2009] Frank, M. C., Goodman, N. D., \BBA Tenenbaum, J. B. \BBOP2009\BBCP. \BBOQUsing speakers’ referential intentions to model early cross-situational word learning\BBCQ \BemPsychological Science, \Bem20(5), 578–85.
- [\BCAYGardentGardent2002] Gardent, C. \BBOP2002\BBCP. \BBOQGenerating Minimal Definite Descriptions\BBCQ In \BemProc. ACL’02, \BPGS 96–103.
- [\BCAYGardent \BBA NarayanGardent \BBA Narayan2015] Gardent, C.\BBACOMMA \BBA Narayan, S. \BBOP2015\BBCP. \BBOQMultiple adjunction in feature-based tree-adjoining grammar\BBCQ \BemComputational Linguistcs, \Bem41(1), 41–70.
- [\BCAYGardent \BBA Perez-BeltrachiniGardent \BBA Perez-Beltrachini2017] Gardent, C.\BBACOMMA \BBA Perez-Beltrachini, L. \BBOP2017\BBCP. \BBOQA statistical, grammar-based approach to microplanning\BBCQ \BemComputational Linguistics, \Bem43(1), 1â–30.
- [\BCAYGaroufiGaroufi2014] Garoufi, K. \BBOP2014\BBCP. \BBOQPlanningâBased Models of Natural Language Generation\BBCQ \BemLanguage and Linguistics Compass, \Bem8(1), 1–10.
- [\BCAYGaroufi \BBA KollerGaroufi \BBA Koller2013] Garoufi, K.\BBACOMMA \BBA Koller, A. \BBOP2013\BBCP. \BBOQGeneration of effective referring expressions in situated context\BBCQ \BemLanguage and Cognitive Processes, \Bem29(8), 986–1001.
- [\BCAYGatt \BBA BelzGatt \BBA Belz2010] Gatt, A.\BBACOMMA \BBA Belz, A. \BBOP2010\BBCP. \BBOQIntroducing shared task evaluation to NLG: The TUNA shared task evaluation challenges\BBCQ In Krahmer, E.\BBACOMMA \BBA Theune, M.\BEDS, \BemEmpirical methods in natural language generation. Springer, Berlin and Heidelberg.
- [\BCAYGatt, Portet, Reiter, Hunter, Mahamood, Moncur, \BBA SripadaGatt et al.2009] Gatt, A., Portet, F., Reiter, E., Hunter, J. R., Mahamood, S., Moncur, W., \BBA Sripada, S. \BBOP2009\BBCP. \BBOQFrom data to text in the neonatal intensive care Unit: Using NLG technology for decision support and information management\BBCQ \BemAI Communications, \Bem22(3), 153–186.
- [\BCAYGatt, van der Sluis, \BBA van DeemterGatt et al.2007] Gatt, A., van der Sluis, I., \BBA van Deemter, K. \BBOP2007\BBCP. \BBOQEvaluating algorithms for the Generation of Referring Expressions using a balanced corpus\BBCQ In \BemProc. ENLG’07, \BPGS 49–56.
- [\BCAYGeman, Geman, Hallonquist, \BBA YounesGeman et al.2015] Geman, D., Geman, S., Hallonquist, N., \BBA Younes, L. \BBOP2015\BBCP. \BBOQVisual Turing test for computer vision systems\BBCQ \BemProceedings of the National Academy of Sciences, \Bem112(12), 3618–3623.
- [\BCAYGenetteGenette1980] Genette, G. \BBOP1980\BBCP. \BemNarrative Discourse: An Essay in Method. Cornell University Press, Ithaca, NY.
- [\BCAYGervásGervás2001] Gervás, P. \BBOP2001\BBCP. \BBOQAn expert system for the composition of formal Spanish poetry\BBCQ \BemKnowledge-Based Systems, \Bem14(3-4), 181–188.
- [\BCAYGervásGervás2009] Gervás, P. \BBOP2009\BBCP. \BBOQComputational approaches to storytelling and creativity\BBCQ \BemAI Magazine, \BemFall 2009, 49–62.
- [\BCAYGervásGervás2010] Gervás, P. \BBOP2010\BBCP. \BBOQEngineering Linguistic Creativity: Bird Flight and Jet Planes\BBCQ In \BemProc. 2nd Workshop on Computational Approaches to Linguistic Creativity, \BPGS 23–30.
- [\BCAYGervásGervás2012] Gervás, P. \BBOP2012\BBCP. \BBOQFrom the Fleece of Fact to Narrative Yarns : a Computational Model of Composition\BBCQ In \BemProc. Workshop on Computational Models of Narrative.
- [\BCAYGervásGervás2013] Gervás, P. \BBOP2013\BBCP. \BBOQStory Generator Algorithms\BBCQ In Hühn, P.\BED, \BemThe Living Handbook of Narratology. Hamburg University, Hamburg.
- [\BCAYGhosh, Chollet, Laksana, Morency, \BBA SchererGhosh et al.2017] Ghosh, S., Chollet, M., Laksana, E., Morency, L.-P., \BBA Scherer, S. \BBOP2017\BBCP. \BBOQAffect-LM: A Neural Language Model for Customizable Affective Text Generation\BBCQ In \BemProc. ACL’17, \BPGS 634–642.
- [\BCAYGkatzia, Rieser, Bartie, \BBA MackanessGkatzia et al.2015] Gkatzia, D., Rieser, V., Bartie, P., \BBA Mackaness, W. \BBOP2015\BBCP. \BBOQFrom the Virtual to the Real World : Referring to Objects in Real-World Spatial Scenes\BBCQ In \BemProc. EMNLP’15, \BPGS 1936–1942.
- [\BCAYGlucksbergGlucksberg2001] Glucksberg, S. \BBOP2001\BBCP. \BemUnderstanding figurative language: From metaphors to idioms. Oxford University Press, Oxford.
- [\BCAYGodwin \BBA PiwekGodwin \BBA Piwek2016] Godwin, K.\BBACOMMA \BBA Piwek, P. \BBOP2016\BBCP. \BBOQCollecting Reliable Human Judgements on Machine-Generated Language: The Case of the QGSTEC Data\BBCQ In \BemProc. INLG’16, \BPGS 212–216.
- [\BCAYGoldberg, Driedger, \BBA KittredgeGoldberg et al.1994] Goldberg, E., Driedger, N., \BBA Kittredge, R. I. \BBOP1994\BBCP. \BBOQUsing Natural Language Processing to Produce Weather Forecasts\BBCQ \BemIEEE Expert, \Bem2, 45–53.
- [\BCAYGoldbergGoldberg2016] Goldberg, Y. \BBOP2016\BBCP. \BBOQA Primer on Neural Network Models for Natural Language Processing\BBCQ \BemJournal of Artificial Intelligence Research, \Bem57, 345–420.
- [\BCAYGoldbergGoldberg2017] Goldberg, Y. \BBOP2017\BBCP. \BBOQAn adversarial review of ‘adversarial generation of natural language’\BBCQ https://goo.gl/EMipHQ.
- [\BCAYGoncalo OliveiraGoncalo Oliveira2017] Goncalo Oliveira, H. \BBOP2017\BBCP. \BBOQA Survey on Intelligent Poetry Generation : Languages, Features, Techniques, Reutilisation and Evaluation\BBCQ In \BemProc. INLG’17, \BPGS 11–20.
- [\BCAYGoodfellow, Bengio, \BBA CourvilleGoodfellow et al.2016] Goodfellow, I., Bengio, Y., \BBA Courville, A. \BBOP2016\BBCP. \BemDeep Learning. MIT Press, Cambridge, MA.
- [\BCAYGoodman, Cryder, \BBA CheemaGoodman et al.2013] Goodman, J., Cryder, C., \BBA Cheema, A. \BBOP2013\BBCP. \BBOQData collection in a flat world: The strengths and weaknesses of mechanical turk samples\BBCQ \BemJournal of Behavioral Decision Making, \Bem26(3), 213–224.
- [\BCAYGoyal, Dymetman, \BBA GaussierGoyal et al.2016] Goyal, R., Dymetman, M., \BBA Gaussier, E. \BBOP2016\BBCP. \BBOQNatural Language Generation through Character-Based RNNs with Finite-State Prior Knowledge\BBCQ In \BemProc. COLING’16, \BPGS 1083–1092.
- [\BCAYGreene, Ave, Knight, \BBA ReyGreene et al.2010] Greene, E., Ave, L., Knight, K., \BBA Rey, M. \BBOP2010\BBCP. \BBOQAutomatic Analysis of Rhythmic Poetry with Applications to Generation and Translation\BBCQ In \BemProc. EMNLP’10, \BPGS 524–533.
- [\BCAYGriceGrice1975] Grice, H. P. \BBOP1975\BBCP. \BBOQLogic and conversation\BBCQ In \BemSyntax and Semantics 3: Speech Acts, \BPGS 41–58. Elsevier, Amsterdam.
- [\BCAYGrosz, Joshi, \BBA WeinsteinGrosz et al.1995] Grosz, B. J., Joshi, A. K., \BBA Weinstein, S. \BBOP1995\BBCP. \BBOQCentering : A Framework for Modeling the Local Coherence of Discourse\BBCQ \BemComputational Linguistics, \Bem21(2), 203–225.
- [\BCAYGuheGuhe2007] Guhe, M. \BBOP2007\BBCP. \BemIncremental Conceptualization for Language Production. Lawrence Erlbaum Associates, Hillsdale, NJ.
- [\BCAYGupta, Verma, \BBA JawaharGupta et al.2012] Gupta, A., Verma, Y., \BBA Jawahar, C. V. \BBOP2012\BBCP. \BBOQChoosing Linguistics over Vision to Describe Images\BBCQ In \BemProc. AAAI’12, \BPGS 606–612.
- [\BCAYGupta, Walker, \BBA RomanoGupta et al.2007] Gupta, S., Walker, M. A., \BBA Romano, D. M. \BBOP2007\BBCP. \BBOQGenerating Politeness in Task Based Interaction : An Evaluation of Linguistic Form and Culture\BBCQ In \BemProc. ENLG’07, \BPGS 57–64.
- [\BCAYGupta, Walker, \BBA RomanoGupta et al.2008] Gupta, S., Walker, M. A., \BBA Romano, D. M. \BBOP2008\BBCP. \BBOQPOLLy: A Conversational System that uses a Shared Representation to Generate Action and Social Language\BBCQ In \BemProc. IJCNLP’08, \BPGS 7–12.
- [\BCAYGyawali \BBA GardentGyawali \BBA Gardent2014] Gyawali, B.\BBACOMMA \BBA Gardent, C. \BBOP2014\BBCP. \BBOQSurface Realisation from Knowledge-Bases\BBCQ In \BemProc. ACL’14, \BPGS 424–434.
- [\BCAYHalliday \BBA MatthiessenHalliday \BBA Matthiessen2004] Halliday, M.\BBACOMMA \BBA Matthiessen, C. M. \BBOP2004\BBCP. \BemIntroduction to Functional Grammar (3rd Edition \BEd). Hodder Arnold, London.
- [\BCAYHarbusch \BBA KempenHarbusch \BBA Kempen2009] Harbusch, K.\BBACOMMA \BBA Kempen, G. \BBOP2009\BBCP. \BBOQGenerating clausal coordinate ellipsis multilingually: A uniform approach based on postediting\BBCQ In \BemProc. ENLG’09, \BPGS 138–145.
- [\BCAYHardcastle \BBA ScottHardcastle \BBA Scott2008] Hardcastle, D.\BBACOMMA \BBA Scott, D. \BBOP2008\BBCP. \BBOQCan we evaluate the quality of generated text?\BBCQ In \BemProc. LREC’08, \BPGS 3151–3158.
- [\BCAYHarnadHarnad1990] Harnad, S. \BBOP1990\BBCP. \BBOQThe symbol grounding problem\BBCQ \BemPhysica, \BemD42(1990), 335–346.
- [\BCAYHarrisHarris2008] Harris, M. D. \BBOP2008\BBCP. \BBOQBuilding a large-scale commercial NLG system for an EMR\BBCQ In \BemProc. INLG ’08, \BPGS 157–160.
- [\BCAYHearstHearst1992] Hearst, M. A. \BBOP1992\BBCP. \BBOQAutomatic Acquisition of Hyponyms ftom Large Text Corpora\BBCQ In \BemProc. COLING’92, \BPGS 539–545.
- [\BCAYHeeman \BBA HirstHeeman \BBA Hirst1995] Heeman, P. A.\BBACOMMA \BBA Hirst, G. \BBOP1995\BBCP. \BBOQCollaborating on referring expressions\BBCQ \BemComputational Linguistics, \Bem21(3), 351–382.
- [\BCAYHendricks, Akata, Rohrbach, Donahue, Schiele, \BBA DarrellHendricks et al.2016a] Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., \BBA Darrell, T. \BBOP2016a\BBCP. \BBOQGenerating Visual Explanations\BBCQ In \BemProc. ECCV’16.
- [\BCAYHendricks, Venugopalan, Rohrbach, Mooney, Saenko, \BBA DarrellHendricks et al.2016b] Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R. J., Saenko, K., \BBA Darrell, T. \BBOP2016b\BBCP. \BBOQDeep Compositional Captioning: Describing Novel Object Categories without Paired Training Data\BBCQ In \BemProc. CVPR’16, \BPGS 1–10.
- [\BCAYHermanHerman1997] Herman, D. \BBOP1997\BBCP. \BBOQScripts, sequences and stories: Elements of a postclassical narratology\BBCQ \BemPMLA, \Bem112(5), 1046–1059.
- [\BCAYHermanHerman2001] Herman, D. \BBOP2001\BBCP. \BBOQStory logic in conversational and literary narratives\BBCQ \BemNarrative, \Bem9(2), 130–137.
- [\BCAYHermanHerman2007] Herman, D. \BBOP2007\BBCP. \BBOQStorytelling and the sciences of mind: Cognitive narratology, discursive psychology, and narratives in face-to-face interaction\BBCQ \BemNarrative, \Bem15(3), 306–334.
- [\BCAYHermidaHermida2015] Hermida, A. \BBOP2015\BBCP. \BBOQFrom Mr and Mrs Outlier to Central Tendencies: Computational Journalism and crime reporting at the Los Angeles Times\BBCQ \BemDigital Journalism, \Bem3(3), 381–397.
- [\BCAYHervás, Arroyo, Francisco, Peinado, \BBA GervásHervás et al.2016] Hervás, R., Arroyo, J., Francisco, V., Peinado, F., \BBA Gervás, P. \BBOP2016\BBCP. \BBOQInfluence of personal choices on lexical variability in referring expressions\BBCQ \BemNatural Language Engineering, \Bem22(2), 257–290.
- [\BCAYHervás, Francisco, \BBA GervásHervás et al.2013] Hervás, R., Francisco, V., \BBA Gervás, P. \BBOP2013\BBCP. \BBOQAssessing the influence of personal preferences on the choice of vocabulary for natural language generation\BBCQ \BemInformation Processing & Management, \Bem49(4), 817–832.
- [\BCAYHervás, Pereira, Gervás, \BBA CardosoHervás et al.2006] Hervás, R., Pereira, F., Gervás, P., \BBA Cardoso, A. \BBOP2006\BBCP. \BBOQCross-domain analogy in automated text generation\BBCQ In \BemProc. 3rd joint workshop on Computational Creativity, \BPGS 43–48.
- [\BCAYHerzig, Shmueli-scheuer, Sandbank, \BBA KonopnickiHerzig et al.2017] Herzig, J., Shmueli-scheuer, M., Sandbank, T., \BBA Konopnicki, D. \BBOP2017\BBCP. \BBOQNeural Response Generation for Customer Service based on Personality Traits\BBCQ In \BemProc. INLG’17, \BPGS 252–256.
- [\BCAYHochreiter \BBA Urgen SchmidhuberHochreiter \BBA Urgen Schmidhuber1997] Hochreiter, S.\BBACOMMA \BBA Urgen Schmidhuber, J. \BBOP1997\BBCP. \BBOQLong Short-Term Memory\BBCQ \BemNeural Computation, \Bem9(8), 1735–1780.
- [\BCAYHockenmaier \BBA SteedmanHockenmaier \BBA Steedman2007] Hockenmaier, J.\BBACOMMA \BBA Steedman, M. \BBOP2007\BBCP. \BBOQCCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank\BBCQ \BemComputational Linguistics, \Bem33(3), 355–396.
- [\BCAYHodosh, Young, \BBA HockenmaierHodosh et al.2013] Hodosh, M., Young, P., \BBA Hockenmaier, J. \BBOP2013\BBCP. \BBOQFraming image description as a ranking task: Data, models and evaluation metrics\BBCQ \BemJournal of Artificial Intelligence Research, \Bem47, 853–899.
- [\BCAYHoracekHoracek1997] Horacek, H. \BBOP1997\BBCP. \BBOQAn Algorithm For Generating Referential Descriptions With Flexible Interfaces\BBCQ In \BemProc. ACL’97, \BPGS 206–213.
- [\BCAYHovy \BBA SøgaardHovy \BBA Søgaard2015] Hovy, D.\BBACOMMA \BBA Søgaard, A. \BBOP2015\BBCP. \BBOQTagging Performance Correlates with Author Age\BBCQ In \BemACL’15, \BPGS 483–488.
- [\BCAYHovyHovy1988] Hovy, E. H. \BBOP1988\BBCP. \BemGenerating Natural Language Under Pragmatic Constraints. Lawrence Erlbaum Associates, Hillsdale, NJ.
- [\BCAYHovyHovy1991] Hovy, E. H. \BBOP1991\BBCP. \BBOQApproaches to the Planning of Coherent Text\BBCQ In Paris, C. L., Swartout, W. R., \BBA Mann, W. C.\BEDS, \BemNatural Language Generation in Artificial Intelligence and Computational Linguistics, \BPGS 83–102. Kluwer, Dordrecht.
- [\BCAYHovyHovy1993] Hovy, E. H. \BBOP1993\BBCP. \BBOQAutomated discourse generation using discourse structure relations\BBCQ \BemArtificial intelligence, \Bem63(1), 341–385.
- [\BCAYHu, Yang, Liang, Salakhutdinov, \BBA XingHu et al.2017] Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., \BBA Xing, E. P. \BBOP2017\BBCP. \BBOQToward Controlled Generation of Text\BBCQ In \BemProc. ICML’17, \BPGS 1587–1596.
- [\BCAYHuang, Ferraro, Mostafazadeh, Misra, Agrawal, Devlin, Girshick, He, Kohli, Batra, Zitnick, Parikh, Vanderwende, Galley, \BBA MitchellHuang et al.2016] Huang, T.-H., Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., Girshick, R., He, X., Kohli, P., Batra, D., Zitnick, C. L., Parikh, D., Vanderwende, L., Galley, M., \BBA Mitchell, M. \BBOP2016\BBCP. \BBOQVisual Storytelling\BBCQ In \BemProc. NAACL-HLT’16, \BPGS 1233–1239.
- [\BCAYHueske-KrausHueske-Kraus2003] Hueske-Kraus, D. \BBOP2003\BBCP. \BBOQSuregen-2 : a shell system for the generation of clinical documents\BBCQ In \BemProc. EACL’03, \BPGS 215–218.
- [\BCAYHunter, Freer, Gatt, Reiter, Sripada, \BBA SykesHunter et al.2012] Hunter, J. R., Freer, Y., Gatt, A., Reiter, E., Sripada, S., \BBA Sykes, C. \BBOP2012\BBCP. \BBOQAutomatic generation of natural language nursing shift summaries in neonatal intensive care: BT-Nurse\BBCQ \BemArtificial Intelligence in Medicine, \Bem56(3), 157–172.
- [\BCAYHüske-KrausHüske-Kraus2003] Hüske-Kraus, D. \BBOP2003\BBCP. \BBOQText generation in clinical medicine: A review\BBCQ \BemMethods of information in medicine, \Bem42(1), 51–60.
- [\BCAYHutchins \BBA SomersHutchins \BBA Somers1992] Hutchins, W. J.\BBACOMMA \BBA Somers, H. L. \BBOP1992\BBCP. \BemAn introduction to machine translation, \BVOL 362. Academic Press London.
- [\BCAYInui, Tokunaga, \BBA TanakaInui et al.1992] Inui, K., Tokunaga, T., \BBA Tanaka, H. \BBOP1992\BBCP. \BBOQText revision: A model and its implementation\BBCQ In Dale, R., Hovy, E. H., Rosner, D., \BBA Stock, O.\BEDS, \BemAspects of automated natural language generation, \BVOL 587, \BPGS 215–230. Springer, Berlin and Heidelberg.
- [\BCAYIsard, Brockmann, \BBA OberlanderIsard et al.2006] Isard, A., Brockmann, C., \BBA Oberlander, J. \BBOP2006\BBCP. \BBOQIndividuality and Alignment in Generated Dialogues\BBCQ In \BemProc. INLG’06, \BPGS 25–32.
- [\BCAYJanarthanam \BBA LemonJanarthanam \BBA Lemon2011] Janarthanam, S.\BBACOMMA \BBA Lemon, O. \BBOP2011\BBCP. \BBOQThe GRUVE Challenge: Generating Routes under Uncertainty in Virtual Environments\BBCQ In \BemProc. ENLG’11, \BPGS 208–211.
- [\BCAYJanarthanam \BBA LemonJanarthanam \BBA Lemon2014] Janarthanam, S.\BBACOMMA \BBA Lemon, O. \BBOP2014\BBCP. \BBOQAdaptive Generation in Dialogue Systems Using Dynamic User Modeling\BBCQ \BemComputational Linguistics, \Bem40(4), 883–920.
- [\BCAYJia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, \BBA DarrellJia et al.2014] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., \BBA Darrell, T. \BBOP2014\BBCP. \BBOQCaffe: Convolutional Architecture for Fast Feature Embedding\BBCQ In \BemProc. ACM International Conference on Multimedia, \BPGS 675–678. ACM.
- [\BCAYJohannsen, Hovy, \BBA SøgaardJohannsen et al.2015] Johannsen, A., Hovy, D., \BBA Søgaard, A. \BBOP2015\BBCP. \BBOQCross-lingual syntactic variation over age and gender\BBCQ In \BemProc. CoNLL’15, \BPGS 103–112.
- [\BCAYJohn \BBA SrivastavaJohn \BBA Srivastava1999] John, O.\BBACOMMA \BBA Srivastava, S. \BBOP1999\BBCP. \BBOQThe Big Five trait taxonomy: History, measurement, and theoretical perspectives\BBCQ In Pervin, L.\BBACOMMA \BBA John, O.\BEDS, \BemHandbook of Personlity Theory and Research. Guilford Press, New York.
- [\BCAYJohnson, Karpathy, \BBA Fei-FeiJohnson et al.2016] Johnson, J., Karpathy, A., \BBA Fei-Fei, L. \BBOP2016\BBCP. \BBOQDenseCap: Fully Convolutional Localization Networks for Dense Captioning\BBCQ In \BemProc. CVPR’16, \BPGS 4565–4574.
- [\BCAYJohnson, Rizzo, Bosma, Kole, Ghijsen, \BBA Van WelbergenJohnson et al.2004] Johnson, W. L., Rizzo, P., Bosma, W., Kole, S., Ghijsen, M., \BBA Van Welbergen, H. \BBOP2004\BBCP. \BBOQGenerating socially appropriate tutorial dialog\BBCQ In Andre, E., Dybkjæ r, L., Minker, W., \BBA Heisterkamp, P.\BEDS, \BemAffective Dialog Systems: Proceedings of the ADS 2004 Tutorial and Research Workshop, \BVOL Lecture No, \BPGS 254–264. Springer, Berlin and Heidelberg.
- [\BCAYJordan \BBA WalkerJordan \BBA Walker2005] Jordan, P. W.\BBACOMMA \BBA Walker, M. A. \BBOP2005\BBCP. \BBOQLearning content selection rules for generating object descriptions in dialogue\BBCQ \BemJournal of Artificial Intelligence Research, \Bem24, 157–194.
- [\BCAYJoshi \BBA SchabesJoshi \BBA Schabes1997] Joshi, A. K.\BBACOMMA \BBA Schabes, Y. \BBOP1997\BBCP. \BBOQTree-Adjoining Grammars\BBCQ In \BemHandbook of Formal Languages, Vol. 3, \BPGS 69–123. Springer, New York.
- [\BCAYKalchbrenner \BBA BlunsomKalchbrenner \BBA Blunsom2013] Kalchbrenner, N.\BBACOMMA \BBA Blunsom, P. \BBOP2013\BBCP. \BBOQRecurrent Continuous Translation Models\BBCQ In \BemProc. EMNLP’13, \BPGS 1700–1709.
- [\BCAYKarpathy \BBA Fei-FeiKarpathy \BBA Fei-Fei2015] Karpathy, A.\BBACOMMA \BBA Fei-Fei, L. \BBOP2015\BBCP. \BBOQDeep visual-semantic alignments for generating image descriptions\BBCQ In \BemProc. CVPR’15, \BPGS 3128–3137.
- [\BCAYKarpathy, Joulin, \BBA Fei-FeiKarpathy et al.2014] Karpathy, A., Joulin, A., \BBA Fei-Fei, L. \BBOP2014\BBCP. \BBOQDeep Fragment Embeddings for Bidirectional Image Sentence Mapping\BBCQ In \BemProc. NIPS’14, \BPGS 1–9.
- [\BCAYKasperKasper1989] Kasper, R. T. \BBOP1989\BBCP. \BBOQA Flexible Interface for Linking Applications to Penman’s Sentence Generator\BBCQ In \BemProc. Workshop on Speech and Natural Langauge, \BPGS 153–158.
- [\BCAYKauchak \BBA BarzilayKauchak \BBA Barzilay2006] Kauchak, D.\BBACOMMA \BBA Barzilay, R. \BBOP2006\BBCP. \BBOQParaphrasing for automatic evaluation\BBCQ In \BemProc. NAACL-HLT’06, \BPGS 455–462.
- [\BCAYKayKay1996] Kay, M. \BBOP1996\BBCP. \BBOQChart Generation\BBCQ In \BemProc. ACL’96, \BPGS 200–204.
- [\BCAYKazemzadeh, Ordonez, Matten, \BBA BergKazemzadeh et al.2014] Kazemzadeh, S., Ordonez, V., Matten, M., \BBA Berg, T. \BBOP2014\BBCP. \BBOQReferItGame: Referring to Objects in Photographs of Natural Scenes\BBCQ In \BemProc. EMNLP’14, \BPGS 787–798.
- [\BCAYKelleher, Costello, \BBA Van GenabithKelleher et al.2005] Kelleher, J., Costello, F., \BBA Van Genabith, J. \BBOP2005\BBCP. \BBOQDynamically structuring, updating and interrelating representations of visual and linguistic discourse context\BBCQ \BemArtificial Intelligence, \Bem167, 62–102.
- [\BCAYKelleher \BBA KruijffKelleher \BBA Kruijff2006] Kelleher, J.\BBACOMMA \BBA Kruijff, G.-J. \BBOP2006\BBCP. \BBOQIncremental generation of spatial referring expressions in situated dialog\BBCQ In \BemProc. COLING-ACL’06, \BPGS 1041–1048.
- [\BCAYKempenKempen2009] Kempen, G. \BBOP2009\BBCP. \BBOQClausal coordination and coordinate ellipsis in a model of the speaker\BBCQ \BemLinguistics, \Bem47(3), 653–696.
- [\BCAYKennedy \BBA McNallyKennedy \BBA McNally2005] Kennedy, C.\BBACOMMA \BBA McNally, L. \BBOP2005\BBCP. \BBOQScale Structure, Degree Modification, and the Semantics of Gradable Predicates\BBCQ \BemLanguage, \Bem81(2), 345–381.
- [\BCAYKeshtkar \BBA InkpenKeshtkar \BBA Inkpen2011] Keshtkar, F.\BBACOMMA \BBA Inkpen, D. \BBOP2011\BBCP. \BBOQA pattern-based model for generating text to express emotion\BBCQ In \BemProc. ACII’11, \BPGS 11–21.
- [\BCAYKibble \BBA PowerKibble \BBA Power2004] Kibble, R.\BBACOMMA \BBA Power, R. \BBOP2004\BBCP. \BBOQOptimizing referential coherence in text generation\BBCQ \BemComputational Linguistics, \Bem30(4), 401–416.
- [\BCAYKiddon \BBA BrunKiddon \BBA Brun2011] Kiddon, C.\BBACOMMA \BBA Brun, Y. \BBOP2011\BBCP. \BBOQThat’s what she said: double entendre identification\BBCQ In \BemProc. ACL-HLT’11, \BPGS 89–94.
- [\BCAYKilickaya, Erdem, Ikizler-Cinbis, \BBA ErdemKilickaya et al.2017] Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., \BBA Erdem, E. \BBOP2017\BBCP. \BBOQRe-evaluating Automatic Metrics for Image Captioning\BBCQ In \BemProc. EACL’17, \BPGS 199–209.
- [\BCAYKim \BBA MooneyKim \BBA Mooney2010] Kim, J.\BBACOMMA \BBA Mooney, R. J. \BBOP2010\BBCP. \BBOQGenerative Alignment and Semantic Parsing for Learning from Ambiguous Supervision\BBCQ In \BemProc. COLING’10, \BPGS 543–551.
- [\BCAYKiros, Zemel, \BBA SalakhutdinovKiros et al.2014] Kiros, R., Zemel, R. S., \BBA Salakhutdinov, R. \BBOP2014\BBCP. \BBOQMultimodal Neural Language Models\BBCQ In \BemProc. ICML’14, \BPGS 1–14.
- [\BCAYKojima, Tamura, \BBA FukunagaKojima et al.2002] Kojima, A., Tamura, T., \BBA Fukunaga, K. \BBOP2002\BBCP. \BBOQNatural language description of human activities from video images based on concept hierarchy of actions\BBCQ \BemInternational Journal of Computer Vision, \Bem50(2), 171–184.
- [\BCAYKoller \BBA PetrickKoller \BBA Petrick2011] Koller, A.\BBACOMMA \BBA Petrick, R. P. \BBOP2011\BBCP. \BBOQExperiences with planning for natural language generation\BBCQ \BemComputational Intelligence, \Bem27(1), 23–40.
- [\BCAYKoller \BBA StoneKoller \BBA Stone2007] Koller, A.\BBACOMMA \BBA Stone, M. \BBOP2007\BBCP. \BBOQSentence generation as a planning problem\BBCQ In \BemProc. ACL’07, \BPGS 336–343.
- [\BCAYKoller \BBA StriegnitzKoller \BBA Striegnitz2002] Koller, A.\BBACOMMA \BBA Striegnitz, K. \BBOP2002\BBCP. \BBOQGeneration as Dependency Parsing\BBCQ In \BemProc. ACL’02, \BPGS 17–24.
- [\BCAYKoller, Striegnitz, Gargett, Byron, Cassell, Dale, Moore, \BBA OberlanderKoller et al.2010] Koller, A., Striegnitz, K., Gargett, A., Byron, D., Cassell, J., Dale, R., Moore, J. D., \BBA Oberlander, J. \BBOP2010\BBCP. \BBOQReport on the second nlg challenge on generating instructions in virtual environments (give-2)\BBCQ In \BemProc. INLG’10, \BPGS 243–250.
- [\BCAYKoncel-Kedziorski, Hajishirzi, \BBA FarhadiKoncel-Kedziorski et al.2014] Koncel-Kedziorski, R., Hajishirzi, H., \BBA Farhadi, A. \BBOP2014\BBCP. \BBOQMulti-resolution language grounding with weak supervision\BBCQ In \BemProc. EMNLP’14, \BPGS 386–396.
- [\BCAYKondadadi, Howald, \BBA SchilderKondadadi et al.2013] Kondadadi, R., Howald, B., \BBA Schilder, F. \BBOP2013\BBCP. \BBOQA Statistical NLG Framework for Aggregated Planning and Realization\BBCQ In \BemProc. ACL’13, \BPGS 1406–1415.
- [\BCAYKonstas \BBA LapataKonstas \BBA Lapata2012] Konstas, I.\BBACOMMA \BBA Lapata, M. \BBOP2012\BBCP. \BBOQUnsupervised concept-to-text generation with hypergraphs\BBCQ In \BemProc. NAACL-HLT’12, \BPGS 752–761.
- [\BCAYKonstas \BBA LapataKonstas \BBA Lapata2013] Konstas, I.\BBACOMMA \BBA Lapata, M. \BBOP2013\BBCP. \BBOQA global model for concept-to-text generation\BBCQ \BemJournal of Artificial Intelligence Research, \Bem48, 305–346.
- [\BCAYKrahmer \BBA TheuneKrahmer \BBA Theune2010] Krahmer, E.\BBACOMMA \BBA Theune, M. \BBOP2010\BBCP. \BemEmpirical Methods in Natural Language Generation. Springer, Berlin & Heidelberg.
- [\BCAYKrahmer \BBA van DeemterKrahmer \BBA van Deemter2012] Krahmer, E.\BBACOMMA \BBA van Deemter, K. \BBOP2012\BBCP. \BBOQComputational generation of referring expressions: A survey\BBCQ \BemComputational Linguistics, \Bem38(1), 173–218.
- [\BCAYKrizhevsky, Sutskever, \BBA HintonKrizhevsky et al.2012] Krizhevsky, A., Sutskever, I., \BBA Hinton, G. \BBOP2012\BBCP. \BBOQImageNet Classification with Deep Convolutional Neural Networks\BBCQ In \BemProc. NIPS’12, \BPGS 1097–1105.
- [\BCAYKukichKukich1987] Kukich, K. \BBOP1987\BBCP. \BBOQWhere do phrases come from: Some preliminary experiments in connectionist phrase generation\BBCQ In \BemNatural Language Generation: New Results in Artificial Intelligence, Psychology and Linguistics. Springer, Berlin and Heidelberg.
- [\BCAYKukichKukich1992] Kukich, K. \BBOP1992\BBCP. \BBOQTechniques for automatically correcting words in text\BBCQ \BemACM Computing Surveys (CSUR), \Bem24(4), 377–439.
- [\BCAYKulkarni, Premraj, Dhar, Li, Choi, Berg, \BBA BergKulkarni et al.2011] Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., \BBA Berg, T. \BBOP2011\BBCP. \BBOQBaby Talk : Understanding and Generating Image Descriptions\BBCQ In \BemProc. CVPR’11, \BPGS 1601–1608.
- [\BCAYKulkarni, Premraj, Ordonez, Dhar, Li, Choi, Berg, \BBA BergKulkarni et al.2013] Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., \BBA Berg, T. \BBOP2013\BBCP. \BBOQBaby talk: Understanding and generating simple image descriptions\BBCQ \BemIEEE Transactions on Pattern Analysis and Machine Intelligence, \Bem35(12), 2891–2903.
- [\BCAYKusner, Sun, Kolkin, \BBA WeinbergerKusner et al.2015] Kusner, M. J., Sun, Y., Kolkin, N. I., \BBA Weinberger, K. Q. \BBOP2015\BBCP. \BBOQFrom Word Embeddings To Document Distances\BBCQ In \BemProc. ICML’15, \BPGS 957–966.
- [\BCAYKutlak, Mellish, \BBA van DeemterKutlak et al.2013] Kutlak, R., Mellish, C., \BBA van Deemter, K. \BBOP2013\BBCP. \BBOQContent Selection Challenge - University of Aberdeen Entry\BBCQ In \BemProc. ENLG’13, \BPGS 208–209.
- [\BCAYKuznetsova, Ordonez, Berg, Berg, \BBA ChoiKuznetsova et al.2012] Kuznetsova, P., Ordonez, V., Berg, A. C., Berg, T., \BBA Choi, Y. \BBOP2012\BBCP. \BBOQCollective Generation of Natural Image Descriptions\BBCQ In \BemProc. ACL’12, \BPGS 359–368.
- [\BCAYKuznetsova, Ordonez, Berg, \BBA ChoiKuznetsova et al.2014] Kuznetsova, P., Ordonez, V., Berg, T., \BBA Choi, Y. \BBOP2014\BBCP. \BBOQTREETALK: Composition and Compression of Trees for Image Descriptions\BBCQ \BemTransactions of the Association for Computational Linguistics, \Bem2, 351–362.
- [\BCAYLabbé \BBA PortetLabbé \BBA Portet2012] Labbé, C.\BBACOMMA \BBA Portet, F. \BBOP2012\BBCP. \BBOQTowards an abstractive opinion summarisation of multiple reviews in the tourism domain\BBCQ In \BemProc. International Workshop on Sentiment Discovery from Affective Data, \BPGS 87–94.
- [\BCAYLabovLabov2010] Labov, W. \BBOP2010\BBCP. \BBOQOral narratives of personal experience\BBCQ In Hogan, P. C.\BED, \BemCambridge Encyclopedia of the Language Sciences, \BPGS 546–548. Cambridge University Press, Cambridge, UK.
- [\BCAYLakoff \BBA JohnsonLakoff \BBA Johnson1980] Lakoff, G.\BBACOMMA \BBA Johnson, M. \BBOP1980\BBCP. \BemMetaphors we Live By. Chicago University Press, Chicago, Ill.
- [\BCAYLampouras \BBA AndroutsopoulosLampouras \BBA Androutsopoulos2013] Lampouras, G.\BBACOMMA \BBA Androutsopoulos, I. \BBOP2013\BBCP. \BBOQUsing Integer Linear Programming in Concept-to-Text Generation to Produce More Compact Texts\BBCQ In \BemProc. ACL’13, \BPGS 561–566.
- [\BCAYLampouras \BBA VlachosLampouras \BBA Vlachos2016] Lampouras, G.\BBACOMMA \BBA Vlachos, A. \BBOP2016\BBCP. \BBOQImitation learning for language generation from unaligned data\BBCQ In \BemProc. COLING’16, \BPGS 1101–1112.
- [\BCAYLangkilde-GearyLangkilde-Geary2000] Langkilde-Geary, I. \BBOP2000\BBCP. \BBOQForest-based statistical sentence generation\BBCQ In \BemProc. ANLP-NAACL’00, \BPGS 170–177.
- [\BCAYLangkilde-Geary \BBA KnightLangkilde-Geary \BBA Knight2002] Langkilde-Geary, I.\BBACOMMA \BBA Knight, K. \BBOP2002\BBCP. \BBOQHALogen Statistical Sentence Generator\BBCQ In \BemProc. ACL’02 (Demos), \BPGS 102–103.
- [\BCAYLapataLapata2006] Lapata, M. \BBOP2006\BBCP. \BBOQAutomatic Evaluation of Information Ordering: Kendall’s Tau\BBCQ \BemComputational Linguistics, \Bem32(4), 471–484.
- [\BCAYLavie \BBA AgarwalLavie \BBA Agarwal2007] Lavie, A.\BBACOMMA \BBA Agarwal, A. \BBOP2007\BBCP. \BBOQMETEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments\BBCQ In \BemProc. Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, \BPGS 65–72.
- [\BCAYLavoie \BBA RambowLavoie \BBA Rambow1997] Lavoie, B.\BBACOMMA \BBA Rambow, O. \BBOP1997\BBCP. \BBOQA fast and portable realiser for text generation\BBCQ In \BemProc. ANLP’97, \BPGS 265–268.
- [\BCAYLaw, Freer, Hunter, Logie, McIntosh, \BBA QuinnLaw et al.2005] Law, A. S., Freer, Y., Hunter, J. R., Logie, R. H., McIntosh, N., \BBA Quinn, J. \BBOP2005\BBCP. \BBOQA comparison of graphical and textual presentations of time series data to support medical decision making in the neonatal intensive care unit\BBCQ \BemJournal of clinical monitoring and computing, \Bem19(3), 183–94.
- [\BCAYLebret, Grangier, \BBA AuliLebret et al.2016] Lebret, R., Grangier, D., \BBA Auli, M. \BBOP2016\BBCP. \BBOQGenerating Text from Structured Data with Application to the Biography Domain\BBCQ \BemarXiv preprint, \Bem1603.07771.
- [\BCAYLeCun, Bengio, \BBA HintonLeCun et al.2015] LeCun, Y., Bengio, Y., \BBA Hinton, G. \BBOP2015\BBCP. \BBOQDeep learning\BBCQ \BemNature, \Bem521(7553), 436–444.
- [\BCAYLemonLemon2008] Lemon, O. \BBOP2008\BBCP. \BBOQAdaptive Natural Language Generation in Dialogue using Reinforcement Learning\BBCQ In \BemProc. LONDIAL’08, \BPGS 141–148.
- [\BCAYLemonLemon2011] Lemon, O. \BBOP2011\BBCP. \BBOQLearning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation\BBCQ \BemComputer Speech and Language, \Bem25(2), 210–221.
- [\BCAYLepp, Munezero, Granroth-wilding, \BBA ToivonenLepp et al.2017] Lepp, L., Munezero, M., Granroth-wilding, M., \BBA Toivonen, H. \BBOP2017\BBCP. \BBOQData-Driven News Generation for Automated Journalism\BBCQ In \BemProc. INLG’17, \BPGS 188–197.
- [\BCAYLester \BBA PorterLester \BBA Porter1997] Lester, J. C.\BBACOMMA \BBA Porter, B. W. \BBOP1997\BBCP. \BBOQDeveloping and Empirically Evaluating Robust Explanation Generators : The KNIGHT Experiments\BBCQ \BemComputational Linguistcs, \Bem23(1), 65–101.
- [\BCAYLeveltLevelt1989] Levelt, W. \BBOP1989\BBCP. \BemSpeaking: From Intention to Articulation. MIT Press, Cambridge, MA.
- [\BCAYLeveltLevelt1999] Levelt, W. \BBOP1999\BBCP. \BBOQProducing spoken language: a blueprint of the speaker\BBCQ In Brown, C.\BBACOMMA \BBA Hagoort, P.\BEDS, \BemThe Neurocognition of Language, \BPGS 83–122. Oxford University Press, Oxford and London.
- [\BCAYLevelt, Roelofs, \BBA MeyerLevelt et al.1999] Levelt, W., Roelofs, A., \BBA Meyer, A. S. \BBOP1999\BBCP. \BBOQA theory of lexical access in speech production\BBCQ \BemBehavioral and Brain Sciences, \Bem22(1), 1–75.
- [\BCAYLevenshteinLevenshtein1966] Levenshtein, V. I. \BBOP1966\BBCP. \BBOQBinary codes capable of correcting deletions, insertions, and reversals\BBCQ \BemSoviet Physics Doklady, \Bem10(8), 707–710.
- [\BCAYLewis \BBA CatlettLewis \BBA Catlett1994] Lewis, D. D.\BBACOMMA \BBA Catlett, J. \BBOP1994\BBCP. \BBOQHeterogeneous uncertainty sampling for supervised learning\BBCQ In \BemProc. ICML’94, \BPGS 148–156.
- [\BCAYLi, Galley, Brockett, Spithourakis, Gao, \BBA DolanLi et al.2016] Li, J., Galley, M., Brockett, C., Spithourakis, G. P., Gao, J., \BBA Dolan, B. \BBOP2016\BBCP. \BBOQA Persona-Based Neural Conversation Model\BBCQ In \BemProc. ACL’16, \BPGS 994–1003.
- [\BCAYLi, Kulkarni, Berg, Berg, \BBA ChoiLi et al.2011] Li, S., Kulkarni, G., Berg, T., Berg, A. C., \BBA Choi, Y. \BBOP2011\BBCP. \BBOQComposing simple image descriptions using web-scale n-grams\BBCQ In \BemProc. CoNLL’11, \BPGS 220–228.
- [\BCAYLiang, Jordan, \BBA KleinLiang et al.2009] Liang, P., Jordan, M. I., \BBA Klein, D. \BBOP2009\BBCP. \BBOQLearning Semantic Correspondences with Less Supervision\BBCQ In \BemProc. ACL-IJCNLP’09, \BPGS 91–99.
- [\BCAYLin \BBA HovyLin \BBA Hovy2003] Lin, C.-Y.\BBACOMMA \BBA Hovy, E. H. \BBOP2003\BBCP. \BBOQAutomatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics\BBCQ In \BemProc. HLT-NAACL’03, \BPGS 71–78.
- [\BCAYLin \BBA OchLin \BBA Och2004] Lin, C.-Y.\BBACOMMA \BBA Och, F. J. \BBOP2004\BBCP. \BBOQAutomatic Evaluation of Machine Translation Quality Using Using Longest Common Subsequence and Skip-Bigram Statistics\BBCQ In \BemProc. ACL’04, \BPGS 605–612.
- [\BCAYLin \BBA KongLin \BBA Kong2015] Lin, D.\BBACOMMA \BBA Kong, C. \BBOP2015\BBCP. \BBOQGenerating Multi-sentence Natural Language Descriptions of Indoor Scenes\BBCQ In \BemProc. BMVC’15, \BPGS 1–13.
- [\BCAYLin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, \BBA ZitnickLin et al.2014] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., \BBA Zitnick, C. L. \BBOP2014\BBCP. \BBOQMicrosoft COCO: Common objects in context\BBCQ In \BemProc. ECCV’14, \BVOL 8693 LNCS, \BPGS 740–755. Springer.
- [\BCAYLipschultz, Litman, Jordan, \BBA KatzLipschultz et al.2011] Lipschultz, M., Litman, D. J., Jordan, P. W., \BBA Katz, S. \BBOP2011\BBCP. \BBOQPredicting Changes in Level of Abstraction in Tutor Responses to Students\BBCQ In \BemProc. FLAIRS’11, \BPGS 525–530.
- [\BCAYLipton, Vikram, \BBA McAuleyLipton et al.2016] Lipton, Z. C., Vikram, S., \BBA McAuley, J. \BBOP2016\BBCP. \BBOQGenerative Concatenative Nets Jointly Learn to Write and Classify Reviews\BBCQ \BemarXiv preprint, \Bem1511.03683, 1–11.
- [\BCAYLoweLowe2004] Lowe, D. G. \BBOP2004\BBCP. \BBOQDistinctive image features from scale invariant keypoints\BBCQ \BemInternational Journal of Computer Vision, \Bem60, 91–110.
- [\BCAYLukin \BBA WalkerLukin \BBA Walker2013] Lukin, S.\BBACOMMA \BBA Walker, M. A. \BBOP2013\BBCP. \BBOQReally? Well. Apparently Bootstrapping Improves the Performance of Sarcasm and Nastiness Classifiers for Online Dialogue\BBCQ In \BemProc. LSM’13, \BPGS 30–40.
- [\BCAYLuong, Le, Sutskever, Vinyals, \BBA KaiserLuong et al.2015] Luong, M.-T., Le, Q. V., Sutskever, I., Vinyals, O., \BBA Kaiser, L. \BBOP2015\BBCP. \BBOQMulti-Task Sequence to Sequence Learing\BBCQ \BemarXiv preprint, \Bem1511.06114.
- [\BCAYLuong, Socher, \BBA ManningLuong et al.2013] Luong, M.-T., Socher, R., \BBA Manning, C. D. \BBOP2013\BBCP. \BBOQBetter Word Representations with Recursive Neural Networks for Morphology\BBCQ In \BemProc. CoNLL’13, \BPGS 104–113.
- [\BCAYLutzLutz1959] Lutz, T. \BBOP1959\BBCP. \BBOQStochastische texte\BBCQ \BemAugenblick, \Bem4(1), 3–9.
- [\BCAYMacdonald \BBA SiddharthanMacdonald \BBA Siddharthan2016] Macdonald, I.\BBACOMMA \BBA Siddharthan, A. \BBOP2016\BBCP. \BBOQSummarising news stories for children\BBCQ In \BemProc. INLG’16, \BPGS 1–10.
- [\BCAYMahamood \BBA ReiterMahamood \BBA Reiter2011] Mahamood, S.\BBACOMMA \BBA Reiter, E. \BBOP2011\BBCP. \BBOQGenerating Affective Natural Language for Parents of Neonatal Infants\BBCQ In \BemProc. ENLG’11, \BPGS 12–21.
- [\BCAYMairesse, Gasic, Jurcicek, Keizer, Thompson, Yu, \BBA YoungMairesse et al.2010] Mairesse, F., Gasic, M., Jurcicek, F., Keizer, S., Thompson, B., Yu, K., \BBA Young, S. \BBOP2010\BBCP. \BBOQPhrase-based statistical language generation using graphical models and active learning\BBCQ In \BemProc. ACL’10, \BPGS 1552–1561.
- [\BCAYMairesse \BBA WalkerMairesse \BBA Walker2010] Mairesse, F.\BBACOMMA \BBA Walker, M. A. \BBOP2010\BBCP. \BBOQTowards personality-based user adaptation: Psychologically informed stylistic language generation\BBCQ \BemUser Modelling and User-Adapted Interaction, \Bem20(3), 227–278.
- [\BCAYMairesse \BBA WalkerMairesse \BBA Walker2011] Mairesse, F.\BBACOMMA \BBA Walker, M. A. \BBOP2011\BBCP. \BBOQControlling User Perceptions of Linguistic Style: Trainable Generation of Personality Traits\BBCQ \BemComputational Linguistics, \Bem37(3), 455–488.
- [\BCAYMairesse \BBA YoungMairesse \BBA Young2014] Mairesse, F.\BBACOMMA \BBA Young, S. \BBOP2014\BBCP. \BBOQStochastic language generation in dialogue using factored language models\BBCQ \BemComputational Linguistcs, \Bem4(4), 763–799.
- [\BCAYMalinowski, Rohrbach, \BBA FritzMalinowski et al.2016] Malinowski, M., Rohrbach, M., \BBA Fritz, M. \BBOP2016\BBCP. \BBOQAsk your neurons: A neural-based approach to answering questions about images\BBCQ In \BemProc. ICCV’15, \BPGS 1–9.
- [\BCAYManiMani2001] Mani, I. \BBOP2001\BBCP. \BemAutomatic Summarization. John Benjamins Publishing Company, Amsterdam.
- [\BCAYManiMani2010] Mani, I. \BBOP2010\BBCP. \BemThe Imagined Moment: Time, Narrative and Computation. University of Nebraska Press, Lincoln, NE.
- [\BCAYManiMani2013] Mani, I. \BBOP2013\BBCP. \BemComputational Modeling of Narrative. Morgan and Claypool Publishers, USA.
- [\BCAYMann \BBA MatthiessenMann \BBA Matthiessen1983] Mann, W. C.\BBACOMMA \BBA Matthiessen, C. M. \BBOP1983\BBCP. \BBOQNigel: A systemic grammar for text generation (Technical Report RR-83-105)\BBCQ \BTR, ISI, University of Southern California, Marina del Rey, CA.
- [\BCAYMann \BBA MooreMann \BBA Moore1981] Mann, W. C.\BBACOMMA \BBA Moore, J. A. \BBOP1981\BBCP. \BBOQComputer generation of multiparagraph text\BBCQ \BemAmerican Journal of Computational Linguistics, \Bem7(1), 17–29.
- [\BCAYMann \BBA ThompsonMann \BBA Thompson1988] Mann, W. C.\BBACOMMA \BBA Thompson, S. A. \BBOP1988\BBCP. \BBOQRhetorical structure theory: Toward a functional theory of text organization\BBCQ \BemText, \Bem8(3), 243–281.
- [\BCAYManningManning2015] Manning, C. D. \BBOP2015\BBCP. \BBOQLast words: Computational linguistics and deep learning\BBCQ \BemComputational Linguistics, \Bem41, 701–707.
- [\BCAYManurung, Ritchie, \BBA ThompsonManurung et al.2012] Manurung, R., Ritchie, G. D., \BBA Thompson, H. \BBOP2012\BBCP. \BBOQUsing genetic algorithms to create meaningful poetic text\BBCQ \BemJournal of Experimental & Theoretical Artificial Intelligence, \Bem24(1), 43–64.
- [\BCAYMao, Huang, Toshev, Camburu, Yuille, \BBA MurphyMao et al.2016] Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., \BBA Murphy, K. \BBOP2016\BBCP. \BBOQGeneration and Comprehension of Unambiguous Object Descriptions\BBCQ In \BemProc. CVPR’16, \BPGS 11–22.
- [\BCAYMao, Xu, Yang, Wang, Huang, \BBA YuilleMao et al.2015a] Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., \BBA Yuille, A. \BBOP2015a\BBCP. \BBOQDeep Captioning with Multimodal Recurrent Neural Networks (m-RNN)\BBCQ \BemarXiv preprint, \Bem1412.6632.
- [\BCAYMao, Xu, Yang, Wang, Huang, \BBA YuilleMao et al.2015b] Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., \BBA Yuille, A. \BBOP2015b\BBCP. \BBOQLearning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images\BBCQ In \BemProc. ICCV’15, \BPGS 2533–2541.
- [\BCAYMarciniak \BBA StrubeMarciniak \BBA Strube2004] Marciniak, T.\BBACOMMA \BBA Strube, M. \BBOP2004\BBCP. \BBOQClassification-based generation using TAG\BBCQ In \BemProc. INLG’04, \BPGS 100–109.
- [\BCAYMarciniak \BBA StrubeMarciniak \BBA Strube2005] Marciniak, T.\BBACOMMA \BBA Strube, M. \BBOP2005\BBCP. \BBOQBeyond the Pipeline: Discrete Optimization in NLP\BBCQ In \BemProc. CoNLL’05, \BPGS 136–143.
- [\BCAYMartinMartin1990] Martin, J. H. \BBOP1990\BBCP. \BemA Computational Model of Metaphor Interpretation. Academic Press, New York.
- [\BCAYMartinMartin1994] Martin, J. H. \BBOP1994\BBCP. \BBOQMetabank: A knowledge-base of metaphoric language conventions\BBCQ \BemComputational Intelligence, \Bem10(2), 134–149.
- [\BCAYMartinez, Yannakakis, \BBA HallamMartinez et al.2014] Martinez, H. P., Yannakakis, G. N., \BBA Hallam, J. \BBOP2014\BBCP. \BBOQDon’t classify ratings of affect; Rank Them!\BBCQ \BemIEEE Transactions on Affective Computing, \Bem5(3), 314–326.
- [\BCAYMason \BBA CharniakMason \BBA Charniak2014] Mason, R.\BBACOMMA \BBA Charniak, E. \BBOP2014\BBCP. \BBOQDomain-Specific Image Captioning\BBCQ In \BemProc. CONLL’14, \BPGS 11–20.
- [\BCAYMason \BBA SuriMason \BBA Suri2012] Mason, W.\BBACOMMA \BBA Suri, S. \BBOP2012\BBCP. \BBOQConducting behavioral research on amazonâs mechanical turk\BBCQ \BemBehavior Research Methods, \Bem44(1), 1–23.
- [\BCAYMay \BBA PriyadarshiMay \BBA Priyadarshi2017] May, J.\BBACOMMA \BBA Priyadarshi, J. \BBOP2017\BBCP. \BBOQSemEval-2017 Task 9: Abstract Meaning Representation Parsing and Generation\BBCQ In \BemProc. SemEval’17, \BPGS 536–545.
- [\BCAYMcCoy \BBA StrubeMcCoy \BBA Strube1999] McCoy, K. F.\BBACOMMA \BBA Strube, M. \BBOP1999\BBCP. \BBOQGenerating Anaphoric Expressions: Pronoun or Definite Description?\BBCQ In Cristea, D., Ide, N., \BBA Marcu, D.\BEDS, \BemThe Relation of Discourse/Dialogue Structure and Reference: Proceedings of the Workshop held in conjunction with ACL’99, \BPGS 63–71.
- [\BCAYMcDermottMcDermott2000] McDermott, D. \BBOP2000\BBCP. \BBOQThe 1998 AI planning systems competition\BBCQ \BemAI magazine, \Bem21(2), 1–33.
- [\BCAYMcDonaldMcDonald1993] McDonald, D. D. \BBOP1993\BBCP. \BBOQIssues in the Choice of a Source for Natural Language Generation\BBCQ \BemComputational Linguistics, \Bem19(1), 191–197.
- [\BCAYMcDonaldMcDonald2010] McDonald, D. D. \BBOP2010\BBCP. \BBOQNatural language generation\BBCQ In Indurkhya, N.\BBACOMMA \BBA Damerau, F.\BEDS, \BemHandbook of Natural Language Processing (2nd \BEd)., \BPGS 121–144. Chapman and Hall/CRC, London.
- [\BCAYMcDonald \BBA PustejovskyMcDonald \BBA Pustejovsky1985] McDonald, D. D.\BBACOMMA \BBA Pustejovsky, J. D. \BBOP1985\BBCP. \BBOQA computational theory of prose style for natural language generation\BBCQ In \BemProc. EACL’85, \BPGS 187–193.
- [\BCAYMcIntyre \BBA LapataMcIntyre \BBA Lapata2009] McIntyre, N.\BBACOMMA \BBA Lapata, M. \BBOP2009\BBCP. \BBOQLearning to Tell Tales : A Data-driven Approach to Story Generation\BBCQ In \BemProc. ACL-IJCNLP’09, \BPGS 217–225.
- [\BCAYMcKeownMcKeown1985] McKeown, K. R. \BBOP1985\BBCP. \BemText Generation. Cambridge University Press, Cambridge, UK.
- [\BCAYMcRoy, Channarukul, \BBA AliMcRoy et al.2003] McRoy, S. W., Channarukul, S., \BBA Ali, S. S. \BBOP2003\BBCP. \BBOQAn augmented template-based approach to text realization\BBCQ \BemNatural Language Engineering, \Bem9(4), 381–420.
- [\BCAYMeehanMeehan1977] Meehan, J. R. \BBOP1977\BBCP. \BBOQTALE-SPIN, An Interactive Program that Writes Stories\BBCQ In \BemProc. IJCAI’77, \BPGS 91–98.
- [\BCAYMei, Bansal, \BBA WalterMei et al.2016] Mei, H., Bansal, M., \BBA Walter, M. R. \BBOP2016\BBCP. \BBOQWhat to talk about and how? Selective generation using LSTMs with coarse-to-fine alignment\BBCQ In \BemProc. NAACL-HLT’16, \BPGS 720–730.
- [\BCAYMeisterMeister2003] Meister, J. C. \BBOP2003\BBCP. \BemComputing Action. A Narratological Approach. Mouton de Gruyter, Berlin.
- [\BCAYMellish \BBA DaleMellish \BBA Dale1998] Mellish, C.\BBACOMMA \BBA Dale, R. \BBOP1998\BBCP. \BBOQEvaluation in the context of natural language generation\BBCQ \BemComputer Speech & Language, \Bem12(4), 349–373.
- [\BCAYMellish, Scott, Cahill, Paiva, Evans, \BBA ReapeMellish et al.2006] Mellish, C., Scott, D., Cahill, L., Paiva, D. S., Evans, R., \BBA Reape, M. \BBOP2006\BBCP. \BBOQA Reference Architecture for Natural Language Generation Systems\BBCQ \BemNatural Language Engineering, \Bem12(1), 1–34.
- [\BCAYMeteerMeteer1991] Meteer, M. W. \BBOP1991\BBCP. \BBOQBridging the generation gap between text planning and linguistic realization\BBCQ \BemComputational Intelligence, \Bem7(4), 296–304.
- [\BCAYMeteer, McDonald, Anderson, Forster, Gay, Iluettner, \BBA SibunMeteer et al.1987] Meteer, M. W., McDonald, D. D., Anderson, S., Forster, D., Gay, L., Iluettner, A., \BBA Sibun, P. \BBOP1987\BBCP. \BBOQMumble-86: Design and Implementation (Technical Report COINS 87-87)\BBCQ \BTR, University of Massachusetts at Amherst, Amherst, MA.
- [\BCAYMikolov, Chen, Corrado, \BBA DeanMikolov et al.2013] Mikolov, T., Chen, K., Corrado, G., \BBA Dean, J. \BBOP2013\BBCP. \BBOQDistributed Representations of Words and Phrases and their Compositionality\BBCQ In \BemProc. NIPS’13, \BPGS 3111–3119.
- [\BCAYMikolov, Karafiat, Burget, Cernocky, \BBA KhudanpurMikolov et al.2010] Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., \BBA Khudanpur, S. \BBOP2010\BBCP. \BBOQRecurrent Neural Network based Language Model\BBCQ In \BemProc. Interspeech’10, \BPGS 1045–1048.
- [\BCAYMillerMiller1995] Miller, G. A. \BBOP1995\BBCP. \BBOQWordNet: a lexical database for English\BBCQ \BemCommunications of the ACM, \Bem38(11), 39–41.
- [\BCAYMitchell, Dodge, Goyal, Yamaguchi, Stratos, Han, Mensch, Berg, Han, Berg, \BBA Daume IIIMitchell et al.2012] Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Han, X., Berg, T., \BBA Daume III, H. \BBOP2012\BBCP. \BBOQMidge: Generating Image Descriptions From Computer Vision Detections\BBCQ In \BemProc. EACL’12, \BPGS 747–756.
- [\BCAYMitchell, van Deemter, \BBA ReiterMitchell et al.2013] Mitchell, M., van Deemter, K., \BBA Reiter, E. \BBOP2013\BBCP. \BBOQGenerating Expressions that Refer to Visible Objects\BBCQ In \BemProc. NAACL’13, \BPGS 1174–1184.
- [\BCAYMnih \BBA HintonMnih \BBA Hinton2007] Mnih, A.\BBACOMMA \BBA Hinton, G. \BBOP2007\BBCP. \BBOQThree new graphical models for statistical language modelling\BBCQ In \BemProc. ICML’07, \BPGS 641–648.
- [\BCAYMolina, Stent, \BBA ParodiMolina et al.2011] Molina, M., Stent, A., \BBA Parodi, E. \BBOP2011\BBCP. \BBOQGenerating Automated News to Explain the Meaning of Sensor Data\BBCQ In \BemProc. IDA’11, \BPGS 282–293.
- [\BCAYMontfortMontfort2007] Montfort, N. \BBOP2007\BBCP. \BBOQOrdering events in interactive fiction narratives\BBCQ In \BemProc. AAAI Fall Symposium on Intelligent Narrative Technologies, \BPGS 87–94.
- [\BCAYMontfortMontfort2013] Montfort, N. \BBOP2013\BBCP. \BemWorld clock. Harvard Book Store Press, Cambridge, MA.
- [\BCAYMoore \BBA ParisMoore \BBA Paris1993] Moore, J. D.\BBACOMMA \BBA Paris, C. \BBOP1993\BBCP. \BBOQPlanning text for advisory dialogues: Capturing intentional and rhetorical information\BBCQ \BemComputational Linguistics, \Bem19(4), 651–694.
- [\BCAYMoore, Porayska-Pomsta, Zinn, \BBA VargesMoore et al.2004] Moore, J. D., Porayska-Pomsta, K., Zinn, C., \BBA Varges, S. \BBOP2004\BBCP. \BBOQGenerating Tutorial Feedback with Affect\BBCQ In \BemProc. FLAIRS’04, \BPGS 923–928.
- [\BCAYMostafazadeh, Misra, Devlin, Mitchell, He, \BBA VanderwendeMostafazadeh et al.2016] Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., \BBA Vanderwende, L. \BBOP2016\BBCP. \BBOQGenerating natural questions about an image\BBCQ In \BemProc. ACL’16, \BPGS 1802–1813.
- [\BCAYMrabet, Vougiouklis, Kilicoglu, Gardent, Demner-Fushman, Hare, \BBA SimperlMrabet et al.2016] Mrabet, Y., Vougiouklis, P., Kilicoglu, H., Gardent, C., Demner-Fushman, D., Hare, J., \BBA Simperl, E. \BBOP2016\BBCP. \BBOQAligning Texts and Knowledge Bases with Semantic Sentence Simplification\BBCQ In \BemProc. WebNLG’16, \BPGS 29–36.
- [\BCAYMuscat \BBA BelzMuscat \BBA Belz2015] Muscat, A.\BBACOMMA \BBA Belz, A. \BBOP2015\BBCP. \BBOQGenerating Descriptions of Spatial Relations between Objects in Images\BBCQ In \BemProc. ENLG’15, \BPGS 100–104.
- [\BCAYNakanishi, Miyao, \BBA TsujiiNakanishi et al.2005] Nakanishi, H., Miyao, Y., \BBA Tsujii, J. \BBOP2005\BBCP. \BBOQProbabilistic Models for Disambiguation of an HPSG-Based Chart Generator\BBCQ In \BemProc. IWPT’05, \BPGS 93–102.
- [\BCAYNakatsu \BBA WhiteNakatsu \BBA White2010] Nakatsu, C.\BBACOMMA \BBA White, M. \BBOP2010\BBCP. \BBOQGenerating with Discourse Combinatory Categorial Grammar\BBCQ \BemLinguistic Issues in Language Technology, \Bem4(1), 1–62.
- [\BCAYNauman, Stirling, \BBA BorthwickNauman et al.2011] Nauman, A. D., Stirling, T., \BBA Borthwick, A. \BBOP2011\BBCP. \BBOQWhat makes writing good? an essential question for teachers\BBCQ \BemThe Reading Teacher, \Bem64(5), 318–328.
- [\BCAYNemhauser \BBA WolseyNemhauser \BBA Wolsey1988] Nemhauser, G. L.\BBACOMMA \BBA Wolsey, L. A. \BBOP1988\BBCP. \BemInteger programming and combinatorial optimization. Wiley, Chichester, UK.
- [\BCAYNenkova \BBA McKeownNenkova \BBA McKeown2011] Nenkova, A.\BBACOMMA \BBA McKeown, K. R. \BBOP2011\BBCP. \BBOQAutomatic Summarization\BBCQ \BemFoundations and TrendsÂ® in Information Retrieval, \Bem5(2-3), 103–233.
- [\BCAYNenkova \BBA PassonneauNenkova \BBA Passonneau2004] Nenkova, A.\BBACOMMA \BBA Passonneau, R. \BBOP2004\BBCP. \BBOQEvaluating content selection in summarization: The pyramid method\BBCQ In \BemProc. HLT-NAACL’04, \BPGS 145–â152.
- [\BCAYNetzer, Gabay, Goldberg, \BBA ElhadadNetzer et al.2009] Netzer, Y., Gabay, D., Goldberg, Y., \BBA Elhadad, M. \BBOP2009\BBCP. \BBOQGaiku : Generating Haiku with Word Associations Norms\BBCQ In \BemProc. Workshop on Computational Approaches to Linguistics Creativity, \BPGS 32–39.
- [\BCAYNiederhoffer \BBA PennebakerNiederhoffer \BBA Pennebaker2002] Niederhoffer, K. G.\BBACOMMA \BBA Pennebaker, J. W. \BBOP2002\BBCP. \BBOQLinguistic Style Matching in Social Interaction\BBCQ \BemJournal of Language and Social Psychology, \Bem21(4), 337–360.
- [\BCAYNirenburg, Lesser, \BBA NybergNirenburg et al.1989] Nirenburg, S., Lesser, V., \BBA Nyberg, E. \BBOP1989\BBCP. \BBOQControlling a language generation planner\BBCQ In \BemProc. IJCAI’89, \BPGS 1524–1530.
- [\BCAYNorrickNorrick2005] Norrick, N. R. \BBOP2005\BBCP. \BBOQThe dark side of tellability\BBCQ \BemNarrative Inquiry, \Bem15(2), 323–343.
- [\BCAYNovikova \BBA RieserNovikova \BBA Rieser2016a] Novikova, J.\BBACOMMA \BBA Rieser, V. \BBOP2016a\BBCP. \BBOQThe analogue challenge: Non aligned language generation\BBCQ In \BemProc. INLG’16, \BPGS 168–170.
- [\BCAYNovikova \BBA RieserNovikova \BBA Rieser2016b] Novikova, J.\BBACOMMA \BBA Rieser, V. \BBOP2016b\BBCP. \BBOQCrowdsourcing NLG Data: Pictures elicit better data\BBCQ In \BemProc. INLG’16, \BPGS 265–273.
- [\BCAYOberlanderOberlander1998] Oberlander, J. \BBOP1998\BBCP. \BBOQDo the Right Thing … but Expect the Unexpected\BBCQ \BemComputational Linguistics, \Bem24(3), 501–507.
- [\BCAYOberlander \BBA LascaridesOberlander \BBA Lascarides1992] Oberlander, J.\BBACOMMA \BBA Lascarides, A. \BBOP1992\BBCP. \BBOQPreventing false temporal implicatures: Interactive defaults for text generation\BBCQ In \BemProc. COLING’92, \BPGS 721–727.
- [\BCAYOberlander \BBA NowsonOberlander \BBA Nowson2006] Oberlander, J.\BBACOMMA \BBA Nowson, S. \BBOP2006\BBCP. \BBOQWhose thumb is it anyway ? Classifying author personality from weblog text\BBCQ In \BemProc. COLING/ACL’06, \BPGS 627–634.
- [\BCAYOch \BBA NeyOch \BBA Ney2003] Och, F. J.\BBACOMMA \BBA Ney, H. \BBOP2003\BBCP. \BBOQA systematic comparison of various statistical alignment models\BBCQ \BemComputational linguistics, \Bem29(1), 19–51.
- [\BCAYO’DonnellO’Donnell2001] O’Donnell, M. \BBOP2001\BBCP. \BBOQILEX: an architecture for a dynamic hypertext generation system\BBCQ \BemNatural Language Engineering, \Bem7(3), 225–250.
- [\BCAYOh \BBA RudnickyOh \BBA Rudnicky2002] Oh, A. H.\BBACOMMA \BBA Rudnicky, A. I. \BBOP2002\BBCP. \BBOQStochastic natural language generation for spoken dialog systems\BBCQ \BemComputer Speech and Language, \Bem16(3-4), 387–407.
- [\BCAYOliva \BBA TorralbaOliva \BBA Torralba2001] Oliva, A.\BBACOMMA \BBA Torralba, A. \BBOP2001\BBCP. \BBOQModeling the shape of the scene: A holistic representation of the spatial envelope\BBCQ \BemInternational Journal of Computer Vision, \Bem42(3), 145–175.
- [\BCAYOrdonez, Deng, Choi, Berg, \BBA BergOrdonez et al.2013] Ordonez, V., Deng, J., Choi, Y., Berg, A. C., \BBA Berg, T. \BBOP2013\BBCP. \BBOQFrom Large Scale Image Categorization to Entry-Level Categories\BBCQ In \BemProc. ICCV’13, \BPGS 2768–2775.
- [\BCAYOrdonez, Kulkarni, \BBA BergOrdonez et al.2011] Ordonez, V., Kulkarni, G., \BBA Berg, T. \BBOP2011\BBCP. \BBOQIm2text: Describing images using 1 million captioned photographs\BBCQ In \BemProc. NIPS’11, \BPGS 1143–1151.
- [\BCAYOrdonez, Liu, Deng, Choi, Berg, \BBA BergOrdonez et al.2016] Ordonez, V., Liu, W., Deng, J., Choi, Y., Berg, A. C., \BBA Berg, T. \BBOP2016\BBCP. \BBOQLearning to name objects\BBCQ \BemCommunications of the ACM, \Bem59(3), 108–115.
- [\BCAYOrkin \BBA RoyOrkin \BBA Roy2007] Orkin, J.\BBACOMMA \BBA Roy, D. \BBOP2007\BBCP. \BBOQThe restaurant game: Learning social behavior and language from thousands of players online\BBCQ \BemJournal of Game Development, \Bem3, 39–60.
- [\BCAYOrtiz, Wolff, \BBA LapataOrtiz et al.2015] Ortiz, L. G. M., Wolff, C., \BBA Lapata, M. \BBOP2015\BBCP. \BBOQLearning to Interpret and Describe Abstract Scenes\BBCQ In \BemProc. NAACL’15, \BPGS 1505–1515.
- [\BCAYPaiva \BBA EvansPaiva \BBA Evans2005] Paiva, D. S.\BBACOMMA \BBA Evans, R. \BBOP2005\BBCP. \BBOQEmpirically-based control of natural language generation\BBCQ In \BemProc. ACL’05, \BPGS 58–65.
- [\BCAYPang \BBA LeePang \BBA Lee2008] Pang, B.\BBACOMMA \BBA Lee, L. \BBOP2008\BBCP. \BBOQOpinion Mining and Sentiment Analysis\BBCQ \BemFoundations and Trends in Information Retrieval, \Bem1(2), 1–135.
- [\BCAYPapineni, Roukos, Ward, \BBA ZhuPapineni et al.2002] Papineni, K., Roukos, S., Ward, T., \BBA Zhu, W.-j. \BBOP2002\BBCP. \BBOQBLEU : a Method for Automatic Evaluation of Machine Translation\BBCQ In \BemProc. ACL’02, \BPGS 311–318.
- [\BCAYPassonneauPassonneau2006] Passonneau, R. J. \BBOP2006\BBCP. \BBOQMeasuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation\BBCQ In \BemProc. LREC’06, \BPGS 831–836.
- [\BCAYPennebaker, Booth, \BBA FrancisPennebaker et al.2007] Pennebaker, J. W., Booth, R. J., \BBA Francis, M. E. \BBOP2007\BBCP. \BemLinguistic Inquiry and Word Count (LIWC2007): A text analysis program. Austin, TX.
- [\BCAYPennington, Socher, \BBA ManningPennington et al.2014] Pennington, J., Socher, R., \BBA Manning, C. D. \BBOP2014\BBCP. \BBOQGloVe: Global Vectors for Word Representation\BBCQ In \BemProc. EMNLP’14, \BPGS 1532–1543.
- [\BCAYPérez, Ortiz, Luna, Negrete, Castellanos, Peñalosa, \BBA ÁvilaPérez et al.2011] Pérez, R., Ortiz, O., Luna, W., Negrete, S., Castellanos, V., Peñalosa, E., \BBA Ávila, R. \BBOP2011\BBCP. \BBOQA System for Evaluating Novelty in Computer Generated Narratives\BBCQ In \BemProc. ICCC’11, \BPGS 63–68.
- [\BCAYPetrovic \BBA MatthewsPetrovic \BBA Matthews2013] Petrovic, S.\BBACOMMA \BBA Matthews, D. \BBOP2013\BBCP. \BBOQUnsupervised joke generation from big data\BBCQ In \BemProc. ACL’13, \BPGS 228–232.
- [\BCAYPickering \BBA GarrodPickering \BBA Garrod2004] Pickering, M. J.\BBACOMMA \BBA Garrod, S. \BBOP2004\BBCP. \BBOQToward a mechanistic psychology of dialogue\BBCQ \BemBehavioral and Brain Sciences, \Bem27(2), 169–226.
- [\BCAYPickering \BBA GarrodPickering \BBA Garrod2013] Pickering, M. J.\BBACOMMA \BBA Garrod, S. \BBOP2013\BBCP. \BBOQAn integrated theory of language production and comprehension\BBCQ \BemBehavioral and Brain Sciences, \Bem36(4), 329–347.
- [\BCAYPiwekPiwek2003] Piwek, P. \BBOP2003\BBCP. \BBOQAn annotated bibliography of affective natural language generation\BBCQ \BTR, ITRI, University of Brighton.
- [\BCAYPiwek \BBA BoyerPiwek \BBA Boyer2012] Piwek, P.\BBACOMMA \BBA Boyer, K. E. \BBOP2012\BBCP. \BBOQVarieties of question generation: Introduction to this special issue\BBCQ \BemDialogue and Discourse, \Bem3(2), 1–9.
- [\BCAYPlachouras, Smiley, Bretz, Taylor, Leidner, Song, \BBA SchilderPlachouras et al.2016] Plachouras, V., Smiley, C., Bretz, H., Taylor, O., Leidner, J. L., Song, D., \BBA Schilder, F. \BBOP2016\BBCP. \BBOQInteracting with financial data using natural language\BBCQ In \BemProc. SIGIR’16, \BPGS 1121–1124.
- [\BCAYPoesio, Stevenson, Di Eugenio, \BBA HitzemanPoesio et al.2004] Poesio, M., Stevenson, R., Di Eugenio, B., \BBA Hitzeman, J. \BBOP2004\BBCP. \BBOQCentering: A parametric theory and its instantiations\BBCQ \BemComputational Linguistics, \Bem30(3), 309–363.
- [\BCAYPonnamperuma, Siddharthan, Zeng, Mellish, \BBA van der WalPonnamperuma et al.2013] Ponnamperuma, K., Siddharthan, A., Zeng, C., Mellish, C., \BBA van der Wal, R. \BBOP2013\BBCP. \BBOQTag2Blog: Narrative Generation from Satellite Tag Data\BBCQ In \BemProc. ACL’13, \BPGS 169–174.
- [\BCAYPortet, Reiter, Gatt, Hunter, Sripada, Freer, \BBA SykesPortet et al.2009] Portet, F., Reiter, E., Gatt, A., Hunter, J. R., Sripada, S., Freer, Y., \BBA Sykes, C. \BBOP2009\BBCP. \BBOQAutomatic generation of textual summaries from neonatal intensive care data\BBCQ \BemArtificial Intelligence, \Bem173(7-8), 789–816.
- [\BCAYPower, Scott, \BBA Bouayad-AghaPower et al.2003] Power, R., Scott, D., \BBA Bouayad-Agha, N. \BBOP2003\BBCP. \BBOQDocument Structure\BBCQ \BemComputational Linguistics, \Bem29(2), 211–260.
- [\BCAYPower \BBA WilliamsPower \BBA Williams2012] Power, R.\BBACOMMA \BBA Williams, S. \BBOP2012\BBCP. \BBOQGenerating numerical approximations\BBCQ \BemComputational Linguistics, \Bem38(1), 113–134.
- [\BCAYProppPropp1968] Propp, V. \BBOP1968\BBCP. \BemMorphology of the Folk Tale. University of Texas Press, Austin, TX.
- [\BCAYRajkumar \BBA WhiteRajkumar \BBA White2011] Rajkumar, R.\BBACOMMA \BBA White, M. \BBOP2011\BBCP. \BBOQLinguistically Motivated Complementizer Choice in Surface Realization\BBCQ In \BemProc. UCNLG+Eval’11, \BPGS 39–44.
- [\BCAYRajkumar \BBA WhiteRajkumar \BBA White2014] Rajkumar, R.\BBACOMMA \BBA White, M. \BBOP2014\BBCP. \BBOQBetter Surface Realization through Psycholinguistics\BBCQ \BemLanguage and Linguistics Compass, \Bem8(10), 428–448.
- [\BCAYRamos-Soto, Bugarin, Barro, \BBA TaboadaRamos-Soto et al.2015] Ramos-Soto, A., Bugarin, A. J., Barro, S., \BBA Taboada, J. \BBOP2015\BBCP. \BBOQLinguistic Descriptions for Automatic Generation of Textual Short-Term Weather Forecasts on Real Prediction Data\BBCQ \BemIEEE Transactions on Fuzzy Systems, \Bem23(1), 44–57.
- [\BCAYRatnaparkhiRatnaparkhi1996] Ratnaparkhi, A. \BBOP1996\BBCP. \BBOQA maximum entropy model for part-of-speech tagging\BBCQ In \BemProc. EMNLP’96, \BPGS 133–142.
- [\BCAYRatnaparkhiRatnaparkhi2000] Ratnaparkhi, A. \BBOP2000\BBCP. \BBOQTrainable methods for surface natural language generation\BBCQ In \BemProc. NAACL’00, \BPGS 194–201.
- [\BCAYReape \BBA MellishReape \BBA Mellish1999] Reape, M.\BBACOMMA \BBA Mellish, C. \BBOP1999\BBCP. \BBOQJust what is aggregation anyway?\BBCQ In \BemProc. ENLG’99, \BPGS 20–29.
- [\BCAYRegneri, Rohrbach, Wetzel, \BBA ThaterRegneri et al.2013] Regneri, M., Rohrbach, M., Wetzel, D., \BBA Thater, S. \BBOP2013\BBCP. \BBOQGrounding Action Descriptions in Videos\BBCQ \BemTransactions of the Association for Computational Linguistics, \Bem1, 25–36.
- [\BCAYReiterReiter1994] Reiter, E. \BBOP1994\BBCP. \BBOQHas a consensus NL generation architecture appeared, and is it psycholinguistically plausible?\BBCQ In \BemProc. IWNLG’94, \BPGS 163–170.
- [\BCAYReiterReiter2000] Reiter, E. \BBOP2000\BBCP. \BBOQPipelines and Size Constraints\BBCQ \BemComputational Linguistics, \Bem26(2), 251–259.
- [\BCAYReiterReiter2007] Reiter, E. \BBOP2007\BBCP. \BBOQAn architecture for data-to-text systems\BBCQ In \BemProc. ENLG’07, \BPGS 97–104.
- [\BCAYReiterReiter2010] Reiter, E. \BBOP2010\BBCP. \BBOQNatural Language Generation\BBCQ In Clark, A., Fox, C., \BBA Lappin, S.\BEDS, \BemHandbook of Computational Linguistics and Natural Language Processing, \BPGS 574–598. Wiley, Oxford.
- [\BCAYReiter \BBA BelzReiter \BBA Belz2009] Reiter, E.\BBACOMMA \BBA Belz, A. \BBOP2009\BBCP. \BBOQAn Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems\BBCQ \BemComputational Linguistcs, \Bem35(4), 529–558.
- [\BCAYReiter \BBA DaleReiter \BBA Dale1997] Reiter, E.\BBACOMMA \BBA Dale, R. \BBOP1997\BBCP. \BBOQBuilding natural-language generation systems\BBCQ \BemNatural Language Engineering, \Bem3, 57–87.
- [\BCAYReiter \BBA DaleReiter \BBA Dale2000] Reiter, E.\BBACOMMA \BBA Dale, R. \BBOP2000\BBCP. \BemBuilding Natural Language Generation Systems. Cambridge University Press, Cambridge, UK.
- [\BCAYReiter, Gatt, Portet, \BBA van Der MeulenReiter et al.2008] Reiter, E., Gatt, A., Portet, F., \BBA van Der Meulen, M. \BBOP2008\BBCP. \BBOQThe Importance of Narrative and Other Lessons from an Evaluation of an NLG System that Summarises Clinical Data\BBCQ In \BemProc. INLG’08, \BPGS 147–155.
- [\BCAYReiter, Mellish, \BBA LevineReiter et al.1995] Reiter, E., Mellish, C., \BBA Levine, J. \BBOP1995\BBCP. \BBOQAutomatic Generation of Technical Documentation\BBCQ \BemApplied Artificial Intelligence, \Bem9, 259–287.
- [\BCAYReiter, Robertson, \BBA OsmanReiter et al.2003] Reiter, E., Robertson, R., \BBA Osman, L. M. \BBOP2003\BBCP. \BBOQLessons from a failure: Generating tailored smoking cessation letters\BBCQ \BemArtificial Intelligence, \Bem144(1-2), 41–58.
- [\BCAYReiter \BBA SripadaReiter \BBA Sripada2002] Reiter, E.\BBACOMMA \BBA Sripada, S. \BBOP2002\BBCP. \BBOQShould corpora texts be gold standards for NLG?\BBCQ In \BemProc. INLG’02, \BPGS 97–104.
- [\BCAYReiter, Sripada, Hunter, Yu, \BBA DavyReiter et al.2005] Reiter, E., Sripada, S., Hunter, J. R., Yu, J., \BBA Davy, I. \BBOP2005\BBCP. \BBOQChoosing words in computer-generated weather forecasts\BBCQ \BemArtificial Intelligence, \Bem167(1-2), 137–169.
- [\BCAYRiedl \BBA YoungRiedl \BBA Young2005] Riedl, M. O.\BBACOMMA \BBA Young, R. M. \BBOP2005\BBCP. \BBOQAn objective character believability evaluation procedure for multi-agent story generation systems\BBCQ In Panayiotopoulos, T., Gratch, J., Aylett, R., Ballin, D., Olivier, P., \BBA Thomas Rist\BEDS, \BemProc. 5th International Conference on Intelligent Virtual Agents.
- [\BCAYRiedl \BBA YoungRiedl \BBA Young2010] Riedl, M. O.\BBACOMMA \BBA Young, R. M. \BBOP2010\BBCP. \BBOQNarrative planning: Balancing plot and character\BBCQ \BemJournal of Artificial Intelligence Research, \Bem39, 217–268.
- [\BCAYRieser, Keizer, Liu, \BBA LemonRieser et al.2011] Rieser, V., Keizer, S., Liu, X., \BBA Lemon, O. \BBOP2011\BBCP. \BBOQAdaptive Information Presentation for Spoken Dialogue Systems : Evaluation with human subjects\BBCQ In \BemProc. ENLG’11, \BPGS 102–109.
- [\BCAYRieser \BBA LemonRieser \BBA Lemon2009] Rieser, V.\BBACOMMA \BBA Lemon, O. \BBOP2009\BBCP. \BBOQNatural Language Generation as Planning Under Uncertainty for Spoken Dialogue Systems\BBCQ In \BemEACL’09, \BPGS 683–691.
- [\BCAYRieser \BBA LemonRieser \BBA Lemon2011] Rieser, V.\BBACOMMA \BBA Lemon, O. \BBOP2011\BBCP. \BemReinforcement Learning for Adaptive Dialogue Systems. Springer, Berlin and Heidelberg.
- [\BCAYRitchieRitchie2009] Ritchie, G. D. \BBOP2009\BBCP. \BBOQCan computers create humor?\BBCQ \BemAI Magazine, \Bem30(3), 71–81.
- [\BCAYRitter, Cherry, \BBA DolanRitter et al.2011] Ritter, A., Cherry, C., \BBA Dolan, W. B. \BBOP2011\BBCP. \BBOQData-driven response generation in social media\BBCQ In \BemProc. EMNLP’11, \BPGS 583–593.
- [\BCAYRobinRobin1993] Robin, J. \BBOP1993\BBCP. \BBOQA Revision-Based Generation Architecture for Reporting Facts in their Historical Context\BBCQ In Horacek, H.\BBACOMMA \BBA Zock, M.\BEDS, \BemNew Concepts in Natural Language Generation: Planning, Realization and Systems, \BPGS 238–268. Pinter, London.
- [\BCAYRoyRoy2002] Roy, D. \BBOP2002\BBCP. \BBOQLearning visually grounded words and syntax for a scene description task\BBCQ \BemComputer Speech and Language, \Bem16(3-4), 353–385.
- [\BCAYRoy \BBA ReiterRoy \BBA Reiter2005] Roy, D.\BBACOMMA \BBA Reiter, E. \BBOP2005\BBCP. \BBOQConnecting language to the world\BBCQ \BemArtificial Intelligence, \Bem167(1-2), 1–12.
- [\BCAYRuderRuder2017] Ruder, S. \BBOP2017\BBCP. \BBOQTransfer learning: Machine learning’s next frontier\BBCQ http://ruder.io/transfer-learning/.
- [\BCAYRus, Piwek, Stoyanchev, Wyse, Lintean, \BBA MoldovanRus et al.2011] Rus, V., Piwek, P., Stoyanchev, S., Wyse, B., Lintean, M., \BBA Moldovan, C. \BBOP2011\BBCP. \BBOQQuestion generation shared task and evaluation challenge: status report\BBCQ In \BemProc. ENLG’11, \BPGS 318–320.
- [\BCAYRus, Wyse, Piwek, Lintean, Stoyanchev, \BBA MoldovanRus et al.2010] Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., \BBA Moldovan, C. \BBOP2010\BBCP. \BBOQOverview of the first question generation shared task evaluation challenge\BBCQ In \BemProc. 3rd Workshop on Question Generation, \BPGS 45–57.
- [\BCAYSchwartz, Eichstaedt, Kern, Dziurzynski, Ramones, Agrawal, Shah, Kosinski, Stillwell, Seligman, \BBA UngarSchwartz et al.2013] Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E. P., \BBA Ungar, L. H. \BBOP2013\BBCP. \BBOQPersonality, gender, and age in the language of social media: the open-vocabulary approach\BBCQ \BemPloS one, \Bem8(9), 1–16.
- [\BCAYSchwenk \BBA GauvainSchwenk \BBA Gauvain2005] Schwenk, H.\BBACOMMA \BBA Gauvain, J.-l. \BBOP2005\BBCP. \BBOQTraining Neural Network Language Models\BBCQ In \BemProc. EMNLP/HLT’05, \BPGS 201–208.
- [\BCAYScott \BBA Sieckenius de SouzaScott \BBA Sieckenius de Souza1990] Scott, D.\BBACOMMA \BBA Sieckenius de Souza, C. \BBOP1990\BBCP. \BBOQGetting the message across in RST-based text generation\BBCQ In Dale, R., Mellish, C., \BBA Zock, M.\BEDS, \BemCurrent research in natural language generation, \BPGS 47–73. Academic Press Professional, Inc., San Diego, CA.
- [\BCAYSearleSearle1969] Searle, J. R. \BBOP1969\BBCP. \BemSpeech Acts: An Essay in the Philosophy of Language. Cambridge University Press, Cambridge, UK.
- [\BCAYSerban, Sordoni, Bengio, Courville, \BBA PineauSerban et al.2016] Serban, I. V., Sordoni, A., Bengio, Y., Courville, A., \BBA Pineau, J. \BBOP2016\BBCP. \BBOQBuilding End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models\BBCQ In \BemProc. AAAI’16, \BPGS 3776–3784.
- [\BCAYShawShaw1998] Shaw, J. \BBOP1998\BBCP. \BBOQClause aggregation using linguistic knowledge\BBCQ In \BemProc. IWNLG’98, \BPGS 138–148.
- [\BCAYSheikha \BBA InkpenSheikha \BBA Inkpen2011] Sheikha, F. A.\BBACOMMA \BBA Inkpen, D. \BBOP2011\BBCP. \BBOQGeneration of Formal and Informal Sentences\BBCQ In \BemProc. ENLG’11, \BPGS 187–193.
- [\BCAYShutova, Teufel, \BBA KorhonenShutova et al.2012] Shutova, E., Teufel, S., \BBA Korhonen, A. \BBOP2012\BBCP. \BBOQStatistical Metaphor Processing\BBCQ \BemComputational Linguistics, \Bem2(2013), 301–353.
- [\BCAYSiddharthanSiddharthan2014] Siddharthan, A. \BBOP2014\BBCP. \BBOQA survey of research on text simplification\BBCQ \BemInternational Journal of Applied Linguistics, \Bem165(2), 259–298.
- [\BCAYSiddharthan, Green, van Deemter, Mellish, \BBA van der WalSiddharthan et al.2013] Siddharthan, A., Green, M., van Deemter, K., Mellish, C., \BBA van der Wal, R. \BBOP2013\BBCP. \BBOQBlogging birds: Generating narratives about reintroduced species to promote public engagement\BBCQ In \BemProc. INLG’13, \BPGS 120–124.
- [\BCAYSiddharthan \BBA KatsosSiddharthan \BBA Katsos2012] Siddharthan, A.\BBACOMMA \BBA Katsos, N. \BBOP2012\BBCP. \BBOQOffline sentence processing measures for testing readability with users\BBCQ In \BemProc. PITR’12, \BPGS 17–24.
- [\BCAYSiddharthan, Nenkova, \BBA McKeownSiddharthan et al.2011] Siddharthan, A., Nenkova, A., \BBA McKeown, K. R. \BBOP2011\BBCP. \BBOQInformation Status Distinctions and Referring Expressions: An Empirical Study of References to People in News Summaries\BBCQ \BemComputational Linguistics, \Bem37(4), 811–842.
- [\BCAYSimonyan \BBA ZissermanSimonyan \BBA Zisserman2015] Simonyan, K.\BBACOMMA \BBA Zisserman, A. \BBOP2015\BBCP. \BBOQVery Deep Convolutional Networks for Large-Scale Image Recognition\BBCQ In \BemProc. ICLR’15, \BPGS 1–10.
- [\BCAYSleimi \BBA GardentSleimi \BBA Gardent2016] Sleimi, A.\BBACOMMA \BBA Gardent, C. \BBOP2016\BBCP. \BBOQGenerating Paraphrases from DBPedia using Deep Learning\BBCQ In \BemProc. WebNLG’16, \BPGS 54–57.
- [\BCAYSnover, Dorr, Schwartz, Micciulla, \BBA MakhoulSnover et al.2006] Snover, M., Dorr, B., Schwartz, R., Micciulla, L., \BBA Makhoul, J. \BBOP2006\BBCP. \BBOQA Study of Translation Edit Rate with Targeted Human Annotation\BBCQ In \BemProc. AMTA’06, \BPGS 223–231.
- [\BCAYSocher, Karpathy, Le, Manning, \BBA NgSocher et al.2014] Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., \BBA Ng, A. Y. \BBOP2014\BBCP. \BBOQGrounded Compositional Semantics for Finding and Describing Images with Sentences\BBCQ \BemTransactions of the Association for Computational Linguistics, \Bem2, 207–218.
- [\BCAYSordoni, Galley, Auli, Brockett, Ji, Mitchell, Nie, Gao, \BBA DolanSordoni et al.2015] Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., Nie, J.-Y., Gao, J., \BBA Dolan, B. \BBOP2015\BBCP. \BBOQA Neural Network Approach to Context-Sensitive Generation of Conversational Responses\BBCQ In \BemProc. NAACL-HLT’15, \BPGS 196–205.
- [\BCAYSparck Jones \BBA GalliersSparck Jones \BBA Galliers1996] Sparck Jones, K.\BBACOMMA \BBA Galliers, J. R. \BBOP1996\BBCP. \BemEvaluating Natural Language Processing Systems: An Analysis and Review. Springer, Berlin and Heidelberg.
- [\BCAYSripada, Reiter, \BBA DavySripada et al.2003] Sripada, S., Reiter, E., \BBA Davy, I. \BBOP2003\BBCP. \BBOQSUMTIME-MOUSAM: Configurable Marine Weather Forecast Generator\BBCQ \BemExpert Update, \Bem6(1), 4–10.
- [\BCAYSripada, Reiter, \BBA HawizySripada et al.2005] Sripada, S., Reiter, E., \BBA Hawizy, L. \BBOP2005\BBCP. \BBOQEvaluation of an NLG System using Post-Edit Data: Lessons Learned\BBCQ In \BemProc. ENLG’05, \BPGS 133–139.
- [\BCAYStedeStede2000] Stede, M. \BBOP2000\BBCP. \BBOQThe hyperonym problem revisited: Conceptual and lexical hierarchies in language\BBCQ In \BemProc. INLG’00, \BPGS 93–99.
- [\BCAYSteedmanSteedman2000] Steedman, M. \BBOP2000\BBCP. \BemThe Syntactic Process. MIT Press, Cambridge, MA.
- [\BCAYSteedman \BBA PetrickSteedman \BBA Petrick2007] Steedman, M.\BBACOMMA \BBA Petrick, R. P. \BBOP2007\BBCP. \BBOQPlanning dialog actions\BBCQ In \BemProc. SIGDIAL’07, \BPGS 265–272.
- [\BCAYStent, Marge, \BBA SinghaiStent et al.2005] Stent, A., Marge, M., \BBA Singhai, M. \BBOP2005\BBCP. \BBOQEvaluating evaluation methods for generation in the presence of variation\BBCQ In \BemProc. CiCLing’05, \BPGS 341–351.
- [\BCAYStent \BBA MolinaStent \BBA Molina2009] Stent, A.\BBACOMMA \BBA Molina, M. \BBOP2009\BBCP. \BBOQEvaluating automatic extraction of rules for sentence plan construction\BBCQ In \BemProc. SIGDIAL’09, \BPGS 290–297.
- [\BCAYStock \BBA StrapparavaStock \BBA Strapparava2005] Stock, O.\BBACOMMA \BBA Strapparava, C. \BBOP2005\BBCP. \BBOQThe act of creating humorous acronyms\BBCQ \BemApplied Artificial Intelligence, \Bem19(2), 137–151.
- [\BCAYStock, Zancanaro, Busetta, Callaway, Krüger, Kruppa, Kuflik, Not, \BBA RocchiStock et al.2007] Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Krüger, A., Kruppa, M., Kuflik, T., Not, E., \BBA Rocchi, C. \BBOP2007\BBCP. \BBOQAdaptive, intelligent presentation of information for the museum visitor in PEACH\BBCQ \BemUser Modeling and User-Adapted Interaction, \Bem17(3), 257–304.
- [\BCAYStoia \BBA ShockleyStoia \BBA Shockley2006] Stoia, L.\BBACOMMA \BBA Shockley, D. \BBOP2006\BBCP. \BBOQNoun phrase generation for situated dialogs\BBCQ In \BemProc. INLG’06, \BPGS 81–88.
- [\BCAYStoneStone2000] Stone, M. \BBOP2000\BBCP. \BBOQOn Identifying Sets\BBCQ In \BemProc. INLG’00, \BPGS 116–123.
- [\BCAYStone \BBA WebberStone \BBA Webber1998] Stone, M.\BBACOMMA \BBA Webber, B. \BBOP1998\BBCP. \BBOQTextual Economy through Close Coupling of Syntax and Semantics\BBCQ In \BemProc. INLG’98, \BPGS 178–187.
- [\BCAYStriegnitz, Gargett, Garoufi, Koller, \BBA TheuneStriegnitz et al.2011] Striegnitz, K., Gargett, A., Garoufi, K., Koller, A., \BBA Theune, M. \BBOP2011\BBCP. \BBOQReport on the second NLG challenge on generating instructions in virtual environments (GIVE-2)\BBCQ In \BemProc. ENLG’11, \BPGS 243–250.
- [\BCAYStrong, Mehta, Mishra, Jones, \BBA RamStrong et al.2007] Strong, C. R., Mehta, M., Mishra, K., Jones, A., \BBA Ram, A. \BBOP2007\BBCP. \BBOQEmotionally driven natural language generation for personality rich characters in interactive games\BBCQ In \BemProc. AIIDE’07, \BPGS 98–100.
- [\BCAYSutskever, Martens, \BBA HintonSutskever et al.2011] Sutskever, I., Martens, J., \BBA Hinton, G. \BBOP2011\BBCP. \BBOQGenerating Text with Recurrent Neural Networks\BBCQ In \BemProc. ICML’11, \BPGS 1017–1024.
- [\BCAYSutskever, Vinyals, \BBA LeSutskever et al.2014] Sutskever, I., Vinyals, O., \BBA Le, Q. V. \BBOP2014\BBCP. \BBOQSequence to sequence learning with neural networks\BBCQ In \BemProc. NIPS’14, \BPGS 3104–3112.
- [\BCAYTang, Yang, Carton, Zhang, \BBA MeiTang et al.2016] Tang, J., Yang, Y., Carton, S., Zhang, M., \BBA Mei, Q. \BBOP2016\BBCP. \BBOQContext-aware Natural Language Generation with Recurrent Neural Networks\BBCQ \BemarXiv preprint, \Bem1611.09900.
- [\BCAYTheune, Hielkema, \BBA HendriksTheune et al.2006] Theune, M., Hielkema, F., \BBA Hendriks, P. \BBOP2006\BBCP. \BBOQPerforming aggregation and ellipsis using discourse structures\BBCQ \BemResearch on Language and Computation, \Bem4, 353–375.
- [\BCAYTheune, Klabbers, de Pijper, Krahmer, \BBA OdijkTheune et al.2001] Theune, M., Klabbers, E., de Pijper, J.-R., Krahmer, E., \BBA Odijk, J. \BBOP2001\BBCP. \BBOQFrom data to speech: a general approach\BBCQ \BemNatural Language Engineering, \Bem7(1), 47–86.
- [\BCAYTheuneTheune2003] Theune, M. \BBOP2003\BBCP. \BBOQNatural language generation for dialogue: System survey\BBCQ \BTR, Twente University.
- [\BCAYThomason, Venugopalan, Guadarrama, Saenko, \BBA MooneyThomason et al.2014] Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., \BBA Mooney, R. J. \BBOP2014\BBCP. \BBOQIntegrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild\BBCQ In \BemProc. COLING’14, \BPGS 1218–1227.
- [\BCAYThompsonThompson1977] Thompson, H. \BBOP1977\BBCP. \BBOQStrategy and Tactics: a Model for Language Production\BBCQ In \BemPapers from the 13th Regional Meeting of the Chicago Linguistic Society, \BVOL 13, \BPGS 651–668.
- [\BCAYTintarev, Reiter, Black, Waller, \BBA ReddingtonTintarev et al.2016] Tintarev, N., Reiter, E., Black, R., Waller, A., \BBA Reddington, J. \BBOP2016\BBCP. \BBOQPersonal storytelling: Using Natural Language Generation for children with complex communication needs, in the wild\BBCQ \BemInternational Journal of Human Computer Studies, \Bem92-93, 1–16.
- [\BCAYTogelius, Yannakakis, Stanley, \BBA BrowneTogelius et al.2011] Togelius, J., Yannakakis, G. N., Stanley, K. O., \BBA Browne, C. \BBOP2011\BBCP. \BBOQSearch-based procedural content generation: A taxonomy and survey\BBCQ \BemIEEE Transactions on Computational Intelligence and AI in Games, \Bem3(3), 172–186.
- [\BCAYTurian, Shen, \BBA MelamedTurian et al.2003] Turian, J., Shen, L., \BBA Melamed, I. D. \BBOP2003\BBCP. \BBOQEvaluation of Machine Translation and its Evaluation\BBCQ In \BemProc. MT Summit IX, \BPGS 386–393.
- [\BCAYTurner, Sripada, Reiter, \BBA DavyTurner et al.2008] Turner, R., Sripada, S., Reiter, E., \BBA Davy, I. \BBOP2008\BBCP. \BBOQSelecting the Content of Textual Descriptions of Geographically Located Events in Spatio-Temporal Weather Data\BBCQ In \BemApplications and Innovations in Intelligent Systems XV, \BPGS 75–88.
- [\BCAYTurnerTurner1992] Turner, S. R. \BBOP1992\BBCP. \BemMINSTREL: A computer model of creativity and storytelling. Ph.d. thesis, University of California at Los Angeles.
- [\BCAYvan Dalenvan Dalen2012] van Dalen, A. \BBOP2012\BBCP. \BBOQThe algorithms behind the headlines\BBCQ \BemJournalism Practice, \Bem6(5-6), 648–658.
- [\BCAYvan Deemtervan Deemter2012] van Deemter, K. \BBOP2012\BBCP. \BemNot exactly: In praise of vagueness. Oxford University Press, Oxford.
- [\BCAYvan Deemtervan Deemter2016] van Deemter, K. \BBOP2016\BBCP. \BBOQDesigning algorithms for referring with proper names\BBCQ In \BemProc. INLG 2016, \BPGS 31â–35.
- [\BCAYvan Deemter, Gatt, van der Sluis, \BBA Powervan Deemter et al.2012a] van Deemter, K., Gatt, A., van der Sluis, I., \BBA Power, R. \BBOP2012a\BBCP. \BBOQGeneration of Referring Expressions: Assessing the Incremental Algorithm\BBCQ \BemCognitive Science, \Bem36(5), 799–836.
- [\BCAYvan Deemter, Gatt, van Gompel, \BBA Krahmervan Deemter et al.2012b] van Deemter, K., Gatt, A., van Gompel, R. P. G., \BBA Krahmer, E. \BBOP2012b\BBCP. \BBOQToward a computational psycholinguistics of reference production\BBCQ \BemTopics in cognitive science, \Bem4(2), 166–83.
- [\BCAYvan Deemter, Krahmer, \BBA Theunevan Deemter et al.2005] van Deemter, K., Krahmer, E., \BBA Theune, M. \BBOP2005\BBCP. \BBOQReal versus template-based natural language generation: A false opposition?\BBCQ \BemComputational Linguistics, \Bem31(1), 15–24.
- [\BCAYvan Deemter, Krenn, Piwek, Klesen, Schröder, \BBA Baumannvan Deemter et al.2008] van Deemter, K., Krenn, B., Piwek, P., Klesen, M., Schröder, M., \BBA Baumann, S. \BBOP2008\BBCP. \BBOQFully generated scripted dialogue for embodied agents\BBCQ \BemArtificial Intelligence, \Bem172(10), 1219–1244.
- [\BCAYvan der Lee, Krahmer, \BBA Wubbenvan der Lee et al.2017] van der Lee, C., Krahmer, E., \BBA Wubben, S. \BBOP2017\BBCP. \BBOQPass: A dutch data-to-text system for soccer, targeted towards specific audiences\BBCQ In \BemProc. INLG’17, \BPGS 95–104.
- [\BCAYvan der Meulen, Logie, Freer, Sykes, McIntosh, \BBA Huntervan der Meulen et al.2007] van der Meulen, M., Logie, R. H., Freer, Y., Sykes, C., McIntosh, N., \BBA Hunter, J. \BBOP2007\BBCP. \BBOQWhen a Graph is Poorer Than 100 Words: A Comparison of Computerised Natural Language Generation, Human Generated Descriptions and Graphical Displays in Neonatal Intensive Care\BBCQ \BemApplied Cognitive Psychology, \Bem21, 1057–1075.
- [\BCAYvan der Sluis \BBA Mellishvan der Sluis \BBA Mellish2010] van der Sluis, I.\BBACOMMA \BBA Mellish, C. \BBOP2010\BBCP. \BBOQTowards Empirical Evaluation of Affective Tactical NLG\BBCQ In Krahmer, E.\BBACOMMA \BBA Theune, M.\BEDS, \BemEmpirical methods in natural language generation, \BPGS 242–263. Springer, Berlin and Heidelberg.
- [\BCAYvan der Wal, Sharma, Mellish, Robinson, \BBA Siddharthanvan der Wal et al.2016] van der Wal, R., Sharma, N., Mellish, C., Robinson, A., \BBA Siddharthan, A. \BBOP2016\BBCP. \BBOQThe role of automated feedback in training and retaining biological recorders for citizen science\BBCQ \BemConservation Biology, \Bem30(3), 550–561.
- [\BCAYVarges \BBA MellishVarges \BBA Mellish2010] Varges, S.\BBACOMMA \BBA Mellish, C. \BBOP2010\BBCP. \BBOQInstance-based natural language generation\BBCQ \BemNatural Language Engineering, \Bem16(3), 309–346.
- [\BCAYVaudry \BBA LapalmeVaudry \BBA Lapalme2013] Vaudry, P.-L.\BBACOMMA \BBA Lapalme, G. \BBOP2013\BBCP. \BBOQAdapting SimpleNLG for bilingual French-English realisation\BBCQ In \BemProc. ENLG’13, \BPGS 183–187.
- [\BCAYVealeVeale2013] Veale, T. \BBOP2013\BBCP. \BBOQOnce More, With Feeling! Using Creative Affective Metaphors to Express Information Needs\BBCQ In \BemProc. ICCM’13, \BPGS 16–23.
- [\BCAYVeale \BBA HaoVeale \BBA Hao2007] Veale, T.\BBACOMMA \BBA Hao, Y. \BBOP2007\BBCP. \BBOQComprehending and Generating Apt Metaphors: A Web-driven, Case-based Approach to Figurative Language\BBCQ In \BemProc. AAAI’07, \BPGS 1471–1476.
- [\BCAYVeale \BBA HaoVeale \BBA Hao2008] Veale, T.\BBACOMMA \BBA Hao, Y. \BBOP2008\BBCP. \BBOQA fluid knowledge representation for understanding and generating creative metaphors\BBCQ In \BemProc. COLING’08, \BPGS 945–952.
- [\BCAYVeale \BBA LiVeale \BBA Li2015] Veale, T.\BBACOMMA \BBA Li, G. \BBOP2015\BBCP. \BBOQDistributed divergent creativity: Computational creative agents at web scale\BBCQ \BemCognitive Computation, \Bem8(2), 175–186.
- [\BCAYVedantam, Zitnick, \BBA ParikhVedantam et al.2015] Vedantam, R., Zitnick, C. L., \BBA Parikh, D. \BBOP2015\BBCP. \BBOQCIDEr: Consensus-based image description evaluation\BBCQ In \BemProc. CVPR’15, \BPGS 4566–4575.
- [\BCAYVenigalla \BBA Di EugenioVenigalla \BBA Di Eugenio2013] Venigalla, H.\BBACOMMA \BBA Di Eugenio, B. \BBOP2013\BBCP. \BBOQUIC-CSC: The Content Selection Challenge Entry from the University of Illinois at Chicago\BBCQ In \BemProc. ENLG’13, \BPGS 210–211.
- [\BCAYVenugopalan, Rohrbach, Darrell, Donahue, Saenko, \BBA MooneyVenugopalan et al.2015a] Venugopalan, S., Rohrbach, M., Darrell, T., Donahue, J., Saenko, K., \BBA Mooney, R. J. \BBOP2015a\BBCP. \BBOQSequence to Sequence â Video to Text\BBCQ In \BemProc. ICCV’15, \BPGS 4534–4542.
- [\BCAYVenugopalan, Xu, Donahue, Rohrbach, Mooney, \BBA SaenkoVenugopalan et al.2015b] Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R. J., \BBA Saenko, K. \BBOP2015b\BBCP. \BBOQTranslating Videos to Natural Language Using Deep Recurrent Neural Networks\BBCQ In \BemProc. NAACL’15, \BPGS 1494–1504.
- [\BCAYViethen \BBA DaleViethen \BBA Dale2007] Viethen, J.\BBACOMMA \BBA Dale, R. \BBOP2007\BBCP. \BBOQEvaluation in natural language generation: Lessons from referring expression generation\BBCQ \BemTraitement Automatique des Langues, \Bem48(1), 141–160.
- [\BCAYViethen \BBA DaleViethen \BBA Dale2008] Viethen, J.\BBACOMMA \BBA Dale, R. \BBOP2008\BBCP. \BBOQThe Use of Spatial Relations in Referring Expression Generation\BBCQ In \BemProc. INLG’08, \BPGS 59–67.
- [\BCAYViethen \BBA DaleViethen \BBA Dale2010] Viethen, J.\BBACOMMA \BBA Dale, R. \BBOP2010\BBCP. \BBOQSpeaker-dependent variation in content selection for referring expression generation\BBCQ In \BemProc. 8th Australasian Language Technology Workshop, \BPGS 81–89.
- [\BCAYViethen \BBA DaleViethen \BBA Dale2011] Viethen, J.\BBACOMMA \BBA Dale, R. \BBOP2011\BBCP. \BBOQGRE3D7: A Corpus of Distinguishing Descriptions for Objects in Visual Scenes\BBCQ In \BemProc. UCNLG+Eval’11, \BPGS 12–22.
- [\BCAYVinyals, Toshev, Bengio, \BBA ErhanVinyals et al.2015] Vinyals, O., Toshev, A., Bengio, S., \BBA Erhan, D. \BBOP2015\BBCP. \BBOQShow and tell: A neural image caption generator\BBCQ In \BemProc. CVPR’15, \BPGS 3156–3164.
- [\BCAYWah, Branson, Welinder, Perona, \BBA BelongieWah et al.2011] Wah, C., Branson, S., Welinder, P., Perona, P., \BBA Belongie, S. \BBOP2011\BBCP. \BBOQThe Caltech-UCSD Birds-200-2011 Dataset (Technical Report CNS-TR-2011-001)\BBCQ \BTR, California Institute of Technology, California.
- [\BCAYWalkerWalker1992] Walker, M. A. \BBOP1992\BBCP. \BBOQRedundancy in Collaborative Dialogue\BBCQ In \BemProc. COLING’92, \BPGS 345–351.
- [\BCAYWalker, Cahn, \BBA WhittakerWalker et al.1997] Walker, M. A., Cahn, J. E., \BBA Whittaker, S. J. \BBOP1997\BBCP. \BBOQImprovising linguistic style: Social and affective bases for agent personality\BBCQ In \BemProc. Agents’97, \BPGS 96–105.
- [\BCAYWalker, Grant, Sawyer, Lin, Wardrip-Fruin, \BBA BuellWalker et al.2011a] Walker, M. A., Grant, R., Sawyer, J., Lin, G. I., Wardrip-Fruin, N., \BBA Buell, M. \BBOP2011a\BBCP. \BBOQPerceived or Not Perceived: Film Character Models for Expressive NLG\BBCQ In \BemProc. ICIDS’11, \BPGS 109–121.
- [\BCAYWalker, Lin, Sawyer, Grant, Buell, \BBA Wardrip-FruinWalker et al.2011b] Walker, M. A., Lin, G. I., Sawyer, J., Grant, R., Buell, M., \BBA Wardrip-Fruin, N. \BBOP2011b\BBCP. \BBOQMurder in the Arboretum: Comparing Character Models to Personality Models\BBCQ In \BemProc. AIIDEWS’11, \BPGS 106–114.
- [\BCAYWalker, Park, Rambow, \BBA RogatiWalker et al.2001] Walker, M. A., Park, F., Rambow, O., \BBA Rogati, M. \BBOP2001\BBCP. \BBOQSPoT: A Trainable Sentence Planner\BBCQ In \BemProc. NAACL’01, \BPGS 1–8.
- [\BCAYWalker, Rambow, \BBA RogatiWalker et al.2002] Walker, M. A., Rambow, O., \BBA Rogati, M. \BBOP2002\BBCP. \BBOQTraining a sentence planner for spoken dialogue using boosting\BBCQ \BemComputer Speech and Language, \Bem16(3-4), 409–433.
- [\BCAYWalker, Stent, Mairesse, \BBA PrasadWalker et al.2007] Walker, M. A., Stent, A., Mairesse, F., \BBA Prasad, R. \BBOP2007\BBCP. \BBOQIndividual and domain adaptation in sentence planning for dialogue\BBCQ \BemJournal of Artificial Intelligence Research, \Bem30, 413–456.
- [\BCAYWaller, Black, O’Mara, Pain, Ritchie, \BBA ManurungWaller et al.2009] Waller, A., Black, R., O’Mara, D., Pain, H., Ritchie, G. D., \BBA Manurung, R. \BBOP2009\BBCP. \BBOQEvaluating the STANDUP Pun Generating Software with Children with Cerebral Palsy\BBCQ \BemACM Transactions on Accessible Computing, \Bem1(3), 1–27.
- [\BCAYWang \BBA GaizauskasWang \BBA Gaizauskas2015] Wang, J.\BBACOMMA \BBA Gaizauskas, R. \BBOP2015\BBCP. \BBOQGenerating Image Descriptions with Gold Standard Visual Inputs : Motivation , Evaluation and Baselines\BBCQ In \BemProc. ENLG’15, \BPGS 117–126.
- [\BCAYWang, Raghavan, Cardie, \BBA CastelliWang et al.2014] Wang, L., Raghavan, H., Cardie, C., \BBA Castelli, V. \BBOP2014\BBCP. \BBOQQuery-Focused Opinion Summarization for User-Generated Content\BBCQ In \BemProc. COLING ’14, \BPGS 1660–1669.
- [\BCAYWannerWanner2010] Wanner, L. \BBOP2010\BBCP. \BBOQReport generation\BBCQ In Indurkhya, N.\BBACOMMA \BBA Damerau, F.\BEDS, \BemHandbook of Natural Language Processing (2nd \BEd)., \BPGS 533â–555. Chapman and Hall/CRC, London.
- [\BCAYWanner, Bosch, Bouayad-Agha, \BBA CasamayorWanner et al.2015] Wanner, L., Bosch, H., Bouayad-Agha, N., \BBA Casamayor, G. \BBOP2015\BBCP. \BBOQGetting the environmental information across: from the Web to the user\BBCQ \BemExpert Systems, \Bem32(3), 405–432.
- [\BCAYWen, Gasic, Mrksić, Su, Vandyke, \BBA YoungWen et al.2015] Wen, T.-h., Gasic, M., Mrksić, N., Su, P.-h., Vandyke, D., \BBA Young, S. \BBOP2015\BBCP. \BBOQSemantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems\BBCQ In \BemProc. EMNLP’15, \BPGS 1711–1721.
- [\BCAYWhite, Clark, \BBA MooreWhite et al.2010] White, M., Clark, R. A. J., \BBA Moore, J. D. \BBOP2010\BBCP. \BBOQGenerating tailored, comparative descriptions with contextually appropriate intonation\BBCQ \BemComputational Linguistics, \Bem36(2), 159–201.
- [\BCAYWhite \BBA HowcroftWhite \BBA Howcroft2015] White, M.\BBACOMMA \BBA Howcroft, D. M. \BBOP2015\BBCP. \BBOQInducing Clause-Combining Rules : A Case Study with the SPaRKy Restaurant Corpus\BBCQ In \BemProc. ENLG’15, \BPGS 28–37.
- [\BCAYWhite \BBA RajkumarWhite \BBA Rajkumar2009] White, M.\BBACOMMA \BBA Rajkumar, R. \BBOP2009\BBCP. \BBOQPerceptron reranking for CCG realization\BBCQ In \BemProc. EMNLP’09, \BPGS 410–419.
- [\BCAYWhite \BBA RajkumarWhite \BBA Rajkumar2012] White, M.\BBACOMMA \BBA Rajkumar, R. \BBOP2012\BBCP. \BBOQMinimal dependency length in realization ranking\BBCQ In \BemProc. EMNLP’12, \BPGS 244–255.
- [\BCAYWhite, Rajkumar, \BBA MartinWhite et al.2007] White, M., Rajkumar, R., \BBA Martin, S. \BBOP2007\BBCP. \BBOQTowards Broad Coverage Surface Realization with CCG\BBCQ In \BemProc. UCNLG+MT.
- [\BCAYWilksWilks1978] Wilks, Y. \BBOP1978\BBCP. \BBOQMaking preferences more active\BBCQ \BemArtificial Intelligence, \Bem11(3), 197–223.
- [\BCAYWilliams \BBA ReiterWilliams \BBA Reiter2008] Williams, S.\BBACOMMA \BBA Reiter, E. \BBOP2008\BBCP. \BBOQGenerating basic skills reports for low-skilled readers\BBCQ \BemNatural Language Engineering, \Bem14(4), 495–525.
- [\BCAYWinogradWinograd1972] Winograd, T. \BBOP1972\BBCP. \BBOQUnderstanding natural language\BBCQ \BemCognitive Psychology, \Bem3(1), 1–191.
- [\BCAYWong, Hon, \BBA ChunWong et al.2008] Wong, M. T., Hon, A., \BBA Chun, W. \BBOP2008\BBCP. \BBOQAutomatic Haiku Generation Using VSM\BBCQ In \BemProc. ACACOS’08, \BPGS 318–323.
- [\BCAYWong \BBA MooneyWong \BBA Mooney2007] Wong, Y. W.\BBACOMMA \BBA Mooney, R. J. \BBOP2007\BBCP. \BBOQGeneration by Inverting a Semantic Parser That Uses Statistical Machine Translation\BBCQ In \BemProc. NAACL-HLT’07, \BPGS 172–179.
- [\BCAYWubben, van den Bosch, \BBA KrahmerWubben et al.2012] Wubben, S., van den Bosch, A., \BBA Krahmer, E. \BBOP2012\BBCP. \BBOQSentence Simplification by Monolingual Machine Translation\BBCQ In \BemProc. ACL’12, \BPGS 1015–1024.
- [\BCAYXu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, \BBA BengioXu et al.2015] Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S., \BBA Bengio, Y. \BBOP2015\BBCP. \BBOQShow, Attend and Tell: Neural Image Caption Generation with Visual Attention\BBCQ In \BemProc. ICML’15, \BPGS 2048–2057.
- [\BCAYYagcioglu, Erdem, \BBA ErdemYagcioglu et al.2015] Yagcioglu, S., Erdem, E., \BBA Erdem, A. \BBOP2015\BBCP. \BBOQA Distributed Representation Based Query Expansion Approach for Image Captioning\BBCQ In \BemProc. ACL-IJCNLP’15, \BPGS 106–111.
- [\BCAYYang, Passonneau, \BBA de MeloYang et al.2016] Yang, Q., Passonneau, R., \BBA de Melo, G. \BBOP2016\BBCP. \BBOQPeak: Pyramid evaluation via automated knowledge extraction.\BBCQ In \BemProc. AAAI’16, \BPGS 2673–2680.
- [\BCAYYang, Teo, Daume III, \BBA AloimonosYang et al.2011] Yang, Y., Teo, C. L., Daume III, H., \BBA Aloimonos, Y. \BBOP2011\BBCP. \BBOQCorpus-Guided Sentence Generation of Natural Images\BBCQ In \BemProc. EMNLP’11, \BPGS 444–454.
- [\BCAYYannakakis \BBA MartínezYannakakis \BBA Martínez2015] Yannakakis, G. N.\BBACOMMA \BBA Martínez, H. P. \BBOP2015\BBCP. \BBOQRatings are Overrated!\BBCQ \BemFrontiers in ICT, \Bem2(July).
- [\BCAYYao, Yang, Lin, Lee, \BBA ZhuYao et al.2010] Yao, B. Z., Yang, X., Lin, L., Lee, M. W., \BBA Zhu, S. C. \BBOP2010\BBCP. \BBOQI2T: Image parsing to text description\BBCQ \BemProceedings of the IEEE, \Bem98(8), 1485–1508.
- [\BCAYYatskar, Galley, Vanderwende, \BBA ZettlemoyerYatskar et al.2014] Yatskar, M., Galley, M., Vanderwende, L., \BBA Zettlemoyer, L. \BBOP2014\BBCP. \BBOQSee No Evil, Say No Evil: Description Generation from Densely Labeled Images\BBCQ In \BemProc. *SEM COLING’14, \BPGS 110–120.
- [\BCAYYoung, Lai, Hodosh, \BBA HockenmaierYoung et al.2014] Young, P., Lai, A., Hodosh, M., \BBA Hockenmaier, J. \BBOP2014\BBCP. \BBOQFrom Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions\BBCQ \BemTransactions of the Association for Computational Linguistics, \Bem2, 67–78.
- [\BCAYYoungYoung2008] Young, R. M. \BBOP2008\BBCP. \BBOQComputational Creativity in Narrative Generation: Utility and Novelty Based on Models of Story Comprehension\BBCQ In \BemCreative Intelligent Systems, Papers from the 2008 AAAI Spring Symposium (Technical Report SS-08-03), \BPGS 149–155.
- [\BCAYYouyou, Kosinski, \BBA StillwellYouyou et al.2015] Youyou, W., Kosinski, M., \BBA Stillwell, D. \BBOP2015\BBCP. \BBOQComputer-based personality judgments are more accurate than those made by humans\BBCQ \BemProceedings of the National Academy of Sciences, \Bem112(4), 1036–1040.
- [\BCAYYu \BBA BallardYu \BBA Ballard2004] Yu, C.\BBACOMMA \BBA Ballard, D. H. \BBOP2004\BBCP. \BBOQA multimodal learning interface for grounding spoken language in sensory perceptions\BBCQ \BemACM Transactions on Applied Perception, \Bem1(1), 57–80.
- [\BCAYYu \BBA SiskindYu \BBA Siskind2013] Yu, H.\BBACOMMA \BBA Siskind, J. M. \BBOP2013\BBCP. \BBOQGrounded language learning from video described with sentences\BBCQ In \BemProc. ACL’13, \BPGS 53–63.
- [\BCAYYu, Reiter, Hunter, \BBA MellishYu et al.2006] Yu, J., Reiter, E., Hunter, J. R., \BBA Mellish, C. \BBOP2006\BBCP. \BBOQChoosing the content of textual summaries of large time-series data sets\BBCQ \BemNatural Language Engineering, \Bem13(01), 25.
- [\BCAYZarrieß \BBA KuhnZarrieß \BBA Kuhn2013] Zarrieß, S.\BBACOMMA \BBA Kuhn, J. \BBOP2013\BBCP. \BBOQCombining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures\BBCQ In \BemProc. ACL’13, \BPGS 1547–1557.
- [\BCAYZarrieß, Loth, \BBA SchlangenZarrießet al.2015] Zarrieß, S., Loth, S., \BBA Schlangen, D. \BBOP2015\BBCP. \BBOQReading Times Predict the Quality of Generated Text Above and Beyond Human Ratings\BBCQ In \BemProc. ENLG’15, \BPGS 38–47.
- [\BCAYZhang \BBA LapataZhang \BBA Lapata2014] Zhang, X.\BBACOMMA \BBA Lapata, M. \BBOP2014\BBCP. \BBOQChinese poetry generation with recurrent neural networks\BBCQ In \BemProc. EMNLP’2014, \BPGS 670–680.
- [\BCAYZhuZhu2012] Zhu, J. \BBOP2012\BBCP. \BBOQTowards a Mixed Evaluation Approach for Computational Narrative Systems\BBCQ In \BemProc. ICCC’12, \BPGS 150–154.
- [\BCAYZitnick \BBA ParikhZitnick \BBA Parikh2013] Zitnick, C. L.\BBACOMMA \BBA Parikh, D. \BBOP2013\BBCP. \BBOQBringing semantics into focus using visual abstraction\BBCQ In \BemProc. CVPR’13, \BPGS 3009–3016.
- [\BCAYZitnick, Parikh, \BBA VanderwendeZitnick et al.2013] Zitnick, C. L., Parikh, D., \BBA Vanderwende, L. \BBOP2013\BBCP. \BBOQLearning the Visual Interpretation of Sentences\BBCQ In \BemProc. ICCV’13, \BPGS 1681–1688.