A Multilingual FrameNet-based Grammar and Lexicon for Controlled Natural Language This work has been supported by the Swedish Research Council under Grant No. 2012-5746 (Reliable Multilingual Digital Communication: Methods and Applications) and by the Centre for Language Technology in Gothenburg. The research leading to these results has received funding also from the Latvian State Research Programme NexIT (Project No. 1).

# A Multilingual FrameNet-based Grammar and Lexicon for Controlled Natural Language 1

## Abstract

Berkeley FrameNet is a lexico-semantic resource for English based on the theory of frame semantics. It has been exploited in a range of natural language processing applications and has inspired the development of framenets for many languages. We present a methodological approach to the extraction and generation of a computational multilingual FrameNet-based grammar and lexicon. The approach leverages FrameNet-annotated corpora to automatically extract a set of cross-lingual semantico-syntactic valence patterns. Based on data from Berkeley FrameNet and Swedish FrameNet, the proposed approach has been implemented in Grammatical Framework (GF), a categorial grammar formalism specialized for multilingual grammars. The implementation of the grammar and lexicon is supported by the design of FrameNet, providing a frame semantic abstraction layer, an interlingual semantic API (application programming interface), over the interlingual syntactic API already provided by GF Resource Grammar Library. The evaluation of the acquired grammar and lexicon shows the feasibility of the approach. Additionally, we illustrate how the FrameNet-based grammar and lexicon are exploited in two distinct multilingual controlled natural language applications. The produced resources are available under an open source license.

###### Keywords:
FrameNet Grammatical Framework Multilinguality Natural Language Generation Controlled Natural Language
4

## 1 Introduction

Kuhn (2014) defines Controlled Natural Language (CNL) as “a constructed language that is based on a certain natural language, being more restrictive concerning lexicon, syntax, and/or semantics, while preserving most of its natural properties.” In our work, we deviate from this definition in two aspects. First, our intention is to produce a reusable grammar that covers a restricted subset of a natural language instead of a grammar of a predefined constructed language. Second, we produce a currently bilingual but potentially multilingual grammar library which is therefore not based on exactly one natural language but has a shared semantic abstract syntax. Thus, we do not provide a CNL as such but a high-level API (application programming interface) for the facilitation of the development of CNL grammars, making them more flexible – easier to modify and extend. In a sense, we aim at bridging the gap between controlled and natural language.

A more general aim of this research is to make existing FrameNet (FN) resources uniformly and computationally accessible for multilingual natural language generation (NLG) and controlled semantic parsing via a shared semantico-syntactic grammar and lexicon API. We particularly consider the development of CNL interfaces to knowledge bases for authoring and verbalizing facts in a specific domain. For example, Figure 1 illustrates a predictive multilingual CNL editor for authoring object descriptions in the cultural heritage domain. The detailed syntactic constructors for building the verb phrases and clauses have been manually specified for each language. The FN-based API aims to diminish the manual efforts by providing more abstract – frame semantic – constructors, e.g. Create_physical_artwork that takes arguments Representation and Creator, and a target verb. The future potential of our approach is to provide a means for multilingual verbalization of FN-annotated knowledge bases that have been populated by FN-based information extraction systems (Das et al, 2014) and that could be automatically mapped to the appropriate frame constructors, similarly as sketched by Barzdins (2014).

At the CNL 2012 workshop, we proposed a conception of a general-purpose semantic grammar based on FrameNet (Gruzitis et al, 2012). The proposed approach builds on the technology of Grammatical Framework (GF). GF (Ranta, 2004), a type-theoretical grammar formalism and a toolkit, offers a wide-coverage resource grammar library (RGL) for currently 30 languages which implement a shared syntactic API (Ranta, 2009). The idea behind the FrameNet-based grammar is to provide a frame semantic abstraction layer, a shared semantic API, over the syntactic API of GF RGL.

Following this conception, we successfully extracted a shared abstract syntax of wide-coverage English and Swedish grammars from FN-annotated corpora (Dannélls and Gruzitis, 2014a). Soon after, we presented a more elaborated approach to automatic extraction and generation of both the shared abstract syntax and the concrete syntaxes of the proposed grammar (Dannélls and Gruzitis, 2014b). In this article, we give an extended and updated presentation of the work published in the previous two papers. Additionally, we provide the design and implementation details of FN-based lexicons for English and Swedish that are also extracted from the annotated corpora. The experiments and tests presented here are based on the Berkeley FrameNet release 1.5 which is available as of September 2010,5 and a snapshot of the Swedish FrameNet development version taken in December 2014.6

Our approach is outlined in Figure 2. From the linguistic point of view, the particular characteristic is that we focus on the core argument structures (according to FrameNet), the arguments are combined compositionally, a verb is expected as the target word, and the word order is controlled according to the dominant corpus evidence. Although we focus on English and Swedish, the same approach is intended to be applicable to other languages as well. As a side result, we suggest a unified method for comparing and mapping semantic and syntactic valence patterns and lexical units across framenets.

The structure of this article is as follows. In Section 2, we provide background information about FrameNet and Grammatical Framework. Details of the FN-based approach, the experiment series and the implementation of the grammar are given in Section 3. The FN-based lexicon is further detailed in Section 4. These are followed by an illustration of the use of the FN-based API in two multilingual CNL applications in Section 5. We provide an evaluation of our method in Section 6 followed by a discussion in Section 7. Previous approaches to semantic multilingual CNL grammars are briefly discussed in Section 8. We conclude the article in Section 9.

## 2 Background

### 2.1 Berkeley FrameNet (BFN)

Berkeley FrameNet (Fillmore et al, 2003) is a lexico-semantic resource based on the theory of frame semantics (Fillmore, 1985).7 According to this theory, a semantic frame representing a cognitive scenario is characterized in terms of frame elements (FE) and is evoked by target words called lexical units (LU).

FEs are classified in four groups: core, core unexpressed, peripheral and extra-thematic (Ruppenhofer et al, 2010). A set of core FE instantiates the conceptually necessary components of a frame, and uniquely characterizes the frame, making it different from other frames. Core unexpressed FEs are core FEs that may not be used in descendant frames. FEs, such as Time, Place, Manner, that do not uniquely characterize a frame, and can be instantiated in any semantically appropriate frame are classified as peripheral. Extra-thematic FEs do not have a frame-specific understanding, unlike core and peripheral FEs.

In our work, we distinguish two classes of FEs: core that includes core unexpressed, and non-core that includes peripheral and extra-thematic. Core FEs syntactically tend to correspond to verb arguments, in contrast to non-core FEs that typically are adjuncts.

As an example, consider the frame Desiring given in Table 1 where we find: (i) the definition of the frame, a lexicographic description of the scenario it represents, and (ii) lists of core and non-core FEs, the semantic roles.

An LU entry is a pairing of a lemma and a frame, and it carries both the semantic and the syntactic valence information about the possible realizations of the FEs participating in the frame. The syntactic valence is language-specific, and the valence patterns are derived from FN-annotated corpora. The syntactic constituents of the example sentences are annotated according to COMLEX Syntax (Meyers et al, 1995). To take an example, consider the valence patterns for the verb want given in Table 2.

There have been also experiments on automatic alignment of LUs in BFN to synsets in Princeton WordNet (Ferrández et al, 2010), a complementary resource that would help to extend the coverage of BFN, and link LUs across languages. These links, however, are not available as a part of the FrameNet data release.

The FrameNet approach to frame semantics, approbated in BFN, provides a benchmark for representing large amounts of word senses and word usage patterns through the linguistic annotation of corpus examples, therefore the exploitation of FN-like resources has been appealing for a range of advanced NLP applications such as semantic parsing (Das et al, 2014), information extraction (Moschitti et al, 2003) and natural language generation (Roth and Frank, 2009). There are FNs available for German, Japanese, Spanish (Boas, 2009) and Swedish (Borin et al, 2010). More initiatives exist for other languages. In this article, we consider Berkeley FrameNet and Swedish FrameNet.

### 2.2 Swedish FrameNet (SweFN)

Swedish FrameNet has been developed within the Swedish FrameNet++ project at Språkbanken (Borin et al, 2010).8 One of the aims of the project was to integrate a number of existing Swedish lexical resources and harmonize the information they contain. All the integrated resources are linked to the lexical entries of SALDO (Borin et al, 2013), an association lexicon which contains morphological and lexical-semantic information for more than 125,000 Swedish words of which 13,000 are verbs.9 SALDO is therefore considered as the pivot of these integrated Swedish lexical resources.

Integrated lexical resources linked to one pivot lexicon have the advantage of yielding large amount of information. Some of the information we gain access to is syntactic valence information for verbs from SIMPLE and PAROLE lexicons (Lenci et al, 2000), and synsets and senses from WordNet.10

The SweFN resource has been expanded from BFN, which means it follows the same structure and theoretical principles that have been taken in BFN. For example, the description of the frame Desiring shown in Table 1 is the same also in SweFN. This is the language independent aspect. Regarding the language dependent aspect, there are other, more practical differences between the resources. These differences concern the content of the frames including: lexical units from SALDO, compound pattern analysis and examples, syntactic annotations of the example sentences, and domain specification of frames.

Some of the LUs, e.g. the SALDO entries, that evoke the frame Desiring are: känna_för.vb.1 ‘to feel like’, längta.vb.1 ‘to yearn’, vilja.vb.1 ‘to want’, åtrå.vb.1 ‘to desire’, begärelse.nn.1 ‘wish’, åtrå.nn.2 ‘desire’. The number indicates SALDO’s sense identifier, ‘vb’ and ‘nn’ indicate the part-of-speech (POS) tags.

All example sentences in SweFN are syntactically annotated with the Swedish version of MaltParser for depedency structures (Nivre et al, 2004). The annotation scheme is based on KORP, Språkbanken’s corpus infrastructure, part-of-speech and morphosyntactic markup tag sets (Borin et al, 2012).11 Table 3 shows some semantic valence patterns and their syntactic realizations for the verb känna för ‘to feel like’.

As we mentioned, SweFN mostly uses the BFN frame inventory, however, around 50 additional frames are introduced in SweFN, and around 15 BFN frames have been modified. What characterizes the modified frames is change of core FEs and change of the frame content that is either more specific or less specific.

### 2.3 Grammatical Framework (GF)

The presented grammar is implemented in GF, a categorial grammar formalism specialized for multilingual (parallel) grammars (Ranta, 2004). One of the key features of GF grammars is the separation between an abstract syntax and concrete syntaxes. The abstract syntax defines the language-independent structure, the semantics of a domain-specific application grammar or a general-purpose grammar library, while the concrete syntaxes define the language-specific syntactic and lexical realization of the abstract syntax.

Remarkably, GF is not only a grammar formalism or a programming language. It also provides a reusable general-purpose resource grammar library (RGL) for currently 30 languages that implement the same abstract syntax, a shared syntactic API (Ranta, 2009). The grammars implement common syntactic constructions and describe the inflectional morphology of a language. The use of the shared syntactic types and functions allows for rapid and rather flexible development of multilingual application grammars without the need of specifying low-level details like inflectional paradigms, syntactic agreement and word order. In order to hide the low-level details, RGL has a high-level interface that provides constructors like for building a clause from a NP and a VP.12

## 3 FrameNet-Based Grammar

The language-independent layer of FrameNet, i.e. frames and FEs – the semantic valence – is defined in the abstract syntax of the proposed multilingual grammar library, while the language-specific layers, i.e. the surface realization of frames and LUs – the syntactic valence – is defined in concrete syntaxes.13 The syntactic API of GF RGL is used for generalizing and unifying the grammatical types and constructions used in different framenets, which facilitates porting the implementation to other languages. The FrameNet-based grammar, in turn, provides a frame semantic abstraction layer to RGL, so that the application grammar developer can primarily manipulate with plain semantic constructors in combination with some simple syntactic constructors instead of comparatively complex syntactic constructors for building verb phrases (VP). Moreover, the frame constructors can be typically specified for all languages at once in the shared concrete syntax (functor) of an application grammar.

In this research, we consider only those frames for which there is at least one corpus example where the frame is evoked by a verb. In addition, we consider only core FEs which uniquely characterize the frame.

BFN version 1.5 defines 1,020 frames, of which, according to our calculations, 559 are evoked by 3,254 verb LUs in 69,260 annotated sentences. As of December 2014, the SweFN development version covers 995 frames of which 660 are evoked by 2,887 verb LUs in 4,400 annotated sentences.

### 3.1 Abstract Syntax

To acquire a common abstract syntax, a common semantic API,14 we have extracted a set of shared semantico-syntactic frame valence patterns from the annotated sentences in BFN and SweFN. For instance, the shared valence patterns for the frame are:

which correspond, for instance, to the following annotated examples in BFN:

1. [Dexter] [YEARNED] [for a cigarette]

2. [she] [WANTS] [a protector]

3. [I] would n’t [WANT] [to know]

The actual BFN phrase types (, various subtypes of , and its subtypes, etc.) are generalized to the RGL types , , , and .15 Verb types (, , , , , , , and 16) are inferred from the syntactic valence of LUs in the respective examples.

In addition to phrase types, the extracted valence patterns also specify inferred grammatical relations of -typed FEs: (subject), (passive subject), (direct object) and (indirect object) that correspond to the universal dependency relations (de Marneffe et al, 2014). Therefore, we also include the grammatical voice (/17) in the pattern comparison and in pattern identifiers used in the abstract syntax. It is not necessary, however, to reflect the grammatical relations in the abstract syntax; this knowledge is taken into account in the pattern comparison and while generating the concrete syntaxes, but it is not required in order to use the resulting grammar as an API.

#### Sentence patterns versus normalized valence patterns

The first step in the extraction of shared valence patterns is to convert the annotated corpus examples into more general and uniform sentence patterns – valence patterns that preserve the word-order (the order of FEs in a particular sentence), subcategorize FEs by the grammatical RGL types and include the universal grammatical relations of -typed FEs (both inferred from the particular sentence), include prepositions or cases that are used to realize -typed FEs (for deciding on LU-specific or even frame-specific default values in the future), and include references to the target verbs (LUs). Duplicate sentence patterns are kept in the output for frequency counts.

The conversion process is language- and framenet-specific because there is no unified annotation model used across framenets. BFN and SweFN use not only different XML schemas and POS tagsets; they also use different approaches for annotating the syntactic structure of a sentence.

In BFN, a phrase-structure approach is used, which is complemented by few shallow grammatical relations: an external argument (a phrase outside the VP of the target verb), the first object in the active voice (either direct or indirect), and a general dependent. A simplified excerpt from the BFN corpus for the verb want evoking the frame Desiring is:

{Verbatim}

[xleftmargin=5.5mm,commandchars=
{}] ¡sentence¿ ¡text¿Traders in the city want a change.¡/text¿ ¡annotationSet¿ ¡layer rank=”1” name=”BNC”¿ ¡label start=”0” end=”6” name=”NP0”/¿ ¡label start=”20” end=”23” name=”VVB”/¿ ¡label start=”25” end=”25” name=”AT0”/¿ ¡/layer¿ ¡/annotationSet¿ ¡annotationSet status=”MANUAL”¿ ¡layer rank=”1” name=”FE”¿ ¡label start=”0” end=”18” name=”Experiencer”/¿ ¡label start=”25” end=”32” name=”Event”/¿ ¡/layer¿ ¡layer rank=”1” name=”GF”¿ ¡label start=”0” end=”18” name=”Ext”/¿ ¡label start=”25” end=”32” name=”Obj”/¿ ¡/layer¿ ¡layer rank=”1” name=”PT”¿ ¡label start=”0” end=”18” name=”NP”/¿ ¡label start=”25” end=”32” name=”NP”/¿ ¡/layer¿ ¡layer rank=”1” name=”Target”¿ ¡label start=”20” end=”23” name=”Target”/¿ ¡/layer¿ ¡/annotationSet¿ ¡/sentence¿

In SweFN, a dependency approach is used. A simplified excerpt from the SweFN corpus for the verb vilja ‘want’ evoking the frame Desiring is:18

{Verbatim}

[xleftmargin=5.5mm,commandchars=
{}] ¡sentence¿ ¡w pos=”JJ” ref=”1” dephead=”2” deprel=”DT”¿Nästa¡/w¿ ¡w pos=”NN” ref=”2” dephead=”3” deprel=”TA”¿gång¡/w¿ ¡w pos=”VB” ref=”3” deprel=”ROOT”¿skulle¡/w¿ ¡element name=”Experiencer”¿ ¡w pos=”PN” ref=”4” dephead=”3” deprel=”SS”¿jag¡/w¿ ¡/element¿ ¡element name=”LU”¿ ¡w msd=”VB.AKT” ref=”5” dephead=”3” deprel=”VG”¿vilja¡/w¿ ¡/element¿ ¡element name=”Event”¿ ¡w msd=”VB.INF” ref=”6” dephead=”5” deprel=”VG”¿ha¡/w¿ ¡w pos=”RG” ref=”7” dephead=”8” deprel=”DT”¿sju¡/w¿ ¡w pos=”NN” ref=”8” dephead=”6” deprel=”OO”¿sångare¡/w¿ ¡/element¿ ¡/sentence¿

It should be noted that a characteristic of BFN is that FEs which are missing in the sentence are still annotated if the grammar allows or requires the omission, or the identity/type of an FE is understood from the context (Ruppenhofer et al, 2010). Such FEs would be potentially interesting to consider, however, we ignore them as they have no grammatical annotations.

Because of the partial and often erroneous grammatical annotations, various framenet-specific rules and heuristics are applied for generalizing to RGL types, for inferring the grammatical voice and relations, and for partially correcting the automatic annotation errors.

When the uniform sentence patterns are acquired for all languages, a common language- and framenet-independent processor is used in all the remaining steps, including the generation of the abstract and concrete syntaxes and lexicons.

Sentence patterns are summarized and grouped into normalized valence patterns ignoring the word order and prepositions (or cases). As an example, a partial summary of patterns for the frame Desiring in BFN is:

{Verbatim}

[xleftmargin=5.5mm,commandchars=
{}] Act : 275 Event/VP Experiencer/NP.nsubj : 61 Experiencer/NP.nsubj Event/VP : 59 Event/VP Experiencer/NP.nsubj : 2 Experiencer/NP.nsubj Focal_participant/NP.dobj : 61 Experiencer/NP.nsubj Focal_participant/NP.dobj : 55 Focal_participant/NP.dobj Experiencer/NP.nsubj : 6 Experiencer/NP.nsubj Focal_participant/Adv : 43 Experiencer/NP.nsubj Focal_participant/Adv[for] : 26 Experiencer/NP.nsubj Focal_participant/Adv[after] : 7 Experiencer/NP.nsubj Focal_participant/Adv : 2 … … Pass : 13 Experiencer/NP.dobj Focal_participant/NP.nsubjpass : 5 Focal_participant/NP.nsubjpass Experiencer/NP.dobj : 5 …

For generating the abstract syntax, we consider only the normalized valence patterns. The most frequent sentence pattern of each normalized pattern contains sufficient information for generating the concrete syntax for the respective language.

#### Experiment series

To roughly estimate the impact of certain decisions that have been made in the automatic extraction of the semantico-syntactic valence patterns, we have run a series of experiments with various settings:

1. Extract sentence patterns using the framenet-specific grammatical types (baseline).

2. In addition to 0.0, skip examples containing currently unconsidered realizations of FEs, namely quotation and few subtypes of (3.4% of BFN examples; no SweFN examples).19

3. In addition to 1.0, generalize the grammatical types according to GF RGL.

4. In addition to 2.0, skip once-used valence patterns (if the frame has at least one pattern that is used more than once).

where each series include two subseries:

1. Skip repeated FEs (mostly due to coordination, wh-words making discontinuous PPs, and anchors of relative clauses).20

2. Skip non-core FEs and repeated FEs.

The results are summarized in Tables 4 and 5. We are primarily interested in Settings 2.B and 3.B which seem to be optimal for SweFN and BFN respectively: the number of covered frames slightly decreases, but it makes the resulting patterns more prototypical and significantly reduces the potential number of API functions to be generated in the abstract syntax. For a large corpus like BFN, skipping once-used valence patterns helps to reduce the propagation of annotation errors, but, for a relatively small corpus like SweFN, it is not reasonable. In the future, it would be reasonable, however, to include typical non-core FEs in the resulting valence patterns: this would slightly increase the average number of FEs from 2 to 3.

#### Shared valence patterns

The extracted sets of semantico-syntactic valence patterns can vary across languages depending on corpora. Having multilingual applications in mind, we are primarily interested in valence patterns whose implementation can be generated, based on corpus evidence, for all considered languages. Thus, we focus on valence patterns that are shared between framenets. The multilingual criterion also helps in reducing the number of incorrectly derived patterns due to annotation errors introduced by the automatic POS tagging and syntactic parsing applied in both BFN and SweFN corpora. Frequent patterns that are not verified across framenets could be separated into language-specific extra modules of the library (in a similar way as it is done with some language-specific syntactic phenomena in RGL).

To find a representative yet condensed set of shared valence patterns, we compare the extracted normalized patterns by subsumption instead of equivalence. Pattern subsumes pattern if:

1. 21

If subsumes and subsumes then equals . If a pattern of is subsumed by a pattern of , it is added to the shared set (and vice versa). In the final shared set, patterns which are subsumed by other patterns in the set are removed. For instance, in the following example, is subsumed by , is subsumed by and , and are to be removed:

1. :

2. :

3. :

This approach is supported by the design of the FrameNet-based grammar which accepts an empty phrase as an argument to a frame building function if the corresponding FE is not expressed in the sentence.

The comparison is first done between BFN and SweFN sets of verb frames (Table 6) and then between sets of valence patterns that belong to the shared set (intersection) of verb frames (Table 7). For a number of shared frames, no shared valence patterns are found, therefore the final set of shared frames is smaller (Table 7). Intuitively, this is partly because of the size of SweFN (about 6 examples per frame in SweFN versus more than 115 examples per frame in BFN) and partly because of non-compositionality across languages.

In the result, from around 64,000 annotated sentences in BFN (Settings 3.B) and around 4,200 annotated sentences in SweFN (Settings 2.B), we have extracted a set of 869 shared valence patterns covering 483 frames. The result is a proper subset of what would be acquired if Settings 2.B were applied to BFN.

The 869 valence patterns reuse 541 semantico-syntactic types: 339 FEs of type , 159 FEs of type , 17 FEs of type , 17 FEs of type and 9 FEs of type . If considering only the semantic types, there are 429 different FEs.

#### Implementation

The shared valence patterns are declared as frame building functions (henceforth called frame functions) that take one or more core FEs and one target verb as arguments. FEs are expected in the alphabetical order while the verb is always the last argument. The language-specific word order is specified in the concrete syntaxes.

For each frame, the set of core FEs is often split into several alternative functions according to the corpus evidence.22 Different subsets of core FEs may require different types of target verbs. We also differentiate between functions that return clauses in the passive voice from functions that return active voice clauses because the subject and object FEs swap their grammatical relations and/or the order23 that is not reflected in the abstract syntax.

The verb type is always added as a suffix to the function name, and the voice tag is appended in the case of the passive voice. If this is not sufficient to make the function name unique, a discriminative number is appended as well. For instance, consider the following abstract functions derived from the extracted valence patterns given at the beginning of Section 3.1:

1. 24

In GF, constituents and features of phrases are stored in objects of record types, and functions are applied to such objects to construct phrase trees. In the abstract syntax, both argument types and the value type of a function are separated by right associative arrows, i.e. all functions are curried. Arguments of a frame function are combined into an object of type that differs from the RGL type . A whose linearization type is comprises two constituents of RGL types. It is a deconstructed where the subject NP is separated from the rest of the clause. The motivation for this is to allow for nested frames (see Section 5.1) and for adding non-core FEs before combining the NP and VP parts into a clause (see Section 5.2).

The RGL-subcategorized FEs of the shared valence patterns are declared as common semantic types (categories). Although the conception of BFN states that core FEs are unique to the frame, even though their names are not unique across frames, we do not make such a distinction at the level of types; they are implicitly made frame-specific by the frame functions. The only distinction is based on the syntactic realization.

In order to keep the FE names unique, the RGL types are added as suffixes:

Note that the Focal_participant is typically realized as a noun phrase, but some intransitive Desiring verbs require it as a prepositional phrase (PP), hence this FE is subcategorized using the RGL types and (adverbial modifier). In GF, the type covers both adverbs and PPs, and there is no separate type for PPs. Also note that all FEs are specified as optional arguments in the concrete syntaxes, i.e. any FE can be an empty phrase if it is not expressed in the sentence.

The frame-evoking target verb is always expected as the last, mandatory argument. We assume that verbs of the same type evoking the same frame share, in general, a subset of normalized semantico-syntactic valence patterns of that frame. Patterns requiring, for instance, a transitive verb cannot be evoked by an intransitive verb. Otherwise, the current approach does not limit the set of verbs that can evoke a frame, and the set of prepositions that can be used for an FE if it is realized as a PP. We expect that appropriate verbs and prepositions are specified by the application grammar that uses the FrameNet-based grammar as an API. Hence, this approach allows evoking a frame by a metaphor, i.e. an LU that normally evokes another frame.

The design and implementation of the abstract and concrete syntaxes of lexical entries is described in Section 4.

### 3.2 Concrete Syntaxes

The exact behaviour (linearization) of types and functions declared in the abstract syntax is defined in the concrete syntax of each language.

The mapping from the semantic BFN types (FEs) to the syntactic RGL types is straightforward and is shared for all languages in a functor, for instance:

To allow for optional FEs (verb arguments that might not be expressed in the sentence), all linearization types are of type whose behaviour is similar to the analogous type in Haskell: a value of type either contains a value of type (represented as ), or it is not provided (represented as ).

To implement the frame functions, particularly, to fill the verb phrase part of objects, RGL constructors are applied to the arguments depending on their grammatical types and relations, and the grammatical voice. The implementation of functions declared in Section 3.1.4 is systematically generated for English and Swedish as follows:

To the field of a object, either the value of the corresponding or argument, or an empty string of type is assigned. This choice is handled by the helper function that takes a value and returns a predefined empty phrase of the respective type if the value is not provided (); otherwise it returns the provided value. Optional verb complements are handled similarly.

In order to produce a value of the field, RGL constructors , , etc. and RGL structural words (prepositions by and av in English and Swedish respectively) etc. are applied, for instance:25

The RGL-based code templates used to implement the above functions can be systematically reused for many other frame functions. Given the set of shared valence patterns, there are only 32 syntactic patterns that cover all 869 semantico-syntactic patterns (Table 8). By syntactic valence patterns we mean patterns that specify only the grammatical types and relations of FEs, and the grammatical voice. As Table 8 shows, the syntactic patterns underlying functions , , and already cover more that 54% of all the shared frame functions. For the same verb types (, , ), other syntactic patterns cover another 39% of frame functions for which the code templates are derived in several ways:

• complements of type are added by recursively applying the respective constructor, or they are eliminated at all;26

• the field of is fixed to the empty string if the valence pattern does not include the subject FE (e.g. due to examples only in the imperative mood);27

• the agent FE that would be the subject in the active voice but is missing in the passive voice is fixed to the empty string.

The remaining less than 7% of the shared frame functions represent the use of other verb types – , , , and  – for which the respective RGL constructors are applied:

1. [I] do [REMEMBER] [we did a few gigs]

1. [he] [RECOGNIZED] [where he was]

1. [you] specifically [REQUEST] [me] [to do so]

1. [you] [DENIED] [her] [any life of her own]

1. [he] [PERSUADED] [himself] [that they helped]

Note that the type , an embedded declarative sentence, is used only if the subclause can be paraphrased using the subjunction () that; otherwise such FEs are subcategorized as , and the application grammar has to specify the subjunction by applying the RGL constructor .

Also note that FEs of type , and , and encapsulating represent nested frames. We use the types and instead of and to allow for specifying sentence level parameters like tense, anteriority and polarity of the nested frames.

The implementation of frame functions, although currently kept separate for each language, mostly could be shared in a functor thanks to the syntactic abstraction provided by RGL. In general, however, the order of FEs differ across languages.

## 4 FrameNet-Based Lexicon

In GF, there is no formal distinction between syntactic rules and lexical entries. Lexical entries are represented by functions that normally take no arguments and usually but not necessarily return values of lexical categories (e.g. versus ).

LUs between BFN and SweFN (and other framenets) are not explicitly aligned, therefore we first extract and generate a framenet-specific lexicon for each language. Second, we have conducted an experiment to automatically produce a shared lexicon by partially aligning LUs between BFN and SweFN.

### 4.1 Abstract Syntaxes

Following the design of the FrameNet-based grammar, LUs in our approach are subcategorized by GF RGL verb types, therefore for each LU there is one or more lexical entry in the lexicon.

The abstract lexical identifiers (function names) start with the language-specific base form of the verb. To distinguish between different types and senses of LUs, the verb type and the frame name is appended to the identifiers as illustrated in Tables 9 and 10.

The generation of the abstract language-specific lexicons is straightforward. Given the set of 869 shared valence patterns (Section 3.1.3), we select all the distinct target verbs from the sentence patterns (Section 3.1.1) that belong to the shared patterns. Then we append the corresponding verb type and frame name to the base form of the target verb and declare all the resulting identifiers as nullary functions returning verbs of the respective types.

From the BFN corpus, we have extracted 2,831 LUs resulting in 3,432 lexical entries (due to alternative verb types). For Swedish, the numbers are 1,844 and 1,899 respectively. The ratio of lexical entries per LU is considerably smaller for Swedish (1.03 versus 1.21) because of the small number of SweFN examples per LU (around 1.5 versus around 20 in BFN; see Tables 4 and 5 in Section 3.1.2).

### 4.2 Concrete Syntaxes

In order to generate concrete lexicons, first, we have to specify an appropriate inflectional paradigm for each verb independently of its potential senses (frames) and valence types. Inflectional paradigms are represented by language-specific constructors provided in the RGL ParadigmsL modules. Each constructor, which can be overloaded, expects specific verb forms as arguments from which all forms of the verb can be generated, for instance:

1. feel” “felt” “felt

2. want

3. yearn” “yearns” “yearned” “yearned” “yearning

1. “känna” “kände” “känt”

2. “längtar”

3. “vilja” “vill” “vilj” “ville” “velat” “velad”

The first argument usually is the base form, but it can be another form from which the base form can be straightforwardly derived (e.g. längtar ‘[one] longs’).

We extract such verb-constructor pairs from the existing monolingual and multilingual RGL dictionaries and other modules (in the reverse order of preference):

1. L/DictL (6,034 pairs for English, 7,324 for Swedish)

2. translator/DictionaryL (6,037 pairs for English, 2,430 for Swedish)

3. L/LexiconL (98 pairs for English, 96 for Swedish)

4. L/IrregL (173 pairs for English, 182 for Swedish)

5. L/StructuralL (2 pairs for English, 4 for Swedish)

In total, we have extracted constructors for 6,040 English verbs and 7,492 Swedish verbs. Still, 59 BFN verbs and 28 SweFN verbs are out-of-vocabulary.28

Second, for each lexical entry, we generate its linearization rule based on (i) the corresponding verb constructor, (ii) particles and reflexive pronouns, if any, that constitute the LU and (iii) the verb type of the lexical entry, for instance:

1. want

1. känna” “kände” “käntför

2. känna” “kände” “känt

In the result, we were able to generate linearization rules for currently 3,350 (98%) out of 3,432 BFN entries and for 1,789 (94%) out of 1,899 SweFN entries.

At this point, it should be noted that each sentence pattern (Section 3.1.1) includes not only a reference to the LU but also a morphological description of the LU constituents, which is important in the case of multi-word expressions (MWE), e.g. feel like, känna för, känna sig etc. Moreover, the morphological descriptions are unified across languages according to the universal POS tags29 and features30 allowing for a common generator of concrete lexicons.

The current approach to the FrameNet-based grammar and lexicon supports linearization of relatively simple MWEs that, apart from the main verb, include particles (constructor ) and reflexive pronouns (constructor ) in any combination.

Considering only the shared frame valence patterns, we have extracted 98 such lexical entries for English, which is about 3% of all entries extracted from BFN and about 84% of all MWE entries extracted from BFN. All these entries correspond to the same morphological pattern:

1.

where (adposition) represents a particle. For Swedish, we have extracted 465 such entries, which is about 25% of all entries and about 85% of all MWE entries extracted from SweFN. In addition to the MWE pattern found in BFN, SweFN covers several other patterns of simple MWEs:

1.

2.

3.

4.

Patterns of currently unsupported, more complex MWEs are summarized in Table 11. This leads to 19 MWE entries in the English lexicon and 82 MWE entries in the Swedish lexicon having no linearization.

To address this issue, we could include lexical entries of type implying a similar syntactic valence as for verbs of type . However, this would require to introduce separate frame functions. An alternative approach would be to extend the notion and support of particle verbs in RGL so that “particles” could be involved in the syntactic agreement.

### 4.3 Aligning Lexical Units Across Languages

The multilingual RGL lexicons – the large translation dictionary (modules Dictio-naryL) and the small lexicon of frequently used words (modules LexiconL) – can be used not only for the extraction of verb constructors but also for aligning LUs (i.e. lexical entries) across languages.

Let us consider the following example. For the frame Desiring, we have extracted several lexical entries of type as shown in Table 9 for English and in Table 10 for Swedish.

If we search for the English verbs feel, want and yearn, and for the Swedish verbs känna and vilja in the RGL modules DictionaryEng and DictionarySwe respectively, we find these mappings (among others):

1. DictionaryEng:

2. DictionarySwe: “känna” “kände” “känt”

1. DictionaryEng: “want”

2. DictionarySwe:

1. DictionaryEng: “yearn” “yearns” “yearned” …

2. DictionarySwe: “trängtar”

suggesting the following alignment between the framenet-specific lexicons:

We have collected all such suggestions in a separate shared lexicon where BNF identifiers are used as interlingua symbols in the abstract syntax, and the framenet-specific lexicons are used as resource libraries to implement the concrete syntaxes.

The generation of the concrete English lexicon is trivial, for instance:

The concrete Swedish lexicon is generated as illustrated in the alignment example above, and it can include alternative variants, for instance:

meaning that all variants will be considered while parsing a sentence, but only the first variant will be used for linearization. Currently, variants are ordered so that MWEs follow simple verbs, otherwise they are given in the alphabetical order; however, they should be ordered at least by frequency.

In the case of MWEs, we search for alignment variants based on the main verb if there is no match for the whole MWE. This improves the coverage (as it is illustrated with feel like above) but sometimes leads to incorrect alignments, for instance, exhale has been aligned with andas in ‘inhale’:

In the result, we have aligned 703 BFN entries (21%) with 900 SweFN entires (47%). This approach is still promising, and there is a clear space for improvement:

1. The alignment procedure failed for about 30% of BFN entries because of missing linearization for nearly 800 DictionarySwe entries.

2. For nearly half of BFN entries, alignment was not found because no match was found among SwFN entries of the same type belonging to the same frame, which is a consequence of the comparatively small size of SweFN (2.2 SweFN entries versus 4 BFN entries per shared valence pattern).

## 5 Case Studies

We illustrate the use of the FrameNet-based API to GF RGL by re-engineering two existing multilingual CNL grammars: one for translating standard tourist phrases (Ranta et al, 2010) and another for generating descriptions of paintings (Dannélls et al, 2012), both developed in the MOLTO project.31 In both cases, we preserve the original functionality, and we do not make any changes in the application abstract syntax. Changes affect only the concrete syntaxes of English and Swedish.

### 5.1 Phrasebook

Although the Phrasebook grammar covers many idiomatic expressions that cannot be translated using the same frame or for which our approach would not be suitable as such, it includes around 20 complex clause-building functions that can be handled by the FN-based grammar. To illustrate the use of the FN-based grammar as a semantic API, we re-implement the following Phrasebook functions:

ALive   : Person -> Country -> Action  -- e.g. we live in Sweden'
AWant   : Person -> Object -> Action   -- e.g. I want a pizza'
AWantGo : Person -> Place -> Action    -- e.g. I want to go to a museum'


by applying the frame functions and introduced in Section 3, and some additional functions:

Motion_V_2    : Goal_Adv -> Source_Adv -> Theme_NP -> V -> Clause
Possession_V2 : Owner_NP -> Possession_NP -> V2 -> Clause
Residence_V   : Location_Adv -> Resident_NP -> V -> Clause


By using RGL constructors, is implemented for English, Swedish and other languages in the same way, except that different verbs are used:

ALive p co = mkCl p.name (mkVP (mkVP (mkV "live")) (mkAdv in_Prep co))
ALive p co = mkCl p.name (mkVP (mkVP (mkV "bo")) (mkAdv in_Prep co))


First, the language-specific verbs can be factored out by introducing a shared abstract verb in the domain lexicon (e.g. that links and ). Second, the implementation of can be done in a shared functor by using the FN-based API:

ALive p co = let cl : Clause =
in mkCl cl.np cl.vp


For , neither the original RGL-based nor the current FN-based implementation can be done in the functor because, in Swedish, the verb vilja ‘to want’ evoking requires the auxiliary verb ha ‘to have’. This can be seen as a nested auxiliary frame Possession:

AWant p obj = mkCl p.name (mkV2 (mkV "want")) obj       -- Eng
Desiring_V2_Act (Just NP p.name) (Just NP obj) want_V2

AWant p obj = mkCl p.name want_VV (mkVP L.have_V2 obj)  -- Swe
Desiring_VV
(Just VP (Possession_V2 (Nothing NP) (Just NP obj) have_V2).vp)
(Just NP p.name) want_VV


Assuming that the auxiliary verb can be optionally used also with other Swedish verbs when applying this frame function, the nested frame could be hidden in the Swedish implementation of . This, however, is not the case with which in both languages requires a main nested frame and, thus, can be put in the functor:

AWantGo p place = mkCl p.name want_VV (mkVP (mkVP go_V) place.to)

Desiring_VV (Just VP
(Just NP p.name) want_VV


At the first gleam, the new code might look more complex, however, it does not specify how the verb phrases are built, and the same uniform code template is used in all cases.

The re-implemented version of Phrasebook accepts and generates the same set of sentences as before.32

### 5.2 Paintings

The painting grammar is a part of a large-scale controlled NLG grammar developed for the cultural heritage domain in order to verbalize data about museum objects stored in an RDF-based ontology (Dannélls et al, 2012). A set of RDF triples (subject-predicate-object expressions) forms the input to the application. As an example, a simplified set of triples representing information about the artwork Bacchus is given below:

<Bacchus> <createdBy> <Leonardo_da_Vinci>
<Bacchus> <hasDimension> <Bacchus_ImageDimesion>
<Bacchus> <hasCreationDate> <Bacchus_CreationDate>
<Bacchus> <hasCurrentLocation> <Musee_du_Louvre>
<Bacchus_ImageDimesion> <lengthValue> 115
<Bacchus_ImageDimesion> <heightValue> 177
<Bacchus_CreationDate> <timePeriodValue> 1510


This information is combined by the grammar to generate a coherent text. A simplified abstract function that combines the triples is

DPainting : Painting -> Painter -> Year -> Size -> Museum -> Description


Each argument of the function corresponds to a class in the ontology. In Figure 3, we show how the arguments are linearized in the original concrete syntax for English and how this syntax has been adapted to generate descriptions via the FN-based grammar. To adapt the original grammar, we first identified the frames that match the target verbs in the linearization rules. Then we matched the core FEs of the identified frames with the verb arguments.

Since the FN-based grammar currently does not cover non-core FEs, the adjunct Year is associated with no FE in Create_physical_artwork. Instead, it is attached to the corresponding clause in the final linearization rule (), illustrating how non-core FEs can be incorporated.

The grammar exploits patterns of frames Create_physical_artwork, Dimension and Placing:

Create_physical_artwork_V2_Pass :
Creator_NP -> Representation_NP -> V2 -> Clause
Dimension_V2 : Measurement_NP -> Object_NP -> V2 -> Clause
Placing_V2_Pass : Goal_Adv -> Theme_NP -> V2 -> Clause
`

Alternatively, we could easily change the frame Placing with Being_located evoked by the one-place verb hang in the active voice, which would preserve the meaning but alter the linearization.

The Swedish syntax was adapted in the same way. Descriptions generated by the new versions of DPainting are virtually equivalent to the descriptions produced by the original grammar.33 The only difference in comparison to the original grammar is that in Swedish we have imposed the use of the main verb mäta ‘to measure’ instead of the copula:

Eng: Bacchus was painted by Leonardo da Vinci in 1510. It measures 115 by 177 cm. This work is displayed at the Musée du Louvre.
Swe: Bacchus målades av Leonardo da Vinci år 1510. Den mäter 115 gånger 177 cm. Det här verket är utställt på Louvren.

## 6 Evaluation

We have conducted a simple intrinsic and extrinsic evaluation of the acquired FN-based grammar and lexicon. For an initial intrinsic evaluation, we count the number of examples in the source corpora that belong to the set of shared frames and that are covered by the shared semantico-syntactic valence patterns. Corpus examples are judged by the sentence patterns that represent them, disregarding non-core FEs, concrete prepositions and the word order, but including syntactic roles and the grammatical voice. This means that the original sentences are, in general, covered by paraphrasing.

We have extracted 57,615 examples from BFN and 3,348 examples from SweFN that belong to the shared set of 483 frames. For both BFN and SweFN, the concise set of 869 patterns covers 77.5% of those examples. This indicates that the set of shared patterns includes the most frequently used ones despite the modest amount of the annotated example sentences in SweFN.

Based on the FN-annotated sentences covered by the shared valence patterns, and the GF RGL type system for verbs, we have extracted 3,432 lexical entries (subcategorized LUs) from BFN, and 1,899 entries form SweFN. LUs between BFN and SweFN are not directly aligned, therefore a specific lexicon is generated for each language. However, a partial shared lexicon has been automatically derived on top of the language-specific lexicons, currently providing a mapping between 703 LUs in BFN and 900 LUs in SweFN. The shared lexicon covers 25.1% (11,223) of BFN sentences and 35.8% (928) of SweFN sentences, counting only those sentences which are represented by the shared valence patterns.

For an initial extrinsic evaluation, we compare the original application grammars with their FN-based counterparts in terms of code complexity. Since we do not modify the abstract syntax of application grammars, the amount of linearization rules remains the same. Therefore we count the number of constructors used to linearize the functions. In the painting application, the number of constructors is considerably reduced from 21 to 13. In the case of Phrasebook, the number is slightly reduced from 10 in English and 11 in Swedish to 8 in both languages.

Another aspect of the evaluation with regard to the original application grammars is the large number of accurate high-level frame constructors which are available to the CNL application developers. Instead of having to search for typical and valid syntactic patterns in a corpus to match the semantic representations of the application and domain, and to implement them, developers can choose among the abstract but still corpus-based semantico-syntactic patterns. The frame semantics is consistent and can be mapped to the semantic representations of various applications in various domains having different levels of expressiveness.

## 7 Discussion

The presented approach is based on several assumptions that limit the scope of the shared grammar and lexicon. The first difficulty is the low amount of annotated example sentences in SweFN. The differences between the amounts of examples has become noticeable in the set of extracted shared valence patterns. Without going into further methodological details, we should note that the approach taken in the development of SweFN was more lexicographically focused, putting emphasis on enhancing frames with LUs rather than supplementing each LU with example sentences. One way of adding more valence patterns for verbs is from the morpho-syntactic descriptions provided in the SIMPLE/PAROLE lexicons that are a part of the larger SweFN++ project. These lexicons contain descriptive linguistic analysis for around 3,000 Swedish verbs. Adding this information can yield a larger, more representative set of shared valence patterns additionally to the FN-annotated examples.

Furthermore, the extraction of verb valence patterns practically assumes varied semantic descriptions, as well as large amounts of sentence examples that are representative for the language in question. While the BFN approach is likely to suggest frequent patterns and more general linguistic descriptions, the SweFN approach is more likely to cover the linguistic variation for each verb. The question of how to balance between the two approaches has to be dealt with.

Another difficulty is selecting shared patterns in case of more than two languages. Alternatives are: (1) an intersection of all languages, which means that the set of shared patterns inevitably gets smaller by adding each new language, but the intersection becomes more and more prototypical, provided that the corpora are of a reasonable size and coverage; (2) a union of intersections of language pairs, which, on the one hand, would lead to functions temporary having no linearization in the one or the other language, but which, on the other hand, would be an efficient way to reveal non-compositional constructions and provide cross-lingual hints to the FN annotators and lexicographers.

Non-compositional translation equivalents, when verb types differ or when verbs do not have any counterpart in the other language, is yet another issue. In SweFN, we find a range of verbs that lack an exact translation in English such as: vabba ‘to stay home because of a sick child’, heta ‘be named’, duka ‘to make the table’, diska ‘to wash the dishes’, bädda ‘to make the bed’. A related question here is to what extent can these be represented in the grammar and how can we represent them automatically. One possible solution is the reuse of the GF RGL monolingual and multilingual dictionaries. Another solution is finding complementary resources for constructing the FN-based lexicons, for example by using WordNet for linking LUs.

For non-shared patterns and non-compositional translation equivalents, language specific extra modules can be introduced. This will increase the coverage not only in monolingual applications but also in multilingual applications; however, it would require a manual, application-specific mapping between different frames.

The presented approach has some advantages with regard to GF RGL. It can potentially provide feedback to the RGL monolingual and multilingual dictionaries, yielding mutual benefits, such as: (1) verification of verb types; (2) verification of particle verbs; and (3) suggestion of new entries.

## 8 Related Work

The main difference between this work and the previous approaches to CNL grammars is that we present an effort to exploit a robust and well established semantic model in the grammar development. Our approach can be compared with the work on multilingual verbalization of modular ontologies using GF and lemon (Davis et al, 2012), the Lexicon Model for Ontologies. We use additional lexical information about syntactic arguments for building the concrete syntax.

The grounding of NLG using the frame semantics theory has been addressed in the work on text-to-scene generation (Coyne et al, 2011) and in the work on text generation for navigational tasks (Roth and Frank, 2010). In that research, the content of frames is utilized through alignment between the frame-semantic structure and the domain-semantic representation. Discourse is supported by applying aggregation and pronominalization techniques. In the cultural heritage use case, we also show how an application which utilizes the FN-based grammar can become more discourse-oriented; something that is necessary in actual NLG applications and that has been demonstrated in GF before (Dannélls, 2010). In our current approach, the semantic representation of the domain and the linguistic structures of the grammar are based on FN-annotated data.

As suggested before (Gruzitis and Barzdins, 2010), a FN-like approach can be used to deal with polysemy in CNL texts. Although we consider lexicalisation alternatives and restrictions for LUs and FEs, we do not address the problem of selectional restrictions and word sense disambiguation in general.

## 9 Conclusion

In this article, we presented a computational approach to multilingual grammar and lexicon extraction and generation from FN-annotated corpora. The methodology for constructing the grammars and the lexicons was evaluated in a series of experiments. The results show that we are able to generalize over a set of valence patterns to capture the semantics and the syntax of two languages having a shared FN-based abstract syntax. We discussed a number of potential improvements to achieve better results that would lead to a larger coverage, however, the current coverage is already of practical use. We have tested the feasibility of the generated grammar library as a semantic API for developing CNL applications in GF. The major advantage is that language-dependent clause-level specifications to a large extent are hidden by the semantic API, making the application grammars more robust and flexible.

### Footnotes

1. thanks: This work has been supported by the Swedish Research Council under Grant No. 2012-5746 (Reliable Multilingual Digital Communication: Methods and Applications) and by the Centre for Language Technology in Gothenburg. The research leading to these results has received funding also from the Latvian State Research Programme NexIT (Project No. 1).
2. email: normunds.gruzitis@cse.gu.se, normunds.gruzitis@lu.lv
3. email: normunds.gruzitis@cse.gu.se, normunds.gruzitis@lu.lv
4. journal: Language Resources and Evaluation
5. https://framenet.icsi.berkeley.edu/fndrupal/framenet_data
6. http://remu.grammaticalframework.org/framenet/SweFN_2014-12-03.zip
7. https://framenet.icsi.berkeley.edu/
8. http://spraakbanken.gu.se/swefn/
9. http://spraakbanken.gu.se/saldo
10. http://spraakbanken.gu.se/swe/resurs/wordnet-saldo
11. http://spraakbanken.gu.se/eng/korp-info
12. http://www.grammaticalframework.org/lib/doc/synopsis.html
13. https://github.com/GrammaticalFramework/gf-contrib/tree/master/framenet
(The acquired grammar and lexicon; version 0.9.7 at the time of writing.)
14. http://www.grammaticalframework.org/framenet/
15. Where is a VP-modifying adverb,  – an embedded declarative sentence, and  – an embedded question.
16. Where is a one-place verb,  – a two-place verb,  – a three-place verb,  – a -complement verb,  – an -complement verb,  – a -complement verb,  – a verb with and complements,  – a verb with and complements, and  – a verb with and complements.
17. http://universaldependencies.github.io/docs/u/feat/Voice.html
18. SweFN tags are described at http://stp.lingfil.uu.se/~nivre/swedish_treebank/
19. Additionally, more than 100 examples are skipped in both corpora due to inconsistent semantico-syntactic annotations that were not fixed by the current heuristics.
20. If repeated FEs are of different RGL types, the whole example is currently skipped.
21. Taking into account the grammatical types and relations.
22. It is often practically impossible or uncommon that all core FEs are used in the same sentence. For instance, Area is mutually exclusive with five other core FEs in the frame Motion, and these five other -typed FEs normally are not used altogether.
23. E.g. in a high