SLING: A framework for frame semantic parsing

# SLING: A framework for frame semantic parsing

## Abstract

We describe SLING, a framework for parsing natural language into semantic frames. SLING supports general transition-based, neural-network parsing with bidirectional LSTM input encoding and a Transition Based Recurrent Unit (TBRU) for output decoding. The parsing model is trained end-to-end using only the text tokens as input. The transition system has been designed to output frame graphs directly without any intervening symbolic representation. The SLING framework includes an efficient and scalable frame store implementation as well as a neural network JIT compiler for fast inference during parsing. SLING is implemented in C++ and it is available for download on GitHub.

## 1Introduction

Recent advances in machine learning make it practical to train recurrent multi-level neural network classifiers, allowing us to rethink the design and implementation of natural language understanding (NLU) systems.

Earlier machine-learned NLU systems were commonly organized as pipelines of separately trained stages for syntactic and semantic annotation of text. A typical pipeline would start with part-of-speech (POS) tagging, followed by constituency or dependency parsing for syntactic analysis. Using the POS tags and parse trees as feature inputs, later stages in the pipeline could then derive semantically relevant annotations such as entity and concept mentions, entity types, coreference relationships, and semantic roles (SRL).

For simplicity and efficiency, each stage in a practical NLU pipeline would just output its best hypothesis and pass it on to the next stage [?]. Obviously, errors could then accumulate throughout the pipeline making it much harder for the system to perform accurately. For instance, F1 on SRL drops by more than 10% when going from gold to system parse trees [?].

However, applications may not need the intermediate annotations produced by the earlier stages of a NLU pipeline, so it would be preferable if all stages could be trained together to optimize an objective based on the output annotations needed for a particular application.

Earlier NLU pipelines often used linear classifiers for each stage. Linear classifiers achieve simplicity and training efficiency at the expense of feature complexity, requiring elaborate feature extraction, many different feature types, and feature combinations to achieve reasonable accuracy. With deep learning, we can use embeddings, multiple layers, and recurrent network connections to reduce the need for complex feature design. The internal learned representations in model hidden layers replace the hand-crafted feature combinations and intermediate representations in pipelined systems.

The SLING parser exploits deep learning to bypass those limitations of classic pipelined systems. It is a transition-based parser that outputs frame graphs directly without any intervening symbolic representation (see Section 5). Transition-based parsing is often associated with dependency parsing, but we have designed a specialized transition system that outputs frame graphs instead of dependency trees.

We use a recurrent feed-forward unit for predicting the actions in the transition sequence, where the hidden activations from predicting each transition step are fed back into subsequent steps. A bidirectional LSTM (biLSTM) encodes the input into a sequence of vectors(Figure Figure 1). This neural network architecture has been implemented using DRAGNN [?] and TensorFlow [?].

The SLING framework and a semantic parser built in it are now available as open-source code on GitHub.1

In Section 2 we introduce frame semantics, the linguistic theory that inspired SLING, as well as the SLING frame store, a C++ framework for representing and storing semantic frames compactly and efficiently. Section 4 introduces the parser’s frame-semantics-oriented attention mechanism, and Section 5 describes the transition system used for producing frame graphs. Section 6 describes the features used by the parser. In sections Section 7 and Section 8 we describe our experiments on OntoNotes, and Section 9 describes the fast parser runtime.

## 2Frame semantics

While frames in SLING are not tied to any particular linguistic theory or knowledge ontology, they are inspired by frame semantics, the theory of linguistic meaning originally developed by Charles Fillmore [?]. Frame semantics connects linguistic semantics to encyclopedic knowledge, with the central idea that understanding the meaning of a word requires access to all the essential knowledge that relates to that word. A word evokes a frame representing the specific concept it refers to.

A semantic frame is a set of statements that give “characteristic features, attributes, and functions of a denotatum, and its characteristic interactions with things necessarily or typically associated with it.” [?]. A semantic frame can also be viewed as a coherent group of concepts such that complete knowledge of one of them requires knowledge of all of them.

Frame semantics is not just for individual concepts, but can be generalized to phrases, entities, constructions, and other larger and more complex linguistic and ontological units. Semantic frames can also model world knowledge and inferential relationships in common sense, metaphor [?], metonymy, action [?], and perspective [?].

## 3Frames in SLING

SLING represents frames with data structures consisting of a list of slots, where each slot has a name (role) and a value. The slot values can be literals like numbers and strings, or links to other frames. A collection of interlinked frames can thus be seen as a directed graph where the frames are the (typed) nodes and the slots are the (labeled) edges. A frame graph can also be viewed as a feature structure [?] and unification can be used for induction of new frames from existing frames. Frames can also be used to represent more basic data structures such as a C struct with fields, a JSON object, or a record in a database.

SLING frames live inside a frame store. A store is a container that tracks all the frames that have been allocated in the store, and serves as a memory allocation arena for them. When making a new frame, one specifies the store where the frame should be allocated. The frame will live in this store until the store is deleted or the frame is garbage collected because there no remaining live references to it.2

SLING frames are externally represented in a superset of JSON that allows references between frames (JSON objects) with the #n syntax. Frames can be assigned identifiers (ids) using the =#n syntax. SLING frames can have both numeric and named ids and both slot names and values can be frame references. Where JSON objects can only represent trees, SLING frames can be used for representing arbitrary graphs. SLING has special syntax for built-in slot names:

Documents are also represented using frames, where the document frame has slots for the document text, the tokens, and the mention phrases and the frames they evoke. See Figure ? for an example.

## 4Attention

The SLING parser is a kind of sequence-to-sequence model that first encodes the input text token sequence with a bidirectional LSTM encoder and then runs the transition system on that encoding to produce a sequence of transitions, where each transition updates the system state that combined with the input encoding form the input for the transition feed-forward cell that predicts the next transition (Figure Figure 1).

Sequence-to-sequence models often rely on an “attention” mechanism to focus the decoder on the parts of the input most relevant for producing the next output symbol. In this work, however, we use a somewhat difference attention mechanism, loosely inspired on neuroscience models of attention and awareness [?]. In our model, attention focuses on parts of the frame representation that the parser has created so far, rather than focusing on (encodings of) input tokens as is common for other sequence-to-sequence attention mechanisms.

We maintain an attention buffer as part of the transition system state. This an ordered list of frames, where the order represents closeness to the center of attention. Transition system actions maintain the attention buffer, bringing a frame to the front when the frame is evoked or re-evoked by the input text. When a new frame is evoked, it will merge the concept and its roles into a new coherent chunk of meaning, which is represented by the new frame and its relations to other frames, and this will become the new center of attention. Our hypothesis is that by maintaining this attention mechanism, we only need to look at a few recent frames brought into attention to build the desired frame graph.

## 5Transition system

Transition systems

are widely used in parsing to build dependency parse trees as a side effect of performing a sequence state transitions where is a state and is an action. Action computes the new state from state . For example, the arc-standard transition system [?] uses a sequence of SHIFT, LEFT-ARC(label), and RIGHT-ARC(label) actions, operating on a state whose main component is a stack, to build a dependency parse tree.

We use the same idea to construct a frame graph where frames can be evoked by phrases in the input. But instead of using a stack in the state, we use the attention buffer introduced in the previous section that keeps track of the most salient frames in the discourse.

The attention buffer is a priority list of all the frames evoked so far. The front of the buffer serves as the working memory for the parser. Actions operate on the front of the buffer and in some cases other frames in the buffer. The transition system simultaneously builds the frame graph and maintains the attention buffer by moving the frame involved involved in an action to the front of the attention buffer. At any time, each evoked frame has a unique position in the attention buffer.

The transition system consists of the following actions:

• SHIFT – Moves to next input token. Only valid when not at the end of the input buffer.

• STOP – Signals that we have reach the end of the parse. This is only valid when at the end of the input buffer. Multiple STOP actions can be added to the transition sequence, e.g. to make all sequences in a beam have the same length. After a STOP is issued, no other actions are permitted except more STOP actions.

• EVOKE(type, n) – Evokes a frame of type type from the next n tokens in the input. The evoked frame is inserted at the front of the attention buffer, becoming the new center of attention.

• REFER(frame, n) – Makes a new mention from the next n tokens in the input evoking an existing frame in the attention buffer. This frame is moved to the front of the attention buffer and will become the new center of attention.

• CONNECT(source, role, target) – Adds slot to source frame in the attention buffer with name role and value target where target is an existing frame in the attention buffer. The source frame become the new center of attention.

• ASSIGN(source, role, value) – Adds slot to source frame in the attention buffer with name role and constant value value and moves the frame to the front of the buffer. This action is only used for assigning a constant value to a slot, in contrast to CONNECT where the value is another frame in the attention buffer.

• EMBED(target, role, type) – Creates a new frame with type type and adds a slot to it with name role and value target where target is an existing frame in the attention buffer. The new frame becomes the center of attention.

• ELABORATE(source, role, type) – Creates a new frame with type type and adds a slot to an existing frame source in the attention buffer with role set to the new frame. The new frame becomes the center of attention.

In summary, EVOKE and REFER are used to evoke frames from text mentions, while ELABORATE and EMBED are used to create frames not directly evoked by text.

This transition system can generate any connected frame graph where the frames are either directly on indirectly evoked by phrases in the text. A frame can be evoked by multiple mentions and the graph can have cycles.

The transition system can potentially have an unbounded number of actions since it is parameterized by phrase length and attention buffer indices which can be arbitrarily large. In the current implementation, we only consider the top frames in the attention buffer () and we do not consider any phrases longer than those in the training corpus.

Multiple transition sequences can generate the same frame annotations, but we have implemented an oracle sequence generator that takes a document and converts it to a canonical transition sequence in a way similar to how this is done for transition-based dependency parsing [?]. For example, the sentence “John hit the ball” generates the following transition sequence:

EVOKE(/saft/person, 1)  SHIFT
EVOKE(/pb/hit-01, 1)
CONNECT(0, /pb/arg0, 1)
SHIFT
SHIFT
EVOKE(/saft/consumer_good, 1)
CONNECT(1, /pb/arg1, 0)
SHIFT
STOP

## 6Features

The biLSTM uses only lexical features based on the current input word:

• The current word itself. During training we initialize the embedding for this feature from pre-trained word embeddings [?] for all the words in the the training data.

• The prefixes and suffixes of the current input word. We use only prefixes up to three characters in our experiments.

• Word shape features based on the characters in the current input word: hyphenation, capitalization, punctuation, quotes, and digits. Each of these features has its own embedding matrix.

The TBRU is a simple feed-forward unit with a single hidden layer. It takes the hidden activations from the biLSTM as well as the activations from the hidden layer from the previous steps as raw input features, and maps them through embedding matrices to get the input vector for the hidden layer. More specifically, the inputs to the TBRU are as follows:

• The left-to-right and right-to-left LSTMs supply their activations for the current token in the parser state.

• The attention feature looks at the top- frames in the attention buffer and finds the phrases in the text (if any) that evoked them. The activations from the left-to-right and right-to-left LSTMs for the last token of each of those phrases are are included as TBRU inputs, serving as continuous lexical representations of the top- frames in the attention buffer.

• The hidden layer activations of the transition steps which evoked or brought into focus the top- frames in the attention buffer are also inputs to the TBRU, providing a continuous representation for the semantic frame contexts that evoked those frames most recently.

• The history feature uses the hidden activations in the feed-forward unit from the previous steps as feature inputs to the current step.

• Embeddings of triples of the form , encode the fact that the frame at position in the attention buffer has a role with the frame at position in the attention buffer as its value. Back-off features are added for the source roles , target role , and unlabeled roles .

## 7Experiments

We derived a corpus annotated with semantic frames from the OntoNotes corpus [?]. We took the PropBank SRL layer [?] and converted the predicate-argument structures into frame annotations. We also annotated the corpus with entity frames based on entity types from a state-of-the-art entity tagger. We determined the head token of each argument span and if this coincided with the span of an existing frame, then we used it as the evoking span for the argument frame, otherwise we just used the head token as the evoking span of the argument frame.

The various frame types mentioned above are listed in Table 1. They include 7 conventional entity types, 6 top-level non-entity types (e.g. date), 13 measurement types, and more than 5400 PropBank frame types. All the frame roles are collapsed onto /pb/arg0, /pb/arg1, and so on. Our training corpus size was sentences, tokens.

Table 2 shows action statistics for the transition sequences that generate the gold frames in the training corpus. As expected, there is one SHIFT action per training token, and one STOP action per training sentence. The EVOKE action occurred with unique (length, type) arguments in the corpus, for a raw count of roughly million action tokens. Overall our action space had action types, which is also the size of the softmax layer of our TBRU decoder.

Our final set of hyperparameters after grid search with a dev corpus was: , [?] with , , , no dropout, gradient clipping at , exponential moving average, no layer normalization, and a training batch size of . We use dimensional word embeddings, single layer LSTMs with dimensions, and a dimensional hidden layer in the feed-forward unit.

We stopped training after steps, where each step corresponds to processing one training batch, and evaluated on the dev corpus ( sentences) after every checkpoint (= steps). Figure 2 shows the how the various evaluation metrics evolve as training progresses. Section 8 contains the details of these metrics are evaluated. We picked the checkpoint with the best ‘Slot F1‘ score.

## 8Evaluation

An annotated document consists of a number of connected frames as well as phrases (token spans) that evoked these frames. We evaluated annotation quality by comparing the generated frames with the gold standard frame annotations from the evaluation corpus.

Two documents are matched by constructing a virtual graph where the document is the start node. The document node is then connected to the spans and the spans are connected to the frames that the spans evoke. This graph is then extended by following the frame-to-frame links via the roles. Quality is computed by aligning the golden and predicted graphs and computing precision, recall, and F1. Those scores are separately computed for spans, frames, frame types, roles that link to other frames (referred to as ’roles’), and roles that link to global constants (referred to as ’labels’).

We also report two aggregate quality scores: (a) Slot, which is an aggregate of Type, Role, and Label, and (b) Combined, which is an aggregate of Span, Frame, Type, Role, and Label.

We rated the checkpoints using the Slot-F1 metric and selected the checkpoint with the best Slot-F1. Intuitively, a high Slot score reflects that the right type of frames are being evoked, along with the right set of slots and links to other frames.

Figure 2 shows that as training progresses, the model learns to output the spans and frames evoked from those spans with fairly good quality (SPAN F1 FRAME F1 ). It also gets the type of those frames right with a TYPE F1 of . ROLE F1 though is lower at just . ROLE F1 measures the accuracy of correctly getting the frame-frame link, including the label of the link. Further error analysis will be required to understand how frame-frame links are missed by the model. Also note that currently the roles feature is the only one that captures inter-frame link information. Augmenting this with more features should help improve ROLE quality, as we will investigate in future work.

Finally, we took the best checkpoint, with SLOT F1 at steps, and evaluated it on the test corpus. Table 3 lists the quality of this model on the test and dev corpora. With the exception of LABEL accuracies, all the other metrics exhibit less than half a percent difference between the test and dev corpora. This illustrates that despite the lack of dropout, the model generalizes well to unseen text. As for the disparity on LABEL F1 ( on dev against on test), we observe from Figure 2 that the LABEL accuracies follow a different improvement pattern during training. On the dev set, LABEL F1 peaked at at steps, and started degrading slightly from there on to at steps, possibly showing signs of overfitting which are absent in the other metrics.

We have tried increasing the sizes of the LSTM dimensions, hidden layers, and embeddings, but this did not improve the results significantly.

## 9Parser runtime

The SLING parser uses TensorFlow [?] for training but it also supports annotating text with frame annotations at runtime. It can take advantage of batching and multi-threading to speed up parsing. However, in practical applications of the parser, it may not be convenient to batch documents for processing, so to have a realistic benchmark, we set the batch size to one at runtime. In this configuration, the TensorFlow-based SLING parser runs at 200 tokens per CPU second.

To speed up parsing, we have created Myelin, a just-in-time compiler for neural networks that compiles network cells into x64 machine code at runtime. The generated code exploits such specialized CPU features as SSE, AVX, and FMA3, if available. Tensor shapes and model parameters are fixed at runtime. This allows us to optimize the network by folding constants, unrolling loops, and pre-computing embeddings, among other transformations. The JIT compiler can also fix the data instance layout at compile-time to speed up runtime data access.

The Myelin-based SLING parser runs at 2500 tokens per CPU second, more than ten times faster than the TensorFlow-based version (Table Table 4).

The Myelin-based SLING parser is independent of TensorFlow so it only needs to link with the Myelin runtime (less than 500 KB) instead of the TensorFlow runtime library (37 MB), and it is also much faster to initialize (0.5 seconds including compilation time) than the TensorFlow-based parser (10 seconds). Figure 3 shows a breakdown of the CPU time for the Myelin-based parser runtime.

Half the time is spent computing the logits for the output actions. This is expensive because the OntoNotes-based corpus has 6968 actions, where the vast majority of the actions are of a form like EVOKE(/pb/hit-01, 1), one for each PropBank roleset predicate in the training data. Table 2 shows that only about 26% of all the actions are EVOKE actions. The output layer of the FF unit could be turned into a cascaded classifier, where if the first classifier predicts a generic EVOKE(/pb/predicate, 1) action, it would use a secondary classifier to predict the predicate type. This could almost double the speed of the parser.

## 10Conclusion

We have described SLING, a framework for parsing natural language into semantic frames. Our experiments show that it is feasible to build a semantic parser that outputs frame graphs directly without any intervening symbolic representation, only using the tokens as inputs. We illustrated this on the joint task of predicting entity mentions, entity types, measures, and semantic role labeling. While the LSTMs and TBRUs are expensive to compute, we can achieve acceptable parsing speed using the Myelin JIT compiler. We hope to make use of SLING in the future for further exploration into semantic parsing.

## Acknowledgements

We would like to thank Google for supporting us in this project and allowing us to make SLING available to the public community. We would also like to thank the Tensorflow and DRAGNN teams for making their systems publicly available. Without it, we could not have made SLING open source.

### Footnotes

2. See the SLING Guide for a detailed description of the SLING frame store implementation.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters