Shallow Syntax in Deep Water
Shallow syntax provides an approximation of phrase-syntactic structure of sentences; it can be produced with high accuracy, and is computationally cheap to obtain. We investigate the role of shallow syntax-aware representations for NLP tasks using two techniques. First, we enhance the ELMo architecture (Peters:18b) to allow pretraining on predicted shallow syntactic parses, instead of just raw text, so that contextual embeddings make use of shallow syntactic context. Our second method involves shallow syntactic features obtained automatically on downstream task data. Neither approach leads to a significant gain on any of the four downstream tasks we considered relative to ELMo-only baselines. Further analysis using black-box probes from Liu:19 confirms that our shallow-syntax-aware contextual embeddings do not transfer to linguistic tasks any more easily than ELMo’s embeddings. We take these findings as evidence that ELMo-style pretraining discovers representations which make additional awareness of shallow syntax redundant.
The NLP community is revisiting the role of linguistic structure in applications with the advent of contextual word representations (cwrs) derived from pretraining language models on large corpora Peters:18; Radford:18; Howard:18; Devlin:18. Recent work has shown that downstream task performance may benefit from explicitly injecting a syntactic inductive bias into model architectures (Kuncoro:18), even when cwrs are also used Strubell:18. However, high quality linguistic structure annotation at a large scale remains expensive—a trade-off needs to be made between the quality of the annotations and the computational expense of obtaining them. Shallow syntactic structures (Abney:91; also called chunk sequences) offer a viable middle ground, by providing a flat, non-hierarchical approximation to phrase-syntactic trees (see Fig. 1 for an example). These structures can be obtained efficiently, and with high accuracy, using sequence labelers. In this paper we consider shallow syntax to be a proxy for linguistic structure.
While shallow syntactic chunks are almost as ubiquitous as part-of-speech tags in standard NLP pipelines Jurafsky:00, their relative merits in the presence of cwrs remain unclear. We investigate the role of these structures using two methods. First, we enhance the ELMo architecture (Peters:18b) to allow pretraining on predicted shallow syntactic parses, instead of just raw text, so that contextual embeddings make use of shallow syntactic context (§2). Our second method involves classical addition of chunk features to cwr-infused architectures for four different downstream tasks (§3). Shallow syntactic information is obtained automatically using a highly accurate model (97% on standard benchmarks). In both settings, we observe only modest gains on three of the four downstream tasks relative to ELMo-only baselines (§4).
Recent work has probed the knowledge encoded in cwrs and found they capture a surprisingly large amount of syntax Blevins:18; Liu:19; Tenney:18. We further examine the contextual embeddings obtained from the enhanced architecture and a shallow syntactic context, using black-box probes from Liu:19. Our analysis indicates that our shallow-syntax-aware contextual embeddings do not transfer to linguistic tasks any more easily than ELMo embeddings (§4.2).
Overall, our findings show that while shallow syntax can be somewhat useful, ELMo-style pretraining discovers representations which make additional awareness of shallow syntax largely redundant.
2 Pretraining with Shallow Syntactic Annotations
We briefly review the shallow syntactic structures used in this work, and then present a model architecture to obtain embeddings from shallow Syntactic Context (mSynC).
2.1 Shallow Syntax
Base phrase chunking is a cheap sequence-labeling–based alternative to full syntactic parsing, where the sequence consists of non-overlapping labeled segments (Fig. 1 includes an example.) Full syntactic trees can be converted into such shallow syntactic chunk sequences using a deterministic procedure Jurafsky:00. Tjong:00 offered a rule-based transformation deriving non-overlapping chunks from phrase-structure trees as found in the Penn Treebank (Marcus:93). The procedure percolates some syntactic phrase nodes from a phrase-syntactic tree to the phrase in the leaves of the tree. All overlapping embedded phrases are then removed, and the remainder of the phrase gets the percolated label—this usually corresponds to the head word of the phrase.
In order to obtain shallow syntactic annotations on a large corpus, we train a BiLSTM-CRF model Lample:16; Peters:17, which achieves 97% on the CoNLL 2000 benchmark test set. The training data is obtained from the CoNLL 2000 shared task Tjong:00, as well as the remaining sections (except §23 and §20) of the Penn Treebank, using the official script for chunk generation.111https://www.clips.uantwerpen.be/conll2000/chunking/ The standard task definition from the shared task includes eleven chunk labels, as shown in Table 1.
|Noun Phrase (NP)||51.7|
|Verb Phrase (VP)||20.0|
|Prepositional Phrase (PP)||19.8|
|Adverbial Phrase (ADVP)||3.7|
|Subordinate Clause (SBAR)||2.1|
|Adjective Phrase (ADJP)||1.9|
|Verb Particles (PRT)||0.5|
|Conjunctive Phrase (CONJ)||0.06|
|Interjective Phrase (INTJ)||0.03|
|List Marker (LST)||0.01|
|Unlike Coordination Phrase (UCP)||0.002|
2.2 Pretraining Objective
Traditional language models are estimated to maximize the likelihood of each word given the words that precede it, . Given a corpus that is annotated with shallow syntax, we propose to condition on both the preceding words and their annotations.
We associate with each word three additional variables (denoted ): the indices of the beginning and end of the last completed chunk before , and its label. For example, in Fig. 2, for . Chunks, are only used as conditioning context via ; they are not predicted.222A different objective could consider predicting the next chunks, along with the next word. However, this chunker would have access to strictly less information than usual, since the entire sentence would no longer be available. Because the labels depend on the entire sentence through the CRF chunker, conditioning each word’s probability on any means that our model is, strictly speaking, not a language model, and it can no longer be meaningfully evaluated using perplexity.
A right-to-left model is constructed analogously, conditioning on alongside . Following Peters:18, we use a joint objective maximizing data likelihood objectives in both directions, with shared softmax parameters.
2.3 Pretraining Model Architecture
Our model uses two encoders: for encoding the sequential history (), and for shallow syntactic (chunk) history (). For both, we use transformers Vaswani:17, which consist of large feedforward networks equipped with multiheaded self-attention mechanisms.
As inputs to , we use a context-independent embedding, obtained from a CNN character encoder Kim:2016:CNL:3016100.3016285 for each token . The outputs from represent words in context.
Next, we build representations for (observed) chunks in the sentence by concatenating a learned embedding for the chunk label with s for the boundaries and applying a linear projection (). The output from is input to , the shallow syntactic encoder, and results in contextualized chunk representations, . Note that the number of chunks in the sentence is less than or equal to the number of tokens.
Each is now concatentated with , where corresponds to , the last chunk before position . Finally, the output is given by , where is a model parameter. For training, is used to compute the probability of the next word, using a sampled softmax Bengio:03. For downstream tasks, we use a learned linear weighting of all layers in the encoders to obtain a task-specific mSynC, following Peters:18.
Staged parameter updates
Jointly training both the sequential encoder , and the syntactic encoder can be expensive, due to the large number of parameters involved. To reduce cost, we initialize our sequential cwrs , using pretrained embeddings from ELMo-transformer. Once initialized as such, the encoder is fine-tuned to the data likelihood objective (§2.2). This results in a staged parameter update, which reduces training duration by a factor of 10 in our experiments. We discuss the empirical effect of this approach in §4.3.
3 Shallow Syntactic Features
Our second approach incorporates shallow syntactic information in downstream tasks via token-level chunk label embeddings. Task training (and test) data is automatically chunked, and chunk boundary information is passed into the task model via BIOUL encoding of the labels. We add randomly initialized chunk label embeddings to task-specific input encoders, which are then fine-tuned for task-specific objectives. This approach does not require a shallow syntactic encoder or chunk annotations for pretraining cwrs, only a chunker. Hence, this can more directly measure the impact of shallow syntax for a given task.333In contrast, in §2, the shallow-syntactic encoder itself, as well as predicted chunk quality on the large pretraining corpus could affect downstream performance.
Our experiments evaluate the effect of shallow syntax, via contextualization (mSynC, §2) and features (§3). We provide comparisons with four baselines—ELMo-transformer Peters:18b, our reimplementation of the same, as well as two cwr-free baselines, with and without shallow syntactic features. Both ELMo-transformer and mSynC are trained on the 1B word benchmark corpus (Chelba:13); the latter also employs chunk annotations (§2.1). Experimental settings are detailed in Appendix §A.1.
4.1 Downstream Task Transfer
We employ four tasks to test the impact of shallow syntax. The first three, namely, coarse and fine-grained named entity recognition (NER), and constituency parsing, are span-based; the fourth is a sentence-level sentiment classification task. Following Peters:18, we do not apply finetuning to task-specific architectures, allowing us to do a controlled comparison with ELMo. Given an identical base architecture across models for each task, we can attribute any difference in performance to the incorporation of shallow syntax or contextualization. Details of downstream architectures are provided below, and overall dataset statistics for all tasks is shown in the Appendix, Table 5.
|Baseline (no cwr)||88.1 0.27||78.5 0.19||88.9 0.05||51.6 1.63|
|+ shallow syn. features||88.6 0.22||78.9 0.13||90.8 0.14||51.1 1.39|
|ELMo-transformer (Peters:18b)||91.1 0.26||—||93.7 0.00||—|
|ELMo-transformer (our reimplementation)||91.5 0.25||85.7 0.08||94.1 0.06||53.0 0.72|
|+ shallow syn. features||91.6 0.40||85.9 0.28||94.3 0.03||52.6 0.54|
|Shallow syn. contextualization (mSynC)||91.5 0.19||85.9 0.20||94.1 0.07||53.0 1.07|
We use the English portion of the CoNLL 2003 dataset (Tjong:03), which provides named entity annotations on newswire data across four different entity types (PER, LOC, ORG, MISC). A bidirectional LSTM-CRF architecture (Lample:16) and a BIOUL tagging scheme were used.
The same architecture and tagging scheme from above is also used to predict fine-grained entity annotations from OntoNotes 5.0 (Weischedel:11). There are 18 fine-grained NER labels in the dataset, including regular named entitities as well as entities such as date, time and common numerical entries.
We use the standard Penn Treebank splits, and adopt the span-based model from Stern:17. Following their approach, we used predicted part-of-speech tags from the Stanford tagger (Toutanova:03) for training and testing. About 51% of phrase-syntactic constituents align exactly with the predicted chunks used, with a majority being single-width noun phrases. Given that the rule-based procedure used to obtain chunks only propagates the phrase type to the head-word and removes all overlapping phrases to the right, this is expected. We did not employ jack-knifing to obtain predicted chunks on PTB data; as a result there might be differences in the quality of shallow syntax annotations between the train and test portions of the data.
We consider fine-grained (5-class) classification on Stanford Sentiment Treebank (Socher:13). The labels are negative, somewhat_negative, neutral, positive and somewhat_positive. Our model was based on the biattentive classification network (Mccann:17). We used all phrase lengths in the dataset for training, but test results are reported only on full sentences, following prior work.
Results are shown in Table 2. Consistent with previous findings, cwrs offer large improvements across all tasks. Though helpful to span-level task models without cwrs, shallow syntactic features offer little to no benefit to ELMo models. mSynC’s performance is similar. This holds even for phrase-structure parsing, where (gold) chunks align with syntactic phrases, indicating that task-relevant signal learned from exposure to shallow syntax is already learned by ELMo. On sentiment classification, chunk features are slightly harmful on average (but variance is high); mSynC again performs similarly to ELMo-transformer. Overall, the performance differences across all tasks are small enough to infer that shallow syntax is not particularly helpful when using cwrs.
4.2 Linguistic Probes
We further analyze whether awareness of shallow syntax carries over to other linguistic tasks, via probes from Liu:19. Probes are linear models trained on frozen cwrs to make predictions about linguistic (syntactic and semantic) properties of words and phrases. Unlike §4.1, there is minimal downstream task architecture, bringing into focus the transferability of cwrs, as opposed to task-specific adaptation.
4.2.1 Probing Tasks
The ten different probing tasks we used include CCG supertagging Hockenmaier:07, part-of-speech tagging from PTB Marcus:93 and EWT (Universal Depedencies Silveira:14), named entity recognition Tjong:03, base-phrase chunking Tjong:00, grammar error detection Yannakoudakis:11, semantic tagging Bjerva:16, preposition supersense identification Schneider:18, and event factuality detection Rudinger:18. Metrics and references for each are summarized in Table 6. For more details, please see Liu:19.
Results in Table 3 show ten probes. Again, we see the performance of baseline ELMo-transformer and mSynC are similar, with mSynC doing slightly worse on 7 out of 9 tasks. As we would expect, on the probe for predicting chunk tags, mSynC achieves 96.9 vs. 92.2 for ELMo-transformer, indicating that mSynC is indeed encoding shallow syntax. Overall, the results further confirm that explicit shallow syntax does not offer any benefits over ELMo-transformer.
4.3 Effect of Training Scheme
We test whether our staged parameter training (§2.3) is a viable alternative to an end-to-end training of both and . We make a further distinction between fine-tuning vs. not updating it at all after initialization (frozen).
|mSynC end-to-end||86.89 0.04|
|staged||mSynC frozen||87.36 0.02|
|mSynC fine-tuned||87.44 0.07|
Downstream validation-set on fine-grained NER, reported in Table 4, shows that the end-to-end strategy lags behind the others, perhaps indicating the need to train longer than 10 epochs. However, a single epoch on the 1B-word benchmark takes 36 hours on 2 Tesla V100s, making this prohibitive. Interestingly, the frozen strategy, which takes the least amount of time to converge (24 hours on 1 Tesla V100), also performs almost as well as fine-tuning.
We find that exposing cwr-based models to shallow syntax, either through new cwr learning architectures or explicit pipelined features, has little effect on their performance, across several tasks. Linguistic probing also shows that cwrs aware of such structures do not improve task transferability. Our architecture and methods are general enough to be adapted for richer inductive biases, such as those given by full syntactic trees (RNNGs; Dyer:16), or to different pretraining objectives, such as masked language modeling (BERT; Devlin:18); we leave this pursuit to future work.
Appendix A Supplemental Material
Our baseline pretraining model was a reimplementation of that given in Peters:18b. Hyperparameters were generally identical, but we trained on only 2 GPUs with (up to) 4,000 tokens per batch. This difference in batch size meant we used 6,000 warm up steps with the learning rate schedule of Vaswani:17.
The function is identical to the 6-layer biLM used in ELMo-transformer. , on the other hand, uses only 2 layers. The learned embeddings for the chunk labels have 128 dimensions and are concatenated with the two boundary of dimension 512. Thus maps dimensions to 512. Further, we did not perform weight averaging over several checkpoints.
The size of the shallow syntactic feature embedding was 50 across all experiments, initialized uniform randomly.
All model implementations are based on the AllenNLP library Gardner:17.
|CoNLL 2003 NER (Tjong:03)||23,499||5,942||5,648|
|OntoNotes NER (Weischedel:13)||81,828||11,066||11,257|
|Penn TreeBank (Marcus:93)||39,832||1,700||2,416|
|Stanford Sentiment Treebank (Socher:13)||8,544||1,101||2,210|
|CCG Supertagging||CCGBank Hockenmaier:07||Accuracy|
|PTB part-of-speech tagging||PennTreeBank Marcus:93||Accuracy|
|EWT part-of-speech tagging||Universal Dependencies Silveira:14||Accuracy|
|Chunking||CoNLL 2000 Tjong:00|
|Named Entity Recognition||CoNLL 2003 Tjong:03|
|Grammar Error Detection||First Certificate in English Yannakoudakis:11|
|Preposition Supersense Role||STREUSLE 4.0 Schneider:18||Accuracy|
|Preposition Supersense Function||STREUSLE 4.0 Schneider:18||Accuracy|
|Event Factuality Detection||UDS It Happened v2 Rudinger:18||Pearson R|