Why Attention? Analyzing and Remedying BiLSTM Deficiency in Modeling Cross-Context for NER
State-of-the-art approaches of NER have used sequence-labeling BiLSTM as a core module. This paper formally shows the limitation of BiLSTM in modeling cross-context patterns. Two types of simple cross-structures – self-attention and Cross-BiLSTM – are shown to effectively remedy the problem. On both OntoNotes 5.0 and WNUT 2017, clear and consistent improvements are achieved over bare-bone models, up to 8.7% on some of the multi-token mentions. In-depth analyses across several aspects of the improvements, especially the identification of multi-token mentions, are further given.
With state-of-the-art empirical results, most regard BiLSTM-CNN as a robust core module for sequence-labeling NER Ma and Hovy (2016); Chiu and Nichols (2016); Aguilar et al. (2018); Akbik et al. (2018); Clark et al. (2018). However, each direction of BiLSTM only sees and encodes half of a sequence at each time step. For each token, the forward LSTM only encodes past context; the backward LSTM only encodes future context. Both do not model the patterns that happen to cross past and future at this specific time step.
This paper explores two types of cross-structures to help cope with the problem: Cross-BiLSTM-CNN and Att-BiLSTM-CNN. Section 2 formulates the three models, with Section 2.2 gives a concrete proof that patterns forming an XOR cannot be modeled by (Baseline-)BiLSTM-CNN used in all previous work. Section 3 evaluates practical effectiveness of the approaches on two challenging NER datasets. The cross-structures bring consistent improvements over Baseline-BiLSTM-CNN without additional gazetteers, language-modeling, or multi-task supervision. The improved core module surpasses comparable bare-bone models on OntoNotes and WNUT by 1.4% and 4.6% respectively. Ablation experiments reveal that emerging, complex, confusing, and multi-token entity mentions benefitted much from the cross-structures, up to 8.7% on some of the multi-token mentions.
For Baseline Chiu and Nichols (2016), a CNN is used to compute character-level word features alongside word embedding and multi-layer BiLSTM is used to capture the future and the past for each time step:
The probability of each token class is given by affine-softmax. Using OSBIE sequential labels Chiu and Nichols (2016), when there are entity types, the number of token classes .
2.2 XOR Limitation of Ordinary Multi-Layer BiLSTM
Consider the following four phrases that form an XOR, where the token "and" should be tagged as work-of-art:I in the first two cases and as O in the last two cases.
Key and Peele (work-of-art: show title)
You and I (work-of-art: song title)
Key and I
You and Peele
First, note that the score vector at each time step is the sum of contributions of two directions.
Suppose the index of work-of-art:I and O are i, j respectively. Then, to predict each "and" correctly, it must hold that (superscripts denote the phrase number)
Now, phrase 1 and phrase 3 have the same past context for "and", and hence the same and , i.e., . Similarly, , , . Rewriting with these equalities gives
Finally, summing the first two inequalities and the last two inequalities gives two contradicting constraints that cannot be satisfied. In other words, even if an oracle is given to training the model, Baseline-BiLSTM-CNN can only tag at most 3 out of 4 "and" correctly. No matter how many LSTM cells are stacked for each direction, the formulation in previous studies simply does not have enough modeling capacity to capture cross-context patterns for sequence labeling NER.
To resolve the problem, we proposes to use Cross-BiLSTM-CNN:
As the forward and backward hidden states are interleaved between stacked LSTM layers, Cross-BiLSTM-CNN models cross-context patterns by computing representations of the whole sequence in a feed-forward, additive manner.
Another way to resolve the problem is to add a self-attentive mechanism Vaswani et al. (2017) on baseline BiLSTM:
Att-BiLSTM-CNN correlates past and future context for each token in a dot-product, multiplicative manner. To see that, the computation of attention scores can be rewritten as follows.
|OntoNotes 5.0||WNUT 2017|
|OntoNotes 5.0||WNUT 2017|
|OntoNotes 5.0||WNUT 2017|
|1 2 3 3+||1 2 3 3+|
|Cross||+0.3% +0.6% +1.8% +1.3%||+1.7% +2.9% +8.7% +5.4%|
|Att||+0.1% +1.1% +2.3% +1.8%||+1.5% +2.0% +2.6% +0.9%|
We evaluated on tow datasets: OntoNotes 5.0 Fine-Grained NER – a million-token corpus with diverse sources and 18 fine-grained entity types, including hard ones such as law, event, work-of-art Hovy et al. (2006); Pradhan et al. (2013); WNUT 2017 Emerging NER – a corpus consists of noisy social media text, with text in the testing set containing surface forms seen in the training set filtered out Strauss et al. (2016); Derczynski et al. (2017).
Overall Results. Table 1 shows overall results. Besides Baseline-, Cross-, and Att-BiLSTM-CNN, results of bare-bone BiLSTM-CNN Chiu and Nichols (2016), CRF-BiLSTM(-BiLSTM) Strubell et al. (2017); Lin et al. (2017), and CRF-IDCNN Strubell et al. (2017) from the literature are also listed. The models proposed in this paper surpassed previous reported bare-bone models by 1.4% on OntoNotes and 4.6% on WNUT. More substantial improvements were achieved for WNUT 2017 emerging NER, suggesting that cross-context patterns were even more crucial for emerging contexts and entities, which cannot be memorized by their surface forms.
Complex and Confusing Entity Mentions. Table 2 shows significant results per entity type. It could be seen that harder entity types generally benefitted more from the cross-structures. For example, work-of-art/creative-work entities could in principle take any surface forms and written with unreliable capitalizations on social media, requiring models to learn better understanding of their context. Both cross-structures were more capable in dealing with such hard entities (2.1%/5.6%/3.2%/2.0%) than Baseline. Moreover, disambiguating fine-grained entity types is also a challenging task. For example, entities of language and NORP often take the same surface forms. Figure 0(a) shows a confusing example containing "Dutch" and "English", with the attention heat map (Figure 1(a)) telling the story that Att has relied on its attention head to make context-aware decisions. Both cross-structures were much better at disambiguating these fine-grained types (4.1%/0.8%/3.3%/3.4%).
Multi-Token Entity Mentions. Table 3 shows results among different entity lengths. It could be seen that cross-structures were much better at dealing with multi-token mentions (1.8%/2.3%/8.7%/2.6%). In fact, identifying correct mention boundaries for multi-token mentions requires modeling cross-context. For example, a token should be tagged as Inside if and only if it immediately follows a Begin or an I and is immediately followed by an I or an End. Figure 0(b) shows a sentence with "the White house", a triple-token facility mention with unreliable capitalization, resulting in an emerging surface form. While Att correctly tagged the three tokens, Baseline predicted a false single-token mention "White" without hints of a seen surface form.
Entity-Chunking. We perform an ablation study focused on chunking tags to understand why cross-structures are better at locating multi-token mentions. In Table 4, lists the performance of Att-BiLSTM-CNN. Other columns list the performance compared to , where columns to are when each is on its own. The figures are per-token recalls, telling if a part of the model is scoring that tag. Att appeared to surpass Baseline by designating the task of scoring I to the attention mechanism: performed well on its own (-3.80) compared to of Att on its own (-28.18). And context vectors have worked in cooperation: , focused more on scoring E (-36.45, -39.19) than I (-60.56, -50.19), while focused more on scoring B (-12.21) than I (-57.19). Then, the quantitative results and the qualitative visualizations explains each other: and especially tended to focus on looking for preceding mention token (the diagonal shifted left in Figure 1(b), 1(c)), enabling them to signal for End and Inside; tended to focus on looking for succeeding mention token (the diagonal shifted right in Figure 1(d)), enabling it to signal for Begin and Inside. In contrast, unable to model cross-context patterns, Baseline inadvertently retract to predict single-token entities (0.13 vs. -0.63, -0.41, -0.38) when hints from familiar surface forms are unavailable.
We have formally analyzed the deficiency of the prevalently used BiLSTM-CNN in modeling cross-context for NER. A concrete proof of its inability to capture XOR patterns has been given. Additive and multiplicative cross-structures have shown to be crucial in modeling cross-context, significantly enhancing recognition of emerging, complex, confusing, and multi-token entity mentions. Against comparable bare-bone models, 1.4% and 4.6% overall improvements on OntoNotes 5.0 and WNUT 2017 have been achieved, showing the importance of remedying the core module of NER.
-  (2018) Modeling noisiness to recognize named entities using multitask neural networks on social media. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Cited by: §1.
-  (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Cited by: §1.
-  (2016) Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics. Cited by: §1, §2.1, §3.
-  (2018) Semi-supervised sequence modeling with cross-view training. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §1.
-  (2017) Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, Cited by: §3.
-  (2006) OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, Cited by: §3.
-  (2017) Multi-channel BiLSTM-CRF model for emerging named entity recognition in social media. In Proceedings of the 3rd Workshop on Noisy User-generated Text, Cited by: §3.
-  (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
-  (2013) Towards robust linguistic analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Cited by: §3.
-  (2016) Results of the WNUT16 named entity recognition shared task. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), Cited by: §3.
-  (2017) Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Cited by: §3.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, Cited by: §2.4.