Why Attention? Analyzing and Remedying BiLSTM Deficiency in Modeling Cross-Context for NER

Why Attention? Analyzing and Remedying BiLSTM Deficiency in Modeling Cross-Context for NER

Peng-Hsuan Li, Tsu-Jui Fu, Wei-Yun Ma
Academia Sinica
{jacobvsdanniel,tsujuifu}@gmail.com, ma@iis.sinica.edu.tw

State-of-the-art approaches of NER have used sequence-labeling BiLSTM as a core module. This paper formally shows the limitation of BiLSTM in modeling cross-context patterns. Two types of simple cross-structures – self-attention and Cross-BiLSTM – are shown to effectively remedy the problem. On both OntoNotes 5.0 and WNUT 2017, clear and consistent improvements are achieved over bare-bone models, up to 8.7% on some of the multi-token mentions. In-depth analyses across several aspects of the improvements, especially the identification of multi-token mentions, are further given.

1 Introduction

With state-of-the-art empirical results, most regard BiLSTM-CNN as a robust core module for sequence-labeling NER Ma and Hovy (2016); Chiu and Nichols (2016); Aguilar et al. (2018); Akbik et al. (2018); Clark et al. (2018). However, each direction of BiLSTM only sees and encodes half of a sequence at each time step. For each token, the forward LSTM only encodes past context; the backward LSTM only encodes future context. Both do not model the patterns that happen to cross past and future at this specific time step.

This paper explores two types of cross-structures to help cope with the problem: Cross-BiLSTM-CNN and Att-BiLSTM-CNN. Section 2 formulates the three models, with Section 2.2 gives a concrete proof that patterns forming an XOR cannot be modeled by (Baseline-)BiLSTM-CNN used in all previous work. Section 3 evaluates practical effectiveness of the approaches on two challenging NER datasets. The cross-structures bring consistent improvements over Baseline-BiLSTM-CNN without additional gazetteers, language-modeling, or multi-task supervision. The improved core module surpasses comparable bare-bone models on OntoNotes and WNUT by 1.4% and 4.6% respectively. Ablation experiments reveal that emerging, complex, confusing, and multi-token entity mentions benefitted much from the cross-structures, up to 8.7% on some of the multi-token mentions.

2 Model

2.1 (Baseline-)BiLSTM-CNN

For Baseline Chiu and Nichols (2016), a CNN is used to compute character-level word features alongside word embedding and multi-layer BiLSTM is used to capture the future and the past for each time step:

The probability of each token class is given by affine-softmax. Using OSBIE sequential labels Chiu and Nichols (2016), when there are entity types, the number of token classes .

2.2 XOR Limitation of Ordinary Multi-Layer BiLSTM

Consider the following four phrases that form an XOR, where the token "and" should be tagged as work-of-art:I in the first two cases and as O in the last two cases.

  1. Key and Peele (work-of-art: show title)

  2. You and I (work-of-art: song title)

  3. Key and I

  4. You and Peele

First, note that the score vector at each time step is the sum of contributions of two directions.

Suppose the index of work-of-art:I and O are i, j respectively. Then, to predict each "and" correctly, it must hold that (superscripts denote the phrase number)

Now, phrase 1 and phrase 3 have the same past context for "and", and hence the same and , i.e., . Similarly, , , . Rewriting with these equalities gives

Finally, summing the first two inequalities and the last two inequalities gives two contradicting constraints that cannot be satisfied. In other words, even if an oracle is given to training the model, Baseline-BiLSTM-CNN can only tag at most 3 out of 4 "and" correctly. No matter how many LSTM cells are stacked for each direction, the formulation in previous studies simply does not have enough modeling capacity to capture cross-context patterns for sequence labeling NER.

2.3 Cross-BiLSTM-CNN

To resolve the problem, we proposes to use Cross-BiLSTM-CNN:

As the forward and backward hidden states are interleaved between stacked LSTM layers, Cross-BiLSTM-CNN models cross-context patterns by computing representations of the whole sequence in a feed-forward, additive manner.

2.4 Att-BiLSTM-CNN

Another way to resolve the problem is to add a self-attentive mechanism Vaswani et al. (2017) on baseline BiLSTM:

Att-BiLSTM-CNN correlates past and future context for each token in a dot-product, multiplicative manner. To see that, the computation of attention scores can be rewritten as follows.

OntoNotes 5.0 WNUT 2017
Prec. Rec. F1 Prec. Rec. F1
BiLSTM-CNN 86.04 86.53 86.280.26 - - -
CRF-IDCNN - - 86.840.19 - - -
CRF-BiLSTM(-BiLSTM*) - - 86.990.22 - - 38.24
Baseline-BiLSTM-CNN 88.37 87.14 87.750.14 53.24 32.93 40.681.78
Cross-BiLSTM-CNN 88.37 88.17 88.270.17 58.28 33.92 42.850.99
Att-BiLSTM-CNN 88.71 88.11 88.400.18 55.82 34.08 42.260.82
Table 1: Overall results. *Used on WNUT for character-based vectors, reported better than CNN.
OntoNotes 5.0 WNUT 2017
event language law NORP* work-of… corporation creative… location
Cross +3.0 +4.1 +4.5 +3.3 +2.1 +6.4 +3.2 +8.6
Att +4.6 +0.8 +0.8 +3.4 +5.6 +0.3 +2.0 +5.3
Table 2: Types with significant results (3% absolute F1 differences vs. Baseline). *Nationalities.
OntoNotes 5.0 WNUT 2017
1          2          3          3+ 1          2          3          3+
Cross +0.3%  +0.6%  +1.8%  +1.3% +1.7%  +2.9%  +8.7%  +5.4%
Att +0.1%  +1.1%  +2.3%  +1.8% +1.5%  +2.0%  +2.6%  +0.9%
Table 3: Improvements vs. Baseline among different mention lengths.

3 Experiments

We evaluated on tow datasets: OntoNotes 5.0 Fine-Grained NER – a million-token corpus with diverse sources and 18 fine-grained entity types, including hard ones such as law, event, work-of-art Hovy et al. (2006); Pradhan et al. (2013); WNUT 2017 Emerging NER – a corpus consists of noisy social media text, with text in the testing set containing surface forms seen in the training set filtered out Strauss et al. (2016); Derczynski et al. (2017).

Overall Results. Table 1 shows overall results. Besides Baseline-, Cross-, and Att-BiLSTM-CNN, results of bare-bone BiLSTM-CNN Chiu and Nichols (2016), CRF-BiLSTM(-BiLSTM) Strubell et al. (2017); Lin et al. (2017), and CRF-IDCNN Strubell et al. (2017) from the literature are also listed. The models proposed in this paper surpassed previous reported bare-bone models by 1.4% on OntoNotes and 4.6% on WNUT. More substantial improvements were achieved for WNUT 2017 emerging NER, suggesting that cross-context patterns were even more crucial for emerging contexts and entities, which cannot be memorized by their surface forms.

Complex and Confusing Entity Mentions. Table 2 shows significant results per entity type. It could be seen that harder entity types generally benefitted more from the cross-structures. For example, work-of-art/creative-work entities could in principle take any surface forms and written with unreliable capitalizations on social media, requiring models to learn better understanding of their context. Both cross-structures were more capable in dealing with such hard entities (2.1%/5.6%/3.2%/2.0%) than Baseline. Moreover, disambiguating fine-grained entity types is also a challenging task. For example, entities of language and NORP often take the same surface forms. Figure 0(a) shows a confusing example containing "Dutch" and "English", with the attention heat map (Figure 1(a)) telling the story that Att has relied on its attention head to make context-aware decisions. Both cross-structures were much better at disambiguating these fine-grained types (4.1%/0.8%/3.3%/3.4%).

Multi-Token Entity Mentions. Table 3 shows results among different entity lengths. It could be seen that cross-structures were much better at dealing with multi-token mentions (1.8%/2.3%/8.7%/2.6%). In fact, identifying correct mention boundaries for multi-token mentions requires modeling cross-context. For example, a token should be tagged as Inside if and only if it immediately follows a Begin or an I and is immediately followed by an I or an End. Figure 0(b) shows a sentence with "the White house", a triple-token facility mention with unreliable capitalization, resulting in an emerging surface form. While Att correctly tagged the three tokens, Baseline predicted a false single-token mention "White" without hints of a seen surface form.

Entity-Chunking. We perform an ablation study focused on chunking tags to understand why cross-structures are better at locating multi-token mentions. In Table 4, lists the performance of Att-BiLSTM-CNN. Other columns list the performance compared to , where columns to are when each is on its own. The figures are per-token recalls, telling if a part of the model is scoring that tag. Att appeared to surpass Baseline by designating the task of scoring I to the attention mechanism: performed well on its own (-3.80) compared to of Att on its own (-28.18). And context vectors have worked in cooperation: , focused more on scoring E (-36.45, -39.19) than I (-60.56, -50.19), while focused more on scoring B (-12.21) than I (-57.19). Then, the quantitative results and the qualitative visualizations explains each other: and especially tended to focus on looking for preceding mention token (the diagonal shifted left in Figure 1(b)1(c)), enabling them to signal for End and Inside; tended to focus on looking for succeeding mention token (the diagonal shifted right in Figure 1(d)), enabling it to signal for Begin and Inside. In contrast, unable to model cross-context patterns, Baseline inadvertently retract to predict single-token entities (0.13 vs. -0.63, -0.41, -0.38) when hints from familiar surface forms are unavailable.

(a) A confusing surface form for language and nationality.
(b) A triple-token mention with unreliable capitalization.
Figure 1: Example problematic entities for Baseline-BiLSTM-CNN.
Att-BiLSTM-CNN Baseline-…
O 99.05 -1.68 0.75 0.95 -1.67 -45.57 -0.81 -35.46 -0.03
S 93.74 2.69 -91.02 -90.56 -90.88 -25.61 -86.25 -84.32 0.13
B 90.99 1.21 -52.26 -90.78 -88.08 -90.88 -12.21 -87.45 -0.63
I 90.09 -28.18 -3.80 -87.93 -60.56 -50.19 -57.19 -79.63 -0.41
E 93.23 2.00 -71.50 -93.12 -36.45 -39.19 -91.90 -90.83 -0.38
Table 4: Entity-chunking ablation on OntoNotes development set. (high/low values of interest)
(a) (Partial) of "…Dutch into English…".
(b) of "…the White house…".
(c) of "…the White house…".
(d) of "…the White house…".
Figure 2: Attention heat maps for the mentions in Figure 1, best viewed on computer.

4 Conclusion

We have formally analyzed the deficiency of the prevalently used BiLSTM-CNN in modeling cross-context for NER. A concrete proof of its inability to capture XOR patterns has been given. Additive and multiplicative cross-structures have shown to be crucial in modeling cross-context, significantly enhancing recognition of emerging, complex, confusing, and multi-token entity mentions. Against comparable bare-bone models, 1.4% and 4.6% overall improvements on OntoNotes 5.0 and WNUT 2017 have been achieved, showing the importance of remedying the core module of NER.


  • [1] G. Aguilar, A. P. López Monroy, F. González, and T. Solorio (2018) Modeling noisiness to recognize named entities using multitask neural networks on social media. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Cited by: §1.
  • [2] A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Cited by: §1.
  • [3] J. Chiu and E. Nichols (2016) Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics. Cited by: §1, §2.1, §3.
  • [4] K. Clark, M. Luong, C. D. Manning, and Q. Le (2018) Semi-supervised sequence modeling with cross-view training. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §1.
  • [5] L. Derczynski, E. Nichols, M. van Erp, and N. Limsopatham (2017) Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, Cited by: §3.
  • [6] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel (2006) OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, Cited by: §3.
  • [7] B. Y. Lin, F. Xu, Z. Luo, and K. Zhu (2017) Multi-channel BiLSTM-CRF model for emerging named entity recognition in social media. In Proceedings of the 3rd Workshop on Noisy User-generated Text, Cited by: §3.
  • [8] X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
  • [9] S. Pradhan, A. Moschitti, N. Xue, H. T. Ng, A. Björkelund, O. Uryupina, Y. Zhang, and Z. Zhong (2013) Towards robust linguistic analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Cited by: §3.
  • [10] B. Strauss, B. Toma, A. Ritter, M. de Marneffe, and W. Xu (2016) Results of the WNUT16 named entity recognition shared task. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), Cited by: §3.
  • [11] E. Strubell, P. Verga, D. Belanger, and A. McCallum (2017) Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Cited by: §3.
  • [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, Cited by: §2.4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description