Chinese Named Entity Recognition Augmented with Lexicon Memory
Inspired by a concept of content-addressable retrieval from cognitive science, we propose a novel fragment-based model augmented with a lexicon-based memory for Chinese NER, in which both the character-level and word-level features are combined to generate better feature representations for possible name candidates. It is observed that locating the boundary information of entity names is useful in order to classify them into pre-defined categories. Position-dependent features, including prefix and suffix are introduced for NER in the form of distributed representation. The lexicon-based memory is used to help generate such position-dependent features and deal with the problem of out-of-vocabulary words. Experimental results showed that the proposed model, called LEMON, achieved state-of-the-art on four datasets.
Named Entity Recognition (NER) aims to locate and classify elements in sentences into pre-defined categories such as persons’ names, organizations, locations, etc. NER systems have been developed using linguistic rule-based techniques or statistical models. Rule-based systems identify names by applying linguistic grammar rules governing the derivation of names , while statistical models identify names based on the distribution of their components in a larger corpus [43, 33]. Recently, neural networks have been applied in NER , such as recurrent neural networks [19, 22, RNNs] and encoder-decoder architectures [18, 5]. There are two reasons for the success of neural networks. On the one hand, neural network can memorize cases that have been seen after training. On the other hand, they can be generalized to other unseen cases . However, these models still suffer from two problems of ambiguous word boundaries and out-of-vocabulary words.
Ambiguity of word boundaries: Traditional approaches to Chinese NER can be divided into two paradigms: character-based and word-based models. Character-based models are not effective enough due to lack of explicit word information [15, 26], while word-based models suffer from the issue of error propagation, since word segmentation provides rather significant information for boundaries of named entities. \citeauthorzhang2018lattice \shortcitezhang2018lattice proposed a lattice-based model to encode a sequence of characters as well as every potential word that matches a lexicon . However, the important boundary features (prefix and suffix) for each name candidate might be blurred because they consider all possible segmentations, but only few of them are feasible, possibly introducing unnecessary noise. Named entities are often in the form of a fragment (sequence of contiguous words) rather than a single character or word , which indicates that fragment-based models deserves further exploring.
Out-of-vocabulary words: If word-level information can be harnessed in form of their embeddings, the adverse effect of unknown words could be much alleviated by leveraging a large unlabeled text corpus to learn word embeddings. As shown in Figure 1, “Microsoft” would be classified with a higher probability into the correct category (namely organization) because its embedding is close to “Google”, “Amazon”, and so on in the embedding space. Such regular pattern can be also applied into location entities such as “Rome”, “Tokyo” and “Beijing”. However, for names of persons or organizations less well-known, such as “å¸é©¬æ¿” (Sima Yi) and “å¤©ç¾å·¥ä½å®¤” (Timi Studio), which can not be found in the vocabulary, syntactic features may help. Most of the people’s names start with a common Chinese surname followed by one or two characters. Organization names usually begin with the name of a city or country, and end with one of few words like “å ¬å¸ (company)”, “å¤§å¦ (university)”, “å»é¢ (hospital)”, etc.
We propose a fragment-based approach to address the above problem, which combines information at different levels of granularity. Position-dependent features, including prefix, suffix and infix deserve further investigation in the case of distributed representations. However not all fragments in a sentence are common words or phrases, we filter those rare ones with the assistance of a lexicon. It has proven fruitful to incorporate a lexicon (an external dictionary) for NER [19, 6], although such word-level features are added by string matching in a rigid, discrete manner. Constructing a lexicon via collecting information such as person surname list and geographical dictionary in a hand-crafted way is time-consuming, so it is worth exploring the possibility of deriving such features automatically from a large word corpus.
The fragment-based approach conforms to the way human recognizes names. Given a fragment, a person’s attention will be drawn towards contents most relevant to her memory, which can be regarded as content-addressable retrieval, a concept borrowed from cognitive science to artificial intelligence . From the viewpoint of cognitive systems, the biological brain does not learn by a single and global optimization principle , but is modular and composed of distinct subsystems, such as memory and control which can interact with each other [1, 36].
Inspired by the findings from cognitive science, we propose a fragment-based model for Chinese NER augmented with a lexicon-based memory, called LEMON (LExicon-MemOry-augmented-Ner). The model consists of three submodules: a character encoder that imitates the process of scanning each character in an input sentence to grasp the global semantics, a fragment encoder that simulates the procedure of reading a sub-sequence (such as words or fragments) in a sentence, and a memory which stores massive words that have ever seen. A ranking algorithm is used to determine whether a fragment is a valid name and which category it belongs to by taking its prefix, suffix, and infix features into account. Experimental results showed that the proposed model achieved state-of-the-art results on four different benchmark datasets.
xu2017local \shortcitexu2017local firstly presented a local detection approach for mention detection and name classification. Their model uses a fixed-size ordinary forgetting encoding (FOFE) to represent all fragments in the context . Our model differs from theirs in that we adopts a character encoder to establish connections between the fragment and its context to provide global context features. Besides, position-dependent features are introduced for each candidate name via lexicon-based memory.
Attention Mechanism and Memory Network
Attention mechanism was first proposed for machine translation [2, 28], which learns an alignment between the source and target languages by estimating their correlation scores. It was also applied to NER in several ways: integrating character-level information by attending to characters , capturing global context information by attending to different sentences in a document , and adopting an adaptive co-attention between texts and pictures . Memory networks was first introduced for question answering [42, 12], this study is among the first ones to incorporate word-level features by memory networks for NER.
We present architecture of the proposed model in this section. As shown in Figure 2, the LEMON is mainly composed of three parts: a character encoder which maps each character into a its feature vector, a fragment encoder which encodes any variable-length sub-sequence in an input sentence into a fixed-sized vector representation, and a lexicon memory which is designed to help in disambiguating the word boundaries and dealing with the out-of-vocabulary problems by providing external syntactic and semantic features for possible words occurred in any fragment.
Given a sentence , each character is firstly mapped into its feature vector . The information derived from the results of word segmentation and part-of-speech (POS) tagging has proven to be useful for NER tasks [32, 47], and thus we augment the character representation with its soft-word and part-of-speech information. As shown in Figure 3, the BMES scheme is used to represent the results of word segmentation . Each character is also assigned a POS tag as same as that of the word to which it belongs. The feature vector of each character is obtained by concatenating the feature vectors from the three parts as:
where , , are three look-up tables, , , are indices of the characters, soft-word labels and POS tags. The character encoder is used to get the context-aware representation of a character in a given sentence:
where , is the dimensionality of the context-aware character representation. A few networks can be adopted as the character encoder, such as a bi-directional LSTM that is of great superiority in modelling long-distance dependencies [11, 16], and a transformer that was firstly proposed for machine translation  to capture the dependencies between different words with any distance in a sentence, which is gaining much attention recently.
The fragment encoder is used to produce a feature vector for each -gram in a sentence. Given a sequence of characters, , where is the dimensionality of the context-aware character representation, the fragment encoder learns to map the matrix to a fixed-sized vector , where is the dimensionality of fragment embedding.
where denotes a candidate fragment spanning from character to .
Assuming that the maximum length of named entities is , for a sentence consisting of characters, the number of all possible fragments would be . The complexity of enumerating all the fragments is , which is rather time consuming. However, an inherent recursive structure helps to reduce the complexity, since the produced representations of shorter fragments can be used to generate those of longer ones, and all the fragments can be enumerated in time.
There are some methods that can be chosen as the fragment encoder. \citeauthorxu2017local \shortcitexu2017local employs a Fixed-size Ordinary Forgetting Encoding (FOFE) as such encoder, which incorporates a forgetting factor to reflect position information . The bag-of-words method that simply averages the representations of words or characters also can be used as a baseline encoder [20, 8].
The lexicon used in this study is not just a gazetteer (i.e., a vocabulary consisting of known named entities): it contains all the possible words extracted from a dataset, which allows us to leverage a large-scale unlabeled data to obtain rich features about words. Like , the lexicon is obtained by automatically segmenting Chinese Giga-Word dataset
Lexicon Matching Modes
Given a fragment , we perform pattern matching on it over the constructed lexicon . We define four types of matching modes as follows:
Exact matching: If there exits one word in the lexicon that is exactly the same as the fragment, the word can be directly used to replace this fragment.
k-prefix matching: If the first characters of a fragment is matched with a word, we call it -prefix matching. For example, a fragment “” matches “” in a 2-prefix matching mode. Such matching patterns provide informative features to identify the named entities whose prefixes are usually chosen from a limited number of words, such as commonly-used Chinese surnames like “ä¸å®” (Shangguan), and “å¸é©¬” (Sima).
k-suffix matching: We say a -suffix matching if the last characters of a fragment, “” is matched with a word “”. Those matching patterns are quite useful to recognize the entities whose names end with one of few words. For example, many locations and organizations share the similar suffixes, such as “ç” (Province), “é¨” (Ministry).
Infix matching: If a word can be found in the middle of a fragment, it is an infix matching. Its role is slightly different from the above modes, and such match serves as a hint that a fragment might contain a nested structure.
Since the first (or last) one and two characters are relatively more important for NER, the results of different matching modes are grouped into multiple buckets according to their importance. We defined LEMON- in the way that for each distinct , the feature derived from -prefix (or -suffix) matching is placed into a separate bucket if , while the remaining features () are grouped into a single bucket. The value is a hyper-parameter.
Attention over Lexicon Memory
Memory networks provide us a feasible method to extract the relevant features from a lexicon-based memory with content-addressable retrieval [42, 37]. Given a matching instance , where denotes a matched word in the lexicon, and denotes one of the matching modes, they will be mapped into two feature vectors, and concatenated as a memory unit.
where , is the size of the lexicon, and is the dimensionality of its vector space. , is the number of the matching modes, and is the dimensionality of the feature vectors used to represent different matching modes.
For a fragment , we first find all its matched words, and then group them into multiple buckets as the way introduced in Section Lexicon Matching Modes, and finally assemble them into a matrix , which is a Lexicon Memory dynamically built for the fragment, where , and is the number of matching over the lexicon.
Classification and Decoding
For a fragment, its representation and the result of the attention over the lexicon are concatenated to produce the final representation . Such representation is then fed into a multi-layer feed-forward neural network to predict the labels of entities. If a fragment does not belong to any entity, it is labelled as “NONE”. We choose to use a recently proposed focal loss as the training objective to mitigate the sample-imbalance problem .
where denotes the probability of the true label, is a parameter vector for the true label which will be tuned during the training process, and is a hyper-parameter that governs the relative importance of the positive samples with the negative ones. If all the values of , and are set to , the focal loss is reduced to the cross-entropy loss.
A decoding layer is stacked on top of the entity detector to resolve the issue that occasionally some overlapped fragments might be all recognized as valid entities :
A threshold is used to filter the results. A fragment is identified as an entity if the model assigns the highest probability to this entity type and the probability is greater than ; otherwise it will be recognized as “NONE”.
If a recognized entity contains another candidate (nested) entity, only the outer entity will be remained for the further processing.
If two identified entities overlap each other, only the one with higher probability is kept.
We found that such decoding strategy works well although it runs in a greedy way. This strategy also can be used to recognize nested entities just by removing the second step.
|P (%)||R (%)||F1 (%)||P (%)||R (%)||F1 (%)||P (%)||R (%)||F1 (%)|
|fragment \ Character||Baseline||Transformer||Bi-RNN|
||BOW + Lex||78.77||70.40||74.35 (+7.54)||76.73||73.48||75.07 (+8.26)||78.27||75.34||76.78 (+5.51)|
|FOFE + Lex||77.33||71.90||74.52 (+4.74)||79.92||72.65||76.11 (+17.37)||79.49||73.77||76.53 (+2.99)|
|Bi-RNN + Lex||77.40||74.39||75.87 (+4.21)||79.62||73.87||76.64 (+19.36)||81.12||75.18||78.04 (+5.47)|
||BOW + Lex||76.33||64.75||70.06 (+5.18)||73.96||64.69||69.02 (+4.14)||78.42||67.06||72.30 (+5.38)|
|FOFE + Lex||77.24||63.91||69.95 (+5.71)||78.46||62.93||69.85 (+2.87)||76.24||68.76||72.31 (+4.15)|
|Bi-RNN + Lex||77.62||66.32||71.53 (+2.83)||76.79||67.09||71.61 (+4.09)||76.57||69.54||72.89 (+3.28)|
The heading with a word “gold” denotes that gold segmentation and part-of-speech tags are used, while “auto” denotes that they are automatically generated by the THULAC toolkit.
|P (%)||R (%)||F1 (%)||P (%)||R (%)||F1 (%)|
|Features \ Data||Ground truth||Automatically labelled|
|char + seg||70.58||69.96||70.27||70.77||63.33||66.85|
|char + pos||71.81||74.48||73.12||70.20||70.26||70.23|
|char + seg + pos||75.63||72.35||73.08||72.88||68.18||70.45|
|char + seg||72.16||66.09||68.99||70.48||62.65||66.33|
|char + pos||74.39||65.44||69.63||72.87||63.73||67.99|
|char + seg + pos||74.97||72.23||73.58||76.29||64.43||69.86|
|char + seg||78.40||70.75||74.38||76.11||64.63||69.91|
|char + pos||77.71||72.35||74.93||77.46||66.89||71.79|
|char + seg + pos||78.70||74.95||76.78||76.41||68.61||72.30|
We evaluated our model on four different datasets: OntoNotes-4 , MSRA , Weibo NER [32, 31], and Resume  datasets. The statistics of those four datasets are given in Table 1. As mentioned in Section Character Encoder, each character needs to be assigned with a soft-word label as well as a POS tag. All the datasets were segmented and tagged by using THULAC toolkit , which achieved about of F1-score in the word segmentation on the datasets. For OntoNotes-4 dataset, the gold segmentation and part-of-speech tags are available, and we reported the NER results both with and without gold segmentation and POS tags.
The proposed model is implemented by PyTorch deep learning framework .
We pretrained the word embeddings and character embeddings on Chinese Giga-word by Word2vec. We tuned all the hyper-parameters on the development set of OntoNotes-4 dataset. The dimensionalities of word embeddings and character embeddings were all set to , and the dimensionalities of soft-word embeddings and POS tag embeddings were both set to . Dropout mechanism was applied to the character encoder at the embedding layer with a drop rate of . All learned parameters are updated by the Adam Optimizer . It is worth mentioning that we use a sparse version of the Adam Optimizer
Experiments on OntoNotes-4
We carried out a set of preliminary experiments on the development set of OntoNotes-4 to optimize the architecture by trying few different components, and to gain some understanding of how the choice of features impacts upon the performance.
Evaluation with Different Architectures
We tried several combinations of different character and fragment encoders to find a suitable configuration for NER. Three different types of networks were tested as the character encoder, and we also tried three different architectures for the fragment encoder. An embedding look-up layer serves as a baseline for the character encoder. Besides, two popular sequence models including a transformer (6-layer, 8-heads, 512-dim) and a Bi-directional LSTM (2-layer, 256-dim) are compared. As to the fragment encoder, we conducted experiments with Bag-of-Words (BOW), FOFE () and bi-directional LSTM. \citeauthorxu2017local \shortcitexu2017local used an embedding look-up layer as the character encoder and a FOFE as the fragment encoder, and they predict the type of a candidate -gram with the help of its left and right contexts. We tried to integrate such context information for NER as \citeauthorxu2017local \shortcitexu2017local, but the results of preliminary experiments showed that its contribution in performance is negligible.
The results of different combinations on the development set of OntoNotes-4 are shown in Table 2. The performances of all models will decrease of approximately in F1-score if we used the results of word segmentation and POS-tagging automatically generated by THULAC toolkit instead of the ground truth. It shows that the NER performance is significantly influenced by the results of the upstream tasks through the error propagation.
Character Encoder: Bi-RNN always outperforms other character encoders due to its ability in modelling long-term dependencies. The transformer contributes a little, and performs slightly better than the baseline although it achieved a great success in the machine translation. One reasonable explanation is that the number of training sentences is not sufficient enough to fit the model capacity of the transformer .
Fragment Encoder: Bi-RNN surpasses other encoders, especially when the character encoder is not built based on the Bi-RNN. BOW performs inferior to others since it is unable to model the order information of a sequence, which is critical for the entity recognition. FOFE learns to produce a linear combination of the representations of words in a sub-sequence, which is less flexible than the Bi-RNN in the sequence modeling since the latter is capable of learning non-linear combinations.
Lexicon Memory: The incorporation of lexicon memory greatly boosts the results of any combination of components, with an average increase of about in F1-score. It can be taken as a strong evidence that the introduced lexicon memory can enhance the model’s performance in NER.
The significance of different features is shown in Table 3. We also trained a LSTM-CRF model as a traditional approach for comparison by NCRF++, an open source neural sequence labeling toolkit . The experimental results demonstrate that the features derived from the word segmentation and POS-tagging always benefit to all the models no matter they are labeled by human or produced by automatic toolkit.
The LEMON still beats the LSTM-CRF-based model by in F1-score without using any word segmentation or part-of-speech information, which shows that the introduced lexicon memory provides the valuable position-dependent and word-level features via the attention mechanism.
|Model||P (%)||R (%)||F1 (%)|
|Model||P (%)||R (%)||F1 (%)|
Models indicated with are those in which the sequence labeling technique is used with LSTM + CRF. Results are extracted from .
|Model||P (%)||R (%)||F1 (%)|
The model indicated with denotes that gold word segmentation are used.
|Model||P (%)||R (%)||F1 (%)|
The LEMON-2 achieved state-of-the-art results on all the four datasets. As shown in Table 4 and 5, the LEMON performs slightly better than the Lattice LSTM on the MSRA and Resume NER Datasets. Our model also achieved the highest F1-score on the OntoNotes-4 (see Table 6). Note that the Weibo NER data is extracted from the social media, it is full of non-standard expressions and only contains about k samples. The problems of out-of-vocabulary words and ambiguity of word boundaries become more serious for NER on this dataset. However, the LEMON still outperforms other models with a fairly significant margin (at least increase), as we can see in Table 7.
We conducted experiments on the Weibo NER dataset to study the influence of attention over lexicon memory, and how the choice of values of the thresholds and focal loss coefficients impact upon the performance.
Attention over Lexicon Memory
Figure 5 illustrates which words will be given more weights computed by the attention operations over the lexicon memory. As shown in the heat map, the model can learn to assign more weights on the key words of named entities, and the attentions are sharp for those words particularly informative for NER.
Taking the entities of “ORG” (organization) as examples, more weights are placed to the last two characters, such as “ä¸å¿” (center), “æ¿åº” (government), “å¦æ ¡” (school), “ç»ç»” (organization), etc. It is in accordance with the common sense that the last characters are more important in identifying Chinese names of organizations. We also found the similar phenomenon when recognizing person names. For instance, famous names such as “æ±æ³½æ°” (Zemin Jiang) can be matched exactly and recognized as a person name, while for names of less well-known persons, the first character (i.e. surname) tends to be given more attention.
Decoding Threshold Settings
We reported the F1-scores for different settings of LEMON on the development set of Weibo NER in Figure 6. LEMON-2 generally performs better than LEMON-0 and LEMON-1, since the features derived from - and -prefix and suffix matching are useful for NER and they cannot be mixed into a single bucket as we described in Section Lexicon Matching Modes.
We found that the best value of the threshold is in range of and . As shown in Figure 6, if the value of is greater than or equal to , the performances are more sensitive to the values of threshold . The performance will drop dramatically if is less than and is set to . One possible explanation is that the focal loss tends to update the parameters by a far larger step for the samples that are hard to be recognized, especially when the probabilities assigned for those samples are pretty low.
Coefficients of Focal Loss
We compared the speed of convergence versus different values of used in focal loss in Figure 5. If is set to zero, the focal loss will be reduced to the cross entropy loss. When the cross entropy loss is used, the model is trapped at an extreme low performance for epochs, which indicates that such loss is not optimal for the situation with severe sample imbalance. Note that the models usually suffer from the problem of sample imbalance in NER because most candidates will be labelled as “NONE”. The model with the focal loss converges relatively faster because this loss will adaptively assign different update steps to mis-classified samples according to how hard they are recognized. Although the model trained with the focal loss did not outperform that with the cross entropy, it does help to speed up the training process.
Observing that Chinese names are usually formed in some distinct patterns and the features derived from their prefix and suffix are particularly useful to identify them, a fragment-based model augmented with position-dependent features learned from a lexicon is introduced for Chinese NER tasks. Experimental results showed that the model using position-dependent features and lexicon-based memory achieved state-of-the-art on four different NER datasets.
- https://pytorch.org/docs/stable/optim.html#torch.optim. SparseAdam
- The weight decay was set to for the Weibo NER dataset, otherwise the network will be hard to converge.
- (2004) An integrated theory of the mind.. Psychological review. Cited by: Introduction.
- (2015) Neural machine translation by jointly learning to align and translate. ICLR. Cited by: Attention Mechanism and Memory Network.
- (2013) Named entity recognition with bilingual constraints. In NAACL, Cited by: Table 6.
- (2006) Chinese named entity recognition with conditional probabilistic models. In SIGHAN Workshop, Cited by: Table 4.
- (2018) Learning to progressively recognize new named entities with sequence to sequence models. In Proceedings of the 27th International Conference on Computational Linguistics, Cited by: Introduction.
- (2016) Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics. Cited by: Introduction.
- (2011) Natural language processing (almost) from scratch. Journal of machine learning research. Cited by: Introduction.
- (2018) What you can cram into a single vector: probing sentence embeddings for linguistic properties. ACL. Cited by: Fragment Encoder.
- (2018) Bert: pre-training of deep bidirectional transformers for language understanding. NAACL. Cited by: Evaluation with Different Architectures.
- (2016) Character-based LSTM-CRF with radical-level features for chinese named entity recognition. In Natural Language Understanding and Intelligent Applications, Cited by: Table 4.
- (1990) Finding structure in time. Cognitive science. Cited by: Character Encoder.
- (2003) Named entity recognition with long short-term memory. In NAACL, Cited by: Attention Mechanism and Memory Network.
- (2017) Neuroscience-inspired artificial intelligence. Neuron. Cited by: Introduction.
- (2017) A unified model for cross-domain and semi-supervised named entity recognition in chinese social media. In AAAI, Cited by: Table 7.
- (2008) Chinese named entity recognition and word segmentation based on character. In SIGHAN Workshop, Cited by: Introduction.
- (1997) Long short-term memory. Neural computation. Cited by: Character Encoder.
- (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences. Cited by: Introduction.
- (2017) Addressing domain adaptation for chinese word segmentation with global recurrent structure. In IJCNLP, Cited by: Introduction.
- (2015) Bidirectional lstm-crf models for sequence tagging. Arxiv. Cited by: Introduction, Introduction.
- (2017) Bag of tricks for efficient text classification. In EACL, Cited by: Fragment Encoder.
- (2014) Adam: a method for stochastic optimization. ICLR. Cited by: Training Details.
- (2016) Neural architectures for named entity recognition. Arxiv. Cited by: Introduction.
- (2006) The third international chinese language processing bakeoff: word segmentation and named entity recognition. In SIGHAN Workshop, Cited by: Datasets.
- (2018) A survey on deep learning for named entity recognition. Arxiv. Cited by: Introduction.
- (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, Cited by: Training Objective.
- (2010) Chinese named entity recognition with a sequence labeling approach: based on characters, or based on words?. In Advanced intelligent computing theories and applications. With aspects of artificial intelligence, Cited by: Introduction.
- (2016) Multi-prototype chinese character embedding.. In LREC, Cited by: Table 4.
- (2015) Effective approaches to attention-based neural machine translation. EMNLP. Cited by: Attention Mechanism and Memory Network, Attention over Lexicon Memory.
- (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: Lexicon Construction.
- (2017) Automatic differentiation in pytorch. NIPS Workshop. Cited by: Training Details.
- (2015) Named entity recognition for chinese social media with jointly trained embeddings. In EMNLP, Cited by: Datasets.
- (2016) Improving named entity recognition for chinese social media with word segmentation representation learning. ACL. Cited by: Character Encoder, Datasets, Table 7.
- (2013) Learning multilingual named entity recognition from wikipedia. Artificial Intelligence. Cited by: Introduction.
- (2016) Attending to characters in neural sequence labeling models. Arxiv. Cited by: Attention Mechanism and Memory Network.
- (2013) Natural language processing: semantic aspects. Cited by: Introduction.
- (1988) From neuropsychology to mental structure. Cited by: Introduction.
- (2015) End-to-end memory networks. In NIPS, Cited by: Attention over Lexicon Memory.
- (2016) Thulac: an efficient lexical analyzer for chinese. Technical report Technical Report. Technical Report. Cited by: Datasets.
- (2017) Attention is all you need. In NIPS, Cited by: Character Encoder, Attention over Lexicon Memory.
- (2013) Effective bilingual constraints for semi-supervised learning of named entity recognizers. In AAAI, Cited by: Table 6.
- (2011) OntoNotes release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium. Cited by: Datasets.
- (2015) Memory networks. ICLR. Cited by: Attention Mechanism and Memory Network, Attention over Lexicon Memory.
- (2009) Phrase clustering for discriminative learning. In ACL, Cited by: Introduction.
- (2018) Improving clinical named entity recognition with global neural attention. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, Cited by: Attention Mechanism and Memory Network.
- (2017) A local detection approach for named entity recognition and mention detection. In ACL, Cited by: Introduction, Decoding Strategy.
- (2003) Chinese word segmentation as lmr tagging. In SIGHAN Workshop, Cited by: Character Encoder.
- (2016) Combining discrete and neural features for sequence labeling. In International Conference on Intelligent Text Processing and Computational Linguistics, Cited by: Character Encoder, Table 6.
- (2018) NCRF++: an open-source neural sequence labeling toolkit. In ACL, Cited by: Feature Combinations.
- (2016) Understanding deep learning requires rethinking generalization. ICLR. Cited by: Introduction.
- (2018) Adaptive co-attention network for named entity recognition in tweets. In AAAI, Cited by: Attention Mechanism and Memory Network.
- (2015) The fixed-size ordinally-forgetting encoding method for neural network language models. In ACL, Cited by: Local Detection, Fragment Encoder.
- (2006) Word segmentation and named entity recognition for sighan bakeoff3. In SIGHAN Workshop, Cited by: Table 4.
- (2018) Chinese NER using lattice LSTM. ACL. Cited by: Lexicon Construction, Table 5, Datasets, Table 4, Table 5, Table 6, Table 7.