Synapse at CAp 2017 NER challenge: Fasttext CRF
We present our system for the CAp 2017 NER challenge [LPB17] which is about named entity recognition on French tweets. Our system leverages unsupervised learning on a larger dataset of French tweets to learn features feeding a CRF model. It was ranked first without using any gazetteer or structured external data, with an F-measure of 58.89%. To the best of our knowledge, it is the first system to use fasttext [BGJM16] embeddings (which include subword representations) and an embedding-based sentence representation for NER.
Keywords: Named entity recognition, fasttext, CRF, unsupervised learning, word vectors
Named-Entity Recognition (NER) is the task of detecting word segments denoting particular instances such as persons, locations or quantities. It can be used to ground knowledge available in texts. While NER can achieve near-human performance [NNN98], it is is still a challenging task on noisy texts such as tweets[RCME11] scarce labels, especially when few linguistic resources are available. Those difficulties are all present in the CAp NER challenge.
A promising approach is using unsupervised learning to get meaningful representations of words and sentences. Fasttext [BGJM16] seems a particularly useful unsupervised learning method for named entity recognition since it is based on the skipgram model which is able to capture substantive knowledge about words while incorporating morphology information, a crucial aspect for NER. We will describe three methods for using such embeddings along with a CRF sequence model, and we will also present a simple ensemble method for structured prediction (section 2). Next, we will show the performance of our model and an interpretation of its results (section 4).
Figure 1 shows an overview of our model. This section will detail the components of the system.
The core of our model is Conditional random fields (CRF) [SM11], a structured prediction framework widely used in NER tasks. It can model the probabilities of a tag sequence given a sequence of words .
We use the linear chain CRF restriction where the sequences are modeled with the probability of transitions between consecutive labels.
yields a feature vector, is a weight vector, and is a normalization factor in order to ensure a probability distribution. CRFs allow for non greedy optimization for learning sequence prediction and allows for much flexibility when defining the feature vector . Furthermore, we can add a prior on the learned weights for regularization purposes. The likelihood of the training data can be optimized using gradient descent. We chose to yield two sets of features that are concatenated: handcrafted features and fasttext embedding-based features.
2.2 Handcrafted features
Table 1 shows the handcrafted features we used. The context columns specifies whether or not a feature was also used with respect to the adjacent words.
|length 1 prefix||✓|
|length 2 prefix|
|length 1 suffix||✓|
|length 2 suffix|
|word uppercase proportion|
|word uppercase proportion*word length|
|beginning of sentence|
|end of sentence|
2.3 Fasttext features
Fasttext skipgram is based on the word2vec skipgram model [MCCD13], where word representations are learned so that they optimize a task of predicting context words.The main difference is that the representation of a word is not only , the representation of its symbol. It is augmented with the sum of the representations of its subword units :
encompasses some character n-grams that contains, provided they are frequent enough and of a desirable length. Morphology of is thus taken in account in the representation of even though the order of n-grams is ignored.
can directly be used as a word level feature. However, [GCWL14] showed that CRFs work better with discrete features, so we also use a clustering-based representation. Several approaches [Ahm13, Sie15, DGG17, GCWL14] use word embeddings for named entity recognition.
2.3.1 Clustering fasttext features
We cluster the fasttext representations of unique words in train and test tweets using a Gaussian Mixture Model (GMM), and feed the vector of probabilities assignments as word-level feature to the CRF. GMM clusters latent space to maximize the likelihood of the training data assuming that it is modeled by a mixture of gaussian.
2.3.2 Sentence representation
We also use the average of word representations in a tweet as a sentence level feature. It is a simple way to provide a global context even though a linear model will not exploit this information thoroughly.
2.4 Ensemble method
We ensemble different models using a voting rule. We train systems, each time training an new fasttext model. This is the only variation between models, but different embeddings can influence the parameters learned with respect to handcrafted features. We then select the best prediction by picking the most frequent labeling sequence predicted for each tweet by the systems.
3 Experimental settings
Test/train data are from CAp NER 2017 data includes french labeled tweets with 13 kinds of segments and IOB format. Further details can be found in [LPB17]. We used Crfsuite [Oka07] through its sklearn-crfsuite python bindings 222http://sklearn-crfsuite.readthedocs.io/en/latest/ which follows the sklearn API and allows for better development speed. The original implementation of fasttext [BGJM16] was used through its python bindings. 333https://github.com/salestock/fastText.py
3.1 Additional data
To learn fasttext word representations, we used tweets from the OSIRIM 444http://osirim.irit.fr/site/fr/articles/corpus platform at IRIT, where of the total feed of tweets is being collected since September 2015. We picked a random subset of French tweets and dropped of tweets containing an url, since many of them come from bots. The remaining urls are kept because there are some urls in the challenge data. We replaced of mentions (@someone tokens) by the symbol @* hoping to help generalization. This preprocessed additional data totals 40M tweets.
3.2 Parameter selection
Parameters and feature subsets were not thoroughly optimized through cross validation, except regularization parameters.
We used Elasticnet regularization [ZH05] and the L-BFGS optimization algorithm, with a maximum of 100 iterations.
We ran grid search using sequence level accuracy score as a metric, on and , the regularization weights for L1 and L2 priors. They were tested in respective ranges and . and were chosen.
Fasttext skipgram uses negative sampling with the parameters described in table 2. A different skipgram model was used for sentence representation, with a dimension of . For the Gaussian Mixture Model, we used dimensions and diagonal covariance.
|context window size||5|
|number of epochs||4|
|negative/positive samples ratio||5|
|minimum n-gram size||3|
|maximum n-gram size||6|
Many clusters correspond directly to named entities. Table 3 shows a random sample of 10 handpicked clusters and the themes we identified.
|Cluster theme||Cluster sample|
|hours||12h00 19h19 12h 7h44|
|dates||1947 1940 27/09 mars Lundi|
|joyful reactions||ptdrr mdrrrrrrr pff booooooooordel|
|TPMP (french show)||#TPMP #hanouna Castaldi|
|transportation lines||@LIGNEJ_SNCF @TER_Metz|
|video games||@PokemonFR manette RT|
|football players||Ribery Leonardo Chelsea Ramos|
We will report the results of the system on the evaluation data when fitted on the full training data. The system yields a sequence level accuracy of using an ensemble of models. Note that a single model () has a sequence level accuracy which is only slightly less.
The challenge scoring metric was a micro F-measure based on chunks of consecutive labels. Our ensemble system scores 58.89% with respect to this metric. Table LABEL:scores summarize the results of the competition and show that our system won with a rather large margin.
Fasttext features bring a notable difference since the sequence level accuracy drops to when we remove all of them.
Table 4 gives an overview of scores per label, and could show us ways to improve the system. The 13 labels were separated according to their IOB encoding status.
4.3 Interpreting model predictions
CRF is based on a linear model and the learned weights are insightful: the highest weights indicate the most relevant features for the prediction of a given label, while the lowest weights indicate the most relevant features for preventing the prediction of a given label. Tables 5 and 6 show those weights for a single model trained on all features. ft_wo_i, ft_wo_c_i and ft_sen_i refer respectively to the th component of a fasttext raw word representation, cluster based representation, and sentence level representation. The model actually uses those three kinds of features to predict labels. Clustering embeddings can improve the interpretability of the system by linking a feature to a set of similar words. Sentence level embeddings seem to prevent the model from predicting irrelevant labels, suggesting they might help for disambiguation.
|3.26||O||end of sentence|
|2.47||O||beginning of sentence|
|-1.29||B-other||previous POS: verb (future)|
|-1.27||B-person||previous word prefix:l|
4.4 Computational cost
Fitting the CRF model with 3000 examples (labeled tweets) takes up 4 minutes on a Xeon E5-2680 v3 CPU using a single thread, and inference on 3688 example only needs 30 seconds. Fitting the fasttext model of dimension 200 on 40M tweets takes up 10 hours on a single thread, but only 30 minutes when using 32 threads.
5 Conclusion and further improvements
We presented a NER system using Fasttext which was ranked first at the CAP 2017 NER challenge. Due to a lack of time, we did not optimize directly on the challenge evaluation metrics, using sequence level accuracy as a proxy, and we did not cross-validate all important parameters. Besides, there are other promising ways to increase the score of the system that we did not implement:
thresholding for F1 maximization: Our system precision () is significantly higher than its recall (). A more balanced score could be obtained by having a negative bias towards predicting no label. This might improve the F1 score. Threshold optimization works well for non-structured prediction [CEN14], but it is not clear that it would bring about improvement in practical applications.
larger scale unsupervised learning: More tweets could be used, and/or domain adaptation could be applied in order to bias embeddings towards learning representations of words occurring in the challenge data.
DBPedia spotlight [DJHM13] could provide an off-the-shelf gazetteer, yielding potentially powerful features for NER.
- [ABP16] Vinayak Athavale, Shreenivas Bharadwaj, Monik Pamecha, Ameya Prabhu, and Manish Shrivastava. Towards deep learning in hindi NER: an approach to tackle the labelled data sparsity. CoRR, abs/1610.09756, 2016.
- [Ahm13] Zia Ahmed. Named Entity Recognition and Question Answering Using Word Vectors and Clustering. 2013.
- [BGJM16] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. 2016.
- [CEN14] Z Chase Lipton, C Elkan, and B Narayanaswamy. Thresholding Classifiers to Maximize F1 Score. ArXiv e-prints, 2014.
- [DGG17] Arjun Das, Debasis Ganguly, and Utpal Garain. Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 16(3):1–19, 2017.
- [DJHM13] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N Mendes. Improving Efficiency and Accuracy in Multilingual Entity Extraction. Proceedings of the 9th International Conference on Semantic Systems, pages 121–124, 2013.
- [GCWL14] Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. Revisiting Embedding Features for Simple Semi-supervised Learning. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2005):110–120, 2014.
- [LC] Nut Limsopatham and Nigel Collier. Proceedings of the 2nd Workshop on Noisy User-generated Text.
- [LPB17] Cédric Lopez, Ioannis Partalas, Georgios Balikas, Nadia Derbas, Amélie Martin, Frédérique Segond, Coralie Reutenauer, and Massih-Reza Amini. French Named Entity Recognition in Twitter Challenge. Technical report, 2017.
- [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. Nips, pages 1–9, 2013.
- [NNN98] Elaine Marsh Nrl, Dennis Perzanowski Nrl, and Ralph Grishman Nyu. MUC-7 EVALUATION OF IE TECHNOLOGY : Overview of Results MUC-7 Program Committee. Program, (April), 1998.
- [Oka07] Naoaki Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs), 2007.
- [RCME11] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Named entity recognition in tweets: An experimental study. In EMNLP, 2011.
- [Sch94] Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, 1994.
- [Sie15] Scharolta Katharina Sien. Adapting word2vec to Named Entity Recognition. Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), (Nodalida):239–243, 2015.
- [SM11] Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields. Machine Learning, 4(4):267–373, 2011.
- [ZH05] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic-net. Journal of the Royal Statistical Society, 67(2):301–320, 2005.