Synapse at CAp 2017 NER challenge: Fasttext CRF

Synapse at CAp 2017 NER challenge: Fasttext CRF

Damien Sileo Camille Pradel Philippe Muller Tim Van de Cruys
Abstract

We present our system for the CAp 2017 NER challenge [LPB17] which is about named entity recognition on French tweets. Our system leverages unsupervised learning on a larger dataset of French tweets to learn features feeding a CRF model. It was ranked first without using any gazetteer or structured external data, with an F-measure of 58.89%. To the best of our knowledge, it is the first system to use fasttext [BGJM16] embeddings (which include subword representations) and an embedding-based sentence representation for NER.

Keywords: Named entity recognition, fasttext, CRF, unsupervised learning, word vectors

1 Introduction

Named-Entity Recognition (NER) is the task of detecting word segments denoting particular instances such as persons, locations or quantities. It can be used to ground knowledge available in texts. While NER can achieve near-human performance [NNN98], it is is still a challenging task on noisy texts such as tweets[RCME11] scarce labels, especially when few linguistic resources are available. Those difficulties are all present in the CAp NER challenge.

A promising approach is using unsupervised learning to get meaningful representations of words and sentences. Fasttext [BGJM16] seems a particularly useful unsupervised learning method for named entity recognition since it is based on the skipgram model which is able to capture substantive knowledge about words while incorporating morphology information, a crucial aspect for NER. We will describe three methods for using such embeddings along with a CRF sequence model, and we will also present a simple ensemble method for structured prediction (section 2). Next, we will show the performance of our model and an interpretation of its results (section 4).

2 Model

Figure 1 shows an overview of our model. This section will detail the components of the system.

Figure 1: overview of our system

2.1 Crf

The core of our model is Conditional random fields (CRF) [SM11], a structured prediction framework widely used in NER tasks. It can model the probabilities of a tag sequence given a sequence of words .

We use the linear chain CRF restriction where the sequences are modeled with the probability of transitions between consecutive labels.

(1)

yields a feature vector, is a weight vector, and is a normalization factor in order to ensure a probability distribution. CRFs allow for non greedy optimization for learning sequence prediction and allows for much flexibility when defining the feature vector . Furthermore, we can add a prior on the learned weights for regularization purposes. The likelihood of the training data can be optimized using gradient descent. We chose to yield two sets of features that are concatenated: handcrafted features and fasttext embedding-based features.

2.2 Handcrafted features

Table 1 shows the handcrafted features we used. The context columns specifies whether or not a feature was also used with respect to the adjacent words.

feature context
word (lowercased)
word length
length 1 prefix
length 2 prefix
length 1 suffix
length 2 suffix
is_upper
is_title
position
word uppercase proportion
word uppercase proportion*word length
is_emoji
hyphenation
POS tag
is_quote
beginning of sentence
end of sentence
Table 1: word-level handcrafted features

The emoji 111https://pypi.python.org/pypi/emoji/ library was used for emoji detection, and we used the Treetagger [Sch94] POS tagger.

2.3 Fasttext features

Fasttext skipgram is based on the word2vec skipgram model [MCCD13], where word representations are learned so that they optimize a task of predicting context words.The main difference is that the representation of a word is not only , the representation of its symbol. It is augmented with the sum of the representations of its subword units :

(2)

encompasses some character n-grams that contains, provided they are frequent enough and of a desirable length. Morphology of is thus taken in account in the representation of even though the order of n-grams is ignored.

can directly be used as a word level feature. However, [GCWL14] showed that CRFs work better with discrete features, so we also use a clustering-based representation. Several approaches [Ahm13, Sie15, DGG17, GCWL14] use word embeddings for named entity recognition.

2.3.1 Clustering fasttext features

We cluster the fasttext representations of unique words in train and test tweets using a Gaussian Mixture Model (GMM), and feed the vector of probabilities assignments as word-level feature to the CRF. GMM clusters latent space to maximize the likelihood of the training data assuming that it is modeled by a mixture of gaussian.

2.3.2 Sentence representation

We also use the average of word representations in a tweet as a sentence level feature. It is a simple way to provide a global context even though a linear model will not exploit this information thoroughly.

2.4 Ensemble method

We ensemble different models using a voting rule. We train systems, each time training an new fasttext model. This is the only variation between models, but different embeddings can influence the parameters learned with respect to handcrafted features. We then select the best prediction by picking the most frequent labeling sequence predicted for each tweet by the systems.

3 Experimental settings

Test/train data are from CAp NER 2017 data includes french labeled tweets with 13 kinds of segments and IOB format. Further details can be found in [LPB17]. We used Crfsuite [Oka07] through its sklearn-crfsuite python bindings 222http://sklearn-crfsuite.readthedocs.io/en/latest/ which follows the sklearn API and allows for better development speed. The original implementation of fasttext [BGJM16] was used through its python bindings. 333https://github.com/salestock/fastText.py

3.1 Additional data

To learn fasttext word representations, we used tweets from the OSIRIM 444http://osirim.irit.fr/site/fr/articles/corpus platform at IRIT, where of the total feed of tweets is being collected since September 2015. We picked a random subset of French tweets and dropped of tweets containing an url, since many of them come from bots. The remaining urls are kept because there are some urls in the challenge data. We replaced of mentions (@someone tokens) by the symbol @* hoping to help generalization. This preprocessed additional data totals 40M tweets.

3.2 Parameter selection

Parameters and feature subsets were not thoroughly optimized through cross validation, except regularization parameters. We used Elasticnet regularization [ZH05] and the L-BFGS optimization algorithm, with a maximum of 100 iterations.
We ran grid search using sequence level accuracy score as a metric, on and , the regularization weights for L1 and L2 priors. They were tested in respective ranges and . and were chosen.

Fasttext skipgram uses negative sampling with the parameters described in table 2. A different skipgram model was used for sentence representation, with a dimension of . For the Gaussian Mixture Model, we used dimensions and diagonal covariance.

parameter value
learning rate 0.02
dimension 200
context window size 5
number of epochs 4
min_count 5
negative/positive samples ratio 5
minimum n-gram size 3
maximum n-gram size 6
sampling threshold
Table 2: Fasttext parameters

4 Results

4.1 Clustering

Many clusters correspond directly to named entities. Table 3 shows a random sample of 10 handpicked clusters and the themes we identified.

Cluster theme Cluster sample
hyperlinks https://t.co/d73eViSrbW
hours 12h00 19h19 12h 7h44
dates 1947 1940 27/09 mars Lundi
joyful reactions ptdrr mdrrrrrrr pff booooooooordel
TPMP (french show) #TPMP #hanouna Castaldi
transportation lines @LIGNEJ_SNCF @TER_Metz
emojis Pfffff :)
video games @PokemonFR manette RT
persons @olivierminne @Vibrationradio
football players Ribery Leonardo Chelsea Ramos
Table 3: Handpicked clusters and random samples

4.2 Performance

We will report the results of the system on the evaluation data when fitted on the full training data. The system yields a sequence level accuracy of using an ensemble of models. Note that a single model () has a sequence level accuracy which is only slightly less.

The challenge scoring metric was a micro F-measure based on chunks of consecutive labels. Our ensemble system scores 58.89% with respect to this metric. Table LABEL:scores summarize the results of the competition and show that our system won with a rather large margin. Fasttext features bring a notable difference since the sequence level accuracy drops to when we remove all of them. Table 4 gives an overview of scores per label, and could show us ways to improve the system. The 13 labels were separated according to their IOB encoding status.

label precision recall f1-score support
B-person 0.767 0.618 0.684 842
I-person 0.795 0.833 0.814 294
B-geoloc 0.757 0.697 0.726 699
B-transportLine 0.978 0.926 0.951 517
B-musicartist 0.667 0.178 0.281 90
B-other 0.286 0.134 0.183 149
B-org 0.712 0.277 0.399 545
B-product 0.519 0.135 0.214 312
I-product 0.320 0.113 0.167 364
B-media 0.724 0.462 0.564 210
B-facility 0.639 0.363 0.463 146
I-facility 0.620 0.486 0.545 175
B-sportsteam 0.514 0.277 0.360 65
I-sportsteam 1.000 0.200 0.333 10
B-event 0.436 0.185 0.260 92
I-event 0.356 0.292 0.321 89
B-tvshow 0.429 0.058 0.102 52
I-tvshow 0.286 0.065 0.105 31
I-media 0.200 0.019 0.035 52
B-movie 0.333 0.045 0.080 44
I-other 0.000 0.000 0.000 73
I-transportLine 0.873 0.729 0.795 85
I-geoloc 0.650 0.409 0.502 159
I-musicartist 0.636 0.163 0.259 43
I-movie 0.250 0.049 0.082 41
Table 4: Fine grained score analysis

4.3 Interpreting model predictions

CRF is based on a linear model and the learned weights are insightful: the highest weights indicate the most relevant features for the prediction of a given label, while the lowest weights indicate the most relevant features for preventing the prediction of a given label. Tables 5 and 6 show those weights for a single model trained on all features. ft_wo_i, ft_wo_c_i and ft_sen_i refer respectively to the th component of a fasttext raw word representation, cluster based representation, and sentence level representation. The model actually uses those three kinds of features to predict labels. Clustering embeddings can improve the interpretability of the system by linking a feature to a set of similar words. Sentence level embeddings seem to prevent the model from predicting irrelevant labels, suggesting they might help for disambiguation.

weight label feature
3.26 O end of sentence
2.47 O beginning of sentence
2.01 O previous word:rt
1.92 B-transportLine ft_wo_91
1.85 B-other previous word:les
1.80 B-geoloc previous word:#qml
1.76 B-geoloc previous word:pour
1.71 B-geoloc ft_sen_22
1.71 O ft_wo_c68
1.68 B-org current word:#ratp
Table 5: Highest weights
weight label feature
-1.65 B-product ft_sen_33
-1.60 B-org ft_sen_9
-1.48 O previous word:sur
-1.41 B-facility ft_sen_33
-1.40 O suffix:lie
-1.38 O suffix:ra
-1.29 B-other previous POS: verb (future)
-1.29 B-geoloc ft_wo_151
-1.27 B-person previous word prefix:l
-1.26 B-org ft_wo_130
Table 6: Lowest weights

4.4 Computational cost

Fitting the CRF model with 3000 examples (labeled tweets) takes up 4 minutes on a Xeon E5-2680 v3 CPU using a single thread, and inference on 3688 example only needs 30 seconds. Fitting the fasttext model of dimension 200 on 40M tweets takes up 10 hours on a single thread, but only 30 minutes when using 32 threads.

5 Conclusion and further improvements

We presented a NER system using Fasttext which was ranked first at the CAP 2017 NER challenge. Due to a lack of time, we did not optimize directly on the challenge evaluation metrics, using sequence level accuracy as a proxy, and we did not cross-validate all important parameters. Besides, there are other promising ways to increase the score of the system that we did not implement:

  1. thresholding for F1 maximization: Our system precision () is significantly higher than its recall (). A more balanced score could be obtained by having a negative bias towards predicting no label. This might improve the F1 score. Threshold optimization works well for non-structured prediction [CEN14], but it is not clear that it would bring about improvement in practical applications.

  2. larger scale unsupervised learning: More tweets could be used, and/or domain adaptation could be applied in order to bias embeddings towards learning representations of words occurring in the challenge data.

  3. RNN embeddings: Unsupervised learning with recurrent neural networks can be used to learn "contextualized" embedding of words. Unsupervised training tasks include language modeling or auto-encoding. RNNs have been used in NER without unsupervised training. [ABP16] [LC]

  4. DBPedia spotlight [DJHM13] could provide an off-the-shelf gazetteer, yielding potentially powerful features for NER.

References

  • [ABP16] Vinayak Athavale, Shreenivas Bharadwaj, Monik Pamecha, Ameya Prabhu, and Manish Shrivastava. Towards deep learning in hindi NER: an approach to tackle the labelled data sparsity. CoRR, abs/1610.09756, 2016.
  • [Ahm13] Zia Ahmed. Named Entity Recognition and Question Answering Using Word Vectors and Clustering. 2013.
  • [BGJM16] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. 2016.
  • [CEN14] Z Chase Lipton, C Elkan, and B Narayanaswamy. Thresholding Classifiers to Maximize F1 Score. ArXiv e-prints, 2014.
  • [DGG17] Arjun Das, Debasis Ganguly, and Utpal Garain. Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 16(3):1–19, 2017.
  • [DJHM13] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N Mendes. Improving Efficiency and Accuracy in Multilingual Entity Extraction. Proceedings of the 9th International Conference on Semantic Systems, pages 121–124, 2013.
  • [GCWL14] Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. Revisiting Embedding Features for Simple Semi-supervised Learning. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2005):110–120, 2014.
  • [LC] Nut Limsopatham and Nigel Collier. Proceedings of the 2nd Workshop on Noisy User-generated Text.
  • [LPB17] Cédric Lopez, Ioannis Partalas, Georgios Balikas, Nadia Derbas, Amélie Martin, Frédérique Segond, Coralie Reutenauer, and Massih-Reza Amini. French Named Entity Recognition in Twitter Challenge. Technical report, 2017.
  • [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. Nips, pages 1–9, 2013.
  • [NNN98] Elaine Marsh Nrl, Dennis Perzanowski Nrl, and Ralph Grishman Nyu. MUC-7 EVALUATION OF IE TECHNOLOGY : Overview of Results MUC-7 Program Committee. Program, (April), 1998.
  • [Oka07] Naoaki Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs), 2007.
  • [RCME11] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Named entity recognition in tweets: An experimental study. In EMNLP, 2011.
  • [Sch94] Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, 1994.
  • [Sie15] Scharolta Katharina Sien. Adapting word2vec to Named Entity Recognition. Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), (Nodalida):239–243, 2015.
  • [SM11] Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields. Machine Learning, 4(4):267–373, 2011.
  • [ZH05] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic-net. Journal of the Royal Statistical Society, 67(2):301–320, 2005.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
200037
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description