Confidence penalty, annealing Gaussian noise and zoneout for biLSTM-CRF networks for named entity recognition
Named entity recognition (NER) is used to identify relevant entities in text. A bidirectional LSTM (long short term memory) encoder with a neural conditional random fields (CRF) decoder (biLSTM-CRF) is the state of the art methodology. In this work, we have done an analysis of several methods that intend to optimize the performance of networks based on this architecture, which in some cases encourage overfitting avoidance. These methods target exploration of parameter space, regularization of LSTMs and penalization of confident output distributions. Results show that the optimization methods improve the performance of the biLSTM-CRF NER baseline system, setting a new state of the art performance for the CoNLL-2003 Shared Task Spanish set with an F1 of 87.18.
Named entity recognition (NER) identifies relevant entities of interest in text, such as people or locations. Current state of the art results are achieved using variants of a biLSTM (bidirectional long short term memory (LSTM)) encoder and a neural linear-chain CRF (Conditional Random Field) decoder  (biLSTM-CRF) based methods, which are used in a variety of tasks in natural language processing. Even though these methods are the state of the art, methods that could help improving the performance of these networks would benefit from additional research. Training these networks is not trivial , but there are several aspects that could be considered such as regularization methods or optimization of the system using a loss function that could approximate the target measure (e.g. F1) .
We investigate recently proposed methods for the optimization of deep neural networks, and in some cases specific to LSTM (Long Short Term Memory). These methods consider different aspects of the biLSTM-CRF systems and do not suppose a burden to the training process.
The first method consists in adding gradient noise  to encorage active exploration. As second method, zoneout  is used, which regularizes LSTM nodes. Finally, NER systems are typically trained to optimize the performance of a loss function, we explore penalizing confident output distributions  of the loss function.
Results on the CoNLL-2003 Shared Task set show a performance gain for English, achieving state of the art results, while a significant improvement for Spanish, setting a new state of the art for NER in Spanish using this set.
In this section, we introduce our baseline method for named entity recognition and describe the methods used to optimize its training. Then, the data sets used in the experiment and the word embedding and language modeling are presented.
2.1 BiLSTM-CRF NER base system
Our system is composed of a bidirectional LSTM encoder, and a decoder that uses a neural linear-chain CRF [8, 1]. In the encoder, we use a stack of three bidirectional LSTMs using residual connections . The input to the system includes pretrained word embeddings and word character embeddings generated by a bidirectional LSTM.
The probability of a sequence over a word sequence is calculated using a softmax over all the possible sequences as shown in equation 1.
The log-likelihood of a predicted sequence is calculated as indicated in equation 2. During decoding, the Viterbi algorithm is used to identify the sequence with highest probability.
2.2 Regularization by penalizing confident output distributions
When training deep neural networks, algorithms may become confident of their prediction in the training set. Penalizing confident output distributions has been proposed in classifications tasks and might be beneficial to reduce overfitting risk.
An entropy based confidence penalty derived from a classification trained model has been proposed by  as shown in equation 3. is the data instance, while is each one of the classes and defines the probability of class given .
The confidence penalty might be combined linearly with the loss function using the hyperparameter , as shown in equation 4, which sets the importance of the penalty.
In our work, there is a correct sequence in word sequence , but there will be a large number of incorrect ones. The probabilities for these incorrect sequences need to be estimated, which might be costly. We have simplified the entropy calculation and used only the correct sequence entropy and combined it to the loss to generate , using the hyperparameter to control the importance of the penalty. Penalty values have been evaluated in this work following . Results show that this simplification is still effective for our problem.
2.3 Gradient noise
Exploration of methods to robustly optimize neural network models is a recurrent research topic. While there is a tradition of using noise to train classical neural networks, their impact in novel neural network architectures requires further exploration. We research adding annealed Gaussian noise to the gradient , which encourages active exploration of parameter space. This technique is straightforward to implement in many systems and, as shown in the results, it can be effective in some cases.
Higher gradient noise at the beginning forces the gradient away of 0 in early stages. The noise decreases overtime controlled by parameter .
We have considered zoneout  in LSTM, which uses random noise to train a pseudo-ensemble in recurrent neural networks.
In LSTM , at each timestep , the hidden state is divided into a memory vector and a hidden vector . Formulation of LSTMs contains the implementation of a set of gates that control the flow of information. These gates include an input gate , an output gate and a forget gate over the previous hidden units and data entry . A set of weight matrices and bias terms are learnt during training.
Zoneout connects the previous time step information from and with the current and . and , as shown below, are masks generated at each timestep using a binomial distribution with and with values between for and respectively.
and are the parameters used to define the probability for mask generation for and . Values used in our work are and as used in . Equations 12-15 show the implementation of zoneout for LSTM. It uses random noise to train a pseudo-ensemble, as in dropout, but as it keeps hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic deep networks.
2.5 Data sets
We have prepared and evaluated the proposed methods on the English and Spanish sets of the CoNLL-2003 Shared Task Named Entity Recognition set 111http://www.cnts.ua.ac.be/conll2003/ner. We have followed the training, development and test set configuration of CoNLL-2003 Shared Task set. The Spanish dataset has 8323/1915/1517 sentences in train/dev/test sets respectively. The English dataset is almost twice as large with 14041/3250/3453 sentences in train/dev/test set. For all of our models, the word-embedding size is set to 100 for English and 64 for Spanish.
2.6 Word embedding
English word embedding was obtained from Word2vec-api222https://github.com/3Top/word2vec-api/blob/master/README.md. The embedding dimension is 100 and it was trained using GloVe with AdaGrad.
2.7 Language Modeling
We have used both forward and backward language models (LM) as additional input for our system. Language models have been successfully used in similar tasks previously [14, 1]. The English forward language model was obtained from444https://github.com/tensorflow/models/tree/master/lm_1b using the One billion word benchmark555https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark  and has a perplexity of 30. The backward English language model and the Spanish forward and backward ones were generated using an LSTM based baseline666https://github.com/rafaljozefowicz/lm . This code estimates a forward language model and was adapted to estimate as well a backward language model. Language models were estimated using the One billion word benchmark. The vocabulary for the backward English model is the same as the pregenerated forward model. The perplexity for the estimated backward English language model is 46. The vocabulary for the Spanish language models has been generated using tokens with frequency 2. The perplexity for the forward and backward Spanish language models are 56 and 57 respectively.
We present results on both English and Spanish sets using the CoNLL-2003 Shared Task NER set. The training set has been used to train the system using several hyperparameter configurations, the development set has been used to select the best configuration and the reported performance of the final system is based on the test set.
For all of our models, the word-embedding size is set to 100 for English and 64 for Spanish. The hidden vector size is 100 for both English and Spanish sets without the LM embeddings. With the LM embeddings, the hidden vector size is changed to 300. We trained the model with Stochastic Gradient Descent with momentum, using the learning rate of 0.005. Statistical significance has been determined using a randomization version of the paired sample t-test .
F1 results are shown in table 1. The baseline system is the biLSTM-CRF method. Penalty improves significantly the baseline performance when for both English and Spanish sets. Adding noise to the gradients has a non-significant improvement for English, except when . A similar performance increase is observed in Spanish. Zoneout significantly improves results on the Spanish set, performance increases are not significant for the English Set. When combining penalty , noise and zoneout the increase in performance for Spanish is quite significant, setting a new state of the art result for the Spanish set with an F1 of 87.18.
Overall, the proposed methods improve over the baseline system. The combination of the proposed methods in the CoNLL-2003 Shared Task set for Spanish sets a new state of the art result that significantly improves over previous results.
Adding a penalty to the loss function seems to be the most relevant method for improving on the English set, which seems to improve as well the performance on the Spanish set. Adding noise has not such a strong impact but it is still able of providing an improvement of both sets. Zoneout has the strongest performance improvement on the Spanish set, even though the improvement on the English set is not that significant.
For the English set, the performance improves but in most cases is not as significant as with the results obtained with the Spanish set. The training set for Spanish is smaller and this could explain the improved performance by the proposed methods.
There are some configurations in which the results do not significantly change respect to the baseline result. Examples are when the level of noise () or penalty () are high enough to prevent finding a better trained configuration of the model.
Using forward and backward language models improve the performance on the English set but decreases the performance of the Spanish set, as seen in previous work . Using the modifications proposed in this work, both results for English and Spanish using language models improve, again the improvements are more significant for the Spanish set. Compared to previous work, the best performance with the English set was obtained by  with an F1 of 91.93, comparable to our result with an F1 of 91.96.
On the Spanish set, the previous state of the art result was obtained with the biLSTM CRF system with residual connections and trainable bias decoding with an F1 of 86.31 . The modifications presented in our work improve this result by a significant margin, with a performance of 87.18, setting a new state of the art result.
5 Conclusions and Future Work
We have presented a set of methods that help improving the training of biLSTM-CRF systems applied to named entity recognition. Our initial investigation shows that these methods improve the baseline system on the CoNLL-2003 Shared Task set and in the case of the Spanish set, provides a new state of the art result with an F1 of 87.18.
The optimization methods presented in this work are not specific to named entity recognition and they might be applied to similar network architectures for different tasks  or more complex networks for named entity recognition . Additional regularization and optimization methods could be considered as shown in .
These networks typically optimize a loss function but they are evaluated using a different measure, such as F1. Previous work  has tried to use a bias in decoding after training the system. We would like to explore ways into which the target evaluation measure might be better integrated in the training.
-  Authors-reference. Authors reference. Authors reference, 2017.
-  Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
-  Paul R Cohen. Empirical methods for artificial intelligence. IEEE Intelligent Systems, (6):88, 1996.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. CoRR, abs/1508.01991, 2015.
-  Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
-  David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, et al. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
-  Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260–270, 2016.
-  Fei Liu, Timothy Baldwin, and Trevor Cohn. Capturing long-range contextual dependencies with memory-enhanced conditional random fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 555–565. Asian Federation of Natural Language Processing, 2017.
-  Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
-  Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
-  Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
-  Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108, 2017.
-  Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics, 2003.
-  Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.