Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding has been demoed as a powerful representation for characterizing the statistical properties of natural language. In this study, we propose to use BLSTM-RNN with word embedding for part-of-speech (POS) tagging task. When tested on Penn Treebank WSJ test set, a state-of-the-art performance of 97.40 tagging accuracy is achieved. Without using morphological features, this approach can also achieve a good performance comparable with the Stanford POS tagger.
Bidirectional long short-term memory [\citenameHochreiter and Schmidhuber1997, \citenameSchuster and Paliwal1997] (BLSTM) is a type of recurrent neural network (RNN) that can incorporate contextual information from long period of fore-and-aft inputs. It has been proven a powerful model for sequential labeling tasks. For applications in natural language processing (NLP), it has helped achieve superior performance in language modeling [\citenameSundermeyer et al.2012, \citenameSundermeyer et al.2015], language understanding [\citenameYao et al.2013], and machine translation [\citenameSundermeyer et al.2014]. Since part-of-speech (POS) tagging is a typical sequential labeling task, it seems natural to expect BLSTM RNN can also be effective for this task.
As a neural network model, it is awkward for BLSTM RNN to make use of conventional NLP features, such as morphological features. Since these features are discrete and has to be represented as one-hot vector to be used, using rich this type of features leads to too large input layer to maintain and update. Therefore, we avoid using such features except word form and simple capital features, instead we involve word embedding. Word embedding is a low dimensional real-valued vector used to represent word. It is considered containing part of syntactic and semantic information and has shown a very attractive feature for various of language processing tasks [\citenameCollobert and Weston2008, \citenameTurian et al.2010a, \citenameCollobert et al.2011]. Word embedding can be obtained by training a neural network model, especially, a neural network language model [\citenameBengio et al.2006, \citenameMikolov et al.2010] or a neural network designed for a specific task [\citenameCollobert et al.2011, \citenameMikolov et al.2013a, \citenamePennington et al.2014a]. Currently many word embeddings trained on quite large corpora are available on line. However, these embeddings are trained by neural networks that are very different from BLSTM RNN. This inconsistency is supposed as an shortcoming to make the most of these trained word embeddings. To conquer this shortcoming, we also propose a novel method to train word embedding on unlabeled data with BLSTM RNN.
The main contributions of this work include: First, it shows an effective way to use BLSTM RNN for POS tagging task and achieves a state-of-the-art tagging accuracy. Second, a novel method for training word embedding is proposed. Finally, we demonstrate that competitive tagging accuracy can be obtained without using morphological features, which makes this approach more practical to tag a language that lacks of necessary morphological knowledge.
2.1 BLSTM RNN for POS Tagging
Given a sentence with tags , BLSTM RNN is used to predict the tag probability distribution of each word. The usage is illustrated in Figure 1.
Here is the one hot representation of the current word. It is a binary vector of dimension where is the vocabulary. To reduce , each letter in input word is transferred into lower case. To still keep the upper case information, a function is introduced to indicate the original case information of word . More specifically, returns a three-dimensional binary vector to tell if is full lowercase, full uppercase or leading with a capital letter. The input vector of the neural network is computed as:
where and are weight matrixes connecting two layers. is the word embedding of which has a much smaller dimension than . In practice, is implemented as a lookup table, is returned by referring to the word embedding of stored in this table. To use word embeddings trained by other task or method, we just need to initialize this lookup table with those external embeddings. For words without corresponding external embeddings, their word embeddings are initialized with uniformly distributed random values, ranging from -0.1 to 0.1. The implementation of BLSTM layer is detailed descripted in [\citenameGraves2012] and therefore is skipped in this paper. This layer incorporates information from the past and future histories when making prediction for current word and is updated as a function of the entire input sentence. The output layer is a softmax layer whose dimension is the number of tag types. It outputs the tag probability distribution of input word . All weights are trained using backpropagation and gradient descent algorithm to maximize the likelihood on training data:
The obtained probability distribution of each step is supposed independent with each other. The utilization of contextual information strictly comes from the BLSTM layer. Thus, in inference phase, the likeliest tag of input word can just be chose as:
where is the number of tag types.
2.2 Word Embedding
In this section, we propose a novel method to train word embedding on unlabeled data with BLSTM RNN. In this approach, BLSTM RNN is also used to do a tagging task, but only has two types of tags to predict: incorrect/correct. The input is a sequence of words which is a normal sentence with some words replaced by randomly chosen words. For those replaced words, their tags are 0 (incorrect) and for those that are not replaced, their tags are 1 (correct). Although it is possible that some replaced words are also reasonable in the sentence, they are still considered “incorrect”. Then BLSTM RNN is trained to minimize the binary classification error on the training corpus. The neural network structure is the same as that in Figure 1. When the neural network is trained, contains all trained word embeddings.
BLSTM RNN systems in our experiments are implemented with CURRENT [\citenameWeninger et al.2014], a machine learning library for RNN which adopts GPU acceleration. The activation function of input layer is identity function, hidden layer is logistic function, while the output layer uses softmax function for multiclass classification. Neural network is trained using statistical gradient descent algorithm with constant learning rate.
The part-of-speech tagged data used in our experiments is the Wall Street Journal data from Penn Treebank III [\citenameMarcus et al.1993]. Training, development and test sets are split following setup in [\citenameCollins2002]. Table 1 lists the detailed information of the three data sets.
To train word embedding, we uses North American news [\citenameGraff2008] as the unlabeled data.
This corpus contains about 536 million words.
It is tokenized using the Penn Treebank tokenizer script
3.2 Hidden Layer Size
We evaluate different sizes of hidden layer in BLSTM RNN to pick up the best structure for later experiments. The input layer size is set to 100 and output layer size is fixed as 45 in all experiments. The accuracies on WSJ test set are shown in Figure 2.
It shows that hidden layer size has a limited impact on performance when it becomes large enough. To keep a good trade-off of accuracy, model size and running time, we choose 100 which is the smallest layer size to get “reasonable” performance as the hidden layer size in all the following experiments.
3.3 POS Tagging Accuracies
Table 2 compares the performance of our systems with other baseline systems.
|[\citenameToutanova et al.2003]||97.24|
|[\citenameHuang et al.2012]||97.35|
|[\citenameCollobert et al.2011] NN||96.36|
|[\citenameCollobert et al.2011] NN+WE||97.20|
Baseline systems. Four typical systems are chosen as baseline systems. [\citenameToutanova et al.2003] is one of the most commonly used approaches which is also known as Stanford tagger. [\citenameHuang et al.2012] is the system reports best accuracy on WSJ test set (97.35%). In fact, [\citenameSpoustová et al.2009] reports a higher accuracy (97.44%), but this work relies on multiple trained taggers and combines their tagging results. Here we focus on single model tagging algorithm and therefore do not include this work as baseline. Besides, [\citenameMoore2014] (97.34%) and [\citenameShen et al.2007] (97.33%) also reach accuracy above 97.3%. These two systems plus [\citenameHuang et al.2012] are considered as current state-of-the-art systems. All these systems rely on rich morphological features. In contrast, [\citenameCollobert et al.2011] NN only uses word form and capital features. [\citenameCollobert et al.2011] NN+WE also incorporates word embeddings trained on unlabeled data like our approach. The main difference is that [\citenameCollobert et al.2011] uses feedforward neural network instead of BLSTM RNN.
BLSTM-RNN is the system described in Section 2.1 which only uses word form and capital features. The vocabulary we used in this experiment is all words appearing in WSJ Penn Treebank training set, merging with the most common 100,000 words in North American news corpus, plus one single “UNK” symbol for replacing all out of vocabulary words.
Without the help of morphological features, it is not surprising that BLSTM-RNN falls behind the state-of-the-art system. However, BLSTM-RNN surpasses [\citenameCollobert et al.2011] NN which is also neural network based method and uses the same input features. It is consistent with [\citenameFernandez et al.2014, \citenameFan et al.2014], in which BLSTM RNN outperforms feedforward neural network.
|WE||Dim||Vocab Size||Train Corpus (Toks #)||OOV||Acc (%)|
|[\citenameMikolov2010]||80||82K||Broadcast news (400M)||0.31||96.91|
|[\citenameTurian et al.2010b]||100||269K||RCV1 (37M)||0.18||96.81|
|[\citenameMikolov et al.2013b]||300||3M||Google news (10B)||0.17||96.86|
|[\citenamePennington et al.2014b]1||100||400K||Wiki (6B)||0.13||97.12|
|[\citenamePennington et al.2014b]2||100||1193K||Twitter (27B)||0.25||97.00|
|BLSTM RNN WE||100||100K||North American news (536M)||0.17||97.26|
BLSTM-RNN+WE. To construct corpus for training word embeddings, about 20% words in normal sentences of North American news corpus are replaced with randomly selected words. Then BLSTM RNN is trained to judge which word has been replaced as described in Section 2.2. The vocabulary for this task contains the 100,000 most common words in North American news corpus and one special “UNK” symbol. When training is finished, word embedding lookup table () in BLSTM RNN for POS tagging is initialized with the trained word embeddings. The following training and testing are the same as previous experiment.
Table 2 shows the results of using word embeddings trained on the first 10 million words (WE(10m)), first 100 million words (WE(100m)) and all 530 million words (WE(all)) of North American news corpus. While WE(10m) does not show much help for the improvement, WE(100m) and WE(all) significantly boosts the performance. It shows that BLSTM RNN can benefit from word embeddings trained on large unlabeled corpus and larger training corpus leads to a better performance. This suggests that the result may be further improved by using even bigger unlabeled data set. With the help of GPU, WE(all) can be trained in about one day (23 hrs). The training time increases linearly with the training corpus size.
WE(all) reduces over 20% error rate of BLSTM-RNN and lets the result comparable with [\citenameToutanova et al.2003]. Note that this result is obtained without using any morphological features. Current state-of-the-art systems [\citenameMoore2014, \citenameShen et al.2007, \citenameHuang et al.2012] all utilize morphological features proposed in [\citenameRatnaparkhi1996] which involves -gram prefix and suffix ( = 1 to 4). Moreover, [\citenameShen et al.2007] also involves prefix and suffix of length from 5 to 9. [\citenameMoore2014] adds extra elaborately designed features, including flags indicating if word ends with or , etc. In practice, many languages with rich morphological forms lack of necessary or effective morphological processing tools. In these cases, a POS tagger that does not rely on morphological features is more realistic for use.
BLSTM-RNN+WE(all)+suffix2. In this experiment, we add bigram suffix of each word as extra feature. These last 2 characters are represented as one-hot vector and appended to the original extra feature vector (). The other configuration follows BLSTM-RNN+WE(all). The additional feature furthermore pushes up the accuracy and lets the approach get the state-of-the-art performance (97.40%). However, adding more morphological features such as trigram suffix does not further improve the performance. One possible reason is that adding such feature brings a much longer extra feature vector which needs retuning parameters such as learning rate and hidden layer size to get the optimum performance.
3.4 Different Word Embeddings
In this experiment, six types of published well-trained word embeddings are evaluated. The basic information of involved word embeddings and results are listed in Table 3 where RCV1 represents the Reuters Corpus Volume 1 news set. The OOV (out of vocabulary) column indicates the rate of words in vocabulary of BLSTM RNN for POS tagging that are not covered by external word embedding vocabulary. The usage of word embeddings is the same as in BLSTM-RNN+WE experiment except that input layer size here is equal to the dimension of external word embedding.
All word embeddings bring about higher accuracy. However, none of them can enhance BLSTM RNN tagging to get a competitive accuracy, despite of larger corpora that they are trained on and lower OOV rate. [\citenamePennington et al.2014b]1 (97.12%) has the highest accuracy among them but it is still lower than [\citenameToutanova et al.2003] (97.24%). Although more experiments are needed to judge which word embeddings are better, this experiment at least shows word embeddings trained by BLSTM RNN are essential in our POS tagging approach to achieve a superior performance.
In this paper, BLSTM RNN is proposed for POS tagging and training word embedding. Combined with word embedding trained on big unlabeled data, this approach gets state-of-the-art accuracy on WSJ test set without using rich morphological features. BLSTM RNN with word embedding is expected as an effective solution for tagging tasks and worth further exploration.
- Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. 2006. Neural probabilistic language models. In Innovations in Machine Learning, pages 137–186. Springer.
- Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In EMNLP, EMNLP ’02, pages 1–8, Stroudsburg, PA, USA.
- Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167.
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537.
- Ronan Collobert. 2011. SENNA. http://ml.nec-labs.com/senna/.
- Yuchen Fan, Yao Qian, Fenglong Xie, and Frank K. Soong. 2014. TTS synthesis with bidirectional LSTM based recurrent neural networks. In INTERSPEECH, Singapore, September.
- Raul Fernandez, Asaf Rendel, Bhuvana Ramabhadran, and Ron Hoory. 2014. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In INTERSPEECH, Singapore, September.
- David Graff. 2008. North American News Text, Complete LDC2008T15. https://catalog.ldc.upenn.edu/LDC2008T15.
- Alex Graves. 2012. Supervised sequence labelling with recurrent neural networks, volume 385. Springer.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Liang Huang, Suphan Fayong, and Yang Guo. 2012. Structured Perceptron with Inexact Search. In HLT-NAACL, pages 142–151, Montréal, Canada.
- Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330.
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH, pages 1045–1048, Makuhari, Chiba, Japan.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In ICLR.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. word2vec. https://code.google.com/p/word2vec/.
- Tomas Mikolov. 2010. RNNLM. http://rnnlm.org/.
- Robert Moore. 2014. Fast high-accuracy part-of-speech tagging by independent classifiers. In Coling, pages 1165–1176, Dublin, Ireland, August.
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014a. Glove: Global Vectors for Word Representation. In EMNLP, pages 1532–1543, Doha, Qatar, October.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014b. GloVe. http://nlp.stanford.edu/projects/glove/.
- Adwait Ratnaparkhi. 1996. A Maximum Entropy Model for Part-Of-Speech Tagging. In EMNLP, pages 133–142.
- Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681.
- Libin Shen, Giorgio Satta, and Aravind Joshi. 2007. Guided Learning for Bidirectional Sequence Classification. In ACL, pages 760–767, Prague, Czech Republic, June.
- Drahomíra “johanka” Spoustová, Jan Hajič, Jan Raab, and Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged Perceptron POS Tagger. In EACL, pages 763–771, Athens, Greece.
- Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural Networks for Language Modeling. In INTERSPEECH, Portland, Oregon, USA.
- Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker, and Hermann Ney. 2014. Translation Modeling with Bidirectional Recurrent Neural Networks. In EMNLP, pages 14–25, Doha, Qatar, October.
- Martin Sundermeyer, Hermann Ney, and Ralf Schluter. 2015. From feedforward to recurrent lstm neural networks for language modeling. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 23(3):517–529.
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In HLT-NAACL.
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010a. Word Representations: A Simple and General Method for Semi-supervised Learning. In ACL, ACL ’10, pages 384–394, Stroudsburg, PA, USA.
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010b. Word representations for NLP. http://metaoptimize.com/projects/wordreprs/.
- Felix Weninger, Johannes Bergmann, and Björn Schuller. 2014. Introducing CURRENNT–the Munich open-source CUDA RecurREnt Neural Network Toolkit. Journal of Machine Learning Research, 15.
- Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. 2013. Recurrent neural networks for language understanding. In INTERSPEECH, pages 2524–2528.