Sharing Network Parameters for Crosslingual Named Entity Recognition
Most state of the art approaches for Named Entity Recognition rely on hand crafted features and annotated corpora. Recently Neural network based models have been proposed which do not require handcrafted features but still require annotated corpora. However, such annotated corpora may not be available for many languages. In this paper, we propose a neural network based model which allows sharing the decoder as well as word and character level parameters between two languages thereby allowing a resource fortunate language to aid a resource deprived language. Specifically, we focus on the case when limited annotated corpora is available in one language () and abundant annotated corpora is available in another language (). Sharing the network architecture and parameters between and leads to improved performance in . Further, our approach does not require any hand crafted features but instead directly learns meaningful feature representations from the training data itself. We experiment with 4 language pairs and show that indeed in a resource constrained setup (lesser annotated corpora), a model jointly trained with data from another language performs better than a model trained only on the limited corpora in one language.
Named Entity Recognition (NER) plays a crucial role in several downstream applications such as Information Extraction, Question Answering, Machine Translation etc.. Existing state of the art systems for NER are typically supervised systems which require sufficient annotated corpora for training . In addition, they rely on language-specific handcrafted features (such as capitalization of first character in English). Some of these features rely on knowledge resources in the form of gazetteers  and other NLP tools such as POS taggers which in turn require their own training data. This requirement of resources in the form of training data, gazetteers, tools, feature engineering, etc. makes it hard to apply these approaches to resource deprived languages.
Recently, several Neural Network based approaches for NER have been proposed  which circumvent the need for hand-crafted features and thereby the need for gazetteers, part-of-speech taggers, etc. They directly learn meaningful feature representations from the training data itself and can also benefit from large amounts of unannotated corpora in the language. However, they still require sufficient data for training the network and thus only partially address the problem of resource scarcity.
Very recently proposed an encoder decoder based model for sequence labeling which takes a sequence of bytes (characters) as input instead of words and outputs spans as well as labels for these spans. For example, in the case of part-of-speech tagging the span could identify one word and the associated label would be the part-of-speech tag of that word. Since the input consists of character sequences, the network can be jointly trained using annotated corpora from multiple languages by sharing the vocabulary (characters, in this case) and associated parameters. They show that such a jointly trained model can perform better than the same model trained on monolingual data. However, they do not focus on the resource constrained setup where one of the languages has very little annotated corpora. Further, the best results in their joint training setup are poor when compared even to the monolingual results reported in this paper.
In this paper, we propose a neural network based model which allows sharing of character dependent, word dependent and output dependent parameters. Specifically, given a sequence of words, we employ LSTMs at word level and CNNs at character level to extract complementary feature representations. The word level LSTMs can capture contextual information and the character level CNNs can encode morphological information. At the output layer we use a feedforward network to predict NER tags. Similar to , our character dependent parameters are shared across languages (which use the same character set). However, unlike we do not use an encoder decoder architecture. Further, our model also employs word level features which can be shared across languages by using jointly learned bilingual word embeddings from parallel corpora . Since the NER tags are same across languages, even the output layer of our model is shared across languages.
We experiment with 4 language pairs, viz., English-Spanish, English-German, Spanish-German and Dutch-German using standard NER datasets released as part of the CoNLL shared task  and German NER data by . We artificially constrain the amount of training data available in one language and show that the network can still benefit from abundant annotated corpora in another language by jointly learning the shared parameters. Further, in the monolingual setup we report state of the art results for two out of three languages without using any handcrafted features or gazetteers.
The remainder of this paper is organized as follows:
In this section we present a quick overview of (i) neural network based approaches for NER which now report state of the art results and (ii) approaches catering to multilingual NER.
Neural networks were first explored in the context of named entity recognition by but, were the first to successfully use neural networks for several NLP tasks including NER. Unlike existing supervised systems, they used minimal handcrafted features and instead relied on automatically learning word representations from large unannotated corpora. The output layer was a CRF layer which modeled the entire sequence likelihood. They also used the idea of sharing network parameters across different tasks (but not between different languages).
This idea was further developed by  to include character level information in addition to word level information. They used Convolutional Neural Networks (CNNs) with fixed filter width to extract relevant character level information. The combined character features and word embeddings were fed to a time delay neural network as in and used for Spanish and Portuguese NER.
There are a few works which use Bidirectional Long Short Term Memory (Bi-LSTMs)  for encoding word sequence information for sequence tagging. For examples use LSTMs for encoding word sequences and then use CRFs for decoding tag sequences. use a combination of Bi-LSTMs with CNNs for NER. The decoder is still a CRF which was trained to maximize the entire sequence likelihood. Both these approaches also use some handcrafted features. Very recently proposed Hierarchical Bi-LSTMs as an alternative to CNN-Bi-LSTMs wherein they first use a character level Bi-LSTMs followed by a word level Bi-LSTMs, thus forming a hierarchy of LSTMs. They also used CRF at the output layer. The model was tested on English, Spanish, Dutch, and German languages. They reported state-of-the-art results when systems with no handcrafted feature engineering are considered.
Very recently proposed a novel encoder-decoder architecture for language independent sequence tagging. Even more recently, extended and focused on both multi-task and multilingual setting. In the multi-task scenario, except for the output CRF layer, the rest of the network parameters were shared. In the multilingual setting only the character-level features were shared across languages. Though they reported some improvements in the multilingual setting, their model is not suitable in a resource constrained setup (limited training data) because knowledge sharing between languages happens only through character-level features.
Multilingual training of NER systems was explored dating back to . Usually these systems train a language dependent NER tagger by (i) enforcing tag constraints along the aligned words in parallel tagged corpora  or untagged parallel corpus  and/or (ii) use cross-lingual features .
Unlike existing methods, our proposed deep learning model allows sharing of different parameters across languages and can be jointly trained without the need for any annotated parallel corpus or any handcrafted features.
In this section, we describe our model which encodes both character level as well as word level information for Named Entity Recognition. As shown in Figure 3, our model consists of three components, viz., (i) a convolutional layer for extracting character-level features, (ii) a bi-directional LSTM for encoding input word sequences and (iii) a feedforward output layer for predicting the tags.
3.1Character level Convolutional Layer
The input to our model is a sequence of words . We consider each word to be further composed of a sequence of characters, i.e., where is the number of characters in the word. Each character is represented as a one hot vector where is the number of characters in the language. These one-hot representations of all the characters in the word are stacked to a form a matrix . We then apply several filters of one dimensional convolution to this matrix. The width of these filters varies from 1 to , i.e., these filters look at 1 to -gram character sequences. The intuition is that a filter of length 1 could look at unigram characters and hopefully learn to distinguish between upper case and lowercase characters. Similarly, a filter of length 4 could learn that a sequence “son$” at the end of a word indicates a PERSON (as in Thomson, Johnson, Jefferson, etc).
The convolutional operation is followed by a max-pooling operation to pick the most relevant feature (for example, as shown in Figure 1, the max-pooling layer picks up the feature corresponding to capitalization). Further, since there could be multiple relevant n-grams of the same length we define multiple filters of each width. For example, each of the 4-gram sequences son$, corp, ltd. is relevant for NER and different filters of width 4 could capture the information encoded in these different 4-gram sequences. In other words, we have filters on width 1,2, ..., n. If we have a total of such filters () then we get a dimensional representation of the word denoted by .
The input to the bi-directional LSTM is a sequence of words where each word is represented by the following concatenated vector.
is simply the embedding of the word which can be pre-trained (say, using word2vec ) and then fine tuned while training our model. The second part, i.e., encodes character level information as described in the previous sub-section.
The forward LSTM reads this sequence of word representations from left to right whereas the backward LSTM does the same from right to left. This results in a hidden representation for each word which contains two parts.
where, and are the forward and backward LSTM’s outputs respectively at time-step (position) . We use the standard definitions of the LSTM functions and as described in .
Given a training set where is a sequence of words and is a corresponding sequence of entity tags, our goal is to maximize the log-likelihood of the training data as in Equation 2.
where are the parameters of the network. The log conditional probability can be decomposed as in Equation 3,
We model using the following equation:
where, is a parameter vector w.r.t tag which when multiplied with gives a score for assigning the tag . Matrix can be viewed as a transition matrix where the entry gives the transition score from tag to tag . is the set of all possible output tags.
In simple words, our decoder computes the probabilities of the entity tags by passing the output representations computed by LSTM at each position and the previous tag through a linear layer followed by a softmax layer. In this sense, our model is a complete neural network based solution as opposed to existing models which use CRFs at the output.
3.4Sharing parameters across languages
As shown in Figure 3, our model contains the following parameters: (i) convolutional filters (ii) word embeddings (iii) LSTM parameters and (iv) decoder parameters. The convolutional filters operate on character sequences and hence can be shared between languages which share a common character set. This is true for many European languages and we consider some of these languages for our experiments (English, Spanish, Dutch and German). Recently there has been a lot of interest in jointly learning bilingual word representations. The aim here is to project words across languages into a common space such that similar words across languages lie very close to each other in this space. In this paper, we experiment with Bilbowa bilingual word embeddings which allows us to share the space of word embeddings across languages. Similarly, we also share the output layer across languages since all languages have the same entity tagset. Finally, we also share the LSTM parameters across languages. Thus, irrespective of whether the model sees a Spanish training instance or an English training, the same set of filters, LSTM parameters and output parameters get updated based on the loss function (and of course the word embeddings corresponding to the words present in the sentence also get updated).
In this section we describe the following: (i) the datasets used for our experiments (ii) publicly available word embeddings used for different languages and (iii) the hyperparameters considered for all our experiments.
For English, Spanish and Dutch we use the the datasets which were released as part of CoNLL Shared Tasks on NER. Specifically, for English we use the data released as part of the CoNLL 2003 English NER Shared Task . For Spanish and Dutch we used the data released as part of the CoNLL 2002 Shared Task . The following entity tags are considered in these Shared Tasks : Person, Location, Organization and Miscellaneous. For all the three languages, the official splits are used as training, development and test files.
Apart from these three languages we also evaluate our models on German. However, we did not have access to the German data from CoNLL (as it requires a special license). Instead we used the publicly available German NER data released by . This data was constructed by manually annotating the first two German Europarl session transcripts with NER labels following the CoNLL 2003 annotation guidelines. We use the first session to create train and valid splits. Table 1 summarizes the dataset statistics. Note that the German data is different from the English, Spanish and Dutch data which use News articles (as opposed to parliamentary proceedings). Note that the German NER data is in IO format so, for all our experiments involving German we convert the data in other languages also to IO format. For the remaining NER experiments, data is converted to IOBES format .
|#Train Tokens||#Test tokens|
We used pre-trained Spectral word embeddings  for English, Spanish, German and Dutch. All the word embeddings are of 200 dimensions. We update these pre-trained word embeddings during training. We convert all words to lowercase before obtaining the corresponding word embedding. However, note that we preserve the case information when sending the character sequence through the CNN layer (as the case information is important for the character filters). Word embeddings for different languages lie in different feature spaces (unless we use bilingual word embeddings which are trained to reside in the same feature space). These word embeddings cannot be directly given as input to our model (as unrelated words from the 2 languages can have similar word embeddings i.e., similar features). We use a language dependent linear layer to map the words from the 2 languages to a common feature space in a task specific setting (common features w.r.t named entity task) and then fed these as input to the LSTM layer.
4.3Resource constrained setup
In the resource constrained setup we assume that we have ample training data in one source language and only limited training data in the target language.
In all our resource constrained experiments the LSTM parameters are always shared between the source and target language. In addition, we share one or more of the following: (i) convolutional filters (ii) space of word embeddings and (iii) decoder parameters. By sharing the space of word embeddings, we mean that instead of using individually trained monolingual Spectral embeddings for the source and target language, we use jointly trained word embeddings which project the words in a common space. We use off-the-shelf Bilbowa algorithm  with default settings to train these bilingual word embeddings. Bilbowa takes both monolingual and bilingual corpora as input. For bilingual corpora, we use the relevant source-target portion of Europarl corpus  and Opus . For monolingiual corpora, we obtain short abstracts for each of the 4 languages from Dbpedia .
During training, we combine the training set of the source and target languages. Specifically, we merge all sentences from the training corpus of each language and randomly shuffle them to obtain a bilingual training set. This procedure is similarly repeated for the development set.
Our model contains the following hyper-parameters: (i) LSTM size, (ii) maximum width of CNN filters (iii) number of filters per width (i.e., number of filters for the same width ) and (iv) the learning rate. All the hyper-parameters were tuned by doing a grid search and evaluating the error on the development set. For the LSTM size we considered values from 100 to 300 in steps of 50, for the maximum width of the CNN filters we considered values from = 4 to 9 (i.e., we use all filters of width 1 to ). We varied the number of filters per width from 10 to 30 in steps of 5 and the learning rate from 0.05 to 0.50 in steps of 0.05.
In this section we report our experimental results.
The main focus of this work is to see if a resource constrained language can benefit from a resource rich language. However, before reporting results in this setup, we would like to check how well our model performs for monolingual NER (i.e., training and testing in the same language). Table 2 compares our results with some very recently published state-of-the art systems. We observe that our model gives state of the art results for Dutch and English and comparable results in Spanish. This shows that a completely neural network based approach can also perform at par with approaches which use a combination of Neural Networks and CRFs .
5.2A naturally resource constrained scenario
We now discuss our results in the resource constrained setup. In our primary experiments, we treat German as the target language and English, Spanish and Dutch as the source language. The reason for choosing German as the target language is that the NER data available for German is indeed very small as compared to the English, Spanish and Dutch datasets (thus naturally forming a pair of resource rich (English, Dutch, Spanish) and resource poor (German) languages). We train our model jointly using the entire source (English or Dutch or Spanish) and target (German) data. We report separate results for the case when (i) the convolutional filters are shared (ii) the decoder is shared and (iii) both are shared. We compare these results with the case when we train a model using only the target (German) data. The results are summarized in Table ? (DE: German, EN: English, NL: Dutch, ES: Spanish).
We observe that sharing of parameters between the two languages helps achieve better results compared to the monolingual setting. Sharing of decoder between English and German helps the most. On the other hand, for German and Dutch we get best results when sharing both character level filters as well as decoder parameters. For German and Spanish sharing the filters helps achieve better results.
Next, we intend to use a common word embedding space for the source and target languages where related words across the two languages have similar embeddings. The intuition here is if a source word is seen at training time but the corresponding target word (translation) is only seen at test time, the model could still be able to generalize since the embeddings of the source and target words are similar. For this, we use the jointly trained Bilbowa word embeddings as described in Section 4.3. In addition, the decoder and character filters are also shared between the two languages. These results are summarized in Table ?. We observe that we get larger gains when combining the source and target language data. However, the overall results are still poorer when using monolingual Spectral embeddings (as reported in Table ?). This is mainly because the monolingual corpora used for training Bilbowa word embeddings was much smaller as compared to that used for training Spectral embeddings. For example, the English Spectral embeddings were trained on a larger GigaWord corpus (1 billion words) whereas the Bilbowa embeddings were trained on a smaller corpus comprising of Dbpedia abstracts (around 400 million words). Given the promising gains obtained by using these bilingual word embeddings it would be interesting to train them on larger corpora. We leave this as future work.
5.3A simulated resource constrained scenario
To help us analyze our model further we perform one more experiment using English as the source and Spanish as the target language. Since sufficient annotated corpora is available in Spanish, we artificially simulate a resource constrained setup by varying the amount of training data in Spanish from 10% to 90% in steps of 10%. These results are summarized in Figure 4. We see an improvement of around 0.73% to 1.87% when the amount of Spanish data is between 30% to 80%. The benefit of adding English data would of course taper off as more and more Spanish data is available. We hoped that the English data would be more useful when a smaller amount of Spanish data ( 30%) is available but this is not the case. We believe this happens because at lower Spanish data sizes, the English data dominates the training process which perhaps prevents the model from learning certain Spanish-specific characteristics. Finally, Figure 5 summarizes the results obtained when using a common word embedding space (i.e., using Bilbowa word embeddings) and sharing the decoder and character filters. Once again we see larger improvements but the overall results are lower than those obtained by Spectral embedding due to reasons explained above.
We did some error analysis to understand the effect of sharing different network parameters. Although our primary experiments were on English-German, Spanish-German and Dutch-German, we restricted our error analysis to English-Spanish since we could understand these two languages.
Intuitively, sharing the decoder should allow one language to benefit from the tag sequence patterns learned from another language. Of course, this would not happen in the two languages having very different word orders (for example, English-Hindi) but this is not the case for English & Spanish. Indeed, we observed that the Spanish model was able to benefit from certain tag sequences which were not frequently seen in the Spanish training data but were seen in the English training data. For example the tag sequence pattern (_O w_LOC is frequently confused and tagged as (_O w_ORG by the Spanish monolingual model. Here, the symbol “(” is tagged as Others and is a place-holder for some word. However, this tag pattern was frequently observed in the English training data. For example, such patterns were observed in English Sports news articles: “Ronaldo (_O Brazil_LOC ) scored 2 goals in the match.”. The joint model could benefit from this information coming from the English data and was thus able to reduce some of the errors made by the Spanish model.
6.2Shared Character Filters
We observed that sharing character filters also helps in generalization by extracting language independent named entity features. For example, many location names begin with an upper-case character and end with the suffix ia as in Australia, Austria, Columbia, India, Indonesia, Malaysia, etc.. There were many such location named entities in the English corpus compared to the Spanish training corpus. We observed that Spanish benefited from this in the joint training setup and made fewer mistakes on such names (which it was otherwise confusing with Organization tag in the monolingual setting)
In this work, we focused on the problem of improving NER in a resource deprived language by using additional annotated corpora from another language. To this end, we proposed a neural network based architecture which allows sharing of various parameters between the two languages. Specifically, we share the decoder, the filters used for extracting character level features and a shared space comprising of bilingual word embeddings. Since the parameters are shared the model can be jointly trained using annotated corpora available in both languages. Our experiments involving 4 language pairs suggest that such joint training indeed improves the performance in a resource deprived language.
There are a few interesting research directions that we would like to pursue in the future. Firstly, we observed that we get much larger gains when the space of word embeddings is shared. However, due to poorer quality of the bilingual embeddings the overall results are not better as compared to the case when we use monolingual word embeddings. We would like to see if training the bilingual word embeddings on a larger corpus would help in correcting this situation. Further, currently the word embeddings are trained independently of the NER task and then fine tuned during training. It would be interesting to design a model which allows to jointly embed words and predict tags in multiple languages. Finally, in this work we used only two languages at a time. We would like to see if jointly training with multiple languages could give better results.
Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data.
Bogdan Babych and Anthony Hartley. Improving machine translation quality with automatic named entity recognition.
Wanxiang Che, Mengqiu Wang, Christopher D. Manning, and Ting Liu. Named entity recognition with bilingual constraints.
Yufeng Chen, Chengqing Zong, and Keh-Yih Su. On jointly recognizing and aligning bilingual named entities.
Jason P. C. Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch.
Paramveer S. Dhillon, Dean P. Foster, and Lyle H. Ungar. Eigenwords: Spectral word embeddings.
Cicero dos Santos, Victor Guimaraes, RJ Niterói, and Rio de Janeiro. Boosting named entity recognition with neural character embeddings.
Manaal Faruqui and Sebastian Padó. Training and evaluating a german named entity recognizer with semantic generalization.
Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. Named entity recognition through classifier combination.
Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. Multilingual language processing from bytes.
Stephan Gouws, Yoshua Bengio, and Greg Corrado. Bilbowa: Fast bilingual distributed representations without word alignments.
James Hammerton. Named entity recognition with long short-term memory.
Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging.
Philipp Koehn. Europarl: A Parallel Corpus for Statistical Machine Translation.
Guillaume Lample, Miguel Ballesteros, Kazuya Kawakami, Sandeep Subramanian, and Chris Dyer. Neural architectures for named entity recognition.
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Chris Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia.
Qi Li, Haibo Li, Heng Ji, Wen Wang, Jing Zheng, and Fei Huang. Joint bilingual name tagging for parallel corpora.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations.
Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition.
Cicero D. Santos and Bianca Zadrozny. Learning character-level representations for part-of-speech tagging.
M. Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural networks.
Raivis Skadinš, Jörg Tiedemann, Roberts Rozis, and Daiga Deksne. Billions of parallel words for free: Building and using the eu bookshop corpus.
Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. Cross-lingual word clusters for direct transfer of linguistic structure.
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition.
Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: Language-independent named entity recognition.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: A simple and general method for semi-supervised learning.
Mengqiu Wang and Christopher D. Manning. Cross-lingual projected expectation regularization for weakly supervised learning.
Mengqiu Wang, Wanxiang Che, and Christopher D. Manning. Effective bilingual constraints for semi-supervised learning of named entity recognizers.
Mengqiu Wang, Wanxiang Che, and Christopher D. Manning. Joint word alignment and bilingual named entity recognition using dual decomposition.
Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual sequence tagging from scratch.