Few-shot classification in Named Entity Recognition Task

Few-shot classification in Named Entity Recognition Task

Alexander Fritzler Russia afritzler449@gmail.com Varvara Logacheva Russia varvara.logacheva@gmail.com  and  Maksim Kretov Russia kretov.mk@mipt.ru

For many natural language processing (NLP) tasks the amount of annotated data is limited. This urges a need to apply semi-supervised learning techniques, such as transfer learning or meta-learning. In this work we tackle Named Entity Recognition (NER) task using Prototypical Network — a metric learning technique. It learns intermediate representations of words which cluster well into named entity classes. This property of the model allows classifying words with extremely limited number of training examples, and can potentially be used as a zero-shot learning method. By coupling this technique with transfer learning we achieve well-performing classifiers trained on only 20 instances of a target class.

Named Entity Recognition, Prototypical networks, Few-shot learning, Semi-supervised learning, Transfer learning
copyright: rightsretainedarticle: 4price: 15.00journalyear: 2019copyright: acmcopyrightconference: The 34th ACM/SIGAPP Symposium on Applied Computing; April 8–12, 2019; Limassol, Cyprusbooktitle: The 34th ACM/SIGAPP Symposium on Applied Computing (SAC ’19), April 8–12, 2019, Limassol, Cyprusprice: 15.00doi: 10.1145/3297280.3297378isbn: 978-1-4503-5933-7/19/04

1. Introduction

Named Entity Recognition (NER) is the task of finding entities, such as names of persons, organizations, locations, etc. in unstructured text. These names can be individual words or phrases in a sentence. Therefore, NER is usually interpreted as sequence labelling task. This task is actively used in various information extraction frameworks and is one of the core components of goal-oriented dialogue systems (Zhao et al., 2017).

When large labelled datasets are available, the task of NER can be solved with very high quality (Akhundov et al., 2018). Common benchmarks for testing new NER methods are CoNLL-2003 (Sang and Meulder, 2003) and Ontonotes (Pradhan et al., 2013) datasets. They both include enough data to train neural architectures in a supervised learning setting. However, in real-world applications such abundant datasets are usually not available, especially for low-resourced languages. And even if we have a large labelled corpus, it will inevitably have rare entities that occur not enough times to train a neural network to accurately identify them in text.

This urges the need for developing methods of few-shot NER — successful identification of entities for which we have extremely small number of labelled examples. One solution would be semi-supervised learning methods, i.e. methods that can yield well-performing models by combining the information from a small set of labelled data and large amounts of unlabelled data which are available for virtually any language. Word embeddings which are trained in the unsupervised manner and are used in the majority of NLP tasks as the input to a neural network, can be considered as incorporation of unlabelled data. However, they only provide general (and not always suitable) information about word meaning, whereas we argue that unsupervised data can be used to extract more task-specific information on the structure of the data.

A prominent approach to the task of learning from few examples is metric learning (Bellet et al., 2013). This term denotes techniques that learn a metric to measure fitness of an object to some class. Metric learning methods, such as matching networks (Vinyals et al., 2016) and prototypical networks (Snell et al., 2017), showed good results in few-shot learning for image classification. These methods can also be considered as semi-supervised learning methods, because they use the information about structure of common objects in order to label the uncommon ones even without seeing many examples. This approach can even be used as zero-shot learning, i.e. instances of a target class do not need to be presented at training time. Therefore, such model does not need to be re-trained in order to handle new classes. This property is extremely appealing for real-world tasks.

Despite its success in image processing, metric learning has not been widely used in NLP tasks. There, in low-resourced settings researchers more often resort to transfer learning — use of knowledge from a different domain or language. We apply prototypical networks to the NER task and compare it to commonly used baselines. We test a metric learning technique in a task which often emerges in real-world setting — identification of instances with extremely small number of labelled examples. We show that although prototypical networks do not succeed in zero-shot NER task, they outperform other models in few-shot case.

The main contributions of the work are the following:

  1. we formulate few-shot NER task as a semi-supervised learning task,

  2. we modify prototypical network model to enable it to solve NER task, we show that it outperforms a state-of-the-art model in low-resource setting.

The paper is organized as follows. In Section 2 we review the existing approaches to few-shot NER task. In Section 3 we describe the prototypical network model and its adaptation to the NER task. Section 4 defines the task and describes the models that we tested to solve it. Section 5 contains the description of our experimental setup. We report and analyze our results in Section 6, and in Section 7 we conclude and provide the directions for future work.

2. Related work

NER is a well-established task that has been solved in a variety of ways. Nowadays, as in the majority of other NLP tasks, the state of the art is sequence labelling with Recurrent Neural Networks (Akhundov et al., 2018; Peters et al., 2018a). However, neural architectures are very sensitive to the size of training data and tend to overfit on small datasets. Hence, the latest research on named entities concentrates on handling low-resourced cases, which often occur in narrow domains or low-resourced languages.

The work by Wang et al. (Wang et al., 2018) describes feature transform between domains which allows exploiting a large out-of-domain dataset for NER task. Numerous works describe a similar transition between languages: Dandapat and Way (Dandapat and Way, 2016) draw correspondences between entities in different languages using a machine translation system, Xie et al. (Xie et al., 2018) map words of two languages into a shared vector space. Both these methods allow “translating” a big dataset to a new language. Cotterell and Duh (Cotterell and Duh, 2017) describe a setting where the performance of a NER model for a low-resourced language is improved by training it jointly with a NER model for its well-resourced cognate.

Besides labelled data of a different domain or language, other sources such as ontologies, knowledge bases or heuristics can be used in limited data settings (Fries et al., 2017). Similarly, Tsai and Salakhutdinov (Tsai and Salakhutdinov, 2017) improve the image classification accuracy using side information.

Active learning is also a popular choice to reduce the amount of training data. In (Shen et al., 2017) the authors apply active learning to few-shot NER task and succeed in improving the performance despite the fact that neural architectures usually require large number of training examples. A somewhat similar approach is self-learning — training on examples labelled by a model itself. While it is ineffective in many settings, (Chen and Zhang, 2018) shows that it can improve results of few-shot NER task when combined with reinforcement learning.

The most closely related work to ours is research by Ma et al. (Ma et al., 2016) where authors learn embeddings for fine-grained NER task with hierarchical labels. They train a model to map hand-crafted and other features of words to embeddings and use mutual information metric to choose a prototype from sets of words. Analogously to this work, we aim at improving performance of NER models on rare classes. However, we do not limit the model to hierarchical classes. It makes our model more flexible and applicable to “cold start” problem (problem of extending data with new classes).

Beyond NLP, there also exist multiple approaches to few-shot learning. The already mentioned metric learning technique (Bellet et al., 2013) benefits from structure shared by all objects in a task, and creates a representation that shows their differences relevant to the task. Meta-learning (Ravi and Larochelle, 2017) approach operates at two levels: it learns to solve a task from a small number of examples, and at the top level it learns more general regularities about the data across tasks. In (Santoro et al., 2016) the authors demonstrate that memory-augmented neural networks, such as Neural Turing Machines, have a capacity to perform meta-learning with few labelled examples.

To the best of our knowledge, prototypical networks (Snell et al., 2017) have not been applied to any NLP tasks before. They have a very attractive capacity of introducing new labels to a model without its retraining. None of models described above can perform such zero-shot learning. Although natural language is indeed different from images for which prototypical networks were originally suggested, we decided to test this model on an NLP task to see if it is possible to transfer this property to the text domain.

3. Prototypical Networks

3.1. Model

Work by Snell et al.  (Snell et al., 2017) introduces prototypical network — a model that was developed for classification in settings where labelled examples are scarce. This network is trained so that representations of objects returned by its last but one layer are similar for objects that belong to the same class and diverse for objects of different classes. In other words, this network maps objects to a vector space which allows easy separation of objects into meaningful task-specific clusters. This feature allows assigning a class to an unseen object even if the number of labelled examples of this class is very limited.

The model is trained on two sets of examples: support set and query set. Support set consists of labelled examples: = {, …,}, where each is a -dimensional representation of an object and is the label of this object. Query set contains labelled objects: = {, …,}. Note that this partition is not stable across training steps — the support and query sets are sampled randomly from the training data at each step.

The training is conducted in two stages:

  1. For each class we define — the set of objects from that belong this class. We use these sets to compute prototypes:

    where function maps the input objects to the -dimensional space which is supposed to keep distances between classes. is usually implemented as a neural network. Its architecture depends on the properties of objects.

    Prototype is the averaged representation of objects in a particular class, or the centre of a cluster corresponding to this class in the -dimensional space.

  2. We classify objects from . In order to classify an unseen example x, we map it to the -dimensional space using and then assign it to a class whose prototype is closer to the representation of x. We compute distance for every . We denote the measure of similarity of x to as . Finally, we convert these similarities to distribution over classes using function: . The model is agnostic about the distance function. Following  (Snell et al., 2017), we use squared Euclidian distance.

The model is trained by optimising cross-entropy loss:

where .

3.2. Adaptation to NER

In order to apply prototypical networks to NER task, we made the following changes to the baseline model described above:

Sequential vs independent objects

Image dataset contains separate images that are not related to each other. In contrast, in NLP tasks we often need to classify words which are grouped in sequences. Words in a sentence influence each other, and when labelling a word we should take into account labels of neighbouring words. Considering a word in isolation does not make sense in such setting. Nevertheless, in NER task we need to classify separate words, so following the description of the model from the previous section, we should assemble the support set from pairs (, ), where is a word and is its label. However, this division can break the sentence structure, if some words in a sentence are assigned to the support set and others to query set. In order to prevent such situations we form our support and query sets from whole sentences.

Class “no entity”

In NER task we have class O that is used to denote words which are not named entities. It cannot be interpreted in the same way as other classes, because objects of class O do not need to (and should not) be close to each other in a vector space. In order to mitigate this problem we modified our prediction function . We replaced the similarity score for the O class with a scalar , and used the following form of softmax: . is trained along with parameters of the model. The initial value of is a hyper-parameter.

In-domain and out-of-domain training

In original paper describing prototypical networks (Snell et al., 2017) they were applied to the setting of zero-shot learning. Weights of the model are updated during training phase, but once training is over instances from test classes are only used for calculation of prototypes. Given it is usually easy to obtain few labelled examples, we modified original zero-shot setting to few-shot setting: we use a small number of available labelled examples of the target class during training phase. We denote this data as in-domain training set, and data for other classes is referred to as out-of-domain training. Here domains in the traditional NLP sense are the same — texts come from the same sources and word distributions are similar. Here we refer to discrepancy between sets of named entity classes that they use.

4. Few-shot NER

4.1. Task formulation

NER is a sequence labelling task, where each word in a sentence is assigned either one of entity classes (“Person”, “Location”, “Organisation”, etc.) or O class if it is not one of the desired entities.

While common classes are usually identified correctly by the existing methods, we target particularly at rare classes for which we have only a very limited number of labelled examples. To increase the quality of their identification, we use the information from other classes. Therefore, we train a separate model for every class in order to see the performance on each of them in isolation. Such formulation can also be considered as a way to tackle the “cold start” problem — adapting a NER model to label entities of a new class with very little number of entities.

As it was described above, we have two training sets: out-of-domain and in-domain. Since we simulate the “cold start” problem in our experiments, these datasets have the following characteristics. The out-of-domain data is quite large and labelled with a number of named entity classes except the target class — this is the initially available data. The in-domain dataset is very small and contains labels only for the class — this is the new data which we acquire afterwards and which we would like to infuse into the model.

In order to train a realistic model we need to keep the frequency of in our in-domain training data similar to the frequency of this class in general distribution. Therefore, if instances of this class occur on average in one of three sentences, then our in-domain training data has to contain sentences with no instances of class (“empty” sentences), and their number should be twice as larger as the number of sentences with . In practice this can be achieved by sampling sentences from unlabelled data until we obtain the needed number of instances of class .

4.2. Basic models

We use two main architectures — the commonly used RNN baseline and a prototypical network adapted for the NER task. Other models we test use these two models as building blocks.

RNN + CRF model

As our baseline we use a NER model implemented in AllenNLP open-source library (Gardner et al., 2018). The model processes sentences in the following way:

  1. words are mapped to pre-trained embeddings (any embeddings, such as GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018b), etc. can be used)

  2. additional word embedding are produced using a character-level trainable Recurrent Neural Network (RNN) with LSTM cells,

  3. embeddings produced at stages (1) and (2) are concatenated and used as the input to a bi-directional RNN with LSTM cells. This network processes the whole sentence and creates context-dependent representations of every word

  4. a feed-forward layer converts hidden states of the RNN from stage (3) to logits that correspond to every label,

  5. the logits are used as input to a Conditional Random Field (CRF) (Lafferty et al., 2001) model that outputs the probability distribution of tags for every word in a sentence.

The model is trained by minimizing negative log-likelihood of true tag sequences. It has to be noted that this baseline is quite reasonable even in our limited resource setting.

Prototypical Network

The architecture of the prototypical network that we use for NER task is very similar to the one of our baseline model. The main change concerns the feed-forward layer. While in the baseline model it transforms RNN hidden states to logits corresponding to labels, in our prototypical network it maps these hidden states to the -dimensional space. The output of the feed-forward layer is then used to construct prototypes from the support set. These prototypes are used to classify examples from the query set as described in section 3.1. We try variants of this model both with and without the CRF layer. The architecture of the prototypical network model is provided in Figure 1.

Figure 1. Model architecture

4.3. Experiments

We perform experiments with a number of different models. We test the different variants of prototypical network model and compare them with RNN baseline. In addition to that, we try transfer learning scenario and combine it with these models. Here we provide the description of all models we test.

RNN Baseline (Base)

This is the baseline RNN model described above. We train it using only in-domain training set.

Baseline prototypical network (BaseProto)

This is the baseline prototypical network model. We train it on in-domain training data. We divide it into two parts. If the in-domain set contains sentences with instances of the target class and sentences “empty” sentences, we use sentences with instances of as support set, and other such sentences along with “empty” sentences serve as query set. We use only half of “empty” sentences to keep the original frequency of class in the query set. Note that the partition is new for every training iteration.

Regularised prototypical network (Protonet)

The architecture and training procedure of this model are the same as those of BaseProto model. The only difference is the data we use for training. At each training step we select the training data using one of two scenarios:

  1. we use in-domain training data, i.e. data labelled with the target class (this setup is the same as the one we use in BaseProto),

  2. we change the target class: we (i) randomly select a new target class (), (ii) sample sentences from out-of-domain dataset until we find instances of , and (iii) re-label the sampled sentences so that they contain only labels of class .

At each step we choose the scenario (1) with probability , or scenario (2) with probability .

Therefore, throughout training the network is trained to predict our target class (scenario (1)), but occasionally it sees instances of some other classes and constructs prototypes for them (scenario (2)). We suggest that this model can be more efficient than BaseProto, because at training time it is exposed to objects of different classes, and the procedure that maps objects to prototype space becomes more robust. This is also a way to leverage out-of-domain training data.

Transfer learning baseline (WarmBase)

We test a common transfer learning model — use of knowledge about out-of-domain data to label in-domain samples. The training of this model is two-part:

  1. We train our baseline RNN+CRF model using out-of-domain training set.

  2. We save all weights of the model except CRF and label prediction layer, and train this model again using in-domain training set.

Transfer learning + prototypical network (WarmProto)

In addition to that, we combine prototypical network with pre-training on out-of-domain data. We first train a Base model on the out-of-domain training set. Then, we train a Protonet model as described above, but initialise its weights with weights of this pre-trained Base model.


This is the same prototypical network pre-trained on out-of-domain data, but it is extended with a CRF layer on top of logits as described in section 4.2.

WarmProto for zero-shot training (WarmProtoZero)

We train the same WarmProto model, but with the probability set to 0. In other words, our model does not see instances of the target class at training time. It learns to produce representations on objects of other classes. Then, at test time, it is given entities of the target class as support set, and words in test sentences are assigned to either this class or O class based on their similarity to this prototype. This is the only zero-shot learning scenario that we test.

5. Experimental setup

5.1. Dataset

We conduct all our experiments on the Ontonotes dataset (Pradhan et al., 2013). It contains 18 classes (+ class). The classes are not evenly distributed — the training set contains over instances of some common classes and less than 100 for rare classes. The distribution of classes is shown in Figure 2. The size of the training set is sentences, the size of the validation set is sentences.

Figure 2. Ontonotes dataset — statistics of classes frequency in the training data.

As the majority of NER datasets, Ontonotes adopts BIO (Beginning, Inside, and Outside) labelling. It provides an extension to class labels, namely, all class labels except O are prepended with the symbol “B” if the corresponding word is the first (or the only) word in an entity, or with the symbol “I” otherwise.

5.2. Data preparation: simulated few-shot experiments

We use the Ontonotes training data as out-of-domain training set (where applicable) and sample in-domain examples from the validation set. In our formulation, the in-domain data is the data where only instances of a target class (class we want to predict) are labelled. Conversely, the out-of-domain data contains instances of some set of classes, but not of the target class. Therefore, we prepare our data by replacing all labels B-C and I-C with O in the training data, and in the validation data we replace all labels except B-C and I-C with O. Note that since we run the experiments for each of 18 Ontonotes classes, we perform this re-labelling for every experiment.

The validation data is still too large for our low-resourced scenario, so we use only a part of it for training. We sample our in-domain training data as follows. We randomly select sentences from the re-labelled validation set until we obtain sentences with at least one instance of the class . Note that sentences of the validation set are not guaranteed to have instances of , so our training data can have some “empty” sentences, i.e. sentences where all words are labelled with O. This sampling procedure allows keeping the proportion of instances of class C close to the one of the general distribution.

In our preliminary experiments we noticed that such sampling procedure leads to large variation in the final scores, because the size of in-domain training data can vary significantly. In order to reduce this variation we alter the sampling procedure. We define a function which computes the proportion of labels of a class in the validation set (). Then we sample sentences containing instances of class and sentences without class . Thus, we keep the proportion instances of class in our in-domain training dataset equal to that of the validation set. We use the same procedure when sampling training examples from out-of-domain data for Protonet model.

5.3. Design of experiments

We conduct separate experiments for each of 18 Ontonotes classes. For each class we conducted 4 experiments with different random seeds. We report averaged results for each class.

We design separate experiments for selection of hyper-parameters and the optimal number of training epochs. For that we selected three well-represented classes — “GPE” (geopolitical entity), “Date”, and “Org” (organization) — to conduct validation experiments on them. We selected training sets as described above, and used the test set (consisting of sentences) to tune hyper-parameters and to stop the training. For other classes we did not perform hyper-parameter tuning. Instead, we used the values acquired in the validation experiments with the three validation classes. In these experiments we used the test set only for computing the performance of trained models.

The motivation of such setup is the following. In many few-shot scenarios researchers report experiments where they train on a small training set, and tune the model on a very large validation set. We argue that this scenario is unrealistic, because if we had a large number of labelled examples in a real-world problem, it would be more efficient to use them for training, and not for validation. On the other hand, a more realistic scenario is to have a very limited number of labelled sentences overall. In that case we could still reserve a part of them for validation. However, we argue that this is also inefficient. If we have 20 examples and decide to train on 10 of them and validate on another 10, this validation will be inaccurate, because 10 examples are not enough to evaluate the performance of a model reliably. Therefore, our evaluation will be very noisy and is likely to result in sub-optimal values of hyper-parameters. On the other hand, additional 10 examples can boost the quality of a model, as it can be seen in Figure 3. Therefore, we assume that optimal hyperparameters are the same for all labels, and use the values we found in validation experiments.

Class name Base BaseProto WarmProtoZero Protonet WarmProto WarmBase WarmProto-CRF
Validation Classes
GPE 69.75 9.04 69.8 4.16 60.1 5.56 78.4 1.19 83.62 3.89 75.8 6.2 80.05 5.4
DATE 54.42 3.64 50.75 5.38 11.23 4.57 56.55 4.2 61.68 3.38 56.32 2.32 65.42 2.82
ORG 42.7 5.54 39.1 7.5 17.18 3.77 56.35 2.86 63.75 2.43 63.45 1.79 69.2 1.2
Test Classes
EVENT 32.33 4.38 24.15 4.38 4.85 1.88 33.95 5.68 33.85 5.91 35.15 4.04 45.2 4.4
LOC 31.75 9.68 24.0 5.56 16.62 7.18 42.88 2.03 49.1 2.4 40.67 4.85 52.0 4.34
FAC 36.7 8.15 29.83 5.58 6.93 0.62 41.05 2.74 49.88 3.39 45.4 3.01 56.85 1.52
CARDINAL 54.82 1.87 53.7 4.81 8.12 7.92 64.05 1.61 66.12 0.43 62.98 3.5 70.43 3.43
QUANTITY 64.3 5.06 61.72 4.9 12.88 4.13 65.05 8.64 67.07 5.11 69.65 5.8 76.35 3.09
NORP 73.5 2.3 72.1 6.0 39.92 10.5 83.02 1.42 84.52 2.79 79.53 1.32 82.4 1.15
ORDINAL 68.97 6.16 71.65 3.31 1.93 3.25 76.08 3.55 73.05 7.14 69.77 4.97 75.52 5.11
WORK_OF_ART 30.48 1.42 27.5 2.93 3.4 2.37 28.0 3.33 23.48 5.02 30.2 1.27 32.25 3.11
PERSON 70.05 6.7 74.1 5.32 38.88 7.64 80.53 2.15 80.42 2.13 78.03 3.98 82.32 2.51
LANGUAGE 72.4 5.53 70.78 2.62 4.25 0.42 68.75 6.36 48.77 17.42 65.92 3.52 75.62 7.22
LAW 58.08 4.9 53.12 4.54 2.4 1.15 48.38 8.0 50.15 7.56 60.13 6.08 57.72 7.06
MONEY 70.12 5.19 66.05 1.66 12.48 11.92 68.4 6.3 73.68 4.72 68.4 5.08 79.35 3.6
PERCENT 76.88 2.93 75.55 4.17 1.82 1.81 80.18 4.81 85.3 3.68 79.2 3.76 88.32 2.76
PRODUCT 43.6 7.21 44.35 3.48 3.75 0.58 39.92 7.22 35.1 9.35 43.4 8.43 49.32 2.92
TIME 35.93 6.35 35.8 2.61 8.02 3.05 50.15 5.12 56.6 2.28 45.62 5.64 59.8 0.76
Table 1. Results of experiments in terms of chunk-based -score. Numbers in bold mean the best score for a particular class, underlined numbers are the second best results. Numbers are averaged across 4 runs with standard deviations calculated.
Figure 3. Performance of models trained on 10 and 20 sentences.

5.4. Model parameters

In all our experiments we set (number of instances of the target class in in-domain training data) to 20. This number of examples is small enough and can be easily labelled by hand. At the same time, it produces models of reasonable quality. Figure 3 compares the performance of models trained on 10 and 20 examples. We see the significant boost in performance for the latter case. Moreover, in the rightmost plot the learning curve for the smaller dataset goes down after the 40-th epoch, which does not happen when the larger dataset is used. This shows that is a reasonable trade-off between model performance and cost of labelling.

In the Protonet model we set to 0.5. Therefore, the model is trained on the instances of the target class half of the steps, and another half of the times it is shown instances of some other randomly chosen class.

We optimize all models with Adam optimizer in pytorch implementation. Base and WarmBase methods use batches of 10 sentences during in-domain training. We train out-of-domain RNN baseline (warm-up for WarmBase and WarmProto* models) using batch of size 32. All models based on prototypical network use batches of size 100 — 40 in support set and 60 in query set. We also use L2-regularization with a multiplier 0.1. All models are evaluated in terms of chunk-based -score for the target class (Sang and Meulder, 2003).

The open-source implementation of the models is available online.111https://github.com/Fritz449/ProtoNER

6. Results

6.1. Performance of models

We selected hyperparameters in the validation experiment and then used them when training models for other classes. We use the following values. The initial value of (logit for the O class) is set to -4. We use dropout with rate 0.5 in LSTM cells for all our experiments. The dimensionality of embeddings space for all models based on prototypical network is set to 64. For all models we use learning rate of .

Table 1 shows the results of our experiments for all classes and methods. It is clearly seen that 20 sentences is not enough to train a baseline RNN+CRF model. Moreover, we see that the baseline prototypical network (BaseProto) performs closely to the RNN baseline. This shows that 20 instances of the target class is also not enough to construct a reliable prototype.

On the other hand, if a prototypical network is occasionally exposed to instances of other classes, as it is done in Protonet model, then the prototypes it constructs are better at identifying the target class. Protonet shows better results than Base and BaseProto on many classes.

The transfer learning baseline (WarmBase) achieves results which are comparable with those of Protonet. This allows to conclude that the information on structure of objects of other classes is helpful even for conventional RNN baseline, and pre-training on out-of-domain data is useful.

Prototypical network pre-trained on out-of-domain data (WarmProto) beats WarmBase and Protonet in more than half of experiments. Analogously to transfer learning baseline, it benefits from the use of out-of-domain data. Unfortunately, such model is not suitable for zero-shot learning — the WarmProtoZero model performs below any other models including the RNN baseline.

Finally, if we enable CRF layer of WarmProto model, the performance grows sharply. As we can see, WarmProto-CRF beats all other models in almost all experiments. Thus, prototypical network is more effective than RNN baseline in the setting where in-domain data is extremely limited.

6.2. Influence of BIO labelling

When such a small number of entities is available, the BIO labelling used in NER datasets can harm the performance of models. First of all, the majority of entities can contain only one word, and the number of I tags can be too small if there are only 20 entities overall. This can decrease the quality of predicting these tags dramatically. Another potential problem is that words labelled with B and I tags can be similar, and a model can have difficulties distinguishing between them using prototypes. Again, this effect can be amplified by the fact that very small number of instances is used for training, and prototypes themselves have high variance.

Class name WarmBase + BIO WarmBase + TO WarmProto + BIO WarmProto + TO
Validation Classes
GPE 75.8 6.2 74.8 4.16 83.62 3.89 82.02 0.42
DATE 56.32 2.32 58.02 2.83 61.68 3.38 64.68 3.65
ORG 63.45 1.79 62.17 2.9 63.75 2.43 65.22 2.83
Test Classes
EVENT 35.15 4.04 35.4 6.04 33.85 5.91 34.75 2.56
LOC 40.67 4.85 40.08 2.77 49.1 2.4 49.05 1.04
FAC 45.4 3.01 44.88 5.82 49.88 3.39 43.52 3.09
CARDINAL 62.98 3.5 63.27 3.66 66.12 0.43 69.2 1.51
QUANTITY 69.65 5.8 69.3 3.41 67.07 5.11 67.97 2.98
NORP 79.53 1.32 80.75 2.38 84.52 2.79 84.5 1.61
ORDINAL 69.77 4.97 70.9 6.34 73.05 7.14 74.7 4.94
WORK_OF_ART 30.2 1.27 25.78 4.07 23.48 5.02 25.6 2.86
PERSON 78.03 3.98 76.0 3.12 80.42 2.13 78.8 0.26
Table 2. -scores for models WarmBase and WarmProto trained on data with and without BIO labelling. Numbers in bold mean the best score for a particular class, underlined numbers are the second best results. Numbers are averaged across 4 runs with standard deviations calculated.

In order to check if these problems hamper the performance of our models, we performed another set of experiments. We removed BIO tagging — for the target class we replaced both B-C and I-C with C. This TO (tag/other) tagging reduced sparsity in the training data. We did so for both in-domain and out-of-domain training sets. The test set remained the same, because chunk-based -score we use for evaluation is not affected by differences between BIO and TO labelling, it always considers a named entity as a whole.

Table 2 shows the result of WarmBase and WarmProto models trained on BIO-labelled and TO-labelled data. It turns out that in the majority of cases the differences between -scores of these models are not significant. Therefore, BIO labelling does not affect our models.

7. Conclusions

In this work we suggested solving the task of NER with metric learning technique actively used in other Machine Learning tasks but rarely applied to NLP. We adapted a metric learning method, namely, prototypical network originally used for image classification to analysis of text. It projects all objects into a vector space which keeps distances between classes, so objects of one class are mapped to similar vectors. These mappings form a prototype of a class, and at test time we assign new objects to classes by similarity of an object representation to class prototype.

In addition to that, we considered the task of NER in a semi-supervised setting — we identified our target classes in text using the information about words of other classes. We showed that prototypical network is more effective in such setting than the state-of-the-art RNN model. Unlike RNN, prototypical network is suitable in cases where extremely small amount of data is available.

According to the original formulation of prototypical network, it can be used as zero-shot learning method, i.e. method which can assign an object to a particular class without seeing instances of this class at training time. We experimented with zero-shot setting for NER and showed that prototypical networks can in principle be used for zero-shot text classification, although there is still much room for improvement. We suggest that this is a prominent direction of future research.

We saw that prototypical networks shows considerably different performance on different classes of named entities. It would be interesting to perform more thorough qualitative analysis to identify characteristics of textual data which is more suitable for this method.

Finally, in our current experiments we trained models to predict entities of only a single class. In our future work we would like to check if the good performance of prototypical network scales to multiple classes. We will focus on training a prototypical network that can predict all classes of Ontonotes or another NER dataset at once.


  • (1)
  • Akhundov et al. (2018) Adnan Akhundov, Dietrich Trautmann, and Georg Groh. 2018. Sequence Labeling: A Practical Approach. CoRR abs/1808.03926 (2018). arXiv:1808.03926 http://arxiv.org/abs/1808.03926
  • Bellet et al. (2013) Aurélien Bellet, Amaury Habrard, and Marc Sebban. 2013. A Survey on Metric Learning for Feature Vectors and Structured Data. arXiv:arXiv:1306.6709
  • Chen and Zhang (2018) Chenhua Chen and Yue Zhang. 2018. Learning How to Self-Learn: Enhancing Self-Training Using Neural Reinforcement Learning. CoRR abs/1804.05734 (2018). arXiv:1804.05734 http://arxiv.org/abs/1804.05734
  • Cotterell and Duh (2017) Ryan Cotterell and Kevin Duh. 2017. Low-Resource Named Entity Recognition with Cross-lingual, Character-Level Neural Conditional Random Fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, 91–96. http://aclweb.org/anthology/I17-2016
  • Dandapat and Way (2016) Sandipan Dandapat and Andy Way. 2016. Improved Named Entity Recognition using Machine Translation-based Cross-lingual Information. Computacion y Sistemas 20, 3 (Jul/Sep 2016).
  • Fries et al. (2017) Jason A. Fries, Sen Wu, Alexander Ratner, and Christopher Ré. 2017. SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data. CoRR abs/1704.06360 (2017). arXiv:1704.06360 http://arxiv.org/abs/1704.06360
  • Gardner et al. (2018) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS). Association for Computational Linguistics, 1–6. http://aclweb.org/anthology/W18-2501
  • Lafferty et al. (2001) John Lafferty, Andrew McСallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML-2001. 282–289.
  • Ma et al. (2016) Yukun Ma, Erik Cambria, and Sa Gao. 2016. Label Embedding for Zero-shot Fine-grained Named Entity Typing. In International Conference on Computational Linguistics (COLING).
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-1162
  • Peters et al. (2018a) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237. https://doi.org/10.18653/v1/N18-1202
  • Peters et al. (2018b) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018b. Deep contextualized word representations. In Proceedings of North American Chapter of the Association for Computational Linguistics.
  • Pradhan et al. (2013) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bjorkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. CoNLL’12 Joint Conference on EMNLP and CoNLL - Shared Task (2013). arXiv:http://www.aclweb.org/anthology/W13-3516
  • Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR).
  • Sang and Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL’03 (2003). aclweb.org/anthology/W03-0419
  • Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. 2016. One-shot Learning with Memory-Augmented Neural Networks. CoRR abs/1605.06065 (2016). http://arxiv.org/abs/1605.06065
  • Shen et al. (2017) Yanyao Shen, Hyokun Yun, Zachary Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017. Deep Active Learning for Named Entity Recognition. In Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, 252–256. https://doi.org/10.18653/v1/W17-2630
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4077–4087. http://papers.nips.cc/paper/6996-prototypical-networks-for-few-shot-learning.pdf
  • Tsai and Salakhutdinov (2017) Yao-Hung Hubert Tsai and Ruslan Salakhutdinov. 2017. Improving One-Shot Learning through Fusing Side Information. CoRR abs/1710.08347 (2017). arXiv:1710.08347 http://arxiv.org/abs/1710.08347
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. 2016. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3630–3638. http://papers.nips.cc/paper/6385-matching-networks-for-one-shot-learning.pdf
  • Wang et al. (2018) Zhenghui Wang, Yanru Qu, Liheng Chen, Jian Shen, Weinan Zhang, Shaodian Zhang, Yimei Gao, Gen Gu, Ken Chen, and Yong Yu. 2018. Label-Aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 1–15. https://doi.org/10.18653/v1/N18-1001
  • Xie et al. (2018) Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A. Smith, and Jaime Carbonell. 2018. Neural Cross-Lingual Named Entity Recognition with Minimal Resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 369–379. http://aclweb.org/anthology/D18-1034
  • Zhao et al. (2017) Tiancheng Zhao, Allen Lu, Kyusong Lee, and Maxine Eskenazi. 2017. Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics, 27–36. https://doi.org/10.18653/v1/W17-5505


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description