Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates

Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates


Annotating training data for sequence tagging tasks is usually very time-consuming. Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget. We are the first to thoroughly investigate this powerful combination in sequence tagging. We find that taggers based on deep pre-trained models can benefit from Bayesian query strategies with the help of the Monte Carlo (MC) dropout. Results of experiments with various uncertainty estimates and MC dropout variants show that the Bayesian active learning by disagreement query strategy coupled with the MC dropout applied only in the classification layer of a Transformer-based tagger is the best option in terms of quality. This option also has very little computational overhead. We also demonstrate that it is possible to reduce the computational overhead of AL by using a smaller distilled version of a Transformer model for acquiring instances during AL.


1 Introduction

In many natural language processing (NLP) tasks, such as named entity recognition (NER), obtaining gold standard labels for constructing the training dataset can be very time-consuming. This makes annotation process expensive and limits the application of supervised machine learning models. Especially, this is the case in such domains as biomedical or scientific text processing, where crowdsourcing is either difficult or impossible. In these domains, highly-qualified experts are needed to annotate data correctly, which dramatically increases the annotation cost.

Active learning (AL) is one of the techniques that can help to reduce the amount of annotation required to train a good model by multiple times Settles and Craven (2008); Settles (2009). Opposite to exhaustive and redundant manual annotation of the entire corpus, AL drives the annotation process to focus the expensive human expert time only on the most informative objects, which contribute to a substantial increase in the model quality.

AL is an iterative process that starts from a small amount of labeled seeding instances. In each iteration, an acquisition model is trained on the currently annotated dataset and is applied to the large pool of unannotated objects. The model predictions are used by the AL query strategy to sample the informative objects, which are then further demonstrated to the expert annotators. When the annotators provide labels for these objects, the next iteration begins.

During AL, models have to be trained on very small amounts of the labeled data, especially during the early iterations. Recently, this problem has been tackled by transfer learning with deep pre-trained models: ELMo Peters et al. (2018), BERT Devlin et al. (2019), ELECTRA Clark et al. (2020), and others. Pre-trained on the large amount of unlabeled data, they are capable of demonstrating remarkable performance using only hundreds or even dozens of labeled training instances. This trait suits the AL framework, but poses the question about the usefulness of the biased sampling provided by the AL query strategies.

In this work, we investigate AL with the aforementioned deep pre-trained models on the widely-used benchmarks for NER and AL and compare the results of this combination to the outcome of the models that do not take advantage of deep pre-training. We experiment with the query strategies based on the least confidence and the Bayesian uncertainty estimates obtained using the Monte Carlo (MC) dropout Gal and Ghahramani (2016b); Gal et al. (2017). The main contributions of this paper are the following:

  • We are the first to thoroughly investigate deep pre-trained models in the AL setting for sequence tagging.

  • We demonstrate benefits of using Bayesian uncertainty estimations approximated via the MC dropout with deep pre-trained models for querying informative instances during AL. We show that the Bayesian active learning by disagreement Houlsby et al. (2011) query strategy is superior to the variation ratio that is investigated in the previous works on AL for sequence tagging tasks Shen et al. (2017); Siddhant and Lipton (2018).

  • We experiment with various MC dropout options for AL with taggers based on deep pre-trained models and find the best options for each type of models.

  • We show that to acquire instances during AL, a full-size Transformer can be substituted with a distilled version, which yields better computational performance and reduces obstacles for applying deep AL in practice.

The rest of the paper is structured as follows. Section 2 covers relevant works on AL for sequence tagging. In section 3, we describe the sequence tagging models. Section 4 describes AL strategies used in the experiments. In section 5, we discuss the experimental setup and present the evaluation results. Section 6 concludes the paper.

2 Related Work

AL for sequence tagging with classical machine learning algorithms and a feature-engineering approach has a long research history, e.g. Settles and Craven (2008); Settles (2009); Marcheggiani and Artières (2014). More recently, AL in conjunction with deep learning has received much attention.

In one of the first works on this topic, \newciteshen2018 note that practical deep learning models that can be used in AL should be computational efficient both for training and inference to reduce the delays in the annotator’s work. They propose a CNN-CNN-LSTM architecture with convolutional character and word encoders and a LSTM tag decoder, which is a faster alternative to the widely adopted LSTM-CRF architecture Lample et al. (2016) with comparable quality. They also reveal disadvantages of the standard query strategy – least confident (LC), and propose a modification, namely Maximum Normalized Log-Probability (MNLP). \newciteSiddhant2018DeepBA experiment with Bayesian uncertainty estimates. They use CNN-CNN-LSTM and CNN-BiLSTM-CRF Ma and Hovy (2016) networks and two methods for calculating the uncertainty estimates: Bayes-by-Backprop Blundell et al. (2015) and the MC dropout Gal and Ghahramani (2016b). The experiments show that the variation ratio Freeman (1965) has substantial improvements over MNLP. In contrast to them, we additionally experiment with the Bayesian active learning by disagreement (BALD) query strategy proposed by \newcitehoulsby2011bayesian and perform a comparison with variation ratio. There is a series of works that tackle AL with a trainable policy. For this purpose, imitation learning is used in \newciteliu2018learning,vu2019learning,brantley-etal-2020-active, while in Fang et al. (2017), the authors use deep reinforcement learning. Although the proposed solutions can outperform other heuristic algorithms, they are not very practical due to high computational costs. Other notable works on deep active learning include Erdmann et al. (2019), which proposes an AL algorithm based on a bootstrapping idea and Lowell et al. (2019), which concerns the problem of mismatch between a model used to construct a training dataset via AL and a final model that is trained on it.

Deep pre-trained models are evaluated in the AL setting for NER by \newciteshelmanov2019. However, they perform evaluation only on the specific biomedical datasets and do not consider the Bayesian query strategies. \newcitebrantley-etal-2020-active use pre-trained BERT in their experiments with NER, but they do not fine-tune it, which results in a suboptimal performance. In this work, we try to fill the gap by evaluating deep pre-trained models: ELMo and various Transformers, in the AL setting with practical query strategies and on the widely-used benchmarks in this area.

3 Sequence Tagging Models

We use a tagger based on the Conditional Random Field model Lafferty et al. (2001), BiLSTM-CRF taggers Lample et al. (2016), and taggers based on state-of-the-art Transformer models.

3.1 Conditional Random Field

As a baseline for comparison, we use a feature-based linear-chain Conditional Random Field (CRF) model Lafferty et al. (2001). It is trained to maximize the conditional log-likelihood of entire training tag sequences. The inference is performed using the Viterbi decoding algorithm, which maximizes the joint probability of tags of all tokens in a sequence. Features used for the CRF model are presented in Appendix A.


This model encodes embedded input tokens via a bidirectional long short term memory neural network (LSTM) Hochreiter and Schmidhuber (1997). BiLSTM processes sequences in two passes: from left-to-right and from right-to-left producing a contextualized token vector in each pass. These vectors are concatenated and used as features in a CRF layer that performs final scoring of sequence tags.

We experiment with two versions of the BiLSTM-CRF model. The first one uses GloVe Pennington et al. (2014) word embeddings pre-trained on English Wikipedia and the Gigaword 5 corpus and a convolutional character encoder Ma and Hovy (2016), which helps to deal with out-of-vocabulary words. We will refer to it as CNN-BiLSTM-CRF. We consider this model as another baseline that does not feature deep pre-training. The second version of the BiLSTM-CRF model uses pre-trained medium-size ELMo Peters et al. (2018) to produce contextualized word representations. ELMo is a BiLSTM language model enhanced with a CNN character encoder. We will refer to this model as ELMo-BiLSTM-CRF.

3.3 Transformer-based Taggers

We perform AL experiments with state-of-the-art pre-trained Transformers: BERT Devlin et al. (2019), DistilBERT Sanh et al. (2019), and ELECTRA Clark et al. (2020). The sequence tagger in this case consists of a Transformer “body” and a decoding classifier with one linear layer. Unlike BiLSTM that encodes text sequentially, these Transformers are designed to process the whole token sequence in parallel with a help of the self-attention mechanism Vaswani et al. (2017). This mechanism is bi-directional since it encodes each token on multiple neural network layers taking into account all other token representations in a sequence. These models are faster than the recurrent counterparts and show remarkable performance on many downstream tasks Li et al. (2020).

BERT is a masked language model (MLM). Its main learning objective is to restore randomly masked tokens, so it can be considered as a variant of a denoising autoencoder. Although this objective makes the model to learn many aspects of natural languages Tenney et al. (2019); Rogers et al. (2020), it has multiple drawbacks including the fact that training is performed only using a small subset of masked tokens. ELECTRA has almost the same architecture as BERT but utilizes a novel pre-training objective, called replaced token detection (RTD), which is inspired by generative adversarial networks. In this task, the model has to determine what tokens in the input are corrupted by a separate generative model, in particular, a smaller version of BERT. Therefore, the model has to classify all tokens in the sequence, which increases training efficiency compared to BERT, and the RTD task is usually harder than MLM, which pushes the model to learn better understanding of a language Clark et al. (2020).

DistilBERT is a widely-used compact version of BERT obtained via a distillation procedure Hinton et al. (2015). The main advantages of this model are smaller memory consumption and the higher fine-tuning and inference speed achieved by sacrificing the quality. We note that good computational performance is a must for practical applicability of AL. Delays in the interactions between a human annotator and an AL system can be expensive. Therefore, although DistilBERT is inferior compared to other Transformers in terms of quality, it is a computationally cheaper alternative for acquiring training instances during AL that could be used for fine-tuning bigger counterparts. \newcitelowell2019practical showed that mismatch between an “acquisition” model and a “successor” model (the model that is trained on the annotated data for the final application) can eliminate benefits of AL. The similarity in the architecture and the knowledge contained in the smaller distilled Transformer potentially can help to alleviate this problem and deliver an AL solution that is both effective and practical.

4 Active Learning Query Strategies

We experiment with four query strategies for selection of training instances during AL.

Random sampling is the simplest query strategy possible: we just randomly select instances from the unlabeled pool for annotation. In this case, there is no active learning at all.

Uncertainty sampling Lewis and Gale (1994) methods select instances according to some probabilistic criteria, which indicates how uncertain the model is about the label that was given to the instance. The baseline method is Least Confident (LC): the samples are sorted in the ascending order of probabilities of the most likely tag sequence. Let be a tag of a token that can take one class out of values, let be a representation of a token in an input sequence of length . Then the LC score can be formulated as follows:

This score favors longer sentences, since long sentences usually have lower probability. Maximization of probability is equivalent to maximizing the sum of log-probabilities:

To make LC less biased towards longer sentences, \newciteshen2018 propose a normalization of the log-probability sum. They call the method Maximum Normalized Log-Probability (MNLP). The MNLP score can be expressed as follows:

In our experiments, we use this normalized version of uncertainty estimate since it has been shown to be slightly better than the classical LC Shen et al. (2017), and it is commonly applied in other works on active learning for NER.

Following the work of \newciteSiddhant2018DeepBA, we implement extensions for the Transformer-based and BiLSTM-based sequence taggers applying the MC dropout technique. \newcitegal2016dropout showed that applying a dropout at the prediction time allows us to consider the model as a Bayesian neural network and calculate theoretically-grounded approximations of uncertainty estimates by analyzing its multiple stochastic predictions. Like \newciteshen2018 and \newciteSiddhant2018DeepBA we experiment with variation ratio (VR) Freeman (1965): a fraction of models, which predictions differ from the majority:

where is a number of stochastic predictions.


Siddhant2018DeepBA and \newciteshen2018 refer to this method as BALD. However, the Bayesian active learning by disagreement (BALD) proposed by \newcitehoulsby2011bayesian leverage mutual information between predictions and Bayesian model posterior. Let be a probability of a tag for a token that is predicted by a -th stochastic pass of a model. Then, the BALD score can be approximated according the following expression:

Although BALD can be considered similar to VR, it potentially can give better uncertainty estimates than VR since it leverages the whole probability distributions produced by the model. However, this method has not been tested in the previous works on active learning for sequence tagging.

For Transformer models, we have two variants of the Monte Carlo dropout: MC dropout on the last layer before the classification layer (MC last), and on all layers (MC all). We note that calculating uncertainty estimates in the case of MC all requires multiple stochastic passes, and in each of them, we have to perform inference of the whole Transformer model. However, if we replace the dropout only in the classifier, multiple recalculations are needed only for the classifier, while it is enough to perform the inference of the massive Transformer “body” only once. Therefore, in this case, the overhead of calculating the Bayesian uncertainty estimates can be less than 1% (in the case of 10 stochastic passes for ELECTRA according to the number of parameters) compared to deterministic strategies like MNLP.

The BiLSTM-CRF model has two types of dropout: the word dropout that randomly drops entire words after the embedding layer and the locked dropout Gal and Ghahramani (2016a) that drops the same neurons in the embedding space of a recurrent layer for a whole sequence. Therefore, for the BiLSTM-CRF taggers, we have three options: replacing the locked dropout (MC locked), replacing the word dropout (MC word), and replacing both of them (MC all). We should note that obtaining Bayesian uncertainty estimates does not require the recalculation of the word embeddings (in our case, the character CNN or ELMo).

5 Experiments and Results

5.1 Experimental Setup

We experiment with two widely-used datasets for evaluation of sequence tagging models and AL query strategies: English CoNLL-2003 Sang and Meulder (2003) and English OntoNotes 5.0 Pradhan et al. (2013). The corpora statistics is presented in Appendix B.

Each experiment is an emulation of the AL cycle: selected samples are not presented to experts for annotation, but are labeled automatically according to the gold standard. Each experiment is performed for each AL query strategy and is repeated 5 times for CoNLL-2003 and 3 times for OntoNotes to report confidence intervals. A random 2% subset of the whole training set is chosen for seeding, and 2% of examples are selected for annotation on each iteration. Overall, 24 AL iterations are made, so in the final iteration, the exactly half of the training dataset is labeled. We do not use validation sets provided in the corpora but keep 25% of the labeled corpus as the development set and train models from scratch on the rest. Details of models and the training procedure are presented in Appendix C. The evaluation is performed using the span-based F1-score Sang and Meulder (2003). For query strategies based on the MC dropout, we make 10 stochastic predictions.

5.2 Results and Discussion

Training on Entire Datasets

Model CoNLL-2003 OntoNotes
CRF 78.2 NA 74.0 NA
CNN-BiLSTM-CRF 88.5 0.3 78.1 0.3
ELMo-BiLSTM-CRF 91.2 0.2 82.0 0.3
DistilBERT 89.8 0.2 84.6 0.2
BERT 91.1 0.2 85.4 0.1
ELECTRA 91.5 0.2 85.6 0.3

Table 1: Performance of models built on the entire training datasets without active learning
a) CoNLL b) OntoNotes

The comparison of MNLP and random query strategies for CRF and BiLSTM-based models. c) CoNLL d) OntoNotes The comparison of MNLP and random sampling query strategies for Transformers.

Figure 1: The comparison of MNLP and random query strategies
a) CoNLL b) OntoNotes

The comparison of Bayesian query strategies with MNLP for the ELMo-BiLSTM-CRF model

c) CoNLL d) OntoNotes

The comparison of Bayesian query strategies with MNLP for Transformer-based taggers

Figure 2: The comparison of Bayesian (BALD and VR) and non-Bayesian (MNLP) query strategies
a) CoNLL b) OntoNotes
Figure 3: The comparison of the best query strategies and models overall
Figure 4: The performance of BERT during AL on the CoNLL-2003 dataset, when DistilBERT is used as an acquisition model.

From Table 1, we can find that for both OntoNotes and CoNLL-2003, the model performance pattern is almost the same. CRF, as the baseline model, has the lowest F1 score. Sequence taggers based on deep pre-trained models achieve substantially higher results compared to the classical CNN-BiLSTM-CRF model and CRF. BERT and ELECTRA significantly outperform ELMo-BiLSTM-CRF on OntoNotes, while on CoNLL, all models have comparable scores. DistilBERT is behind bigger Transformers. It also has slightly lower performance than ELMo-BiLSTM-CRF on CoNLL-2003 but outperforms it on the OntoNotes dataset. We should note that our goal was not to achieve the state of the art on each dataset, but to determine the performance of the models trained with hyperparameters used for AL. Tuning hyperparameters, especially increasing the number of epochs, would lead to better results for both LSTM-based models and Transformers, but that would also increase the amount of computation, which makes AL impractical.

Active Learning

Main results of experiments with AL are presented in Figures 14. AL shows significant improvements over the random sampling baseline for all models and on both datasets. Performance gains are bigger for simpler models like CRF or CNN-BiLSTM-CRF without deep pre-training and for the more complex OntoNotes dataset. However, we see that both ELMo-BiLSTM-CRF and Transformers benefit from biased sampling of AL query strategies, which magnifies their ability to be trained on extremely small amount of labeled data. For example, to get 99% of the score achieved with training on the entire CoNLL dataset, only 20% of the annotated data is required for the ELECTRA tagger accompanied with the best AL strategy. For OntoNotes and BERT, only 18% of the corpus is required. Random sampling in both cases requires more than 40% of annotated data.

MNLP strategy. For the CoNLL-2003 corpus, the best performance in AL with the MNLP query strategy is achieved by the ELECTRA model. It shows significantly better results in the beginning compared to BERT (see Figure 1c). ELECTRA also slightly outperforms the ELMo-BiLSTM-CRF tagger in the beginning and on par with it on the rest of the AL curve (see Figure 3a). The CNN-BilSTM-CRF model always better than the baseline CRF, but worse than the models that take advantage of deep pre-training (see Figure 1a). DistilBERT appeared to be the weakest model in the experiments on the CoNLL-2003 dataset. With random sampling, it is on par with the baseline CNN-BiLSTM-CRF model, but with the MNLP strategy, DistilBERT significantly falls behind CNN-BiLSTM-CRF until iteration 11.

Although BERT is slightly worse than the ELECTRA and ELMo-BiLSTM-CRF taggers in the experiments on the CoNLL-2003 corpus, for OntoNotes, BERT has a significant advantage over them of 0.6–4.0 % of F1 score in the AL setting. The ELMo-BiLSTM-CRF tagger falls behind main Transformers on the OntoNotes corpus. This might be because the BiLSTM-based tagger is underfit to the bigger corpus with only 30 training epochs. In the same vein, the baseline CNN-BiLSTM-CRF model without deep pre-training significantly falls behind DistilBERT on this corpus for MNLP and random query strategies.

We should note that for earlier iterations, random sampling tends to outperform the standard MNLP query strategy. The effect is not noticeable for OntoNotes due to the fact that 2% of this dataset is more than 1,800 examples, which makes up a substantial amount of training data even for the seeding (zero) iteration. However, for the CoNLL-2003 dataset, we see a significant gap between random and MNLP strategies almost for all models (except ELMo-BiLSTM-CRF and ELECTRA) till iteration 3-7. The gap is larger for weaker models: CRF, DistilBERT, and CNN-BiLSTM-CRF. The poor performance of the MNLP strategy on the early iterations is due to the reliance of uncertainty estimates on the weak model’s predictions when not many labels are unveiled. To show good results, AL typically requires good uncertainty estimates of models, which can be provided by Bayesian techniques.

Bayesian AL. Using the Bayesian uncertainty estimates based on the MC dropout showed substantial improvements over the simpler MNLP strategy for Transformer-based and BiLSTM-CRF-based taggers. The comparison of BALD, VR, and MNLP follows the same pattern on both datasets. BALD shows the best results among other strategies giving a substantial boost to the performance in the beginning of the AL process until the iteration 7-9 and it is on par with other strategies on later iterations. The VR strategy is also better than MNLP in the beginning, but performance gains are usually smaller. It is especially noticeable for the ELMo-BiLSTM-CRF and DistilBERT taggers on the CoNLL dataset (Figures 2a and 2c). The performance of VR also slightly degrades compared to BALD and also MNLP for Transformers on later iterations, which can be seen in Figure 2d.

We compare the performance of Bayesian AL strategies, when different dropout layers are replaced with the MC dropout (see Tables 4, 5 in Appendix D). For ELMo-BiLSTM-CRF, we compare three options: replacing the dropout that follows word embeddings (embeddings acquired from ELMo), locked dropout in the recurrent layer, and both. The best results are achieved with replacing the locked dropout. Replacing the dropout that follows the embedding layer degrades the performance significantly even if compared to MNLP. Replacing both is better, but it is substantially worse than replacing only the locked dropout. For Transformers, we compare two options: replacing the dropout only on the last classification layer and all dropouts in the model. When using the VR strategy, replacing all dropout layers degrades the performance on early iterations for the CoNLL-2003 corpus. For OntoNotes, replacing all is slightly better for later iterations. However, For BALD, there is no significant difference. Therefore, replacing only the last layer is the best option since it is computationally cheaper and has the best performance with BALD.

The results for the best Bayesian AL strategies for the best Transformer and BiLSTM-CRF taggers are depicted in Figure 3. For the CoNLL-2003 dataset, the performance of the ELMo-BiLSTM-CRF model is on par with the performance of ELECTRA. ELECTRA is slightly better in the beginning till iteration 3, while ELMo-BiLSTM-CRF has small benefit from iteration 4 till 11. For the OntoNotes dataset, the BERT model paired with BALD and the MC dropout on the last layer has the best performance.

DistilBERT as an acquisition model for BERT. Figure 4 shows the results of AL in the situation, when we use DistilBERT as an acquisition model, but evaluate the performance of the bigger BERT model fine-tuned on the acquired instances. The performance in this case is only slightly lower for iterations from 3 till 7 than when using the “native” acquisition model and always significantly better than random sampling. Since DistilBERT is faster and requires much less amount of memory, replacing a big Transformer model for acquisition of training instances can help to alleviate practical obstacles for using AL in real-world scenarios.

6 Conclusion

In this work, we investigated the combination of AL with sequence taggers that take advantage of deep pre-trained models. In the AL setting, these sequence taggers substantially outperform the models that do not use deep pre-training. We show that AL and transfer learning is a very powerful combination that can help to produce remarkably performing models with just a small fraction of the annotated data. For the CoNLL-2003 corpus, the combination of the best performing pre-trained model and AL strategy achieves 99% of the score that can be obtained with training on the full corpus, while using only 20% of the annotated data. For the OntoNotes corpus, one needs just 18%.

We are the first to apply Bayesian active learning by disagreement in conjunction with deep pre-trained models to sequence tagging tasks and show that it outperforms variation ratio uncertainty estimate investigated in previous works and the deterministic MNLP query strategy. We experiment with various MC dropout options and find that for Transformers, the best option is to apply the MC dropout before the last classification layer, which makes Bayesian query strategies not only a good choice in terms of quality, but also practical due to low computational overhead. Finally, we demonstrate that it is possible to reduce the computational overhead of AL by using a smaller distilled version of a Transformer model for acquiring instances that can be used for training a full-sized model.


We thank reviewers for their valuable feedback. This work was done in the framework of the joint MTS-Skoltech lab. The Zhores supercomputer Zacharov et al. (2019) was used for computations. Finally, we are grateful to Sergey Ustyantsev for help with execution of the experiments.

Appendix A Features Used by the CRF Model

  1. A lowercased word form.

  2. Trigram and bigram suffixes of words.

  3. Capitalization features.

  4. An indicator that shows whether a word is a digit.

  5. A part-of-speech tag of a word with specific info (plurality, verb tense, etc.)1

  6. A generalized part-of-speech.

  7. An indicator whether a word is at the beginning or ending of a sentence.

  8. The aforementioned characteristics for the next word and previous word except suffixes.

Appendix B Dataset Characteristics

# OF TOKENS 1,088,503 152,728
# OF SENTENCES 59,924 8,262
Entity types:
PERSON 15,429 1,988
GPE 15,405 2,240
ORG 12,820 1,795
DATE 10,922 1,602
CARDINAL 7,367 935
NORP 6,870 841
MONEY 2,434 314
PERCENT 1,763 349
ORDINAL 1,640 195
LOC 1,514 179
TIME 1,233 212
WORK_OF_ART 974 166
FAC 860 135
EVENT 748 63
QUANTITY 657 105
PRODUCT 606 76
LAW 282 40
Total entities: 81,828 11,257
Table 2: Characteristics of the OntoNotes 5.0 corpus (without PT section)
# OF TOKENS 203,621 46,435
# OF SENTENCES 14,041 3,453
Entity types:
LOC 7,140 1,668
PER 6,600 1,617
ORG 6,321 1,661
MISC 3,438 702
Total entities: 23,499 5,648
Table 3: Characteristics of the CoNLL-2003 corpus

Appendix C Model and Training Details

c.1 Crf

We set CRF L1 and L2 regularization terms equal to 0.1, and limit the number of iterations by 100.

c.2 BiLSTM-CRF Taggers

We implement the BiLSTM-CRF sequence tagger on the basis of the Flair package2 Akbik et al. (2018). We use the same parameters for all types of BiLSTM-CRF models. The recurrent network has one layer with 128 neurons. During training, we anneal the learning rate by half, when the performance of the model stops improving on the development set for 3 epochs. After annealing, we restore the model from the epoch with the best validation score. The starting learning rate is 0.1. The maximal number of epochs is 30, and the batch size is 32. For optimization, we use the standard SGD algorithm.

c.3 Transformer-based Taggers

The implementation of Trasnformer-based taggers is based on the transformers Wolf et al. (2019) 3 library. We use the following pre-trained versions of BERT, ELECTRA, and DistilBERT accordingly: ‘bert-base-cased’, ‘google/electra-base-discriminator’, and ‘distilbert-base-cased’. The corrected version of Adam (AdamW from transformers) is used for optimization with the base learning rate of 5e-5. Learning rate warmup over the first 10% of steps, and linear decay of the learning rate is applied following the Devlin et al. (2019). The number of epochs is 4 and the batch size is 16. As in Shen et al. (2017), we see that it is critical to adjust the batch size on early AL iterations, when only small amount of labeled data is available. We reduce the batch size to keep the number of iterations per epoch over 50, but limit the minimal batch size to 4.

Appendix D Comparison of Various MC Dropout Options

1 5 10 15 20 24
ELMo-BiLSTM-CRF MNLP 82.6 1.3 88.9 0.4 90.8 0.1 91.3 0.2 91.6 0.2 91.6 0.2
Random 82.7 1.6 87.2 0.6 89.3 0.4 90.2 0.1 90.5 0.1 90.8 0.1
VR(MC word) 81.0 0.6 88.1 0.5 90.1 0.3 90.9 0.2 91.2 0.1 91.4 0.2
VR(MC locked) 82.7 0.4 89.3 0.4 91.1 0.2 91.5 0.1 91.6 0.1 91.6 0.2
VR(MC all) 82.3 1.2 88.8 0.4 90.7 0.2 91.2 0.1 91.5 0.2 91.6 0.1
BALD(MC word) 84.9 0.5 88.1 0.3 89.9 0.3 90.7 0.3 91.2 0.3 91.4 0.1
BALD(MC locked) 84.9 1.0 90.3 0.2 91.1 0.1 91.4 0.2 91.6 0.2 91.6 0.1
BALD(MC all) 85.1 1.1 89.4 0.4 90.7 0.1 91.3 0.1 91.5 0.1 91.6 0.1
DistilBERT MNLP 74.4 0.7 81.7 0.2 87.6 0.2 88.9 0.1 89.5 0.2 89.3 0.3
Random 77.2 0.6 83.5 0.5 85.9 0.5 87.3 0.3 87.6 0.1 88.4 0.2
VR (MC last) 76.5 0.9 84.5 0.4 87.9 0.2 88.5 0.3 88.8 0.5 89.4 0.3
VR (MC all) 74.8 1.4 81.4 0.9 87.4 0.3 88.8 0.4 89.0 0.3 89.3 0.3
BALD (MC last) 79.7 1.0 85.3 0.5 88.0 0.2 89.0 0.3 89.3 0.2 89.5 0.3
BALD (MC all) 79.0 0.7 85.3 0.6 88.1 0.3 88.8 0.2 89.1 0.2 89.6 0.2
ELECTRA MNLP 84.1 1.1 89.4 0.1 90.6 0.1 91.2 0.2 91.5 0.2 91.5 0.2
Random 84.6 0.6 87.7 0.2 89.7 0.2 90.2 0.3 90.6 0.2 91.0 0.2
VR (MC last) 85.4 0.8 90.1 0.3 90.7 0.3 91.2 0.2 91.3 0.2 91.4 0.2
VR (MC all) 82.8 1.3 88.3 0.1 90.7 0.2 91.4 0.2 91.3 0.2 91.5 0.2
BALD (MC last) 86.1 1.0 89.8 0.3 90.8 0.1 91.4 0.2 91.5 0.2 91.5 0.2
BALD (MC all) 84.4 0.6 89.5 0.2 90.9 0.2 91.4 0.2 91.5 0.2 91.5 0.2
Table 4: Results of AL with various MC dropout options on the CoNLL-2003 dataset
Model Query strat. 1 5 10 15 20 24
ELMo-BiLSTM-CRF MNLP 76.1 0.6 80.3 0.1 82.4 0.3 83.5 0.2 84.0 0.1 84.1 0.1
Random 73.4 0.8 78.4 0.3 80.7 0.2 81.8 0.2 82.4 0.2 82.7 0.1
VR(MC word) 76.1 0.4 80.7 0.2 82.5 0.2 83.3 0.2 83.8 0.1 83.9 0.1
VR(MC locked) 76.6 0.3 80.6 0.3 82.5 0.2 83.5 0.2 83.9 0.2 84.1 0.2
VR(MC all) 76.2 0.3 80.9 0.3 82.6 0.1 83.5 0.1 83.9 0.1 84.1 0.2
BALD(MC word) 76.4 0.7 79.8 0.3 82.2 0.2 83.2 0.1 83.9 0.1 84.0 0.2
BALD(MC locked) 77.2 0.3 80.9 0.2 82.7 0.2 83.7 0.1 83.9 0.1 84.1 0.2
BALD(MC all) 77.4 0.2 80.9 0.1 82.7 0.1 83.6 0.1 83.9 0.1 84.0 0.1
DistilBERT MNLP 75.1 0.5 81.6 0.3 83.4 0.3 84.0 0.1 84.3 0.1 84.3 0.1
Random 73.5 0.5 79.3 0.1 81.0 0.2 82.1 0.1 82.8 0.3 83.1 0.2
VR (MC last) 76.2 0.4 82.0 0.2 83.0 0.2 83.7 0.3 83.8 0.2 83.8 0.1
VR (MC all) 76.1 0.6 81.9 0.2 83.5 0.2 84.1 0.2 84.5 0.1 84.3 0.2
BALD (MC last) 77.1 0.3 82.3 0.2 83.5 0.2 84.0 0.2 84.3 0.1 84.3 0.1
BALD (MC all) 77.0 0.3 82.3 0.2 83.6 0.2 84.2 0.2 84.3 0.2 84.3 0.2
BERT MNLP 78.3 0.3 83.1 0.1 84.9 0.3 85.4 0.3 85.2 0.2 85.3 0.2
Random 76.5 0.3 81.4 0.2 82.5 0.4 83.5 0.2 84.1 0.3 84.0 0.1
VR (MC last) 79.7 0.3 83.0 0.2 84.2 0.3 84.7 0.2 84.7 0.2 84.6 0.3
VR (MC all) 78.9 0.5 83.3 0.1 84.6 0.2 85.2 0.2 85.0 0.2 85.4 0.1
BALD (MC last) 79.7 0.3 83.8 0.1 85.1 0.2 85.1 0.2 85.2 0.2 85.5 0.2
BALD (MC all) 79.5 0.6 83.9 0.2 85.0 0.1 85.2 0.3 85.3 0.3 85.4 0.1
Table 5: Results of AL with various MC dropout options on the OntoNotes dataset




  1. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Link Cited by: §C.2.
  2. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, F. R. Bach and D. M. Blei (Eds.), JMLR Workshop and Conference Proceedings, Vol. 37, pp. 1613–1622. External Links: Link Cited by: §2.
  3. ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1, §3.3, §3.3.
  4. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §C.3, §1, §3.3.
  5. Practical, efficient, and customizable active learning for named entity recognition in the digital humanities. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2223–2234. External Links: Link, Document Cited by: §2.
  6. Learning how to active learn: a deep reinforcement learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 595–605. External Links: Link, Document Cited by: §2.
  7. Elementary applied statistics: for students in behavioral science. John Wiley & Sons. Cited by: §2, §4.
  8. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon and R. Garnett (Eds.), Vol. 29, pp. 1019–1027. External Links: Link Cited by: §4.
  9. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1050–1059. External Links: Link Cited by: §1, §2.
  10. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 1183–1192. External Links: Link Cited by: §1.
  11. Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Link, 1503.02531 Cited by: §3.3.
  12. Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §3.2.
  13. Bayesian active learning for classification and preference learning. CoRR abs/1112.5745. External Links: Link, 1112.5745 Cited by: 2nd item.
  14. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, C. E. Brodley and A. P. Danyluk (Eds.), pp. 282–289. Cited by: §3.1, §3.
  15. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Link, Document Cited by: §2, §3.
  16. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), W. B. Croft and C. J. van Rijsbergen (Eds.), pp. 3–12. External Links: Link, Document Cited by: §4.
  17. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering (). External Links: Document Cited by: §3.3.
  18. Practical obstacles to deploying active learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 21–30. External Links: Link, Document Cited by: §2.
  19. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link, Document Cited by: §2, §3.2.
  20. An experimental comparison of active learning strategies for partially labeled sequences. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 898–906. External Links: Link, Document Cited by: §2.
  21. GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §3.2.
  22. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1, §3.2.
  23. Towards robust linguistic analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 143–152. External Links: Link Cited by: §5.1.
  24. A primer in bertology: what we know about how BERT works. CoRR abs/2002.12327. External Links: Link, 2002.12327 Cited by: §3.3.
  25. Introduction to the conll-2003 shared task: language-independent named entity recognition. pp. 142–147. External Links: Link Cited by: §5.1, §5.1.
  26. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §3.3.
  27. An analysis of active learning strategies for sequence labeling tasks. In 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1070–1079. External Links: Link Cited by: §1, §2.
  28. Active learning literature survey. Computer Sciences Technical Report Technical Report 1648, University of Wisconsin–Madison. Cited by: §1, §2.
  29. Deep active learning for named entity recognition. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, pp. 252–256. External Links: Link, Document Cited by: §C.3, 2nd item, §4.
  30. Deep Bayesian active learning for natural language processing: results of a large-scale empirical study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2904–2909. External Links: Link, Document Cited by: 2nd item.
  31. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4593–4601. External Links: Link, Document Cited by: §3.3.
  32. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §3.3.
  33. HuggingFace’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771. External Links: Link, 1910.03771 Cited by: §C.3.
  34. “Zhores” — petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in skolkovo institute of science and technology. Open Engineering 9 (1), pp. 512 – 520. External Links: Document, Link Cited by: Acknowledgments.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description