Investigating the Effectiveness of Word-Embedding Based Active Learning for Labelling Text DatasetsSupported by Science Foundation Ireland and Teagasc.

Investigating the Effectiveness of Word-Embedding Based Active Learning for Labelling Text Datasetsthanks: Supported by Science Foundation Ireland and Teagasc.

Jinghui Lu 1Insight Centre for Data Analytics, Univeristy College Dublin, Ireland 1    Maeve Henchion 2Teagasc Agriculture and Food Development Authority, Ireland , 1, 2    Brian Mac Namee 1Insight Centre for Data Analytics, Univeristy College Dublin, Ireland 1

Manually labelling large collections of text data is a time-consuming, expensive, and laborious task, but one that is necessary to support machine learning based on text datasets. Active learning has been shown to be an effective way to alleviate some of the effort required in utilising large collections of unlabelled data for machine learning tasks without needing to fully label them. The representation mechanism used to represent text documents when performing active learning, however, has a significant influence on how effective the process will be. While simple vector representations such as bag of words have been shown to be an effective way to represent documents during active learning, the emergence of representation mechanisms based on the word embeddings prevalent in neural network research (e.g. word2vec and transformer based models like BERT) offer a promising, and as yet not fully explored, alternative. This paper describes a large-scale evaluation of the effectiveness of different text representation mechanisms for active learning across 8 datasets from varied domains. This evaluation shows that using representations based on modern word embeddings—especially BERT—, which have not yet been widely used in active learning, achieves a significant improvement over more commonly used vector-based methods like bag of words.

active learning text classification word embeddings BERT FastText

1 Introduction

Active learning (AL) [1] is a semi-supervised machine learning technique that minimises the amount of labelled data required to build accurate prediction models. In active learning only the most informative instances from an unlabelled dataset are selected to be labelled by an oracle (i.e. a human annotator). This property makes active learning attractive in scenarios where unlabelled data may be abundant but labelled data is expensive to obtain such as image classification [2, 3], speech recognition [9], and text classification [4, 5, 6, 8]—which is the focus of this work.

One crucial component of active learning for text classification is the mechanism used to represent documents in the tabular structure required by most machine learning algorithms. Vector-based representations, such as bag-of-words (BOW) is one of the most common representations used in active learning [12, 11, 10]. Considerable work, however, has shown that learning representations of natural language, that exploit massive broad coverage unlabeled text corpora, can be useful for a wide range of natural language processing (NLP) tasks including text classification and has been widely adopted [13, 14, 15, 18, 19, 20, 44]. Standard pre-trained language models like word2vec [13], Glove [14], FastText [15, 16], or contextualized pre-trained language models such as Cove [43], ElMo [44] convert words to fixed-length dense vectors that capture semantic and syntactic features, and allow more complex structures (like sentences, paragraphs and documents) to be encoded as aggregates of these vectors. There are also sentence-level pre-trained language models such as ULM-Fit [18], OpenAI GPT [19], and BERT [20], that are followed by task-specific fine-tuning to significantly increase the performance of NLP tasks, and have been shown to be useful for learning common language features. Among them, BERT (bidirectional encoder representations from transformers) has achieved state-of-the-art results across many NLP tasks [20]. We refer to the representations produced by these pre-trained language models as word embeddings in this paper. Even though word embeddings have been widely applied in text classification, there is little work devoted to leveraging them in active learning for text classification [6, 7, 22].

This paper describes a large-scale evaluation experiment to explore the effectiveness of word embeddings for active learning in a text classification context. This evaluation, based on 8 datasets of different domains such as product reviews, news articles, blog posts etc., shows that representations based on word-embeddings—and especially representations based on BERT—consistently outperform the more commonly used simple vector representations, which demonstrates the greater effectiveness of embedding-based active learning framework.

The rest of the paper is organized as follows: Section 2 presents related work and outlines the text representation mechanisms used in the paper along with the active leaning selection strategies used; Section 3 describes the design of the experiment performed; Section 4 discusses the results of this experiment; and, finally, Section 5 draws conclusions and suggests future directions.

2 Related Work

2.1 Applying Word Embedding in Active Learning

In pool-based active learning, a small set of labelled instances is used to seed an initial labelled dataset, . Then, according to a particular selection strategy, a batch of data to be presented to an oracle for labelling is chosen from the unlabelled data pool, . After labelling, these newly labelled instances will be removed from and appended to . This process repeats until a predefined stopping criterion has been met (e.g. a label budget has been exhausted).

The resulting labelled instances will be used for training a predictive model if the goal is to induce a good classification model; if the goal is to label all instances, the induced classification model will be applied to the remaining unlabelled instances in to predict their classes, which saves manual labours as compared to labelling the whole dataset by hand. We are interested in the latter scenario in this paper.

Although applying word embeddings in text classification has attracted considerable attention in the literature [13, 15, 18, 20], the use of word embeddings in active learning is still a largely unexplored research area. Zhang et al. [6] combined word2vec with convolutional neural networks (CNN) and active learning to build classifiers for sentence-based and document-based datasets. Similarly, Zhao et al. [7] proposed leveraging recurrent neural networks and gated recurrent units with word2vec to predict the classes of short-text. Concurrently to our work, Zhang Ye [45] propose a query strategy that combining fine-tuned BERT with CNN, but they only compare the performance of different query strategies while BERT applied, rather than the impacts of different text representation techniques used in active learning. Siddhant and Lipton [22] compare the performance of Glove-embedding-based active learning frameworks, which are composed of different classifiers such as bi-LSTM model and CNN, across many NLP tasks. They find that Glove embeddings selected by Bayesian Active Learning by Disagreement plus Monte Carlo Dropout or Bayes-by-Backprop Dropout usually outperforms the shallow baseline. However, Siddhant and Lipton take Linear SVM combined with BOW representation rather than Glove embeddings as a shallow baseline which makes the conclusion limited. Additionally, these four studies focus on comparing the impact of query strategies when deep neural networks applied, instead of that of text representations.

Another challenge faced by the above researches is the computational cost in active learning. High computational complexity prediction models such as neural networks are too expensive to be used in active learning due to the frequent demand of reconstructing classifiers. Therefore, some studies combine word embedding with typical machine learning algorithms to provide tractable approaches. Hashimoto et al. [21] propose a method combining vector-based representation with k-means, which is called topic model (TM), to encode documents that are then fed into SVM. Hashimoto et al. compare the performance of certainty sampling selection strategy integrating doc2vec-TM to that of certainty sampling with other representations (BOW-TM, LDA and word2vec). The conclusion implies that doc2vec-TM outperforms the rest representations when performing active learning to identify eligible/ineligible studies in clinical literature review. Interestingly, Singh et al. [8] extend experiments in [21] with more datasets in the health domain, demonstrating that directly using doc2vec or BOW, rather than doc2vec-TM, can achieve better results which is contrary to that obtained by Hashimoto. Despite the promising results, these active learning researches adopts on only one or two query strategies (i.e. Certainty Sampling and Certainty Information Gain) regarding the imbalanced dataset from the medical domain leading a less valid inference. Besides, most works intend to produce a high-quality prediction model [6, 7, 22] instead of labelling the full dataset [21, 8], resulting in different evaluation methods which increases the difficulty of comparisons across different papers.

This research fills the gap by comparing the performance of embedding-based active learning with that of classical active learning framework, using a broader range of query strategies and datasets of various domains to fairly demonstrate the effectiveness of each representation. As far as we know, this is the first attempt to evaluate the performance of BERT as a representation compared to other vector-based representations in the context of expediting text labelling tasks via active learning.

2.2 Text Representations

Text representation is an important intermediate step. The most popular text representation techniques, along with a brief description, are listed below.

2.2.1 Bag-of-words

the most basic vector-based representation, has been widely used in many active learning applications [12, 23, 8, 22]. Each column of the BOW vector is the term-frequency (TF) of a distinct word appearing in the document and 0 if the term is absent. The terms often weighted by inverse document frequency to penalise terms commonly used in most documents, which is called TF-IDF. In this paper, we adopt both TF-IDF, and TF which is normalized by the total word count of a document.

2.2.2 Latent Dirichlet Allocation

(LDA) [27] is a topic modelling techniques proposed to infer topic distribution in a collection of documents. The model generates a term-topic matrix and a document-topic matrix. Specifically, each row of the document-topic matrix is a topic-based representation of a document where the th column determines the degree of association between the th topic and the document. Such topical representation of documents has been used in active learning for labelling inclusive/exclusive studies in the literature review [21, 8, 24].

2.2.3 FastText

is a neural language model trained with large online unlabelled corpora. As compared to word2vec and Glove, FastText enrich the training of word embeddings with subword information which improves the ability to obtain word embeddings of out-of-bag words. Since we can not find a applicable doc2vec model pre-trained with large corpus, we consider FastText model a good alternative as the great performance reported in classification tasks [15, 16, 17]. In this paper, we adopt two versions of FastText: 1) inferring word embeddings by original FastText model trained with large corpora, 2) continually training original FastText model with local corpus (without label information) and then inferring word embeddings, which is referred to as FastText_trained. In practice, we average the vectors of words appeared in the document as the document representation.

2.2.4 Bert

has achieved amazing results in many NLP tasks. This model, using multi-head attention mechanism based on the Multi-layer Bidirectional Transformer model, is trained with the plain text through masked word prediction and next sentence prediction tasks to learn contextualized word embeddings. Contextualized word embedding implies a word can have different embeddings according to its context which alleviates the problems caused by polysemy etc. Though the original paper [20] suggests using the vector of “[CLS]” token added in the head of a sentence (the sentence means a text sequence with fixed length) as a sentence-level representation, in practice, researchers find that averaging the word embeddings of the sentence is an equivalent, sometimes, greater option 111 In this paper, we regard the mean of word vectors as a sentence-level representation.

2.3 Query Strategies

Query strategy, which is a technique used for picking unlabelled data to be presented to the oracle for labelling, also plays a vital role in active learning. There are many query strategies studied in the literature. A family of approaches, such as uncertainty sampling, query-by-committee (QBC), density-weighted methods [1], utilise the models trained with the currently labelled instances, , to infer the “informativeness” of unlabelled instances from , among which the most informative are selected to be labelled by the oracle. We refer to these approaches as model-based query strategies. On the other hand, there is a method entirely relies on the features of instances in and to compute the “informativenes” of each candidate such as Exploration Guided Active Learning (EGAL), which is referred to as model-free selection strategy [10]. In this paper, we adopt several commonly used query strategies, that is, Random Sampling (sample i.i.d from ), Uncertainty Sampling [26], Query-by-Committee [30], Information-Density (ID) [29], EGAL [10] to alleviate the influence caused by different selection strategies.

3 Experimental Design

This section describes the design of an experiment performed to evaluate the effectiveness of different text representation mechanisms in active learning. To mitigate the influence of different selection strategies on the performance of the active learning process we also include a number of different selection strategies in the experiment. These section describes the experimental framework, the configuration of the models used, the performance measures used to judge the effectiveness of different approaches, and the datasets used in the experiments.

3.1 Active Learning Framework

We apply pool-based active learning using different text representation techniques and query strategies over several well-balanced full labelled datasets. All datasets are from binary classification problems. The use of fully labelled datasets allows us to simulate data labelling by a human oracle, and is common in active learning research [10, 21, 8, 7, 6]. At the outset, we provide all learners with the same 10 instances (i.e. 5 positive instances and 5 negative instances) sampled i.i.d. at random from a dataset to seed the active learning process. Subsequently, 10 unlabelled instances, whose ground truth labels will be revealed to each learner, are selected according to a certain selection strategy. These examples are removed from to and the classifiers are retrained. We assume it is unrealistic to collect more than 1,000 labels from an oracle, and so we stop the procedure when an annotation budget of 1,000 labels is used up. As the batch size for selection is 10, this means that an experiment is composed of 100 rounds of the active learning process which uses up the label budget of 1,000 labels. Each experiment is repeated 10 times using different random seeds and performance measures reported are averaged across these repetitions.

3.2 Model Configuration

We evaluate the performance of active learning using Linear-SVM models222 which have been shown empirically to perform well with high dimensional data [32]. We tune the hyper-parameters of the SVM models every 10 iterations (i.e. 100 labels requested). We preprocess text data by converting to lowercase, removing stop words, and removing rare terms (for the whole dataset, word count less than 10 or document frequency less than 5).333We found the preprocessing improves the performance of BOW but has a negligible effect to word embeddings, hence we skip preprocessing for word embeddings. We set the number of topics to be used by LDA 444 to 300 following [8]. For FastText and FastText_trained representations, we use the pre-trained subword FastText model trained with Wikipedia (300 dimensions).555 For BERT, we use bert-large-uncased model (1,024 dimensions).666 Since BERT is configured to take as input a maximum of 512 tokens, we divided the sequence with length into fractions, which is then fed to BERT to infer the representations of fractions (each fraction has “[CLS]” token in front of 511 tokens, namely, 512 tokens in total). The vector of each fraction is the average embeddings of words in that fraction and the representation of the whole text sequence is the mean of all fraction vectors. It should be noted that we do not use any label information for fine-tuning models to ensure fair comparisons. A summary of each representation is given in Table 1.

In uncertainty sampling, the most uncertain examples are equivalent to those closest to the class separating hyper-plane in the context of an SVM [33]. In the information density selection strategy, we use entropy to measure the “informativeness” and all parameters are set following [29]. In QBC, we choose Linear-SVM models trained using bagging as committee members following [34]. Since there is no general agreement in the literature on the appropriate committee size for QBC [1], we adopt committee size 5 after some preliminary experiments. In EGAL, all parameters are set following the recommendations given in [10], which are shown to perform well for text classification tasks.

3.3 Performance Measures

As we are interested in the ability of an active learning process to fully label a dataset we use the accuracy+ performance measure, which has been previously used by [10]. This measures the performance of the full active learning system including human annotators. It can be expressed as:


where is the total number of instances in a dataset and superscripts and express human annotator and machine generated labels respectively. and denote the number of true positives and true negatives respectively. Intuitiverly, this metric computes the fraction of correctly labelled instances which are predicted by the oracles as well as a trained classifier. We presume that a human annotator never makes mistakes. We also report the area under the learning curve (AULC) score for each accuracy curve which is computed using the trapezoidal rule and normalized by the maximum possible area, to bound the value between 0 to 1.

3.4 Datasets

We evaluate the performance of active learning systems using 8 fully-labelled datasets. Four of these datasets are based on long text sequences which are Movie Review (MR) [35],777MR and MRS are available at: Multi-domain Customer Review (MDCR) [36],888 Blog Author Gender (BAG) [39]999BAG and ACR are available at: and Guardian2013 (G2013) [37]. While four are based on sentences which are Additional Customer Review (ACR) [38], Movie Review Subjectivity (MRS) [35], Ag news (AGN)101010AGN and DBP are available at : and DBP(Dbpedia) [40]. Table 1 provides summary statistics describing each dataset.

# of Instances Representation Dimensionality
Dataset positives negatives TF TFIDF LDA FT FT_T BERT
MR 1,000 1,000 6,181 6,181 300 300 300 1,024
MDCR 4,000 3,566 4,165 4,165 300 300 300 1,024
BAG 1,675 1,552 4,936 4,936 300 300 300 1,024
G2013 843 1,292 5,345 5,345 300 300 300 1,024
ACR 1,335 736 403 403 300 300 300 1,024
MRS 5,000 5,000 1,868 1,868 300 300 300 1,024
AGN 1,000 1,000 723 723 300 300 300 1,024
DBP 1,000 1,000 552 552 300 300 300 1,024
Table 1: Statistics of 4 document and 4 sentence datasets. Left column set denotes the number of positives and negatives in each dataset, right column set denotes the vector length of different representations wrt. each dataset. FT and FT_T denote FastText and FastText_trained.

4 Results and Discussion

(a) Random
(b) Uncertainty
(c) Information Density
(d) QBC
(e) EGAL
Figure 1: Results over Multi-Domain Customer Review (MDCR) dataset. X-axis represents the number of documents that have been manually annotated and Y-axis denotes accuracy+. Each curve starts with 10 along X-axis.
(a) Random
(b) Uncertainty
(c) Information Density
(d) QBC
(e) EGAL
Figure 2: Results over Movie Review Subjectivity (MRS) dataset. X-axis represents the number of documents that have been manually annotated and Y-axis denotes accuracy+. Each curve starts with 10 along X-axis.

To illustrate the performance differences observed between the different representations explored, Figures 1 and 2 show the learning curves for each different representation (separated by selection strategy) for the MDCR and MRS datasets respectively.111111Similar figures for the other 6 datasets can be found at ULR hidden for anonymous review. In these plots the horizontal axis denotes the number of instances labelled so far, and the vertical axis denotes the accuracy+ score achieved. It should be noted that each curve starts with 10 rather than 0 along the horizontal axis, corresponding to the initial seed labelling described earlier.

Generally speaking, we can observe that better performance is achieved when active learning is used in combination with a text representations based on word embeddings rather than the simpler vector-based text representations (i.e. TF and TF-IDF) and those based on topic modelling (i.e. LDA). More specifically, in Figure 1, we observe that BERT consistently performs any other representation by a reasonably large margin across all query strategies. Another interesting observation is that FastText, FastText_trained and TF-IDF have similar performance, and LDA performs worst across all situations. In Figure 2, we see a similar pattern that, in majority of cases, the performance of the approaches based on BERT surpass the performances achieved using other representations (except for EGAL where FastText_trained gives the best performance). Besides, the remaining two word embeddings (i.e. FastText and FastText_trained) behave close to BERT in many query strategies, exceeding TF and TF-IDF by a large margin. Again, LDA performs poorly regarding all query strategies.

We summarize the results of all methods in Table 3. In this table, each column denotes the performance of different active learning processes on a specific dataset. Different representation and selection strategy combinations are compared and the best results achieved for each dataset are highlighted. The numbers in bracket stands for the ranking of each method in a specific dataset and the last column reports the average ranking of each representation-selection-strategy combination, where a smaller number means a higher rank.

Table 3 presents a very clear message that the word embedding representations perform well across all datasets, which is evidenced by the relatively higher ranks as compared to TF, TF-IDF and LDA. Overall, BERT is the best performing representation with average ranks of 2.81 for BERT + uncertainty, 4.5 for BERT + information density, and 4.81 for BERT + QBC being the highest average ranks overall.

As suggested by [41], Wilcoxon signed-rank test has been applied for pairwise comparisons of the mean ranks between methods. The table of values of all pairwise comparisons can be found on Github. QBC is a best performing query strategy across all text representations as reported by Table 3, hence we only list the p values and win/draw/lose of each combination involving QBC in Table 2. The table demonstrate that all embedding-based methods are significantly different from methods based on TF, TF-IDF and LDA with . However, embedding-based methods do not have significant difference between each other. Remarkably, Bert achieves the most wins as compared to any other representations.

(f) MRS: LDA
Figure 3: T-SNE visualisations of movie reviews and customer reviews regarding (MDCR) (Figure 3(a), 3(b), 3(c)) and (MRS) (Figure 3(d), 3(e), 3(f)) dataset in the corresponding feature space respectively. Red squares and blue crosses indicate reviews of different classes.
BERT 6/0/2 5/0/3 8/0/0 8/0/0 8/0/0
FT 0.0687 3/1/4 8/0/0 8/0/0 8/0/0
FT_T 0.0929 0.4982 8/0/0 7/0/1 7/0/1
LDA 0.0117 0.0117 0.0117 0/0/8 1/0/7
TF-IDF 0.0117 0.0117 0.0173 0.0117 7/0/1
TF 0.0117 0.0116 0.0173 0.0251 0.0117
Table 2: P values and win/draw/lose of pairwise comparison of QBC-based methods.
Rep Strategy MR MDCR BAG G2013 ACR MRS AGN DBP Rank
BERT random 0.8570.022(7.0) 0.8050.009(3.0) 0.7170.008(3.0) 0.9460.002(19.0) 0.7890.003(5.0) 0.8990.011(9.0) 0.9620.003(11.0) 0.9760.006(14.0) 8.88
uncertainty 0.8970.024(1.0) 0.8230.009(1.0) 0.7280.009(2.0) 0.9760.001(4.5) 0.8230.007(2.0) 0.9200.009(2.0) 0.9850.002(3.0) 0.9890.003(7.0) 2.81
ID 0.8570.008(8.0) 0.7720.004(4.0) 0.7350.005(1.0) 0.9750.002(7.0) 0.8220.009(3.0) 0.9320.001(1.0) 0.9850.001(4.0) 0.9880.004(8.0) 4.50
EGAL 0.8530.025(11.0) 0.7160.030(14.0) 0.6650.011(23.0) 0.9410.003(24.0) 0.7690.009(13.0) 0.8750.006(14.0) 0.9570.007(14.0) 0.9730.006(15.0) 16.00
QBC 0.8920.026(2.0) 0.8180.010(2.0) 0.7140.010(4.0) 0.9710.002(11.5) 0.8300.005(1.0) 0.9190.009(3.0) 0.9820.001(6.0) 0.9880.000(9.0) 4.81
FT random 0.8210.006(20.0) 0.7240.007(9.5) 0.7050.006(8.0) 0.9500.002(16.0) 0.7400.011(24.5) 0.8880.003(13.0) 0.9530.002(15.0) 0.9790.007(13.0) 14.88
uncertainty 0.8530.008(12.0) 0.7280.014(6.0) 0.7010.018(10.0) 0.9800.003(2.0) 0.7760.018(8.0) 0.8940.011(12.0) 0.9790.005(9.0) 0.9920.004(5.0) 8.00
ID 0.8520.008(14.0) 0.7200.008(12.0) 0.6970.015(12.0) 0.9810.002(1.0) 0.7760.012(9.0) 0.8980.006(10.0) 0.9810.004(7.0) 0.9900.006(6.0) 8.88
EGAL 0.8140.005(22.0) 0.6470.041(23.0) 0.6560.018(26.0) 0.9470.003(17.0) 0.7370.009(26.0) 0.8590.031(15.0) 0.9610.004(12.0) 0.9800.007(12.0) 19.12
QBC 0.8570.005(9.0) 0.7240.007(9.5) 0.7060.013(6.0) 0.9760.001(4.5) 0.7880.007(6.0) 0.9040.002(7.0) 0.9810.001(8.0) 0.9940.001(1.5) 6.44
FT_T random 0.8190.005(21.0) 0.7260.009(7.0) 0.7110.006(5.0) 0.9460.004(18.0) 0.7500.009(18.0) 0.9020.001(8.0) 0.9590.001(13.0) 0.9820.007(11.0) 12.62
uncertainty 0.8470.012(16.0) 0.7250.007(8.0) 0.7060.011(7.0) 0.9780.004(3.0) 0.7750.022(10.0) 0.9080.004(5.0) 0.9850.003(2.0) 0.9930.004(3.0) 6.75
ID 0.8440.009(17.0) 0.7210.007(11.0) 0.6980.014(11.0) 0.9750.003(6.0) 0.7720.017(11.0) 0.9050.004(6.0) 0.9870.001(1.0) 0.9920.005(4.0) 8.38
EGAL 0.8050.005(23.0) 0.6560.038(22.0) 0.6660.007(21.0) 0.9420.004(23.0) 0.7400.013(23.0) 0.8980.005(11.0) 0.9630.003(10.0) 0.9850.004(10.0) 17.88
QBC 0.8510.005(15.0) 0.7300.008(5.0) 0.7020.014(9.0) 0.9750.001(8.5) 0.7930.005(4.0) 0.9090.003(4.0) 0.9850.000(5.0) 0.9940.001(1.5) 6.50
LDA random 0.7720.006(28.0) 0.6110.006(26.0) 0.6690.005(20.0) 0.8840.012(29.0) 0.7330.006(27.0) 0.6800.003(27.0) 0.8540.004(29.0) 0.8580.006(29.0) 26.88
uncertainty 0.7910.006(27.0) 0.6130.010(25.0) 0.6730.015(18.0) 0.9220.012(26.0) 0.7540.011(16.0) 0.6710.013(28.0) 0.8770.017(26.0) 0.8810.011(27.0) 24.12
ID 0.7960.005(26.0) 0.6060.006(28.0) 0.6720.005(19.0) 0.9140.010(28.0) 0.7500.010(17.0) 0.6200.008(29.0) 0.8620.010(28.0) 0.8630.012(28.0) 25.38
EGAL 0.7600.007(30.0) 0.6000.008(29.0) 0.6480.009(28.0) 0.8580.014(30.0) 0.7200.008(29.0) 0.6110.015(30.0) 0.8350.006(30.0) 0.8050.018(30.0) 29.50
QBC 0.7700.009(29.0) 0.6070.009(27.0) 0.6750.008(16.0) 0.9170.008(27.0) 0.7630.008(14.0) 0.6890.006(26.0) 0.9010.005(22.0) 0.9050.006(24.0) 23.12
TF-IDF random 0.8370.003(18.0) 0.7080.012(16.0) 0.6660.006(22.0) 0.9450.002(20.0) 0.7400.011(24.5) 0.8080.007(18.0) 0.8870.011(24.0) 0.9120.014(23.0) 20.69
uncertainty 0.8710.005(3.0) 0.7190.009(13.0) 0.6840.006(13.0) 0.9750.001(8.5) 0.7580.017(15.0) 0.8070.016(19.0) 0.9190.017(18.0) 0.9430.021(18.0) 13.44
ID 0.8620.004(6.0) 0.6960.004(17.0) 0.6830.003(14.0) 0.9710.002(11.5) 0.7440.013(22.0) 0.8030.016(20.0) 0.9150.010(19.0) 0.9260.018(20.5) 16.25
EGAL 0.8000.008(24.0) 0.6420.043(24.0) 0.6370.009(29.0) 0.9420.005(22.0) 0.7450.009(21.0) 0.8090.008(17.0) 0.8950.012(23.0) 0.9120.016(22.0) 22.75
QBC 0.8540.012(10.0) 0.7130.007(15.0) 0.6760.007(15.0) 0.9650.002(14.0) 0.7780.005(7.0) 0.8120.005(16.0) 0.9320.004(16.0) 0.9520.007(16.0) 13.62
TF random 0.8320.004(19.0) 0.6740.009(21.0) 0.6510.007(27.0) 0.9430.003(21.0) 0.7290.007(28.0) 0.8020.007(22.0) 0.8800.012(25.0) 0.9020.017(25.0) 23.50
uncertainty 0.8680.003(4.0) 0.6830.008(19.0) 0.6600.007(24.0) 0.9730.001(10.0) 0.7460.016(20.0) 0.7970.022(23.0) 0.9110.012(20.0) 0.9370.029(19.0) 17.38
ID 0.8670.002(5.0) 0.6940.004(18.0) 0.6750.002(17.0) 0.9700.001(13.0) 0.7480.011(19.0) 0.7960.008(24.0) 0.9110.006(21.0) 0.9260.018(20.5) 17.19
EGAL 0.7990.016(25.0) 0.5590.011(30.0) 0.6040.010(30.0) 0.9370.002(25.0) 0.7160.026(30.0) 0.7950.015(25.0) 0.8630.011(27.0) 0.9000.017(26.0) 27.25
QBC 0.8530.004(13.0) 0.6780.011(20.0) 0.6580.008(25.0) 0.9630.002(15.0) 0.7700.006(12.0) 0.8030.007(21.0) 0.9290.004(17.0) 0.9480.009(17.0) 17.50
Table 3: The summary of AULC scores, which are computed by trapezoidal rule and normalized by the maximum possible area, on each dataset regarding the different combinations of text representations and query strategies. The best performance of each dataset is highlighted and the last column denotes the average ranking of each method where the smaller number suggests a higher rank.

4.1 Analysis of Different Representations

The previous experiments have illustrated the superior performance of word embeddings, especially BERT, in active learning. We attempt to provide an insight into the impacts of different representations from perspective of numerical feature representations. We compare BERT, TF-IDF and LDA. Figure 3 shows visualisations of instances from the Multi Domain Customer Review (MDCR) and Movie Review Subjectivity (MRS) datasets generated using t-SNE [42] based BERT, TF-IDF and LDA representations. We can see that for the BERT representation instances of the same class tend to cluster near to each other and that there is good separation between instances of the two classes (see Figure 3(a) and 3(d)), even though no label information is used in generating the BERT representation or these visualisations. For the equivalent TF-IDF (Figures 3(b) and 3(e)) and LDA (Figures 3(c) and 3(f)), visualisations class are less well clustered and much more overlapping. This ability of BERT to generate instances representations that are easily separable is likely to contribute strongly towards its ability to lead to highly performing active learning systems. We suppose that the unsatisfactory performance of topical representation indicates each class is likely to contain a mixture of most of the topics in these datasets.

5 Conclusions

Active learning processes used with text data rely heavily on the document representation mechanism used. This paper presented an evaluation experiment which explored the effectiveness of different text representations in an active learning context. The performance of different text representation techniques combined with popular selection strategies was compared over datasets from different domains to investigate a general active learning framework for data labelling task. We recognized that the embedding-based representations, which are rarely used in active learning, lead to better performance compared to vector based representations. Several most common used query strategies have been applied in experiments to provide a more convincing argument. Notably, BERT integrating uncertainty sampling greatly facilitates the application of active learning for text labelling. Hence, we suggest that BERT with uncertainty sampling is the default framework while BERT with QBC/ID and FastText_trained with QBC can be alternatives for text classification in the context of labelling task in some cases.

An important application of active learning is to labelling the included/ex-cluded studies in literature review [12, 23, 21] which is usually an imbalanced dataset. So It leads to more exploration of the active learning framework over an imbalanced dataset in future work.


  • [1] Settles, B., 2009. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences.
  • [2] Tong, S. and Chang, E., 2001, October. Support vector machine active learning for image retrieval. In Proceedings of the ninth ACM international conference on Multimedia (pp. 107-118). ACM.
  • [3] Zhang, C. and Chen, T., 2002. An active learning framework for content-based information retrieval. IEEE transactions on multimedia, 4(2), pp.260-268.
  • [4] Hoi, S.C., Jin, R. and Lyu, M.R., 2006, May. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th international conference on World Wide Web (pp. 633-642). ACM.
  • [5] Liere, R. and Tadepalli, P., 1997, July. Active learning with committees for text categorization. In AAAI/IAAI (pp. 591-596).
  • [6] Zhang, Y., Lease, M. and Wallace, B.C., 2017, February. Active discriminative text representation learning. In Thirty-First AAAI Conference on Artificial Intelligence.
  • [7] Zhao, W., 2017. Deep Active Learning for Short-Text Classification.
  • [8] Singh, G., Thomas, J. and Shawe-Taylor, J., 2018. Improving active learning in systematic reviews. arXiv preprint arXiv:1801.09496.
  • [9] Tur, G., Hakkani-Tür, D. and Schapire, R.E., 2005. Combining active and semi-supervised learning for spoken language understanding. Speech Communication, 45(2), pp.171-186.
  • [10] Hu, R., Delany, S.J. and Mac Namee, B., 2010, July. EGAL: Exploration guided active learning for TCBR. In International Conference on Case-Based Reasoning (pp. 156-170). Springer, Berlin, Heidelberg.
  • [11] Hu, R., Mac Namee, B. and Delany, S.J., 2008. Sweetening the dataset: Using active learning to label unlabelled datasets.
  • [12] Wallace, B.C., Small, K., Brodley, C.E. and Trikalinos, T.A., 2010, July. Active learning for biomedical citation screening. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 173-182). ACM.
  • [13] Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • [14] Pennington, J., Socher, R. and Manning, C., 2014, October. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
  • [15] Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T., 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, pp.135-146.
  • [16] Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T., 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  • [17] Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H. and Mikolov, T., 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
  • [18] Howard, J. and Ruder, S., 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
  • [19] Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf.
  • [20] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • [21] Hashimoto, K., Kontonatsios, G., Miwa, M. and Ananiadou, S., 2016. Topic detection using paragraph vectors to support active learning in systematic reviews. Journal of biomedical informatics, 62, pp.59-65.
  • [22] Siddhant, A. and Lipton, Z.C., 2018. Deep Bayesian active learning for natural language processing: Results of a large-scale empirical study. arXiv preprint arXiv:1808.05697.
  • [23] Miwa, M., Thomas, J., O’Mara-Eves, A. and Ananiadou, S., 2014. Reducing systematic review workload through certainty-based screening. Journal of biomedical informatics, 51, pp.242-253.
  • [24] Mo, Y., Kontonatsios, G. and Ananiadou, S., 2015. Supporting systematic reviews using LDA-based document representations. Systematic reviews, 4(1), p.172.
  • [25] Zhu, Jingbo, Huizhen Wang, Tianshun Yao, and Benjamin K. Tsou. ”Active learning with sampling by uncertainty and density for word sense disambiguation and text classification.” In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 1137-1144. Association for Computational Linguistics, 2008.
  • [26] Lewis, D.D. and Gale, W.A., 1994. A sequential algorithm for training text classifiers. In SIGIR’94 (pp. 3-12). Springer, London.
  • [27] Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), pp.993-1022.
  • [28] Shannon, C.E., 1948. A mathematical theory of communication. Bell system technical journal, 27(3), pp.379-423.
  • [29] Settles, B. and Craven, M., 2008, October. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 1070-1079). Association for Computational Linguistics.
  • [30] Seung, H.S., Opper, M. and Sompolinsky, H., 1992, July. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 287-294). ACM.
  • [31] Dagan, I. and Engelson, S.P., 1995. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995 (pp. 150-157). Morgan Kaufmann.
  • [32] Hsu, C.W., Chang, C.C. and Lin, C.J., 2003. A practical guide to support vector classification.
  • [33] Tong, S. and Koller, D., 2000. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov), pp.45-66.
  • [34] Mamitsuka, N.A.H., 1998, July. Query learning strategies using boosting and bagging. In Machine learning: proceedings of the fifteenth international conference (ICML’98) (Vol. 1). Morgan Kaufmann Pub.
  • [35] Pang, B. and Lee, L., 2004, July. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics (p. 271). Association for Computational Linguistics.
  • [36] Blitzer, J., Dredze, M. and Pereira, F., 2007, June. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 440-447).
  • [37] Belford, M., Mac Namee, B. and Greene, D., 2018. Stability of topic modeling via matrix factorization. Expert Systems with Applications, 91, pp.159-169.
  • [38] Ding, X., Liu, B. and Yu, P.S., 2008, February. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining (pp. 231-240). ACM.
  • [39] Mukherjee, A. and Liu, B., 2010, October. Improving gender classification of blog authors. In Proceedings of the 2010 conference on Empirical Methods in natural Language Processing (pp. 207-217). Association for Computational Linguistics.
  • [40] Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).
  • [41] Benavoli, A., Corani, G. and Mangili, F., 2016. Should we really use post-hoc tests based on mean-ranks?. The Journal of Machine Learning Research, 17(1), pp.152-161.
  • [42] Maaten, L.V.D. and Hinton, G., 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), pp.2579-2605.
  • [43] McCann, B., Bradbury, J., Xiong, C. and Socher, R., 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems (pp. 6294-6305).
  • [44] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L., 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  • [45] Zhang, Y., 2019. Neural NLP models under low-supervision scenarios (Doctoral dissertation).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description