Study of Methods for Abstract Screening in a Systematic Review Platform
A major task in systematic reviews is abstract screening, , excluding hundreds or thousands of irrelevant citations returned from one or several database searches based on titles and abstracts. Most of the earlier efforts on studying systematic review methods for abstract screening evaluate the existing technologies in isolation and based on the findings found in the published literature. In general, there is no attempt to discuss and understand how these technologies would be rolled out in an actual systematic review system. In this paper, we evaluate a collection of commonly used abstract screening methods over a wide spectrum of metrics on a large set of reviews collected from Rayyan, a comprehensive systematic review platform. We also provide equivalence grouping of the existing methods through a solid statistical test. Furthermore, we propose a new ensemble algorithm for producing a -star rating for a citation based on its relevance in a systematic review.
In our comparison of the different methods, we observe that almost always there is a method that ranks first in the three prevalence groups. However, there is no single dominant method across all metrics. Various methods perform well on different prevalence groups and for different metrics. Thus, a holistic “composite” strategy is a better choice in a real-life abstract screening system. We indeed observe that the proposed ensemble algorithm combines the best of the evaluated methods and outputs an improved prediction. It is also substantially more interpretable, thanks to its -star based rating.
keywords:Abstract Screening Platform, Systematic Review.
Randomized controlled trials (RCTs) constitute a key component of medical research and it is by far the best way of achieving results that increase our knowledge about treatment effectiveness Chalmers:81. Because of the large number of RCTs that are being conducted and reported by different medical research groups, it is difficult for an individual to glean a summary evidence from all the RCTs. The objective of systematic reviews is to overcome this difficulty by synthesizing the results from multiple RCTs.
Systematic reviews (SRs) involve multiple steps. Firstly, systematic reviewers formulates a research question and search in multiple biomedical databases. Secondly, they identify relevant RCTs based on abstracts and titles (abstract screening). Thirdly, based on full texts of a subset thereof, assess their methodological qualities, extract different data elements and synthesize them. Finally, report the conclusions on the review question Uman:11. However, the precision of the search step of a SR process is usually low and for most cases it returns a large number of citations from hundreds to thousands. SR authors have to manually screen each of these citations making such a task tedious and time-consuming. Therefore, a platform that can automate the abstract screening process is fundamental in expediting the systematic review process. To fulfill this need, we have built Rayyan 111https://rayyan.qcri.org/—a web and mobile application for SR ouzzani2016rayyan. As of November 2017, Rayyan users exceeded 8,000 from over 100 countries conducting hundreds of reviews totaling more than 11M citations.
A primary objective of Rayyan is to expedite the abstract screening process. The first step of the abstract screening process is to upload a set of citations obtained from searches in one or more databases. Once citations are processed and deduplicated, Rayyan provides an easy-to-use interface to the systematic reviewers so that they can label the citations as relevant or irrelevant. As soon as the systematic reviewer labels a certain amount of citations from both classes (relevant and irrelevant), Rayyan runs its 5-star rating algorithm and sorts the unlabeled citations based on their relevance. As the reviewer continues to label the citations, the system may trigger multiple runs of the 5-star rating algorithm. The process ends when all the citations are labeled. Thereof, one of the design requirements of the system is the ability to give feedbacks on the relevance of studies to users quickly. Figure 1 shows a pictorial depiction of our system.
Earlier efforts on studying systematic review methods in biomedical Alison.Thomas:15 and software engineering Olorisade.De:16 domains evaluate the existing technologies for abstract screening from the findings presented in the published literature. As pointed out by Alison.Thomas:15, these evaluations do not make it clear how these technologies would be rolled out for use in an actual review system. In this study, we present our insights in designing methods (from both feature representation and algorithm design perspectives) for a light-weight abstract screening platform, Rayyan (Please see Section 3 for challenges, and Section LABEL:sec:discussion for study limitations).
A key difference of our evaluation, compared to existing studies such as Olorisade.De:16; Alison.Thomas:15, is that our reviews come from an actual deployed systematic review system. In addition, many of the results presented in the literature are hard to generalize as they do not provide performance over multiple datasets and over different prevalence groups. We thus collect a large set of reviews from the Rayyan platform and evaluate a large set of methods based on their practical utility to solve the class imbalance problem. Additionally, to give a clear insight to the class imbalance problem, we also divide the dataset into three prevalence groups, perform well defined statistical tests, and present our findings over a large set of metrics. The results of our evaluation give insights on the best approach for abstract screening in a real-life environment. Besides the evaluation of the existing methods, we also propose an ensemble method for the task of abstract screening that presents the prediction results through a 5-star rating scale. We make our detailed results and related resources available online222https://tksaha.github.io/paper_resources/.
The primary aim of this study is to assess the performance of abstract screening task on the widely used SVM-based methods using a large set of reviews from a real abstract screening platform. This paper therefore addresses the following research questions from an abstract screening system design perspective:
What are the challenges of using different feature space representations?
What kind of algorithms are suitable?
How SVM-based methods perform in different metrics over a large set of reviews and different class imbalance ratios?
What aspects should be considered for designing a new algorithm for abstract screening?
Can an ensemble method improve performance?
The remainder of this paper is organized as follows. In Section 2, we discuss related work. In Section 3, we provide an overview of the SVM-based methods used in the evaluation and details of the proposed -star rating method. Section LABEL:sec:exp_result presents the experimental results. In Section LABEL:sec:discussion, we discuss our key findings. We conclude in Section LABEL:sec:conclusion.
2 Related Work
We organize the related works into the following four groups: 1) works on “abstract screening” methods, 2) methodologies for handling “data imbalance” in classification, as it is one of the main challenges in the abstract screening process, 3) methodologies for “active learning”, a popular method for addressing data imbalance and user’s labeling burden issue, and finally, 4) works on “linear review”, from the legal domain, which bears some similarity with abstract screening in systematic reviews.
2.1 Abstract Screening
A large body of past research has focused on automating the abstract screening process Ambert:10; Cohen:08; Cohen.Ambert:12; Cohen.Hersh:06; Khabsa.Elmagarmid:15; Miwa.Thomas:14. In terms of feature representation, most of the existing approaches Cohen:08; Cohen.Hersh:06 use unigram, bigram, and MeSH (Medical Subject Headings). An alternative to the MeSH terms is to extract LDA based latent topics from the titles and abstracts and use them as features Miwa.Thomas:14. Other methods Cohen:08; Khabsa.Elmagarmid:15 utilize external information as features such as citations and co-citations. In terms of classification methods, SVM-based learning algorithms are commonly used Ambert:10; Cohen:08; Cohen.Ambert:12. According to a recent study Olorisade.De:16, different types of algorithms, including SVM, decision trees, naive Bayes, -nearest neighbor, and neural networks have been proposed in existing works on abstract screening. To the best of our knowledge, none of the existing works use the structured output version of SVM ( Joachims:05) with different loss functions, which we have considered in this study. Note that ’s learning is faster than a transductive SVM Joachim:99. For prediction, the former only keeps feature weights, which makes its prediction module very fast. On the other hand, transductive SVM takes both labeled and unlabeled citations into account, so its learning is slower than . Wallace.Kuiper:16 uses a distant supervision technique to extract PICO (Population/Problem, Intervention, Comparator, and Outcome) sentences from clinical trial reports. The distant supervision method trains a model using previously conducted reviews and for the abstract screening task there is no such database for a particular review. So, we do not use distant supervision method. In this work, we evaluate various SVM-based classification methods with different loss functions over a set of abstract screening tasks.
2.2 Data Imbalance
Data imbalance in supervised classification is a well studied problem Aggarwal:14; Joachim:99; Joachims:05; Shalev-Shwartz.Singer:11; Sun.Deng:12. Two different kinds of approaches have been proposed to solve it. Methods belonging to the first kind focus on designing suitable loss functions: such as KL divergence Esuli.Sebastiani:15, quadratic mean Liu.Chawla:11, cost sensitive classification Elkan:01, mean average precision Yue.Finley:07, random forest Chen.Liaw:04 with meta-cost, and AUC Joachims:05. Methods belonging to the second kind generate synthetic data for artificially balancing the ratio of labeled relevant and irrelevant citations. Examples of such methods are borderline-SMOTE Han.Wang:05, safe-level-SMOTE Bunkhumpornpat.Sinapiromsaran:09, and oversampling of the instances of the minority class along with under-sampling of the instances of the majority class. The authors in Niculescu-Mizil.Caruana:05; Wallace.Dahabreh:12; Wallace.Dahabreh:14 use probability calibration techniques. In this paper, we evaluate the loss-centric approaches ignoring the data centric and probability calibration based approaches. The reason for our choice is the fact that synthetic data generation is computationally expensive which makes it very difficult to overcome extreme imbalance—a typical scenario for abstract screening. Also, for probability calibration we need a validation set which is not always available in a typical abstract screening session. For loss-centric approaches, we have modified as proposed in Esuli.Sebastiani:15; Liu.Chawla:11 to incorporate KL divergence and quadratic mean loss because the default implementation did not consider these loss functions.
2.3 Active Learning
As systematic reviews are done in batches, the problem of abstract screening can be modeled as an instance of batch mode active learning. Online learning trains the classifier after adding every citation whereas batch-mode active learning (BAL) does the same after adding batches of citations. However, BAL does not have any theoretical guarantee compared to online learning Hanneke:14; Tong.Koller:00. The task of BAL is to select batches with informative samples (citations) that would help to learn a better classification model. There are two popular methods to select samples: (1) Certainty based and (2) Uncertainty based. In certainty based methods, “most certain” samples are selected to train the classifier while in uncertainty based method ,“most uncertain” ones are selected. In the uncertainty sampling-based methodologies, a large number of uncertainty metrics have been proposed; examples include entropy Dagan.Engelson:95, smallest-margin Scheffer.Decomain:01, least confidence Culotta.McCallum:05, committee disagreement Seung.Opper:92, and version space reduction Nowak:09; Tong.Koller:00. Using “most uncertain” samples based on these metrics improves the quality of the classifier to find the best separating hyperplane, and thus improves the accuracy in classifying new instances Miwa.Thomas:14. On the other hand, certainty-based methods have also shown to be effective in carrying out active learning on imbalanced datasets, as demonstrated in Fu.Lee:11. In this paper, we present some interesting findings based on “certainty” based sampling in abstract screening task.
2.4 Linear Review
A similar task to the abstract screening review is Technology Assisted Review (TAR) Manfred.Paskach:13; Cormack:13; Grossman.Cormack:10; Henry:15; Roitblat.Kershaw:10; Saha.Hasan:15 which is popular among law firms. The main objectives of both review systems are the same, which is to classify between relevant results and irrelevant ones in a very imbalanced setup. TAR is used to prioritize attorneys’ time to screen documents relevant to a lawsuit rather than those which are irrelevant. Generally, the number of documents are much larger (in millions) in TAR compared to the same in systematic review, so active learning is very popular in the TAR domain. However, unlike systematic reviews, TAR researchers are interested in finding a good stabilization point where the training of the classifier should be stopped. Moreover, in TAR, achieving at least 75% recall is considered as acceptable, whereas in systematic reviews 95%-100% recall is desired. Linear review faces similar set of challenges as is faced by abstract screening and in many times, they overcome those challenges by using similar solutions.
For abstract screening, we have as input the title () and abstract () of a set of citations, . We represent each citation, as a tuple : is a -dimensional feature space representation of the citation and is a binary label denoting whether is relevant or not for the given abstract screening task. For a labeled citation, is known and for an unlabeled citation it is not known. We use and to denote the set of labeled and unlabeled citations, respectively, to represent features for a set of citations, and for the hypothesis or hyperplane learned by training on . Some of our feature representation methods embed a word/term to a latent space; for a word , we use for the latent space representation of this word.
3.1 Task Challenges
The desiderata for supporting abstract screening in a SR platform like Rayyan includes the following: (1) feature extraction should be fast and the features should be readily computable from the available data, (2) the learning and prediction algorithms should be efficient, (3) the prediction algorithm should be able to handle extreme data imbalance so that it can overcome the shortcomings of low search precision, and (4) the prediction algorithm should solve an instance of a two-class ranking problem such that the relevant citations should be ranked higher than the irrelevant ones. Based on these requirements, the choices for feature representations, the prediction algorithms, and the prediction metrics are substantially reduced, as we discuss in the following paragraphs.
For feature extraction, we focus on two classes of methods namely uni-bigram and word2vec. Uni-bigram is a traditional feature extraction method for text data. Word2vec Mikolov.Sutskever:13 is a recent distributed method which provides vector representation of words capturing semantic similarity between them. Both methods are fast and their feature representation is computed from the citation data which are readily available. We avoided other feature extraction choices as they are either hard to compute or the information from which these feature values are computed is not readily available. For example, features like co-citations are hard to obtain. Another possible feature is the frequency of MeSH (Medical Subject Heading) terms. While MeSH terms of a citation may be obtained from PubMed, this has some practical limitations especially in an automated system like Rayyan. In particular, (i) PubMed does not necessarily cover all of the existing biomedical literature and (ii) to obtain MeSH terms, one has to either provide the PMID (PubMed Article Identifier) of each reference which is not always available, or use the title search API which is an error-prone process since it could return more than one result. Thus, we avoided using this feature in our experiments. We also discarded the use of LDA (Latent Dirichlet Allocation) based feature generation and a matrix factorization based method kontonatsios2017semi as an LDA based method is slow, and there is no easy way to set some of its critical parameters such as the number of topics. Methods based on distributed representations (, word2vec Mikolov.Sutskever:13) are shown to perform well in practice compared to the matrix factorization based alternatives. Additionally, for distributed representation methods, Natural Language Processing (NLP) functions such as lemmatization (heuristic method to chop end of a word) and stemming (morphological analysis method to find root form of a word) are usually not a requirement as the methods automatically discovers the relationship between inflected word forms (for example, require, requires, required) based on the co-occurrence statistics and represents them closely in the vector space. For Uni-Bi (unigram, bigram) feature representation, one can use lemmatization or stemming. However, one immediate problem with lemmatization is that it is actually quite difficult to perform accurately for very large untagged corpora, so often a simple stemming algorithm (such as Porter Stemmer) is applied, instead, to remove all the most common morphological and inflectional word endings from the corpus Bullinaria2012. However, we avoided such preprocessing step as it adds some extra cost to the preprocessing step and in our initial analysis, the performance difference was not statistically significant.
Since irrelevant citations are more common than the relevant ones, abstract screening algorithms suffer from the data-imbalance issue. Among the binary classification methods, the maximum-margin based methods ( SVM) have become very popular in the abstract screening domain Alison.Thomas:15. Thereof, we use SVM variants that are more likely to overcome the data-imbalance problem. Specifically, we use (cost-sensitive SVM), and (SVM with multivariate performance measure) with different loss functions for our evaluation.
Finally, various prediction metrics have been used in the abstract screening domain which makes it difficult to compare among various models. A recent survey of current approaches in abstract screening Alison.Thomas:15 reported that among the publications they surveyed, report performance using Recall, using Precision, and using AUC. The non-uniform usage of these metrics makes it difficult to draw a conclusion about the performance of different methods. For example, if a particular work only reports Recall but not Precision, it is hard to understand its applicability. Among all the above metrics, the most common is the Area Under the ROC Curve (AUC), which can be computed from the ranking of the citations as provided by the classification model. Hence, to use AUC, the classification model for abstract screening needs to provide a ranked order of the citations. Among other ranking based metrics, AUPRC (area under Precision-Recall curve) is also used extensively. Moreover, authors of a recent work Raeder:12 suggested to study the variability of the reported metrics and advocated to use a large number of repetitions () to ensure reproducibility. We also observed that most of the existing studies were performed on a small number of SRs, reported the evaluation with a different set of metrics, validated with a small number of repetitions, and most of them did not perform proper statistical significant tests which would provide a statistical guaranty of the superiority of a method over the others.
In the following, we describe the feature space representation, then the existing methods of automated abstract screening, and finally our proposed -star rating algorithm.
3.2 Feature Space Representation
We use two types of feature representation: (1) uni-bigram (Uni-Bi) and (2) word2vec (w2v). Uni-Bi based feature representation uses the frequency of uni-grams and bi-grams in a document. Note that uni-grams and bi-grams are generalization of -grams, the collection of all contiguous sequences of words from a given text. Since the sum of the number of distinct words (uni-grams) and the number of distinct word-pairs (bi-grams) over the document collection in a particular review task is generally large, Uni-Bi feature representation is high-dimensional. Besides, it produces a sparse feature representation, , a large number of entries in the feature matrix is zero. Such high dimensional and sparse feature representation is poor for training a classification model, so several alternative feature representations are proposed to overcome this issue. Among them, word2vec Mikolov.Sutskever:13 is a popular alternative. It adopts a distributed framework to represent the words in a dense -dimensional latent feature space. The feature vector of each word in this latent space is learned by shallow neural network models and the feature vector of a citation is then obtained by averaging the latent representation of all the words in that citation. It has been hypothesized that these condensed real-valued vector representation, learned from unlabeled data, outperforms Uni-Bi based representations.
To learn vector representation of words using word2vec Mikolov.Sutskever:13 for our corpus, we train the model on all abstracts and titles available in Rayyan. We use Gensim Radim:10 with the following parameters: the number of context words as , the min word count as and the number of dimensions in the latent space as (=500, chosen through validation). Thus, for each word in the set of all available words (), we learn a -dimensional latent space representation. After averaging the latent vectors of words, we obtain the latent vector of a citation on which we apply two kinds of normalization: (1) instance normalization (row normalization) and (2) feature based normalization (column normalization). These normalizations give statistically significant better results than using raw features for threshold agnostic metrics such as AUC and AUPRC. In instance normalization, we -normalize the extracted features for each citation, . For column normalization, we -normalize in each of the dimensions. After both normalizations, we keep the feature values up to -decimal places to minimize the memory requirement in our system. Note that there exist some neural network based models such as sen2vec Le.Mikolov:14 which can learn the representation of a citation holistically instead of learning it by averaging the representation of words in that citation. However, the sen2vec model is transductive in nature, , for new citations, we need to execute a few passes over the trained model, which is time-consuming.
After learning the representation, for a limited number of words, we manually validate whether the latent space representation of a word captures the known semantic similarities of that words with various biomedical terms. Semantic similarities are computed using cosine similarity in the latent space following Mikolov.Sutskever:13. Our manual validation shows encouraging results. For example, the cosine similarity between “liver” and “cirrhosis” is , which is large considering the fact that the vectors are of 500 dimensions. Also, for the query “which words are related to cirrhosis in the same way breast and cancer are related”, returns “liver” as one of the top- answers in the w2v feature representation.
|— (b=1, AUC)||2|
|— (b=1, KLD)||3|
|— (b=1, QuadMean)||4|
|— (J, b=1)||7|
|w2v row||(b=1, AUC)||21|
|— (b=1, KLD)||22|
|— (b=1, QuadMean)||23|
|— (J, b=1)||25|
|w2v col||(b=1, AUC)||31|
|— (b=1, KLD)||32|
|— (b=1, QuadMean)||33|
|— (J, b=1)||35|
3.3 Existing Algorithms
A recent study Olorisade.De:16 reports that Support Vector Machine (SVM) is the most used algorithm in abstract screening. It is used in 31% of the studies and at least one experiment annually since 2006. Moreover, as discussed in Section 2, SVM-based methods are mostly used in both the data-imbalance and batch-mode active learning settings Fu.Lee:11; Miwa.Thomas:14. We thus restrict our evaluation to existing SVM-based algorithms. SVM is a supervised classification algorithm which uses a set of labeled data instances and learns a maximum-margin separating hyperplane by solving a quadratic optimization problem. We should mention that the number of labeled citations can be very few at the start of a citation screening process and also the number of citations varies from a few hundred to a few thousand for different reviews. Thus, we did not try any supervised deep-learning based technique in our evaluation.
We use three types of SVM methods: (1) inductive, (2) transductive, and (3) . Inductive SVM learns a hypothesis induced by . Transductive SVM Joachim:99 reduces the learning problem of finding a to a different learning problem where the goal is to find one equivalence class from infinitely many induced by all the instances in and . exploits the alternative structural formulation of the SVM optimization problem for conventional binary classification with error rate Joachims:06. We use three different loss functions for the implementation: (i) AUC, (ii) Kullback-Leibler Divergence (KLD), and (iii) QuadMean Error Liu.Chawla:11. Table 1 shows the different SVM-based algorithms we used in our comparison along with their parameters and loss functions. We use the default regularization in all cases and in is set to the following ratio: . The ratio biases the hypothesis learner (the training algorithm) to penalize mistakes on the minority class (the relevant class) times more than the majority class (the negative class). In table 1, we assign distinct integer identifiers to represent each of the algorithms. For example, the first row in Table 1 refers to a method identified by Id that uses with and AUC as the loss function. Comparison results among different methods (such as the results shown in Table LABEL:tab:rank-group) are shown compactly by referring to each method by its identifier instead of its name.
Making a comparison with the exact baseline algorithms proposed by various studies is not straightforward. 31% of the studies reported in Olorisade.De:16 use SVM as their classifier. However, each one employs a different feature representation and a different SVM implementation. Thus, in this study we restrict our attention to a specific set of feature representations which are more suitable for our abstract screening platform, and use the linear kernel with different loss functions and the implementation provided by the author of each SVM algorithm.
3.4 The Proposed -star Rating Algorithm
In our SR platform, we want to rank citations based on their graded relevance. The intuition is to help reviewers better manage their time. To this end, we rate the citations from to using our proposed
algorithm. The citations with stars are relevant with high probability and need more attention whereas starred documents are irrelevant with high probability and may need less attention. Among the stars, we conceptualize stars to indicate relevant citations and stars for irrelevant ones.
Our -star rating algorithm,