Which techniques does your application use?:
An information extraction framework for scientific articles
Every field of research consists of multiple application areas with various techniques routinely used to solve problems in these wide range of application areas. With the exponential growth in research volumes, it has become difficult to keep track of the ever-growing number of application areas as well as the corresponding problem solving techniques. In this paper, we consider the computational linguistics domain and present a novel information extraction system that automatically constructs a pool of all application areas in this domain and appropriately links them with corresponding problem solving techniques. Further, we categorize individual research articles based on their application area and the techniques proposed/used in the article. -gram based discounting method along with handwritten rules and bootstrapped pattern learning is employed to extract application areas. Subsequently, a language modelling approach is proposed to characterize each article based on its application area. Similarly, regular expressions and high-scoring noun phrases are used for the extraction of the problem solving techniques. We propose a greedy approach to characterize each article based on the techniques. Towards the end, we present a table representing the most frequent techniques adopted for a particular application area. Finally, we propose three use cases presenting an extensive temporal analysis of the usage of techniques and application areas.
Which techniques does your application use?:
An information extraction framework for scientific articles
|Soham Dan, Sanyam Agarwal, Mayank Singh, Pawan Goyal and Animesh Mukherjee|
|Department of Computer Science and Engineering|
|Indian Institute of Technology, Kharagpur, WB, India|
•Information systems Information retrieval; Retrieval tasks and goals; Information extraction;
Areas, techniques, language models, citation context, temporal analysis
It is not uncommon for researchers to envisage an information extraction system for scientific articles that can answer queries like, (i)what are all the techniques and tools used in machine translation?, (ii)Which are the subareas of computational linguistics, where Malt parser is frequently used? etc. However, the meta-information necessary for constructing such a system is rarely available. Each research domain consists of multiple application areas which are typically associated with various techniques used to solve problems in these areas. Knowledge of these typical techniques adopted for an application area is of crucial importance to a researcher focusing on that application area. For instance, two commonly used techniques in information extraction are conditional random fields and hidden markov models. Answering queries such as to obtain list of significant techniques given an application area or to obtain list of all application areas given a technique is extremely valuable to every researcher, but particularly for those who are venturing into a new area. An automated information extraction system would save humongous manual efforts to come up with such a list for each research field. Moreover, the lists are dynamic in nature, which further presents challenges in generation and automation. New techniques are added to the area with time and changing needs. Thus, it is of interest to the researcher to know the significant techniques for an area within a time frame. Thus, the temporal aspect is also of particular interest and raises diverse research questions - for example, how have the techniques for "part-of-speech tagging" varied over time, or, what are the most important areas of computational linguistics that have been addressed in the last 5 years?
In this work, we demonstrate the system for the domain of computational linguistics because of the availability of full-text research articles. However, the proposed schema can be easily generalized to any other field of research. Next, we define two common keywords used in the current paper:
Area: Area represents an application area of computational linguistics domain. Common application areas include ‘machine translation’, ‘dependency parsing’, ‘part-of-speech tagging’, ‘information extraction’, ‘question answering’, etc.
Technique: Technique either represents a tool or a task used in an area. Common examples include ‘Bleu score’, ‘Rouge score’, ‘Charniak parser’, ‘TnT tagger’, ‘Malt parser’, ‘MST parser’, etc. Note that technique of one paper can potentially be an area of another paper. For example, in “Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment” [?], ‘word alignment’ is an area but in “Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation” [?], ‘word alignment’ is a technique for machine translation.
In section Which techniques does your application use?: An information extraction framework for scientific articles, we describe an algorithmic approach to construct a mapping between a list of areas and a list of techniques which is further used to construct a summary table of the meta-data. The approach is organized into five important steps:
We first create a ranked list of application areas;
Given a research paper, we automatically assign it to one of the areas based on its title and abstract;
Next, we create a ranked list of techniques;
Given a paper, we automatically extract all the techniques it introduces;
Finally, we automatically construct a mapping from the list of areas to the list of techniques.
We achieve significantly high performance in each of the above steps. The precision of the first step is 84% (for top 25) and recall is 87%. For the second step, the accuracy is 73.3%. The third step results in a precision and recall of 80% (for top 25) and 81% respectively. In the fourth step, our system achieves an accuracy of 60%. Finally, for the fifth step we present a number of relevant examples from the entire mapping. As a use case for this study, we analyze temporal characteristics of the techniques (Section Which techniques does your application use?: An information extraction framework for scientific articles). In an area, over the time, certain techniques become quite popular (as determined by the number of times that technique is cited by any paper) while certain techniques fall out of favor. Advent of social media is an example of this phenomenon where, due to different writing styles (short text length, use of out-of-vocabulary words, character flooding etc.), the need of new techniques is particularly essential. We further explore other aspects under temporal analysis – how the popular (as determined by the number of papers published in that area) areas in computational linguistics vary with time. We also investigate the temporal variation of the popular areas for specific conferences, namely, acl and coling.
Extracting application area and techniques is primarily an information extraction task. Information extraction (IE) from scientific articles combines approaches from natural language processing and data mining and has generated substantial research interest in recent times. In particular, there has been burgeoning research interest in the domain of biomedical documents. Shah et al. [?] extracted keywords from full text of biomedical articles and claim that there exist a heterogeneity in the keywords from different sections. Muller et al. [?] have developed the Textpresso framework, that leverage ontologies for information retrieval and extraction. In a similar work, Fukuda et al. [?] proposed an IE system for protein name extraction. There has been significant work in information extraction in the area of protein structure analysis. Gaizauskas et al. [?] proposed PASTA, an IE system developed and evaluated for the protein structure domain. Friedman et al. [?] have developed a similar system to extract structure information about cellular pathways using a knowledge model. Biological information extraction has seen extensive work covering diverse aspects with large number of survey papers. Cohen et al.’s [?] survey on biomedical text mining, Krallinger et al.’s [?] survey on information extraction and applications for biology and Wimalasuriya et al. [?] on ontology based information extraction are examples of some of the popular surveys on IE for biomedical domain.
Information extraction in other domains have also received an equally strong attention from researchers. Hyponym relations have been extracted automatically in the celebrated work by Hearst et al. [?]. Caraballo et al. [?] have extended previous work of automatically building semantic lexicons to automatic construction of a hierarchy of nouns and their hypernyms. Teufel [?] proposed information management and information foraging for researchers and introduced a new document analysis technique called argumentative zoning which is useful for generating user-tailored and task-tailored summaries. Kim et al. [?] and Lopez et al. [?] are two popular works in automatic keyphrase extraction from scientific articles. Quazvinian et al. [?] have explored summarization of scientific papers using citation summary networks and citation summarization through keyphrase extraction [?].
Jones [?] introduced an approach for entity extraction from labeled and unlabeled text. They proposed algorithms that alternately look at noun phrases and their local contexts to recognize members of a semantic class in context. A relatively recent work by Gupta et al. [?] developed a pattern learning system with bootstrapped entity extraction.In [?], they investigated the dynamics of a research community by extracting key aspects from scientific papers and showed how extracting key information help in analyzing the influence of one community on another. Jin at al. [?] proposed a supervised sequence labeling system that identifies scientific terms and their accompanying definition.
In our work, we have developed and tested an information extraction system that can extract the area and techniques from scientific papers and build a repertoire of such areas and techniques from the corpus. Further, we have constructed a mapping from each area to the list of techniques used for that area. As a use case for our developed system, we do a thorough temporal analysis for the variation of techniques for an area with time, temporal variation of popular areas for the entire AAN dataset and temporal variation of popular areas for top conferences in computational linguistics-namely, acl and coling.
We use AAN (ACL Anthology Network) [?] dataset which consists of 21,213 full text papers from the domain of computational linguistics and natural language processing. The dataset consists of papers between the years 1965 – 2013 from 342 ACL venues. We further pre-process the full text articles to remove OCR errors. Dataset and code is available online at http://tinyurl.com/hhrpfge.
In this section, we describe a method to construct a mapping between a list of areas and a list of techniques. As we already pointed out in the introduction, the mapping task is organized into five steps: (1) creation of a ranked list of areas, (2) categorizing papers on the basis of areas, (3) creation of a ranked list of techniques, (4) categorizing papers on the basis of techniques, and (5) the final mapping between the list of areas to the list of techniques.
The current work has two-fold contribution. First, we develop a system which automatically extracts area and techniques used in a paper. Second, we create a mapping from area to corresponding popular techniques. Together, they can be thought of as structured metadata for individual scientific articles as well as a research field.
Next, we briefly describe five phases in further details:
We employ paper title information to extract areas. AAN provides paper title information as a metadata. We use hand-written rules to extract phrases which are likely to contain the area names. We observe that some functional keywords, such as, “for”, “via”,“using” and “with” act as delimiters for such phrases. For example, paper title, “Moses: Open source toolkit for statistical machine translation” [?] represents an instance of the form X for Y, where Y is the application area. We also observe that the phrase succeeding “for” or preceding “using” or both (e.g., in “Decision procedures for dependency parsing using graded constraints” [?]) are likely to contain the name of an application area.
Seed set creation: We create a seed set of the above functional keywords and use bootstrapped pattern learning to gather more such words along with areas. For example, given a word ‘via’ in the seed set and paper title “Non-Monotonic Sentence Alignment via Semisupervised Learning” [?], we extract the leading phrase (-gram) before ‘via’. Further, we extend the seed set by extracting more functional keywords. For example, given a paper title “Improving English Russian sentence alignment through POS tagging and Damerau Levenshtein distance” [?] and the above extracted phrase (-gram) alignment, we enrich the seed set by the functional keyword ‘through’. We had initially started with seven functional keywords and by bootstrapped pattern learning, augmented this to a final set of 11 functional keywords.
Ranking of the extracted phrases: The previous step helps in extracting all the possible candidate phrases, which could potentially be area names. This set contains a lot of noisy phrases such as, “machine translation system combination and evaluation”. Here, “machine translation” must be extracted from the surrounding noisy words. Next, we use empirical ranking algorithms to extract the exact area names from this collection.We observe that empirical ranking algorithms produce good results in extraction of the exact area names from long phrases. We employ three ranking schemes, described below:
Scheme 1: In this scheme, we rank according to individual -gram scores. The score of a given -gram () is calculated as:
where, represents occurrence count of the -gram and the denominator represents total count of all the -grams.
Scheme 2: In this scheme the scoring method is very similar to previous scheme. However, an additional heuristic employed is that if the score of a -gram is greater than both of its border grams then the border grams are left out. The intuition behind this is as follows: the trigram “word sense disambiguation" will have a higher score than its border bigrams, “word sense” and “sense disambiguation”, causing these bigrams to be left out.
Scheme 3: This is an improvement on the previous method in that different threshold scores are chosen for bigrams, trigrams, 4-grams and 5-grams below which they are eliminated. All higher order -grams are also eliminated. Unigrams were also not selected based on a pilot experiment with only unigrams which indicated that application areas are not usually unigrams. The thresholds were selected manually by observing the individual lists for each value of and . In the result section, we shall compare the precision of each of these methods and we have finally adopted Scheme 3 since it gives the best results. We present 24 of the top 30 areas judged accurate by domain experts:
machine translation, natural language processing, word sense disambiguation, speech recognition, question answering, dependency parsing, information extraction, chinese word segmentation, semantic role labeling, information retrieval, entity recognition, word alignment, conditional random fields, maximum entropy, coreference resolution, machine learning, dialogue systems, textual entailment, natural language understanding, active learning, part-of-speech tagging, relation extraction, sentiment analysis, sense induction
From the first step, we obtain a pool of application areas and we have to assign individual papers to one of these areas. Individual papers from the entire corpus are categorized to their corresponding areas on the basis of two strategies – direct match and relevance as per the language models, defined for various areas (see Figure Which techniques does your application use?: An information extraction framework for scientific articles).
Direct match: In the direct match approach, we search for an explicit string match between the title or abstract and one of the areas. One non-trivial task here is identifying the location and the text in the abstract of a paper. We achieve this by using handwritten rules upto (where the "?" denotes that the section number and is optional), to match the portion of the paper between the abstract and the introduction. Also, we have converted the text to lower case during pre-processing. Subsequently, using the set of areas obtained from the previous step, we search if there is a direct match between any of these and the title. If the given title contains only one area name from the previous list, then the paper is assigned to that area. In case we do not find a match in the title, we check for a direct match with the abstract of the paper. If the abstract contains only one such matching area then the paper is categorized to that area. On the other hand, if the title or the abstract contains more than one direct match with the area set then we further use the language modeling approach to classify that paper.
Language modeling for classification of area of individual paper: In this approach, we create a language model for each area, and classify a document into one of these areas based on the probability that the title and abstract of the document is generated from the language models of various areas.To create a language model for each area, we select the papers which could be classified on the basis of a single direct match. The titles and abstracts of all the papers belonging to one area are taken together to construct the language model of that area with the Jenilek-Mercer (JM) smoothing. A document not categorized using direct match is treated as a query, consisting of the words in its title and abstract. The Porter stemmer is used on the bag-of-words for each area and the query. As validated by [?], the parameter for JM smoothing should be high for long queries, which is true in our case. After experimenting on a small set of sample of papers, we fixed to 0.7. Now, in our context, the prior probability for an application area, is proportional to the number of papers which were assigned to that area by a single direct match of either the title or the abstract. Hence, given a query paper , the area which scores the highest for the query () is assigned as the area for the given paper.
The central idea here is based on the notion of method papers: these are papers that introduce a novel technique or provide a toolkit in an area of computational linguistics. For instance, the paper introducing the Stanford CoreNLP toolkit is one such relevant example. The novel technique introduced or improved upon is a reason why method papers are cited. These papers are thus characterized by two features – one, they are expected to have been cited a number of times which is above some threshold (say ) thus indicating that the technique introduced or improved upon is frequently used and second, the fraction of times they obtain their citations in the “methodology” section of other papers is above some threshold (atleast %), thereby, indicating that they are primarily “method papers". We have set these thresholds by observing a large subset of papers of the ACL corpus. The values we chose to identify method papers are and %.
Identifying method papers: one non-trivial task in coming up with the list of method papers is to identify the method section from the text of a research paper. For this, we searched the text with the pattern [digit][S] where S is a string containing some keywords like “methodology" ,“method", “approach" or syntactic variations of these. The digit indicates the section number. This step help us identify as to which of the citations occur in the method section. Now, once the “method papers" are identified using the two thresholds above, we need to associate them with the techniques they are used for. For example, the Stanford CoreNLP toolkit paper will be cited whenever the citing paper uses the former’s part-of-speech tagging module, tokenizer module or the named-entity recognition module, although these are quite different techniques. Hence we need to associate or extract from the Stanford CoreNLP toolkit paper the techniques, part-of speech tagging, tokenization, named entity recognition, among other techniques.
Here we make use of the fact that when the citing paper applies a technique from the cited paper, it cites that paper and also mentions the name of the technique in the citation context (i.e., the sentence where the citation is made). Our objective is to extract all the techniques a method paper is used for, from the citation context(s). In other words, given all the citation contexts for a particular “method paper", we wish to identify all the specific techniques for which that method paper is used. We now describe the algorithm in detail.
Step 1: As a first step, we need to identify the citations in the text of the paper. One example of a citation is: “(String et al., year)”. There are numerous forms in which the citations could be written. Sometimes the year is shortened to two digits or if there are only two authors, their names are separated by “and" and so on. Further, we had to take into account the possible errors in the text like missing parenthesis. We hand-crafted a set of regular expressions to match the citation format. We tested these on a small sample of 30 papers and augmented the set. The final set was used for extracting the citation contexts.
Step 2: For every method paper in the corpora, we extract all its citation contexts when the citation is made in the method section. One observation we make is that the techniques are almost exclusively noun phrases in the citation context. We build a global vector of noun phrases across all citation contexts for all the method papers. The component of the vector is the raw count of the noun phrase, ordered lexicographically, over the method citations of the entire corpora. We consider this global vector as the ranked list of all the techniques used in the computational linguistics domain. Some of the top ranking noun phrases are depicted below.
penn treebank, stanford parser, rate training, berkeley parser, machine translation, statistical machine translation, charniak parser, moses toolkit, word sense disambiguation, maximum entropy, ibm model, bleu score, perceptron algorithm, word alignment, stanford pos tagger, collins parser, natural language processing, bleu metric, coreference resolution, moses decoder, giza++ toolkit, brill tagger, tnt tagger,anaphora resolution, mst parser, ccg parser, malt parser, minimum error rate training
For each method paper, X, we extracted all the noun phrases that have been used in its citation context(s). We built a similar vector of these noun phrases where the component of the vector is the raw count of that noun phrase drawn from the global vector introduced in the previous section. If a particular noun phrase from the global vector is missing in the citation contexts for X, its weight is set to zero. Once this vector is created for method paper X, we simply take a dot product between this local vector of X and the global vector to get a ranked list of possible techniques for X. Now, we simply read off the top techniques on this rank list as the techniques the method paper X is used for.
However, note that since we have taken a dot product between the local and global noun phrase vectors, those noun phrases which occur frequently across almost all papers might wrongly turn up as a technique if it occurs even once in the local vector. Noun phrases such as “citation", “previous work" and “recent work" are some typical examples of such occurrences that we found in our experiment. Thus, we maintain a stop-phrase list of noun phrases that commonly occur in the citation context but are not techniques. Further, we ignore the technique which exactly matches with, contains or is contained in the area since this is a redundant case. For example, for the application area "Statistical Machine Translation", we ignore the technique "Machine Translation". After this stage we were able to extract a list of top techniques for each method paper.
Construction of the mapping table is the final and the most important step of our problem. For building this mapping we follow a simple method of count updating.
In Algorithm 1, Area(P) is a function that returns the area of a paper P and similarly Technique(M) is a function that returns the techniques introduced or improved upon by method paper M. Further, MethodPapersCitedBy(P) is a function that returns all the method papers cited by paper P in its methodology section. Thus, basically in this algorithm, we pick each paper P from the corpus, find its area and all the techniques of the method papers that it cites in its methodology section and append all these techniques to the list corresponding to the extracted area for this paper.
We also tried a simple variation of this technique by keeping track of the number of times a particular technique features in an area. This way we also get to know which are the most popular techniques for an area.
An example of the kind of mapping that we expect is: machine translation word alignment, gale church algorithm, bleu score, Moses toolkit etc. A detailed set of example entries that we obtained through our experiments are presented in the corresponding results section.
In this paper, we have presented a variety of experiments that require a careful analysis of the results to evaluate the efficacy of the algorithms. The first four stages are crucial to the construction of the information extraction system and thus a critical evaluation is presented. Construction of the mapping table is an outcome of the first four stages and examples have been presented to depict the nature of the mapping table.
First, we present the relative performances of the three schemes (in terms of precision) for creation of the ranked list, in Table Which techniques does your application use?: An information extraction framework for scientific articles. Since Scheme 3 gave the highest precision (assignments were judged by an expert) on the list of top-30 areas, we used this scheme for the creation of the ranked list in the subsequent stages.
Scheme ID Precision (%) 1 57 2 73 3 80 Table \thetable: The precision values for the top 30 areas extracted by various schemes.
In this part of the evaluation, we have a ranked list of potential areas in the computational linguistics domain extracted from the ACL corpora. We use standard precision-recall measures for the purpose of evaluation. However, note that, there is no easy way to measure the recall of the algorithm on the entire ACL corpora since all the possible areas in the corpus is not known a-priori. This is because the only way to construct a definitive list of all possible areas is to manually identify them from every individual paper in the corpus. However, it is relatively easy to measure the recall of our algorithm on a smaller subset. Therefore, we selected a random set of 200 research papers and manually identified each of their areas. 23 distinct areas were found from this smaller corpus by one domain expert. The areas found out manually were matched to the closest area from the pool of areas. This was done so that we can verify the set of areas returned by the algorithm without ambiguity. This set of areas is the ground-truth for our algorithm. Once this smaller corpus of 200 papers was created, we ran the -gram discounting algorithm with thresholds (Scheme 3) to arrive at a set of areas. Among the 23 cases, 20 of the areas were successfully identified. Thus, the recall of our algorithm is 87 %.
Measurement of the precision of our algorithm was relatively straightforward. Measurement of precision can be done by manually judging what fraction of the top areas are indeed genuine computational linguistics areas as judged by domain experts. The annotation was done by a domain expert in the following way: 1 if it is a true positive and 0 if it is a false positive. Table Which techniques does your application use?: An information extraction framework for scientific articles notes the values of precision obtained for and top application areas. As we can see, as the value of increases the precision falls which is a witness to the fact that actual areas are ranked higher by our ranking methodology.
K Precision (%) 25 84 30 80 50 72 75 51 100 43 Table \thetable: The precision values using Scheme 3 for = 25, 50, 75 and 100 for extraction of the list of application areas.
We also asked another domain expert to annotate the first 30 results independent of the first judge. Inter-annotator agreement (Cohen’s kappa coefficient) was calculated and the value of came out to be 0.79. The matrix with the agreement/disagreement count between the experts is presented in Table Which techniques does your application use?: An information extraction framework for scientific articles.
Domain Expert 2 Yes No Total Domain Expert 1 Yes No Total Table \thetable: The matrix of agreement and disagreement between two domain experts for annotation of area list.
Once we have a probable area for all the papers in the ACL corpus we need to validate the accuracy of our assignment algorithm. Through an online survey, we have conducted the validation of these assignments by a team of domain experts. A set of 120 papers was validated by these domain experts. The accuracy of our method is calculated as where is the number of application area assignments marked as correct by the judges. Out of the 120 papers to be evaluated, the judges identified 88 evaluations as correct and hence the accuracy of our method is 73.3%.
Evaluating this step is very similar to the evaluation of the pool of application areas generated by our algorithm. In this case, recall calculation is difficult if we work with the top techniques for each method paper. To simplify the task, we proceed to calculate recall for only the highest ranked technique for each method paper. The experiment is setup in the following way – for a randomly selected set of 30 papers we aggregate all their citation contexts from the method sections of the citing papers. The smaller corpus of 30 papers introduced or improved upon 26 distinct techniques, is judged by one domain expert. Then we ran the noun phrase extraction algorithm to obtain the list of techniques. 21 of these 26 techniques could be extracted and thus, the recall of our algorithm is around 81%.
Precision calculation is done by finding out what percentage of the top techniques are genuine techniques of the computational linguistics domain, as judged by domain experts. Thus, through manual annotation by a domain expert we found the precision at top for the technique list at various values of . Table Which techniques does your application use?: An information extraction framework for scientific articles shows the precision obtained for the technique extraction algorithm for various values of . As the value of increases the precision falls, which is, once again, a witness to the fact that actual areas are ranked higher by our ranking methodology.
K Precision (%) 25 80 50 64 75 48 100 41 Table \thetable: The precision values for and for extraction of the list of techniques.
Here again we asked another domain expert to annotate the results independent of the first judge. We also calculated the inter-annotator agreement (Cohen’s kappa coefficient) for the top 25 techniques and came out to be 0.65. The matrix of agreement/disagreement counts is presented in Table Which techniques does your application use?: An information extraction framework for scientific articles.
Domain Expert 2 Yes No Total Domain Expert 1 Yes No Total Table \thetable: The matrix of agreement and disagreement between two domain experts for annotation of technique list.
Once we have the repertoire of tasks for the entire corpus by using the dot-product based ranking we can identify the specific techniques a method paper is used for. We now need to evaluate the accuracy of this assignment of techniques for a method paper. For simplicity of the evaluation, we report only the top ranked technique for a method paper. For the evaluation, a sample of 60 papers were selected randomly and the judgments of a set of domain experts collected through an online survey were used to test the accuracy of our method. Thus, each domain expert evaluated the technique assigned for the papers submitted to him/her and reported how many of these were correct. The accuracy of our algorithm is then calculated as , where is the number of correct assignments of technique to the method papers. In this evaluation, our algorithm had 36 correct assignments and thus the accuracy of our method is 60%.
Earlier, we discussed the procedure for the construction of the mapping from areas to the list of techniques. In this section we present some of the entries that we obtained from the mapping table for some of the higher ranked areas of computational linguistics. As we see from the examples, the techniques extracted consists of sub-tasks, tools and data-sets popularly used in an area. Further, the techniques are quite detailed and describe many techniques which are very specific to the areas - for example, bleu score and gale church algorithm : machine translation, nivres arc eager : dependency parsing and spin model : opinion mining. Also, it is interesting to observe that the extracted techniques span a wide range of time- for example: Collins parser, Berkeley parser, Charniak parser, Stanford parser, MST parser and Malt parser : Dependency Parsing were introduced to the computational linguistics (CL) community at substantially different time periods.
In this section, we present three use cases. Each use case is a temporal study of areas and techniques. We analyze evolution of application areas and corresponding techniques over a given time-period. In the first use case, we study the evolution of areas in the computational linguistics field. Next use case deals with evolution of techniques for a given area. Lastly, we study the evolution of major areas in the top conferences.
From the list of popular areas (based on the total number of papers published in an area) in aan, we present six representative areas, namely, “machine translation”, “dependency parsing”, “speech recognition”, “information extraction”, “summarization” and “semantic role labeling”, and study their popularity (percentage of papers in that area for that time period out of total papers published in that time period) from 1980-2013 in 5-year windows. Figure Which techniques does your application use?: An information extraction framework for scientific articles demonstrates the temporal variations for these areas and how they evolve with time.
Observations: While areas like “machine translation” and “dependency parsing” are on the rise, “information extraction” and “semantic role labeling” are on a decline. A further interesting observation is that till 1994 the ACL community had a lot of interest in “speech recognition” which then saw a sharp decline possibly because of the fact that the speech community slowly separated out.
In the second use case, we study evolution of techniques for a given area. For this analysis, we divide the time-line into fixed buckets of years. Next, for each bucket, we extract popular techniques (based on the number of times any paper has cited that technique) using our proposed system. Interestingly, we found multiple trends and figures in this study. Table Which techniques does your application use?: An information extraction framework for scientific articles presents the popular techniques for five example areas. Some of the interesting trends from the table Which techniques does your application use?: An information extraction framework for scientific articles are listed below:
Dependency Parsing: New techniques (Malt parser, minimum spanning tree (MST) parser, etc.) came into existence in 2005 – 2009. In the next year bucket, these parsers overcome popularity of previous parsers such as Collin’s parser, Berkeley parser and are almost at par with Charniak parser. In addition, we observe that the Penn treebank is extensively used for dependency parsing across almost all time periods.
Machine Translation: We found that word alignment and inversion transduction grammar are popular techniques for machine translation across all time periods. Also, Bleu score has been a popular technique since its introduction in 2000 – 2004. Similarly, Moses toolkit and IBM model are both popular techniques across most time periods.
Sentiment Analysis: In this area, mutual information and word sense disambiguation are popular techniques for most of the time periods. Latent dirichlet allocation (introduced in 2003) found important use in sentiment analysis in 2005 – 2009. Also the spin model got popularity in 2005 – 2009.
Cross Lingual Textual Entailment: Distributional similarity and mutual information are important techniques and are popular in multiple time periods. Verb ocean gets popular in 2005 – 2009 and 2009 – 2013. It is also very interesting to note that machine translation is actually an important tool for this area and is very popular in 2005 – 2009. However, in 2010 – 2013 its popularity goes down. A probable explanation for this could be the introduction of techniques which perform cross-lingual textual entailment without machine translation - for example, FBK: Cross-Lingual Textual Entailment Without Translation [?].
Grammatical Error Correction: Techniques to address out-of-Vocabulary (OOV) words have become important in recent times. Over the years, Collins parser got replaced by Charniak parser and finally by Berkeley parser. Penn treebank is an important dataset for this area.
We shortlisted two top conferences in the computational linguistics domain namely, Annual Meeting of the Association of Computational Linguistics (ACL) and the International Conference on Computational Linguistics (COLING). We study 40 years of conference history. For each conference, 40 year time-period is divided into four 10-year buckets. Next, for each conference, we extract top ten most popular areas (based on citation counts) for each 10-year bucket. Figure Which techniques does your application use?: An information extraction framework for scientific articles presents phrase clouds representing evolution of areas in these two conferences in comparison to the full AAN dataset itself. Some of the interesting observations from this analysis are:
Full AAN dataset: Here, we observe that while in the earlier decades, areas such as “semantic role labeling”, “evaluation of natural language” and “speech recognition” were dominant, they fade away in the recent decades. On the other hand, areas such as “machine translation” and “dependency parsing”, which were less prevalent in the earlier decades gain significant importance in the recent decades. We also see “sentiment analysis” as one of the major areas in the last decade.
ACL: In the earlier decades of this conference, one finds that the community is interested in areas like “linguistic knowledge sources” and “semantic role labeling.” Over the recent decades, however, the community seems to be more interested in areas such as “machine translation” and “dependency parsing”. Interestingly, in the time period 2005 – 2013, an upcoming area like “social media” is found to gain importance.
COLING: For this conference, we observe that in the earlier decades, areas like “lexical semantics” and “linguistic knowledge sources” were of interest to the community. However, in the recent years, areas like “machine translation”, “dependency parsing” and “bilingual lexicon extraction” have gained importance. An interesting observation here is that “semantic role labeling” has been all through a thrust area for this particular conference.
In this paper, we have proposed a novel information extraction system for scientific articles. The system extracts ranked list of all application areas in the computational linguistics domain. At a more granular level, it also extracts application area for a given paper. In addition, it extracts ranked list of all techniques as well as paperwise technique extraction. Finally, we construct a mapping from application areas to all the techniques. We evaluate our system with domain experts and prove that it performs reasonably well on both precision and recall. As a use case, we present an extensive analysis of temporal variation in popularity of the techniques for a given area. Some of the interesting observation that we make here are that the areas like “machine translation” and “dependency parsing” are on the rise of popularity while areas like “speech recognition”, “linguistic knowledge sources” and “evolution of natural language” are on the decline.
In future, we plan to work on constructing a multi-level mapping table that maps application areas to techniques and further techniques to a set of parameters. For example, application area Machine Translation has Bleu score as one of its techniques. Bleu score is a algorithm that takes few input parameters. Changing these parameters will change the outcome of the score. Example of one such parameter is , which represents the value of for the -grams.
We also plan to run the current system on a larger corpus of scientific articles. All our methods can be generalized to domains other than computational linguistics. Currently, we plan to extend the dataset to Microsoft academic research dataset (larger dataset) and a biomedical dataset (different domain). We also plan to study temporal characteristics of techniques for a given application area to observe if future predictions can be made for a technique - whether its popularity will increase or decrease in the years come.
-  S. A. Caraballo. Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 120–126. Association for Computational Linguistics, 1999.
-  A. M. Cohen and W. R. Hersh. A survey of current work in biomedical text mining. Briefings in bioinformatics, 6(1):57–71, 2005.
-  C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17(suppl 1):S74–S82, 2001.
-  K.-i. Fukuda, T. Tsunoda, A. Tamura, T. Takagi, et al. Toward information extraction: identifying protein names from biological papers. Citeseer.
-  R. Gaizauskas, G. Demetriou, P. J. Artymiuk, and P. Willett. Protein structures and information extraction from biological texts: the pasta system. Bioinformatics, 19(1):135–143, 2003.
-  S. Gupta and C. D. Manning. Analyzing the dynamics of research by extracting key aspects of scientific papers. In IJCNLP, pages 1–9, 2011.
-  S. Gupta and C. D. Manning. Spied: Stanford pattern-based information extraction and diagnostics. Sponsor: Idibon, 38, 2014.
-  X. He. Using word dependent transition models in hmm based word alignment for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 80–87. Association for Computational Linguistics, 2007.
-  M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2, COLING ’92, pages 539–545, Stroudsburg, PA, USA, 1992. Association for Computational Linguistics.
-  Y. Jin, M.-Y. Kan, J.-P. Ng, and X. He. Mining scientific terms and their definitions: A study of the ACL anthology. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 780–790, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
-  R. Jones. Learning to extract entities from labeled and unlabeled text. PhD thesis, Citeseer, 2005.
-  S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21–26. Association for Computational Linguistics, 2010.
-  P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics, 2007.
-  M. Krallinger, A. Valencia, and L. Hirschman. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome biology, 9(Suppl 2):1–14, 2008.
-  A. Kutuzov. Improving english-russian sentence alignment through pos tagging and damerau-levenshtein distance. ACL 2013, page 63, 2013.
-  P. Lopez and L. Romary. Humb: Automatic key term extraction from scientific articles in grobid. In Proceedings of the 5th international workshop on semantic evaluation, pages 248–251. Association for Computational Linguistics, 2010.
-  Y. Mehdad, M. Negri, and J. G. C. de Souza. Fbk: cross-lingual textual entailment without translation. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 701–705. Association for Computational Linguistics, 2012.
-  W. Menzel and I. Schröder. Decision procedures for dependency parsing using graded constraints. In in proceedings of ACL’90. Citeseer, 1998.
-  H.-M. Müller, A. Rangarajan, T. K. Teal, and P. W. Sternberg. Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics, 6(3):195–204, 2008.
-  V. Qazvinian and D. R. Radev. Scientific paper summarization using citation summary networks. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 689–696. Association for Computational Linguistics, 2008.
-  V. Qazvinian, D. R. Radev, and A. Özgür. Citation summarization through keyphrase extraction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 895–903. Association for Computational Linguistics, 2010.
-  X. Quan, C. Kit, and Y. Song. Non-monotonic sentence alignment via semisupervised learning. In ACL (1), pages 622–630. Citeseer, 2013.
-  D. R. Radev, P. Muthukrishnan, and V. Qazvinian. The ACL anthology network corpus. In Proceedings, ACL Workshop on Natural Language Processing and Information Retrieval for Digital Libraries, Singapore, 2009.
-  T. Schoenemann. Training nondeficient variants of ibm-3 and ibm-4 for word alignment. In ACL (1), pages 22–31, 2013.
-  P. K. Shah, C. Perez-Iratxeta, P. Bork, and M. A. Andrade. Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics, 4(1):1–9, 2003.
-  S. Teufel et al. Argumentative zoning: Information extraction from scientific text. PhD thesis, Citeseer.
-  D. C. Wimalasuriya and D. Dou. Ontology-based information extraction: An introduction and a survey of current approaches. Journal of Information Science, 2010.
-  C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179–214, 2004.