Bib2vec: An Embedding-based Search System for Bibliographic Information
We propose a novel embedding model that represents relationships among several elements in bibliographic information with high representation ability and flexibility. Based on this model, we present a novel search system that shows the relationships among the elements in the ACL Anthology Reference Corpus. The evaluation results show that our model can achieve a high prediction ability and produce reasonable search results. The demonstration is available at http://tti-coin.jp/demo/bib2vec/
Modeling relationships among several types of information, such as nodes in information networks, has attracted great interests in natural language processing (NLP) and data mining (DM), since it can uncover hidden information in data. Topic models such as author-topic model  have been widely studied to represent relationships among these types of information. These models, however, need a considerable effort to incorporate new types and do not scale well in increasing the number of types since they explicitly model the relationships between types in the generating process.
Word representation models, such as skip-gram and continuous bag-of-word (CBOW) models , have made a great success in NLP. They have been widely used to represent texts, but recent studies started to apply these methods to represent other types of information, e.g., authors or papers in citation networks .
We propose a novel embedding model that represents relationships among several elements in bibliographic information, which is useful to discover hidden relationships such as authors’ interests and similar authors. We built a novel search system that enables to search for authors and words related to other authors based on the model using the ACL Anthology Reference Corpus . Based on skip-gram and CBOW models, our system embeds vectors to not only words but also other elements of bibliographic information such as authors and references and provides a great representation ability and flexibility. The vectors can be used to calculate distances among the elements using similarity measures such as the cosine distance and inner products. For example, the distances can be used to find words or authors related to a specific author. Our model can easily incorporate new types without changing the model structure and scale well in the number of types.
Most of previous studies on modeling several elements in bibliographic information have been based on topic models such as author-topic model . Although the models work fairly well, they have comparably low flexibility and scalability since they explicitly model the generation process. Our model employs word representation-based models instead of topic models.
Some previous studies embedded vectors to the elements. Among them, large-scale information network embedding (LINE)  embedded a vector to each node in information network. LINE handles single type of information and prepares a network for each element separately. By contrast, our model simultaneously handles all the types of information.
We propose a novel method to represent bibliographic information by embedding vectors to elements based on the skip-gram and CBOW models.
We assume the bibliographic data set have the following structure. The data set is composed of bibliographic information of papers. Each paper consists of several categories. Categories are divided into two groups: a textual category (e.g., titles and abstracts
Our model focuses on a target element, and predicts a context element from the target element. We use only the elements in non-textual categories as contexts to reduce the computational cost. Figure ? shows the case when we use an element in a non-textual category as a target. For the black-painted target element in category , the shaded elements in the same paper are used as its contexts.
When we use elements in the textual category as a target, instead of treating each element as a target, we consider that the textual category has only one element that represents all the elements in the category like CBOW. Figure ? illustrates the case that we consider the averaged vector of the vectors of all the elements in the textual category as a target.
We describe our probabilistic model to predict a context element from a target in a certain paper. We define two -dimensional vectors and to represent an element as a target and context, respectively. Similarly to the skip-gram model, the probability to predict element in the context from input is defined as follows:
where denotes a bias corresponds to , and denotes pairs of and that belong to a category . As we mentioned, our model considers that the textual category has only one averaged vector. The vector can be described as:
Our target loss can be defined as:
where denotes a set of all the correct pairs of the elements in the data set. To reduce the cost of the summation in Eq. (Equation 1), we applied the noise-contrastive estimation (NCE) to minimize the loss .
3.3Predicting related elements
We predict the top elements related to a query element by calculating their similarities to the query element. We calculate the similarities using one of three similarity measures: the linear function in Eq. (Equation 1), dot product, and cosine distance.
As pre-processing, we deleted commas and periods that sticked to the tails of words and removed non-alphabetical words such as numbers and brackets from abstracts and titles. We then lowercased the words, and made phrases using the word2phrase tool
We prepared five categories: author, paper-id, reference, year and text. author consists of the list of authors without distinguishing the order of the authors. paper-id is an unique identifier assigned to each paper, and this mimics the paragraph vector model . reference includes the paper ids of reference papers in this data set. Although ids in paper-id and reference are shared, we did not assign the same vectors to the ids since they are different categories. year is the publication year of the paper. text includes words and phrases in both abstracts and titles, and it belongs to the textual category , while each other category is treated as a non-textual category . We regard elements as unknown elements when they appear less than minimum frequencies in Table 1.
We split the data set into training and test. We prepared 17,475 papers for training and the remaining 2,000 papers for evaluation. For the test set, we regarded the elements that do not appear in the training set as unknown elements.
We set the dimension of vectors to 300 and show the results with the linear function.
|Input Author||Relevant Words||Similar Authors||Topic Words||Topic Authors|
|Philipp Koehn||machine translation||Hieu Hoang||alignment||Chris Dyer|
|hmeant||Alexandra Birch||translation||Qun Liu|
|human translators||Eva Hasler||align||Hermann Ney|
|Ryan McDonald||dependency parsing||Keith Hall||parse||Michael Collins|
|extrinsic||Slav Petrov||sentense||Joakim Nivre|
|hearing||David Talbot||parser||Jens Nilson|
We automatically built multiple choice questions and evaluate the accuracy of our model. We also compared some results of our model with those of author-topic model.
Our method models elements in several categories and allows us to estimate relationships among the elements with high flexibility, but this makes the evaluation complex. Since it is tough to evaluate all the possible combinations of inputs and targets, we focused on relationships between authors and other categories. We prepared an evaluation data set that requires to estimate an author from other elements. We removed an (not unknown) author from each paper in the evaluation set to ask the system to predict the removed author considering all the other elements in the paper. To choose a correct author from all the authors can be insanely difficult, so we prepared 10 selection candidates. In order to evaluate the effectiveness of our model, we compared the accuracy on this data set with that by logistic regression. As a result, when we use our model, we got 74.3% (1,486 / 2,000) in accuracy, which was comparable to 74.1% (1,482 / 2,000) by logistic regression.
Table 2 shows the examples of the search results using our model. The leftmost column shows the authors we input to our model. In the rightmost two columns, we manually picked up words and authors belonging to a certain topic described in that can be considered to correspond to the input author. This table shows that our model can predict relative words or similar authors favorably well although the words are inconsistent with those by the author topic model.
Figure 1 shows the screenshot of our system. The lefthand box shows words in the word cloud related to the query and the righthand box shows the close authors. We can input a query by putting it in the textbox or click one of the authors in the righthand box and select a similarity measure by selecting a radio button.
When we train the model, we did not use elements in category as context. This reduced the computational costs, but this might disturbed the accuracy of the embeddings. Furthermore, we used the averaged vector for the textual category , so we do not consider the importance of each word. Our model might ignore the inter-dependency among elements since we applied skip-grams. To resolve these problems, we plan to incorporate attentions  so that the model can pay more attentions to certain elements that are important to predict other elements.
We also found that some elements have several aspects. For example, words related to an author spread over several different tasks in NLP. We may be able to model this by embedding multiple vectors .
This paper proposed a novel embedding method that represents several elements in bibliographic information with high representation ability and flexibility, and presented a system that can search for relationships among the elements in the bibliographic information. Experimental results in Table 2 show that our model can predict relative words or similar authors favorably well. We plan to extend our model by other modifications such as incorporating attention and embedding multiple vectors to an element. Since this model has high flexibility and scalability, it can be applied to not only papers but also a variety of bibliographic information in broad fields.
We would like to thank the anonymous reviewer for helpful comments and suggestions.
- Note that we have only one textual category since the categories for texts are usually not distinguished in most word representation models.
Steven Bird, Robert Dale, Bonnie J. Dorr, Bryan R. Gibson, Mark Thomas Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir R. Radev, and Yee Fan Tan. The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics.
Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models.
Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents.
Wang Ling, Yulia Tsvetkov, Silvio Amir, Ramon Fermandez, Chris Dyer, Alan W Black, Isabel Trancoso, and Chu-Cheng Lin. Not all contexts are created equal: Better word representations with variable attention.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.
Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. Efficient non-parametric estimation of multiple embeddings per word in vector space.
Michal Rosen-Zvi, Thomas L. Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents.
Yanchuan Sim, Bryan R. Routledge, and Noah A. Smith. A utility model of authors in the scientific community.
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: large-scale information network embedding.