Interactive Exploration and Discovery of Scientific Publications with PubVis
With an exponentially growing number of scientific papers published each year, advanced tools for exploring and discovering publications of interest are becoming indispensable. To empower users beyond a simple keyword search provided e.g. by Google Scholar, we present the novel web application PubVis. Powered by a variety of machine learning techniques, it combines essential features to help researchers find the content most relevant to them. An interactive visualization of a large collection of scientific publications provides an overview of the field and encourages the user to explore articles beyond a narrow research focus. This is augmented by personalized content based article recommendations as well as an advanced full text search to discover relevant references. The open sourced implementation of the app can be easily set up and run locally on a desktop computer to provide access to content tailored to the specific needs of individual users. Additionally, a PubVis demo with access to a collection of papers can be tested online.
Machine Learning Group, Technische Universität Berlin, Germany
The Web of Science111by Thomson Reuters (http://wokinfo.com/) has a record of almost 60 million published scientific papers from all areas and, concerned with biomedical literature specifically, the PubMed database comprises more than 26 million citations from MEDLINE, life science journals, and online books.222https://www.ncbi.nlm.nih.gov/pubmed/ This is already an incomprehensibly large amount of published literature and there is an exponential increase in the number of articles published per year .
With many subfields and fragmented communities, getting an overview of current and accumulated research can be quite challenging and time consuming, especially if you do not know where to start. While review papers and conferences aim to get researchers up to speed on the developments in the field, they can often only provide a biased view on the subject matter, tainted by current trends and personal preferences of the authors and reviewers. Generally, researchers have to rely on simple keyword searches, e.g. as implemented by Google Scholar, to obtain a specific piece of information. While being easy to use and widely applicable, keyword searches can only comb through the texts to match a query, thereby only scratching the surface of the semantic relations encoded in the unstructured data. While Google does a very good job of delivering the papers you searched for, what if you do not yet know what you are looking for? What if you are new to the field or just want to explore the research beyond your field of expertise? It is hard to search for keywords you are not even aware exist.
While content discovery and recommendation systems are widely-used in commercial settings for movies  or news , for example, comparatively few solutions are available for scientific publications. Besides the fact that many of the existing services seem to be poorly maintained after their initial publication, they often rely on user ratings [10, 9] or citations , which can again lead to biased recommendations. The Science Concierge  computes the similarities between papers’ abstracts to provide content based recommendations, however it still requires the user to initially search for articles of interest. As a promising and well maintained project, the Arxiv Sanity Preserver by Andrej Karpathy333http://www.arxiv-sanity.com/ lists recent arXiv preprints and enables the user to explore similar papers as well as get personalized recommendations generated based on papers of interest. However, none of these existing solutions provides the user with a global overview of a field to put relevant papers into context.
We present the novel web application PubVis444http://pubvis.herokuapp.com/
and http://arxvis.herokuapp.com/, which combines a set of features to help researchers discover the content most relevant to them. An interactive visualization of a large collection of scientific publications provides the initial overview of a field, including clusters of subtopics, and invites users to also explore articles from outside their research focus. Besides a simple keyword search to find articles of interest, it is possible to upload the entire abstract of a current paper draft to search for related articles, e.g. to ensure you are not missing recently published key references. Additionally, personalized article recommendations are available based on the content of other papers the user has marked as relevant.
In the following sections, the individual features of the web application are described in more detail with a special focus on the machine learning techniques behind them. This includes obtaining an exemplary dataset of PubMed abstracts about various cancer types (Section 3), computing similarities between articles (Section 4), creating an interactive visualization of the dataset (Section 5), searching for related articles based on content similarity (Section 6), and generating personalized article recommendations (Section 7). The paper concludes with a discussion and outlook.
2 Overview of the app
The PubVis web application is implemented in Python using the Flask microframework to build a REST API back-end delivering all the content in JSON format. On top, the front-end displays the retrieved data, relying heavily on the d3.js library to generate the interactive visualization. The app can be run locally on a desktop computer (with a reasonable amount of RAM) and includes all the code necessary to download paper abstracts from the web and prepare this data for exploration and discovery.555https://github.com/cod3licious/pubvis
For the initial setup of the app as well as for updating it with new content, a series of actions has to be performed, detailed in the next sections, to scrape new content, update the article similarities and search index, and to compute the embedding coordinates used for the visualization. To save expensive resources when deploying the app to a cloud based service such as Heroku, all updates of the data can still be executed from a local machine by simply connecting to the database running in the cloud.
3 Obtaining data
As approximately 40% of all men and women are diagnosed with a form of cancer during their lifetime , it is no surprise that cancer is one of the most researched biological topics with many research papers publicly available on PubMed. Using the PubMed API888http://www.ncbi.nlm.nih.gov/books/NBK25500/ we create an exemplary scientific literature dataset containing almost abstracts associated with ten different cancer types (Figure 1).
While all following results are based on this dataset, it is also easy to populate PubVis with different papers. By default, the app includes functions to obtain abstracts from both the PubMed and arXiv API, and it is possible to add content from other sources to PubVis by writing a custom scraper.
4 Similar publications
For every article in PubVis, related papers are suggested based on the similarities of the articles’ content (Figure 2).
To compute these similarities, the papers first have to be encoded as numerical feature vectors. By computing weighted counts of all words in a paper’s text, a tf-idf bag-of-words (BOW) representation is generated for each article. Tf stands for ‘term frequency’, which is computed for every document and term by counting how often a word occurs in the text. Some words, such as ‘the’ or ‘and’, occur frequently in almost all documents, but are not very descriptive. Their influence can be reduced by weighting the term frequencies with their inverse document frequency (idf). The idf of a term is calculated as the logarithm of the total number of documents in the corpus, , divided by the number of documents which contain term , i.e.
The resulting sparse, high dimensional BOW representations of the documents are then transformed using latent semantic analysis (LSA). LSA is a simple topic modeling technique used to reduce noise and create more overlap between document vectors from similar topics. To this end, the dimensionality of the BOW vectors is reduced to a fixed number of components using singular value decomposition. Finally, the similarity between two documents is computed as the cosine similarity of their LSA vectors.
5 Interactive visualization
To create the interactive visualization (Figure 5 in the Appendix), the heart of PubVis, the papers’ LSA vectors are embedded in two dimensions using t-SNE . The algorithm’s ability to preserve local neighborhoods in the embedding makes it an excellent choice for creating a visualization that can be explored to discover related articles.
To obtain the low dimensional coordinates for a set of data points, t-SNE first constructs a probability distribution over pairs of high-dimensional input data points based on their euclidean distance in the original space. A similar probability distribution is defined over the pairwise distances in the low dimensional embedding space and then the optimal solution is obtained by minimizing the KL-divergence between both distributions iteratively using gradient descent until a local minimum is reached.
When computing the probability distribution for the input data, a perplexity parameter has to be set, which relates to the expected number of nearest neighbors of a data point. Additionally, the dimensionality and sparseness of the input vectors and resulting pairwise euclidean distances influence the solution. We experimented with different values and set the perplexity to 15 and the number of LSA components to 150, which provides a good trade-off between grouping together articles from the same field as well as preserving subclusters for specific topics. To verify the quality of the embedding, a postdoc in oncology examined the visualization and identified the topics of papers belonging to the same cluster (Figure 3). Besides topics related to a specific cancer type, it was also possible to identify interdisciplinary clusters, e.g. concerning the quality of life or the influence of certain diets on the development of cancer, which supports the idea that valuable insights might be gained from broadening the literature research beyond one’s usual focus.
6 Search by similarity
In addition to a simple keyword search, PubVis offers the possibility to match a whole abstract against the database of papers. This can be helpful, for example, when drafting a new paper, to quickly verify that no recently published papers central to your case were overlooked. For the search to work, first an index has to be created, linking every word occurring in any of the collected texts to the documents it occurs in. This can easily be accomplished by transposing the documents’ tf-idf BOW vectors, which additionally ensures that frequent but meaningless words do not have a significant influence on the search results. When a new abstract is then submitted for search, the set of all words occurring in it is used to access the index and the scores associated with all matching articles are aggregated to yield the final results (Figure 6 in the Appendix). The submitted abstracts are only used to search for related articles and not stored in the database.
7 Personalized recommendations
For every paper in the app, a user can indicate whether it is relevant or irrelevant for him or her. Based on this collected information, personalized content based article recommendations can be generated for each user (Figure 4).
For all articles the user has marked as relevant, similar articles are retrieved. These articles are then combined in a set of potential suggestions and labeled with the maximum similarity score each article received from one of the relevant articles.999Taking the maximum and not the average of the similarity scores for an article associated with multiple papers a user had marked as relevant accounts for the fact that users might be interested in different topics that do not necessarily produce articles related amongst each other. From this list of potential suggestions, all articles are then removed that the user has previously marked as irrelevant. These negative ratings are not taken into account when generating the initial list of suggestions, however, as this does not improve the quality of such content based recommendations .
For the PubVis instance running online, the preferences of multiple users are tracked by identifying them via cookies.
8 Discussion and Outlook
With an elaborate set of features, the PubVis app can aid researchers in the exploration and discovery of scientific publications. With an interactive visualization of a large collection of articles, the user can quickly obtain an overview of a field. Additionally, personalized article recommendations, an easy to navigate network of similar papers, and advanced search functionalities effectively provide users with relevant content. The open source app is easy to set up and run locally on a desktop computer and by using the built-in interface to the PubMed and arXiv APIs, the articles displayed in PubVis can be tailored to the user’s interests.
While, with a reasonable amount of RAM (8-16GB) available, the app can easily cope with more than articles, scalability is certainly an issue. Running the app itself is less problematic, but the content updates in the current setup, especially computing the article similarities and the two dimensional embedding with t-SNE (both ) can quickly exceed the available resources. If more papers should be included in the app, it might be necessary to switch to iterative updates and the more efficient (but less exact) Barnes-Hut implementation of t-SNE () . However, since it can be assumed that individual users are generally only interested in well constrained research areas and it is enough to execute these content updates only a few times per month, for average use cases the current performance should be sufficient. We see this app as a big step towards making literature research simpler and more enjoyable.
Further development of PubVis will focus on making the app even easier to set up (i.e. not requiring the usage of a terminal to run it), providing access to other journals and conference abstracts via custom web scrapers, and using the app for other types of content, e.g. to complement movie recommendation systems.
I would like to thank Dmitry Monin for programming the PubVis front-end, Alice Nomura for annotating the topic clusters in the PubVis visualization, and Antje Relitz for her helpful comments on the manuscript draft.
- Achakulvisut et al.  Titipat Achakulvisut, Daniel E Acuna, Tulakan Ruangrong, and Konrad Kording. Science concierge: A fast content-based recommendation system for scientific publications. PLOS ONE, 11(7):e0158423, 2016.
- Bell and Koren  Robert M Bell and Yehuda Koren. Lessons from the netflix prize challenge. Acm Sigkdd Explorations Newsletter, 9(2):75–79, 2007.
- Björk et al.  Bo-Christer Björk, Annikki Roos, and Mari Lauri. Global annual volume of peer reviewed scholarly articles and the share available via different open access options. In ELPUB2008, 2008.
- Gipp et al.  Bela Gipp, Jöran Beel, and Christian Hentschel. Scienstein: A research paper recommender system. In Proceedings of the international conference on emerging trends in computing (icetic’09), pages 309–315, 2009.
- Howlander et al.  N Howlander, AM Noone, M Krapcho, J Garshell, D Miller, SF Altekruse, et al. Seer cancer statistics review, 1975–2011. bethesda, md: National cancer institute; 2014, 2013.
- Li et al.  Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
- van der Maaten  Laurens van der Maaten. Barnes-Hut-SNE. In Proceedings of the International Conference on Learning Representations, 2013.
- van der Maaten and Hinton  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85, 2008.
- Wang and Blei  Chong Wang and David M Blei. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 448–456. ACM, 2011.
- Yoneya and Mamitsuka  Takashi Yoneya and Hiroshi Mamitsuka. Pure: a pubmed article recommendation system based on content-based filtering. Genome informatics, 18:267–276, 2007.