Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols
Despite the central importance of research papers to scientific progress, they can be difficult to read. Comprehension is often stymied when the information needed to understand a passage resides somewhere else—in another section, or in another paper. In this work, we envision how interfaces can bring definitions of technical terms and symbols to readers when and where they need them most. We introduce ScholarPhi, an augmented reading interface with four novel features: (1) tooltips that surface position-sensitive definitions from elsewhere in a paper, (2) a filter over the paper that “declutters” it to reveal how the term or symbol is used across the paper, (3) automatic equation diagrams that expose multiple definitions in parallel, and (4) an automatically generated glossary of important terms and symbols. A usability study showed that the tool helps researchers of all experience levels read papers. Furthermore, researchers were eager to have ScholarPhi’s definitions available to support their everyday reading.
Researchers are charged with keeping on top of immense, rapidly-changing literatures. Naturally, then, reading constitutes a major part of a researcher’s everyday work. Senior researchers, such as faculty members, spend over one-hundred hours a year reading the literature, consuming over one one-hundred papers annually (Tenopir et al., 2009). And despite the formidable background knowledge that a researcher gains over the course of their career, they will still often find that papers are prohibitively difficult to read.
As they read, a researcher is constantly trying to fit the information they find into schemas of their prior knowledge, but the success of this assimilation is by no means guaranteed (Bazerman, 1985). A researcher may struggle to understand a paper due to gaps in their own knowledge, or due to the intrinsic difficulty of reading a specific paper (Bazerman, 1985). Reading is made all the more challenging by the fact that scholars increasingly read selectively, looking for specific information by skimming and scanning (Nicholas et al., 2010; Hillesund, 2010; Tenopir et al., 2019).
We are motivated by the question “Can a novel interface improve the reading experience by reducing diversions and distractions that interrupt the reading flow?” This work takes a measured step to address the general design question by focusing on the specific case of helping readers understand cryptic technical terms and symbols defined within a paper, which are called “nonce words” in the field of linguistics. Formally, a nonce word is a word that is coined for a particular use, which is unlikely to become a permanent part of the vocabulary (Mattiello, 2017). Because a nonce word is localized to a specific paper, a reader cannot know precisely what it means when they start reading the paper. Because it is only intended for use within a single paper, it is likely to be defined somewhere within that same paper, but finding that definition may require significant effort by the reader. By their nature, nonce words are an interesting focus for augmenting reading tools because readers will have questions about them, and those questions will be answerable (exclusively by) searching the text that contains them.
Two aspects of nonce words constrain the design of any reading application that is built to define them. First, they are abundant. A paper can contain hundreds of them; indeed, a single passage may contain dozens closely packed together. For example, an equation may be comprised of dozens of symbols, and sub-symbols that make up those symbols. Similarly, tables of results may have rows and columns which are indexed by abbreviations for metrics, datasets, and experimental conditions (Figure 2). In such settings a reader is likely to have demands on their working memory and may also want to see definitions for multiple nonce words in the same vicinity. Second, nonce words are sometimes assigned multiple definitions within the same paper. One example is a symbol like , which over the course of a single paper may variously stand for a dummy variable in a summation operation, the number of components in a mixture of Gaussian models, and the number of clusters output by a clustering algorithm (see the scenario in Section 4). These two aspects of nonce words beg the question of whether conventional solutions for showing definitions of terms (e.g., the electronic glossaries explored in second-language learning research (Cheng and Good, 2009; Yanagisawa et al., 2020) or Wikipedia’s page previews (MediaWiki contributors, [n.d.])) also suit a researcher who is puzzling their way through dense, cryptic, ambiguous notation.
In this work, we design, develop, and assess ScholarPhi, a prototype tool for helping readers concentrate on the cognitively demanding task of reading scientific papers by providing them efficient access to definitions of nonce words. This paper begins with a formative study of nine readers as they read a scientific text of their own choice (Section 3.1). Most readers expressed confusion at nonce words in the text. Many readers were reluctant to look up what the words meant given the anticipated cost of doing so. This inspired the subsequent design of tools that could have answered those readers’ questions while requiring so little effort that readers would actually use the tool.
We then describe design motivations for a new reading interface (Section 3.2), that are grounded in insights from four pilot studies of early prototypes of ScholarPhi, conducted with 24 researchers. Key insights from the research include the importance of tailoring definitions to the passage where a reader seeks to understand a nonce word, and the competing goals of providing scent (i.e., visual cues (Pirolli and Card, 1999)) of what is defined without distracting from a reading task that is already cognitively demanding on its own.
Building on the motivations found in the pilot research, the ScholarPhi system is presented (Section 4). The basic design of ScholarPhi is one of an interactive hypertext interface. A reader’s paper is augmented with subtle hyperlinks indicating which nonce words can be clicked in order to access definition information. Readers can click nonce words to access definitions for those words in a compact tooltip (Figure 1). These definitions are position-sensitive—that is, if there are multiple definitions of a nonce word in the text, ScholarPhi uses the heuristic of showing readers the most recent definition that appears before the selected usage of the nonce word. Definitions are also linked to the passage they were extracted from: a reader can click on a hyperlink next to the definition to jump to where it appears in the paper. In addition to definitions, the tooltip makes available a list of all usages of the nonce word throughout the text (including definitions), as well as a special view of formulae that include the nonce word.
Beyond these basic affordances, ScholarPhi provides a suite of features, each of which provides readers with efficient yet non-intrusive methods for accessing information about nonce words. First, ScholarPhi provides efficient, precise selection mechanics for selecting mathematical symbols and their sub-symbols through single clicks, rather than error-prone text selections (Section 4.1). Second, ScholarPhi provides a novel filter over the paper called “declutter” that helps a reader search for information about a nonce word by low-lighting all sentences in the paper that do not include that word (Section 4.2). Third, ScholarPhi generates equation diagrams and overlays them on top of display equations, affixing labels to all symbols and sub-symbols in the equation for which definitions are available (Section 4.3). The final feature is a priming glossary comprising definitions of all nonce words that appear in a paper, prepended to the start of the document to let a reader review the nonce words for a paper before they begin to read it (Section 4.4).
The emphasis in the design of each of these features is on acknowledging the
inherent complexity of the setting of scientific papers, and hence designing
features for looking up definitions that are easy to invoke and minimally
distracting. To enable these features, new methods were introduced for analyzing
scientific papers in order to make nonce words interactive. A paper processing
pipeline was built that automatically segments equations into symbols and their
sub-symbols, detects all usages for a nonce word, and which detects precise bounding
box locations of nonce words so that they may be clicked. A custom PDF annotation
tool was built to facilitate the manual extraction of definitions and annotation
of nonce words with these definitions. This pipeline was sufficient for enabling us
to research the design and use of ScholarPhi, and has been made available for
This work concludes with a controlled usability study with twenty-seven researchers (Section 5). Researchers were observed as they used three versions of ScholarPhi—one with all the features described above, one with only the “declutter” feature, and one that behaved exactly like a standard, un-augmented PDF reader. When readers had access to ScholarPhi’s features, they could answer questions about a scientific paper in significantly less time, while viewing significantly less of the paper in order to come to an answer. They also reported that they found it easier to answer questions about the paper, and were more confident about their answers, when they used ScholarPhi. Researchers were also observed as they used ScholarPhi for a fifteen-minute period of unstructured reading time. Researchers made use of all of ScholarPhi’s features. Feedback was overwhelmingly positive. Most participants expressed an interest in using the features “often” or “always” for future papers, with a particular emphasis on the perceived utility of definition tooltips and equation diagrams.
In summary, this work makes four contributions. First, it characterizes the problem of searching for information about nonce words as one of the challenges of reading scientific papers, grounded in a small formative study. Second, it provides design motivations for designing interactive tools that define nonce words, grounded in the iterative design of a tool. Third, it presents ScholarPhi, an augmented reading interface with a suite of novel features for helping readers understand nonce words in scientific papers. Finally, it provides evidence of the usefulness of the design in searching and reading scientific papers through a controlled study with twenty-seven researchers.
2. Background and Related Work
2.1. How researchers read papers
Researchers read papers to become aware of foundational ideas and to stay apprised of the latest developments in their field. However, reading papers is difficult. Challenges in reading a paper can come from gaps in the reader’s knowledge, or from ideas in the paper that are poorly explained (Bazerman, 1985). Papers may be read out-of-order and piecemeal (Bazerman, 1985; Hillesund, 2010; Nicholas et al., 2010). As a result, a passage of a paper may be read out of context. The need to assimilate information scattered across one or many papers is representative of what has been observed in the human-computer interaction community for active reading behavior of other types of knowledge workers (O’Hara, 1996; Adler et al., 1998; Tashman and Edwards, 2011a).
Papers that include mathematical content can impose additional demands on a reader. Reading mathematical texts often entails grappling with unfamiliar terminology and notational idioms, which can be particularly challenging for less experienced readers (Shepherd and Van De Sande, 2014). Self-reports from mathematicians have suggested that the process of reading math involves backtracking as a reader attempts to scaffold their understanding (Weber, 2008), a pattern which has also been observed in eye-tracking studies of reading math (Inglis and Alcock, 2012; Kohlhase et al., 2018). When attempting to understand an equation, readers will look to nearby equations and text for clarifications (Kohlhase et al., 2018).
While reading papers in physical volumes and print-outs used to be the norm, it is increasingly the case that researchers consult papers in digital reading applications (Tenopir et al., 2009; Late et al., 2019), particularly for some types of scholarly communication such as conference proceedings (Late et al., 2019). This suggests the value of investing in reading user interfaces that take advantage of the unique interactive potential of digital interfaces to augment the reading experience.
2.2. Augmented reading interfaces
Since the beginning of human-computer interaction as a discipline, one of the foundational challenges has been equipping knowledge workers with tools that extend their cognition during reading. Vannevar Bush in his vision of the memex proposed a system that enabled readers to build trails across the literature, linking passages across related readings in a way that makes implicit connections clear (Bush, 1945). This vision has expressed itself in many forms, from the invention of hypertext (Conklin, 1987) to experiments with interactive books (Norman, 2013) and “fluid documents” that can adapt their form and content to elaborate where readers need clarification (Chang et al., 1998). In the first decade of the CHI conference, myriad techniques were proposed to help readers navigate text using social annotations (Hill et al., 1992), augment hypertexts with glosses that could dynamically change the layout of the text (Zellweger et al., 1998), and provide navigational affordances that allowed readers to see overviews of document content and jump quickly to passages of interest (Graham, 1999; Schilit et al., 1998).
Glossaries, Definitions, and Explanations
Today, many reading and editing tools show dictionary definitions on hover or clicking on words. The Word Wise feature in the Amazon Kindle lets readers view definitions of tricky words in the space between consecutive lines of text (Look Up Words, People, and Places While You Read, [n.d.]). In 2014, Wikipedia began to roll out page previews as a feature that allowed readers to preview the content of a referenced page by hovering over a link to that page. Based on positive usability evaluation results, Wikipedia decided to make the feature a permanent fixture on the site (MediaWiki contributors, [n.d.]). Recent proceedings of human-computer interaction conferences have introduced prototypes that allow readers to answer their questions about how to use web pages (Chilana et al., 2012), the meaning of cryptic programming syntax (Head et al., 2015), hard-to-visualize quantities (Hullman et al., 2018), and unfamiliar words from a second language (Lungu et al., 2018).
ScholarPhi used an advanced symbol selection technique that draws from related work. Zeleznik et al. (2010) introduced gestures for a multi-touch display that support the efficient selection of mathematical expressions. Bier et al. (2006) designed a technique for rapid selection of entities (such as addresses) with a single click. The symbol selection mechanism in ScholarPhi can be seen as a combination of these two features, supporting single-click selection of mathematical expressions, with refinement of the selection to choose specific sub-symbols of that expression via additional clicks. In the future, ScholarPhi may support the efficient selection of many nonce words at once in a passage using fuzzy text selection techniques such as those proposed by Hinckley et al. (2012) and Chang et al. (2016).
Information Highlighting and Fading
ScholarPhi was designed to provide the efficiency of visual querying present in contemporary code editors like VSCode (VSCode, [n.d.]), in which arbitrary text (i.e., a variable or expression) can be selected, and all other appearances of that same text are instantly highlighted everywhere else in the text. In the design of its lists of definitions and usages, ScholarPhi also draws inspiration from tools such as LiquidText (Tashman and Edwards, 2011b), which support viewing lists of within-text search results side-by-side with the query term highlighted. In its design of the “declutter” filter, ScholarPhi draws on the design of visual filters already present in prototype and production tools. The fading out of content in order to direct a reader’s focus to information of interest is a design pattern that has been put to use for interactive tutorials (Kelleher and Pausch, 2005) in which instructions are highlighted while the rest of the user interface is faded, as well as interactive debugging tools (Ko and Myers, 2009; Dragicevic et al., 2011).
Readability versus Document Augmentations
On the whole, evidence has supported the use of embedding explanations in texts. In the context of second-language learning, embedded glosses for unfamiliar vocabulary have been shown to lead to vocabulary learning (Yanagisawa et al., 2020), and improved comprehension (Taylor, 2006).
That said, in making texts interactive, there is a key tension between assisting the reader and distracting them. On the one hand, studies such as one run by Rott (2007) suggest that the best comprehension outcomes can be achieved when all words that have glosses are marked. That said, interactive texts change readers’ behavior. Understandably, readers are more likely to click on words that are visibly interactive (De Ridder, 2002), leading to what has been called by some “click-happy behavior” (Roby, 1999). Furthermore, studies of texts augmented with hyperlinks have sometimes shown that these augmentations have led to worse comprehension of the texts, rather than better comprehension (DeStefano and LeFevre, 2007). What the evidence suggests overall is that amidst the appeal of interactive reading interfaces, great care must be taken during design to make sure not to introduce features that will ultimately distract readers from the cognitively demanding task of reading.
2.3. Tools for reading scientific papers
Links to External Resources
Tools can help researchers read scientific papers in a number of ways. To reduce the need to click away from the paper currently being read, some journals publishers now allow readers to view metadata by clicking on citations (eLife, 2013; PubMed, [n.d.]; Springer, [n.d.]). Experimental tools have been built that augment papers with additional information about cited papers (Powley et al., 2009), bias in study design (Marshall et al., 2016), and links to external learning resources (Liu et al., 2015; Jiang et al., 2018). They have supported explanations from one person to another, allowing peer reviewers (PeerLibrary, [n.d.]), collaborators (Yoon et al., 2014), instructors (McCartney et al., 2018), strangers (Fermat’s Library, [n.d.]), and crowds (Jiang and Dogan, 2014) to annotate and discuss arbitrary passages of papers. Other approaches to saving the scientist time include tools to support literature search (e.g., (Zhang et al., 2008; Ponsard et al., 2016; Qian et al., 2019)), summarize the text (Scholarcy, [n.d.]; Cachola et al., 2020) or rewrite passages in simpler language (Kim et al., 2016).
Links Within Papers
Reading interfaces can also help researchers by helping them navigate to information of interest within the paper. For several years, interfaces for reading PDFs have provided standard affordances for jumping to hyperlinks within the paper. Typesetting software like LaTeX can automatically embed clickable links from references to figures, equations, and sections to the content they refer to, and from citations to reference sections.
Prototype tools have been built to further assist readers in finding passages about topics of interest (Graham, 1999), in jumping between a passage that describes research results to the relevant parts of data tables (Kong et al., 2014; Kim et al., 2018; Badam et al., 2019), and jumping to passages that answer natural language questions (Zhao and Lee, 2020). Other research has augmented static figures in papers with interactive overlays (Grossman et al., 2015; Masson et al., 2020).
Of particular relevance to this paper is a class of experimental systems that surface explanations of terms and symbols in scientific papers. Tools have been developed that link from terms to pages that define them on Wikipedia (Abekawa and Aizawa, 2016), expand acronyms (SciHive, [n.d.]), and which direct from key phrases in papers to topic pages where those phrases are defined and relevant excerpts for those topics are taken from other papers (ScienceDirect, [n.d.]).
Tools for Reading Math
In response to the unique challenges of reading mathematical texts, prototype tools have been designed to help readers find explanations of math expressions in the text (Alcock, 2009; Pagel and Schubotz, 2014; Kohlhase et al., 2018). The e-Proof system focuses on single, single-page proofs rather than papers. As part of a guided tour of a proof, they selectively fade out the parts of the proof that are not currently the focus of the tour (Alcock, 2009). Another approach is a prototype that lets readers look up the meanings of operator symbols in external knowledge bases, and reveals simplified versions of equations with details elided (Kohlhase et al., 2011). Of these tools, only Alcock (2009) was evaluated with human users (see (Roy, 2014; Roy et al., 2017)) in a math education setting. It was found that while readers used the tools of their own accord (Roy, 2014), many features that were introduced to assist readers, such as audio walkthroughs of the content, got in readers’ way (Roy et al., 2017). ScholarPhi consolidates and extends features from these prior prototypes, and introduces additional features and affordances, with the goal of helping readers understand nonce words in papers. This work contributes a system that can help scale these interactions to scientific papers, and an understanding of how to tailor the interactions to better support readers based on iterative evaluation of interactive prototypes.
3. Design Motivation and Principles
The design principles for ScholarPhi are based on several rounds of iterative design, first with a formative study of how scholars currently read scientific texts, and then with increasingly fleshed out versions of the prototype. This section simultaneously reports on these rounds of iterative design and presents the resulting design principles.
3.1. Formative study
To better understand how the presence of nonce words affects the reading experience, we conducted a small formative study. Nine readers (four graduate students, five undergraduate students, referred to as R1–9 below) participated in an observational study in which they read a scientific text of their own choice. Six participants brought research papers (R1–5, R8). Five of these papers were about computer science and one was about architecture. Three participants brought instructional texts on the topics of data science (R6), experimental design (R7), and formal analysis (R9). These latter three participants were included to see whether the obstacles encountered in reading scientific papers also occurred for readers of other types of scientific texts.
Readers were asked to read their text for forty minutes. During this time, they thought aloud. Readers reported when they encountered confusing passages of text. Then, they described whether they intended to look up information to clarify their confusion. If they chose to look for such clarifying information, they described where they looked and why. Our findings were as follows:
All but one reader expressed confusion at a term used in the text (R1–3, R5–9). In some cases, the confusion was about a term that was specific to the scientific discipline of the text (R3, R5–9), such as the terms “diacritic” (R3) or “population parameter” (R6). For papers from computer science, such terms included both benchmarks used to test an algorithm (R3) as well as baselines against which an algorithm of interest was compared (R5).
In other cases, the terms causing confusion came from within the same paper. Authors introduced terms to describe their methods (“symbolic validator” (R1), “backtranslation” (R3)) that had nuanced meanings within the text, but whose meaning the reader could not summon when viewed out of the context of its definition. Authors would invent shorthand for running examples (e.g., a test set of cow images named “cow”) that they then referred back to by that shorthand (i.e., “cow”) throughout the figures, which could be confusing if the reader was reading the text out of order (R5). Texts could also be sprinkled with vague back-references to assumptions (R5), analyses (R6), parameters (R8), and theorems (R9) that readers could not recall. In some cases (R6, R8), readers were not sure whether a reference referred to a passage in the current text or in another text.
Mathematical symbols were another source of confusion (R2–4, R6). Questions that readers had about symbols were of several kinds. Readers sometimes could simply not understand the meaning of a symbol (e.g., “”, “”, “”, “”, “”, “”, “”, R2–4, R6). In other cases, they wanted information about how a set of symbols were used in combination. For example, R4 scanned the appendix of the research paper they were reading to better understand the meaning of a ratio “” that appeared in one of the equations. Readers also wondered about the values that symbols were assigned (R2, R3, R6). For example, one reader (R2) wondered what value the regularization parameter was set to when a model was trained. Another reader (R3) wanted to see example data that could be used as inputs and to a translation algorithm.
Thus, confusion about terms and symbols (nonce words, in our terminology) was common among the readers in the study. Readers’ strategies for resolving this confusion varied based on how important it was that they understood a nonce word. If it mattered that they understood a nonce word, a reader often attempted to infer meaning from context (R3, R6–9). If they could not surmise the meaning from context, readers would sometimes delay looking up an explanation with the hope that they might find one later in the text (R1, R3, R4, R6–9). A drawback of this approach, described by R1, is that a reader may reach a point in the text that they lack an understanding of so many important terms that they can no longer understand the text without stopping and searching for explanations.
Eventually, many readers needed to stop reading in order to look up explanations. One participant referred to this as an undesirable “context switch” which takes them out of the “headspace” of understanding a complicated passage (R4). When looking for explanations, five readers looked elsewhere in the same text (R2-4, R8, R9). This entailed backtracking within the text (R3, R4), jumping forward (R2, R4), opening within-text glossaries (R8), and performing within-text search (i.e., “Control-F” search) within the reading application (R9). Those reading instructional texts often consulted external references like web search results (R6, R8), dictionary applications (R7), and Wikipedia (R9). One reader took a proactive approach to reducing the cost of within-paper lookups by assembling glossaries for key symbols in the margins of the text (R4, see Figure 3).
This study indicated that readers of scientific papers, and scientific texts more generally, frequently have questions about nonce words. To answer these questions, readers either infer answers from context, wait for an answer, or look for explanations elsewhere. While readers do look for explanations elsewhere, they try to avoid doing so as it takes them away from the text they are trying to understand. These observations suggest that readers could benefit from interfaces that make explanations of nonce words available to them without distracting them from the task of a careful reading.
3.2. Design guidelines from iterative design
The design of ScholarPhi was refined through an iterative design process lasting twelve months. Improvements to the design were motivated by feedback from 24 researchers who used prototypes of the tool to read scientific papers.
Four pilot studies were conducted, each one of a prototype of ScholarPhi at a different stage of design.
Study (Declutter lens only): 4 researchers (D1–4)
Study (Side notes containing definitions, defining formulae, and usages): 4 researchers (S1–4)
Study (Tooltips instead of side notes): 9 researchers (T1–9)
Study (Equation diagrams and a complex version of tooltip interaction flow): 9 researchers (E1–9)
The first two studies ( and ) were observation studies. Participants thought aloud as they used the tool. For the second two studies ( and ), participants read on their own, participated in a 30-minute focus group discussion, and filled out a questionnaire about their experience. Seven participants in these last two studies (T1–3, E1–4) participated in a 15-minute follow-up interview. In each study, participants read a different scientific paper. Two researchers (S2, S3) participated in multiple studies.
Participant feedback motivated improvements to the interface which are reported in Section 4. One author analyzed transcripts from all studies following a qualitative approach. This yielded the following six design motivations for designing effective interfaces for providing in-situ explanations within scientific texts.
M1. Tailor definitions to the location of appearance The same nonce word can have multiple conflicting definitions throughout a paper. For example, in the paper used as stimulus in the formal study (Strubell et al., 2018), the symbol took on multiple distinct senses including referring to the dimensionality of a vector , being part of a composite symbol used to refer to a layer in a neural network and being used as the matrix transposition operation in several display equations. Additionally, several meanings of were never explicitly defined.
When readers used a prototype that showed definitions of all of these senses in a list, they wanted to know which ones were the most appropriate to the passage that they were reading (S1–3). Readers requested that the tool show the definitions appropriate to the place where they asked for them (S1). They also asked to see the surrounding context of a definition (S2, S3).
A related principle is eliminating redundant definitions. If a reader selected a nonce word within a passage where it was being defined, they did not wish to see a tooltip containing the definition sentence they were already reading (S1, T9).
M2. Connect readers to definitions in context Four readers requested the ability to jump from a definition to the passage where it appeared in the paper (S1–3, T5, T6). This would aid judging relevance (S1–3) and assessing what appeared to be errors in the extraction algorithm (T5).
M3. Consolidate information While the information that explains a nonce word can be scattered across a paper, readers want explanations that consolidates all of that information in one compact, concise package. When they clicked on a composite symbol, they wanted to see explanations of each sub-symbol that made it up (E2, E4). They also expected the interface to be able to gather explanations for semantically similar symbols that differed in their surface features, such as showing a definition for “” that was extracted for the function “” (E1).
M4. Provide scent In all prototypes, nonce words were marked with a light dotted underline. Readers appreciated that the underlines provided scent of which words they could click to see definitions (S2–4). Participants did not turn off this affordance, although they were provided with this option in later versions of the design.
M5. Minimize occlusion In two prototypes, tooltips were packed with definitions, defining formulae, and usages for symbols. Readers reported that these tooltips occluded text that they wished to see (T4, T6, E7) without providing much value beyond the first definition (T1, T4–6). Still, some readers desired tooltips as opposed to side notes, as it allowed them to view definitions without losing their place in the text (E3, E4). The current prototype attempts to balance these conflicting needs by providing a compact tooltip that contains only the most recent definition of a nonce word and a few small buttons for accessing lists of definitions, defining formulae, and usages. A tooltip for a nonce word can be hidden by clicking on a “close” button within the tooltip.
M6. Minimize distractions The user interface was revised several times to remove features that, while originally envisioned as being helpful, distracted from the reading task. One reader aptly described, “I was trying to pay more attention to the paper than the tool and the paper requires a lot of overhead to understand. So I didn’t have much left over for the tool” (E1). One prototype used several highlighting colors to indicate appearances, usages, and definitions of a selected nonce word; however, this added visual clutter that was hard to understand (E3). The current prototype uses a single static highlight color. Readers were asked across multiple studies whether they found underlines beneath the nonce words distracting. They repeatedly reported that they did not (S2–4, T5, T7). However, one reader did request the ability to turn them off (E1), which has been included in all recent prototypes of the interface.
4. User Interface Walkthrough
We illustrate the experience using ScholarPhi through a set of four
scenarios, where a reader wishes to know the meaning of a specific nonce word. Each
scenario is chosen so that one of ScholarPhi’s features is uniquely well-suited
to the reader’s task.
To explain the design decisions underlying a feature, we refer back to findings from the formative research. Specifically, we note whenever a design choice was informed by one of the design motivations M1–6 that were introduced in Section 3.2. Implementation details can be found in the Appendix A.1.
4.1. Definition tooltips
When a reader wants to know the meaning of a nonce word, ScholarPhi lets them look up the meaning by clicking the nonce word. This reveals a definition tooltip (see Figure 1).
Definition tooltips appear directly beneath the selected nonce word. This placement is intentional. By placing the definition beneath the word, as opposed to placing it in a document margin or a glossary elsewhere in the text, a reader need not divert their gaze from the text. In this way, the tooltip placement is chosen to minimize distraction (M6). Furthermore, to avoid occluding the text (M5), tooltips are compact. Their dimensions never exceed half the page width, nor are they permitted to be longer than four lines tall.
If there are multiple definitions of a nonce word available within the paper, ScholarPhi shows the definition that it infers as being most relevant to the context. Specifically, it uses a heuristic of showing the definition that appeared most recently before that appearance of the word. This reduces mental effort that seeing multiple definitions over the nonce word would incur (M1) and reduces the amount of text occluded by the tooltip (M5).
In the passage shown below, refers to an index of a component in a mixture of Gaussians.
However, in a later passage, is given an entirely different meaning—a parameter that controls the number of clusters output by a clustering algorithm. When the reader opens a definition tooltip in this other passage, they also see the appropriate definition.
After seeing a definition in the tooltip, a reader may want more information about the nonce word. For instance, they may want to find whether the authors recommended that a specific number of components be used in the mixture of Gaussians. To help the reader answer questions like this, ScholarPhi connects the reader to definitions in context (M2). The reader can view the definition in context by clicking the hyperlink next to the definition (i.e., “page 2” in the figure above). ScholarPhi scrolls the paper to the definition, highlighting the sentence that the definition came from:
When the reader has finished consulting the highlighted passage, they can click their web browser’s “Back” button to return to the definition tooltip at their previous position in the document.
Lists of usages A reader can also look for more information about a nonce word by reviewing the usages of the word. To connect a reader with these usages, the definition tooltip provides three buttons. The buttons let a reader open lists of all prose definitions of the word, all defining formulae (i.e., formulae in which the nonce word appears on the left-hand side of an assignment), and all usages (i.e., passages that refer to the nonce word). Together, the buttons provide a way for readers to access a consolidated collection of everything that ScholarPhi knows about a nonce word (M3). Each button provides scent that helps a reader understand how a nonce word is defined and used (M4). By hovering over a button, the reader can see how many definitions, defining formulae, or usages there are for the nonce word. The button is disabled when no definitions, defining formulae, or usages exist. When a reader clicks the button for “all usages,” the list of usages opens in a dedicated sidebar, rather than in the tooltip, to avoid occluding the text (M5). For example, below we show the usages list for the nonce word .
Each usage in the list comprises one sentence referring to the nonce word and a link to the sentence where it appears in the paper (M2). To help readers evaluate the relevance of a usage, which can contain the visual clutter of dense text and equations, the nonce word is highlighted wherever it appears in a usage.
To avoid disorienting the reader, a tooltip always makes the same information available to a reader in the same layout: buttons for lists of definitions, defining formulae, and usages, as well as a definition if one is available. If a tooltip is opened for a nonce word within the sentence where the word is defined, the definition tooltip reports, “Defined here.” This way, tooltips do not distract the reader from the text with a definition they have already seen, or are about to see (M6). If no definition exists for the nonce word, then the three buttons to access the usage lists are still shown, but those with no information behind them are grayed out.
Scent While some nonce words are defined in a paper, others are not. Authors may assume the meaning of a nonce word is implicit or they may simply forget to define the nonce word. ScholarPhi provides visual scent (Pirolli and Card, 1999) to help readers determine whether they’ll find a definition for a nonce word before they click on it (M4). This visual scent is provided in the form of a beneath the nonce word. For instance, in the following passage, readers can open definition tooltips for any of the underlined nonce words, “CoNLL-2005,” “SRL,” and “LISA.”
So that it does not divert a user’s attention from the text needlessly (M6), ScholarPhi assumes that a reader will not want to view a nonce word in a sentence that defines it, and so does not underline such nonce words. The rules for underlining symbols are more nuanced. Papers can contain composite symbols where certain sub-symbols (e.g., subscripts or superscripts) are defined, but the symbol as a whole is not. In such a case, ScholarPhi highlights sub-symbols for which definitions are available. In the passage below, ScholarPhi highlights symbols to indicate it has definitions for “,” “,” and “.” Because the composite symbol “” is defined in the sentence, it is not underlined.
Symbol selection In a conventional interface for reading papers, one challenge to searching for information about a symbol is simply selecting the symbol. Because the text for a symbol is often split across multiple baselines (i.e., in subscripts or superscripts), conventional text selection mechanisms may fail to select precisely those characters that belong to the symbol. To reduce the cost of accessing explanations, ScholarPhi supports efficient selection of symbols. Symbols can be selected by clicking them once (steps “1” and “2” below). Once a symbol is selected, all sub-symbols that belong to it are highlighted and can be selected with a click (“3”).
By helping readers rapidly select sub-symbols, it is hoped that ScholarPhi lets readers understand the meaning of a composite symbol in terms of the meanings of it parts (M3).
Beyond the core features of definition tooltips and lists of usage, ScholarPhi provides three innovative views to help readers access definitions of nonce words when and where they need them, which we describe next.
To help readers quickly find information about a nonce word that is scattered across a paper, ScholarPhi provides a novel feature called “decluttering.” When a reader selects a nonce word, ScholarPhi “declutters” the paper — by highlighting segments of text that contain matches, and fading out all other sentences — in an effort to help readers scan the paper for usages.
ScholarPhi provides visual scent (M4) of where usages can be found via a conventional search bar. The search bar counts how many times the nonce word appears in the paper, and shows the page number of the usage the reader selected. While readers are expected to navigate a decluttered document by scrolling through it, the search bar also supports navigation between usages with “Next” and “Previous” buttons with arrow key keyboard shortcuts.
Decluttering has several advantages over the list of usages: it connects readers to definitions in context by providing a view that is grounded in the text (M2) and it reduces distractions by hiding content in the paper, rather than exposing additional user interface widgets (M6). Like the list of usages, decluttering does not occlude text (M5).
4.3. Equation diagrams
Some passages are rife with nonce words. For instance, tables of empirical results are indexed by abbreviations that represent experimental conditions and measurements. Equations contain dozens of symbols. For dense passages like these, readers may desire the ability to consult the definitions for many nonce words at the same time. For display equations in particular (i.e., equations that are shown on their own line separated from the text), ScholarPhi provides the ability to view definitions of all symbols at the same time. To see the definitions of all symbols in a display equation, a reader can click that equation. Definitions are affixed to all symbols simultaneously.
Definitions are shown for symbols (e.g., “”) and the sub-symbols they are composed of (“”, “”). Thus, definition information that would otherwise be split across multiple tooltips is consolidated into one place (M3). Like the definitions that appear in tooltips, the definitions for equation diagrams are position-sensitive (M1). By clicking a label for a symbol, a reader can open the definition tooltip for the symbol, providing access through the definition tooltip to the context of the definition (M2).
Brushing and linking connects the definitions to the symbols; as a reader hovers over a definition, the symbol it defines is highlighted with a more saturated color than the other symbols. Leader lines connect the definitions to the symbols. The leader lines connecting definitions to symbols are diagonal, proceeding straight from the definition label to the symbol. This style of leader line was chosen as opposed to orthogonal leaders (i.e., leaders comprising one horizontal and one vertical segment). While in general, orthogonal leaders have been observed to have great legibility (Barth et al., 2019), we have found that diagonal lines stand out better amidst the clutter of other marks in an equation (M6).
4.4. Priming glossary
Scientific texts like textbooks often contain glossaries that allow readers to look up definitions of terms in a predictable place. One type of glossary that can be particularly helpful to readers is what Widdowson (1978, page 82) called a “priming glossary,” or a glossary that is shown to readers before a text to help prepare them for problematic words that may appear in the text. ScholarPhi prepends a priming glossary to scientific papers. The glossary includes a list of key terms and symbols, ordered by their appearance in the paper.
The glossary is intended to help readers in two ways. First, it lets them prime themselves on the terms and symbols that will be used in the paper. And second, it provides a reference that can be printed and viewed side-by-side with the paper. One advantage to presenting definitions in a priming gloss as opposed to tooltips is that definitions for all nonce words can be consolidated into one place (M3), letting a reader understand groups of related nonce words by studying all of their definitions together. Furthermore, the gloss provides scent (M4) of the density of nonce words, and the presence of definitions of those words, before they start reading the paper.
5. Usability Study
We performed a formal remote usability study to ascertain the answers to the following questions: Do the features of ScholarPhi aid readers’ ability to understand the use of nonce words when reading complex scientific papers? Do readers elect to use the features when given unstructured reading time? How are the features used to support the reading experience?
In a within-participants design, we compared the full features of ScholarPhi to a simplified version and a standard PDF reader on a series of close reading tasks on a machine learning paper. The quantitative and subjective results were strongly in favor of the affordances supplied by ScholarPhi over a standard PDF reader, with one exception.
5.1. Study design
Participants Criteria for inclusion was having previously read a machine learning paper. A total of 27 participants were recruited through university and company mailing lists. were doctoral students, were master’s students, were undergraduate students, and was a professional researcher. of the participants identified their discipline as machine learning, and were somewhat or very comfortable with reading machine learning papers. Participants were thanked with a $20 (USD) gift certificate. All study sessions were 1 hour long and held remotely over Zoom, a video conferencing platform; participant interactions were logged and screen activity was captured. Participants opened a version of the application in a private browser window, and were asked to share their screen with the experimenters. This led to participants using the interface on a variety of screen sizes and configurations.
Stimulus Paper For this study, all participants read “Linguistically-Informed Self-Attention for Semantic Role Labeling” (LISA) (Strubell et al., 2018). (Several examples in Section 4 are drawn from this paper.) This paper makes for an interesting case because it won a best paper award and yet some notation is used inconsistently and some symbols are never defined explicitly.
Tasks Each 1 hour session ran as follows: (1) Greeting and consent form. (2) Interactive tutorial with all features on a two-page paper (Cohen et al., 2016). (3) Read the abstract of the stimulus paper. (4) Complete a timed practice question with the full interface. (5) Complete three timed test questions using each of the three test interfaces (4 minutes each), each followed with a question about confidence and ease of use. (6) Unstructured reading of the stimulus paper (15-20 minutes). (7) Questionnaire on background and subjective responses.
In the unstructured reading portion participants were encouraged to make use of the tools if they anticipated they would be helpful. The intention of this segment was to observe which aspects of the tool were used when not under time pressure.
Interfaces Three interfaces were compared within-participants:
“Basic” is a basic PDF reader with standard search functionality (specifically, being able to find words using “Control-F” with a toggle button to match case and the ability to highlight all matches).
“Declutter” is a PDF reader with additional declutter functionality.
“ScholarPhi” is a PDF reader with all ScholarPhi features.
Test questions The three multiple-choice test questions were each intended to assess a different aspect of pain points identified by formative studies.
“Results”: “According to Table 1, which model achieves the best recall on WSJ data when GloVE embeddings are used?”
“Dataset”: “Which text corpora is the ConLL-2005 dataset made from? Select all that apply.”
“Symbols”: “What does T (upper case) mean in this paper? Select all senses in which T is used.”
Assessment Measures For each of the test questions, we measured the following quantitative metrics:
“Confidence” is a five-point Likert scale variable indicating the participant’s self-assessment of the following prompt: “I am confident I came up with the right answer.” A score of 5 indicates strong agreement, and a score of 1 indicates strong disagreement.
“Ease” is a five-point Likert scale variable indicating the participant’s self-assessment of the following prompt: “It was easy to find the answer.” A score of 5 indicates strong agreement, and a score of 1 indicates strong disagreement.
“Time” is the number of seconds the participant spent to answer the question. It is measured from when the question first appeared on the participant’s screen, to when the participant clicked the next button or the question timer expired (whichever event occurred first).
“Correct” is a binary variable indicating whether the participant’s response to the question was correct. For questions requiring a response with multiple selections, a response was considered correct if it included all and only the correct selections.
“Area” is the proportion of the full paper viewed. It is computed as the cumulative total pixel area viewed over the total available pixel area in the entire paper. It ranges between values 0 (none of the paper viewed) and 1 (entire paper viewed).
“Distance” is a continuous variable measuring the cumulative (normalized) absolute vertical pixel distance — that is, number of document lengths — traversed by the participant’s screen. Normalization controls for different pixel heights across participants’ devices. The distance between the top and bottom pixels on each page is set to such that the entire paper’s total height sums to ; traversing the length of the paper twice would contribute to the total Distance.
Unstructured reading task measurements Measurements in the unstructured reading tasks included usage of key features and subjective feedback.
Assignment Using a repeated measures factorial design, we assigned each participant to three of nine possible configurations — interface-question pairs — while ensuring that (i) each participant observed each interface and each question type exactly once and (ii) all nine configurations had the same number of assigned participants.
Analysis For each of the quantitative measurements, we fit a generalized linear mixed-effects model (GLMM) with fixed effects for the interface and question factors (and a fixed-effects interaction term). Details can be found in Appendix A.2.
Reduced Controls Due to Remote Testing Since the study was held remotely, some standard controls could not be employed: the size of the screen, the speed of the user’s computer (the PDF reader appeared to have lag for some participants and not for others), and the distraction in the environment (background noise could be heard for many of the participants). These differences might account for variation in performance and subjective accounts of the experience. Rather than degrading the quality of the data, these factors make the study better represent variation that we anticipate readers using this tool would have in their environments.
5.2. Results: Quantitative
Figure 4 summarizes how the quantitative measures on the test questions vary across the three interfaces. We report results from two-sided -test analyses of pairwise contrasts in Table 1. These results indicate which patterns shown in Figure 4 are statistically significant.
We observed that ScholarPhi outperformed the other interfaces in terms of subjective scores on Ease and Confidence. (Declutter reported higher Ease than Basic, but not higher Confidence). ScholarPhi also outperformed the other interfaces in terms of time required to answer the test questions (Time). (Declutter and Basic were not significantly different). While ScholarPhi and Basic performed equally on number of participants answering test questions correctly (Correct) (Declutter reported higher Correct than both), these pairwise differences were not statistically significant. As such, we observed that participants using ScholarPhi were able to answer questions as correctly as they would using other interfaces, but in less time. Finally, we observed that participants traversed less screen Distance and viewed less Area of the paper under ScholarPhi and Declutter compared to Basic; ScholarPhi outperformed Declutter on Area but did not significantly outperform Declutter on Distance. Overall, these results suggest that even the lighter-weight version of the tool, with the Declutter overlay alone, yields benefits over the standard PDF reader, but the full set of features in ScholarPhi is especially beneficial.
|\hdashline[0.4pt/2pt] Ease (1–5)||0.93||0.005||0.78||0.020||1.70||<0.0001|
|\hdashline[0.4pt/2pt] Time (seconds)||-27.6||0.015||-16.8||0.218||-45.4||0.0001|
|\hdashline[0.4pt/2pt] Distance (# doc lengths)||-0.24||0.572||-0.66||0.023||-0.90||0.001|
Upon further inspection of the results on Correct, we found the performance of participants on a particular question yielded the reason for ScholarPhi performing similarly to Basic (and with Declutter yielding slightly higher results than the other two). Participants performed better on both Results and Dataset using ScholarPhi, but performed very poorly on Symbols with this interface. Recall from the discussion in Section 3.2 (M1) that the LISA paper uses the symbol inconsistently and also does not define all senses of this symbol. We found that participants almost always answered this question incorrectly using ScholarPhi because the definitions did not show all of the usages, and the participants had the expectation that the definitions showed all of the senses of the term. This highlights an important potential drawback of a tool like ScholarPhi—it can mislead if it implies incorrect information.
5.3. Subjective impressions
Subjective responses from participants (referred to here as “readers” collectively, and P1-27 individually) were obtained both from oral comments during the study and from open-ended questions in the final questionnaire. Readers’ impressions of ScholarPhi were overwhelmingly positive. Readers were enthusiastic about the support that ScholarPhi provided for the reading task. They described the tool as “cool” (P8), “very cool”, (P13), “super cool” (P12), and “amazing” (P4, P16, P19). Eight of the 27 responses to the open-ended questionnaire forms contained exclamation marks conveying participant excitement for the tool. Several readers commented on the polish of the prototype (P7, P24), which reflects on the careful refinement of the interface over several cycles of iterative design.
Readers appreciated ScholarPhi for three supporting roles they saw it as playing in reading tasks. First, its ability to preserve what multiple participants called “reading flow” (P16, P27). In the words of one participant, ScholarPhi helped them “focus on the aspects of the paper that interested me, and not waste time on other stuff” like reminding themselves of definitions (P4). The features provided timely reminders (P10, P21, P26), and eliminated the need to traverse “back and forth” within the paper (P11). Second, ScholarPhi helped them “check their understanding” of the meanings of nonce words (P16) and the passages of text they appeared in (P20). Third, readers thought that ScholarPhi could help readers engage with papers that they otherwise would not have had the vocabulary to read easily (P4, P23), in effect “lowering the barrier” to reading papers in fields outside of one’s expertise.
Anticipated usage To determine which of ScholarPhi’s features would be of greatest interest to researchers in the future, and hence which features should be developed further, readers were asked to report how often they expected they would use each feature if it was available in the software they used to read papers. Readers expected they would use several of the features very frequently, including definition tooltips for symbols (16 / 27 “always”, 8 / 27 “often”), definition tooltips for terms (15 / 27 “always”, 9 / 27 “often”), and equation diagrams (17 / 27 “always”, 6 / 27 “often”). The other features seemed to have less universal appeal, in particular declutter for symbols (5 / 27 “always”, 13 / 27 “sometimes”), declutter for terms (2 / 27 “always”, 15 / 27 “often”), and the priming glossary (8 / 27 “always”, 6 / 27 “often”). Even amidst this variation, responses on the whole were positive. While readers could report that they “never” saw themselves using a feature, we did not see a single participant report they would never use one of the features.
5.4. Usability of features
To understand successes and gaps in the design, usage logs were collected during the unstructured reading task. All readers except for one (96%) used at least one of ScholarPhi’s features during the unstructured reading time. Analysis of the aforementioned data led to the following conclusions about the usability of ScholarPhi’s features:
Definition tooltips For most readers, tooltips were ScholarPhi’s most essential feature. As noted above, it was the feature that the most participants anticipated using regularly if available in their reading interfaces. Readers appreciated tooltips for their intended purpose: their support for looking up definitions of nonce words that appeared elsewhere in the paper (P10). An additional use case was to check if a passage the reader was consulting was indeed the definition of a nonce word, so the reader could make sure they were not missing information of interest (P2).
Readers used definition tooltips more than any other feature in ScholarPhi. All but three participants opened at least one tooltip for a symbol, and all but one participants opened at least one tooltip for a term. When readers used tooltips they used them often. Readers opened tooltips for symbols a median of 10 times (, ), and for terms a median of 5 times (, ).
Declutter In contrast to tooltips, which were unanimously liked, the declutter feature saw disagreement. Some readers valued the feature, and others did not. Those that did not sometimes did not understand what the point of the feature was (P25), or thought the feature provided little value over the definition tooltips (P22). Others felt that the standard “Control-F” search provided a more efficient interface for searching a paper than scrolling through a paper with declutter (P2). One obstacle to using declutter was that, in contrast to standard text search, readers could only start searching with declutter if the nonce word they wanted to search for was within view. With standard text search, a search can be initiated anywhere by typing in an arbitrary query into an always-available query widget. It could be frustrating not to be able to do the same with declutter, particularly if the reader wanted to temporarily deactivate declutter so that they could consult some of the hidden text, and then resume declutter for the same nonce word as before (P14).
That said, readers’ behavior indicates that most readers likely expected declutter to be useful for finding answers to questions in a paper: all participants activated declutter at least once in the test task when they used an interface with only the declutter feature enabled. Several readers indicated that they believed declutter could be useful for finding information about nonce words (P6, P11, P15, P23, P26). One reader noted that the feature made the paper look “less cluttered,” despite not having been told that the feature was in fact named “declutter” (P11). Furthermore, declutter could make readers feel “less overwhelmed” by the text in the paper (P27).
Lists of usages Nearly all (20 / 27) readers opened a list of definitions, defining formulae, or usages during the unstructured reading task. 18 readers opened a list of definitions, 3 opened a list of defining formulae, and 10 opened a list of usages. Some participants used the lists heavily—one participant opened the lists of definitions and usages eight times each (P4). Readers used the list of usages to develop an understanding of the purpose of the paper (P9) and to gather additional context to check their understanding of a term (P16).
One reader used the list of usages in a novel way, describing the list as a “guide” that supported non-linear reading (P27). The reader left the list open for minutes at a time. Because the usages pane loads usages for whatever nonce word is currently selected, they could therefore jump from one passage to the next, finding usages of nonce words that drew their interest, jumping to those usages, and then viewing the usages of nonce words in the passage they just jumped to. The reader believed that by supporting this reading pattern, the list allowed them to answer questions they had about the text as they were raised, rather than waiting them to be resolved in a later passage.
Equation diagrams Equation diagrams were a favorite feature for many readers. More readers expected they would use this feature “always” for future readings than any other feature. Nearly all (21 / 27) readers opened at least one equation diagram during the fifteen-minute unstructured reading session, and most readers opened multiple; the median participant opened three equation diagrams while they read (, ).
The primary use of equation diagrams was to understand the symbols in an equation without needing to attend to the surrounding text (P1, P6, P11, P13, P14, P21, P24). Diagrams were seen as particularly useful when an equation was long (P24) or complex (P11). One of the equations, for instance, comprised four lines of notation with a total of fourteen symbols for which definitions were available, and many others for which definitions weren’t. Readers were regularly observed opening the diagram for this equation and then viewing it for some time. Beyond the primary use of answering questions about symbols, equation diagrams supported new ways of navigating the text. For instance, the diagrams permitted one reader to skim the technical section by opening the diagrams for one equation after another, without feeling the need to carefully consult the text in between (P7).
Priming glossary Among the features of ScholarPhi, priming glossaries were the least used during the reading task. A few readers (6 / 27) were observed consulting the priming glossary for a non-trivial amount of time, declared in our protocol to be 10 or more seconds. Although readers rarely consulted the priming glossary during the study, they saw the glossary as being useful in two scenarios. First, the priming glossary was envisioned as a useful tool for orienting to the terminology used in a paper before reading it (P13, P16). Two readers spent a substantial amount of time (i.e., two minutes (P16) and over five minutes (P1)) carefully studying the glossary at the beginning of the reading task. Second, readers hoped that the glossary might provide additional information about a nonce word that could not be found in a definition tooltip. In fact, it appeared that several readers accessed the glossary as a fallback when the definition tooltip did not contain the information that readers were looking for (P3, P12, P14, P22).
Coordination of features Observations of readers’ behavior offer evidence of the usability of the holistic set of features. The tool features appeared to be discoverable. During the tutorial task, readers often discovered features on their own by tinkering with the interface, like the ability to jump to a definition in the paper from the definition tooltip (P2), or the presence of lists of definitions and usages (P6). Furthermore, during the unstructured reading task, readers sometimes chained together interactions with multiple ScholarPhi features as they sought information about nonce words. For instance, one participant (P6) clicked an equation to reveal a diagram, selected one of the symbols in the diagram, opened the list of definitions for the symbol, and then clicked on a link that took them to one of those definitions. Sequences of interactions like this sometimes lasted only a few a seconds from start to end. Several readers chained interactions across multiple of ScholarPhi’s features in a similar way (P6, P8, P13, P19). Readers’ positive experiences with the individual features as well as the tool as a whole indicates the suitability of ScholarPhi’s design for helping readers find what they need to know about nonce words in scientific papers.
6. Discussion and Future Work
6.1. Summary of results
The outcomes of the usability study produced the following answers to the research questions:
Do the features of ScholarPhi aid readers’ ability to understand the use of nonce words when reading complex scientific papers? Yes. When asked to answer questions requiring understanding of nonce words, readers answered questions significantly more quickly with ScholarPhi than with a baseline PDF reader, while viewing significantly less of the paper.
Do readers elect to use the features when given unstructured reading time? Yes. 96% of readers used ScholarPhi’s features at least once during 15 minutes of unstructured reading time. Tooltips were the most frequently used feature: readers opened a median of 10 tooltips for symbols, and 5 for terms. Equation diagrams were opened a median of 3 times. Almost all participants opened a list of definitions, defining formulae, or usages at least once.
How are the features used to support the reading experience? On the whole, readers used the features for the reasons expected: they referred to tooltips to remind themselves of forgotten definitions, activated declutter to find information about nonce words within a less cluttered view of the paper, and opened equation diagrams to view the definitions of many symbols at once. Readers also used the tools to support the reading experience in unconventional ways, for instance using the list of usages as a “guide” to support a non-linear, curiosity-driven reading, and skimming a section by jumping from one equation diagram to the next.
A major limitation of the usability study is its focus on a single paper, where performance was measured for only three tasks. Papers vary widely in clarity and readability. To improve generalizability of the study, the paper was selected to be a widely-read scientific paper exhibiting some of the very problems the system was seeking to address. Furthermore, the three tasks were chosen to require an understanding of different types of nonce words: terms referring to datasets, baselines, and symbols. In the future, we will continue to evaluate ScholarPhi on a variety of research papers, as has been done to date through the iterative design process for the tool. A second limitation, that pertains to the tool’s suitability for supporting unstructured reading, is that readers in the study only used the tool for 15–20 minutes, and may have not had enough time to discover limitations that would preclude them using the tool in the future. Observations from our pilot studies have suggested that readers continue to find aspects of the tool useful after 20 minutes of reading, but longitudinal studies are necessary to better assess how readers would employ ScholarPhi in day-to-day use.
6.3. Future work
The study of ScholarPhi has revealed three opportunities for future research to advance the potential of intelligent reading interfaces to aid in the authoring and reading of scientific papers.
Connecting readers to definitions beyond the paper
Readers in the formative studies, pilot studies, and usability study all asked for the ability to look for definitions of terms that resided outside of a paper. This means that readers were looking for definitions of terms that were not nonce words, but were rather jargon or domain-specific vocabulary. Readers also asked for the ability to look up information about cited papers within the paper they were currently reading. Substantial further design work is needed to provide just-in-time, relevant definitions like these that connect readers to external information sources, though prototypes of such tools are already being built (e.g., (Abekawa and Aizawa, 2016; Jiang et al., 2018)). A key design challenge which represents an opportunity for novel research is how to address the design motivations of providing scent, tailoring definitions, and minimizing distractions in a setting where definitions are sourced from massive corpora comprised of source documents of widely varying quality (e.g., other papers, Wikipedia, or the science blogosphere). Our research suggests that if this design problem can be solved in a well-designed tool, researchers would enthusiastically embrace that tool.
Co-development of reading interfaces and machine learning models
Are today’s machine learning models up to the task of detecting definitions of nonce words so that users can use ScholarPhi for arbitrary papers? A recent study (Anonymized authors, [n.d.]) indicates that the state-of-the-art algorithms for definition detection currently have a problem of recall when it comes to detecting definitions in scientific papers. This raises the question of whether readers would still want to use ScholarPhi if some definitions were not detected by the system, or if some of the predictions were wrong. Furthermore, it is unclear how best to tune the precision-recall tradeoff of an AI method, since we don’t yet know whether false positives are more detrimental to the reader experience than false negatives. Researchers in human-computer interaction have explored how users interact with an imperfect artificial intelligence (Yin et al., 2019; Kocielnik et al., 2019). Tools like ScholarPhi may benefit from an analogous thread of research which explores how models for augmenting texts with interactive affordances can convey uncertainty. Conventional solutions like showing multiple, alternative predictions (i.e., definitions) may not suit the setting of scientific papers, where showing additional definitions may distract and ultimately lead to disuse of the tool. The success of both human-computer interaction research into augmented reading and applied natural language processing depend on a co-exploration of the underlying algorithms and management of user expectations at the same time.
ScholarPhi for writing scientific papers
Is there a dual of ScholarPhi that could support the task of writing clear scientific papers? Such a tool might better support the goals of ScholarPhi than a post-hoc augmented reading interface by placing a small burden on an author in order to reduce the mental effort expended by the author’s many readers. Features that an author might wish for are the ability to know when they have left a nonce word undefined, when they use the same nonce word to mean two different things (as is often the case for symbols like ), and to know when they are using two redundant nonce words to refer to the same thing. The same paper processing technologies that can detect definitions and relate two nonce words to each other could suit writing just as well as reading. As we saw in the development of ScholarPhi, the design exploration of augmented writing interfaces likely needs to begin with careful observations of writers to understand how lightweight, non-intrusive features can support the writing task without distracting authors.
Our formative study showed that readers find nonce words in scientific texts confusing, but may choose not to look up what the nonce words mean given the anticipated cost of doing so. The ScholarPhi system was designed to help readers concentrate on the cognitively demanding task of reading scientific papers by providing them efficient access to definitions of nonce words. The iterative design of the system revealed that systems like ScholarPhi’s need to tailor definitions to the passage where a reader seeks an understanding of a nonce word, provide scent, and avoid distracting readers from their reading. A usability study with 27 researchers showed that when using ScholarPhi versus a standard PDF reader, they could answer questions that required an understanding of nonce words in less time, viewing less of the paper, with ScholarPhi. Readers could see using ScholarPhi’s definition tooltips and equation diagrams “often” or “always” if they were available in their reading interface. These strong empirical results suggest that researchers are eager and ready for tools like ScholarPhi that support the reading task by providing just-in-time, position-sensitive definitions of nonce words when and where they need them.
Acknowledgements.We thank Zachary Kirby, Jocelyn Sun, Luming Chen, Nidhi Kakulawaram, RJ Pimentel, and Benjamin Barantschik for their help in designing, building, and evaluating prototypes of the ScholarPhi system. We also thank Luca Weihs, Brendan Roof, and Alvaro Herrasti for developing a prototype algorithm for localizing colorized LaTeX equations that inspired the algorithm used in the ScholarPhi pipeline. This research receives funding from the Alfred P. Sloan Foundation, the Allen Institute for AI, Office of Naval Research grant N00014-15-1-2774, and the University of Washington Washington Research Foundation/Thomas J. Cable Professorship.
Appendix A Appendix
a.1. Implementation of ScholarPhi
This section describes our suite of algorithms for preparing papers to be read
in ScholarPhi. These implementations, along with an interactive paper annotation
tool for cleaning up the outputs of these algorithms, is available for other
tool builders to use in our public repository.
Paper preprocessing ScholarPhi currently supports papers which have been written using the TeX typesetting language. By restricting the domain of papers to those that have TeX, ScholarPhi is able to more precisely identify the locations of symbols and relationships between them. Given the TeX source for a paper, plain text sentences are extracted by removing macros and replacing citations and equations with placeholders. The plain text is split into a sequence of sentences using pysbd (Sadvilkar, 2020), a state-of-the-art sentence boundary detector. These sentences act as inputs to the algorithms for detecting definitions and usages of nonce words.
Symbol detection To detect symbols in a paper, ScholarPhi first extracts all equations from the TeX for the paper using a custom TeX lexer. Each equation is parsed using KATeX (KaTeX, [n.d.]), an open source library for rendering LaTeX equations in the browser. This yields, for each equation, a representation of that equation in MathML (Froumentin, [n.d.]), a flavor of XML where elements correspond to identifiers, operators, numbers, and combinations thereof.
ScholarPhi climbs the MathML tree, building up symbols that are more and more complex, assigning those made at lower levels of the tree as sub-symbols of those made at higher levels of the tree. In this manner, composite symbols are identified.
Nonce word localization in PDFs To make nonce words interactive, ScholarPhi must know the positions of those words in the PDF. It is non-trivial to extract structured representations of mathematical symbols from PDFs based on the information available in PDFs. Hence, ScholarPhi makes use of a technique described by Siegel et al. (2018) to find the bounding boxes of objects of interest in PDFs when TeX source files are available. Specifically, the technique perturbs the colors of the objects by detecting the text span that creates the object, and wrapping the span in coloring commands. Then, the TeX document is compiled into a PDF, and simple computer vision techniques are used to detect the regions of the colorized PDF that differ from the original PDF. These regions form the bounding box for the object.
To adapt this technique to the detection of symbols, ScholarPhi needs to know which spans of characters in a TeX file corresponded to a symbol. Therefore, the KATeX equation parser (see above) was instrumented to report which character spans of each TeX equation corresponded to which MathML elements in the MathML tree produced by the KATeX parser. Once the character offsets of each symbol in the paper’s TeX is known, the technique by Siegel et al. (2018) can be used to locate the precise bounding box of each symbol in the PDF. The bounding boxes of terms and sentences are detected using the same method. The character offsets of terms within the TeX are extracted by the custom TeX processor that can take an arbitrary list of term names as input and determine the offsets of all appearances of those terms. The character offsets of sentences are extracted by the sentence boundary detector.
Term and definition detection For some of the prototypes assessed in Section 3.2, terms and definitions were extracted using an automatic, state-of-the-art definition recognition algorithm (Anonymized authors, [n.d.]). As in the case with most algorithms in Natural Language Processing, the results are not 100% correct. This system is under active development as part of this project and accuracy is anticipated to improve steadily.
Because we wanted to assess the designed interfaces affordances of ScholarPhi without delving into issues relating to error recovery (which is a separate and relevant topic) for the usability study described in Section 5, we manually corrected and selected the terms and definitions shown to participants. To scaffold our prototyping efforts, a custom PDF annotation tool was developed (which is also included in our suite of open source tools), which supported the tagging of arbitrary text as terms, and the tagging of those terms with arbitrary lists of definitions and usages.
Usage extraction Usages of a nonce word were extracted as all sentences that contained the nonce word. Containment was determined by comparing the character offsets of the sentences and nonce words where they appeared in the TeX. Defining formulae were extracted for symbols by searching for equations in which the symbol appeared on the left-hand side of an equation (i.e., to the left of a definition operator like “=”). Each appearance of the nonce word in a usage was wrapped in HTML tags that allowed the nonce word to be highlighted in lists of usages in the web interface.
User interface The user interface builds on top of the Mozilla Foundation’s open source pdf.js web viewer (Mozilla and individual contributors, [n.d.]). ScholarPhi’s interactive features, including definition tooltips, lists of usages, decluttering, symbol selection, equation diagrams, and priming glossaries, are all implemented as an overlay atop the pdf.js PDF reader. Visual styling is accomplished using custom styles using Material UI components (Material-UI, [n.d.]) as a starting point. The features are written in 10.5k lines of React code, which complements the 10.2k lines of Python code and 200 lines of custom TeX coloring macros that are used to process the papers before they reach the user interface.
Symbols and formulae are rendered throughout the interface. Rather than show symbols using bitmaps extracted from the PDFs, TeX for equations and symbols are rendered within the browser using KATeX. This has the advantage of rendering symbols and equations at a high resolution. In addition, definitions and usages that contain equations can be rendered in views like tooltips and lists of usages where their text must be able to wrap.
To display equation diagrams, definitions for symbols and sub-symbols are overlaid on top of the page. Labels are placed on the top and bottom boundaries of the equation with a fixed margin between the equation and labels. Labels are spaced horizontally using a constraint-based layout algorithm implemented in Labella.js (Labella.js, [n.d.]). They are split evenly between the top and bottom of the equation, with label position determined by which side of the equation the symbol is closest to.
Algorithms for both straight (i.e., diagonal, single-segment) and orthogonal (i.e., two-segment, horizontal-then-vertical) leader lines are implemented in the ScholarPhi code.
Technical limitations. Most aspects of ScholarPhi are currently automated, with a few limitations. As it stands, much of the document processing is automated, such as symbol detection, sentence detection, and nonce word localization in PDFs. (Some minor adjustments were made to correct errors for the usability study.) As mentioned, the current implementation is applied only to documents with TeX source, but we intend for future versions of to be able to deliver full functionality on arbitrary PDFs, perhaps by making use of state-of-the-art tools for symbol extraction from scholarly documents (e.g., SymbolScraper (Davila et al., 2019)).
Definition detection is the one stage that requires further advances to the state-of-the-art to achieve full functionality. In future work, we intend to investigate error recovery mechanisms and user-supplied corrections. We also see work like ScholarPhi as informing the direction of further advances for the state of the art in automated definition recognition.
a.2. Details of Statistical Analysis
Modeling mixed-effects in repeated measures studies
For the analysis of results in Section 5, we use the generalized linear mixed-effects model (GLMM). GLMMs are often used to analyze repeated measures studies, in which the same subject contributes multiple (potentially correlated) measurements.(Lindstrom and Bates, 1990) They have been used for such studies in medicine (Cnaan et al., 1997), behavioral studies (Cudeck, 1996) and even usability studies of semantic layouts (Hearst et al., 2020).
F-tests for significant effect of interface
For each of the quantitative measurements (), we fit a GLMM with fixed effects for the interface () and question () factors (and a fixed-effects interaction term).
This is done using the lme4 package in R (Bates et al., 2015), we fit the following GLMM:
where is the link function,
-tests for pairwise contrasts between interfaces
We conduct a post-hoc analysis of pairwise contrasts to quantify the differences
in mean effect of interface on under the GLMM (and controlling for
question). Two-sided -tests for pairwise contrasts are computed using the
emmeans R package,
Ordinal regression for Likert-scale variables
As Ease and Confidence are measured on a 5-point Likert scale, a linear GLMM estimated means may be ill-suited for analysis, especially if Ease and Confidence are sufficiently non-Normally distributed. We additionally perform likelihood ratio tests after fitting analogous cumulative link mixed-effects models (CLMM) provided in the ordinal R package (Christensen, 2018). Likelihood ratio tests, which are similar to F-tests but more conservative, yielded similar -values — Ease () and Confidence () — and resulted in the same conclusions as those when using the GLMM. Since pairwise contrasts are not available through emmeans (or other libraries) for CLMMs, we’ve opted to use the GLMM model for Ease and Confidence to enable subsequent analysis for Table 1.
- copyright: acmcopyright
- journalyear: 2021
- doi: 10.1145/1122445.1122456
- conference: CHI ’21: ACM SIGCHI Conference on Human Factors in Computing Systems; May 8–13, 2021; Yokohama, Japan
- booktitle: CHI ’21: ACM SIGCHI Conference on Human Factors in Computing Systems, May 8–13, 2021, Yokohama, Japan
- price: 15.00
- isbn: 978-1-4503-XXXX-X/18/06
- ccs: Human-centered computing Interactive systems and tools
- See the project repository at https://github.com/allenai/scholarphi.
- See also the video figure at https://bit.ly/scholarphi-video-walkthrough.
- See the project repository at https://github.com/allenai/scholarphi.
- We use the identity link for Ease, Confidence, Time, Distance, Area . We use the logit link for , which is treated as a Bernoulli variable.
- The F-test is not applicable when Bernoulli, so we performed the similar, but slightly less conservative, likelihood ratio test for Correct (Kuznetsova et al., 2017).
- Because the GLMM for Correct was fit using a logit link, direct testing of pairwise contrasts is not possible. We used the transform option in emmeans to perform the contrast tests on the log-odds scale, which are linear under the GLMM, before applying the inverse-link transformation to return to the probability scale. This yields the estimated (absolute) differences in reported in Table 1.
- Takeshi Abekawa and Akiko Aizawa. 2016. SideNoter: Scholarly Paper Browsing System based on PDF Restructuring and Text Annotation. In Proceedings of the International Conference on Computational Linguistics. 136–140.
- Annette Adler, Anuj Gujar, Beverly L. Harrison, Kenton O’Hara, and Abigail Sellen. 1998. A diary study of work-related reading: design implications for digital reading devices. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 241–248.
- Lara Alcock. 2009. e-Proofs: Student Experience of Online Resources to Aid Understanding of Mathematical Proofs. In Proceedings of the Conference on Research in Undergraduate Mathematics Education.
- Anonymized authors. Under peer review.
- Sriram Karthik Badam, Zhicheng Liu, and Niklas Elmqvist. 2019. Elastic Documents: Coupling Text and Tables through Contextual Visualizations for Enhanced Document Reading. IEEE Transactions on Visualization and Computer Graphics 25, 1 (January 2019), 661–671.
- Lukas Barth, Andreas Gemsa, Benjamin Niedermann, and Martin Nöllenburg. 2019. On the readability of leaders in boundary labeling. Information Visualization 18, 1 (2019), 110–132.
- Douglas Bates, Martin Mächler, Benjamin M. Bolker, and Steven C. Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67, 1 (October 2015), 1–48.
- Charles Bazerman. 1985. Physicists Reading Physics: Schema-Laden Purposes and Purpose-Laden Schema. Written Communication 2, 1 (January 1985), 3–23.
- Eric A. Bier, Edward W. Ishak, and Ed Chi. 2006. Entity Quick Click: Rapid Text Copying Based on Automatic Entity Extraction. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 562–567.
- Vannevar Bush. 1945. As we may think. The Atlantic 176, 1 (July 1945), 101–108.
- Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S. Weld. 2020. TLDR: Extreme Summarization of Scientific Documents. (2020). arXiv:2004.15011 [cs.CL]
- Bay-Wei Chang, Jock D. Mackinlay, Polle T. Zellweger, and Takeo Igarashi. 1998. A Negotiation Architecture for Fluid Documents. In Proceedings of the Symposium on User Interface Software and Technology. ACM, 123–132.
- Joseph Chee Chang, Nathan Hahn, and Aniket Kittur. 2016. Supporting Mobile Sensemaking Through Intentionally Uncertain Highlighting. In Proceedings of the Symposium on User Interface Software and Technology. ACM, 61–68.
- Ying-Hsueh Cheng and Robert L. Good. 2009. L1 glosses: Effects on EFL learners’ reading comprehension and vocabulary retention. Reading in a Foreign Language 2009, 2 (October 2009), 119–142.
- Parmit K. Chilana, Amy J. Ko, and Jacob O. Wobbrock. 2012. LemonAid: Selection-Based Crowdsourced Contextual Help for Web Applications. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 1549–1558.
- Rune Haubo B Christensen. 2018. Cumulative Link Models for Ordinal Regression with the R Package ordinal. http://cran.uni-muenster.de/web/packages/ordinal/vignettes/clm_article.pdf.
- Avital Cnaan, Nan M. Laird, and Peter Slasor. 1997. Using the general linear mixed model to analyse unbalanced repeated measures and longitudinal data. Statistics in Medicine 16, 20 (1997), 2349–2380.
- Joseph Paul Cohen, Henry Z. Lo, Tingting Lu, and Wei Ding. 2016. Crater Detection via Convolutional Neural Networks. (2016). arXiv:1601.00978 [cs.CV]
- Jeff Conklin. 1987. Hypertext: An Introduction and Survey. Computer 20, 9 (September 1987), 17–41.
- Robert Cudeck. 1996. Mixed-effects Models in the Study of Individual Differences with Repeated Measures Data. Multivariate Behavioral Research 31, 3 (1996), 371–403.
- Kenny Davila, Ritvik Joshi, Srirangaraj Setlur, Venu Govindaraju, and Richard Zanibbi. 2019. Tangent-V: Math Formula Image Search Using Line-of-Sight Graphs. In Proceedings of the European Conference on Information Retrieval. Springer, 681–695.
- Isabelle De Ridder. 2002. Visible or invisible links? In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 624–625.
- Diana DeStefano and Jo-Anne LeFevre. 2007. Cognitive load in hypertext reading: A review. Computers in Human Behavior 23, 3 (May 2007), 1616–1641.
- Pierre Dragicevic, Stéphane Huot, and Fanny Chevalier. 2011. Gliimpse: Animating from Markup Code to Rendered Documents and Vice Versa. In Proceedings of the Symposium on User Interface Software and Technology. ACM, 257–262.
- eLife. 2013. Seeing through the eLife Lens: A new way to view research. https://elifesciences.org/inside-elife/0414db99/seeing-through-the-elife-lens-a-new-way-to-view-research.
- Fermat’s Library. https://fermatslibrary.com/. Last accessed September 16, 2020.
- Max Froumentin. Mathematical Markup Language (MathML). https://www.w3.org/Math/whatIsMathML.html. Last accessed September 16, 2020.
- Jamey Graham. 1999. The Reader’s Helper: A Personalized Document Reading Environment. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 481–488.
- Tovi Grossman, Fanny Chevalier, and Rubaiat Habib Kazi. 2015. Your Paper is Dead! Bringing Life to Research Articles with Animated Figures. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 461–475.
- Andrew Head, Codanda Appachu, Marti A. Hearst, and Björn Hartmann. 2015. Tutorons: Generating Context-Relevant, On-Demand Explanations and Demonstrations of Online Code. In Proceedings of the Symposium on Visual Languages and Human-Centric Computing. IEEE, 3–12.
- Marti A. Hearst, Emily Pedersen, Lekha Patil, Elsie Lee, Paul Laskowski, and Steven Franconeri. 2020. An Evaluation of Semantically Grouped Word Cloud Designs. IEEE Transactions on Visualization and Computer Graphics 26, 9 (September 2020), 2748–2761.
- William C. Hill, James D. Hollan, Dave Wroblewski, and Tim McCandless. 1992. Edit wear and read wear. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 3–9.
- Terje Hillesund. 2010. Digital reading spaces: How expert readers handle books, the Web and electronic paper. First Monday 15, 4 (April 2010).
- Ken Hinckley, Xiaojun Bi, Michel Pahud, and Bill Buxton. 2012. Informal Information Gathering Techniques for Active Reading. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 1893–1896.
- Sture Holm. 1979. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 6, 2 (1979), 65–70.
- Jessica Hullman, Yea-Seul Kim, Francis Nguyen, Lauren Speers, and Maneesh Agrawala. 2018. Improving Comprehension of Measurements Using Concrete Re-Expression Strategies. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM. Paper 34.
- Matthew Inglis and Lara Alcock. 2012. Expert and Novice Approaches to Reading Mathematical Proofs. Journal for Research in Mathematics Education 43, 4 (July 2012), 358–390.
- Nan Jiang and Huseyin Dogan. 2014. CrowdHiLite: A Peer Review Service to Support Serious Reading on the Screen. In Proceedings of the International BCS Human Computer Interaction Conference. 323–328.
- Zhuoren Jiang, Liangcai Gao, Ke Yuan, Zheng Gao, Zhi Tang, and Xiaozhong Liu. 2018. Mathematics Content Understanding for Cyberlearning via Formula Evolution Map. In Proceedings of the International Conference on Information and Knowledge Management. ACM, 37–46.
- KaTeX. https://katex.org. Last accessed September 16, 2020.
- Caitlin Kelleher and Randy Pausch. 2005. Stencils-Based Tutorials: Design and Evaluation. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 541–550.
- Dae Hyun Kim, Enamul Hoque, Juho Kim, and Maneesh Agrawala. 2018. Facilitating Document Reading by Linking Text and Tables. In Proceedings of the Symposium on User Interface Software and Technology. ACM, 423–434.
- Yea-Seul Kim, Jessica Hullman, Matthew Burgess, and Eytan Adar. 2016. SimpleScience: Lexical Simplification of Scientific Terminology. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1066–1071.
- Amy J. Ko and Brad A. Myers. 2009. Finding Causes of Program Output with the Java Whyline. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 1569–1578.
- Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM. Paper 411.
- Andrea Kohlhase, Michael Kohlhase, and Taweechai Ouypornkochagorn. 2018. Discourse Phenomena in Mathematical Documents. In Proceedings of the Conference on Intelligent Computer Mathematics. 147–163.
- Michael Kohlhase, Joseph Corneli, Catalin David, Deyan Ginev, Constantin Jucovschi, Andrea Kohlhase, Christoph Lange, Bogdan Matican, Stefan Mirea, and Vyacheslav Zholudev. 2011. The Planetary System: Web 3.0 & Active Documents for STEM. Procedia Computer Science 4 (2011), 598–607.
- Nicholas Kong, Marti A. Hearst, and Maneesh Agrawala. 2014. Extracting References Between Text and Charts via Crowdsourcing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 31–40.
- Alexandra Kuznetsova, Peter Brockhoff, and Rune H. B. Christensen. 2017. lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software 82, 13 (December 2017), 1–26.
- Labella.js. https://twitter.github.io/labella.js/. Last accessed September 16, 2020.
- Elina Late, Carol Tenopir, Sanna Talja, and Lisa Christian. 2019. Reading practices in scholarly work: from articles and books to blogs. Journal of Documentation 75, 3 (2019), 478–499.
- Mary J. Lindstrom and Douglas M. Bates. 1990. Nonlinear Mixed Effects Models for Repeated Measures Data. Biometrics 46, 3 (September 1990), 673–687.
- Xiaozhong Liu, Zhuoren Jiang, and Liangcai Gao. 2015. Scientific Information Understanding via Open Educational Resources (OER). In Proceedings of the International Conference on Research and Development in Information Retrieval. ACM, 645–654.
- Look Up Words, People, and Places While You Read. https://www.amazon.com/b?ie=UTF8&node=17717476011. Last accessed September 16, 2020.
- Mircea F. Lungu, Luc van den Brand, Dan Chirtoaca, and Martin Avagyan. 2018. As We May Study: Towards the Web as a Personalized Language Textbook. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM. Paper 338.
- Iain J Marshall, Joël Kuiper, and Byron C Wallace. 2016. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association 23, 1 (January 2016), 193–201.
- Damien Masson, Sylvain Malacria, Edward Lank, and Géry Casiez. 2020. Chameleon: Bringing Interactivity to Static Digital Documents. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM. Paper 432.
- Material-UI. https://material-ui.com/. Last accessed September 16, 2020.
- Elisa Mattiello. 2017. Analogy in Word-Formation: A Study of English Neologisms and Occasionalisms. De Gruyter Mouton.
- Melissa McCartney, Chazman Childers, Rachael R. Baiduc, and Kitch Barnicle. 2018. Annotated Primary Literature: A Professional Development Opportunity in Science Communication for Graduate Students and Postdocs. Journal of Microbiology & Biology Education 19, 1 (March 2018), 1–13.
- MediaWiki contributors. Page Previews. https://www.mediawiki.org/wiki/Page_Previews. Last accessed September 16, 2020.
- Mozilla and individual contributors. pdf.js. https://mozilla.github.io/pdf.js/. Last accessed September 16, 2020.
- David Nicholas, Peter Williams, Ian Rowlands, and Hamid R. Jamali. 2010. Researchers’ e-journal use and information seeking behaviour. Journal of Information Science 36, 4 (2010), 494–516.
- Don Norman. 2013. The design of everyday things. Basic Books. See pages 288–291, section “The Future of Books”.
- Kenton O’Hara. 1996. Towards a Typology of Reading Goals. Technical Report. Rank Xerox Research Centre.
- Robert Pagel and Moritz Schubotz. 2014. Mathematical Language Processing Project. In Proceedings of the Conference on Intelligent Computer Mathematics.
- PeerLibrary. https://peerlibrary.org/. Last accessed September 16, 2020.
- Peter Pirolli and Stuart K. Card. 1999. Information Foraging. Psychological Review 106, 4 (1999), 643–675.
- Antoine Ponsard, Francisco Escalona, and Tamara Munzner. 2016. PaperQuest: A Visualization Tool to Support Literature Review. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 2264–2271.
- Brett Powley, Robert Dale, and Ilya Anisimoff. 2009. Enriching a Document Collection by Integrating Information Extraction and PDF Annotation. In Document Recognition and Retrieval.
- PubMed. https://www.ncbi.nlm.nih.gov/pmc/. Last accessed September 16, 2020.
- Xin Qian, Matt J. Erhart, Aniket Kittur, Wayne G. Lutters, and Joel Chan. 2019. Beyond iTunes for Papers: Redefining the Unit of Interaction in Literature Review Tools. In Proceedings of the Conference on Computer-Supported Cooperative Work and Social Computing. ACM, 341–346.
- Warren B. Roby. 1999. ”What’s in a gloss?”: A commentary on Lara L. Lomicka’s ”To gloss or not to gloss”: An investigation of reading comprehension online. Language Learning & Technology 2, 2 (January 1999), 94–101.
- Susanne Rott. 2007. The Effect of Frequency of Input-Enhancements on Word Learning and Text Comprehension. Language Learning 57, 2 (June 2007), 165–199.
- Somali Roy. 2014. Evaluating novel pedagogy in higher education: A case study of e-Proofs. Ph.D. Dissertation. Loughborough University.
- Somali Roy, Matthew Inglis, and Lara Alcock. 2017. Multimedia resources designed to support learning from written proofs: An eye-movement study. Educational Studies in Mathematics 96, 2 (2017), 249–266.
- Nipun Sadvilkar. 2020. pySBD: Python Sentence Boundary Disambiguation (SBD). https://github.com/nipunsadvilkar/pySBD. Last accessed July 27, 2020.
- Bill N. Schilit, Gene Golovchinsky, and Morgan N. Price. 1998. Beyond Paper: Supporting Active Reading with Free Form Digital Ink Annotations. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 249–256.
- Scholarcy. https://www.scholarcy.com/scholarcy-features/. Last accessed July 24, 2020.
- ScienceDirect. https://www.sciencedirect.com. Last accessed September 16, 2020.
- SciHive. https://start.scihive.org/. Last accessed September 16, 2020.
- Mary D. Shepherd and Carla C. Van De Sande. 2014. Reading mathematics for understanding—From novice to expert. The Journal of Mathematical Behavior 35 (September 2014), 74–86.
- Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extracting Scientific Figures with Distantly Supervised Neural Networks. In Proceedings of the Joint Conference on Digital Libraries. ACM, 223–232.
- Springer. https://link.springer.com. Last accessed September 16, 2020.
- Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-Informed Self-Attention for Semantic Role Labeling. (2018). arXiv:1804.08199 [cs.CL]
- Craig Tashman and W. Keith Edwards. 2011a. Active Reading and Its Discontents: The Situations, Problems and Ideas of Readers. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 2927–2936.
- Craig S. Tashman and W. Keith Edwards. 2011b. LiquidText: A Flexible, Multitouch Environment to Support Active Reading. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 3285–3294.
- Alan Taylor. 2006. The Effects of CALL versus Traditional L1 Glosses on L2 Reading Comprehension. CALICO journal 23, 2 (2006), 309–318.
- Carol Tenopir, Donald W. King, Sheri Edwards, and Lei Wu. 2009. Electronic journals and changes in scholarly article seeking and reading patterns. 61, 1 (2009), 5–32.
- Carol Tenopir, Elina Late, Sanna Talja, and Lisa Christian. 2019. Changes in Scholarly Reading in Finland Over a Decade: Influences of E-Journals and Social Media. Libri 69, 3 (2019), 169–187.
- VSCode. https://code.visualstudio.com/. Last accessed September 16, 2020.
- Keith Weber. 2008. How Mathematicians Determine if an Argument Is a Valid Proof. Journal for Research in Mathematics Education 39, 4 (July 2008), 431–459.
- H. G. Widdowson. 1978. Teaching Language as Communication. Oxford University Press.
- Akifumi Yanagisawa, Stuart Webb, and Takumi Uchihara. 2020. How do different forms of glossing contribute to L2 vocabulary learning from reading? A meta-regression analysis. Studies in Second Language Acquisition 42, 2 (May 2020), 411–438.
- Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM. Paper 279.
- Dongwook Yoon, Nicholas Chen, François Guimbretière, and Abigail Sellen. 2014. RichReview: Blending Ink, Speech, and Gesture to Support Collaborative Document Review. In Proceedings of the Symposium on User Interface Software and Technology. 481–490.
- Robert Zeleznik, Andrew Bragdon, Ferdi Adeputra, and Hsu-Sheng Ko. 2010. Hands-On Math: A page-based multi-touch and pen desktop for technical work and problem solving. In Proceedings of the Symposium on User Interface Software and Technology. ACM, 17–26.
- Polle T. Zellweger, Bay-Wei Chang, and Jock D. Mackinlay. 1998. Fluid Links for Informed and Incremental Link Transitions. In Proceedings of the Conference on Hypertext and Hypermedia. ACM, 50–57.
- Xiaolong Zhang, Yan Qu, C. Lee Giles, and Piyou Song. 2008. CiteSense: Supporting Sensemaking of Research Literature. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 677–680.
- Tianchang Zhao and Kyusong Lee. 2020. Talk to Papers: Bringing Neural Question Answering to Academic Search. In Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 30–36.