TopicSifter: Interactive Search Space Reduction
Through Targeted Topic Modeling
Topic modeling is commonly used to analyze and understand large document collections. However, in practice, users want to focus on specific aspects or “targets” rather than the entire corpus. For example, given a large collection of documents, users may want only a smaller subset which more closely aligns with their interests, tasks, and domains. In particular, our paper focuses on large-scale document retrieval with high recall where any missed relevant documents can be critical. A simple keyword matching search is generally not effective nor efficient as 1) it is difficult to find a list of keyword queries that can cover the documents of interest before exploring the dataset, 2) some documents may not contain the exact keywords of interest but may still be highly relevant, and 3) some words have multiple meanings, which would result in irrelevant documents included in the retrieved subset. In this paper, we present TopicSifter, a visual analytics system for interactive search space reduction. Our system utilizes targeted topic modeling based on nonnegative matrix factorization and allows users to give relevance feedback in order to refine their target and guide the topic modeling to the most relevant results.
0 \vgtccategoryResearch \vgtcinsertpkg \teaser \CCScatlist \CCScatTwelveHuman-centered computingVisualizationVisualization application domainsVisual analytics; \CCScatTwelveInformation systemsInformation retrievalUsers and interactive retrievalSearch interfaces
As the world becomes increasingly digital and huge amounts of text data are generated every minute, it becomes more challenging to discover useful information from them for applications such as situational awareness, patient phenotype discovery, event detection , or the onset of violence within a diverse population. More often than not, topics of interests are only implicitly covered in vast amounts of text data and the relevant data items are sparse and not immediately obvious. This scenario is more prevalent especially in large scale data analytics where the data are obtained from passive sources and not all data items are relevant to the questions at hand. In these cases, users want to focus on a subset of documents about specific aspects or “targets”, rather than analyzing entire document collections . For example, a journalist may want to analyze social media data that are related to a specific event. Similarly, a marketing expert may want reviews that are relevant to certain products or brands only. Both examples require the search space of entire documents to be reduced to relevant documents.
Discovering and extracting data items of relevance from a large collection of documents is a challenging and important step in text analytics. In particular, we are interested in the high recall retrieval problem, where any missing relevant documents are critical . For instance, a legal analyst searching for relevant cases from a large legal document collection may want to collect as many documents as possible even if some of them are only slightly relevant to her targets. Another example is a graduate student who is preparing a literature review and does not want to miss a related work. This is different from a traditional informational retrieval problem of finding a list of results that are most relevant to a query, e.g., a student searching for the top 5 papers to learn about an unfamiliar research field. Our focus is on not missing relevant results in addition to high precision. To solve this, our goal is to retrieve documents that are relevant to targets from large scale document collections, which we will refer to as search space reduction throughout this paper.
Traditional static keyword search is not suitable for our search space reduction setting. First, it is often difficult to know or express the target aspect in advance without exploring the dataset. Next, even when the users are familiar with their target concept, it is hard to cover all relevant keywords, which would result in false negative. Lastly, a keyword may have multiple unrelated meanings and when they are extracted out of context, static keyword match can result in false positive. More advanced approaches such as query expansion and relevance feedback have been introduced in information retrieval. These approaches expand query keywords and provide feedback on documents to update the query. However, since they are designed for high precision problems of retrieving a number of the most relevant data items, they may not cover all relevant data items.
To this end, we take a human-in-the-loop approach and advocate interactive and exploratory retrieval. In our framework, users explore retrieved documents, learn them, and interactively build targets, which will be used to sift through documents. Instead of users rating a number of retrieved documents generated by systems, our method allows the users to proactively modify target keywords and give relevance feedback. In addition, we adopt targeted topic modeling to support this process. Targeted topic modeling techniques find relevant topics and disregard irrelevant aspects from document collections. Utilizing results from targeted topic modeling, our approach allows users to discover relevant subtopics and refine the targets using topic-level relevance feedback.
In this paper, we propose a novel framework for interactive search space reduction along with an effective visual analytics system called TopicSifter. TopicSifter tightly integrates the underlying computational methods and interactive visualization to support topic model exploration and targeted topic modeling.
The primary contributions of this work include:
A novel iterative and interactive technique for search space reduction through interactive target building, sifting, and targeted topic modeling.
A visual analytics prototype, TopicSifter, that supports tight integration between the interactive visualization and the underlying algorithms.
Experiments and use cases that illustrate the effectiveness of TopicSifter.
1 Related Work
In this section, we discuss prior works on information retrieval and topic modeling in the context of search space reduction.
1.1 Visualizing Search Results/Space
Various information visualization techniques have been applied to improve user interfaces for search. Some systems augment search result lists with additional small visualizations. For example, TileBars , INSYDER , and HotMap  visualize query-document relationships as icons or glyphs alongside search results. Another approach is to visualize search results in a spatial layout where proximity represents similarity. Systems such as InfoSky  and IN-SPIRE  are examples. FacetAtlas  overlays additional heatmaps to visualize density. ProjSnippet  visualizes text snippets in a 2-D layout. Many others cluster the search results and offer faceted navigation. FacetMap  and ResultMap  utilizes treemap-style visualizations to represent facets. These systems may guide users well in exploring search results, but they are mostly based on static search queries. Our system goes beyond search results exploration and offers interactive target (query) building.
1.2 Query Expansion and Relevance Feedback
Information retrieval is finding (unstructured) documents that satisfies an information need from large collections . However, users of information retrieval systems may not have a clear idea of what to search for, may not know how to construct an optimal query, or may not understand what kind of information is available . To this end, various interactive methods to assist the retrieval process have been proposed. Interactive query expansion  allows the users to choose additional query terms from the suggested list of keywords. Instead of lists, Fowler et al.  and Hoeber et al.  display keyword suggestions as graphs. Sparkler  visualize multiple query results so that users can compare and identify the best query from the expanded queries. In our system, we suggest additional keywords for queries in terms of two categories of good-to-have keywords and bad-to-have keywords. Another interactive approach is relevance feedback, meaning users are asked to mark documents as relevant to steer the system to modify the original query . For instance, VisIRR  allows users to rate retrieved documents on a 5-star scale. IntentRadar [39, 38] models intents behind search queries and lets users give relevance feedback on the intents to interactively update them. We adopt a similar approach to give relevance feedback to documents as well as groups of documents (topics). These existing systems are designed for the traditional information retrieval setting of obtaining the most relevant data items with high precision, and thus are not well-suited for our search space reduction setting which desires high recall. Closer to our work is ReQ-ReC  which combines iterative query expansion and iterative classifier refinements to solve high recall retrieval problem. A major difference is that ReQ-ReC system requires users to label given documents while our system allows the users to explore the documents and their topics and give relevance feedback if needed.
1.3 Aspect-Specific Topic Summarization
Although topic summarization has been studied for a long time, discovering topic summary of a specific aspect (or targets) is a relatively new research problem. TTM  is the first work to propose the term ‘targeted topic modeling’. This work proposes a probabilistic model that is a variation of latent Dirichlet allocation (LDA) . Given a static keyword list defining a particular aspect, the model identifies topic keywords related to this aspect. Wang et al.  identifies a list of target words from review data and disentangles aspect words and opinion words from the list. APSUM  assigns aspects to each word in a generative process. Since the aforementioned model generates topic keywords based on a static keyword list, a dynamic model is desired. An automatic method to generate keyword dynamically has been proposed . This method focuses on the on-line environment of Twitter and automatically generates keywords based on the time-evolving word graph.
1.4 Interactive Topic Modeling
Interactive topic models allow users to steer the topics to improve the topic modeling results. Various topic steering interactions such as adding, editing, deleting, splitting, and merging topics have been introduced [33, 27, 6, 8, 24, 17, 41, 21, 22]. These interactions can be applied to refine relevant topics and remove irrelevant topics to identify targeted topics when most of the data items are relevant and only a small portion is irrelevant. However, in our large-scale search space reduction setting, a more tailored approach is needed. In this paper, we propose interactive targeted topic modeling to steer the topics to discover the target-relevant topics and documents.
2 Interactive Search Space Reduction
In many practical cases in large-scale text analyses, users have specific aspects they want to focus on, which we will refer to as targets. Although there are many tools available with powerful natural language processing and text mining features, they tend to lack the ability to concentrate on the targets. We define this problem of retrieving a subset of documents that are relevant to given targets from large-scale datasets with high recall as search space reduction. Our solution is to iteratively retrieve the relevant documents utilizing user feedback. Over multiple iterations, users inspect a topical summary of previously retrieved documents and give feedback, and our system updates targets to better reflect their mental model and retrieve relevant documents through sifting.
In this section, we first formulate the problem of interactive search space reduction, then describe our iterative workflow and algorithm.
|Given document collection of documents|
|Given keyword dictionary of keywords|
|The word-document matrix of|
|Current iteration number|
|Set of retrieved documents at the -th iteration,|
|Relevance score of document/topic at the -th iteration|
|Topic score of document at the -th iteration|
|Set of targets at the -th iteration,|
|Set of target vectors for ,|
|Set of topics at -th iteration,|
|Set of top ten keywords of a topic (or a document )|
|Set of documents that belong to a topic|
|Set of good-to-have (bad-to-have) keywords by users at the -th iteration|
|Set of upvoted (downvoted) topics by users at the -th iteration|
|Set of upvoted (downvoted) documents by users at the -th iteration|
|The word-document matrix of ,|
|The word-topic matrix of|
|The topic-document matrix of|
|-th column of , , , respectively|
|The set of nonnegative real numbers|
|The Frobenius norm|
|The standard basis vector where for .|
|The -th row of matrix|
|The -th column of matrix|
|The index of the largest element in vector|
2.1 Problem Formulation and Algorithm Workflow
Given a document collection with documents, our goal is to retrieve a subset of documents that are relevant with high recall. Note that we do not limit the number of retrieved documents , as opposed to traditional information retrieval. Our iterative approach updates targets based on user feedbacks and retrieves documents over iterations .
Our algorithm workflow is outlined in Fig. 2, with notation listed in Table 1. Each iteration consists of three computational steps: target building, sifting, and targeted topic modeling. An iteration starts with user feedback from its previous iteration. After exploration of previously retrieved documents and their topics, users can modify keyword queries and/or give positive or negative feedback on topics or documents. Based on the user input, the interactive target building step (Section 2.2) updates the targets . Next, the sifting step (Section 2.3) selects a new set of documents using the updated targets . Finally, the targeted topic modeling step (Section 2.4) generates topics and the system visualizes them. The users can repeat the iterative process until satisfied.
2.2 Interactive Target Building Based on User Feedback
We represent targets as a set of single keywords (e.g., “apple”) or keyword compounds (e.g., “apple, orange”). The former looks for documents containing the single keyword and the latter looks for those containing all of the keywords in the keyword compound. In the search space reduction problem, users may not be familiar with their target domains . Even for domain expert users, constructing a good static query is a challenging task without exploring and understanding given datasets in advance. Both cases can be solved with interactive target building. At each iteration, our interactive target building step updates targets based on user feedback.
Different from existing information retrieval approaches that use positive queries, we use negative as well as positive targets. This allows users to express their complicated mental target model. For example, the users may be interested in a target, but not interested in a similar concept (e.g., retrieve “apple, fruit” and ignore “orange, fruit”). Negative targets can also deal with multi-meaning words (e.g., retrieve “apple, iphone” and ignore “apple, fruit”). In detail, we allow the users to directly update the keyword sets including good-to-have keywords, bad-to-have keywords, stopwords to be ignored. Stopwords are the words that are not useful in text analysis including too frequent words such as articles, prepositions, and pronouns. In addition to the commonly used English stopwords, we allow the users to add custom stopwords that are data-specific or domain-specific. For example, when exploring medical records, ignoring common medical terms may increase the quality of topic modeling and sifting. Also, the users can indirectly update the target by giving item-level (documents) or group-level (topics) relevance feedback.
Our approach incorporates seven kinds of user relevance feedback into target building:
Edit good-to-have keywords
Edit bad-to-have keywords
Edit stopwords (words to be ignored)
1, 4, 6 are positive relevance feedback indicating that the corresponding words, topics, or documents are relevant to the user’s mental target , respectively. On the contrary, 2, 5, 7 are negative relevance feedback indicating that the corresponding words, topics, or documents are irrelevant to the user’s mental target , respectively. Lastly, 3 modifies the set of stopwords, which affects the follow-up topic modeling process described in Section 2.4.
Given user relevance feedback, we model the targets and their representative vectors as follows:
TargetModel computes the targets and their vectors at the -th iteration using the user supplied input . The target consists of positive/negative explicit/implicit parts. Users can change the explicit part directly through keyword modification. For implicit part , using relevance feedback on a topic or a document, we extract its top keywords and add the keyword compound as an implicit target.
2.2.1 Keyword Suggestion
Manually entering keywords can be burdensome. To this end, we recommend candidates for the good-to-have and bad-to-have keyword sets in real time. Candidate recommendation is based on similarities with the current good-to-have and bad-to-have keyword sets. Similarities between words can be calculated by several distance measures. Among them, we adopt the vector-space model of word representation . To learn word vectors, we use empirical pointwise mutual information (ePMI), which measures co-occurrence between word pairs. The ePMI score between the word pair is defined as:
where denotes the total number of the word co-occurring word pairs; and and denote the number of occurrences of the word pair and the single word , respectively. As suggested by , we first construct a matrix where , perform low-rank matrix factorization on , and use the left factor as the vector representations of words after -normalization. We computed word vectors for each dataset to obtain dataset-specific word similarities, but pre-trained word vectors using word2vec  or Glove  can be used in our algorithm.
2.3 Sifting Documents and Words
After the target modeling step, we retrieve a new set of documents using the updated targets. We provide two retrieval options: hard filtering by target keywords and soft sifting.
HardSift throws out documents that contain one of the negative target elements or their nearest words and retrieves documents that contain one of the target elements or their nearest words. One of nearest words of a word is denoted by . Note that we apply negative feedback first and positive feedback later to take a conservative approach in filtering out documents.
SoftSift incorporates a relevance score model to rank documents by how similar they are to the explicit and implicit targets. The relevance score of a document with respect to a target is calculated as cosine similarity between its target vector and the document vector , . All target vectors and document vectors are l2-normalized. To calculate the final relevance score of a document, we take a weighted average of its previous relevance score and its relevance scores with respect to positive and negative feedbacks at the current iteration. To put more emphasis on recall than precision, we use smaller weight for negative feedback score than positive feedback score, i.e. .
2.4 Targeted Topic Modeling
The last step of an iteration is targeted topic modeling. Targeted topic modeling finds a target-specific topical summary of documents that are retrieved from the previous sifting step. The calculated topics and their representative documents are visualized to the users so that they can easily understand what kind of documents are retrieved at the current iteration and perform relevance feedback for the next iteration.
2.4.1 Background: NMF for Topic Modeling
Given a nonnegative matrix , NMF approximates X as a product of nonnegative factor matrices and , i.e., , with . This can be solved by optimizing the following formula:
In the topic modeling context, is a word-document matrix where (the -th column vector of ) is a bag-of-words representation of -th document over keywords. is based on TF-IDF representation of the document set and usually normalized with -norm. is set to be the number of topics. Factor matrices and represent word-topic and topic-document relationships, respectively. represents the -th topic as a distribution over words. Large values in indicate that the corresponding keywords are strongly associated with the -th topic. represents the -th document as a weighted combination of topics. The -th document belongs to the -th topic if the -th element of is its maximum, i.e., . We denote the -th topic as and define it by its word distribution vector () and the documents that belong to it ().
2.4.2 Targeted Topic Modeling using NMF
To reflect a target built by users into the topic modeling process, we introduce an additional constraint term to the standard NMF formula, Eqn. 2, as follows:
where is an elementwise multiplication. The additional term forces certain topics’ word representation to be similar to the corresponding target elements with the help of masking coefficient matrix . The parameter controls the balance between the original term and the additional term. Bigger results in stronger incorporation of the target in topic modeling. That is, the bigger the rho is, the closer the topics become to the targets at the expense of becoming less truthful representation of data. When , it is equivalent to the standard topic modeling. Also, is inversely proportional to the number of positive targets. To compute and , for each positive target vector , find its closest topic vector, which we define as . We set and if .
The detailed algorithm at the -th iteration is as follows:
TargetedTopicModel applies a constrained NMF algorithm on the current document set and the current targets to compute number of topics . Additionally, we calculate each topic’s relevance score with respect to the targets. Note that calculates the rank of a word within the topic ’s topic vector . For speedup, we use a fast rank-2 NMF  algorithm to initialize and in Eqn. 2.
In this section, we present TopicSifter, our interactive document search space reduction system. Our visualization system is tightly integrated with the underling algorithms described in Section 2 to support various user feedback interactions listed in Section 2.2.
TopicSifter is designed to meet these design goals:
Given targets, retrieve relevant documents with high recall: TopicSifter should retrieve documents that are relevant to targets.
Show summary and details of sifted documents: TopicSifter should provide a topical summary and details of retrieved documents to help users understand them.
Modify targets over iterations: TopicSifter should allow users to update targets easily and iteratively.
Observe changes between iterations: TopicSifter should show differences in retrieved documents between iterations.
Export results for further analysis: TopicSifter is designed for one step of a complex text analysis workflow. Users should be able to export the retrieved documents for in-depth analyses.
TopicSifter consists of a web-based visualization interface using D3.js and a backend system in Python and MATLAB using the Django framework.
3.1 System Overview
TopicSifter consists of three panels: (1) the control panel, (2) the main view, and (3) the detail panel (Fig. TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling). The control panel contains the keyword module to modify good-to-have words, bad-to-have words, and stopwords (supporting 1, 2, 3) and control buttons to update the main view. The main view shows the sifting status and the topical overview of retrieved documents at the current iteration and allows the users to upvote or downvote topics and documents (supporting 4, 5, 6, 7). The relevance feedback on words, topics, and documents will be reflected on the next iteration (Fig. 3). Lastly, the detail panel has the document table to show additional detail of all documents and the history view to show historical trends over iterations. The width of each panel is adjustable by dragging the divider in order to allocate more or less space to the panel. The system design is shown in Fig. TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling.
The users follow the workflow in Fig. 3. Each iteration starts with the users exploring the retrieved documents and their topics in the main view. To give relevance feedback, the users can modify keyword sets in the control panel or upvote/downvote topics and documents in the main view. They can export the results or move on to the next iteration using buttons in the control panel.
3.2 Control Panel
The users can utilize the control panel to update the main view.
The control panel contains the keyword input module and the control buttons.
The keyword input module shows current set of good-to-have keywords , bad-to-have keywords , and stopwords and allows users to modify them (1, 2, 3).
Keyword Input Module
In the keyword input module (Fig. TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling(d)), the users can add new keywords using an input text box or see current keyword lists for good-to-have keywords, bad-to-have keywords, and stopwords. To add a keyword, users can enter the keyword in the input text box. While typing, possible matching keywords in the dictionary is listed in the pop-up as shown in Fig. 4(a). The list is sorted by word frequency and updated as the user types more letters. After selecting one of the keywords in the pop-up list, the users can either enter the keyword as a single keyword (e.g., “visual”) or build a keyword compound (e.g., “visual” AND “analyt(ic)” in Fig. 4(b)). By clicking one of the green, red, or gray buttons in Fig. 4(c), the entered keyword or keyword compound is added in the good-to-have keyword list , the bad-to-have keyword list , or the stopword list, respectively. Keywords or keyword compounds in the keyword list is visualized as word buttons inside the colored areas (good-to-have: green, bad-to-have: red, stopword: gray) as in Fig. 4(c). In order to remove a keyword or a keyword compound, the users can click the icon on the keyword button.
As discussed in Section 2.2.1, our technique suggests additional keywords based on the current set of keywords. The keywords recommended for good-to-have or bad-to-have lists are visualized under the corresponding keyword list as keyword buttons with dashed borders with a + icon. The users can add one of suggested keywords by clicking the + icon. The recommended keywords are updated in real time as the users add or remove keywords to the keyword lists.
Changing the number of topics
Users can change the topic granularity by increasing or decreasing the number of topics using the button group in the control panel. When the generated topics are too fine-grained or too coarse-grained, giving relevance feedback can be problematic. For example, the user wants to give feedback on all “fruit” related topics, but there are too many fine-grained “fruit” related topics to interact with. On the other hand, the user may be interested in part of a topic (e.g., like “apple, mac” part from “apple, mac, fruit”, but not “apple, fruit” part). The number of current topics is shown in the middle part of the button group. The users can click the buttons to decrease the number of topics by -5(), -1(), or increase it by +1(), +5(). Note that a new set of topics is generated using the same retrieved document subset. The visual update after changing the number of topics is fast since this happens within an iteration without triggering the target building, sifting, and targeted topic modeling processes (Fig. 3).
The users can run our backend algorithms by clicking the sift button. After modifying good-to-have keywords, bad-to-have keywords, and stopwords (1, 2, 3) and upvoting or downvoting topics and documents (4, 5, 6, 7), the users move on to the next iteration. The sift button triggers the target building, sifting, and targeted topic modeling processes to retrieve a new set of documents and visualize their topic summary. This process is shown in Fig. 3.
The users can export the results using the export button. When the users are satisfied with the retrieved documents after multiple iterations, our system provides an option to save the results. The results are saved as a JSON file including targets, topics, and IDs, topic membership, and relevance scores of retrieved documents.
3.3 Main View
The main view will visualize topic summary of retrieved documents at the current iteration along with the sifting status bar to show the difference between the current iteration and the previous iteration.
In the topic visualization, the users can upvote or downvote topics and documents to indicate that they are relevant to targets or not (4, 5, 6, 7).
Fig. 5 shows the status bar chart.
The total length of all bars represents the total number of documents in the dataset.
The total length of blue bars represents the number of retrieved documents at the current iteration, , while the total length of gray bars represents the number of sifted out documents.
Solid-colored bars represent documents that stay retrieved (solid blue) or stay sifted out (solid gray) between the previous -th iteration and the current -th iteration.
Patterned bars represent document status changes from the previous iteration, .
In detail, the blue patterned bar represents incoming documents that were not retrieved at the -th iteration but retrieved at the -th iteration.
The gray patterned bar represents outgoing documents that were retrieved at the -th iteration but sifted out at the -th iteration.
Longer patterned bars indicate interactions at the -th iteration have resulted in a larger change in retrieved documents.
Topics computed from the retrieved documents are visualized as rectangular cells (Fig. TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling). On top of each topic cell, its top ten keywords are shown, and its representative documents are visualized as small squares. The sizes of cells are proportional to the number of retrieved documents that belong to each topic. The layout of cells is calculated by D3’s built-in treemap algorithm. The color hues of topic cells represent how relevant each topic is to the target () from green (relevant) to red (irrelevant) as in Fig. 6. The color hue of each topic is shared by its keywords and its documents. If a topic has changed from the previous iteration, new representative keywords are highlighted as bold and underlined (The yellow topic in Fig. 6). For topic cells with narrow width, the users can hover over top keywords to see the full list of keywords. To give positive (or negative) feedback to topics, the users can click the menu button on the top right corner of each topic cells to open a pop-up menu with upvote and downvote button (Fig. 7).
The number of representative documents that are visualized as squares in a topic cell are determined by the size of the cell. Our system picks documents to be visualized by how close the documents are to its topic () since they are more representative of the topic. The color lightness of document squares represents how close each document is to its topic from dark (close) to light (less close). The positions of document squares are also sorted by closeness to their topics from top-left to bottom-right (Fig. 6). To see the detail of a document, the users can hover over the square to see its document ID in a pop-up menu or click the square to see its detail in the document table in the detail panel. Users are able to give positive (or negative) feedback to documents to indicate that they are relevant (or irrelevant) to their mental targets by toggling the upvote (or downvote) button in the pop-up menu of each document square as in Fig. 7. Upvoted topics and documents are highlighted with border and downvoted topics and documents are whited out.
3.4 Detail Panel
The detail panel has two tabs to toggle between the document table view and the history view.
The document table view shows the list of all documents and their raw text details.
The history view shows the history of previous iterations to keep track of the iterative sifting process.
The document table shows additional information of all documents in the dataset, i.e., .
Each row of the table shows document details such as document IDs, titles, raw texts, etc, along with their topic memberships and topic-relevance scores.
The document table is linked with the topic visualization.
Hovering over a document square highlights the corresponding table row, and vice versa.
Column fields may vary depending on datasets used.
The raw texts can be long, so our system does not show them by default, but a row can be expanded to show the raw text when clicked.
One challenge is that rendering all document rows are impractical in our large-scale text analytics setting.
To solve this, we use Clusterize.js111Available at: https://clusterize.js.org/ library to render currently visible rows only and reuse those HTML elements when the table is scrolled.
Another challenge is navigating and scrolling through tens of thousands of rows.
For easy navigation, when a document square in the main view is right-clicked, the document table automatically scrolls to the corresponding row.
Fig. 8 shows the history view, which contains a stacked bar chart (left) and the keyword summary history (right). The stacked bar chart shows all the visualized status bars from previous iterations. It can reveal changes per iteration and if the sifting results became stable. The keyword summary history shows top keywords for retrieved documents at each previous iteration. Users can observe whether their interactions have resulted as expected.
In this section, we provide quantitative evaluation utilizing simulated user feedback with a labeled dataset. Also, we show use cases to illustrate the usefulness of TopicSifter for search space reduction using two datasets: a TED dataset and Twitter dataset.
4.1 Dataset Description
The 20 Newsgroup dataset222Source: http://qwone.com/~jason/20Newsgroups is a collection of 19.8K newsgroup documents partitioned into 20 categories. The size of dictionary is 128K. The TED talk transcript dataset333Source: https://github.com/saranyan/TED-Talks contains 2,896 documents that are transcribed from the English TED talk videos. The talks are about various topics including technology, education, etc. The size of dictionary is 18,275. The contents of the documents are spoken languages in a subtitle-like style. The twitter dataset444Source: https://archive.org/details/twitter_cikm_2010 was originally explored by . We use part of the data containing 500K tweets. After removing the documents with less than five words, we are left with 300K documents and 32.3K words. We applied the Porter stemming algorithm  for pre-processing and built the TF-IDF matrices for the datasets.
4.2 Quantitative Evaluation
In this section, we present results of a study simulating user input to test the effectiveness of our technique.
4.2.1 Experiment Setup
To simulate user relevance feedback, we used the 20 Newsgroup dataset which has category labels. Among 20 categories, we chose two labels “rec.sport.baseball” (989 documents) and “rec.sport.hockey” (993 documents) as relevant/true labels, which is about 10% of the entire dataset.
First, we entered “game, team, player, play”, which were four most representative keywords from documents from the two categories, as initial target words. At each iteration, we select two documents or topics to give relevance feedback on (upvote or downvote based on the true label). We compared six strategies: 1) upvote two true documents (), 2) upvote two true topics (), 3) downvote two false documents (), 4) downvote two false topics (), 5) upvote a true document and downvote a false document (), 6) upvote a true topic and downvote a false topic ().
Table 2 summarizes the performance of different feedback strategies at the 10-th iteration, averaged over three runs. We used four measures: precision, recall, F1-score (the harmonic mean of precision and recall), and PRES , which is a recall-oriented measure. For each strategy, we tried parameters from , , where and chose the combination with best F1 score. For the sifting threshold, we used . All strategies converged after 4-6 iterations.
Simulating positive feedback showed higher recall and lower precision than negative feedback. Performing both positive and negative feedback showed better or comparable scores than performing only positive feedbacks, which advocates our novel negative targeting. In addition, positive topic-level feedbacks () outperformed the others in F1 and PRES scores. This validates that our topic-level relevance feedback is beneficial in search space reduction.
4.3 Use Case 1: Exploring Scraped Data
Jim is an art-major student who is also interested in technology. He is looking for technology areas where he can incorporate his artistic sense, and uses TopicSifter to retrieve talks related to his interest.
His visual exploration starts with an initial topic modeling that shows ten topics of all documents of the TED transcript dataset. From the main view (shown in Fig. 9(left)), he observes that a variety of topics are covered in the TED dataset, thus, he decides to focus on his interest, art and technology. He adds the keyword compound “art, technology” to the good-to-have keyword list and run the TopicSifter. He discovers some topics that are not interesting, such as biology/medicine or economy related ones, and downvotes them by clicking the topic cells. During the process, he finds out that a keyword “laughter” was the third-most frequent word in the TED dataset as he sees the history view from the detail panel (shown in the upper-right of Fig. 10). He reminds that the TED dataset is based on the scripts of the talks, and the keyword “laughter” is usually used to describe audience’s reaction in scripts. He adds the keyword to the list of stopwords so that it cannot influence the sifting process (Left of Fig. 10). As he proceeds, he sees a topic with keywords “market, africa, dollar” and downvotes it since it looks unrelated. Here, TopicSifter does not simply removes all document in the downvoted topic. It still retrieves target-relevant documents that was in the downvoted topic, accomplishing high recall. For example, documents titled “The surprising seeds of a big data revolution in healthcare”, and “Tim Brown urges designers to think big” are both highly related to art and technology field, and were assigned to the “money” topic. They were survived by TopicSifter by taking account into overall relevancy. He finds “comput, robot” topic inteseting, inspects its documents in the table view interesting, and upvotes it (Fig. 9(middle)). In this topic, Jim finds out interesting topics and corresponding documents that contain contents about 3-D printer or human-computer interaction. Finally, he continues iterations until he is satisfied with his target documents about art and technology (Fig. 9(right)).
4.4 Use Case 2: Exploring Social Media Data
Now, we will follow the case of a marketer in a travel agency, Pam, who uses the proposed technique to sort out consumers’ interests in travel experiences. Pam starts by loading the twitter dataset and looking at the initial topics.
Since Twitter is a social media platform, many tweets are about everyday life and emotions. For example, Pam sees that some topics include top keywords such as: “rt”, “home”, “day”, and “today”. She adds the keyword “#travel” to the good-to-have list to observe users’ behavior using hashtag keywords about traveling on Twitter. As recommended good-to-have words pop up around the selected keywords, she selects relevant keywords among them such as “airline”, “plane”, “travel” and “vacation” to see broader user interests about traveling (Fig. 12). After a single sifting phase, she observes that a red (and thus less relevant) topic includes the keyword “intern” (rectangle in Fig. 11(left)). Many tweets included in it are comments about “internship” such as “Why are like 80% of the PokerRoad intern applicants from Canada? […]”. She finds it strange that a topic about internship is retrieved for travel related targets. As it turns out, the word “international”, which is relevant to the targets, is stemmed to “intern”, so tweets about internships are incorrectly identified as relevant. Pam adds “intern” to the stopword list to avoid this issue. After one iteration, the “intern” topic is removed (Fig. 11(middle)). She spots an unusual topic “wind,mph”. Tweets in this topic are mostly automatically generated from a weather bot twitter account such as “HD: Light Rain and Breezy and 52 F at New York/John F. Kennedy […]”. Another topic “@DL_KOPC,chasin,miami” contains various spam messages such as advertisement for a trip to Miami. She downvoted these two topics to remove additional spamming tweets (black rectangles in Fig. 11(middle)). At the next iteration, there are many casual tweets such as “Family, food, games, and football. That’s Thanksgiving.” or “Just chatted w/ Jane Lindskold & husband Jim here at the airport. Very cool people.”. Pam continues exploration to find out more specific tweets that represent customers’ interests related to traveling (shown in Fig. 11(right)). One big travel-related concern is “flight delay” as shown in the top-right topic in Fig. 11(right). Another interest is “free Wi-Fi” as shown in the bottom-right topic in Fig. 11(right). She starts designing travel packages that includes free WiFi options and flight delay insurances. The application helped her realize customer concerns and customize the agency’s products.
Iterative methods are computational methods that update approximate solutions over iterations. In general, iterative methods have some stopping criteria or stopping rules to terminate the methods, based on their objective functions or evaluation measures, e.g., when a score converges to a local minimum. Likewise, many visual analytics systems that adopt interactive machine learning or optimization methods utilize some form of measures to evaluate their tasks and application. These measures can be kept internally for monitoring; or can be shown to the users as charts (e.g., ) or some form of visual encodings (e.g., ) to inform users about the status of the current iteration. In our case, the relevancy scores of documents can be used as a measure. Unfortunately, our iterative retrieval approach not only updates the solution (which is the retrieved set of relevant documents), but also updates the target by which we measure the relevance scores of the documents. For this reason, comparing the relevance scores between iterations are meaningless if the target has been changed. That is, a higher relevance score in an iteration does not necessarily mean a better solution than a lower relevance score in another iteration. One naïve walkaround would be to compute the relevance scores of previously retrieved documents against the current target. However, this walkaround requires the system to store all historical results and calculate the relevance scores again at every iteration, which is not practical. Instead, for the TopicSifter prototype system, we decided to show retrieval status changes similar to membership changes in clustering. As explained in Fig. 8, the history view in the detail panel shows changes in retrieved documents the over iterations in the stacked bar chart. In addition, we use a colored triangle mark to indicate if a topic has changed much from the previous iteration as in Fig. 6. These kinds of visual cues can guide the users’ decision on when to stop the iteration (e.g., limited change between iterations)
In the paper, we proposed a novel sifting technique to solve search space reduction problem interactively and iteratively. Our technique combined interactive target building and targeted topic modeling to sift through document collections and retrieve relevant document as many as possible. As a proof of concept, we built an interactive search space reduction system which offers tight integration between the visualization and the underlying algorithms.
Acknowledgements.The authors wish to thank Ji Yeon Kim for her help with illustration for the paper. This work was supported in part by NSF grant OAC-1642410 and IIS-1750474. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of funding agencies.
-  K. Andrews, W. Kienreich, V. Sabol, J. Becker, G. Droschl, F. Kappe, M. Granitzer, P. Auer, and K. Tochtermann. The infoSky visual explorer: Exploiting hierarchical structure and document similarities. Information Visualization, 1(3-4):166–181, 2002. doi: 10 . 1057/palgrave . ivs . 9500023
-  D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3:993–1022, 2003.
-  N. Cao, J. Sun, Y.-R. Lin, D. Gotz, S. Liu, and H. Qu. FacetAtlas: Multifaceted visualization for rich text corpora. IEEE Transactions on Visualization and Computer Graphics, 16(6):1172–1181, Nov 2010. doi: 10 . 1109/TVCG . 2010 . 154
-  Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: a content-based approach to geo-locating twitter users. In Proc. the ACM International Conference on Information and Knowledge Management, pp. 759–768. ACM, 2010.
-  J. Choo, H. Kim, E. Clarkson, Z. Liu, C. Lee, F. Li, H. Lee, R. Kannan, C. D. Stolper, J. Stasko, and H. Park. VisIRR: A visual analytics system for information retrieval and recommendation for large-scale document data. ACM Transactions on Knowledge Discovery from Data, 12(1):8:1–8:20, Jan. 2018. doi: 10 . 1145/3070616
-  J. Choo, C. Lee, C. K. Reddy, and H. Park. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics, 19(12):1992–2001, Dec 2013. doi: 10 . 1109/TVCG . 2013 . 212
-  E. Clarkson, K. Desai, and J. Foley. Resultmaps: Visualization for search interfaces. IEEE Transactions on Visualization and Computer Graphics, 15(6):1057–1064, 2009.
-  W. Cui, S. Liu, Z. Wu, and H. Wei. How hierarchical topics evolve in large text corpora. IEEE Transactions on Visualization and Computer Graphics, 20(12):2281–2290, Dec 2014. doi: 10 . 1109/TVCG . 2014 . 2346433
-  B. Drake, T. Huang, A. Beavers, R. Du, and H. Park. Event detection based on nonnegative matrix factorization: Ceasefire violation, environmental, and malware events. In D. Nicholson, ed., Advances in Human Factors in Cybersecurity, pp. 158–169. Springer International Publishing, New York City, USA, 2018. doi: 10 . 1007/978-3-319-60585-2_16
-  R. Du, D. Kuang, B. Drake, and H. Park. DC-NMF: Nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling. Journal of Global Optimization, 68(4):777–798, Aug. 2017. doi: 10 . 1007/s10898-017-0515-z
-  M. El-Assady, F. Sperrle, O. Deussen, D. A. Keim, and C. Collins. Visual Analytics for Topic Model Optimization based on User-Steerable Speculative Execution. IEEE Transactions on Visualization and Computer Graphics, 2018. doi: 10 . 1109/TVCG . 2018 . 2864769
-  R. H. Fowler, W. A. Fowler, and B. A. Wilson. Integrating query thesaurus, and documents through a common visual representation. In Proc. the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 142–151. ACM, 1991.
-  E. Gomez-Nieto, F. San Roman, P. Pagliosa, W. Casaca, E. S. Helou, M. C. F. de Oliveira, and L. G. Nonato. Similarity preserving snippet-based visualization of web search results. IEEE Transactions on Visualization and Computer Graphics, 20(3):457–470, 2014.
-  D. Harman. Towards interactive query expansion. In Proc. the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 321–331. ACM, 1988.
-  S. Havre, E. Hetzler, K. Perrine, E. Jurrus, and N. Miller. Interactive visualization of multiple query results. In Proc. the IEEE Symposium on Information Visualization, pp. 105–. IEEE Computer Society, Washington, DC, USA, 2001.
-  M. A. Hearst. TileBars: visualization of term distribution information in full text information access. In Proc. the SIGCHI Conference on Human Factors in Computing Systems, pp. 59–66. ACM Press/Addison-Wesley Publishing Co., 1995.
-  D. Herr, Q. Han, S. Lohmann, and T. Ertl. Hierarchy-based projection of high-dimensional labeled data to reduce visual clutter. Computers & Graphics, 62:28 – 40, 2017. doi: 10 . 1016/j . cag . 2016 . 12 . 004
-  E. Hetzler and A. Turner. Analysis experiences using information visualization. IEEE Computer Graphics and Applications, 24(5):22–26, Sept 2004. doi: 10 . 1109/MCG . 2004 . 22
-  O. Hoeber and X. D. Yang. A comparative user study of web search interfaces: Hotmap, concept highlighter, and google. In Proc. the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 866–874. IEEE, 2006.
-  O. Hoeber, X.-D. Yang, and Y. Yao. Visualization support for interactive query refinement. In Proc. the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 657–665. IEEE, 2005.
-  E. Hoque and G. Carenini. Convisit: Interactive topic modeling for exploring asynchronous online conversations. In Proc. the International Conference on Intelligent User Interfaces, pp. 169–180. ACM, New York, NY, USA, 2015. doi: 10 . 1145/2678025 . 2701370
-  E. Hoque and G. Carenini. Interactive topic hierarchy revision for exploring a collection of online conversations. Information Visualization, 0(0):1473871618757228, 2018. doi: 10 . 1177/1473871618757228
-  H. Kim, J. Choo, C. Lee, H. Lee, C. K. Reddy, and H. Park. PIVE: Per-iteration visualization environment for real-time interactions with dimension reduction and clustering. In Proc. the AAAI Conference on Artificial Intelligence, 2017.
-  M. Kim, K. Kang, D. Park, J. Choo, and N. Elmqvist. TopicLens: Efficient multi-level visual topic exploration of large-scale document collections. IEEE Transactions on Visualization and Computer Graphics, 23(1):151–160, Jan 2017. doi: 10 . 1109/TVCG . 2016 . 2598445
-  D. Kuang and H. Park. Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In Proc. the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 739–747. ACM, 2013. doi: 10 . 1145/2487575 . 2487606
-  D. Kuang, S. Yun, and H. Park. SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering. Journal of Global Optimization, 62(3):545–574, Nov. 2014. doi: 10 . 1007/s10898-014-0247-2
-  H. Lee, J. Kihm, J. Choo, J. Stasko, and H. Park. iVisClustering: An interactive visual document clustering via topic modeling. Computer Graphics Forum, 31(3pt3):1155–1164, 2012. doi: 10 . 1111/j . 1467-8659 . 2012 . 03108 . x
-  O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Proc. the International Conference on Neural Information Processing Systems - Volume 2, pp. 2177–2185. MIT Press, Cambridge, MA, USA, 2014.
-  C. Li, Y. Wang, P. Resnick, and Q. Mei. Req-rec: High recall retrieval with query pooling and interactive classification. In Proc. the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172. ACM, 2014.
-  W. Magdy and G. J. Jones. PRES: a score metric for evaluating recall-oriented information retrieval applications. In Proc. the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 611–618. ACM, 2010.
-  C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Natural Language Engineering, 16(1):100–103, 2010.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  C. North, A. Endert, and P. Fiaux. Semantic interaction for sensemaking: Inferring analytical reasoning for model steering. IEEE Transactions on Visualization and Computer Graphics, 18:2879–2888, 2012. doi: doi . ieeecomputersociety . org/10 . 1109/TVCG . 2012 . 260
-  J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proc. the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543, 2014.
-  M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
-  V. Rakesh, W. Ding, A. Ahuja, N. Rao, Y. Sun, and C. K. Reddy. A sparse topic model for extracting aspect-specific summaries from online reviews. In Proc. the World Wide Web Conference, pp. 1573–1582. International World Wide Web Conferences Steering Committee, 2018.
-  H. Reiterer, G. Tullius, and T. M. Mann. Insyder: a content-based visual-information-seeking system for the web. International Journal on Digital Libraries, 5(1):25–41, Mar 2005. doi: 10 . 1007/s00799-004-0111-y
-  T. Ruotsalo, G. Jacucci, P. Myllymäki, and S. Kaski. Interactive intent modeling: information discovery beyond search. Communications of the ACM, 58(1):86–92, 2015.
-  T. Ruotsalo, J. Peltonen, M. Eugster, D. Głowacka, K. Konyushkova, K. Athukorala, I. Kosunen, A. Reijonen, P. Myllymäki, G. Jacucci, et al. Directing exploratory search with interactive intent modeling. In Proc. the ACM International Conference on Information and Knowledge Management, pp. 1759–1764. ACM, 2013.
-  I. Ruthven and M. Lalmas. A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(2):95–145, 2003.
-  A. Smith, V. Kumar, J. Boyd-Graber, K. Seppi, and L. Findlater. Closing the loop: User-centered design and evaluation of a human-in-the-loop topic modeling system. In Proc. the International Conference on Intelligent User Interfaces, pp. 293–304. ACM, New York, NY, USA, 2018. doi: 10 . 1145/3172944 . 3172965
-  G. Smith, M. Czerwinski, B. Meyers, D. Robbins, G. Robertson, and D. S. Tan. FacetMap: A scalable search and browse visualization. IEEE Transactions on Visualization and Computer Graphics, 12(5):797–804, 2006.
-  S. Wang, Z. Chen, G. Fei, B. Liu, and S. Emery. Targeted topic modeling for focused analysis. In Proc. the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235–1244. ACM, 2016.
-  S. Wang, M. Zhou, S. Mazumder, B. Liu, and Y. Chang. Disentangling aspect and opinion words in target-based sentiment analysis using lifelong learning. arXiv preprint arXiv:1802.05818, 2018.
-  R. White and R. Roth. Exploratory Search: Beyond the Query-Response Paradigm. Morgan & Claypool, 2013.
-  X. Zheng, A. Sun, S. Wang, and J. Han. Semi-supervised event-related tweet identification with dynamic keyword generation. In Proc. of the ACM on Conference on Information and Knowledge Management, pp. 1619–1628. ACM, 2017.