Dynamic Search - Optimizing the Game of Information Seeking

Dynamic Search - Optimizing the Game of Information Seeking

Zhiwen Tang and Grace Hui Yang

This article presents the emerging topic of dynamic search (DS). To position dynamic search in a larger research landscape, the article discusses in detail its relationship to related research topics and disciplines. The article reviews approaches to modeling dynamics during information seeking, with an emphasis on Reinforcement Learning (RL)-enabled methods. Details are given for how different approaches are used to model interactions among the human user, the search system, and the environment. The paper ends with a review of evaluations of dynamic search systems.

Dynamic search, reinforcement learning, information seeking, human-centered artificial intelligence

1 Introduction

Information seeking is a type of human activity where “humans purposely engage to change their state of knowledge” [82]. Kids talk to Amazon Alexa to try out its functionalities and test the limit of its artificial intelligence. Parents search online and gather information from their friends and real estate agents to decide which house to purchase. Students read textbooks and Wikipedia to understand what is a wormhole and why was Schrödinger’s cat both dead and alive. The advancement of information technology and artificial intelligence have made access to large amounts of information very convenient. Digital assistants, such as search engines, are the best friends of human beings who are seeking information. One might expect the experience of interacting with an intelligent agent to always be as straightforward as finding the front page of the Apple Store website. Yet that is not always the case. This experience could be quite tedious and tortuous. For instance, you would really wish more magic could happen when seeking materials for a job interview or finding a nursing home that your grandmother would agree to go to.

If we look closer at these tortuous search tasks, we realize that most of them are linked to some telic goals. These goals concern financial needs, health care, and other serious matters. The eventual successes of these search activities would lead to ultimate human joy; but at the moment, they seem only remotely related to leisure, entertainment, and enjoyment. And users are well aware of the inconvenience of these processes, and the efforts that need to be paid to accomplish their goals.

Let us see one example. As someone who had been troubled by a non-stopping cough for days, you decided to get some medications. A doctor’s appointment could only be granted in a few days; but you could not bear with the symptoms any more. So you decided to purchase some over-the-counter medicines from Amazon. You typed in cough medicine and found many types of cough and cold medicines available, each of which targets different symptoms. Your own symptoms included headache, cough, and running nose. By comparing your symptoms with what a medicine’s description said it could treat, you narrowed down the choices to just a few: Robitussin, Delsym, and Nyquil. You then searched for each of them, compared their ingredients, and made sure you were not allergic to any ingredient by forming new queries like Delsym ingredients. You also checked whether the medicines could be taken together. After that, you read the comments and ratings from other customers to check if those medicines were found to be as effective as the sellers claimed. Finally, you chose Robitussin and made the purchase.

Fig. 1: Information Seeking Process, from Marchionini [82].

These “serious” search tasks are closely related to one’s need for problem solving, decision making, and reasoning. Marchionini called them information seeking and defined information as “anything that can potentially change people’s knowledge” [82]. It can be a “communication act,” a piece of knowledge that “increases or decreases the uncertainty,” or instructive material that “imparts knowledge” [22]. Figure 1 [82] shows the process of information seeking. It is conducted through several steps. First, “an information seeker,” most likely a human user, recognizes and accepts an information problem, defines and understands the problem, and then chooses a search system, e.g. Google. After that, the human user formulates a query, executes the search function, examines the results, extracts useful information from them, and reflects on what has been extracted. This constitutes one search iteration. The user usually repeats similar cycles and iterates through the entire process until the task is completed or abandoned, or no further search is supported.

Note that this description is given only from the user’s perspective. We should be aware that the search engine kicks in after the user “executes the search function” and before the user “examines the results.” A more complete elucidation is given by Luo et al. [80], which views information seeking as a cooperative process between a human agent and a search-engine agent (Figure 2). The idea that the process entails communication between two agents was fleshed out in [77], where a human agent queries and clicks to send his message, and a search-engine agent ranks and selects the content and presentation to communicate back to the user. If we would like to make more comparisons, we might note that the entire discipline of Information Retrieval (IR) uses another more traditional yet different view. Most IR research studies and designs are described from the search engine’s point of view, in that a search iteration starts from receiving a user-issued query, and then a computer algorithm checks its pre-built index to find the most relevant documents to return. This one-step algorithm, or function, that has been extensively studied–perhaps more than all others–is called ad-hoc retrieval.

Fig. 2: Dual Agents in Information Seeking, by Luo et al. [80].

One evident difference between information seeking and ad-hoc retrieval is the number of search iterations that are used in their processes. Ad-hoc retrieval takes in one query and finds a ranked list of documents. It runs only one iteration and is called “the narrowest sense” of IR by Yang [142]. On the other hand, information seeking takes a much longer process that consists of rounds and rounds of querying and returning of documents.

A less noticeable difference between the two lies in their uses. Ad-hoc retrieval is mainly responsible for look-ups [83]. For instance, looking up a word’s synonym, look-up for a web site that you always visit, looking-up next year’s academic calendar, and looking up this weekend’s program at the Kennedy Center. On the other hand, information seeking takes place when people learn and investigate [83].As most users use the same tool – web search engines – to perform both ad-hoc retrieval and information seeking, they are in fact different in nature, beyond the surface difference that one takes a single step and the other multiple steps.

When people learn and investigate, tasks are sometimes open-ended. They do not possess telic goals, and the search processes are not means-to-ends. For example, finding materials to write a term paper could be quite exploratory and enjoyable. These are the subject of the study of exploratory search (ES) [133], which does not require a specific goal to get started and can be primarily driven by curiosity. Other times, however, the search tasks have telic goals and are of a serious nature, such as finance, jobs, health care, legal issues, and family affairs. The line between the two types might seem fine. It is because exploration, the characteristic of ES, can be found in almost all IS tasks, including the most serious types, such as catching a human-trafficker, usually as an early stage. And this exploratory phase could re-appear from time to time when a user needs to broaden her search scope. As subtle as these comparisons are, we think their differences are greater than their similarities: ES bears no particular intent to derive a solution that must converge to some conclusion; while the other “serious” type does. In this article, we are mainly interested in the views, algorithms, and solutions for the latter.

Fig. 3: Communication between User and Search Engine, Luo et al. [77].

All types of IS processes, including ad-hoc retrieval and exploratory search, claim that they would like to “satisfy the user’s information need,” and based on that to determine if the search is successful. However, the definition of an information need is vague, and how it could be met is never made crystal-clear, at least not in a mathematical sense. The most-accepted and commonly-used expression is to find documents that are “relevant.” Implicitly, it suggests that finding relevant documents would sufficiently meet the user’s information need. However, ”being relevant” does not equal ”satisfying user’s information need.” How exactly could relevant documents satisfy an information need? Is it that the need is a threshold of and the documents enable the user to reach , so that if , then we can say that the need is met? Is it that the need is a hypothesis and if the documents contain a proof to (that would be too easy), or partially and indirectly support the user to prove , then the need is met? Is it that the need is a set of conjunctive disjunctive clauses and the documents allow the user to find at least one non-empty set for , and then, we say the need is met? Or, is it that the user needs a feeling of relaxation, such that only a Netflix movie marathon could satisfy the need? It is a pity that we cannot easily find a universal and mathematical formulation to what we really mean by “to satisfy an information need.”

This lack of clear formulation of “what is the expectation in the end” makes it difficult to take a rational approach to finding the optimal, the most suitable, and the best solution to our research problem, if there is one. On the other hand, there are large amounts of excellent optimization methods that have flourished through great advancements in machine learning and its applications, where optimization is at the heart of their discipline. So we have two choices–to keep using an imprecise, vague formulation of how an end is met without being able to effectively optimize it, or to re-define the research problem and aim for optimizing the entire search process, and be able to utilize an arsenal of data-driven, optimization-based learning methods. We think this is a call to every IR researcher.

Over the last decade, we have witnessed the emergence of two different philosophies and their related schools of methods, for ad-hoc retrieval. The two schools are heuristic-driven methods, including a vector-space model (VSM) [33], probabilistic relevance models [33], Okapi BM25 [103], language modeling [149], and optimization-based, data-driven supervised learning approaches, also known as learning-to-rank approaches [73]. We think it would be a little rash to judge which one is better, for one of them could be a necessary stage for the IR discipline before the other has naturally developed. We think both are great for solving research problems in IR, and that they will co-exist for a long time, because in the end, human goals are indeed difficult to define, and their clarity can only be improved to a certain degree. Currently, what we have observed is that large commercial search engines have moved from heuristic approaches to learning approaches, but features like BM25 are still unquestionably necessary. The state of the art for ad-hoc retrieval is a mixed use of the two: taking an optimization-based approach but using features that bear the spirit of heuristics.

We can expect similar situations in IS. Early solutions to information seeking have been studied under the name of Interactive Information Retrieval (IIR)  [107]. IIR studies how to modify the query space or document space or both based on user’s reactions to previously retrieved results. The most famous of this kind is the relevant feedback (RF) method. As you may expect, IIR did not take an optimization approach. Its algorithms are mostly hand-crafted by experts, rather than learned from data. As a consequence, IIR lacks the ability to “purposefully” plan for subsequent retrievals, and is not optimized for any long-term goal.

Finding optimal search paths during information seeking is the mission of the later proposed Dynamic Search (DS). They were first introduced in Yang et al. [144]’s 2014 tutorial. Figure 4 shows a rough relationship between DS and other research domains that we have discussed before. Dynamic search cares about the optimality of the entire search process rather than the optimality of a single step. The user and search engine traverse an information space, collecting information at each stop. A search task can be completed in multiple different ways, as long as the ultimate goal is met. Thus, we can think of DS as a type of information-seeking solution that provides optimal treatment to multi-step search tasks that are motivated by telic goals.

Fig. 4: Relationship illustration of DS and others.

In this article, we review approaches to dynamic search. Most reviewed works take reinforcement learning as their framework. Reinforcement learning (RL) [120] provides natural solutions to dynamic search. As Luo et al. [79] pointed out, the matching of the two stems from their shared conceptions of temporary dependency, a trial-and-error process, and optimization over a long-term goal. Reinforcement learning has achieved great success in robotics, computer vision, natural language processing, and finance. It is a type of learning originated from animal learning, which concerns how animals set up their responses based on their observation of the surrounding environments so as to maximize their chances of survival and fertility. RL-based methods for dynamic search is the focus of our paper, and we organize the works of DS by their different origins in RL. We show how DS can be done by imitation learning, multi-armed bandits (MAB), value-based RL methods, policy-based RL methods, and model-based RL methods. Early non-optimization-based IIR methods are also included to give a complete review of all past research efforts. Note that DS is still a young field. Most works reviewed here were developed after 2014. We hope that through our review, readers will identify new directions, as there must be plenty of room for this young field to grow and bloom.

2 Reinforcement Learning Background

2.1 The Settings

In reinforcement learning (RL) [120], there is an agent and an environment. The environment represents a dynamic world, with which the agent continuously interacts in order to learn a strategy to survive, succeed, or win (a game). The winning is usually marked by having accumulated adequate rewards, the sum of which is called return. The rewards are feedback given by the environment to the agent based on the agent’s actions. What the agent learns is what actions to take under which situations to maximize its long-term return. The learning target is to find and use the best winning strategy.

Reinforcement learning uses a complex set of symbols to describe its setting. These essential elements are put into a tuple :

  • : the set of states. A state is a representation of the environment at time ;

  • : the set of actions, discrete or continuous. An action is an operation that an agent can choose to change the state at time ;

  • : the immediate reward. denotes the reward signal given by the environment if the agent takes action at state . Its value, sometimes called reinforce, is denoted by . It can be a function, too, such that .

  • : the transition function between states. is the probability that the environment goes from state to next state if the agent takes action .

Fig. 5: Reinforcement learning framework, by Sutton and Barto [120].

These elements work together dynamically. At time step , the agent observes the state () from the environment, and takes an action (). The action impacts the environment, brings in a reward (), and produces a new state (). This process loops until reaching the end of its episode, generating a trajectory and so on. Figure 5 [120] illustrates the RL framework.

The optimization goal of RL is to find an optimal strategy to respond to various situations. With this strategy, the expected return, from the environment, would be maximized. The strategy is called a policy. It is denoted by and is a stochastic function from to , which gives instructions to an agent on how to behave. The optimization can be expressed by the following:


where is a factor discounting future rewards. And the return is


The expected return is thus the expected long-term rewards over the course of the entire episode, such as one complete game or one accomplished task.

An RL agent’s learning is characterized by two of its great abilities, adaptation and exploration. All of its learning and optimization are about being the best adaptor to a dynamic environment and able to handle issues emerging in diverse situations. Some situations may have been encountered during the training, and others may be unexpected. This ability to be adaptive allows RL agents to better handle unseen challenges. Being exploratory is another good trait. RL always recognizes the value of being exploratory. From the outset of the field, not being greedy, especially in the short term, has been a consistent practice in RL’s optimization. What RL cares about most is always eventual success.

For the DS approaches that we review in this paper, most of them share a common characteristic, in that a search engine takes the role of the RL agent; the environment is either the ensemble of the user and the document collection, or just the documents alone, depending on if the user is modeled as another agent. Suppose we use the first setting, in which the user is part of the environment: then we could describe the framework in the following manner – The search engine agent observes the environment’s state () at time , takes actions () to retrieve documents, and shows them to the user. The user provides feedback, , which expresses the extent to which the retrieved documents satisfy the informational need. The documents retrieved (deduced) by the search engine agent’s action would probably also change the user’s perceptions about what is relevant, and to what stage the search task has progressed. These changes are expressed as state transitions . The optimization goal is to maximize the expected long-term return for the entire information-seeking task.

A common misconception in RL is that the agent gets trained and produces results at the same time. We suspect that this misconception comes from a misinterpretation of RL’s “learning by interacting” – that the agent could respond to the environment based on what rewards it has been fed in the current episode. However, this is not how RL works. Similar to supervised learning, in fact, RL also needs a training phase that is separate from the testing. Training an RL agent is like training a baby to become an expert. The training requires many episodes, such as playing a game many, many times in order to understand how to win. The training episodes (or trials) might repeat thousands or many more times. From each training episode, the RL agent learns gradually how to adapt and explore, and eventually grows into an expert that is good at the game and ready for a real test. In the testing phase, the mature trained agent then acts to deal with complex situations, with the aim of winning or surviving in the dynamic environment.

2.2 Types of RL Algorithms

Based on how the optimization is achieved, RL algorithms can be categorized into the following general types.

The first type is value-based approaches. The agent in a value-based RL method would attempt to find policies that land in high-value states. A value function is needed to measure the “value” of a state. This function is defined as the expectation of the return from a given state or a given action, under a certain policy. A value function has two different forms: state-value function (-function), and action-value function (-function). They are similar in many ways, and both can be approximated via supervised machine learning, such as deep neural networks. The definitions of or are shown below:


å They are closely related to each other:


A typical example of a value-based RL method is Deep Q-Network (DQN) [89], which outperforms human players in many Atari video games, such as Pong, Boxing, and Breakout. Earlier works [120] on Value iteration, Monte-Carlo methods, and Temporal-Difference (TD) methods, like Sarsa and Q-learning, also belong to this type.

The second type includes the policy-based RL approaches. These methods aim to learn an optimal policy directly from rewards. What makes this possible is the policy gradient theorem [120]. The theorem says that the gradient of the expected return would go in the same direction as the gradient of the policy; therefore finding the optimal return would be somehow equivalent to finding the best policy. This can be achieved by standard stochastic gradient descent (SGD). First, these methods construct a general mapping from return to action. For instance, they parameterize the action to be a linear function over state features, and then wraps this function inside a softmax function to approximate a distribution of the actions. Then, they find the optimal policy by iteratively calculating what the approximation would be and updating the policy gradient. Example algorithms include the vanilla REINFORCE [120], Trust-Region Policy Optimization [110], and Proximal Policy Optimization [111]. The last two are top algorithms in the state of the art. Policy-based and value-based methods can be combined. They are called actor-critic methods, which are highly effective. Examples are A3C [88] and DDPG [69].

Both value-based and policy-based methods do not utilize the state transition in the environment. Therefore, they are called model-free methods. When we make use of the knowledge carried by the transition model , we encounter the third type, model-based RL methods. “Model” is an over-used word in machine learning. Yet in model-based RL, the word “model” specifically refers to the transition model (it could also mean both and the reward function ). The Model can be provided or learned. When learning a model, the learning can happen before, through, or alternating with, policy learning. The first case can be found in early works, such as optimal control and planning. Monte-Carlo Tree Search (MCTS) is an example of planning. It was used in combination with actor-critic approaches, and beat human Go champions in 2016 and made AlphaGo [117] such a glorious chapter in the history of AI research. The model can also be learned alternately with the policy, and Dyna [120] is a classic of this type. Lastly, the model can be learned by calculating a gradient through both the model and the policy, as in PILCO [36] and Guided Policy Search (GPS) [64]. Due to this separation of model learning and policy learning, model-based RL methods allow potential injection of domain knowledge and human expertise into reinforcement learning, suggesting promise when RL meets supervised ML. Much new research is under development along this direction.

2.3 Emphasis on Success

Being correct Being successful Being entertaining
Research Problems Regression Information Seeking Exploratory Search
Classification Dynamic Search Dialogue Systems
Question Answering Ad-hoc Retrieval Conversational AI
Suitable ML Methods Supervised Machine Learning Reinforcement Learning RL?
TABLE I: Correctness vs. Success vs. Entertainingness.

Reinforcement learning is a type of machine-learning algorithm that emphasizes success. It is rooted in animal learning, which studies how animals learn their survival skills and behaviors among all other options. The agent’s goal is to find the optimal strategy (policy) that would yield the maximum amount of long-term expected return. This could be summarized in one word – that all RL cares about is to – succeed. Such emphasis is put through a long range of steps, which means in the short-term, RL would bear with some temporary loss, or even strategically choose to lose at some point. Also, there could be multiple ways to be successful so the agent is flexible in choosing which winning trajectory to take. What it pursues is a definite win in the end in which the return is maximized, the task is successfully accomplished, and the game is won.

This emphasis on success makes RL quite distinct from supervised machine learning (SML) that might be more familiar to ML/AI practitioners. SML concerns whether predicted labels would be correct. SML’s focus on getting the correct predictions makes it suitable for problem domains that have standard answers with which to compare. Examples include regression and classification. Question answering is also a domain that focuses on correctness.

Being correct and being successful are different. The former concerns whether a single decision is correct or not, while the latter concerns an entire process and if it would eventually succeed, even if in some of its steps there would be incorrect decisions. Table I classifies a few common research problems based on if they work more for “being correct” or “being successful.” The table itself is a classification by author-crafted decision rules. It puts regression, classification, and question answering under the category of “being correct” and suggests that the most suitable solutions for them are supervised learning methods.

The table puts information seeking and dynamic search in a different column. IS and DS both concern the end result of a process and if it would be successful. The aim of their search systems is to successfully satisfy a user’s information need. In them, the search tasks bear telic goals, or we say they are means-to-ends, as shown in Figure 4 and discussed in the introduction. We think RL is the most suitable category of solutions for them given their emphasis on interaction and ultimate success.

Ad-hoc retrieval is a bit special, and we could be a little controversial here. At first glance, ad-hoc retrieval seems to be a problem of “being correct.” Is it really? As a discipline, IR’s top priority is to find documents that are “relevant.” What does “being relevant” mean? We think it has at least two meanings. First of all, “being relevant” does not equal “being correct.” In many cases, contrary to intuition, users would prefer an incorrect but definite document in order to quickly rule out bad options, and reach the final search success sooner. They might even prefer getting no matches at all. These search results are still “relevant,” but definitely not “correct.” Second, “being relevant” is frequently mentioned together with another expression, “satisfying the needs,” which means that the needs would need to be successfully met. Considering these two perspectives, the authors think “being relevant” actually means “being successful.” The whole IS process aims to be successful, and ad-hoc retrieval is just an intermediate step that leads to eventual success. It is only because retrieval is a single step, and singled out from other parts of the entire process, it appears to be a task that cares about “being correct” – except it is actually not. In this sense, even though supervised learning-to-rank methods are currently prevailing for ad-hoc retrieval, they might not be the most suitable solutions. Table I puts ad-hoc retrieval in the same column as IS and DS.

Another type of research problem is “being entertained,” which is closely linked to “engagement” and “enjoyment”. Under this category, there are exploratory search, dialogue systems, and most conversational AI agents. In them, a user would be concerned mostly with how much they can explore, be entertained and engaged, and have fun. This is a completely different optimization goal and should not be confused with “being correct” and “being successful.” One may argue that an algorithm could be optimized to find “the correct model that entertains the user” or “the model that most successfully engages the user”. However, then the algorithm must handle the exploration requirements and emotional joy as parts of its optimization goal. For human beings, we believe this is as important as being correct and being successful. We list them in a third category in Table I and think maybe RL is more suitable to provide their solutions, given its exploration ability.

3 Related Work

In this section, we review the related disciplines one by one and eventually derive the relationships among them and dynamic search. They include information seeking (IS), exploratory search (ES), dialogue systems (including conversational search), and online learning-to-rank. Interactive information retrieval (IIR) is considered as an early DS method, so we discuss it in another section. These research problems both overlap with and differ from DS.

3.1 Information Seeking

Information seeking is the process where “people purposefully engage to change their state of knowledge” [82]. In the old days, most work in IS was about how to ask a librarian to find the information that you were looking for, or as a librarian, how to serve people. Things have changed since the rapid spread of search engines, digital libraries, electronic catalogues, and social media. Therefore, the more modern sense of IS has become how to make use of the digital tools to best assist human information seekers.

Fig. 6: Information Seeking Factors, from Marchionini [82].

An information seeking process is when an information seeker works with a search system in a certain domain, attempting to accomplish a task in a given setting; and the process produces its outcome as the foraged information [81]. Factors in an information seeking process include an information seeker, task, search system, domain, setting, and outcome [82] (Figure 6). Here information seeker is a human user who actively changes his or her state of knowledge. Task manifests the information need and is the driver of this entire process. Usually it belongs to just one domain, which is the body of knowledge concentrated on a specified field, e.g. chemistry or biology. Task is also called “problem context” by White et al. [133], which drives the search activities. It describes the gap between what the user knows and what the user wants to know. Search system represents the knowledge stored in a pre-built database or index, which contains knowledge that is potentially available in the forms of books, library, or online resources, and provides access to the user via a user interface. Settings constrain the IS process, which can be physical, such as time constraints and hardware availability, or psychological, such as the user’s confidence or mood. Outcome is what eventually is obtained, which includes both the product, i.e., the solution to the user’s problem, and the process and all episodes that it has experienced.

White et al. [133] describes the dynamics in information seeking: Initially, the user is unclear about the solution to the problem, even unsure about the problem itself. As the search goes on, the user lowers uncertainty, achieves better understanding of the problem, and constructs the solution by foraging the information obtained from external resources. Hopefully the user would close the gap eventually.

The human-side efforts involved in information seeking consist of multiple rounds of querying and examining returned items. In this process, much user effort is spent at comparing and investigating search results. Marchionini [83] classified a user’s search activities into three overlapping types, i.e. looking-up, learning, and investigation. Looking-ups are the most basic search operation and typical examples of look-up operations include web search, question answering, and fact retrieval. Learning and investigation are the main activities of the search process. Learning searches are the searches where the user develops new knowledge by reading and comparing the raw material. Instances that fall into this category includes developing certain professional skills. Investigation involves critical assessment of the search results before integrating them into existing knowledge bases and cares more about recall than precision. Professional searches, such as legal search or patent search, fall into this category.

The system side of study for IS is not as developed as that of the user side. Most current system-related research is on developing search engines and friendly user interfaces. Except for them, the rest are still underdeveloped. Dynamic search is interested in the system side of information seeking. Our focus is on how to mathematically and statistically handle the dynamics of the IS process, and how a search system could better assist the user.

3.2 Exploratory Search

Fig. 7: An illustration of exploratory search process, by Bates [13].

Exploratory search is a type of open-ended information seeking. It does not require a telic goal. Instead, its search activities are more triggered by curiosity and desire to understand and to learn. It often happens when users face an unfamiliar domain.

Many ES processes create knowledge-intensive products. For instances, writing a research paper, or understanding housing market. It typically involves high-level intellectual activities such as synthesis and evaluation.

The process of ES demonstrates different features from telic IS tasks. Researchers have described the ES process in various metaphors. White and Roth [134] called it “way-finding.” “Way-finding” emphasizes the prerequisite that the user needs to conceptualize the sought-after information first before navigating to the desired part.

Bates [13] gave it a different metaphor, “berry-picking.” “Berry-picking” refers to evolution of an information need, where a user starts with a vague perception of an information need and traverses the document collection, collecting pieces of information along the way. Newly-collected fragments may change the information need and the user’s behavior for the next round of “picking.” Fig. 7 shows the ES process with the berry-picking metaphor.

Another feature of ES is that its users tend to spend more time in searching. Even after they have gathered all the information fragments they need, they may continue searching, because the process is motivated by learning and understanding. Hence they tend to keep learning and validating the information from different sources. This is distinct from DS, in that after the goal is met, the search would be considered complete, and end.

3.3 Dialogue Systems

Fig. 8: Conversational Search System Architecture, by Chen et al. [29].

Dialogue systems, aka chatbots, also interact with human users [52, 40]. How they differ from DS is in the media of interaction. DS uses documents to communicate to users, and chatbots use machine-generated short natural language utterances.

Figure 8 shows a common pipeline in dialogue systems. In this pipeline, first, human utterances go through a natural language understanding (NLU) component to get parsed and interpreted by the chatbot. The chatbot then tracks their dialogue states by techniques such as slot filling. This part could be thought of as parallel to transition models in RL. Then, a “policy learning” module learns from training data to decide the chatbot’s dialogue act, based on the dialogue state. This can be done through expert-crafted rules [132], supervised machine learning [112], or reinforcement learning [67]. Finally, a natural language generation (NLG) component generates the chatbot’s responsive utterances from the dialogue acts determined by the policy. This is another difference between chatbots and DS. In DS, no NLG is expected, because users would access the retrieved content by viewing rather than hearing.

Most chatbots assume a highly-structured formality in their tasks. For example, to successfully make a doctor’s appointment, a chatbot would need to know the when, where, and who of the visit; to find a target smart phone, the chatbot may ask the brand, price, color, and storage capacity. Because of this high task formality, slot-filling appears to be adequate for the tasks. In DS, the domains of the information-seeking tasks are much less structured, and slot-filling would not be as effective as it is for chatbots.

Modern dialogue systems are trained end-to-end with supervised machine learning methods, especially with deep neural networks. Based on the way in which the dialogue system generates/selects its response, those systems can be categorized as retrieval-based chatbots or generation-based chatbots. Retrieval-based chatbots select the response from a pre-built candidate pool, choosing the one that suits the context best with text-matching techniques [145, 16, 59, 136, 138, 154, 152, 97]. Generation-based chatbots generate the response by the system with natural language generation techniques instead of using pre-built ones [66, 115, 119, 124, 126, 95]. Generative approaches appear to be more flexible, but also increase the complexity of the problem.

When chatbots are mainly used for information seeking, they are sometimes called by another name, conversational search. Radlinski and Craswell [99] proposed a general framework for conversational search, without assuming high task formality. In this sense, it is closer to DS. They define conversational search as a mixed-initiative communication running back-and-forth between a user and a dialogue agent. When generating conversations, the agent’s focus is more on eliciting user preferences and identifying their search targets. The work was a theoretical framework that covered a wide range of research domains, including dialogue systems, recommender systems, and commercial search engines.

3.4 Online Learning-to-Rank

Learning-to-rank (L2R) [73] is a family of ad-hoc retrieval methods that build retrieval models by supervised machine learning. Online learning-to-rank (online L2R) [41] is the online learning version of it. It collects training data on-the-fly and trains the model. The newly-collected training data update the search engine model so that its search effectiveness in subsequent search iterations will be better.

Online learning-to-rank is also a dynamic system, but it is quite different from DS. First of all, it is not specifically designed for complex search tasks. It aims for solving general-purpose ad-hoc retrieval tasks–for instance, finding 91 Academy Awards nominations. Second, it accumulates training labels from large amounts of live user clicks and other implicit feedback data to figure out personal preferences, trends, and so on, to create more personalized search engine models for each user. It is more like a recommender system in this sense. Third, its training is not separated by search episodes. In DS, we have an information-seeking task, and search episodes that mark the beginning and end of the task; while in online learning-to-rank, learning continues across sessions and tasks.

3.5 A Venn Diagram of All

Fig. 9: Venn Diagram of Dynamic Search and its Related Disciplines.

Figure 9 depicts our understanding of the scope of these related disciplines. Dynamic search stands at the intersection of information seeking and reinforcement learning. Information seeking goes beyond the query-document-matching paradigm in ad-hoc retrieval, and we think it is a larger concept than IR. Moreover, dynamic search and reinforcement learning are part of artificial intelligence. In our opinion, DS is a type of human-centered AI, whose mission will always be servicing human users.

3.6 Miscellaneous

Apart from building retrieval models, other aspects of IS and IR have also been investigated thoroughly. These include search log (also known as query log) analysis, user interface design, and collaborative search.

Identifying Search tasks from query logs: Modern commercial search engines record user activities and keep search logs that contain rich interaction signals such as queries, dwelling time, and clicks. However, boundaries between search sessions, which overlap with individual information-seeking tasks, are not explicitly marked in the logs. Much research has been devoted to segment search sessions or to identify search tasks from the logs. Jones and Klinkner [57] discovered that it was of limited utility to claim a separate session after a fixed inactivity period, since many sessions could be interleaved and hierarchically organized. Instead, they recognized individual search tasks via supervised learning, with temporal and word-editing features. Wang et al. [129] worked on long-term search task recognition, with a semi-supervised learning method that exploited inter-query dependencies derived from behavioral data. Work by Lucchese et al. [75] utilized Wiktionary and Wikipedia to detect query pairs that are related semantically but not lexically. However, some still found that a properly-set threshold of inactivity period, for instance a 30-minute inactive threshold, would work adequately to separate sessions, and this method is applied in a wide range of online applications [44].

User Interface for DS: Since it handles more complex search interactions, research has been done to design more effective User Interfaces (UI) for DS systems. Sarrafzadeh and Lank [109] added a hierarchical knowledge network into a search engine UI, which can visualize related concepts in the documents. Qvarfordt et al. [98] designed a query preview control, which can display overlap between newly retrieved documents and previously retrieved ones before the search system runs the query. Hecht et al. [47] showed a visualization of query concepts with thematic cartography. Ruotsalo et al. [106] proposed Intent Radar, a system that could exhibit and organize user intents in a radical layout. Bozzon et al. [19] proposed a Liquid Query interface, which would enable users to explore information across several domains and gradually come closer to the sought search targets.

Collaborative DS: The exploration of unknown information space can also be done by a team of users instead of an individual user. Pickens et al. [96] proposed a mediation framework which would allow several users to work on the same search task and communicate with each other. Tani et al. [122] used an author-topic model to identify multiple users’ interests and retrieve relevant documents for them. Sometimes, collaboration can produce better search results than the sum of those those produced by each information seeker. This was called the synergic effect by Shah and González-Ibáñez [113].

4 Approaches to Dynamic Search

We survey approaches to dynamic search and organize them into several categories, including early IIR methods, imitation learning, active learning, value-based RL, policy-based RL, model-based RL, and multi-armed bandits.

We must point out that much research here was originally developed by IS and IR researchers. Yet they are re-classified by us into categories based on RL. We re-examine them from our point of view and explain their nature in the language of RL. Due to the differences between the IS/IR community and the RL community, our view may disagree with that of the usual audience of these works.

4.1 Early Approaches

Fig. 10: Information Retrieval, adapted from Buckland and Plaunt [23].

IIR offered early solutions to DS. These early approaches do not bear an explicit optimization goal, so strictly speaking, they may not be counted as DS. But we think they made successful attempts to tackle a very similar research challenge, and their approaches provided ideas for deriving DS policies. These IIR approaches could be quite efficient, and probably sufficient for small-scale search systems, especially in early stages of an information-seeking process. At the later stages, however, errors carried from prior search iterations would very quickly add up and yield poor search results.

Relevance feedback (RF) [61, 102] is perhaps the most well-known IIR method. The intuition behind RF is that after observing the results retrieved by a search engine agent, the user provides feedback to indicate which documents are relevant and which are not, and the search engine then modifies either the query space or the document space to bias its next retrieval towards where the feedback prefers. Figure 10 shows where this modification would appear in an information retrieval pipeline.

One representative RF method is the Rocchio algorithm [104]. Based on a set of user feedback that indicates positive and negative relevance for documents, it modifies the query representation vector, from to , to move it closer to where the feedback prefers in the query representation space:


where is a vector representation, most likely a bag-of-words (BoW) representation, of a document or a query, is the initial query, and is the modified query. The user feedback is encapsulated in , the relevant documents, and , the irrelevant ones; , and are coefficients.

Basically, the intuition is to close the gap between the query and the information need so that the latter can be matched with documents retrieved by the newly modified query. It is a good heuristic, but it does not have an optimization goal, and thus cannot plan any long search trajectory. Its modification of the query might be short-sighted and result in disorientation in the long run.

Relevance feedback models retrieve documents based on directly calculated document-to-query relevance scores. Direct uses of RF in DS can be found in [54, 63, 114]. Jiang and He [54] combines the current search query, past search queries and clicked documents to update document-to-query relevance score and demote the duplicated documents. Levine et al. [63] exploited query-term changes (difference between and ), which reflect the state transitions, to adjust term weights for both queries and documents based on whether the query term is added, retained, or removed. Shalaby and Zadrozny [114] re-ranked documents in a neural learning-to-rank model by extending its word2vec [86] representation with words and documents in the positive relevance feedback.

Other IIR methods do not directly measure document relevance; instead, they focus on modifying the queries and then feeding them into an off-the-shelf search engine, which is usually powered by ad-hoc retrieval algorithms such as VSM, BM25, or L2R, to retrieve documents using the new query. There are three ways to modify the queries: (1) query expansion, (2) query re-weighting, and (3) query suggestion.

Query expansion beyond the original query is done in various ways. Bah et al. [11] expanded queries based on query suggestions from a commercial search engine. Liu et al. [74] augmented the current query with all past queries within the same search session, and put heavier weights on the most recent queries. Yuan et al. [148], Wang et al. [131] and Adeyanju et al. [1] also broadened the queries based on past queries in the same session; and Yuan et al. [148] also excluded terms that were estimated to not be able to improve search effectiveness. Albakour and Kruschwitz [6] added anchor text to expand the original query; anchor text has been shown to be an effective representation of what human creators think about a web document. Likewise, queries can also be expanded with snippets [43], clicked documents [25], proximity phrases, or pseudo-relevance feedback [60].

Query re-weighting is similar to what we have seen in Rocchio. Zhang et al. [150] – manually defined rules to re-weight query terms based on user attention time and user clicks. Liu et al. [71] classifies search sessions into four types based on whether their search goal is specific or amorphous, and whether the expected outcome is factual or intellectual. Each type of search session corresponds to a unique set of pre-defined parameters for the re-weighting of terms.

Query suggestion is done based on query-log analysis. When query logs from other users are available, they can be used to suggest new queries to help the current user and her search task. The assumption here is that similar search tasks would use similar queries even if they are issued by different users. These methods first recognize patterns in how different queries are constructed by massive users, and then make use of this knowledge to create new queries. Click graphs are widely used to generate query suggestions. A click graph Craswell and Szummer [32] is a bipartite graph between queries and clicked documents, where an edge between a query and a document indicates a user has clicked on the document for the query, and the weights on each edge can be estimated via random walk. For instance, Feild [38], He et al. [46] uses a click graph to induce new queries, and Ozertem et al. [94] builds a set of suggested queries aiming to maximize overall query utilities.

Relevance feedback models are also used in combination with active learning. Active learning is a type of supervised learning method where a learning agent asks a human to provide annotations (feedback) from time to time to its predictions. The main challenge for active learning is to decide which data samples to select to ask for manual labels. This selection needs to be intelligent, because human labeling is costly. A common practice is to select those fall on or near the decision boundaries, for their high ambiguity. Tian and Lease [123] proposes using active learning for IIR. They use a support vector machine (SVM) to score the documents’ relevance. At each search iteration, documents that are maximally uncertain – i.e., documents that are the closest to the hyper-plane of SVM – are selected and shown to the user, to ask for feedback about their relevance. Liu et al. [70] predicts document usefulness via a decision tree (DT), which is grown by recursively choosing the most indicative user behaviour features, such as length of dwelling time or number of visits, from the relevance feedback. The decision tree is then used to estimate the usefulness of documents and modify the query space.

It is worth noting that active learning approaches would interrupt the natural communication flow between the user and the search engine. The search engine agent is much more ‘active’ in this case, and the turns between the two are led by it. This is a different and more artificial setting than that of IS and DS. Information need is also not modeled in active learning IIR approaches; instead, they attempt to classify documents into “relevant” vs. “irrelevant” classes – which means not to adapt to changing states in information-seeking processes, regardless of how “adaptive” or “interactive” they seem to be.

Other IIR research has reached out to interesting fields to borrow ideas. For instance, Azzopardi [9] uses economics models for IIR. The work modeled information-seeking processes as outputting value (relevance) by spending some costs – inputting queries and feedback assessments. Applying microeconomics theory, the author identifies search strategies that minimize the cost – that is, the amount of queries and assessments – of a search process. Follow-up development along this line explains several user search behaviors with the economics models [10]. Albakour et al. [5] studies IIR with a model inspired by the human immune system [91]. The model represents a user’s profile with a network, where each node is a word, and each link between two nodes represents their affinity. In the network, terms are ordered by decreasing weights, and form a hierarchy. The estimation of a document’s relevance is conducted through an activation spread process: When estimating document relevance, terms (nodes) matching the information need are activated. Their higher-ranked neighbors are also activated oo by receiving parts of their weights passed down from those terms. These higher-ranked neighbors can continue the dissemination of weights to their own neighbors. The whole activation process runs until no nodes can be reached, and the term weights after the spread can be used to calculate document relevance scores. The newly found documents change the topology of this network as new words (nodes) are added into it, and links among terms are updated, which changes the activation spread process and document relevance estimation.

4.2 Imitation Learning

A dynamic search process consists of a search trajectory formed by a sequence of actions. How to best choose which actions to formulate queries, retrieve documents, and approach an eventual success could be learned from how humans act. In fact, from the outset, supervised learning approaches have been studied for DS. The research problem is formulated as learning from human decision-making and then making predictions for future decisions. To do so, we can record the history of past action sequences of humans, and learn from successfully accomplished processes.

Fig. 11: Imitation Learning, from Ross [105]. A dataset is collected in the form of pairs from expert demonstration. A supervised learning algorithm then learns the mapping from to . The trained supervised learning model is used as the policy.

These supervised methods for sequential decision-making are called imitation learning [93], also known as learning by demonstration or programming by examples. Imitation learning is widely used in applications that need to make sequential decisions, such as self-driving cars. Figure 11 illustrates the workflow of imitation learning. The algorithm collects training data, usually as state-action pairs , where is an action demonstrated by a human expert. Then, the algorithm fits a function , and this function is used as policy . The policy is learned in a supervised fashion by minimizing the error between estimated actions and actions that have been performed by the human.

In the context of DS, some exciting news is that there are large numbers of human demonstrations collected and stored in the form of query logs. Most commercial search engines have plenty of such query logs, hence imitation learning is popular due to the availability of this type of training data [147, 101, 125, 87]. There have also been academic efforts, such as [137, 49], to study an offline version of DS in the name of session search. From 2010 to 2014, the TREC Session Tracks [26] organized research teams to develop search engines that learn from query logs, which contain queries, ranked documents, time spent, and clicks, to improve the current (last) iteration of search. Participating systems have developed imitation learning methods in which a human user demonstrates which documents are relevant and how to reformulate queries in the logs. The search engine agent learns from these demonstrations how to recognize relevant documents and reformulate queries.

Supervised learning methods that can handle a sequence of inputs are popular in imitation learning for DS; the Sequence-to-Sequence (Seq2Seq) model is an example. It is a recurrent neural network that encodes a sequence in one domain and decodes it in another domain. It is used in DS to predict sequences of clicks, query reformulations, and query auto-completions (QAC). For instance, Borisov et al. [17] predicts clicks during DS processes on the Yandex Relevance Prediction dataset111https://academy.yandex.ru/events/data_analysis/relpred2011/ with a Seq2Seq deep neural network. Its input sequence is a sequence of queries, and the retrieved items and output sequence a sequence indicating whether the items are clicked or not. The model was improved by getting continuous human demonstrations via interaction. Ren et al. [101] employed a Seq2Seq model with two layers of attention networks [12] to generate new queries based on previous queries and search results. Here, the state consisted of previous queries and other user behavior data in previous interactions. Different sessions from many users were used as demonstrations to derive the agent’s action, which was to reformulate the next query. Dehghani et al. [35] proposed a customized Seq2Seq model for session-based query auto-completion and suggestion. On top of the basic Seq2Seq method, They incorporated a copying mechanism in the decoder which enables the model to copy query terms from the session’s context instead of generating all the query terms on its own.

Other supervised sequence learning methods have also been explored. Earlier works include estimating document relevance with logistic regression by Huurnink et al. [49] and SVMRank by Xue et al. [137]. Mitra [87] used a convolutional latent semantic model [116] to explore whether a distributed representation of queries would help query auto-completion task. For example, , where is the embedding vector of a given word. Yang et al. [147] used a deep self-attention network [125] for query reformulation. They modeled a two-round interaction between user and search engine in an E-commerce website and formulated new queries based on words presented in the previous queries.

4.3 Value-based RL Approaches

Value-based RL approaches aim to arrive at high-value states. It could be one super high-value state or many reasonably high-value states. Since it is important to know the values, obtaining good approximations of the value functions via interaction with the environment is the focus of these methods.

There two types of value functions, the state-value function and the action-value function . Both can be learned (approximated) by supervised machine learning methods. One way of obtaining their approximations is to use Monte Carlo methods. These methods sample a batch of trajectories, each from the beginning to the end of an episode, and then use these samples to calculate the sample average return for each state or action. These sample returns are used as approximations of the value functions. For instance,


where is the return of the step of the trajectory where the agent receives state and takes action at step .

In 2013, Jin et al. [55] proposed a Monte Carlo approach for multi-page results ranking. Multiple-page ranking is a special type of DS. For web search engines, because their datasets are gigantic, returned results are usually displayed through multiple Search Engine Results Pages (SERPs). When a user clicks the ‘go to next page’ button to view the continued results listed on the next SERP, the DS algorithm captures this special user action and re-ranks the results on the next page before the user sees them. It is similar to making ranking decisions by taking into account previous search logs before the moment of ‘go to next page.’ Their method treated states as a combination of document similarity and document relevance. Each action was a ranked list of documents. Reward was determined by the Discounted Cumulative Gain (DCG) score [50] of the corresponding list. Finding the combination of documents that constitute the optimal ranking order would be intractable, instead they greedily selected documents one by one to approximate the optimal ranking list. The state value functions were estimated via Monte Carlo sampling. Experiments were conducted by simulating a user who would check the top documents from the list and provide clicks each time. The proposed algorithm outperformed BM25 [103], maximal marginal relevance (MMR) [24], and Rocchio [104] on several TREC datasets.

Temporal-difference (TD) learning is another common family of value-based RL methods. TD learning also relies on sampling. But it does not need to sample the entire episode like the Monte Carlo methods do. Instead, it only samples one step ahead. It calculates a difference between its estimate of the current action-value, , and a better estimate of it, , which is made after looking one step ahead. This difference between two estimates is known as the TD error. It is closely related to dopamine in animal learning and triggers agents to adjust their behaviors. The assumption is that using some knowledge from the next state, in particular the next state’s estimated value and the reward obtained from moving from this step to the next, our estimation of the current state could be more accurate.

TD learning updates its estimate of the current value by a fraction proportional to the TD error. The best known off-policy TD learning method is Q-learning. It updates its action-value by:


where is the learning rate, are the current state and action, and are the next state and action. The quantity is the target, is the current estimate, and is the TD error.

Note that and are instantiations with different inputs for the same function . This function can be approximated by deep neural networks. Deep Q-Network (DQN) [89] is a classic example of this kind. DQN approximates the action-value functions by minimizing the loss between the target value function and the estimated value function:


where parameter is kept a few steps behind . At each step, DQN chooses the action with the highest action-value at a probability of , and the rest actions equally likely to share . It then samples a batch of state transitions from a replay buffer to learn the function as in Eq. 10. DQN further improves its training process by double q-learning to stabilize the moving target problem.

The use of value-based RL for DS could be traced back to the 2000s. Leuski [62] defined state, , as the inter-document similarities and action, , as the next document to examine. The function, parameterized by , was updated via TD learning and the update rule was


where was the learning rate.

More recently, Tang and Yang [121] used DQN for DS. They adopted a standard DQN framework to re-formulate queries at each time step and then used an off-the-shelf retrieval algorithm to obtain the documents. In their work, state was defined as a tuple consisting of the current query, relevant passages, and the current index of the search iterations. Reward was defined as the relevance of documents discounted based on ranking and novelty. Actions were several query reformulation options, including adding terms, removing terms, and re-weighting terms. Zheng et al. [153] proposed an extended DQN for news recommendation. They incorporated user behavioral patterns when constructing the reward function. State was defined as the embedding vector of users, and action was defined as the feature representation of news articles. Both online and offline experiments have shown that they could be improved by using dueling bandit gradient descent (DBGD), and features that indicated user engagement.

4.4 Policy-Based RL Approaches

Policy-based RL approaches skip the step of finding out state-values or action-values; instead, they directly learn policy by observing rewards . This class of RL methods assumes that the policy can be parameterized by some parameter , whose goal is to learn the best that optimizes some performance measure of the policy:


where is a trajectory from time 1 to , i.e. .

A classic policy-based RL algorithm is REINFORCE [135]. It is still widely used by many applications, including dialogue systems and robotics. REINFORCE is a Monte Carlo policy gradient method. It first generates a trajectory, then computes the return for each time step and updates the policy parameter . Via a gradient descent process, it finds the best that maximizes :


where is the learning rate. Algorithm 1 details the REINFORCE algorithm.

while True do
       Generate a trajectory following ;
       for each step  do
       end for
end while
ALGORITHM 1 REINFORCE by Williams [135].

In 2015, Luo et al. [77] first proposed directly learning policy for dynamic search. Each search iteration is broken into three phases–browsing, querying, and ranking. Each of those phases is parameterized by its own parameters. They then computed the gradient of the summed rewards in terms of the parameters in the ranking phase. The action is sampled from search results generated by an off-the-shelf ad-hoc retrieval method. Their system was trained in a fashion similar to REINFORCE. Algorithm 2 shows the direct policy learning method for DS.

       Sample history from history set;
       where is the query, is the set of retrieved documents. are the clicks, is the dwelling time ;
       for  in range() do
             The user takes action ;
             The user takes action ;
             Sample a search engine action ;
       end for
until  or history set is empty;
ALGORITHM 2 Direct Policy Learning for Dynamic Search [77].
where is the input question;
for  in range() do
       , , , where is the answer span, is the score of the span, is the reader state;
end for
return answer span with highest score
ALGORITHM 3 Multi-step reasoning for open-domain QA [34]

Policy-based RL methods have been popularly used in other interactive AI systems, such as dialogue systems, multi-turn question answering (QA), and query reformulation. Dhingra et al. [37] proposed KB-InfoBot, a multi-turn dialogue agent for movie recommendation that operates on a knowledge base (KB). User intents are tracked through a belief tracker, which constructs probabilities of slots in the utterances. They also use a soft look-up function that constructs a posterior distribution over all the items in the KB from the output of the belief tracker. The posterior distribution suggests user preferences. The output from the belief tracker and the output of look-up function together constitute the state of the RL agent. The action is whether to ask the user about one of the slots of the interested item, or to inform the user of the final answer. The policy is optimized based on maximizing the returns.

In 2019, Das et al. [34] proposed a multi-hop reasoning framework for open-domain QA, where two agents, a retriever and a reader, iteratively interact with each other. The retriever’s job is to find relevant passages, from which the reader extracts answers to the question. The query in this work is a feature vector, which is initialized by embedding the input question in it. A multi-step reasoner reformulates the query – i.e., modifies the query vector, based on the output of the reader and the input question vector. State is defined as the input question and inner state of the reader. The action is to select which paragraphs to return. Reward is measured by how well the output of the reader matches the ground-truth question. The model may be trained with REINFORCE, and the algorithm is shown in Algorithm 3.

Li et al. [67] applies reinforcement learning to generate dialogues between two virtual agents. State is their previous round of conversation, . Action is to generate the next utterance with arbitrary length . Reward is measured in terms of the utterance’s informativity, coherence, and ease with which it can be answered. The policy is paramaterized via a Long Short-Term Memory (LSTM) encoder-decoder neural network. The system is trained by REINFORCE [135].

Nogueira and Cho [92] proposes an RL-based query reformulation system. Its state is the set of retrieved documents, and action is to select terms and formulate a query. The original query and candidate terms along with its context are fed into a siamese neural network, whose output is the probability of putting a term into the reformulated query and the estimated state-value. This system was also trained with REINFORCE. Chen et al. [30] scales REINFORCE to an extremely large action space, and proposes Top-K off-policy correction for recommender systems.

In 2017, Aissa et al. [3] proposed a conversational search system by translating information need into keyword queries. They trained a Seq2Seq network with REINFORCE. The Seq2Seq network acts as the policy, whose input is the information need, which is expressed as a sequence of words, and the output is a sequence of binary variables with the same length that indicate whether to keep the corresponding word in the query. Reward is defined as the mean-average precision (MAP) of search results retrieved using the query formulated. The task is to generate topic titles from some descriptions, using datasets from the TREC Robust Track [128] and Web Track [45]. Compared with supervised approach, training the Seq2Seq model with RL achieves the best performance.

4.5 Model-based RL Approaches

In RL, the word “model” specifically means the state transition function and sometimes also the reward function . It represents knowledge about the dynamics of the environment:


where is the action taken by the agent after it perceives state ; the model estimates the resultant state as and the immediate reward as .

Fig. 12: Model-based RL and Model-free RL, from Sutton and Barto [120].

Sections 4.3 and 4.4 present RL methods that are “model-free” and are not aware of . Model-based RL methods make plans with the model before taking actions. They collect and learn the model first from past experiences, which sometimes are simulated from a policy, and then makes the plan. Figure 12 shows the paradigm of model-based reinforcement learning.

Value iteration is perhaps the simplest form of model-based RL. It results from the Bellman equation and approximates the optimal state-value as


where is the current state, is the next state, and is the transition function, which may be represented by a tabulate.

In 2013, Guan et al. [42] proposes the Query Change Model (QCM) for DS. It is a term-weighting scheme optimized via value iteration. QCM defines states as queries, and actions as term-weighting operations, such as increasing, decreasing, and maintaining term weights. The number of adjusted weights is determined based on syntactic changes in two adjacent queries. QCM denotes the state transition function as a query-change probability, . Here is the query, is the set of documents retrieved by , and is the query change action. Different query change actions would result in different . To calculate the state transition, the two adjacent queries and are first broken into tokens. The probability of is then calculated based on how each of the tokens would appear in , and . QCM then estimates the state value for each document and uses it to score the document:


where the first term, , measures the relevance between a document and the current query , with a ranking score obtained from a standard ad-hoc retrieval method. The second term is complex: is the transition probability from to , given the query change ; is the maximum possible state value of the previous retrieval that has just happened.

Fig. 13: Examples of state transitions in Win-Win search, adapted from [80].

The win-win search method proposed by Luo et al. [80] in 2014 studied dynamic search as a POMDP. Two agents – i.e., the search engine system and the user – participate in this process. The state is modeled as a cross-product of whether it is relevant and whether it is exploring. Actions on the system side include changing term weights and using certain retrieval models. Actions on the user side include adding or removing query terms. The model – i.e., the state transition probabilities – is estimated by analyzing query logs – e.g., how likely a user would go from relevance & exploration () to relevance & exploitation (). The estimated state transition probabilities are then used to select actions that would optimize system performance – i.e., maximize the expected return. The state of queries in the query log is manually annotated and used to train the model. Examples of the state transition probabilities are shown in Figure 13, where the probability of moving from state to state is .

At each step, the state transition model is used to compute the -values on the user side – i.e., . The system chooses an action that jointly optimizes both and :


where is the action-value on the system’s side,and is the action-value on the user’s side. They are defined as


where is the belief and is the observation. A continuing work by Luo et al. [76] is a model-free version of the algorithm, which learns a policy directly through the EM algorithm [14].

In 2016, Zhang and Zhai [151] proposed a model-based RL method for search engine UI improvement. It is also optimized by value iteration. The task was to choose an interface card and show it to the user at each step. Its state was defined as the set of interface cards that had not been presented to the user, while action was to choose the next card to show – i.e., , where is an interface card. Its transition is to exclude an element from a set – i.e.,


They use value iteration, as in Eq. 16, to solve the Markov Decision Processes (MDPs). Simulation and a user study have shown that the method could automatically adjust to various screen sizes and users’ stopping tendencies.

In 2018, Feng et al. [39] proposed an actor-critic framework for search result diversification. Its state is defined as a tuple , where is the query, the current ranked list, and the set of candidate documents. The action is choosing the next document to select. The transition is then to append a document at the end of and remove it from :


where is an appending operation. They use Monte-Carlo Tree Search (MCTS) [28] to simulate and evaluate the policy.

4.6 Bandits Algorithms

Exploration/exploitation trade-off is one of the core challenges in RL. Athukorala et al. [7] investigates how the exploration rate in a bandit algorithm, LinRel[8], impacts retrieval performance and user satisfaction. In general, bandits algorithms are based on a stateless MDP. They approximate the -value with various error correction mechanisms and regularization methods.

Li et al. [65] proposed an extended version of the upper confidence bound (UCB) algorithm [2] in a DS scenario where multiple queries are available simultaneously. Each arm is formulated as a query, and the task is to select a query from the pool at each step to maximize the overall rewards. UCB takes into account both the maximality of the action and the uncertainty in the estimation; it selects the action by


where is the current estimation of the action value and is the number of times action has been selected. The first term in Eq. 23 evaluates the optimality of the action in the current estimation, while the second term accounts for the uncertainty. [65] extends the UCB by estimating the action value of a new query based on its lexical similarity with existing queries.

Yang and Yang [139] applied to DS a contextual bandit algorithm, LinUCB [68]. Their bandits algorithm chooses among several query reformulation options, such as adding terms, removing terms, re-weighting terms, and stopping the entire process. At each step, LinUCB selects an action based on


where and . Here is the dimension of the feature space, is the feature vector, is the coefficient vector, is the matrix of training examples, and is the vector holding the corresponding responses.

Wang et al. [130] proposed a factorization-based bandits algorithm for interactive recommender systems where a sub-linear upper regret bound with high probability was proved. Observable contextual features and user dependencies were used to improve the convergence rate of the algorithm and the cold-start problem in recommendation.

Exploration has been studied in IR without using optimization; heuristics are used instead. Diversity and novelty are two heuristic goals when ranking documents. Jiang et al. [53] penalizes duplicate results by simulating user behavior when ranking documents. Bron et al. [20] diversifies retrieval results based on maximal marginal relevance (MMR) [24] and latent Dirichlet allocation (LDA) [15]. Raman et al. [100] studied the intrinsic diversity in search sessions. They identify intrinsic diversified sessions with linguistic features, and then re-rank documents by greedily selecting those that maximize intrinsic diversity.

5 A Testbed for Dynamic Search Systems

Fig. 14: TREC Dynamic Domain Track 2015-2017, by Yang et al. [143].

Communities from the Text REtrieval Conference (TREC) and the Conference and Labs of the Evaluation Forum (CLEF) have devoted to the development of proper testbeds for dynamic search systems. Such efforts include TREC Interactive Tracks from 1997 to 2002 [127], TREC Session Tracks from 2011 to 2014 [26] , TREC Dynamic Domain Tracks from 2015 to 2017 [141, 140, 143], and CLEF Dynamic Search Lab in 2018 [58].

The TREC Interactive Tracks evaluated the search effectiveness of the team of a user and a search engine for interactive IR. They did not separate the user from the search engine. What the user did and what the search system did were evaluated together. Aspect precision and aspect recall were the main metrics used. The interactions were live but the evaluation was not reproducible. On the contrary, TREC Session Tracks involved no user in real time. The search systems were provided with a search log generated by someone in the past. The systems were expected to improve the search results for the last query based on the search history. The search logs contained recorded queries, retrieved URLs, clicks, and dwell time. Common web search evaluation metrics, e.g. the normalized discounted cumulative gain (nDCG), were used for measuring effectiveness in the Session Tracks.

The TREC Dynamic Domain (DD) Tracks replaced the human user in the Interactive Track with a simulator. The simulator would provide feedback to a DS agent. The feedback enclosed relevance ratings to the returned documents and highlights which passages were relevant to which subtopics. It created a live interaction and enabled reproducible experiments via simulation. Figure 14 shows the interaction process between the simulated user and the search system. Initially, the search system receives the name of a search topic. The search system then retrieves five documents and returns them to the simulated user. Feedback provided by the simulated user includes graded relevance score at subtopic level and highlighted passages that are pertinent to the information need. The search system is expected to adjust its search algorithm and find more relevant documents at each search iteration. This loop continues until the search system calls for a stop.

The search topics used in TREC DD were two-layer hierarchical topics. One example search topic is shown in Table II. It information need is to find information on factors that would affect kangaroo survival. The topic is composed of 6 subtopics, each of which addresses one or several factors, such as traffic accidents and illegal hunting. The hierarchical structure of the information need, except the topic name, was not available to the search system at any time.

Topic (id: DD17-49) Kangroo Survival
Subtopic 1 (id: 464)
Road Danger
- - 7 Relevant Passages
Subtopic 2 (id: 462)
Effect on Land
- - 15 Relevant Passages
Subtopic 3 (id: 463)
Monetizing Kangroos
- - 65 Relevant Passages
Subtopic 4 (id: 459)
Kangroo Hunting
- - 17 Relevant Passages
Subtopic 5 (id: 552)
Other Kangroo Dangers
- - 4 Relevant Passages
Subtopic 6 (id: 460)
Protection and Rescues
- - 57 Relevant Passages
TABLE II: An Example Search Topic and its Subtopics in TREC DD.
User: search DD17-49: kangaroo survival
System: Return document 0190584
User: Non-relevant document.
System: Return document 1407502
Relevant on subtopic 460 with a rating of 2,
“They have rescued an orphaned kangaroo.”
System: Return document 1523859
Relevant on subtopic 462 with a rating of 3,
“kangaroos have strayed from nature reserves
to graze on suburban lawns in Canberra.”
TABLE III: Example Search History.

Dozens of teams participated in the TREC DD from 2015 to 2017. Most of them still used hand-crafted policies, either in a way similar to relevance feedback model [56, 4, 21] or to diversify the search results based on topic hierarchies [90, 84]. There are also few attempts to use RL-based approaches [121].

Evaluation metrics adopted in TREC DD include Cube Test [78], -DCG [31], session DCG [51], nERR-IA [27], and Expected Utility [146]. Most of the above metrics evaluate the performance of a dynamic search system in terms of the relevance of information acquired and the effort the user put into the search process, based on different user models. -DCG, session DCG, nERR-IA only consider the relevant information retrieved and discount the raw relevance based on ranking order or diversity, which comes from human heuristics. Cube Test [78] evaluates the efficiency, or the speed of retrieval, of dynamic search systems. Expected utility [146] measures the “net gain” of the search process, i.e. the amount of relevant information subtracting the user effort. Apart from metrics used in TREC DD, there are still other metrics that are designed for, or can be used in dynamic search, such as Time-Biased Gain (TBG) [118] and U-measure [108]. These two take into account the fact that user may spend different time on different parts of the search results. Even though relevance itself is situational [18] and may not be consistent in the dynamic search process [85], it is shown that relevance judgment correlates well with user satisfaction at session level [48] and the last query in the search process plays a vital role [72].

CLEF Dynamic Search Lab in 2018 [58] further developed the TREC DD Track. The dynamic search system was deconstructed into two agents, a Q-agent, which reformulated queries, and an A-agent, which retrieves documents. Tasks included query suggestions and results composition. Reformulated queries were evaluated based on their effectiveness of retrieving documents, instead of similarities with human-generated ones.

6 Conclusion

This article reviews the state-of-the-art for dynamic search. A dynamic search system acts as a digital assistant, through interacting with whom a human user acquires information and makes rational decisions with its support. This emerging research field is related to a few disciplines that have aroused long-lasting research interests. Spending much effort to compare our subject to related fields, including ad-hoc retrieval, information seeking, exploratory search, interactive information retrieval, dialogue systems, online learning-to-rank, supervised machine learning and reinforcement learning, this article is perhaps the first to do such comparison across a broad set of targets. Our views might be subjective, but we think the comparison is necessary.

The majority of the article concentrates on interpreting the works in dynamic search from reinforcement learning’s point of view. This angle makes a strong but valid assumption that reinforcement learning would be the more suitable family of methods to solve dynamic search. We made this assumption by how natural the two fit each other and how much they both relate to animal learning. The handling of transitions, interactions, rewards, and their aiming for success all run parallel. Reinforcement learning itself is undergoing a leap at the moment we are writing this article. Empowered by large amounts of available data and advances in deep neural networks, this sophisticated branch of machine learning is experiencing its renaissance right now. We expect reinforcement learning as a general solution to handle interactions to create bloom in all studies about interactive agents. Dynamic search would be one very important application, as search engines are the most widely used interactive agents and share the forthcoming fruition.

Adaptation and exploration would be two major research themes for dynamic search. We picture future systems should be able to adjust themselves and adapt to evolving information goals and changing users. The research will become more challenging, as optimizing over “moving target” is new but crucial to future dynamic search research. Being able to quickly and wisely explore all possible options would also be ideal but has not yet achieved. We anticipate much improvement will be done in these directions.

Dynamic search is unique for persistently putting human users at the center of its research. It is unquestionably human-centered “artificial intelligence”, if one would like to use the new buzz word. Years of research in information retrieval and the rich experience that only the IR community has accumulated to deal with human-agent interactions (in this case the agent is the search engine) would definitely nurture artificial intelligence research, including reinforcement learning, as long as they have a human user to serve, to satisfy, and to help her succeed.


The authors would like to thank Ian Soboroff, Jiyun Luo, Shiqi Liu, Angela Yang, and Xuchu Dong for their past efforts during our long-term collaboration on dynamic search. We thank the annotators from National Institute of Standards and Technology (NIST) for helping us build the TREC DD evaluation dataset. We also thank Connor Lu for proof-reading the paper. This research was supported by NSF CAREER grant IIS-145374 and DARPA Memex Program FA8750-14-2-0226. Any opinions, findings, conclusions, or recommendations expressed in this paper are of the authors, and do not necessarily reflect those of the sponsor.


  • [1] I. Adeyanju, D. Song, F. M. Nardini, M. Albakour, and U. Kruschwitz (2011) RGU-isti-essex at TREC 2011 session track. In TREC ’11, Cited by: §4.1.
  • [2] R. Agrawal (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 (4). Cited by: §4.6.
  • [3] W. Aissa, L. Soulier, and L. Denoyer (2018) A reinforcement learning-driven translation model for search-oriented conversational systems. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI, SCAI@EMNLP 2018, Cited by: §4.4.
  • [4] A. Albahem, D. Spina, L. Cavedon, and F. Scholer (2016) RMIT @ TREC 2016 dynamic domain track: exploiting passage representation for retrieval and relevance feedback. In TREC ’16, Cited by: §5.
  • [5] M. Albakour, U. Kruschwitz, B. Neville, D. Lungley, M. Fasli, and N. Nanas (2011) University of essex at the TREC 2011 session track. In TREC ’11, Cited by: §4.1.
  • [6] M. Albakour and U. Kruschwitz (2012) University of essex at the trec 2012 session track. In TREC ’12, Cited by: §4.1.
  • [7] K. Athukorala, A. Medlar, K. Ilves, and D. Glowacka (2015) Balancing exploration and exploitation: empirical parameterization of exploratory search systems. In CIKM ’15, Cited by: §4.6.
  • [8] P. Auer (2002) Using confidence bounds for exploitation-exploration trade-offs. JMLR 3. Cited by: §4.6.
  • [9] L. Azzopardi (2011) The economics in interactive information retrieval. In SIGIR ’11, Cited by: §4.1.
  • [10] L. Azzopardi (2014) Modelling interaction with economic models of search. In SIGIR ’14, Cited by: §4.1.
  • [11] A. Bah, K. Sabhnani, M. Zengin, and B. Carterette (2014) University of delaware at TREC 2014. In TREC ’14, Cited by: §4.1.
  • [12] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR ’15, Cited by: §4.2.
  • [13] M. J. Bates (1989) The design of browsing and berrypicking techniques for the online search interface. Online review 13 (5). Cited by: Fig. 7, §3.2.
  • [14] J. A. Bilmes et al. (1998) A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. International Computer Science Institute 4 (510). Cited by: §4.5.
  • [15] D. M. Blei, A. Y. Ng, and M. I. Jordan (2001) Latent dirichlet allocation. In NIPS ’01, Cited by: §4.6.
  • [16] A. Bordes, Y. Boureau, and J. Weston (2017) Learning end-to-end goal-oriented dialog. In ICLR ’17, Cited by: §3.3.
  • [17] A. Borisov, M. Wardenaar, I. Markov, and M. de Rijke (2018) A click sequence model for web search. In SIGIR ’18, Cited by: §4.2.
  • [18] P. Borlund (2003) The concept of relevance in IR. JASIST 54 (10). Cited by: §5.
  • [19] A. Bozzon, M. Brambilla, S. Ceri, and P. Fraternali (2010) Liquid query: multi-domain exploratory search on the web. In WWW ’10, Cited by: §3.6.
  • [20] M. Bron, J. He, K. Hofmann, E. Meij, M. de Rijke, M. Tsagkias, and W. Weerkamp (2010) The university of amsterdam at TREC 2010: session, entity and relevance feedback. In TREC ’10, Cited by: §4.6.
  • [21] E. D. Buccio and M. Melucci (2016) Evaluation of a feedback algorithm inspired by quantum detection for dynamic search tasks. In TREC ’16, Cited by: §5.
  • [22] M. K. Buckland (1991) Information as thing. JASIS 42 (5). Cited by: §1.
  • [23] M. Buckland and C. Plaunt (1994) On the construction of selection systems. Library Hi Tech 12 (4). Cited by: Fig. 10.
  • [24] J. G. Carbonell and J. Goldstein (1998) The use of mmr, diversity-based reranking for reordering documents and producing summaries. In SIGIR ’98, Cited by: §4.3, §4.6.
  • [25] B. Carterette and P. Chandar (2011) Implicit feedback and document filtering for retrieval over query sessions. In TREC ’11, Cited by: §4.1.
  • [26] B. Carterette, P. D. Clough, M. M. Hall, E. Kanoulas, and M. Sanderson (2016) Evaluating retrieval over sessions: the TREC session track 2011-2014. In SIGIR ’16, Cited by: §4.2, §5.
  • [27] O. Chapelle, S. Ji, C. Liao, E. Velipasaoglu, L. Lai, and S. Wu (2011) Intent-based diversification of web search results: metrics and algorithms. Inf. Retr. 14 (6). Cited by: §5.
  • [28] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck (2008) Monte-carlo tree search: A new framework for game AI. In Proceedings of the Fourth Artificial Intelligence and Interactive Digital Entertainment Conference, Cited by: §4.5.
  • [29] H. Chen, X. Liu, D. Yin, and J. Tang (2017) A survey on dialogue systems: recent advances and new frontiers. SIGKDD Explorations 19 (2). Cited by: Fig. 8.
  • [30] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi (2019) Top-k off-policy correction for a REINFORCE recommender system. In WSDM ’19, Cited by: §4.4.
  • [31] C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon (2008) Novelty and diversity in information retrieval evaluation. In SIGIR ’08, Cited by: §5.
  • [32] N. Craswell and M. Szummer (2007) Random walks on the click graph. In SIGIR ’07, Cited by: §4.1.
  • [33] W. B. Croft, D. Metzler, and T. Strohman (2009) Search engines - information retrieval in practice. Pearson Education. External Links: Link, ISBN 978-0-13-136489-9 Cited by: §1.
  • [34] R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum (2019) Multi-step retriever-reader interaction for scalable open-domain question answering. In ICLR ’19, Cited by: §4.4, 3.
  • [35] M. Dehghani, S. Rothe, E. Alfonseca, and P. Fleury (2017) Learning to attend, copy, and generate for session-based query suggestion. In CIKM ’17, Cited by: §4.2.
  • [36] M. P. Deisenroth and C. E. Rasmussen (2011) PILCO: A model-based and data-efficient approach to policy search. In ICML ’11, Cited by: §2.2.
  • [37] B. Dhingra, L. Li, X. Li, J. Gao, Y. Chen, F. Ahmed, and L. Deng (2017) Towards end-to-end reinforcement learning of dialogue agents for information access. In ACL ’17, Cited by: §4.4.
  • [38] H. Feild (2014) Endicott college at 2014 TREC session track. In TREC ’14, Cited by: §4.1.
  • [39] Y. Feng, J. Xu, Y. Lan, J. Guo, W. Zeng, and X. Cheng (2018) From greedy selection to exploratory decision-making: diverse ranking with policy-value networks. In SIGIR ’18, Cited by: §4.5.
  • [40] J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3). External Links: ISSN 1554-0669 Cited by: §3.3.
  • [41] A. Grotov and M. de Rijke (2016) Online learning to rank for information retrieval: SIGIR 2016 tutorial. In SIGIR ’16, Cited by: §3.4.
  • [42] D. Guan, S. Zhang, and H. Yang (2013) Utilizing query change for session search. In SIGIR ’13, Cited by: §4.5.
  • [43] M. Hagen, M. Völske, J. Gomoll, M. Bornemann, L. Ganschow, F. Kneist, A. H. Sabri, and B. Stein (2013) Webis at TREC 2013-session and web track. In TREC ’13, Cited by: §4.1.
  • [44] A. Halfaker, O. Keyes, D. Kluver, J. Thebault-Spieker, T. T. Nguyen, K. Shores, A. Uduwage, and M. Warncke-Wang (2015) User session identification based on strong regularities in inter-activity time. In WWW ’15, Cited by: §3.6.
  • [45] D. Hawking (2000) Overview of the TREC-9 web track. In TREC ’00, Cited by: §4.4.
  • [46] J. He, V. Hollink, C. Boscarino, A. P. de Vries, and R. Cornacchia (2011) CWI at TREC 2011: session, web, and medical. In TREC ’11, Gaithersburg, Maryland, USA, November 15-18, 2011, Cited by: §4.1.
  • [47] B. J. Hecht, S. Carton, M. Quaderi, J. Schöning, M. Raubal, D. Gergle, and D. Downey (2012) Explanatory semantic relatedness and explicit spatialization for exploratory search. In SIGIR ’12, Cited by: §3.6.
  • [48] S. B. Huffman and M. Hochster (2007) How well does result relevance predict session satisfaction?. In SIGIR ’07, Cited by: §5.
  • [49] B. Huurnink, R. Berendsen, K. Hofmann, E. Meij, and M. de Rijke (2011) The university of amsterdam at the TREC 2011 session track. In TREC ’11, Cited by: §4.2, §4.2.
  • [50] K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of IR techniques. TOIS 20 (4), pp. 422–446. Cited by: §4.3.
  • [51] K. Järvelin, S. L. Price, L. M. L. Delcambre, and M. L. Nielsen (2008) Discounted cumulated gain based evaluation of multiple-query IR sessions. In ECIR ’08, Cited by: §5.
  • [52] Z. Ji, Z. Lu, and H. Li (2014) An information retrieval approach to short text conversation. CoRR abs/1408.6988. External Links: 1408.6988 Cited by: §3.3.
  • [53] J. Jiang, D. He, and S. Han (2012) On duplicate results in a search session. In TREC ’12, Cited by: §4.6.
  • [54] J. Jiang and D. He (2013) Pitt at TREC 2013: different effects of click-through and past queries on whole-session search performance. In TREC ’13, Cited by: §4.1.
  • [55] X. Jin, M. Sloan, and J. Wang (2013) Interactive exploratory search for multi page search results. In WWW ’13, Cited by: §4.3.
  • [56] R. Joganah, L. Lamontagne, and R. Khoury (2015) Laval university and lakehead university at TREC dynamic domain 2015: combination of techniques for subtopics coverage. In TREC ’15, Cited by: §5.
  • [57] R. Jones and K. L. Klinkner (2008) Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In CIKM ’08, Cited by: §3.6.
  • [58] E. Kanoulas, L. Azzopardi, and G. H. Yang (2018) Overview of the CLEF dynamic search evaluation lab 2018. In CLEF ’18, Cited by: §5, §5.
  • [59] T. Kenter and M. de Rijke (2017) Attentive memory networks: efficient machine reading for conversational search. CoRR abs/1712.07229. External Links: 1712.07229 Cited by: §3.3.
  • [60] B. King and I. Provalov (2010) Cengage learning at the TREC 2010 session track. In TREC ’10, Cited by: §4.1.
  • [61] V. Lavrenko and W. B. Croft (2001) Relevance-based language models. In SIGIR ’01, Cited by: §4.1.
  • [62] A. Leuski (2000) Relevance and reinforcement in interactive browsing. In CIKM ’00, Cited by: §4.3.
  • [63] N. Levine, H. Roitman, and D. Cohen (2017) An extended relevance model for session search. In SIGIR ’17, Cited by: §4.1.
  • [64] S. Levine and V. Koltun (2013) Guided policy search. In ICML ’13, Cited by: §2.2.
  • [65] C. Li, P. Resnick, and Q. Mei (2016) Multiple queries as bandit arms. In CIKM ’16, Cited by: §4.6.
  • [66] J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, and W. B. Dolan (2016) A persona-based neural conversation model. In ACL ’16, Cited by: §3.3.
  • [67] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016) Deep reinforcement learning for dialogue generation. In EMNLP ’16, Cited by: §3.3, §4.4.
  • [68] L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In WWW ’10, Cited by: §4.6.
  • [69] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In ICLR ’16, Cited by: §2.2.
  • [70] C. Liu, N. J. Belkin, and M. J. Cole (2012) Personalization of search results using interaction behaviors in search sessions. In SIGIR ’12, Cited by: §4.1.
  • [71] C. Liu, M. J. Cole, E. Baik, and N. J. Belkin (2012) Rutgers at the TREC 2012 session track. In TREC ’12, Cited by: §4.1.
  • [72] M. Liu, Y. Liu, J. Mao, C. Luo, and S. Ma (2018) Towards designing better session search evaluation metrics. In SIGIR ’18, Cited by: §5.
  • [73] T. Liu (2009) Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3 (3). Cited by: §1, §3.4.
  • [74] W. Liu, H. Lin, Y. Ma, and T. Chang (2011) DUTIR at the session track in TREC 2011. In TREC ’11, Cited by: §4.1.
  • [75] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, and G. Tolomei (2011) Identifying task-based sessions in search engine query logs. In WSDM ’11, Cited by: §3.6.
  • [76] J. Luo, X. Dong, and H. Yang (2015) Learning to reinforce search effectiveness. In ICTIR ’15, Cited by: §4.5.
  • [77] J. Luo, X. Dong, and H. Yang (2015) Session search by direct policy learning. In ICTIR ’15, Cited by: Fig. 3, §1, §4.4, 2.
  • [78] J. Luo, C. Wing, H. Yang, and M. A. Hearst (2013) The water filling model and the cube test: multi-dimensional evaluation for professional search. In CIKM ’13, Cited by: §5.
  • [79] J. Luo, S. Zhang, X. Dong, and H. Yang (2015) Designing states, actions, and rewards for using POMDP in session search. In ECIR ’15, Cited by: §1.
  • [80] J. Luo, S. Zhang, and H. Yang (2014) Win-win search: dual-agent stochastic game in session search. In SIGIR ’14, Cited by: Fig. 2, §1, Fig. 13, §4.5.
  • [81] G. Marchionini and B. Shneiderman (1988) Finding facts vs. browsing knowledge in hypertext systems. IEEE Computer 21 (1). Cited by: §3.1.
  • [82] G. Marchionini (1997) Information seeking in electronic environments. Cambridge university press. Cited by: Fig. 1, §1, §1, Fig. 6, §3.1, §3.1.
  • [83] G. Marchionini (2006) Exploratory search: from finding to understanding. Commun. ACM 49 (4). Cited by: §1, §3.1.
  • [84] R. McCreadie, S. Vargas, C. MacDonald, I. Ounis, S. Mackie, J. Manotumruksa, and G. McDonald (2015) University of glasgow at TREC 2015: experiments with terrier in contextual suggestion, temporal summarisation and dynamic domain tracks. In TREC ’15, Cited by: §5.
  • [85] A. Medlar and D. Glowacka (2018) How consistent is relevance feedback in exploratory search?. In CIKM ’18, Cited by: §5.
  • [86] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS ’13, Cited by: §4.1.
  • [87] B. Mitra (2015) Exploring session context using distributed representations of queries and reformulations. In SIGIR ’15, Cited by: §4.2, §4.2.
  • [88] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In ICML ’16, Cited by: §2.2.
  • [89] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540). Cited by: §2.2, §4.3.
  • [90] F. Moraes, R. L. T. Santos, and N. Ziviani (2016) UFMG at the TREC 2016 dynamic domain track. In TREC ’16, Cited by: §5.
  • [91] N. Nanas and A. N. D. Roeck (2009) Autopoiesis, the immune system, and adaptive information filtering. Natural Computing 8 (2). Cited by: §4.1.
  • [92] R. Nogueira and K. Cho (2017) Task-oriented query reformulation with reinforcement learning. In EMNLP ’17, Cited by: §4.4.
  • [93] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters (2018) An algorithmic perspective on imitation learning. Foundations and Trends in Robotics 7 (1-2). Cited by: §4.2.
  • [94] U. Ozertem, E. Velipasaoglu, and L. Lai (2011) Suggestion set utility maximization using session logs. In CIKM ’11, Cited by: §4.1.
  • [95] M. Patidar, P. Agarwal, L. Vig, and G. Shroff (2018) Automatic conversational helpdesk solution using seq2seq and slot-filling models. In CIKM ’18, Cited by: §3.3.
  • [96] J. Pickens, G. Golovchinsky, C. Shah, P. Qvarfordt, and M. Back (2008) Algorithmic mediation for collaborative exploratory search. In SIGIR ’08, Cited by: §3.6.
  • [97] M. Qiu, L. Yang, F. Ji, W. Zhou, J. Huang, H. Chen, W. B. Croft, and W. Lin (2018) Transfer learning for context-aware question matching in information-seeking conversations in e-commerce. In ACL ’18, Cited by: §3.3.
  • [98] P. Qvarfordt, G. Golovchinsky, T. Dunnigan, and E. Agapie (2013) Looking ahead: query preview in exploratory search. In SIGIR ’13, Cited by: §3.6.
  • [99] F. Radlinski and N. Craswell (2017) A theoretical framework for conversational search. In CHIIR ’17, Cited by: §3.3.
  • [100] K. Raman, P. N. Bennett, and K. Collins-Thompson (2013) Toward whole-session relevance: exploring intrinsic diversity in web search. In SIGIR ’13, Cited by: §4.6.
  • [101] G. Ren, X. Ni, M. Malik, and Q. Ke (2018) Conversational query understanding using sequence to sequence modeling. In WWW ’18, Cited by: §4.2, §4.2.
  • [102] S. E. Robertson and K. S. Jones (1976) Relevance weighting of search terms. JASIS 27 (3). Cited by: §4.1.
  • [103] S. E. Robertson and H. Zaragoza (2009) The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (4). Cited by: §1, §4.3.
  • [104] J. J. Rocchio (1971) Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing. Cited by: §4.1, §4.3.
  • [105] S. Ross (2013) Interactive learning for sequential decisions and predictions. Ph.D. Thesis, Citeseer, Carnegie Mellon University. Cited by: Fig. 11.
  • [106] T. Ruotsalo, J. Peltonen, M. J. A. Eugster, D. Glowacka, K. Konyushkova, K. Athukorala, I. Kosunen, A. Reijonen, P. Myllymäki, G. Jacucci, and S. Kaski (2013) Directing exploratory search with interactive intent modeling. In CIKM’ 13, Cited by: §3.6.
  • [107] I. Ruthven (2008) Interactive information retrieval. ARIST 42 (1). Cited by: §1.
  • [108] T. Sakai and Z. Dou (2013) Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In SIGIR ’13, Cited by: §5.
  • [109] B. Sarrafzadeh and E. Lank (2017) Improving exploratory search experience through hierarchical knowledge graphs. In SIGIR ’17, Cited by: §3.6.
  • [110] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz (2015) Trust region policy optimization. In ICML ’15, Cited by: §2.2.
  • [111] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: 1707.06347 Cited by: §2.2.
  • [112] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI ’16, Cited by: §3.3.
  • [113] C. Shah and R. I. González-Ibáñez (2011) Evaluating the synergic effect of collaboration in information seeking. In SIGIR ’11, Cited by: §3.6.
  • [114] W. Shalaby and W. Zadrozny (2018) Toward an interactive patent retrieval framework based on distributed representations. In SIGIR ’18, Cited by: §4.1.
  • [115] L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. In ACL ’15, Cited by: §3.3.
  • [116] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil (2014) Learning semantic representations using convolutional neural networks for web search. In WWW ’14, Cited by: §4.2.
  • [117] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587). Cited by: §2.2.
  • [118] M. D. Smucker and C. L. A. Clarke (2012) Time-based calibration of effectiveness measures. In SIGIR ’12, Cited by: §5.
  • [119] A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan (2015) A neural network approach to context-sensitive generation of conversational responses. In NAACL HLT ’15, Cited by: §3.3.
  • [120] R. S. Sutton and A. G. Barto (1998) Reinforcement learning - an introduction. Adaptive computation and machine learning, MIT Press. External Links: ISBN 0262193981 Cited by: §1, Fig. 5, §2.1, §2.1, §2.2, §2.2, §2.2, Fig. 12.
  • [121] Z. Tang and G. H. Yang (2017) A reinforcement learning approach for dynamic search. In TREC ’17, Cited by: §4.3, §5.
  • [122] N. Tani, D. Bollegala, N. P. Chandrasiri, K. Okamoto, K. Nawa, S. Iitsuka, and Y. Matsuo (2011) Collaborative exploratory search in real-world context. In CIKM ’11, Cited by: §3.6.
  • [123] A. Tian and M. Lease (2011) Active learning to maximize accuracy vs. effort in interactive information retrieval. In SIGIR ’11, Cited by: §4.1.
  • [124] Z. Tian, R. Yan, L. Mou, Y. Song, Y. Feng, and D. Zhao (2017) How to make context more useful? an empirical study on context-aware neural conversational models. In ACL ’17, Cited by: §3.3.
  • [125] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS ’17, Cited by: §4.2, §4.2.
  • [126] O. Vinyals and Q. V. Le (2015) A neural conversational model. CoRR abs/1506.05869. External Links: 1506.05869 Cited by: §3.3.
  • [127] E. M. Voorhees (2002) Overview of TREC 2002. In TREC ’02, Cited by: §5.
  • [128] E. M. Voorhees (2004) Overview of the TREC 2004 robust track. In TREC ’04, Cited by: §4.4.
  • [129] H. Wang, Y. Song, M. Chang, X. He, R. W. White, and W. Chu (2013) Learning to extract cross-session search tasks. In WWW ’13, Cited by: §3.6.
  • [130] H. Wang, Q. Wu, and H. Wang (2017) Factorization bandits for interactive recommendation. In AAAI ’17, Cited by: §4.6.
  • [131] S. Wang, Z. Bao, S. Huang, and R. Zhang (2018) A unified processing paradigm for interactive location-based web search. In WSDM ’18, Cited by: §4.1.
  • [132] J. Weizenbaum (1966) ELIZA - a computer program for the study of natural language communication between man and machine. Commun. ACM 9 (1), pp. 36–45. Cited by: §3.3.
  • [133] R. W. White, B. Kules, S. M. Drucker, et al. (2006) Supporting exploratory search, introduction, special issue, communications of the acm. Communications of the ACM 49 (4). Cited by: §1, §3.1, §3.1.
  • [134] R. W. White and R. A. Roth (2009) Exploratory search: beyond the query-response paradigm. Synthesis Lectures on Information Concepts, Retrieval, and Services, Morgan & Claypool Publishers. Cited by: §3.2.
  • [135] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8. Cited by: §4.4, §4.4, 1.
  • [136] Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li (2017) Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In ACL ’17, Cited by: §3.3.
  • [137] Y. Xue, G. Cui, X. Yu, Y. Liu, and X. Cheng (2014) ICTNET at session track TREC2014. In TREC ’14, Cited by: §4.2, §4.2.
  • [138] R. Yan, Y. Song, and H. Wu (2016) Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In SIGIR ’16, Cited by: §3.3.
  • [139] A. Yang and G. H. Yang (2017) A contextual bandit approach to dynamic search. In ICTIR ’17, Cited by: §4.6.
  • [140] G. H. Yang and I. Soboroff (2016) TREC ’16 dynamic domain track overview. In TREC ’16, Cited by: §5.
  • [141] G. H. Yang, Z. Tang, and I. Soboroff (2017) TREC ’17 dynamic domain track overview. In TREC ’17, Cited by: §5.
  • [142] G. H. Yang (2019) Information retrieval fundamentals.. Note: The first ACM SIGIR/SIGKDD Africa Summer School on Machine Learning for Data Mining and Search (AFIRM 2019) Cited by: §1.
  • [143] H. Yang, J. Frank, and I. Soboroff (2015) TREC ’15 dynamic domain track overview. In TREC ’15, Cited by: Fig. 14, §5.
  • [144] H. Yang, M. Sloan, and J. Wang (2014) Dynamic information retrieval modeling. In SIGIR ’14, Cited by: §1.
  • [145] L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B. Croft, J. Huang, and H. Chen (2018) Response ranking with deep matching networks and external knowledge in information-seeking conversation systems. In SIGIR ’18, Cited by: §3.3.
  • [146] Y. Yang and A. Lad (2009) Modeling expected utility of multi-session information distillation. In ICTIR ’09, Cited by: §5.
  • [147] Y. Yang, Y. Gong, and X. Chen (2018) Query tracking for e-commerce conversational search: A machine comprehension perspective. In CIKM ’18, Cited by: §4.2, §4.2.
  • [148] X. Yuan, J. Liu, and N. Sa (2012) U. albany & USC at the TREC 2012 session track. In TREC ’12, Cited by: §4.1.
  • [149] C. Zhai and J. D. Lafferty (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR ’01, Cited by: §1.
  • [150] C. Zhang, X. Wang, S. Wen, and R. Li (2012) BUPT_pris at TREC 2012 session track. In TREC ’12, Cited by: §4.1.
  • [151] Y. Zhang and C. Zhai (2016) A sequential decision formulation of the interface card model for interactive IR. In SIGIR ’16, Cited by: §4.5.
  • [152] Y. Zhang, X. Chen, Q. Ai, L. Yang, and W. B. Croft (2018) Towards conversational search and recommendation: system ask, user respond. In CIKM ’18, Cited by: §3.3.
  • [153] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li (2018) DRN: A deep reinforcement learning framework for news recommendation. In WWW ’18, Cited by: §4.3.
  • [154] X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H. Tian, X. Liu, and R. Yan (2016) Multi-view response selection for human-computer conversation. In EMNLP ’16, Cited by: §3.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description