Intent Models for Contextualising and Diversifying Query Suggestions

Intent Models for Contextualising and
Diversifying Query Suggestions


The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in the query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user’s context. These two directions to improve the aforementioned mechanisms’ quality can be in opposition: while the latter aims to promote suggestions that address search intents that a user is likely to have, the former aims to diversify the suggestions to cover as many intents as possible. We introduce a contextualisation framework that utilises a short-term context using the user’s behaviour within the current search session, such as the previous query, the documents examined, and the candidate query suggestions that the user has discarded. This short-term context is used to contextualise and diversify the ranking of query suggestions, by modelling the user’s information need as a mixture of intent-specific user models. The evaluation is performed offline on a set of approximately 1.0M test user sessions. Our results suggest that the proposed approach significantly improves query suggestions compared to the baseline approach.

Intent Models for Contextualising and

Diversifying Query Suggestions

Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, Iadh Ounis
Yandex School of Computing Science, University of Glasgow

{kharitonov, pavser} {craig.macdonald, iadh.ounis}

A short version of this paper [?] was presented at ACM CIKM 2013.

Query suggestion111The term query auto-completion [?] is also used. is a mechanism that helps a search engine’s users to type less while submitting a query. Usually, such suggestions are represented as a list of queries, which are filtered by the prefix entered by the user. This list appears as a user starts to enter a new query and changes as the user types new characters. The most common approach to generate query suggestion candidates for web search is based on query log mining (e.g., [?, ?]). Usually the process can be divided into offline preparation and online suggestion steps. In the offline step, queries are aggregated from search logs, cleaned from inappropriate/adult queries and clustered to identify near duplicates. Finally, the queries are indexed. In the online step, this index is used to provide a user with a small set of possible candidates (usually, no more than 10) immediately after the user starts to type the query. Since short prefixes can have thousands of highly frequent candidates, the problem is to select a subset that has a maximum probability to contain the query that the user is trying to submit.Since up to 50% of queries in the query stream are preceded by other queries within the same session, the user’s earlier behaviour in the session can be a valuable source of contextual information [?], which can be used to provide the user with a better list of suggestions. To illustrate the utility of such kind of context, let us consider a user has just entered the query [apache], clicked on several pages devoted to Native Americans and skipped It is natural to assume that the user’s next input [apache t] refers to [apache territory] and not to [apache tomcat]. Queries that have several possible interpretations (intents), such as [apache], are referred to as ambiguous [?].

Another possible direction to improve the query suggestion ranking is to diversify a list of proposed suggestions, i.e. to avoid unnecessarily redundant candidates. To illustrate, let us consider a prefix [apache t] and two possible candidate sets: ([apache tomcat], [apache tomcat install], [apache tomcat download]) and ([apache tomcat], [apache tomcat install], [apache territory]). Despite the fact that [apache tomcat download] may be more frequent than [apache territory], its utility in the presence of the two other candidates can be lower than that of [apache territory] and hence the latter set can be more useful to the users.

Often, the diversification and contextualisation approaches are studied independently (e.g., [?, ?]). However, if considered independently, they are to some extent contrasting: while diversification implies a broad coverage of intents in order to satisfy as many users as possible, contextualisation aims to cover only the needs of a particular user. In some sense, both approaches constitute possible ways to deal with the lack or presence of the user’s intent preference information and a problem arises in how to combine them in a mathematically motivated manner.

In this paper, we describe a framework that is capable of providing a user whose intentions are clear with highly contextualised queries and a user without context with diversified suggestions. We investigate how both the short-term search context and the query candidates diversification can be combined to provide a user with suggestions that are more useful with fewer typed characters. Our contributions in this paper can be summarised as follows:

  • A framework to perform contextualised and diversified query suggestions ranking;

  • A generative model of user behaviour that is capable of inferring the user’s real intent from their short-term context, namely the previously entered query, the documents skipped and clicked, and the discarded query suggestions;

  • A thorough experimental evaluation of the framework.

The rest of this paper is organised as follows. After reviewing some related work in Section Intent Models for Contextualising and Diversifying Query Suggestions, we discuss our framework to perform a contextualised and diversified ranking of query suggestions in Section Intent Models for Contextualising and Diversifying Query Suggestions. Next, we describe our proposed approach to model user behaviour (Section Intent Models for Contextualising and Diversifying Query Suggestions) and how it can be used to represent the user context (Section Intent Models for Contextualising and Diversifying Query Suggestions). Section Intent Models for Contextualising and Diversifying Query Suggestions shows how our proposed query suggestion framework is related to an existing state-of-the-art diversification framework. The experimental setup of our evaluation is described in Section Intent Models for Contextualising and Diversifying Query Suggestions. Finally, we report the results obtained in Section \thetable and close the paper with some conclusions in Section Intent Models for Contextualising and Diversifying Query Suggestions.

Three papers, Bar-Yossef et al. [?], Yan et al. [?], and Shokouhi [?] are the most closely related to our work. Bar-Yossef et al. proposed a method to contextualise query suggestions, which relies on representing suggestions and context (previous queries) as high-dimensional term-vectors. The ranking of suggestions is then based on the linear combination of the query frequency and the similarity with the previous context. Shokouhi [?] proposed a feature-based machine learning framework to personalise query suggestions. Shokouhi considered the long-term features (e.g. age, the user’s previous queries) as well as the short-term features (e.g. the user’s queries in the same session) and demonstrated that his approach outperforms the baseline that ranks suggestions according to their frequency. We note three key differences between our work and that of Bar-Yossef et al. [?] and Shokouhi [?]: the context we consider also includes those documents that were clicked and skipped; our explicit modelling of user intents makes it possible to diversify the suggested queries while keeping them contextualised. In addition, in contrast to the work of Bar-Yossef et al.  our method allows us to adjust the trade-off between query popularity and relatedness with the context in a context-dependent manner.

After submitting a query, some search engines provide the users with query recommendations. An approach used by Yan et al. [?] to perform query recommendations in the presence of ambiguous queries differs from our work by ignoring the search results the user skipped previously. In addition, despite formulating the user needs in terms of query intents, the diversification of the recommendations was not discussed. Cao et al. [?] studied the related problem of contextualising the query recommendations, but without considering the user’s document examination behaviour as part of the context or addressing the explicit diversification. The method we use to extract latent user intents from the click and reformulation behaviour can be considered as an analog to the method introduced by Cao et al. [?]. However, they only mention the possibility to leverage that information in the query suggestion mechanism. Song et al. [?] proposed an approach to perform a diversified ranking of query recommendations. The underlying idea behind their work is to promote queries with a high level of novelty with respect to the previously submitted query. Another algorithm to generate diversified query recommendations was introduced by Ma et al. [?]. The algorithm leverages a Markov random walk process on a query-URL bipartite graph to infer the most probable recommendation. The first difference with our work is that both approaches [?, ?] rely on the query candidate’s implicit similarity and dissimilarity without explicitly modelling the possible user intentions. Secondly, the possibility of additional contextualisation of recommendations by the user’s click behaviour is not considered.

While our work addresses the diversification of query suggestions, the related task of search result diversification has seen much research in recent years. Various models for search results diversification have been proposed, including IA-Select [?] and xQuAD [?], which both build a diversified ranking of documents in an iterative manner, by greedily selecting at each step a document that has a maximal probability to satisfy a user given that the previously selected documents have failed to satisfy the user. Since the previously selected documents have failed to satisfy the user, the next selected document should cover intents that are the least covered by those previously selected documents. Vallet et al. [?] introduced extensions to diversification frameworks, such as IA-Select [?], to perform a diversified and personalised ranking of web results. Our work differs from that of Vallet et al. in the nature of the tackled task (query suggestions vs. results diversification) as well as the nature of the considered user behaviour (long-term vs. short-term).

As can be seen from the related work, little attention has been paid to the contextualisation of query suggestions by means of analysing the user’s document examination behaviour, as well as to the problem of combining contextualisation and diversification. In the next section we introduce a novel framework that combines the user’s short-term query and document examination context and the diversification of the query suggestions in a unified manner, to improve the users’ satisfaction with the query suggestions.

Our approach to combine diversification and contextualisation is to reformulate the diversification problem as a special case of contextualisation. We use the underlying idea of IA-Select [?], where at each iteration, when a new document is selected to be added to the ranked list, the documents previously selected are assumed to have failed to satisfy the user’s needs. Similarly, we consider the set of queries already placed in the list of query suggestions to have failed to satisfy the user, and this set of queries forms the diversification part of the context. Apart from this, the user’s previous behaviour (previously submitted query, documents clicked or skipped) constitutes the historical part of the context.

Informally, our framework to build diversified and contextualised suggestions can be recursively defined as follows. We consider the session of a user who submitted the query , interacted with the search result page and submitted several characters of their next query. On the th step of building a query suggestions list, we already have selected suggestion candidates: . The task is to select the next candidate with the highest probability to guess the user’s target query, given their historical context , and as a diversification context. We denote the full context available at step of the algorithm as . Then, the next query suggestion is greedily selected with the highest probability to be submitted by the user given the current context , . After finding the suggestion candidate with the highest probability, it is included into as : . We repeat the procedure until the required number of suggestions is selected.

As the above algorithm can encapsulate different forms of information, it constitutes a framework that can generate different query suggestion lists. In the remainder of this section, we describe a context-aware method to estimate the probability 222 For clarity, in the following we use the simpler notation instead of for a candidate query suggestion identified at step . The query the user actually submits is referred to as . for ranking candidate query suggestions given the current context .

Since we are considering the problem of ranking query suggestions after observing a part of the user’s session, it is natural to assume that the set of possible user intentions (search tasks) coincides with a set of possible interpretations of the previous query , i.e. . This allows us to deal with the small set of intentions associated with , instead of the potentially huge space of all possible search intents users may have. On the other hand, we lose the ability to diversify the candidate queries that are not represented by the intents of , which may reduce the usefulness of diversification. A previous search task can be completely unrelated to a new query if a user has satisfied his/her information need and is starting a new search task with the new query. Following Ozertem et al. [?], we use the notion of search task continuation to account for this effect. Let us introduce an indicator variable , which is equal to if the user’s previous task is continued, and denote the probability of the continuation given the user’s context as .

We assume that before starting to submit the next query, the user can be in one of the states: either the user is satisfied with the results and is not continuing their previous search task (one state); or the user is dissatisfied with the results and continues the task ( intent states). At each step, we update the full context and consequently our beliefs of the user’s state. After that, we find a suggestion candidate with the highest expected probability to be submitted by the user. To formalise this idea, we expand the expected probability of submitting given the current context :


The first term corresponds to the probability of the user submitting and continuing the previous search task. The second term equals to the probability of submitting while starting a new search task. We assume that a user who is not going to continue their current search task issues a query with probability ,333Except for , all probabilities in Sections Intent Models for Contextualising and Diversifying Query Suggestions and Intent Models for Contextualising and Diversifying Query Suggestions are conditioned by - we omit to simplify the notation. which is close to the observed frequency of the query in a query log. Taking that into account, we can drop the query’s dependency on the context in the second term, and as a consequence, we can re-write the previous expression as:


Since the estimates for can be found directly by counting query occurrences in the query log, and are independent of the user’s context, we focus on estimating our belief that the user continues their task (i.e. ) and the probability that the user submits given he/she continues the search task, . To estimate the latter probability, we assume that context influences these probabilities only by affecting the distribution of user intents. To leverage this assumption, we firstly factorise over the possible intents :


Under the above assumption, the probability of query is independent from the user’s context given the user’s intent, thus we can drop the conditioning on the context from the first term:


Using Bayes’ rule, can be estimated as follows:


As a next step, we obtain the following expression to estimate the task continuation probability :


The probability can also be similarly calculated.

Finally, can be estimated by combining Equations (Intent Models for Contextualising and Diversifying Query Suggestions), (Intent Models for Contextualising and Diversifying Query Suggestions) and (Intent Models for Contextualising and Diversifying Query Suggestions) and putting these into Equation (Intent Models for Contextualising and Diversifying Query Suggestions). The obtained expression can be used for query ranking with various representations of the context . Only the probabilities of observing the context ( and ) depend on the context representation.

We have thus far defined our framework. In the next two sections, we discuss two possible instantiations of the framework and discuss how , , and the context representation parameters can be learned from a query log.

In this section, we introduce a generative approach to model the user behaviour. This model provides us with the means to represent different forms of the user context, as we will discuss in Section Intent Models for Contextualising and Diversifying Query Suggestions. This representation is further used in the above proposed query suggestion framework.

A part of the session, which starts with the user submitting a query and finishes with a query reformulation , is further referred to as an interaction. Let us consider a population of user interactions all starting with the query , . We assume that is generated from a mixture of models, with each mixture component corresponding to an intent from the family . With each interaction , we associate two latent variables: the intent the user had while submitting and a binary variable which is equal to 1, if the next query belongs to the same search task as and is equal to 0, if the user decided to switch to another task.

Symbol Description
was the th document examined?
was the th document clicked?
was the user attracted by the th document?
was the user satisfied with the th document?
was the next query submitted after examination
of the th document?
a query submitted by the user at the end of
a latent binary variable denoting if the user
continues his/her search task while submitting
an intent latent variable
Table \thetable: Notations used

Each mixture component describes a model inspired by the Simplified DBN [?] click model and a unigram language model over possible query reformulations. Our model of user behaviour assumes that after submitting , a user with intent examines results from top to bottom, one at a time. An examined document attracts a user’s click with probability and satisfies the user after clicking with probability . If the user is satisfied with the last result clicked then the next submitted query is unrelated to the previous search task and its terms are distributed according to the background unigram language model of the whole query stream . If the user is not satisfied, then the user continues to examine documents until they find a satisfying document or after examining all the query results they submit a new query with an intent-dependent term distribution .

In other words, the model assumes that a term of is generated from a mixture of components and a user’s satisfaction with the last clicked document determines which component will be used to generate it: a non-satisfied user submits a query with terms generated from while a satisfied user generates the terms of the next query from the distribution . A user who cannot find a satisfying document examines all query results. Due to these intent-dependent click and language models, interactions with similar click/skip patterns or reformulations tend to be associated with the same component of the mixture.

The underlying graphical model is depicted in Figure Intent Models for Contextualising and Diversifying Query Suggestions and, for a given document position , it uses the random variables described in Table Intent Models for Contextualising and Diversifying Query Suggestions. Denoting the document on the th position as , the model can be described by means of the following equations:


Indeed, the above equations describe the following constraints on the model: The first document is always examined (7a); Documents are examined sequentially (7b); When an examined document is attractive, the user will click it (7c); the probability of attracting the user and the probability of satisfying the user are the document parameters, conditioned on intent (7d) & (7e); An unclicked document cannot satisfy a user (7f); The examination of the ranked document list terminates when the user is satisfied, meaning that the search task is not continued and the user submits a new query (7g); If the user is not satisfied, then the examination proceeds down the ranked list, as far as rank (7h); If the user is not satisfied with the top documents, then the user continues their search task with a new query (7i); A new query for a new search task of a satisfied user is drawn with the likelihood as given by the query log (7j); However, for the next query in a continuing search task, this probability is conditioned on the intent of the user (7k).

The maximum a posteriori (MAP) estimates of the model parameters as well as the distribution are found from the available query log data by means of an Expectation-Maximisation (EM) procedure:

where denotes the probability of an observed interaction given the model parameters . Following [?] and [?], we impose Beta and Dirichlet priors on the click (, ) and language model (, ) parameters, respectively. The probability is obtained by performing a single Expectation-like step over the interactions ending with :

where are the responsibility values obtained on the Expectation step and is a set of interactions starting with and ending with .

Recall that within our framework, we consider the context for a suggestion at the th step to be defined as the unselected queries and the historical context. In the following, we define two possible context representations for the historical part of the context . The first representation includes the previous query only, while the second includes not only the query, but also documents that the user has clicked or skipped during the session.

Query-only history Once the query-only historical context is considered, our current belief in the user’s state is determined by the query-dependent intent probabilities and the diversification part of the context . Since we assume that the probability of the suggestion to be submitted by the user is independent from the context, given the user’s intent , the probability that all previously ranked suggestions have failed to guess the user’s target query given the user’s intent is equal to:


On the other hand, if the user decided not to continue the task, then the assumption about the previously selected candidates leads to the following representation of the probability of the context :


We expect the query-only context to be useful since the knowledge of the previous query should dramatically reduce the space of the user’s possible intentions.

Figure \thefigure: A graphical model of the user behaviour. Grey circles correspond to observed variables.

Query, clicks and skips as history A more detailed search context includes not only the previous query, but also the clicked and skipped documents for that query. Taking the user intent model proposed in Section Intent Models for Contextualising and Diversifying Query Suggestions into account and assuming that the diversification and historical contexts are independent given the user’s intent and the search task continuation indicator , the probability of context can be expressed as follows:




with and being defined by Equations (8) & (9), respectively.

As in Section Intent Models for Contextualising and Diversifying Query Suggestions, denotes the document in the th position and is a binary variable representing that was clicked. According to the user model described in Section Intent Models for Contextualising and Diversifying Query Suggestions, the probability of observing the historical part of the context is equal to:


where is the number of results on a search page. Putting Equations (10), (11), (12) & (13) into Equation (Intent Models for Contextualising and Diversifying Query Suggestions), we can find with the search context represented by the document’s clicks and skips. The values of , , and are estimated from a session log using the EM procedure, as we discussed in Section Intent Models for Contextualising and Diversifying Query Suggestions.

xQuAD [?] is one of the state-of-the-art frameworks for web search results diversification – as illustrated by its top performances in recent TREC Web track diversity evaluations (e.g. [?]). The framework builds the diversified result list of documents in a greedy manner, at each step selecting a document that maximizes the following objective function:


where is the document’s relevance to the query , is the query intent distribution, denotes the document’s relevance with respect to an intent and is a free parameter.

Bearing in mind that in a query suggestion ranking, the query candidates take the place of documents, and comparing Equation (14) with Equations (Intent Models for Contextualising and Diversifying Query Suggestions), (Intent Models for Contextualising and Diversifying Query Suggestions) & (8), we can find several noticeable similarities: corresponds to the probability of search task continuation , the document’s non-diversified relevance corresponds to , and the query’s intent distribution corresponds to . Thus the proposed framework can be seen as a contextualised extension of xQuAD to the query suggestions domain. However, apart from the ability to leverage short-term context dependencies to infer the intent distribution, the two frameworks differ in the way the relevance-diversity (or popularity-relatedness) trade-off is addressed: while xQuAD and its personalised modification [?] rely on a model parameter to adjust the trade-off, our proposed framework uses a context to set the probability of the search task continuation .

From that point of view, a work by Santos et al. [?] discusses similar ideas to some extent. Indeed, Santos et al. propose a feature-based machine learning approach to predict if a query aspect implies a navigational or informational search task and leverages that information in the diversification algorithm.

The preceding sections have defined our proposed framework, introduced a Bayesian model for query and document contexts, and have shown how it relates to an existing document diversification framework. In the following, we define the experimental setup that we use to empirically evaluate our framework’s behaviour within a large commercial search engine.

Our empirical study has the following goals. Firstly, we investigate if the proposed framework to perform a diversified and contextualised ranking leads to improvements over a basic baseline algorithm that ranks query candidates according to their frequency. Secondly, we compare the performance of different possible combinations of historical and diversification contexts, so as to determine which part of the context, historical or diversification, leads to better improvements.

To address these goals, we experiment with the baseline ranking and four different variations of the proposed ranking framework: (a) Ranking contextualised by only the previous query entered by the user; (b) Ranking contextualised by the previous query and the documents clicked and skipped; (c) Ranking diversified and contextualised by the previous query; (d) Ranking diversified and contextualised by the previous query and the clicked and skipped documents.

The case of contextualisation by the previous query only (a) (with ) corresponds to ranking query candidates according to the probability of their generation from a mixture of the background query stream distribution and the distribution of query reformulations with the user’s task unchanged :

The mixing coefficients and are query-dependent and can be found by marginalising from known from the model learning step.

Dataset The dataset that we use in our evaluation experiments consists of training and test parts. The training part was generated from Yandex’ query log over a period from June, 1 to August, 28, 2012. The following two weeks were used to create the test set. We split the user actions into interactions (a part of session between two queries - as defined in Section Intent Models for Contextualising and Diversifying Query Suggestions) by five minutes of inactivity. A production query suggestions mechanism with near-duplicate queries removed was used to calculate using the training dataset. In order to avoid sparsity while learning the model parameters, we filter out all queries with less than 400 interactions observed during the training period. The test set contains only interactions starting with a query in the training set. Some descriptive statistics of the datasets can be found in Table Intent Models for Contextualising and Diversifying Query Suggestions.

Train Test
time period 3 months 2 weeks
#interactions 19M 960k
#unique reformulations 11M 340k
#unique queries () 17k 7k
Table \thetable: Dataset statistics

Estimating The Model Parameters All model parameters - namely , , and are estimated from the training set using the Deterministic Annealing modification of the EM algorithm [?], as initial experiments found this to obtain the highest performance. A hold-out subset of the training data is used to adjust the parameters of the Beta and Dirichlet priors. While learning the language models of reformulations, the queries are lemmatised and stopwords are removed. The EM procedure used to estimate the model parameters is only able to find a local optima of the objective, thus the results obtained are sensitive to the initialisation of the latent intent variables. Moreover, in order to run the EM optimisation, the number of mixture components (user intents, ) should be determined beforehand. A variety of approaches to address these requirements of the EM procedure have been proposed, e.g. Figueiredo et al. [?] leverage the minimum description length principle to automatically adjust the number of mixture components. However, the optimal choice of initialisation parameters is out of the scope of the paper and, for this reason, we use an entity-based web search intent-mining algorithm both to set the number of intents as well as to initialise the latent intent variables: an interaction is assigned to an intent which is the most likely connected with the last clicked document in the interaction. The number of intents for each query is set equal to the number of intents identified by the web search diversification algorithm. The intent-mining algorithm works in two steps. At the first step, the system analyses the users’ queries and identifies entities occurring (films, books, etc.) in the queries. This process is weakly supervised and relies on query template mining. At the next step, each entity is classified into one or more manually predefined categories, based on category indicators that frequently co-occur with the entity in the users’ queries. These steps are related to the algorithm described by Paşca [?]. The category indicators and a set of possible intentions for a category are extracted from the query log in a semi-automated manner. For instance, the query [casablanca] will be classified both into the “city" and “film" categories, with “film" category having such intents as “buy a dvd", or “reviews". Since the underlying web result diversification algorithm is used only for the EM initialisation, the proposed learning algorithm does not depend on its implementation and can be used with any diversification algorithm. In fact, given some reasonable initialisation, the algorithm itself is capable to extract latent intents from a query log. Due to the EM initialisation scheme, all queries without intents known from the underlying web search diversification algorithm are removed from the dataset. In order to speed up the learning process, we restrict each query to have no more than 5,000 associated interactions, uniformly sampling the required number of interactions for highly frequent queries. The optimisation is terminated either after performing 75 iterations or when the difference in log-likelihood between two consecutive iterations is less than . Since the intent model mixtures are trained on a per-query basis, the learning process can be easily parallelised within a MapReduce framework [?].

Evaluation Unfortunately, there is no metric for evaluating query suggestion quality that is commonly accepted in the literature. Shokouhi et al. [?] evaluated the quality of suggestions for a given prefix by the reciprocal rank of the most popular result, and the Spearman correlation between the predicted and ground-truth ranks of the queries. These metrics were averaged over a set of test prefixes. They consider a query as relevant if it is top-ranked according to the ground-truth query frequencies, Strizhevskaya et al. [?] reported P@3, AP@3 and nDCG@3 averaged over the observed prefixes. We use the same session log-based scenario as Bar-Yossef et al. [?], with two minor changes. Since building the diversified ranking list is of quadratic computational complexity with respect to the number of suggestion candidates, we perform diversification in two steps. At the first step, we use the corresponding non-diversified contextualised ranking to find 100 top scored candidates. Next, we perform the diversified re-ranking of these candidates. As a result of this scheme, weighting the scores by the number of candidates becomes less justified. Further, as the length of the query suggestion lists is usually no longer than 10, cut-off levels higher than 10 do not reflect the user’s actual experience. For all these reasons, we report mean reciprocal rank (MRR) at cut-off level 10 in our experiments. Following [?], we use a prefix of length 3 to filter the query candidates. Recently proposed metrics such as pSaved/eSaved [?] also assume the query log-based evaluation approach and can be used as well.

In a web search setting, the evaluation of diversification algorithms usually implies labelling documents manually by judges and calculating intent-aware quality metrics, such as ERR-IA [?]. In contrast, we measure diversification success as the ability to rank a suggestion candidate preferred by a user higher, using a query log as an evidence of that preference and MRR as a metric. We believe that a query log-based approach better reflects the user experience. However, it has some drawbacks, e.g. it is assumed that only the query the user submitted can satisfy her, even if there are semantically similar queries ranked higher which the user might have not seen in her session. Thus we consider the benefit obtained from contextualisation and diversification of the query suggestions in our experiments as a lower bound of the real improvement.

Overall, to the best of our knowledge, our work is the first to evaluate the effects of diversification using a query log-based offline approach.

#interactions QCntx QCntxDiv FCntx FCntxDiv
960k +0.302 +0.304 +0.307 +0.307
715k +0.509 +0.513 +0.518 +0.519
412k +0.811 +0.818 +0.825 +0.826
224k +0.977 +0.987 +1.001 +1.003
Table \thetable: Relative improvements in over the baseline after the user submits the first three characters of the query. is the length of the second query in words.

In this section, we report the evaluated quality of different combinations of contexts. Moreover, as longer queries are harder to predict given the first three characters, we obtain additional insights into the framework’s performance by varying the length of in the experiments. The results are presented in Table Intent Models for Contextualising and Diversifying Query Suggestions. We use the following abbreviations: QCntx corresponds to the ranking with the previous query as a context; FCntx corresponds to the non-diversified ranking with the previous query, clicks and skips as a context; QCntxDiv and FCntxDiv correspond to the versions of QCntx and FCntx with the diversification context added.

Due to the proprietary nature of the system, we report only relative improvements over the baseline (e.g.  denotes a 30.2% relative improvement). All pairwise differences of these improvements (i.e., for any two cells on a single row with a non-zero difference reported) are statistically significant according to the paired t-test ().

#interactions QCntx QCntxDiv FCntx FCntxDiv
730k +0.201 +0.203 +0.201 +0.202
515k +0.317 +0.322 +0.316 +0.317
230k +0.770 +0.772 +0.798 +0.798
200k +1.268 +1.269 +1.315 +1.315
Table \thetable: Relative improvements in over the baseline after the user submits the first three characters of the query. is a binary variable representing if the second query contains the first query as a prefix. is the length of the second query in words.

On analysing Table Intent Models for Contextualising and Diversifying Query Suggestions, we observe that contextualisation leads to a considerable improvement over a basic query suggestion approach that simply ranks candidates by their frequency. This agrees with the results reported in [?], however the values of the improvements are not directly comparable.

A noteworthy observation is that as the length of the second query in words grows, the improvement from the context increases. This seems reasonable, since the longer queries are harder to predict by the baseline approach given the three-character prefix. On the other hand, longer queries are related to the reformulation behaviour when a user is dissatisfied with the first query and tries to specify it using additional keywords. In this scenario, the user’s search task continues and hence contextualisation is beneficial.

For all considered subset of queries, adding the document examination context leads to further increases in the contextualisation performance. Moreover, the relative improvement from the additional contextual information grows as the query length grows. We believe that a richer context allows the framework to derive the possibility of the task continuation and the user’s intent with higher confidences, thus resulting in better results on sessions with reformulation behaviour.

In order to support the idea that the query specification affects the benefit of contextualisation, we additionally consider the case of being a prefix of and report the results in Table Intent Models for Contextualising and Diversifying Query Suggestions. Indeed, we can see that contextualisation exhibits considerable performance improvements for interactions when the second query specifies the first one. In addition, for such interactions, the relative gain from adding click behaviour reaches its maximum. Our intuition behind this observation is that given the richer search context, the framework is able to contextualise the candidates more aggressively and since for that interaction the second query is indeed related to the previous task, this results in significant improvements. On the other hand, the contextualisation is useful for interactions where the second query does not contain the first one as a prefix, demonstrating that the relatedness of queries goes beyond simple prefix-similarity.

In our experiments, the benefit of the diversification context is less marked than the benefit of the historical context (though statistically significant), especially when we have more evidence about the user intent (FCntx). This observation makes sense, since diversification in its nature is loosely a method to address our uncertainty in the user’s search task, while contextualisation leverages information to predict the task and rank suggestion candidates accordingly. An interesting observation from Table Intent Models for Contextualising and Diversifying Query Suggestions is that if the user does not continue her task (which is not known at the time of ranking suggestion candidates) then the improvement from adding the diversification context is higher than in the opposite case. This observation supports the idea of diversification as a tool to mitigate uncertainty in the user’s intentions, thus being more useful for users not continuing their search tasks and less useful for user continuing their search tasks.

On the other hand, this is not the case when ranking with a richer context, FCntx, possibly due the fact that the click behaviour context allows the framework to infer the continuation and the user’s intent with a higher level of confidence.

Overall, our results support the benefit of enriching the search context with the document examination behaviour as an approach to improve the user’s satisfaction with the query suggestion mechanism. Further adding the diversification context does not hurt performance and results in small, though statistically significant improvements in some cases.

To conclude, we find that our proposed framework is able to perform an effective contextualised ranking of query suggestions, by handling ambiguity in the user’s task.

In this paper, we presented a novel framework that performs a contextualised ranking of query suggestions, where the context encompasses the user’s previous query, the documents previously clicked and skipped, and the query suggestions already examined. In contrast to the approaches previously discussed in the literature, the proposed framework is capable to combine contextualisation and diversification in a uniform manner. In order to do so, the diversity requirement is represented as an intrinsic part of the user’s search context.

We experimented with two types of historical evidence for the search context: the first one contains the previous query only, while the second additionally contains the documents clicked and skipped during the user’s interaction for the previous query. In order to infer the user’s intentions from their examination behaviour, we described an approach to model the user behaviour as a mixture of intent models. Our empirical study using a 3.5 month query log encapsulating about 20M interactions demonstrates that the proposed framework ranks query suggestions better than a baseline approach (approximately a relative improvement on the test set). Our results also show that enriching the search context with a finer-grained representation of user behaviour leads to further improvements in the suggestion ranking quality. Indeed, the FCntxDiv ranking with the richest context considered (the user’s previous query, document examination history, and diversification context) exhibits the top performance on all the considered subsets of queries and attains a relative improvement over the baseline in one of the experiments. As a possible direction of future work, the same approach can be used to contextualise and diversify web search results.

  • [1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In WSDM 2009.
  • [2] Z. Bar-Yossef and N. Kraus. Context-sensitive query auto-completion. In WWW 2011.
  • [3] H. Cao, D. Jiang, J. Pei, E. Chen, and H. Li. Towards context-aware search by learning a very large variable length hidden markov model from search logs. In WWW 2009.
  • [4] H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD 2008.
  • [5] O. Chapelle, S. Ji, C. Liao, E. Velipasaoglu, L. Lai, and S.-L. Wu. Intent-based diversification of web search results: metrics and algorithms. Information Retrieval, 14(6):572–592, 2011.
  • [6] O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. In WWW 2009.
  • [7] C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2011 Web track. In Proc. of the 20th Text REtrieval Conference, TREC ’11.
  • [8] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113.
  • [9] M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(3):381–396, 2002.
  • [10] E. Kharitonov, C. Macdonald, P. Serdyukov, and I. Ounis. Intent models for contextualising and diversifying query suggestions. In CIKM 2013.
  • [11] E. Kharitonov, C. Macdonald, P. Serdyukov, and I. Ounis. User model-based metrics for offline query suggestion evaluation. In SIGIR 2013.
  • [12] H. Ma, M. Lyu, and I. King. Diversifying query suggestion results. In AAAI 2010.
  • [13] U. Ozertem, O. Chapelle, P. Donmez, and E. Velipasaoglu. Learning to suggest: a machine learning framework for ranking query suggestions. In SIGIR 2012.
  • [14] M. Paşca. Weakly-supervised discovery of named entities using web search queries. In CIKM 2007.
  • [15] R. L. T. Santos, C. Macdonald, and I. Ounis. Exploiting query reformulations for web search result diversification. In WWW 2010.
  • [16] R. L. T. Santos, C. Macdonald, and I. Ounis. Intent-aware search result diversification. In SIGIR 2011.
  • [17] M. Shokouhi. Learning to personalize query auto-completion. In SIGIR 2013.
  • [18] M. Shokouhi and K. Radinsky. Time-sensitive query auto-completion. In SIGIR 2012.
  • [19] R. Song, Z. Luo, J.-R. Wen, Y. Yu, and H.-W. Hon. Identifying ambiguous queries in web search. In WWW 2007.
  • [20] Y. Song, D. Zhou, and L. He. Post-ranking query suggestion by diversifying search results. In SIGIR 2011.
  • [21] D. Sontag, K. Collins-Thompson, P. N. Bennett, R. W. White, S. Dumais, and B. Billerbeck. Probabilistic models for personalizing web search. In WSDM 2012.
  • [22] A. Strizhevskaya, A. Baytin, I. Galinskaya, and P. Serdyukov. Actualization of query suggestions using query logs. In WWW 2012 Companion.
  • [23] N. Ueda and R. Nakano. Deterministic annealing em algorithm. Neural Networks, 11(2):271 – 282, 1998.
  • [24] D. Vallet and P. Castells. Personalized diversification of search results. In SIGIR 2012.
  • [25] X. Yan, J. Guo, and X. Cheng. Context-aware query recommendation by learning high-order relation in query logs. In CIKM ’11.
  • [26] C. Zhai. Statistical language models for information retrieval a critical review. Found. Trends Inf. Retr., 2(3):137–213, 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description