Learning Optimal Card Ranking from Query Reformulation

# Learning Optimal Card Ranking from Query Reformulation

## Abstract

Mobile search has recently been shown to be the major contributor to the growing search market. The key difference between mobile search and desktop search is that information presentation is limited to the screen space of the mobile device. Thus, major search engines have adopted a new type of search result presentation, known as information cards, in which each card presents summarized results from one domain/vertical, for a given query, to augment the standard blue-links search results. While it has been widely acknowledged that information cards are particularly suited to mobile user experience, it is also challenging to optimize such result sets. Typically, user engagement metrics like query reformulation are based on whole ranked list of cards for each query and most traditional learning to rank algorithms require per-item relevance labels. In this paper, we investigate the possibility of interpreting query reformulation into effective relevance labels for query-card pairs. We inherit the concept of conventional learning-to-rank, and propose pointwise, pairwise and listwise interpretations for query reformulation. In addition, we propose a learning-to-label strategy that learns the contribution of each card, with respect to a query, where such contributions can be used as labels for training card ranking models. We utilize a state-of-the-art ranking model and demonstrate the effectiveness of proposed mechanisms on a large-scale mobile data from a major search engine, showing that models trained from labels derived from user engagement can significantly outperform ones trained from human judgment labels.

H.3.5 [Information Storage and Retrieval]: Online Information Services

Design, Theory, Experimentation

Card Ranking, Federated Search, Mobile Search, Labeling, Online Metrics, Reformulation

## 1Introduction

Mobile search has recently been reported to be the major contributor to the search market1. The key difference of mobile search from traditional desktop search lies in the fact that information presentation is constrained to the screen space of a mobile device. For this reason, major search engines have adopted a new type of search result presentation, known as information cards2, in which, for a given query, each card presents summarized results from one domain/vertical. For instance, when a user types “Local Restaurants” on a mobile device, modern search engines can directly pull out maps and relevant restaurant information organized in a clean and concise way, shown in Figure 1. Another example is that, given the query “Brad Pitt”, a PersonCard can be triggered to present relevant information about this person, e.g., bio, photos and recent movie show. In addition, a NewsCard can be also triggered with the latest news articles concering the actor. The core concept is that, with one or a small number of cards, a user’s information needs can be satisfied directly, without scrolling down to the regular or conventional web search results. It has been widely recognized that information cards are well-suited to mobile devices and have greatly improved the mobile search experience3.

While information cards can effectively augment traditional web results, the definition of a successful search or whether a user is satisfied with a particular ranking of cards, with respect to a query, is becoming more difficult to answer. In the conventional notion of relevance, each search result is judged on whether it is relevant to a query using a graded scale. Such judgments are static and unique for a query-search result pair. Nevertheless, these judgments laid the foundation for a number of search metrics like Normalized-Discounted-Cumulative-Gain (NDCG) [14] or Expected-Reciprocal-Rank (ERR) [8]. In recent years, researchers also found that click-based implicit feedback [17] is a very important signal for relevance. However, for the problem of card ranking, the above feedback mechanisms are lacking. For instance, the relevance between a card and a query can be temporal and the relative ordering of such relevance among different cards might be subtle and can be very different for different users. Additionally, click-based signals may not be available. For example, a Question-to-Answer (Q2ACard) card could be relevant to a query, if the user sees it and finds it to directly answer his/her question thereby not requiring any further action on the user’s part. Therefore, it is impossible to solely depend on click-based signals as labels for card ranking. In order to mitigate these issues, search engines have started resorting to Query Reformulations, whether a user quickly reforms his/her current query into a similar query, as one of the key online metrics to evaluate the quality of search results [12]. The core assumption is that if a user is satisfied with the search results, he/she is less likely to reformulate the query to search further within that query session. The same metric has also shown great impact on mobile search [22]. For this reason, we can expect that a good card ranking model should, for each query, rank the cards in a way that the user is less likely to reformulate his/her query. Therefore, the target of a card ranking model is ultimately to reduce the amount of query reformulations.

However, it is challenging to optimize query reformulations directly for two reasons. Firstly, a query reformulation event is not defined on a per query-result level and it is usually realized when a user inputs two similar (a predefined threshold is used to determine whether two queries are similar or not) queries consecutively. Therefore, how to interpret such an observation of query reformulation from the query level to the level of each query-card pair can be seen as the core of many research challenges. Second reason is that, many cards can be relevant to a query even without any click (e.g., Q2ACard and WeatherCard). Thus, conventional labeling strategies based on user’s implicit feedback, such as clicks and skips on the search results [17], are not applicable to cards. One may consider that we can pre-define, for a given card, the types of user interaction, as positive/negative labels, e.g., taking the interaction of “viewing with no click” as the positive label for Q2ACard. However, this strategy cannot scale in practice, where we could have a large number of cards, requiring tremendous effort to predefine all types of user interaction for each card.

In this work, we specifically investigate the possibility of interpreting query reformulations into effective relevance labels for training card ranking models. On a high level, a card ranking model shall take features extracted from queries, query-card pairs and lists of cards as input and output a ranked list of cards for the query. Although remarkable feature engineering is required, we emphasize that the key to training an effective ranking model lies in the quality of labels, which provides the learning basis of the relevance between queries and cards. In this paper, we aim to propose and discuss several strategies to derive labels for card ranking from query reformulations. Thus, card ranking models can be trained to optimize such metrics in online systems. The key research question we attempt to address in this paper is:

• How to learn from query reformulation for labeling query-card pairs?

We inherit the concept of conventional learning-to-rank, and propose pointwise, pairwise and listwise interpretations for query reformulations. In addition, we propose a learning-to-label strategy that learns the rewards for the cards shown to each query. The rewards can be further used as labels for training card ranking models. Note that our focus of the paper is not on the choice of ranking models. Throughout the paper, we will use Gradient Boosted Trees (GBT) as the model [11], which has been shown to be an effective ranking model for web search, including the KDD CUP of Learning to Rank competition [5].

Our main contributions in this paper are summarized as follows:

1. We systematically analyze online users’ behaviors with respect to query reformulations and demonstrate that it is difficult to optimize such user engagement metric through conventional human-judgments-based learning to rank procedure.

2. We propose several strategies to derive labels from query reformulations and provide a guideline to train ranking models based on them.

3. We compare these mechanisms on a large-scale data set from a major search engine and show that ranking models trained from proposed method can significantly outperform the ones trained from human labels.

The paper is organized as follows. In §Section 2, we review the related work followed by §Section 3, in which we present the details of the proposed card ranking labeling strategies. In §Section 4, we describe the ranking model used in our work. Then, in §Section 5, we elaborate our evaluation methodology and demonstrate the performance of the proposed labeling strategies. §Section 6 concludes the paper.

## 2Related Work

Our work is related to research topics of heterogeneous web search, including vertical selection and federated search, and the work of deriving labels from user implicit feedback. In this section, we discuss each of these aspects, and position our work within that research spectrum.

Heterogeneous Web Search: Selection of one or a few relevant information domains has been extensively studied as the problem of vertical selection. Vertical selection is one of the earliest research efforts to integrate heterogeneous information in specific domains into conventional web search. Note that the notion of “vertical” often used in desktop web search is equivalent to the notion of “card” in mobile search as both are presented a block of information from a particular domain in a search result page. Diaz [9] first studied selection models for news domain. Arguello et al. [1] investigated the vertical selection problem on multiple domains, such as images and videos. As mentioned earlier, traditionally, research on vertical selection has focused on choosing a very few relevant domain and thus, methods and models developed so far are inherently based on binary classifications or multi-class classifications (e.g., [9]) where decisions are made towards each vertical to be selected or not. While it is possible to obtain a ranking from independent binary decisions, the relative ordering of verticals were not modeled explicitly in those works. Aggregated search is then established as another research topic that dedicates to building models for ranking multiple verticals. In this sense, the notion of “aggregated search” is technically equivalent to the notion of “card ranking” in mobile search. Note that the area of federated search [29] in literature is similar to aggregated search, since both of them are built by merging information from a variety of verticals. Arguello et al. [2] made a further contribution to aggregated search by a comprehensive analysis of both feature engineering and modeling approaches. This work is closely related to our work in the sense that we also target to build effective card ranking models. However, the difference is still significant, since our focus is on deriving card ranking labels from users’ reformulation events, while the early work of [2] relied on human assessments, which, as discussed in §Section 1, is not scalable. For more related literature along this line, please refer to [29].

Labeling from User Feedback: Most of previous work resorts to human experts to judge relevance labels in the problem of vertical selection and aggregated search (e.g., [2]). Recently, a few contributions have been made to exploit implicit feedback signals from online users to derive the query-vertical (comparable to query-card) relevance. Particularly, a large body of those contributions are in the similar spirit to the fundamental work by Thorsten et al. [17] that exploits user implicit feedback, i.e., clicks and skips, for optimizing conventional web search. Ponnuswami et al. [25] has provided a method to derive labels of verticals based on user click data. A label for a particular query-vertical pair is determined by whether the vertical has received a click from the user, and the relative position of the vertical to the first web results block. Jie et al. [16] has casted the vertical ranking problem into a multi-armed bandit problem and tried to learn a regression function to predict the rewards for each vertical shown to different positions. The rewards used in their paper are defined by click-skip actions from users on each vertical. As mentioned in §Section 1, those works are inherently not applicable to the cases where the relevance between a vertical and a query cannot be measured based on users’ click/skip actions. In contrast, our work is directly motivated from online metric of query reformulations, which are completely decoupled from users’ click data. It is worth mentioning that one early work in conventional web search has exploited query reformulations for deriving the labels of web results [18]. However, those papers only use query reformulations to extend the click/skip based labeling strategy, i.e., it still relies on user click data. For this reason, our work is substantially different from previous work not only because we tackle the problem of card ranking, not that of conventional web search, but also because we propose new strategies to exploit query reformulations, independent of click signals.

Information Cards: To our knowledge, there is no prior work on the problem of card ranking in mobile search. One of the latest work by Shokouhi and Guo [27] was among the first to present the problem of serving cards to mobile users. However, their work and follow-up papers like [32] addressed the problem under the proactive search setting, i.e., generating card recommendations without queries from the user. Our work, on the other hand, address the typical search problem in which the cards are ranked in response to the user’s explicit queries.

Learning to Rank: This work is related to the field of learning-to-rank (LtR) but with significant difference. The main focus of LtR literature is to investigate ranking models for settings where relevance labels are available. Even in the so-called listwise case (e.g., [6]) where ranking models are explicitly trained against a list of results for a particular query, the availability of relevance labels for each query-result pair is a prerequisite. Indeed, as mentioned above, the classic setting of LtR [7] requires human relevance judgments and these labels are provided on the query-result level, which is different from our setting. For more thorough discussion about LtR, please refer to [24].

Note that, we are aware of research efforts to define more fine-grained, or to some extend, better user engagement metrics, other than query reformulations for search results, such as dwell time [21] or more complex task-level satisfaction metrics [15]. We use query reformulations in this paper as a reasonable starting point and leave extensions on deriving labels from other more advanced metrics for furture work.

## 3Labels for Card Ranking

Our input data consists of query-page-view (QPV) events where each QPV . Here, represents a query, drawn from where is the number of distinct queries. The input data could contain multiple occurances of the same query. In , is a ranking of a (sub)set of cards where each card is drawn from a whole set of cards . The ranking represents that for a query , the card is ranked higher than and is ranked higher than . For each , we define as the induced set of cards from the ranking. In the QPV tuple, , the label, is if is not reformulated and if is reformulated into another query . Multiple queries can form a reformulation chain if is for multiple consecutive queries. Note that, how is derived is out of the scope of this paper and we treat it as given. Roughly speaking, if the next query is very similar to the current query and therefore, we believe that the user is not satisfied with the query and thus, make its label negative. Also, another key point is that, labels are defined on QPV-level not on query-level. Thus, for the same query, different QPV events might have different labels.

The goal of a card ranking model is to provide for each based on query features, card-level features, user features and contextual features. In this section, we argue that the key challenge in training the ranking model, however, lies in that how to define the relevance label for each query-card pair based on user feedback. As discussed in Section 1, the card ranking model should be optimized for reducing query reformulations. Therefore, we propose a few strategies to label each query-card pair based on users’ query reformulation activity.

Note that, we define a card as a composite unit of multiple information widgets to serve a specific purpose or a task. In other words, even if the actual content of a card varies, two cards are considered the same type of cards when they are serving the same specific purpose. For instance, a NewsCard may have different news titles and links for two similar queries like “Obama” and “Obama News” but they are essentially the same card type. Thus, we use the notion of “card” to represent a particular card type throughout the paper.

### 3.1Pointwise Labeling

The first and most intuitive proposal we have is to directly translate a query-reformulation event from the query-level to the set of cards that are involved in each QPV. Namely, we need to derive for , the label for card , based on . We start the most straightforward one as below:

Under this definition, we treat all query-card pairs that appeared in one QPV, which is later reformulated, as equally negative examples, while all query-card pairs that appeared in one QPV which is reformulated from a query but not further reformulated as equally positive examples. Note that, we only consider to derive labels for cards in the last QPV which has the negative label while ignoring all previous ones in the chain is because users might explore their information needs in a chain of queries and query reformulations may not be equally bad in those cases. Treating all for all no matter is an interesting future work. We illustrate this strategy in the following example. Suppose we have and and their corresponding ranked cards are shown below:

Then, we can derive query-card labels from this reformulation activity as shown below:

We can see that this labeling strategy has no consideration on the relative ordering of cards. For example, an ideal model trained from these labels would have the same relevance prediction for , , and for the QPV , while the relative ordering among them remains uncertain. For this reason, we take into account the rank position of each card in a query session to assign a weight to the corresponding label, leading to the following strategy.

The main assumption behind this strategy is that the higher a card ranked in a page, the larger impact a card makes on the user’s decision. For Example Equation 1, we may assume that the reason for the user decided to reformulate is mainly because he/she is not satisfied with , the top card in . Therefore, we introduce a NDCG style discounting function to encode the impact of relative ordering of cards. Note that throughout the paper we use rank position starting from 1. For Example Equation 1, we could derive the following labels:

As we can see that, this labeling strategy not only translates positive and negative information to each cards but also keeps the relative ordering. For positive ones, it tries to encourage the ranking that matches the QPV while for negative ones, it tries to penalize the ranking that matches the data.

Up to now, the above two labeling strategies are still based on an individual QPV, i.e., when we define labels for QPV , we do not take into account observations from QPV . Therefore, the pointwise ranking model might not be effective in capturing the relative difference from two consecutive QPVs, and thus, missing the information that affects the user’s decision to reformulate. We present a labeling strategy which particularly addresses this concern. The main assumption here is that a query reformulation satisfies a user’s information need because it moves relevant cards up in the list and in the meanwhile, moves irrelevant cards down in the list, or brings relevant cards into the list. Specifically, we define a movement-based pointwise labeling strategy as follows:

The key ingredient is to define the function . We have five possibilities of card movement during a reformulation process, as described below:

1. In the case of for , as , we interpret that moving up in contributes to the user’s satisfaction.

2. When for , as , we interpret that moving down in may not be relevant to the user’s need.

3. When for , as , our interpretation in this case is that the card stayed in the same position has neutral impact, compared to other cards, on the user’s satisfaction.

4. If for but , it is possible that the user is satisfied because of the information brought by the newly appeared card. For this reason, we interpret that the appeared card has a positive contribution to the user’s satisfaction.

5. If for but , this means that the user can still be satisfied without such a card type. In other words, the disappeared card is not relevant to the user’s information need. Even if we have to include such a card in the list, we expect to rank it in a low position.

Considering all the five cases, we formulate the label based on the function as below:

in which and are the default values for the cards appeared or disappeared in the reformulation. We illustrate this strategy in an example similar to the one shown before, i.e., and as below:

Then, based on Movement-based Pointwise Labeling we can derive query-card labels as shown below:

In our work, we empirically tested the choice of and , and we found that a moderate magnitude of their value is satisfactory for the card ranking performance. As shown in Section 5, we choose to set and . We shall also notice that two design choices are underlaid in our strategy. First, we chose a specific formulation of function , which may also be formulated differently. Our experimental evaluation shows that such a choice results in a reasonable card ranking performance, while we leave more elaborated design of this function to future work. Second, our labeling strategy takes in viewpoint from the reformulation, as shown in the example, . We may also take the view point from , which would result in a symmetric and equivalent labeling outcome.

### 3.2Pairwise Labeling

Strategies present in the previous sub-section focus on deriving labels for each query-card pair. Here, we discuss a labeling method to obtain pairwise preferences between two cards from the data. The pairwise labeling strategy allows us to identify the relative contributions of each individual cards through the reformulation process.

where is the label for the card pair , meaning that whether the card is preferred over the card . Using Example Equation 1, we can derive following labels:

Given cards in , for pointwise methods, labels would be derived while pairwise methods would derive labels, which is significantly more.

While it is straightforward to derive pairwise preferences as above, for a trained model, it is NP-hard to obtain an optimal ranking from predicted preferences [24] although approximations do exist for rankings with less agreements with predicted preferences.

Therefore, we provide another approximation strategy to the pairwise labeling:

This mechanism essentially breaks down pairwise preferences to pointwise ones while keeping the relative ordering. For the query in Example Equation 1, we could have:

If we combine multiple labels for the same card into one, the labeling result yields as:

It turns out that, the labeling results are similar to the one used by Discounted-Pointwise-Labeling but symmetric emphasizing/penalizing the top/bottom results. Comparing to the true pairwise case, Approximated-Pairwise-Labeling has a scoring time for cards. Therefore, we stick to this method in later experiments.

### 3.3Listwise Labeling

Apart from the pointwise and pairwise labeling strategies, we further propose a listwise labeling strategy. We shall point out that in literature listwise learning-to-rank techniques were designed in a way to optimize approximately ranking loss, such as NDCG or ERR. However, labels are prerequisite requirements for those methods and they do not tackle the issue of obtaining labels. In this work, the listwise labeling strategy is substantially different from the previous work in LtR in the sense that we focus on deriving listwise labels, instead of training ranking models. We define the listwise labeling strategy as follows:

where represents the label for ranking of cards. Taking Example Equation 1 again as an example, listwise labels are defined for as:

Note that in this strategy we actually label the whole QPV rather than query-card pair.

We shall emphasize two potential limitations of the listwise strategy and our consideration in respect to them:

• Feasibility

In testing, a listwise ranking model would take permutations of all possible subsets of the card set as input and choose the one with the highest predicted score. Such approach is not practical due to its complexity for cards. In this sense, the listwise strategy is not able to scale up for a large (or even small) number of cards. However, we consider this strategy should be applicable in practice for two reasons. First, since the card ranking model only serves to predict the relevance of a handful of relevant cards (cases where ), the actual running time of predictions only depends on those cards. Second, for a set of relevant cards, there could be some product design constraints that pre-set positioning rules for some cards. For instance, Web card is designed to be always placed in the bottom of the list. For this reason, the actual possible rankings are in a small number. As a result, it is feasible to run an effective listwise ranking model in production.

• Generalization

As a matter of fact, by utilizing the listwise strategy, a ranking model would only capture relevance at the list level, it is then limited in its ability to generalize. Specifically, if a particular card ranking list is not observed in the training set (i.e., the set of data we use to training the ranking model), the model is impossible to predict the relevance of such a list. In other words, all the card rankings that a model can learn are limited to observed lists. However, as mentioned above, the product design has set up quite a few card ranking constraints, and thus, in practice we do not need to assess all the possible rankings of cards. We show in our experiments that the listwise strategy can allow us to train a ranking model that performs as competitive as other alternatives.

Although we demonstrated in our work that the listwise labeling strategy is practically applicable, we do acknowledge that further efforts that lead to addressing the above limitations are highly valuable. We leave it to one of our future directions.

### 3.4Learning to Label

The last strategy we propose, namely, Learning to Label (LtL) is to exploit an additional learning algorithm for estimating the importance of each card on a query. The idea is borrowed from multi-touch attribution (e.g., [26]) in online advertising where regression models are used to allocate credits, i.e. conversions, to multiple advertising channels. Here, we want to allocate a credit, a QPV-level label, into different cards and use those distributed credits as pseudo labels to train ranking models. We start from a simple form to decompose :

where is the query-term-level bias, representing the natural uncertainty of the query, is the credit for card in QPV and . Logistic function is used as labels are binary. Ideally, should be different for different QPV . The central problem is that both and are unknown for all . If a query term has QPVs with average cards, the problem yields unknown variables for equations, making the problem hard to solve.

Instead of directly tackling , we take the following feature-based approach by obtaining from some simple features, resulting in much less parameters to learn:

where is an indicator feature, representing card being clicked, is an indicator feature, representing card being viewed while and are corresponding weights. Note that, both and are the same for the query term across all QPVs where . Thus, we reduce all unknown parameters from to for QPVs. For query terms, Equation 2 indicates separate regression problems and the whole setting can be embarrassingly parallelized. Note that, more features can be used, but in this paper, for simplicity, we only use these two features.

As Equation 2 implies, , the expected credit of a card , can be computed as:

where is essentially the mean of the feature value multiplies the learned weight, similarly for . We call the “total value” of the card , the “click value” and the “view value”. The total value of a card can be seen as an average contribution of a card with respect to reformulation. We formalize the LtL strategy as below:

Following Example Equation 1, and further we assume that none of the cards was clicked in , and was clicked in , then, we have the labels as show below:

### 3.5Alternative Labeling

Other than labeling approaches derived from query reformulations, alternative methods do exist for ranking cards. Here, we discuss two important ones.

Click-Through-Rate Labeling: If a user clicks on links shown on a card, it can be interpreted as the user is interested in the card. Thus, it is reasonable to use Click-Through-Rate (CTR) as a signal of relevance, as similar ideas exploited before [17]. Here, CTR is computed as the number of links from a card got clicked normalized by the total number of links shown on the card. Thus, a higher CTR represents a higher degree of relevance of a card with respect to the query. However, as mentioned in §Section 1, not all cards contain links, like WeatherCard and Q2ACard. Therefore, CTR-based labels can only drive user engagements on link-based cards. Nevertheless, this method is a strong baseline to consider. Note that, as CTR is computed on the query-card level, it can also treated as a pointwise method.

Human Judgments Labeling: As mentioned in §Section 2, it is a standard method to utilize human judgments to train ranking models in previous research of vertical selection or federated search. The main limitation of such a approach is that, the label is not defined on QPV but on query-card level. Thus, it looses the way to quantify the uncertainty of query reformulations on a same query and models trained on such labels tend to have strong bias towards one particular outcome (e.g., either reformulated or not-reformulated). Even though it has limitations, it still has advantages for the scenarios like launching a new product where user feedback data is not available. In this paper, we randomly sample top queries from the mobile query log of a major search engine with cards, yielding human judgments in the scale of {Excellent, Good, Neutral, Poor, Very Poor}. One example of a few judgments is shown in Table ?, for the query “Facebook”. We can observe two additional drawbacks of human judgments: 1) it lacks of a ranking of cards as different cards may have the same relevance judgments and 2) maintaining, revising and adding judgments are tremendously time-consuming. One may argue that this human label data set is small but as we would point out later, a much larger set does not solve issues of human judgments.

## 4Ranking Model

In the previous section, we mainly deal with the problem of deriving labels from a QPV-level user engagement metric, QR. Here, we present a state-of-the-art ranking framework to train models with those labels. Recall that each strategy present in §Section 3 derives labels on query-card, query-card-pair or query-card-list level. For each query and the card set , we can construct corresponding feature vectors and . A ranking model takes such feature vectors and outputs a ordered list of cards. In theory, for each query, would evaluate all possible candidates of cards, pairs of cards or list of cards. In practice, this is never the case given that for each query, only a handful of cards could be relevant and therefore, the final output is almost always a subset of cards while other cards are decided not shown to the user. Note that, the step of deciding relevant cards can be done by the ranking model but usually is done through a simpler function with taking less features. It is out of the scope of this paper to describe such a function. Basically, one can assume that, a pool of small number of cards would be present to after this relevance evaluation step for each query. For , it has following scenarios:

1. Pointwise Labels: For a candidate set of cards, evaluates each card and outputs a score. The final is obtained by sorting.

2. Pairwise Labels: As mentioned in §Section 3.2, evaluating all possible pairs of cards and obtain the optimal ranking is NP-hard. For Approximated-Pairwise-Labeling, it essentially has the same procedure to obtain a ranking as pointwise methods.

3. Listwise Labels: As mentioned in §Section 3.3, in theory, would evaluate all possible rankings but in practice, evaluates on rankings that haven been shown to users in the past.

4. Learning To Label: It has the exact same procedure as pointwise ones.

For these scenarios, feature sets are adapted to meet their criterion. Note that, depending on the value of labels, different strategies would yield either classification problems or regression problems.

In this paper, we use GBT algorithm [11] to learn for all scenarios mentioned above. GBT is an additive regression algorithm consisting of an ensemble of trees, fitted to current residuals, gradients of the loss function, in a forward step-wise manner. It iteratively fits an additive model as:

such that a certain loss function (e.g., square loss, logistic loss) is minimized, where is a tree at iteration , weighted by a parameter , with a finite number of parameters , and is the learning rate. At iteration , tree is induced to fit the negative gradient by least squares. That is:

where is the weight for data instance , which is usually set to , and is the gradient over the current prediction function: . The optimal weights of tree are determined by . More details about GBT, please refer to [11].

## 5Experimental Evaluation

To evaluate the effectiveness of our proposed methods, we use a sample of QPV data from a major search engine. In particular, the data is randomly sampled from two weeks’ data produced by a production mobile card ranking system. It contains distinct queries and QPVs.

Comparisons: We compare the following approaches of labeling in this section: 1) Pointwise Labeling, mentioned in §Section 3.1, includes Naïvely-Pointwise-Labeling (NPL), Discounted-Pointwise-Labeling (DPL) and Movement-based-Pointwise-Labeling (MPL), 2) Pairwise Labeling, Approximated-Pairwise-Labeling, mentioned in §Section 3.2), abbreviated as APL 3) Listwise Labeling, mentioned in §Section 3.3, abbreviated as LL, 4) Learning To Label, mentioned in §Section 3.4, abbreviated as LtL, 5) Click-Through-Rate Lableing, mentioned in §Section 3.5, abbreviated as CTR, and 6) Human Judgment Lableing, mentioned in §Section 3.5, abbreviated as Human.

Evaluation Protocol: The key difficulty of training card ranking models is that, there is no established method to properly evaluate them. As mentioned in §Section 2, the classic evaluation method used in the problem of vertical selection or federated search requires human judgments as ground-truth labels and traditional ranking metrics such as NDCG, MAP and ERR are used to compare different models. Here, we use a different approach. For each QPV, we use as the ground-truth label, which is either or for the whole . When testing, a predicted ranked list is produced by a ranking model. We compute two metrics:

• True-Positive-Ratio (TPR) is defined as:

where , the total number of positive ranked lists and when is true otherwise .

• True-Negative-Ratio (TNR) is defined as:

where , the total number of negative ranked lists.

Note that, for both metrics, we require matches exactly. Both and resemble the importance sampling technique used in offline A/B testing evaluation methods [23]. As emphasizes that a ranker could match non-reformulated ranked lists, which is a positive sign of a model and emphasizes that a ranker could match reformulated cases, which is a negative sign of a model. Therefore, a good ranker is the one that has a high score in but achieves a low one in . Given this observation, we define a -Measure as:

which defines as a harmonic mean between and where is used to penalize a ranker which matches negative ranked lists. Under this definition, the ranker that produces the dataset has and , resulting in in -Measure.

Parameters: We chose parameters based on -cross validations and results are reported from cross-validation as well and they are statistical significant. For GBT, we choose trees, nodes and shrinkage from cross validation and fix them for all models. Squared-loss is used in GBT and we found that logistic-loss does not give any significant difference in terms of evaluation metrics introduced above. For more discussion about GBT, please refer to [11] for the details about these parameters.

Features: As mentioned in §Section 4, we use a number of feature groups, resulting in approximately in total. In particular, we have:

• Lexical Features: These features include unigram, bigram and language models of query terms, which have been hashed [30] into a fixed number of bins. Then, a simpler linear model is trained through these lexical features to indicate whether a card is relevant to a query.

• Query Intent Features: Queries are classified into a hierarchical taxonomy where each node represents an intent. Certain intent might have strong indication for a particular card. For instance, a local intent may imply LocalCard or WeatherCard stronger than other intents.

• Card Backend Features: For a given query, whether a card’s backend system handles it is also important factor for a ranking system to consider. For instance, even if a query has a local intent, the LocalCard may not find relevant stores, restaurants and other local business. Therefore, the card ranking system would incorporate the returned results or relevance scores from a backend as signals to leverage.

• Click Feedback Features: Sometimes, a relevance between a query and a card may be influenced by some temporal factors. For instance, in general, the query “Apple” may not have a news intent. But, if the Apple Company announces new product releases, the NewsCard would become relevant during that period of time. Thus, we have features to track click-through-rate of URLs related to a query and relate them to a card, capturing the temporal dynamics of a card with respect to certain queries.

As this paper is not about card ranking models, we do not discuss features in detail.

### 5.1Basic Statistics

In this sub-section, we firstly show some characteristics of the dataset. First, starting from Figure 2, it shows the number of queries with certain ratio of reformulated QPVs. The point on the upper left corner is the number of queries with ratio of reformulated QPVs, meaning that all QPVs have label while the upper right corner represents queries with ratio of reformulated QPVs. After reviewing the data, we found that both points are from queries with little QPVs, demonstrating extreme cases. Other parts of the figure do not reveal strong patterns, which might indicate that all queries possibly can be reformulated no matter they are top queries or long-tail ones.

We take two queries “Facebook” (shown at the top) and “Kim Kardashian” (shown at the bottom) as examples, shown in Figure 3. We show several card groups for each query with their corresponding percentage of positive QPVs and negative ones. A card group is defined as a of cards for the query. As we can see, no matter which card group, for both queries, there exists reformulated QPVs, even though some card groups (e.g, N1-L-N2-W and N1-N2-W and N2-W) have a very low reformulation ratio. Additionally, there is no one single card group with high ratio of positive labels. In other words, a card group with less reformulation ratio does not imply that other card groups are less likely to be positive. From these two examples, we can see that, some queries are inherently more likely to be reformulated than others.

In Table ?, we show the percentage of QPVs containing of cards where in the second column of the table. We can see that nearly of QPVs which contains cards while almost no QPVs show more cards and beyond. This distribution is very intuitive as mobile devices have very constrained screens. The third column shows the percentage of queries which contain a certain number of cards, demonstrating that a large portion of queries have cards. The last column of the table shows that, for a given number of cards, what percentage of QPVs has a positive label or not. Note that, the ratio of positive versus negative labels is highly skewed as negative ones are sparse across different , indicating that, learning to avoid negative examples is an inherent difficult task.

### 5.2Comparisons of Labeling Strategies

In this sub-section, we compare different labeling strategies in terms of , and -Measure, shown in Table ?. The first phenomenal observation is that, all labeling strategies are significantly better than Human, not only in terms of -Measure but also on . Human editors cannot predict what users want and indeed, the rankings induced from their assessed relevance labels do not match what users like. In addition, the ranking model for Human does not scale, only trained from a small amount of judgments which yield in sub-optimal results while other strategies can handle millions of QPVs.

Although CTR is much better than Human, it is still the second worst method in the result, which is understandable in some sense. First of all, CTR and QR are two different objectives. They might be related but certainly still not the same. As mentioned in §Section 3.5, cards may not have links. Meanwhile, it is also hard to argue that the normalized click-through-rate for different cards is a good indicator for a better ranking. For instance, even if two cards have the same number of links (e.g., say ), it is not always true that the one with clicks is better than the one with click. However, the result here does not mean that, clicks are not informative at all.

For pointwise methods, NPL, DPL and MPL all perform significantly better than the baseline and CTR. In particular, NPL performs surprisingly well, given its simplicity. One possible explanation is that, NPL, to some degree, resembles the idea of LL. For example, if and , LL would generate a label and NPL would generate labels: , and . Essentially, two methods generate the same labels for this data instance. Although NPL and LL produce similar labels for many cases, they differ in some subtle scenarios. For instance, if if and , LL would generate a label , which has nothing to do with the other label generated above. However, NPL would generate , and , which obviously interfere with the labels generated above. Therefore, we can see that, without ordering NPL would confuse itself with the effect of different original labels of the same set of cards. This is also observable in terms of performance as LL is superior to NPL. The relative worse performance of MPL might indicate that, the movement of a single card is overly penalized or encouraged and induced labels do not keep the order of rankings. On the other hand, DPL performs quite well. Although it is a pointwise method, it carries over the ordering information of positive examples and negative ones and therefore, it has a high , achieving a good -Measure.

For pairwise methods and listwise methods, APL achieves the second best performance. It has a relative high and low , demonstrating that it has a balance between maintaining good pairs of cards and avoiding bad pairs. As mentioned before, APL has the same prediction complexity as pointwise methods and therefore, it is a even more preferred one in terms of performance and simplicity. For LL, as expected, it has the highest , as it literally remembers good and bad cases, achieves quite good performance overall.

For LtL, it outperforms all other methods as it has a high and a relatively low . The main advantage of this method is that, it learns contributions of a card with respect to reformulations or not in a principled way. In addition, it only generates the same amount of training instances as pointwise methods, while APL, although it has a strong performance, generates significantly more data instances as shown in examples in §Section 3.2. The only obvious drawback of LtL is that, a model needs to be trained first to obtain labels. But this shortcoming can be mitigated as the model might be trained from a large corpus and keep it constant for a while and card ranking models can be re-trained more frequently. In order to demonstrate the effectiveness of LtL, we show regression weights learned by LtL for two queries “Barack Obama”and “Apple” in Table ?. The first column is the card name and columns represent “click value”, click weight (learned from the model) and click mean (computed from the training set), similarly for “view” (columns ). The “total value” in the last column is defined in Equation 3. Cards are sorted by “total value” in the table. As we can see that, “total value” gives a very intuitive functional explanation of cards and their engagement contributions. For instance, NewsCard and WebCard are comparatively much important than other cards for “Barack Obama” while NavigationCard is way more critical for “Apple” as most people wanted to use NavigationCard to quickly jump to Apple’s homepage. In addition to its superior performance, LtL can provide valuable insights that other approaches cannot offer.

## 6Conclusions

We have presented in this paper a comprehensive series of strategies of exploiting the users’ query reformulation activities for labeling query-card relevance, based on which effective card ranking models can be optimized for mobile search. We demonstrated that the proposed labeling strategies achieve substantial improvement over the conventional human-judgment-based labeling strategy. In addition, our experimental results show that by directly exploiting user feedback from query reformulation we can attain a better card ranking model, compared to the conventional user feedback from CTR. Finally, the learning-to-label strategy succeeds in building discriminative query-level labeling models, which leads to a card ranking model that performs superior to other alternatives. For future work, we would explore possibilities to derive labels based on task-level search metrics and develop ranking models to optimize them.

### References

1. Classification-based resource selection.
J. Arguello, J. Callan, and F. Diaz. In Proceedings of CIKM 2009, pages 1277–1286.
2. Learning to aggregate vertical results into web search results.
J. Arguello, F. Diaz, and J. Callan. In Proceedings of CIKM 2011, pages 201–210.
3. A methodology for evaluating aggregated search results.
J. Arguello, F. Diaz, J. Callan, and B. Carterette. In Proceedings of ECIR 2011, pages 141–152.
4. Sources of evidence for vertical selection.
J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. In Proceedings of SIGIR 2009, pages 315–322.
5. Learning to rank using an ensemble of lambda-gradient models.
C. J. Burges, K. M. Svore, P. N. Bennett, A. Pastusiak, and Q. Wu. In Yahoo! Learning to Rank Challenge, pages 25–35, 2011.
6. Learning to rank: From pairwise approach to listwise approach.
Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. In Proceedings of ICML 2007, pages 129–136.
7. Yahoo! learning to rank challenge overview.
O. Chapelle and Y. Chang. In Proceedings of the Yahoo! Learning to Rank Challenge, pages 1–24, 2011.
8. Expected reciprocal rank for graded relevance.
O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. In Proceedings of CIKM 2009, pages 621–630.
9. Integration of news content into web results.
F. Diaz. In Proceedings of WSDM 2009, pages 182–191.
10. Adaptation of offline vertical selection predictions in the presence of user feedback.
F. Diaz and J. Arguello. In Proceedings of SIGIR 2009, pages 323–330.
11. Greedy function approximation: a gradient boosting machine.
J. H. Friedman. Annals of statistics, pages 1189–1232, 2001.
12. Beyond clicks: Query reformulation as a predictor of search satisfaction.
A. Hassan, X. Shi, N. Craswell, and B. Ramsey. In Proceedings of CIKM 2013, pages 2019–2028.
13. Analyzing and evaluating query reformulation strategies in web search logs.
J. Huang and E. N. Efthimiadis. In Proceedings of CIKM 2009, pages 77–86.
14. Cumulated gain-based evaluation of ir techniques.
K. Järvelin and J. Kekäläinen. ACM Transactions on Information Systems, 20(4):422–446, 2002.
15. Automatic online evaluation of intelligent assistants.
J. Jiang, A. Hassan Awadallah, R. Jones, U. Ozertem, I. Zitouni, R. Gurunath Kulkarni, and O. Z. Khan. In Proceedings of WWW 2015, pages 506–516.
16. A unified search federation system based on online user feedback.
L. Jie, S. Lamkhede, R. Sapra, E. Hsu, H. Song, and Y. Chang. In Proceedings of KDD 2013, pages 1195–1203.
17. Optimizing search engines using clickthrough data.
T. Joachims. In Proceedings of KDD 2002, pages 133–142.
18. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search.
T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. ACM Transactions on Information Systems, 25(2), 2007.
19. Search engines that learn from implicit feedback.
T. Joachims and F. Radlinski. Computer, 40(8):34–40, 2007.
20. Implicit feedback for inferring user preference: A bibliography.
D. Kelly and J. Teevan. SIGIR Forum, 37(2):18–28, 2003.
21. Modeling dwell time to predict click-level satisfaction.
Y. Kim, A. Hassan, R. W. White, and I. Zitouni. In Proceedings of WSDM 2014, pages 193–202.
22. Towards better measurement of attention and satisfaction in mobile search.
D. Lagun, C.-H. Hsieh, D. Webster, and V. Navalpakkam. In Proceedings of SIGIR 2014, pages 113–122.
23. Toward predicting the outcome of an a/b experiment for search relevance.
L. Li, J. Y. Kim, and I. Zitouni. In Proceedings of WSDM 2015, pages 37–46.
24. Learning to rank for information retrieval.
T.-Y. Liu. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
25. On composition of a federated web search result page: Using online users to provide pairwise preference for heterogeneous verticals.
A. K. Ponnuswami, K. Pattabiraman, Q. Wu, R. Gilad-Bachrach, and T. Kanungo. In Proceedings of WSDM 2011, pages 715–724.
X. Shao and L. Li. In Proceedings of KDD 2011, pages 258–264.
27. From queries to cards: Re-ranking proactive card recommendations based on reactive search history.
M. Shokouhi and Q. Guo. In Proceedings of SIGIR 2015, pages 695–704.
28. Mobile query reformulations.
M. Shokouhi, R. Jones, U. Ozertem, K. Raghunathan, and F. Diaz. In Proceedings of SIGIR 2014, pages 1011–1014.
29. Federated search.
M. Shokouhi and L. Si. Foundations and Trends in Information Retrieval, 5(1):1–102, Jan. 2011.
30. Feature hashing for large scale multitask learning.
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. In Proceedings of ICML 2009, pages 1113–1120.
31. Listwise approach to learning to rank: Theory and algorithm.
F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. In Proceedings of ICML 2008, pages 1192–1199.
32. Modelling user interest for zero-query ranking.
L. Yang, Q. Guo, Y. Song, S. Meng, M. Shokouhi, K. McDonald, and W. B. Croft. In Proceedings of ECIR 2016.
33. A general boosting method and its application to learning ranking functions for web search.
Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun. In Proceedings of NIPS 2008, pages 1697–1704.
34. Evaluating reward and risk for vertical selection.
K. Zhou, R. Cummins, M. Lalmas, and J. M. Jose. In Proceedings of CIKM 2012, pages 2631–2634.
35. Aligning vertical collection relevance with user intent.
K. Zhou, T. Demeester, D. Nguyen, D. Hiemstra, and D. Trieschnigg. In Proceedings of CIKM 2014, pages 1915–1918.
13745