Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application

Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application

Yujing Hu Alibaba Group Qing Da Alibaba Group Anxiang Zeng Alibaba Group Yang Yu Nanjing University  and  Yinghui Xu Alibaba Group

In e-commerce platforms such as Amazon and TaoBao, ranking items in a search session is a typical multi-step decision-making problem. Learning to rank (LTR) methods have been widely applied to ranking problems. However, such methods often consider different ranking steps in a session to be independent, which conversely may be highly correlated to each other. For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multi-step ranking problem. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution of an SSMDP. Experiments are conducted in simulation and TaoBao search engine. The results demonstrate that our algorithm performs much better than online LTR methods, with more than and growth of total transaction amount in the simulation and the real application, respectively.

reinforcement learning; online learning to rank; policy gradient
article: 4price: 15.00

1. Introduction

Over past decades, shopping online has become an important part of people’s daily life, requiring the E-commerce giants like Amazon, eBay and TaoBao to provide stable and fascinating services for hundreds of millions of users all over the world. Among these services, commodity search is the fundamental infrastructure of these E-commerce platforms, affording users the opportunities to search commodities, browse product information and make comparisons. For example, every day millions of users choose to purchase commodities through TaoBao search engine.

In this paper, we focus on the problem of ranking items in large-scale item search engines, which refers to assigning each item a score and sorting the items according to their scores. Generally, a search session between a user and the search engine is a multi-step ranking problem as follows:

  1. the user inputs a query in the blank of the search engine,

  2. the search engine ranks the items related to the query and displays the top items (e.g., ) in a page,

  3. the user makes some operations (e.g., click items, buy some certain item or just request a new page of the same query) on the page,

  4. when a new page is requested, the search engine reranks the rest of the items and display the top items on a new page.

These four steps will repeat until the user buys some items or just leaves the search session. Empirically, a successful transaction always involves multiple rounds of the above process.

The operations of users in a search session may indicate their personal intentions and preference on items. From a statistical view, these signals can be utilized to learn a ranking function which satisfies the users’ demand. This motivates the marriage of machine learning and information retrieval, namely the learning to rank (LTR) methods (Joachims, 2002; Liu et al., 2009), which learns a ranking function by classification or regression from training data. The major paradigms of supervised LTR methods are pointwise (Nallapati, 2004; Li et al., 2008), pairwise (Cao et al., 2006; Burges et al., 2005), and listwise (Cao et al., 2007). Recently, online learning techniques such as regret minimization (Auer, 2002; Langford and Zhang, 2008; Kveton et al., 2015a) have been introduced into the LTR domain for directly learning from user signals. Compared with offline LTR, online LTR avoids the mismatch between manually curated labels, user intent (Yue and Joachims, 2009) and the expensive cost of creating labeled data sets. Although rigorous mathematical models are adopted for problem formalization (Yue and Joachims, 2009; Kveton et al., 2015a; Zoghi et al., 2017) and guarantees on regret bounds are established, most of those works only consider a one-shot ranking problem, which means that the interaction between the search engine and each user contains only one round of ranking-and-feedback activity. However, in practice, a search session often contains multiple rounds of interactions and the sequential correlation between each round may be an important factor for ranking, which has not been well investigated.

In this paper, we consider the multi-step sequential ranking problem mentioned above and propose a novel reinforcement learning (RL) algorithm for learning an optimal ranking policy. The major contributions of this paper are as follows.

  • We formally define the concept of search session Markov decision process (SSMDP) to formulate the multi-step ranking problem, by identifying the state space, reward function and state transition function

  • We theoretically prove that maximizing accumulative rewards is necessary, indicating that the different ranking steps in a session are tightly correlated rather than independent.

  • We propose a novel algorithm named deterministic policy gradient with full backup estimation (DPG-FBE), designed for the problem of high reward variance and unbalanced reward distribution of SSMDP, which could be hardly dealt with even for existing state-of-the-art RL algorithms.

  • We empirically demonstrate that our algorithm performs much better than online LTR methods, with more than and growth of total transaction amount in the simulation and the real application (TaoBao search engine), respectively.

The rest of the paper is organized as follows. Section 2 introduces the background of this work. The problem description, analysis of SSMDP and the proposed algorithm are stated in Section  3,  4,  5, respectively. The experimental results are shown in Section  6, and Section 7 concludes the paper finally.

2. Background

In this section, we briefly review some key concepts of reinforcement learning and the related work in the online LTR domain. We start from the reinforcement learning part.

2.1. Reinforcement Learning

Reinforcement learning (RL) (Sutton and Barto, 1998) is a learning technique that an agent learns from the interactions between the environment by trial-and-error. The fundamental mathematical model of RL is Markov decision process (MDP).

Definition 2.1 (Markov Decision Process).

A Markov decision process is a tuple , where is the state space, is the action space of the agent, is the reward function, is the state transition function and is the discount rate.

The objective of an agent in an MDP is to find an optimal policy which maximizes the expected accumulative rewards starting from any state (typically under the infinite-horizon discounted setting), which is defined by , where denotes any policy of the agent, stands for expectation under policy , is the current time step, is a future time step, and is the immediate reward at the time step . This goal is equivalent to finding the optimal state-action value for any state-action pair . In finite-horizon setting with a time horizon , the objective of an agent can be reinterpreted as the finding the optimal policy which maximizes the expected -step discounted return or undiscounted return 111The undiscounted return is a special case in discount setting with . in the discounted and undiscounted reward cases, respectively.

An optimal policy can be found by computing the optimal state-value function or the optimal state-action value function . Early methods such as dynamic programming (Sutton and Barto, 1998) and temporal-difference learning (Watkins, 1989) rely on a table to store and compute the value functions. However, such tabular methods cannot scale up in large-scale state/action space problems due to the curse of dimensionality. Function approximation is widely used to address the scalability issues of RL. By using a parameterized function (e.g., linear functions (Maei et al., 2010), neural networks (Mnih et al., 2015; Silver et al., 2016)) to represent the value function or the policy (a.k.a value function approximation and policy gradient method respectively), the learning problem is transformed to optimizing the function parameters according to reward signals. In recent years, policy gradient methods (Sutton et al., 2000; Silver et al., 2014; Schulman et al., 2015) have drawn much attention in the RL domain. The explicit parameterized representation of policy enables the learning agent to directly search in the policy space and avoids the policy degradation problem of value function approximation.

2.2. Related Work

Early attempt of online LTR can be dated back to the evaluation of RankSVM in online settings (Joachims, 2002). As claimed by Hofmann et al., balancing exploitation and exploration should be a key ability of online LTR methods (Hofmann et al., 2013). The theoretical results in the online learning community (typically in the bandit problem domain) (Auer, 2002; Langford and Zhang, 2008) provide rich mathematical tools for online LTR problem formalization and algorithms for efficient exploration, which motivates a lot of online LTR methods. In general, these methods can be divied into two groups. The first is to learn the best ranking function from a function space (Yue and Joachims, 2009; Hofmann et al., 2013). For example, Yue and Joachims (Yue and Joachims, 2009) define a dueling bandit problem in which actions are pairwise comparisons between documents and the goal is to learn a parameterized retrieval function which has sublinear regret performance. The second groups of online LTR methods directly learn the best list under some model of user interactions (Radlinski et al., 2008; Slivkins et al., 2013), which can be treated as an assumption on how users act to a ranked list. Representative models include the cascade model (Kveton et al., 2015a, b; Zong et al., 2016; Li et al., 2016), the dependent-click model (Katariya et al., 2017), and the position-based model (Lagrée et al., 2016). Since no single model can entirely capture the behavior of all users, Zoghi et al. (Zoghi et al., 2017) recently propose a stochastic click learning framework for online LTR in a broad class of click models.

Our work in this paper is more similar to the first group of online LTR methods which learn ranking functions. However, while most of previous works consider a one-shot ranking problem, we focus on learning a ranking policy in a multi-step ranking problem, which contains multiple rounds of interactions and typically occurs in e-commerce scenarios.

Figure 1. A typical search session in TaoBao

3. Problem Formulation

As we mentioned in previous sections, in e-commerce platforms such as TaoBao and TMall, ranking items is a multi-step decision-making problem given a query, where the search engine should take a ranking action whenever an item page is requested by a user. Figure (1) shows a typical search session between the search engine and a mobile app user in TaoBao. In the beginning, the user inputs a query “Cola” into the blank of the search engine and clicks the “Search” button. Then the search engine takes a ranking action and shows the top items related to “Cola” in page 1. The user browse the displayed items and clicks some of them for the details. When no items interest the user or the user wants to check more items for comparisons, the user requests a new item page. The search engine again takes a ranking action and displays page 2. After a certain number of such ranking rounds, the search session will finally end when the user purchases items or just leaves the search session.

3.1. Search Session Modeling

Before we formulate the multi-step ranking problem as an MDP, we define some concepts to formalize the contextual information and user behaviours in a search session, which are the basis to define the state and state transitions of our MDP.

Definition 3.1 (Top List).

For an item set , a ranking function , and a positive integer (), the top list is an ordered item list which contains the top items when applying the rank function to the item set , where () is the item in position and for any , it is the case that .

Definition 3.2 (Item Page).

For each step () during a session, the item page is the top list resulted by applying the ranking action of the search engine to the set of unranked items in the last decision step . For the initial step , . For any decision step , .

Definition 3.3 (Item Page History).

In a search session, let be the input query. For the initial decision step , the initial item page history . For each later desision step , the item page history up to is , where is the item page history up to the step and is the item page of step .

The item page history contains all information the user observes at the decision step . Since the item set is finite, there are at most item pages, and correspondingly at most decision steps in a search session. In TaoBao and TMall, users may choose to purchase items or just leave at different steps of a session. If we treat all possible users as an environment which samples user behaviors, this would mean that after observing any item page history, the environment may terminate a search session with a certain probability of transaction conversion or abandonment. We formally define such two types of probability as follows.

Definition 3.4 (Conversion Probability).

For any item page history () in a search session, let denote the conversion event that a user purchases an item after observing . The conversion probability of , which is denoted by , is the averaged probability that occurs when takes place.

Definition 3.5 (Abandon Probability).

For any item page history () in a search session, let denote the abandon event that a user leaves the search session after observing . The abandon probability of , which is denoted by , is the averaged probability that occurs when takes place.

Since is the direct result of the agent’s action in the last item page history , the conversion probability and the abandon probability define how the state of the environment (i.e., the user population) will change after is taken in : (1) terminating the search session by purchasing an item in with probability ; (2) leaving the search session from with probability ; (3) continuing the search session from with probability . For convenience of later discussion, we also define the continuing probability of an item page history.

Definition 3.6 (Continuing Probability).

For any item page history () in a search session, let denote the continuation event that a user continues searching after observing . The continuing probability of , which is denoted by , is the averaged probability that occurs when takes place.

Obviously, for any item page history , it holds that . Specially, the continuation event of the initial item page history which only contains the query is a sure event (i.e., ) as neither a conversion event nor a abandon event can occur before the first item page is displayed.

3.2. Search Session MDP

Now we are ready to define the instantiated Markov decision process (MDP) for the multi-step ranking problem in a search session, which we call a search session MDP (SSMDP).

Definition 3.7 (Search Session MDP).

Let be a query, be the set of items related to , and () be the number of items that can be displayed in a page, the search session MDP (SSMDP) with respect to , and is a tuple , where

  • is the maximal decision step of a search session,

  • is the set of all possible item page histories, is the set of all item page histories up to ().

  • is the state space, is the nonterminal state set that contains all continuation events, and are two terminal state sets which contain all conversion events and all abandon events, respectively.

  • is the action space which contains all possible ranking functions of the search engine.

  • is the reward function.

  • is the state transition function. For any step (), any item page history , any action , let . The transition probability from the nonterminal state to any state after taking action is


In an SSMDP, the agent is the search engine and the environment is the population of all possible users. The states of the environment are indication of user status in the corresponding item page histories (i.e., contiuation, abandonment, or transaction conversion). The action space can be set differently (e.g., discrete or continuous) according to specific ranking tasks. The state transition function is directly based on the conversion probability and abandon probability. The reward function highly depends on the goal of a specific task, we will discuss our reward setting in Section 4.2.

4. Analysis of SSMDP

Before we apply the search session MDP (SSMDP) model in practice, some details need to be further clarified. In this section, we first identify the Markov property of the states in an SSMDP to show that SSMDP is well defined. Then we provide a reward function setting for SSMDP, based on which we perform an analysis on the reward discount rate and show the necessity for a search engine agent to maximize long-time accumulative rewards.

4.1. Markov Property

The Markov property means that a state is able to summarize past sensations compactly in such a way that all relevant information is retained (Sutton and Barto, 1998). Formally, the Markov property refers to that for any state-action sequence experienced in an MDP, it holds that


That is to say, the occurring of the current state is only conditional on the last state-action pair rather than the whole sequence. Now we show that the states of a search session MDP (SSMDP) also have the Markov property.

Proposition 4.1 ().

For the search session MDP defined in Definition 3.7, any state is Markovian.


We only need to prove that for any step () and any possible state-action sequence with respect to , it holds that

Note that all states except in the sequence must be non-terminal states. According to the state definition, for any step (), there must be an item page history corresponding to the state such that . So the state-action sequence can be rewritten as . Note that for any step (), it holds that

where is the top list (i.e., item page) with respect to the unranked item set and ranking action in step . Given , the unranked item set is deterministic. Thus, is the necessary and unique result of the state-action pair . Therefore, the event can be equivalently represented by the event , and the following derivation can be conducted:

The third step of the derivation holds because for any step (), is contained in . Similarly, the fourth step holds because contains the occurrence of . ∎

4.2. Reward Function

In a search session MDP , the reward function of is a quantitative evaluation of the action performance in each state. Specifically, for any nonterminal state , any action , and any other state , is the expected value of the immediate rewards that numerically characterize the user feedback when action is taken in and the state is changed to . Therefore, we need to translate user feedback to numeric reward values that a learning algorihtm can understand.

In the online LTR domain, user clicks are commonly adopted as a reward metric (Katariya et al., 2017; Lagrée et al., 2016; Zoghi et al., 2017) to guide learning algorithms. However, in e-commerce scenarios, successful transactions between users (who search items) and sellers (whose items are ranked by the search engine) are more important than user clicks. Thus, our reward setting is designed to encourage more successful transactions. For any decision step (), any item page history , and any action , let . Recall that after observing the item page history , a user will purchase an item with a conversion probability . Although different users may choose different items to buy, from a statistical view, the deal prices of the transactions occurring in must follow an underlying distribution. We use to denote the expected deal price of . Then for the nonterminal state and any state , the reward is set as follows:


where is the terminal state which represents the conversion event of . The agent will recieve a positive reward from the environment only when its ranking action leads to a successful transation. In all other cases, the reward is zero. It should be noted that the expected deal price of any item page history is most probably unknown beforehand. In practice, the actual deal price of a transaction can be directly used as the reward signal.

4.3. Discount Rate

The discount rate is an important parameter of an MDP which defines the importance of future rewards in the objective of the agent (defined in Section 2.1). For the search session MDP (SSMDP) defined in this paper, the choice of the discount rate brings out a fundamental question: “Is it necessary for the search engine agent to consider future rewards when making decisions?” We will find out the answer and determine an appropriate value of the discount rate by analyzing how the objective of maximizing long-time accumulative rewards is related to the goal of improving the search engine’s economic performance.

Let be a search session MDP with respect to a query , an item set and an integer (). Given a fixed deterministic policy of the agent222More accurately, the polic is a mapping from the nonterminal state set to the action space . Our conclusion in this paper also holds for stochastic policies, but we ommit the discussion due to space limitation., denote the item page history occurring at step () under by . We enumerate all possible states that can be visited in a search session under in Figure 2. For better illustration, we show all item page histories (marked in red) in the figure. Note that they are not the states of the SSMDP . Next, we will rewrite , , , and as , , , and for simplicity.

Figure 2. All states that can be visited under policy . The black circles are nonterminal states and the black squares are terminal states. The red circles are item page histories. The solid black arrow starting from each nonterminal state represents the execution of the policy . The dotted arrows from each item page history are state transitions, with the corresponding transition probabilities marked in blue.

Without loss of generality, we assume the discount rate of the SSMDP is (). Denote the state value function (i.e., expected accumulative rewards) under by . For each step (), the state value of the nonterminal state is


where for any (), is the immediate reward recieved at the future step in the item page history . According to the reward function in Equation (3), the expected value of the immediate reward under is


where is the expected deal price of the item page history . However, since is the expected discounted accumulative rewards on condition of the state , the probability that the item page history is reached when is visited should be taken into account. Denote the reaching probability from to by , it can be computed as follows according to the state transition function in Equation (1):


The reaching probability from to is since is the directly result of the state action pair . For other future item page histories, the reaching probability is the product of all continuing probabilities along the path from to . By taking Equations (5) and (6) into Equation (4), can be further computed as follows:


With the conversion probability and the expected deal price of each item page history in Figure 2, we can also derive the expected gross merchandise volume (GMV) lead by the search engine agent in a search session under the policy as follows:


By comparing Equations (7) and (8), it can be easily found that when the discount rate . That is to say, when , maximizing the expected accumulative rewards directly leads to the maximization of the expected GMV. However, when , maximizing the value function cannot necessarily maximize since the latter is an upper bound of .

Proposition 4.2 ().

Let be a search session MDP. For any deterministic policy and any discount rate (), it is the case that , where is state value function defined in Equation (4), is the initial nonterminal state of a search session, is the expected gross merchandise volume (GMV) of defined in Equation (8). Only when , we have .


The proof is trivial since the difference between and , namely , is always positive when . ∎

Now we can give the answer to the question proposed in the beginning of this section: considering future rewards in a search session MDP is necessary since maximizing the undiscounted expected accumulative rewards can optimize the performance of the search engine in the aspect of GMV. The sequential nature of our multi-step ranking problem requires the ranking decisions at different steps to be optimized integratedly rather than independently.

5. Algorithm

In this section, we propose a policy gradient algorithm for learning an optimal ranking policy in a search session MDP (SSMDP). We resort to the policy gradient method since directly optimizing a parameterized policy function addresses both the policy representation issue and the large-scale action space issue of an SSMDP. Now we briefly review the policy gradient method in the context of SSMDP. Let be an SSMDP, be the policy function with the parameter . The objective of the agent is to find an optimal parameter which maximizes the expectation of the -step returns along all possible trajectories


where is a trajectory like and follows the trajectory distribution under the policy parameter , is the -step return of the trajectory . Note that if the terminal state of a trajectory is reached in less than steps, the sum of the rewards will be truncated in that state. The gradient of the target with respect to the policy parameter is


where is the sum of rewards from step to the terminal step in the trajectory . This gradient leads to the well-known REINFORCE algorithm (Williams, 1992). The policy gradient theorem proposed by Sutton et al. (Sutton et al., 2000) provides a framework which generalizes the REINFORCE algorithm. In general, the gradient of can be written as

where is the state-action value function under the policy . If is deterministic, the gradient of can be rewritten as

Silver et al. (Silver et al., 2014) show that the deterministic policy gradient is the limiting case of the stochastic policy gradient as policy variance tends to zero. The value function can be estimated by temporal-difference learning (e.g., actor-critic methods (Sutton and Barto, 1998)) aided by a function approximator with the parameter which minimizes the mean squared error .

5.1. The DPG-FBE Algorithm

Instead of using stochastic policy gradient algorithms, we rely on the deterministic policy gradient (DPG) algorithm (Silver et al., 2014) to learn an optimal ranking policy in an SSMDP since from a practical viewpoint, computing the stochastic policy gradient may require more samples, especially if the action space has many dimensions. However, we have to overcome the difficulty in estimating the value function , which is caused by the high variance and unbalanced distribution of the immediate rewards in each state. As indicated by Equation (3), the immediate reward of any state-action pair is zero or the expected deal price of the item history page resulted by . Firstly, the reward variance is high because the deal price normally varies over a wide range. Secondly, the immediate reward distribution of is unbalanced because the conversion events lead by occur much less frequently than the two other cases (i.e., abandon and continuing events) which produce zero rewards. Note that the same problem also exists for the -step returns of the trajectories in an SSMDP since in any possible trajectory, only the reward of the last step may be nonzero. Therefore, estimating by Monte Carlo evaluation or temporal-difference learning may cause inaccurate update of the value function parameter and further influence the optimization of the policy parameter.

Our way for solving the above problem is similar to the model-based reinforcement learning approaches (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002), which maintain an approximate model of the environment to help with performing reliable updates of value functions. According to the Bellman Equation (Sutton and Barto, 1998), the state-action value of any state-action pair under any policy is

where denotes the Bellman operator. Let be the next item page history resulted by . Only the states , , and can be transferred to from with nonzero probability. Among these three states, only involves a nonzero immediate reward and involves a nonzero value. So the above equation can be simplified to


where , , and are the conversion probability, continuing probability and expected deal price of , respectively. Therefore, when the value function is approximated by a parameterized function , we can use as an estimation of to approximately compute the mean squared error (MSE) of , then optimize . Specifically, we have

where represents the item page history resulted by each state-action pair . For minimizing , every time a state-action as well as its next item page history is observed, the parameter can be updated in a full backup manner:

where is a learning rate and is the state of continuing event of . With this full backup updating method, the sampling errors caused by immediate rewards or returns can be avoided. Furthermore, the computational cost of full backups in our problem is almost equal to that of one-step sample backups.

Input: Learning rate and , pretrained conversion probability model , continuing probability model , and expected deal price model of item page histories
1 Initialize the actor and the critic with parameter and ;
2 foreach search session do
3        Use to generate a ranking action at each step with exploration;
4        Get the trajectory of the session with its final step index ;
5        ;
6        for  do
7               the sample tuple at step ;
8               the item page history of ;
9               if  then
10                      Update the models , , and with the samples , , and , respectively;
12              else
13                      Update the models and with the samples and , respectively;
15              , ;
16               ;
17               ;
18               ;
20       ;
Algorithm 1 Deterministic Policy Gradient with Full Backup Estimation (DPG-FBE)

Our policy gradient algorithm is based on the deterministic policy gradient theorem (Silver et al., 2014) and the full backup estimation of the Q-value functions. Unlike previous works which entirely model the reward and state transition functions (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002), we only need to build the conversion probability model , the continuing probability model , and the expected deal price model of the item page histories in an SSMDP. These models can be trained using online or offline data by any possible statistical learning method. We call our algorithm Deterministic Policy Gradient with Full Backup Estimation (DPG-FBE) and show its details in Algorithm 1.

As shown in this table, the parameters and will be updated after any search session between the search engine agent and users. Exploration (at line ) can be done by, but not limited to, -greedy (in discrete action case) or adding random noise to the output of (in continuous action case). Although we have no assumptions on the specific models used for learning the actor and the critic in Algorithm 1, nonlinear models such as neural networks are preferred due to the large state/action space of an SSMDP. To solve the convergence problem and ensure a stable learning process, a replay buffer and target updates are also suggested (Mnih et al., 2015; Lillicrap et al., 2015).

6. Experiments

In this section, we conduct two groups of experiments: A simulated experiment in which we construct an online shopping simulator and test our algorithm DPG-FBE as well as some state-of-the-art online learning to rank (LTR) algorithms, and a real application in which we apply our algorithm in TaoBao, one of the largest e-commerce platforms in the world.

6.1. Simulation

The online shopping simulator is constructed based on the statistical information of items and user behaviors in TaoBao. An item is represented by a -dim () feature vector and a ranking action of the search engine is a -dim weight vector . The ranking score of the item under the ranking action is the inner product of the two vectors. We choose important features related to the item category of dress (e.g., price and quality) and generate an item set by sampling items from a distribution approximated with all the items of the dress category. Each page contains items so that there are at most ranking rounds in a search session. In each ranking round, the user operates on the current item page (such as clicks, abandonment, and purchase) are simulated by a user behavior model, which is constructed from the user behavior data of the dress items in TaoBao. The simulator outputs the probability of each possible user operation given the recent item pages examined by the user. A search session will end when the user purchases one item or leaves.

Figure 3. The learning performance of the DDPG-FBE algorithm in the simulation experiment

Our implementation of the DPG-FBE algorithm is a deep RL version (DDPG-FBE) which adopts deep neural networks (DNN) as the policy and value function approximators (i.e., actor and critic). We also implement the deep DPG algorithm (DDPG) (Lillicrap et al., 2015). The state of environment is represented by a -dim feature vector extracted from the last item pages of the current search session. The actor and critic networks of the two algorithms have two full connected hidden layers with and units, respectively. We adopt relu and tanh as the activation functions for the hidden layers and the output layers of all networks. The network parameters are optimized by Adam with a learning rate of for the actor and for the critic. The parameter for the soft target updates (Lillicrap et al., 2015) is set to . We test the performance of the two algorithms under different settings of the discount rate . Five online LTR algorithms, point-wise LTR, BatchRank (Zoghi et al., 2017), CascadeUCB1 (Kveton et al., 2015a), CascadeKL-UCB (Kveton et al., 2015a), and RankedExp3 (Radlinski et al., 2008) are implemented for comparison. Like the two algorithms, the point-wise LTR method implemented in our simulation also learns a parameterized function which outputs a ranking weight vector in each state of a search session. We choose DNN as the parameterized function and use the logistic regression algorithm to train the model, with an objetive function that approximates the goal of maximizing GMV. The four other online LTR algorithms are regret minimization algorithms which are based on variants of the bandit problem model. The test of each algorithm contains search sessions and the transaction amount of each session is recorded. Results are averaged over runs and are shown in Figures 3, 4, and 5.

Figure 4. The learning performance of the DDPG algorithm in the simulation experiment

Now let us first examine the subfigure of DDPG-FBE. It can be found that the performance of DDPG-FBE is improved as the discout rate increases. The learning curve corresponding to the setting (the green one) is far below other curves in Fig. 3, which indicates the importance of delay rewards. The theoretical result in Section 4 is empirically verified since the DDPG-FBE algorithm acheives the best performance when , with growth of transaction amount per session compared to the second best performance. Note that in e-commerce scenarios, even growth is considerable. The DDPG algorithm also performs the best when , but it fails to learn as well as the DDPG-FBE algorithm. As shown in Fig. 4, all the learning curves of DDPG are under the value . Unlike the two RL algorithms which output weight vectors, the five online LTR algorithms can directly output a ranked item list according to their own ranking mechanisms. However, as we can observe in Fig. 5, the transaction amount lead by each of the algorithms is much smaller than that lead by DDPG-FEB and DDPG. This is not surprising since none of the online LTR algorithms are designed for the multi-step ranking problem where the ranking decisions at different steps should be optimized integratedly.

6.2. Application

We apply our algorithm in TaoBao search engine for providing online realtime ranking service. The searching task in TaoBao is characterized by high concurrency and large data volume. In each second, the TaoBao search engine should respond to hundreds of thousands of users’ requests in concurrent search sessions and simultaneously deal with the data produced from user behaviours. On sale promotion days such as the TMall Double Global Shopping Festival333This refers to November the 11-th of each year. On that day, most sellers in TaoBao and TMall carry out sale promotion and billions of people in the world join in the online shopping festival, both the volume and producing rate of the data would be multiple times larger than the daily values.

Figure 5. The learning performance of five online LTR algorithms in the simulation experiment
Figure 6. RL ranking system of TaoBao search engine

In order to satisfy the requirement of high concurrency and the ability of processing massive data in TaoBao, we design a data stream-driven RL ranking system for implementing our algorithm DPG-FBE. As shown in Figure 6, this system contains five major components: a query planner, a ranker, a log center, a reinforcement learning component, and an online KV system. The work flow of our system mainly consists of two loops. The first one is an online acting loop (in the right bottom of Figure 6), in which the interactions between the search engine and TaoBao users take place. The second one is a learning loop (on the left of the online acting loop in Figure 6) where the training process happens. The two working loops are connected through the log center and the online KV system, which are used for collecting user logs and storing the ranking policy model, respectively. In the first loop, everytime a user requests an item page, the query planner will extract the state feature, get the parameters of the ranking policy model from the online KV system, and compute a ranking action for the current state (with exploration). The ranker will apply the computed action to the unranked items and display the top items (e.g., ) in an item page, where the user will give feedback. In the meanwhile, the log data produced in the online acting loop is injected into the learning loop for constructing training data source. In the log center, the user logs collected from different search sessions are transformed to training samples like , which are output continuously in the form of data stream and utitlized by our algorithm to update the policy parameters in the learning component. Whenever the policy model is updated, it will be rewritten to the online KV system. Note that the two working loops in our system work in parallel but asynchronously, because the user log data generated in any search session cannot be utilized for training immediately.

The linear ranking mode used in our simulation is also adopted in this TaoBao application. The ranking action of the search engine is a -dim weight vector. The state of the environment is represented by a -dim feature vector, which contains the item page features, user features and query features of the current search session. We add user and query information to the state feature since the ranking service in TaoBao is for any type of users and there is no limitation on the input queries. We still adopt neural networks as the policy and value function approximators. However, to guarantee the online realtime performance and quick processing of the training data, the actor and critic networks have much smaller scale than those used in our simulation, with only and units in each of their two fully connected hidden layers, respectively. We implement DDPG and DDPG-FBE algorithms in our system and conduct one-week A/B test to compare the two algorithms. In each day of the test, the DDPG-FBE algorithm can lead to more transaction amount than the DDPG algorithm 444We cannot report the accurate transaction amount due to the information protection rule of Alibaba. Here we provide a reference index: the GMV achieved by Alibaba’s China retail marketplace platforms surpassed billion U.S. dollars in the fiscal year of 2016 (Alizila, 2017).. The DDPG-FBE algorithm was also used for online ranking service on the TMall Double Global Shopping Festival of . Compared with the baseline algorithm (an LTR algorithm trained offline), our algorithm acheived more than growth in GMV at the end of that day.

7. Conclusions

In this paper, we propose to use reinforcement learning (RL) for ranking control in e-commerce searching scenarios. Our contributions are as follows. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multi-step ranking problem in a search session. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy in an SSMDP. Experimental results in simulation and TaoBao search engine show that our algorithm perform much better than online LTR methods in the multi-step ranking problem, with more than and growth in gross merchandise volume (GMV), respectively.


  • (1)
  • Alizila (2017) Alizila. 2017. Joe Tsai Looks Beyond Alibaba’s RMB 3 Trillion Milestone. (2017).
  • Auer (2002) Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3, Nov (2002), 397–422.
  • Brafman and Tennenholtz (2002) Ronen I. Brafman and Moshe Tennenholtz. 2002. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3 (2002), 213–231.
  • Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning. 89–96.
  • Cao et al. (2006) Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval (SIGIR’06). 186–193.
  • Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, 129–136.
  • Hofmann et al. (2013) Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16, 1 (2013), 63–90.
  • Joachims (2002) Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). ACM, 133–142.
  • Katariya et al. (2017) Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. 2017. Stochastic Rank-1 Bandits. In Artificial Intelligence and Statistics. 392–401.
  • Kearns and Singh (2002) Michael Kearns and Satinder Singh. 2002. Near-optimal reinforcement learning in polynomial time. Machine Learning 49, 2-3 (2002), 209–232.
  • Kveton et al. (2015a) Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015a. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 767–776.
  • Kveton et al. (2015b) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015b. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems (NIPS’15). 1450–1458.
  • Lagrée et al. (2016) Paul Lagrée, Claire Vernade, and Olivier Cappe. 2016. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems (NIPS’16). 1597–1605.
  • Langford and Zhang (2008) John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems. 817–824.
  • Li et al. (2008) Ping Li, Qiang Wu, and Christopher J Burges. 2008. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in Neural Information Processing Systems (NIPS’08). 897–904.
  • Li et al. (2016) Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. 2016. Contextual combinatorial cascading bandits. In International Conference on Machine Learning (ICML’16). 1245–1253.
  • Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
  • Liu et al. (2009) Tie-Yan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
  • Maei et al. (2010) Hamid R. Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton. 2010. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning. 719–726.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
  • Nallapati (2004) Ramesh Nallapati. 2004. Discriminative models for information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, 64–71.
  • Radlinski et al. (2008) Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning. ACM, 784–791.
  • Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 1889–1897.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
  • Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 387–395.
  • Slivkins et al. (2013) Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked bandits in metric spaces: learning diverse rankings over large document collections. Journal of Machine Learning Research 14, Feb (2013), 399–436.
  • Sutton and Barto (1998) R.S. Sutton and A.G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.
  • Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS’00). 1057–1063.
  • Watkins (1989) C.J.C.H. Watkins. 1989. Learning from delayed rewards. Ph.D. Dissertation. King’s College, Cambridge.
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.
  • Yue and Joachims (2009) Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, 1201–1208.
  • Zoghi et al. (2017) Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online Learning to Rank in Stochastic Click Models. In International Conference on Machine Learning. 4199–4208.
  • Zong et al. (2016) Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (UAI’16). 835–844.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description