Reinforcement Learning to Rank in ECommerce Search Engine: Formalization, Analysis, and Application
Abstract.
In ecommerce platforms such as Amazon and TaoBao, ranking items in a search session is a typical multistep decisionmaking problem. Learning to rank (LTR) methods have been widely applied to ranking problems. However, such methods often consider different ranking steps in a session to be independent, which conversely may be highly correlated to each other. For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multistep ranking problem. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution of an SSMDP. Experiments are conducted in simulation and TaoBao search engine. The results demonstrate that our algorithm performs much better than online LTR methods, with more than and growth of total transaction amount in the simulation and the real application, respectively.
1. Introduction
Over past decades, shopping online has become an important part of people’s daily life, requiring the Ecommerce giants like Amazon, eBay and TaoBao to provide stable and fascinating services for hundreds of millions of users all over the world. Among these services, commodity search is the fundamental infrastructure of these Ecommerce platforms, affording users the opportunities to search commodities, browse product information and make comparisons. For example, every day millions of users choose to purchase commodities through TaoBao search engine.
In this paper, we focus on the problem of ranking items in largescale item search engines, which refers to assigning each item a score and sorting the items according to their scores. Generally, a search session between a user and the search engine is a multistep ranking problem as follows:

the user inputs a query in the blank of the search engine,

the search engine ranks the items related to the query and displays the top items (e.g., ) in a page,

the user makes some operations (e.g., click items, buy some certain item or just request a new page of the same query) on the page,

when a new page is requested, the search engine reranks the rest of the items and display the top items on a new page.
These four steps will repeat until the user buys some items or just leaves the search session. Empirically, a successful transaction always involves multiple rounds of the above process.
The operations of users in a search session may indicate their personal intentions and preference on items. From a statistical view, these signals can be utilized to learn a ranking function which satisfies the users’ demand. This motivates the marriage of machine learning and information retrieval, namely the learning to rank (LTR) methods (Joachims, 2002; Liu et al., 2009), which learns a ranking function by classification or regression from training data. The major paradigms of supervised LTR methods are pointwise (Nallapati, 2004; Li et al., 2008), pairwise (Cao et al., 2006; Burges et al., 2005), and listwise (Cao et al., 2007). Recently, online learning techniques such as regret minimization (Auer, 2002; Langford and Zhang, 2008; Kveton et al., 2015a) have been introduced into the LTR domain for directly learning from user signals. Compared with offline LTR, online LTR avoids the mismatch between manually curated labels, user intent (Yue and Joachims, 2009) and the expensive cost of creating labeled data sets. Although rigorous mathematical models are adopted for problem formalization (Yue and Joachims, 2009; Kveton et al., 2015a; Zoghi et al., 2017) and guarantees on regret bounds are established, most of those works only consider a oneshot ranking problem, which means that the interaction between the search engine and each user contains only one round of rankingandfeedback activity. However, in practice, a search session often contains multiple rounds of interactions and the sequential correlation between each round may be an important factor for ranking, which has not been well investigated.
In this paper, we consider the multistep sequential ranking problem mentioned above and propose a novel reinforcement learning (RL) algorithm for learning an optimal ranking policy. The major contributions of this paper are as follows.

We formally define the concept of search session Markov decision process (SSMDP) to formulate the multistep ranking problem, by identifying the state space, reward function and state transition function

We theoretically prove that maximizing accumulative rewards is necessary, indicating that the different ranking steps in a session are tightly correlated rather than independent.

We propose a novel algorithm named deterministic policy gradient with full backup estimation (DPGFBE), designed for the problem of high reward variance and unbalanced reward distribution of SSMDP, which could be hardly dealt with even for existing stateoftheart RL algorithms.

We empirically demonstrate that our algorithm performs much better than online LTR methods, with more than and growth of total transaction amount in the simulation and the real application (TaoBao search engine), respectively.
The rest of the paper is organized as follows. Section 2 introduces the background of this work. The problem description, analysis of SSMDP and the proposed algorithm are stated in Section 3, 4, 5, respectively. The experimental results are shown in Section 6, and Section 7 concludes the paper finally.
2. Background
In this section, we briefly review some key concepts of reinforcement learning and the related work in the online LTR domain. We start from the reinforcement learning part.
2.1. Reinforcement Learning
Reinforcement learning (RL) (Sutton and Barto, 1998) is a learning technique that an agent learns from the interactions between the environment by trialanderror. The fundamental mathematical model of RL is Markov decision process (MDP).
Definition 2.1 (Markov Decision Process).
A Markov decision process is a tuple , where is the state space, is the action space of the agent, is the reward function, is the state transition function and is the discount rate.
The objective of an agent in an MDP is to find an optimal policy which maximizes the expected accumulative rewards starting from any state (typically under the infinitehorizon discounted setting), which is defined by , where denotes any policy of the agent, stands for expectation under policy , is the current time step, is a future time step, and is the immediate reward at the time step . This goal is equivalent to finding the optimal stateaction value for any stateaction pair . In finitehorizon setting with a time horizon , the objective of an agent can be reinterpreted as the finding the optimal policy which maximizes the expected step discounted return or undiscounted return ^{1}^{1}1The undiscounted return is a special case in discount setting with . in the discounted and undiscounted reward cases, respectively.
An optimal policy can be found by computing the optimal statevalue function or the optimal stateaction value function . Early methods such as dynamic programming (Sutton and Barto, 1998) and temporaldifference learning (Watkins, 1989) rely on a table to store and compute the value functions. However, such tabular methods cannot scale up in largescale state/action space problems due to the curse of dimensionality. Function approximation is widely used to address the scalability issues of RL. By using a parameterized function (e.g., linear functions (Maei et al., 2010), neural networks (Mnih et al., 2015; Silver et al., 2016)) to represent the value function or the policy (a.k.a value function approximation and policy gradient method respectively), the learning problem is transformed to optimizing the function parameters according to reward signals. In recent years, policy gradient methods (Sutton et al., 2000; Silver et al., 2014; Schulman et al., 2015) have drawn much attention in the RL domain. The explicit parameterized representation of policy enables the learning agent to directly search in the policy space and avoids the policy degradation problem of value function approximation.
2.2. Related Work
Early attempt of online LTR can be dated back to the evaluation of RankSVM in online settings (Joachims, 2002). As claimed by Hofmann et al., balancing exploitation and exploration should be a key ability of online LTR methods (Hofmann et al., 2013). The theoretical results in the online learning community (typically in the bandit problem domain) (Auer, 2002; Langford and Zhang, 2008) provide rich mathematical tools for online LTR problem formalization and algorithms for efficient exploration, which motivates a lot of online LTR methods. In general, these methods can be divied into two groups. The first is to learn the best ranking function from a function space (Yue and Joachims, 2009; Hofmann et al., 2013). For example, Yue and Joachims (Yue and Joachims, 2009) define a dueling bandit problem in which actions are pairwise comparisons between documents and the goal is to learn a parameterized retrieval function which has sublinear regret performance. The second groups of online LTR methods directly learn the best list under some model of user interactions (Radlinski et al., 2008; Slivkins et al., 2013), which can be treated as an assumption on how users act to a ranked list. Representative models include the cascade model (Kveton et al., 2015a, b; Zong et al., 2016; Li et al., 2016), the dependentclick model (Katariya et al., 2017), and the positionbased model (Lagrée et al., 2016). Since no single model can entirely capture the behavior of all users, Zoghi et al. (Zoghi et al., 2017) recently propose a stochastic click learning framework for online LTR in a broad class of click models.
Our work in this paper is more similar to the first group of online LTR methods which learn ranking functions. However, while most of previous works consider a oneshot ranking problem, we focus on learning a ranking policy in a multistep ranking problem, which contains multiple rounds of interactions and typically occurs in ecommerce scenarios.
3. Problem Formulation
As we mentioned in previous sections, in ecommerce platforms such as TaoBao and TMall, ranking items is a multistep decisionmaking problem given a query, where the search engine should take a ranking action whenever an item page is requested by a user. Figure (1) shows a typical search session between the search engine and a mobile app user in TaoBao. In the beginning, the user inputs a query “Cola” into the blank of the search engine and clicks the “Search” button. Then the search engine takes a ranking action and shows the top items related to “Cola” in page 1. The user browse the displayed items and clicks some of them for the details. When no items interest the user or the user wants to check more items for comparisons, the user requests a new item page. The search engine again takes a ranking action and displays page 2. After a certain number of such ranking rounds, the search session will finally end when the user purchases items or just leaves the search session.
3.1. Search Session Modeling
Before we formulate the multistep ranking problem as an MDP, we define some concepts to formalize the contextual information and user behaviours in a search session, which are the basis to define the state and state transitions of our MDP.
Definition 3.1 (Top List).
For an item set , a ranking function , and a positive integer (), the top list is an ordered item list which contains the top items when applying the rank function to the item set , where () is the item in position and for any , it is the case that .
Definition 3.2 (Item Page).
For each step () during a session, the item page is the top list resulted by applying the ranking action of the search engine to the set of unranked items in the last decision step . For the initial step , . For any decision step , .
Definition 3.3 (Item Page History).
In a search session, let be the input query. For the initial decision step , the initial item page history . For each later desision step , the item page history up to is , where is the item page history up to the step and is the item page of step .
The item page history contains all information the user observes at the decision step . Since the item set is finite, there are at most item pages, and correspondingly at most decision steps in a search session. In TaoBao and TMall, users may choose to purchase items or just leave at different steps of a session. If we treat all possible users as an environment which samples user behaviors, this would mean that after observing any item page history, the environment may terminate a search session with a certain probability of transaction conversion or abandonment. We formally define such two types of probability as follows.
Definition 3.4 (Conversion Probability).
For any item page history () in a search session, let denote the conversion event that a user purchases an item after observing . The conversion probability of , which is denoted by , is the averaged probability that occurs when takes place.
Definition 3.5 (Abandon Probability).
For any item page history () in a search session, let denote the abandon event that a user leaves the search session after observing . The abandon probability of , which is denoted by , is the averaged probability that occurs when takes place.
Since is the direct result of the agent’s action in the last item page history , the conversion probability and the abandon probability define how the state of the environment (i.e., the user population) will change after is taken in : (1) terminating the search session by purchasing an item in with probability ; (2) leaving the search session from with probability ; (3) continuing the search session from with probability . For convenience of later discussion, we also define the continuing probability of an item page history.
Definition 3.6 (Continuing Probability).
For any item page history () in a search session, let denote the continuation event that a user continues searching after observing . The continuing probability of , which is denoted by , is the averaged probability that occurs when takes place.
Obviously, for any item page history , it holds that . Specially, the continuation event of the initial item page history which only contains the query is a sure event (i.e., ) as neither a conversion event nor a abandon event can occur before the first item page is displayed.
3.2. Search Session MDP
Now we are ready to define the instantiated Markov decision process (MDP) for the multistep ranking problem in a search session, which we call a search session MDP (SSMDP).
Definition 3.7 (Search Session MDP).
Let be a query, be the set of items related to , and () be the number of items that can be displayed in a page, the search session MDP (SSMDP) with respect to , and is a tuple , where

is the maximal decision step of a search session,

is the set of all possible item page histories, is the set of all item page histories up to ().

is the state space, is the nonterminal state set that contains all continuation events, and are two terminal state sets which contain all conversion events and all abandon events, respectively.

is the action space which contains all possible ranking functions of the search engine.

is the reward function.

is the state transition function. For any step (), any item page history , any action , let . The transition probability from the nonterminal state to any state after taking action is
(1)
In an SSMDP, the agent is the search engine and the environment is the population of all possible users. The states of the environment are indication of user status in the corresponding item page histories (i.e., contiuation, abandonment, or transaction conversion). The action space can be set differently (e.g., discrete or continuous) according to specific ranking tasks. The state transition function is directly based on the conversion probability and abandon probability. The reward function highly depends on the goal of a specific task, we will discuss our reward setting in Section 4.2.
4. Analysis of SSMDP
Before we apply the search session MDP (SSMDP) model in practice, some details need to be further clarified. In this section, we first identify the Markov property of the states in an SSMDP to show that SSMDP is well defined. Then we provide a reward function setting for SSMDP, based on which we perform an analysis on the reward discount rate and show the necessity for a search engine agent to maximize longtime accumulative rewards.
4.1. Markov Property
The Markov property means that a state is able to summarize past sensations compactly in such a way that all relevant information is retained (Sutton and Barto, 1998). Formally, the Markov property refers to that for any stateaction sequence experienced in an MDP, it holds that
(2) 
That is to say, the occurring of the current state is only conditional on the last stateaction pair rather than the whole sequence. Now we show that the states of a search session MDP (SSMDP) also have the Markov property.
Proposition 4.1 ().
For the search session MDP defined in Definition 3.7, any state is Markovian.
Proof.
We only need to prove that for any step () and any possible stateaction sequence with respect to , it holds that
Note that all states except in the sequence must be nonterminal states. According to the state definition, for any step (), there must be an item page history corresponding to the state such that . So the stateaction sequence can be rewritten as . Note that for any step (), it holds that
where is the top list (i.e., item page) with respect to the unranked item set and ranking action in step . Given , the unranked item set is deterministic. Thus, is the necessary and unique result of the stateaction pair . Therefore, the event can be equivalently represented by the event , and the following derivation can be conducted:
The third step of the derivation holds because for any step (), is contained in . Similarly, the fourth step holds because contains the occurrence of . ∎
4.2. Reward Function
In a search session MDP , the reward function of is a quantitative evaluation of the action performance in each state. Specifically, for any nonterminal state , any action , and any other state , is the expected value of the immediate rewards that numerically characterize the user feedback when action is taken in and the state is changed to . Therefore, we need to translate user feedback to numeric reward values that a learning algorihtm can understand.
In the online LTR domain, user clicks are commonly adopted as a reward metric (Katariya et al., 2017; Lagrée et al., 2016; Zoghi et al., 2017) to guide learning algorithms. However, in ecommerce scenarios, successful transactions between users (who search items) and sellers (whose items are ranked by the search engine) are more important than user clicks. Thus, our reward setting is designed to encourage more successful transactions. For any decision step (), any item page history , and any action , let . Recall that after observing the item page history , a user will purchase an item with a conversion probability . Although different users may choose different items to buy, from a statistical view, the deal prices of the transactions occurring in must follow an underlying distribution. We use to denote the expected deal price of . Then for the nonterminal state and any state , the reward is set as follows:
(3) 
where is the terminal state which represents the conversion event of . The agent will recieve a positive reward from the environment only when its ranking action leads to a successful transation. In all other cases, the reward is zero. It should be noted that the expected deal price of any item page history is most probably unknown beforehand. In practice, the actual deal price of a transaction can be directly used as the reward signal.
4.3. Discount Rate
The discount rate is an important parameter of an MDP which defines the importance of future rewards in the objective of the agent (defined in Section 2.1). For the search session MDP (SSMDP) defined in this paper, the choice of the discount rate brings out a fundamental question: “Is it necessary for the search engine agent to consider future rewards when making decisions?” We will find out the answer and determine an appropriate value of the discount rate by analyzing how the objective of maximizing longtime accumulative rewards is related to the goal of improving the search engine’s economic performance.
Let be a search session MDP with respect to a query , an item set and an integer (). Given a fixed deterministic policy of the agent^{2}^{2}2More accurately, the polic is a mapping from the nonterminal state set to the action space . Our conclusion in this paper also holds for stochastic policies, but we ommit the discussion due to space limitation., denote the item page history occurring at step () under by . We enumerate all possible states that can be visited in a search session under in Figure 2. For better illustration, we show all item page histories (marked in red) in the figure. Note that they are not the states of the SSMDP . Next, we will rewrite , , , and as , , , and for simplicity.
Without loss of generality, we assume the discount rate of the SSMDP is (). Denote the state value function (i.e., expected accumulative rewards) under by . For each step (), the state value of the nonterminal state is
(4) 
where for any (), is the immediate reward recieved at the future step in the item page history . According to the reward function in Equation (3), the expected value of the immediate reward under is
(5) 
where is the expected deal price of the item page history . However, since is the expected discounted accumulative rewards on condition of the state , the probability that the item page history is reached when is visited should be taken into account. Denote the reaching probability from to by , it can be computed as follows according to the state transition function in Equation (1):
(6) 
The reaching probability from to is since is the directly result of the state action pair . For other future item page histories, the reaching probability is the product of all continuing probabilities along the path from to . By taking Equations (5) and (6) into Equation (4), can be further computed as follows:
(7) 
With the conversion probability and the expected deal price of each item page history in Figure 2, we can also derive the expected gross merchandise volume (GMV) lead by the search engine agent in a search session under the policy as follows:
(8) 
By comparing Equations (7) and (8), it can be easily found that when the discount rate . That is to say, when , maximizing the expected accumulative rewards directly leads to the maximization of the expected GMV. However, when , maximizing the value function cannot necessarily maximize since the latter is an upper bound of .
Proposition 4.2 ().
Let be a search session MDP. For any deterministic policy and any discount rate (), it is the case that , where is state value function defined in Equation (4), is the initial nonterminal state of a search session, is the expected gross merchandise volume (GMV) of defined in Equation (8). Only when , we have .
Proof.
The proof is trivial since the difference between and , namely , is always positive when . ∎
Now we can give the answer to the question proposed in the beginning of this section: considering future rewards in a search session MDP is necessary since maximizing the undiscounted expected accumulative rewards can optimize the performance of the search engine in the aspect of GMV. The sequential nature of our multistep ranking problem requires the ranking decisions at different steps to be optimized integratedly rather than independently.
5. Algorithm
In this section, we propose a policy gradient algorithm for learning an optimal ranking policy in a search session MDP (SSMDP). We resort to the policy gradient method since directly optimizing a parameterized policy function addresses both the policy representation issue and the largescale action space issue of an SSMDP. Now we briefly review the policy gradient method in the context of SSMDP. Let be an SSMDP, be the policy function with the parameter . The objective of the agent is to find an optimal parameter which maximizes the expectation of the step returns along all possible trajectories
(9) 
where is a trajectory like and follows the trajectory distribution under the policy parameter , is the step return of the trajectory . Note that if the terminal state of a trajectory is reached in less than steps, the sum of the rewards will be truncated in that state. The gradient of the target with respect to the policy parameter is
(10) 
where is the sum of rewards from step to the terminal step in the trajectory . This gradient leads to the wellknown REINFORCE algorithm (Williams, 1992). The policy gradient theorem proposed by Sutton et al. (Sutton et al., 2000) provides a framework which generalizes the REINFORCE algorithm. In general, the gradient of can be written as
where is the stateaction value function under the policy . If is deterministic, the gradient of can be rewritten as
Silver et al. (Silver et al., 2014) show that the deterministic policy gradient is the limiting case of the stochastic policy gradient as policy variance tends to zero. The value function can be estimated by temporaldifference learning (e.g., actorcritic methods (Sutton and Barto, 1998)) aided by a function approximator with the parameter which minimizes the mean squared error .
5.1. The DPGFBE Algorithm
Instead of using stochastic policy gradient algorithms, we rely on the deterministic policy gradient (DPG) algorithm (Silver et al., 2014) to learn an optimal ranking policy in an SSMDP since from a practical viewpoint, computing the stochastic policy gradient may require more samples, especially if the action space has many dimensions. However, we have to overcome the difficulty in estimating the value function , which is caused by the high variance and unbalanced distribution of the immediate rewards in each state. As indicated by Equation (3), the immediate reward of any stateaction pair is zero or the expected deal price of the item history page resulted by . Firstly, the reward variance is high because the deal price normally varies over a wide range. Secondly, the immediate reward distribution of is unbalanced because the conversion events lead by occur much less frequently than the two other cases (i.e., abandon and continuing events) which produce zero rewards. Note that the same problem also exists for the step returns of the trajectories in an SSMDP since in any possible trajectory, only the reward of the last step may be nonzero. Therefore, estimating by Monte Carlo evaluation or temporaldifference learning may cause inaccurate update of the value function parameter and further influence the optimization of the policy parameter.
Our way for solving the above problem is similar to the modelbased reinforcement learning approaches (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002), which maintain an approximate model of the environment to help with performing reliable updates of value functions. According to the Bellman Equation (Sutton and Barto, 1998), the stateaction value of any stateaction pair under any policy is
where denotes the Bellman operator. Let be the next item page history resulted by . Only the states , , and can be transferred to from with nonzero probability. Among these three states, only involves a nonzero immediate reward and involves a nonzero value. So the above equation can be simplified to
(11) 
where , , and are the conversion probability, continuing probability and expected deal price of , respectively. Therefore, when the value function is approximated by a parameterized function , we can use as an estimation of to approximately compute the mean squared error (MSE) of , then optimize . Specifically, we have
where represents the item page history resulted by each stateaction pair . For minimizing , every time a stateaction as well as its next item page history is observed, the parameter can be updated in a full backup manner:
where is a learning rate and is the state of continuing event of . With this full backup updating method, the sampling errors caused by immediate rewards or returns can be avoided. Furthermore, the computational cost of full backups in our problem is almost equal to that of onestep sample backups.
Our policy gradient algorithm is based on the deterministic policy gradient theorem (Silver et al., 2014) and the full backup estimation of the Qvalue functions. Unlike previous works which entirely model the reward and state transition functions (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002), we only need to build the conversion probability model , the continuing probability model , and the expected deal price model of the item page histories in an SSMDP. These models can be trained using online or offline data by any possible statistical learning method. We call our algorithm Deterministic Policy Gradient with Full Backup Estimation (DPGFBE) and show its details in Algorithm 1.
As shown in this table, the parameters and will be updated after any search session between the search engine agent and users. Exploration (at line ) can be done by, but not limited to, greedy (in discrete action case) or adding random noise to the output of (in continuous action case). Although we have no assumptions on the specific models used for learning the actor and the critic in Algorithm 1, nonlinear models such as neural networks are preferred due to the large state/action space of an SSMDP. To solve the convergence problem and ensure a stable learning process, a replay buffer and target updates are also suggested (Mnih et al., 2015; Lillicrap et al., 2015).
6. Experiments
In this section, we conduct two groups of experiments: A simulated experiment in which we construct an online shopping simulator and test our algorithm DPGFBE as well as some stateoftheart online learning to rank (LTR) algorithms, and a real application in which we apply our algorithm in TaoBao, one of the largest ecommerce platforms in the world.
6.1. Simulation
The online shopping simulator is constructed based on the statistical information of items and user behaviors in TaoBao. An item is represented by a dim () feature vector and a ranking action of the search engine is a dim weight vector . The ranking score of the item under the ranking action is the inner product of the two vectors. We choose important features related to the item category of dress (e.g., price and quality) and generate an item set by sampling items from a distribution approximated with all the items of the dress category. Each page contains items so that there are at most ranking rounds in a search session. In each ranking round, the user operates on the current item page (such as clicks, abandonment, and purchase) are simulated by a user behavior model, which is constructed from the user behavior data of the dress items in TaoBao. The simulator outputs the probability of each possible user operation given the recent item pages examined by the user. A search session will end when the user purchases one item or leaves.
Our implementation of the DPGFBE algorithm is a deep RL version (DDPGFBE) which adopts deep neural networks (DNN) as the policy and value function approximators (i.e., actor and critic). We also implement the deep DPG algorithm (DDPG) (Lillicrap et al., 2015). The state of environment is represented by a dim feature vector extracted from the last item pages of the current search session. The actor and critic networks of the two algorithms have two full connected hidden layers with and units, respectively. We adopt relu and tanh as the activation functions for the hidden layers and the output layers of all networks. The network parameters are optimized by Adam with a learning rate of for the actor and for the critic. The parameter for the soft target updates (Lillicrap et al., 2015) is set to . We test the performance of the two algorithms under different settings of the discount rate . Five online LTR algorithms, pointwise LTR, BatchRank (Zoghi et al., 2017), CascadeUCB1 (Kveton et al., 2015a), CascadeKLUCB (Kveton et al., 2015a), and RankedExp3 (Radlinski et al., 2008) are implemented for comparison. Like the two algorithms, the pointwise LTR method implemented in our simulation also learns a parameterized function which outputs a ranking weight vector in each state of a search session. We choose DNN as the parameterized function and use the logistic regression algorithm to train the model, with an objetive function that approximates the goal of maximizing GMV. The four other online LTR algorithms are regret minimization algorithms which are based on variants of the bandit problem model. The test of each algorithm contains search sessions and the transaction amount of each session is recorded. Results are averaged over runs and are shown in Figures 3, 4, and 5.
Now let us first examine the subfigure of DDPGFBE. It can be found that the performance of DDPGFBE is improved as the discout rate increases. The learning curve corresponding to the setting (the green one) is far below other curves in Fig. 3, which indicates the importance of delay rewards. The theoretical result in Section 4 is empirically verified since the DDPGFBE algorithm acheives the best performance when , with growth of transaction amount per session compared to the second best performance. Note that in ecommerce scenarios, even growth is considerable. The DDPG algorithm also performs the best when , but it fails to learn as well as the DDPGFBE algorithm. As shown in Fig. 4, all the learning curves of DDPG are under the value . Unlike the two RL algorithms which output weight vectors, the five online LTR algorithms can directly output a ranked item list according to their own ranking mechanisms. However, as we can observe in Fig. 5, the transaction amount lead by each of the algorithms is much smaller than that lead by DDPGFEB and DDPG. This is not surprising since none of the online LTR algorithms are designed for the multistep ranking problem where the ranking decisions at different steps should be optimized integratedly.
6.2. Application
We apply our algorithm in TaoBao search engine for providing online realtime ranking service. The searching task in TaoBao is characterized by high concurrency and large data volume. In each second, the TaoBao search engine should respond to hundreds of thousands of users’ requests in concurrent search sessions and simultaneously deal with the data produced from user behaviours. On sale promotion days such as the TMall Double Global Shopping Festival^{3}^{3}3This refers to November the 11th of each year. On that day, most sellers in TaoBao and TMall carry out sale promotion and billions of people in the world join in the online shopping festival, both the volume and producing rate of the data would be multiple times larger than the daily values.
In order to satisfy the requirement of high concurrency and the ability of processing massive data in TaoBao, we design a data streamdriven RL ranking system for implementing our algorithm DPGFBE. As shown in Figure 6, this system contains five major components: a query planner, a ranker, a log center, a reinforcement learning component, and an online KV system. The work flow of our system mainly consists of two loops. The first one is an online acting loop (in the right bottom of Figure 6), in which the interactions between the search engine and TaoBao users take place. The second one is a learning loop (on the left of the online acting loop in Figure 6) where the training process happens. The two working loops are connected through the log center and the online KV system, which are used for collecting user logs and storing the ranking policy model, respectively. In the first loop, everytime a user requests an item page, the query planner will extract the state feature, get the parameters of the ranking policy model from the online KV system, and compute a ranking action for the current state (with exploration). The ranker will apply the computed action to the unranked items and display the top items (e.g., ) in an item page, where the user will give feedback. In the meanwhile, the log data produced in the online acting loop is injected into the learning loop for constructing training data source. In the log center, the user logs collected from different search sessions are transformed to training samples like , which are output continuously in the form of data stream and utitlized by our algorithm to update the policy parameters in the learning component. Whenever the policy model is updated, it will be rewritten to the online KV system. Note that the two working loops in our system work in parallel but asynchronously, because the user log data generated in any search session cannot be utilized for training immediately.
The linear ranking mode used in our simulation is also adopted in this TaoBao application. The ranking action of the search engine is a dim weight vector. The state of the environment is represented by a dim feature vector, which contains the item page features, user features and query features of the current search session. We add user and query information to the state feature since the ranking service in TaoBao is for any type of users and there is no limitation on the input queries. We still adopt neural networks as the policy and value function approximators. However, to guarantee the online realtime performance and quick processing of the training data, the actor and critic networks have much smaller scale than those used in our simulation, with only and units in each of their two fully connected hidden layers, respectively. We implement DDPG and DDPGFBE algorithms in our system and conduct oneweek A/B test to compare the two algorithms. In each day of the test, the DDPGFBE algorithm can lead to more transaction amount than the DDPG algorithm ^{4}^{4}4We cannot report the accurate transaction amount due to the information protection rule of Alibaba. Here we provide a reference index: the GMV achieved by Alibaba’s China retail marketplace platforms surpassed billion U.S. dollars in the fiscal year of 2016 (Alizila, 2017).. The DDPGFBE algorithm was also used for online ranking service on the TMall Double Global Shopping Festival of . Compared with the baseline algorithm (an LTR algorithm trained offline), our algorithm acheived more than growth in GMV at the end of that day.
7. Conclusions
In this paper, we propose to use reinforcement learning (RL) for ranking control in ecommerce searching scenarios. Our contributions are as follows. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multistep ranking problem in a search session. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy in an SSMDP. Experimental results in simulation and TaoBao search engine show that our algorithm perform much better than online LTR methods in the multistep ranking problem, with more than and growth in gross merchandise volume (GMV), respectively.
References
 (1)
 Alizila (2017) Alizila. 2017. Joe Tsai Looks Beyond Alibaba’s RMB 3 Trillion Milestone. http://www.alizila.com/joetsaibeyondalibabas3trillionmilestone/. (2017).
 Auer (2002) Peter Auer. 2002. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research 3, Nov (2002), 397–422.
 Brafman and Tennenholtz (2002) Ronen I. Brafman and Moshe Tennenholtz. 2002. RMAX  A General Polynomial Time Algorithm for NearOptimal Reinforcement Learning. Journal of Machine Learning Research 3 (2002), 213–231.
 Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning. 89–96.
 Cao et al. (2006) Yunbo Cao, Jun Xu, TieYan Liu, Hang Li, Yalou Huang, and HsiaoWuen Hon. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval (SIGIR’06). 186–193.
 Cao et al. (2007) Zhe Cao, Tao Qin, TieYan Liu, MingFeng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, 129–136.
 Hofmann et al. (2013) Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16, 1 (2013), 63–90.
 Joachims (2002) Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). ACM, 133–142.
 Katariya et al. (2017) Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. 2017. Stochastic Rank1 Bandits. In Artificial Intelligence and Statistics. 392–401.
 Kearns and Singh (2002) Michael Kearns and Satinder Singh. 2002. Nearoptimal reinforcement learning in polynomial time. Machine Learning 49, 23 (2002), 209–232.
 Kveton et al. (2015a) Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015a. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning (ICML15). 767–776.
 Kveton et al. (2015b) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015b. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems (NIPS’15). 1450–1458.
 Lagrée et al. (2016) Paul Lagrée, Claire Vernade, and Olivier Cappe. 2016. Multipleplay bandits in the positionbased model. In Advances in Neural Information Processing Systems (NIPS’16). 1597–1605.
 Langford and Zhang (2008) John Langford and Tong Zhang. 2008. The epochgreedy algorithm for multiarmed bandits with side information. In Advances in neural information processing systems. 817–824.
 Li et al. (2008) Ping Li, Qiang Wu, and Christopher J Burges. 2008. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in Neural Information Processing Systems (NIPS’08). 897–904.
 Li et al. (2016) Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. 2016. Contextual combinatorial cascading bandits. In International Conference on Machine Learning (ICML’16). 1245–1253.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
 Liu et al. (2009) TieYan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
 Maei et al. (2010) Hamid R. Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton. 2010. Toward offpolicy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning. 719–726.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
 Nallapati (2004) Ramesh Nallapati. 2004. Discriminative models for information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, 64–71.
 Radlinski et al. (2008) Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning diverse rankings with multiarmed bandits. In Proceedings of the 25th international conference on Machine learning. ACM, 784–791.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 1889–1897.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 387–395.
 Slivkins et al. (2013) Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked bandits in metric spaces: learning diverse rankings over large document collections. Journal of Machine Learning Research 14, Feb (2013), 399–436.
 Sutton and Barto (1998) R.S. Sutton and A.G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS’00). 1057–1063.
 Watkins (1989) C.J.C.H. Watkins. 1989. Learning from delayed rewards. Ph.D. Dissertation. King’s College, Cambridge.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8, 34 (1992), 229–256.
 Yue and Joachims (2009) Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, 1201–1208.
 Zoghi et al. (2017) Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online Learning to Rank in Stochastic Click Models. In International Conference on Machine Learning. 4199–4208.
 Zong et al. (2016) Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading bandits for largescale recommendation problems. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence (UAI’16). 835–844.