A deep reinforcement learning framework for allocating buyer impressions in e-commerce websites
We study the problem of allocating impressions to sellers in e-commerce websites, such as Amazon, eBay or Taobao, aiming to maximize the total revenue generated by the platform. When a buyer searches for a keyword, the website presents the buyer with a list of different sellers for this item, together with the corresponding prices. This can be seen as an instance of a resource allocation problem in which the sellers choose their prices at each step and the platform decides how to allocate the impressions, based on the chosen prices and the historical transactions of each seller. Due to the complexity of the system, most e-commerce platforms employ heuristic allocation algorithms that mainly depend on the sellers’ transaction records and without taking the rationality of the sellers into account, which makes them susceptible to several price manipulations.
In this paper, we put forward a general framework of designing impression allocation algorithms in e-commerce websites given any behavioural model for the sellers, using deep reinforcement learning. The impression allocation problem is modeled as a Markov decision process, where the states encode the history of impressions, prices, transactions and generated revenue and the actions are the possible impression allocations at each round. To tackle the problem of continuity and high-dimensionality of states and actions, we adopt the ideas of the DDPG algorithm to design an actor-critic gradient policy algorithm which takes advantage of the problem domain in order to achieve covergence and stability. Our algorithm is compared against natural heuristics and it outperforms all of them in terms of the total revenue generated. Finally, contrary to the DDPG algorithm, our algorithm is robust to settings with variable sellers and easy to converge.
A fundamental problem that all e-commerce websites are faced with is to decide how to allocate the buyer impressions to the potential sellers. When a buyer searches a keyword such as “iphone 7 rose gold”, the platform will return a ranking of different sellers providing an item that fits the keyword, with different prices and different historical sale records. The goal of the platform is to come up with algorithms that will allocate the impressions to the most appropriate sellers, eventually generating more revenue from the transactions.
This setting can be modeled as a resource allocation problem over a sequence of rounds, where in each round, buyers arrive, the algorithm inputs the historical records of the sellers and their prices and outputs such an allocation of impressions. The sellers and the buyers carry out their transactions and the historical records are updated. In reality, most e-commerce websites employ a class of heuristic algorithms, such as collaborative filtering or content based filtering , many of which rank sellers in terms of “historical scores” calculated based on the transaction history of the sellers with buyers of similar characteristics. However, this approach does not typically take into account the fact that sellers strategize with the choice of prices, as certain sub-optimal prices at one round might affect the record histories of sellers in subsequent rounds, yielding more revenue for them in the long run. Even worse, since the sellers are usually not aware of the algorithm in use, they might “explore” with their pricing schemes, rendering the system uncontrollable at times. It seems natural that a more sophisticated approach that takes all these factors into account should be in place.
In the presence of strategic or rational individuals, the field of mechanism design  has provided a concrete toolbox for managing or preventing the ill effects of selfish behaviour and achieving desirable objectives. Its main principle is the design of systems in such a way that the strategic behaviour of the participants will lead to outcomes that are aligned with the goals of the society, or the objectives of the designer. A common denominator in most of the classical work is that the participants have access to either full information of some distributional estimate of the preferences of others and crucially, that they act fully rationally using this information when making their decisions
For the reasons mentioned above, a large recent body of work has advocated that other types of agent behaviour, based on learning and exploring are perhaps more appropriate for such large-scale online problems encountered in reality . In turn, this generates a requirement for new algorithmic techniques for solving those problems. Our approach is to use techniques from deep reinforcement learning for solving the problem of the impression allocation to sellers, given their selfish nature. In other words, given a rationality model for the sellers, we design reinforcement learning algorithms that take this model into account and solve the impression allocation problem efficiently. We call this general approach reinforcement mechanism design and we can view our contribution in this paper as an instance of this framework.
More concretely, the impression allocation problem can be modeled as a Markov decision process (MDP) in which the information about the prices, past transactions, past allocations of impressions and generated revenue is stored in the states and the actions correspond to all the different ways of allocating the impressions, with the rewards being the immediate revenue generated by each allocation. Given that the costs of the sellers (which depend on their production costs) are private information, it seems natural to employ reinforcement learning techniques for solving the MDP and obtain more sophisticated impression allocation algorithms than the heuristics mentioned above.
In our setting however, both the state space and the action space are continuous and high-dimensional, which renders traditional reinforcement learning techniques such as temporal difference learning  or more specifically Q-learning  not suitable for solving the MDP. In a highly influential paper, Mnih et al.  employed the use of deep neural networks as function approximators to estimate the action-value function. The resulting algorithm, coined “Deep Q Network” (DQN), can handle large (or even continuous) state spaces but crucially, it can not be used for continuous action domains, as it relies on finding the action that maximizes the Q-function at each step.
To handle the continuous action space, policy gradient methods have been proposed in the literature of reinforcement learning with actor-critic algorithms rising as prominent examples , where the critic estimates the Q-function by exploring, while the actor adjusts the parameters of the policy by stochastic gradient ascent. To handle the high-dimensionality of the action space, Silver et al.  designed a deterministic actor-critic algorithm, coined “Deterministic Policy Gradient” (DPG) which performs well in standard reinforcement-learning benchmarks such as mountain car, pendulum and 2D puddle world. As Lillicrap et al.  point out however, the algorithm falls short in large-scale problems and for that reason, they developed the “Deep-DPG” (DDPG) algorithm which uses the idea from  and combines the deterministic policy gradient approach of DPG with deep neural networks as function approximators. To ensure convergence and stability, they employ previously known techniques such as batch normalization , target Q-networks , and experience replay , also used in .
We draw inspiration from the DDPG algorithm to design a deep-learning actor-critic algorithm for allocating impressions in e-commerce websites. However, DDPG can not work verbatim in our setting, the reason being that policy space is the whole set of feasible allocations and the two-layered fully connected network used in  is not appropriate; as the number of sellers increases, the number of feasible allocations increases sharply, and convergence of the algorithm becomes harder, as the number of actions the network needs to explore increases. Additionally, while the fully connected network performs pretty well for a small number of fixed sellers, its performance deteriorates rapidly in environments where sellers come and go or have variable costs for their items; the environments of e-commerce websites are typically of such nature. To remedy this issue, we make use of a certain property of our domain, namely that the immediate reward of each round is the sum of revenues generated by all sellers and employ Recurrent Neural Networks (RNN) for the training. For the actor, we train a sub-actor network for each seller independently to output a score, and then use the softmax function to aggregate the score into the allocation. For the critic, we add the Q-function values of all the independent sub-critic networks to output the final Q-function value. This modification reduces policy space, which ensures the convergence of the algorithm for a larger number of sellers and furthermore, achieves the “permutation invariance” property, which dictates that the results are robust to settings where sellers are variable across rounds. As our experiments show, our algorithm performs bettter than the heuristic algorithms, which attests to the applicability of our framework on the fundamental task of impression allocation in e-commerce websites.
In the impression allocation problem of e-commerce websites, there are sellers who compete for a unit of buyer impression. In each round, a buyer
Typically, there are slots (e.g. positions on a webpage) to be allocated and we let denote the probability (or the fraction of time) that seller is allocated the impression at slot . With each slot, there is an associated click-through-rate which captures the “clicking potential” of each slot, and is indepent of the seller, as all items offered are identical. We let denote the probability that the buyer will click the item of seller . Given this definition (and assuming that sellers can appear in multiple slots in each page), the usual feasiblity constraints for allocations, i.e.
can be alternatively written as
That is, for any such allocation , there is a feasible ranking that realizes (for the ease of notation, we assume the sum of click-through rates of all slots is ) and therefore we can allocate the buyer impression to sellers directly instead of outputting a ranking over these items when a buyer searches a keyword.
Let denote the record of seller at round , which is a tuple consisting of the following quantities:
is the expected fraction of buyer impressions that seller gets,
is the price that seller sets,
is the expected amount of transactions that seller makes and
is the expected revenue that seller makes at round .
Let denote the records of all sellers at round , and let denote the vectors of records of seller from round to round , which we will refer to as the history of the seller. At each round , seller chooses a price for its item and the algorithm allocates the buyer impression to sellers.
The setting can be defined as a Markov decision process (MDP) defined by the following components: a continous state space , a continuous action space , with an initial state distribution with density , and a transition distribution of states with conditional density satisfying the Markov property, i.e. . Furthermore, there is an associated reward function assigning payoffs to pairs of states and actions. Generally, a policy is a function that selects stochastic actions given a state, i.e, , where is the set of probability distributions on . Let denote the discounted sum of rewards from the state , i.e, , where . Given a policy and a state, the value function is defined to be the expected total discounted reward, i.e. and the action-value function is defined as .
For our problem, a state of the MDP consists of the records of all sellers in the last rounds, i.e. , that is, the state is a tensor, the allocation outcome of the round is the action and the immediate reward is the expected total revenue generated in this round. The performance of an allocation algorithm is defined as the average expected total revenue over a sequence of rounds.
We model the behaviour of the buyer as being dependent on a valuation that comes from a distribution with cumulative distribution function . Intuitively, this captures the fact that buyers may have different spending capabilities (captured by the distribution). Specifically, the probability that the buyer purchases item is . That is, the probability of purchasing is decided by the impression allocation and the price seller sets. For simplicity and without loss of generality with respect to our framework, we assume that the buyer’s valuation is drawn from , i.e. the uniform distribution over .
As we explained in the introduction, we will take the sellers’ strategies into account when designing the algorithms to optimize our objectives. The seller rationality can be modelled in many different ways; here we will adopt a model where with some probability the agent plays a simple strategy, based on the empirical distribution of the past actions and otherwise it will choose a “low regret” strategy with some room for exploration. A similar idea has also been proposed in  to model bounded-rational decision-makers
More concretely, we will use to denote the strategy of seller , which prescribes the price that the seller uses at each round. The function inputs the record of seller in rounds to and the cost of the seller and outputs a price for the next round, i.e, . The strategy of each seller is fixed and unknown to the platform. We consider a class of strategies of sellers, named -rational strategies.
-rational strategies: At round , each seller posts a random price. At any other round (with ), for some for ,
with probability , seller selects a price drawn from the empirical distribution of her historical price choices in previous rounds.
with probability , seller first calculates the discounted profit of the item in each round , i.e. (for ) and then it picks the price of round with the maximum discounted profit, i.e, , and furthermore adds Gaussian noise for exploration; that is, the seller posts a price at round , where is the Gaussian noise. The use of the discount factor follows the standard assumption that recent records will factor more in the decision of a seller.
The parameter of each seller is drawn from a Gaussian distribution , with , where denotes the “degree of rationality” of sellers (in real-life scenarios, the parameter can also be inferered given past historical data of each seller). Larger values of mean sellers are more rational. Note that when we set , seller is (completely) rational, we call this strategy a rational strategy. We remark here that our framework can handle almost any kind of seller strategy, ranging from simple strategies (e.g. choose the price randomly) to more complex strategies.
2.3Heuristic allocation algorithms
As the strategies of the sellers are unknown to the platform, and the only information available is the sellers’ historical records, the platform can only use that information for the impression allocation. Next, we present two natural heuristic algorithms that we will use as benchmarks against which we will compare our deep reinforcement learning algorithm. We also test some others (e.g. random allocation or a non-myopic version of Greedy) but they do not perform better so we omit the results.
Greedy Myopic Algorithm: At round , the algorithm allocates the buyer impression to each seller with equal probability. At any other round (for ), the algorithm allocates the buyer impression to each seller with probability , i.e. proportional to the contribution of each seller to the total revenue of the last round.
UCB Algorithm: For each seller , the algorithm records a value which is initalized to . At the end of each round , if seller is allocated the buyer impression at this round, then it updates the value of seller by , where . From round to round , the algorithm allocates the buyer impression to a seller which has not been allocated the buyer impression before (breaking ties arbitrarily), otherwise at each round the algorithm allocates the buyer impression to the seller with the maximum weighted value, i.e, . This algorithm employs the idea of the “UCB1 Algorithm”  that gives higher priority to sellers which have not been explored before.
2.4Previous approaches: DPG and DDPG
Deterministic Policy Gradient: The shortcoming of DQN  is that while it can handle continuous states, it can not handle continous actions or high-dimensional action spaces. Although stochastic actor-critic algorithms could handle continous actions, they are hard to converge in high dimensional action spaces. The DPG algorithm aims to train a deterministic policy with parameter vector . This algorithm consists of two components: an actor, which adjusts the parameters of the deterministic policy by stochastic gradient ascent of the gradient of the discounted sum of rewards, and the critic, which approximates the action-value function.
Deep Deterministic Policy Gradient: Directly training neural networks for the actor and the critic of the DPG algorithm fails to achieve convergence; the main reason is the high degree of temporal correlation which introduces high variance in the approximation of the Q-function by the critic. For this reason, the DDPG algorithm uses a technique known as experience replay, according to which the experiences of the agent at each time step are stored in a replay buffer and then a minibatch is sampled uniformly at random from this set for learning, to eliminate the temporal correlation. The other modification is the employment of target networks for the regularization of the learning algorithm. The target network is used to update the values of and at a slower rate instead of updating by the gradient network; the prediction will be relatively fixed and violent jitter at the beginning of training is absorbed by the target network. A similar idea appears in  with the form of double Q-value learning.
3The IA(GRU) impression allocation algorithm
In this section, we present our main deep reinforcement learning algorithm, termed “IA(GRU)” (IA stands for “impression allocation”) which is in the center of our framework for impression allocations in e-commerce platforms and is based on the ideas of the DDPG algorithm. Before we present the algorithm, we highlight why simply applying DDPG to our problem can not work.
Shortcomings of DDPG: First of all, while DDPG is designed for settings with continuous and often high-dimensional action spaces, the blow-up in the number of actions in our problem is very sharp as the number of sellers increases; this is because the action space is the set of all feasible allocations, which increases very rapidly with the number of sellers. As we will show in Section 4, the direct application of the algorithm fails to converge even for a moderately small number of sellers. The second problem comes from the inability of DDPG to handle variability on the set of sellers. Since the algorithm uses a two-layer fully connected network, the position of each seller plays a fundamental role; each seller is treated as a different entity according to that position. As we show in Section 4, if the costs of sellers at each round are randomly selected, the performance of the DDPG algorithm deteriorates rapidly. The settings in real-life e-commerce platforms however are quite dynamic, with sellers arriving and leaving or their costs varying over time, and for an allocation algorithm to be applicable, it should be able to handle such variability. We expect that each seller’s features are only affected by its historical records, not some “identity” designated by the allocation algorithm; we refer to this highly desirable property as “permutation invariance”. Based on time-serial techniques, our algorithm uses Recurrent Neural Networks at the dimension of the sellers and achieves the property.
The IA(GRU) impression allocation algorithm Next, we explain the design of our algorithm, but we postpone some implementation details for Section 4. At a high level, the algorithm uses the framework of DDPG with different network structures and different inputs of networks. It maintains a sub-actor network and a sub-critic network for each seller and employs “input preprocessing” at each training step, to ensure permutation invariance.
Input Preprocessing: In each step of training, with a state tensor of shape , we firstly utilize a “background network” to calculate a public vector containing information of all sellers: it transforms the state tensor to a tensor and performs RNN operations on the axis of rounds. At this step, it applies “permutation transformation”, i.e. a technique for maintaining permutation invariance; specifically, it first orders the sellers according to a certain metric, such as the weighted average of their past generated revenue and then inputs the (state, action) pair following this order to obtain the public vector . On the other hand, for each seller , it applies a similar RNN operation on its history, resulting in an individual temporal feature called . Combining those two features, we obtain a feature vector that we will use as inputs for the sellers’ sub-actor and sub-critic networks.
Actor network: For each seller, the sub-actor network takes as input and outputs a score. This algorithm uses a softmax function over the outputs of all sub-actor networks to choose an action. The structure of the policy ensures that policy space is much smaller than that of DDPG as the space of inputs of all sub-actor networks is restricted, and allows for easier convergence.
Critic network: For the critic, we make use of the problem-specific property, that the immediate reward of each round is the sum of revenues of all sellers and the record of each seller has the same space. Each sub-critic network inputs the expected fraction of buyer impression the seller gets (the sub-action) and (the sub-state) as input and outputs the Q-value of the corresponding seller, i.e, the expected discounted sum of revenues from the sub-state following the policy. Then, it sums up the estimated Q-value of all sub-critic networks to output the final estimated Q-value, with the assumption that the strategy of each seller is independent of the records of other sellers.
In this section, we present the evaluation of our algorithms in terms of convergence time and revenue performance against several benchmarks, namely the direct application of the DDPG algorithm (with a fully connected network) and the heuristic allocation algorithms that we defined in Section 2. We use Tensorflow and Keras as the engine for the deep learning, combining the idea of DDPG and the techniques mentioned in Section 3 to train the neural network.
Experimental Setup: In the implementation of DDPG, the actor network uses two full connected layers, a rectifier (ReLu) as the activation function, and outputs the action by a softmax function, while the critic network inputs the pair of state and action and outputs the estimation of the Q-value with similar structures. The algorithm IA(GRU) uses the same structure, i.e. the fully connected network in sub-actor and sub-critic networks, and uses a Recurrent Neural Network with gate recurrent units (GRU) in cyclic layers to obtain the inputs of these networks. For the experiments we set . (i.e, the record of all items of the last round is viewed as the state).
Experimental Parameters: We use 1000 episodes for both training and testing, and there are 1000 steps in each episode. The valuation of the buyer in each round is drawn from the standard uniform distribution and the costs of sellers follow a Gaussian distribution with mean and variance . The size of the replay buffer is , the discount factor is , and the rate of update of the target network is . The actor network and the critic network are trained via the Adam algorithm, a gradient descent algorithm presented in , and the learning rates of these two networks are and respectively (our experiments show that the algorithms converge to better solutions with lower learning rates - hence the choice of the small values). Following the same idea as in , we add Gaussian noise to the action outputted by the actor network, with the mean of the noise decaying with the number of episodes in the exploration.
Convergence of DDPG and IR(GRU)
First, to show the difference in the convergence properties of these two algorithms, we train them for 200 rational sellers (i.e. ) with fixed costs. Figure ? shows the comparison between the rewards of the algorithms and Figure ? shows the comparison in terms of the training loss with the number of steps.
The gray band shows the variance of the vector of rewards near each step. From the figures, we see that DDPG does not converge, while IA(GRU) converges, as the training loss of the algorithm decreases with the number of steps.
Performance of Deep RL Algorithms and Heuristics for fixed sellers
We test DDPG and IA(GRU) with 200 fixed rational sellers(i.e. )against the heuristic algorithms that we presented in Section 2. In Figures ? and ?, we show the performances of DDPG, IA(GRU), Greedy Myopic and UCB.
Every point of the above figures shows the reward at the corresponding step. Although the variance of result of IA(GRU) algorithm is large, its performance is close to 0.24 and clearly better than the other algorithms in terms of the average reward. The reason why the performance of DDPG is worse than IA(GRU) is that DPPG does not converge with 200 sellers.
Performance of Deep RL Algorithms and Heuristics for variable sellers
We train DDPG and IA(GRU) with 200 variable sellers, i.e. the cost of each seller is sampled again in each round and the probability of following the empirical distribution for each seller is drawn from i.i.d . Then we compare the performance of the two deep reinforcement learning algorithms against the heuristics. In Figures ? and ?, we show the performance of DDPG, IA(GRU), Greedy Myopic and UCB.
DDPG is no longer stable and perform worse compared to the setting with fixed sellers. On the other hand, IA(GRU) performs also well in this setting and is the best among the four algorithms compared in terms of the average reward. Greedy Myopic also performs relatively well, the reason being that is that it does not require any prior knowledge, which is good for the case of variable sellers; it allocates the buyer impression according to the revenue of sellers in the last round. The UCB performs worst in this setting.
In this paper, we designed an algorithm (IA(GRU)) for the problem of allocating impressions in e-commerce platforms, which can be seen as part of a more general framework of employing deep reinforcement learning techniques for this problem, in contrast to the existing approaches that mainly use heuristic allocation algorithms. Our experiments show that IA(GRU) works very well in larger-scale problems and can handle dynamic environments, where different sellers might appear in each round, which is typically the case in such e-commerce systems, unlike the straightforward application of the existing state-of-the-art deep reinforcement learning algorithms, who suffer in terms of convergence and seller variability. Future work could apply the same ideas for different seller behavioral models; one could actually use real data to infer the rationality of the sellers and our framework allows for the developement of algorithms for this case as well.
- Tsinghua University, China. E-mail:
- University of Oxford, United Kingdom. E-mail:
- Tsinghua University, China. E-mail:
- Tsinghua University, China. E-mail:
- This is also the case in , where the authors study the impression allocation problem under strong informational and traditional rationality assumptions and the strategies are the numbers of fake transactions generated by sellers.
- As the purchasing behavior is determined by the valuation of buyers over the item, without loss of generality we could consider only one buyer at each round.
- The framework of our algorithms extends to cases where we need return similiar but different items to a buyer, i.e, the algorithm outputs a ranking over these items. Furthermore, our approach extends trivially to the case when sellers have multiple items.
- We found out that training our algorithms for larger values of does not help to improve the performance.
- Experience replay for real-time reinforcement learning control.
S. Adam, L. Busoniu, and R. Babuska. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2):201–212, 2012.
- Finite-time analysis of the multiarmed bandit problem.
P. Auer, N. Cesa-Bianchi, and P. Fischer. Machine learning, 47(2-3):235–256, 2002.
- Incremental natural actor-critic algorithms.
S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. In NIPS, pages 105–112, 2007.
- Mechanism design for personalized recommender systems.
Q. Cai, A. Filos-Ratsikas, C. Liu, and P. Tang. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 159–166. ACM, 2016.
- Learning in auctions: Regret is hard, envy is easy.
C. Daskalakis and V. Syrgkanis. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 219–228. IEEE, 2016.
P. Dayan and C. Watkins. Machine learning, 8(3):279–292, 1992.
- Model-free reinforcement learning with continuous action in practice.
T. Degris, P. M. Pilarski, and R. S. Sutton. In American Control Conference (ACC), 2012, pages 2177–2182. IEEE, 2012.
- No-regret learning in bayesian games.
J. Hartline, V. Syrgkanis, and E. Tardos. In Advances in Neural Information Processing Systems, pages 3061–3069, 2015.
- Memory-based control with recurrent neural networks.
N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver. arXiv preprint arXiv:1512.04455, 2015.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift.
S. Ioffe and C. Szegedy. arXiv preprint arXiv:1502.03167, 2015.
- Adam: A method for stochastic optimization.
D. Kingma and J. Ba. arXiv preprint arXiv:1412.6980, 2014.
- Continuous control with deep reinforcement learning.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. arXiv preprint arXiv:1509.02971, 2015.
- Learning and efficiency in games with dynamic population.
T. Lykouris, V. Syrgkanis, and É. Tardos. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 120–129. Society for Industrial and Applied Mathematics, 2016.
- Mechanism design: How to implement social goals.
E. S. Maskin. The American Economic Review, 98(3):567–576, 2008.
- Playing atari with deep reinforcement learning.
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. arXiv preprint arXiv:1312.5602, 2013.
- Human-level control through deep reinforcement learning.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Nature, 518(7540):529–533, 2015.
- Econometrics for learning agents.
D. Nekipelov, V. Syrgkanis, and E. Tardos. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 1–18. ACM, 2015.
- Human decision-making under limited time.
P. A. Ortega and A. A. Stocker. In Advances in Neural Information Processing Systems, pages 100–108, 2016.
- Introduction to recommender systems handbook.
F. Ricci, L. Rokach, and B. Shapira. Springer, 2011.
- Modeling bounded rationality.
A. Rubinstein. MIT press, 1998.
- Deterministic policy gradient algorithms.
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. In International Conference on Machine Learning (ICML, 2014.
- Learning to predict by the methods of temporal differences.
R. S. Sutton. Machine learning, 3(1):9–44, 1988.
- Policy gradient methods for reinforcement learning with function approximation.
R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. In NIPS, volume 99, pages 1057–1063, 1999.
- Deep reinforcement learning with double q-learning.
H. Van Hasselt, A. Guez, and D. Silver. In AAAI, pages 2094–2100, 2016.