FeedbackBased Tree Search for Reinforcement Learning
Abstract
Inspired by recent successes of MonteCarlo tree search (MCTS) in a number of artificial intelligence (AI) application domains, we propose a modelbased reinforcement learning (RL) technique that iteratively applies MCTS on batches of small, finitehorizon versions of the original infinitehorizon Markov decision process. The terminal condition of the finitehorizon problems, or the leafnode evaluator of the decision tree generated by MCTS, is specified using a combination of an estimated value function and an estimated policy function. The recommendations generated by the MCTS procedure are then provided as feedback in order to refine, through classification and regression, the leafnode evaluator for the next iteration. We provide the first sample complexity bounds for a tree searchbased RL algorithm. In addition, we show that a deep neural network implementation of the technique can create a competitive AI agent for the popular multiplayer online battle arena (MOBA) game King of Glory.
1 Introduction
MonteCarlo tree search (MCTS), introduced in \citetCoulom2006 and surveyed in detail by \citetBrowne2012, has received attention in recent years for its successes in gameplay artificial intelligence (AI), culminating in the Goplaying AI AlphaGo \citepSilver2016. MCTS seeks to iteratively build the decision tree associated with a given Markov decision process (MDP) so that attention is focused on “important” areas of the state space, assuming a given initial state (or root node of the decision tree). The intuition behind MCTS is that if rough estimates of state or action values are given, then it is only necessary to expand the decision tree in the direction of states and actions with high estimated value. To accomplish this, MCTS utilizes the guidance of leafnode evaluators (either a policy function [Chaslot et al.(2006)Chaslot, Saito, Uiterwijk, Bouzy, and van den Herik] rollout, a value function evaluation [Campbell et al.(2002)Campbell, Hoane Jr, and Hsu, Enzenberger(2004)], or a mixture of both \citepSilver2016) to produce estimates of downstream values once the tree has reached a certain depth \citepBrowne2012. The information from the leafnodes are then backpropagated up the tree. The performance of MCTS depends heavily on the quality of the policy/value approximations \citepGelly2007, and at the same time, the successes of MCTS in Go show that MCTS improves upon a given policy when the policy is used for leaf evaluation, and in fact, it can be viewed as a policy improvement operator \citepsilver2017mastering. In this paper, we study a new feedbackbased framework, wherein MCTS updates its own leafnode evaluators using observations generated at the root node.
MCTS is typically viewed as an online planner, where a decision tree is built starting from the current state as the root node \citepChaslot2006,Chaslot2008,Hingston2007,Maitrepierre2008,Cazenave2009,Mehat2010,Gelly2011,Gelly2012,Silver2016. The standard goal of MCTS is to recommend an action for the root node only. After the action is taken, the system moves forward and a new tree is created from the next state (statistics from the old tree may be partially saved or completely discarded). MCTS is thus a “local” procedure (in that it only returns an action for a given state) and is inherently different from value function approximation or policy function approximation approaches where a “global” policy (one that contains policy information about all states) is built. In realtime decisionmaking applications, it is more difficult to build an adequate “onthefly” local approximation than it is to use pretrained global policy in the short amount of time available for decisionmaking. For games like Chess or Go, online planning using MCTS may be appropriate, but in games where fast decisions are necessary (e.g., Atari or MOBA video games), tree search methods are too slow \citepGuo2014. The proposed algorithm is intended to be used in an offpolicy fashion during the reinforcement learning (RL) training phase. Once the training is complete, the policies associated with leafnode evaluation can be implemented to make fast, realtime decisions without any further need for tree search.
Main Contributions. These characteristics of MCTS motivate our proposed method, which attempts to leverage the local properties of MCTS into a training procedure to iteratively build global policy across all states. The idea is to apply MCTS on batches of small, finitehorizon versions of the original infinitehorizon Markov decision process (MDP). A rough summary is as follows: (1) initialize an arbitrary value function and a policy function; (2) start (possibly in parallel) a batch of MCTS instances, limited in searchdepth, initialized from a set of sampled states, while incorporating a combination of the value and policy function as leafnode evaluators; (3) update both the value and policy functions using the latest MCTS root node observations; (4) Repeat starting from step (2). This method exploits the idea that an MCTS policy is better than either of the leafnode evaluator policies alone \citepSilver2016, yet improved leafnode evaluators also improve the quality of MCTS \citepGelly2007. The primary contributions of this paper are summarized below.

We propose a batch, MCTSbased RL method that operates on continuous state, finite action MDPs and exploits the idea that leafevaluators can be updated to produce a stronger tree search using previous tree search results. Function approximators are used to track policy and value function approximations, where the latter is used to reduce the length of the tree search rollout (oftentimes, the rollout of the policy becomes a computational bottleneck in complex environments).

We provide a full sample complexity analysis of the method and show that with large enough sample sizes and sufficiently large tree search effort, the performance of the estimated policies can be made close to optimal, up to some unavoidable approximation error. To our knowledge, batch MCTSbased RL methods have not been theoretically analyzed.

An implementation of the feedbackbased tree search algorithm using deep neural networks is tested on the recently popular MOBA game (a North American version of the same game is titled ). The result is a competitive AI agent for the 1v1 mode of the game.
2 Related Work
The idea of leveraging tree search during training was first explored by \citetGuo2014 in the context of Atari games, where MCTS was used to generate offline training data for a supervised learning (classification) procedure. The authors showed that by using the power of tree search offline, the resulting policy was able to outperform the deep network (DQN) approach of [Mnih et al.(2013)Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, and Riedmiller]. A natural next step is to repeatedly apply the procedure of \citetGuo2014. In building AlphaGo Zero, \citetsilver2017mastering extends the ideas of \citetGuo2014 into an iterative procedure, where the neural network policy is updated after every episode and then reincorporated into tree search. The technique was able to produce a superhuman Goplaying AI (and improves upon the previous AlphaGo versions) without any human replay data.
Our proposed algorithm is a provably nearoptimal variant (and in some respects, generalization) of the AlphaGo Zero algorithm. The key differences are the following: (1) our theoretical results cover a continuous, rather than finite, state space setting, (2) the environment is a stochastic MDP rather than a sequential deterministic two player game, (3) we use batch updates, (4) the feedback of previous results to the leafevaluator manifests as both policy and value updates rather than just the value (as \citetsilver2017mastering does not use policy rollouts).
anthony2017thinking proposes a general framework called expert iteration that combines supervised learning with tree searchbased planning. The methods described in \citetGuo2014, \citetsilver2017mastering, and the current paper can all be (at least loosely) expressed under the expert iteration framework. However, no theoretical insights were given in any of these previous works and our paper intends to fill this gap by providing a full theoretical analysis of an iterative, MCTSbased RL algorithm. Our analysis relies on the concentrability coefficient idea of \citetmunos2007performance for approximate value iteration and builds upon the work on classification based policy iteration \citeplazaric2016analysis, approximate modified policy iteration \citepscherrer2015approximate, and fitted value iteration \citepmunos2008finite.
Sample complexity results for MCTS are relatively sparse. \citetteraoka2014efficient gives a high probability upper bound on the number of playouts needed to achieve accuracy at the root node for a stylized version of MCTS called . More recently, \citetkaufmann2017monte provided high probability bounds on the sample complexity of two other variants of MCTS called  and . In this paper, we do not require any particular implementation of MCTS, but make a generic assumption on its accuracy that is inspired by these results.
3 Problem Formulation
Consider a discounted, infinitehorizon MDP with a continuous state space and finite action space . For all , the reward function satisfies . The transition kernel, which describes transitions to the next state given current state and action , is written — a probability measure over . Given a discount factor , the value function of a policy starting in is given by
(1) 
where is the state visited at time . Let be the set of all stationary, deterministic policies (i.e., mappings from state to action). The optimal value function is obtained by maximizing over all policies: .
Both and are bounded by . We let be the set of bounded, realvalued functions mapping to . We frequently make use of the shorthand operator , where the quantity is be interpreted as the reward gained by taking an action according to , receiving the reward , and then receiving an expected terminal reward according to the argument :
It is wellknown that is the unique fixedpoint of , meaning \citepPuterman. The Bellman operator is similarly defined using the maximizing action:
It is also known that is the unique fixedpoint of \citepPuterman and that acting greedily with respect to the optimal value function produces an optimal policy:
We use the notation to mean the compositions of the mapping , e.g., . Lastly, let and let be a distribution over . We define left and right versions of an operator :
Note that and is another distribution over .
4 FeedbackBased Tree Search Algorithm
We now formally describe the proposed algorithm. The parameters are as follows. Let be a space of approximate policies and be a space of approximate value functions (e.g., classes of neural network architectures). We let be the policy function approximation (PFA) and be the value function approximation (VFA) at iteration of the algorithm. Parameters subscripted with ‘0’ are used in the value function approximation (regression) phase and parameters subscripted with ‘1’ are used in the tree search phase. The full description of the procedure is given in Figure 1, using the notation , where maps all states to the action . We now summarize the two phases, VFA (Steps 2 and 3) and MCTS (Steps 4, 5, and 6).
VFA Phase. Given a policy , we wish to approximate its value by fitting a function using subroutine on states sampled from a distribution . Each call to requires repeatedly performing rollouts that are initiated from leafnodes of the decision tree. Because repeating full rollouts during tree search is expensive, the idea is that a VFA obtained from a onetime regression on a single set of rollouts can drastically reduce the computation needed for . For each sampled state , we estimate its value using full rollouts, which can be obtained using the absorption time formulation of an infinite horizon MDP \citep[Proposition 5.3.1]Puterman.
MCTS Phase. On every iteration , we sample a set of i.i.d. states from a distribution over . From each state, a tree search algorithm, denoted , is executed for iterations on a search tree of maximum depth . We assume here that the leaf evaluator is a general function of the PFA and VFA from the previous iteration, and , and it is denoted as a “subroutine” . The results of the procedure are piped into a subroutine , which fits a new policy using classification (from continuous states to discrete actions) on the new data. As discussed more in Assumption 4, uses observations (onestep rollouts) to compute a loss function.
The illustration given in Figure 2 shows the interactions (and feedback loop) of the basic components of the algorithm: (1) a set of tree search runs initiated from a batch of sampled states (triangles), (2) leaf evaluation using and is used during tree search, and (3) updated PFA and VFA and using tree search results.
5 Assumptions
Figure 1 shows the algorithm written with general subroutines , , , and , allowing for variations in implementation suited for different problems. However, our analysis assumes specific choices and properties of these subroutines, which we describe now. The regression step solves a least absolute deviation problem to minimize an empirical version of
as described in the first assumption.
Assumption 1 ( Subroutine).
For each , define for all and for each , the state is drawn from . Let be an estimate of using rollouts and , the VFA resulting from , obtained via least absolute deviation regression:
(2)  
(3) 
There are many ways that may be defined. The standard leaf evaluator for MCTS is to simulate a default or “rollout” policy \citepBrowne2012 until the end of the game, though in related tree search techniques, authors have also opted for a value function approximation \citepCampbell2002,Enzenberger2003. It is also possible to combine the two approximations: \citetSilver2016 uses a weighted combination of a full rollout from a pretrained policy and a pretrained value function approximation.
Assumption 2 ( Subroutine).
Our approach uses a partial rollout of length and a value estimation at the end. produces unbiased observations of
(4) 
where .
Assumption 2 is motivated by our MOBA game, on which we observed that even short rollouts (as opposed to simply using a VFA) are immensely helpful in determining local outcomes (e.g., dodging attacks, eliminating minions, health regeneration). At the same time, we found that numerous full rollouts simulated using the relatively slow and complex game engine is far too timeconsuming within tree search.
We also need to make an assumption on the sample complexity of , of which there are many possible variations \citepChaslot2006,Coulom2006,Kocsis2006,Gelly2007,couetoux2011continuous,couetoux2011continuous2,AlKanj2016,Jiang2017. Particularly relevant to our continuousstate setting are tree expansion techniques called progressive widening and double progressive widening, proposed in \citetcouetoux2011continuous, which have proven successful in problems with continuous state/action spaces. To our knowledge, analysis of the sample complexity is only available for stylized versions of MCTS on finite problems, like \citetteraoka2014efficient and \citetkaufmann2017monte. Theorems from these papers show upper bounds on the number of iterations needed so that with high probability (greater than ), the value at the root node is accurate within a tolerance of . Fortunately, there are ways to discretize continuous state MDPs that enjoy error guarantees, such as \citetbertsekas1975convergence, \citetdufour2012approximation, or \citetsaldi2017asymptotic. These error bounds can be combined with the MCTS guarantees of \citetteraoka2014efficient and \citetkaufmann2017monte to produce a sample complexity bound for MCTS on continuous problems. The next assumption captures the essence of these results (and if desired, can be made precise for specific implementations through the references above).
Assumption 3 ( Subroutine).
Consider a stage, finitehorizon subproblem of (1) with terminal value function and initial state is . Let the result of be denoted . We assume that there exists a function , such that if iterations of are used, the inequality holds with probability at least .
Now, we are ready to discuss the subroutine. Our goal is to select a policy that closely mimics the performance of the result, similar to practical implementations in existing work \citepGuo2014,silver2017mastering,anthony2017thinking. The question is: given a candidate , how do we measure “closeness” to the policy? We take inspiration from previous work in classificationbased RL and use a costbased penalization of classification errors \citeplangford2005relating,li2007focus,lazaric2016analysis. Since is an approximation of the performance of the policy, we should try to select a policy with similar performance. To estimate the performance of some candidate policy , we use a onestep rollout and evaluate the downstream cost using .
Assumption 4 ( Subroutine).
For each and , let be an estimate of the value of stateaction pair using samples.
Let , the result of , be obtained by minimizing the discrepancy between the result and the estimated value of the policy under approximations :
where are i.i.d. samples from .
An issue that arises during the analysis is that even though we can control the distribution from which states are sampled, this distribution is transformed by the transition kernel of the policies used for rollout/lookahead. Let us now introduce the concentrability coefficient idea of \citetmunos2007performance (and used subsequently by many authors, including \citetmunos2008finite, \citetlazaric2016analysis, \citetscherrer2015approximate, and \citethaskell2016empirical).
Assumption 5 (Concentrability).
Consider any sequence of policies . Suppose we start in distribution and that the state distribution attained after applying the policies in succession, , is absolutely continuous with respect to . We define an step concentrability coefficient
and assume that . Similarly, we assume , is absolutely continuous with respect to and assume that
is finite for any .
6 Sample Complexity Analysis
Before presenting the sample complexity analysis, let us consider an algorithm that generates a sequence of policies satisfying with no error. It is proved in \citet[pp. 3031]Bertsekas1996 that in the finite state and action setting. Our proposed algorithm in Figure 1 can be viewed as approximately satisfying this iteration in a continuous state space setting, where plays the role of and evaluation of uses a combination of accurate rollouts (due to ) and fast VFA evaluations (due to ). The sample complexity analysis requires the effects of all errors to be systematically analyzed.
For some , our goal is to develop a high probability upper bound on the expected suboptimality, over an initial state distribution , of the performance of policy , written as . Because there is no requirement to control errors with probability one, bounds in tend to be much more useful in practice than ones in the traditional . Notice that:
(5)  
where the lefthandside is the loss function used in the classification step from Assumption 4. It turns out that we can relate the righthandside (albeit under a different distribution) to the expected suboptimality after iterations , as shown in the following lemma. Full proofs of all results are given in the supplementary material.
[Loss to Performance Relationship]lemmalemone The expected suboptimality of can be bounded as follows:
where .
From Lemma 6, we see that the expected suboptimality at iteration can be upper bounded by the suboptimality of the initial policy (in maximum norm) plus a discounted and reweighted version of accumulated over prior iterations. Hypothetically, if were small for all iterations and all states , then the suboptimality of converges linearly to zero. Hence, we may refer to as the “true loss,” the target term to be minimized at iteration . We now have a starting point for the analysis: if (5) can be made precise, then the result can be combined with Lemma 6 to provide an explicit bound on . The various errors that we incur when relating the objective of to the true loss include the error due to regression using functions in ; the error due to sampling the state space according to ; the error of estimating using the sample average of onestep rollouts ; and of course, the error due to .
We now give a series of lemmas that help us carry out the analysis. In the algorithmic setting, the policy is a random quantity that depends on the samples collected in previous iterations; however, for simplicity, the lemmas that follow are stated from the perspective of a fixed policy or fixed value function approximation rather than or . Conditioning arguments will be used when invoking these lemmas (see supplementary material).
Lemma 1 (Propagation of VFA Error).
The lemma above addresses the fact that instead of using directly, and only have access to the estimates and ( steps of rollout with an evaluation of at the end), respectively. Note that propagation of the error in is discounted by or and since the lemma converts between and , it is also impacted by the concentrability coefficients and .
Let be the VCdimension of the class of binary classifiers and let be the pseudodimension of the function class . The VCdimension is a measure of the capacity of and the notion of a pseudodimension is a generalization of the VCdimension to realvalued functions (see, e.g., \citetpollard1990empirical, \citethaussler1992decision, \citetmohri2012foundations for definitions of both). Similar to \citetlazaric2016analysis and \citetscherrer2015approximate, we will present results for the case of two actions, i.e., . The extension to multiple actions is possible by performing an analysis along the lines of \citet[Section 6]lazaric2016analysis. We now quantify the error illustrated in Figure 3. Define the quantity , the sum of the coefficients from Lemma 1.
Lemma 2.
Suppose the regression sample size is
and the sample size , for estimating the regression targets, is
Furthermore, there exist constants , , , and , such that if and are large enough to satisfy
and if , then
with probability at least .
Sketch of Proof.
By adding and subtracting terms, applying the triangle inequality, and invoking Lemma 1, we see that:
Here, the error is split into two terms. The first depends on the sample and the history through while the second term depends on the sample and the history through . We can thus view as fixed when analyzing the first term and as fixed when analyzing the second term (details in the supplementary material). The first term contributes the quantity in the final bound with additional estimation error contained within . The second term contributes the rest. See Figure 3 for an illustration of the main proof steps. ∎
The first two terms on the righthandside are related to the approximation power of and and can be considered unavoidable. We upperbound these terms by maximizing over , in effect removing the dependence on the random process in the analysis of the next theorem. We define:
two terms that are closely related to the notion of inherent Bellman error \citepantos2008learning,munos2008finite,lazaric2016analysis,scherrer2015approximate,haskell2017empirical. Also, let , which was assumed to be finite in Assumption 5.
Theorem 1.
Suppose the sample size requirements of Lemma 2 are satisfied with and replacing and , respectively. Then, the suboptimality of the policy can be bounded as follows:
with probability at least .
Search Depth. How should the search depth be chosen? Theorem 1 shows that as increases, fewer iterations are needed to achieve a given accuracy; however, the effort required of tree search (i.e., the function ) grows exponentially in . At the other extreme (), more iterations are needed and the “fixed cost” of each iteration of the algorithm (i.e., sampling, regression, and classification — all of the steps that do not depend on ) becomes more prominent. For a given problem and algorithm parameters, these computational costs can each be estimated and Theorem 1 can serve as a guide to selecting an optimal .
7 Case Study: King of Glory MOBA AI
We implemented FeedbackBased Tree Search within a new and challenging environment, the recently popular MOBA game King of Glory by Tencent (the game is also known as Honor of Kings and a North American release of the game is titled Arena of Valor). Our implementation of the algorithm is one of the first attempts to design an AI for the 1v1 version of this game.
Game Description. In the King of Glory, players are divided into two opposing teams and each team has a base located on the opposite corners of the game map (similar to other MOBA games, like League of Legends or Dota 2). The bases are guarded by towers, which can attack the enemies when they are within a certain attack range. The goal of each team is to overcome the towers and eventually destroy the opposing team’s “crystal,” located at the enemy’s base. For this paper, we only consider the 1v1 mode, where each player controls a primary “hero” alongside less powerful gamecontrolled characters called “minions.” These units guard the path to the crystal and will automatically fire (weak) attacks at enemies within range. Figure 4 shows the two heroes and their minions; the upperleft corner shows the map, with the blue and red markers pinpointing the towers and crystals.
Experimental Setup. The state variable of the system is taken to be a 41dimensional vector containing information obtained directly from the game engine, including hero locations, hero health, minion health, hero skill states, and relative locations to various structures. There are 22 actions, including move, attack, heal, and special skill actions, some of which are associated with (discretized) directions. The reward function is designed to mimic reward shaping \citepng1999policy and uses a combination of signals including health, kills, damage dealt, and proximity to crystal. We trained five King of Glory agents, using the hero DiRenJie:

The “FBTS” agent is trained using our feedbackbased tree search algorithm for iterations of 50 games each. The search depth is and rollout length is . Each call to ran for 400 iterations.

The second agent is labeled “NR” for no rollouts. It uses the same parameters as the FBTS agent except no rollouts are used. At a high level, this bears some similarity to the AlphaGo Zero algorithm \citepsilver2017mastering in a batch setting.

The “DPI” agent uses the direct policy iteration technique of [Lazaric et al.(2016)Lazaric, Ghavamzadeh, and Munos] for iterations. There is no value function and no tree search (due to computational limitations, more iterations are possible when tree search is not used).

We then have the “AVI” agent, which implements approximate value iteration \citepde2000existence,van2006performance,munos2007performance,munos2008finite for iterations. This algorithm can be considered a batch version of DQN \citepMnih2013.

Lastly, we consider an “SL” agent trained via supervised learning on a dataset of approximately 100,000 state/action pairs of human gameplay data. Notably, the policy architecture used here is consistent with the previous agents.
In fact, both the policy and value function approximations are consistent across all agents; they use fullyconnected neural networks with five and two hidden layers, respectively, and SELU (scaled exponential linear unit) activation \citepklambauer2017self. The initial policy takes random actions: move (w.p. 0.5), directional attack (w.p. 0.2), or a special skill (w.p. 0.3). Besides biasing the move direction toward the forward direction, no other heuristic information is used by . was chosen to be a variant of UCT \citepKocsis2006 that is more amenable toward parallel simulations: instead of using the argmax of the UCB scores, we sample actions according to the distribution obtained by applying softmax to the UCB scores.
In the practical implementation of the algorithm, uses a cosine proximity loss while uses a negative loglikelihood loss, differing from the theoretical specifications. Due to the inability to “rewind” or “fastforward” the game environment to arbitrary states, the sampling distribution is implemented by first taking random actions (for a random number of steps) to arrive at an initial state and then following until the end of the game. To reduce correlation during value approximation, we discard of the states encountered in these trajectories. For , we follow the policy while occasionally injecting noise (in the form of random actions and random switches to the default policy) to reduce correlation. During rollouts, we use the internal AI for the hero DiRenJie as the opponent.
Results. As the game is nearly deterministic, our primary methodology for testing to compare the agents’ effectiveness against a common set of opponents chosen from the internal AIs. We also added the internal DiRenJie AI as a “sanity check” baseline agent. To select the test opponents, we played the internal DiRenJie AI against other internal AIs (i.e., other heroes) and selected six heroes of the marksman type that the internal DiRenJie AI is able to defeat. Each of our agents, including the internal DiRenJie AI, was then played against every test opponent. Figure 5 shows the length of time, measured in frames, for each agent to defeat the test opponents (a value of 20,000 frames is assigned if the opponent won). Against the set of common opponents, FBTS significantly outperforms DPI, AVI, SL, and the internal AI. However, FBTS only slightly outperforms NR on average (which is perhaps not surprising as NR is the only other agent that also uses MCTS). Our second set of results help to visualize headtohead battles played between FBTS and the four baselines (all of which are won by FBTS): Figure 6 shows the ratio of the FBTS agent’s gold to its opponent’s gold as a function of time. Gold is collected throughout the game as heroes deal damage and defeat enemies, so a ratio above 1.0 (above the red region) indicates good relative performance by FBTS. As the figure shows, each game ends with FBTS achieving a gold ratio in the range of .
8 Conclusion & Future Work
In this paper, we provide a sample complexity analysis for feedbackbased tree search, an RL algorithm based on repeatedly solving finitehorizon subproblems using MCTS. Our primary methodological avenues for future work are (1) to analyze a selfplay variant of the algorithm and (2) to consider related techniques in multiagent domains (see, e.g., \citethu2003nash). The implementation of the algorithm in the 1v1 MOBA game King of Glory provided us encouraging results against several related algorithms; however, significant work remains for the agent to become competitive with humans.
Acknowledgements
We sincerely appreciate the helpful feedback from four anonymous reviewers, which helped to significantly improve the paper. We also wish to thank our colleagues at Tencent AI Lab, particularly Carson Eisenach and Xiangru Lian, for assistance with the test environment and for providing the SL agent. The first author is very grateful for the support from Tencent AI Lab through a faculty award.
References
 AlKanj, Lina, Powell, Warren B, and BouzaieneAyari, Belgacem. The informationcollecting vehicle routing problem: Stochastic optimization for emergency storm response. arXiv preprint arXiv:1605.05711, 2016.
 Anthony, Thomas, Tian, Zheng, and Barber, David. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pp. 5366–5376, 2017.
 Antos, András, Szepesvári, Csaba, and Munos, Rémi. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 Bertsekas, Dimitri P. Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control, 20(3):415–419, 1975.
 Bertsekas, Dimitri P and Tsitsiklis, John N. Neurodynamic Programming. Athena Scientific, Belmont, MA, 1996.
 Browne, Cameron B, Powley, Edward, Whitehouse, Daniel, Lucas, Simon M, Cowling, Peter I, Rohlfshagen, Philipp, Tavener, Stephen, Perez, Diego, Samothrakis, Spyridon, and Colton, Simon. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 Campbell, Murray, Hoane Jr, A Joseph, and Hsu, Fenghsiung. Deep blue. Artificial Intelligence, 134(12):57–83, 2002.
 Cazenave, Tristan. Nested MonteCarlo search. In International Joint Conference on Artificial Intelligence, pp. 456–461, 2009.
 Chaslot, Guillaume, Saito, JahnTakeshi, Uiterwijk, Jos WHM, Bouzy, Bruno, and van den Herik, H Jaap. MonteCarlo strategies for computer Go. In 18th BelgianDutch Conference on Artificial Intelligence, pp. 83–90, 2006.
 Chaslot, Guillaume, Bakkes, Sander, Szita, Istvan, and Spronck, Pieter. Montecarlo tree search: A new framework for game AI. In AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 2008.
 Couëtoux, Adrien, Hoock, JeanBaptiste, Sokolovska, Nataliya, Teytaud, Olivier, and Bonnard, Nicolas. Continuous upper confidence trees. In International Conference on Learning and Intelligent Optimization, pp. 433–445. Springer, 2011a.
 Couëtoux, Adrien, Milone, Mario, Brendel, Mátyás, Doghmen, Hassan, Sebag, Michele, and Teytaud, Olivier. Continuous rapid action value estimates. In Asian Conference on Machine Learning, pp. 19–31, 2011b.
 Coulom, Rémi. Efficient selectivity and backup operators in MonteCarlo tree search. In International Conference on Computers and Games, pp. 72–83, 2006.
 De Farias, D Pucci and Van Roy, Benjamin. On the existence of fixed points for approximate value iteration and temporaldifference learning. Journal of Optimization theory and Applications, 105(3):589–608, 2000.
 Dufour, François and PrietoRumeau, Tomás. Approximation of markov decision processes with general state space. Journal of Mathematical Analysis and Applications, 388(2):1254–1267, 2012.
 Enzenberger, Markus. Evaluation in go by a neural network using soft segmentation. In Advances in Computer Games, pp. 97–108. Springer, 2004.
 Gelly, Sylvain and Silver, David. Combining online and offline knowledge in UCT. In Proceedings of the 24th International Conference on Machine learning, pp. 273–280, 2007.
 Gelly, Sylvain and Silver, David. Montecarlo tree search and rapid action value estimation in computer Go. Artificial Intelligence, 175(11):1856–1875, 2011.
 Gelly, Sylvain, Kocsis, Levente, Schoenauer, Marc, Sebag, Michele, Silver, David, Szepesvári, Csaba, and Teytaud, Olivier. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM, 55(3):106–113, 2012.
 Guo, Xiaoxiao, Singh, Satinder, Lee, Honglak, Lewis, Richard L, and Wang, Xiaoshi. Deep learning for realtime Atari game play using offline MonteCarlo tree search planning. In Advances in Neural Information Processing Systems, pp. 3338–3346, 2014.
 Haskell, William B, Jain, Rahul, and Kalathil, Dileep. Empirical dynamic programming. Mathematics of Operations Research, 41(2):402–429, 2016.
 Haskell, William B, Jain, Rahul, Sharma, Hiteshi, and Yu, Pengqian. An empirical dynamic programming algorithm for continuous MDPs. arXiv preprint arXiv:1709.07506, 2017.
 Haussler, David. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, 1992.
 Hingston, Philip and Masek, Martin. Experiments with Monte Carlo Othello. In IEEE Congress on Evolutionary Computation, pp. 4059–4064. IEEE, 2007.
 Hu, Junling and Wellman, Michael P. Nash Qlearning for generalsum stochastic games. Journal of Machine Learning Research, 4(Nov):1039–1069, 2003.
 Jiang, Daniel R, AlKanj, Lina, and Powell, Warren B. Monte carlo tree search with sampled information relaxation dual bounds. arXiv preprint arXiv:1704.05963, 2017.
 Kaufmann, Emilie and Koolen, Wouter. MonteCarlo tree search by best arm identification. In Advances in Neural Information Processing Systems, pp. 4904–4913, 2017.
 Klambauer, Günter, Unterthiner, Thomas, Mayr, Andreas, and Hochreiter, Sepp. Selfnormalizing neural networks. In Advances in Neural Information Processing Systems, pp. 972–981, 2017.
 Kocsis, Levente and Szepesvári, Csaba. Bandit based MonteCarlo planning. In European Conference on Machine Learning, pp. 282–293, 2006.
 Langford, John and Zadrozny, Bianca. Relating reinforcement learning performance to classification performance. In Proceedings of the 22nd International Conference on Machine Learning, pp. 473–480, 2005.
 Lazaric, Alessandro, Ghavamzadeh, Mohammad, and Munos, Rémi. Analysis of classificationbased policy iteration algorithms. Journal of Machine Learning Research, 17(19):1–30, 2016.
 Li, Lihong, Bulitko, Vadim, and Greiner, Russell. Focus of attention in reinforcement learning. Journal of Universal Computer Science, 13(9):1246–1269, 2007.
 Maîtrepierre, Raphaël, Mary, Jérémie, and Munos, Rémi. Adaptative play in Texas hold’em poker. In European Conference on Artificial Intelligence, 2008.
 Méhat, Jean and Cazenave, Tristan. Combining UCT and nested Monte Carlo search for singleplayer general game playing. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):271–277, 2010.
 Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mohri, Mehryar, Rostamizadeh, Afshin, and Talwalkar, Ameet. Foundations of Machine Learning. MIT Press, 2012.
 Munos, Rémi. Performance bounds in l_pnorm for approximate value iteration. SIAM Journal on Control and Optimization, 46(2):541–561, 2007.
 Munos, Rémi and Szepesvári, Csaba. Finitetime bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
 Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pp. 278–287, 1999.
 Pollard, David. Empirical processes: Theory and applications. In NSFCBMS Regional Conference Series in Probability and Statistics, pp. i–86. JSTOR, 1990.
 Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 Saldi, Naci, Yüksel, Serdar, and Linder, Tamás. On the asymptotic optimality of finite approximations to markov decision processes with Borel spaces. Mathematics of Operations Research, 42(4):945–978, 2017.
 Scherrer, Bruno, Ghavamzadeh, Mohammad, Gabillon, Victor, Lesner, Boris, and Geist, Matthieu. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16(Aug):1629–1676, 2015.
 Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Teraoka, Kazuki, Hatano, Kohei, and Takimoto, Eiji. Efficient sampling method for Monte Carlo tree search problem. IEICE Transactions on Information and Systems, 97(3):392–398, 2014.
 Van Roy, Benjamin. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.