[

# [

Devavrat Shah devavrat@mit.edu
LIDS, Massachusetts Institute of Technology, Cambridge, MA 02139 \ANDQiaomin Xie qiaomin.xie@cornell.edu
ORIE, Cornell University, Ithaca, NY 14853 \ANDZhi Xu zhixu@mit.edu
LIDS, Massachusetts Institute of Technology, Cambridge, MA 02139
###### Abstract

Inspired by the success of AlphaGo Zero (AGZ) which utilizes Monte Carlo Tree Search (MCTS) with Supervised Learning via Neural Network to learn the optimal policy and value function, in this work, we focus on establishing formally that such an approach indeed finds optimal policy asymptotically, as well as establishing non-asymptotic guarantees in the process. We shall focus on infinite-horizon discounted Markov Decision Process to establish the results.

To start with, it requires establishing the MCTS’s claimed property in the literature (cf. Kocsis and Szepesvári (2006); Kocsis et al. (2006)) that for any given query state, MCTS provides approximate value function for the state with enough simulation steps of MDP. We provide a non-asymptotic analysis establishing this property by analyzing a non-stationary multi-arm bandit setup.111The proof of MCTS in literature is incomplete. Please see discussion in Section [ for details. Our proof suggests that MCTS needs to be utilized with polynomial rather than logarithmic “upper confidence bound” for establishing its desired performance — interestingly enough, AGZ chooses such a polynomial bound.

Using this as a building block, combined with nearest neighbor supervised learning, we argue that MCTS acts as a “policy improvement” operator; it has a natural “bootstrapping” property to iteratively improve value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn approximation of value function in norm, MCTS combined with nearest-neighbors requires samples scaling as , where is the dimension of the state space. This is nearly optimal due to a minimax lower bound of

On RL Using MCTS with Supervised Learning: Non-Asymptotic Analysis]On Reinforcement Learning Using Monte Carlo Tree Search with Supervised Learning: Non-Asymptotic Analysis

[ Devavrat Shah devavrat@mit.edu
LIDS, Massachusetts Institute of Technology, Cambridge, MA 02139
Qiaomin Xie qiaomin.xie@cornell.edu
ORIE, Cornell University, Ithaca, NY 14853
Zhi Xu zhixu@mit.edu
LIDS, Massachusetts Institute of Technology, Cambridge, MA 02139

Keywords: Reinforcement Learning, Monte Carlo Tree Search, Non-asymptotic Analysis.

## \thesection Introduction

Monte Carlo Tree Search (MCTS) is a search framework for finding optimal decisions, based on the search tree built by random sampling of the decision space (Browne et al., 2012). MCTS has been widely used in sequential decision-making applications that have a tree representation, exemplified by games and planning problems.

Since MCTS was first introduced, many variations and enhancements have been proposed to extend and refine the basic algorithm. Recently, MCTS has been combined with deep neural networks for reinforcement learning, which achieves remarkable success for games of Go (Silver et al., 2016, 2017b), chess and shogi (Silver et al., 2017a). In particular, AlphaGo Zero (AGZ) (Silver et al., 2017b) mastered the game of Go entirely through self-play and achieved superhuman performance. Specifically, AGZ employs supervised learning to learn a policy/value function (represented by a neural network) from samples generated via MCTS; MCTS then uses the neural network to query the value of leaf nodes for simulation guidance in the next iteration.

MCTS is referred as a “strong policy improvement operator” (Silver et al., 2017b). The intuition is that, starting with the current policy, MCTS produces a better target policy for the current state. Thus training the network with cross-entropy loss regarding the target policy on the current state is a form of policy iteration (Sutton and Barto, 1998), which will hopefully improve the neural network policy. Although this explanation seems natural, and is intuitively appealing, it is not at all clear why the novel combination of MCTS and supervised learning guarantees this.

Despite the wide application and remarkable empirical success of MCTS, there has been only limited work exploring theoretical guarantees of MCTS and its variants. One exception to this are the works of Kocsis and Szepesvári (2006) and Kocsis et al. (2006), where they propose the use of Upper Confidence Bound as the tree policy (UCT). They give a result showing that MCTS with UCT asymptotically converges to the optimal actions. However, the proof appears to be incomplete.

While asymptotic convergence of MCTS is conceptually natural, rigorous analysis is quite subtle. A key challenge is that the tree policy (e.g., UCT) used to select actions typically needs to balance consideration of exploration and exploitation, resulting in a non-stationary (non-uniform) random sampling at each node across multiple simulations. The strong dependencies in sampling introduces substantial complications. In summary, we are interested in the following questions:

• To begin with, is it feasible to provide rigorous justification of the properties claimed about MCST in the literature, cf. Kocsis and Szepesvári (2006); Kocsis et al. (2006) ?

• Is the MCTS a “strong policy improvement” operator as believed in the literature, cf. Silver et al. (2017b)?

• Does MCTS with (appropriate) supervised learning find the optimal policy, asymptotically? If so, what about finite-sample (non-asymptotic) analysis?

Our contributions. As the main contribution of this work, we answer all of the above questions. Specifically, we provide rigorous finite-sample analysis of MCTS; we establish MCTS’s “strong policy improvement” property or the “bootstrap” property when coupled with non-parametric supervised learning and finally establish that MCTS with supervised learning leads to finding the optimal policy with near optimal sample complexity.

Finite Sample Analysis of MCTS. We provide an explicit non-asymptotic analysis of MCTS with a variant of UCT for infinite-horizon discounted-reward MDPs. In particular, consider MCTS with depth , where an estimated value function with accuracy is used to query the value of leaf nodes. We show that the expected estimated value for the root node converges to its optimal value within a radius of at a polynomial rate , where is the discounted factor, is the number of simulations and is a constant. That is, after simulations, Therefore, MCTS is guaranteed to produce estimates of a smaller error with sufficient simulations. This justifies the improvement property of MCTS. In addition, it establishes the MCTS’s claimed property in literature (Kocsis and Szepesvári, 2006; Kocsis et al., 2006).

Non-stationary Multi-Arm Bandit: Convergence and concentration. In the process of establishing property of MCTS as claimed above, the key intermediate step is analyzing an non-stationary MAB. Specifically, we establish the convergence and concentration of the non-stationary MAB. In particular, we argue that if the cumulative rewards of each arm individually satisfy “polynomial” concentration, then the UCT policy with “polynomial” upper confidence bound leads to regret that concentrates around the appropriate (non-stationary) mean value “polynomially”. This is consistent with what is observed in the literature in terms of concentration of MAB in the standard, stationary regime, cf. Audibert et al. (2009); Salomon and Audibert (2011).

MCTS plus supervised learning as “policy improvement operator”. As the key contribution of this work, we establish the “bootstrap” property of MCTS when coupled with nearest neighbor non-parametric supervised learning, i.e. MCTS is indeed a “strong policy improvement operator” when coupled with nearest neighbor supervised learning. Specifically, we establish that with number of samples, MCTS with nearest neighbor can find approximation of the optimal value function with respect to -norm. This is nearly optimal due to a minimax lower bound of , cf. Shah and Xie (2018).

### \thesubsection Related work

We discuss some of the existing work on the MCTS and its various modifications used for reinforcement learning. Reinforcement learning can learn to approximate the optimal value function directly from experience data. A variety of algorithms have been developed in literature, including model-based approaches, model-free approaches like tabular Q-learning  (Watkins and Dayan, 1992), and parametric approximation such as linear architectures (Sutton, 1988). More recently, the optimal value function/policy is approximated by deep neural networks, which are trained by using temporal-difference learning or Q-learning (Van Hasselt et al., 2016; Mnih et al., 2016, 2013).

MCTS is an alternative approach, which aims to estimate the (optimal) value of states by building a search tree from Monte-Carlo simulations (Browne et al., 2012). Particularly, the search tree is built and updated according to a tree policy, which selects actions according to some scheme balancing exploration-exploitation. The most popular tree policy is UCT, which selects children (actions) with maximum sum of the current action-value and a bonus term that encourages exploration. The standard bonus term is a logarithmic upper confidence bound stemmed from Hoeffding inequalities.

Kocsis and Szepesvári (2006) and Kocsis et al. (2006) show the asymptotic convergence of MCTS with standard UCT. However, the proof is incomplete (Szepesvári, 2019). A key step towards proving the claimed result is to show the convergence and concentration properties of the regret under UCB for the case of non-stationary reward distributions. In particular, to establish an exponential concentration of regret (Theorem 5, (Kocsis et al., 2006)), Lemma 14 is applied. However, the requirement on the conditional independence of sequence does not hold, which makes the conclusion of exponential concentration questionable. Therefore, the proof of Theorem 7, applying Theorem 5 with an inductive argument, does not seem to be complete as stated.

Beyond the completeness of the argument of Theorem 5 (Kocsis et al., 2006), it might be infeasible. For example, the work of Audibert et al. (2009) shows that for bandit problems, the regret under UCB concentrates around its expectation polynomially and not exponentially as desired in (Kocsis et al., 2006). Further, Salomon and Audibert (2011) prove that for any strategies that does not use the knowledge of time horizon, it is infeasible to improve beyond this polynomial concentration and establish desired exponential concentration. Indeed, our result is consistent with these fundamental bound of stationary MAB – we establish polynomial concentration of regret for non-stationary MAB, which plays a crucial role in our ability to establish the desired property of MCST.

Although not directly related, we note that a lot of research work has been done in developing novel variants of MCTS for a diverse range of applications. In this regard, the work of Coquelin and Munos (2007) introduces flat UCB in order to improve the worst case regret bounds of UCT. Schadd et al. (2008) modifies MCTS for single-player games by adding to the standard UCB formula a term that captures the possible deviation of the node. In the work by Sturtevant (2008), a variant of MCTS is introduced for multi-player games by adopting the max idea. In addition to turn-based games like GO and Chess, MCTS has also been applied to real-time games (e.g., Ms. PacMan, Tron and Starcraft) and nondeterministic games with imperfect information. The applications of MCTS go beyond games, including areas such as optimization, scheduling and other decision-making problems. Interested readers are referred to the survey on MCTS by Browne et al. (2012) for other variations and applications.

Recently, a popular method of applying MCTS for reinforcement learning is to combine it with deep neural networks, which approximate the value function and/or policy (Silver et al., 2016, 2017b, 2017a). For instance, in AlphaGo Zero, MCTS uses the neural network to query the value of leaf nodes for simulation guidance; the neural network is updated with sample data generated by MCTS-based policy and then reincorporated into tree search in the next iteration. Following that, Azizzadenesheli et al. (2018) develop generative adversarial tree search that generates roll-outs with a learned GAN-based dynamic model and reward predictor, while using MCTS for planning over the simulated samples and a deep Q-network to query the Q-value of leaf nodes.

In terms of theoretical results, the closest work to this paper is by Jiang et al. (2018), where they also consider a batch, MCTS-based reinforcement learning algorithm–a variant of AlphaGo Zero algorithm. The key algorithmic difference from ours lies in the leaf-node evaluator of the search tree: they use a combination of an estimated value function and an estimated policy. The latest MCTS root node observations are then used to update the value and policy functions (leaf-node evaluator) for the next iteration. They also give a finite sample complexity analysis. However, we want to emphasize on the main differences: their result is based on a key assumption on the sample complexity of MCTS and assumptions on the approximation power of value/policy architectures; we provide an explicit finite-sample bound for MCTS and characterize the non-asymptotic error prorogation under MCTS with non-parametric regression for leaf-node evaluation; and subsequently they do not establish “strong policy improvement” property of the MCTS.

Two other closely related papers by Teraoka et al. (2014) and Kaufmann and Koolen (2017) study a simplified MCTS for two-player zero-sum games, where the goal is to identify the best action of the root in a given game tree. For each leaf node, a stochastic oracle is provided to generate i.i.d. samples for the true reward. Teraoka et al. (2014) give a high probability bound on the number of oracle calls needed for obtaining -accurate score at the root. More recently, Kaufmann and Koolen (2017) develop new sample complexity bounds with a refined dependence on the problem instance. Both the setting and algorithms are simpler compared to classical MCTS (e.g., UCT): the game tree is given in advance, rather than being built gradually through samples; The algorithm proposed by Teraoka et al. (2014) operates on the tree in a bottom-up fashion with uniform sampling at the leaf nodes. As a result, the analysis is significantly simpler and it is unclear whether the techniques can be extended to analyze other variants of MCTS.

It is also important to mention the work of Chang et al. (2005) that focuses on the analysis of adaptive multi-stage sampling applied to MDP, and the work of Kearns et al. (2002) who study a sparse sampling algorithm for large MDPs. However, we remark that these belong to a different class of algorithms.

### \thesubsection Organization

The remainder of the paper is organized as follows. In Section [, we describe the setting of learning MDPs. In Section [, we present a reinforcement learning framework of using MCTS with supervised learning. The main results are presented in Section [. Appendix [ contains the complete proof of the result on MCTS’s property. Our proof also analyzes a variant of non-stationary multi-armed bandit, which we summarize in Appendix [. We show the property of non-parametric supervised learning in Appendix [. Appendix [ gives the proof of the non-asymptotic bound for the iterative algorithm using MCTS with supervised learning.

## \thesection Preliminary

In this section, we introduce the formal framework studied in this paper. We consider a general reinforcement learning setting where an agent interacts with an environment. The interaction is modeled as a discrete-time discounted Markov decision process (MDP). An MDP is described by a five-tuple , where is the set of states, is the set of actions, is the Markovian transition kernel, is a random reward function, and is a discount factor. In each time step, the system is in some state . When an action is taken, the state transits to a next state according to the transition kernel and an immediate reward is generated according to .

A stationary policy gives the probability of performing action given the current state The value function for each state under policy , denoted by , is defined as the expected discounted sum of rewards received following the policy from initial state , i.e.,

 Vπ(s)=Eπ[∞∑t=0γtR(st,at)|s0=s].

The goal is to find an optimal policy that maximizes the value from each initial state. The optimal value function is defined as , In this paper, we restrict our attention to the MDPs with the following assumptions:

###### Assumption 1 (MDP Regularity)

(A1.) The action space is a finite set; (A2.) The immediate rewards are non-negative random variables uniformly bounded such that (A3.) The state transitions are deterministic, i.e. for all .

In particular, the deterministic transition assumption holds for many applications, such as Go/Chess games and robotic navigation. We believe that this assumption is not essential: we can modify the expansion part of MCTS and apply our analysis. However, adding this assumption allows us to simplify many mathematical expressions. Define and Since all the rewards are bounded by , it is easy to see that the value function of every policy is bounded by  (Even-Dar and Mansour, 2004; Strehl et al., 2006).

## \thesection Algorithm

Inspired by the success of AlphaGo Zero, we introduce and study a class of reinforcement learning algorithms, in which two oracles work together to iteratively improve our objective, with one oracle providing better estimates for the other and vice verse at each iteration. To this end, we first lay down this meta algorithm in Section [. A practical instance is then described in Section [ which employs MCTS and non-parametric supervised learning as the two oracles. This is the core algorithm that we analyze in this paper.

### \thesubsection Meta Algorithm

We introduce a generic meta algorithm for finding a quantity of interest for reinforcement learning, which we will refer as Bootstrapped Supervised Reinforcement Learning. Let be the function we wish to learn, such as the optimal value or the optimal policy. Suppose we have access to “bootstrap oracle” and “learning oracle” with the following properties:

• bootstrap oracle : given any and (an estimate) of , is such that for some distance measure.

• learning oracle : given observations with large enough, learns a function so that .

Then, starting with an initial estimate, say , we can iteratively refine estimate of as follows: for each iteration ,

1. Bootstrap Step: for appropriately sampled elements from , produce improved estimates .

2. Supervised Step: from , by using , produce function .

By the property of , its estimate at is closer to compared to . The hope is that using these “better” estimates at points, can learn and generalize this improvement to entire , i.e. resulting is such that . If such is the case, then iteratively we get better estimates of . Indeed, in this paper, we argue that such is the case when is MCTS and is nearest neighbor supervised learning algorithm (with appropriately sampled states) for learning the optimal value function. That is, MCTS with appropriate supervised learning realizes this simple, but remarkable meta algorithm for reinforcement learning.

### \thesubsection MCTS with Non-parametric Supervised Learning

We now introduce a specific, practical instance of the Bootstrapped Supervised Reinforcement Learning template, which aims to find the optimal value function through a combination of finite-depth MCTS and nearest neighbor supervised learning. Algorithm 1 shows the overall structure of this instance, which is the core algorithm we study in this paper. MCTS is a fast simulation technique for exploring the large search space and hence, is a natural choice for our bootstrap oracle. The details of the MCTS oracle will be presented shortly in Section [. As for the supervised oracle, in modern practice, a suitable deep neural network could instead be applied. In this paper, for tractability we focus on a nearest neighbor supervised oracle. Other non-parametric kernel methods can also be employed to obtain similar results. Finally, in Algorithm 1, we assume that we are allowed to freely sample any state as the root node for running MCTS.

### \thesubsection The Fixed-depth MCTS Oracle

MCTS has been quite popular recently in many reinforcement learning tasks, and it is highly desirable to precisely understand its role and rigorous guarantee within an iterative algorithm. In the following, we detail the specific form of MCTS that is used in Algorithm 1. Overall, we fix the search tree to be of depth . Similar to most literature on this topic, it uses some variant of the Upper Confidence Interval (UCB) algorithm to select an action at each stage. At a leaf node (i.e., a state at depth ), we use the current value oracle to evaluate its value. Note that since we consider deterministic transitions, consequently, the tree is fixed once the root node (state) is chosen, and we use the notation to denote the next state after taking action at state . Each edge hence represents a state-action pair, while each node represents a state. For clarity, we use superscript to distinguish quantities related to different depth. The pseudocode for the MCTS procedure is given in Algorithm 2, and Figure [ shows the structure of the search tree and related notation.

In Algorithm 2, there are certain sequences of algorithmic parameters required, namely, , , and . The choices for these constants will become clear in our non-asymptotic analysis. At a higher level, the constants for the last layer (i.e., depth ), , , and depend on the properties of the leaf nodes, while the rest are recursively determined by the constants one layer below.

###### Remark 1

Note that in selecting action at each depth (i.e., Line 6 of Algorithm 2), the upper confidence term is polynomial in while a typical UCB algorithm would be logarithmic in . The logarithmic factor in the original UCB algorithm was motivated by the exponential tail probability bounds. In our case, it turns out that exponential tail bounds for each layer seems to be infeasible without further structural assumptions. As mentioned in Section [, prior work (Audibert et al., 2009; Salomon and Audibert, 2011) has justified the polynomial concentration for classical UCB. This implies that the concentration at intermediate depth (i.e., depth less than ) is at most polynomial. Indeed, we will prove these polynomial concentration bounds. Further technical remark may be found in the proof (Appendix [). Interestingly, the successful AlphaGo Zero uses a polynomial upper confidence term.

## \thesection Main Results

In this section, we present our main results. One of our main contributions is a rigorous analysis of the non-asymptotic error bounds for MCTS, which is stated in Section [. Combining the bootstrap property of MCTS with the finite-sample bound from supervised learning (Section [), we finally present our main result, a non-asymptotic analysis of the proposed algorithm, in Section [.

### \thesubsection The Bootstrap Property of MCTS

To establish a non-asymptotic bound for our overall algorithm, we need to precisely understand the MCTS’s property of obtaining a better estimate for the root node. The key theorem in the following states the resulting non-asymptotic error bound for the fixed-depth MCTS oracle (Algorithm 2) applied to an MDP. In fact, we study a general fixed-depth MCTS oracle (see the proof in Appendix [), which is of independent interest. Compared with Algorithm 2, the generalized model differs in the evaluation of leaf nodes (i.e., Line 10 of Algorithm 2), in which we assume that the rewards at leaf nodes follow non-stationary processes, instead of querying a deterministic value oracle. The following theorem for Algorithm 2 then easily follows as a special case of the general model.

###### Theorem 1

For an MDP satisfying Assumption 1, suppose that we have an estimate for such that

 ∣∣V(s)−V∗(s)∣∣≤ε0,∀s∈S,

where is a constant. Suppose that the MCTS algorithm, described in Algorithm 2, is applied with depth and leaf nodes being evaluated by . Then, there exist appropriate constants, , , , and , such that for each query state the following claim holds for the output of MCTS with simulations

 ∣∣E[^Vn(s)]−V∗(s)∣∣≤γHε0+O(n−c),

where is a constant that depends on the parameters of the MDP and the MCTS algorithm.

The theorem above shows that under a fixed-depth MCTS, the expected estimate converges to within a radius of of the optimal value in norm at a polynomial rate. In particular, the radius converges to at an exponential rate with respect to the depth. Therefore, with a sufficiently large depth and enough simulation steps, MCTS is guaranteed to improve the accuracy of approximate value function over the leaf estimator. That is, by choosing appropriate depth and simulation steps, the MCTS oracle works as a desired bootstrap oracle to obtain improved estimates based on current ones. In addition, Theorem 1 immediately justifies the MCTS’s claimed property in literature (Kocsis and Szepesvári, 2006; Kocsis et al., 2006) that the evaluations of MCTS converge to the optimal value function asymptotically. The complete proof is presented in Appendix [. Here we state a result from the proof that is worth mentioning. In fact, we could fix the sequences of the constants to be of particular forms so that the convergence rate admits a simple solution. In particular, let us fix , and such that . Consider the following sequences of constants:

 η(h) =η(H)≡η,∀h∈[H], (\theequation) α(h) =η(1−η)(α(h+1)−1),∀h∈[H−1], (\theequation) ξ(h) =α(h+1)−1,∀h∈[H−1]. (\theequation)

Note that the sequence is omitted here as it does not appear in the final convergence rate directly. In fact, it depends on other constants, and the detailed equations can be found in Appendix [. With this setup, we have the following corollary:

###### Corollary 1

Consider the constants , and , given by Eq. (\theequation) - (\theequation). Suppose that a large enough is chosen such that . Then, there exist corresponding constants such that for each query state the following claim holds for the output of MCTS with simulations

 ∣∣E[^Vn(s)]−V∗(s)∣∣≤γHε0+O(nη−1).

Since , Corollary 1 implies a best case convergence rate of .

### \thesubsection Supervised Learning with Nearest Neighbor

To give an error bound on the non-parametric supervised learning for estimating the optimal value function, we make the following structural assumption about the MDP. Specifically, we assume that the optimal value function (i.e., true regression function) is smooth in some sense. The Lipschitz continuous assumption stated below is standard in literature on MDPs with continuous state spaces; see, e.g., the work of  Dufour and Prieto-Rumeau (2012, 2013), and Bertsekas (1975).

###### Assumption 2 (Smoothness)

(A1.) The state space is a compact subset of . The chosen distance metric associated with the state space satisfies that forms a compact metric space. (A2.) The optimal value function satisfies Lipschitz continuity with parameter , i.e.,

In this paper, we will focus on -nearest neighbor (-NN) regression method, where is a pre-specified number. To determine what the label (i.e., V-value ) should be for a state , we find its nearest neighbors in the training data with respect to the chosen distance metric and average their labels. For each let denote the set of its nearest neighbors from . Then the -NN estimate for the value function at state is given by

To control the noise due to nearest neighbor approximation, we assume that each state has excellent nearest neighbors from the training data set and enough of them, as stated below.

###### Assumption 3

The training data set satisfy the following conditions: (A1.) is a -covering set of , i.e., Moreover, for each , there exist at least points from that are within the bandwidth . That is, (A2.) The labels of the training data are independent random variables that satisfy the following: there exist constant and such that , and

The following theorem gives the expected error bound on the performance of supervised learning with nearest neighbor method. The proof is provided in Appendix [.

###### Theorem 2

For the -NN regression setup, under Assumptions 2 and 3, for any finite subset of states , and for each , we have

In particular, if we have

###### Remark 2

Note that the size of should scale as , where is the -covering number of the metric space . Let us consider a simple scenario where the state space is a unit volume ball in . Then the corresponding scales as , which is . Therefore,

 m=Θ(Khd)=Θ(Cdεd+2log2Vmax|L|ε)

suffices to obtain expected error of the same order as the training data.

### \thesubsection Non-asymptotic Bound for Bootstrapped Supervised RL

As the main result of this paper, we obtain non-asymptotic analysis of the bootstrapped supervised RL algorithm (Algorithm 1). Specifically, we find that the algorithm converges to the optimal exponentially with respect to the number of iterations, where the sample complexity at each iteration depends polynomially on the relevant problem parameters. Here, an iteration refers to a complete bootstrap step and a supervised step as described in Algorithm 1. The precise result is stated below and is proved in Appendix [ using Theorem 1 and 2.

###### Theorem 3

Suppose that Assumptions 1 and 2 hold. For each there exist appropriately chosen parameters for Algorithm 1 (see Eqs. (\theequation)-(\theequation) and (\theequation) for specific values), such that for each the output of Algorithm 1 after iterations satisfies

 E[|V(L)(s)−V∗(s)|]≤VmaxλL,∀s∈S. (\theequation)

In particular, the above result is achieved with sample complexity of

 Θ(λ−(2+1/(1−η))L⋅N(λL)⋅(logλ−1)⋅L),

where is a constant and is the -covering number of the metric space .

In the case where the optimal value function is smooth, this theorem shows that after iterations, we can obtain an estimate for within error. Particularly, the sample complexity for achieving error depends polynomially on the relevant problem parameters. An important parameter is the covering number which provides a measure of the “complexity” of the state space For instance, for the scenario where the state space is a unit volume hypercube in , the corresponding covering number should scale as (cf. Proposition 4.2.12 in Vershynin (2017)). As a direct corollary, we can characterize the dependence of the sample complexity on the desired accuracy level for this special case.

###### Corollary 2

In the setting of Theorem 3 with for each , after iterations, we have The total sample complexity is given by

 M(L)=κε−(2+1/(1−η)+d)⋅log1ε,

where is determined by and other problem parameters and independent of

Corollary 1 implies that we can choose appropriate algorithm parameters such that In this setting, Corollary 2 states that the sample complexity of Algorithm 1 scales as (omitting the logarithmic factor). Leveraging the the minimax lower bound for the problem of non-parametric regression (Tsybakov, 2009; Stone, 1982), the recent work by Shah and Xie (2018) establishes a lower bound on the sample complexity for reinforcement learning algorithms. In particular, for any policy to learn the optimal value function within approximation eror, the number of samples required must scale as . Hence in terms of the dependence on the dimension, Algorithm 1 is nearly optimal. Optimizing the dependence of the sample complexity on other parameters is left for future work.

## \thesection Conclusion

In this paper, we establish a rigorous, non-asymptotic analysis of a fixed-depth MCTS oracle and provide a first step towards theoretically understanding reinforcement learning algorithms that blend MCTS with powerful supervised learning approximator. Such kind of algorithm has enjoyed superior performance in practice, notably in AlphaGo Zero, but a precise theoretical understanding, considered in this paper, was lacking before. There are several interesting questions remained for this line of research. In particular, many MDPs in practice, such as games, often contain rich structures. Under what structural assumptions will we take the most advantage of this meta approach, that is, faster exploration via MCTS and better generalization via supervised learning? We believe a more comprehensive theoretical understanding will shed light into further development of this practically appealing meta approach.

• Audibert et al. (2009) Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009.
• Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Sample-efficient deep RL with generative adversarial tree search. CoRR, abs/1806.05780, 2018.
• Bertsekas (1975) D. Bertsekas. Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control, 20(3):415–419, 1975.
• Browne et al. (2012) Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
• Chang et al. (2005) Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus. An adaptive sampling algorithm for solving markov decision processes. Operations Research, 53(1):126–139, 2005.
• Coquelin and Munos (2007) Pierre-Arnaud Coquelin and Rémi Munos. Bandit algorithms for tree search. arXiv preprint cs/0703062, 2007.
• Dufour and Prieto-Rumeau (2012) F. Dufour and T. Prieto-Rumeau. Approximation of Markov decision processes with general state space. Journal of Mathematical Analysis and applications, 388(2):1254–1267, 2012.
• Dufour and Prieto-Rumeau (2013) F. Dufour and T. Prieto-Rumeau. Finite linear programming approximations of constrained discounted Markov decision processes. SIAM Journal on Control and Optimization, 51(2):1298–1324, 2013.
• Even-Dar and Mansour (2004) E. Even-Dar and Y. Mansour. Learning rates for Q-learning. JMLR, 5, December 2004.
• Jiang et al. (2018) Daniel R Jiang, Emmanuel Ekwedike, and Han Liu. Feedback-based tree search for reinforcement learning. 2018.
• Kaufmann and Koolen (2017) Emilie Kaufmann and Wouter M Koolen. Monte-carlo tree search by best arm identification. In Advances in Neural Information Processing Systems, pages 4897–4906, 2017.
• Kearns et al. (2002) Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2-3):193–208, 2002.
• Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
• Kocsis et al. (2006) Levente Kocsis, Csaba Szepesvári, and Jan Willemson. Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep, 2006.
• Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
• Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
• Salomon and Audibert (2011) Antoine Salomon and Jean-Yves Audibert. Deviations of stochastic bandit regret. In International Conference on Algorithmic Learning Theory, pages 159–173. Springer, 2011.
• Schadd et al. (2008) Maarten P. D. Schadd, Mark H. M. Winands, H. Jaap van den Herik, Guillaume M. J. B. Chaslot, and Jos W. H. M. Uiterwijk. Single-player monte-carlo tree search. In H. Jaap van den Herik, Xinhe Xu, Zongmin Ma, and Mark H. M. Winands, editors, Computers and Games, pages 1–12, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.
• Shah and Xie (2018) Devavrat Shah and Qiaomin Xie. Q-learning with nearest neighbors. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3115–3125. Curran Associates, Inc., 2018.
• Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
• Silver et al. (2017a) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a.
• Silver et al. (2017b) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017b.
• Stone (1982) Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, pages 1040–1053, 1982.
• Strehl et al. (2006) Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
• Sturtevant (2008) Nathan R. Sturtevant. An analysis of uct in multi-player games. In H. Jaap van den Herik, Xinhe Xu, Zongmin Ma, and Mark H. M. Winands, editors, Computers and Games, pages 37–49, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.
• Sutton (1988) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
• Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
• Szepesvári (2019) Csaba Szepesvári. Personal communication. January 2019.
• Teraoka et al. (2014) Kazuki Teraoka, Kohei Hatano, and Eiji Takimoto. Efficient sampling method for monte carlo tree search problem. IEICE TRANSACTIONS on Information and Systems, 97(3):392–398, 2014.
• Tsybakov (2009) Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009.
• Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
• Vershynin (2017) Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2017.
• Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

## A Analysis of MCTS with Non-stationary Leaf Nodes and Proof of Theorem 1

In this section, we give a complete analysis for the fixed-depth Monte Carlo Tree Search (MCTS) algorithm illustrated in Algorithm 2 and prove Theorem 1. In fact, we analyze a more general fixed-depth MCTS, which should be of independent interest. The generalized MCTS algorithm is the same as Algorithm 2, except that instead of using a deterministic oracle to evaluate the values for leaf nodes, in generalized algorithm we model the rewards at the leaf nodes as non-stationary processes. More precisely, in Line 10 of Algorithm 2, is generated by a non-stationary process. Doing so allows more flexibility and could be broadly applicable in other scenarios. Note that even if the rewards at leaf nodes are stationary processes, for example, IID, the complicated interaction within the tree would still make the rewards at intermediate nodes non-stationary. It is hence more appropriate to assume non-stationarity for the leaf nodes at the first place. Under such a setting, we provide a general non-asymptotic error bound for the return of the MCTS algorithm (i.e., estimation for the value of the root node). The desired result, Theorem 1, can then be readily obtained as a special case.

Overview of the Analysis: The overall structure of our analysis is divided into four main parts:

• We first state precisely the non-stationary model for the leaf nodes under our generalized MCTS algorithm, and introduce basic notation as well as some preliminary implications from the assumptions and our algorithm. [[[]

• The first step of our proof is to analyze the rewards sequences for nodes at depth , under the generalized MCTS algorithm. This amounts to analyzing the average reward for a non-stationary multi-armed bandit (MAB) problem, which might be of independent interest. The plan is to show that the average reward for each node at depth satisfies similar convergence and concentration properties, assumed for the non-stationary processes at the leaf nodes (i.e., depth ). [[[]

• We then argue, by induction, the same convergence and concentration properties hold for any depth , with different constants depending on the problem instance. This result allows us to derive a finite-time error bound for the root node. [[[]

• Finally, since our original MCTS oracle (i.e., Algorithm 2) is a special case of the generalized version, we apply the result from the general case to complete the proof of Theorem 1. [[]

Organization: Appendix [ is organized as follows. Part 1, which includes Sections [[, provides the basic setup. Specifically, Section [ introduces our assumptions on the non-stationary processes and some necessary notation. Section [ then gives a short discussion about the implication of the assumptions on the specific variant of UCB used to select actions during the MCTS simulation. Part 2 consists of Sections [[. In Sections [ and [, we prove the convergence and concentration for the resulting non-stationary MAB problem at depth . In these sections, we drop the superscript to simplify the notation. To complete Part 2, we need to utilize the non-stationary MAB result to obtain node independent constants for the concentration property at depth . To this end, in Section [, we first introduce additional notation to set up the stage for our recursive analysis and recap the results obtained so far. These results will then be combined to derive precise formula for constants pertained to depth . With results from Part 2, we are able to analyze the behavior of the nodes at each depth in Section [. The key result, the non-asymptotic error bound for the root node, is then summarized in Section [, and this finishes our analysis for the generalized MCTS algorithm. The final part is given in Section [, where we apply the general error bound for our specific case (i.e., Algorithm 2). This completes the proof of Theorem 1.

### \thesubsection Assumptions on the Non-stationary Leaf Nodes and Notation

To begin with, there are two kinds of rewards being collected during the simulation: (1) the intermediate rewards (i.e., rewards on the edges); (2) the final reward (i.e., reward on the leaf nodes of the tree). We assume that all the rewards are bounded and for simplicity, all the rewards belong to . The reward at each edge is the random reward obtained by playing a specific action at a particular node , namely, . For every edge, this reward is naturally IID according to our MDP model. Here we model the rewards at the leaf nodes as non-stationary processes that satisfy certain convergence and concentration properties, as stated below.

###### Assumption 4

For each leaf node we consider the following non-stationary reward sequences, . Let be the empirical average of choosing (i.e., selecting an action) the leaf node for times, and let be its expected value. We assume that is bounded for every and , and for simplicity, let . We assume the following convergence and concentration properties for these non-stationary processes:

1. (Convergence) We assume that converges to a value , which lies in the -vicinity of a true reward value , i.e.,

 μleafi=limn→∞E[¯Xleafi,n], (a. Convergence) (\theequation) ∣∣μ†,leafi−μleafi∣∣≤ε0 (b. Drift) (\theequation)
2. (Concentration) Further, we assume that certain form of tail probability inequalities holds for the empirical average: for each

 P(n¯Xleafi,n−nμleafi≥ Φ(n,δ))≤δ, (c. Concentration) (\theequation) P(n¯Xleafi,n−nμleafi≤ −Φ(n,δ))≤δ, (\theequation)

where and is a function that is non-increasing with and non-decreasing with .

In this analysis, we will focus on polynomial tails:

 Φ(n,δ)=nη(βδ)1/ξ, (\theequation)

where , and are constants. With this choice, we make the following concentration assumption on the leaf nodes: there exist three constants, , , and such that for every and every integer ,

 P(n¯Xleafi,n−nμleafi≥nηz)≤βzξ, (\theequation) P(n¯Xleafi,n−nμleafi≤−nηz)≤βzξ. (\theequation)

Note that if were IID, this concentration holds from Hoeffding inequality. In fact, if , the above inequality is also true, because trivially, . Finally, note that the choices of and could be related. If becomes larger, one could also find a larger so that the inequalities still hold.

Combining Intermediate Reward and Leaf Reward. At depth , when playing an action , there are two rewards being collected: the intermediate reward associated with the corresponding edge and the final reward from the leaf node that the edge is connected to. The leaf reward are assumed to satisfy the previous assumptions. On the other hand, the intermediate rewards are IID random variables in . That is, the intermediate rewards satisfy the Hoeffding inequality:

 P(|n¯Xinteri,n−nμinteri|≥x)≤2exp(−2x2/n),∀x>0.

Equivalently, if we let , then

 P(|n¯Xinteri,n−nμinteri|≥nηz)≤2exp(−2n(2η−1)z2),∀z>0.

Note that the tail bound is exponential, which is stronger than the polynomial bound assumed for the leaf nodes (cf. (\theequation) – (\theequation)). This means that a similar polynomial bound also holds for the intermediate rewards. In particular, since , we could fix the same and , as in (\theequation) – (\theequation), and choose a large enough so that

 P(n¯Xinteri,n−nμinteri≥nηz)≤β′zξ (\theequation) P(n¯Xinteri,n−nμinteri≤−nηz)≤β′zξ. (\theequation)

Note that in (\theequation) – (\theequation), replacing with any larger number still make the inequalities hold. For conciseness and without lose of generality, we assume that is large enough so that the inequalities (\theequation) – (\theequation) still hold with replaced by .

To simplify, with some abuse of notation, we let be the random variable of the total reward associated with an action in the subsequent analysis:

 Xi,n≜Xinteri,n+γXleafi,n.

That is, is the reward that one would receive, at depth , when playing an action (i.e., an edge) that leads to the th leaf node for the th time. Here, is the discounting factor. Note that . Similarly, denote by the empirical average of the rewards, and let be its expectation. Consequently, . Furthermore, we denote the true reward by . Note that . Finally, we define

 δi,n≜μi,n−μi

as the difference between the expected empirical average when is selected for times and the final converged expectation.

Since satisfies the Hoeffding inequality and is assumed to satisfy the polynomial tail bounds, we can safely assume the following: there exist constants, , , and such that for every and every ,

 (\theequation) P(n¯Xi,n−nμi≤−nηz)≤βzξ. (\theequation)
###### Remark 3

Recall that in Remark 1, we have argued the necessity of polynomial tail bounds. Technically speaking, if we assume an exponential tail bound for the leaf nodes, instead of the tail bound in (\theequation) and (\theequation), we could still follow the same analysis in the following sections (Sections [ and [) to derive convergence and concentration properties for nodes at depth . However, the same analysis will lead to a polynomial tail bound instead of an exponential one. That is, the exponential property does not get preserved across different depths. Hence we are not able to recursively argue the convergence and concentration properties for nodes at each depth.

### \thesubsection Action Selection in MCTS

Starting from this section, we analyze the behavior of nodes at depth . For simplicity, we drop the superscript or , which denotes depth. Most of MCTS in literature, including our approach here, use variants of Upper Confidence Bound (UCB) algorithm to select actions at each depth. Following the terminology in Multi-Armed Bandit (MAB) literature and due to the deterministic transition, we use the terms action, edge, and arm, interchangeably.

From now on, let us fix a particular node at depth and assume that there are possible actions in total (cf. Figure [). Combining results for each node to derive universal theorem for the whole depth will be the theme of Section [.

Let us now give some context of applying general UCB algorithm at depth to select an action. Let be the edge played at time . Denote by the number of times th edge has been played, up to (including) time , i.e., . Then, at time , a generic UCB algorithm will select the action with the maximum upper confidence bound, i.e.,

where is an appropriately chosen bias sequence when the particular edge has been played times up to and include time .

Note that under our assumptions (cf. (\theequation)-(\theequation)), by setting to be

 Bt,s=Φ(s,t−α)s=sη⋅β1/ξ⋅tα/