Static and Dynamic Values of Computation in MCTS

# Static and Dynamic Values of Computation in MCTS

## Abstract

Monte-Carlo Tree Search (MCTS) is one of the most-widely used methods for planning, and has powered many recent advances in artificial intelligence. In MCTS, one typically performs computations (i.e., simulations) to collect statistics about the possible future consequences of actions, and then chooses accordingly. Many popular MCTS methods such as UCT and its variants decide which computations to perform by trading-off exploration and exploitation. In this work, we take a more direct approach, and explicitly quantify the value of a computation based on its expected impact on the quality of the action eventually chosen. Our approach goes beyond the myopic limitations of existing computation-value-based methods in two senses: (I) we are able to account for the impact of non-immediate (ie, future) computations (II) on non-immediate actions. We show that policies that greedily optimize computation values are optimal under certain assumptions and obtain results that are competitive with the state-of-the-art.

## Introduction

Monte Carlo tree search (MCTS) is a widely used approximate planning method that has been successfully applied to many challenging domains such as computer Go [coulom2007, silver2016]. In MCTS, one estimates values of actions by stochastically expanding a search tree—capturing potential future states and actions with their respective values. Most MCTS methods rely on rules concerning how to expand the search tree, typically trading-off exploration and exploitation such as in UCT [kocsis2006]. Including such exploitative considerations is a heuristic, since no ‘real’ reward accrues during internal search [hay2011, hay2012, tolpin2012]. In this paper, we propose a more direct approach by calculating values of MCTS computations (i.e., tree expansions/simulations).

In the same way that the value of an action in a Markov decision process (MDP) depends on subsequent actions, the value of a computation in MCTS should reflect subsequent computations. However, computing the optimal computation values— the value of a computation under an optimal computation policy—is known to be intractable [lin2015, wefald1991]. We address this problem by proposing static and dynamic value functions that form lower and upper bounds for optimal computation values while being independent of future computations. We then utilize these function to define the value of a computation as the expected— increase in the static or dynamic value of a state resulting from the computation.

We show that MCTS policies that greedily maximize computation values are able to outperform various MCTS baselines.

## Background

In this section we cover some of relevant literature and introduce the notation.

### Monte Carlo Tree Search

MCTS algorithms function by incrementally and stochastically building a search tree to approximate state-action values. This incremental growth prioritizes the promising regions of the search space by directing the growth of the tree towards high value states. To elaborate, a tree policy is used to traverse the search tree and select a node which is not fully expanded—meaning, it has immediate successors that aren’t included in the tree. Then, the node is expanded once by adding one of its unexplored children to the tree, from which a trajectory simulated for a fixed number of steps or until a terminal state is reached. Such trajectories are generated using a rollout policy; which is typically fast to compute—for instance random and uniform. The outcome of this trajectory—i.e., cumulative discounted rewards along the trajectory—is used to update the value estimates of the nodes in the tree that lie along the path from the root to the expanded node.

Upper Confidence Bounds applied to trees (UCT) [kocsis2006] adapts a multi-armed bandit algorithm called UCB1 [auer2002] to MCTS. More specifically, UCT’s tree policy applies the UCB1 algorithm recursively down the tree starting from the root node. At each level, UCT selects the most promising action at state via where is the set of available actions at , is the number of times the is visited, , is the average reward obtained by performing rollouts from or one of its descendants, and is a positive constant, which is typically selected empirically. The second term of the UCT-rule assigns higher scores to nodes that are visited less frequently. As such, it can be thought of as an exploration bonus.

UCT is simple and has successfully been utilized for many applications. However, it has also been noted [tolpin2012, hay2012] that UCT’s goal is different from that of approximate planning. UCT attempts to ensure that the agent experiences little regret associated with the actions that are taken during the Monte Carlo simulations that comprise planning. However, since these simulations do not involve taking actions in the environment, the agent actually experience no true regret at all. Thus failing to explore actions based on this consideration could slow down discovery of their superior or inferior quality.

### Metareasoning & Value of Information

\citeauthor

howard1966 \shortcitehoward1966 was the first to quantify mathematically the economic gain from obtaining a piece of information. \citeauthorwefald1991 \shortcitewefald1991 formulated the rational metareasoning framework, which is concerned with how one should assign values to meta-level actions (i.e., computations). \citeauthorhay2012,tolpin2012 \shortcitehay2012,tolpin2012 applied the principles of this framework to MCTS by modifying the tree-policy of UCT at the root node such that the selected child node maximizes the value of information. They showed empirically that such a simple modification can yield significant improvements.

The field of Bayesian optimization has evolved in parallel. For instance, what are known as knowledge gradients [ryzhov2012, frazier2017] are exactly information/computation value formulations for “flat” problems such as multi-armed bandits.

Computation values have also been used to explain certain human and animal behavior. For example, it has been suggested that humans might leverage computation values to solve planning tasks in a resource efficient manner [lieder2014, sezener2019], and animals might improve their policies by “replaying” memories with large computation values [mattar2018].

Despite these advances, the application of metareasoning/information-values to practical problems has been very limited. “First-order approximations” dominate, such as only considering the immediate impact of a computation on only immediate actions. This is because of the intractability of computing exact computation values. In fact, \citeauthorlin2015 \shortcitelin2015 has shown that metareasoning is a harder computational problem than reasoning (e.g., finding the best action in a planning setting) for a general class of problems.

### Notation

A finite Markov decision process (MDP) is a -tuple , where is a finite set of states is a finite set of actions, is the transition function such that , where and , is the expected immediate reward function such that , where again and , is the discount factor such that .

We assume an agent interacts with the environment via a (potentially stochastic) policy , such that . These probabilities typically depend on parameters; these are omitted from the notation. The value of an action at state is defined as the expected cumulative discounted rewards following policy , that is .

The optimal action value function is defined as for all state-action pairs, and satisfies the Bellman optimality recursion:

 Q∗(s,a)=∑s′Pass′[Rass′+γmaxa′Q∗(s′,a′)].

We use and to denote a multivariate and univariate Normal distribution respectively with mean vector/value and covariance matrix or scale .

## State-action values in MCTS

To motivate the issues underlying this paper, consider the following example (Figure 1). Here, there are two rooms: one containing two boxes and the other containing five boxes. Each box contains an unknown but i.i.d. amount of money; and you are ultimately allowed to open only one box. However, you do so in stages. First you must choose a room, then you can open one of the boxes and collect the money. Which room should you choose? What if you know ahead of time that you could peek inside the boxes after choosing the room?

In the first case, it doesn’t matter which room one chooses, as all the boxes are equally valuable in expectation in absence of any further information. By contrast, in the second case, choosing the room with five boxes is the better option. This is because one can obtain further information by peeking inside the boxes—and more boxes mean more money in expectation, as one has the option to choose the best one.

Formally, let and be sets of random variables denoting rewards in the boxes of the first and the second room respectively. Assume all the rewards are sampled i.i.d. and . Then we have , which is why the two rooms are equally valuable if one has to choose a box blindly. On the other hand, , which is analogous to the case where boxes can be peeked in first.

If we consider MCTS with this example in mind, when we want to value an action at the root of the tree, backing up the estimated mean values of the actions lower in the tree may be insufficient. This is because the value of a root action is a convex combination of the “downstream” (e.g., leaf) actions; and, as such, uncertainty in the values of the leaves contributes to the expected value at the root due to Jensen’s inequality. We formalize this notion of value as dynamic value in the following section and utilize it to define computation values later on.

### Static and Dynamic values

We assume a planning setting where the environment dynamic (i.e., and ) is known. We could then compute in principle; however, this is typically computationally intractable. Therefore, we estimate by performing computations such as random environment simulations (e.g., MCTS rollouts). Note that, our uncertainty about is not epistemic—environment dynamic is known—but it is computational. In other words, if we do not know , it is because we haven’t performed the necessary computations. In this subsection, we introduce static and dynamic value functions, which are “posterior” estimates of conditioned on computations.

Let us unroll the Bellman optimality equation for -steps1. For a given “root” state, , let be the set of leaf state-actions—that is, state-actions can be transitioned to from in exactly -steps. Let be our prior belief distribution for over . We then use to denote the Bayes-optimal prior state-action values at that we define as a function of :

 Q∗n(s,a)=⎧⎪⎨⎪⎩Q∗0(s,a)if n=0∑s′Pass′[Rass′+elseγmaxa′∈As′Q∗n−1(s′,a′)],

where is the set of actions available at .

We assume it is possible to obtain noisy evaluations of for leaf state-actions by performing computations such as trajectory simulations. We further assume that the process by which a state-action value is sampled is a given, and we are interested in determining which state-action to sample from. Therefore, we associate each computation with a single state-action in ; but, the outcome of a computation might be informative for multipe leaf values if they are dependent. Let be a candidate computation. We denote the unknown outcome of this computation at time with random variable (or equivalently, ), which we assume to be where is an unknown noise term with a known distribution and is i.i.d. sampled for each . If we associate a candidate computation with its unknown outcome at time , we refer to the resulting tuple as a closure2 and denote it as . Finally, we denote a performed computation at time , by dropping the bar, as where (or equivalently, ) is the observed outcome of the computation that we assume to be , and thus . In the context of MCTS, will be the cumulative discounted reward of a simulated trajectory from at time . We will obtain these trajectories using an adaptive, asymptotically optimal sampler/simulator (e.g., UCT), such that . This means is a non-stationary stochastic process in practice; yet, we will treat it as a stationary process, as reflected in our i.i.d. assumption.

Let be a sequence of performed computations concerning arbitrary state-actions in and be the current state of the agent on which we can condition . Because contains the necessary statistics to condition leaf values, we will sometimes refer to it as the knowledge state. We denote the resulting posterior belief distributions for a as or the joint distribution of leaf values as .

We define the dynamic value function as the expected value the agent should assign to an action at given , assuming it could resolve all of the remaining uncertainty about posterior leaf state-action values .

###### Definition 1.

The dynamic value function is defined as where,

 Υn(s,a|ω1:t)\coloneqq⎧⎪⎨⎪⎩Q∗0(s,a)|ω1:tif n=0∑s′Pass′[Rass′+elseγmaxa′∈As′Υn−1(s′,a′|ω1:t)]

where is the set of actions available at .

The ‘dynamic’ in the term reflects the fact that the agent may change its mind about the best actions available at each state within -steps; yet, this is reflected and accounted for in . A useful property of is that is time-consistent in the sense that it does not change with further computations in expectation. Let be a sequence of closures. Then the following equality holds for any :

 ψn(sρ,a|ω1:t)=EΩ1:k[ψn(sρ,a|ω1:tΩ1:k)], (1)

due to the law of total expectation, where is a concatenation. This might seem paradoxical: why perform computations if action values do not change in expectation? The reason is that we care about the maximum of dynamic values over actions at , which increases in expectation as long as computations resolve some uncertainty. Formally, , due to Jensen’s inequality, just as in the example of the boxes.

Dynamic values capture one extreme: valuation of actions assuming perfect information in the future. Next, we consider the other extreme, valuation under zero information in the future, which is given by the static value function.

###### Definition 2.

We define the static value function as

 ϕn(s,a|ω1:t)\coloneqq⎧⎪⎨⎪⎩E[Q∗0(s,a)|ω1:t]if n=0∑s′Pass′[Rass′+elseγmaxa′∈As′ϕn−1(s′,a′|ω1:t)]

where is the set of actions available at .

In other words, captures how valuable would be if the agent were to take actions before running any new computations. In Figure 2, we graphically contrast dynamic and static values, where the difference is the stage at which the expectation is taken. For the former, it is done at the level of the root actions; for the latter, at the level of the leaves.

Going back to our example with the boxes, dynamic value of a room assumes that you open all the boxes after entering the room, whereas the static value assumes you do not open any boxes. What can we say about the in-between cases: action values under a finite number of future computations? Assume we know that the agent will perform computations before taking an action at . The optimal allocation of these computations to leaf nodes is known to be intractable even in a simpler bandit setting [madani2004]. That said, for any allocation (and for any finite ), static and dynamic values will form lower and upper bounds on expected action values nevertheless. We formalize this for a special case below.

###### Proposition 1.

Assume an agent at state and knowledge state decides to perform , a sequence of candidate computations, before taking actions. Then the expected future3 value of prior to observing any of the -computation outcomes is equal to , where . Then,

 ψn(sρ,a|ω1:t)≥EΩ1:k[ϕn(sρ,a|ω1:tΩ1:k)]≥ϕn(sρ,a|ω1:t),

where both bounds are tight.

We provide the proof in the appendix.

## Value of Computation

In this section, we introduce the value of computation formally and discuss some of its properties. We provide the proofs in the Appendix.

###### Definition 3.

We define the value of computation at state for a sequence of candidate computations given a static or dynamic value function and a knowledge state as

 \textscVOCf(sρ,¯¯¯ω1:k|ω1:t)=EΩ1:k[ maxa∈Asρf(sρ,a|ω1:tΩ1:k)] −maxa∈Asρf(sρ,a|ω1:t),

where specifies the state-actions in , that is, where is the th element of .

We refer to computation-policies that choose computations based on greedy maximization one by one (i.e., ) of VOC as -greedy and -greedy depending on which value function is utilized. We assume these policies stop if and only if . Our greedy policies consider and select computations one-by-one. Alternatively, one can perform a forward search over future computation sequences, similar to the search over future actions sequences in \citeauthorguez2013 \shortciteguez2013. However, this adds another meta-level to our metareasoning problem; thus, further increasing the computational burden.

We analyze these greedy policies in terms of the Bayesian simple regret, which we define as the difference between two values. The first is the maximum value the agent could expect to reap assuming it can perform infinitely many computations, thus resolving all the uncertainty, before committing to an immediate action. Given our formulation, this is identical to and thus is independent of the agent’s action policy and the (future) computation policy for a given knowledge state . Furthermore, it remains constant in expectation as the knowledge state expands. The second term in the regret is the maximum static/dynamic action value assuming the agent cannot expand its knowledge state before taking an action.

###### Definition 4.

Given a knowledge state , we define Bayesian simple regret at state as

 Rf(sρ,ω1:t)=E[maxa∈AsρΥn(sρ,a|ω1:t)]−maxa∈Asρf(sρ,a|ω1:t),

where .

###### Proposition 2.

-greedy and -greedy choose the computation that maximize expected decrease in and respectively.

We refer to policies that choose the regret-minimizing computation as being one-step optimal. Note that this result is different than what is typically referred to as myopic optimality. Myopia refers to considering the impact of a single computation only, whereas -greedy policy accounts for the impact of possible future computations that succeed the immediate action.

###### Proposition 3.

Given an infinite computation budget, -greedy and -greedy policies will find the optimal action at the root state.

###### Proof sketch.

Both policies will perform all computations infinitely many times as shown for the flat setting in \citeauthorryzhov2012 \shortciteryzhov2012. Thus, dynamic and static values at the leaves (ie, for ) will converge to the true optimal values (ie, ), so will the downstream values. ∎

We refer to such policies as being asymptotically optimal.

#### Alternative Definitions

A common [wefald1991, keramati2011, hay2012, tolpin2012] formulation for the value of computation is

 \textscVOC$′$f(sρ,¯¯¯ω1:k|ω1:t)=EΩ1:k[ maxa∈Asρf(sρ,a|ω1:tΩ1:k) −f(sρ,α|ω1:tΩ1:k)], (2)

where and is a value function as before.

The difference between this and Definition 3 is that the second term in VOC conditions also on . This might seems intuitively correct. VOC is positive if and only if the policy at changes with some probability, that is, . However, this approach can be too myopic as it often takes multiple computations for the policy to change [hay2012]. Note that, this is particularly troublesome for static values (), which commonly arise in methods such as UCT that estimate mean returns of rollouts.

###### Proposition 4.

-greedy is neither one-step optimal nor asymptotically optimal.

By contrast, dynamic value functions escape this problem.

###### Proposition 5.

For any and we have .

## Value of Computation in MCTS

We now introduce a MCTS method based on VOC-greedy policies we introduced. For this, as done in other information/computation-value-based MCTS methods [hay2012, tolpin2012], we utilize UCT as a “base” policy—meaning we call UCT as a subroutine to draw samples from leaf nodes. Because UCT is adaptive, these samples will be drawn from a non-stationary stochastic process in practice; yet, we will treat them as being i.i.d. .

We introduce the model informally, and provide the exact formulas and a pseudocode in the Appendix. We assume no discounting, i.e., , and zero immediate rewards within steps of the root node for simplicity here, though as we show in the Appendix, the results trivially generalize.

We assume where is a prior mean vector and is a prior covariance matrix. We assume both these quantities are known—but it is possible to also assume a Wishart prior over or to employ optimization methods from the Gaussian process literature (e.g., maximizing the likelihood function via gradient descent). We assume computations return evaluations of with added Normal noise with known parameters. Then, the posterior value function can be computed in for an isotropic prior covariance, in using recursive update rules for multivariate Normal priors , where is the number of leaf nodes, or in using Gaussian process priors. We omit the exact form of the posterior distribution here as it is a standard result.

For computing the -greedy policy we need to evaluate how the expected values at the leaves change with a candidate computation , i.e., where . Note that, conditioned on , gives the posterior predictive distribution for rollout returns from , and is normally distributed. Thus, is a multivariate random Normal variable of dimension . The maximum of this variable, i.e. , is a piecewise linear function in and thus its expectation can be computed exactly in as shown in \citeauthorfrazier2009b \shortcitefrazier2009b. If an isotropic prior covariance is assumed, the computations simplify greatly as reduces to the expectation of a truncated univariate normal distribution, can be computed in given the posterior distributions and the leaf node with the highest expected value. If the transitions are stochastic, then the same method can be utilized whether the covariance is isotropic or anisotropic, with an extra averaging step over transition probabilities at each node, increasing the computational costs.

Computing the -policy is much harder on the other hand, even for a deterministic state-transition function, because we need to calculate the expected maximum of possibly correlated random variables, . One could resort to Monte Carlo sampling. Alternatively, assuming an isotropic prior over leaf values, we can obtain the following by adapting a bound on expected maximum of random variables [lrr76, ross2010]:

 E[maxQ∗0|ω1:t]≤λsρt\coloneqqc+ ∑(s′,a′)∈Γn(sρ)[(σs′a′t)2Fs′a′t(c) +(μs′a′t−c)[1−Fs′a′t(c)]]

where and are posterior mean and variances, that is , is the CDF of , and is a real number. The tightest bound is realized for a that satisfies , which can be found by root-finding methods.

The critical question is then how changes with an additional sample from . For this, we use the local sensitivity, as a proxy, where is the number of samples drawn from until time . We give the closed form equation for this partial derivative along with some of its additional properties in the Appendix. Then we can approximately compute the -greedy policy by choosing the computation that maximizes the magnitude of . This approach only works if state-transitions are deterministic as it enables us to collapse the root action values into a single of leaf values. If the state transitions are stochastic, this is no longer possible as averaging over state transitions probabilities is required. Alternatively, one can sample deterministic transition functions and average over the samples as an approximation.

Our VOC-greedy MCTS methods address important limitations of VOC-based methods [tolpin2012, hay2012]. VOC-greedy does not suffer from the early stopping problem that afflicts VOC-based. It is also less myopic in the sense that it can incorporate the impact of computations that may be performed in the future if dynamic value functions are utilized. Lastly, our proposal extends VOC calculations to non-root actions, as determined by .

## Experiments

We compare the VOC-greedy policies against UCT [kocsis2006], VOI-based [hay2012], Bayes UCT [tesauro2010], and Thompson sampling for MCTS (DNG-MCTS) [bai2013] in two different environments: bandit-trees and peg solitaire.

Bayes UCT computes approximate posterior action values and uses a rule similar to UCT to select child nodes. DNG-MCTS also estimates the posterior action values but instead utilizes Thompson sampling recursively down the tree. We use the same conjugate Normal prior structure for the Bayesian algorithms: VOC-greedy, Bayes UCT, and DNG-MCTS4. The prior and the noise parameters are tuned for each method via grid search using the same number of evaluations, as well as the exploration parameters of UCT and VOI-based.

VOI-based, VOC-greedy, and Bayes UCT are hybrid methods, using one set of rules for the top of the search tree and UCT for the rest. We refer to this top part of the tree as the partial search tree (PST). By construction, VOI-based utilizes a PST of height . We implement the latter two methods using PSTs of height in bandit-trees and of in peg solitaire. These heights are determined based on the branching factors of the environments and the total computation budgets, such that each leaf node is sampled a few (5-8) times on average. For the experiments we explain next, we tune the hyperparameters of all the policies using grid search.

### Bandit-trees

The first environment in which we evaluate the MCTS policies is an MDP composed of a complete binary tree of height , similar to the setting presented in \citeauthortolpin2012 \shortcitetolpin2012 but with a deeper tree structure and stochastic transitions. The leaves of the tree are noisy “bandit arms” with unknown distributions. Agents perform “computations” to draw samples from the arms, which is analogous to performing rollouts for evaluating leaf values in MCTS. At each state, the agents select an action from (denoting the desired subtree of height ) and transition there with probability and to the other subtree with probability . In Figure 3, we illustrate a bandit tree of height .

At each time step , agents sample one of the arms, and update their value estimates at the root state . We measure the simple objective5 regret at state at , which we define as , for a deterministic policy which depends on the knowledge state acquired by performing many computations.

We sample the rewards of the bandit arms from a multivariate Normal distribution, where the covariance is obtained either from a radial basis function or from a white noise kernel. The noise of arms/computations follow an i.i.d. Normal distribution. We provide the exact environment parameters in the Appendix.

Figure 4.a shows the results in the case with correlated bandit arms. These correlations are exploited in our implementation of -greedy (via an anisotropic Normal prior over the leaf values of the PST). Note that, we aren’t able to incorporate this extra assumption in other Bayesian methods. Bayes UCT utilizes a specific approximation for propagating values up the tree. Thompson sampling would require a prior over all state-actions in the environment which is complicated due to the parent-child dependency among the nodes as well as computationally prohibitive. Because computing the -greedy policy is very expensive if state transitions are stochastic, we only implement -greedy for this environment, but implement both for the next environment.

We see that -greedy outperforms all other methods. Note that this is a low-sample density setting: there are bandit arms and each arm gets sampled on average once as the maximum budget (see x-axis) is as well. This is why many of the policies do not seem to have converged to the optimal solution. The outstanding performance of -greedy is partially due to its ability of exploiting correlations. In order to control for this, Figure 4b shows the results in a case in which the bandit rewards are actually uncorrelated (i.e., sampled from an isotropic Normal distribution). As we can see -greedy and Bayes UCT performs equally well, and better than the other policies. This implies that the good performance of -greedy does not depend wholly on its ability to exploit the correlational structure.

### Peg solitaire

Peg solitaire—also known as Solitaire, Solo, or Solo Noble—is a single-player board game, where the objective for our purposes is to remove as many pegs as possible from the board by making valid moves. We use a board, with pegs randomly placed.

In the implementation of VOC-greedy policies, we assume an anisotropic prior over the leaf nodes. As shown in Figure 5, -greedy has the best performance for small budget ranges, which is in line with our intuition as is a more accurate valuation of action values for small computation budgets. For large budgets, we see that -greedy performs as well as Thompson sampling, and better than the rest.

## Discussion

This paper offers principled ways of assigning values to actions and computations in MCTS. We address important limitations of existing methods by extending computation values to non-immediate actions while accounting for the impact of non-immediate future computations. We show that MCTS methods that greedily maximize computation values have desirable properties and are more sample-efficient in practice than many popular existing methods. The major drawback of our proposal is that computing VOC-greedy policies might be expensive, and may only worth doing so if rollouts (i.e., environment simulations) are computationally expensive.

Practical applications aside, we believe that the study of computation values might provide tools for a better understanding of MCTS policies, for instance, by providing notions of regret and optimality for computations similar to what already exists for actions (i.e., ).

## Appendix

### Value of Computation in MCTS

If the state transition function is deterministic, then static and dynamic value computations simplify greatly:

 ψn(s,a|ω1:t) =E[max(s′,a′)∈Γn(s)Z(s′,a′|ω1:t)] (3) ϕn(s,a|ω1:t) =max(s′,a′)∈Γn(s)E[Z(s′,a′|ω1:t)], (4)

where , is the posterior leaf values scaled by and shifted by the discounted immediate rewards () along the path from to .

In this ‘flat’ case, -greedy policy is equivalent to a knowledge gradient policy, details of which can be found in \citeauthorfrazier2009b,ryzhov2012 \shortcitefrazier2009b,ryzhov2012 for either isotropic or anisotropic Normal . On the other hand, -greedy policy has not been studied to the best of our knowledge. Computing the expected maximum of random variables is generally hard, which is required for . Below, we offer a novel approximation to remedy this problem.

#### Computing \textscVOC(ψn)

We utilize a bound [lrr76] that enables us to get a handle on . This asserts,

 ψn(s,a|ω1:t)≤c+∑(s′,a′)∈Γn(s)∫∞c[1−Fs′a′t(x)]dx

for any , where is the CDF of . This bound does not assume independence and holds for any correlation structure by assuming the worst case. Furthermore, the inequality is true for all . However, the tightest bound is obtained by differentiating the RHS with respect to , and setting its derivative to zero, which in turn yields Thus, the optimizing can be obtained via line search methods.

If is distributed according to a multivariate (isotropic or anisotropic) Normal distribution, then we can eliminate the integral [ross2010]:

 ψn(s,a|ω1:t)≤λsat\coloneqqc+ ∑(s′,a′)∈Γn(sρ)[(σs′a′t)2Fs′a′t(c) +(μs′a′t−c)[1−Fs′a′t(c)]]

where and are posterior mean and variances, that is .

If we further assume an isotropic Normal prior with mean and scale , and observation noise i.i.d. for , then we get the posterior mean and scale as

 μs′a′t =ns′a′t^os′a′t/σ2+μs′a′0/σ2s′a′tns′a′t/σ2+1/σ2s′a′t, σs′a′t =ns′a′t/σ2s′a′t+1/σ2,

where is the mean trajectory rewards obtained from and is the number of times a sample is drawn from . Then keeping fixed, we can estimate the “sensitivity” of with respect to an additional sample from with

 ∂λsat∂ns′a′t=∂λsat∂σs′a′tdσs′a′tdns′a′t+∂λsat∂μs′a′tdμs′a′tdns′a′t.

where

 ∂λsat∂σs′a′t =1√2πexp(−(μs′a′t−c)22(σs′a′t)2) dσs′a′tdns′a′t =−σσ3sa02(ns′a′tσ2sa0+σ2)32 ∂λsat∂μs′a′t =12(1+erf(√2(μs′a′t−c)2σs′a′t)) dμs′a′tdns′a′t =σ2σ2sa0(−μsa0+^os′a′t)(ns′a′t)2σ4sa0+2ns′a′tσ2σ2sa0+σ4.

We can then compute and utilize as a proxy for the expected change in . Because is an upper bound, we find that this scheme works the best when the priors are optimistic, that is is large. In fact, as long as the prior mean is larger than the empirical mean, , we have . Then we can safely choose the best leaf to sample from via . We use this scheme when implementing -greedy in peg solitaire and confirmed that the results are nearly indistinguishable from calculating -greedy by drawing Monte Carlo samples in terms of the resulting regret curves.

### VOC-greedy algorithm

We provide the pseudocode for VOC-greedy MCTS policy in Algorithm 1. Throughout our analysis of this policy, we assume an infinite computation budget .

#### Time complexities

Computational complexity of VOC-greedy methods depend on a variety of factors, including the prior distribution of the leaf values, stochasticity, use of static vs dynamic values. Here, we discuss the time complexity of computing the -greedy policy with a conjugate Normal prior (with known variance) in MDPs with deterministic transitions.

The posterior values can be updated incrementally in constant time if the prior is isotropic Normal and in if it is anisotropic, where is the number of leaf nodes [Barber:2012]. Given the posterior distributions, the value of a computation can be computed in in the isotropic case and in in the anisotropic case. We refer the reader to [frazier2009b] for further details, as the analysis done for bandits with correlated Normal arms do apply directly.

### Proofs

#### Proof of Proposition 1

Let us consider the “base case” of and define a higher-order function , capturing the -step Bellman optimality equation for a state-action :

 g(h)\coloneqq∑s′Pass′[Rass′+γmaxa′h(s′,a′)].

Then and . Because is a convex function, we have by Jensen’s inequality. We use this to prove the upper bound in Proposition 1:

 ψ1(s,a|ω1:t) =EΩ1:k[ψ1(s,a|ω1:tΩ1:k)] ≥EΩ1:k[ϕ1(s,a|ω1:tΩ1:k)],

where the first inequality is due to Equation 1. For the lower bound, we have

 EΩ1:k[ϕ1(s,a|ω1:tΩ1:k)] =EΩ1:k[g(E[Q∗0|ω1:tΩ1:k])] ≥g(EΩ1:k[E[Q∗0|ω1:tΩ1:k)]] =g(E[Q∗0|ω1:t]) =ϕ1(s,a|ω1:t).

These inequalities also holds for for the same reasons. We omit the proof.

#### Proof of Proposition 2

Let denote the optimal candidate computation (of length 1), which minimizes Bayesian simple regret in expectation in one-step. That is,

 ¯¯¯ω∗\coloneqqargmin¯¯¯ωEΩ[Rf(sρ,ω1:tΩ)]

where is the closure corresponding to . Then, we subtracting , we get

 ¯¯¯ω∗=argmin¯¯¯ω[EΩ[Rf(sρ,ω1:tΩ)]−Rf(sρ,ω1:t)].

The first terms of the regrets cancel out as