Computing the Value of Computation for Planning

# Computing the Value of Computation for Planning

Can Eren Sezener
October 2018
###### Abstract

An intelligent agent performs actions in order to achieve its goals. Such actions can either be externally directed, such as opening a door, or internally directed, such as writing data to a memory location or strengthening a synaptic connection. Some internal actions, to which we refer as computations, potentially help the agent choose better actions. Considering that (external) actions and computations might draw upon the same resources, such as time and energy, deciding when to act or compute, as well as what to compute, are detrimental to the performance of an agent.

In an environment that provides rewards depending on an agent’s behavior, an action’s value is typically defined as the sum of expected long-term rewards succeeding the action (itself a complex quantity that depends on what the agent goes on to do after the action in question). However, defining the value of a computation is not as straightforward, as computations are only valuable in a higher order way, through the alteration of actions.

This thesis offers a principled way of computing the value of a computation in a planning setting formalized as a Markov decision process. We present two different definitions of computation values: static and dynamic. They address two extreme cases of the computation budget: affording calculation of zero or infinitely many steps in the future. We show that these values have desirable properties, such as temporal consistency and asymptotic convergence.

Furthermore, we propose methods for efficiently computing and approximating the static and dynamic computation values. We describe a sense in which the policies that greedily maximize these values can be optimal. Furthermore, we utilize these principles to construct Monte Carlo tree search algorithms that outperform most of the state-of-the-art in terms of finding higher quality actions given the same simulation resources.

\Year

2018 \trnumber

\supervisor

Prof. Peter Dayan
University College London

###### Acknowledgements.
I feel incredibly privileged to have had Peter Dayan as my thesis supervisor. I am especially appreciative of him always going through my work with great care and providing helpful feedback. His mathematical rigor and scrutiny has greatly helped me become a better thinker. Not every master’s student is as lucky. I am very thankful to Klaus Obermayer for being an excellent internal supervisor and helping with many aspects of my thesis process and studies. Special thanks to Mehdi Keramati for supervising my first lab rotation, which has been pivotal for my research and career. In fact, the topics he has introduced me to have strongly influenced my interests, and paved the way for this thesis. I am grateful to my bachelor’s advisor Erhan Öztop for enabling me to experience the joy of scientific discovery and consequently setting me on the path of research. My master’s journey has been a long one. I would like to thank my friends and colleagues for making it also an enjoyable one: Murat and Hüseyin—for being excellent general-purpose buddies, Milena—for the thesis-writing-solidarity, Maraike—for the sanity-boosting, the Gatsby Unit—for providing an excellent, high information-content environment during my six month visit, and the BCCN family, especially my classmates, “the year above”-mates, the faculty, Robert, and Margret—for many things, enumeration of which could only be done justly in the Supporting Information. Many thanks to colleagues whose input has shaped this thesis: Florian Wenzel for giving me a be a better grasp of Gaussian processes, Georg Chechelnizki for applying his logic superpowers to crush the trivial xor inconsistent statements on the abstract and helping with the writing of “Zusammenfasssung”, Greg Knoll for his feedback on my thesis defense, and Fiona Mild for all the extracellular dopamine. Additional thanks to TEV and DAAD for the generous scholarship that made my studies significantly more comfortable. Finally, I am incredibly grateful to my parents, Çiğdem and Erdal, for their unconditional support.
\mainmatter

## Chapter \thechapter Introduction

“First say to yourself what you would be; and then do what you have to do.”

– Epictetus, Discourses

Such questions do not only concern chess players. Rather, variants of them present themselves on a daily basis, where one has to decide between thinking and acting, as well as what to think about. Reckless behavior, “analysis-paralysis”, or rumination might be seen as negative consequences resulting from too little, or too much, thinking than is ideal; or thinking about the “wrong” things.

An intelligent robot (or its programmer) has to deal with the same questions as well. For a mobile robot, acting and computing (i.e., thinking) often draw upon the same resources such as time and energy. Therefore, a resource-bounded robot needs a policy to decide when to act, when to compute, and what to compute.

In this thesis, we address these questions by proposing ways in which one could assign values to computations—which is typically111We build upon the existing body of literature on VOC-like measures, which we discuss later in this chapter. referred to as the value of computation (VOC). We do so by operating in a Bayesian framework, where the agent is uncertain about the values of possible actions, and decreases its uncertainty by performing computations.

As we later discuss, this is not a straightforward task for a variety of reasons. For instance, the value of an action does not only depend on the present computation, but it also depends on future computations. A chess player running out of clock-time might decide to ignore a seemingly complicated move if she thinks she might not have enough time to think properly through all the complications resulting from the move.

This problem could be addressed by our proposal to some extent. We propose static and dynamic action value functions, which respectively cover the the two extremes of zero and infinitely many future computations. Furthermore, we argue that the current work on VOC-like measures and relating topics are limited in two severe ways: (I) they exclusively focus on what we refer to as static values—ignoring effects of future computations, (II) they assume a bandit-like scenario where action values are stationary—whereas, in planning, action values improve in expectation as the agent performs more computations. Our formulations address these two limitations.

In the theoretical part of the thesis, we discuss the properties of static and dynamic values, and define the value of computation as the expected increase in these values. We show that choosing static/dynamic-value-maximizing computations is optimal in certain cases. Furthermore, we provide exact and efficient approximate algorithms for computing these values. In the applied part of the thesis, we give a Monte Carlo tree search algorithm based on VOC-maximization. We empirically demonstrate its efficacy by comparing it against other popular or similar methods in two different environments.

Before moving on to our discussion of computation values in the next chapter, we first provide some mathematical background, which will form the basis of our work. Then we discuss some of the work relating to this thesis below. Lastly provide an outline of the thesis.

### 1 Preliminaries

Below, we introduce multi-armed bandit problems and Markov decision process, which we will utilize heavily throughout this thesis.

#### 1.1 Multi-armed bandit problems

A multi-armed bandit (MAB, or simply bandit) problem is composed of a set of actions , where each action is associated with a distinct reward distribution function. Let be the expected reward function: .

The goal then is to minimize some form of regret. In this thesis, we define it as the expected reward one misses out by taking action , . For instance, one can aim to minimize cumulative regret, ; or simply the regret at a particular , which is referred to as simple regret. In either case, the actor cannot directly observe the regret, as is unknown, but typically forms estimates of it.

#### 1.2 Markov decision processes

A finite Markov decision process (MDP) is a -tuple , where

• is a finite set of states

• is a finite set of actions

• is the transition function such that , where and

• is the expected immediate reward function such that , where again and

• is the discount factor such that .

We assume an agent interacts with the environment via a (potentially stochastic) policy , such that . This quantity is typically conditioned on some parameters which determine the policy of the agent, which is left out from the notation here. The agent aims to maximize the the expected value of the cumulative discounted rewards , where , by controlling its policy .

These expected values are typically estimated by planning and reinforcement learning methods in some form. As such, they get a preferential treatment in terms of the notation, and simply get referred to as values, which is given by value functions. The value of a state is given by the value function :

 Vπ(s)=Eπ[Rt|st=s]. (1)

Similarly, the value of a state-action is given by the value function :

 Qπ(s,a)=Eπ[Rt|st=s,at=a]. (2)

Both of these value functions can be defined recursively, by relating a state (or a state-action) to its neighboring states (or state-actions), by the Bellman equations:

 Qπ(s,a) =∑s′Pass′[Rass′+γ∑a′π(s′,a′)Qπ(s′,a′)] (3) Vπ(s) =∑aπ(s,a)∑s′Pass′[Rass′+γVπ(s′)]. (4)

Solving a planning or a reinforcement learning problem boils down to find an optimal policy that is better than all other policies. In other words, for an optimal policy

 π∗=argmaxπVπ(s) (5)

must hold for all .

State and state-action values of an optimal policy are given respectively by the optimal state value function and the optimal state-action value function . Formally,

 V∗(s)=Vπ∗(s)=maxπVπ(s) (6)

for all and

 Q∗(s,a)=Qπ∗(s,a)=maxπQπ(s,a), (7)

for all and . These optimal value functions can also be defined recursively, through what we refer to as the Bellman optimality equations:

 Q∗(s,a) =∑s′Pass′[Rass′+γ∑a′maxa′Q∗(s′,a′)] (8) V∗(s) =maxa′∑s′Pass′[Rass′+γV∗(s′)]. (9)

### 2 Related work

We draw upon two different lines of research. The first considers the value of resolving one’s uncertainty, and relates to ideas such as the value of information, knowledge gradients, and metareasoning. The second line is concerned with finding approximately optimal actions in known environments—that is, approximate planning.

#### 2.1 Information values and metareasoning

We draw heavily on Howard’s work on information value theory [1], which is concerned with the economic advantage gained through reducing one’s uncertainties. The original form of this work addresses the questions of how much one should pay to obtain a piece of information. The problem we investigate in this thesis is analogous. However, instead of the value of an external piece of information, we are interested in the value of an internal piece of information, which results from thinking or computing.

Formal approaches to this problem include the work of Russell and Wefald [2]. This lays down the theoretical foundations of rational metareasoning which concerns finding and performing optimal internal actions (i.e., computations). However, as noted by the authors, solving the metareasoning problem optimally—meaning performing the optimal sequence of computations—is intractable for several reasons. One of the reasons is that optimal meteareasoning is a harder computational problem than optimal reasoning, as recently proven by Lin et. al. [3] for a general class of problems. This is because, finding the optimal solution to the metareasoning problem comes with a great computational burden at the level of meta-metareasoning, optimization of which requires even a more meta reasoner. In other words, optimality at one level can only be achieved with a great cost at the one higher (more meta) level. One reason for this is that (meta) state and action spaces grow as one goes up the metareasoning hierarchy. In many settings with a good reasoner, perfect metareasoning is likely to perform worse than no-metareasoning—in the sense that the latter obtains the same solution but at a lower computational cost. This being said, some metareasoning can be better than the two extremes.

We can perhaps better understand this with a neural networks analogy. Consider three ways of updating the weights of a network: random walk, stochastic gradient descent (SGD), and Newton’s method. Random walk is the cheapest to compute (similar to no-metareasoning) but performs the worst. Newton’s method is similar to perfect meta-reasoning as it is the costliest to compute, but is locally optimal. However, SGD, which falls between the two in terms of computational costs performs often the best—even though its updates are noisy and thus suboptimal. One reason is that the computations spent on performing one “optimal” update can be more spent on performing many sub-optimal updates, which often yields better results in practice.

Thus, Hay and others [4, 5] utilize “myopic” metareasoning—which is akin to performing computations that maximize the immediate (rather than the long-term) rewards of the meta-level problem222The definition of which can be found in the cited work.—to outperform non-meta reasoners in the context of approximate planning.

A parallel line of progress has recently been made by Ryzhov, Frazier, Powell, and their colleagues [6, 7, 8]. They propose the knowledge gradient (KG)—which closely resembles Howard’s information values—and show that it solves Bayesian optimization (specifically, selection and ranking) problems effectively.

We can consider two kinds of planning problems: “stateless” and “stateful”. In the former, outcomes of actions are path independent. That is, if two action sequences are permutations of each other, then they must be equally rewarding in expectation. Whereas, path-dependence may exist in the latter kind. To elaborate, in stateful problems, actions not only yield rewards, but they also modify the state of the agent. Therefore, the ordering of actions do impact the reward expectations. Multi-armed bandits are examples of stateless problems, and Markov decision processes of stateful problems typically.

To the best of our knowledge, all the work on rational metareasoning and knowledge gradients either focus on stateless problems, or they treat stateful problems as stateless problems. For example, in rational-metareasoning-based Monte Carlo tree search [4], the authors make a simplifying statelessness assumption by treating the root actions as “bandit arms” with fixed quantities of expected rewards attached—even though the rewards depend on the policy of the agent, and thus are not fixed. As also noted by the authors, this is perhaps the most significant shortcoming of their method. In this thesis we partially lift this constraint, by extending metareasoning (or knowledge gradients) to distal (i.e., non-immediate) actions. Our remedy is only partial, because the extent of actions for which metareasoning is to be leveraged should be determined in advance. “Stateful” problems are typically referred to as planning problems, which we explain next.

#### 2.2 Approximate planning

Planning consists of finding an optimal sequence of external actions given an environment and a goal. When formalized as a MDP, the goal is then to find an optimal or a near optimal policy. However, achieving optimality is typically intractable; and one has to resort to approximately optimal solutions. Monte Carlo tree search (MCTS) is a popular family of methods for finding such solutions. MCTS methods build a search tree of possible future states and actions by using a model to simulate random trajectories. Good MCTS methods exploit nascent versions of the tree to make judicious choices of what to simulate. One main computational advantage of MCTS comes from the concept of sparse sampling [9]. It turns out that it suffices to build a backup (expectimax) tree that is sparse, and as such, to avoid the curse of dimensionality associated with building complete backup trees.

UCT [10], which we describe in detail later, is a popular MCTS method that performs very well in many domains. Roughly speaking, UCT breaks down how simulations are chosen in MCTS into smaller sub-problems, each of which is then treated as a bandit problem. UCT is asymptotically optimal—meaning, given infinite computation time, finds the best action. However, it is also known to be biased in the sense that it explores less than is optimal, when the goal is to minimize simple regret as in MCTS [11].

This is because UCT treats MCTS—where the goal is to minimize the (simple) regret of the action eventually taken—as a traditional bandit problem—where the goal is to minimize the cumulative regret. To put it in another way, in MCTS, simulations are valuable, only because they help select the best action. However, UCT actually aims to maximize the sum of rewards obtained in simulations, rather than paying attention to the quality of actual (i.e., not simulated) actions. Consequently, it tries to avoid simulations with potentially low rewards, even though they might help select better actions. In fact, Hay et. al. [4] improve the performance of UCT just by adding a very limited metareasoning capability, namely by computing an information value score for picking the immediate action from which to run a simulation, and then resorting to vanilla UCT for the rest of the decisions down the tree. The practical part of this thesis is concerned with pushing this envelope by building MCTS methods with better metareasoning capabilities.

### 3 Thesis organization

The thesis is organized as follows. In Chapter 2, we introduce different notions of computation values and provide formal definitions. In Chapter 3, we discuss how computation values can be computed or approximated efficiently. In Chapter 4, we materialize our ideas into a Monte Carlo tree search algorithm which leverages computation values. In Chapter 5, we evaluate our methods, together with other popular methods, in two different environments, and show that our computation-value-based algorithms perform either better than all other alternatives, or comparably to the best alternative.

## Chapter \thechapter Value of Computation

“In truth, either reason is a mockery, or it must aim solely at our contentment, and the sum of its labors must tend to make us live well and at our ease…”

– Michel de Montaigne, Essays

We begin by giving a general formal definition of computations and computation value. Then, we introduce additional assumptions, and consider multi-armed bandit problems and Markov decision processes.

### 4 Computations as higher-order functions

Let us leave MDPs and bandits aside for a moment, and consider a more abstract setting. Let be the set of possible states of the agent, be the set of possible actions, and be the value function of the agent indicating the desirability of state-actions. Generally, we consider an action desirable, if it is expected to lead to a valuable outcome.

Assume the agent has some control over —which can either result from obtaining better knowledge of the possible outcomes of actions and their respective probabilities, or it can be a more direct form of control, like deeming hard-to-reach-grapes as sour as in Aesop’s fable. The latter corresponds to altering of utility functions333Which assigns values to outcomes. (or primary reinforcers); and the former of outcome probabilities, and thus of expected utilities (or secondary reinforcers) as well. We refer to both kinds of alterations as computations, and to both utilities and expected utilities as values for now. When we are discussing computation values in Markov decision processes and multi-armed bandits, we will provide more concrete definitions, and limit the effect of computations to expected utility alterations only.

In real time planning schemes such as Monte Carlo tree search (MCTS)—which we introduce in a subsequent chapter—the agent aims to maximize the value444In MCTS, there are true values (i.e., those given by the Bellman optimality equation) which are determined by the environment. However, at the moment, we are considering a more abstract and general notion of values. of the best action, , where is the set of actions available at state , by performing computations. In general, we can think of a computation as a higher-order function transforming the value function, . The value of the best action at state after performing the computation will be , causing a net change of

 maxa∈Asω[Q](s,a)−maxa∈AsQ(s,a). (10)

This is the value of a deterministic computation whose outcome is known to the agent, which the agent aims to maximize via,

 argmaxω∈Ω[maxa∈Asω[Q](s,a)−maxa∈AsQ(s,a)], (11)

where is the set of possible computations.

Even though we would like to determine the optimizing computation so that we could perform it, how to do it in an advantageous way is not clear. If one has to actually perform all the computations in order to pick the best one, it is likely to be counterproductive from a cost-benefit point of view as it would require paying the cost of many computations in return for performing a single computation.

However, if the agent has predictions concerning the effects of computations, possibly obtained by storing the descriptive statistics of previous computations’ outcomes, this problem can be circumvented. We operate under this assumption, and redefine as a random-valued higher-order function that maps a value function to a random value function—as stochastic computations include deterministic ones as a subset. Then, for a known is a random variable in . This reflects the fact that the agent is not entirely sure about the effect of a computation, and instead has a probability measure over the set of possible effects. Then, we obtain the value of computation (VOC) for as

 \textscVOC(ω)=E[maxa∈Asω[Q](s,a)]−maxa∈AsQ(s,a). (12)

On the other hand, the commonly used definition [2, 4, 12] is slightly different, which we denote with VOC,

 \textscVOC$′$(ω)=E[maxa∈Asω[Q](s,a)]−E[Q(argmaxa∈Asω[Q](s,a))]. (13)

Let us rewrite the VOC definition in Equation 12 to make it more compatible with Equation 13,

 \textscVOC(ω) =E[maxa∈Asω[Q](s,a)]−E[Q(argmaxa∈AsQ(s,a))] (14) =E[maxa∈Asω[Q](s,a)]−Q(argmaxa∈AsQ(s,a)). (15)

The difference is subtle. VOC uses the new expected value of the previously best action as the baseline; whereas VOC uses the old value of the previously best action. Thus, VOC only takes the differences due to policy changes into account. Consider a case where there are objective values to actions, meaning they are solely given by the environment such as in a multi-armed bandit task, which we will formalize later on. In this case, computations cannot directly modify the values (); however, they are useful because they can modify the agent’s posterior expectations of , which in turn benefits the agent because he can make a better decision concerning what action to take. VOC assumes this setting of objective ’s and therefore values a computation solely based on its impact on his policy at his current state, i.e. .

Lastly, it follows from the definition in Eq 13 that . However, we cannot say much about VOC yet. For this, we need further assumptions, which we introduce next.

### 5 Computations as evidence

Up until now, we have assumed , the value function, is known to the agent. Here, we relax this assumption, and treat it as a quantity that the agent estimates by performing computations. Computations might achieve this in two distinct ways: by resolving empirical uncertainty or computational uncertainty. The former uncertainty is due to insufficient external data. The latter type is due to insufficient processing of the existing external data. Consider the problem of estimating the parameters of a generative model from a dataset using Markov Chain Monte Carlo. In order to get a more accurate estimate, one can either collect more data, or run more Monte Carlo simulations. Alternatively, consider a tennis robot getting ready for a shot. He would like to reduce his uncertainty about what the best motor commands are. He can achieve this by collecting more data concerning the velocity and the position of the ball using his sensors. Or, he can use the data he has, but spend processor cycles unrolling the physics model he has into the future. In many cases, the total uncertainty is a combination of both empirical and computational uncertainties. That being said, we focus on resolving uncertainties of this latter kind—that is, computational—and thus consider planning in fully known environments. However, the results we obtain should transfer to the cases where the agent also has to learn the statistics of the environment (e.g., as in model-based learning).

Agents that operate in unknown environments typically collect statistics concerning the outcomes of their actions in order to make better decision. We take this a step further, and assume the agent also keeps statistics concerning his past computations and their outcomes. Subsequently, we take a Bayesian approach and assume the value function, , is sampled from a known distribution, which could be given by the past experience of the agent. Note that, as before, we do not enforce the Bellman equation constraint yet, and treat the -function in a more general sense—state-actions that have higher -values are more desirable. The agent estimates this unknown -function by performing computations, which transform , which is a random valued function, to another random valued function. Here, we change our formalization of computations for one last time. A computation is no longer a higher-order function, but is a function that outputs imperfects estimates of state-action values, which we refer to as observations. At a given time, the agent will have performed a sequence of computations whose respective outcomes are . Each computation belongs to a finite set , and each outcome belongs to a potentially infinite set . Thus, and . For notational simplicity, we assume that the outcome of a computation formally includes information about the computation from which the outcome resulted. That is, if a computation yields information about an then the outcome of is in —as it carries information about a state action, and its value555More generally, a computation might carry information about multiple , then the outcome of the computation would be in . . The agent uses to get a better estimate of , namely the posterior distribution . We introduce a random function , whose domain is , such that for all state-actions, or equivalently . In other words, -function is the posterior distribution of the unknown -function conditioned on .

We would like to evaluate the value of a computation prior to performing the computation itself. Therefore, we can think of a meta-computation, which evaluates the potential outcomes of a computation . More specifically, we assume computations draw samples from where is a stochastic noise term capturing the imperfection of computations; and thus, meta-computations draw samples from , which is the posterior predictive distribution. An alternative formulation for meta-computations would be to draw samples directly from , without including the noise term . This, in a sense, captures how much one can learn by performing sufficiently many computations such that the noise cancels out due to the Law of Large Numbers, which is equivalent to the expected improvement formulation. However, we would like to quantify how much we could learn from a single computation; thus, the noise is included.

Previously, when computations were higher-order functions, we had denoted the expected value of a function , after a computation as . Now, yields an observation about , rather than transforming it. Therefore, we now denote the same value with , which captures the expected value of after performing . The process of evaluating this expression is a meta-computation as it carries information about computation . In most settings we will introduce, the cost of performing a meta-computation will be negligible compared to the cost of performing a computation. As such, we only apply the cost-benefit analysis to the computations themselves.

We assume there is a one-to-one mapping between actions and computations for a given state. That is, between , actions available at , and , computations that directly affect , such that each corresponds to observing for a specific and vice versa. Therefore, from now on, performing a computation and sampling an action is used interchangeably. This assumption is not necessary but convenient to introduce now as it is both intuitive and common in planning algorithms. Consider the UCT algorithm [10] as an example, for the computation called a roll-out, whose goal is to provide information about the value of root actions. Note that, despite this one-to-one mapping between computations and actions, a computation can have an effect on multiple state-action values given the potential dependencies. The hypothetical outcome of a candidate computation , affects for all and transforms it to where is the concatenation operator.

We avoid using the notation for , because includes the outcomes of the previously performed computations, whereas is the hypothetical outcome of a candidate computation, which has a different distribution (e.g., posterior predictive) than those computations in .

However, both past computations and future computations have the same type, i.e., they belong to . As such, they can form a sequence when concatenated. Note that, for a given , is a value in . However, for an unknown , it is a random variable, and is a random function.

For all , we have

 Eo∼ω[E[Z(s,a|o1:t⊕o)]]=E[Z(s,a|o1:t)]. (16)

due to the law of total expectation. This equality suggests that the beliefs of the agent are coherent and cannot be exploited by a Dutch book. In our formulation, this is achieved trivially, because we assume we know the prior distribution of . In the experiments section we show that, even when we do not know the true sampling distribution of (i.e., when we assume one), algorithms based on this principle can perform very well. Furthermore, we postulate that humans do likely leverage some form informed priors, as we do here, which can be satisfied by a TD-learning-like rule that learns the second moments of action values in addition to the first ones.

If we consider the maximum of action values under computations, given is a convex function, we have

 Eo∼ω[maxa∈AsE[Z(s,a|o1:t⊕o)]]≥maxa∈AsE[Z(s,a|o1:t)], (17)

due to Jensen’s inequality. Most results in this chapter are derived directly from these two relations. Equation 16 states that although the posterior distribution of may change after a computation, its expectation remains the same. Despite this, the maximum posterior expectation is expected to increase or remain constant, as implied by Equation 17. In this setting, let us reintroduce the value of computation definitions which we will use for the rest of the text:

###### Definition 1.

We (re)define the value of computation,

 \textscVOC(ω;s,o1:t)=Eo∼ω[maxa∈AsE[Z(s,a|o1:t⊕o)]]−maxa∈AsEo∼ω[Z(s,a|o1:t)], (18)

where for some and a noise term .

###### Definition 2.

The alternative different formulation from Eq 13 is

 \textscVOC$′$(ω;s,o1:t) =Eo∼ω[maxa∈AsE[Z(s,a|o1:t⊕o)] −E[Z(s,argmaxa∈AsE[Z(s,a|o1:t)])|o1:t⊕o)]]. (19)

We will compare these two definitions through out this section. Recall that for now we assume computations directly inform us about the actions available at the current state . More specifically, a computation draws samples from a specific action at and is used to potentially update value estimates of all state-actions. Then we have:

###### Proposition 1.

If where is the current state of the agent, then

 \textscVOC(ω;s,o1:t)=\textscVOC$′$(ω;s,o1:t). (20)
###### Proof.

We need to show the second terms of Equations 18 and 19 are equal. Note that

 maxa∈AsE[Z(s,a|o1:t)]=E[Z(s,argmaxa∈AsE[Z(s,a|o1:t)])|o1:t)] (21)

by rearrangement. Then we can utilize the law of total expectation to assert,

 E[Z(s,argmaxa∈AsE[Z(s,a|o1:t)])|o1:t)]=E[Z(s,argmaxa∈AsE[Z(s,a|o1:t)])|o1:t⊕o)]. (22)

A policy decides which action to take given a state. We now introduce meta-policies, which decide what computation to perform given a state and a sequence of past computation outcomes. A more formal treatise of meta-policies and meta-MDPs can be found in [4].

###### Definition 3.

VOC-greedy and VOC-greedy are meta-policies that perform the computation that maximizes VOC and VOC respectively until a fixed number of computations is reached or until the VOC becomes non-positive for all computations. Having performed many computations whose outcomes are , if the agent takes an action, then he performs

Instead of performing a fixed number of computations, the agent might want to decide whether the computation is worthwhile, and only if so, perform it. An intuitive and somewhat [4, 2] popular stopping criterion is

 maxω∈Ω\textscVOC(ω;s,o1:t)

where is the cost of performing a computation, typically in terms of time or energy.

We would like to evaluate the performance of meta-policies. The error function we are interested minimizing is the so-called simple regret, because it captures the quality of the eventually selected action.

###### Definition 4.

Simple regret of a policy at state is defined as

 \textscSR(s,π)=Ea∼π(s,⋅)[maxa′∈AsQ(s,a′)−Q(s,a)], (24)

where is a curried function whose domain is .

However, in the current setting, the -function is given by the environment and is unknown. Therefore, at best, we can minimize our subjective estimate of simple regret, which captures how much regret the agent expects to face if he is to take an action right away, which we refer to as Bayesian simple regret. Formally:

###### Definition 5.

We define the Bayesian simple regret (BSR) of a policy at state given a sequence of past computations as

 \textscBSR(s;π,o1:t)=Ea∼π(s,⋅)[E[maxa′∈AsZ(s,a′|o1:t)]−E[Z(s,a|o1:t)]], (25)

where potentially takes into account.

It should be clear that the policy that minimizes BSR is a deterministic policy that assigns probability to the action . Thus, we introduce the optimal Bayesian simple regret () as

 \textscBSR∗(s;o1:t) =minπ\textscBSR(s,π;o1:t) (26) =E[maxa∈AsZ(s,a|o1:t)]−maxa∈AsE[Z(s,a|o1:t)]. (27)

For a given policy , we would like to see the relation between BSR and SR. Recall that . Thus, BSR is in a sense the subjective expected simple regret:

 \textscBSR(s,π;o1:t)=EQ∼Z|o1:t⎡⎢ ⎢ ⎢ ⎢ ⎢⎣Ea∼π(s,⋅)[ maxa′∈AQ(s,a′)−Q(s,a)]=simple regret⎤⎥ ⎥ ⎥ ⎥ ⎥⎦, (28)

where is a random function.

We can see that the agent can minimize the BSR at state by taking better actions (i.e., controlling ) or performing better computations (i.e., controlling ). In reinforcement learning, where model of the environment is not known, it might often be advantageous to take actions that seem to be suboptimal—as the exploration induced by such actions might be beneficial in the long-run. In planning problems, however, exploration of the state space is not necessary nor beneficial as the model is already known. Therefore, the only reasonable policy of selecting actions is to the expected values obtained via the computations. Consequently, the only way to (non-trivially) reduce BSR is to select better computations.

###### Definition 6.

A meta-policy is said to be myopically optimal if and only if it performs a computation that minimizes the expected given the current state and a sequence of computation outcomes . That is,

 ω∗\coloneqqargminω∈ΩEo∼ω[\textscBSR∗(s;o1:t⊕o)]. (29)
###### Proposition 2.

VOC-greedy is myopically optimal.

###### Proof.

The optimal computation as defined in Equation 29 remains unchanged if we subtract the current . I.e.,

 ω∗=argminω∈ΩEo∼ω[\textscBSR∗(s;o1:t⊕o)−\textscBSR∗(s;o1:t)]. (30)

The first terms of the ’s are equal,

 E[maxa∈AsZ(s,a|o1:t)]!=Eo∼ωE[maxa∈AsZ(s,a|o1:t⊕o)] (31)

, due to the law of total expectation. Hence we get,

 ω∗ = argminω∈ΩEo∼ω[maxaE[Z(s,a|o1:t)]−maxaE[Z(s,a|o1:t⊕o)]] (32) = argmaxω∈ΩEo∼ω[maxaE[Z(s,a|o1:t⊕o)]−maxaE[Z(s,a|o1:t)]]=\textscVOC(ω;s,o1:t) (33)

Thus, VOC is the expected change in ; and a meta-policy that minimizes maximizes the value of computation. ∎

###### Proposition 3.

VOC-greedy is myopically optimal.

###### Proof.

This is a direct corollary of Propositions 1 and 2. ∎

What about the asymptotic behavior as the agent performs infinitely many computations?

###### Proposition 4.

If the sampling distributions are known, VOC-greedy is asymptotically optimal in the sense that simple regret approaches in the limit of computations performed.

This is proved by Frazier et. al. [13] in the setting of knowledge gradient policies, results of which directly apply here. Here, we give an intuitive but incomplete explanation why this is the case.

###### Proof sketch.

There are two conditions in which VOC-greedy might fail to be asymptotically optimal: (I) the meta-policy halts early, before ensuring simple regret is ; (II) the meta-policy does not halt, yet simple regret does not approach .

Note that equals zero if and only if there exists a policy whose simple regret is zero. This is because SR is non-negative, and is the subjective expectation of SR under the best policy.

Also recall that VOC of is the expected reduction in resulting from . Thus, . The converse is also true for sampling distributions with infinite support. Thus, the meta-policy does not halt until hence the simple regret is . The condition (II) can only happen if an action is sampled finitely many times, which cannot happen for probability distributions with infinite support, as there will always be a chance of a sub-optimal action turning out to be the optimal one after a computation. ∎

In addition to myopic and asymptotic optimality, finite error bounds can also be computed as done in [13].

Given our assumption about all the uncertainty being computational; posterior values become delta functions given enough computation. If a similar approach is taken in a model-based scenario, where the agent has to learn the statistics of the environment to make better decisions, the -function would capture the value marginalized over empirical but not computational uncertainties. Alternatively, the -function could account for both kinds of uncertainties; however, this would bring added complexity—as regret cannot be made zero solely via computations—and require a different formulation then we propose here.

We should point out that asymptotic optimality is easy to obtain. Heuristic policies such as uniform random sampling, or -greedy also are asymptotically optimal as they sample all the actions infinitely many times. However, VOC-greedy is the only policy that is both myopically and asymptotically optimal666Besides the case, where VOC-greedy and VOC-greedy are equivalent. What can we say about optimality for the cases in between these two extremes? We have no optimality guarantees. In fact, computing the optimal meta-policy that maximizes the cumulative computation value of many computations is shown to be NP-hard in [14]. That being said, optimality in two extremes suggests that the performance is likely good in between as well?but likely not optimal. Therefore, myopic and asymptotic optimality is the best we can obtain efficiently in general.

In this section, we were only concerned with computations and actions regarding the state occupied by the agent. We assumed the sampling distribution of is a given—which could be unrealistic in multi-state settings (MDPs) as computing the sampling distribution might require actually solving the problem. We address this problem in Section 7. Before that, we direct our attention to an easier problem, namely the multi-armed bandit problem.

### 6 Computations in multi-armed bandits

In this section, we consider a more concrete setting, and motivate the usefulness of VOC in multi-armed bandits, or equivalently, in discrete Bayesian optimization. Here, we will focus on selecting the optimal action to perform, rather than the optimal computation. We show that the formalism behind the two are analogous and our VOC results are directly applicable here. Consequently, we do not introduce any new ideas here and readers who are not interested in a concrete example are encouraged to skip this section. We should also mention that this problem has been studied in the setting of the knowledge gradients and similar results can be found in [13]. Below, we (re)introduce the multi-armed bandit problem we have discussed in Introduction before.

###### Definition 7.

In a multi-armed bandit (MAB) problem, there is a finite set of actions , and a value function , which is unknown to the agent. At each time step , the agent takes an action and collects where is sampled i.i.d. from a distribution with zero mean and potentially depends on . The goal is to minimize a measure concerning the expected simple regret—which is defined as . One can either aim to minimize this directly for some , or its time sum, that is for some .

###### Example 1.

Consider the setting of a MAB problem, where i.i.d. for a known and i.i.d. for known and . Let be a sequence of returns yielded by sampling a specific action and . Thus, at time step , if the agent takes action and receives reward , then we define the observation resulting from this action as . Then, the posterior is obtained by

 Z(a|o1:t)∼N(tτ^rt+τ0μ0tτ+τ0,1tτ+τ0). (34)

where is the normal distribution. Then, the posterior predictive distribution of a return sampled form is . We can see that the formulation for picking actions based on observations is equivalent picking computations as we have done previously. Let us consider whether equality asserted by the law of total expectation holds as in Equation 16,

 Eo∼ωa[E[Z(a|o1:t⊕o)]]?=E[Z(a|o1:t)], (35)

where is what the agent supposes to observe after performing action . We use the notation instead of in order to keep the notation consistent with our VOC formulation. We have

 Eo∼ωa[Z(a|o1:t⊕o)] ∼N((t+1)τ^rt+1+τ0μ0(t+1)τ+τ0,1(t+1)τ+τ0) (36) =N(τ(n^rt+rt+1)+τ0μ0(t+1)τ+τ0,1(t+1)τ+τ0). (37)

Given is sampled from the posterior predictive, , we can include its mean and variance,

 Eo∼ωa[Z(a|o1:t⊕o)] ∼N(τt^rt+τ0μ0(t+1)τ+τ0+τ(t+1)τ+τ0tτ^rt+τ0μ0tτ+τ0, (38) 1(t+1)τ+τ0+(τ(t+1)τ+τ0)2(1tτ+τ0+1τ)⎞⎠ (39) =N((τt^rt+τ0μ0)(1(t+1)τ+τ0+τ/(tτ+τ0)(t+1)τ+τ0), (40) 1(t+1)τ+τ0+(τ(t+1)τ+τ0)2((t+1)τ+τ0τ(tτ+τ0))⎞⎠ (41) =N((τt^rt+τ0μ0)(1(t+1)τ+τ0+τ/(tτ+τ0)(t+1)τ+τ0), (42) 1(t+1)τ+τ0+τ/(tτ+τ0)(t+1)τ+τ0) (43) =N(tτ^rt+τ0μ0tτ+τ0,1tτ+τ0), (44)

which is the same distribution with that of as given in Eq 34. Then,

 Z(a|o1:t)d=Eo∼ωa[Z(a|o1:t⊕o)], (45)

where denotes a distributional equality, which in turn implies

 E[Z(a|o1:t)]=Eo∼ωa[E[Z(a|o1:t⊕o)]]. (46)

Thus the equality in Equation 35 holds and the probability estimates are consistent. Furthermore, we can also see that

 \textscVOC(ωa;o1:t)\coloneqqEo∼ωa[maxa∈AE[Z(a|o1:t⊕o)]]−maxa∈AE[Z(a|o1:t)]≥0, (47)

due to Jensen’s inequality. Note that, in this bandit settings, we do not perform computations but actions. We are evaluating the information obtained by performing , which is usually referred to as the value of information. However, because the underlying formulations are the same, we stick to VOC.

Let us consider the first term in Equation 47. When considering the value of th action, given many observations, we have

 E[Z(a|o1:t⊕o)]∼N(E[Z(a|o1:t)],1tτ+τ0−1(t+1)τ+τ0). (48)

In a sense, computations add variability to the expectations of ’s—which is good, as higher variability under convexity (i.e., in our case) mean higher expectation. This might seem puzzling, as we would like to reduce the uncertainty by performing computations, which is actually the case as we have

 Var[Z(a|o1:t+1)]]=1(t+1)τ+τ0≤Var[Z(a|o1:t)]=1tτ+τ0. (49)

In other words, computations reduce the total variance by shifting a fraction of the variance into the expectation, which subsequently is resolved by performing a computation.

Lastly, in this example, where are sampled independently across , then VOC can be calculated analytically, as it will be the expectation of a truncated normal distribution. In the case of correlated samples with a known covariance, the analytical solution still exists [6, 13]; however, the involved computations are more costly.

If the sampling distributions are not known, then BSR will not be equivalent to expected simple regret. VOC-greedy will still be myopically optimal with respect to the assumed distribution of by construction. However, it won’t be guaranteed to minimize the expected simple regret as it is a function of the true , which has an unknown distribution. In terms of asymptotic optimality, the prior needs ensure that each arm will be sampled infinitely many times, which is sufficient but not necessary.

### 7 Computations in Markov decision processes

We saw VOC-greedy (and also VOC-greedy) are myopically and asymptotically optimal in minimizing Bayesian simple regret in multi-armed bandits. Eventually we would like to see if this transfers to MDPs. Assume we treat the actions at the current state as bandit arms. Can we apply the principles from the previous section to solve this bandit problem?

The main obstacle is that in MDPs, action values are not solely given by the environment, but are a product of both the environment and the agent’s policy. To complicate the matters further, the policy of the agent likely improves as he performs more computations; thus, the action values should ideally reflect the knowledge of his future self. To circumvent this problem, we introduce static and dynamic values, which respectively capture how much the agent should value an action if he can no longer perform any computations or if he can perform infinitely many computation in the future.

Before moving on, let us consider a puzzle which will yield insights into static and dynamic values. As illustrated in Figure 1, there are two rooms: the first room contains two boxes, and the second one contains five boxes. Each box contains an unknown amount of money. First you need choose a room, then you can select any of the boxes in the room—but peeking inside is not allowed—and collect the money. Which room should you choose? What if you could peek inside the boxes upon choosing the room?

In the first case, it doesn’t matter which room one chooses, as all the boxes are equally valuable in expectation, as no further information is given. Whereas, in the second case, choosing the second room, the one with five boxes, is the better option. This is because one can obtain further information by peeking inside the boxes—and more boxes mean more money in expectation, as one has the option to choose the best one. Formally, let and be the sets of boxes of the first and the second room respectively. Assume all the boxes are sampled i.i.d. from an unknown distribution and . Then, if one cannot peek inside the boxes, we have the equality of two rooms, as . Whereas, is one can peek inside all the boxes, then the -room is more valuable in expectation as . Therefore, it is smarter to pick the room with more boxes, if peeking is allowed.

Alternatively, what if one can only reveal the contents of three boxes, which needs to be done before committing to a room? Or what if one gets to reveal one box before selecting a room, and reveal another one after selecting the room?

In this chapter, we address these questions in a principled way, through static and dynamic values. In the next chapters, we show that the answers we provide are directly applicable to MCTS—where one has imperfect information about action values, and gets to “reveal” more information by performing rollouts.

Using these, we redefine computation values for MDPs. It turns out that VOC-greedy is still myopically and asymptotically optimal, whereas this is no longer the case for VOC-greedy.

Before moving on to MDPs from MABs, we introduce an intermediate problem, where the idea is to “cut out” a “reduced-MDP” by uniformly expanding in all directions from the current state for a fixed number of steps777The expansion does not have to be uniform, but we consider this scenario for simplicity. (i.e., actions). To this end, we introduce new value function measures that we use to relate the action values that lie on the “cut” to the current state of the agent.

###### Definition 8.

Values of state-action pairs can be defined recursively888As in the Bellman policy evaluation equation., which we refer to as the -step value, and obtain as

 Qπn(s,a)={Qπ0(s,a)if n=0∑s′Pass′[Rass′+γ∑a′∈As′π(s′,a′)Qπn−1(s′,a′)]else, (50)

where is a measure of how valuable frontier state-actions are.

We will consider different assumptions concerning . They can belong to the environment as in the expected rewards of bandit arms. Alternatively, the spatial horizon of can be computational. Meaning, even though there are reachable steps beyond , the agent ignores them and treats the expected rewards of the frontier actions as bandit arms. In either case, we assume that they are independent of agent’s policy. Justification of this assumption is two-fold. First, the impact of the policy on action values diminishes as we go farther away from the current state. Second, as we discuss later, we could implicitly account for the policy at remote states by leveraging an adaptive sampling policy such as the UCT-policy for Monte Carlo tree search.

In practice, if we have a spatial horizon of , we will compute for the actions that are available at the current state, and for the actions that are -step reachable, and so on, until we get to the state-actions that lie on the cut, for which we will have .

The -step value is convenient because it provides a trade-off between a MAB problem and a full-fledged MDP. Let , and be drawn from a known sampling distribution. The agent does not know , but can draw noisy samples from it. Then the problem of finding the best computation to maximize the root action value here is equivalent to the problem of finding the best action to perform in a MAB problem, where the goal is to minimize the expected simple regret of the next time step. For , the -step value generalizes this MAB problem to multiple states. Also for , we essentially get the full Bellman backups that solves an MDP.

Note that, even if the estimates of are highly unreliable, the estimates of proximal state-actions (e.g., of high ) will be more reliable given the that shrinks the errors by and operations which “average out” the noise of the frontier estimates. We now reintroduce -step optimal value function, leaving out of the equation:

###### Definition 9.

We define the -step optimal value function as

 Q∗n(s,a)={Q0(s,a)if n=0∑s′Pass′[Rass′+γmaxa′∈As′Q∗n−1(s′,a′)]else, (51)

where is a measure of how valuable frontier state-actions are.

The motivation for this comes from the fact that, the structure of a Monte Carlo Search Tree resembles that of an -step optimal value equation. The leaves of the tree offer imperfect estimates of action values, which are propagated to the root of the tree in order to rank and select the best action available at the root. One difference is that MCTS algorithms build trees iteratively, whereas we assume full-expansion up to depth . In fact, as a result of this expansion, we end up with a weighted directed acyclic graph (DAG) where the current state is the sink, and all -step reachable nodes are sources. Thus, our data structure is not a tree as in MCTS, but it is a graph, which we refer to as a search graph. The crucial difference is that, in a tree, each node can have a single parent; whereas in a DAG, a node can have multiple parents. Therefore, a DAG is more (sample-wise) efficient than a tree for evaluating action values, as the latter does not share the information across branches [15]. Consequently, instead of a root node as in MCTS, we have a sink, which is the state of the agent; and instead of leaves, we have sources. Values, or uncertainties, flow from sources to the sink in a sense. If a state-action pair is far from the sink, we refer to it as being distal; and if it is close, as being proximal.

From now on, we assume we are given a search graph, and we would like to perform computations, which are the noisy estimates of source values, in order to minimize the at the sink. As before, our action value estimates are probabilistic and are denoted by , such that .

###### Definition 10.

Given that we are now uncertain about the action values, we define the dynamic value function as where,

 Υn(s,a|o1:t)\coloneqq{Z(s,a|o1:t)if n=0∑s′Pass′[Rass′+γmaxa′∈As′Υn−1(s′,a′|o1:t)]else. (52)

captures how valuable is, assuming the agent will have obtained all the information needed to completely eliminate his uncertainties about the frontier action values. The ‘dynamic’ reflects the fact that the agent will likely change his mind about the best actions available at each states in the future; yet, this is reflected and accounted for in .

This value is useful, because imagine you are forced to take an action right away, after which you will have the luxury of deliberation. In this case, the value of your action should take into account that the future version of you will be wiser, which achieves. The major downside is that it cannot be computed efficiently in general. However, in the algorithms section, we will introduce an efficient way of computing approximately. Alternatively, we can utilize another measure, which captures the other extreme, where the agent cannot obtain any new information.

###### Definition 11.

We define the static value function as

 ϕn(s,a|o1:t)\coloneqq{E[Z(s,a|o1:t)]if n=0∑s′Pass′[Rass′+γmaxa′∈As′ϕn−1(s′,a′|o1:t)]else. (53)

captures how valuable would be if the agent were to take actions before running any new computations. We also refer to as the spatial horizon.

In Figure 2, we graphically contrast dynamic and static values, where the difference is the stage at which the expectation is taken. For the former, it is done at the level of the root actions; for the latter, at the level of the leaves.

Note that is a convex function of evaluated at the frontier nodes. Thus, let us denote as where is a convex functional. Then we have . Consequently, by Jensen’s inequality, we get

 ψn(s,a|o1:t)≥ϕn(s,a|o1:t). (54)

This shows that, regardless of the computation outcomes, i.e. , the dynamic value is at least as big as the static value, which is intuitive.

In addition, accounts for what we refer to as optionality, namely the fact that given the agent might gather more information in the future, he should value the option to change his mind (and his policy). Flexible flight tickets—where one can change the dates without paying a penalty—are more expensive because of the optionality. By buying a flexible ticket to Sicily, you are paying for the option to postpone your trip for a few days if your holiday turns out to be in risk of being ruined by bad weather.

So far we have only examined properties of and statically—given past computation outcomes . Now, the effects of we consider future computations, for which we need to consider two cases: (I) the sampling distributions of and are known and stationary; and (II) unknown and non-stationary.

##### 7.0.1 Known sampling distributions

Let us consider the case where the sampling distribution of , as well as that of the noise, is known, but itself is unknown. We can think of this case as an MDP, where all state-action pairs -step away from the current state are stationary bandits, and we would like to sample the bandit arms in order to determine the values of actions, which then would be propagated back to the state occupied by the agent. In other words, all the reducible uncertainty999In contrast to the irreducible uncertainty, which is given by the environment, for instance, in terms of the stochasticity of state transitions. concerns the values of the frontier actions. Therefore, the agent attempts gain certainty about those actions in an effort to improve his decision quality at his current state.

In this setting, is the known sampling distribution of , hence the prior; and is the posterior distribution conditioned by a sequence of computation outcomes . Then we have,

 EQ0∼Z[Q∗n(s,a)]=ψn(s,a|⟨⟩), (55)

where is a null computation sequence, i.e., with no elements. Furthermore, dynamic values are stable in expectation under any sequence of future computations , which would add more computation outcomes to the outcomes existing in . At this point, a computation consists of drawing a sample as well as computing the updated and given the sample.

We can see that is consistent with respect to the new computations—meaning it is constant in expectation,

 ψn(s,a|o1:t)=Eo1:k∼ω1:k[ψn(s,a|o1:t⊕o1:k)], (56)

due to the law of total expectation. Note that the new computations can be applied in bulk without updating after every computation. Or can be updated after every computation, such that each computation would be based on the updated resulting from the previous computations. Eq 56 holds for either approaches. We can also relate it to the agent’s expectation of ,

 ψn(s,a|o1:t)=EQ0∼Z|o1:t[Q∗n(s,a)]. (57)

For , we can assert that

 Eo1:k∼ω1:k[ϕn(s,a|o1:t⊕o1:k)]≥ϕn(s,a|o1:t), (58)

which follows from Jensen’s inequality. This suggests that the static values cannot decrease in expectation after a computation sequence; yet, they are not consistent as are. Combining these we get,

 ψn(s,a|o1:t)≥Eo1:k∼ω1:k[ϕn(s,a|o1:t⊕o1:k)]≥ϕn(s,a|o1:t). (59)

In fact,