Tree Exploration for Bayesian RL Exploration1footnote 11footnote 1This is a corrected and slightly expanded version of the homonymous paper presented at CIMCA’08.

# Tree Exploration for Bayesian RL Exploration111This is a corrected and slightly expanded version of the homonymous paper presented at CIMCA’08.

Christos Dimitrakakis
EPFL
Lausanne
Switzerland
christos.dimitrakakis@gmail.com
###### Abstract

Research in reinforcement learning has produced algorithms for optimal decision making under uncertainty that fall within two main types. The first employs a Bayesian framework, where optimality improves with increased computational time. This is because the resulting planning task takes the form of a dynamic programming problem on a belief tree with an infinite number of states. The second type employs relatively simple algorithm which are shown to suffer small regret within a distribution-free framework. This paper presents a lower bound and a high probability upper bound on the optimal value function for the nodes in the Bayesian belief tree, which are analogous to similar bounds in POMDPs. The bounds are then used to create more efficient strategies for exploring the tree. The resulting algorithms are compared with the distribution-free algorithm UCB1, as well as a simpler baseline algorithm on multi-armed bandit problems.

## 1 Introduction

In recent work [17, 21, 10, 15, 7, 16, 6], Bayesian methods for exploration in Markov decision processes (MDPs) and for solving known partially-observable Markov decision processes (POMDPs), as well as for exploration in the latter case, have been proposed. All such methods suffer from computational intractability problems for most domains of interest.

The sources of intractability are two-fold. Firstly, there may be no compact representation of the current belief. This is especially true for POMDPs. Secondly, optimally behaving under uncertainty requires that we create an augmented MDP model in the form of a tree [7], where the root node is the current belief-state pair and children are all possible subsequent belief-state pairs. This tree grows large very fast, and it is particularly problematic to grow in the case of continuous observations or actions. In this work, we concentrate on the second problem – and consider algorithms for expanding the tree.

Since the Bayesian exploration methods require a tree expansion to be performed, we can view the whole problem as that of nested exploration. For the simplest exploration-exploitation trade-off setting, bandit problems, there already exist nearly optimal, computationally simple methods [1]. Such methods have recently been extended to tree search [12]. This work proposes to take advantage of the special structure of belief trees in order to design nearly-optimal algorithms for expansion of nodes. In a sense, by recognising that the tree expansion problem in Bayesian look-ahead exploration methods is also an optimal exploration problem, we develop tree algorithms that can solve this problem efficiently. Furthermore, we are able to derive interesting upper and lower bounds for the value of branches and leaf nodes which can help limit the amount of search. The ideas developed are tested in the multi-armed bandit setting for which nearly-optimal algorithms already exist.

The remainder of this section introduces the augmented MDP formalism employed within this work and discusses related work. Section 2 discusses tree expansion in exploration problems and introduces some useful bounds. These bounds are used in the algorithms detailed in Section 3, which are then evaluated in Section 4. We conclude with an outlook to further developments.

### 1.1 Preliminaries

We are interested in sequential decision problems where, at each time step , the agent seeks to maximise the expected utility

 \bf{E}[ut|⋅]≜∞∑k=1γk\bf{E}[rt+k|⋅],

where is a stochastic reward and is simply the discounted sum of future rewards. We shall assume that the sequence of rewards arises from a Markov decision process, defined below.

###### Definition 1.1 (Markov decision process)

A Markov decision process (MDP) is defined as the tuple comprised of a set of states , a set of actions , a transition distribution conditioning the next state on the current state and action,

 T(s′|s,a)≜μ(st+1=s′|st=s,at=a) (1)

satisfying the Markov property , and a reward distribution conditioned on states and actions:

 R(r|s,a)≜μ(rt+1=r|st=s,at=a), (2)

with , , . Finally,

 μ(rt+1,st+1|st,at)=μ(rt+1|st,at)μ(st+1|st,at). (3)

We shall denote the set of all MDPs as . For any policy that is an arbitrary distribution on actions, we can define a -horizon value function for an MDP at time as:

 Vπ,μt,T(s,a) =\bf{E}[rt+1|st=s,at=a,μ] +γ∑s′μ(st+1=s′|st=s,at=a)Vπμ,t+1,T(s′).

Note that for the infinite-horizon case, for all .

In the case where the MDP is unknown, it is possible to use a Bayesian framework to represent our uncertainty (c.f. [7]). This essentially works by maintaining a belief , about which MDP corresponds to reality. In a Bayesian setting, is our subjective probability measure that is true.

In order to optimally select actions in this framework, we need to use the approach suggested originally in [3] under the name of Adaptive Control Processes. The approach was investigated more fully in [6, 7]. This creates an augmented MDP, with a state comprised of the original MDP’s state and our belief state . We can then solve the exploration in principle via standard dynamic programming algorithms such as backwards induction. We shall call such models Belief-Augmented MDPs, analogously to the Bayes-Adaptive MDPs of [7]. This is done by not only considering densities conditioned on the state-action pairs , i.e. , but taking into account the belief , a probability space over possible MDPs, i.e. augmenting the state space from to and considering the following conditional density: . More formally, we may give the following definition:

###### Definition 1.2 (Belief-Augmented MDP)

A Belief-Augmented MDP (BAMPD) is an MDP where , where is the set of probability measures on , and are the transition and reward distributions conditioned jointly on the MDP state , the belief state , and the action . Here is singular, so that we can define the transition

 p(ωt+1|at,ωt) ≡p(st+1,ξt+1|at,st,ξt).

It should be obvious that jointly form a Markov state in this setting, called the hyper-state. In general, we shall denote the components of a future hyper-state as . However, in occassion we will abuse notation by referring to the components of some hypserstate as . We shall use to denote the set of BMDPs.

As in the MDP case, finite horizon problems only require sampling all future actions until the horizon .

 Vπ∗t,T(ωt,at)=\bf{E}[rt+1|ωt,at]+γ∫ΩVπ∗t+1,T(ωt+1)ν(ωt+1|ωt,at)dωt+1. (4)

However, because the set of hyper-states available at each time-step is necessarily different from those at other time-steps, the value function cannot be easily calculated for the infinite horizon case.

In fact, the only clear solution is to continue expanding a belief tree until we are certain of the optimality of an action. As has previously been observed [4, 5], this is possible since we can always obtain upper and lower bounds on the utility of any policy from the current hyper-state. We can apply such bounds on future hyper-states in order to efficiently expand the tree.

### 1.2 Related work

Up to date, most work had only used full expansion of the belief tree up to a certain depth. A notable exception is [22], which uses Thompson sampling [20] to expand the tree. In very recent work [18], the importance of tree expansion in the closely related POMDP setting222The BAMDP setting is equivalent to a POMDP where the unobservable part of the state is stationary, but continuous (chap. 5 [7]) has been recognised. Therein, the authors contrast and compare many different methods for tree expansion, including branch-and-bound [13] methods and Monte Carlo sampling.

Monte Carlo sampling methods have also been recently explored in the upper confidence bounds on trees (UCT) algorithms, proposed in [8, 12] in the context of planning in games. Our case is similar, however we can take advantage of the special structure of the belief tree. In particular, for each node we can obtain high-probability upper and lower bounds on the value of the optimal policy.

This paper’s contribution is to recognise that tree expansion in Bayesian exploration is itself an exploration problem with very special properties. Based on this insight, it proposes to combine sampling with lower bounds and upper bound estimates at the leaves. This allows us to obtain high-probability bounds for expansion of the tree. While the proposed methods are similar to the ones used in the discrete-state POMDP setting [18], the BAMDP requires the evaluation of different bounds at leaf nodes. On the experimental side, we present first results on bandit problems, for which nearly-optimal distribution-free algorithms are known. We believe that this is a very important step towards extending the applicability of Bayesian look-ahead methods in exploration.

## 2 Belief tree expansion

Let the current belief be and suppose we observe . This observation defines a unique subsequent belief . Together with the MDP state , this creates a hyper-state transition from to . By recursively obtaining observations for future beliefs, we can obtain an unbalanced tree with nodes . However, we cannot hope to be able to fully expand the tree. This is especially true in the case where observations (i.e. states, rewards, or actions) are continuous, where we cannot perform even a full single-step expansion. Even in the discrete case the problem is intractable for infinite horizons – and far too complex computationally for the finite horizon case. However, had there been efficient tree expansion methods, this problem would be largely alleviated. The remainder of this section details bounds and algorithms that can be used to reduce the computational complexity of the Bayesian lookahead approach.

### 2.1 Expanding a given node

All tree search methods require the expansion of leaf nodes. However, in general, a leaf node may have an infinite number of children. We thus need some strategies to limit the number of children.

More formally, let us assume that we wish to expand in node , with defining a density over . For discrete state/action/reward spaces, we can simply enumerate all the possible outcomes , where is the set of possible reward outcomes. Note that if the reward is deterministic, there is only one possible outcome per state-action pair. The same holds if is deterministic, in both cases making an enumeration possible. While in general this may not be the case, since rewards, states, or actions can be continuous, in this paper we shall only examine the discrete case.

### 2.2 Bounds on the optimal value function

At each point in the process, the next node to be expanded is the one maximising a utility . Let be the set of leaf nodes. If their values were known, then we could easily perform the backwards induction procedure shown in Algorithm 1.

The main problem is obtaining a good estimate for , i.e. the value of leaf nodes. Let denote the policy such that, for any ,

 Vπ∗(μ)μ(s)≥Vπμ(s)∀s∈S.

Furthermore, let the maximum probability MDP arising from the belief at hyper-state be . Similarly, we denote the mean MDP with .

###### Proposition 2.1

The optimal value function at any leaf node is bounded by the following inequalities

 ∫Vπ∗(μ)μ(sω)ξω(μ)dμ≥V∗(ω)≥∫Vπ∗(¯μω)μ(sω)ξω(μ)dμ. (5)
###### Proof.

By definition, for all , for any policy . The lower bound follows trivially, since

 Vπ∗(¯μω)(ω)≜∫Vπ∗(¯μω)μ(sω)ξω(μ)dμ. (6)

The upper bound is derived as follows. First note that for any function , . Then, we remark that:

 V∗(ω) =supπ∫Vπμ(sω)ξω(μ)dμ (7a) ≤∫supπVπμ(sω)ξω(μ)dμ (7b) =∫Vπ∗(μ)μ(sω)ξω(μ)dμ. (7c)

∎∎

In POMDPs, a trivial lower bound can be obtained by calculating the value of the blind policy [9, 19], which always takes the same action. Our lower bound is in fact the BAMDP analogue of the value of the blind policy in POMDPs. This is because for any fixed policy , it holds trivially that . In our case, we have made this lower bound tighter by considering , the policy that is greedy with respect to the current mean estimate.

The upper bound itself is analogous to the POMDP value function bound given in Theorem 9 of [9]. However, while the lower bound is easy to compute in our case, the upper bound can only be approximated via Monte Carlo sampling with some probability.

### 2.3 Calculating the bounds

In general, (6) and (7b) cannot be expressed in closed form. However, the integrals can be approximated via Monte Carlo sampling. Let the leaf node which we wish to expand be . Then, we can obtain MDP samples from the belief at : .

The lower bound can be calculated by performing value iteration in the mean MDP, in order to obtain the mean-MDP-optimal policy , where is the mean MDP for belief . If the beliefs can be expressed in closed form, it is easy to calculate the mean transition distribution and the mean reward from . For discrete state spaces, transitions can be expressed as multinomial distributions, to which the Dirichlet density is a conjugate prior. In that case, for Dirichlet parameters , we have . Similarly, for Bernoulli rewards, the corresponding mean model arising from the beta prior with parameters is . Then optimal policy for the mean MDP can be found with standard dynamic programming.

We can now use the mean-optimal polciy to obtain a stochastic lower bound on the optimal value function. First, we calculate the value function of the mean-optimal policy for each sampled MDP . We then average these samples to obtain the following approximation to (6):

 ¯vc(ω)≜1cc∑k=1Vπ∗(¯μω)μk(sω). (8)

For upper bounds, we follow a similar procedure. For each , we derive the optimal policy and estimate its value function . We may then average these samples to obtain

 ^v∗c(ω)≜1cc∑k=1~v∗k(sω). (9)

Let . It holds that and that . Due to the latter, we can apply a Hoeffding inequality

 \bf{P}(|^v∗c(ω)−¯v∗(ω)|>ϵ)<2exp(−2cϵ2(Vmax−Vmin)2), (10)

thus bounding the error within which we estimate the upper bound. For and discount factor , note that . A similar inequality holds for the lower bound.

### 2.4 Bounds on parent nodes

We can obtain upper and lower bounds on the value of every action , at any part of the tree, by iterating over , the set of possible outcomes following :

 ¯v(ωt,a) =|Ωt|∑i=1\bf{P}(ωit|ωt,a)[rit+γ¯v(ωit)] (11) ^v∗(ωt,a) =|Ωt|∑i=1\bf{P}(ωit|ωt,a)[rit+γ^v∗(ωit)], (12)

where the probabilities are implicitly conditional on the beliefs at each . For every node, we can calculate an upper and lower bound on the value of all actions. Obviously, if at the root node , there exists some such that for all , then is unambiguously the optimal action.

However, in general, there may be some other action, , whose upper bound is higher than the lower bound of . In that case, we should expand either one of the two trees.

It is easy to see that the upper and lower bounds at any node can be expressed as a function of the respective bounds at the leaf nodes. Let be the set of all branches from when action is taken. For each branch , let be the probability of the branch from and be the discounted cumulative reward along the branch. Finally, let be the set of leaf nodes reachable from and be the specific node reachable from branch . Then, upper or lower bounds on the value function can simply be expressed as . This would allow us to use a heuristic for greedily minimising the uncertainty at any branch. However, the algorithms we shall consider here will only employ evaluation of upper and lower bounds.

## 3 Algorithms

At each time step , expansions are performed, starting from state . At the -th expansion, a utility function is evaluated for every node in the set of leaf nodes . The main difference among the algorithms is the way is calculated.

1. Serial. This results in a nearly balanced tree, as the oldest leaf node is expanded next, i.e. , the negative node index.

2. Random. In this case, we expand any of the current leaf nodes with equal probability, i.e. for all . This can of course lead to unbalanced trees.

3. Highest lower bound. We expand the node maximising a lower bound i.e. .

4. Thompson sampling. We expand the node for which the currently sampled upper bound is highest, i.e. .

5. High probability upper bound. We expand the node with the highest mean upper bound .

While methods 3 and 4 only use one sample from the upper bound calculation at every iteration. The last two methods retain the samples obtained in the previous iterations and use them to calculate the mean estimate.

## 4 Experiments

We compared the regret of the tree expansion to the optimal policy in bandit problems with Bernoulli rewards with two benchmarks: the UCB1 algorithm [2], which suffers only logarithmic regret, and secondly a Bayesian algorithm that is greedy with respect to a mean Bayesian estimate with a prior density , i.e. Algorithm 3 applied with , which is a simple optimistic heuristic for such problems.

We compared the algorithms using the notion of expected undiscounted regret accumulated over time steps, i.e. the expected loss that a specific policy suffers over the policy which always chooses the arm with the highest mean reward:

 T∑T=1\bf{E}[r|at=a∗]−T∑T=1\bf{E}[r|π].

In order to determine the expected regret experimentally we must perform multiple independent runs and average over them.

Figure 1 shows the cumulative undiscounted regret for horizon , with and , averaged over 1000 runs. We compare the UCB1 algorithm (ucb), and the Bayesian baseline (base) with the BAMDP approach. The figure shows the cumulative undiscounted regret as a function of the number of look-aheads, for the following expansion algorithms: serial, random, highest lower bound (lower bound), and high probability upper bound(upper bound). The last two algorithms use -rate discounting for future node expansion.

It is evident from these results that the highest lower bound method never improves beyond the first expansion. This is due to the fact that the lower bounds never change after the first step when this algorithm is used. The simple serial expansion seems to perform only slightly better. On the other hand, while the serial expansion is consistently better than the random expansion, it does not manage to achieve less than half of the regret of the latter. It thus appears as though the stochastic selection of branches is in itself of quite some importance in this type of problem. For problems with more arms and longer horizons, the differences between methods are amplified. The results are in agreement with those obtained in the POMDP setting, where upper bound expansions appear to be best [18].

## 5 Conclusion

One of this paper’s aims was to draw attention to the interesting problem of tree expansion in Bayesian RL exploration. To this end, bounds on the optimal value function at belief tree leaf nodes have been derived and then utilised as heuristics for tree expansion. It is shown experimentally that the resulting expansion methods have very significant differences in computational complexity for bandit problems. While the results are preliminary in the sense that no experiments on more complex problems are presented and that only very simple expansion algorithms have been tried, they are nevertheless significant in the sense that the effect of the tree exploration method used is very large.

Apart from performing further experiments, especially with more sophisticated expansion algorithms, future work should include deriving bounds on the minimum and maximum depth reached for each algorithm, as well as more general regret bounds if possible. The regret could be measured either as simply the optimality of , or, more interestingly, bounds on the cumulative online regret suffered by each algorithm. More importantly, problems with infinite observation spaces (i.e. with continuous rewards) should also be examined.

My current work includes the analysis of the stochastic branch-and-bound algorithm such as the ones described in [14, 11]. This algorithm is essentially the same as the high probability upper bound method used in the current paper. Another interesting approach would be to develop a new expansion algorithm that achieves a small anytime regret, perhaps in the lines of UCT [12]. Such algorithms have been very successful in solving problems with large spaces and may be useful in this problem as well, especially when the space of observations becomes larger.

#### Acknowledgments

This work was supported by the ICIS-IAS project. Thanks to Carsten Cibura and Frans Groen for useful discussions and to Aikaterini Mitrokotsa for proofreading.

## Appendix A The Bayesian inference in detail

Let be the set of MPDs with unknown transition probabilities and state space of size . We denote our belief at time about which MDP is true as

 ξt+1(μ) ≜ξt(μ|st+1,stat) (13a) (13b)

Since this is an infinite set of MDPs, we can have each MDP correspond to a particular probability distribution over the state-action pairs. More specifically, let us define for each state action pair , a Dirichlet distribution

 ξt(qs,a=x)=Γ(ψs,a(t))∏i∈SΓ(ψs,ai(t))∏i∈Sxψs,ai(t)i. (14)

with . We will denote by the matrix of state-action-state transition counts at time t, with being the matrix defining our prior Dirichlet distriburtion.

We shall now model the joint prior over transition distributions as simply the product of priors. Then we can denote the matrix of state-action-state transition probabilities for MDP as and let . Then

 ξt(μ) =ξt(Qμ)=ξt(qs,a=qμs,a∀s∈S,a∈A) (15a) =∏s∈S∏a∈Aξt(qs,a=qμs,a), (15b) =∏s∈S∏a∈AΓ(ψs,a(t))∏i∈SΓ(ψs,ai(t))∏i∈S(qμs,a,i)ψs,ai(t). (15c)

where we assume that each state-action pair’s transition distribution is independent of the other transition distributions. This means that is a sufficient statistic for expressing the density over .

We can additionally model with a suitable belief and assume independence. This in no way complicates the exposition for MDPs.

## References

• [1] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learning Research, 3(Nov):397–422, 2002. A preliminary version has appeared in Proc. of the 41th Annual Symposium on Foundations of Computer Science.
• [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3):235–256, 2002. A preliminary version has appeared in Proc. of the 15th International Conference on Machine Learning.
• [3] Richard Bellman and Robert Kalaba. A mathematical theory of adaptive control processes. Proceedings of the National Academy of Sciences of the United States of America, 45(8):1288–1290, 1959.
• [4] Richard Dearden, Nir Friedman, and Stuart J. Russell. Bayesian Q-learning. In AAAI/IAAI, pages 761–768, 1998.
• [5] Christos Dimitrakakis. Nearly optimal exploration-exploitation decision thresholds. In Int. Conf. on Artificial Neural Networks (ICANN), 2006. IDIAP-RR 06-12.
• [6] Michael O. Duff and Andrew G. Barto. Local bandit approximation for optimal learning problems. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 1019. The MIT Press, 1997.
• [7] Michael O’Gordon Duff. Optimal Learning Computational Procedures for Bayes-adaptive Markov Decision Processes. PhD thesis, University of Massachusetts at Amherst, 2002.
• [8] Sylvain Gelly and David Silver. Combining online and offline knowledge in UCT. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 273–280, New York, NY, USA, 2007. ACM Press.
• [9] Milos Hauskrecht. Value-function approximations for partially observable markov decision processes. Journal of Artificial Intelligence Resesarch, pages 33–94, Aug 2000.
• [10] Matthew Hoffman, Arnaud Doucet, Nando De Freitas, and Ajay Jasra. Bayesian policy learning with trans-dimensional mcmc. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.
• [11] A.J. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2001.
• [12] Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of ECML-2006, 2006.
• [13] LG Mitten. Branch-and-bound methods: General formulation and properties. Operations Research, 18(1):24–34, 1970.
• [14] Vladimir I. Norkin, Georg Ch. Pflug, and Andrzej Ruszczyński. A branch and bound method for stochastic global optimizatio n. Mathematical Programming, 83(1):425–450, January 1998.
• [15] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. Proceedings of the 23rd international conference on Machine learning, pages 697–704, 2006.
• [16] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive POMDPs. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, Cambridge, MA, 2008. MIT Press.
• [17] Stephane Ross, Joelle Pineau, and Brahim Chaib-draa. Theoretical analysis of heuristic search methods for online POMDPs. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.
• [18] Stéphane Ross, Joelle Pineau, Sébastien Paquet, and Brahim Chaib-draa. Online planning algorithms for POMDPs. Journal of Artificial Intelligence Resesarch, 32:663–704, July 2008.
• [19] T. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In Proceedigns of the 21st Conference on Uncertainty in Artificial Intelligence (UAI-05), pages 542–547, 2005.
• [20] W.R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of two Samples. Biometrika, 25(3-4):285–294, 1933.
• [21] Marc Toussaint, Stefan Harmelign, and Amos Storkey. Probabilistic inference for solving (PO)MDPs, 2006.
• [22] Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian sparse sampling for on-line reward optimization. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 956–963, New York, NY, USA, 2005. ACM.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters