Abstract
In modelbased solution approaches to the problem of learning in an unknown environment, exploring to learn the model parameters takes a toll on the regret. The optimal performance with respect to regret or PAC bounds is achievable, if the algorithm exploits with respect to reward or explores with respect to the model parameters, respectively. In this paper, we propose TSEB, a Thompson Sampling based algorithm with adaptive exploration bonus that aims to solve the problem with tighter PAC guarantees, while being cautious on the regret as well. The proposed approach maintains distributions over the model parameters which are successively refined with more experience. At any given time, the agent solves a model sampled from this distribution, and the sampled reward distribution is skewed by an exploration bonus in order to generate more informative exploration. The policy by solving is then used for generating more experience that helps in updating the posterior over the model parameters. We provide a detailed analysis of the PAC guarantees, and convergence of the proposed approach. We show that our adaptive exploration bonus encourages the additional exploration required for better PAC bounds on the algorithm. We provide empirical analysis on two different simulated domains.
1 Introduction
In the standard Reinforcement Learning (RL) framework, the environment with which the agent interacts is modeled as a Markov Decision Process (MDP) and the goal of the agent is to learn a policy such that the cumulative reward it receives is maximized over a finite or an infinite horizon. If the parameters of the MDP are known, then the learning process is straight forward, and the optimal policy can be learnt by traditional DPmethods (rlbook). However, in any real life application, the parameters of the MDP are not known a priori. In such a scenario, the agent can try to directly learn the policy that maximizes the return (modelfree learning) or the agent can try to estimate the parameters of the MDP and learn a policy based on the learnt MDP (modelbased learning).
Recently Modelbased learning approaches have been receiving increasing attention (stren; kolter; singhvar; vanroy). In modelbased RL the goal of the agent is twofold. First, it should estimate the true parameters of the model. Second, it should also behave optimally during the phase of learning the model parameters. This is yet another instance of explorationexploitation dilemma in Reinforcement Learning. The agent has to explore to learn the model parameters, but trying to be explorative in improving the belief over the model parameters reduces the performance, i.e., sum of cumulative rewards over a certain number of timesteps. In this approach, the belief over the parameters of the model gets updated as and when the agent receives a sample, , where is the state the agent is at time , is the action the agent took at time and and are next state and its corresponding reward. As the number of samples increases, the belief converges to the true parameters of the MDP.
Among model based methods, Bayesian approaches are particularly attractive due to their amenability for theoretical analysis, and the convenient posterior update rule. Much of the recent work has been focused on Thompson sampling (TS) (ts1933) based approaches both in simpler bandit settings (chapelle; shipra; Shipralinear; Aditya), as well as the full MDP problem (stren; adiTS). The Bayesian RL approach proposed in (stren) is an episodic way of incrementally converging to the true parameters of the model. The learning happens in phases, where after each phase, the agent estimates a posterior distribution over the parameters and samples a model for the next episode. The agent solves for an optimal policy with the sampled parameters and uses it to generate trajectories in that episode followed by updating the posterior. This approach of posterior sampling is known as Thompson sampling (ts1933). The structure of the Bayesian learning as discussed provides a nonzero probability mass over the true model parameters guaranteeing convergence.
Ever since Chapelle and Li (chapelle) discussed the efficacy of TS approaches for reinforcement learning, there have been concerted attempts made to achieve a better theoretical understanding of such approaches. Apart from the results in the bandit setting, Thompson sampling approach for full RL has been shown to work well in practice and has been shown to be regret optimal (adiTS). However, there are no PAC guarantees in the literature for the Thompson sampling approach. To achieve PAC guarantees we need to encourage more aggressive exploration than enjoined by the basic TS approach and one way to do that is to use an exploration bonus. For e.g., (kolter) proposed Bayesian Exploration Bonus (BEB) algorithm which added a constant exploration bonus to the problem of solving for an optimal policy in an unknown environment. (kolter) computes a point estimate of the MDP and solves for the optimal policy in every episode. This improves on an another exploration bonus approach, MBIEEB (mbieb), in terms of the PAC bounds. Note that in general, adding exploration bonus to the learning agent results in better performance with respect to PAC but may not result in a regret optimal algorithm.
The primary contribution of our work is TSEB (Thompson Sampling with Exploration Bonus), a Thompson Sampling algorithm that uses an adaptive exploration bonus. As with usual TS approaches, TSEB also maintains a distribution over the parameter space. But the sampling strategy employs an adaptive exploration bonus  when a model is sampled from the distribution in each phase, the rewards of the sampled model are modified. This exploration bonus at that state is related to the current uncertainty in the parameter estimates for the state and this leads to more informative trajectories being generated in each episode. The exploration in most Thompson sampling approaches lead to optimal regret. We show that TSEB encourages additional exploration required for better PAC bounds. To our knowledge this is the first work in the literature to provide PAC guarantees for TS. We empirically show that by appropriately tuning a tradeoff parameter we can improve performance with respect to regret as well.
Major contributions of this work are,

Introducing an adaptive, based, exploration bonus to aid model based learning agent.

Providing a tighter PAC guarantee for TS, with exploration bonus.

Theoretically showing the convergence of the algorithm.

Empirically showing the inadequacy of TS to be PAC optimal.
The rest of the paper is organized as follows. Section 2 describes the preliminaries. We describe the TSEB algorithm in Section 3 followed by theoretical analysis of TSEB in section 4. Section 5 discusses about the regret guarantees of TSEB. In section 6, we experimentally analyse TSEB in 2 domains. Section 7 discusses the related work, and section 8 concludes the paper.
2 Preliminaries
In reinforcement learning, a learning agent interacts with a world modeled as a MDP, S,A,R,T,. The MDP description consists of, S, the set of all states, A, the set of all actions, R, the reward function, Rx, T, the transition function, xx and the discount factor . The agent has to learn an optimal action mapping, that maximizes the cumulative reward over a finite or infinite horizon, . When the model parameters are known, the optimal policy, , can be obtained by solving the MDP using classical DPtechniques like valueiteration, policy iteration or by optimization methods.
Further in the discussion we will use a metric to bound the distance between the value function of the sampled and the true MDP. The metric is inspired from the homomorphism literature (kbcs). Consider two different MDPs and . Let the norm over the difference in their rewards be , and in transition be , and the range of the rewards in be . The similarity between the problem of homomorphism and the problem of estimating the closeness of sampled MDP to true MDP is subtle. The structure, state space, and action space, , remains the same and the , reward function in the true MDP, can be approximated with the samples obtained from the world. Thus, with the given descriptions the difference in the values of the state in and , and , which we define as the ffunction, can be bounded by the following expression (kbcs),
(1) 
If we have a Dirichlet distribution governing the transition, can be bounded by (singhvar), where is the number of times the stateaction pair was observed. We refer Eqn. 1 as . The is upper bounded by in a normalized bounded rewards setting ( [1,1]),
(2) 
is estimated from the difference between the expected reward sampled in an episode for an arbitrary state to the empirical mean of the rewards sampled with samples obtained till episode and timestep in that episode. This measure, ffunction, provides a measure of variance of the sampled MDP. This is an unbiased estimate of the distance from the true reward.
Let be the sample average constructed with n samples, across the episodes. As ,
(3) 
The above expression shows that the computed sample average is an unbiased estimate of the expectation of the random variable , reward of state s.
3 Tseb Algorithm
TSEB is an episodic approach, where the incrementally sampled model grows closer to the true model. Solving the converged model gives us a near optimal policy. From the problem as posed, it is intuitive to understand that the agent has to learn the true model to converge to an optimal policy, . TSEB has a modified Bellman update that considers the exploration bonus. The reward, thus, is a convex combination of the reward obtained from the sampled world, and the exploration bonus computed for that state (or ) in that episode. The Bellman update will be,
(4) 
(5) 
where provides an upper bound on the difference in the value of the state between the true and sampled MDP. n(s,a) is the number of times s was visited.
The term is similar to the defined in the preliminary section except that instead of it will be computed for that particular state, ( gets replaced by the n(s,a)).
(6) 
The algorithm follows a greedy policy and takes the max action with respect to the modified Bellman update. As decays with every visit, the agent explores the state space adaptively. The updates from the sampled trajectories help the distribution to narrow its belief reducing the variance over the distribution. As the agent samples from those states that are useful, by following a greedy policy, more often after a few episodes, the sampled MDP might not be close to the true MDP, or cannot be guaranteed. But, the parameters of the better rewarding states will be close to the optimal, thus providing an optimal policy. Note that TSEB learns optimal policy for states which are and the notion of states evolve over the episodes. Thus, even though TSEB does not learn optimal policy for the true MDP, it learns a nearoptimal policy which is more close to the optimal policy in states that will be often visited by the agent.
The linear decay of the exploration bonus makes the exploration bonus to become insignificant either when the parameters are closer to the true MDP or if the number of visits is large.
TSEB, unlike the most other previous algorithms (Except (singhvar)) uses the uncertainty in the estimates to structure the exploration bonus. This helps us model real world systems, which have inherent uncertainty. Exploiting the inherent uncertainty in the events to decide on exploration is the key feature of our work. The exploration bonus entails the PAC guarantees of the algorithm. Further theoretical analysis shows us that the bound is indeed tighter than in (singhvar). The exploration bonus is computed here even more cleverly, thus avoiding integrating over the parameters, compared to (singhvar). Also there is a principal difference with (kolter), wherein the exploration of the agent is concentrated around the uncertain region and the uncertainty is not assumed to be uniform over the world. This formulation can also be extended to provide exploration bonus a priori. Apart from the prior over the model parameters, prior over the exploration bonus helps in learning the model faster and better. We don’t analyze a priori exploration bonus in this work.
4 Theoretical Analysis of Tseb
PAC analysis provides an upper bound on the number of suboptimal steps of an asymptotic agent that is required for the algorithm to converge to an optimal solution with probability . Let us assume the algorithm requires a set of samples, M. Let each sample be . Though, TS theory gives us regret guarantees at (T) (adiTS), where is the number of timesteps, we don’t have a notion of PACbound for TS. This is primarily because the algorithm has to be explorative to learn the model parameters for it to be PACoptimal. The conundrum here is, if the algorithm is explorative its regret will be worse. Hence, the greedy action selection doesn’t let the agent to be explorative. In TSEB with the addition of exploration bonus and thereby skewing the sampled MDP in a way that would make the exploration as part of rewards, the agent can still be greedy with the action selection and provide a sample guarantee. The linear decay of exploration bonus helps the agent converge to the optimal policy like in the TS setting.
Theorem 4.1.
Exploration using the defined exploration bonus , , leads to a monotonic convergence of sampled parameters to the optimal parameters.
Proof.
This theorem essentially says that adding exploration does not affect the monotonic convergence of Thompson Sampling. To prove this theorem, it is sufficient to show that the exploration bonus leads to a monotonic convergence of the ffunction. The function defined in Eqn. 6 for every episode, by way of doing posterior sampling, decreases with samples. As the function is an unbiased estimate of the difference in the value of the true and the sampled MDP (Eqn. 1), the decrease in the ffunction indicates that the sampled model parameters are closer to the true MDP. The ffunction will converge to an such that . For any number of samples further, the sampled MDP lies within the ball of the true MDP. To bound the saturation point, let us estimate the rate of change of with respect to timestep, ,
(7) 
Since the rewards are bounded, rate of change of range can be bounded by a constant c, 2. At optimum, the first order derivative vanishes. Hence,
(8) 
The sum over differences across all the states gives an upper bound on the true distance between the sampled and actual MDP. Hence, in an episode this is,
(9) 
Now, with , the cardinality of the set of states, and , the cardinality of set of actions, this can be upper bounded by,
(10) 
Let the sum over differences be denoted by , then,
(11) 
Eqn. 11 states that the n(s,a) is inversely related to . And, is directly proportional to . As we don’t discard the samples, the increases monotonically thus letting the to decrease monotonically. The saturation of the upperbound on the distance, ffunction, provides a formal guarantee of the convergence of TSEB. ∎
PACMDP: An RL algorithm is said to be PACMDP, if for any MDP, M, , , the sample complexity of the algorithm is bounded by some function f that is polynomial in S, A, , , and , with probability at least 1.
Theorem 4.2.
After M = steps, TSEB converges to an optimal value function with probability 1.
Outline of the Proof. Let be the expected probability of selecting an action that will lead the agent to an unexplored state. Let us define a positive nonzero number, , which is the number of time steps in each episodes such that the expected number of visits to unexplored states is at least 1. Let there be a finite positive integer , the number of times a state has to be visited for its exploration bonus to become insignificant implying that the state has been explored.
With the above notations, the number of episodes required to converge to the true MDP parameters will be kSA, where and are the cardinality of state and action sets. We define the samples required for the algorithm to be optimal as kSAT. The expression has k, and T which are not known. The expressions obtained in this section, for the sample complexity, maps to k and T indirectly.
By showing the of difference between the true MDP and the sampled MDP monotonically decreases with every sample and ffunction provides a finite length converging sequence, we can compute the total sample complexity, . We consider variancebased concentration measure, as it is more applicable for deriving the bounds for TSEB, and provides a sharper concentration measure than the ChernoffBound used in the analysis of UCB like algorithms. Let be independent random variables with and . Then for 0,
(12) 
This is an extension of the Chernoff bounds (chernoff1952) in a known variance setting.
Proof.
From Variance bounds definition, for a sample of reward sequence from a single state, . Let , and
(13) 
(14) 
The above equation expresses the relation between and (). in the above equation is the summation of differences in value of states between the true and sampled MDP. This can be upper bounded by .
Further we need to establish that the exploration bonus decays with the variance of the model parameters. By definition, the exploration bonus is a cumulative sum of differences between the sampled state parameters and an unbiased estimate of the parameters. This,
(15) 
where, is the augmented notation for the state’s parameter, is an unbiased estimate of the mean, and is the sampled parameter.
The variance of the estimate will be 2norm of the estimate above, Eqn. 15, and we look at cumulative 1norm. The difference occurs in the magnitude of the convergence rate, but the point of convergence remains the same. Hence, we use a variance based complexity bound to upper bound the number of samples.
For a tighter , variance has to be smaller. This implies that has to be smaller. As repeated sampling of trajectory decreases the variance, this is set as an adaptive exploration bonus to the agent. Now, with being the number of visits to a state , and being the expected initial distance of the sampled MDP from the true MDP with respect to the prior,
(16) 
The upper bound on the number of visits to an individual state is given by,
(17) 
The total sample complexity, M, for () guarantee on the converged MDP, with S and A being the cardinalities of set of states and actions, is bounded by,
(18) 
(19) 
Eqn. 19 shows that the upper bound on the sample complexity is dependent on the initial estimates of the model. , the total sample complexity is adaptive, as it is a function of the distance between the sampled MDP and the true MDP. The bound, hence, is adaptive and theoretically better than the earlier bounds on sample complexity for a PACMDP. ∎
Algorithm  PAC Bounds 

MBIE (mbieb)  
BEB (kolter)  
Variance Based (singhvar)  
TSEB 
Table 1 lists all the existing PAC bounds for model based learning setting. Note that TSEB is the only Thompson Sampling algorithm in this list. PAC bound for TSEB is better than MBIE and variance based method while term has term which makes this bound higher when compared to BEB. However, TSEB also performs better in regret which is not guaranteed with BEB.
5 Arguments on Regret
The discussion so far elucidates the PAC guarantees offered by the TSEB algorithm. The claim of the algorithm being not going worse in regret has not been addressed so far. Following a greedy policy from the sampled MDP is not very different from the TS approach. The parameters sampled in every episode grow closer to the true model as discussed empirically and theoretically in earlier corresponding sections. As the TSEB agent acts greedily with the sampled model parameters and the model parameters converge, the agent after a certain number of episodes will be acting optimally with the true parameters, . Because, the greedy policy in the will be an optimal policy .
The exploration bonus, a linearly decaying component in the modified Bellman update, will become insignificant even if it doesn’t become zero. This ensures that TSEB behaves like pure Thompson Sampling after sufficient exploration. Let us define regret at any arbitrary step , , as
(20) 
Where, is the expected reward by taking the optimal action and is the average reward obtained at a step . The expected regret ,
(21) 
where, is the number of suboptimal steps in an episode and , the expected regret in episode , different from the previous definition. From previous sections, we can observe that the is a converging sequence and so is , because of the greedy policy that is mandated in the TS algorithm.
From the algorithm, it is clear that it behaves like the true Thompson Sampling algorithm after a point when the exploration bonuses becomes numerically insignificant. The suboptimal steps taken by the agent falls into the two cases,

When the sampled parameters are off from the true model parameters and the agent takes a greedy action.

Taking an action that is not the optimal action with respect to the sampled MDP.(May be due to the uncertainty in the action selection.)
The recent work on regret in parameterized MDP (adiTS) is a major contribution to the regret analysis of the full RL Thompson sampling approach. The arguments for the regret analysis of TSEB can be done similar to the TSMDP, but varies in the additive constant term. The bigoh notation of the regret makes it insignificant and hence the same analysis holds.
6 Empirical Analysis
In this section, we experimentally analyze the performance of TSEB. We run experiments in two simulated domains, Chain world (kolter) and Queuing Domain (adiTS). The aim of the experiments is to experimentally validate the claim of convergence of the belief and analyze the algorithm under different values of the tradeoff parameter, [0,1].
6.1 Chain World
The chain domain has 5 states and 2 actions and . The agent can take both the actions from any state. With probability 0.2 the agent takes the opposite action than the one selected. The transitions and reward are shown in the figure. The first state has a stochastic reward, from a Gaussian . The optimal policy with is to take action in all the states. The algorithm is experimented on different values of the tradeoff parameter (Table 2). The analysis shows better performance (cumulative sum of rewards) on every nonzero value. This is intuitive, because when is 0 the algorithm behaves only to reduce the variance and ignores the rewards obtained in the world. This behavior is expected. But, the performance increases with increase in and decreases after 0.5. The performance has high variance and is inconsistent when ; this is the TS case. = 0.1 has the maximum cumulative reward in this case.
Average cumulative reward  

0.0  1382.80 
0.1  1963.74 
0.2  1951.72 
0.3  1944.65 
0.4  1954.06 
0.5  1956.22 
0.6  1955.14 
0.7  1953.99 
0.8  1940.77 
0.9  1934.99 
1.0  1942.63 
Further we analyze the convergence of the model parameters in Fig.2(a) and Fig.2(b). We plot the ffunction against the Episodes. The graphs explain that the posterior sampling with exploration bonus converges faster. When , the plot shows that the algorithm converges to an inferior model. The inferiority in the model corresponds to the higher fvalue. The fvalue for = 0.5 converges to a much better model. This can be argued because the agent considers both the variance in the model parameters as well as the reward obtained in the true world to be maximized, thus converging to a better model.
Thompson sampling, which is a special case when =1, keeps oscillating and doesn’t converge. This is because of the lack of exploration. The graph relates the distance between the sampled MDP and true MDP to the number of samples. As the agent in TS set up acts greedily, the exploration of the agent is poor. The agent has to explore to converge to the true model parameters. TS, being regret optimal always chooses the greedy action and doesn’t explore the statespace well. Hence, the poor PACguarantees of TS is experimentally validated. Similarly, the better PAC guarantees that can be obtained by inducing exploration bonus is validated as well.
6.2 Queuing World
We analyse the TSEB algorithm with different values (Table 3) in the Queuing world defined in (adiTS). The states of the MDP is simply the number of packets in the queue at any given time, i.e.,S=0,1,2,…,50. At any given time, one of 2 actions: Action 1 (SLOW service) and Action 2 (FAST service) may be chosen, i.e., A={1,2}. Applying SLOW (resp. FAST) service results in serving one packet from the queue with probability 0. 3 (resp. 0.8) if it is not empty, i.e., the service model is Bernoulli() where is the packet processing probability under service type i= 1,2. Actions 1 and 2 incur a perinstant cost of 0 and 0.25 units respectively. In addition to this cost, there is a holding cost of 0.1 per packet in the queue at all times. The system gains a reward of +1 units whenever a packet is served from the queue.
Average cumulative reward  

0.0  5050 
0.1  5061 
0.2  5038 
0.3  5048 
0.4  5051 
0.5  5042 
0.6  5040 
0.7  5062 
0.8  5038 
0.9  5026 
1.0  5023 
The comparison in Table. 3 shows the cumulative reward for different settings. , the regret optimal case, outperformed the others. This is because the model didn’t have much variance in the parameter, so the learning was faster. Hence, the regret optimal way was better than the rest.
The two worlds provide two different scenarios: one in which the difference between the performance with different values is large (Chain Domain), two in which the difference is less (Queuing Domain). Both the experiments suggest that the combination of exploration bonus and the true rewards in the MDP provides a better performance. As the Chain domain doesn’t offer negative rewards to the agent, exploration as well does pay off well for the agent. But relying only on exploration bonus, =0, doesn’t let it converge to the optimal policy. Hence, the reason the agent accumulates better reward when it is not being exploration centric. On the other hand, in the Queuing domain, the agent receives negative rewards as well, this doesn’t aid the agent being over explorative, and the variance in the model parameters are less as well. Hence it accumulates better cumulative reward when it is regret optimal. These two experiments suggest a heuristic to tune . can be dynamically adapted with respect to unbiased variance in the reward parameter estimate. We leave this as a future work.
7 Related Work
Optimism in the face of uncertainty, is an appreciated approach and reasonably widely applied in practice. The approach overestimates the statevalue or actionvalue estimates with some heuristic to aid in exploration of the agent. In (kael), an algorithm proposed as Interval estimation Qlearning (IEQ), the action with the highest upper bound on the underlying Qvalue gets chosen. This work also asserts that the gradual decay of the overestimation lets the agent converge to the optimal policy. This has been followed in approaches as early as UCB(auer), where the empirical mean, , of an arm i is overestimated by the confidence interval of the estimated mean. And, for solving an MDP, the UCRL (ucrl) takes an approach inspired by the UCB technique for over estimation to aid exploration. This provides a logarithmic regret bounds in an MDP setting. In an unknown environment setting, the variance based approach to over estimate the value of a state to aid in exploration was proposed in (singhvar), but it is not a TS approach.
Quite a few approaches have addressed the sample complexity issue in RL. But, while being sample efficient the regret gets worse. And, hence PSRL is better when regret optimal learning is needed. Also, the theoretical guarantees of TS have not been analyzed until recently (shipra). Similar guarantees, though, were not extended to the PAC setting. Recently, (adiTS) gave a regret analysis of TS in full MDP setting that is logarithmic in , the timesteps. (vanroy) highlighted an informationtheoretic analysis of TS, giving a better regret bound, considering the entropy of the actiondistribution. In the last decade, parameter estimation was extended for the MDP setting; an episodic way of solving for the model estimation in unknown environment (stren).
More recently, (kolter) proposed Bayesian Exploration Policy (BEB) algorithm which added a constant exploration bonus to the standard (nonThompson sampling based) Bayesian RL. This is improved upon the MBIEEB (mbie), an interval based exploration bonus algorithm, by increasing the decay rate. (kolter) states that the Bayesian approach cannot have a PAC solution if it doesn’t encode an exploration bonus. So, BEB first proposed a bound on the samples, which is the first PACanalysis of the Bayesian RL. In line of (stren) BOSS, Best of Sampled Sets (boss) that samples multiple models and merges them. The framework then runs trajectories on the derived MDP. It has a constant B, the number of visits for the agent to know a state’s parameters. Note that TSEB can be extended to BOSS setting by sampling multiple MDPs, and following TSEB exploration bonus.
From the literature, it is evident that quite a few approaches were looked at in solving for an optimal policy. The most recent of them include computing the mean MDP (kolter) and ML MDP (singhvar). As the two approaches compute a point estimate, it is theoretically very likely that the probability mass over the true model parameters becomes zero or converges to a very bad estimate in certain cases. The TS approach on the other hand is a pure Bayesian technique that keeps updating the belief and samples a new MDP from the updated samples. This, though converges, is only regret optima, so provides a very bad PACestimate. Thus, we showed that adding a better exploration bonus, can make the traditional TS sample efficient and converges to a PACMDP.
8 Conclusion
In this work we propose TSEB  a Thompson sampling approach to modelbased RL that uses an adaptive exploration bonus. This is the first TS variant that provides a PAC bound. We introduced a tradeoff parameter that controls how much the exploration bonus influences the policy learnt on a sampled MDP. Tuning this parameter allows us to achieve better empirical performance with respect to the regret as well. While this work provides initial intuition into the PAC analysis of TS, more work needs to be done to establish a theory of useful exploration bonus and performance guarantees. Extending the model estimation to a nonparameterized setting, devoid of tight constraints over the parameter space, will also be an useful extension that will be applicable to a wide range of problems.