ISL: Optimal Policy Learning With Optimal Exploration-Exploitation Trade-Off

# ISL: Optimal Policy Learning With Optimal Exploration-Exploitation Trade-Off

Lucas Cassano
Department of Electrical Engineering
University of California, Los Angeles
Los Angeles, CA 90095-1594
cassanolucas@ucla.edu
&Ali H. Sayed
School of Engineering
École Polytechnique Fédérale de Lausanne
Lausanne, Switzerland CH-1015
ali.sayed@epfl.ch
The author is also with the Institute of Electrical Engineering at EPFL. Alternative address lucas.cassano@epfl.ch.
###### Abstract

Traditionally, off-policy learning algorithms (such as Q-learning) and exploration schemes have been derived separately. Often times, the exploration-exploitation dilemma being addressed through heuristics. In this article we show that both the learning equations and the exploration-exploitation strategy can be derived in tandem as the solution to a unique and well-posed optimization problem whose minimization leads to the optimal value function. We present a new algorithm following this idea. The algorithm is of the gradient type (and therefore has good convergence properties even when used in conjunction with function approximators such as neural networks); it is off-policy; and it specifies both the update equations and the strategy to address the exploration-exploitation dilemma. To the best of our knowledge, this is the first algorithm that has these properties.

## 1 Introduction

Reinforcement learning (RL) is concerned with designing algorithms that seek to maximize long term cumulative rewards by interacting with an environment whose dynamics are unknown. Three main features are desirable for model-free algorithms to achieve this goal efficiently: (a) high sample efficiency, (b) guaranteed convergence (even when the algorithm is used in conjunction with expressive function approximators like neural networks) and (c) the ability to perform deep exploration. Recently, algorithms based on policy gradient have been introduced with guaranteed convergence and achieve state-of-the-art results in many tasks. The most notable cases being TRPO TRPO , PPO ppo and A3C a3c . These algorithms have two main drawbacks: they have poor sample efficiency due to the fact that they operate on-policy and they are not capable of performing deep exploration (i.e., they tend to perform poorly in environments with sparse rewards). Another group of algorithms is based on Q-learning (like DQN dqn and DDQN ddqn ). These algorithms have high sample efficiency but they also have two main drawbacks: they can diverge when used in conjunction with function approximators and they do not address the exploration-exploitation dilemma. Hence, heuristics are typically necessary to endow these algorithms with better exploration capabilities (for example, Bootstrap DQN boot_dqn ). More recently, a new family of algorithms has been introduced, which is based in the idea of learning policies that aim to maximize the long term cumulative rewards while also maximizing their own entropy. One notable algorithm within this group is SBEED sbeed , which has high sample efficiency and guaranteed convergence when used with function approximators. SBEED still has the deficiency that it does not address the exploration-exploitation dilemma and hence it does not perform efficient exploration. The contribution of this work is the introduction of a novel algorithm which, to the best of our knowledge, is the first algorithm that has the three aforementioned properties (a)-(c). The main difference between our algorithm and previous work is that we use a novel cost function which makes the exploration-exploitation dilemma explicit and derives the learning rule and the exploratory strategy in tandem as the solution to a unique optimization problem.

### 1.1 Relation to prior work

Our paper is mostly related to recent work on maximum entropy algorithms. Some of the most prominent algorithms in this area are G-learning G_learning , soft Q-learning haarnoja2017reinforcement , PCL pcl , SAC sac , Trust-PCL trust_pcl , and SBEED sbeed . All these algorithms augment the traditional RL objective with a term that aims to maximize the entropy of the learned policy, which is weighted by a temperature parameter. The consequence of using this augmented objective is two-fold. First, it allows to derive convergent off-policy algorithms (even when used with function approximators). Second, it improves the exploration properties of the algorithms. However, using this augmented objective has two main drawbacks. In the first place, the policy to which these algorithms converge is biased away from the true optimal policy. This point can be handled by annealing the temperature parameter but this can slow down convergence. Furthermore, it is unclear what the optimal schedule is to perform such annealing and how it affects the conditioning of the optimization problem. In the second place, even though the exploration is improved, algorithms derived from this modified cost are not efficient at performing deep exploration. The reason for this is that a unique temperature parameter is used for all states. In order to perform deep exploration it is necessary to have a scheme that allows agents to learn policies which exploit more in states where the agent has high confidence in the optimal action and act in a more exploratory manner in unknown states. The main difference between our approach and these works is that we augment the traditional RL objective with a term that makes the exploration-exploitation trade-off explicit instead of the policy’s entropy. Under our scheme, agents converge to the true optimal policy without the need for annealing of any parameters and, moreover, an exploration strategy is derived that is capable of performing deep exploration.

## 2 Preliminaries

We consider the problem of policy optimization within the traditional reinforcement learning framework. We model our setting as a Markov Decision Process (MDP), with an MDP defined by (,,,), where is a set of states of size , is a set of actions of size , specifies the probability of transitioning to state from state having taken action and is the average reward when the agent transitions to state from state having taken action ).

###### Assumption 1.

We assume for all .

In this work we consider the maximization of the discounted infinite reward as the objective of the RL agent:

 π†(a|s)=argmaxπEP,π(∞∑t=0γt\boldmathr(\boldmathst,\boldmathat,\boldmathst+1)∣∣\boldmaths% 0=s) (1)

where is the optimal action, is the discount factor and and are the state and action at time , respectively. We clarify that in this work, random variables are always denoted in bold font. We recall that each policy has an associated state value function and state-action value function given by111In this paper we will refer to both and as value functions indistinctly.:

 vπ(s) =EP,π(∞∑t=0γt\boldmathr(\boldmathst,\boldmathat,% \boldmathst+1)∣∣\boldmaths0=s)=EP,π(\boldmathr(s,\boldmatha,% \boldmaths′)+γvπ(\boldmaths′)) (2a) qπ(s,a) =EP,π(∞∑t=0γt\boldmathr(\boldmathst,\boldmathat,% \boldmathst+1)∣∣\boldmaths0=s,\boldmatha0=a)=EP(\boldmathr(s,a,% \boldmaths′)+γvπ(\boldmaths′)) (2b)

It is well-known that the optimal value functions satisfy the following fixed point equations puterman :

 q†(s,a) =r(s,a)+γEPmaxa′q†(\boldmaths′,a′) (3a) v†(s) =maxar(s,a)+γEPv†(% \boldmaths′)=maxaq†(s,a) (3b)

where for convenience we defined .

## 3 Algorithm derivation

Optimization problem (1) and relations (2) and (3) are useful to derive algorithms for planning problems (i.e., problems in which the reward function and transition kernel are known) but are unfit to derive RL algorithms because they obviate the fact that the agent relies on estimated quantities (which are subject to uncertainty). Hence, in this work, we modify (1) to reflect the fact that an RL agent is constrained by the uncertainty of its estimates. We change the goal of the agent to not just maximize the discounted cumulative rewards but also to minimize the uncertainty of its estimated quantities. For this purpose, we assume that at any point in time the agent has some estimate of the optimal value function denoted by , which is subject to some uncertainty. We quantify this uncertainty through the state-action Bellman error and model it in a Bayesian manner. More specifically, we assume follows a uniform probability distribution with zero mean:

 δ(s,a)∼U(0,ℓ(s,a)) (4)

We will refer to the probability density function of as . We assume zero mean uniform distributions for the following reasons:

• Zero mean: if the mean were different than zero (i.e., ), it could be used to improve the estimate as resulting in a new estimate for which the state-action Bellman error would be zero mean.

• Uniform distribution: under assumption 1, we know that for any infinitely discounted MDP, a symmetric bound for the state-action Bellman error exists in the form 222This is due to the fact that the value functions are lower and upper bounded by and and hence we know . Moreover, typically there is no prior information about the error distribution between these bounds and therefore a non-informative uniform distribution with limit becomes appropriate.

We further define the state Bellman error whose distribution is given by a mixture of the state-action error distributions:

 δπ(s)∼∑aπ(a|s)d(s,a)(δ) (5)

Note that the Bayesian update for is given by:

 ℓ(s,a)←max{ℓ(s,a),|δi(s,a)|} (6)

where is the -th sample from . Note that the Bayesian update (6) assumes a stationary distribution. However, as the agent updates its estimate , the error distributions will change over time. For this reason, we modify (6) to endow it with tracking capabilities:

 ℓ(s,a)←(1−αl)max{ℓ(s,a),|δi(s,a)|}+αl|δi(s,a)| (7)

Note that when an error is sampled that is bigger than its corresponding estimate , then the update equations (6) and (7) will coincide. Update equation (7) is in tabular form. However, in practical applications it is typically necessary to parameterize with parameters (which we denote as ) to reduce the dimensionality of the learning parameters. To obtain an update equation for , we define the following optimization problem:

 minνEψ2−1[ℓ(s,a;ν)−(1−α)max{ℓ(s,a;ν),|δi(s,a)|}−α|δi(s,a)|]2Δ=JB(ν) (8)

where is some distribution according to which the state-action pairs are sampled. The gradient of (8) (which can be used to update ) is given by:

 ∇νJB(ν)=(ℓ(s,a;ν)−(1−α)max{ℓ(s,a;ν),|δi(s,a)|}−α|δi(s,a)|)∇νℓ(s,a;ν) (9)

### 3.1 Optimization problem

We now define the optimization problem for our RL agent to be:

 π⋆(a|s) (10)

where is the Bellman state error corresponding to some policy that seeks to maximize the cumulative discounted information (i.e., , where is the indicator function) and is the Kullback-Leibler divergence or relative entropy. In this work we refer to as the uncertainty constrained optimal policy (or uc-optimal policy). Under this new objective, we redefine the value functions as:

 vπ(s) =EP,π(∞∑t=0γt[\boldmathr(\boldmathst,\boldmathat,\boldmathst+1)−κDKL(δπ(% \boldmathst)||δ∙(\boldmathst))]∣∣\boldmaths0=s) (11a) =EP,π(\boldmathr(s,% \boldmathat,\boldmaths′)+γvπ(s′))−κDKL(δπ(s)||δ∙(s)) (11b) qπ(s,a) =r(s,a)+γEPvπ(\boldmaths′) (11c)

Using (11) we can rewrite (10) as:

 π⋆(a|s) =argmaxπ∑aπ(a|s)ˆq(s,a)−κDKL(δπ(s)||δ∙(s)) (12)

Note that the exploration-exploitation trade-off becomes explicit in our cost function. To maximize the first term of the summation, the agent has to exploit its knowledge of , while to maximize the second term, the agent’s policy needs to match , which is a policy that seeks to maximize the information gathered through exploration. Since the argument being maximized in (12) is differentiable with respect to we can obtain a closed-form expression for . Before providing the closed form solution for we introduce the following useful lemma and definitions.

###### Definition 1.

Pareto dominated action: For a certain state we say that an action is Pareto dominated by action if and .

###### Lemma 1.

For all Pareto dominated actions it holds that .

###### Proof.

See appendix B.

The statement of Lemma 1 is intuitive since choosing a Pareto dominated action lowers the expected cumulative reward and the information gained, relative to choosing the action that dominates it. Also note that Lemma 1 implies that for all Pareto optimal actions it must be the case that if then .

###### Definition 2.

Mixed Pareto dominated action: For a certain state we say that an action is mixed Pareto dominated if there exists two actions and (which satisfy ) such that:

 ˆq(s,ak) <(ℓ(s,ai)−ℓ(s,ak))ℓ(s,aj)ˆq(s,aj)+(ℓ(s,ak)−ℓ(s,aj))ℓ(s,ai)ˆq(s,ai)ℓ(s,ak)(ℓ(s,ai)−ℓ(s,aj)) (13)
###### Definition 3.

Pareto optimal action: We define an action as Pareto optimal if it is not Pareto dominated or mixed Pareto dominated.

We now introduce the state dependent set of actions with cardinality , which is formed by all the Pareto optimal actions corresponding to state . Furthermore, we introduce the ordering functions which for every state provide an ordering amongst the Pareto optimal actions from lowest uncertainty to highest (i.e., ). For instance, provides the index of the action at state , which has the lowest uncertainty amongst the actions contained in .

###### Theorem 1.

is given by:

 π⋆(aj|s) (14a) Δℓq(j,s) =ℓσ(j)(s)ˆq(s,σ(j))−ℓσ(j−1)(s)ˆq(s,σ(j−1))ℓσ(j)(s)−ℓσ(j−1)(s) (14b)

where to simplify notation we defined and . We also set .

###### Proof.

See Appendix C.

Note that as expected, according to (14), the uc-optimal policy always assigns strictly positive probability to the actions that have the biggest uncertainty and biggest (in cases where one action has both then ).

###### Lemma 2.

The value function corresponding to policy is given by:

 v⋆(s) =κlog⎛⎝|Es|∑j=1(ℓσ(j)(s)−ℓσ(j−1)(s))ℓmax(s)exp[κ−1Δℓq(j,s)]⎞⎠ (15) q⋆(a,s) (16)

where .

###### Proof.

The proof follows by combining (15) with (11b) and (11c).

###### Remark 1.

Note that the pair satisfies two important conditions:

 limκ→0+[π⋆(a|s),v⋆(s)]=[π†(a|s),v†(s)],limℓA→ℓ+A−1→⋯→l+1[π⋆(a|s),v⋆(s)]=[π†(a|s),v†(s)] (17)

The first condition is expected since when the relative entropy term is eliminated, (1) and (10) become equivalent. The second condition reflects the fact that when the uncertainty is equal for all actions the distributions and become equal regardless of and therefore and hence (1) and (10) become equivalent. Condition (17) is of fundamental importance because it guarantees that as learning progresses and the limits diminish, policy tends to the desired policy (note that annealing of is not necessary for this convergence of towards ).

### 3.2 Learning algorithm

Using the relations from Theorem 1 and Lemma 2 we pose the following optimization problem:

 minωJ(ω) (18a) J(ω)=12Eψ[ˆq(s,a;ω)−r(s,a)−γκEPlog⎛⎝|Es|∑j=1(ℓσ(j)(s′)−ℓσ(j−1)(s′))ℓmax(s′)exp[κ−1Δℓq(j,s′)]⎞⎠=δ(s,a)]2 (18b)

where is the distribution according to which state-action pairs are sampled, and is a parametric approximation of with parameters . Note that the gradient of with respect to is a product of expectations. Therefore, in the general case where transitions are stochastic sample estimates of such gradient become biased. To bypass this issue, we use the popular duality trick sbeed ; cassano2019team ; macua2015distributed ; du2017stochastic ; FDPE_conf ; FDPE_jour as follows:

 12Eψδ(s,a)2=Eψmaxρρ(s,a)δ(s,a)−12ρ(s,a)2=maxρEψρ(s,a)δ(s,a)−12ρ(s,a)2 (19)

Hence, minimization problem (18) becomes equivalent to the following primal-dual formulation:

 minω12Eψδ(s,a)2 =minωmaxθEψ[η(ρ(s,a;θ)δ(s,a)−12ρ(s,a;θ)2)+(1−η)2δ(s,a)2]Δ=Jη(ω,θ) (20)

where and is the parameterized version of through parameters . The gradients of (20) are given by:

 ∇θJη(ω,θ) =η(δ(s,a)−ρ(s,a;θ))∇θρ(s,a;θ) (21a) ∇ωJη(ω,θ) =(ηρ(s,a;θ)+(1−η)δ(s,a))(∇ωˆq(s,a;ω)−γEP|Es|∑j=1π⋆(aj|s′)∇ωˆq(s′,aj;ω)) (21b)

Note that the parameter allows control of a variance-bias trade-off for the estimate of the gradient. In the particular case where the transitions of the MDP are deterministic, the optimal choice is . However, while the entropy of the distribution over state transitions increases, higher values of become preferable. Also note that as we explained before, to estimate we need to sample . Transitions may be used to estimate , however, using these samples as estimates has the disadvantage that the variance inherent to the rewards and state transition does not diminish as learning progresses (and therefore nor will ). For this reason, it is more convenient to use to estimate the Bellman errors. With this clarification and using (9), (14) and (21) a new algorithm can be introduced which we refer to as Information Seeking Learner (ISL), the detailed listing can be found in Appendix A. Notice that ISL has the following fundamental properties: it works off-policy, it is compatible with function approximation, and it specifies both the learning rule and the exploration-exploitation strategy.

## 4 Experiments

In this section we test the capabilities of ISL to perform deep exploration. We compare the performance of a tabular implementation of ISL and Bootsrtap Q-learning in the Deep Sea game osband2017deep , which is useful benchmark to test the exploration capabilities of RL algorithms. Implementation details and results for the stochastic versions of Deep Sea can be found in Appendix D. Figures 0(a) and 0(b) show the average regret curves and figure 0(c) shows the amount episodes required for the regret to drop below the dotted lines indicated with in figures 0(a) and 0(b). As can be seen in the figures, the amount of episodes required to learn the optimal policy is which is optimal osband2017deep , and further it can be appreciated that the constant associated with ISL is smaller than the one corresponding to Bootstrap Q-learning.

## Appendix B Proof of Lemma 1

We start stating the following assumption to avoid having using function through the entire derivation.

###### Assumption 2.

Without loss of generality we assume that actions are numbered such that . This implies that action is the one whose bellman error has the biggest uncertainty, while action is the one with the lowest.

We prove the lemma by contradiction. Assume that action is Pareto dominated by action , and further assume that there exists a -optimal policy for which . We construct policy and show that and therefore is not a -optimal policy.

 (22)

Note that since we assume dominates due to assumption 2 this implies that . Without loss of generality we assume . We now proceed to show that .

 DKL(δπ(s)||δ∙(s)) =∫b∑aπ(a|s)d(s,a)(δ)log(∑a′π(a′|s)d(s,a′)(b)∑a′π∙(a′|s)d(s,a′)(b))db (23) (b)=∑aπ(a|s)∫bd(s,a)(δ)log(∑a′π(a′|s)d(s,a′)(b))db+log(ℓA) (24)

where in we used the fact that . Now using assumption 2 and the fact that all error densities are uniform we can write a closed form expression for the integral.

 ∫bd(s,a)(δ)log(∑a′π(a′|s)d(s,a′)(b))db=∫ℓj0ℓ−1jlog(∑a′π(a′|s)d(s,a′)(b))db (25) =ℓ−1j∫ℓjℓj−1log(∑a′π(a′|s)d(s,a′)(b))db+ℓ−1j∫ℓj−10log(∑a′π(a′|s)d(s,a′)(b))db (26) (27) =j∑n=1ℓn−ℓn−1ℓjlog(A∑b=nπ(b|s)ℓb) (28)

Combining (24) and (28) we get:

 DKL(δπ(s)||δ∙(s)) =A∑k=1π(ak|s)ℓkk∑n=1(ℓn−ℓn−1)log(A∑b=nπ(b|s)ℓb)−log(ℓA) (29) =A∑n=1(ℓn−ℓn−1)(A∑k=nπ(ak|s)ℓk)log(A∑b=nπ(b|s)ℓb)−log(ℓA) (30)

Combining (22) and (30) we can write:

 DKL(δ□(s)||δ∙(s))−log(ℓA)=A∑n=1(ℓn−ℓn−1)(A∑k=nπ□(ak|s)ℓk)log(A∑b=nπ□(b|s)ℓb) (31) =A∑n=i+1(ℓn−ℓn−1)(A∑k=nπ△(ak|s)ℓk)log(A∑b=nπ△(b|s)ℓb) +(ℓi−ℓj)(A∑k=iπ△(ak|s)ℓk+π△(aj|s)ℓi)log(A∑b=iπ△(b|s)ℓb+π△(aj|s)ℓi) +j∑n=1(ℓn−ℓn−1)⎛⎜ ⎜⎝A∑k=nk≠jπ△(ak|s)ℓk+π△(aj|s)ℓi⎞⎟ ⎟⎠log⎛⎜ ⎜⎝A∑b=nb≠jπ△(b|s)ℓb+π△(aj|s)ℓi⎞⎟ ⎟⎠ (32)

For we get:

 DKL(δ△(s)||δ∙(s))−log(ℓA)=A∑n=i+1(ℓn−ℓn−1)(A∑k=nπ△(ak|s)ℓk)log(A∑b=nπ△(b|s)ℓb) +(ℓi−ℓj)(A∑k=iπ△(ak|s)ℓk)log(A∑b=iπ△(b|s)ℓb) +j∑n=1(ℓn−ℓn−1)⎛⎜ ⎜⎝A∑k=nk≠jπ△(ak|s)ℓk+π△(aj|s)ℓj⎞⎟ ⎟⎠log⎛⎜ ⎜⎝A∑b=nb≠jπ△(b|s)ℓb+π△(aj|s)ℓj⎞⎟ ⎟⎠ (33)

For the purposes of simplifying equations (and only for the remainder of this subsection) we define:

 xn Δ=A∑k=nk≠jπ△(ak|s)ℓk (34) π△(aj|s) Δ=πj (35)

Combining (32) and (B) we get:

 exp(DKL(δ△(s)||δ∙(s))−DKL(δ□(s)||δ∙(s)))=⎛⎜⎝xixi+πjℓi⎞⎟⎠(ℓi−ℓj)xi ⋅j∏n=1⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(xn+πjℓj)(xn+πjℓj)(xn+πjℓi)(xn+πjℓi)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦(ℓn−ℓn−1)(xi+πjℓi)−((ℓi−ℓj)ℓiπj) (36) =(1+πjxiℓi)−(ℓi−ℓj)xi(xi+πjℓi)−((ℓi−ℓj)ℓiπj)j∏n=1⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(1+πjxnℓj)(xn+πjℓj)(1+πjxnℓi)(xn+πjℓi)x(ℓi−ℓjℓjℓiπj)n⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ℓn−ℓn−1 (37) ⋅j−1∏n=1⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(1+πjxnℓj)(xn+πjℓj)(1+πjxnℓi)(xn+πjℓi)x(ℓi−ℓjℓjℓiπj)n⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ℓn−ℓn−1 (38) =(1+πjxiℓi)−(ℓi−ℓj)(xi+πjℓi)⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(1+πjxiℓj)(xi+πjℓj)(1+πjxiℓi)(xi+πjℓi)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ℓj⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(1+πjxiℓj)(xi+πjℓj)(1+πjxiℓi)(xi+πjℓi)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦−ℓj−1x−ℓj−1(ℓi−ℓjℓjℓiπj)i ⋅j−1∏n=1⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(1+πjxnℓj)(xn+πjℓj)(1+πjxnℓi)(xn+πjℓi)x(ℓi−ℓ