ISL: Optimal Policy Learning With Optimal ExplorationExploitation TradeOff
Abstract
Traditionally, offpolicy learning algorithms (such as Qlearning) and exploration schemes have been derived separately. Often times, the explorationexploitation dilemma being addressed through heuristics. In this article we show that both the learning equations and the explorationexploitation strategy can be derived in tandem as the solution to a unique and wellposed optimization problem whose minimization leads to the optimal value function. We present a new algorithm following this idea. The algorithm is of the gradient type (and therefore has good convergence properties even when used in conjunction with function approximators such as neural networks); it is offpolicy; and it specifies both the update equations and the strategy to address the explorationexploitation dilemma. To the best of our knowledge, this is the first algorithm that has these properties.
1 Introduction
Reinforcement learning (RL) is concerned with designing algorithms that seek to maximize long term cumulative rewards by interacting with an environment whose dynamics are unknown. Three main features are desirable for modelfree algorithms to achieve this goal efficiently: (a) high sample efficiency, (b) guaranteed convergence (even when the algorithm is used in conjunction with expressive function approximators like neural networks) and (c) the ability to perform deep exploration. Recently, algorithms based on policy gradient have been introduced with guaranteed convergence and achieve stateoftheart results in many tasks. The most notable cases being TRPO TRPO , PPO ppo and A3C a3c . These algorithms have two main drawbacks: they have poor sample efficiency due to the fact that they operate onpolicy and they are not capable of performing deep exploration (i.e., they tend to perform poorly in environments with sparse rewards). Another group of algorithms is based on Qlearning (like DQN dqn and DDQN ddqn ). These algorithms have high sample efficiency but they also have two main drawbacks: they can diverge when used in conjunction with function approximators and they do not address the explorationexploitation dilemma. Hence, heuristics are typically necessary to endow these algorithms with better exploration capabilities (for example, Bootstrap DQN boot_dqn ). More recently, a new family of algorithms has been introduced, which is based in the idea of learning policies that aim to maximize the long term cumulative rewards while also maximizing their own entropy. One notable algorithm within this group is SBEED sbeed , which has high sample efficiency and guaranteed convergence when used with function approximators. SBEED still has the deficiency that it does not address the explorationexploitation dilemma and hence it does not perform efficient exploration. The contribution of this work is the introduction of a novel algorithm which, to the best of our knowledge, is the first algorithm that has the three aforementioned properties (a)(c). The main difference between our algorithm and previous work is that we use a novel cost function which makes the explorationexploitation dilemma explicit and derives the learning rule and the exploratory strategy in tandem as the solution to a unique optimization problem.
1.1 Relation to prior work
Our paper is mostly related to recent work on maximum entropy algorithms. Some of the most prominent algorithms in this area are Glearning G_learning , soft Qlearning haarnoja2017reinforcement , PCL pcl , SAC sac , TrustPCL trust_pcl , and SBEED sbeed . All these algorithms augment the traditional RL objective with a term that aims to maximize the entropy of the learned policy, which is weighted by a temperature parameter. The consequence of using this augmented objective is twofold. First, it allows to derive convergent offpolicy algorithms (even when used with function approximators). Second, it improves the exploration properties of the algorithms. However, using this augmented objective has two main drawbacks. In the first place, the policy to which these algorithms converge is biased away from the true optimal policy. This point can be handled by annealing the temperature parameter but this can slow down convergence. Furthermore, it is unclear what the optimal schedule is to perform such annealing and how it affects the conditioning of the optimization problem. In the second place, even though the exploration is improved, algorithms derived from this modified cost are not efficient at performing deep exploration. The reason for this is that a unique temperature parameter is used for all states. In order to perform deep exploration it is necessary to have a scheme that allows agents to learn policies which exploit more in states where the agent has high confidence in the optimal action and act in a more exploratory manner in unknown states. The main difference between our approach and these works is that we augment the traditional RL objective with a term that makes the explorationexploitation tradeoff explicit instead of the policy’s entropy. Under our scheme, agents converge to the true optimal policy without the need for annealing of any parameters and, moreover, an exploration strategy is derived that is capable of performing deep exploration.
2 Preliminaries
We consider the problem of policy optimization within the traditional reinforcement learning framework. We model our setting as a Markov Decision Process (MDP), with an MDP defined by (,,,), where is a set of states of size , is a set of actions of size , specifies the probability of transitioning to state from state having taken action and is the average reward when the agent transitions to state from state having taken action ).
Assumption 1.
We assume for all .
In this work we consider the maximization of the discounted infinite reward as the objective of the RL agent:
(1) 
where is the optimal action, is the discount factor and and are the state and action at time , respectively. We clarify that in this work, random variables are always denoted in bold font. We recall that each policy has an associated state value function and stateaction value function given by^{1}^{1}1In this paper we will refer to both and as value functions indistinctly.:
(2a)  
(2b) 
It is wellknown that the optimal value functions satisfy the following fixed point equations puterman :
(3a)  
(3b) 
where for convenience we defined .
3 Algorithm derivation
Optimization problem (1) and relations (2) and (3) are useful to derive algorithms for planning problems (i.e., problems in which the reward function and transition kernel are known) but are unfit to derive RL algorithms because they obviate the fact that the agent relies on estimated quantities (which are subject to uncertainty). Hence, in this work, we modify (1) to reflect the fact that an RL agent is constrained by the uncertainty of its estimates. We change the goal of the agent to not just maximize the discounted cumulative rewards but also to minimize the uncertainty of its estimated quantities. For this purpose, we assume that at any point in time the agent has some estimate of the optimal value function denoted by , which is subject to some uncertainty. We quantify this uncertainty through the stateaction Bellman error and model it in a Bayesian manner. More specifically, we assume follows a uniform probability distribution with zero mean:
(4) 
We will refer to the probability density function of as . We assume zero mean uniform distributions for the following reasons:

Zero mean: if the mean were different than zero (i.e., ), it could be used to improve the estimate as resulting in a new estimate for which the stateaction Bellman error would be zero mean.

Uniform distribution: under assumption 1, we know that for any infinitely discounted MDP, a symmetric bound for the stateaction Bellman error exists in the form ^{2}^{2}2This is due to the fact that the value functions are lower and upper bounded by and and hence we know . Moreover, typically there is no prior information about the error distribution between these bounds and therefore a noninformative uniform distribution with limit becomes appropriate.
We further define the state Bellman error whose distribution is given by a mixture of the stateaction error distributions:
(5) 
Note that the Bayesian update for is given by:
(6) 
where is the th sample from . Note that the Bayesian update (6) assumes a stationary distribution. However, as the agent updates its estimate , the error distributions will change over time. For this reason, we modify (6) to endow it with tracking capabilities:
(7) 
Note that when an error is sampled that is bigger than its corresponding estimate , then the update equations (6) and (7) will coincide. Update equation (7) is in tabular form. However, in practical applications it is typically necessary to parameterize with parameters (which we denote as ) to reduce the dimensionality of the learning parameters. To obtain an update equation for , we define the following optimization problem:
(8) 
where is some distribution according to which the stateaction pairs are sampled. The gradient of (8) (which can be used to update ) is given by:
(9) 
3.1 Optimization problem
We now define the optimization problem for our RL agent to be:
(10) 
where is the Bellman state error corresponding to some policy that seeks to maximize the cumulative discounted information (i.e., , where is the indicator function) and is the KullbackLeibler divergence or relative entropy. In this work we refer to as the uncertainty constrained optimal policy (or ucoptimal policy). Under this new objective, we redefine the value functions as:
(11a)  
(11b)  
(11c) 
Using (11) we can rewrite (10) as:
(12) 
Note that the explorationexploitation tradeoff becomes explicit in our cost function. To maximize the first term of the summation, the agent has to exploit its knowledge of , while to maximize the second term, the agent’s policy needs to match , which is a policy that seeks to maximize the information gathered through exploration. Since the argument being maximized in (12) is differentiable with respect to we can obtain a closedform expression for . Before providing the closed form solution for we introduce the following useful lemma and definitions.
Definition 1.
Pareto dominated action: For a certain state we say that an action is Pareto dominated by action if and .
Lemma 1.
For all Pareto dominated actions it holds that .
Proof.
See appendix B.
The statement of Lemma 1 is intuitive since choosing a Pareto dominated action lowers the expected cumulative reward and the information gained, relative to choosing the action that dominates it. Also note that Lemma 1 implies that for all Pareto optimal actions it must be the case that if then .
Definition 2.
Mixed Pareto dominated action: For a certain state we say that an action is mixed Pareto dominated if there exists two actions and (which satisfy ) such that:
(13) 
Definition 3.
Pareto optimal action: We define an action as Pareto optimal if it is not Pareto dominated or mixed Pareto dominated.
We now introduce the state dependent set of actions with cardinality , which is formed by all the Pareto optimal actions corresponding to state . Furthermore, we introduce the ordering functions which for every state provide an ordering amongst the Pareto optimal actions from lowest uncertainty to highest (i.e., ). For instance, provides the index of the action at state , which has the lowest uncertainty amongst the actions contained in .
Theorem 1.
is given by:
(14a)  
(14b) 
where to simplify notation we defined and . We also set .
Proof.
See Appendix C.
Note that as expected, according to (14), the ucoptimal policy always assigns strictly positive probability to the actions that have the biggest uncertainty and biggest (in cases where one action has both then ).
Lemma 2.
The value function corresponding to policy is given by:
(15)  
(16) 
where .
Remark 1.
Note that the pair satisfies two important conditions:
(17) 
The first condition is expected since when the relative entropy term is eliminated, (1) and (10) become equivalent. The second condition reflects the fact that when the uncertainty is equal for all actions the distributions and become equal regardless of and therefore and hence (1) and (10) become equivalent. Condition (17) is of fundamental importance because it guarantees that as learning progresses and the limits diminish, policy tends to the desired policy (note that annealing of is not necessary for this convergence of towards ).
3.2 Learning algorithm
Using the relations from Theorem 1 and Lemma 2 we pose the following optimization problem:
(18a)  
(18b) 
where is the distribution according to which stateaction pairs are sampled, and is a parametric approximation of with parameters . Note that the gradient of with respect to is a product of expectations. Therefore, in the general case where transitions are stochastic sample estimates of such gradient become biased. To bypass this issue, we use the popular duality trick sbeed ; cassano2019team ; macua2015distributed ; du2017stochastic ; FDPE_conf ; FDPE_jour as follows:
(19) 
Hence, minimization problem (18) becomes equivalent to the following primaldual formulation:
(20) 
where and is the parameterized version of through parameters . The gradients of (20) are given by:
(21a)  
(21b) 
Note that the parameter allows control of a variancebias tradeoff for the estimate of the gradient. In the particular case where the transitions of the MDP are deterministic, the optimal choice is . However, while the entropy of the distribution over state transitions increases, higher values of become preferable. Also note that as we explained before, to estimate we need to sample . Transitions may be used to estimate , however, using these samples as estimates has the disadvantage that the variance inherent to the rewards and state transition does not diminish as learning progresses (and therefore nor will ). For this reason, it is more convenient to use to estimate the Bellman errors. With this clarification and using (9), (14) and (21) a new algorithm can be introduced which we refer to as Information Seeking Learner (ISL), the detailed listing can be found in Appendix A. Notice that ISL has the following fundamental properties: it works offpolicy, it is compatible with function approximation, and it specifies both the learning rule and the explorationexploitation strategy.
4 Experiments
In this section we test the capabilities of ISL to perform deep exploration. We compare the performance of a tabular implementation of ISL and Bootsrtap Qlearning in the Deep Sea game osband2017deep , which is useful benchmark to test the exploration capabilities of RL algorithms. Implementation details and results for the stochastic versions of Deep Sea can be found in Appendix D. Figures 0(a) and 0(b) show the average regret curves and figure 0(c) shows the amount episodes required for the regret to drop below the dotted lines indicated with in figures 0(a) and 0(b). As can be seen in the figures, the amount of episodes required to learn the optimal policy is which is optimal osband2017deep , and further it can be appreciated that the constant associated with ISL is smaller than the one corresponding to Bootstrap Qlearning.
References
 (1) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, New York, USA, 2015, pp. 1889–1897.
 (2) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, August 2017.
 (3) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. International Conference on Machine Learning, New York, USA, 2016, pp. 1928–1937.
 (4) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 (5) H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Qlearning,” in Proc. AAAI Conference on Artificial Intelligence, Arizona, USA, 2016.
 (6) I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration via bootstrapped dqn,” in Proc. Advances in neural information processing systems, Barcelona, Spain, 2016, pp. 4026–4034.
 (7) B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, “SBEED: Convergent reinforcement learning with nonlinear function approximation,” in Proc. International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1133–1142.
 (8) R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement learning via soft updates,” in Proc. Conference on Uncertainty in Artificial Intelligence, New York, USA, 2016, pp. 202–211.
 (9) T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energybased policies,” in Proc. International Conference on Machine LearningVolume 70, Sydney, Australia, 2017, pp. 1352–1361.
 (10) O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” in Proc. Advances in Neural Information Processing Systems, Long Beach, USA, 2017, pp. 2775–2785.
 (11) T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1856–1865.
 (12) O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “TrustPCL: An offpolicy trust region method for continuous control,” arXiv:1707.01891, February 2018.
 (13) M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dynamic Programming. Wiley, NY, 2014.
 (14) L. Cassano, S. A. Alghunaim, and A. H. Sayed, “Team policy learning for multiagent reinforcement learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, May 2019, pp. 3062–3066.
 (15) S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed, “Distributed policy evaluation under multiple behavior strategies,” IEEE Transactions on Automatic Control, vol. 60, no. 5, pp. 1260–1274, 2015.
 (16) S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou, “Stochastic variance reduction methods for policy evaluation,” in Proc. International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1049–1058.
 (17) L. Cassano, K. Yuan, and A. H. Sayed, “Distributed valuefunction learning with linear convergence rates,” in Proc. of European Control Conference, Napoli, Italy, 2019, pp. 505–511.
 (18) L. Cassano, K. Yuan, and A. H. Sayed, “Multiagent fully decentralized value function learning with linear convergence rates,” arXiv:1810.07792, October 2018.
 (19) I. Osband, B. Van Roy, D. Russo, and Z. Wen, “Deep exploration via randomized value functions,” arXiv:1703.07608, March 2019.
 (20) I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepezvari, S. Singh, B. Van Roy, R. Sutton, D. Silver, and H. Van Hasselt, “Behaviour suite for reinforcement learning,” arXiv:1908.03568, August 2019.
Appendix A Detailed listing of Information Seeking Learner (ISL)
ISL with replay buffer.
Initialize:
and randomly, for all and an empty replay buffer .
For episode do:
For environment transitions do:
Sample transitions by following policy (14) and store them in .
For optimization iterations do:
Arrange transitions from into minibatches :
For do:
Appendix B Proof of Lemma 1
We start stating the following assumption to avoid having using function through the entire derivation.
Assumption 2.
Without loss of generality we assume that actions are numbered such that . This implies that action is the one whose bellman error has the biggest uncertainty, while action is the one with the lowest.
We prove the lemma by contradiction. Assume that action is Pareto dominated by action , and further assume that there exists a optimal policy for which . We construct policy and show that and therefore is not a optimal policy.
(22) 
Note that since we assume dominates due to assumption 2 this implies that . Without loss of generality we assume . We now proceed to show that .
(23)  
(24) 
where in we used the fact that . Now using assumption 2 and the fact that all error densities are uniform we can write a closed form expression for the integral.
(25)  
(26)  
(27)  
(28) 
Combining (24) and (28) we get:
(29)  
(30) 
Combining (22) and (30) we can write:
(31)  
(32) 
For we get:
(33) 
For the purposes of simplifying equations (and only for the remainder of this subsection) we define:
(34)  
(35) 
Combining (32) and (B) we get:
(36)  
(37)  
(38)  