We introduce exploration potential, a quantity that measures how much a reinforcement learning agent has explored its environment class. In contrast to information gain, exploration potential takes the problem’s reward structure into account. This leads to an exploration criterion that is both necessary and sufficient for asymptotic optimality (learning to act optimally across the entire environment class). Our experiments in multi-armed bandits use exploration potential to illustrate how different algorithms make the tradeoff between exploration and exploitation.
Good exploration strategies are currently a major obstacle for reinforcement learning (RL). The state of the art in deep RL (Mnih et al., 2015, 2016) relies on -greedy policies: in every time step, the agent takes a random action with some probability. Yet -greedy is a poor exploration strategy and for environments with sparse rewards it is quite ineffective (for example the Atari game ‘Montezuma’s Revenge’): it just takes too long until the agent randomwalks into the first reward.
More sophisticated exploration strategies have been proposed: using information gain about the environment (Sun et al., 2011; Orseau et al., 2013; Houthooft et al., 2016) or pseudo-count (Bellemare et al., 2016). In practice, these exploration strategies are employed by adding an exploration bonus (‘intrinsic motivation’) to the reward signal (Schmidhuber, 2010). While the methods above require the agent to have a model of its environment and formalize the strategy ‘explore by going to where the model has high uncertainty,’ there are also model-free strategies like the automatic discovery of options proposed by Machado and Bowling (2016). However, none of these explicit exploration strategies take the problem’s reward structure into account. Intuitively, we want to explore more in parts of the environment where the reward is high and less where it is low. This is readily exposed in optimistic policies like UCRL (Jaksch et al., 2010) and stochastic policies like PSRL (Strens, 2000), but these do not make the exploration/exploitation tradeoff explicitly.
In this paper, we propose exploration potential, a quantity that measures reward-directed exploration. We consider model-based reinforcement learning in partially or fully observable domains. Informally, exploration potential is the Bayes-expected absolute deviation of the value of optimal policies. Exploration potential is similar to information gain about the environment, but explicitly takes the problem’s reward structure into account. We show that this leads to a exploration criterion that is both necessary and sufficient for asymptotic optimality (learning to act optimally across an environment class): a reinforcement learning agent learns to act optimal in the limit if and only if the exploration potential converges to . As such, exploration potential captures the essence of what it means to ‘explore the right amount’.
Another exploration quantity that is both necessary and sufficient for asymptotic optimality is information gain about the optimal policy (Russo and Van Roy, 2014; Reddy et al., 2016). In contrast to exploration potential, it is not measured on the scale of rewards, making an explicit value-of-information tradeoff more difficult.
For example, consider a 3-armed Gaussian bandit problem with means , , and . The information content is identical in every arm. Hence an exploration strategy based on maximizing information gain about the environment would query the third arm, which is easily identifiable as suboptimal, too frequently (linearly versus logarithmically). This arm provides information, but this information is not very useful for solving the reinforcement learning task. In contrast, an exploration potential based exploration strategy concentrates its exploration on the first two arms.
2 Preliminaries and Notation
A reinforcement learning agent interacts with an environment in cycles: at time step the agent chooses an action and receives a percept consisting of an observation and a reward ; the cycle then repeats for . We use to denote a history of length . With abuse of notation, we treat histories both as outcomes and as random variables.
A policy is a function mapping a history and an action to the probability of taking action after seeing history . An environment is a function mapping a history to the probability of generating percept after this history . A policy and an environment generate a probability measure over infinite histories, the expectation over this measure is denoted with . The value of a policy in an environment given history is defined as
where is the discount factor. The optimal value is defined as , and the optimal policy is . We use to denote the true environment.
We assume the nonparametric setting: let denote a countable class of environments containing the true environment . Let be a prior probability distribution on . After observing the history the prior is updated to the posterior . A policy is asymptotically optimal in mean iff for every , as .
3 Exploration Potential
We consider model-based reinforcement learning where the agent learns a model of its environment. With this model, we can estimate the value of any candidate policy. Concretely, let denote our estimate of the value of the policy at time step . We assume that the agent’s learning algorithm satisfies on-policy value convergence (OPVC):
This does not imply that our model of the environment converges to the truth, only that we learn to predict the value of the policy that we are following. On-policy value convergence does not require that we learn to predict off-policy, i.e., the value of other policies. In particular, we might not learn to predict the value of the -optimal policy .
For example, a Bayesian mixture or an MDL-based estimator both satisfy OPVC if the true environment is the environment class; for more details, see Leike (2016, Sec. 4.2.3).
We define the -greedy policy as .
[Exploration Potential] Let be a class of environments and let be a history. The exploration potential is defined as
Intuitively, captures the amount of exploration that is still required before having learned the entire environment class. Asymptotically the posterior concentrates around environments that are compatible with the current environment. EP then quantifies how well the model understands the value of the compatible environments’ optimal policies.
[Properties of ]
depends neither on the true environment , nor on the agent’s policy .
depends on the choice of the prior and on the agent’s model of the world .
for all histories .
The last item follows from the fact that the posterior and the value function are bounded between and .
[Bound on Optimality] For all ,
The bound of subsection 3.2 is to be understood as follows.
If we switch to the greedy policy , then due to on-policy value convergence (1). This reflects how well the agent learned the environment’s response to the Bayes-optimal policy. Generally, following the greedy policy does not yield enough exploration for to converge to . In order to get a policy that is asymptotically optimal, we have to combine an exploration policy which ensures that and then gradually phase out exploration by switching to the -greedy policy. Because of property (i), the agent can compute its current value and thus check how close it is to . The higher the prior belief in the true environment , the smaller this value will be (in expectation).
[Policy Convergence] Let and be two policies. We say the policy converges to in -probability iff as in .
We assume that is continuous in the policy argument. If converges to in total variation in the sense that for all actions and , then converges to in .
[Strongly Unique Optimal Policy] An environment admits a strongly unique optimal policy iff there is a -optimal policy such that for all policies if
then converges to in .
Assuming that is continuous is , an environment has a unique optimal policy if there are no ties in . Admitting a strongly unique optimal policy is an even stronger requirement because it requires that there exist no other policies that approach the optimal value asymptotically but take different actions (i.e., there is a constant gap in the argmax). For any finite-state (PO)MDP with a unique optimal policy that policy is also strongly unique.
[Asymptotic Optimality ] If the policy is asymptotically optimal in mean in the environment class and each environment admits a strongly unique optimal policy, then in -probability for all .
If we don’t require the condition on strongly unique optimal policies, then the policy could be asymptotically optimal while : there might be another policy that is very different from any optimal policy , but whose -value approaches the optimal value: as . Our policy could converge to without converging to .
4 Exploration Potential in Multi-Armed Bandits
In this section we use experiments with multi-armed Bernoulli bandits to illustrate the properties of exploration potential. The class of Bernoulli bandits is (the arms’ means). In each time step, the agent chooses an action (arm) and receives a reward where is the true environment. Since is uncountable, exploration potential is defined with an integral instead of a sum:
where is the posterior distribution given the history , is the Bayes-mean parameter, and is the index of the best arm accoding to .
Figure 1 shows the exploration potential of several bandit algorithms, illustrating how much each algorithm explores. Notably, optimally confident UCB (Lattimore, 2015) stops exploring around time step 700 and focuses on exploitation (because in contrast to the other algorithms it knows the horizon). Thompson sampling, round robin (alternate between all arms), and -greedy explore continuously (but -greedy is less effective). The optimal strategy (always pull the first arm) never explores and hence its exploration potential decreases only slightly.
Exploration potential naturally gives rise to an exploration strategy: greedily minimize Bayes-expected exploration potential (MinEP); see Algorithm 1. This strategy unsurprisingly explores more than all the other algorithms when measured on exploration potential, but in bandits it also turns out to be a decent exploitation strategy because it focuses its attention on the most promising arms. For empirical performance see Figure 2. However, MinEP is generally not a good exploitation strategy in more complex environments like MDPs.
Several variants on the definition exploration potential given in subsection 3.1 are conceivable. However, often they do not satisfy at least one of the properties that make our definition appealing. Either they break the necessity (subsection 3.2), sufficiency (subsection 3.3), our proofs thereof, or they make hard to compute. For example, we could replace by where is the agent’s future policy. This preserves both necessesity and sufficiency, but relies on computing the agent’s future policy. If the agent uses exploration potential for taking actions (e.g., for targeted exploration), then this definition becomes a self-referential equation and might be very hard to solve. Following Dearden et al. (1999), we could consider which has the convenient side-effect that it is model-free and therefore applies to more reinforcement learning algorithms. However, in this case the necessity guarantee (subsection 3.3) requires the additional condition that the agent’s policy converges to the greedy policy . Moreover, this does not remove the dependence on a model since we still need a model class and a posterior.
Based on the recent successes in approximating information gain (Houthooft et al., 2016), we are hopeful that exploration potential can also be approximated in practice. Since computing the posterior is too costly for complex reinforcement learning problems, we could (randomly) generate a few environments and estimate the sum in subsection 3.1 with them.
In this paper we only scratch the surface on exploration potential and leave many open questions. Is this the correct definition? What are good rates at which EP should converge to 0? Is minimizing EP the most efficient exploration strategy? Can we compute EP more efficiently than information gain?
We thank Tor Lattimore, Marcus Hutter, and our coworkers at the FHI for valuable feedback and discussion.
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Marc G Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. Technical report, 2016. http://arxiv.org/abs/1606.01868.
- Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in Artificial Intelligence, pages 150–159, 1999.
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks. Technical report, 2016. http://arxiv.org/abs/1605.09674.
- Marcus Hutter. Universal Artificial Intelligence. Springer, 2005.
- Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Tor Lattimore. Optimally confident UCB: Improved regret for finite-armed bandits. Technical report, 2015. http://arxiv.org/abs/1507.07880.
- Jan Leike. Nonparametric General Reinforcement Learning. PhD thesis, Australian National University, 2016.
- Marlos C Machado and Michael Bowling. Learning purposeful behaviour in the absence of rewards. Technical report, 2016. http://arxiv.org/abs/1605.07700.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
- Laurent Orseau, Tor Lattimore, and Marcus Hutter. Universal knowledge-seeking agents for stochastic environments. In Algorithmic Learning Theory, pages 158–172. Springer, 2013.
- Gautam Reddy, Antonio Celani, and Massimo Vergassola. Infomax strategies for an optimal balance between exploration and exploitation. Journal of Statistical Physics, 163(6):1454–1476, 2016.
- Dan Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pages 1583–1591, 2014.
- Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
- Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000.
- Yi Sun, Faustino Gomez, and Jürgen Schmidhuber. Planning to be surprised: Optimal bayesian exploration in dynamic environments. In Artificial General Intelligence, pages 41–51. Springer, 2011.