Goal-oriented Trajectories for Efficient Exploration
Exploration is a difficult challenge in reinforcement learning and even recent state-of-the art curiosity-based methods rely on the simple epsilon-greedy strategy to generate novelty. We argue that pure random walks do not succeed to properly expand the exploration area in most environments and propose to replace single random action choices by random goals selection followed by several steps in their direction. This approach is compatible with any curiosity-based exploration and off-policy reinforcement learning agents and generates longer and safer trajectories than individual random actions. To illustrate this, we present a task-independent agent that learns to reach coordinates in screen frames and demonstrate its ability to explore with the game Super Mario Bros. improving significantly the score of a baseline DQN agent.
Exploration is an inherent part of a reinforcement learning algorithm allowing an agent to learn about its environment. The goal of exploration is to reduce uncertainty about the environment’s transitions and to discover rewards. The most common approach to exploration in absence of any knowledge about the environment is to perform random actions. As knowledge is gained, the agent can use it to attempt to increase its performance by taking greedy actions, while retaining some chance to choose random actions to further explore the environment (-greedy exploration).
However, if rewards are sparse or are not sufficiently informative to allow performance improvements, -greedy fails to explore sufficiently far. Several methods have been described that bias the agent’s actions towards novelty mostly by using optimistic initialization or curiosity signals (Oudeyer et al., 2007; Oudeyer & Kaplan, 2009; Schmidhuber, 1991, 2010). While the first one is mainly compatible with tabular methods, the second relies on intrinsic rewards motivating, for example, information gain (Kearns & Koller, 1999; Brafman & Tennenholtz, 2002), state visitation counts (Bellemare et al., 2016; Tang et al., 2017) or prediction error (Stadie et al., 2015; Pathak et al., 2017). While these methods show success in sparse reward environments, they dynamically modify the agent’s reward function and thus make the environment’s MDP appear non-stationary to the agent, while still mostly relying on completely random actions to discover novelty.
The main issue with random actions is that they might waste a lot of time by cancelling each other or push into obstacles and can frequently lead the agent to terminal states. We argue that if during random exploration, actions are still chosen in a coherent way it is possible to extend the boundaries of the exploration area much faster. To achieve this purpose, we propose to replace most of the random actions with short trajectories towards random goals.
As a practical implementation, we propose to use screen coordinates as goals and present a novel and efficient algorithm, that we call Q-map, in charge of learning to reach them directly from screen frames.
The core concept of the proposed Q-map algorithm is that a single transition allows us to update, off-policy, the estimated distance from to any goal when taking action . In (Andrychowicz et al., 2017) goals used for such updates are taken from the set of states visited later in experienced trajectories. While this allows to focus on goals that can effectively be visited later from a given state, it lacks the ability to learn to reach other goals never experienced in following states. On the contrary we propose to use all possible goals for each update. This is performed by only providing frames to an artificial neural network and requesting the 3D tensors of Q-values for all actions and goals at once. Other works focused on neural networks taking states and goals in input as proposed for Universal Value Function Approximators (Schaul et al., 2015a).
Q-map agent uses screen coordinates as goals, where the learned Q-values represent a notion of the expected distance to a point in screen space. As these goals are likely correlated and depend on visual cues, we expect that an auto-encoder-like architecture would be able to efficiently tackle large output matrix. Convolutions are applied to the input frames followed by fully connected layers and deconvolutions (convolution transpose) to output a set of 2D frames (Q-maps), one for each action. In each frame the rows and columns directly indicate a coordinate in screen space, covering the entire visible screen.
The approach to provide only the state as the input, expecting the output to consist of values for all goals is similar to the implementation by (Hasselt, 2010), where DQN network produces Q values for every action, given a state in input. For DQN agent, the approach allows for efficient choice of greedy action with a single pass through the network. In the case of the Q-map agent, similar implementation, illustrated in Figure 2, allows the network to exploit locality, to converge faster and generalize visual cues, to update towards all possible goals at once and to efficiently select goals.
Given a transition , the value for any coordinates can be updated using a variant of Q-learning, with the following target:
Where is a discount factor used for the Q-map, not necessarily the same as the one discounting the rewards from the environment. This is efficiently done using the 3D tensors of Q-values from the network as detailed in Algorithm 1.
Q-maps are suited for environments in which it is possible to locate the agent’s position in screen coordinates, which could either be provided by the environment (e.g. from the RAM in video games), or a separate classifier could be trained to localise the agent in the environment. While coordinates are used to create the target for the Q-learning, it does not however preclude one from only using raw frames as input for the Q-map agent. In some games, such as Super Mario Bros. (used later in the experiments), the screen scrolls while the player’s avatar moves, and thus only a portion of an entire level is shown at once. In the proposed Q-map implementation we chose to use the coordinates available on the screen as possible goals and not coordinates over an entire level, thus the map is local to the area around the agent.
While distance to goals could more directly be represented by the expected number of steps, the decay factor forces values to be bounded, with a value of for points wich are unreachable and for points immediately reachable. The Q-values naturally decay for points which are never reached. However, it is important to note that Q-map agent is not a planner, the Q-values represent an expected value which can become quite unreliable after several time steps, especially as the agent lacks prediction to deal with dynamic environments. Nevertheless we expect the agent to have sufficiently accurate local estimations, which can further be improved by requesting a new Q-map every time step towards a chosen goal.
3 Combining Q-maps with reinforcement learning
The proposed Q-map exploration agent can be used with any off-policy reinforcement learning agents and can learn simultaneously with it. A combined agent then consists of an exploration agent learning Q-maps of the environment and a task-learner agent learning to maximize the rewards in the environment. At every time step, if a goal has already been chosen, the Q-map agent provides the next action towards it and decreases the remaining time allowed to reach it, while if no goal exists, there is a probability to select a new one or to request an action from the task-learner agent. On top of these two agents it is also necessary to keep some amount of random actions in order to efficiently train both of them. The process is detailed in Algorithm 2.
The number of steps spent following goals is not accurately predictable thus ensuring that the average proportion of exploratory steps (random and goal-oriented actions) follows a scheduled proportion is not straightforward. To achieve a good approximation we dynamically adjust the probability of selecting new goals to make a running average of the proportion of exploratory steps match the scheduled exploration. This allows us to compare the performance of our proposed agent and baseline DQN with a similar proportion of exploratory actions.
With -greedy, a fight can happen between the random actions and the greedy actions resulting in difficulty to properly move. With our proposed approach this phenomenon can be exacerbated even more since the exploratory steps are larger. To reduce this effect, with some fixed probability, a biasing of the goal selection is performed by first asking the task-learner agent for the next greedy action it would like to perform and keeping only goals compatible with that action.
The proposed Q-map-DQN agent is composed of two DQN-based sub-agents (Mnih et al., 2015), one to learn the task-independent Q-maps and one to learn the task from environmental rewards. The implementation is based on OpenAI Baselines’ DQN (Dhariwal et al., 2017), using TensorFlow (Abadi et al., 2015). Inputs are a stack of three grayscale frames normalized ( for white and for black). Each sub-agent uses a separate artificial neural network but the same architecture is used for the encoder visible in Figure 2. We used double Q-learning (Hasselt, 2010) with target network updates every steps, prioritized experience replay (Schaul et al., 2015b) (with default parameters from Baselines) using a shared buffer of steps but separate priorities. The training starts after steps and the networks are used after steps. Both networks are trained every steps with independent batches of size . Two Adam optimizers (Kingma & Ba, 2014) are used with learning rate and default other hyperparameters from TensorFlow’s implemetation. The probability of greedy actions from DQN increases from to linearly during the first steps.
The DQN sub-agent outputs Q-values, one for each action, uses dueling (Wang et al., 2015), relu activations and a discount factor of . The Q-map sub-agent outputs frames of size ( times lower resolution than the input), totalling Q-values. The deconvolutions (convolutions transpose) use the same kernel shapes as in the encoder but smaller strides, and the elu activation function. Goals are selected within an expected range of to timesteps while the time given to reach them contains a supplement to account for possible random movements interfering with the trajectories. The goal selection is biased with a probability of to match the greedy action from the DQN sub-agent. The discount factor () decays the value towards impossible states to and force the agent to reach goals as quickly as possible. Finally, the probability of taking completely random action at any time is decayed from to during the first steps.
To measure the impact of the proposed exploration compared to epsilon greedy, we train both our agent and DQN (same agent deprived of the Q-map) on the first level of the game Super Mario Bros. (All Stars) from the Super Nintendo Entertainment System (SNES) console, using the Retro Learning Environment (RLE) framework (Bhonker et al., 2016) based on OpenAI Gym (Brockman et al., 2016). Transitions are deterministic, and the action set is limited to the most basic ones: no action, move left and right, jump up and diagonally to the left or to the right. Terminations by touching enemies or are detected from the RAM and only original SNES rewards are used and divided by . No bonus was provided when moving to the right as it is originally implemented in RLE and no penalty for dying. Typical rewards are for breaking brick blocks, for killing enemies, for collecting coins, for throwing turtle shells, for eating consumables such as mushrooms, and for reaching the final flag. The coordinates of Mario and of the scrolling windows were extracted from the RAM. Episodes are naturally limited by the timer of seconds present in the game, which corresponds to steps. Videos are available at: https://sites.google.com/view/got-exploration.
To verify that Q-map walk is more efficient than random walk in pure exploration, we first ran both types of walks, with no task-learner agent present, for million time steps and collected the coordinates traversed by Mario. Figure 3 (top) shows that Q-map walk succeeds to push the boundaries of the exploration much further away than random walk and almost manages to reach the flag. To verify that Q-map effectively improves the exploration when training DQN, we then trained the full proposed agent for million time steps. Figure 3 (bottom) shows that the combination of Q-map and DQN succeeds to ultimately push exploration to the end of the level by reaching the flag.
Finally, to measure the performance of the proposed agent in terms of sum of rewards collected per episode and number of flags reached we trained it for million time steps and reported the results in Figure 4. Our proposed agent quickly outperforms the baseline DQN agent and reaches a consistent better performance during the final third of the training. The stars in the figure indicate that the flag has been reached. While DQN reaches the flag times, our agent reaches it times.
5 Conclusion and future directions
We have shown that exploring the environment by following a coherent sequence of steps towards random goals can successfully expand the exploration boundaries and thus improve reinforcement learning agents’ performance when compared to common -greedy approaches. We have proposed and evaluated a practical implementation of such an exploration method combined with DQN and achieved significant performance gain over the baseline version.
Further work is needed to better evaluate the generalization properties of the proposed Q-map agent, such as by testing on a variety of other levels of the same game.
The research presented in this paper has been supported by Dyson Technology Ltd. and Samsung.
- Abadi et al. (2015) Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dandelion, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Andrychowicz et al. (2017) Andrychowicz, Marcin, Wolski, Filip, Ray, Alex, Schneider, Jonas, Fong, Rachel, Welinder, Peter, McGrew, Bob, Tobin, Josh, Abbeel, OpenAI Pieter, and Zaremba, Wojciech. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
- Bellemare et al. (2016) Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
- Bhonker et al. (2016) Bhonker, Nadav, Rozenberg, Shai, and Hubara, Itay. Playing snes in the retro learning environment. arXiv preprint arXiv:1611.02205, 2016.
- Brafman & Tennenholtz (2002) Brafman, Ronen I and Tennenholtz, Moshe. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Brockman et al. (2016) Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, and Zaremba, Wojciech. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Dhariwal et al. (2017) Dhariwal, Prafulla, Hesse, Christopher, Klimov, Oleg, Nichol, Alex, Plappert, Matthias, Radford, Alec, Schulman, John, Sidor, Szymon, and Wu, Yuhuai. Openai baselines. https://github.com/openai/baselines, 2017.
- Hasselt (2010) Hasselt, Hado V. Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621, 2010.
- Kearns & Koller (1999) Kearns, Michael and Koller, Daphne. Efficient reinforcement learning in factored mdps. In IJCAI, volume 16, pp. 740–747, 1999.
- Kingma & Ba (2014) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Oudeyer & Kaplan (2009) Oudeyer, Pierre-Yves and Kaplan, Frederic. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
- Oudeyer et al. (2007) Oudeyer, Pierre-Yves, Kaplan, Frdric, and Hafner, Verena V. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
- Pathak et al. (2017) Pathak, Deepak, Agrawal, Pulkit, Efros, Alexei A, and Darrell, Trevor. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
- Schaul et al. (2015a) Schaul, Tom, Horgan, Daniel, Gregor, Karol, and Silver, David. Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320, 2015a.
- Schaul et al. (2015b) Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015b.
- Schmidhuber (1991) Schmidhuber, Jürgen. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227, 1991.
- Schmidhuber (2010) Schmidhuber, Jürgen. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
- Stadie et al. (2015) Stadie, Bradly C, Levine, Sergey, and Abbeel, Pieter. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
- Tang et al. (2017) Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke, Adam, Chen, OpenAI Xi, Duan, Yan, Schulman, John, DeTurck, Filip, and Abbeel, Pieter. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2750–2759, 2017.
- Wang et al. (2015) Wang, Ziyu, Schaul, Tom, Hessel, Matteo, Van Hasselt, Hado, Lanctot, Marc, and De Freitas, Nando. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.