High-Dimensional Control Using
Generalized Auxiliary Tasks
A long-standing challenge in reinforcement learning is the design of function approximations and efficient learning algorithms that provide agents with fast training, robust learning, and high performance in complex environments. To this end, the use of prior knowledge, while promising, is often costly and, in essence, challenging to scale up. In contrast, we consider problem knowledge signals, that are any relevant indicator useful to solve a task, e.g., metrics of uncertainty or proactive prediction of future states. Our framework consists of predicting such complementary quantities associated with self-performance assessment and accurate expectations. Therefore, policy and value functions are no longer only optimized for a reward but are learned using environment-agnostic quantities. We propose a generally applicable framework for structuring reinforcement learning by injecting problem knowledge in policy gradient updates. In this paper: (a) We introduce MERL, our multi-head reinforcement learning framework for generalized auxiliary tasks. (b) We conduct experiments across a variety of standard benchmark environments. Our results show that MERL improves performance for on- and off-policy methods. (c) We show that MERL also improves transfer learning on a set of challenging tasks. (d) We investigate how our approach addresses the problem of reward sparsity and pushes the function approximations into a better-constrained parameter configuration.
Learning to control agents in simulated environments has been a source of many research efforts for decades [nguyen1990truck, werbos1989neural, schmidhuber1991learning, Robinson1989DynamicRD] and this approach is still at the forefront of recent works in deep reinforcement learning (DRL) [burda2018exploration, ha2018recurrent, silver_mastering_2016, espeholt2018impala]. But current algorithms tend to be fragile and opaque [iyer2018transparency]. Most of them require a large amount of training data collected from an agent interacting with a simulated environment, and where the reward signal is often critically sparse. Based on that, we make two observations.
First, previous work in reinforcement learning uses prior knowledge [lin1992self, clouse1992teaching, moreno2004using, ribeiro1998embedding] as a means of reducing sample inefficiency. While promising and undoubtedly necessary, the integration of such priors into current methods is potentially costly to implement. It may cause unwanted constraints and can hinder scaling up. Therefore, one goal of this paper is to integrate non-limiting constraints directly in the current RL methods, while being generally applicable to many tasks. Second, if the probability of receiving a reward by chance is arbitrarily low, the time required to learn from it will be arbitrarily long [whitehead1991complexity]. This barrier to learning will prevent agents from significantly reducing learning time. One way to overcome this barrier is to learn by getting alternative and complementary signals from different sources [oudeyer2007intrinsic, schmidhuber1991curious].
Building upon existing auxiliary tasks methods, we develop the MERL framework that integrates complementary problem-oriented quantities into the learning process. In this framework, while the agent learns to maximize the returns by sampling an environment, it also learns to correctly evaluate its performance on the control problem by minimizing a MERL objective function augmented with several problem-focused quantities. One intuition the reader can have is that MERL transforms a reward-focused task into a task regularized with problem knowledge signals. In turns, agents can learn a richer representation of the environment and consequently better address the task at hand.
In the sequel of this paper, we first present the policy gradient methods used to demonstrate the performance of MERL. Then, we introduce the framework and detail on how it can be applied to most DRL method in any environment. We suggest some of the multi-head/problem knowledge quantities mentioned above, but the reader is further encouraged to introduce other relevant signals. We demonstrate that while being able to predict the quantities from the different MERL heads correctly, the agent outperforms the on- and off-policy baselines that do not use the MERL framework on various continuous control tasks. We also show that our framework allows to better transfer the learning from one task to others on several Atari 2600 games.
We consider a Markov Decision Process (MDP) defined on a state space , an action space and a reward function where , . Let denote a stochastic policy and let the objective function be the traditional expected discounted reward:
where is a discount factor [sutton1998introduction] and is a trajectory sampled from the environment.
Policy gradient methods aim at modelling and optimizing the policy directly [sutton2000policy]. The policy is generally modeled with a function parameterized by ; in the rest of the paper, and denote the same policy. In DRL, the policy is represented in a neural network called the policy network. Policy gradient methods can generally be cast into two groups: off-policy gradient methods, such as Deep Deterministic Policy Gradients (DDPG) [lillicrap2015continuous] and on-policy methods, such as Proximal Policy Optimization (PPO) [schulman2017proximal]. These two methods are among the most commonly used and state-of-the-art policy gradient methods.
On- and Off-Policy Gradient Methods
On-Policy with Clipped Surrogate Objective
PPO-Clip [schulman2017proximal] is an on-policy gradient-based algorithm. In previous work, PPO has been tested on a set of benchmark tasks and has proven to produce impressive results in many cases despite a relatively simple implementation. For instance, instead of imposing a hard constraint like TRPO [schulman2015trust], PPO formalizes the constraint as a penalty in the objective function. In PPO, at each iteration, the new policy is obtained from the old policy :
We use the clipped version of PPO whose objective function is:
By taking the minimum of the two terms in Eq. (3), the ratio is constrained to stay within a small interval around 1. The expected advantage function for the new policy is estimated by an old policy and then re-calibrated using the probability ratio between the new and the old policy.
Off-Policy with Actor-Critic algorithm
DDPG [lillicrap2015continuous] is a model-free off-policy actor-critic algorithm, combining Determinist Policy Gradient (DPG) [silver2014deterministic] with Deep Q-Network (DQN) [mnih2013playing]. While the original DQN works in discrete action space and stabilizes the learning of the Q-function with experience replay and a target network, DDPG extends it to continuous action space with the actor-critic framework while learning a deterministic policy.
Let denote the distribution from which the next state is sampled. The Bellman equation describing the optimal action-value function is given by:
Assuming the function approximator of is a neural network with parameters , an essential part of DDPG is that computing the maximum over actions is intractable in continuous action spaces, therefore the algorithm uses a target policy network to compute an action which approximately maximizes . Given the collection of transitions in a set , where denotes whether is terminal, we obtain the mean-squared Bellman error (MSBE) function with:
MERL incorporates three key ideas: (a) the use of auxiliary quantities as a means to tackle the problem of reward sparsity, (b) the introduction of quantities of self-performance assessment and accurate expectations which constitute task-agnostic signals, more widely applicable than environment-specific prior knowledge, (c) policy and value function approximation with neural networks augmented with a multi-head layer.
Auxiliary tasks have been used to facilitate representation learning for decades [suddarth1990rule], along with intrinsic motivation [schmidhuber2010formal, pathak2017curiosity] and artificial curiosity [oudeyer2007intrinsic, schmidhuber1991curious]. [li2015recurrent] introduces a supervised loss for fitting a recurrent model on the hidden representations to predict the next observed state, in the context of imitation learning of sequences provided by experts. Other works on auxiliary tasks [jaderberg2016reinforcement, mirowski2016learning, burda2018large] allow the agents to maximize other pseudo-reward functions simultaneously. Our method is much distinctive from those previous approaches. First, neither the introduction of additional neural networks (e.g. for memory) nor the introduction of a replay buffer is needed. Second, the quantities (or auxiliary tasks) we introduce are compatible with any task: previous methods only work on pixel-based environments. Third, MERL neither introduces additional iterations nor modifies the reward function of the policy gradient algorithms it is applied to. For all those reasons, MERL can be added out-of-the-box to most policy gradient algorithms with a negligible computational cost overhead.
From a different perspective, [garcia2015comprehensive] gives a detailed overview of previous work that has changed the optimality criterion as a safety factor. But most methods use a hard constraint rather than a penalty; one reason is that it is difficult to choose a single coefficient for this penalty that works well for different problems. We are successfully addressing this problem with MERL. In [lipton2016combating], catastrophic actions are avoided by training an intrinsic fear model to predict whether a disaster will occur and using it to shape rewards. Compared to both methods, MERL is more scalable and lightweight while it incorporates quantities of self-performance assessments (e.g. variance explained of the value function) and accurate expectations (e.g. next state prediction) leading to an improved performance.
Finally, many previous studies have focused on the use of imitation learning to address a task directly. There are two main approaches: behavioral cloning [pomerleau1989alvinn], which attempts to learn a task policy through supervised learning, and inverse RL [abbeel2004apprenticeship], which attempts to learn a reward function from a set of demonstrations. Although, these successful approaches often push the agent only to learn how to perform a task from expert demonstrations with a relatively modest understanding of its own behavior. We propose a method that gives the agent relevant quantities that brings its learning closer to how to learn to accomplish the task at hand, in addition to being optimized to solve it.
MERL: Multi-Head Framework for Generalized Auxiliary Tasks
Our multi-head architecture and its associated learning algorithm are directly applicable to most state-of-the-art policy gradient methods. Let be the index of each MERL head: . In the context of DRL, we introduce two of the quantities predicted by these heads and show how to incorporate them in the policy gradient methods mentioned above.
Value Function and Policy
In DRL, the policy is generally represented in a neural network called the policy network, with parameters (weights) , and the value function is parameterized by the value network, with parameters . In the case of DDPG, the value network translates into the action-value network. Each MERL head takes as input the last embedding layer from the value network and is constituted of only one layer of fully-connected neurons, with parameters . The output size of each head corresponds to the size of the predicted MERL quantity. Below, we introduce two examples of these quantities.
Let us define the first quantity we use in MERL: the fraction of variance explained . It is the fraction of variance that the value function explains about the returns . Put differently, it corresponds to the proportion of the variance in the dependent variable that is predictable from the independent variables. We compute at each policy gradient update with the samples used for the gradient computation. In statistics, this quantity is also known as the coefficient of determination [10.2307/2683704]. For the sake of clarity, we will not use this notation for the coefficient of determination, but we will refer to this criterion as:
where and are respectively the return and the expected return from state , and is the mean of all returns in the trajectory. It should be noted that this criterion may be negative for non-linear models, indicating a severe lack of fit [10.2307/2683704] of the corresponding function:
if the fitted value function perfectly explains the returns;
corresponds to a simple average prediction;
if the fitted value function provides a worse fit to the outcomes than the mean of the discounted rewards.
We denote as the corresponding MERL head, with parameters and its objective function is defined by:
measures the ability of the value function to fit the returns. implies that of the variability of the dependent variable has been accounted for, and the remaining of the variability is still unaccounted for. For instance in [flet2019samples], is used to filter the samples that will be used to update the policy. By its definition, this quantity is a highly relevant indicator for assessing self-performance in reinforcement learning.
At each timestep, one of the agent’s MERL heads tries to predict a future state from . While a typical MERL quantity can be fit by regression on mean-squared error, we observed that predictions of future states are better fitted with a cosine-distance error. We denote the corresponding head, with parameters , and the observation space size (size of vector ). We define its objective function as:
Problem-Constrained Policy Update
Once the set of MERL heads and their associated objective functions have been defined, we modify the gradient update step of the policy gradient algorithms. The objective function incorporates all . Of course, each MERL objective is associated with its coefficient . Since in this paper we introduce two MERL heads, the corresponding two hyper-parameters are reported along with the others in the supplementary materials. It is worthy to note that we used the exact same MERL coefficients for all our experiments, which demonstrate the framework’s ease of applicability.
We evaluate MERL in multiple high-dimensional environments, ranging from MuJoCo [todorov2012mujoco] to the Atari 2600 games [bellemare2013arcade] (we describe these environments in detail in Tables 5 and 6 in the supplementary materials). The experiments in MuJoCo allow us to evaluate the performance of MERL on a large number of different continuous control problems. It is worthy to note that the universal characteristics of the auxiliary quantities we choose ensure that MERL is generally applicable to any task. Other popular auxiliary methods [jaderberg2016reinforcement, mirowski2016learning, burda2018large] cannot be applied to challenging continuous control tasks like MuJoCo. Thus, we naturally compare the performance of our method with on- and off-policy state-of-the-art methods (respectively, PPO [schulman2017proximal] and DDPG [lillicrap2015continuous]).
The experiments on the Atari 2600 games allow us to study the transfer learning abilities of MERL on a set of diverse tasks.
For the continuous control MuJoCo tasks, the agents have learned using separated policy and value networks. In this case, we build upon the value network (named the action-value network in the DDPG algorithm) to incorporate our framework’s heads. On the contrary, when playing Atari 2600 games from pixels, the agents were given a CNN network shared between the policy and the value function. In that case, are naturally attached to the last embedding layer of the shared network. In both configurations, the outputs of heads are the same size as the quantity they predict: for instance, is a scalar whereas is a state.
- Hyper-parameters Setting.
We used the same hyper-parameters as in the main text of the respective papers. We made this choice within a clear and objective protocol of demonstrating the benefits of using MERL. Hence, its reported performance is not necessarily the best that can be obtained, but it still exceeds the baselines. Using MERL adds as many hyper-parameters as there are heads in the multi-head layer and it is worth noting that MERL hyper-parameters are the same for all tasks. We report all hyper-parameters in the supplementary materials.
- Performance Measures.
We examine the performance across a large number of trials (with different seeds for each task). Standard deviation and average return are generally considered to be the most stable measures used to compare the performance of the algorithms being studied [islam2017reproducibility]. Thereby, in the rest of this work, we use those metrics to establish the performance of our framework quantitatively.
Single-Task Learning: Continuous Control
On-Policy Learning: PPO+MERL
First, we apply MERL to PPO in several MuJoCo environments. Due to space constraints, only three graphs from varied tasks are shown in Fig. 2. The complete set of 9 tasks is reported in Table 1 and the graphs are included in the supplementary materials.
We see from the curves that using MERL leads to better performance on a variety of continuous control tasks. Moreover, learning seems to be faster for some tasks, suggesting that MERL takes advantage of its heads to learn relevant quantities from the beginning of learning, when the reward may be sparse. Interestingly, by looking at the performance across all 9 tasks, we observed better results by predicting only the next state and not the subsequent ones.
Off-Policy Learning: DDPG+MERL
Next, we tested MERL on the same MuJoCo tasks by choosing DDPG as the off-policy baseline. We experimented with several open-sourced implementation, including the one from OpenAI. However, while others have reported similar issues in the open-sourced repository, it is difficult to tune DDPG to reproduce results from other works even when using their reported hyper-parameter setting and with various network architectures. Therefore, in Fig. 4, we report experiments for the tasks that have successfully been learned by the DDPG baseline, and we test MERL on those tasks. Similarly to PPO+MERL improving performance, the learning curves indicate that the learned loss modified by MERL is able to better train agents in situations where the learning is off-policy.
Transfer Learning: Atari Domain
Because of training time constraints, we consider a transfer learning setting where after the first training steps, the agent switches to a new task and is trained for another steps. The agent is not aware of the task switch. In total, we tested MERL on 20 pairs of tasks. Atari 2600 has been a challenging testbed for many years due to its high-dimensional video input (size 210 x 160) and the discrepancy of tasks between games. To investigate the advantages of using MERL for transfer learning we choose a set of 6 different Atari games with an action space of 9, which is the average size of the action space in the Atari domain. This choice has two benefits: first, the neural network shared between the policy, the value function and MERL heads do not need to be further modified when performing transfer learning and second, the 6 games provide a diverse range of game-play while sticking to the same size of action space.
The results from Fig. 7 demonstrate that our method can reasonably adapt to a different task if we compare to the same method where MERL heads are not used. The complete set of graphs is in the supplementary materials in Fig. 18. Interestingly, the very few cases where our method does not give the best results are when the orange curve (no transfer) is the best performing. This means that for those tasks learning from scratch seems more adapted. For all the other task pairs, MERL performs better. We interpret this result with the intuition that MERL heads learn and help represent information that is more generally relevant for other tasks, such as self-performance assessment or accurate expectations. In addition to adding a regularization to the objective function through problem knowledge signals, those auxiliary quantities make the neural network optimize for task-agnostic objectives.
We conduct an ablation study to evaluate the separate and combined contributions of the two heads. Fig. 9 shows the comparative results in HalfCheetah, Walker2d and Swimmer. Interestingly, with HalfCheetah, using only the head degrades the performance, but when combining it with the head, it outperforms PPO+FS. Results of the complete ablation analysis demonstrate that each head is potentially valuable for enhancing learning and that their combination can produce remarkable results.
From the experiments, we see that MERL successfully optimizes the policy according to complementary quantities seeking for good performance and safe realization of tasks, i.e. it does not only maximize a reward but instead ensures the control problem is appropriately addressed. Moreover, we show that MERL is directly applicable to policy gradient methods while adding a negligible computation cost. Indeed, for both MuJoCo and Atari tasks, the computational cost overhead is respectively 5% and 7% with our training infrastructure. All of these factors result in an algorithm that robustly solves high-dimensional control problems in a variety of areas with continuous action spaces or by using only raw pixels for observations.
Thanks to generalized auxiliary tasks framework and a consistent choice of complementary quantities injected in the optimization process, MERL can better align an agent’s objectives with higher-level insights into how to solve a control problem. Besides, since many current methods involve that successful learning depends on the agent’s ability to reach the goal by chance in the first place, correctly predicting MERL heads allow the agent to learn something useful while improving in this task. At the same time, it also addresses the problem of the sparsity of rewards.
In this paper, we introduced MERL, a generally applicable deep reinforcement learning framework for problem-focused representations in contrast with many current reward-centric algorithms. The virtual agent is able to predict problem-solving quantities in a multi-head layer to better address reinforcement learning problems. Our framework improves the performance of state-of-the-art on- and off-policy algorithms for continuous control MuJoCo tasks and Atari 2600 games in transfer learning tasks.
With MERL, we inject environment-agnostic problem knowledge directly in the policy gradient optimization. The advantage of this framework is threefold. First, the agent learns a better representation for single-task learning, and that is generalizable to other tasks. The multi-head layer provides a more problem-focused representation to the function approximations, which is therefore not only reward-centric. Moreover, continuous problem-solving signals help to address the problem of reward sparsity. Second, MERL can be seen as being a hybrid model-free and model-based framework with a small and lightweight component for self-performance assessment and accurate expectations. MERL heads introduce additional regularization to the function approximation and additionally results in better performance and improved transfer learning. Third, MERL is directly applicable to most policy gradient algorithms and environments; it does not need to be redesigned for different problems and is not restricted to pixel-based tasks. Finally, it can be extended with many other relevant problem-solving quantities.
Although the relevance and higher performance of MERL have only been shown empirically, it would be interesting to study the theoretical contribution of this framework from the perspective of an implicit regularization of the agent’s representation on his environment. We also believe that predicting complementary quantities related to the objective of a task is a worthwhile idea to explore in supervised learning too. Finally, the identification of additional MERL quantities (e.g. prediction of immediate reward, prediction of time until the end of a trajectory) and the effect of their combination is also a research topic that we find most relevant for future works.
The authors would like to acknowledge the support of Inria, and SequeL for providing a great environment for this research. The authors also acknowledge the support from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020. This work was supported by the French Ministry of Higher Education and Research.
Appendix A Full Benchmark results
Appendix B Hyper-parameters
|Horizon ()||2048 (MuJoCo), 128 (Atari)|
|Adam stepsize||(MuJoCo), (Atari)|
|Nb. epochs||10 (MuJoCo), 3 (Atari)|
|Minibatch size||64 (MuJoCo), 32 (Atari)|
|Number of actors||1 (MuJoCo), 4 (Atari)|
|GAE parameter ()||0.95|
|Clipping parameter ()||0.2 (MuJoCo), 0.1 (Atari)|
|Value function coef||0.5|
|Learning rate actor|
|Learning rate critic|
|Critic L2 reg.||0.01|
Appendix C Implementation details
Function Approximations with Neural Networks
Unless otherwise stated, the policy network used for the MuJoCo tasks is a fully-connected multi-layer perceptron with two hidden layers of 64 units. For Atari, the network is shared between the policy and the value function and is the same as in [mnih2016asynchronous]. Each additional head is composed of a small fully-connected layer and outputs the desired quantity.
MERL+PPO and MERL+DDPG Algorithms
In Eq. (14), the targets are computed, then in Eq. (15) and Eq. (16) respectively the Q-function and are updated by one step of gradient descent (each MERL objective is associated with its loss coefficient ). In Eq. (17), the policy is updated by one step of gradient ascent. Finally, in Eq. (19), the targets networks are updated with a hyper-parameter between 0 and 1.
Appendix D Environments
|Ant-v2||Make a four-legged creature walk forward as fast as possible.|
|HalfCheetah-v2||Make a 2D cheetah robot run.|
|Hopper-v2||Make a two-dimensional one-legged robot hop forward as fast as possible.|
|Humanoid-v2||Make a three-dimensional bipedal robot walk forward as fast as possible, without falling over.|
|InvertedPendulum-v2||This is a MuJoCo version of CartPole. The agent’s goal is to balance a pole on a cart.|
|InvertedDoublePendulum-v2||This is a harder version of InvertedPendulum, where the pole has another pole on top of it. The agent’s goal is to balance a pole on a pole on a cart.|
|Reacher-v2||Make a 2D robot reach to a randomly located target.|
|Swimmer-v2||Make a 2D robot swim.|
|Walker2d-v2||Make a two-dimensional bipedal robot walk forward as fast as possible.|
|AsterixNoFrameskip-v4||The agent guides Taz between the stage lines in order to eat hamburgers and avoid the dynamites.|
|BeamRiderNoFrameskip-v4||The agent’s objective is to clear the Shield’s 99 sectors of alien craft while piloting the BeamRider ship.|
|CrazyClimberNoFrameskip-v4||The agent assumes the role of a person attempting to climb to the top of four skyscrapers.|
|EnduroNoFrameskip-v4||Enduro consists of manoeuvring a race car. The objective of the race is to pass a certain number of cars each day. Doing so will allow the player to continue racing for the next day.|
|MsPacmanNoFrameskip-v4||The gameplay of Ms. Pac-Man is very similar to that of the original Pac-Man. The player earns points by eating pellets and avoiding ghosts.|
|VideoPinballNoFrameskip-v4||Video Pinball is a loosesimulation of a pinball machine: ball shooter, flippers, bumpers and spinners.|