HighDimensional Control Using
Generalized Auxiliary Tasks
Abstract
A longstanding challenge in reinforcement learning is the design of function approximations and efficient learning algorithms that provide agents with fast training, robust learning, and high performance in complex environments. To this end, the use of prior knowledge, while promising, is often costly and, in essence, challenging to scale up. In contrast, we consider problem knowledge signals, that are any relevant indicator useful to solve a task, e.g., metrics of uncertainty or proactive prediction of future states. Our framework consists of predicting such complementary quantities associated with selfperformance assessment and accurate expectations. Therefore, policy and value functions are no longer only optimized for a reward but are learned using environmentagnostic quantities. We propose a generally applicable framework for structuring reinforcement learning by injecting problem knowledge in policy gradient updates. In this paper: (a) We introduce MERL, our multihead reinforcement learning framework for generalized auxiliary tasks. (b) We conduct experiments across a variety of standard benchmark environments. Our results show that MERL improves performance for on and offpolicy methods. (c) We show that MERL also improves transfer learning on a set of challenging tasks. (d) We investigate how our approach addresses the problem of reward sparsity and pushes the function approximations into a betterconstrained parameter configuration.
Introduction
Learning to control agents in simulated environments has been a source of many research efforts for decades [nguyen1990truck, werbos1989neural, schmidhuber1991learning, Robinson1989DynamicRD] and this approach is still at the forefront of recent works in deep reinforcement learning (DRL) [burda2018exploration, ha2018recurrent, silver_mastering_2016, espeholt2018impala]. But current algorithms tend to be fragile and opaque [iyer2018transparency]. Most of them require a large amount of training data collected from an agent interacting with a simulated environment, and where the reward signal is often critically sparse. Based on that, we make two observations.
First, previous work in reinforcement learning uses prior knowledge [lin1992self, clouse1992teaching, moreno2004using, ribeiro1998embedding] as a means of reducing sample inefficiency. While promising and undoubtedly necessary, the integration of such priors into current methods is potentially costly to implement. It may cause unwanted constraints and can hinder scaling up. Therefore, one goal of this paper is to integrate nonlimiting constraints directly in the current RL methods, while being generally applicable to many tasks. Second, if the probability of receiving a reward by chance is arbitrarily low, the time required to learn from it will be arbitrarily long [whitehead1991complexity]. This barrier to learning will prevent agents from significantly reducing learning time. One way to overcome this barrier is to learn by getting alternative and complementary signals from different sources [oudeyer2007intrinsic, schmidhuber1991curious].
Building upon existing auxiliary tasks methods, we develop the MERL framework that integrates complementary problemoriented quantities into the learning process. In this framework, while the agent learns to maximize the returns by sampling an environment, it also learns to correctly evaluate its performance on the control problem by minimizing a MERL objective function augmented with several problemfocused quantities. One intuition the reader can have is that MERL transforms a rewardfocused task into a task regularized with problem knowledge signals. In turns, agents can learn a richer representation of the environment and consequently better address the task at hand.
In the sequel of this paper, we first present the policy gradient methods used to demonstrate the performance of MERL. Then, we introduce the framework and detail on how it can be applied to most DRL method in any environment. We suggest some of the multihead/problem knowledge quantities mentioned above, but the reader is further encouraged to introduce other relevant signals. We demonstrate that while being able to predict the quantities from the different MERL heads correctly, the agent outperforms the on and offpolicy baselines that do not use the MERL framework on various continuous control tasks. We also show that our framework allows to better transfer the learning from one task to others on several Atari 2600 games.
Preliminaries
We consider a Markov Decision Process (MDP) defined on a state space , an action space and a reward function where , . Let denote a stochastic policy and let the objective function be the traditional expected discounted reward:
(1) 
where is a discount factor [sutton1998introduction] and is a trajectory sampled from the environment.
Policy gradient methods aim at modelling and optimizing the policy directly [sutton2000policy]. The policy is generally modeled with a function parameterized by ; in the rest of the paper, and denote the same policy. In DRL, the policy is represented in a neural network called the policy network. Policy gradient methods can generally be cast into two groups: offpolicy gradient methods, such as Deep Deterministic Policy Gradients (DDPG) [lillicrap2015continuous] and onpolicy methods, such as Proximal Policy Optimization (PPO) [schulman2017proximal]. These two methods are among the most commonly used and stateoftheart policy gradient methods.
On and OffPolicy Gradient Methods
OnPolicy with Clipped Surrogate Objective
PPOClip [schulman2017proximal] is an onpolicy gradientbased algorithm. In previous work, PPO has been tested on a set of benchmark tasks and has proven to produce impressive results in many cases despite a relatively simple implementation. For instance, instead of imposing a hard constraint like TRPO [schulman2015trust], PPO formalizes the constraint as a penalty in the objective function. In PPO, at each iteration, the new policy is obtained from the old policy :
(2) 
We use the clipped version of PPO whose objective function is:
(3) 
where
(4) 
By taking the minimum of the two terms in Eq. (3), the ratio is constrained to stay within a small interval around 1. The expected advantage function for the new policy is estimated by an old policy and then recalibrated using the probability ratio between the new and the old policy.
OffPolicy with ActorCritic algorithm
DDPG [lillicrap2015continuous] is a modelfree offpolicy actorcritic algorithm, combining Determinist Policy Gradient (DPG) [silver2014deterministic] with Deep QNetwork (DQN) [mnih2013playing]. While the original DQN works in discrete action space and stabilizes the learning of the Qfunction with experience replay and a target network, DDPG extends it to continuous action space with the actorcritic framework while learning a deterministic policy.
Let denote the distribution from which the next state is sampled. The Bellman equation describing the optimal actionvalue function is given by:
(5) 
Assuming the function approximator of is a neural network with parameters , an essential part of DDPG is that computing the maximum over actions is intractable in continuous action spaces, therefore the algorithm uses a target policy network to compute an action which approximately maximizes . Given the collection of transitions in a set , where denotes whether is terminal, we obtain the meansquared Bellman error (MSBE) function with:
(6) 
Related Work
MERL incorporates three key ideas: (a) the use of auxiliary quantities as a means to tackle the problem of reward sparsity, (b) the introduction of quantities of selfperformance assessment and accurate expectations which constitute taskagnostic signals, more widely applicable than environmentspecific prior knowledge, (c) policy and value function approximation with neural networks augmented with a multihead layer.
Auxiliary tasks have been used to facilitate representation learning for decades [suddarth1990rule], along with intrinsic motivation [schmidhuber2010formal, pathak2017curiosity] and artificial curiosity [oudeyer2007intrinsic, schmidhuber1991curious]. [li2015recurrent] introduces a supervised loss for fitting a recurrent model on the hidden representations to predict the next observed state, in the context of imitation learning of sequences provided by experts. Other works on auxiliary tasks [jaderberg2016reinforcement, mirowski2016learning, burda2018large] allow the agents to maximize other pseudoreward functions simultaneously. Our method is much distinctive from those previous approaches. First, neither the introduction of additional neural networks (e.g. for memory) nor the introduction of a replay buffer is needed. Second, the quantities (or auxiliary tasks) we introduce are compatible with any task: previous methods only work on pixelbased environments. Third, MERL neither introduces additional iterations nor modifies the reward function of the policy gradient algorithms it is applied to. For all those reasons, MERL can be added outofthebox to most policy gradient algorithms with a negligible computational cost overhead.
From a different perspective, [garcia2015comprehensive] gives a detailed overview of previous work that has changed the optimality criterion as a safety factor. But most methods use a hard constraint rather than a penalty; one reason is that it is difficult to choose a single coefficient for this penalty that works well for different problems. We are successfully addressing this problem with MERL. In [lipton2016combating], catastrophic actions are avoided by training an intrinsic fear model to predict whether a disaster will occur and using it to shape rewards. Compared to both methods, MERL is more scalable and lightweight while it incorporates quantities of selfperformance assessments (e.g. variance explained of the value function) and accurate expectations (e.g. next state prediction) leading to an improved performance.
Finally, many previous studies have focused on the use of imitation learning to address a task directly. There are two main approaches: behavioral cloning [pomerleau1989alvinn], which attempts to learn a task policy through supervised learning, and inverse RL [abbeel2004apprenticeship], which attempts to learn a reward function from a set of demonstrations. Although, these successful approaches often push the agent only to learn how to perform a task from expert demonstrations with a relatively modest understanding of its own behavior. We propose a method that gives the agent relevant quantities that brings its learning closer to how to learn to accomplish the task at hand, in addition to being optimized to solve it.
MERL: MultiHead Framework for Generalized Auxiliary Tasks
Our multihead architecture and its associated learning algorithm are directly applicable to most stateoftheart policy gradient methods. Let be the index of each MERL head: . In the context of DRL, we introduce two of the quantities predicted by these heads and show how to incorporate them in the policy gradient methods mentioned above.
Value Function and Policy
In DRL, the policy is generally represented in a neural network called the policy network, with parameters (weights) , and the value function is parameterized by the value network, with parameters . In the case of DDPG, the value network translates into the actionvalue network. Each MERL head takes as input the last embedding layer from the value network and is constituted of only one layer of fullyconnected neurons, with parameters . The output size of each head corresponds to the size of the predicted MERL quantity. Below, we introduce two examples of these quantities.
Variance Explained
Let us define the first quantity we use in MERL: the fraction of variance explained . It is the fraction of variance that the value function explains about the returns . Put differently, it corresponds to the proportion of the variance in the dependent variable that is predictable from the independent variables. We compute at each policy gradient update with the samples used for the gradient computation. In statistics, this quantity is also known as the coefficient of determination [10.2307/2683704]. For the sake of clarity, we will not use this notation for the coefficient of determination, but we will refer to this criterion as:
(7) 
where and are respectively the return and the expected return from state , and is the mean of all returns in the trajectory. It should be noted that this criterion may be negative for nonlinear models, indicating a severe lack of fit [10.2307/2683704] of the corresponding function:

if the fitted value function perfectly explains the returns;

corresponds to a simple average prediction;

if the fitted value function provides a worse fit to the outcomes than the mean of the discounted rewards.
We denote as the corresponding MERL head, with parameters and its objective function is defined by:
(8) 

[leftmargin=0pt]
 Interpretation.

measures the ability of the value function to fit the returns. implies that of the variability of the dependent variable has been accounted for, and the remaining of the variability is still unaccounted for. For instance in [flet2019samples], is used to filter the samples that will be used to update the policy. By its definition, this quantity is a highly relevant indicator for assessing selfperformance in reinforcement learning.
Future States
At each timestep, one of the agent’s MERL heads tries to predict a future state from . While a typical MERL quantity can be fit by regression on meansquared error, we observed that predictions of future states are better fitted with a cosinedistance error. We denote the corresponding head, with parameters , and the observation space size (size of vector ). We define its objective function as:
(9) 
ProblemConstrained Policy Update
(10)  
(11) 
Once the set of MERL heads and their associated objective functions have been defined, we modify the gradient update step of the policy gradient algorithms. The objective function incorporates all . Of course, each MERL objective is associated with its coefficient . Since in this paper we introduce two MERL heads, the corresponding two hyperparameters are reported along with the others in the supplementary materials. It is worthy to note that we used the exact same MERL coefficients for all our experiments, which demonstrate the framework’s ease of applicability.
Experiments
Methodology
We evaluate MERL in multiple highdimensional environments, ranging from MuJoCo [todorov2012mujoco] to the Atari 2600 games [bellemare2013arcade] (we describe these environments in detail in Tables 5 and 6 in the supplementary materials). The experiments in MuJoCo allow us to evaluate the performance of MERL on a large number of different continuous control problems. It is worthy to note that the universal characteristics of the auxiliary quantities we choose ensure that MERL is generally applicable to any task. Other popular auxiliary methods [jaderberg2016reinforcement, mirowski2016learning, burda2018large] cannot be applied to challenging continuous control tasks like MuJoCo. Thus, we naturally compare the performance of our method with on and offpolicy stateoftheart methods (respectively, PPO [schulman2017proximal] and DDPG [lillicrap2015continuous]).
The experiments on the Atari 2600 games allow us to study the transfer learning abilities of MERL on a set of diverse tasks.

[leftmargin=0pt]
 Implementation.

For the continuous control MuJoCo tasks, the agents have learned using separated policy and value networks. In this case, we build upon the value network (named the actionvalue network in the DDPG algorithm) to incorporate our framework’s heads. On the contrary, when playing Atari 2600 games from pixels, the agents were given a CNN network shared between the policy and the value function. In that case, are naturally attached to the last embedding layer of the shared network. In both configurations, the outputs of heads are the same size as the quantity they predict: for instance, is a scalar whereas is a state.

[leftmargin=0pt]
 Hyperparameters Setting.

We used the same hyperparameters as in the main text of the respective papers. We made this choice within a clear and objective protocol of demonstrating the benefits of using MERL. Hence, its reported performance is not necessarily the best that can be obtained, but it still exceeds the baselines. Using MERL adds as many hyperparameters as there are heads in the multihead layer and it is worth noting that MERL hyperparameters are the same for all tasks. We report all hyperparameters in the supplementary materials.

[leftmargin=0pt]
 Performance Measures.

We examine the performance across a large number of trials (with different seeds for each task). Standard deviation and average return are generally considered to be the most stable measures used to compare the performance of the algorithms being studied [islam2017reproducibility]. Thereby, in the rest of this work, we use those metrics to establish the performance of our framework quantitatively.
SingleTask Learning: Continuous Control
OnPolicy Learning: PPO+MERL
First, we apply MERL to PPO in several MuJoCo environments. Due to space constraints, only three graphs from varied tasks are shown in Fig. 2. The complete set of 9 tasks is reported in Table 1 and the graphs are included in the supplementary materials.
Task  PPO  Ours 

Ant  
HalfCheetah  
Hopper  
Humanoid  
InvertedDoublePendulum  
InvertedPendulum  
Reacher  
Swimmer  
Walker2d 
We see from the curves that using MERL leads to better performance on a variety of continuous control tasks. Moreover, learning seems to be faster for some tasks, suggesting that MERL takes advantage of its heads to learn relevant quantities from the beginning of learning, when the reward may be sparse. Interestingly, by looking at the performance across all 9 tasks, we observed better results by predicting only the next state and not the subsequent ones.
OffPolicy Learning: DDPG+MERL
Next, we tested MERL on the same MuJoCo tasks by choosing DDPG as the offpolicy baseline. We experimented with several opensourced implementation, including the one from OpenAI. However, while others have reported similar issues in the opensourced repository, it is difficult to tune DDPG to reproduce results from other works even when using their reported hyperparameter setting and with various network architectures. Therefore, in Fig. 4, we report experiments for the tasks that have successfully been learned by the DDPG baseline, and we test MERL on those tasks. Similarly to PPO+MERL improving performance, the learning curves indicate that the learned loss modified by MERL is able to better train agents in situations where the learning is offpolicy.
Transfer Learning: Atari Domain
Because of training time constraints, we consider a transfer learning setting where after the first training steps, the agent switches to a new task and is trained for another steps. The agent is not aware of the task switch. In total, we tested MERL on 20 pairs of tasks. Atari 2600 has been a challenging testbed for many years due to its highdimensional video input (size 210 x 160) and the discrepancy of tasks between games. To investigate the advantages of using MERL for transfer learning we choose a set of 6 different Atari games with an action space of 9, which is the average size of the action space in the Atari domain. This choice has two benefits: first, the neural network shared between the policy, the value function and MERL heads do not need to be further modified when performing transfer learning and second, the 6 games provide a diverse range of gameplay while sticking to the same size of action space.


The results from Fig. 7 demonstrate that our method can reasonably adapt to a different task if we compare to the same method where MERL heads are not used. The complete set of graphs is in the supplementary materials in Fig. 18. Interestingly, the very few cases where our method does not give the best results are when the orange curve (no transfer) is the best performing. This means that for those tasks learning from scratch seems more adapted. For all the other task pairs, MERL performs better. We interpret this result with the intuition that MERL heads learn and help represent information that is more generally relevant for other tasks, such as selfperformance assessment or accurate expectations. In addition to adding a regularization to the objective function through problem knowledge signals, those auxiliary quantities make the neural network optimize for taskagnostic objectives.
Ablation Study
We conduct an ablation study to evaluate the separate and combined contributions of the two heads. Fig. 9 shows the comparative results in HalfCheetah, Walker2d and Swimmer. Interestingly, with HalfCheetah, using only the head degrades the performance, but when combining it with the head, it outperforms PPO+FS. Results of the complete ablation analysis demonstrate that each head is potentially valuable for enhancing learning and that their combination can produce remarkable results.
Discussion
From the experiments, we see that MERL successfully optimizes the policy according to complementary quantities seeking for good performance and safe realization of tasks, i.e. it does not only maximize a reward but instead ensures the control problem is appropriately addressed. Moreover, we show that MERL is directly applicable to policy gradient methods while adding a negligible computation cost. Indeed, for both MuJoCo and Atari tasks, the computational cost overhead is respectively 5% and 7% with our training infrastructure. All of these factors result in an algorithm that robustly solves highdimensional control problems in a variety of areas with continuous action spaces or by using only raw pixels for observations.
Thanks to generalized auxiliary tasks framework and a consistent choice of complementary quantities injected in the optimization process, MERL can better align an agent’s objectives with higherlevel insights into how to solve a control problem. Besides, since many current methods involve that successful learning depends on the agent’s ability to reach the goal by chance in the first place, correctly predicting MERL heads allow the agent to learn something useful while improving in this task. At the same time, it also addresses the problem of the sparsity of rewards.
Conclusion
In this paper, we introduced MERL, a generally applicable deep reinforcement learning framework for problemfocused representations in contrast with many current rewardcentric algorithms. The virtual agent is able to predict problemsolving quantities in a multihead layer to better address reinforcement learning problems. Our framework improves the performance of stateoftheart on and offpolicy algorithms for continuous control MuJoCo tasks and Atari 2600 games in transfer learning tasks.
With MERL, we inject environmentagnostic problem knowledge directly in the policy gradient optimization. The advantage of this framework is threefold. First, the agent learns a better representation for singletask learning, and that is generalizable to other tasks. The multihead layer provides a more problemfocused representation to the function approximations, which is therefore not only rewardcentric. Moreover, continuous problemsolving signals help to address the problem of reward sparsity. Second, MERL can be seen as being a hybrid modelfree and modelbased framework with a small and lightweight component for selfperformance assessment and accurate expectations. MERL heads introduce additional regularization to the function approximation and additionally results in better performance and improved transfer learning. Third, MERL is directly applicable to most policy gradient algorithms and environments; it does not need to be redesigned for different problems and is not restricted to pixelbased tasks. Finally, it can be extended with many other relevant problemsolving quantities.
Although the relevance and higher performance of MERL have only been shown empirically, it would be interesting to study the theoretical contribution of this framework from the perspective of an implicit regularization of the agent’s representation on his environment. We also believe that predicting complementary quantities related to the objective of a task is a worthwhile idea to explore in supervised learning too. Finally, the identification of additional MERL quantities (e.g. prediction of immediate reward, prediction of time until the end of a trajectory) and the effect of their combination is also a research topic that we find most relevant for future works.
Acknowledgements
The authors would like to acknowledge the support of Inria, and SequeL for providing a great environment for this research. The authors also acknowledge the support from CPER NordPas de Calais/FEDER DATA Advanced data science and technologies 20152020. This work was supported by the French Ministry of Higher Education and Research.
References
Appendix A Full Benchmark results
SingleTask Learning



Transfer Learning




Appendix B Hyperparameters
Hyperparameter  Value 

Horizon ()  2048 (MuJoCo), 128 (Atari) 
Adam stepsize  (MuJoCo), (Atari) 
Nb. epochs  10 (MuJoCo), 3 (Atari) 
Minibatch size  64 (MuJoCo), 32 (Atari) 
Number of actors  1 (MuJoCo), 4 (Atari) 
Discount ()  0.99 
GAE parameter ()  0.95 
Clipping parameter ()  0.2 (MuJoCo), 0.1 (Atari) 
Value function coef  0.5 
Hyperparameter  Value 

Learning rate actor  
Learning rate critic  
Minibatch size  64 
Training steps  50 
Discount ()  0.99 
Buffer size  100000 
Nb. cycles  10 
Critic L2 reg.  0.01 
Hyperparameter  Value 

coef  0.5 
coef  0.01 
Appendix C Implementation details
Function Approximations with Neural Networks
Unless otherwise stated, the policy network used for the MuJoCo tasks is a fullyconnected multilayer perceptron with two hidden layers of 64 units. For Atari, the network is shared between the policy and the value function and is the same as in [mnih2016asynchronous]. Each additional head is composed of a small fullyconnected layer and outputs the desired quantity.
MERL+PPO and MERL+DDPG Algorithms
(12)  
(13) 
(14) 
(15)  
(16)  
(17)  
(18)  
(19) 
In Eq. (14), the targets are computed, then in Eq. (15) and Eq. (16) respectively the Qfunction and are updated by one step of gradient descent (each MERL objective is associated with its loss coefficient ). In Eq. (17), the policy is updated by one step of gradient ascent. Finally, in Eq. (19), the targets networks are updated with a hyperparameter between 0 and 1.
Appendix D Environments
Environment  Description 

Antv2  Make a fourlegged creature walk forward as fast as possible. 
HalfCheetahv2  Make a 2D cheetah robot run. 
Hopperv2  Make a twodimensional onelegged robot hop forward as fast as possible. 
Humanoidv2  Make a threedimensional bipedal robot walk forward as fast as possible, without falling over. 
InvertedPendulumv2  This is a MuJoCo version of CartPole. The agent’s goal is to balance a pole on a cart. 
InvertedDoublePendulumv2  This is a harder version of InvertedPendulum, where the pole has another pole on top of it. The agent’s goal is to balance a pole on a pole on a cart. 
Reacherv2  Make a 2D robot reach to a randomly located target. 
Swimmerv2  Make a 2D robot swim. 
Walker2dv2  Make a twodimensional bipedal robot walk forward as fast as possible. 
Environment  Screenshot  Description 

AsterixNoFrameskipv4  The agent guides Taz between the stage lines in order to eat hamburgers and avoid the dynamites.  
BeamRiderNoFrameskipv4  The agent’s objective is to clear the Shield’s 99 sectors of alien craft while piloting the BeamRider ship.  
CrazyClimberNoFrameskipv4  The agent assumes the role of a person attempting to climb to the top of four skyscrapers.  
EnduroNoFrameskipv4  Enduro consists of manoeuvring a race car. The objective of the race is to pass a certain number of cars each day. Doing so will allow the player to continue racing for the next day.  
MsPacmanNoFrameskipv4  The gameplay of Ms. PacMan is very similar to that of the original PacMan. The player earns points by eating pellets and avoiding ghosts.  
VideoPinballNoFrameskipv4  Video Pinball is a loosesimulation of a pinball machine: ball shooter, flippers, bumpers and spinners. 