Regularization Matters in Policy Optimization
Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. We also compare with the widely used entropy regularization and find regularization is generally better. Our findings are further shown to be robust against training hyperparameters variations. We further study regularizing different components and find that only regularizing the policy network is typically the best. We hope our study provides guidance for future practices in regularizing policy optimization algorithms.
The use of regularization methods to prevent overfitting is a key technique in successfully training neural networks. Perhaps the most widely recognized regularization methods in deep learning are regularization (also known as weight decay) and dropout (Srivastava et al., 2014). These techniques are standard practice in supervised learning tasks across many domains. Major tasks in computer vision, e.g., image classification (Krizhevsky et al., 2012; He et al., 2016), object detection (Ren et al., 2015; Redmon et al., 2016), use regularization as a default option. In natural language processing, for example, the Transformer (Vaswani et al., 2017) uses dropout, and the popular BERT model (Devlin et al., 2018) uses regularization. In fact, it is rare to see state-of-the-art neural models trained without regularization in a supervised setting.
However, in deep reinforcement learning (deep RL), those conventional regularization methods are largely absent or underutilized in past research, possibly because in most cases we are maximizing the return on the same task as in training. In other words, there is no generalization gap from the training environment to the test environment (Cobbe et al., 2018). Heretofore, researchers in deep RL have focused on high-level algorithm design and largely overlooked issues related to network training, including regularization. For popular policy optimization algorithms like Trust Region Policy Optimization (TRPO) (Schulman et al., 2015), Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Soft Actor Critic (SAC) (Haarnoja et al., 2018), conventional regularization methods were not considered. Even in popular codebases such as the OpenAI Baseline (Dhariwal et al., 2017), regularization and dropout were not incorporated.
Instead, the most commonly used regularization in the RL community is an entropy regularization term that penalizes the high-certainty output from the policy network to encourage more exploration during the training process and prevent the agent from overfitting to certain actions. The entropy regularization was first introduced by Williams and Peng (1991) and now used by many contemporary algorithms (Mnih et al., 2016; Schulman et al., 2017; Teh et al., 2017; Farebrother et al., 2018).
In this work, we take an empirical approach to assess the conventional paradigm which omits common regularization when learning deep RL models. We study agent performance on current task (the environment which the agent is trained on), rather than its generalization ability to different environments as many recent works (Zhang et al., 2018a; Zhao et al., 2019; Farebrother et al., 2018; Cobbe et al., 2018). We specifically focus our study on policy optimization methods, which are becoming increasingly popular and have achieved top performance on various tasks. We evaluate four popular policy optimization algorithms, namely SAC, PPO, TRPO, and the synchronous version of Advantage Actor Critic (A2C), on multiple continuous control tasks. Various conventional regularization techniques are considered, including / weight regularization, dropout, weight clipping (Arjovsky et al., 2017) and Batch Normalization (BN) (Ioffe and Szegedy, 2015). We compare the performance of these regularization techniques to that without regularization, as well as the entropy regularization.
Surprisingly, even though the training and testing environments are the same, we find that many of the conventional regularization techniques, when imposed to the policy networks, can still bring up the performance, sometimes significantly. Among those regularizers, regularization, perhaps the most simple one, tends to be the most effective overall and generally outperforms entropy regularization. regularization and weight clipping can boost performance in many cases. Dropout and Batch Normalization tend to bring improvements only on off-policy algorithms. Additionally, all regularization methods tend to be more effective on more difficult tasks. We also verify our findings with a wide range of training hyperparameters and network sizes, and the result suggests that imposing proper regularization can sometimes save the effort of tuning other training hyperparameters. We further study which part of the policy optimization system should be regularized, and conclude that generally only regularizing the policy network suffices, as imposing regularization on value networks usually does not help. Finally we discuss and analyze possible reasons for some experimental observations. Our main contributions can be summarized as follows:
We provide, to our best knowledge, the first comprehensive study of common regularization methods in policy optimization, which have been largely ignored in the deep RL literature.
We find conventional regularizers can be effective on continuous control tasks (especially on harder ones), with statistical significance. Remarkably, simple regularizers (, , weight clipping) could perform better than the more widely used entropy regularization, with generally the best. BN and dropout can only help in off-policy algorithms.
We experiment with multiple randomly sampled training hyperparameters for each algorithm and confirm our findings still hold.
We study which part of the network(s) should be regularized. The key lesson is to regularize the policy network but not the value network.
2 Related Works
Regularization in Deep RL. Conventional regularization methods have rarely been applied in deep RL. One rare case of such use is in Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), where Batch Normalization is applied to all layers of the actor and some layers of the critic, and regularization is applied to the critic.
Some recent studies have developed more complicated regularization approaches to continuous control tasks. Cheng et al. (2019) regularizes the stochastic action distribution using a control prior. The regularization weight at a given state is adjusted based on the temporal difference (TD) error. Galashov et al. (2019) introduces a default policy that receives limited information as a regularizer, which accelerates convergence and improves performance. Parisi et al. (2019) uses TD error regularization to penalize inaccurate value estimation and Generalized Advantage Estimation (GAE) (Schulman et al., 2016) regularization to penalize GAE variance. However, most of these regularizations are rather complicated (Galashov et al., 2019), catered to certain algorithms (Parisi et al., 2019), or need prior information (Cheng et al., 2019). Also, these techniques consider regularizing the output of the network, while conventional regularization methods mostly directly regularize the parameters. In this work, we focus on studying these simpler but under-utilized regularization methods.
Generalization in Deep RL typically refers to how the model perform in a different environment from the one it is trained on. The generalization gap can come from different modes/levels/difficulties of a game (Farebrother et al., 2018), simulation vs. real world (Tobin et al., 2017), parameter variations (Pattanaik et al., 2018), or different random seeds in environment generation (Zhang et al., 2018b). There are a number of methods designed to address this issue, e.g., through training the agent over multiple domains/tasks (Tobin et al., 2017; Rajeswaran et al., 2017), adversarial training (Tobin et al., 2017), designing model architectures (Srouji et al., 2018), adaptive training (Duan et al., 2016), etc. Meta RL (Finn et al., 2017; Gupta et al., 2018; Al-Shedivat et al., 2017) try to learn generalizable agents by training on many environments drawn from the same family/distribution. There are also some comprehensive studies on RL generalization with interesting findings (Zhang et al., 2018a, b; Zhao et al., 2019; Packer et al., 2018), e.g., algorithms performing better in training environment could perform worse with domain shift (Zhao et al., 2019).
Recently, several studies have investigated conventional regularization’s effect on generalization. Farebrother et al. (2018) shows that in Deep Q-Networks (DQN), regularization and dropout can sometimes bring benefit when evaluated on the same Atari game with mode variations. Cobbe et al. (2018) shows that regularization, dropout, data augmentation, and Batch Normalization can improve generalization performance, but to a less extent than entropy regularization and -greedy exploration. Different from those studies, we focus on regularization’s effect in the same environment, a more direct goal compared with generalization, yet on which conventional regularizations are under-explored.
3 Regularization Methods
There are in general two kinds of common approaches for imposing regularization. One is to discourage complex models (e.g., weight regularization, weight clipping), and the other is to inject certain noise in network activations (e.g., dropout and Batch Normalization). Here we briefly introduce the methods we investigate in our experiments.
/ Weight Regularization. Large weights are usually believed to be a sign of overfitting to the training data, since the function it represents tend to be complex. One can encourage small weights by adding a loss term penalizing the norm of the weight vector. Suppose is the original empirical loss we want to minimize. SGD updates the model on a mini-batch of training samples: , where is the learning rate. When applying regularization, we add an additional -norm squared loss term to the training objective. Thus the SGD step becomes . Similarly, in the case of weight regularization, the additional loss term is , and the SGD step becomes .
Weight Clipping. Weight clipping is a simple operation: after each gradient update step, each individual weight is clipped to range , where is a hyperparameter. This could be formally described as . In Wasserstein GANs (Arjovsky et al., 2017), weight clipping is used to enforce the constraint of Lipschitz continuity. This plays an important role in stabilizing the training of GANs (Goodfellow et al., 2014), which were notoriously hard to train and often suffered from “mode collapse” before. Weight clipping can also be seen as a regularizor since it reduce the complexity of the model space, by preventing any weight’s magnitude from being larger than .
Dropout. Dropout (Srivastava et al., 2014) is one of the most successful regularization techniques developed specifically for neural networks. The idea is to randomly deactivate a certain percentage of neurons during training; during testing, a rescaling is applied to ensure the scale of the activations is the same as training. One explanation for its effectiveness in reducing overfitting is they can prevent “co-adaptation” of neurons. Another explanation is that dropout acts as implicit model ensembling, because during training a different model is sampled to fit each mini-batch of data.
Batch Normalization. Batch Normalization (BN) (Ioffe and Szegedy, 2015) is invented to address the problem of “internal covariate shift”, and it does the following transformation: , where and are the mean and standard deviation of input activations over , and are trainable affine transformation parameters (scale and shift) which provides the possibility of linearly transforming normalized activations back to any scales. BN turns out to greatly accelerate the convergence and bring up the accuracy. It has become a standard component, especially in convolutional networks. BN also acts as a regularizer (Ioffe and Szegedy, 2015): since the statistics and depend on the current batch, BN subtracts and divides different values in each iteration. This stochasticity can encourage subsequent layers to be robust against such input variation.
Entropy Regularization. In a policy optimization framework, the policy network is used to model a conditional distribution over actions, and entropy regularization is widely used to prevent the learned policy from overfitting to one or some of the actions. More specifically, in each step, the output distribution of the policy network is penalized to have a high entropy. Policy entropy is calculated at each step as , where is the state-action pair. Then the per-sample entropy is averaged within the batch of state-action pairs to obtain the regularization term . A coefficient is also needed, and is added to the policy objective to be maximized during policy updates. Entropy regularization also encourages exploration due to increased randomness in actions, leading to better performance in the long run.
Algorithms. We evaluate the six regularization methods introduced in Section 3 using four popular policy optimization algorithms, namely, A2C (Mnih et al., 2016), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018). The first three algorithms are on-policy while the last one is off-policy. For the first three algorithms, we adopt the code from OpenAI Baseline (Dhariwal et al., 2017), and for SAC, we use the official implementation at (Haarnoja, 2018). Our code can be accessed at https://github.com/xuanlinli17/po-rl-regularization.
Tasks. The algorithms with different regularizers are tested on nine continuous control tasks: Hopper, Walker, HalfCheetah, Ant, Humanoid, and HumanoidStandup from the MuJoCo simulation environment (Todorov et al., 2012); Humanoid, AtlasForwardWalk, and HumanoidFlagrun from the more challenging RoboSchool (OpenAI, ) suite. Among the MuJoCo tasks, agents for Hopper, Walker, and HalfCheetah are easier to learn, while Ant, Humanoid, HumanoidStandup are relatively harder (larger state-action space, more training examples). The three Roboschool tasks are even harder than all the MuJoCo tasks as they require more timesteps to converge. To better understand how different regularization methods work on different difficulties, we roughly categorize the first three environments as “easy” tasks and the last six as “hard” tasks.
Training. On MuJoCo tasks, we keep all training hyperparameters unchanged as in the codebase adopted. Since hyperparameters for the RoboSchool tasks are not included in the original codebase, we briefly tune the hyperparameters for each algorithm before we apply any regularization (details in Appendix C). For details on regularization strength tuning, please see Appendix B. The results shown in this section are obtained by only regularizing the policy network, and a further study on this will be presented in Section 6. We run each experiment independently with five seeds, then use the average return over the last episodes as the final result. Each regularization method is evaluated independently, with other regularizers turned off. We refer to the result without any regularization as the baseline. For BN and dropout, we use its training mode in updating the network, and test mode in sampling trajectories.
During our training, negligible computation overhead is induced when a regularizer is applied. Specifically, the increase in training time for BN is , dropout , while , , weight clipping, and entropy regularization are all . We used up to 16 NVIDIA Titan Xp GPUs and 96 Intel Xeon E5-2667 CPUs, and all experiments take roughly 57 days with resources fully utilized.
Note that entropy regularization is still applicable for SAC, despite it already incorporates the maximization of entropy in the reward. In our experiments, we add the entropy regularization term to the policy loss function in equation (12) of Haarnoja et al. (2018). Meanwhile, policy network dropout is not applicable to TRPO because during policy updates, different neurons in the old and new policy networks are dropped out, causing different shifts in the old and new action distributions given the same state, which violates the trust region constraint. In this case, the algorithm fails to perform any update from network initialization.
|Reg \ Alg||A2C||TRPO||PPO||SAC||TOTAL|
Training curves. We plot the training curves from four environments (rows) in Figure 1, on four algorithms (columns). Figures for the rest five environments are deferred to Appendix K. In the figure, different colors are used to denote different regularization methods, e.g., black is the baseline method. Shades are used to denote standard deviation range. Notably, these conventional regularizers can frequently boost the performance across different tasks and algorithms, demonstrating that a study on the regularization in deep RL is highly demanding. Interestingly, in some cases where the baseline (with the default hyperparameters in the codebase) does not converge to a reasonable solution, e.g., A2C Ant, PPO Humanoid, imposing some regularization can make the training converge to a high level. Another observation is that BN always significantly hurts the baseline for on-policy algorithms. The reason will be discussed later. For the off-policy SAC algorithm, dropout and BN sometimes bring large improvement on hard tasks like AtlasForwardWalk and RoboschoolHumanoid.
How often do regularizations help? To quantitatively measure the effectiveness of the regularizations on each algorithm across different tasks, we define the condition when a regularization is said to “improve” upon the baseline in a certain environment. Denote the baseline mean return over five seeds on an environment as , and the mean and standard deviation of the return obtained with a certain regularization method over five seeds as and . We say the performance is “improved” by the regularization if , where is the minimum return threshold of an environment. The threshold serves to ensure the return is at least in a reasonable level. We set the threshold to be for HumanoidStandup and for all other tasks.
The result is shown in Table 1. Perhaps the most significant observation is that regularization is the most often to improve upon the baseline. A2C algorithm is an exception, where entropy regularization is the most effective. regularization behaves similar to regularization, but is outperformed by the latter. Weight clipping’s usefulness is highly dependent on the algorithms and environments. Despite in total it only helps at 30.6% times, it can sometimes outperform entropy regularization by a large margin, e.g., in TRPO Humanoid and PPO Humanoid as shown in Figure 1. BN is not useful at all in the three on-policy algorithms (A2C, TRPO, and PPO). Dropout is not useful in A2C at all, and sometimes helps in PPO. However, BN and dropout can be useful in SAC. All regularization methods generally improve more often when they are used on harder tasks, perhaps because for easier ones the baseline is often sufficiently strong to reach a high performance.
Note that under our definition, not “improving” does not indicate the regularization is hurting the performance. If we define “hurting” as (the reward minimum threshold is not considered here), then total percentage of hurting is 0.0% for , 2.8% for , 5.6% for weight clipping, 44.4% for dropout, 66.7% for BN, and 0.0% for entropy. In other words, under our parameter tuning range, and entropy regularization never hurt with appropriate strengths. For BN and dropout, we also note that almost all hurting cases are in on-policy algorithms, except one case for BN in SAC. In sum, all regularizations in our study very rarely hurt the performance except for BN/dropout in on-policy methods.
How much do regularizations improve? For each algorithm and environment (for example, PPO on Ant), we calculate a -score for each regularization method and the baseline, by treating results produced by all regularizations (including baseline) and all five seeds together as a population, and calculate each method’s average -scores from its five final results (positively clipped). -score is also known as “standard score”, the signed fractional number of standard deviations by which the value of a data point is above the mean value. For each algorithm and environment, a regularizer’s -score roughly measures its relative performance among others. The -scores are then averaged over environments of a certain difficulty (easy/hard), and the results are shown in Table 2. In terms of the average improved margin, we can draw mostly similar observations as the improvement frequency (Table 1): tops the average -score most often, and by large margin in total; entropy regularization is best used with A2C; Dropout and BN are only useful in the off-policy SAC algorithm; the improvement over baseline is larger on hard tasks. Notably, for all algorithms, any regularization on average outperforms the baseline on hard tasks, except dropout and BN in on-policy algorithms. On hard tasks, and weight clipping also perform higher than entropy in total, besides .
Is the improvement statistically significant? For each regularization method, we collect the -scores produced by all seeds and all environments of a certain difficulty (e.g. for on PPO and hard environments, we have 6 envs 5 seeds = 30 -scores), and perform Welch’s t-test (two-sample t-test with unequal variance) with the corresponding -scores produced by the baseline. The resulting p-values are presented in Table 3. Note that whether the significance indicates improvement or harm depends on the relative mean -score in Table 2. For example, for BN and dropout in on-policy algorithms, the statistical significance denotes harm, and in most other cases it denotes improvement. From the results, we observe that the improvement is statistically significant (p ) for hard tasks in general, with only a few exceptions. In total, , , entropy and weight clipping are all statistically significantly better than baseline. For Welch’s t-test between entropy regularization and other regularizers, see Appendix F. For more metrics of comparison (e.g., average ranking, min-max scaled reward) see Appendix G.
5 Robustness with Hyperparameter Changes
In the previous section, the experiments are conducted mostly with the default hyperparameters in the codebase we adopt, which are not necessarily optimized. For example, PPO Humanoid baseline performs poorly using default hyperparameters, not converging to a reasonable solution. Meanwhile, it is known that RL algorithms are very sensitive to hyperparameter changes (Henderson et al., 2018). Thus, our findings can be vulnerable to such variations. To further confirm our findings, we evaluate the regularizations under a variety of hyperparameter settings. For each algorithm, we sample five hyperparameter settings for the baseline and apply regularization on each of them. Due to the heavy computation budget, we only evaluate on five MuJoCo environments: Hopper, Walker, Ant, Humanoid, and HumanoidStandup. Under our sampled hyperparameters, poor baselines are mostly significantly improved. See Appendix D and L for details on sampling and curves.
Similar to Table 2 and 3, the results of -scores and p-values are shown in Table 4 and Table 5. For results of improvement percentages similar to Table 1, please refer to Appendix E. We note that our main findings discussed in Section 4 still hold. Interestingly, compared to the previous section, , , and weight clipping all tend to be better than entropy regularization by larger margins.
To better visualize the robustness against change of hyperparameters, we show the result when a single hyperparameter is varied in Figure 2. We note that the certain regularizations can consistently improve the baseline with different hyperparameters. In these cases, proper regularizations can ease the hyperparameter tuning process, as they bring up performance of baselines with suboptimal hyperparameters to be higher than that with better ones.
6 Policy and Value Network Regularization
Our experiments so far only impose regularization on policy network. To investigate the relationship between policy and value network regularization, we compare four options: 1) no regularization, and regularizing 2) policy network, 3) value network, 4) policy and value networks. For 2) and 3) we tune the regularization strengths independently and then use the appropriate ones for 4) (more details in Appendix B). We evaluate all four algorithms on the six MuJoCo tasks and present the improvement percentage in Table 6. Note that entropy regularization is not applicable to the value network. For detailed training curves, please refer to Appendix M.
We observe that generally, only regularizing the policy network is the most often to improve almost all algorithms and regularizations. Regularizing the value network alone does not bring improvement as often as other options. Though regularizing both is better than regularizing value network alone, it is worse than only regularizing the policy network.
7 Analysis and Conclusion
Why does regularization benefit policy optimization? In RL, when we are training and evaluating on the same environment, there is no generalization gap across different environments. However, there is still generalization between samples: the agents is only trained on the limited trajectories it has experienced, which cannot cover the whole state-action space of the environment. A successful policy needs to generalize from seen samples to unseen ones, which potentially makes regularization necessary. This might also explain why regularization could be more helpful on harder tasks, which have larger state space, and the portion of the space that have appeared in training tends to be smaller. Overfitting to this smaller portion of space would cause more serious issues, where regularization may help more.
To support the argument above, we take agents trained with and without regularization, evaluate the reward on 100 different trajectories, and plot the reward distributions over trajectories in Figure 3. These trajectories represent unseen samples during training, since the state space is continuous and it is impossible to traverse the same trajectories as in training. For baseline, some of the trajectories yield relatively high rewards, while others yield low rewards, demonstrating the baseline cannot stably generalize to unseen examples; for regularized models, the rewards are more concentrated in a high level, demonstrating they can more stably generalize to unseen samples. This suggests that conventional regularization can improve the model’s generalization ability to larger portion of unseen samples.
We also compare the reward with varying number of training samples/timesteps, since the performance of learning from fewer samples is closely related to generalization ability. From the results in Figure 4, we find that for regularized models to reach the same level of reward as baseline, they need much fewer samples in training. This suggests regularized models have better generalization ability than baseline.
Why do BN and dropout work only with off-policy algorithms? One finding in our experiments is BN and dropout can sometimes improve on the off-policy algorithm SAC, but mostly hurt on-policy algorithms. We hypothesize two possible reasons: 1) for both BN and dropout, training mode is used to train the network, and testing mode is used to sample actions during interaction with the environment, leading to a discrepancy between the sampling policy and optimization policy (the same holds if we always use training mode). For on-policy algorithms, if such discrepancy is large, it can cause severe “off-policy issues”, which hurts the optimization process or even crashes it since their theory necessitates that the data is “on policy”, i.e., data sampling and optimization policies are the same. For off-policy algorithms, this discrepancy is not an issue, since they sample data from replay buffer and do not require the two polices to be the same. 2) BN can be sensitive to input distribution shifts, since the mean and std statistics depend on the input, and if the input distribution changes too quickly in training, the mapping functions of BN layers can change quickly too, which can possibly destabilize training. One evidence for this is that in supervised learning, when transferring a ImageNet pretrained model to other vision datasets, sometimes the BN layers are fixed (Yang et al., 2017) and only other layers are trained. In off-policy algorithms, the sample distributions are relatively slow-changing since we always draw from the whole replay buffer which holds cumulative data; in on-policy algorithms, we always use the samples generated from the latest policy, and the faster-changing input distribution for on-policy algorithms could be harmful to BN. Previously, BN has also been shown to be useful in DDPG (Lillicrap et al., 2015), an off-policy algorithm.
In summary, we conducted the first comprehensive study of regularization methods on multiple policy optimization algorithms. We found that conventional regularizations (, , weight clipping) could be effective at improving performance, even more than the widely used entropy regularization. BN and dropout could be useful but only on off-policy algorithms. Our findings were confirmed with multiple sampled hyperparameters. Further experiments have shown that generally, the best practice is to regularize the policy network but not the value network or both.
Appendix A Policy Optimization Algorithms
The policy optimization family of algorithms is one of the most popular methods for solving reinforcement learning problems. It directly parameterizes and optimizes the policy to gain more cumulative rewards. Below, we give a brief introduction to the algorithms we evaluate in our work.
Sutton et al. (2000) developed a policy gradient to update the parametric policy in a gradient descent manner. However, the gradient estimated in this way suffers from high variance. Advantage Actor Critic (A3C) (Mnih et al., 2016) is proposed to alleviate this problem by introducing a function approximator for values and replacing the Q-values with advantage values. A3C also utilizes multiple actors to parallelize training. The only difference between A2C and A3C is that in a single training iteration, A2C waits for parallel actors to finish sampling trajectories before updating the neural network parameters, while A3C updates in an asynchronous manner.
Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) proposes to constrain each update within a safe region defined by KL divergence to guarantee policy improvement during training. Though TRPO is promising at obtaining reliable performance, approximating the KL constraint is quite computationally heavy.
Proximal Policy Optimization (PPO) (Schulman et al., 2017) simplifies TRPO and improves computational efficiency by developing a surrogate objective that involves clipping the probability ratio to a reliable region, so that the objective can be optimized using first-order methods.
Soft Actor Critic (SAC) (Haarnoja et al., 2018) optimizes the maximum entropy objective in reward (Ziebart et al., 2008), which is different from the objective of the on-policy methods above. SAC combines soft policy iteration, which maximizes the maximum entropy objective, and clipped double learning (Fujimoto et al., 2018), which prevents overestimation bias, during actor and critic updates, respectively.
Appendix B Implementation and Tuning for Regularization Methods
For and regularization, we add and , respectively, to the loss of policy network or value network of each algorithm (for SAC’s value regularization, we apply regularization only to the network instead of also to the two networks). and loss are applied to all the weights of the policy or value network. For A2C, TRPO, and PPO, we tune in the range of for and for . For SAC, we tune in the range of for and for .
For weight clipping, the OpenAI Baseline implementation of the policy network of A2C, TRPO, and PPO outputs the mean of policy action from a two-layer fully connected network (MLP). The log standard deviation of the policy action is represented by a standalone trainable vector. We find that when applied only to the weights of MLP, weight clipping makes the performance much better than when applied to only the logstd vector or both. Thus, for these three algorithms, the policy network weight clipping results shown in all the sections above come from clipping only the MLP part of the policy network. On the other hand, in the SAC implementation, both the mean and the log standard deviation come from the same MLP, and there is no standalone log standard deviation vector. Thus, we apply weight clipping to all the weights of the MLP. For all algorithms, we tune the policy network clipping range in . For value network, the MLP produces a single output of estimated value given a state, so we clip all the weights of the MLP. For A2C, TRPO, and PPO, we tune the clipping range in . For SAC, we only clip the network and do not clip the two networks for simplicity. We tune the clipping range in due to its weights having larger magnitude.
For BatchNorm/dropout, we apply it before the activation function of each hidden layer/immediately after the activation function. When the policy or the value network is performing update using minibatches of trajectory data or minibatches of replay buffer data, we use the train mode of regularization and update the running mean and standard deviation. When the policy is sampling trajectory from the environment, we use the test mode of regularization and use the existing running mean and standard deviation to normalize data. For Batch Normalization/dropout on value network, only training mode is applied since value network does not participate in sampling trajectories. Note that adding policy network dropout on TRPO causes the KL divergence constraint to be violated almost every time during policy network update. Thus, policy network dropout causes the training to fail on TRPO, as the policy network cannot be updated.
For entropy regularization, we add to the policy loss. is tuned from for A2C, TRPO, PPO and for SAC. Note that for SAC, our entropy regularization is added directly on the optimization objective (equation 12 in Haarnoja et al. (2018)), and is different from the original maximum entropy objective inside the reward term.
Note that for the three on-policy algorithms (A2C, TRPO, PPO) we use the same tuning range, and the only exception is the off-policy SAC. The reason why SAC’s tuning range is different is that SAC uses a hyperparameter that controls the scaling of the reward signal, while A2C, TRPO, and PPO do not. In the original implementation of SAC, the reward signals are pre-tuned to be scaled up by a factor ranging from to , according to specific environments. Also, unlike A2C, TRPO, and PPO, SAC uses unnormalized reward because if the reward magnitude is small, then, according to the original paper, the policy becomes almost uniform. Due to the above reasons, the reward magnitude of SAC is much higher than the magnitude of rewards used by A2C, TRPO, and PPO. Thus, the policy network loss and the value network loss have larger magnitude than those of A2C, TRPO, and PPO, so the appropriate regularization strengths become higher. Considering the SAC’s much larger reward magnitude, we selected a different range of hyperparameters for SAC before we ran the whole experiments.
The optimal policy network regularization strength we selected for each algorithm and environment used in Section 4 can be seen in the legends of Appendix M. In addition to the results with environment-specific strengths presented in Section 4, we also present the results when the regularization strength is fixed across all environments for the same algorithm. The results are shown in Appendix H.
In Section 6, to investigate the effect of regularizing both policy and value networks, we combine the tuned optimal policy and value network regularization strengths. The detailed training curves are presented in Appendix M.
Appendix C Default Hyperparameter Settings for Baselines
Training timesteps. For A2C, TRPO, and PPO, we run timesteps for Hopper, Walker, and HalfCheetah; timesteps for Ant, Humanoid (MuJoCo), and HumanoidStandup; timesteps for Humanoid (RoboSchool); and timesteps for AtlasForwardWalk and HumanoidFlagrun. For SAC, since its simulation speed is much slower than A2C, TRPO, and PPO (as SAC updates its policy and value networks using a minibatch of replay buffer data at every timestep), and since it takes fewer timesteps to converge, we run timesteps for Hopper and Walker; timesteps for HalfCheetah and Ant; timesteps for Humanoid and HumanoidStandup; and timesteps for the RoboSchool environments.
Hyperparameters for RoboSchool. In the original PPO paper (Schulman et al., 2017), hyperparameters for the Roboschool tasks are given, so we apply the same hyperparameters to our training, except that instead of linear annealing the log standard deviation of action distribution from to , we let it to be learnt by the algorithm, as implemented in OpenAI Baseline (Dhariwal et al., 2017). For TRPO, due to its proximity to PPO, we copy PPO’s hyperparameters if they exist in both algorithms. We then tune the value update step size in . For A2C, we keep the original hyperparameters and tune the number of actors in and the number of timesteps for each actor between consecutive policy updates in . For SAC, we tune the reward scale from .
|Hidden layer size|
|Number of hidden layers|
|Samples per batch|
|Replay buffer size|
|Discount factor ()|
|Target smoothing coefficient ()|
|Target update interval|
Appendix D Hyperparameter Sampling Details
In Section 5, we present results based on five hyperparameter settings. To obtain such hyperparameter variations, we consider varying the learning rates and the hyperparameters that each algorithm is very sensitive to. For A2C, TRPO, and PPO, we consider a range of rollout timesteps between consecutive policy updates by varying the number of actors or the number of trajectory sampling timesteps for each actor. For SAC, we consider a range of reward scale and a range of target smoothing coefficient.
More concretely, for A2C, we sample the learning rate from linear decay, the number of trajectory sampling timesteps (nsteps) for each actor from , and the number of actors (nenvs) from . For TRPO, we sample the learning rate of value network (vf_stepsize) from and the number of trajectory sampling timesteps for each actor (nsteps) in . The policy update uses conjugate gradient descent and is controlled by the max KL divergence. For PPO, we sample the learning rate from , the number of actors (nenvs) from , and the probability ratio clipping range (cliprange) in . For SAC, we sample the learning rate from the target smoothing coefficient () from , and the reward scale from small, default, and large mode. The default reward scale of is changed to ; to ; to for each mode, respectively. Sampled hyperparameters 1-5 for each algorithms are listed in Table 11-14.
Appendix E Hyperparameter Experiment Improvement Percentage Result
|Reg \ Alg||A2C||TRPO||PPO||SAC||TOTAL|
Appendix F Statistical Significance Test Comparing Entropy Regularization with other Regularizations
As a complement to Table 3 in Section 4 and Table 5 in Section 5, we present the p-value results from Welch’s t-test comparing the -scores of entropy regularization with other regularizers in Table 16 and Table 17. Note that whether the significance indicates improvement or harm over entropy regularization depends on the relative mean -score in Table 2 under default hyperparameter setting and Table 4 under sampled hyperparameter setting. We observe that in total, has significant improvement over entropy in both default hyperparameter setting and sampled hyperparameter setting. and weight clipping are significantly better than entropy under sampled hyperparameter setting. In general, the improvement over entropy is statistically more significant for hard tasks.
Appendix G Additional Metrics for Evaluation of Performance
g.1 Ranking all regularizers
|Reg \ Alg||A2C||TRPO||PPO||SAC||TOTAL|
|Reg \ Alg||A2C||TRPO||PPO||SAC||TOTAL|
We compute the “average ranking” metric to compare the relative effectiveness of different regularization methods. Note that the average ranking of different methods across a set of tasks/datasets has been adopted as a metric before, as in Ranftl et al. (2019) and Knapitsch et al. (2017). Here, we rank the performance of all the regularization methods, together with the baseline, for each algorithm and task, and present the average ranks in Table 18 and Table 19. The ranks of returns among different regularizers are collected for each environment (after averaging over 5 random seeds), and then the mean rank over all seeds is calculated. From Table 18 and Table 19, we observe that, except for BN and dropout in on-policy algorithms, all regularizations on average outperform baselines. Again, regularization is the strongest in most cases. Other similar observations can be made as in previous tables. For every algorithm, baseline ranks lower on harder tasks than easier ones; in total, it ranks 3.50 for easier tasks and 5.25 for harder tasks. This indicates that regularization is more effective when the tasks are harder.
g.2 Comparison and Significance Testing with Scaled Rewards
Min-max scaling is a linear-mapping operation to map values ranging from to , using . For each environment and policy optimization algorithm (for example, PPO on Ant), we calculate a “scaled reward” for each regularization method and the baseline, using the maximum mean return obtained using any regularization method (including baseline) as and as , on positively clipped returns. We then average the scaled rewards of mean return over environments of a certain difficulty (easy/hard). We present the results under the default hyperparameter setting in Table 20-22 and the results under sampled hyperparameter settings in Table 23-25. To analyze whether regularization significantly improves over the baseline and whether conventional regularizers significantly improves over entropy, we perform Welch’s t-test on the scaled rewards, using an identical approach to the one we used for -score. Our observation is similar to the ones we made in Section 4 and Section 5.
|Reg \ Alg||A2C||TRPO||PPO||SAC||TOTAL|
|Reg \ Alg||A2C||TRPO||PPO||SAC||TOTAL|
Appendix H Regularization with a Single Strength
In previous sections, we tune the strength of regularization for each algorithm and environment, as described in Appendix B. Now we restrict the regularization methods to a single strength for each algorithm, across different environments. The results are shown in Table 26 and 27. The selected strength are presented in Table 28. We see that the regularization is still generally the best performing one, but SAC is an exception, where BN is better. This can be explained by the fact that in SAC, the reward scaling coefficient is different for each environment, which potentially causes the optimal and strength to vary a lot across different environments, while BN does not have a strength parameter.
Appendix I Regularizing with both and Entropy
We also investigate the effect of combining regularization with entropy regularization, given that both cases of applying one of them alone yield performance improvement. We take the optimal strength of regularization and entropy regularization together and compare with applying regularization or entropy regularization alone. From Figure 5, we find that the performance increases for PPO HumanoidStandup, approximately stays the same for TRPO Ant, and decreases for A2C HumanoidStandup. Thus, the regularization benefits are not always addable. This phenomenon is possibly caused by the fact that the algorithms already achieve good performance using only regularization or entropy regularization, and further performance improvement is restrained by the intrinsic capabilities of algorithms.
Appendix J Comparing Regularization with Fixed Weight Decay (AdamW)
For the Adam optimizer (Kingma and Ba, 2015), “fixed weight decay” (AdamW in Loshchilov and Hutter (2019)) differs from regularization in that the gradient of is not computed with the gradient of the original loss, but the weight is “decayed” finally with the gradient update. For Adam these two procedures are very different (see Loshchilov and Hutter (2019) for more details). In this section, we compare the effect of adding regularization with that of using AdamW, with PPO on Humanoid and HumanoidStandup. The result is shown in Figure 6. Similar to , we briefly tune the strength of weight decay in AdamW and the optimal one is used. We find that while both regularization and AdamW can significantly improve the performance over baseline, the performance of AdamW tends to be slightly lower than the performance of regularization.
Appendix K Additional Training Curves Under Default Hyperparameters
Appendix L Training Curves for Hyperparameter Experiments
Appendix M Training Curves for Policy vs. Value Experiments
- Continuous adaptation via meta-learning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641. Cited by: §2.
- Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1, §3.
- Control regularization for reduced variance reinforcement learning. arXiv preprint arXiv:1905.05380. Cited by: §2.
- Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341. Cited by: §1, §1, §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
- OpenAI baselines. GitHub. Note: \urlhttps://github.com/openai/baselines Cited by: Appendix C, §1, §4.1.
- RL2: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §2.
- Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123. Cited by: §1, §1, §2, §2.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2.
- Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1587–1596. Cited by: Appendix A.
- Imformation asymmetry in kl-regularized rl. In International Conference on Learning Representations, Cited by: §2.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.
- Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, pp. 5302–5311. Cited by: §2.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1856–1865. Cited by: Appendix A, Appendix B, §1, §4.1, §4.1.
- Soft actor-critic. GitHub. Note: \urlhttps://github.com/haarnoja/sac Cited by: §4.1.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
- Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1, §3.
- Adam: a method for stochastic optimization. Cited by: Appendix J.
- Tanks and temples: benchmarking large-scale scene reconstruction. ACM Trans. Graph. 36 (4). External Links: Cited by: §G.1.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §2.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §7.
- Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: Appendix J.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: Appendix A, §1, §4.1.
- Open-source software for robot simulation, integrated with openai gym. GitHub. Note: \urlhttps://github.com/openai/roboschool Cited by: §4.1.
- Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282. Cited by: §2.
- TD-regularized actor-critic methods. In Machine Learning, pp. 1–35. Cited by: §2.
- Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2040–2042. Cited by: §2.
- Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561. Cited by: §2.
- Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. arXiv:1907.01341. Cited by: §G.1.
- You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
- Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: Appendix A, §1, §4.1.
- High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, Cited by: §2.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix A, Appendix C, §1, §1, §4.1.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §1, §3.
- Structured control nets for deep reinforcement learning. In International Conference on Machine Learning, pp. 4749–4758. Cited by: §2.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: Appendix A.
- Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506. Cited by: §1.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §2.
- Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §4.1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
- Function optimization using connectionist reinforcement learning algorithms. Connection Science 3 (3), pp. 241–268. Cited by: §1.
- A faster pytorch implementation of faster r-cnn. https://github.com/jwyang/faster-rcnn.pytorch. Cited by: §7.
- A dissection of overfitting and generalization in continuous reinforcement learning. ArXiv abs/1806.07937. Cited by: §1, §2.
- A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893. Cited by: §2.
- Investigating generalisation in continuous deep reinforcement learning. ArXiv abs/1902.07015. Cited by: §1, §2.
- Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pp. 1433–1438. External Links: Cited by: Appendix A.