Improving Generalization of Reinforcement Learning with Minimax Distributional Soft ActorCritic
Abstract
Reinforcement learning (RL) has achieved remarkable performance in numerous sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of environment usually leads to some devastating events, which should be the focus of safetycritical systems such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft ActorCritic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most severe variations from environment, in which the protagonist policy maximizes actionvalue function while the adversary policy tries to minimize it. Distributional framework aims to learn a stateaction return distribution, from which we can model the risk of different returns explicitly, thereby formulating a riskaverse protagonist policy and a riskseeking adversarial policy. We implement our method on the decisionmaking tasks of autonomous vehicles at intersections and test the trained policy in distinct environments. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.
Game theory, adversarial reinforcement learning, riskaware policy learning, autonomous driving.
I Introduction
Numerous applications of deep reinforcement learning (RL) have demonstrated great performance in a range of challenging domains such as games[15] and autonomous driving[2]. Mainstream RL algorithms focus on optimizing policy based on the performance in the training environment, without considering its universality for situations never encountered during training. Studies showed that this could reduce the generalization ability of the learned policy[8][17]. For intelligent agents, such as autonomous vehicles, we usually need them to be able to cope with multiple situations, including unknown scenarios.
A straightforward technique to improve the generalization ability of RL is training on a set of random environments. By randomizing the dynamics of the simulation environment, the developed policies are capable of adapting to different dynamics encountered during training [11]. Furthermore, some works have proposed that directly adding noises to state observations can provide adversarial perturbations for the training process, which can make the learned policy more insensitive to environmental variation during testing[6, 10, 4]. However, these approaches can scarcely capture all variations from environment, as the space of dynamic parameters could be larger than the space of possible actions.
Alternative techniques to improve generalization include risksensitive policy learning. Generally, risk is related to the stochasticity of environment and with the fact that, even an optimal policy (in terms of expected return) may perform poorly in some cases. Instead, risksensitive policy learning includes a risk measure in the optimization process, either as the objective[13] or as a constraint[16]. This formulation not only seeks to maximize the expected reward but to optimize the risk criteria, such that the trained policy can reduce the likelihood of failure in a varying environment. In practice, the risk is always modeled as the variance of return and the most representative algorithms include meanvariance tradeoff method[3] and percentile optimization method[14]. However, the existing methods can only model the risk by sampling discretely some trajectories from randomized environments, rather than learn the exact return distribution.
Another technique to improve generalization across different kind of environment variations is the minimax formulation. As a pioneering work in this field, Morimoto et al. (2005) firstly combined Hinfinity control with RL to learn an optimal policy, which is the prototype of most existing minimax formulation of RL algorithms[7]. They formulated a differential game in which a protagonist agent tries to learn the control law by maximizing the accumulated reward while an adversary agent aims to make the worst possible destruction by minimizing the same objective. By that way, this problem was reduced to find a minimax solution of a value function. After that, Pinto et al. (2017) extended this work with deep neural network and further proposed the Robust Adversarial Reinforcement Learning (RARL) algorithm, in which the protagonist and adversary policies are trained alternatively, with one being fixed whilst the other adapts[12]. Recently, Pan et al. (2019) introduced the risksensitive framework into RARL to prevent the rare, catastrophic events such as automotive accidents[9]. For that propose, the risk was modeled as the variance of value functions and they used an ensemble of Qvalue networks to estimate variance, in which multiple Qnetworks were trained in parallel. Their experiments on autonomous driving demonstrated that the introduction of risksensitive framework into RARL is effective and even crucial, especially in safetycritical systems. However, the existing methods can only handle the discrete and lowdimensional action spaces, as they select actions according to their Qnetworks. More urgently, the value function must be divided into multiple discrete intervals in advance. This is inconvenient because different tasks usually require different division numbers.
In this paper, we propose a new RL algorithm to improve the generalization ability of the learned policy. In particular, the learned policy can not only succeed in the training environment but also cope with the situations never encountered before. To that end, we adopt the minimax formulation, which augments the standard RL with an adversarial policy, to develop a minimax variant of Distributional Soft ActorCritic (DSAC) algorithm[1], called Minimax DSAC. Here, we choose DSAC as the basis of our algorithm, not only because it is the stateoftheart RL algorithm, but also it can directly learn a continuous distribution of returns, which enables us to model return variance as risk explicitly. By modeling risk, we can train stronger adversaries and through competition, the protagonist policy will have a greater ability to cope with environmental changes. Additionally, the application of our algorithm on autonomous driving tasks shows that Minimax DSAC can guarantee the good performance even when the environment changes drastically.
The rest of the paper is organized as follows: Section II states the preliminaries and Section III introduces formulation and implementation of the proposed method Minimax DSAC. Section IV introduces the simulation scenarios and evaluates the trained model. Section V summarizes the major contributions and concludes this paper.
Ii Preliminaries
Before delving into the details of our algorithm, we first introduce notation and summarize maximum entropy RL and distributional RL mathematically.
Iia Notation
Standard Reinforcement Learning (RL) is designed to solve sequential decisionmaking tasks wherein the agent interacts with the environment. Formally, we consider an infinite horizon discounted Markov Decision Process (MDP), defined by the tuple (), where is a continuous set of states and is a continuous set of actions, is the transition probability distribution, is the reward function, and is the discounted factor. In each time step , the agent receives a state and selects an action , and the environment will return the next state with the probability and a scalar reward . We will use and to denote the state and stateaction distribution induced by policy in environment. For the sake of simplicity, the current and next stateaction pairs are also denoted as and , respectively.
IiB Maximum entropy RL
The maximum entropy RL aims to maximize the expected accumulated reward and policy entropy, by augmenting the standard RL objective with an entropy maximization term:
(1) 
where is the temperature parameter which determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy. The Qvalue of policy is defined as:
(2) 
where and .
Obviously, the maximum entropy objective differs from the maximum expected reward objective used in standard RL, though the conventional objective can be recovered as . Prior works have demonstrated that the maximum entropy objective can incentive the policy to explore more widely. In problem settings where multiple actions seem equally attractive, the policy will act as randomly as possible to perform those actions[5].
IiC Distributional RL
Distributional framework has attracted much attention for the reason that distributional RL algorithms show improved sample complexity and final performance. The core idea of distributional RL is that the return
(3) 
is viewed as a random variable where and we choose to directly learn its distribution instead just its expected value, i.e., Qvalue in (2):
Under this theme, many works used discrete distribution to build the return distribution, in which we need to divide the value function into different intervals priorly. Recently, Duan et al.[1] proposed the Distributional Soft ActorCritic (DSAC) algorithm to directly learn the continuous distribution of returns by truncating the difference between the target and current return distribution. Therefore, we draw on the continuous return distribution in the following illustration.
The optimal policy is learned by a distributional variant of the policy iteration method which alternates between policy evaluation and policy improvement. The corresponding variant of Bellman operator can be derived as:
where denotes that two random variables and have equal probability laws and the next state and action are distributed according to and respectively.
Supposing , where denotes the distribution of , the return distribution can be optimized by minimizing the distribution distance between Bellman updated and the current return distribution:
(4) 
where is some metric to measure the distance between two distribution. For example, we can adopt as the KullbackLeibler (KL) divergence or Wasserstein metric. In policy improvement process, we aim to find a new policy that is better than the current policy , such that for all state action pairs :
(5)  
It has shown that policy evaluation step in (4) and policy improvement step in (5) can alternately roll forward and gradually shift to the optimal policies [5, 1].
Iii Our methods
Although distributional RL algorithms like DSAC considered the randomness of return caused by the environment, they may still fail in a distinct environment. Here, we introduce the minimax formulation into the existing DSAC algorithm and model the risk explicitly through the continuous return distribution.
Iiia Minimax Distributional Soft ActorCritic (Minimax DSAC)
In minimax formulation, there exist two policies to be optimized, called protagonist policy and adversary policy respectively. Given the current state , the protagonist policy will take action , the adversary policy will take action , and then the next state will be reached. Whereas these two policies obtain different rewards: the protagonist gets a reward while the adversary gets a reward at each time step. We use and to denote the state and stateaction distribution induced by policy and in environment.
Under this theme, the random return generated by and can be rewritten as:
and its expectation is the action value function :
Suppose , we can use the similar method in (4) to update the return distribution. To learn risksensitive policies, we model risk as the variance of the learned continuous return distribution, where the protagonist policy is optimized to mitigate risk to avoid the potential events that have the chance to lead to bad return, i.e., maximizing the following objective:
(6) 
And the adversary policy seeks to increase risk to disrupt the learning process, i.e., minimizing the following objective:
(7) 
where and are the constants corresponding to the variance which describes different risk level.
IiiB Implementation of Minimax DSAC
To handle problems with large continuous domains, we use function approximators for all the return distribution function and two policies, which can be modeled as a Gaussian with the mean and variance given by neural networks (NNs). We will consider a parameterized stateaction return distribution function , a stochastic protagonist policy and a stochastic adversarial policy where , and are parameters. Next we will derive update rules for these parameter vectors and show the details of our Minimax DSAC.
In policy evaluation step, the return distribution is updated by minimizing the difference between the target return distribution and the current return distribution. The formulation is similar with the DSAC algorithm except that we consider two policies[1]:
where is a constant. The gradient about parameter can be written as:
To prevent the gradient exploding, we adopt the clipping technique to keep it close to the expectation value of the current distribution :
where is a hyperparameter representing the clipping boundary. To stabilize the learning process, target return distribution with parameter , two policy functions with separate parameters and , are used to evaluate the target function. The target networks use a slowmoving update rate, parameterized by , such as
(8) 
where represents the parameters , and .
In policy improvement step, as discussed in (6), the protagonist policy aims to maximize the expected return with entropy and select actions with low variance:
(9) 
The adversarial policy in (7) aims to minimize the expected return and select actions with high variance:
(10) 
Suppose the mean and variance of the return distribution can be explicitly parameterized by parameters . We can derive the policy gradient of protagonist and adversary policy using the reparameterization trick:
where , is auxiliary variables which are sampled from some fixed distribution. Then the protagonist policy gradient of (9) can be derived as:
And the adversarial policy gradient of (10) can be approximated with
Finally, the temperature is updated by minimizing the following objective
where is the expected entropy. The detail of our algorithm can be shown as Algorithm 1.
Iv Experiments
In this section, we evaluate our algorithm on an autonomous driving tasks, in which we choose the intersection as the driving scenario.
Iva Simulation Environment
We focus on a typical 4direction intersection shown in Fig. 1. Each direction is denoted by its location, i.e. up(U), down (D), left (L) and right (R) respectively. The intersection is unsignalized and each direction has one lane. The protagonist vehicle (red car in Fig. 1) attempts to travel from down to up, while two adversarial vehicles (green cars in Fig. 1) ride from right to left, left to right respectively. The trajectories of all three vehicles are given priorly, and as a result, there are two traffic conflict points in the path of protagonist vehicle and adversarial vehicles, as the solid circle shown in Fig. 1. In our experiment setting, the protagonist vehicle attempts to pass the intersection safely and quickly, while the other two adversarial vehicles try to provide disruption by hitting the protagonist vehicle.
We choose position and velocity information of each vehicle as states, i.e., (, ), where is distance between vehicle and center of the intersection. Note that is positive when a vehicle is heading for the center and negative when it is leaving. For action space, we choose the acceleration of each vehicle and suppose that vehicles can strictly follow the desired acceleration. In total, 6dimensional continuous state space and 3dimensional continuous action space are constructed.
The reward function is designed to consider both safety and time efficiency. This task is constructed in an episodic manner, where two terminate conditions are given: collision or passing. First, if the protagonist vehicle passes the intersection safely, a large positive reward 110 is given; Second, if a collision happens anywhere, a large negative reward 110 is given to the protagonist vehicle; Besides, a minor negative reward 1 is given every time step to encourage the protagonist vehicle to pass as quickly as possible. However, the adversarial vehicles obtain opposite reward in every case aforementioned.
IvB Algorithm Details and Results
Both the value function and two policies are approximated by multilayer perceptron (MLP) with 2 hidden layers and 256 units per layer. The policy of protagonist vehicle aims to maximize future expected return, while the policy of adversarial vehicles aims to minimize it. The baseline of our algorithm is the standard DSAC [1] without the adversarial policy, in which the protagonist vehicle learns to pass through the intersection with the existence of two random surrounding vehicles. Also, we adopt the asynchronous parallel architecture of DSAC called PABLE, in which 4 learners and 3 actors are designed to accelerate the learning speed. The hyperparameters used in training are listed in Table I and the training result is shown as Fig. 2.
Max buffer size 
500 

Sample batch size  256 
Hidden layers activation  gelu 
Optimizer type  Adam 
Adam parameter  
Actor learning rate  
Critic learning rate  
learning rate  
Discount factor  0.99 
Temperature  
Target update rate  0.001 
Expected entropy   ACTION DIMENSIONS 
Clipping boundary  20 
0.1  

Results show that Minimax DSAC obtained a smaller mean with respect to the average return, which is explicable that the adversary policy provides a strong disruption to the learning of protagonist policy. Besides, it is clear that Minimax DSAC has more fluctuation than standard DSAC at convergence stage. That can be explained that the protagonist vehicle has learned to avoid the potential collision by decelerating and even stopping and waiting in face of the despiteful adversarial vehicles, which will lead to punishment in each step and finally result in a lower return.
IvC Evaluation
Compared with the performance during training process, we concern more about that on situations distinct from the training environment, i.e., the generalization ability. As adversarial vehicles can be regarded as part of the environment, we can design different driving modes of adversarial agents to adjust the environment difficulty to evaluate the generalization ability of the protagonist policy. Formally, we design three driving modes for the adversarial agents: aggressive, conservative and random. In aggressive mode, the two adversarial vehicles sample their acceleration from positive interval while in conservative mode they sample acceleration from negative interval . In random mode, one adversarial vehicle samples acceleration from and the other vehicle samples acceleration from .
The comparison of two methods under three modes is shown in Fig. 3, in which the corresponding values are also marked. Results show that Minimax DSAC can greatly improve the performance under different modes of adversarial vehicles, especially in aggressive and random mode. In conservative mode, these two algorithms show minor difference because both the adversarial vehicles drive at the lowest speed in the limit, thereby less potential collision to the protagonist will happen. However, Minimax DSAC still obtained a higher return because it adopted large acceleration to improve the passing efficiency. The ttest results in Fig. 3 show that the average reward of DSAC is significantly smaller than that of Minimax DSAC .
Fig. 4 shows the control effect of trained policies for protagonist vehicle under the same behavior of adversarial vehicles. In aggressive mode, the protagonist vehicle of Minimax DSAC learned to decelerate priorly to wait until all adversary vehicles pass firstly, while the agent of DSAC suffers a collision resulting from its high speed. In conservative mode, both protagonist vehicles adopted similar strategy to pass successfully except that the Minimax DSAC gets a little less pass time. Under this environment, riding with high speed to pass firstly will encounter less collision and improve the pass efficiency. In random mode, our Minimax DSAC can adjust the speed more flexible to pass the intersection with a less pass time (6.1s) than standard DSAC (8.9s).
To sum up, although the Minimax DSAC obtained smaller average return during training process, it can maintain better performance when encountering different kinds of variations from environment.
V Conclusion
In this paper, we combine the minimax formulation with the distributional framework to improve the generalization ability of RL algorithms, in which the protagonist agent must compete with the adversarial agent to learn how to behave well. Based on the DSAC algorithm, we propose the Minimax DSAC algorithm and implement it on the autonomous driving task at intersections. Results show that our algorithm significantly improves the protagonist agent’s persistence to the variation from the environment. This study provides a promising approach to accelerate the application of RL algorithms in real world like autonomous driving, where we always develop algorithms on the simulator which is distinct from the real environment.
References
 (2020) Distributional soft actorcritic: offpolicy reinforcement learning for addressing value estimation errors. arXiv preprint arXiv:2001.02811. Cited by: §I, §IIC, §IIC, §IIIB, §IVB.
 (2020) Hierarchical reinforcement learning for selfdriving decisionmaking without reliance on labelled driving data. IET Intelligent Transport Systems 14 (5), pp. 297–305. Cited by: §I.
 (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §I.
 (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations (ICLR), Cited by: §I.
 (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §IIB, §IIC.
 (2017) Adversarially robust policy learning: active construction of physicallyplausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3932–3939. Cited by: §I.
 (2005) Robust reinforcement learning. Neural computation 17 (2), pp. 335–359. Cited by: §I.
 (2018) Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282. Cited by: §I.
 (2019) Risk averse robust adversarial reinforcement learning. arXiv preprint arXiv:1904.00511. Cited by: §I.
 (2018) Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2040–2042. Cited by: §I.
 (2018) Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8. Cited by: §I.
 (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 2817–2826. Cited by: §I.
 (2018) Risksensitive reinforcement learning: a constrained optimization viewpoint. arXiv preprint arXiv:1810.09126. Cited by: §I.
 (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §I.
 (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–486. Cited by: §I.
 (2015) Optimizing the cvar via sampling. In 29th AAAI Conference on Artificial Intelligence (AAAI), Cited by: §I.
 (2019) Investigating generalisation in continuous deep reinforcement learning. arXiv preprint arXiv:1902.07015. Cited by: §I.