Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic


Reinforcement learning (RL) has achieved remarkable performance in numerous sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of environment usually leads to some devastating events, which should be the focus of safety-critical systems such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most severe variations from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thereby formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.


Game theory, adversarial reinforcement learning, risk-aware policy learning, autonomous driving.

I Introduction

Numerous applications of deep reinforcement learning (RL) have demonstrated great performance in a range of challenging domains such as games[15] and autonomous driving[2]. Mainstream RL algorithms focus on optimizing policy based on the performance in the training environment, without considering its universality for situations never encountered during training. Studies showed that this could reduce the generalization ability of the learned policy[8][17]. For intelligent agents, such as autonomous vehicles, we usually need them to be able to cope with multiple situations, including unknown scenarios.

A straightforward technique to improve the generalization ability of RL is training on a set of random environments. By randomizing the dynamics of the simulation environment, the developed policies are capable of adapting to different dynamics encountered during training [11]. Furthermore, some works have proposed that directly adding noises to state observations can provide adversarial perturbations for the training process, which can make the learned policy more insensitive to environmental variation during testing[6, 10, 4]. However, these approaches can scarcely capture all variations from environment, as the space of dynamic parameters could be larger than the space of possible actions.

Alternative techniques to improve generalization include risk-sensitive policy learning. Generally, risk is related to the stochasticity of environment and with the fact that, even an optimal policy (in terms of expected return) may perform poorly in some cases. Instead, risk-sensitive policy learning includes a risk measure in the optimization process, either as the objective[13] or as a constraint[16]. This formulation not only seeks to maximize the expected reward but to optimize the risk criteria, such that the trained policy can reduce the likelihood of failure in a varying environment. In practice, the risk is always modeled as the variance of return and the most representative algorithms include mean-variance trade-off method[3] and percentile optimization method[14]. However, the existing methods can only model the risk by sampling discretely some trajectories from randomized environments, rather than learn the exact return distribution.

Another technique to improve generalization across different kind of environment variations is the minimax formulation. As a pioneering work in this field, Morimoto et al. (2005) firstly combined H-infinity control with RL to learn an optimal policy, which is the prototype of most existing minimax formulation of RL algorithms[7]. They formulated a differential game in which a protagonist agent tries to learn the control law by maximizing the accumulated reward while an adversary agent aims to make the worst possible destruction by minimizing the same objective. By that way, this problem was reduced to find a minimax solution of a value function. After that, Pinto et al. (2017) extended this work with deep neural network and further proposed the Robust Adversarial Reinforcement Learning (RARL) algorithm, in which the protagonist and adversary policies are trained alternatively, with one being fixed whilst the other adapts[12]. Recently, Pan et al. (2019) introduced the risk-sensitive framework into RARL to prevent the rare, catastrophic events such as automotive accidents[9]. For that propose, the risk was modeled as the variance of value functions and they used an ensemble of Q-value networks to estimate variance, in which multiple Q-networks were trained in parallel. Their experiments on autonomous driving demonstrated that the introduction of risk-sensitive framework into RARL is effective and even crucial, especially in safety-critical systems. However, the existing methods can only handle the discrete and low-dimensional action spaces, as they select actions according to their Q-networks. More urgently, the value function must be divided into multiple discrete intervals in advance. This is inconvenient because different tasks usually require different division numbers.

In this paper, we propose a new RL algorithm to improve the generalization ability of the learned policy. In particular, the learned policy can not only succeed in the training environment but also cope with the situations never encountered before. To that end, we adopt the minimax formulation, which augments the standard RL with an adversarial policy, to develop a minimax variant of Distributional Soft Actor-Critic (DSAC) algorithm[1], called Minimax DSAC. Here, we choose DSAC as the basis of our algorithm, not only because it is the state-of-the-art RL algorithm, but also it can directly learn a continuous distribution of returns, which enables us to model return variance as risk explicitly. By modeling risk, we can train stronger adversaries and through competition, the protagonist policy will have a greater ability to cope with environmental changes. Additionally, the application of our algorithm on autonomous driving tasks shows that Minimax DSAC can guarantee the good performance even when the environment changes drastically.

The rest of the paper is organized as follows: Section II states the preliminaries and Section III introduces formulation and implementation of the proposed method Minimax DSAC. Section IV introduces the simulation scenarios and evaluates the trained model. Section V summarizes the major contributions and concludes this paper.

Ii Preliminaries

Before delving into the details of our algorithm, we first introduce notation and summarize maximum entropy RL and distributional RL mathematically.

Ii-a Notation

Standard Reinforcement Learning (RL) is designed to solve sequential decision-making tasks wherein the agent interacts with the environment. Formally, we consider an infinite horizon discounted Markov Decision Process (MDP), defined by the tuple (), where is a continuous set of states and is a continuous set of actions, is the transition probability distribution, is the reward function, and is the discounted factor. In each time step , the agent receives a state and selects an action , and the environment will return the next state with the probability and a scalar reward . We will use and to denote the state and state-action distribution induced by policy in environment. For the sake of simplicity, the current and next state-action pairs are also denoted as and , respectively.

Ii-B Maximum entropy RL

The maximum entropy RL aims to maximize the expected accumulated reward and policy entropy, by augmenting the standard RL objective with an entropy maximization term:


where is the temperature parameter which determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy. The Q-value of policy is defined as:


where and .

Obviously, the maximum entropy objective differs from the maximum expected reward objective used in standard RL, though the conventional objective can be recovered as . Prior works have demonstrated that the maximum entropy objective can incentive the policy to explore more widely. In problem settings where multiple actions seem equally attractive, the policy will act as randomly as possible to perform those actions[5].

Ii-C Distributional RL

Distributional framework has attracted much attention for the reason that distributional RL algorithms show improved sample complexity and final performance. The core idea of distributional RL is that the return


is viewed as a random variable where and we choose to directly learn its distribution instead just its expected value, i.e., Q-value in (2):

Under this theme, many works used discrete distribution to build the return distribution, in which we need to divide the value function into different intervals priorly. Recently, Duan et al.[1] proposed the Distributional Soft Actor-Critic (DSAC) algorithm to directly learn the continuous distribution of returns by truncating the difference between the target and current return distribution. Therefore, we draw on the continuous return distribution in the following illustration.

The optimal policy is learned by a distributional variant of the policy iteration method which alternates between policy evaluation and policy improvement. The corresponding variant of Bellman operator can be derived as:

where denotes that two random variables and have equal probability laws and the next state and action are distributed according to and respectively.

Supposing , where denotes the distribution of , the return distribution can be optimized by minimizing the distribution distance between Bellman updated and the current return distribution:


where is some metric to measure the distance between two distribution. For example, we can adopt as the Kullback-Leibler (KL) divergence or Wasserstein metric. In policy improvement process, we aim to find a new policy that is better than the current policy , such that for all state action pairs :


It has shown that policy evaluation step in (4) and policy improvement step in (5) can alternately roll forward and gradually shift to the optimal policies [5, 1].

Iii Our methods

Although distributional RL algorithms like DSAC considered the randomness of return caused by the environment, they may still fail in a distinct environment. Here, we introduce the minimax formulation into the existing DSAC algorithm and model the risk explicitly through the continuous return distribution.

Iii-a Minimax Distributional Soft Actor-Critic (Minimax DSAC)

In minimax formulation, there exist two policies to be optimized, called protagonist policy and adversary policy respectively. Given the current state , the protagonist policy will take action , the adversary policy will take action , and then the next state will be reached. Whereas these two policies obtain different rewards: the protagonist gets a reward while the adversary gets a reward at each time step. We use and to denote the state and state-action distribution induced by policy and in environment.

Under this theme, the random return generated by and can be rewritten as:

and its expectation is the action value function :

Suppose , we can use the similar method in (4) to update the return distribution. To learn risk-sensitive policies, we model risk as the variance of the learned continuous return distribution, where the protagonist policy is optimized to mitigate risk to avoid the potential events that have the chance to lead to bad return, i.e., maximizing the following objective:


And the adversary policy seeks to increase risk to disrupt the learning process, i.e., minimizing the following objective:


where and are the constants corresponding to the variance which describes different risk level.

Iii-B Implementation of Minimax DSAC

To handle problems with large continuous domains, we use function approximators for all the return distribution function and two policies, which can be modeled as a Gaussian with the mean and variance given by neural networks (NNs). We will consider a parameterized state-action return distribution function , a stochastic protagonist policy and a stochastic adversarial policy where , and are parameters. Next we will derive update rules for these parameter vectors and show the details of our Minimax DSAC.

In policy evaluation step, the return distribution is updated by minimizing the difference between the target return distribution and the current return distribution. The formulation is similar with the DSAC algorithm except that we consider two policies[1]:

where is a constant. The gradient about parameter can be written as:

To prevent the gradient exploding, we adopt the clipping technique to keep it close to the expectation value of the current distribution :

where is a hyperparameter representing the clipping boundary. To stabilize the learning process, target return distribution with parameter , two policy functions with separate parameters and , are used to evaluate the target function. The target networks use a slow-moving update rate, parameterized by , such as


where represents the parameters , and .

In policy improvement step, as discussed in (6), the protagonist policy aims to maximize the expected return with entropy and select actions with low variance:


The adversarial policy in (7) aims to minimize the expected return and select actions with high variance:


Suppose the mean and variance of the return distribution can be explicitly parameterized by parameters . We can derive the policy gradient of protagonist and adversary policy using the reparameterization trick:

where , is auxiliary variables which are sampled from some fixed distribution. Then the protagonist policy gradient of (9) can be derived as:

And the adversarial policy gradient of (10) can be approximated with

Finally, the temperature is updated by minimizing the following objective

where is the expected entropy. The detail of our algorithm can be shown as Algorithm 1.

  Initialize parameters , , and
  Initialize target parameters , ,
  Initialize learning rate , , , and
     Select action ,
     Observe reward and new state
     Store transition tuple in buffer
     Sample transitions from
     Update return distribution
     Update protagonist policy
     Update adversarial policy
     Adjust temperature
     Update target networks using (8)
  until Convergence
Algorithm 1 Minimax DSAC Algorithm

Iv Experiments

In this section, we evaluate our algorithm on an autonomous driving tasks, in which we choose the intersection as the driving scenario.

Iv-a Simulation Environment

We focus on a typical 4-direction intersection shown in Fig. 1. Each direction is denoted by its location, i.e. up(U), down (D), left (L) and right (R) respectively. The intersection is unsignalized and each direction has one lane. The protagonist vehicle (red car in Fig. 1) attempts to travel from down to up, while two adversarial vehicles (green cars in Fig. 1) ride from right to left, left to right respectively. The trajectories of all three vehicles are given priorly, and as a result, there are two traffic conflict points in the path of protagonist vehicle and adversarial vehicles, as the solid circle shown in Fig. 1. In our experiment setting, the protagonist vehicle attempts to pass the intersection safely and quickly, while the other two adversarial vehicles try to provide disruption by hitting the protagonist vehicle.

Fig. 1: Intersection Scenario.

We choose position and velocity information of each vehicle as states, i.e., (, ), where is distance between vehicle and center of the intersection. Note that is positive when a vehicle is heading for the center and negative when it is leaving. For action space, we choose the acceleration of each vehicle and suppose that vehicles can strictly follow the desired acceleration. In total, 6-dimensional continuous state space and 3-dimensional continuous action space are constructed.

The reward function is designed to consider both safety and time efficiency. This task is constructed in an episodic manner, where two terminate conditions are given: collision or passing. First, if the protagonist vehicle passes the intersection safely, a large positive reward 110 is given; Second, if a collision happens anywhere, a large negative reward -110 is given to the protagonist vehicle; Besides, a minor negative reward -1 is given every time step to encourage the protagonist vehicle to pass as quickly as possible. However, the adversarial vehicles obtain opposite reward in every case aforementioned.

Iv-B Algorithm Details and Results

Both the value function and two policies are approximated by multi-layer perceptron (MLP) with 2 hidden layers and 256 units per layer. The policy of protagonist vehicle aims to maximize future expected return, while the policy of adversarial vehicles aims to minimize it. The baseline of our algorithm is the standard DSAC [1] without the adversarial policy, in which the protagonist vehicle learns to pass through the intersection with the existence of two random surrounding vehicles. Also, we adopt the asynchronous parallel architecture of DSAC called PABLE, in which 4 learners and 3 actors are designed to accelerate the learning speed. The hyperparameters used in training are listed in Table I and the training result is shown as Fig. 2.

Max buffer size
Sample batch size 256
Hidden layers activation gelu
Optimizer type Adam
Adam parameter
Actor learning rate
Critic learning rate
learning rate
Discount factor 0.99
Target update rate 0.001
Expected entropy - ACTION DIMENSIONS
Clipping boundary 20

TABLE I: Trainning hyperparameters

Results show that Minimax DSAC obtained a smaller mean with respect to the average return, which is explicable that the adversary policy provides a strong disruption to the learning of protagonist policy. Besides, it is clear that Minimax DSAC has more fluctuation than standard DSAC at convergence stage. That can be explained that the protagonist vehicle has learned to avoid the potential collision by decelerating and even stopping and waiting in face of the despiteful adversarial vehicles, which will lead to punishment in each step and finally result in a lower return.

Fig. 2: Average return during training process. The solid lines correspond to the mean and the shaded regions correspond to 95% confidence interval over 10 runs.

Iv-C Evaluation

Compared with the performance during training process, we concern more about that on situations distinct from the training environment, i.e., the generalization ability. As adversarial vehicles can be regarded as part of the environment, we can design different driving modes of adversarial agents to adjust the environment difficulty to evaluate the generalization ability of the protagonist policy. Formally, we design three driving modes for the adversarial agents: aggressive, conservative and random. In aggressive mode, the two adversarial vehicles sample their acceleration from positive interval while in conservative mode they sample acceleration from negative interval . In random mode, one adversarial vehicle samples acceleration from and the other vehicle samples acceleration from .

The comparison of two methods under three modes is shown in Fig. 3, in which the corresponding -values are also marked. Results show that Minimax DSAC can greatly improve the performance under different modes of adversarial vehicles, especially in aggressive and random mode. In conservative mode, these two algorithms show minor difference because both the adversarial vehicles drive at the lowest speed in the limit, thereby less potential collision to the protagonist will happen. However, Minimax DSAC still obtained a higher return because it adopted large acceleration to improve the passing efficiency. The t-test results in Fig. 3 show that the average reward of DSAC is significantly smaller than that of Minimax DSAC .

Fig. 3: Average return during testing process. Each boxplot is drawn based on values of 20-episode evaluations.

(a) (b) (c) (d) (e) (f)

Fig. 4: Result visualization of different policies. The brown line in each subplot shows the velocity of protagonist vehicle and the colorbar shows the location of all vehicles at each time step. Performance of (a) Minimax DSAC under aggressive mode (crossing in 8.5s). (b) Minimax DSAC under conservative mode (crossing in 7.4s). (c) Minimax DSAC under random mode (crossing in 6.1s). (d) DSAC under aggressive mode (a failure pass, i.e., collision happens in 2.3s). (e) DSAC under conservative mode (crossing in 7.6s). (f) DSAC under random mode (crossing in 8.9s).

Fig. 4 shows the control effect of trained policies for protagonist vehicle under the same behavior of adversarial vehicles. In aggressive mode, the protagonist vehicle of Minimax DSAC learned to decelerate priorly to wait until all adversary vehicles pass firstly, while the agent of DSAC suffers a collision resulting from its high speed. In conservative mode, both protagonist vehicles adopted similar strategy to pass successfully except that the Minimax DSAC gets a little less pass time. Under this environment, riding with high speed to pass firstly will encounter less collision and improve the pass efficiency. In random mode, our Minimax DSAC can adjust the speed more flexible to pass the intersection with a less pass time (6.1s) than standard DSAC (8.9s).

To sum up, although the Minimax DSAC obtained smaller average return during training process, it can maintain better performance when encountering different kinds of variations from environment.

V Conclusion

In this paper, we combine the minimax formulation with the distributional framework to improve the generalization ability of RL algorithms, in which the protagonist agent must compete with the adversarial agent to learn how to behave well. Based on the DSAC algorithm, we propose the Minimax DSAC algorithm and implement it on the autonomous driving task at intersections. Results show that our algorithm significantly improves the protagonist agent’s persistence to the variation from the environment. This study provides a promising approach to accelerate the application of RL algorithms in real world like autonomous driving, where we always develop algorithms on the simulator which is distinct from the real environment.


  1. J. Duan, Y. Guan, S. E. Li, Y. Ren and B. Cheng (2020) Distributional soft actor-critic: off-policy reinforcement learning for addressing value estimation errors. arXiv preprint arXiv:2001.02811. Cited by: §I, §II-C, §II-C, §III-B, §IV-B.
  2. J. Duan, S. E. Li, Y. Guan, Q. Sun and B. Cheng (2020) Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data. IET Intelligent Transport Systems 14 (5), pp. 297–305. Cited by: §I.
  3. J. Garcıa and F. Fernández (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §I.
  4. I. J. Goodfellow, J. Shlens and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations (ICLR), Cited by: §I.
  5. T. Haarnoja, A. Zhou, P. Abbeel and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §II-B, §II-C.
  6. A. Mandlekar, Y. Zhu, A. Garg, L. Fei-Fei and S. Savarese (2017) Adversarially robust policy learning: active construction of physically-plausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3932–3939. Cited by: §I.
  7. J. Morimoto and K. Doya (2005) Robust reinforcement learning. Neural computation 17 (2), pp. 335–359. Cited by: §I.
  8. C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun and D. Song (2018) Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282. Cited by: §I.
  9. X. Pan, D. Seita, Y. Gao and J. Canny (2019) Risk averse robust adversarial reinforcement learning. arXiv preprint arXiv:1904.00511. Cited by: §I.
  10. A. Pattanaik, Z. Tang, S. Liu, G. Bommannan and G. Chowdhary (2018) Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2040–2042. Cited by: §I.
  11. X. B. Peng, M. Andrychowicz, W. Zaremba and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8. Cited by: §I.
  12. L. Pinto, J. Davidson, R. Sukthankar and A. Gupta (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 2817–2826. Cited by: §I.
  13. A. Prashanth L and M. Fu (2018) Risk-sensitive reinforcement learning: a constrained optimization viewpoint. arXiv preprint arXiv:1810.09126. Cited by: §I.
  14. A. Rajeswaran, S. Ghotra, B. Ravindran and S. Levine (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §I.
  15. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam and M. Lanctot (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–486. Cited by: §I.
  16. A. Tamar, Y. Glassner and S. Mannor (2015) Optimizing the cvar via sampling. In 29th AAAI Conference on Artificial Intelligence (AAAI), Cited by: §I.
  17. C. Zhao, O. Siguad, F. Stulp and T. M. Hospedales (2019) Investigating generalisation in continuous deep reinforcement learning. arXiv preprint arXiv:1902.07015. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description