# Adversarial Inverse Reinforcement Learning for Decision Making in Autonomous Driving

###### Abstract

Generative Adversarial Imitation Learning (GAIL) is an efficient way to learn sequential control strategies from demonstration. Adversarial Inverse Reinforcement Learning (AIRL) is similar to GAIL but also learns a reward function at the same time and has better training stability. In previous work, however, AIRL has mostly been demonstrated on robotic control in artificial environments. In this paper, we apply AIRL to a practical and challenging problem – the decision-making in autonomous driving, and also augment AIRL with a semantic reward to improve its performance. We use four metrics to evaluate its learning performance in a simulated driving environment. Results show that the vehicle agent can learn decent decision-making behaviors from scratch, and can reach a level of performance comparable with that of an expert. Additionally, the comparison with GAIL shows that AIRL converges faster, achieves better and more stable performance than GAIL.

## I Introduction

The application of Reinforcement Learning (RL) in robotics has been very fruitful in recent years. The application tasks range from flying inverted helicopter [13] to robot soccer [15] and other robotic manipulations [4][18]. Though the results are encouraging, one significant barrier of applying RL to real-world problems is the required definition of the reward function, which is typically unavailable or infeasible to design in practice.

Inverse Reinforcement Learning (IRL) [14] aims to tackle such problems by learning the reward function from expert demonstrations, thus avoiding reward function engineering and making good use of the collected data. Maximum entropy IRL [22][23] is commonly used, where trajectories are assumed to follow a Boltzmann distribution based on a cost function. However, because of the expansive reinforcement learning procedure in the inner loop, it has limited application in problems involving high-dimensional state and action spaces [6].

Generative Adversarial Imitation Learning (GAIL) [6] overcomes these challenges by learning a policy against a discriminator that tries to distinguish learnt actions from expert actions. It is very efficient because the policy generator and discriminator can be iteratively updated in a single loop. Guided Cost Learning (GCL) [4] tackles the problem from another direction, where importance sampling is used to estimate the partition function in the maximum entropy IRL formulation. Finn et al. [3] shows that GAIL with a properly reparameterized discriminator is mathematically equivalent to GCL with some tweaks in importance sampling. Adversarial Inverse Reinforcement Learning (AIRL) [5] combines GAIL and GCL formulation and learns the cost functions together with the policy in an adversarial way. Compared to AIRL, GAIL does not recover a cost function and may suffer from training instability issues. In the implementation of both GAIL and AIRL, the discriminator takes individual station-action pairs instead of the whole trajectories as input to reduce the high variance problem that exists in GCL. While the AIRL formulation is appealing, it has only been applied to robotic controls in OpenAI Gym environments to our knowledge.

Autonomous driving is a complicated problem as it involves extensive interactions with other vehicles in a dynamically changing environment, and this is particularly true for the decision-making task that needs to monitor the environment and decides maneuvering commands to the control module. In this paper, we apply AIRL to learn the challenging decision-making behavior in a simulated environment where each vehicle is interacting with all other vehicles in its surroundings. Our study, different from RL based studies, makes good use of demonstration data and, different from IRL based studies, learns both a reward function and a policy.

## Ii Related Work

The applications of adversarial learning based algorithms are mostly verified with control tasks in OpenAI Gym [4][6][5]. For example, GAIL was tested by 9 control tasks which were either classic controls in OpenAI Gym or 3D controls in MuJoCo [6]. In AIRL, the experiments for testing the algorithm were simple tabular MDP, 2D point mass navigating, and a running ant [5]. In GCL, the algorithm was verified in both simulated and physical robotic arms, but the experimenting environment was stationary and there was no interaction with the environment [4]. Among these algorithms, AIRL has shown robust and superior performance than GAIL and GCL methods for applications where the dynamics of environment are changed. Singh et al. [18] apply adversarial learning to real-world robotic control tasks like placing books and arranging objects without reward crafting, and the learning is based on a fraction of state space with labels instead of entire expert demonstration trajectories.

The decision-making task in our use case of lane changes is much more complex than the OpenAI Gym-based experiments implemented in the original papers of [4][6][5]. First, the decision-making task involves more complicated factors as the goal is not just to change to the target lane, it also involves other criteria such as safety (e.g. not crash into other objects during the changing), efficiency (e.g. not be too cautious to take much time in changing), and comfort (e.g. not bring unsmooth travelling experiences. In contrast, the goal in the implemented OpenAI gyms in [4] and [5] is clear and simple, as to reaching a goal position or velocity. Second, different from the stationary background environment in the OpenAI gyms, the driving environment we are experimenting with is dynamically changing which creates diverse driving situations and complicates the rewarding mechanism, making the learning much more challenging. Last but not the least, the decision-making task itself is a complicated problem as it is entangled with vehicle control. A decision impacts the control action at the next step and the control action in return affects the selection of the subsequent decision. The coupled effect makes the decision-making problem challenging.

In autonomous driving, many recent studies have applied RL to the vehicle control task [12][8][21]. Some studies began to apply it to decision making [17][20]. Only a few studies have applied adversarial learning to driving. One work was done by Kuefler et al. [9]. They applied GAIL to learn lane-keeping task in a driving simulator, with a recurrent neural network as the policy network. The task is relatively simple as lane keeping requires less interactions with road users. To our best knowledge, no prior work has applied AIRL to practical tasks in autonomous driving that needs to deal with interactions with surrounding vehicles. Our work is to explore its feasibility for handling the challenging decision-making task.

## Iii Methodology

AIRL is directly related with maximum entropy IRL [23] and Guided Cost Learning [4]. It uses a special form of the discriminator different from that used in GAIL, and recovers a cost function and a policy simultaneously as that in GCL but in an adversarial way. In this section, we give the details of AIRL and the augmented AIRL that we adapted in this paper.

### Iii-a Preliminaries

#### Maximum entropy inverse reinforcement learning

Maximum entropy Inverse Reinforcement Learning models the distribution of the demonstrated behaviors with a Boltzmann distribution [22]

(1) |

where is a behavior sequence from the demonstrated data; is the unknown cost function parameterized by that we want to learn, and is the partition function which is the integral of over all possible trajectories from the environment and the estimation of this term lays the barrier of applying Maximum entropy IRL to high-dimensional or continuous problems.

The goal of IRL is to find the cost function by solving a maximum likelihood problem

where is the dataset of demonstrations. Among research which focus on solving the estimation of [22][10][7], Finn et al. [4] proposed an importance sampling based method, Guided Cost Learning, to estimate for high-dimensional and continuous problems.

#### Guided cost learning

Guided Cost Learning introduces a new sampling distribution to generate samples for the estimation of . It is verified that , at optimality, is proportional to the Maximum entropy IRL based distribution: , therefore, the objective of the cost learning can be reformulated as

The update of the importance sampling distribution is a policy optimization procedure with the goal to minimizing the KL divergence between and the Boltzmann distribution , which results in an objective of minimizing the learned cost function and maximizing the distribution’s entropy

The cost learning step and policy optimization step alternate until convergence is reached.

### Iii-B Adversarial Inverse Reinforcement Learning

#### The form of discriminator

To be in consistent with the previous section, we still use to denote the true distribution of the demonstration, and is the generator’s density. Based on the mathematical proof in [3], a new form of discriminator can be designed as in (2) where is estimated by the Maximum entropy IRL distribution.

(2) |

The optimal solution of this form of discriminator is independent of the generator, which improves the training stability that we will show later in our training results.

The use of the full trajectory in the cost calculation could result in high variance. Instead, a revised version of the discriminator based on individual state-action pairs is used in [5] to remedy this issue. By subtracting a constant from the cost function, the discriminator can be changed to the following form as in (3) where is equivalently replaced with .

(3) |

#### Objectives of Discriminator and Generator

The objective of the discriminator is to distinguish whether the state-action pair is from the expert or is generated. This corresponds to the sum of two terms as

(4) |

The generator’s task to produce samples to fool the discriminator, i.e. is to minimize its log probability of being classified as generated samples. This signal alone is not enough to train the generator when the discriminator fast learns to distinguish between generated data and expert data. Therefore, another part, interpreted as the discriminator’s confusion [3], is added to the generator’s loss as

(5) |

To convert it into a RL formulation, we obtain the following form as the generator’s objective

(6) |

The reward function can be extracted as in (7) and is used in our policy optimization.

(7) |

#### Optimization

The optimization of the discriminator and generator is similar to the idea of GCL in which the cost learning interleaves with the policy optimization procedure.

The optimization of discriminator is based on (4) as a binary logistic regression problem. The updated discriminator is fed to the reward function (7) for policy optimization. In GCL, the policy is optimized based on guided policy search [11], a model-based method for unknown dynamics in which it iteratively fits a time-varying linear dynamic model to the generated samples [4]. As the dynamic driving environment is too complicated to learn for the driving task, we instead use a model-free policy optimization method, Trust Region Policy Optimization (TRPO) [16], to update the policy by using the updated reward function.

### Iii-C Augmented AIRL

In AIRL, the reward function is formulated purely in the view of the adversarial learning theory and learns the intrinsic rewarding mechanism embedded in the demonstrated behaviors. If we can apply some domain knowledge and augment the learned reward function with a semantic but sparse reward signal, it should provide the agent some informative guidance and assist it to learn fast. Based on this insight, we augment the reward function with intuitively defined semantic rewards.

#### Semantic Reward

The semantic reward should be easily defined and obtained. In our study case, a clear goal is to successfully complete the lane change and not to collide with other objects. Additionally, we encourage the agent to initiate the lane-change as soon as possible once it receives the lane-change command, i.e. select the closest gap and commit lane change immediately, which corresponds to action 2 according to our definition (the definition of the action space is deferred to Section IV along with further explanations of the decision-making task). Therefore, we define the semantic reward as follows.

We add the semantic term to the generator reward function as follows.

(8) |

This treatment is equivalent to adding to for policy updates as it can be proved that

Actually, when using this form of reward to optimize the policy, we are learning an entropy-regularized policy which maximizes the accumulated reward and also captures the most stochasticity of the policy. Such features make the policy more robust to system noises. Thus the learned reward function characterizes the demonstrated behavior and can be potentially applied to other environments, and the learned policy can be used to generate similar decision-making strategies as demonstrated by experts.

The core of the training procedure of the augmented AIRL method is in Algorithm 1.

## Iv Decision-making Task

Decision-making is a broad term and the definition of the task may vary in different applications. In our study, the decision-making task is defined to select an appropriate gap on the target lane and decide whether to merge into it at the current step, when a lane change requirement (e.g. ”changing to the left/right lane”) is issued by a separate module at the strategic level. In other words, the task of our focus includes a longitudinal decision – the selection of a target gap, and a lateral decision – whether to commit the lane change right now. After the decision is made, a low-level controller will calculate corresponding control commands to execute the decision.

To apply AIRL to learn the decision making behavior, we need to design reasonable action space and state space, and make their distributions representative and distinguishable. Details about the action space and state space are described in the next subsections.

### Iv-a Action Space

The longitudinal decision corresponds to the target gap selection. Fig. 1 demonstrates a typical driving situation with candidate gaps that the ego vehicle may choose when it is to execute the command of left turn. Gap 0, gap 1, gap 2 and gap 3 refer to the front gap, alongside gap, back gap and the current-lane gap, respectively. The decision on the lateral movement is defined as a binary decision task, i.e. whether to execute the lateral movement of the lane-change behavior at the current step.

In combination, there are 8 action pairs. When practical situations are taken into account, some combinations are not preferable. Take the action pair (0, 1) for example, if gap 0 is selected, the lateral decision should not be 1 (i.e. not to take the action of ”committing lane change right now”) as it has to adjust its longitudinal movement first to position itself at a proper location. Similarly, the combination of (2, 1) and (3, 1) should also be avoided in order to reduce the agent’s unnecessary exploration.

To make it simple, we use an one-dimensional vector to represent the five possible action tuples. The mapping is . Interestingly, such an action space design can allow the agent to perform the behavior of ”abort”. That is, if the vehicle agent detects itself in a risky lane-change situation, it can choose to abort the lane change and return back to its original lane, and waits for the decision at the next step. This is an appealing feature that not many learning-based decision-making studies have embedded.

### Iv-B State Space

In our daily driving, normally we only pay attention to certain parts of the scene that relates to our driving task rather than to all the information in the observable field of view. This subconscious act helps lower our burden of scene understanding and makes us focus more on the most irrelevant information. Similar idea should be applied to the learning agent. Instead of providing all the available information that can be gathered by sensors, we choose to use observations that are only relevant to the decision-making task. That is, the state space in our study only includes features from relevant vehicles, i.e. vehicles from gap 0 to gap 3, as shown in Fig. 1. In this study, we use observations and states interchangeably, and do not consider hidden states.

For each relevant vehicle, we gather vehicle kinematics with what we can derive with on-board sensors, which includes vehicle speed, acceleration, position, lane id, vehicle id, etc. For the ego vehicle, we also gather similar information plus a target lane id. In total, there are 44 features.

### Iv-C Expert Data

The demonstrations can be data collected from real-world driving or from a well-developed simulator. Since the training involves data generation by current policy, for consistency, we use a driving simulator to collect the demonstration data. The training can be similarly carried out with real-world driving data from physical systems when available.

Generating expert behaviour is not a straightforward task. If rules can be formulated to produce reasonable behaviours in all random traffic scenarios, there is little need for learning methods as rule-based solutions will be adequate. Instead of defining explicit rules for all possible situations, in this work we generate expert data by imposing criteria that will lead to safe, efficient, and smooth driving maneuvers.

The primary concern is safety, namely, no crash is introduced by the lane-change vehicle’s behaviour. The safety in simulation is ensured, on one hand, by exploiting the natures of behaviour models such as Intelligent Driver Model [1], and on the other hand, by implementing an enforced safety threshold. Another point is efficiency, i.e. arriving at the target lane as fast as possible. By calculating a dummy time point of the completion of a lane-change task based on the current kinematics of both the lane-changing vehicle and interactive vehicles, we can simulate behaviors that pursue driving efficiency. Comfort is also considered by minimizing the jerk for both the lane-changing vehicle and vehicles that the lane-changing vehicle interacts with.

With these practical considerations implemented in the simulator, the expert data we gather from simulation can represent reasonable driving behaviors under diverse driving situations.

## V Experiment

In our experiment, we evaluate the performance of AIRL and augmented AIRL on the decision-making task and compare the results with GAIL and expert data in simulated highway driving scenarios.

### V-a Simulation Environment

The simulation environment is a highway segment of 500 meters long and three-lane wide on each direction. The traffic is moderately dense in which the headways among adjacent vehicles are about 1-3 seconds, creating diverse challenging situations. Vehicles are generated randomly on all the lanes and they can perform car-following behaviors modeled with an adapted Intelligent Driver Model [19]. The ego vehicle initially enters on the middle lane and receives left or right lane-change command after entering the road for around 50 meters. Once it receives the command, the ego vehicle starts the decision-making function modeled by the aforementioned adversarial algorithms. After a decision is made, a low-level controller, consisting of a PID [1] based lateral control and sliding-mode [2] based longitudinal control, is activated to execute the decision. In our study, we allow the agent to perform the decision-making at a frequency of 10 Hz and we do not involve a planning model for near future trajectory planning. A demonstration of the simulated environment is shown in Fig. 2.

The simulator embraces many features. For example, other vehicles can yield to or overpass the lane-changing vehicle, and the ego vehicle can also adapt its own speed to merge into gaps before or behind it. These features enrich the diversity of vehicle behaviors, and make the driving environment very much like a real-world setting. With this simulator, we can learn interactive decision-making behaviors.

### V-B Evaluation Metrics

The selection of proper evaluation metrics is important to assess the performance of the algorithms. Simply using the accumulated reward and episode length (as in [4] and [5]) is not enough for evaluating the complicated lane-change behavior. In our study, we propose to use four metrics. One is the successful lane change ratio calculated with the number of successful lane change cases and the total number of the lane change cases. Another metric is the decision-making steps that a lane change process consumes. It is counted from the moment the lane change command is issued to the time when the lane change process finishes. Also, we gather the changing-lane steps which is counted from the start of lateral movement to the completion of the lane change process. The last metric is the total reward accumulated during the lane-changing procedure. The reward value is obtained from a reward function that contains multiple components that we defined in the simulator. It is used as a criterion for evaluating the performance of different algorithms.

### V-C Training Results

The neural network architecture of the discriminator and generator for the three algorithms (i.e. GAIL, AIRL, AugAIRL) are the same: two hidden layers of 100 units each for the generator, and two hidden layers of 512 units each for the discriminators. We use the same amount of expert data (5000 lane-change episodes) in the three trainings. Fig. 7 depicts the training curves of the four metrics over around 15,000 iterations. Table I gives the exact results of the three algorithms from the last saved models.

Algorithms | Total Reward | Success Ratio | Decision Steps | Changing Steps | Discrim. Loss | |||||
---|---|---|---|---|---|---|---|---|---|---|

Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | |

GAIL | 15.67 | 5.78 | 0.955 | 0.071 | 77.44 | 12.82 | 55.57 | 3.6 | – | – |

AIRL | 17.58 | 4.58 | 0.99 | 0.01 | 67.54 | 8.41 | 52.53 | 3.56 | 0.694 | 0.032 |

AIRL+SPARSE | 18.87 | 2.81 | 0.99 | 0.01 | 64.30 | 5.66 | 56.31 | 3.36 | 0.625 | 0.035 |

Expert | 24.21 | – | 1.00 | – | 67.74 | – | 58.18 | – | – | – |

As shown in Fig. 7, the metric curves from the three adversarial based methods all show convergence. In particular, the AIRL based methods demonstrate better performance and converge faster than GAIL in all four metrics. It is also noticeable that the training by both AIRLs methods are more stable than that by GAIL, proving the effectiveness of the special form of the discriminator.

More interesting results can be obtained from exploring the curves of AIRLs. From the successful ratio curves we can observe that the agent completes lane changes with a quite high successful rate, approaching to the expert performance of 1.0, which is very appealing since the agent learns the decision-making strategy from scratch.

The values of the decision-making steps and lane-changing steps of AIRL methods indicate that the lane change behavior at convergence stays quite stable and consumes similar maneuvering time to that of the expert. Additionally, from these two curves, we can infer that at the beginning of the training, the agent is quite conservative and hesitates to commit lane change as the decision steps are high and the actual lane-changing steps are low. As training goes on, the agent begins to explore and imitates the demonstrated behaviors.

From the reward curves, we can observe that the accumulated rewards at convergence, particularly for the augmented AIRL, are around 20 and quite close to the expert value of 24.

The training performance of the augmented AIRL is slightly better than AIRL when we zoom in to look into the details of the data in Table I that shows the results of the last saved model at 15,000 iterations. The values of total rewards, success ratio, and changing steps from the augmented AIRL are much more closer to the experts’ performance, and their standard deviations are smaller than AIRL as well as GAIL. In general the improvement is consistent. The next could be to explore other forms of semantic reward to provide more intuitive guidance to the agent.

### V-D Validation Results

We conduct testing on 5 saved checkpoints during training. For each checkpoint, we run 50 episodes and average their values. Fig. 12 shows the testing results of the four metrics with expert data plotted in dashed lines.

As shown in the Fig. 12, the testing results are generally consistent with the training results that AIRL methods perform better and are more stable than GAIL. Additionally, we can see more clearly that at testing time the augmented AIRL method shows more satisfactory results. Its total reward is relatively higher, its success ratio approaches 1.0 faster, and its decision-making steps and lane-changing steps are lower which means it completes the task faster.

## Vi Conclusions and Discussions

In this paper, we applied Adversarial Inverse Reinforcement Learning (AIRL) methods to the challenging decision-making task in autonomous driving, and augmented the learned reward with a semantic reward term to improve learning. The results were compared with that of the Generative Adversarial Imitation Learning (GAIL) and the expert data. Comparison shows that the performance of AIRL and augmented AIRL, in terms of the four metrics (i.e. total reward, success ratio, decision-making steps and lane-changing steps) is better than that of GAIL, and is in comparable level with that of the experts. This verifies the effectiveness of the AIRL method, and the slightly but consistently improved performance of augmented AIRL also proves effectiveness of adding the semantic reward term. The results also indicate the feasibility and capability of learning both a reward function and a policy simultaneously from the demonstrated behaviors.

Designing different forms of semantic reward and adding it to different parts in the learning framework are worth trying to gain more significant improvement in performance. Also, we will extend the current work to multiple driving style learning where the demonstrations include multiple driving behaviors.

## References

- [1] (1995) PID controllers: theory, design, and tuning. Vol. 2, Instrument society of America Research Triangle Park, NC. Cited by: §IV-C, §V-A.
- [2] (1998) Sliding mode control: theory and applications. Crc Press. Cited by: §V-A.
- [3] (2016) A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852. Cited by: §I, §III-B, §III-B.
- [4] (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58. Cited by: §I, §I, §II, §II, §III-A, §III-B, §III, §V-B.
- [5] (2017) Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248. Cited by: §I, §II, §II, §III-B, §V-B.
- [6] (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §I, §I, §II, §II.
- [7] (2014) Action-reaction: forecasting the dynamics of human interaction. In European Conference on Computer Vision, pp. 489–504. Cited by: §III-A.
- [8] (2015) Learning driving styles for autonomous vehicles from demonstration. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 2641–2646. Cited by: §II.
- [9] (2017) Imitating driver behavior with generative adversarial networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 204–211. Cited by: §II.
- [10] (2012) Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617. Cited by: §III-A.
- [11] (2013) Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §III-B.
- [12] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §II.
- [13] (2006) Autonomous inverted helicopter flight via reinforcement learning. In Experimental robotics IX, pp. 363–372. Cited by: §I.
- [14] (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: §I.
- [15] (2009) Reinforcement learning for robot soccer. Autonomous Robots 27 (1), pp. 55–73. Cited by: §I.
- [16] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §III-B.
- [17] (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. Cited by: §II.
- [18] (2019) End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854. Cited by: §I, §II.
- [19] (2000) Congested traffic states in empirical observations and microscopic simulations. Physical review E 62 (2), pp. 1805. Cited by: §V-A.
- [20] (2018) A reinforcement learning based approach for automated lane change maneuvers. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1379–1384. Cited by: §II.
- [21] (2019) Automated driving maneuvers under interactive environment based on deep reinforcement learning. Technical report Cited by: §II.
- [22] (2008) Maximum entropy inverse reinforcement learning. Cited by: §I, §III-A, §III-A.
- [23] (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D. Thesis, figshare. Cited by: §I, §III.