Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving

Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving

Maria Huegle, Gabriel Kalweit, Moritz Werling and Joschka Boedecker
Dept. of Computer Science, University of Freiburg, Germany.{hueglem,kalweitg,jboedeck}@cs.uni-freiburg.deBMWGroup, Unterschleissheim, Germany.Moritz.Werling@bmw.deCluster of Excellence BrainLinks-BrainTools, Freiburg, Germany.

The common pipeline in autonomous driving systems is highly modular and includes a perception component which extracts lists of surrounding objects and passes these lists to a high-level decision component. In this case, leveraging the benefits of deep reinforcement learning for high-level decision making requires special architectures to deal with multiple variable-length sequences of different object types, such as vehicles, lanes or traffic signs. At the same time, the architecture has to be able to cover interactions between traffic participants in order to find the optimal action to be taken. In this work, we propose the novel Deep Scenes architecture, that can learn complex interaction-aware scene representations based on extensions of either 1) Deep Sets or 2) Graph Convolutional Networks. We present the Graph-Q and DeepScene-Q off-policy reinforcement learning algorithms, both outperforming state-of-the-art methods in evaluations with the publicly available traffic simulator SUMO.

I Introduction

In autonomous driving scenarios, the number of traffic participants and lanes surrounding the agent can vary considerably over time. Common autonomous driving systems use modular pipelines, where a perception component extracts a list of surrounding objects and passes this list to other modules, including localization, mapping, motion planning and high-level decision making components. Classical rule-based decision-making systems are able to deal with variable-sized object lists, but are limited in terms of generalization to unseen situations or are unable to cover all interactions in dense traffic. Since Deep Reinforcement Learning (DRL) methods can learn decision policies from data and off-policy methods can improve from previous experience, they offer a promising alternative to rule-based systems. In the past years, DRL has shown promising results in various domains [1, 2, 3, 4, 5]. However, classical DRL architectures like fully-connected or convolutional neural networks (CNNs) are limited in their ability to deal with variable-sized, structured inputs or to model interactions between objects.

Prior works on reinforcement learning for autonomous driving that used fully-connected network architectures and fixed sized inputs [6, 7, 5, 8, 9] are limited in the number of vehicles that can be considered. CNNs using occupancy grids [10, 11] are limited to their initial grid size. Recurrent neural networks are useful to cover temporal context, but are not able to handle a variable number of objects permutation-invariant w.r.t to the input order for a fixed time step. In [12], limitations of these architectures are shown and a more flexible architecture based on Deep Sets [13] is proposed for off-policy reinforcement learning of lane-change maneuvers, outperforming traditional approaches in evaluations with the open-source simulator SUMO.

In this paper, we propose to use Graph Networks [14] as an interaction-aware input module in reinforcement learning for autonomous driving. We employ the structure of Graphs in off-policy DRL and formalize the Graph-Q algorithm. In addition, to cope with multiple object classes of different feature representations, such as different vehicle types, traffic signs or lanes, we introduce the formalism of Deep Scenes, that can extend Deep Sets and Graph Networks to fuse multiple variable-sized input sets of different feature representations. Both of these can be used in our novel DeepScene-Q algorithm for off-policy DRL. Our main contributions are:

  1. Using Graph Convolutional Networks to model interactions between vehicles in DRL for autonomous driving.

  2. Extending existing set input architectures for DRL to deal with multiple lists of different object types.

Ii Related Work

Graph Networks are a class of neural networks that can learn functions on graphs as input [15, 16, 17, 18, 19] and can reason about how objects in complex systems interact. They can be used in DRL to learn state representations [20, 21, 22, 17], e.g. for inference and control of physical systems with bodies (objects) and joints (relations). In the application for autonomous driving, Graph Networks were used for supervised traffic prediction while modeling traffic participant interactions [23], where vehicles were modeled as objects and interactions between them as relations. Another type of interaction-aware network architectures, Interaction Networks, were proposed to reason about how objects in complex systems interact [18]. A vehicle behavior interaction network that captures vehicle interactions was presented in [24]. In [25], a convolutional social pooling component was proposed using a CNN to model spatial connections between vehicles for vehicle trajectory prediction.

(a)           (b)

Fig. 1: Scheme of DeepScene-Q, using (a) Deep Sets and (b) Graphs. Both architectures combine multiple variable-length object lists in a scene, here a traffic sign , lanes and vehicles . The modules , and are fully-connected networks. As permutation invariant pooling operator, we use the sum. The vector includes static features and the action value output.

Iii Preliminaries

We model the task of high-level decision making for autonomous driving as a Markov Decision Process (MDP), where the agent is following a policy in an environment in a state , applying a discrete action to reach a successor state according to a transition model . In every time step , the agent receives a reward , e.g. for driving as close as possible to a desired velocity. The agent tries to maximize the discounted long-term return , where is the discount factor. In this work, we use Q-learning [26]. The Q-function represents the value of following a policy after applying action . The optimal policy can be inferred from the optimal action-value function by maximization over actions.

Iii-a Q-Function Approximation

We use DQN [1] to estimate the optimal -function by function approximator , parameterized by . It is trained in an offline fashion on minibatches sampled from a fixed replay buffer with transitions collected by a driver policy . As loss, we use with targets where is a target network, parameterized by , and is a randomly sampled minibatch from . For the target network, we use a soft update, i.e. with update step-size . Further, we use a variant of Double--learning [27] which is based on two Q-network pairs and uses the minimum of the predictions for the target calculation, similar as in [28].

Iii-B Deep Sets

A network can be trained to estimate the -function for a state representation and action . The representation consists of a static input and a dynamic, variable-length input set , where are feature vectors for surrounding vehicles in sensor range. In [12], it was proposed to use Deep Sets to handle this input representation, where the Q-network consists of three network modules and . The representation of the dynamic input set is computed by which makes the Q-function permutation invariant w.r.t. the order of the dynamic input [13]. Static feature representations are fed directly to the -module, and the Q-values can be computed by , where denotes a concatenation of two vectors. The Q-learning algorithm is called DeepSet-Q [12].

Iv Methods

Iv-a Deep Scene-Sets

To overcome the limitation of DeepSet-Q to one variable-sized list of the same object type, we propose a novel architecture, Deep Scene-Sets, that are able to deal with input sets , where every set has variable length. A combined, permutation invariant representation of all sets can be computed by

where . The output vectors of the neural network modules have the same length . We additionally propose to share the parameters of the last layer for the different networks. Then, can be seen as a projection of all input objects to the same encoded object space. We combine the encoded objects of different types by the sum (or other permutation invariant pooling operators, such as max) and use the network module to create an encoded scene, which is a fixed-sized vector. The encoded scene is concatenated to and the Q-values can be computed by . We call the corresponding Q-learning algorithm DeepScene-Q, shown in Algorithm 2 (Option 1) and Figure 1 (a).

1 initialize and , set replay buffer
2 for optimization step o=1,2,… do
3        get minibatch from
4        foreach transition do
5               foreach object in  do
8              compute by GCN with get
10        perform a gradient step on loss:
11        update target network by:
Algorithm 1 Graph-Q

Iv-B Graphs

In the Deep Set architecture, relations between vehicles are not explicitly modeled and have to be inferred in . We extend this approach by using Graph Networks, considering graphs as input. Graph Convolutional Networks (GCNs) [14] operate on graphs defined by a set of node features and a set of edges represented by an adjacency matrix . The propagation rule of the GCN is where we set using an encoder module similar as in the Deep Sets approach. is an adjacency matrix with added self-connections, , the activation function, hidden layer activations and the learnable matrix of the -th layer. The dynamic input representation can be computed from the last layer of the GCN: where is a neural network and the output vector has length . The Q-values can be computed by . We call the corresponding Q-learning algorithm Graph-Q, see Algorithm 1.

Iv-C Deep Scene-Graphs

The graph representation can be extended to deal with multiple variable-length lists of different object types by using encoder networks. As node features, we use and and compute the dynamic input representation from the last layer of the GCN:

with . Similar to the Deep Scene-Sets architecture, are neural network modules with output vector length and parameter sharing in the last layer. To create a fixed vector representation, we combine all node features by the sum into an encoded scene. The Q-values can be computed by . This module can replace the DeepScene-Sets module in DeepScene-Q as shown in Algorithm 2 (Option 2) and in Figure 1 (b).

1 initialize and , set replay buffer
2 for optimization step o=1,2,… do
3        get minibatch from
4        foreach transition do
5               foreach object type  do
6                      foreach object in  do
9              Set (Option 1) :
10                      get
11              Graph (Option 2) :
12                      compute by GCN with get
15       perform a gradient step on loss and update target network as in Algorithm 1.
Algorithm 2 DeepScene-Q

Iv-D Graph Construction

We propose two different strategies to construct bidirectional edge connections between vehicles for Graphs and Deep Scene-Graphs representations:

  1. Close agent connections: Connect agent vehicle to its direct leader and follower in its own and the left and right neighboring lanes ( edges).

  2. All close vehicles connections: Connect all vehicles to their leader and follower in their own and the left and right lanes ( edges for surrounding vehicles).

Edge weights are computed by the inverse absolute distance between two vehicles, as shown in [23]. A fully-connected graph is avoided due to computational complexity.

Iv-E MDP Formulation

The feature representations of the the surrounding cars and lanes are shown in section V-B. The action space consists of a discrete set of three possible actions in lateral direction: keep lane, left lane-change and right lane-change. Acceleration and collision avoidance are controlled by low-level controllers, that are fixed and not updated during training. Maintaining safe distance to the preceding vehicle is handled by an integrated safety module, as proposed in [11, 5]. If the chosen lane-change action is not safe, the agent keeps the lane. The reward function is defined as: where and are the actual and desired velocity of the agent, is a penalty for choosing a lane-change action and minimizing lane-changes for additional comfort.

Driver Type maxSpeed lcCooperative accel/ decel length lcSpeedGain
agent driver 10 - 2.6/4.5 4.5 -
passenger drivers 1 2.6/4.5
passenger drivers 2 2.6/4.5
passenger drivers 3 2.6/4.5
truck drivers 1.3 / 2.25
motorcycle drivers 3.0/5.0
TABLE I: SUMO parameters for different driver types. In each scenario, trucks and motorcycles are sampled with and probability, passenger cars and their driver types are sampled uniformly for the remaining number of vehicles.

V Experimental Setup

We use the open-source SUMO [29] traffic simulation to learn lane-change maneuvers.

V-a Scenarios


To evaluate and show the advantages of Graph-Q, we use the circular highway environment shown in [12] with three continuous lanes and one object class (passenger cars). To train our agents, we used a dataset with 500.000 transitions.

Fast Lanes

To evaluate the performance of DeepScene-Q, we use a more complex scenario with a variable number of lanes, shown in Figure 2. It consists of a m circular highway with three continuous lanes and additional fast lanes in two sections. At the end of lanes, vehicles slow down and stop until they can merge into an ongoing lane. The agent receives information about additional lanes in form of traffic signs starting before every lane start or end. Further, different vehicle types with different behaviors are included, i.e. cars, trucks and motorcycles with different lengths and behaviors. For simplicity, we use the same feature representation for all vehicle classes. As dataset, we collected 500.000 transitions in the same manner as for the Highway environment.

Fig. 2: Fast Lanes scenario in SUMO. The agent (blue) is overtaking other vehicles (red) on the fast lane and has to merge before the lane ends.

V-B Input Features

In the Highway scenario, we use the same input features as proposed in [12]. For the Fast Lanes scenario, the input features used for vehicle are:

  • relative distance: ,
    , are longitudinal positions in a curvilinear coordinate system of the lane.

  • relative velocity:

  • relative lane index: ,
    where , are lane indices.

  • vehicle length:

The state representation for lane is:

  • lane start and end: distances (km) to lane start and end

  • lane valid: lane currently passable

  • relative lane index: ,
    where , are lane indices.

For the agent, the normalized velocity is included, where and are the current and desired velocity of the agent. Passenger cars, trucks and motorcycles use the same feature representation. When the agent reaches a traffic sign indicating a starting (ending) lane, the lane features get updated until the start (end) of the lane.

V-C Training & Evaluation Setup

All agents are trained off-policy on datasets collected by a rule-based agent with enabled SUMO safety module integrated, performing random lane changes to the left or right whenever possible. For training, traffic scenarios with a random number of vehicles for Highway and with vehicles for Fast Lanes are used. Evaluation scenarios vary in the number of vehicles . For each fixed , we evaluate 20 scenarios with different a priori randomly sampled positions and driver types for each vehicle, to smooth the high variance.

In SUMO, we set the time step length to . The action step length of the reinforcement learning agents is and the lane change duration is . Desired time headway and minimum gap are and . All vehicles have no desire to keep right (). The sensor range of the agent is . LC2013 is used as lane-change controller for all other vehicles. To simulate traffic conditions as realistic as possible, different driver types are used with parameters shown in Table I.

Input() Input() Input()
: FC(), FC() : FC(), FC() : FC(), FC()
: FC(), FC() sum()
concat(, Input())
FC(100), FC(100), Linear(3)
Deep Scene-Sets Deep Scene-Graphs
Input() and Input()
: FC(20), FC(80),FC(80) : FC(20), FC(80),FC(80)
: FC(20), FC(80), FC(80) : FC(20), FC(80),FC(80)
: FC(), FC() sum()
concat(, Input())
FC(100), FC(100), Linear(3)
TABLE II: Network architectures. FC() are fully-connected layers. The CNN uses strides of . (*) For VBIN FC(200). (**) Parameters of the last layers are shared.

V-D Comparative Analysis

Each network is trained with a batch size of and optimized by Adam [30] with a learning rate of . As activation function, we use Rectified Linear Units (ReLu) in all hidden layers of all architectures. The target networks are updated with a step-size of . All network architectures, including the baselines, were optimized using Random Search with the same budget of 20 training runs. We preferred Random Search over Grid Search, since it has been shown to result in better performance using budgets in this range [31]. The Deep Sets architecture and hyperparameter-optimized settings for all encoder networks are used from [12]. The network architectures are shown in Table II. Graph-Q is compared to two other interaction-aware Q-learning algorithms, that use input modules originally proposed for supervised vehicle trajectory prediction. To support our architecture choices for the Deep Scene-Sets, we compare to a modification with separate networks. We use the following baselines111Since we do not focus on including temporal context, we adapt recurrent layers to fully-connected layers in all baselines.:

Rule-Based Controller

Naive, rule-based agent controller, that uses the SUMO lane change model LC2013.

Convolutional Social Pooling (SocialCNN)

In [25], a social tensor is created by learning latent vectors of all cars by an encoder network and projecting them to a grid map in order to learn spatial dependencies.

Vehicle Behaviour Interaction Networks (VBIN)

In [24], instead of summarizing the output vectors as in the Deep Sets approach, the vectors are concatenated, which results in a limitation to a fixed number of cars. We consider the 6 vehicles surrounding the agent (leader and follower on own, left and right lane).

Multiple -networks

Deep Scene architecture where all object types are processed separately by using different -network modules. The resulting output vectors are concatenated as and fed into the Q-network module.

V-E Implementation Details & Hyperparameter Optimization

All networks were trained for optimization steps. The Random Search configuration space is shown in Table III. For all approaches except VBIN, we used the same and architectures. Due to stability issues, adapted these parameters for VBIN. For SocialCNN, we used the optimized grid from [12] with a size of . The GCN architectures were implemented using the pytorch gemoetric library [32].

Architecture Parameter Configuration Space
Encoders : num layers
: hidden/ output dims
Deep Sets : num layers
: hidden/ output dims
GCN num GCN layers 1,2,3
hidden and output dim 20, 80
use edge weights True, False
SocialCNN CONV: num layers
kernel sizes
VBIN : output dim 20, 80
: hidden dim 20, 80, 160, 200
: hidden dim 100, 200
Deep Scene-Sets : output dim 20, 80
shared parameters True, False
Deep Scene-Graphs use network True, False
: output dim 20, 80
shared parameters True, False
TABLE III: Random Search configuration space. For every architecture, we sampled 20 configurations to find the best setting.
Fig. 3: Mean performance and standard deviation in the Highway scenario over 10 training runs for Graph-Q with all close vehicle connections, the Deep Sets [12] and two other interaction-aware Q-function input modules (left), and Graph-Q using the two proposed graph construction strategies (right). The number of vehicles indicates the traffic intensity, from light to dense traffic.
Fig. 4: Mean performance and standard deviation in the Fast Lanes scenario over 10 training runs for Deep Scene-Sets, Deep Scene-Graphs and the rule-based controller from SUMO (left), and different architecture choices of the Deep Scenes (right). The number of vehicles indicates the traffic intensity.

Vi Results

The results for the Highway scenario are shown in Figure 3. Graph-Q using the GCN input representation (with all close vehicle connections) is outperforming VBIN and Social CNN. Further, the GCN input module yields a better performance compared to Deep Sets in all scenarios besides in very light traffic with rare interactions between vehicles. While the Social CNN architecture has a high variance, VBIN shows a better and more robust performance and is also outperforming the Deep Sets architecture in high traffic scenarios. This underlines the importance of interaction-aware network modules for autonomous driving, especially in urban scenarios. However, VBIN are still limited to fixed-sized input and additional gains can be achieved by combining both variable input and interaction-aware methods as in Graph Networks. To verify that the shown performance increases are significant, we performed a T-Test exemplarily for 90 car scenarios:

  • Independence of the mean performances of DeepSet-Q and Graph-Q is highly significant () with a p-value of 0.0011.

  • Independence of the mean performances between Graph-Q and VBIN is significant () with a p-value of 0.0848. Graph-Q is additionally more flexible and can consider a variable number of surrounding vehicles.

Figure 3 (right) shows the performance of the two graph construction strategies. A graph built with connections for all close vehicles outperforms a graph built with close agent connections only. However, the performance increase is only slight, which indicates that interactions with the direct neighbors of the agent are most important.

The evaluation results for Fast Lanes are shown in Figure 4 (left). The vehicles controlled by the rule-based controller rarely use the fast lane. In contrast, our agent learns to drive on the fast lane as much as possible ( of the driving time). We assume, that the Deep Scene-Sets are outperforming Deep Scene-Graphs slightly, because the agent has to deal with less interactions than in the Highway scenario. Finally, we compare Deep Scene-Sets to a basic Deep Sets architecture with a fixed feature representation. Using the exact same lane features (if necessary filled with dummy values), both architectures show similar performance. However the performance collapse for the Deep Sets agent considering only its own, left and right lane shows, that the ability to deal with an arbitrary number of lanes (or other object types) can be very important in certain situations. Due to its limited lane representation, the Deep Sets (closest lanes) agent is not able to see the fast lane and thus significantly slower. Figure 4 (right) shows an ablation study, comparing the performance of the Deep-Scene Sets with and without shared parameters in the last layer of the encoder networks. Using shared parameters in the last layer leads to a slight increase in robustness and performance, and outperforms the architecture with separate networks.

Vii Conclusion

In this paper, we propose Graph-Q and DeepScene-Q, interaction-aware reinforcement learning algorithms that can deal with variable input sizes and multiple object types in the problem of high-level decision making for autonomous driving. We showed, that interaction-aware neural networks, and among them especially GCNs, can boost the performance in dense traffic situations. The Deep Scene architecture overcomes the limitation of fixed-sized inputs and can deal with multiple object types by projecting them into the same encoded object space. The ability of dealing with objects of different types is necessary especially in urban environments. In the future, this approach could be extended by devising algorithms that adapt the graph structure of GCNs dynamically to adapt to the current traffic conditions. Based on our results, it would be promising to omit graph edges in light traffic, essentially falling back to the Deep Sets approach, while it is beneficial to model more interactions with increasing traffic density.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description