Multiagent Interactive Prediction under Challenging Driving Scenarios
Abstract
In order to drive safely on the road, autonomous vehicle is expected to predict future outcomes of its surrounding environment and react properly. In fact, many researchers have been focused on solving behavioral prediction problems for autonomous vehicles. However, very few of them consider multiagent prediction under challenging driving scenarios such as urban environment. In this paper, we proposed a prediction method that is able to predict various complicated driving scenarios where heterogeneous road entities, signal lights, and static map information are taken into account. Moreover, the proposed multiagent interactive prediction (MAIP) system is capable of simultaneously predicting any number of road entities while considering their mutual interactions. A case study of a simulated challenging urban intersection scenario is provided to demonstrate the performance and capability of the proposed prediction system.
capbtabboxtable[][\FBwidth]
I Introduction
Ia Motivation
For autonomous vehicles, the ability to drive safely in a sophisticated driving environment is required. Since the road entities and traffic lights are dynamic in the driving environment, autonomous vehicles should be able to examine and react to these situations while ensuring its own safety. In fact, Advanced Driver Assistance System (ADAS) has been designed to assist human drivers and enhance the driving experience. In recent decades, approaches to predict future behaviors of onroad vehicles have made big progress, especially under simple scenarios such as highway. However, for complex driving environment such as urban areas in Fig. 1, the challenges for the researchers to design comprehensive prediction algorithms still remain. In this work, we would like to build a prediction structure that can be easily adapt to any complicated urban driving scenarios with given semantic map information. In particular, we aim at forecasting future behaviors of all types of road entities while considering their mutual interactions.
IB Related Works
In the field of computer vision and robotics, prediction of future behaviors for vehicles has been studied by different approaches. For example, to predict future behaviors and states for vehicles, traditional methods such as dynamic, kinematic model and Intelligent Driver Model (IDM) have been designed [3, 10, 15]. However, these methods have big limitations that the predictions are made by estimating each step recursively under underlying assumptions such as constant velocity and constant acceleration. In order to make a better prediction for the model with uncertainty, more frameworks are developed, such as Gaussian Processes (GP), Hidden Markov Models (HMM) [14] and Monte Carlo sampling [6] which is designed to consider various motion patterns. These approaches have good results in motion prediction for single vehicle driving scenarios without consideration of interactions with other vehicles. However, these methods are unreasonable since behaviors of the predicted vehicle are influenced by surroundings in the real world.
In order to take interactions into account, there are works extracting features from the environment which have a potential effect on the ego vehicle. Most of these works use learningbased methods to consider interactions among vehicles and other entities[8, 2, 11]. For example, [2] predicts the most likely future longitudinal and lateral trajectories for the ego vehicle on a highway surrounded by the other nine vehicles using LSTM. Authors in [11] combine an LSTMbased structure with occupancy grid maps to predict the behavior vehicles taking the lane change behaviors of surrounding vehicles into account. Works such as [16] and [13] also consider the lane change interaction among multiple vehicles. However, these works only focus on simple scenarios such as highway where the moving direction of all the vehicles is the same, and the interactions between vehicles are straightforward since the speed of vehicles on highways is bounded and does not have a significant variation. Approaches from these works cannot be adapted to an urban environment where driving conditions and interactions which are more complicated.
Several works have been done to establish more robust predicting systems under a complicated driving context. Approaches from [1] use a stochastic filtering framework to predict the behaviors and trajectories for vehicles in a merged intersection. [17] and [7] build stereovision systems to track and predict future trajectories of vehicles driving in roundabouts scenarios. [19] considers an urban area with 4way intersections by using HMM, and [20] use Monte Carlo method to predict multimodal trajectories in urban intersections. In these works, interactions between vehicles in complex scenarios are studied. However, in the real urban driving situations, vehicles are also greatly influenced by dynamic traffic signals and pedestrians passing through the streets which are not considered by these works.
Indeed, interactions between vehicles and heterogeneous road entities in intersections were considered using vehicletovehicle (V2V) communication [4] where all the entities are assumed connected by wireless waves etc. However these connections between road entities were not assumed in this work which makes our approach more suitable for the current real world situation where no preset connections are available.
In this work, we consider interactions between the predicted vehicle and heterogeneous road entities that might have potential influences on the predicted vehicle. In our approach, we aim to build an autonomous system that could predict future behaviors for all vehicles under challenging driving scenario. In fact, we claim that for any driving scenario, it can be categorized into two parts: static information and dynamic information. For a single vehicle driving on the street, it should obey traffic rules by considering static information such as lane directions, crosswalk, sidewalk, and infeasible driving regions. When the vehicle is surrounded by other moving entities, dynamic information including traffic lights and states of other road entities such as vehicles, pedestrians or cyclists, should be taken into account simultaneously. By incorporating all these information, we are able to design a generic prediction architecture that can be easily adapted to various challenging driving environment.
IC Contribution
In this paper, a Multiagent Interactive Prediction (MAIP) system is proposed for autonomous vehicles to predict future behaviors of every road entity while taking interactions into account. We utilize a complicated urban intersection scenario to implement and examine our proposed system. The main contributions of this work can be summarized in four folds:

Taking into account various environment information such as heterogeneous road entities, lane direction, and traffic light, where the number of heterogeneous entities in the prediction system is not limited.

Introducing adaptive prediction system by utilizing grid maps to encode static and dynamic environments.

Considering multimodal properties of predicted distributions.

Leveraging masking mechanism to improve learning efficiency.
Ii Problem Formulation
In this problem, our goal is to solve trajectory prediction problems for vehicles considering environment interactions in an urban intersection area. To increase the environment complexity, we consider three types of driving lanes for the intersection: {}. Here, F, L, and R represent ”Forward”, ”Left Turn” and ”Right Turn” driving intentions respectively. As shown in Fig. 2, , , and .
Amongst all possible interactive situations, we focus on three challenging cases: Unprotected Left Turn, Right Turn Merge, and Pedestrian Avoidance. In these circumstances, the ego vehicle should consider environment information of multiple road entities simultaneously to predict their future behaviors. To avoid accidents and traffic congestion, road geometries, surrounding conditions should also be taken into account. Note that although we consider only vehicles and pedestrians in this problem, our prediction structure is also capable of dealing with other road entities such as cyclists.
We assume that at each time step , we assign an ID, , to each car that is within the detection range of our autonomous vehicle. Then, each vehicle’s state information can be recorded as , where and denotes the global position of , denotes the yaw angle, and denotes the speed and acceleration respectively. Similarly, each pedestrian, identified as , has the state information as . The state of traffic light at time is denoted by .
After obtaining the historical information of every dynamically changing object up to the past time steps, our goal is to predict the speed and yaw angle of each vehicle into the future time steps . Notice that the future states of each vehicle are highly related to the dynamical environment, which is considered in our proposed prediction structure.
Iii Methodology
In this section, we first introduce fundamental concepts we used in our method. Then we describe the overall prediction structure we proposed. Finally, feature details of the proposed method are described.
Iiia Fundamental Concepts
When we are considering prediction problems, it is important to have probabilistic predictions instead of deterministic results since we need to take into account uncertainties of the road entities. Therefore, the predictor should be based on probabilistic models. In this work, we utilize one of the probabilistic models to demonstrate the effectiveness of our proposed multiagent interactive prediction (MAIP) system. The model we used is the Conditional Variational Autoencoder (CVAE) [18][9], which is a latent variable model that is rooted in Bayesian inference and contains an encoder as well as a decoder.
The goal of CVAE is to approximate the likelihood distribution:
(1) 
where is the target function trained by the network to approximate the output given some input and a vector . denotes the variance of .
During training, input is fed into the encoder along with a sampled vector from the latent space which is from some distribution . According to the CVAE structure, the following equation can be derived:
(2) 
Note that is a prior distribution of latent space, which is always assumed to be normally distributed so that the divergence can have a closedform solution. In order to maximize the log likelihood of on the righthandside of Eq. 2, we need to maximize the lefthandside which becomes the loss function of the CVAE structure:
(3) 
where denotes the ground truth, denotes the output estimation, and is a hyperparameter that can adjust weights between two loss terms.
IiiB Proposed Prediction System
The overall structure of our Multiagent Interactive Prediction (MAIP) system is shown in Fig. 3. In order to predict future behaviors for every vehicle considering their interactions with the environment, we incorporate input information into two groups: group1 and group2.
Group1 contains information of the entire environment which includes a static map and a dynamic map . Group2, on the other hand, consider the information of a single vehicle that we choose to predict. In particular, group2 involves the dynamic map for the selected vehicle, , as well as its state information , which are extracted from group1. We are then able to predict every entity in the scene by fixing group1 while changing the information in group2 to each corresponding vehicle we want to predict. In this way, any number of road entities can be predicted in parallel under the proposed prediction system. Details of each input will be described in IIIC.
To clearly illustrate the framework flow, we take input as a running example. According to Fig. 3, is fed into a convolutional neural network, CNN1, to extract the spatial features of the environment. We define such operation as and then we fed the output into a fully connected layer, FC1, in order to reduce the feature into one dimensional vector , where denotes the activation function.
Therefore, by considering all four outputs, we are able to obtain the compressed feature information as:
(4)  
where and denotes parameters for the network, denotes mapping function for long shortterm memory method. Then is fed into CVAE as one of the inputs to the encoder.
Since our goal is to predict future behaviors of each vehicle using its historical observations, each input can expressed as . Notice that is static map information and will remain the same. Similarly, the corresponding output label for car is denoted as , where is a subset of .
By integrating CVAE into our proposed prediction system, the overall loss function of our network can be expressed as:
(5) 
Moreover, to enable backpropagation, we utilize the reparameterization trick [12] to resolve the nondifferential sampling process in the latent space . Note that both the encoder and decoder are used in the training process. However, only the decoder will be used during testing.
IiiC Feature Details
In this work, we apply our method in an urban intersection, where interactions and driving conditions are complicated (see Fig. 1). To predict driving behavior of vehicles, road geometries, traffic light information, and the behaviors of heterogeneous surrounding road entities should be taken into account. In our approach, to incorporate all the environment information, we categorize information into four types.
Static Map
In the static map, we include information that does not change with time such as road geometries, lane information and traffic lights locations. In order to distinguish different types of static information, we assign different numbers in the static grid map which is illustrated by different colors in Fig. 3.
Dynamic Map
In dynamic map, we consider information that changes with time. To take account interactions with the environment, for a single time step , dynamic information such as the pose of all road entities, and traffic light signals are encoded into the dynamic map. In Fig. 3 , vehicles and pedestrians are represented by white and yellow boxes respectively. Besides, the green and red boxes denotes the traffic light color for the two driving directions.
Dynamic Map for Predicted Vehicle
focus on dynamic information for the predicted vehicle and information in are extracted from the dynamic map .
State Information for Predicted Vehicle
For the predicted vehicle at time , where is the state information of , is the state of traffic light that the predicted vehicle is controlled by.
Mask
To improve learning efficiency, we introduce the masking mechanism. In fact, each vehicle only needs to focus on nearby environment information that has potential influences on it. Therefore, mask is introduced as
(6) 
where denotes an elementwise product. If the dynamic environment information located in row , column of is taken into account by the ego vehicle, then , otherwise . In our approach, we apply masking mechanism to dynamic map input . For example, if vehicle is driving towards the intersection area and its front signal light is green, it only focuses on information besides the gray ”masked” region as shown in Fig. 4.
Iv Experiment
Iva Data Generation
We used the opensource autonomous driving simulator CARLA [5] to simulate a urban intersection area
Vehicles moving into the intersection are controlled by two traffic light groups whose duration is similar to the reallife situation. Pedestrians are generated with different speeds to pass through each crossroad. To simulate realworld scenario, pedestrians will keep crossing the road if the traffic light turns red while they are at the center of the road. Therefore, the prediction system is expected to learn that pedestrians should have higher priority than the traffic light for each onroad vehicles.
To record data in various scenarios, we randomly generate four to six cars in the map and their driving behaviors are automatically controlled by the simulator. We collected 37,265 frames for training in total with sample frequency of . Among all the interacting situations, we focus on three challenging cases (Fig. 5):

Unprotected Left Turn: An unprotected left turn occurs at an intersection where there is no traffic light to signal the left turn (see Fig. 5 (a)). The critical rule for the vehicle to make left turn is that its driving behaviors do not influence other vehicles moving forward from opposite side of the road. In other words, straight driving vehicles have higher road priority than vehicles to make a left turn in intersection areas to avoid accidents. Hence, car will not make left turn until car passes through the intersection.
Method 0.2s 0.4s 0.6s 0.8s 1s IDM 24.73 34.72 42.20 48.65 54.26 0.49 0.63 0.83 0.99 1.12 CNNLSTM 6.82 0.16 7.29 0.19 7.65 0.14 8.43 0.15 9.32 0.26 0.47 0.03 0.71 0.02 0.78 0.03 0.95 0.03 1.06 0.04 CNNCVAE 7.99 0.51 8.29 0.82 7.55 0.86 8.80 0.95 9.67 1.22 0.95 0.11 1.01 0.12 0.98 0.13 1.03 0.17 1.15 0.20 MAIPRecursive 4.91 0.68 6.32 0.62 8.36 0.11 9.54 0.13 11.41 2.38 0.41 0.05 0.77 0.09 0.85 0.08 0.91 0.11 1.03 0.12 MAIP (our method) 5.20 0.70 5.23 0.73 5.38 0.81 6.06 0.85 7.19 1.03 0.56 0.04 0.61 0.05 0.63 0.05 0.68 0.06 0.82 0.08 TABLE I: Evaluation for different methods. Speed () unit is m/s and yaw angle () is measured in degree. 
Right Turn Merge: As it is shown in Fig. 5 (b), vehicles driving in two directions merge into the same lane. In our scenarios, there is no traffic signal for the right turn and vehicles are able to turn if it is safe. Generally, vehicles with right turn intention driving have low road priority and they should wait until it is safe to turn. In this example, is more likely to insert in front of than of since the gap between and is larger.

Pedestrian Avoidance: Since we want to consider the interaction between vehicles and pedestrians, we place the pedestrians on two crosswalks in the scene. When vehicles are driving towards pedestrians, they are expected to brake ahead of the time to avoid accidents. For the case shown in Fig. 5 (c), should stop since it has a strong interaction with . However, for , its behavior is less influenced by and it can choose to either pass or yield the pedestrian.
IvB Implementation Details
In this section, we introduce the implementation details of our MAIP system. For all the three convolutional neural networks, we utilize a kernal. The longshorttermmemory (LSTM) network of 16 hidden neurons is utilized to incorporate historical state information with time for the input . Also, we apply fully connected layers FC1 to FC4 of 8 neurons each, and FC5 of 16 neurons. For the CVAE structure, we use two fully connected layers of 16 neurons for our encoder and decoder. According to the crossvalidation results, we utilize a two dimensional latent space to map our high dimensional input features into a low dimensional space.
IvC Evaluation
Quantitative Results
To evaluate the performance of our approach, Root Mean Square Error (RMSE) is adapted as the evaluation metric for our system. We compare our method with other four methods as follows:

IDM (Intelligent Driver Model). This method is a simple car following approach where the ego vehicle only considers interaction with its front vehicle.

CNNLSTM In this method, future behaviors of vehicles are predicted by using similar structure from our approach without CVAE. While future behaviors predicted by this method is deterministic, in order to calculate its standard deviation and compare the results with other probabilistic models, ensemble method was utilized.

CNNCVAE. This method is based on CVAE and it has a similar framework structure as our approach except that the state information for ego vehicle is fed into fully connected layers instead of LSTM.

MAIPRecursive. This method predicts future behaviors for the ego vehicle by setting to a single frame. Then we feed the predicted behavior to the network to predict recursively.
To analyze performance amongst different methods at each time step , we compare RMSE and its standard deviation for both yaw angle and speed under different prediction time steps. Results are shown in Table I. The standard deviation for IDM is not shown because it is a deterministic model which can not make probabilistic prediction. For all the methods, prediction errors are accumulated with time step . It is clear that the prediction performance of IDM is far from others. The reason is that IDM is a simple car following model which means the predicted vehicle only considers the interaction with vehicle ahead. Thus, this method is very limited when the predicted vehicle have interactions with multiple vehicles and traffic lights. Moreover, since the geometry information for scenarios is not taken into account by IDM, RMSE for steering angle is nearly eight times bigger than MAIP.
Comparing the results between CNNCVAE and MAIP, the former approach is apparently less accurate and has a higher standard deviation, which indicates that LSTM can better incorporate with historical information in future behavior prediction problem especially when the predicted horizon gets longer. Comparing the results between CNNLSTM and MAIP, the latter approach is more accurate and has smaller standard deviation, which shows that direct ensemble method doesn’t achieve a better performance than the learningbased probabilistic model. It should be noted that, the standard deviation of CNNLSTM is smaller than other models using CVAE because CNNLSTM method does not take account multimodal distribution. For a short time step , MAIPRecursive is more accurate than MAIP, but for longer prediction sequence, its errors accumulate and lead to inaccurate prediction.
Visualization Results
In order to verify that our prediction system has the ability to incorporate environmental information into various scenarios, we choose four representative test cases that are not seen in the training data for illustration, which is shown in Fig. 6.

Case A: In this case, it can be verified that our system is able to incorporate traffic light information and interactions between vehicles driving in the same direction. At time step , car and are predicted to stop because of the Red traffic light. As the traffic signal changes to Green at , is predicted to move forward while is predicted to follow with a smaller velocity until their distance becomes larger at .

Case B: This case illustrates the Unprotected Left Turn scenario and indicates that our approach can take account of road geometry and interaction among vehicles driving in different directions. As can be seen through the second column in Fig. 6, is predicted to not make left turn until passes through the center area of the intersection. At the same time, is predicted to remain its velocity without yielding since our system learned that has the rightofway.

Case C: In this case, the ability of our prediction system under dense traffic in a Right Turn Merge scenario is examined. As shown in the third column of Fig. 6, the gap between and is not big enough. Thus, is predicted to merge until both cars on the main road pass.

Case D: Here we examine the proposed system’s prediction ability when strong interactions between vehicles and pedestrians occur. According to Fig. 6, at time step , a pedestrian is about to cross the road and its location is closer to than to . Therefore, at , is predicted to break in front of while is predicted to keep driving forward due to the green traffic light.
Moreover, we also test the robustness of our MAIP system under a more complicated driving scenario with 29 vehicles. As can be seen in Fig. 7, the MAIP model is able to predict any number of road entities and take interactions into account even when the system is trained on a much less number of entities.
According to the selected cases, our approach is capable of predicting future behaviors for vehicles driving in various situations considering the interactions with other road entities. Moreover, for vehicles driving in and under Case B and Case C, they can either go straight or make a turn. Therefore, it is expected that when vehicles drive closer to the intersection center, the predictor should consider prediction uncertainties by considering every possibility of the driving directions. According to the results, the sampled trajectories indeed split into different groups which illustrate that the desired multimodal distribution can be predicted by our proposed framework.
V Conclusions
In this paper, a Multiagent Interactive Prediction (MAIP) system is proposed, which utilizes static and dynamic environment information to predict every road entity while taking into account their mutual interaction. We examined the performance of our proposed system under a simulated urban intersection scenario. We first compared the prediction accuracy of our method with four different approaches and concluded that the proposed MAIP system outperforms others in terms of the mean and standard deviation of the prediction error especially for long prediction horizon. We then selected four representative testing cases to illustrate the capability of our method under various unseen challenging scenarios. The result shows that the proposed MAIP system successfully learned to reason about complicated environment information and provide rational prediction results for every onroad vehicle. In future work, we will extend the current allocentric input to the egocentric input, which may provide more robustness and adaptivity to our system.
Footnotes
 The simulation environment and some testing results can be found on https://youtu.be/TpPqDVaBl1I
References
 (2011) A bayesian approach for driving behavior inference. In 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 595–600. Cited by: §IB.
 (2017) An lstm network for highway trajectory prediction. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 353–359. Cited by: §IB.
 (2009) Real time trajectory prediction for collision risk estimation between vehicles. In 2009 IEEE 5th International Conference on Intelligent Computer Communication and Processing, pp. 417–422. Cited by: §IB.
 (2012) Intersection management using vehicular networks. Technical report SAE Technical Paper. Cited by: §IB.
 (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: §IVA.
 (2008) Statistical threat assessment for general road scenes using monte carlo sampling. IEEE Transactions on intelligent transportation systems 9 (1), pp. 137–147. Cited by: §IB.
 (2010) Vehicle tracking and motion prediction in complex urban scenarios. In 2010 IEEE Intelligent Vehicles Symposium, pp. 26–33. Cited by: §IB.
 (2018) Probabilistic prediction of vehicle semantic intention and motion. In Intelligent Vehicles Symposium (IV), 2018 IEEE, pp. 307–313. Cited by: §IB.
 (2019) Multimodal probabilistic prediction of interactive behavior via an interpretable model. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV2019), Cited by: §IIIA.
 (2004) IMM object tracking for high dynamic driving maneuvers. In IEEE Intelligent Vehicles Symposium, 2004, pp. 825–830. Cited by: §IB.
 (2017) Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 399–404. Cited by: §IB.
 (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IIIB.
 (2013) Learningbased approach for online lane change intention prediction. In 2013 IEEE Intelligent Vehicles Symposium (IV), pp. 797–802. Cited by: §IB.
 (2011) Probabilistic analysis of dynamic scenes and collision risks assessment to improve driving safety. IEEE Intelligent Transportation Systems Magazine 3 (4), pp. 4–19. Cited by: §IB.
 (2012) Driver intent inference at urban intersections using the intelligent driver model. In 2012 IEEE Intelligent Vehicles Symposium, pp. 1162–1167. Cited by: §IB.
 (2007) Lane change intent analysis using robust operators and sparse bayesian learning. IEEE Transactions on Intelligent Transportation Systems 8 (3), pp. 431–440. Cited by: §IB.
 (2013) A stereovision based object tracking approach at roundabouts. IEEE Intelligent Transportation Systems Magazine 5 (2), pp. 22–32. Cited by: §IB.
 (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §IIIA.
 (2014) Prediction of driver intended path at intersections. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 134–139. Cited by: §IB.
 (2014) Online maneuver recognition and multimodal trajectory prediction for intersection assistance using nonparametric regression. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 918–923. Cited by: §IB.