LTN: LongTerm Network for LongTerm Motion Prediction
Abstract
Making accurate motion prediction of surrounding agents such as pedestrians and vehicles is a critical task when robots are trying to perform autonomous navigation tasks. Recent research on multimodal trajectory prediction, including regression and classification approaches, perform very well at shortterm prediction. However, when it comes to longterm prediction, most Long ShortTerm Memory (LSTM) based models tend to diverge far away from the ground truth. Therefore, in this work, we present a twostage framework for longterm trajectory prediction, which is named as LongTerm Network (LTN). Our LongTerm Network integrates both the regression and classification approaches. We first generate a set of proposed trajectories with our proposed distribution using a Conditional Variational Autoencoder (CVAE), and then classify them with binary labels, and output the trajectories with the highest score. We demonstrate our LongTerm Network’s performance with experiments on two realworld pedestrian datasets: ETH/UCY, Stanford Drone Dataset (SDD), and one challenging realworld driving forecasting dataset: nuScenes. The results show that our method outperforms multiple stateoftheart approaches in longterm trajectory prediction in terms of accuracy.
I Introduction
Accurately predicting the motions of surrounding agents such as pedestrians and vehicles are significant when mobile robots or autonomous vehicles are trying to perform navigation tasks. In traffic systems, the future behavior of each traffic participant is determined by multiple aspects, such as the movement of other traffic agents, the physical constraints, and the traffic rules [1, 2, 3]. Humans have the ability to navigate through a complex traffic scenario because they have the ability to reason about all the other people’s actions, and how the physical constraints in the traffic systems affect their movements. Therefore, for a robot navigating through a complex traffic system, we need to consider about all the movement of the other surrounding traffic agents and the physical constraints in the traffic system.
With the discovery of the vanilla LSTM model, the researchers started to use Long ShortTerm Memory networks to produce a regression of the future trajectories of traffic agents. The LSTM is a model that processes the data sequentially, so it is suitable for predicting the trajectories which is also considered as sequential data. Starting from the Social LSTM [4] model, the researchers started to model the people’s social interaction. When predicting the future trajectories of traffic agents, they will store the knowledge about people, e.g. speed, direction, motion pattern, and people’s social interaction in the hidden state [5, 6, 7, 8].
Then, the map information is integrated by extracting the features of map using a Convolutional Neural Network (CNN) combined with the current LSTM model, which largely improves the prediction accuracy. The recent works start to compete with each other by using different structures on modeling social interaction, and introduces a LSTM based encoderdecoder structure. The model encodes the past trajectory of the traffic agent using the LSTM along with the nearby traffic agents, produces a regression for the future trajectory, and decodes this trajectory using the LSTM. The Trajectron++[2], WWTG[7] (Where Will They Go),and SocialBiGAT[9], which are recent stateoftheart models, outperform most of the popular LSTM model on future trajectory prediction in terms of accuracy. The model uses the traditional LSTM encoderdecoder structure, but it encodes the past trajectory and future trajectory into a latent space using the Conditional Variational Autoencoder (CVAE)[10, 11]. For prediction, it draws a latent variable from the latent space, decodes it as a regression using GRU[12] and past trajectory information. However, there is still space for improvement in terms of accuracy in longterm prediction, especially 3 to 4 seconds after the current observation.
In this work, we propose a twostage framework called LongTerm Network (LTN) to improve the longterm trajectory prediction in terms of accuracy. In the first stage, the LTN uses a traditional LSTMGRU encoderdecoder structure along with the CVAE[10] to produce a set of possible future trajectory proposals. In the second stage, LTN performs classification and refinement on the trajectories proposals, and outputs the proposal with the highest score as the final trajectory prediction result. The trajectory proposals are generated based on the surrounding traffic agents identified by the LTN, and the prior extracted map information, so that the model can identify the traversable spaces of our robot and identify the possible effects of surrounding traffic agents to make better proposals.
The contributions of this paper are summarized as follows: 1) We propose a newly modified GRU unit called Mogrifier GRU, based on the idea of the Mogrifier LSTM[13]. By our refinement on the hidden state, we improve the performance of the model in terms of longterm prediction accuracy by 10% just by replacing the regular GRU with our Mogrifier GRU. 2) We propose a twostage approach, in which we combine the regression and classification methods and largely improve the performance on the long term trajectory prediction. 3) Our model achieve the stateoftheart results on the widely used pedestrian trajectory prediction datasets (ETH/UCY)[14][15], Stanford Drone Dataset (SDD)[16] and one realworld driving forecasting dataset (nuScenes)[17].
Ii Related Work
Iia MultiModal Trajectory Prediction
Many earlier works in human trajectory forecasting can be roughly divided into two categories: the classic methods and the deep learning approaches. The classic methods include the kinematic equation, or the statistical models like polynomial fitting and Gaussian mixture models. In most recent works, Recurrent Neural Network (RNN) and its variants such as Long ShortTerm Memory (LSTM) or Gated Recurrent Unit (GRU)[12], and Convolutional Neural Network become the basis of most of recent models. Researchers utilize CNN to extract map features, and RNN or its variants, LSTM and GRU, to capture the social interaction between the traffic agents, and then regress the future trajectory [18, 9, 19, 20, 21].
IiB Deep Generative Models
Besides the classic methods and the CNNRNN combined approaches, the generative approaches have emerged as another stateoftheart approach in trajectory prediction problems. This approach shifts some researchers’ focus on regressing a single future trajectory to producing a distribution of the future trajectory. With a full distribution of the future trajectory, the results can produce more possible trajectory proposals. To produce such distributions, most works use recurrent backbone architecture with a latent variable model, such as the Generative Adversarial Network (GAN)[22], and the Conditional Variational Autoencoder (CVAE)[10]. Currently, Trajectron++[2], SocialBiGAT[9], WWTG[7] are two CVAE and GAN based models that outperform most stateoftheart trajectory prediction models. Trajectron++[2] and SocialBiGAT [9] are able to account for the social interactions between traffic agents and physical constraints in the scene.
IiC Regression and Classification
Current models that produce full distributions mostly utilize the Gaussian Mixture Model, which outputs the local maximum of the distribution as the final trajectory prediction result. But empirically, by the qualitative analysis in most of the work, the output is not actually the closest trajectory produced by the full distribution to the ground truth. So the new approaches combining regression and classification appear, which generates a set of hypothesis trajectory proposals, and outputs the proposal with the highest score as the final trajectory prediction result. Trajectory Proposal Net (TPNet)[19] is another stateoftheart that uses this regression and classification method for trajectory prediction, where the model does polynomial fitting between the starting point and the proposed end point of the traffic agent, while considering the socialinteraction and traffic rules. At the end, the model performs classification on these proposed trajectories and outputs the proposal with the top scores.
Iii Problem Formulation
In this work, we select our robot as the center point in the scene, where we will determine its surrounding traffic agents and predict their future trajectories. During the time interval , we denote the number of traffic agents surrounding the robot at time as . We denote the surrounding agents as , and for each agent, we categorize it as pedestrians or vehicles. For simplicity, the vehicle category also includes agents like bicycles, motorcycles, and cars. The model takes a series of past positions in the time interval of each agent , and also a series of past positions of the robot , where . For each agent’s future trajectory in interval , we denote it as , and for the robot’s future trajectory, we expresses it as , where . We also incorporate the map information in the same way as Trajectron++, and encode the map information as , where the is the agent, and .
Iv Method
Iva LongTerm Network
To further improve the performance of current model in longterm trajectory prediction, we propose a twostage framework called LongTerm Network (LPN). The framework is visualized in Figure 1.
IvB Determining the Surrounding Traffic Agents
To determine the surrounding traffic agents, we first determine the number of agents in the scene, and denote it as . We include the agents () that are close to the robot in distance. Formally, the agent is selected if at time , , where is a hyperparameter indicating the maximum perception distance.
Since we are going to predict each agent ’s future trajectories, we perform the similar process to select the surrounding traffic agents of our selected agents . The agents around is again determined by the distance. Formally, the agent around is selected if at time and , , where again is the same hyperparameter that expresses the maximum perception distance.
IvC Modeling the Agent History and The Social Interactions
To model the agent history, we primarily utilize the Mogrifier LSTM[13], which is a variant of the vinilla LSTM model. The Mogrifier LSTM has better performance in longterm performance than the vinilla version, as the experiment in the paper demonstrates. The Mogrifier LSTM utilizes the same LSTM module, but between each unit, the Mogrifier LSTM updates the input and previous hidden state with several rounds of mutual gating, which is called a mogrifying step. The Mogrifier LSTM can also be implemented based on Bidirectional LSTM. Since there are no public code for the Mogrifier LSTM, we implement the bidirectional Mogrifier LSTM with traditional bidirectional LSTM and add Mogrifier into the model. The number of mogrifying steps is 6, and the number of layers is 2. To model the agent history, we input the agent ’s history trajectory into a Mogrifier LSTM network with 32 hidden dimensions to obtain the encoded agent history tensor.
To model the social interactions, we input the agent ’s surrounding traffic agents trajectories , which , into a Mogrified LSTM with 8 hidden dimensions. Then, we utilize an attention module, which encodes these social interactions as additive attentions. We utilize additive attention, where the encoded tensors of all surrounding traffic agents are aggregated to obtain one attention tensors, and then concatenate with the corresponding agent history tensors to obtain one complete history tensor .
IvD Map Encoding and Future Encoding
To obtain the encoded map information, we utilize Convolutional Neural Network (CNN) to encode the local map information, which is the similar to Trajectron++[2].
We model the target agent and its surrounding traffic agents’ future trajectories into the encoded tensors during the training phase, in order to provide information to formulate the future trajectory distribution used in the training phase. We input the agent ’s future trajectory into a Mogrifier LSTM network with 32 hidden dimensions to obtain the encoded agent future tensor, and we input the agent ’s surrounding traffic agents trajectories , which into the Mogrifier LSTM with 8 hidden dimensions. Then, the additive attention is used to aggregate both tensors to obtain one attention tensor, and then we will concatenate our attention tensor into the corresponding agent’s future tensor to obtain one complete future tensor .
IvE CVAE Latent Variable Framework
To address the multimodality to produce the full distribution of the agent ’s future trajectory, we utilize the modified Conditional Variational Autoencoder (CVAE)[10] latent variable framework employed in [23] and [24]. The framework utilizes our encoded complete history tensor and produces the full distribution on the future trajectory of agent by defining a discrete categorical latent variable which we then can express as:
(1) 
where and are neural network weights that parameterize their respective distributions, and is a discrete latent variable that also aids in interpretability.
During training, we have in the dataset and use a bidirectional Mogrifier LSTM with 32 hidden dimensions to produce a ground truth future trajectory distribution , and is neural network weight again.
IvF Mogrifier GRU
To obtain the final trajectory distribution, we need to decode the final trajectory distribution with the latent variable , and the complete history tensor , using the GRU[12] units. We present a more powerful variant of GRU unit, where we adopt the similar idea in Mogrifier LSTM[13] to process the input and hidden state before each GRU unit. Suppose we have input and previous hidden state , for a normal GRU, the current hidden state is calculated by:
(2) 
and the is calculated by:
(3) 
where are the reset, update, and new gates. is the sigmoid function, and is the Hadamard product, and all the are the learnable weights matrices.
Our Mogrifier GRU works by performing mogrifying steps before the usual GRU computation step. Suppose we perform mogrifying steps times, we have MogGRU = GRU. For each :
(4) 
(5) 
where for odd , the equation is performed, and for even , the equation is performed. The parameters in the equations are recommended as: and has to be a integer. In our model, we choose , and will recover the traditional GRU unit. The parameter in the mogrifying steps are the same as the in the parameters in the Mogrifier LSTM, which are randomly initialized matrices. We will have the experiment result of Mogrifier GRU in the experiment section to show it’s performance comparing with the traditional GRU.
IvG Trajectory Proposal and Classification Module
During training, when we get the full distribution of the future trajectory of agent , we then produce the final trajectory proposal by sampling numbers of the trajectories from the latent variable distribution, where the latent variable is determined by:
(6) 
Then, in the classification module. We consider the methods used in TPNet[19]. We assign each proposal with a binary class label, which is used to indicate whether it is a good trajectory or not. We use the average distance between all the sampled trajectory proposals and the ground truth proposal as the criterion for measuring the proposal’s quality, which can be expressed as:
(7) 
which is the average distance between the nth sampled proposal vector and the ground truth trajectory . Then, a threshold is used, and for the proposed trajectories that have lower than , we assign positive labels. For the proposed trajectories that have larger than , we assign negative labels.
IvH Objective Function
In our model, there are two loss functions to be minimized, One is the regression loss, and the other one is the classification loss.
Regression Loss: We adopt the objective function provided in Trajectron++[2] and WWTG[7], Where we aim to solve:
(8) 
where by [2], the is the mutual information between and under the distribution . The computation of is the same as [25]. and are hyperparameters.
Classification Loss: For the classification loss, we consider the methods in TPNet[19], which a binary crossentropy loss is employed as:
(9) 
The total loss is written as follows:
(10) 
where is the number of trajectory proposals in the proposal set, is learnable weight, and is the corresponding predicted label, and is the corresponding ground truth label.
During the training phase, the regression module minimizes the regression loss, and the classification module minimizes the classification loss.
V Experiments
Va Datasets
Our model is evaluated on four widely used public datasets: The ETH, UCY, Stanford Drone Dataset, and nuScenes. The ETH and UCY datasets foucs on the pedestrian trajectory prediction, and contains complex social interactions. The ETH/UCY dataset has five subsets, each named ETH, HOTEL, UCY, ZARA01, ZARA02. There are two settings for the length of trajectories, and , . The data is captured at (), So the dataset will contains 8 frames for observations and 8/12 frames for prediction.
For the Stanford Drone Dataset, this is a trajectory dataset that is captured by drones from topdown view. So the scenes in the dataset are topdownview. The scene are captured at a university campus with vehicles, cyclists, and crowds. The dataset contains a lot of heterogeneoous data.
For the nuScenes dataset, this is a challenging large realworld driving forecasting dataset, where with more than 1000 scenes in the dataset are captured in Boston and Singapore. Each scene is 20 seconds long, and the dataset contains HighDefinition semantic maps. All the scenes in the dataset contains a large amount of heterogeneous data, with complex social interactions among up to 23 semantic object classes. Also, the map provides data about the physical constraints in each scenes.
VB Evaluation Metrics
In our experiment, we will use four metrics that are commonly used in all trajectory prediction models, which include , , and .

(): the minimum mean distance between the ground truth trajectory and predicted trajectory from the best of 20 samples by the LTN.

(): the minimum distance between the ground truth final position and the predicted final position at the final from the best of 20 samples by the LTN.

ADE: the mean distance between the ground truth trajectory and predicted trajectory by the LTN.

FDE: the distance between the ground truth final position and the predicted final position at the final by the LTN.
Metric  Dataset  LSTM  SLSTM  SGAN  Trajectron++  SBiGAT  TPNet  SoPhie  STGAT  LTN 

ADE  ETH  1.09  1.09  0.81  0.43  0.69  0.84  0.70  0.65  0.39 
HOTEL  0.86  0.79  0.81  0.12  0.69  0.24  0.76  0.49  0.16  
UNIV  0.61  0.67  0.72  0.22  0.4  0.42  0.54  0.55  0.20  
ZARA1  0.41  0.47  0.60  0.17  0.55  0.33  0.30  0.30  0.18  
ZARA2  0.52  0.56  0.34  0.12  0.30  0.26  0.38  0.36  0.15  
Average  0.70  0.72  0.42  0.20  0.36  0.42  0.54  0.48  0.22 
Metric  Dataset  LSTM  SLSTM  SGAN  Trajectron++  SBiGAT  TPNet  SoPhie  STGAT  LTN 

FDE  ETH  2.41  2.35  1.52  0.86  1.29  1.73  1.43  1.12  0.80 
HOTEL  1.91  1.76  1.61  0.19  1.01  0.46  1.67  0.66  0.18  
UNIV  1.31  1.40  1.26  0.43  1.32  0.94  1.24  1.10  0.41  
ZARA1  0.88  1.00  0.63  0.32  0.62  0.75  0.63  0.69  0.29  
ZARA2  1.11  1.17  0.84  0.25  0.75  0.60  0.78  0.60  0.23  
Average  1.52  1.54  1.18  0.41  1.00  0.90  1.15  1.08  0.38 
Metric  Dataset  SLSTM  SGAN  SATTN  STGAT  Trajectron++  CARNet  LTN 
mADE  SDD  31.4  27.0  33.3  18.8  19.3  25.72  15.2 
Metric  Dataset  SLSTM  SGAN  SATTN  STGAT  Trajectron++  CARNet  LTN 
mFDE  SDD  55.6  43.9  55.9  31.3  32.7  51.80  25.8 
VC Baselines
We compare the performance of our model with the stateoftheart models below:

Vanilla LSTM: An LSTM network utilizing only the agent ’s history trajectory information.

Social LSTM[4]: An LSTM network that utilizes not only the agent ’s history trajectory information, but also uses LSTM to model the agents’ trajectory information around agent .

Social GAN[26] (SGAN): This is a GAN with social interaction considered. Each agent is modeled by an LSTMGAN combined network, where the LSTM encoderdecoder outputs are the generator of GAN, and the generated trajectories are then evaluated against the ground truth trajectories in the discriminator.

Trajectron++[2]: This is a CVAEbased trajectory prediction model, where the LSTMGRU encoderdecoder structure is used, and the LSTM is used to model the agent’s history and it’s corresponding social interaction, and at the GRU decoder a full distribution of the predicted trajectory is produced.

SocialBiGAT[9] (SBiGAT): This is an LSTMGAN with Graph Attention Network to encode agent’s social interactions.

TPNet[19]: This is a CNN based network that produces a trajectory proposal set by first predict the end point from the given map information and then predict the potential endpoint, then do regression based on map information, starting and the end point. At the end, a classification is performed to output proposals with high scores.

Sophie[27]: This is a GANbased trajectory prediction model that also leverages the social interactions and physical information. Similar to SocialGAN, the trajectory is produced by generator and the discriminator will evaluate these predictions against the ground truth trajectories.

STGAT[28]: This is a spatialtemporal graph based trajectory prediction model. The spatial interaction is captured by graph attention mechanisms and the LSTM is used for temporal interactions.

SocialAttention[29](SATTN): This is an attention based trajectory prediction model, where it captures the relative importance, which is attention, of each person when predicting for the future trajectories.

CarNet[30]: This is a prediction model that can account the dependencies between agent’s behavior and their spatial environment, where the model can learns where to look in a large environment when predicting the trajectory of an agent.

CSP[24]: This is a prediction model that builds on LSTM encoderdecoder framework, where LSTM is used to model the agent’s social interaction and output multimodal future distribution of the agent based on the social interaction.

SpAGNN[31]: This is a probabilistic model that utilizes graph neural network to capture the interactions between the vehicles and output a distribution of the future trajectory of the vehicle the model selected to predict.
Implementation Details For our experiment, we adopted the ETH/UCY/nuScenes dataset preprocessing method provided in Trajectron++ [2]. For SDD, it is preprocessed according to the methods provided in Evolvegraph [32]. For our Mogrifier GRU, we chose the mogrifying step to be 6, and the , in the regression objective function is =1, and is dynamically changed for the best performance, according to Trajectron++, which this methods provides most optimal result. We optimize the network using Adam[33] optimizer with learning rate 0.002. The we used is 3. We implemented LTN with PyTorch on a desktop with Ubuntu 20.04, equipped with one Intel I78700K CPU and two Nvidia GTX 1070Ti GPUs.
VD Evaluation of Trajectory Prediction
We compared our Mogrifier GRU version LTN with several baseline method on ETH, UCY, datasets in terms of two metrics ADE and FDE in Table I. We reported our results as LTN, where it’s the 20 trajectory proposal that has the highest scores. In Table I, we can see that in terms of ADE, our LTN can outperform Trajectron++, which leads the rest of the model in the data we collected. In ETH and UNIV, our model achieved better ADE results. Also, in terms of FDE, our LTN outperforms all other methods completely, showing that our model is very good at long term trajectory prediction. We successfully improved our methods by nearly 10% in terms of FDE, or long term prediction. However, we also notice that both ETH and UCY has reached a saturation, which means that more improvement is unlikely due to the data annotation errors or any offerrors during data collection, which leads us to analyze the LTN’s performance in another dataset SDD.
We compared our Mogrifier GRU version LTN with several baseline method on SDD dataset in terms of minADE and minFDE in table 2. In table 2, our methods outperformed all the baseline methods, with over percent improvement in both minADE and minFDE.
Not only our LTN demonstrated its longterm prediction advantages in terms of accuracy in ETH, UCY, and SDD, we also introduced a dataset with more heterogeneous data, nuScenes.
Methods  1s  2s  3s  4s 

SLSTM  0.47    1.61   
CSP  0.46    1.50   
CARNet  0.38    1.35   
SpAGGN  0.36    1.23   
Trajectron++  0.07  0.45  1.14  2.20 
LTN  0.13  0.43  0.92  1.74 
In Table III, we compared our LTN model with several baseline methods with nuScenes dataset. We can see that Trajectron++ outperformed all the other baseline methods across all 4 seconds time span, but our LTN outperforms Trajectron++, especially in long terms, where over 20 percent improvement is achieved in3seconds and4seconds. But we also noticed that in short term, especially in1seconds,our method did not outperform Trajectron++, but we are not especially concerned with it because our model demonstrated persuasive long term prediction performance in terms of accuracy.
VE Evaluation of Mogrifier GRU
We now examine the performance of the Mogrifier GRU. As we mentioned before, because we believe both ETH/UCY datasets have reached a saturation, where there are no possible improvement space for us, we conduct the rest of the experiment on SDD and nuScenes datasets. As the table shows, we set up experiment with two versions of Trajectron++, since the authors kindly provided their code:
Version 1: Trajectron++ with regular GRU as decoder.
Version 2: Trajectron++ with Mogrifier GRU as decoder.
Methods  1s  2s  3s  4s 

Trajectron++  0.07  0.45  1.14  2.20 
Trajectron++  0.08  0.43  1.02  2.03 
As Table IV shows, the Mogrifier GRU version Trajectron++ outperforms the regular GRU version Trajectron++ at seconds and seconds. At seconds and seconds the improvement reached around percent just by switching the regular GRU with our Mogrifier GRU. Our Mogrifier GRU is intended to remind the future hidden states and input with the information before, so our experiment showed that with the reminder of the states of the agent in seconds and seconds provided to seconds and seconds time period, the prediction result in seconds and seconds improved.
VF Qualitative Analysis
We compared our visualized experiment results with Trajectron++[2] in the same scene provided in nuScenes[17] dataset. In Fig. 2(a) and Fig. 2(b), we can clearly see the advantage brought by the distribution. We can compare the shape of the distribution in both figures, where we can see that distribution of the red and blue car looks more uniform, meaning that it can generate samples with less bias. Therefore, the distribution in Fig. 2(b) can provide more samples near the ground truth. Also, the dense center of the distribution shifts towards the ground truth comparing with that of the full distribution. So, the proposed samples in will have larger probability to be the closest samples to the ground truth. Although the variance increases for the distribution, we will tackle it with our classification module, where the samples with high variance will be filtered out by the threshold .
In Fig. 2(b), we can also see that by plainly taking the argmax of the distribution, the output is not the closest one to the ground truth. This is very clear in Fig. 2(b), where for the red car and its end point prediction, the indicates an area at the very center of the distribution that is closest to the ground truth trajectory, but the most likely output does not fall into that area. By our classification module, with good sampling from the distribution, filtering, and classification, we obtained the yellow hollow circle in Fig. 2(c) that falls into the area indicated by the distribution. The result is better compared with Trajectron++. For clarity, we only indicate the prediction of LTN at seconds, which is the end point distribution in red for the red car.
Vi Conclusion
In this work, we present our model LTN, a twostage trajectory prediction model for longterm trajectory prediction. The LTN incorporates heterogeneous data and combines regression and classification methods to improve the trajectory prediction performance for longterm prediction. Our LTN will first generate a distribution of the future trajectories, sample future trajectory proposals from it and perform classification on our proposal set. Along with the data of surrounding agents and the map information, our model could ensure that the final trajectory prediction is ideal and better. Our LTN achieves stateoftheart performance in various popular datasets and shows significant improvement in terms of longterm trajectory prediction accuracy.
References
 J. Li, H. Ma, and M. Tomizuka, “Conditional generative neural system for probabilistic trajectory prediction,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 6150–6156.
 T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamicallyfeasible trajectory forecasting with heterogeneous data,” 2020.
 W. Zhan, J. Li, Y. Hu, and M. Tomizuka, “Safe and feasible motion generation for autonomous driving via constrained policy net,” in IECON 201743rd Annual Conference of the IEEE Industrial Electronics Society. IEEE, 2017, pp. 4588–4593.
 A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. FeiFei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 J. Li, H. Ma, and M. Tomizuka, “Interactionaware multiagent tracking and probabilistic behavior prediction via adversarial learning,” in 2019 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2019.
 H. Ma, J. Li, W. Zhan, and M. Tomizuka, “Wasserstein generative learning with kinematic constraints for probabilistic interactive driving behavior prediction,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 2477–2483.
 P. Felsen, P. Lucey, and S. Ganguly, “Where will they go? predicting finegrained adversarial multiagent motion using conditional variational autoencoders,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
 W. Zhan, L. Sun, Y. Hu, J. Li, and M. Tomizuka, “Towards a fatalityaware benchmark of probabilistic reaction prediction in highly interactive driving scenarios,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3274–3280.
 V. Kosaraju, A. Sadeghian, R. MartÃnMartÃn, I. Reid, S. H. Rezatofighi, and S. Savarese, “Socialbigat: Multimodal trajectory forecasting using bicyclegan and graph attention networks,” 2019.
 K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 3483–3491.
 J. Li, H. Ma, Z. Zhang, and M. Tomizuka, “Socialwagdat: Interactionaware trajectory prediction via wasserstein graph doubleattention network,” arXiv preprint arXiv:2002.06241, 2020.
 K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” 2014.
 G. Melis, T. KoÄiskÃ½, and P. Blunsom, “Mogrifier lstm,” 2020.
 S. Pellegrini, A. Ess, K. Schindler, and L. van Gool, “You’ll never walk alone: Modeling social behavior for multitarget tracking,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 261–268.
 L. LealTaixÃ©, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese, “Learning an imagebased motion context for multiple people tracking,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3542–3549.
 A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 549–565.
 H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
 J. Li, H. Ma, W. Zhan, and M. Tomizuka, “Coordination and trajectory prediction for vehicle interactions via bayesian generative modeling,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 2496–2503.
 L. Fang, Q. Jiang, J. Shi, and B. Zhou, “Tpnet: Trajectory proposal network for motion prediction,” 2020.
 J. Li, H. Ma, W. Zhan, and M. Tomizuka, “Generic probabilistic interactive situation recognition and prediction: From virtual to real,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3218–3224.
 J. Li, W. Zhan, Y. Hu, and M. Tomizuka, “Generic tracking and probabilistic prediction framework and its application in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3634–3649, 2020.
 I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014.
 B. Ivanovic, K. Leung, E. Schmerling, and M. Pavone, “Multimodal deep generative models for trajectory prediction: A conditional variational autoencoder approach,” in arXiv, 2020.
 N. Deo and M. M. Trivedi, “Multimodal trajectory prediction of surrounding vehicles with maneuver based lstms,” in 2018 IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 1179–1184.
 S. Zhao, J. Song, and S. Ermon, “Infovae: Balancing learning and inference in variational autoencoders,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 5885–5892, 07 2019.
 A. Gupta, J. Johnson, L. FeiFei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” 2018.
 A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese, “Sophie: An attentive gan for predicting paths compliant to social and physical constraints,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1349–1358.
 Y. Huang, H. Bi, Z. Li, T. Mao, and Z. Wang, “Stgat: Modeling spatialtemporal interactions for human trajectory prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
 A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling attention in human crowds,” 2018.
 A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and S. Savarese, “Carnet: Clairvoyant attentive recurrent network,” 2018.
 S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spatiallyaware graph neural networks for relational behavior forecasting from sensor data,” 2019.
 J. Li, F. Yang, M. Tomizuka, and C. Choi, “Evolvegraph: Multiagent trajectory prediction with dynamic relational reasoning,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
 D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.