Conditional Generative Neural System for Probabilistic
Trajectory Prediction
Abstract
Effective understanding of the environment and accurate trajectory prediction of surrounding dynamic obstacles are critical for intelligent systems such as autonomous vehicles and wheeled mobile robotics navigating in complex scenarios to achieve safe and highquality decision making, motion planning and control. Due to the uncertain nature of the future, it is desired to make inference from a probability perspective instead of deterministic prediction. In this paper, we propose a conditional generative neural system (CGNS) for probabilistic trajectory prediction to approximate the data distribution, with which realistic, feasible and diverse future trajectory hypotheses can be sampled. The system combines the strengths of conditional latent space learning and variational divergence minimization, and leverages both static context and interaction information with soft attention mechanisms. We also propose a regularization method for incorporating soft constraints into deep neural networks with differentiable barrier functions, which can regulate and push the generated samples into the feasible regions. The proposed system is evaluated on several public benchmark datasets for pedestrian trajectory prediction and a roundabout naturalistic driving dataset collected by ourselves. The experimental results demonstrate that our model achieves better performance than various baseline approaches in terms of prediction accuracy.
I Introduction
It is desired for a multiagent prediction system to satisfy the following requirements to generate diverse, realistic future trajectories. 1) Contextaware: The system should be able to forecast trajectories which are inside the traversable regions and collisionfree with static obstacles in the environment. For instance, when the vehicles navigate in a roundabout (see Fig. 1(a)) they need to advance along the curves and avoid collisions with road boundaries. 2) Interactionaware: The system needs to generate reasonable trajectories compliant to traffic or social rules, which takes into account interactions and reactions among multiple entities. For instance, when the vehicles approach an unsignalized intersection (see Fig. 1(b)), they need to anticipate others’ possible intentions and motions as well as the influences of their own behaviors on surrounding entities. 3) Feasibilityaware: The system should anticipate naturalistic and physicallyfeasible trajectories which are compliant to vehicle kinematics or dynamics constraints, although these constraints can be ignored for pedestrians due to the large flexibility of their motions. 4) Probabilistic prediction: Since the future is full of uncertainty, the system should be able to learn an approximated distribution of future trajectories close to data distribution and generate diverse samples which represent various possible behavior patterns.
In this work, we propose a generative neural system that satisfies all the aforementioned requirements for predicting trajectories in highly interactive scenarios. The system takes advantage of both explicit and implicit density learning in a unified generative system to predict the distributions of trajectories for multiple interactive agents, from which the sampled hypotheses are not only reasonable and feasible but also cover diverse possible motion patterns.
The main contributions of this paper are as follows:

A Conditional Generative Neural System (CGNS) is proposed to jointly predict future trajectories of multiple highlyinteractive agents, which takes into account the static context information, interactions among multiple entities and feasibility constraints.

A block attention mechanism and a Gaussian mixture attention mask are proposed and applied to historical trajectories and scene image sequences respectively, which are computationally efficient.

An effective strategy for soft constraint incorporation into deep neural networks is presented.

The latent space learning and variational divergence minimization approaches are integrated into a unified framework in a novel fashion, which combines their strengths on distribution learning.

The proposed CGNS is validated on multiple pedestrian trajectory forecasting benchmarks and is used to solve a task of anticipating motions of onroad vehicles navigating in highlyinteractive scenarios.
Ii Related Work
In this section, we provide a brief overview on related research and illustrate the distinction and advantages of the proposed generative system.
Trajectory and Sequence Prediction
Many research efforts have been devoted to predict behaviors and trajectories of pedestrians and onroad vehicles. Many classical approaches were employed to make timeseries prediction, such as variants of Kalman filter based on system process models, timeseries analysis and autoregressive models. However, such methods only suffice for shortterm prediction in simple scenarios where interactions among entities can be ignored. More advanced learningbased models have been proposed to cope with more complicated scenarios, such as hidden Markov models [1, 2], Gaussian mixture regression [3, 4], Gaussian process, dynamic Bayesian networks, and rapidlyexploring random tree. However, these approaches are nontrivial to handle highdimensional data and require handdesigned input features, which confines the flexibility of representation learning. Moreover, these methods only predict behaviors for a certain entity. A few works also took advantage of both recurrent neural networks [5, 6] and generative modeling to learn an explicit or implicit trajectory distribution, which achieved better performance [7, 8, 9]. However, they either leveraged only static context images or only trajectories of agents, which is not sufficient to make predictions for the agents that interact with both static and dynamic obstacles. In this paper, we propose a conditional generative neural system which can leverage both historical scene evolution information and trajectories of multiple interactive agents and generate realistic and diverse trajectory hypotheses.
Soft Attention Mechanisms
Soft attention mechanisms have been widely used in neural networks to enable the capability of focusing on a subset of input features, which have been extensively studied in the field of image captioning [10], visual object tracking [11] and natural language processing. Several works also brought attention mechanisms into trajectory prediction tasks to figure out the most informative and related obstacles [12, 13, 14, 15]. In this paper, we put forward a block attention mask mechanism for trajectories to extract the most critical features of each entity as well as a Gaussian mixture attention mechanism for context images to extract the most crucial static features.
Deep Bayesian Generative Modeling
The objective of generative models is to approximate the true data distribution, with which one can generate new samples similar to real data points with a proper variance. Generative models have been widely employed in tasks of representation learning and distribution approximation in literature, which basically fall into two categories: explicit density models and implicit density models [16]. In recent years, since deep neural networks have been leveraged as universal distribution approximators thanks to its high flexibility, two deep generative models have been widely studied: Variational AutoEncoder (VAE) [17] and Generative Adversarial Network (GAN) [18]. Since in trajectory forecasting tasks the predicted trajectories are sampled from the posterior distribution conditioned on historical information, the two models were extended to their conditional versions which results in conditional VAE (CVAE) [19] and conditional GAN (CGAN) [20, 12]. In this paper, we combine the strengths of conditional latent space learning via CVAE and variational divergence minimization via adversarial training.
Iii Problem Formulation
The objective of this paper is to develop a deep generative system that can accurately forecast motions and trajectories for multiple agents simultaneously. The system should take into account the historical state information, static context and interactions among dynamic entities.
Assume there are in total entities in the observation area, which may vary in different cases. We denote a set of trajectories covering the history and prediction horizons ( and ) as
(1) 
where is the 2D coordinate in the pixel space or world space. The latent random variable is denoted as , where is the current time step. The sequence of context images up to time step is denoted as . Our goal is to predict the conditional distribution of future trajectories given the historical context images and trajectories . The longterm prediction is realized by propagating the generative system multiple times to the future. To simplify the notations in the following sections, we denote the condition variable as , the sequence of predicted variables as .
Iv Methodology
In this section, we first provide an overview of the key components and the architecture of the proposed Conditional Generative Neural System (CGNS). The detailed theories and models of each component are then illustrated.
Iva System Overview
The architecture of CGNS is shown in Fig. 2 where there is a deep feature extractor (DFE) with an environment attention mechanism (EAM) as well as a generative neural sampler (GNS). First, the DFE extracts deep features from a sequence of historical context images and trajectories of multiple interactive agents to obtain both the information of static and dynamic obstacles, where the EAM tells which areas and dynamic entities should be paid more attention to than others when predicting the trajectory of a certain entity. The above information is utilized as the input of GNS which takes advantage of a deep latent variable model and a variational divergence minimization approach to generate a set of feasible, realistic and diverse future trajectories of all the involved entities. All the components are implemented with deep neural networks thus can be trained endtoend efficiently and consistently.
IvB EnvironmentAware Deep Feature Extraction
We take advantage of both context images and historical trajectories of interactive agents to extract deep features of both static and dynamic environments. In order to figure out the most crucial parts to consider when forecasting behaviors of certain agents, we propose a soft block attention mechanism applied to trajectories and a Gaussian mixture attention mechanism applied to context images. The details are illustrated below.
The historical and future trajectories are constructed as matrices which are treated as 2D images. The former is fed into a convolutional neural network (CNN) and an average pooling layer to obtain a contractable attention mask over the whole trajectory matrix, which is then expanded to the same size as the trajectory matrix by duplicating each column twice corresponding to coordinates and . The original trajectory matrix is multiplied by the block attention mask elementwisely. This mechanism is not applied to the future trajectory matrix since it is unreasonable to have particular attention on the future evolution. The context image sequences are also fed into a CNN followed by fully connected layers to obtain a set of parameters of the Gaussian mixture distribution, which is used to calculate the context attention mask. The elementwise multiplication of original images and attention masks is fed to a pretrained feature extractor, which is the convolution base of VGG19 [21] in this paper. The interactionaware features and contextaware features are concatenated and fed into a recurrent layer followed by fully connected layers to obtain a comprehensive and consistent feature embedding.
IvC Deep Generative Sampling
The GNS is composed of an encoder and a generator . The goal of encoder is to learn a consistent distribution in a lowerdimensional latent space, from which the latent variable can be sampled efficiently. The generator aims to produce trajectories as real as possible. An auxiliary discriminator is adopted, which aims to distinguish fake trajectories from groundtruth. The generator and discriminator formulates a minimax game. The three components can be optimized jointly via conditional latent space learning and variational divergence minimization.
Conditional Latent Space Learning (CLSL)
The conditional latent variable model defined in this paper contains three classes of variables: condition variable , predicted variable and latent variable . We aim to obtain the conditional distribution . Given the training data , the model first samples from an arbitrary distribution . Our goal is to maximize the variational lower bound, which is written as
(2) 
where . This process can be realized with a Conditional Variational AutoEncoder which consists of an encoder network to obtain and a decoder (generator) network to model . The loss function can be formulated as a weighted sum of the reconstruction error and KL divergence:
,  (3) 
(4) 
where . The optimal encoder and generator can be obtained by
(5) 
Variational Divergence Minimization (VDM)
Given two conditional distributions and with absolutely continuous density function and which denotes the real data distribution and its approximation with GNS, the divergence [22] is defined as
(6) 
where is a convex and lowersemicontinuous function with . A lower bound of divergence can be derived with the convex conjugate function
(7) 
where is an arbitrary class of mapping . In order to minimize the variational lower bound in (7), we can formulate a minimax game of and , which are parameterized by and , respectively. Then the optimal and can be obtained by
(8) 
In this work, we propose to minimize the Pearson divergence between and
(9) 
Since (9) is intractable, we leverage the adversarial learning techniques with a generator and a discriminator implemented as deep networks. The adversarial loss functions are derived as
(10) 
(11)  
To discriminate the effect of latent space learning, we also involve two additional terms and where the input are sampled from the encoded latent distribution. Thus, the optimal encoder, generator and discriminator by variational divergence minimization can be obtained as
(12)  
IvD Soft Constraint Incorporation
In order to make generated samples compliant to feasibility constraints of vehicle kinematics, we propose to incorporate a differentiable barrier (indicator) function in the loss function, which enables soft constraints in deep neural networks via pushing predicted trajectories to the feasible regions. In this work, we denote the empirical upper bounds on the absolute values of accelerations and path curvatures as and , respectively. Then the feasibility loss can be calculated as
(13) 
where sgn() refers to the sign function and , can be calculated with the predicted waypoints. This loss term is not applied to human trajectory prediction.
IvE Conditional Generative Neural System (CGNS)
We leverage both CLSL and VDM in the proposed system, which provides complementary strengths. The objective function of the whole system is formulated as
(14)  
which can be trained endtoend. In practice, due to the existence of reconstruction loss, the generator tends to improve faster than the discriminator, which may result in unbalanced training. Therefore, we compensate the unbalance by training the discriminator multiple times in each iteration.
V Experiments
In this section, we validate the proposed CGNS on three benchmark datasets for trajectory prediction which are available online and solve a task of probabilistic behavior prediction for multiple interactive onroad vehicles in a roundabout scenario. The model performance is compared with several stateoftheart baselines.
Va Datasets
ETH [23] and UCY [24]: These datasets include birdeyeview videos and image annotations of pedestrians in various outdoor and indoor scenarios. The trajectories were extracted in the world space.
Stanford Drone Dataset (SDD) [25]: The dataset also contains a set of birdeyeview videos and the corresponding trajectories of involved entities, which was collected in multiple scenarios within a university campus full of pedestrians, bikers and vehicles. The trajectories were extracted in the pixel space instead of the world space.
INTERACTION Dataset (ID) [26, 27]: The raw dataset was collected by a drone with camera and our testing vehicle equipped with LiDAR. The trajectories were extracted by visual detection. We visualized the real trajectories in our simulator to obtain the birdeyeview images, where the static context information came from the Google Earth.
VB Evaluation Metrics and Baselines
We evaluate the model performance in terms of average displacement error (ADE) defined as the average distance between the predicted trajectories and the groundtruth over all the involved entities within the prediction horizon, as well as final displacement error (FDE) defined as the distance at the last predicted time step. To allow for fair comparisons with prior works [12, 28, 5], we predicted the future 12 time steps (4.8s) based on the previous 8 time steps (3.2s) for ETH and UCY in the Euclidean space. We used the standard training and testing split for SDD and make predictions in the pixel space. For our own dataset ID, we predicted the future 10 time steps (5s) based on the historical 4 time steps (2s) in the Euclidean space.
CVM  LR  PLSTM  SLSTM  SGAN  SGANP  SoPhie  CGNS  

ETH  1.42 / 2.88  1.33 / 2.94  1.13 / 2.38  1.09 / 2.35  0.81 / 1.52  0.87 / 1.62  0.70 / 1.43  0.62 / 1.40 
HOTEL  0.51 / 0.68  0.39 / 0.72  0.91 / 1.89  0.79 / 1.76  0.72 / 0.61  0.67 / 1.37  0.76 / 1.67  0.70 / 0.93 
UNIV  0.73 / 1.63  0.82 / 1.59  0.63 / 1.36  0.67 / 1.40  0.60 / 1.26  0.76 / 1.52  0.54 / 1.24  0.48 / 1.22 
ZARA1  0.59 / 1.36  0.62 / 1.21  0.44 / 0.84  0.47 / 1,00  0.34 / 0.69  0.35 / 0.68  0.30 / 0.63  0.32 / 0.59 
ZARA2  0.84 / 1.55  0.77 / 1.48  0.51 / 1.16  0.56 / 1.17  0.42 / 0.84  0.42 / 0.84  0.38 / 0.78  0.35 / 0.71 
AVG  0.82 / 1.62  0.79 / 1.59  0.72 / 1.53  0.72 / 1.54  0.58 / 1.18  0.61 / 1.21  0.54 / 1.15  0.49 / 0.97 
LR  PLSTM  SLSTM  SGAN  SoPhie  CARNet  DESIRE  CGNS  
SDD  37.1 / 63.5  35.8 / 55.4  31.2 / 57.0  24.8 / 38.6  17.8 / 32.1  25.7 / 51.8  19.3 / 34.1  15.6 / 28.2 
Baseline Models  Proposed CGNS  

CVM  LR  PLSTM  SLSTM  SGAN  + CLSL  + VDM  + CLSL + VDM  + + CLSL+VDM  
1.0s  0.16 / 0.29  0.24 / 0.32  0.23 / 0.28  0.24 / 0.30  0.22 / 0.28  0.19 / 0.23  0.22 / 0.27  0.17 / 0.25  0.21 / 0.26 
2.0s  0.59 / 0.78  0.58 / 0.92  0.47 / 0.60  0.45 / 0.57  0.42 / 0.58  0.34 / 0.42  0.38 / 0.45  0.38 / 0.44  0.35 / 0.40 
3.0s  1.21 / 1.92  1.43 / 2.28  0.84 / 1.53  0.80 / 1.48  0.81 / 1.54  0.72 / 1.33  0.75 / 1.37  0.69 / 1.24  0.64 / 1.15 
4.0s  2.94 / 3.98  3.85 / 4.73  1.27 / 1.51  1.21 / 1.69  1.28 / 1.87  1.26 / 1.81  1.35 / 1.76  0.86 / 1.33  0.79 / 1.23 
5.0s  4.28 / 6.12  5.89 / 6.91  1.78 / 2.21  1.69 / 2.77  1.65 / 2.68  1.85 / 3.20  1.72 / 2.89  1.54 / 2.37  1.47 / 2.12 
We compared the performance of our proposed system with the following baseline approaches on multiple datasets: Constant Velocity Model (CVM), Linear Regression (LR), Probabilistic LSTM (PLSTM), Social LSTM (SLSTM) [5], Social GAN (SGAN and SGANP) [28], Clairvoyant attentive recurrent network (CARNet) [13], SoPhie [12] and DESIRE [19].
VC Implementation Details
Since the whole system consists of differentiable functions approximated by deep neural networks, it can be trained endtoend efficiently. The detailed model architecture and hyperparameters are introduced below.
In the deep feature extractor, the contains one convlayer with kernel size and zero padding to keep the same dimension. The contains three convlayers with kernel size and the contains two layers with 64 hidden units. The is the convolution base of pretrained VGG19 whose weights are fixed during training. The , , and all have 128 hidden units. The Encoder has three fullyconnected layers with 256, 128 and 64 hidden units, respectively. The dimension of encoded latent space is two. The Generator has 128 hidden units. The Discriminator has 128 hidden units and has three layers with 128, 128 and 1 units, respectively. In all the experiments, we set and . The Adam optimizer was employed with a learning rate of 0.002. Moreover, we found that the Gaussian mixture in the context image attention mask does not lead to obvious improvement in terms of prediction accuracy and diversity than a single Gaussian in this task. Therefore, we utilized the latter to reduce model complexity and show the corresponding results.
VD Quantitative Analysis
ETH and UCY Dataset: To allow for fair comparisons with multiple baseline approaches which only leverage historical trajectory information, we deactivated the branch of context feature extraction in our system to illustrate its superiority on prediction accuracy based on the same input as prior works. The ADE and FDE of the proposed CGNS and baseline models in Euclidean space are compared in Table I. Some of the reported statistics are adapted from the original papers.
It can be seen that the CVM performs the worst as expected since the constant velocity approximation is insufficient for a crowded scenario with highly interactive agents. The LR performs slightly better in most scenarios than CVM but achieves the smallest error on the HOTEL dataset. A possible reason is that the human trajectories in this dataset tend to be more straight and smooth, which brings an advantage for linear fitting methods. The PLSTM and SLSTM provide an improvement with similar accuracy due to the exploitation of recurrent neural networks. The SGAN and Sophie achieve a bigger progress thanks to the implicit generative modeling of trajectory distribution. Our approach makes a step forward on prediction accuracy, which implies the effectiveness of latent space learning.
Stanford Drone Dataset: We also compared the ADE and FDE of the CGNS and baseline models in pixel space, which is shown in Table II. Similarly, the linear method LR performs the worst and the ordinary PLSTM and SLSTM give a slightly better accuracy. The CARNet makes a step forward by utilizing a physical attention module. The SGAN and DESIRE provide better results than the above baselines since they solve the task from a probabilistic perspective by learning implicit data distribution and latent space representations, respectively. Our approach achieves the best performance in terms of prediction error, which implies the significance and necessity of leveraging both context and trajectory information. The combination of CLSL and VDM also contributes to the enhancement.
INTERACTION Dataset: We finally compared the model performance on our roundabout driving dataset in Table III. The SoPhie, CARNet and DESIRE are not involved since their codes are not publicly available. It is shown that the linear models CVM and LR have similar performance to advanced learningbased models for shortterm prediction since the velocity and yaw angle of vehicles cannot vary much in a short period due to kinematics feasibility constraints. However, as the prediction horizon increases, their performance deteriorates much faster. A potential reason is that due to the curving roads within the roundabout area, the vehicles tend to advance along the curving lines to avoid collisions, which is not able to be captured by linear approximations. The PLSTM and SLSTM provide similar results, which implies that the social pooling mechanism has little effects on feature extraction in this scenario. Our CGNS is able to achieve the smallest prediction error among baseline models in most cases especially for longterm prediction.
VE Qualitative Analysis
We provide a qualitative analysis of the prediction results on our INTERACTION dataset. To illustrate the effectiveness of the attention module, we visualize the context image masks and trajectory block masks of several typical testing cases in Fig. 3. Detailed analysis can be found in the caption. The distribution of generated future trajectories is approximated by the kernel density estimation, which is visualized in Fig. 4. We can see that the system can generate smooth, feasible and realistic vehicle trajectories, which evolve along the road curves. The groundtruth is located at the most dense part of the distribution in most cases. In general, our proposed CGNS can achieve better generation performance in terms of realism and diversity.
VF Ablative Analysis
We conduct an ablative analysis on the RD dataset to demonstrate relative significance of each component in the proposed CGNS. The ADE and FDE of each model setting are shown in Table III. We notice that using the + CLSL and + VDM achieves similar performance in terms of prediction error while + CLSL + VDM provides a notable improvement. Moreover, it is demonstrated that the complete system + + CLSL + VDM does not lead to obvious improvement compared with three partial systems for shortterm prediction while its superiority becomes more remarkable as the forecasting horizon increases. This is reasonable since the static context has little effect on driver behaviors in a short period. More specifically, since the trajectory segment within a short period can be approximated by a linear segment, learning the road curvature from context images does not provide much assistance for prediction. As the forecasting horizon increases, however, the restriction of road geometry on vehicle motions cannot be ignored any more, which results in larger performance gain of leveraging context information.
Vi Conclusions
In this paper, we propose a conditional generative neural system for longterm trajectory prediction, which takes into account both static context information through images and dynamic evolution of traffic situations through trajectories of interactive agents. We also incorporate attention mechanisms to figure out the most critical portions for predicting motions of a certain entity. The system combines the strengths of both latent space learning and variational divergence minimization to approximate the data distribution, from which realistic and diverse trajectory hypotheses can be sampled. The proposed system is validated on various benchmark datasets as well as a roundabout driving dataset collected by ourselves. The results show that our system can achieve better performance than various baseline models on most datasets in terms of prediction accuracy.
References
 [1] J. Li, H. Ma, W. Zhan, and M. Tomizuka, “Generic probabilistic interactive situation recognition and prediction: From virtual to real,” in 2018 IEEE Intelligent Transportation Systems Conference. IEEE, 2018.
 [2] W. Zhan, L. Sun, Y. Hu, J. Li, and M. Tomizuka, “Towards a fatalityaware benchmark of probabilistic reaction prediction in highly interactive driving scenarios,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3274–3280.
 [3] J. Li, W. Zhan, and M. Tomizuka, “Generic vehicle tracking framework capable of handling occlusions based on modified mixture particle filter,” in Proceedings of 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 936–942.
 [4] J. Li, W. Zhan, Y. Hu, and M. Tomizuka, “Generic tracking and probabilistic prediction framework and its application in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, to appear, 2019.
 [5] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. FeiFei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 961–971.
 [6] B. Kim, C. M. Kang, S. H. Lee, H. Chae, J. Kim, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” arXiv preprint arXiv:1704.07049, 2017.
 [7] J. Li, H. Ma, W. Zhan, and M. Tomizuka, “Coordination and trajectory prediction for vehicle interactions via bayesian generative modeling,” in in 2019 Intelligent Vehicles Symposium (IV). IEEE, 2019.
 [8] H. Ma, J. Li, W. Zhan, and M. Tomizuka, “Wasserstein generative learning with kinematic constraints for probabilistic interactive driving behavior prediction,” in in 2019 Intelligent Vehicles Symposium (IV). IEEE, 2019.
 [9] J. Li, H. Ma, and M. Tomizuka, “Interactionaware multiagent tracking and probabilistic behavior prediction via adversarial learning,” in 2019 IEEE International Conference on Robotics and Automation. IEEE, 2019.
 [10] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.
 [11] A. Kosiorek, A. Bewley, and I. Posner, “Hierarchical attentive recurrent tracking,” in Advances in Neural Information Processing Systems, 2017, pp. 3053–3061.
 [12] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, and S. Savarese, “Sophie: An attentive gan for predicting paths compliant to social and physical constraints,” arXiv preprint arXiv:1806.01482, 2018.
 [13] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and S. Savarese, “Carnet: Clairvoyant attentive recurrent network,” in European Conference on Computer Vision (ECCV), 2018.
 [14] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling attention in human crowds,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–7.
 [15] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowdrobot interaction: Crowdaware robot navigation with attentionbased deep reinforcement learning,” arXiv preprint arXiv:1809.08835, 2018.
 [16] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
 [17] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [18] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [19] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 336–345.
 [20] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
 [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [22] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847–5861, 2010.
 [23] S. Pellegrini, A. Ess, and L. Van Gool, “Improving data association by joint modeling of pedestrian trajectories and groupings,” in European conference on computer vision. Springer, 2010, pp. 452–465.
 [24] L. LealTaixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese, “Learning an imagebased motion context for multiple people tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3542–3549.
 [25] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in European conference on computer vision. Springer, 2016, pp. 549–565.
 [26] W. Zhan, L. Sun, D. Wang, Y. Jin, and M. Tomizuka, “Constructing a Highly Interactive Vehicle Motion Dataset,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019.
 [27] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Kümmerle, H. Königshof, C. Stiller, A. de La Fortelle, and M. Tomizuka, “INTERACTION Dataset: An INTERnational, Adversarial and Cooperative moTION Dataset in Interactive Scenarios with Semantic Maps,” 2019.
 [28] A. Gupta, J. Johnson, L. FeiFei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.