Multiple Futures Prediction
Abstract
Temporal prediction is critical for making intelligent and robust decisions in complex dynamic environments. Motion prediction needs to model the inherently uncertain future which often contains multiple potential outcomes, due to multiagent interactions and the latent goals of others. Towards these goals, we introduce a probabilistic framework that efficiently learns latent variables to jointly model the multistep future motions of agents in a scene. Our framework is datadriven and learns semantically meaningful latent variables to represent the multimodal future, without requiring explicit labels. Using a dynamic attentionbased state encoder, we learn to encode the past as well as the future interactions among agents, efficiently scaling to any number of agents. Finally, our model can be used for planning via computing a conditional probability density over the trajectories of other agents given a hypothetical rollout of the ‘self’ agent. We demonstrate our algorithms by predicting vehicle trajectories of both simulated and real data, demonstrating the stateoftheart results on several vehicle trajectory datasets.
1 Introduction
The ability to make good predictions lies at the heart of robust and safe decision making. It is especially critical to be able to predict the future motions of all relevant agents in complex and dynamic environments. For example, in the autonomous driving domain, motion prediction is central both to the ability to make high level decisions, such as when to perform maneuvers, as well as to low level path planning optimizations Thrun et al. (2006); Paden et al. (2016).
Motion prediction is a challenging problem due to the various needs of a good predictive model. The varying objectives, goals, and behavioral characteristics of different agents can lead to multiple possible futures or modes. Agents’ states do not evolve independently from one another, but rather they interact with each other. As an illustration, we provide some examples in Fig. 1. In Fig. 1(a), there are a few different possible futures for the blue vehicle approaching an intersection. It can either turn left, go straight, or turn right, forming different modes in trajectory space. In Fig. 1(b), interactions between the two vehicles during a merge scenario show that their trajectories influence each other, depending on who yields to whom. Besides multimodal interactions, prediction needs to scale efficiently with an arbitrary number of agents in a scene and take into account auxiliary and contextual information, such as map and road information. Additionally, the ability to measure uncertainty by computing probability over likely future trajectories of all agents in closedform (as opposed to Monte Carlo sampling) is of practical importance.
Despite a large body of work in temporal motion predictions Lee et al. (2017); Casas et al. (2018); Cui et al. (2018); Ma et al. (2018); Deo and Trivedi (2018); Bansal et al. (2018); Rhinehart et al. (2019); Chai et al. (2019); Zhao et al. (2019), existing stateoftheart methods often only capture a subset of the aforementioned features. For example, algorithms are either deterministic, not multimodal, or do not fully capture both past and future interactions. Multimodal techniques often require the explicit labeling of modes prior to training. Models which perform joint prediction often assume the number of agents present to be fixed Watters et al. (2017); Sun et al. (2019).
We tackle these challenges by proposing a unifying framework that captures all of the desirable features mentioned earlier. Our framework, which we call Multiple Futures Predictor (MFP), is a sequential probabilistic latent variable generative model that learns directly from multiagent trajectory data. Training maximizes a variational lower bound on the loglikelihood of the data. MFP learns to model multimodal interactive futures jointly for all agents, while using a novel factorization technique to remain scalable to arbitrary number of agents. After training, MFP can compute both (un)conditional trajectory probabilities in closed form, not requiring any Monte Carlo sampling.
MFP builds on the Seq2seq Sutskever et al. (2014), encoderdecoder framework by introducing latent variables and using a set of parallel RNNs (with shared weights) to represent the set of agents in a scene. Each RNN takes on the pointofview of its agent and aggregates historical information for sequential temporal prediction for that agent. Discrete latent variables, one per RNN, automatically learn semantically meaningful modes to capture multimodality without explicit labeling. MFP can be further efficiently and jointly trained endtoend for all agents in the scene. To summarize, we make the following contributions: First, semantically meaningful latent variables are automatically learned from trajectory data without labels. This addresses the multimodality problem. Second, interactive and parallel stepwise rollouts are preformed for all agents in the scene. This addresses the modeling of interactions between actors during future prediction, see Sec. 3.1. We further propose a dynamic attentional encoding which captures both the relationships between agents and the scene context, see Sec. 3.1. Finally, MFP is capable of performing hypothetical inference: evaluating the conditional probability of agents’ trajectories conditioning on fixing one or more agent’s trajectory, see Sec. 3.2.
2 Related Work
The problem of predicting future motion for dynamic agents has been well studied in the literature. The bulk of classical methods focus on using physics based dynamic or kinematic models Welch et al. (1995); Haarnoja et al. (2016); Lefèvre et al. (2014). These approaches include Kalman filters and maneuver based methods, which compute the future motion of agents by propagating their current state forward in time. While these methods perform well for short time horizons, longer horizons suffer due to the lack of interaction and context modeling.
The success of machine learning and deep learning ushered in a variety of datadriven recurrent neural network (RNN) based methods Lee et al. (2017); Casas et al. (2018); Cui et al. (2018); Ma et al. (2018); Deo and Trivedi (2018); Bansal et al. (2018). These models often combine RNN variants, such as LSTMs or GRUs, with encoderdecoder architectures such as conditional variational autoencoders (CVAEs). These methods eschew physic based dynamic models in favor of learning generic sequential predictors (e.g. RNNs) directly from data. Converting raw input data to input features can also be learned, often by encoding rasterized inputs using CNNs Casas et al. (2018); Cui et al. (2018).
Methods that can learn multiple future modes have been proposed in Deo and Trivedi (2018); Lee et al. (2017); Cui et al. (2018). However, Deo and Trivedi (2018) explicitly labels six maneuvers/modes and learn to separately classify these modes. Lee et al. (2017); Cui et al. (2018) do not require mode labeling but they also do not train in an endtoend fashion by maximizing the data loglikelihood of the model. Most of the methods in literature encode the past interactions of agents in a scene, however prediction is often an independent rollout of a decoder RNN, independent of other future predicted trajectories Deo and Trivedi (2018); Park et al. (2018). Encoding of spatial relationships is often done by placing other agents in a fixed and spatially discretized grid Deo and Trivedi (2018); Lee et al. (2017).
In contrast, MFP proposes a unifying framework which exhibits the aforementioned features. To summarize, we present a feature comparison of MFP with some of the recent methods in the supplementary materials.
3 Multiple Futures Prediction
We tackle motion prediction by formulating a probabilistic framework of continuous space but discrete time system with a finite (but variable) number of interacting agents. We represent the joint state of all agents at time as , where is the dimensionality of each state^{1}^{1}1We assume states are fully observable and are agents’ coordinates on the ground plane (=2)., and is the state th agent at time . With a slight abuse of notation, we use superscripted to denote the past states of the th agent and to denote the joint agent states from time to , where is the past history steps. The future state at time of all agents is denoted by and the future trajectory of agent , from time to time , is denoted by . denotes the joint state of all agents for the future timesteps. Contextual scene information, e.g. a rasterized image of the map, could be useful by providing important cues. We use to represent any contextual information at time .
The goal of motion prediction is then to accurately model . As in most sequential modelling tasks, it is both inefficient and intractable to model jointly. RNNs are typically employed to sequentially model the distribution in a cascade form. However, there are two major challenges specific to our multiagent prediction framework: (1) Multimodality: optimizing vanilla RNNs via backpropagation through time will lead to modeaveraging since the mapping from to is not a function, but rather a onetomany mapping. In other words, multimodality means that for a given , there could be multiple distinctive modes that results in significant probability distribution over different sequences of . (2) VariableAgents: the number of agents is variable and unknown, and therefore we can not simply vectorize as the input to a standard RNN at time .


For multimodality, we introduce a set of stochastic latent variables , one per agent, where can take on discrete values. The intuition here is that would learn to represent intentions (left/right/straight) and/or behavior modes (aggressive/conservative). Learning maximizes the marginalized distribution, where is free to learn any latent behavior so long as it helps to improve the data loglikelihood. Each is conditioned on at the current time (before future prediction) and will influence the distribution over future states . A key feature of the MFP is that is only sampled once at time , and must be consistent for the next time steps. Compared to sampling at every timestep, this leads to a tractability and more realistic intention/goal modeling, as we will discuss in more detail later. We now arrive at the following distribution:
(1) 
where denotes the joint latent variables of all agents. Naïvely optimizing for Eq. 1 is prohibitively expensive and not scalable as the number of agents and timesteps may become large. In addition, the max number of possible modes is exponential: . We first make the model more tractable by factorizing across time, followed by factorization across agents. The joint future distribution assumes the form of product of conditional distributions:
(2)  
(3) 
The second factorization is sensible as the factorial component is conditioning on the joint states of all agents in the immediate previous timestep, where the typical temporal delta is very short (e.g. ms). Also note that the future distribution of the th agent is explicitly dependent on its own mode but implicitly dependent on the latent modes of other agents by reencoding the other agents predicted states (please see discussion later and also Sec. 3.1). Explicitly conditioning an agent’s own latent modes is both more scalable computationally as well as more realistic: agents in the realworld can only infer other agent’s latent goals/intentions via observing their states. Finally our overall objective from Eq. 1 can be written as:
(4)  
(5) 
The graphical model of the MFP is illustrated in Fig. 1(a). While we show only three agents for simplicity, MFP can easily scale to any number of agents. Nonlinear interactions among agents makes complicated to model. The class of recurrent neural networks are powerful and flexible models that can efficiently capture and represent longterm dependences in sequential data. At a high level, RNNs introduce deterministic hidden units at every timestep , which act as features or embeddings that summarize all of the observations up until time . At time step , a RNN takes as its input the observation, , and the previous hidden representation, , and computes the update: . The prediction is computed from the decoding layer of the RNN . and are recursively applied at every timestep of the sequence.
Fig. 1(b) shows the computation graph of the MFP. A pointofview (PoV) transformation is first used to transform the past states to each agent’s own reference frame by translation and rotation such that axis aligns with agent’s heading. We then instantiate an encoding and a decoding RNN^{2}^{2}2We use GRUs Chung et al. (2014). LSTMs and GRUs perform similarly, but GRUs were slightly faster computationally. per agent. Each encoding RNN is responsible for encoding the past observations into a feature vector. Scene context is transformed via a convolutional neural network into its own feature. The features are combined via a dynamic attention encoder, detailed in Sec. 3.1, to provide inputs both to the latent variables as well as to the ensuing decoding RNNs. During predictive rollouts, the decoding RNN will predict its own agent’s state at every timestep. The predictions will be aggregated and subsequently transformed via , providing inputs to every agent/RNN for the next timestep. Latent variables provide extra inputs to the decoding RNNs to enable multimodality. Finally, the output consists of a dim vector governing a Bivariate Normal distribution: , , , , and correlation coefficient .
While we instantiate two RNNs per agent, these RNNs share the same parameters across agents, which means we can efficiently perform joint predictions by combining inputs in a minibatch, allowing us to scale to arbitrary number of agents. Making discrete and having only one set of latent variables influencing subsequent predictions is also a deliberate choice. We would like to model modes generated due to high level intentions such as left/right lane changes or conservative/aggressive modes of agent behavior. These latent behavior modes also tend to stay consistent over the time horizon which is typical of motion prediction (e.g. 5 seconds).
Learning
Given a set of training trajectory data , we optimize using the maximum likelihood estimation (MLE) to estimate the parameters that achieves the maximum marginal data loglikelihood:^{3}^{3}3We have omitted the dependence on context for clarity. The R.H.S. is derived from the common logderivative trick.
(6) 
Optimizing for Eq. 6 directly is nontrivial as the posterior distribution is not only hard to compute, but also varies with . We can however decompose the loglikelihood into the sum of the evidence lower bound (ELBO) and the KLdivergence between the true posterior and an approximating posterior Neal and Hinton (1998):
(7) 
where Jensen’s inequality is used to arrive at the lower bound, is the entropy function and is the KLdivergence between the true and approximating posterior. We learn by maximizing the variational lower bound on the data loglikelihood by first using the true posterior^{4}^{4}4The ELBO is the tightest when the KLdivergence is zero and the is the true posterior. at the current as the approximating posterior: . We can then fix the approximate posterior and optimize the model parameters for the following function:
(8) 
where denote the parameters of the RNNs and the parameters of the network layers for predicting . As our latent variables are discrete and have small cardinality (e.g. < 10), we can compute the posterior exactly for a given . The RNN parameter gradients are computed from and the gradient for is .
Our learning algorithm is a form of the EM algorithm Dempster et al. (1977), where for the Mstep we optimize RNN parameters using stochastic gradient descent. By integrating out the latent variable , MFP learns directly from trajectory data, without requiring any annotations or weak supervision for latent modes. We provide a detailed training algorithm pseudocode in the supplementary materials.
Classmatesforcing
Teacher forcing is a standard technique (albeit biased) to accelerate RNN and sequencetosequence training by using ground truth values as the input to step . Even with scheduled sampling Bengio et al. (2015), we found that overfitting due to exposure bias could be an issue. Interestingly, an alternative is possible in the MFP: at time , for agent , the ground truth observations are used as inputs for all other agents . However, for agent itself, we still use its previous predicted state instead of the true observations as its input. We provide empirical comparisons in Table 2.
Connections to other Stochastic RNNs
Various stochastic recurrent models in existing literature have been proposed: DRAW Gregor et al. (2015), STORN Bayer and Osendorfer (2014), VRNN Chung et al. (2015), SRNN Fraccaro et al. (2016), Zforcing Goyal et al. (2017), GraphVRNN Sun et al. (2019). Beside the multiagent modeling capability of the MFP, the key difference between these methods and MFP is that the other methods use continuous stochastic latent variables at every timestep, sampled from a standard Normal prior. The training is performed via the pathwise derivatives, or the reparameterization trick. Having multiple continuous stochastic variables means that the posterior can not be computed in closed form and Monte Carlo (or lowervariance MCMC estimators^{5}^{5}5Even with IWAE Burda et al. (2015), 50 samples are needed to obtain a somewhat tight lowerbound, making it prohibitively expensive to compute good logdensities for these stochastic RNNs for online applications.) must be used to estimate the ELBO. This makes it hard to efficiently compute the logprobability of an arbitrary imagined or hypothetical trajectory, which might be useful for planning and decisionmaking (See Sec. 3.2). In contrast, latent variables in MFP is discrete and can learn semantically meaningful modes (Sec. 4.1). With modes, it is possible to evaluate the exact loglikelihoods of trajectories in , without resorting to sampling.
3.1 State Encodings
As shown in Fig. 1(b), the input to the RNNs at step is first transformed via the pointofview transformation, followed by state encoding, which aggregates the relative positions of other agents with respect to the th agent (ego agent, or the agent for which the RNN is predicting) and encodes the information into a feature vector. We denote the encoded feature . Here, we propose a dynamic attentionlike mechanism where radial basis functions are used for matching and routing relevant agents from the input to the feature encoder, shown in Fig. 3.
Each agent uses a neural network to transform its state (positions, velocity, acceleration, and heading) into a key or descriptor, which is then matched via a radial basis function to a fixed number of “slots" with learned keys in the encoder network. The ego^{6}^{6}6We will use ego to refer to the main or ’self’ agent for whom we are predicting. agent has a separate slot to send its own state. Slots are aggregated and further transformed by a two layer encoder network, encoding a state (e.g. dim vector). The entire dynamic encoder can be learned in an endtoend fashion. The keymatching is similar to dotproduct attention Vaswani et al. (2017), however, the use of radial basis functions allows us to learn spatially sensitive and meaningful keys to extract relevant agents. In addition, Softmax normalization in dotproduct attention lacks the ability to differentiate between a single closeby agent vs. a faraway agent.
3.2 Hypothetical Rollouts
Planning and decisionmaking must rely on prediction for whatifs Howard et al. (2014). It is important to predict how others might behave to different hypothetical ego actions (e.g. what if ego were to perform a more an aggressive lane change?). Specifically, we are interested in the distribution when conditioning on any hypothetical future trajectory of one (or more) agents:
(9) 
This can be easily computed within MFP by fixing future states of the conditioning agent on the R.H.S. of Eq. 9 while the states of other agents are not changed. This is due to the fact that MFP performs interactive future rollouts in a synchronized manner for all agents, as the joint predicted states at of all agents are used as inputs for predicting the states at . As a comparison, most of the other prediction algorithms perform independent rollouts, which makes it impossible to perform hypothetical rollouts as there is a lack of interactions during the future timesteps.
4 Experimental Results
We demonstrate the effectiveness of MFP in learning interactive multimodal predictions for the driving domain, where each agent is a vehicle. As a proofofconcept, we first generate simulated trajectory data from the CARLA simulator Dosovitskiy et al. (2017), where we can specify the number of modes and script 2ndorder interactions. We demonstrate MFP can learn semantically meaningful latent modes to capture all of the modes of the data, all without using labeling of the latent modes. We then experiment on a widely known standard dataset of real vehicle trajectories, the NGSIM Colyar and Halkias (2007) dataset. We show that MFP achieves stateoftheart results on modeling heldout test trajectories. In addition, we also benchmark MFP with previously published results on the more recent large scale Argoverse motion forecasting dataset Chang et al. (2019). We provide MFP architecture and learning details in the supplementary materials.
4.1 Carla
CARLA is a realistic, opensource, high fidelity driving simulator based on the Unreal Engine Dosovitskiy et al. (2017). It currently contains six different towns and dozens of different vehicle assets. The simulation includes both highways and urban settings with traffic light intersections and fourway stops. Simple traffic law abiding "autopilot" CPU agents are also available.
We create a scenario at an intersection where one vehicle is approaching the intersection and two other vehicles are moving across horizontally (Fig. 4(a)). The first vehicle (red) has 3 different possibilities which are randomly chosen during data generation. The first mode aggressively speeds up and makes the right turn, cutting in front of the green vehicle. The second mode will still make the right turn, however it will slow down and yield to the green vehicle. For the third mode, the first vehicle will slow to a stop, yielding to both of the other vehicles. The far left vehicle also chooses randomly between going straight or turning right. We report the performance of MFP as a function of of modes in Table 1.
The modes learned here are somewhat semantically meaningful. In Fig. 4(c), we can see that even for different vehicles, the same latent variable learned to be interpretable. Mode (squares) learned to go straight, mode (circles) learned to break/stop, and mode (triangles) represents right turns. Finally, in Table 2, we can see the performance between using teacherforcing vs. the proposed classmatesforcing. In addition, we compare different types of encodings. DynEnc is the encoding proposed in Sec. 3.1. Fixedencoding uses a fixed ordering which is not ideal when there are arbitrary number of agents. We can also look at how well we can perform hypothetical rollouts by conditioning our predictions of other agents on ego’s future trajectories. We report these results in Table 3.
CARLA PRECOG
We next compared MFP to a much larger CARLA dataset with published benchmark results. This dataset consists of over 60K training sequences collected from two different towns in CARLA Rhinehart et al. (2019). We trained MFP (with and without LIDAR; with 3, 5, and 7 modes) on the Town01 training set for 300K updates. We report the minMSD metric (in meters) at for a 5 agents jointly. We compare our results with stateoftheart results in Tables 4, 5. NonMFP results are reported from Rhinehart et al. (2019); Chai et al. (2019). Table 5 are reported for Town02 test set.
Metric time Cons vel. CVGMMDeo et al. (2018) Kuefler et al. (2017) MATFZhao et al. (2019) LSTM SLSTMAlahi et al. (2016) CSLSTM(M) MFP1 MFP2 MFP3 MFP4 MFP5 NLL(nats) 1 sec. 3.72 2.02   1.17 1.01 0.89 (0.58) 0.73 0.32 0.58 0.65 0.45 2 sec. 5.37 3.63   2.85 2.49 2.43 (2.14) 2.33 1.43 1.26 1.19 1.36 3 sec. 6.40 4.62   3.80 3.36 3.30 (3.03) 3.17 2.45 2.32 2.28 2.42 4 sec. 7.16 5.35   4.48 4.01 3.97 (3.68) 3.77 3.21 3.07 3.06 3.17 5 sec. 7.76 5.93   4.99 4.54 4.51 (4.22) 4.26 3.81 3.69 3.69 3.76 Metric time Cons vel. CVGMM MATF LSTM SLSTM CSLSTMDeo and Trivedi (2018) MFP1 MFP2 MFP3 MFP4 MFP5 RMSE(m) 1 sec. 0.73 0.66 0.69 0.66 0.68 0.65 0.61 0.54 0.55 0.54 0.54 0.55 2 sec. 1.78 1.56 1.51 1.34 1.65 1.31 1.27 1.16 1.18 1.17 1.16 1.18 3 sec. 3.13 2.75 2.55 2.08 2.91 2.16 2.09 1.90 1.92 1.91 1.89 1.92 4 sec. 4.78 4.24 3.65 2.97 4.46 3.25 3.10 2.78 2.80 2.78 2.75 2.78 5 sec. 6.68 5.99 4.71 4.13 6.27 4.55 4.37 3.83 3.85 3.83 3.78 3.80


4.2 Ngsim
Next Generation Simulation Colyar and Halkias (2007)(NGSIM) is a collection of videotranscribed datasets of vehicle trajectories on US101, Lankershim Blvd. in Los Angeles, I80 in Emeryville, CA, and Peachtree St. in Atlanta, Georgia. In total, it contains approximately minutes of vehicle trajectory data at Hz and consisting of diverse interactions among cars, trucks, buses, and motorcycles in congested flow.
We experiment with the US101 and I80 datasets, and follow the experimental protocol of Deo and Trivedi (2018), where the datasets are split into training, validation, and testing. We extract seconds trajectories, using the first seconds as history to predict seconds into the future.
In Table 6, we report both neg. loglikelihood and RMSE errors on the test set. RMSE and other measures such as average/final displacement errors (ADE/FDE) are not good metrics for multimodal distributions and are only reported for MFP1. For multimodal MFPs, we report minRMSE over 5 samples, which uses the ground truth select the best trajectory and therefore could be overly optimistic. Note that this applies equally to other popular metrics such as minADE, minFDE, and minMSD.
The current stateoftheart, multimodal CSLSTM Deo and Trivedi (2018), requires a separate prediction of 6 fixed maneuver modes. As a comparison, MFP achieves significant improvements with less number of modes. Detailed evaluation protocols are provided in the supplementary materials. We also provide qualitative results on the different modes learned by MFP in Fig. 5. In the right panel, we can interpret the green mode is fairly aggressive lane change while the purple and red mode is more “cautious”. Ablative studies showing the contributions of both interactive rollouts and dynamic attention encoding are also provided in the supplementary materials. We obtain best performance with the combination of both interactive rollouts and dynamic attention encoding.
4.3 Argoverse Motion Forecasting
Argoverse motion forecasting dataset is a large scale trajectory prediction dataset with more than curated scenarios Chang et al. (2019). Each sequence is 5 seconds long in total and the task is to predict the next 3 seconds after observing 2 seconds of history.
We performed preliminary experiments by training a MFP with 3 modes for 20K updates and compared to the existing official baselines in Table 7. MFP hyperparmeters were not selected for this dataset so we do expect to see improved MFP performances with additional tuning. We report validation set performance on both version 1.0 and version 1.1 of the dataset.
4.4 Planning and Decision Making
The original intuitive motivation for learning a good predictor is to enable robust decision making. We now test this by creating a simple yet nontrivial reinforcement learning (RL) task in the form of an unprotected left turn. Situated in Town05 of the CARLA simulator, the objective is to safely perform an unprotected (no traffic lights) turn, see Fig. 6. Two oncoming vehicles have random initial speeds. Collisions incur a penalty of while success yields . There is also a small reward for higher velocity and the action space is acceleration along the ego agent’s default path (blue).
Using predictions to learn the policy is in the domain of modelbased RL Sutton (1990); Weber et al. (2017). Here, MFP can be used in several ways: 1) we can generate imagined future rollouts and add them to the experiences from which temporal difference methods learns Sutton (1990), or 2) we can perform online planning by using a form of the shooting methods Betts (1998), which allows us to optimize over future trajectories. We perform experiments with the latter technique where we progressively train MFP to predict the joint future trajectories of all three vehicles in the scene. We find the optimal policy by leveraging the current MFP model and optimize over ego’s future actions. We compare this approach to a couple of strong modelfree RL baselines: DDPG and Proximal policy gradients. In Fig. 7, we plot the reward vs. the number of environmental steps taken. In Table 4.4, we show that MFP based planning is more robust to parameter variations in the testing environment.
5 Discussions
In this paper, we proposed a probabilistic latent variable framework that facilitates the joint multistep temporal prediction of arbitrary number of agents in a scene. Leveraging the ability to learn latent modes directly from data and interactively rolling out the future with different pointofview encoding, MFP demonstrated stateoftheart performance on several vehicle trajectory datasets. For future work, it would be interesting to add a mix of discrete and continuous latent variables as well as train and validate on pedestrian or bicycle trajectory datasets.
Acknowledgements We thank Barry Theobald, Russ Webb, Nitish Srivastava, and the anonymous reviewers for making this a better manuscript. We also thank the authors of Deo and Trivedi (2018) for open sourcing their code and dataset.
References
 Social lstm: human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961–971. Cited by: Table 6.
 ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. CoRR abs/1812.03079. External Links: Link, 1812.03079 Cited by: §1, §2.
 Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610. Cited by: §3.
 Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §3.
 Survey of numerical methods for trajectory optimization. Journal of guidance, control, and dynamics 21 (2), pp. 193–207. Cited by: §4.4.
 Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: footnote 5.
 IntentNet: learning to predict intention from raw sensor data. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 947–956. External Links: Link Cited by: §1, §2.
 MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449. Cited by: §1, §4.1, §4.1.
 Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8748–8757. Cited by: §4.3, Table 7, §4.
 Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: footnote 2.
 A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988. Cited by: §3.
 US highway 101 dataset.. FHWAHRT07030. Note: http://www.torcs.org Cited by: §4.2, §4.
 Multimodal trajectory predictions for autonomous driving using deep convolutional networks. CoRR abs/1809.10732. External Links: Link, 1809.10732 Cited by: §1, §2, §2.
 Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, pp. 1–38. External Links: Link Cited by: §3.
 How would surround vehicles move? a unified framework for maneuver classification and motion prediction. IEEE Transactions on Intelligent Vehicles 3 (2), pp. 129–140. Cited by: Table 6.
 Convolutional social pooling for vehicle trajectory prediction. CoRR abs/1805.06771. External Links: Link, 1805.06771 Cited by: §1, §2, §2, §4.2, §4.2, Table 6, §5.
 CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: 3(a), §4.1, §4.
 Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207. Cited by: §3.
 Zforcing: training stochastic recurrent networks. In Advances in neural information processing systems, pp. 6713–6723. Cited by: §3.
 Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623. Cited by: §3.
 Backprop KF: learning discriminative deterministic state estimators. CoRR abs/1605.07148. External Links: Link, 1605.07148 Cited by: §2.
 Modelpredictive motion planning: several key developments for autonomous mobile robots. IEEE Robotics & Automation Magazine 21 (1), pp. 64–73. Cited by: §3.2.
 Imitating driver behavior with generative adversarial networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 204–211. Cited by: Table 6.
 DESIRE: distant future prediction in dynamic scenes with interacting agents. pp. 2165–2174. External Links: Document Cited by: §1, §2, §2.
 A survey on motion prediction and risk assessment for intelligent vehicles. ROBOMECH journal 1 (1), pp. 1. Cited by: §2.
 TrafficPredict: trajectory prediction for heterogeneous trafficagents. CoRR abs/1811.02146. External Links: Link, 1811.02146 Cited by: §1, §2.
 A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pp. 355–368. Cited by: §3.
 A survey of motion planning and control techniques for selfdriving urban vehicles. IEEE Transactions on Intelligent Vehicles 1. Cited by: §1.
 Sequencetosequence prediction of vehicle trajectory via LSTM encoderdecoder architecture. CoRR abs/1802.06338. External Links: Link, 1802.06338 Cited by: §2.
 PRECOG: prediction conditioned on goals in visual multiagent settings. CoRR abs/1905.01296. External Links: Link, 1905.01296 Cited by: §1, §4.1, §4.1.
 Stochastic prediction of multiagent interactions from partial observations. CoRR abs/1902.09641. External Links: Link, 1902.09641 Cited by: §1, §3.
 Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
 Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pp. 216–224. Cited by: §4.4.
 Stanley: the robot that won the darpa grand challenge. Journal of field Robotics 23 (9), pp. 661–692. Cited by: §1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.
 Visual interaction networks: learning a physics simulator from video. In Advances in neural information processing systems, pp. 4539–4547. Cited by: §1.
 Imaginationaugmented agents for deep reinforcement learning. CoRR abs/1707.06203. External Links: Link, 1707.06203 Cited by: §4.4.
 An introduction to the kalman filter. Cited by: §2.
 Multiagent tensor fusion for contextual trajectory prediction. CoRR abs/1904.04776. External Links: Link, 1904.04776 Cited by: §1, Table 6.