DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents
Abstract
We introduce a Deep Stochastic IOC^{1}^{1}1IOC: Abbreviation for inverse optimal control, which will be more explained throughout the paper. RNN Encoderdecoder framework, DESIRE, for the task of future predictions of multiple interacting agents in dynamic scenes. DESIRE effectively predicts future locations of objects in multiple scenes by 1) accounting for the multimodal nature of the future prediction (i.e., given the same context, future may vary), 2) foreseeing the potential future outcomes and make a strategic prediction based on that, and 3) reasoning not only from the past motion history, but also from the scene context as well as the interactions among the agents. DESIRE achieves these in a single endtoend trainable neural network model, while being computationally efficient. The model first obtains a diverse set of hypothetical future prediction samples employing a conditional variational autoencoder, which are ranked and refined by the following RNN scoringregression module. Samples are scored by accounting for accumulated future rewards, which enables better longterm strategic decisions similar to IOC frameworks. An RNN scene context fusion module jointly captures past motion histories, the semantic scene context and interactions among multiple agents. A feedback mechanism iterates over the ranking and refinement to further boost the prediction accuracy. We evaluate our model on two publicly available datasets: KITTI and Stanford Drone Dataset. Our experiments show that the proposed model significantly improves the prediction accuracy compared to other baseline methods.
1 Introduction
It is far better to foresee even without certainty than not to foresee at all.
Henri Poincaré (Foundations of Science)
Considering the future as a consequence of a series of past events, a prediction entails reasoning about probable outcomes based on past observations. But predicting the future in many computer vision tasks is inherently riddled with uncertainty (see Fig. 1). Imagine a busy traffic intersection, where such ambiguity is exacerbated by diverse interactions of automobiles, pedestrians and cyclists with each other, as well as with semantic elements such as lanes, crosswalks and traffic lights. Despite tremendous recent interest in future prediction [3, 5, 17, 23, 26, 45, 46], existing stateoftheart produces outcomes that are either deterministic, or do not fully account for interactions, semantic context or longterm future rewards.
(a) Future prediction example 
(b) Workflow of DESIRE 
In contrast, we present DESIRE, a Deep Stochastic IOC RNN Encoderdecoder framework, to overcome those limitations. The key traits of DESIRE are its ability to simultaneously: (a) generate diverse hypotheses to reflect a distribution over plausible futures, (b) reason about interactions between multiple dynamic objects and the scene context, (c) rank and refine hypotheses with consideration of longterm future rewards (see Fig. 1). These objectives are cast within a deep learning framework.
We model the scene as composed of semantic elements (such as roads and crosswalks) and dynamic participants or agents (such as cars and pedestrians). A static or moving observer is also considered as an instance of an agent. We formulate future prediction as determining the locations of agents at various instants in the future, relying solely on observations of the past states of the scene, in the form of agent trajectories and scene context derived from imagebased features or other sensory data if available. The problem is posed in an optimization framework that maximizes the potential future reward of the prediction. Specifically, we propose the following novel mechanisms to realize the above advantages, also illustrated in Fig. 2:

[leftmargin=8pt]

Diverse Sample Generation: Sec. 3.1 presents a conditional variational autoencoder (CVAE) framework [41] to learn a sampling model that, given observations of past trajectories, produces a diverse set of prediction hypotheses to capture the multimodality of the space of plausible futures. The CVAE introduces a latent variable to account for the ambiguity of the future, which is combined with a recurrent neural network (RNN) encoding of past trajectories, to generate hypotheses using another RNN.

IOCbased Ranking and Refinement: In Sec. 3.2, we propose a ranking module that determines the most likely hypotheses, while incorporating scene context and interactions. Since an optimal policy is hard to determine where multiple agents make strategic interdependent choices, the ranking objective is formulated to account for potential future rewards similar to inverse optimal control (IOC). This also ensures generalization to new situations further in the future, given limited training data. The module is trained in a multitask framework with a regressionbased refinement of the predicted samples. In the testing phase, we iterate the above multiple times to obtain more accurate refinements of the future prediction.

Scene Context Fusion: Sec. 3.3 presents the Scene Context Fusion (SCF) layer that aggregates interactions between agents and the scene context encoded by a convolutional neural network (CNN). The fused embedding is channeled to the aforementioned RNN scoring module and allows to produce the rewards based on the contextual information.
While DESIRE is a general framework that is applicable to any future prediction task, we demonstrate its utility in two applications – traffic scene understanding for autonomous driving and behavior prediction in aerial surveillance. Sec. 4 demonstrates outstanding accuracy for predicting the future locations of traffic participants in the KITTI raw dataset and pedestrians in the Stanford Drone dataset.
To summarize, this paper presents DESIRE, which is a deep learning based stochastic framework for timeprofiled distant future prediction, with several attractive properties:

[leftmargin=8pt]

Scalability: The use of deep learning rather than handcrafted features enables endtoend training and easy incorporation of multiple cues arising from past motions, scene context and interactions between multiple agents.

Diversity: The stochastic output of a deep generative model (CVAE) is combined with an RNN encoding of past observations to generate multiple prediction hypotheses that hallucinate ambiguities and multimodalities inherent in future prediction.

Accuracy: The IOCbased framework accumulates longterm future rewards for sampled trajectories and the regressionbased refinement module learns to estimate a deformation of the trajectory, enabling more accurate predictions further into the future.
2 Related Works
Classical methods Path prediction problems have been studied extensively with different approaches such as Kalman filters [18], linear regressions [29] to nonlinear Gaussian Process regression models [49, 33, 34, 48], autoregressive models [2] and timeseries analysis [32]. Such predictions suffice for scenarios with few interactions between the agent and the scene or other agents (like a flight monitoring system). In contrast, we propose methods for more complex environments such as surveillance for a crowd of pedestrians or traffic intersections, where the locomotion of individual agents is severely influenced by the scene context (e.g., drivable road or building) and the other agents (e.g., people or cars try to avoid colliding with the other).
IOC for path prediction Kitani et al. recover human preferences (i.e., reward function) to forecast plausible paths for a pedestrian in [23] using inverse optimal control (IOC), or inverse reinforcement learning (IRL) [1, 52], while [26] adapt IOC and propose a dynamic reward function to address changes in environments for sequential path predictions. Combined with a deep neural network, deep IOC/IRL has been proposed to learn nonlinear reward functions and showed promising results in robot control [11] and driving [50] tasks. However, one critical assumption made in IOC frameworks, which makes them hard to be applied to general path prediction tasks, is that the goal state or the destination of agent should be given a priori, whereby feasible paths must be found to the given destination from the planning or control point of view. A few approaches relaxed this assumption with socalled goal set [28, 10], but these goals are still limited to a target task space. Furthermore, a recovered cost function using IOC is inherently static, thus it is not suitable for timeprofiled prediction tasks. Finally, past approaches do not incorporate interaction between agents, which is often a key constraint to the motion of multiple agents. In contrast, our methods are designed for more natural scenarios where agent goals are openended, unknown or timevarying and where agents interact with each other while dynamically adapting in anticipation of future behaviors.
Future prediction Walker et al. [47] propose a visual prediction framework with a datadriven unsupervised approach, but only on a static scene, while [5] learn scenespecific motion patterns and apply to novel scenes for motion prediction as a knowledge transfer. A method for future localization from egocentric perspective is also addressed successfully in [30]. But unlike our method, none of those can provide timeprofiled predictions. Recently, a large dataset is collected in [36] to propose the concept of social sensitivity to improve forecasting models and the multitarget tracking task. However, their social force [14] based model has limited navigation styles represented merely using parameters of distancebased Gaussians.
Interactions When modeling the behavior of an agent, it should also be taken into account that the dynamics of an agent not only depend on its own, but also on the behavior of others. Predicting the dynamics of multiple objects is also studied in [24, 25, 3, 31], to name a few. Recently, a novel pooling layer is presented by [3], where the hidden state of neighboring pedestrians are shared together to joinly reason across multiple people. Nonetheless, these models lack predictive capacity as they do not take into account scene context. In [24], a dynamic Bayesian network to capture situational awareness is proposed as a context cue for pedestrian path prediction, but the model is limited to orientations and distances of pedestrians to vehicles and the curbside. A large body of work in reinforcement learning, especially game theoretical generalizations of Markov Decision Processes (MDPs), addresses multiagent cases such as minmaxQ learning [27] and NashQ learning [16]. However, as noted in [38], typically learning in multiagent setting is inherently more complex than single agent setting [40, 39, 6].
RNNs for sequence prediction Recurrent neural networks (RNNs) are natural generalizations of feedforward neural networks to sequences [42] and have achieved remarkable results in speech recognition [13], machine translation [4, 42, 7] and image captioning [19, 51, 9]. The power of RNNs for sequencetosequence modeling thus makes them a reasonable model of choice to learn to generate sequential future prediction outputs. Our approach is similar to [7] in making use of the encoderdecoder structure to embed a hidden representation for encoding and decoding variable length inputs and outputs. We choose to use gated recurrent units (GRUs) over long shortterm memory units (LSTMs) [15] since the former is found to be simpler yet yields no degraded performance [8]. Despite the promise inherent in RNNs, however, only a few works have applied RNNs to behavior prediction tasks. Multiple LSTMs are used in [3] to jointly predict human trajectories, but their model is limited to producing fixedlength trajectories, whereas our model can produce variablelength ones. A FusionRNN that combines information from sensory streams to anticipate a driver’s maneuver is proposed in [17], but again their model outputs deterministic and fixedlength predictions.
Deep generative models Our work is also related to deep generative models [37, 35, 44], as we have a sample generation process that is built on a variational autoencoder (VAE) [22] within the framework. Since our prediction model essentially performs posteriorbased probabilistic inference where candidate samples are generated based on conditioning variables (i.e., past motions besides latent variables), we naturally extend our method to exploit a conditional variational autoencoder (CVAE) [21, 41] during the sample generation process. Dense trajectories of pixels are predicted from a single image using CVAE in [46], while we focus on predicting longterm behaviors of multiple interacting agents in dynamic scenes.
Unlike our framework, all aforementioned approaches lack either consideration of scene context, modeling of interaction with other agents or capabilities in producing continuous, timeprofiled and longterm accurate predictions.
3 Method
We formulate the future prediction problem as an optimization process, where the objective is to learn the posterior distribution of multiple agents’ future trajectories given their past trajectories and sensory input where is the number of agents. The future trajectory of an agent is defined as , and the past trajectory is defined similarly as . Here, each element of a trajectory (e.g., ) is a vector in (or ) representing the coordinates of agent at time , and and refer to the maximum length of time steps for future and past respectively. Since direct optimization of continuous and high dimensional Y is not feasible, we design our method to first sample a diverse set of future predictions and assign a probabilistic score to each of the samples to approximate . In this section, we describe the details of DESIRE (Fig. 2) in the following structure: Sample Generation Module (Sec. 3.1), Ranking and Refinement Module (Sec. 3.2), and Scene Context Fusion (Sec. 3.3).
3.1 Diverse Sample Generation with CVAE
Future prediction can be inherently ambiguous and has uncertainties as multiple plausible scenarios can be explained under the same past situation (e.g., a vehicle heading toward an intersection can make different turns as seen in Fig. 1). Thus, learning a deterministic function that directly maps to will underrepresent potential prediction space and easily overfit to training data. Moreover, a naively trained network with a simple loss will produce predictions that average out all possible outcomes.
In order to tackle the uncertainty, we adopt a deep generative model, conditional variational autoencoder (CVAE) [41], inside of DESIRE framework. CVAE is a generative model that can learn the distribution of the output conditioned on the input by introducing a stochastic latent variable ^{2}^{2}2Notice that we learn the distribution independently over different agents in this step. Interaction between agents is considered in Sec. 3.2.. It is composed of multiple neural networks, such as recognition network , (conditional) prior network , and generation network . Here, denote the parameters of corresponding networks. The prior of the latent variables is modulated by the input , however, this can be relaxed to make the latent variables statistically independent of input variables, i.e., [21, 41]. Essentially, a CVAE introduces stochastic latent variables that are learned to encode a diverse set of predictions given input , making it suitable for modeling onetomany mapping. During training, is learned such that it gives higher probability to that is likely to produce a reconstruction close to actual prediction given the full context and . At test time is sampled randomly from the prior distribution and decoded through the decoder network to produce a prediction hypothesis. This enables probabilistic inference which serves to handle multimodalities in the prediction space.
Train phase: Firstly, the past and future trajectories of an agent , and respectively, are encoded through two RNN encoders with separate set of parameters (i.e., RNN Encoder1 and RNN Encoder2 in Fig. 2). The resulting two encodings, and , are concatenated and passed through one fully connected () layer with a nonlinear activation (e.g., ). Two sidebyside layers are followed to produce both the mean and the standard deviation over . The distribution of is modeled as a Gaussian distribution (i.e., ) and is regularized by the divergence against a prior distribution during the training. Upon successful training, the target distribution is learned in the latent variable , which allows one to draw a random sample from a Gaussian distribution to reconstruct at test time. Since backpropagation is not possible through random sampling, we adopt the standard reparameterization trick [22] to make it differentiable.
In order to model , is combined with as follows. The sampled latent variable is passed to one layer to match the dimension of that is followed by a layer, producing . Then that is combined with the encodings of past trajectories through a masking operation (i.e., elementwise multiplication). One can interpret this as a guided drop out where the guidance is derived from the full context of individual trajectory during the training phase, while it is randomly drawn from agnostic prior distribution in the testing phase. Finally, the following RNN decoder (i.e., RNN Decoder1 in Fig. 2) takes the output of the previous step, , and generates number of future prediction samples, i.e., .
There are two loss terms in training the CVAEbased RNN encoderdecoder.

[leftmargin=8pt]

Reconstruction Loss: . This loss measures how far the generated samples are from the actual ground truth.

Loss: . This regularization loss measures how close the sampling distribution at test time is to the distribution of latent variable that we learn during training.
Test phase: At test time, the encodings of future trajectories are not available, thus the encodings of past trajectories are combined with multiple random samples of latent variable drawn from the prior . Similar to the training phase, is passed to the following RNN decoder (i.e., RNN Decoder1 in Fig. 2) to generate a diverse set of prediction hypotheses.
Further details: For both train and test phases, we pass trajectories through a temporal convolution layer before encoding to encourage the network to learn the concept of velocity from adjacent frames before getting passed into RNN encoders. Also, RNNs are implemented using gated recurrent units (GRU) [7] to learn longterm dependencies, yet they can be easily replaced with other popular RNNs like long shortterm memory units (LSTM) [15]. In summary, this sample generation module produces a set of diverse hypotheses critical to capturing the multimodality of the prediction task, through a effective combination of CVAE and RNN encoderdecoder. Unlike [46], where CVAE is used to predict for shortterm visual motion from a single image, our CVAE module generates diverse set of future trajectories based on a past trajectory.
3.2 IOCbased Ranking and Refinement
Predicting a distant future can be far more challenging than predicting one close by. In order to tackle this, we adopt the concept of decisionmaking process in reinforcement learning (RL) where an agent is trained to choose its actions that maximizes longterm rewards to achieve its goal [43]. Instead of designing a reward function manually, however, IOC [50, 11] learns an unknown reward function. Inspired by this, we design an RNN model that assigns rewards to each prediction hypothesis and measures their goodness based on the accumulated longterm rewards. Thereafter, we also directly refine prediction hypotheses by learning displacements to the actual prediction through another layer. Lastly, the module receives iterative feedbacks from regressed predictions and keeps adjusting so that it produces precise predictions at the end. The model is illustrated in the right side of Fig. 2. During the process, we combine 1) past motion history through the embedding vector , 2) semantic scene context through a CNN with parameters , and 3) interaction among multiple agents by using interaction features (Sec. 3.3). Notice that unlike typical robotics applications [50, 11], we do not assume that the goal (final destination) is known or the dynamics of the agents are given. Our model learns the agents dynamics as well as the scene context in a coherent framework.
Learning to score: For an agent , there are number of samples (i.e., ) that are generated by our CVAE sampler. Let the score of individual prediction hypothesis for the agent be defined as follows,
(1) 
where is the prediction samples of other agents (i.e., , where ), is the prediction sample of an agent at time , is all the prediction samples until a timestep , is the maximum prediction length, and is the reward function that assigns a reward value at each timestep. is implemented as an layer that is connected to the hidden vector of RNN cell at each time step. We share the parameters of the layer over all the time steps (each RNN cell outputs the hidden state of the same dimension). Therefore, the score is accumulated rewards over time, accounting for the entire future rewards being assigned to each hypothesis. This enables our model to make a strategic decision by allowing us to rank samples as in other samplingbased IOC frameworks [11]. In addition, the reward function incorporates both scene context as well as the interaction between agents (see Sec. 3.3).
Learning to refine: Alongside the scores, our model also estimates a regression vector that refines each prediction sample . The regression vector for each agent is obtained with the regression function defined as follows,
(2) 
Represented as parameters of a neural network, the regression function accumulates both scene contexts and all other agents dynamics from the past to entire future frames, and estimates the best displacement vector over entire timehorizon . Similarly to the score , it accounts for what happens in the future both in terms of scene context and interactions among dynamic agents to produce the output. We implement as another layer that is connected to the last hidden vector of the RNN which outputs dimensional vector. (or ) is the dimension of the location state.
Iterative feedback: Using the displacement vector , we iteratively refine the prediction hypothesis . After each cycle, is updated by , and fed into the IOC module. This process is similar to the gradient descent optimization of over the score function , but it does not require to compute the gradient over RNN which can be very unstable due to the recurrent structure (i.e., vanishing or exploding gradient). We observe that iterative refinement indeed improves the quality of prediction samples in the experiments (see Fig. 4 and Fig. 5).
Losses: There are two loss terms in training the IOC ranking and refinement module.

[leftmargin=8pt]

Crossentropy Loss: of which the target distribution is obtained by , where .

Regression Loss:
Finally, the total loss of the entire network is defined as a multitask loss as follows, where is the number of agents in one batch.
(3) 
3.3 Scene Context Fusion
As discussed in the previous section, our ranking and refinement module relies on the hidden representation of the shared RNN module. Thus, it is important that the RNN must contain the information about 1) individual past motion context, 2) semantic scene context and 3) the interaction between multiple agents, in order to provide proper hidden representations that can score and refine a prediction .
We achieve the goal by having an RNN that takes following input at each time step:
(4) 
where is a velocity of at , is a layer with a ReLU activation that maps the velocity to a high dimensional representation space, is a pooling operation that pools the CNN feature at the location , is the interaction feature computed by a fusion layer that spatially aggregates other agents hidden vectors, similar to SocialPooling (SP) layer [3]. The embedding vector (the output of the RNN Encoder1 in Fig. 2) is shared as the initial hidden state of the RNN, in order to provide the individual past motion context. We share this embedding with the CVAE module since both require the same information to be embedded in the vector.
Interaction Feature: We implement a spatial grid based pooling layer similar to SP layer [3]. For each sample of an agent at , we define spatial grid cells centered at . Over each grid cell , we pool the hidden representation of all the other agents’ samples that are within the spatial cell, . Instead of using the max pooling operation with rectangular grids, we adopt logpolar grids with an average pooling. Combined with CNN features, the SCF module provides the RNN decoder with both static and dynamic scene information. It learns consistency between semantics of agents and scenes for reliable prediction.
3.4 Characteristics of DESIRE
This section highlights particularly distinctive features of DESIRE that naturally enable higher accuracy and reliability.

[leftmargin=8pt]

The framework is based on deep neural network and is trainable endtoend, rather than relying on handcrafted parametric representation and interactions terms. Trajectories of each agent are represented using RNN encoders and are combined together through a fusion layer within the architecture. Scene context is represented through CNN and is not solely restricted to images (i.e., can handle nonvisual sensors too). Overall, the algorithm is scalable and flexible.

CVAE is combined with RNN encodings to generate stochastic prediction hypothesis, which handles ambiguities and multimodalities inherent in future prediction.

A novel RNN module coherently integrates multiple cues that have critical influence on behavior prediction such as dynamics of all neighboring agents and scene semantics.

An IOC framework is used to train the trajectory ranking objective by measuring potential longterm future rewards. This makes the model less reactive, and enables more accurate predictions further into the future.

A regression vector is learned to refine trajectories and an iterative feedback mechanism sequentially adjusts the predicted behavior, resulting in more accurate predictions.
4 Experiments
4.1 Datasets
KITTI Raw Data [12]: The dataset provides images of driving scenes and Velodyne 3D laser scan along with calibration information between cameras and sensors. To prepare data examples (i.e., ), we performed the following: As the dataset does not provide semantic labels for 3D points (which we need for scene context), we first perform semantic segmentations of images and project Velodyne laser scans onto the image plane using the provided camera matrix to label 3D points. The semantically labeled 3D points are then registered into the world coordinates using GPSIMU tags. Finally we create topdown view feature maps of size (: size of crop and : number of classes for scene elements, e.g., road, sidewalk, and vegitation shown as red, blue and green color in Fig. 6.). is cropped with respect to the view point of the camera to simulate actual driving scenario ( and the size of pixel is . The camera is located at the leftcenter.). Since laser scans on dynamic objects generate traces during registration, we remove moving objects and only use static scene elements. The trajectories are generated by extracting the center locations of the 3D tracklets and registering them in the world coordinates. We use all annotated videos from Road and City scenes for our experiments and generate approximately 2,500 training examples.
Stanford Drone Dataset [36]: The dataset contains a large volume of aerial videos captured in a university campus using a drone. There are various classes of dynamic objects interacting with each other, often in the form of high density crowds. Except for less stabilized cameras and lost labels, we used all videos to create examples to train/test our model, yielding approximately examples. Note that we directly use raw images to extract visual features, rather than semantically labeled feature maps. We resize the images by in following experiments to avoid memory overhead.
4.2 Evaluation Metrics and Baselines
The following metrics are used to measure the performance of future prediction task in various aspects: (i) L2 distance between the prediction and ground truth at multiple time steps, (ii) missrate with a threshold in terms of L2 distance at multiple time steps, (iii) maximum L2 distance over entire time frames, (iv) maximum missrate over entire time frames, and (v) oracle error over top K number of samples (i.e., ) to account for the uncertainty in the future prediction (similar to MEE in [46]). We set K to be throughout the main experiments.
We compare our method with the following baselines:

[leftmargin=8pt]

Linear: A linear regressor that estimates linear parameters by minimizing the least square error.

RNN ED: An RNN encoderdecoder model that directly regresses the prediction only using the past trajectories.

RNN EDSI: An RNN ED augmented with our SCF unit into the decoder similar to [17]. The model combines the scene and interaction features while making prediction and uses the same information as ours, but makes a prediction at solely based on the past information up to .

DESIRE: The proposed method. We denote our model with only semantic scene context in SCF module as DESIRES and our model with both scene context and interaction as DESIRESI. We also evaluate DESIREXIT{N}, where N is the number of iterative feedbacks.
4.3 Learning Details
We train the model with Adam optimizer [20] with the initial learning rate of . The learning rate is decreased by half at every quarter of total epochs, albeit we do not observe clear improvement with this. All the models including Encoderdecoder baselines are trained for epochs for KITTI and epochs for SDD (about K iterations with a batch size ). The full details on the architecture are discussed in the supplementary materials. In order to avoid exploding gradient in RNNs, we apply gradient clipping with L2 norm of . During the training procedure, we randomly rotate the scene and trajectories to augment data and reduce overfitting. For all experiments, we run randomized 5 fold cross validation without overlapping videos in different splits. All models observe maximum of seconds for past trajectories and make a prediction up to seconds into the future. All models are implemented using TensorFlow and trained endtoend with a NVIDIA Tesla K80 GPU. Training takes approximately one to two days per model.
4.4 Analysis
Iteration 0  Iteration 1  Iteration 3 
Table 1 and Fig. 4 compare the oracle prediction errors^{3}^{3}3The maximum error in Table 1 might be different from Fig. 4 due to the test examples without ground truth labels at 4 seconds in the future. of various methods. We present L2 distance error for both datasets and missrate with 1 threshold for KITTI only, as trajectories in SDD are defined in image pixel space. Note that Linear, RNN ED, and RNN EDSI output a single prediction, thus their results are shown as horizontal lines. CVAE samples are sorted randomly without confidence values.
Baselines: RNN ED performs significantly better than Linear since it can learn nonlinear motion. We observe that RNN EDSI performs worse than RNN ED on the KITTI since the model learns to behave reactive (see Fig. 6). This might be due to the small size of the dataset, which makes it hard to learn predictive CNN/interaction features (i.e., features need to have high capacity to encode longterm information). On the contrary, RNN EDSI significantly outperforms RNN ED on SDD dataset since SDD is much bigger and has a large number of interactions among agents.
Proposed models: With a single random sample (CVAE 1 in Table 1), CVAE performs worse than RNN ED since RNN ED directly optimizes for L2 distance during training. Given more than few samples (e.g., CVAE in Table 1), CVAE outperforms RNN ED quickly on both datasets, which confirms the multimodal nature of the prediction problem. DESIREXIT0 without iterative regression properly ranks the random CVAE samples achieving lower error with few samples. Note that DESIREXIT0 only ranks the samples without regression, thus achieves the same error as used all samples, i.e., at Top K ratio of in Fig. 4. As we iterate over, the outputs get refined and achieve smaller oracle error (i.e., DESIREX10%IT0 vs. DESIREX10%IT4). Fig. 5 shows an example of the iterative feedback. Finally, we observe that considering the interaction between agents further helps to achieve lower error. The difference between DESIRES and DESIRESI is smaller in KITTI experiment, since KITTI has only few interactions between cars. However, we observe clear improvement on the SDD dataset since there are rich set of scenes with interactions between agents. Although our model with top 1 sample (DESIRE Best) achieves higher error compared to the direct regression baselines, using a few more samples yields much better prediction accuracy (i.e., DESIRE ). Note that direct regression models with lower error are not necessarily better if averaging various futures (e.g., going straight). We believe that in some applications, probabilistic prediction over a variety of outcomes is more desirable than a single MAP prediction. For both datasets, DESIRE achieves error on par with best baselines using as little as top samples of DESIRESIIT4 predictions (see Fig. 4). Qualitative results are presented in Fig. 6 and in the supplementary material.
Method  1.0 (sec)  2.0 (sec)  3.0 (sec)  4.0 (sec) 
KITTI (error in meters / missrate with 1 threshold)  
Linear  0.89 / 0.31  2.07 / 0.49  3.67 / 0.59  5.62 / 0.64 
RNN ED  0.45 / 0.13  1.21 / 0.39  2.35 / 0.54  3.86 / 0.62 
RNN EDSI  0.56 / 0.16  1.40 / 0.44  2.65 / 0.58  4.29 / 0.65 
CVAE 1  0.61 / 0.22  1.81 / 0.50  3.68 / 0.60  6.16 / 0.65 
CVAE 10%  0.35 / 0.06  0.93 / 0.30  1.81 / 0.49  3.07 / 0.59 
DESIRESIT0 Best  0.53 / 0.17  1.52 / 0.45  3.02 / 0.58  4.98 / 0.64 
DESIRESIT0 10%  0.32 / 0.05  0.84 / 0.26  1.67 / 0.43  2.82 / 0.54 
DESIRESIT4 Best  0.51 / 0.15  1.46 / 0.42  2.89 / 0.56  4.71 / 0.63 
DESIRESIT4 10%  0.27 / 0.04  0.64 / 0.18  1.21 / 0.30  2.07 / 0.42 
DESIRESIIT0 Best  0.52 / 0.16  1.50 / 0.44  2.95 / 0.57  4.80 / 0.63 
DESIRESIIT0 10%  0.33 / 0.06  0.86 / 0.25  1.66 / 0.42  2.72 / 0.53 
DESIRESIIT4 Best  0.51 / 0.15  1.44 / 0.42  2.76 / 0.54  4.45 / 0.62 
DESIRESIIT4 10%  0.28 / 0.04  0.67 / 0.17  1.22 / 0.29  2.06 / 0.41 
SDD (pixel error at resolution)  
Linear  2.58  5.37  8.74  12.54 
RNN ED  1.53  3.74  6.47  9.54 
RNN EDSI  1.51  3.56  6.04  8.80 
CVAE 1  2.51  6.01  10.28  14.82 
CVAE 10%  1.84  3.93  6.47  9.65 
DESIRESIT0 Best  2.02  4.47  7.25  10.29 
DESIRESIT0 10%  1.59  3.31  5.27  7.75 
DESIRESIT4 Best  2.11  4.69  7.58  10.66 
DESIRESIT4 10%  1.30  2.41  3.67  5.62 
DESIRESIIT0 Best  2.00  4.41  7.18  10.23 
DESIRESIIT0 10%  1.55  3.24  5.18  7.61 
DESIRESIIT4 Best  2.12  4.69  7.55  10.65 
DESIRESIIT4 10%  1.29  2.35  3.47  5.33 
Method  K (the number of prediction samples)  

25  50  100  200  
DESIRESIT4 Best  4.87  4.71  4.81  4.70 
DESIRESIT4 top  2.03  2.04  1.99  1.96 
Method  Time length for past (sec)  

1.0  2.0  4.0  
DESIRESIT4 Best  4.94  4.71  4.78 
DESIRESIT4 10%  2.11  2.07  2.05 
5 Conclusion
We introduce a novel framework DESIRE for distant future prediction of multiple agents in complex scene. The model incorporates both static and dynamic scene contexts with a deep IOC framework and produces stochastic, continuous, and timeprofiled longterm predictions that can effectively account for the uncertainty in the future prediction task. Our empirical evaluations on driving and surveillance scenarios demonstrate clear improvement over other baselines. For future work, we believe that our model can be further improved on larger datasets and be applied to various robotics applications with a direct use of perspective images.
Acknowledgement
This work was part of N. Lee’s summer internship at NEC Labs America and also supported by the EPSRC, ERC grant ERC2012AdG 321162HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1.
References
 [1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 [2] H. Akaike. Fitting autoregressive models for prediction. Annals of the institute of Statistical Mathematics, 21(1):243–247, 1969.
 [3] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. FeiFei, and S. Savarese. Social lstm: Human trajectory prediction in crowded spaces. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–971, 2016.
 [4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [5] L. Ballan, F. Castaldo, A. Alahi, F. Palmieri, and S. Savarese. Knowledge transfer for scenespecific motion prediction. arXiv preprint arXiv:1603.06987, 2016.
 [6] L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, And CyberneticsPart C: Applications and Reviews, 38 (2), 2008, 2008.
 [7] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 [9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
 [10] A. D. Dragan, N. D. Ratliff, and S. S. Srinivasa. Manipulation planning with goal sets using constrained trajectory optimization. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 4582–4588. IEEE, 2011.
 [11] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. arXiv preprint arXiv:1603.00448, 2016.
 [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
 [13] A. Graves, A.r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.
 [14] D. Helbing and P. Molnar. Social force model for pedestrian dynamics. Physical review E, 51(5):4282, 1995.
 [15] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [16] J. Hu and M. P. Wellman. Nash qlearning for generalsum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
 [17] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena. Recurrent neural networks for driver activity anticipation via sensoryfusion architecture. In International Conference on Robotics and Automation (ICRA), 2016.
 [18] R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
 [19] A. Karpathy and L. FeiFei. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015.
 [20] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [21] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 [22] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [23] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecasting. In European Conference on Computer Vision, pages 201–214. Springer, 2012.
 [24] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila. Contextbased pedestrian path prediction. In European Conference on Computer Vision, pages 618–633. Springer, 2014.
 [25] H. Kretzschmar, M. Kuderer, and W. Burgard. Learning to predict trajectories of cooperatively navigating agents. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 4015–4020. IEEE, 2014.
 [26] N. Lee and K. M. Kitani. Predicting wide receiver trajectories in american football. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9. IEEE, 2016.
 [27] M. L. Littman. Markov games as a framework for multiagent reinforcement learning.
 [28] J. Mainprice, R. Hayne, and D. Berenson. Goal set inverse optimal control and iterative replanning for predicting human reaching motions in shared workspaces. arXiv preprint arXiv:1606.02111, 2016.
 [29] P. McCullagh and J. A. Nelder. Generalized linear models, volume 37. CRC press, 1989.
 [30] H. S. Park, J.J. Hwang, Y. Niu, and J. Shi. Egocentric future localization. 2016.
 [31] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll never walk alone: Modeling social behavior for multitarget tracking. In 2009 IEEE 12th International Conference on Computer Vision, pages 261–268. IEEE, 2009.
 [32] M. B. Priestley. Spectral analysis and time series. 1981.
 [33] J. QuiñoneroCandela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 [34] C. E. Rasmussen. Gaussian processes for machine learning. 2006.
 [35] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [36] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In European Conference on Computer Vision, pages 549–565. Springer, 2016.
 [37] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In AISTATS, volume 1, page 3, 2009.
 [38] S. ShalevShwartz, N. BenZrihem, A. Cohen, and A. Shashua. Longterm planning by shortterm prediction. arXiv preprint arXiv:1602.01580, 2016.
 [39] Y. Shoham and K. LeytonBrown. Multiagent systems: Algorithmic, gametheoretic, and logical foundations. Cambridge University Press, 2008.
 [40] Y. Shoham, R. Powers, and T. Grenager. If multiagent learning is the answer, what is the question? Artificial Intelligence, 171(7):365–377, 2007.
 [41] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
 [42] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [43] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 [44] E. ThibodeauLaufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable by backprop. 2014.
 [45] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. 2016.
 [46] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pages 835–851. Springer, 2016.
 [47] J. Walker, A. Gupta, and M. Hebert. Patch to the future: Unsupervised visual prediction. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3302–3309. IEEE, 2014.
 [48] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283–298, 2008.
 [49] C. K. Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. In Learning in graphical models, pages 599–621. Springer, 1998.
 [50] M. Wulfmeier, D. Z. Wang, and I. Posner. Watch this: Scalable costfunction learning for path planning in urban environments. arXiv preprint arXiv:1607.02329, 2016.
 [51] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.
 [52] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.