GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction

GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction

Abstract

Predicting the future paths of an agent’s neighbors accurately and in a timely manner is central to the autonomous applications for collision avoidance. Conventional approaches, e.g., LSTM-based models, take considerable computational costs in the prediction, especially for the long sequence prediction. To support more efficient and accurate trajectory predictions, we propose a novel CNN-based spatial-temporal graph framework GraphTCN, which models the spatial interactions as social graphs and captures the spatio-temporal interactions with a modified temporal convolutional network. Different from conventional models, both the spatial and temporal modeling of our model are computed within each local time window. Therefore, it can be executed in parallel for much higher efficiency, and meanwhile with accuracy comparable to best-performing approaches. Experimental results confirm that our model achieves better performance in terms of both efficiency and accuracy as compared with state-of-the-art models on various trajectory prediction benchmark datasets.

\wacvfinalcopy

1 Introduction

Trajectory prediction is a fundamental and challenging task, which needs to forecast the future path of the agents in autonomous applications, such as autonomous vehicles, socially compliant robots, agents in simulators, to navigate in a shared environment. With multi-agent interaction in theses applications, the agents are required to respond timely and precisely to the environment for avoiding collisions. Therefore, the ability of the agents to predict the future paths of their neighbors in an efficient and accurate manner is thus much needed. Although recent works [22, 30, 16, 25] have achieved great improvement in modeling complex social interactions among agents to generate accurate future paths, trajectory prediction is still a challenging task, where the deployment of the prediction models in real-world applications is mostly restricted by its high computational cost and long inference time. For example, some small robots are only equipped with limited computing devices that can not afford the high inference cost of existing solutions.

Figure 1: The illustration of trajectory prediction in the crowd. The solid blue line is the observed trajectories, while the dash blue line represents the plausible future path. The influence levels between two agents are different based on their relative movement trends, e.g., and between the human , and human are different.

In particular, the trajectory prediction is typically modeled in two dimensions, i.e., the temporal dimension and the spatial dimension, which is illustrated in Fig. 1. The temporal dimension models the historical movement dynamics for each agent. Most of the state-of-the-art approaches [1, 11, 22, 16, 18, 31] have focused on Recurrent Neural Networks (RNNs), e.g., Long Short-Term Memory (LSTM) [14] networks, to capture such sequence dynamics since RNNs are designed for sequence modeling. However, besides the training difficulties of gradient vanishing and exploding [27] in modeling sequential data, both training and inference of RNN models are notoriously slow compared with their feed-forward counterparts, e.g., Convolutional Neural Networks (CNNs). This is due to the fact that each hidden state of RNNs is dependent on the previous inputs and hidden states. As a consequence, the prediction of RNNs is produced sequentially, and thus not parallelizable.

The spatial dimension models the human-human interaction, i.e., interactions between the agent and its neighbors. There are mainly three approaches proposed to capture the spatial interaction, including the distance-based [1, 11, 22], attention-based [30, 7, 39, 18] and graph-based [16, 20, 47, 25] ones. Distance-based approaches introduce a social pooling layer to summarize the crowd interactions, while the attention-based approaches instead dynamically generate the importance of neighbors using soft attention. The graph-based approaches define the agents’ representation as a graph and utilize graph neural network, e.g., GCN [19], variant of GCN [47, 25] or GAT [38], to obtain the spatial interaction features of agents, which is more intuitive and effective in modeling complex social interactions. However, existing graph-based approaches are overly dependent either on attention or on the geometric distance between agents and ignore the relative relationship.

To improve effectiveness and efficiency, we propose a novel CNN-based graph spatial-temporal network (STGNN), i.e., GraphTCN, to capture the spatial and temporal interaction for trajectory prediction. In the temporal dimension, different from RNN-based methods, we adopt a modified gated convolutional network (TCN) to capture the temporal dynamics for each agent. The gated highway mechanism introduced to CNNs dynamically regulate the information flow by focusing on more salient features, and the feed-forward nature of CNN makes it more tractable in training and parallelizable for much higher efficiency both in training and inference. In the spatial dimension, we propose an edge feature graph attention neural network (EFGAT) with skip connections and gate mechanism for each time instants to model the spatial interaction between the agents. Specifically, nodes in the graph represent agents, and edges between two agents denote their relative geometric relation. EFGAT then learns the adjacency matrix, i.e., the spatial interaction, of the graph adaptively. Together, the spatial and temporal modules of GraphTCN support more effective and efficient modeling of the interactions within each time step between agents and across the time steps for each and every agent. We summarize our main contributions as follows:

  • We propose an edge feature graph attention neural network (EFGAT), which integrates the relative spatial locations as prior knowledge, to capture the spatial interaction adaptively with a self-attention mechanism.

  • We propose to model the spatial-temporal interactions with a gated convolutional network (TCN), which empirically proves to be more efficient and effective.

  • Our spatial-temporal framework achieves better performance compared with state-of-the-art approaches. Specifically, we reduce the average displacement error by 19.4% and final displacement error by 13.6% with 5 times less generated samples, and achieves up to 5.22x wall-clock time speedup over existing solutions.

We organize this paper as follows: in Section 2, we introduce the background and discuss related works in detail. Our GraphTCN framework is introduced in Section 3. Then in Section 4, results of GraphTCN measured in both accuracy and efficiency are compared with state-of-the-art approaches. Finally, Section 5 concludes the paper.

2 Related Work

2.1 Human-Human Interactions

Research in the crowd interaction model can be traced back to the Social Force model [13], which adopts the nonlinearly coupled Langevin equations to represent the attractive and repulsive forces for human movement in the crowed scenarios. Similar hand-crafted approaches[36, 2, 41] have proved successful in crowd simulation [15, 29], crowd behavior detection [24], and trajectory prediction [43]. However, these approaches model social behavior based only on psychological or physical realization, which alone is insufficient to capture complex crowd interaction. Recent works have investigated deep learning techniques to capture the interaction between the agent and neighbors. The distance-based approaches [1, 11, 23] either adopt the grid-based pooling or symmetric function to aggregate the hidden states from neighbors or encode the geometric relation between the agents. Different from the distance-based methods, attention-based approaches [30, 39, 7, 47] provide better crowd understanding since they differentiate the importance of neighbors by soft attention or gating mechanisms. More recent works [16, 20, 25] adopts graph-based networks to learn the social interaction by aggregating neighborhood features adaptively with the adjacency matrix, which provides a more intuitive way to represent the pedestrian’s topology in a shared space. Social-STGCNN [25] captures the spatial relation by introducing a kernel function on the weighted adjacency matrix; STGAT [16, 20] adopts GAT directly on the LSTM hidden states to capture the spatial interaction between pedestrians; however, the former approach focuses on distance features of the agent while the latter one only fully depends on attention. EGNN [9] incorporates the edge feature into the graph attention mechanism to exploit richer graph information. However, it treats the edge feature as real matrices may cause the social model to lose important relative features between pedestrians. We propose to model the pedestrian interactions with a novel edge feature graph neural network, which integrates the relative distance feature into graph attention operation to learn an adaptive adjacency matrix for the most salient interaction information.

2.2 Pattern-based Sequence Prediction

Sequence prediction refers to the problem of predicting the future sequence using historical sequence information. Recently, pattern-based methods have become mainstream for many sequence prediction tasks, e.g., speed recognition [37, 4, 10], activity recognition [6, 17], and natural language processing [3, 34, 8]. In particular, trajectory prediction can be formulated as a sequence prediction task, which uses historical movement patterns of the agent to generate the future path in the sequence. Most of trajectory prediction methods adopt recurrent neural networks (RNNs), e.g., Long Short-Term Memory (LSTM) networks [14], to capture the temporal movement in the sequence. However, RNN-based models suffer from gradient vanishing and exploding during training and overfocusing on more recent inputs during prediction, especially for long input sequences. Many sequence prediction works [37, 42] instead adopt convolutional neural networks (CNNs) and have achieved great success. The convolutional networks can better capture long-term dependency and greatly improve prediction efficiency. The superiority of CNN-based methods can be largely attributed to the convolutional operation, which is independent of preceding time-steps and thus can process in parallel. The recent work [26] proposes a compact CNN model to capture the temporal information and an MLP layer to generate the future sequence simultaneously; their results confirm that the CNN-based model can yield competitive performance in trajectory prediction. However, it fails to model the spatial interaction between pedestrians. In this work, we propose to capture the spatial interaction with EFGAT and introduce gated convolutional networks to capture the temporal dynamics for each pedestrian.

2.3 Spatial-temporal Graph Networks for Trajectory Prediction

Recently, many studies have attempted to adopt spatial-temporal graph neural networks (STGNNs) for the sequence prediction task, such as action recognition [44, 32], taxi demand prediction [46], and traffic prediction [45]. Specifically, the sequence can be formulated as a sequence of graphs of nodes and edges, where nodes correspond to the agents and edges to their interactions. The sequence can thus be effectively modeled with the spatial-temporal graph network. Similarly, the trajectory prediction task can be effectively modeled with the spatial-temporal graph network. In trajectory prediction, the prediction task can be modeled in two dimensions, i.e., the spatial dimension and the temporal dimension. Specifically, the spatial dimension models the interaction between the agent and its neighbors, and the temporal dimension models the historical trajectory for each agent. Therefore, in STGNNs, each node in the graph represents one pedestrian in a scene, and each edge between two nodes captures the interaction between the two corresponding pedestrians. For example, social attention [39] models each node with the location of the agent, and edge with the distance between pedestrians, where the spatial relation is modeled with an attention module and then the temporal with RNNs. Similarly,  [40] constructs the STGNN with Edge RNN and Node RNN based on the location; STGAT [16] uses GAT to capture the spatial interaction by assigning different importance to neighbors and adopts extra LSTMs to capture the temporal information of each agent. The major limitation of these methods is the difficulty in capturing the spatial interaction along the temporal dimension. Notably, the future path of the agent is not only dependent on the current position but its neighbors’. However, the details of such spatial interaction may be lost during the aggregation of the node features along the temporal dimension using RNN-based models. In contrast to the RNN-based methods, Social-STGCNN [25] Graph WaveNet [42] benefit from CNNs to alleviate parameter inefficient and has demonstrated the capability of CNNs on the temporal modeling of the long sequence. In this paper, we propose an enhanced temporal convolutional network to integrate both the temporal dynamics of the agent and its social information, capturing the spatial and temporal correlation of the interactions. Besides, we further incorporate and investigate the variational generative model, i.e., VAE [33, 23], in multimodal trajectory prediction.

3 GraphTCN

The goal of trajectory prediction is to predict the future paths of all agents simultaneously that are present in the scene. Naturally, the future path of an agent depends on its historical trajectory, i.e., the temporal interaction, and is influenced by the trajectories of neighboring agents, i.e., the spatial interaction. Consequently, the trajectory prediction model is supposed to take both features into consideration when modeling the spatial and temporal interactions for the prediction.

Problem Formulation Formally, the trajectory prediction can be defined as follows: Given N pedestrians observed in a scene with length , the position of a single pedestrian at the time step is denoted as . Thereby, the observation positions of the pedestrian can be represented as = , ,…, . The goal of trajectory prediction is then to predict all the future positions (}) simultaneously.

3.1 Overall Framework

(a) GraphTCN Overview
(b) EFGAT
Figure 2: (a) The overview of GraphTCN, where EFGAT captures the spatial interaction between agents for each time step and based on the historical trajectory embedding, TCN further captures the temporal interaction across time steps. The decoder module then produces multiple socially acceptable trajectories for all the agents simultaneously. (b) EFGAT captures the spatial salient information with graph attentional layers (GAL) and skip connections.

As illustrated in Fig. 2(a), GraphTCN comprises three key modules, including the edge feature graph attention (EFGAT) module, temporal convolutional (TCN) module, and a decoder. First, we embed the absolute positions and relative positions of each pedestrian into a fixed-length hidden space and feed the trajectories features into the EFGAT module. The residual learning mechanism and skip connection [12] are incorporated into the network to facilitate the gradient backpropagation and keep forward the intermediate features. The TCN module is a feed-forward one-dimensional convolutional network with a gating activation unit [37] for capturing the most salient features. Finally, the decoder module produces future trajectories of all pedestrians simultaneously. We elaborate on the details of each module of our framework in the following sections.

3.2 EFGAT Module for Spatial Interaction

The EFGAT module shown in Fig. 2(b) is designed to encode the spatial interaction between pedestrians with graph attentional layers and graph residual terms. Formally, pedestrians within the same time step can be formulated as a directed graph , where each node corresponds to the -th pedestrian, and the weighted edge represents the human-human interaction between pedestrian to . The adjacency matrix of thus represents the spatial relationships between pedestrians.

Figure 3: An illustration of the graph attentional layer with 5 nodes employed by our EFGAT module. The attention between node and its neighbors is learned from their embedding features and relative spatial prior knowledge .

We represent the spatial relation of nodes as an asymmetric, non-negative matrix in this task since the influence between the agents should be different based on their relative movement behavior. Therefore, instead of constructing graphs with undirected spatial distance, we introduce relative spatial location as prior edge feature knowledge of the adjacency matrix:

(1)

where is an embedding function, is the embedding weight. We feed the learnable edge weights and node features into graph attentional layers shown in Fig. 3 to capture the spatial interaction:

(2)

where is the node feature for pedestrian at time step , is the number of node features, is the LeakyReLU activation. Therefore, gives the importance weight of the neighbor to the pedestrian dynamically calculated via the self-attention mechanism. The gating function has empirically proven to be powerful for controlling the bypass signals [37, 5]. We therefore adopt a similar gated activation unit to dynamically regulate the information flow and select more salient features:

(3)

where is the tanh activation function, and are the learnable parameters, and denotes the element-wise multiplication. This can be understood as a multiplicative skip connection which facilitates gradients flow through layers [5]. To stabilize the self-attention process [38, 42], we adopt the multi-head attention mechanism:

(4)

where is the learnable parameter, denotes concatenation, and is the number of attention heads. denotes the graph residual term [37, 38, 42]. Thereby, we can obtain the final node features of , where captures the aggregated spatial interaction between pedestrian and its neighbors at each time step. Therefore, the EFGAT module can learn a self-adaptive adjacency matrix that captures the relative importance of different pedestrians.

3.3 TCN-based Spatial and Temporal Interaction Representation

The movement pattern of a pedestrian is greatly influenced by the historical trajectory and the moving patterns of neighboring pedestrians. We therefore propose to capture the spatial and temporal interaction between pedestrians using a modified temporal convolution network(TCN), which is illustrated in Fig. 4.

(a)
(b)
Figure 4: (a) An illustration of TCN with a stack of 3 convolution layers of kernel size 3. The input () contains the spatial information captured by preceding modules. The output of TCN is collected by concatenating () across time. (b) The gating function in each of the TCN layers to control the bypass signals.

The network shown in Fig. 4(a) can be regarded as a short-term and long-term encoder, where lower convolution layers focus on local short-term interactions, while in higher layers, long-term interactions are captured with a larger receptive field. For example, if the kernel size of the TCN is , the receptive field size in the -th layer is , which increases linearly ascending layers. Therefore, the top layer of TCN captures interactions within a longer time span. Since the order of the input is important in the sequence prediction task, we therefore adopt the left padding of size instead of symmetric padding for the convolution, where each convolution output convolves over the input of the corresponding time step and the preceding time steps. Thereby, the output size of each convolution remains the same as the input. In each layer of TCN 4(b), the gated activation unit utilizes two non-linear functions to dynamically regulate the information flow formed as:

(5)

where , , denotes the sigmoid function, and are the learnable 1D-convolution parameters, respectively. Then, the final output of the TCN module is obtained by concatenating across the time dimension, denoted as . In this way, the embedding vector captures the spatial-temporal interaction between the -th pedestrian and its neighbors. We note that TCN can handle much longer input sequences with the dilated convolution [37], which is more efficient than RNN-based methods.

3.4 Future Trajectory Prediction

In real-world applications, given the historical trajectory, there are multiple plausible ways of the future movement. We therefore model such uncertainty of the final movement in our decoder module for the trajectory prediction.

Following STGAT [16], the decoder module produces multiple socially acceptable trajectories by introducing random noises as part of the input besides the spatial-temporal embedding . The predicted relative location is the relative position to the origin for all the pedestrians. We then convert relative positions to absolute positions and adopt the variety loss as the loss function for training, which computes the minimum ADE loss among plausible trajectories:

(6)
(7)

where is the ground truth, are the plausible trajectories predicted. Although this loss function may lead to a diluted probability density function [35], we empirically find that it facilitates better predictions of multiple future trajectories.

We further investigate the generative trajectories distributions strategy adopted widely in multimodal prediction. We concatenate the encoded future trajectories and and encode the features by Conv1D layers to produce parameters, i.e. and , for distributions in CVAE. Note that is sampled from unconditioned distribution during testing. We then further introduce the KL divergence term to our GraphTCN:

(8)

which is named GraphTCN-G in the following experiments.

4 Experiments

In this section, we evaluate our GraphTCN on two world coordinates trajectory prediction benchmark datasets, i.e., ETH [28] and UCY [21], and compare the performance of GraphTCN with state-of-the-art approaches.

4.1 Datasets and Evaluation Metrics

ETH and UCY datasets comprise five unique outdoor environments that are recorded from a fixed top-view. The ETH dataset includes ETH and Hotel, and the UCY dataset consists of UNIV, ZARA1, and ZARA2. In these datasets, pedestrians exhibit complex behaviors, including nonlinear trajectories, moving from different directions, walking together, walking unpredictably, avoiding collisions, standing, etc. The crowd density of a single scene in each environment is different, which varies from 0 to 51 pedestrians per frame. The frames per second (FPS) of all the videos are 25, and the pedestrian trajectory is extracted at every 2.5 FPS.

We use two metrics to evaluate model performance: Average Displacement Error (ADE) defined in Equation 6, which is the average Euclidean distance between the predicted trajectory and the ground truth over all prediction time steps, and Final Displacement Error (FDE) is the Euclidean distance between the predict position and the ground truth position at the final time step .

The model is trained with the leave-one-out policy [1, 11, 39, 16]. We produced 4 samples for the next 4.8 seconds (12 timesteps) based on 3.2 seconds (8 timesteps) observations.

4.2 Implementation Details

We train with Adam optimizer with a learning rate of 0.0001 with 50 epochs. The feature embedding size is set to 64. The EFGAT module comprises two graph attention layers with attention heads M = 2, 1, and a output dimension of 16, 32 in the first and second attention layers, respectively. The dimension of is eight. The noise has a dimension of four. For the VAE module, the ground truth is encoded by three Conv1D layers with ReLU non-linearity of 32 dimensions and 5, 5, 4 kernel size, respectively. The dimension of the latent variable is set to 64. is set to 4 and 20 for predicting 4 and 20 samples. All the LeakyReLU in our model has a negative slope of 0.2. is set to 1, and is set to 0.5 for the first 15 epochs and 0.2 for the rest epochs.

4.3 Baselines

We compare our framework with the baselines and four state-of-the-art approaches: LSTM adopts the vanilla LSTM encoder-decoder model to predict the sequence of every single pedestrian. Social LSTM [1] builds on top of LSTM and introduces a social pooling layer to capture the spatial interaction between pedestrians. CNN [26] adopts the CNNs to predict the sequence. SR-LSTM [47] obtains the spatial influence with iteratively refines the LSTM hidden states through the gate and attention mechanism. Social GAN [11] improves over Social LSTM with socially generative GAN to generate multiple plausible trajectories. Trajectron [18] utilizes LSTM to capture the spatial and temporal relations and incorporate CVAE to generate the distributions of future path. Social-STGCNN [25] is one of the SOTA approaches that utilizes CNNs to extract spatio-temporal features. STGAT [16] is one of the SOTA approaches which adopts vanilla GAT to model the spatial interactions and LSTMs to capture temporal interaction.

4.4 Quantitative Results

The results in Table 1 show that GraphTCN achieves consistently better performance compared with existing models on these benchmark datasets. Our model generates four multiple trajectories at once for unexpected trajectories. Our GraphTCN achieves better prediction performance than other baselines with fewer predictions with an ADE of 0.36 and an FDE of 0.72 on average. These results confirm that our GraphTCN yields competitive results even with less generated samples compared with previous approaches in terms of prediction accuracy, especially on the more complex dataset UNIV, ZARA1, and ZARA2.

Method ETH HOTEL UNIV* ZARA1 ZARA2 AVG
LSTM [1] 1.09 / 2.41 0.86 / 1.91 0.61 / 1.31 0.41 / 0.88 0.52 / 1.11 0.70 / 1.52
Social-LSTM [1] 1.09 / 2.35 0.79 / 1.76 0.67 / 1.40 0.47 / 1.00 0.56 / 1.17 0.72 / 1.54
CNN [26] 1.04 / 2.07 0.59 / 1.27 0.57 / 1.21 0.43 / 0.90 0.34 / 0.75 0.59 / 1.22
SR-LSTM [47] 0.63 / 1.25 0.37 / 0.74 0.51 / 1.10 0.41 / 0.90 0.32 / 0.70 0.45 / 0.94
Social-GAN [11] 0.81 / 1.52 0.72 / 1.61 0.60 / 1.26 0.34 / 0.69 0.42 / 0.84 0.58 / 1.18
Trajectron [18] 0.59 / 1.14 0.35 / 0.66 0.54 / 1.13 0.43 / 0.83 0.43 / 0.85 0.56 / 1.14
Social-STGCNN [25] 0.64 / 1.11 0.49 / 0.85 0.44 / 0.79 0.34 / 0.53 0.30 / 0.48 0.44 / 0.75
STGAT [16] 0.65 / 1.12 0.35 / 0.66 0.52 / 1.10 0.34 / 0.69 0.29 / 0.60 0.43 / 0.83
GraphTCN (M = 4) 0.59 / 1.12 0.27 / 0.52 0.42 / 0.87 0.30 / 0.62 0.23 / 0.48 0.36 / 0.72
GraphTCN-G (M = 4) 0.60 / 1.21 0.27 / 0.52 0.41 / 0.84 0.28 / 0.58 0.22 / 0.47 0.36 / 0.72
Table 1: Quantitative results of our GraphTCN compared with baseline approaches. Evaluation metrics are reported in ADE / FDE in meters (the lower numerical result is better). The mark denotes the deterministic model, and the rest of the baseline approaches are stochastic models with M = 20 prediction samples.

Ablation Study. We evaluate each module of our GraphTCN through a systematic ablation experiment in Table 2. w/o EGNN refers to the model without the spatial module. vanilla GAT refers to the model with GAT as the spatial module which ignores the relative relationship between pedestrians. GraphTCN-G refers to the model which incorporates VAE for multi-modal future paths inference. The result demonstrates that introducing graph neural networks (GNN) into the framework can reduce ADE and FDE, and adding edge prior to GNN leads to further improvement. However, these spatial interactions can only improve performance mildly. After further checking the dataset, we attribute these findings to the fact that pedestrians seldom change their path suddenly to avoid their neighbors. As a consequence, the temporal features already contain part of the spatial interactions for the trajectory prediction. Compared to relatively weak temporal models, e.g., RNN-based approaches, our model focused on the whole observed sequence without losing important temporal information. Therefore, spatial information is less important in the trajectory modeling.

Method M = 4 M = 20
w/o EGNN 0.38 / 0.78 0.28 / 0.54
vanilla GAT 0.37 / 0.74 0.27 / 0.54
GraphTCN 0.36 / 0.72 0.26 / 0.51
GraphTCN-G 0.36 / 0.72 0.25 / 0.48

Table 2: Ablation studies of GraphTCN.

Speed Comparison. We compare the inference speed of GraphTCN with state-of-the-art methods, including Social GAN [11], SR-LSTM [47], Social-STGCNN [25] and STGAT [16]. The results1 in Table 3 report the model inference time and the speedup factor compared with the Social GAN on the same dataset in wall-clock second. As can be observed from the results, GraphTCN achieves much faster inference compared with these baseline approaches. In particular, GraphTCN takes 0.00067 second inference time to generate 4 samples, which is 42.82 times and 5.22 times faster than Social-GAN and the most similar prior approach STGAT respectively.

Inference Time Speed-up
Social-GAN [11] 0.02869 1
Social-STGCNN [25] 0.00861 3.33
STGAT [16] 0.00350 8.20
Trajectron [18] 0.00081 35.42
GraphTCN (M=4) 0.00066 43.47
GraphTCN-G (M=4) 0.00067 42.82
GraphTCN-G (M=20) 0.00075 38.25
Table 3: The inference time and speedup of GraphTCN compared with baseline methods. The inference time is the average of the total inference steps per pedestrian for M = 4 or M = 20 samples. The results are reported on an Intel Core i9-9880H Processor.

STGAT

OURS

(a) (b) (c)
Figure 5: Comparison of our GraphTCN (M=4) and STGAT predictions against ground truth trajectories. To better illustrate the result, only part of the pedestrian trajectories are presented. The solid red line, solid blue line, and dashed yellow line represent the observed trajectory, ground truth future trajectory, and predicted trajectory, respectively.

4.5 Qualitative Analysis

We investigate the prediction results of our GraphTCN by visualizing and comparing the predicted trajectories with the best-performing approach STGAT in Fig. 5. We choose three unique scenarios in which the complex interactions take place. The complex interactions include pedestrian standing, pedestrian merging, pedestrian following, pedestrian avoiding.

In Fig. 5, we can observe that GraphTCN achieves better performance on: Directions and speeds - From Fig. 5(a)(b), we find that trajectories generated by GraphTCN follow the same direction as the ground truth, while predictions from STGAT deviate from the path obviously. In Fig. 5(a), one pedestrian moves in an unexpected direction, and GraphTCN generates an acceptable prediction to it. Besides, GraphTCN generates plausible short trajectories to the stationary pedestrian or the pedestrian who moves slowly. Collision-free future paths Fig. 5(b)(c) show that STGAT may fail to make satisfactory predictions when pedestrians are from different groups, while GraphTCN gives better prediction in scenarios where one pedestrian meets another group. In Fig. 5(b), GraphTCN can successfully produce predictions avoiding future collisions when the pedestrian moves in the same direction from an angle. GraphTCN produces socially acceptable predictions even in the more complex scenario in Fig. 5(c) when the pedestrian departs for the opposite directions or walks towards the same direction.

(a)

(b)

(c)
Figure 6: Illustration of EFGAT attention weights. The solid brown line is the trajectory and the arrow indicates the trajectory direction. The circle color shows the attention at each time step, and the circle size corresponds to the attention weight. The trajectory without circles is the target pedestrian trajectory.

Social attention In Fig. 6, we illustrate the learned attention weights by the EGAT module. The results show that our model can capture the relative importance of the target’s neighbors: attention weights between two pedestrians are different. For instance, the attention weight from pedestrian A to pedestrian B and pedestrian B to pedestrian A is different which reflects the social conventions. Less attention weight: in Fig. 6(a), the stationary pedestrian has less importance to its moving neighbors, and the model assigns small importance to the pedestrians far away from the target. More significant influence: our model assigns a higher attention weight to the pedestrian moving toward to the target in Fig. 6(a), moving ahead or moving in the rear while having a higher velocity in Fig. 6(b), and moving from the opposite directions before meeting the target Fig. 6(c). These cases demonstrate that reasonable attention weights are successfully assigned to the target pedestrian’s neighbors according to all pedestrian movement patterns in the scene.

Diverse predictions Fig. 7 is the visualization of diverse prediction. The result shows that GraphTCN can generate the prediction closer to the ground truth even with a smaller number of samples and can make good predictions for the pedestrian, who have relatively unexpected behaviors. In this scenario, one pedestrian has the intention to change its direction from the observation, and GraphTCN can generate both the normal and unexpected predictions for it. And for other pedestrians who have a more consistent observation, the model produces future paths with normal behaviors. Further, from Fig. 7(b) and (c), the prediction area of GraphTCN is much smaller and precise than STGAT in sample 20.

(a) GraphTCN

(b) GraphTCN-G

(c) STGAT
Figure 7: Visualizations of diverse predictions. The scenario is the same as Fig. 5(a). (a) and (b) show four diverse samples generate by GraphTCN and GraphTCN-G, while (c) shows the 20 samples produced by STGAT.

5 Conclusion

In this paper, we proposed GraphTCN for trajectory prediction, which captures the spatial and temporal interaction between pedestrians effectively by adopting the EFGAT to model their spatial interactions, and TCN to model both the spatial and temporal interactions. The proposed GraphTCN is totally based on feed-forward networks, which is more tractable during training, and more importantly, achieves better prediction accuracy and higher inference speed compared with existing solutions. Experiment results confirm that our GraphTCN outperforms state-of-the-art approaches on all the adopted benchmark datasets.

Footnotes

  1. For a fair comparison, the reported time includes the data processing time since some approaches require extra time to construct the graph during inference. Note that we use the corresponding official implementations and settings for each model. We use batch size one for our inference speed evaluation.

References

  1. A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961–971. Cited by: §1, §1, §2.1, §4.1, §4.3, Table 1.
  2. G. Antonini, M. Bierlaire and M. Weber (2006) Discrete choice models of pedestrian walking behavior. Transportation Research Part B: Methodological 40 (8), pp. 667–687. Cited by: §2.1.
  3. K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724–1734. Cited by: §2.2.
  4. J. Chorowski, D. Bahdanau, K. Cho and Y. Bengio (2014) End-to-end continuous speech recognition using attention-based recurrent nn: first results. arXiv preprint arXiv:1412.1602. Cited by: §2.2.
  5. Y. N. Dauphin, A. Fan, M. Auli and D. Grangier (2017) Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 933–941. Cited by: §3.2, §3.2.
  6. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Cited by: §2.2.
  7. T. Fernando, S. Denman, S. Sridharan and C. Fookes (2018) Soft+ hardwired attention: an lstm framework for human trajectory prediction and abnormal event detection. Neural networks 108, pp. 466–478. Cited by: §1, §2.1.
  8. J. Gehring, M. Auli, D. Grangier, D. Yarats and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. Cited by: §2.2.
  9. L. Gong and Q. Cheng (2019) Exploiting edge features for graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9211–9219. Cited by: §2.1.
  10. A. Graves and N. Jaitly (2014) Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, pp. 1764–1772. Cited by: §2.2.
  11. A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese and A. Alahi (2018) Social GAN: socially acceptable trajectories with generative adversarial networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 2255–2264. Cited by: §1, §1, §2.1, §4.1, §4.3, §4.4, Table 1, Table 3.
  12. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  13. D. Helbing and P. Molnar (1995) Social force model for pedestrian dynamics. Physical review E 51 (5), pp. 4282. Cited by: §2.1.
  14. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §2.2.
  15. L. Hou, J. Liu, X. Pan and B. Wang (2014) A social force evacuation model with the leadership effect. Physica A: Statistical Mechanics and its Applications 400, pp. 93–99. Cited by: §2.1.
  16. Y. Huang, H. Bi, Z. Li, T. Mao and Z. Wang (2019) STGAT: modeling spatial-temporal interactions for human trajectory prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6272–6281. Cited by: §1, §1, §1, §2.1, §2.3, §3.4, §4.1, §4.3, §4.4, Table 1, Table 3.
  17. M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat and G. Mori (2016) A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1980. Cited by: §2.2.
  18. B. Ivanovic and M. Pavone (2019) The trajectron: probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2375–2384. Cited by: §1, §1, §4.3, Table 1, Table 3.
  19. T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1.
  20. V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, H. Rezatofighi and S. Savarese (2019) Social-bigat: multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Advances in Neural Information Processing Systems, pp. 137–146. Cited by: §1, §2.1.
  21. A. Lerner, Y. Chrysanthou and D. Lischinski (2007) Crowds by example. In Computer graphics forum, pp. 655–664. Cited by: §4.
  22. J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann and L. Fei-Fei (2019) Peeking into the future: predicting future person activities and locations in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5725–5734. Cited by: §1, §1, §1.
  23. K. Mangalam, H. Girase, S. Agarwal, K. Lee, E. Adeli, J. Malik and A. Gaidon (2020) It is not the journey but the destination: endpoint conditioned trajectory prediction. arXiv preprint arXiv:2004.02025. Cited by: §2.1, §2.3.
  24. R. Mehran, A. Oyama and M. Shah (2009) Abnormal crowd behavior detection using social force model. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 935–942. Cited by: §2.1.
  25. A. Mohamed, K. Qian, M. Elhoseiny and C. Claudel (2020) Social-stgcnn: a social spatio-temporal graph convolutional neural network for human trajectory prediction. arXiv preprint arXiv:2002.11927. Cited by: §1, §1, §2.1, §2.3, §4.3, §4.4, Table 1, Table 3.
  26. N. Nikhil and B. Tran Morris (2018) Convolutional neural network for trajectory prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.2, §4.3, Table 1.
  27. R. Pascanu, T. Mikolov and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, JMLR Workshop and Conference Proceedings, Vol. 28, pp. 1310–1318. Cited by: §1.
  28. S. Pellegrini, A. Ess and L. Van Gool (2010) Improving data association by joint modeling of pedestrian trajectories and groupings. In European conference on computer vision, pp. 452–465. Cited by: §4.
  29. P. Saboia and S. Goldenstein (2012) Crowd simulation: applying mobile grids to the social force model. The Visual Computer 28 (10), pp. 1039–1048. Cited by: §2.1.
  30. A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi and S. Savarese (2019) Sophie: an attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1349–1358. Cited by: §1, §1, §2.1.
  31. T. Salzmann, B. Ivanovic, P. Chakravarty and M. Pavone (2020) Trajectron++: multi-agent generative trajectory forecasting with heterogeneous data for control. arXiv preprint arXiv:2001.03093. Cited by: §1.
  32. C. Si, W. Chen, W. Wang, L. Wang and T. Tan (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236. Cited by: §2.3.
  33. K. Sohn, H. Lee and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §2.3.
  34. I. Sutskever, O. Vinyals and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.2.
  35. L. A. Thiede and P. P. Brahma (2019) Analyzing the variety loss in the context of probabilistic trajectory prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9954–9963. Cited by: §3.4.
  36. A. Treuille, S. Cooper and Z. Popović (2006) Continuum crowds. ACM Transactions on Graphics (TOG) 25 (3), pp. 1160–1168. Cited by: §2.1.
  37. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior and K. Kavukcuoglu (2016) WaveNet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, pp. 125. Cited by: §2.2, §3.1, §3.2, §3.2, §3.3.
  38. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò and Y. Bengio (2018) Graph Attention Networks. International Conference on Learning Representations. Cited by: §1, §3.2, §3.2.
  39. A. Vemula, K. Muelling and J. Oh (2018) Social attention: modeling attention in human crowds. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. Cited by: §1, §2.1, §2.3, §4.1.
  40. A. Wang, Z. Wang and W. Yuan (2019) Pedestrian trajectory prediction with graph neural networks. Semantic Scholar. Cited by: §2.3.
  41. J. M. Wang, D. J. Fleet and A. Hertzmann (2007) Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence 30 (2), pp. 283–298. Cited by: §2.1.
  42. Z. Wu, S. Pan, G. Long, J. Jiang and C. Zhang (2019) Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, S. Kraus (Ed.), pp. 1907–1913. Cited by: §2.2, §2.3, §3.2, §3.2.
  43. K. Yamaguchi, A. C. Berg, L. E. Ortiz and T. L. Berg (2011) Who are you with and where are you going?. In CVPR 2011, pp. 1345–1352. Cited by: §2.1.
  44. S. Yan, Y. Xiong and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.3.
  45. H. Yao, X. Tang, H. Wei, G. Zheng, Y. Yu and Z. Li (2018) Modeling spatial-temporal dynamics for traffic prediction. arXiv preprint arXiv:1803.01254. Cited by: §2.3.
  46. H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye and Z. Li (2018) Deep multi-view spatial-temporal network for taxi demand prediction. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.3.
  47. P. Zhang, W. Ouyang, P. Zhang, J. Xue and N. Zheng (2019) Sr-lstm: state refinement for lstm towards pedestrian trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12085–12094. Cited by: §1, §2.1, §4.3, §4.4, Table 1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414580
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description