Learning 3D-aware Egocentric Spatial-Temporal Interaction via Graph Convolutional Networks

Learning 3D-aware Egocentric Spatial-Temporal Interaction via Graph Convolutional Networks

Abstract

To enable intelligent automated driving systems, a promising strategy is to understand how human drives and interacts with road users in complicated driving situations. In this paper, we propose a 3D-aware egocentric spatial-temporal interaction framework for automated driving applications. Graph convolution networks (GCN) is devised for interaction modeling. We introduce three novel concepts into GCN. First, we decompose egocentric interactions into ego-thing and ego-stuff interaction, modeled by two GCNs. In both GCNs, ego nodes are introduced to encode the interaction between thing objects (e.g., car and pedestrian), and interaction between stuff objects (e.g., lane marking and traffic light). Second, objects’ 3D locations are explicitly incorporated into GCN to better model egocentric interactions. Third, to implement ego-stuff interaction in GCN, we propose a MaskAlign operation to extract features for irregular objects.

We validate the proposed framework on tactical driver behavior recognition. Extensive experiments are conducted using Honda Research Institute Driving Dataset, the largest dataset with diverse tactical driver behavior annotations. Our framework demonstrates substantial performance boost over baselines on the two experimental settings by 3.9% and 6.0%, respectively. Furthermore, we visualize the learned affinity matrices, which encode ego-thing and ego-stuff interactions, to showcase the proposed framework can capture interactions effectively.

I Introduction

Automated driving in highly interactive scenarios is challenging as it involves different levels of 3D scene analysis [3, 35], situation understanding [30, 19], intention prediction [18, 5], decision making and planning [44, 25]. Understanding how human drives and interacts with road users is essential toward an intelligent automated driving system. The first step to achieve this is to develop a computational model which can capture the complicated spatial-temporal interactions between the ego vehicle and road users.

(a) Goal-oriented prediction: Left turn and Cause prediction: crossing vehicle.
(b) Learned affinity matrix
(c) Visualization of the learned affinity matrix using a top-view scene layout
Fig. 1: In a complicated traffic situation at intersections, the ego vehicle intends to take a left turn while yielding to a upcoming vehicle. Our model learns a graph structure in (b) using edge connections to represent the interactions among road users and the ego vehicle. The top-view scene representation in (c) is derived from (a) and (b) by overlaying learned relations on a scene layout for better illustration.

Over the past decade there has been a significant advance in modeling spatial-temporal interactions [24, 21, 11, 8, 1, 31, 5]. However, most of the existing work still cannot effectively model complex interactions since many of them are leveraging “hand-crafted interaction models” [1]. Data-driven approaches are better options as they can learn subtle and complex interactions [1, 34, 6, 5]. However, existing approaches are still insufficient for three reasons.

First, the input used by several existing methods [1, 34, 6] is the human’s 2D-location on bird’s-eye-view (BEV) images. However, it is more desirable to use ego-perspective sensing devices, e.g, cameras, as humans use two eyes to sense. This calls for a specific design for egocentric interaction models. Second, using 2D pixel coordinates to model the 3D interactions (such as [5]) is insufficient because of perspective projection. BEV images can resolve this problem since the depth and spatial positions are both embedded in the BEV images.Third, the existing approaches only consider human-human or human-robot interactions, ignoring the environment factors, such as lane markings, crosswalks, and traffic lights. However, modeling these objects is nontrivial because they have irregular shapes.

In this paper, we propose a 3D-aware egocentric spatial-temporal interaction framework for automated driving applications. Our method is the first method based on egocentric images and can address the aforementioned problems. The specific approach we take is to design two graph convolutional networks (GCN) [14] to model the egocentric interactions. We define two graphs, Ego-Thing Graph and Ego-Stuff Graph to encode how the ego vehicle interacts with the thing objects (e.g., cars and pedestrians) and the stuff objects (e.g., lane markings and traffic lights). The ego-thing graph is an improvement of Wu et al.  [39]. We introduce two new concepts. We add an ego node (i.e., the ego vehicle) for egocentric interaction modeling, and we incorporate the objects’ 3D locations (recovered from image-based depth estimation). The ego-stuff graph is designed similarly. However, in order to extract features from irregular stuff objects, we introduce a new method known as the MaskAlign operation.

We validate the proposed framework on tactical driver behavior recognition using Honda Research Institute Driving Dataset (HDD) [26]. The HDD is the largest dataset in the field. It provides 104-hour egocentric videos with frame-level annotations of tactical driver behavior. We validate our method based on two types of settings: 1) the ego vehicle has interactions with stuff objects (e.g., lane change, lane branch, and merge) and 2) the ego vehicle has interactions with thing objects (e.g., stop for crossing pedestrian and deviate for parked car). Our approach offers substantial performance boost (in terms of mAP, See Experiment section for definitions) over baselines on the two settings by 3.9% and 6.0%, respectively.

Ii Related Works

Ii-a Tactical Driver Behavior Recognition

Significant efforts have been made in tactical driver behavior recognition [24, 15, 21, 38, 11, 40, 36, 26, 41]. Hidden Markov networks (HMM) were leveraged to recognize driver behaviors [24, 15, 21, 38, 11]. A single node in HMM encodes the states from the ego vehicle, roads and traffic participants [24] into a state vector. In the proposed framework, we explicitly model the above three states using different nodes, each of which encodes its own representation according to the semantic context. Recently, convolutional and recurrent neural network based algorithms [40, 26, 41] are proposed. They implicitly encode the states of the ego vehicle and road users using 2D convolution, and the state transition is via recurrent units. Our method explicitly models the states using graph convolutional networks (GCN) and uses temporal convolution networks for the state transition.

Wang et al., [36] designed an object-level attention layer to capture the impacts of objects on driving policies. Instead of simply weighting and concatenating objects’ features, our framework preserves more complicated forms of interactions benefiting from GCN. Additionally, interactions between the ego vehicle and road infrastructure are included in our system.

Ii-B Graph Neural Networks for Driving Scenes

Recently, graph neural networks (GNN) [20, 14] has made significant progress in situation recognition [19], action recognition [42, 37], group activity recognition [39], and scene graph generation [43]. However, considerably less attention has been paid to driving scene applications

Herzig et al. [10] proposed a Spatio-Temporal Action Graph (STAG) network to detect driving collision. While STAG is similar to the proposed Ego-thing graph, our model explicitly exploits 3D locations of objects and the ego vehicle into the design of nodes and edges. The 3D cue is essential in understanding scenes from egocentric perspective. This design is motivated by [39]. Note that 2D locations are used in [39] while we use 3D locations extracted from [16]. Moreover, we consider interactions between the ego vehicle and road infrastructure that enable the proposed framework to be applied for diverse driving scene applications, e.g., learning driving model from images [22]. The details of our graph design can be found in Section III-B.

Iii Egocentric Spatial-Temporal Interaction Modeling

Fig. 2: An overview of our framework. Given a video segment, our model applies 3D convolutions to extract visual features followed by two branches: RoIAlign is employed to extract object features from object bounding boxes and MaskAlign is designed to extract features of irregular shape objects from semantic masks. Then, frame-wise Ego-Thing Graph and Ego-Stuff Graph are constructed to propagate interactive information among objects via graph convolution networks. The outputs of the two graphs are fused and fed into a temporal fusion module to form interactive representation. Finally, global video representation from I3D head and interactive features are aggregated as an input to tactical driver behavior recognizer.

Iii-a Overall Architecture

An overview of the proposed framework is depicted in Fig. 2. Given video frames, we apply instance segmentation and semantic segmentation in [9] to obtain thing objects and stuff objects, respectively. Object features are extracted from intermediate I3D [4] features via RoIAlign [9] and MaskAlign (Section III-C). Afterwards, we construct Ego-Thing Graphs and Ego-Stuff Graphs in a timely manner and apply graph convolutional networks (GCN) [14] for message passing. The updated ego features from two graphs are fused and processed via a temporal fusion module. Additionally, the temporally fused ego features are concatenated with the I3D head feature, which serves as a global video embedding, to form the egocentric representation. At last, this egocentric feature is passed through a fully connected layer to obtain the final classification.

Iii-B Ego-Thing Graph

The ego-thing graph is designed to model interactions among ego vehicle and movable traffic participants, such as and so on.

Node feature extraction. In our design, thing objects are car, person, bicycle, motorcycle, bus, train, and truck. Given bounding boxes generated from Mask R-CNN [9], we keep the top K detections on each frame from all the classes above and set K to 20. Then RoIAlign [9] and a max pooling layer are applied to obtain dimensional appearance features as Thing Node features in a ego-thing graph. The Ego Node feature is obtained by the same procedure from a frame-size bounding box.

Graph definition. We denote the sequence of frame-wise ego-thing graphs as , where is the number of frames, and is the ego-thing affinity matrix at frame representing the pair-wise interactions among thing objects and ego. Specifically, denotes the influence of object on object . Nodes in graph correspond to a set of objects , where is -th object’s appearance feature, and is the 3D location of the object in world frame. Note that index corresponds to ego object and correspond to thing objects.

Interaction modeling. Ego-thing interactions are defined as second-order interactions, where not only the original state but also the changing state of the thing object caused by other objects will altogether influence the ego state. To sufficiently model these interactions, we consider both appearance features and distance constraints inspired by [39]. We compute the edge value as:

(1)

where indicates the appearance relation between two objects, and we set up a distance constraint via a spatial relation . Softmax function is used to normalize the influence on object from other objects.

The appearance relation is calculated as below:

(2)

where and . Both and are learnable parameters which map appearance features to a subspace and enable learning the correlation of two objects. is a normalization factor.

The necessity of defining spatial relation arises from that the interactions of two distant objects are usually scarce. To calculate this relation, we first unproject objects from 2D image plane to the 3D space in the world frame [16]:

(3)

where and are homogeneous representations in 2D and 3D coordinate systems, is the camera intrinsic matrix, and is the relative depth at obtained by depth estimation [17]. In the 2D plane, we choose the centers of bounding boxes to locate thing objects. The location of the ego vehicle is fixed at the middle-bottom pixel of the frame. Then the spatial relation function is formulated as:

(4)

where is the indicator function, computes the Euclidean distance between object and object in the 3D space, and is the distance threshold which regulates the spatial relation value to be zero if the distance is beyond this upper bound. In our implementation, the value of is set to be 3.0.

Iii-C Ego-Stuff Graph

The ego-stuff graph is constructed in a similar manner as the ego-thing graph in Eq.1 except for the following aspects:

Node feature extraction. We include the following classes as stuff objects: Crosswalk, Lane Markings, Lane Separator, Road, Service Lane, Traffic Island, Traffic Light and Traffic Sign. The criterion we use to distinguish stuff objects from thing objects is based on whether the change of states can be caused by other objects. For example, cars stop and yield to person, but a traffic light turns red to green by itself. Another distinction lies in that the contour of most stuff objects cannot be well depicted as rectangular bounding boxes. Thus, it is difficult either to detect it by algorithms like Faster R-CNN [28], YOLO [27] or to extract features by RoIAlign [9] without enclosing irrelevant information. For this, we propose a feature extraction approach named MaskAlign to extract features for a binary mask , which is the -th stuff object at time . is downsampled to () with the same spatial dimension as the intermediate I3D feature map . We compute the stuff object feature by MaskAlign as following:

(5)

where is the D-dimension feature at pixel for time , and is a binary scalar indicating whether object exists at pixel .

Interaction Modeling. In ego-stuff graph, we ignore interactions among stuff objects since they are insusceptible to other objects. Hence, we set to zeros for every pair of stuff objects and only pay attention to the influence that stuff objects act on ego vehicle. We call it as the first-order interaction. To better model the spatial relations, instead of unprojecting bounding box centers, we map every pixel inside the downsampled binary mask to 3D space and calculate the Euclidean distance between every pixel with the ego vehicle. The distance is the minimum distance of the all. The distance threshold in ego-stuff graph is designed as 0.8.

Iii-D Reasoning on Graphs

To perform reasoning on graphs, we introduce graph convolutional networks (GCN) proposed in [14]. GCN takes a graph as input, passes information through the learned edges, and refreshes nodes’ features as output. Specifically, graph convolution can be expressed as:

(6)

where is the affinity matrix from graphs. Taking ego-thing graph as an example, is the appearance feature matrix of nodes in the -th layer. is the learnable weight matrix. We also build a residual connection by adding . In the end of each layer, we adopt Layer Normalization [2] and ReLU before is fed to the next layer. As second-order interaction is not considered in ego-stuff graph but in ego-thing graph, we use one layer GCN in ego-stuff graph and two layers in ego-thing graph.

Iii-E Temporal Modeling

Fig. 3: Architecture of our temporal modeling module.

GCN interactive features in each frame are processed independently without considering temporal context information. Therefore, we append a temporal fusion module to the late stage in our framework as illustrated in Fig. 3. Unlike prior works [37, 39, 42], which fuse features of every node in different graphs, we only focus on ego node. Ego features are aggregated by a element-wise summation from two types of graphs. Then these time-specific ego features are fed into a temporal fusion module, which applies element-wise max pooling to obtain a feature vector, namely GCN egocentric feature. We also propose another two designs for temporal fusion: (a) Inspired by Temporal Relation Network [45], which utilizes multi-layer perceptrons (MLP) as temporal modeling, we follow the similar approach in order to capture the temporal ordering of patterns. (b) The temporal fusion can also be replaced by element-wise average pooling. In Section IV-E, we conduct different experiments to investigate all three temporal modeling approaches.

Iv Experiments

Iv-a Dataset

We evaluate the proposed framework on the HDD dataset [26], the largest dataset that provides 104-hour egocentric videos with frame-level annotations of tactical driver behavior. It has a diverse set of scenarios where complicated interactions happen between the ego vehicle and road users. The data was collected within San Francisco Bay Area including urban, suburban and highways. We follow the same Train/Test data split as [26].

The videos are labeled by a 4-layer representation to describe tactical driver behaviors. Among these 4 layers, Goal-oriented action layer (e.g., left turn and right lane lane change) and Cause layer (e.g., stop for crossing vehicle) consist of the actions with interactions. We leverage those labels and analyze the effectiveness of the proposed interaction modeling framework in Section IV-C.

Individual actions
Method
Online/Offline
intersection
passing
L turn
R turn
L lane
change
R lane
change
L lane
branch
R lane
branch
crosswalk
passing
railroad
passing
merge
u-turn
Overall
mAP
CNN [26] Online 53.4 47.3 39.4 23.8 17.9 25.2 2.9 4.8 1.6 4.3 7.2 20.7
CNN-LSTM [26] 65.7 57.7 54.4 27.8 26.1 25.7 1.7 16.0 2.5 4.8 13.6 26.9
ED [41] 63.1 54.2 55.1 28.3 35.9 27.6 8.5 7.1 0.3 4.2 14.6 27.2
TRN [41] 63.5 57.0 57.3 28.4 37.8 31.8 10.5 11.0 0.5 3.5 25.4 29.7
DEPSEG-LSTM [23] 70.9 63.4 63.6 48.0 40.9 39.7 4.4 16.1 0.5 6.3 16.7 33.7
C3D [33] 72.8 64.8 71.7 53.4 44.7 52.2 3.1 14.6 2.9 10.6 15.8 37.0
C3D [33] Offline 82.4 77.4 80.7 67.9 56.9 59.7 5.2 17.4 3.9 20.1 29.5 45.5
I3D [4] 85.6 79.1 78.9 74.0 62.4 59.0 14.3 29.8 0.1 20.1 41.4 49.5
Ours 85.5 77.9 79.1 76.0 62.0 64.0 19.8 29.6 1.0 27.7 39.0 51.1
TABLE I: Results of Goal-oriented driver behavior recognition on HDD. The unit is %.

Iv-B Implementation Details

We implemented our framework in TensorFlow. All experiments are performed on a server with 4 NVIDIA TITAN-XP. The input to the framework is a 20-frame clip with a resolution of at 3 fps, approximately 6.67s. We adopt Inception-v3 [32] pre-trained on ImageNet [29] as the backbone, following [4] to inflate 2D convolution into a 3D ConvNet, and fine-tune it on the Kinetics action recognition dataset [12]. The intermediate feature map used in RoIAlign and MaskAlign is extracted from the Mixed_3c layer, where is the number of feature channels. The global I3D feature is generated from a convolution on Mixed_5c layer feature, which reduces the output channel number from 1024 to 512. The downsampled binary mask is . The model is trained in a two-stage training scheme with batch size set to 32: (1) we fine-tune the Kinetics pre-trained model on the HDD dataset for 50K iterations without using GCN. We refer to this model the baseline for our experiment. (2) We load the weights trained in Stage 1, and further train the network together with GCN for 20K iterations.

Individual actions
Method
Stop for
Congestion
Stop for
Sign
Stop for
Red Light
Stop for
Crossing
Vehicle
Deviate for
Parked
Vehicle
Stop for
Crossing
Pedestrian
Overall
mAP
I3D [4] 64.8 71.7 63.6 21.5 15.8 26.2 43.9
Ours 74.1 72.4 76.3 26.9 20.4 29.0 49.9
TABLE II: Results of driver behavior recognition in Cause layer on HDD. The unit is %.

We use Adam [13] optimizer with default parameters. We set learning rate as 0.001 and 0.0002 for the first and second stage for training, respectively.

Iv-C Analysis on Interactions

To understand the benefits of modeling interactions, we perform analysis on the following two aspects.

Goal-oriented Action Layer. Table I presents Goal-oriented action recognition results. We use the per-frame mean average precision (mAP) as evaluation metric in all experiments. We pay attention to the 5 ‘lane-related’ classes in frames: Left Lane Change, Right Lane Change, Left Lane Branch, Right Lane Branch and Merge. Our model obtains 49.9% mAP over these 5 classes, which surpasses the I3D baseline 46.0% mAP by a gain of 3.9%. This improvement showcases the effectiveness of modeling interactions between ego vehicle and traffic lanes, which also can be validated by visualization in Section IV-F.

Cause Layer. 6 classes from Cause layer are designed to explain the reason for stop and deviate actions, such as Deviate for Parked Vehicle, which is an example of ego-thing interaction. We extend our framework to multi-head classifiers to simultaneously predict Goal-oriented actions and Causes. Note that we train a multi-head I3D as the baseline for this experiment. Our design achieves a steady increase in recognizing Goal-oriented actions by improving the baseline of 48.5% to 50.2%. Meanwhile, the result of Cause layer in Table II shows a significant gain of 6.0% in overall mAP. We further demonstrate the strength of the proposed interaction modeling by using a Deviate for Parked Vehicle scenario in Fig. 5 in Section IV-F.

Iv-D Comparison with the State of the Art

We compare our approach with the state-of-the-art in Table I. We categorize the existing methods tested on HDD into online and offline. The online approaches aim to detect driver actions as soon as a frame arrives. Future context is not considered. The offline approaches take future frames into consideration. Since future information is processed, the offline approaches exhibit an overwhelming advantage over the online approaches. Among the offline methods, our model significantly outperforms the C3D [33] and I3D [4] by 5.6% and 1.6% in terms of mAP, respectively.

(a) Left Turn
(b) Right Turn
(c) Crosswalk Passing
(d) Intersection Passing
(e) Left Turn
(f) Left Lane Change
(g) Left Lane Branch
(h) Merge
Fig. 4: Attention visualization from egocentric view. The first and second row show examples from Ego-Thing Graph and Ego-Stuff Graph, respectively. In (a)-(c), pedestrians intending to cross the street have significant influence on ego behavior when turning left, turning right and passing the crosswalk. The ego vehicle passes an intersection in (d) while paying attention to the moving car and bicycle in front of it. The figure (e) illustrates a left turn case when the heat map shows a high attention around the traffic light, which is green. In (f)-(h), lane markings show strong influences to ego’s lane-related behaviors.
(a) Goal-oriented prediction: Background and Cause prediction: Parked Vehicle
(b) Learned affinity matrix
(c) Visualization of the learned affinity matrix using a top-view scene representation
Fig. 5: Attention visualization from top-view.
Method Overall mAP
Different Graphs I3D [4] 49.5
Ego-Stuff Graph 50.6
Ego-Thing Graph 50.8
Ego-Thing Graph + Ego-Stuff Graph 51.1
Spatial Modeling Appearance Relation 50.9
Appearance + Spatial Relation 51.1
Temporal Modeling Average 50.0
MLP 50.9
Max 51.1
TABLE III: Ablation Studies

Iv-E Ablation Studies

To provide a comprehensive understanding of the contributions from each module, we decompose our model into three components and conduct ablation studies using the Goal-oriented action recognition shown in Table III.

Comparison of Different Graphs. The first section of Table III analyzes the influence of each graph to the tactical driver behavior recognition. The baseline is the I3D. When ego-stuff graph or ego-thing graph is included, the results are boosted from 49.5% to 50.6% and 50.7%, respectively. If both graphs are trained jointly with the baseline model, we achieve the best performance 51.1% on the Goal-oriented action recognition.

Importance of Spatial Relation. To investigate the effectiveness of spatial relation function in Eq. 4, we conduct two experimental settings: using only the appearance relations, and embedding 3D spatial relation as an additional constraint. Without using the proposed 3D spatial relation, the performance decreases by 0.2%, indicating the advantage of encoding spatial context.

Variations of Temporal Modeling. We analyze the impact of temporal modeling approaches. The best mAP – 51.1% is obtained by element-wise max pooling. If we use element-wise averaging for the features from each time step, the model has a mAP of 50.0%. Our conjecture is that, for a 20-frame video clip, the key change takes place within a short duration. For example, in a Left Lane Change behavior, the most noticeable moment is when the ego vehicle intersects the traffic lanes within a few frames. Temporal modeling using averaging features potentially degrades the distinguishable features, which will unavoidably result in information loss. A multi-layer perceptron (MLP), which takes temporal ordering patterns into account, exceeds averaging pooling by 0.9% but is 0.2% lower than the best performance. Our hypothesis is that significant change of interactive relations plays an more important role in recognizing tactical driver behavior than the ordering in time.

Iv-F Visualization

Apart from quantitative evaluation, we demonstrate interpretability of our method by the following two visualization strategies.

Attention Visualization from Egocentric View. Given the learned affinity matrices in ego-thing graph and ego-stuff graph, we highlight those objects with strong connection to the ego node in Fig. 4. The visualization results provide a strong proof that the proposed model captures the underlying interactions, which is essential for tactical driving behavior understanding. Note that in the example shown in Fig 4 (e), the model captures the relation between the ego vehicle (turning left) and the traffic light (green light).

Attention Visualization from Top-view. In addition to the interactions with ego, we can represent the complicated traffic scene in a graph structure as well. Fig. 5(b) shows the visualized ego-thing graph from the multi-head model for a scenario where the ego vehicle deviates for a parked truck. Each circle in the graph corresponds to a thing object in the frame and the ego vehicle is represented by a star. The edge linking two nodes represents the interactive relation among them. We manually draw a top-view map Fig. 5(c) to better represent the interactions based on spatial context.

V Conclusion

In this paper, we propose a framework to model complicated interactions between driver and road users using graph convolution networks. The proposed framework demonstrates favorable quantitative performance on the HDD dataset. Qualitatively, we show the model can captures interactions between the ego vehicle and stuff objects, and the ego vehicle and thing objects. For future work, we plan to incorporate temporal modeling of thing objects into the proposed framework as it is an important cues in interaction modeling. With that, the framework will enable tactical behavior anticipation and behavior modeling [7], and potentially trajectory prediction of thing objects [5].

References

  1. A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei and S. Savarese (2016) Social LSTM: Human Trajectory Prediction in Crowded Spaces. In CVPR, Cited by: §I, §I.
  2. J. L. Ba, J. R. Kiros and G. E. Hinton (2016) Layer N ormalization. In arXiv preprint arXiv:1607.06450, Cited by: §III-D.
  3. I. A. Barsan, P. Liu, M. Pollefeys and A. Geiger (2018) Robust Dense Mapping for Large-Scale Dynamic Environments. In ICRA, Cited by: §I.
  4. J. Carreira and A. Zisserman (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR, Cited by: §III-A, §IV-B, §IV-D, TABLE I, TABLE II, TABLE III.
  5. R. Chandra, U. Bhattacharya, A. Bera and D. Manocha (2019) TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic using Weighted Interactions. In CVPR, Cited by: §I, §I, §I, §V.
  6. C. Chen, Y. Liu, S. Kreiss and A. Alahi (2019) Crowd-Robot Interaction: Crowd-aware Robot Navigation with Attention-based Deep Reinforcement Learning. In ICRA, Cited by: §I, §I.
  7. A. Doshi and M. M. Trivedi (2011) Tactical Driver Behavior Prediction and Intent Inference: A Review. In ITSC, Cited by: §V.
  8. T. Gindele, S. Brechtel and R. Dillmann (2015) Learning Driver Behavior Models from Traffic Observations for Decision Making and Planning. IEEE Intelligent Transportation Systems Magazine 7 (1), pp. 69–79. Cited by: §I.
  9. K. He, G. Gkioxari, P. Dollar and R. Girshickn (2017) Mask R-CNN. In CVPR, Cited by: §III-A, §III-B, §III-C.
  10. R. Herzig, E. Levi, H. Xu, H. Gao, E. Brosh, X. Wang, A. Globerson and T. Darrell (2019) Spatio-Temporal Action Graph Networks. In ICCVW, Cited by: §II-B.
  11. A. Jain, H. Koppula, B. Raghavan, S. Soh and A. Saxena (2015) Car that Knows before You Do: Anticipating Maneuvers via Learning Temporal Driving Models. In ICCV, Cited by: §I, §II-A.
  12. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman and A. Zisserman (2017) The Kinetics Human Action Video Dataset. In arXiv preprint arXiv:1705.06950, Cited by: §IV-B.
  13. D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. In arXiv preprint arXiv:1412.6980, Cited by: §IV-B.
  14. T. N. Kipf and M. Welling (2017) Semi-supervised Classification with Graph Convolutional Networks. In ICLR, Cited by: §I, §II-B, §III-A, §III-D.
  15. N. Kuge, T. Yamamura, O. Shimoyama and A. Liu (2000) A driver behavior recognition method based on a driver model framework. In SAE 2000 World Congress, Cited by: §II-A.
  16. K. Lasinger, R. Ranftl, K. Schindler and V. Koltun (2019) Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. In arXiv preprint arXiv:1907.01341, Cited by: §II-B, §III-B.
  17. K. Lasinger, R. Ranftl, K. Schindler and V. Koltun (2019) Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. In arXiv preprint arXiv:1907.01341, Cited by: §III-B.
  18. N. Lee, W. Choi, P. Vernaza, C. Choy, P. Torr and M. Chandraker (2017) DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents. In CVPR, Cited by: §I.
  19. R. Li, M. Tapaswi, R. Liao, J. Jia, R. Urtasun and S. Fidler (2017) Situation Recognition with Graph Neural Networks. In ICCV, Cited by: §I, §II-B.
  20. Y. Li, D. Tarlow, M. Brockschmidt and R. Zemel (2016) Gated Graph Sequence Neural Networks. In ICLR, Cited by: §II-B.
  21. D. Mitrovic (2005) Reliable Method for Driving Events Recognition. IEEE Transactions on Intelligent Transportation Systems 6 (2), pp. 198–205. Cited by: §I, §II-A.
  22. M. Müller, A. Dosovitskiy, B. Ghanem and V. Koltun (2018) Driving Policy Transfer via Modularity and Abstraction. In CoRL, Cited by: §II-B.
  23. A. Narayanan, Y. Chen and S. Malla (2018) Semi-supervised Learning: Fusion of Self-supervised, Supervised Learning, and Multimodal Cues for Tactical Driver Behavior Detection. In CVPRW, Cited by: TABLE I.
  24. N. Oliver and A. Pentland (2000) Graphical Models for Driver Behavior Recognition in a SmartCar. In IV, Cited by: §I, §II-A.
  25. S. Qi and S. Zhu (2018) Intent-aware Multi-agent Reinforcement Learning. In ICRA, Cited by: §I.
  26. V. Ramanishka, Y. Chen, T. Misu and K. Saenko (2018) Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning. In CVPR, Cited by: §I, §II-A, §IV-A, TABLE I.
  27. J. Redmon, S. Divvala, R. Girshick and A. Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection . In CVPR, Cited by: §III-C.
  28. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks . In NeurIPS, Cited by: §III-C.
  29. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei (2015) ImageNet LargeScale Visual Recognition Challenge. In IJCV, Cited by: §IV-B.
  30. M. Schmidt, U. Hofmann and M. Bouzouraa (2014) A Novel Goal Oriented Concept for Situation Representation for ADAS and Automated Driving. In ITSC, Cited by: §I.
  31. J. Schulz, C. Hubmann, N. Morin, J. Löchner and D. Burschka (2019) Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. In IV, Cited by: §I.
  32. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna (2016) Rethinking the Inception Architecture for Computer Vision. In CVPR, Cited by: §IV-B.
  33. D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri1 (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV, Cited by: §IV-D, TABLE I.
  34. A. Vemula, K. Muelling and J. Oh (2018) Social Attention: Modeling Attention in Human Crowds. In ICRA, Cited by: §I, §I.
  35. C. Wang, A. Narayanan, A. Patil and Y. Chen (2018) A 3D Dynamic Scene Analysis Framework for Development of Intelligent Transportation Systems. In IV, Cited by: §I.
  36. D. Wang, C. Devin, Q. Cai, F. Yu and T. Darrell (2019) Deep Object Centric Policies for Autonomous Driving. In ICRA, Cited by: §II-A, §II-A.
  37. X. Wang and A. Gupta (2018) Videos as Space-Time Region Graphs. In ECCV, Cited by: §II-B, §III-E.
  38. B. Wu, Y. Chen, C. Yeh and Y. Li (2013) Reasoning Based Framework for Driving Safety Monitoring using Driving Event Eecognition. IEEE Transactions on Intelligent Transportation Systems 14 (3), pp. 1231–1241. Cited by: §II-A.
  39. J. Wu, L. Wang, L. Wang, J. Guo and G. Wu (2019) Learning Actor Relation Graphs for Group Activity Recognition. In CVPR, Cited by: §I, §II-B, §II-B, §III-B, §III-E.
  40. H. Xu, Y. Gao, F. Yu and T. Darrell (2016) End-To-End Learning of Driving Models From Large-Scale Video Datasets. In CVPR, Cited by: §II-A.
  41. M. Xu, M. Gao, Y. Chen, L. Davis and D. Crandall (2019) Temporal Recurrent Networks for Online Action Detection. In ICCV, Cited by: §II-A, TABLE I.
  42. S. Yan, Y. Xiong and D. Lin (2018) Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI, Cited by: §II-B, §III-E.
  43. J. Yang, J. Lu, S. Lee, D. Batra1 and D. Parikh (2018) Graph R-CNN for Scene Graph Generation. In ECCV, Cited by: §II-B.
  44. W. Zhan, C. Liu, C. Chan and M. Tomizuka (2016) A Non-conservatively Defensive Strategy for Urban Autonomous Driving. In ITSC, Cited by: §I.
  45. B. Zhou, A. Andonian, A. Oliva and A. Torralba (2018) Temporal relational reasoning in videos. In ECCV, Cited by: §III-E.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...

Supplementary Materials

410012
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description