Learning 3D-aware Egocentric Spatial-Temporal Interaction via Graph Convolutional Networks
To enable intelligent automated driving systems, a promising strategy is to understand how human drives and interacts with road users in complicated driving situations. In this paper, we propose a 3D-aware egocentric spatial-temporal interaction framework for automated driving applications. Graph convolution networks (GCN) is devised for interaction modeling. We introduce three novel concepts into GCN. First, we decompose egocentric interactions into ego-thing and ego-stuff interaction, modeled by two GCNs. In both GCNs, ego nodes are introduced to encode the interaction between thing objects (e.g., car and pedestrian), and interaction between stuff objects (e.g., lane marking and traffic light). Second, objects’ 3D locations are explicitly incorporated into GCN to better model egocentric interactions. Third, to implement ego-stuff interaction in GCN, we propose a MaskAlign operation to extract features for irregular objects.
We validate the proposed framework on tactical driver behavior recognition. Extensive experiments are conducted using Honda Research Institute Driving Dataset, the largest dataset with diverse tactical driver behavior annotations. Our framework demonstrates substantial performance boost over baselines on the two experimental settings by 3.9% and 6.0%, respectively. Furthermore, we visualize the learned affinity matrices, which encode ego-thing and ego-stuff interactions, to showcase the proposed framework can capture interactions effectively.
Automated driving in highly interactive scenarios is challenging as it involves different levels of 3D scene analysis [3, 35], situation understanding [30, 19], intention prediction [18, 5], decision making and planning [44, 25]. Understanding how human drives and interacts with road users is essential toward an intelligent automated driving system. The first step to achieve this is to develop a computational model which can capture the complicated spatial-temporal interactions between the ego vehicle and road users.
Over the past decade there has been a significant advance in modeling spatial-temporal interactions [24, 21, 11, 8, 1, 31, 5]. However, most of the existing work still cannot effectively model complex interactions since many of them are leveraging “hand-crafted interaction models” . Data-driven approaches are better options as they can learn subtle and complex interactions [1, 34, 6, 5]. However, existing approaches are still insufficient for three reasons.
First, the input used by several existing methods [1, 34, 6] is the human’s 2D-location on bird’s-eye-view (BEV) images. However, it is more desirable to use ego-perspective sensing devices, e.g, cameras, as humans use two eyes to sense. This calls for a specific design for egocentric interaction models. Second, using 2D pixel coordinates to model the 3D interactions (such as ) is insufficient because of perspective projection. BEV images can resolve this problem since the depth and spatial positions are both embedded in the BEV images.Third, the existing approaches only consider human-human or human-robot interactions, ignoring the environment factors, such as lane markings, crosswalks, and traffic lights. However, modeling these objects is nontrivial because they have irregular shapes.
In this paper, we propose a 3D-aware egocentric spatial-temporal interaction framework for automated driving applications. Our method is the first method based on egocentric images and can address the aforementioned problems. The specific approach we take is to design two graph convolutional networks (GCN)  to model the egocentric interactions. We define two graphs, Ego-Thing Graph and Ego-Stuff Graph to encode how the ego vehicle interacts with the thing objects (e.g., cars and pedestrians) and the stuff objects (e.g., lane markings and traffic lights). The ego-thing graph is an improvement of Wu et al. . We introduce two new concepts. We add an ego node (i.e., the ego vehicle) for egocentric interaction modeling, and we incorporate the objects’ 3D locations (recovered from image-based depth estimation). The ego-stuff graph is designed similarly. However, in order to extract features from irregular stuff objects, we introduce a new method known as the MaskAlign operation.
We validate the proposed framework on tactical driver behavior recognition using Honda Research Institute Driving Dataset (HDD) . The HDD is the largest dataset in the field. It provides 104-hour egocentric videos with frame-level annotations of tactical driver behavior. We validate our method based on two types of settings: 1) the ego vehicle has interactions with stuff objects (e.g., lane change, lane branch, and merge) and 2) the ego vehicle has interactions with thing objects (e.g., stop for crossing pedestrian and deviate for parked car). Our approach offers substantial performance boost (in terms of mAP, See Experiment section for definitions) over baselines on the two settings by 3.9% and 6.0%, respectively.
Ii Related Works
Ii-a Tactical Driver Behavior Recognition
Significant efforts have been made in tactical driver behavior recognition [24, 15, 21, 38, 11, 40, 36, 26, 41]. Hidden Markov networks (HMM) were leveraged to recognize driver behaviors [24, 15, 21, 38, 11]. A single node in HMM encodes the states from the ego vehicle, roads and traffic participants  into a state vector. In the proposed framework, we explicitly model the above three states using different nodes, each of which encodes its own representation according to the semantic context. Recently, convolutional and recurrent neural network based algorithms [40, 26, 41] are proposed. They implicitly encode the states of the ego vehicle and road users using 2D convolution, and the state transition is via recurrent units. Our method explicitly models the states using graph convolutional networks (GCN) and uses temporal convolution networks for the state transition.
Wang et al.,  designed an object-level attention layer to capture the impacts of objects on driving policies. Instead of simply weighting and concatenating objects’ features, our framework preserves more complicated forms of interactions benefiting from GCN. Additionally, interactions between the ego vehicle and road infrastructure are included in our system.
Ii-B Graph Neural Networks for Driving Scenes
Recently, graph neural networks (GNN) [20, 14] has made significant progress in situation recognition , action recognition [42, 37], group activity recognition , and scene graph generation . However, considerably less attention has been paid to driving scene applications
Herzig et al.  proposed a Spatio-Temporal Action Graph (STAG) network to detect driving collision. While STAG is similar to the proposed Ego-thing graph, our model explicitly exploits 3D locations of objects and the ego vehicle into the design of nodes and edges. The 3D cue is essential in understanding scenes from egocentric perspective. This design is motivated by . Note that 2D locations are used in  while we use 3D locations extracted from . Moreover, we consider interactions between the ego vehicle and road infrastructure that enable the proposed framework to be applied for diverse driving scene applications, e.g., learning driving model from images . The details of our graph design can be found in Section III-B.
Iii Egocentric Spatial-Temporal Interaction Modeling
Iii-a Overall Architecture
An overview of the proposed framework is depicted in Fig. 2. Given video frames, we apply instance segmentation and semantic segmentation in  to obtain thing objects and stuff objects, respectively. Object features are extracted from intermediate I3D  features via RoIAlign  and MaskAlign (Section III-C). Afterwards, we construct Ego-Thing Graphs and Ego-Stuff Graphs in a timely manner and apply graph convolutional networks (GCN)  for message passing. The updated ego features from two graphs are fused and processed via a temporal fusion module. Additionally, the temporally fused ego features are concatenated with the I3D head feature, which serves as a global video embedding, to form the egocentric representation. At last, this egocentric feature is passed through a fully connected layer to obtain the final classification.
Iii-B Ego-Thing Graph
The ego-thing graph is designed to model interactions among ego vehicle and movable traffic participants, such as and so on.
Node feature extraction. In our design, thing objects are car, person, bicycle, motorcycle, bus, train, and truck. Given bounding boxes generated from Mask R-CNN , we keep the top K detections on each frame from all the classes above and set K to 20. Then RoIAlign  and a max pooling layer are applied to obtain dimensional appearance features as Thing Node features in a ego-thing graph. The Ego Node feature is obtained by the same procedure from a frame-size bounding box.
Graph definition. We denote the sequence of frame-wise ego-thing graphs as , where is the number of frames, and is the ego-thing affinity matrix at frame representing the pair-wise interactions among thing objects and ego. Specifically, denotes the influence of object on object . Nodes in graph correspond to a set of objects , where is -th object’s appearance feature, and is the 3D location of the object in world frame. Note that index corresponds to ego object and correspond to thing objects.
Interaction modeling. Ego-thing interactions are defined as second-order interactions, where not only the original state but also the changing state of the thing object caused by other objects will altogether influence the ego state. To sufficiently model these interactions, we consider both appearance features and distance constraints inspired by . We compute the edge value as:
where indicates the appearance relation between two objects, and we set up a distance constraint via a spatial relation . Softmax function is used to normalize the influence on object from other objects.
The appearance relation is calculated as below:
where and . Both and are learnable parameters which map appearance features to a subspace and enable learning the correlation of two objects. is a normalization factor.
The necessity of defining spatial relation arises from that the interactions of two distant objects are usually scarce. To calculate this relation, we first unproject objects from 2D image plane to the 3D space in the world frame :
where and are homogeneous representations in 2D and 3D coordinate systems, is the camera intrinsic matrix, and is the relative depth at obtained by depth estimation . In the 2D plane, we choose the centers of bounding boxes to locate thing objects. The location of the ego vehicle is fixed at the middle-bottom pixel of the frame. Then the spatial relation function is formulated as:
where is the indicator function, computes the Euclidean distance between object and object in the 3D space, and is the distance threshold which regulates the spatial relation value to be zero if the distance is beyond this upper bound. In our implementation, the value of is set to be 3.0.
Iii-C Ego-Stuff Graph
The ego-stuff graph is constructed in a similar manner as the ego-thing graph in Eq.1 except for the following aspects:
Node feature extraction. We include the following classes as stuff objects: Crosswalk, Lane Markings, Lane Separator, Road, Service Lane, Traffic Island, Traffic Light and Traffic Sign. The criterion we use to distinguish stuff objects from thing objects is based on whether the change of states can be caused by other objects. For example, cars stop and yield to person, but a traffic light turns red to green by itself. Another distinction lies in that the contour of most stuff objects cannot be well depicted as rectangular bounding boxes. Thus, it is difficult either to detect it by algorithms like Faster R-CNN , YOLO  or to extract features by RoIAlign  without enclosing irrelevant information. For this, we propose a feature extraction approach named MaskAlign to extract features for a binary mask , which is the -th stuff object at time . is downsampled to () with the same spatial dimension as the intermediate I3D feature map . We compute the stuff object feature by MaskAlign as following:
where is the D-dimension feature at pixel for time , and is a binary scalar indicating whether object exists at pixel .
Interaction Modeling. In ego-stuff graph, we ignore interactions among stuff objects since they are insusceptible to other objects. Hence, we set to zeros for every pair of stuff objects and only pay attention to the influence that stuff objects act on ego vehicle. We call it as the first-order interaction. To better model the spatial relations, instead of unprojecting bounding box centers, we map every pixel inside the downsampled binary mask to 3D space and calculate the Euclidean distance between every pixel with the ego vehicle. The distance is the minimum distance of the all. The distance threshold in ego-stuff graph is designed as 0.8.
Iii-D Reasoning on Graphs
To perform reasoning on graphs, we introduce graph convolutional networks (GCN) proposed in . GCN takes a graph as input, passes information through the learned edges, and refreshes nodes’ features as output. Specifically, graph convolution can be expressed as:
where is the affinity matrix from graphs. Taking ego-thing graph as an example, is the appearance feature matrix of nodes in the -th layer. is the learnable weight matrix. We also build a residual connection by adding . In the end of each layer, we adopt Layer Normalization  and ReLU before is fed to the next layer. As second-order interaction is not considered in ego-stuff graph but in ego-thing graph, we use one layer GCN in ego-stuff graph and two layers in ego-thing graph.
Iii-E Temporal Modeling
GCN interactive features in each frame are processed independently without considering temporal context information. Therefore, we append a temporal fusion module to the late stage in our framework as illustrated in Fig. 3. Unlike prior works [37, 39, 42], which fuse features of every node in different graphs, we only focus on ego node. Ego features are aggregated by a element-wise summation from two types of graphs. Then these time-specific ego features are fed into a temporal fusion module, which applies element-wise max pooling to obtain a feature vector, namely GCN egocentric feature. We also propose another two designs for temporal fusion: (a) Inspired by Temporal Relation Network , which utilizes multi-layer perceptrons (MLP) as temporal modeling, we follow the similar approach in order to capture the temporal ordering of patterns. (b) The temporal fusion can also be replaced by element-wise average pooling. In Section IV-E, we conduct different experiments to investigate all three temporal modeling approaches.
We evaluate the proposed framework on the HDD dataset , the largest dataset that provides 104-hour egocentric videos with frame-level annotations of tactical driver behavior. It has a diverse set of scenarios where complicated interactions happen between the ego vehicle and road users. The data was collected within San Francisco Bay Area including urban, suburban and highways. We follow the same Train/Test data split as .
The videos are labeled by a 4-layer representation to describe tactical driver behaviors. Among these 4 layers, Goal-oriented action layer (e.g., left turn and right lane lane change) and Cause layer (e.g., stop for crossing vehicle) consist of the actions with interactions. We leverage those labels and analyze the effectiveness of the proposed interaction modeling framework in Section IV-C.
Iv-B Implementation Details
We implemented our framework in TensorFlow. All experiments are performed on a server with 4 NVIDIA TITAN-XP. The input to the framework is a 20-frame clip with a resolution of at 3 fps, approximately 6.67s. We adopt Inception-v3  pre-trained on ImageNet  as the backbone, following  to inflate 2D convolution into a 3D ConvNet, and fine-tune it on the Kinetics action recognition dataset . The intermediate feature map used in RoIAlign and MaskAlign is extracted from the Mixed_3c layer, where is the number of feature channels. The global I3D feature is generated from a convolution on Mixed_5c layer feature, which reduces the output channel number from 1024 to 512. The downsampled binary mask is . The model is trained in a two-stage training scheme with batch size set to 32: (1) we fine-tune the Kinetics pre-trained model on the HDD dataset for 50K iterations without using GCN. We refer to this model the baseline for our experiment. (2) We load the weights trained in Stage 1, and further train the network together with GCN for 20K iterations.
We use Adam  optimizer with default parameters. We set learning rate as 0.001 and 0.0002 for the first and second stage for training, respectively.
Iv-C Analysis on Interactions
To understand the benefits of modeling interactions, we perform analysis on the following two aspects.
Goal-oriented Action Layer. Table I presents Goal-oriented action recognition results. We use the per-frame mean average precision (mAP) as evaluation metric in all experiments. We pay attention to the 5 ‘lane-related’ classes in frames: Left Lane Change, Right Lane Change, Left Lane Branch, Right Lane Branch and Merge. Our model obtains 49.9% mAP over these 5 classes, which surpasses the I3D baseline 46.0% mAP by a gain of 3.9%. This improvement showcases the effectiveness of modeling interactions between ego vehicle and traffic lanes, which also can be validated by visualization in Section IV-F.
Cause Layer. 6 classes from Cause layer are designed to explain the reason for stop and deviate actions, such as Deviate for Parked Vehicle, which is an example of ego-thing interaction. We extend our framework to multi-head classifiers to simultaneously predict Goal-oriented actions and Causes. Note that we train a multi-head I3D as the baseline for this experiment. Our design achieves a steady increase in recognizing Goal-oriented actions by improving the baseline of 48.5% to 50.2%. Meanwhile, the result of Cause layer in Table II shows a significant gain of 6.0% in overall mAP. We further demonstrate the strength of the proposed interaction modeling by using a Deviate for Parked Vehicle scenario in Fig. 5 in Section IV-F.
Iv-D Comparison with the State of the Art
We compare our approach with the state-of-the-art in Table I. We categorize the existing methods tested on HDD into online and offline. The online approaches aim to detect driver actions as soon as a frame arrives. Future context is not considered. The offline approaches take future frames into consideration. Since future information is processed, the offline approaches exhibit an overwhelming advantage over the online approaches. Among the offline methods, our model significantly outperforms the C3D  and I3D  by 5.6% and 1.6% in terms of mAP, respectively.
|Different Graphs||I3D ||49.5|
|Ego-Thing Graph + Ego-Stuff Graph||51.1|
|Spatial Modeling||Appearance Relation||50.9|
|Appearance + Spatial Relation||51.1|
Iv-E Ablation Studies
To provide a comprehensive understanding of the contributions from each module, we decompose our model into three components and conduct ablation studies using the Goal-oriented action recognition shown in Table III.
Comparison of Different Graphs. The first section of Table III analyzes the influence of each graph to the tactical driver behavior recognition. The baseline is the I3D. When ego-stuff graph or ego-thing graph is included, the results are boosted from 49.5% to 50.6% and 50.7%, respectively. If both graphs are trained jointly with the baseline model, we achieve the best performance 51.1% on the Goal-oriented action recognition.
Importance of Spatial Relation. To investigate the effectiveness of spatial relation function in Eq. 4, we conduct two experimental settings: using only the appearance relations, and embedding 3D spatial relation as an additional constraint. Without using the proposed 3D spatial relation, the performance decreases by 0.2%, indicating the advantage of encoding spatial context.
Variations of Temporal Modeling. We analyze the impact of temporal modeling approaches. The best mAP – 51.1% is obtained by element-wise max pooling. If we use element-wise averaging for the features from each time step, the model has a mAP of 50.0%. Our conjecture is that, for a 20-frame video clip, the key change takes place within a short duration. For example, in a Left Lane Change behavior, the most noticeable moment is when the ego vehicle intersects the traffic lanes within a few frames. Temporal modeling using averaging features potentially degrades the distinguishable features, which will unavoidably result in information loss. A multi-layer perceptron (MLP), which takes temporal ordering patterns into account, exceeds averaging pooling by 0.9% but is 0.2% lower than the best performance. Our hypothesis is that significant change of interactive relations plays an more important role in recognizing tactical driver behavior than the ordering in time.
Apart from quantitative evaluation, we demonstrate interpretability of our method by the following two visualization strategies.
Attention Visualization from Egocentric View. Given the learned affinity matrices in ego-thing graph and ego-stuff graph, we highlight those objects with strong connection to the ego node in Fig. 4. The visualization results provide a strong proof that the proposed model captures the underlying interactions, which is essential for tactical driving behavior understanding. Note that in the example shown in Fig 4 (e), the model captures the relation between the ego vehicle (turning left) and the traffic light (green light).
Attention Visualization from Top-view. In addition to the interactions with ego, we can represent the complicated traffic scene in a graph structure as well. Fig. 5(b) shows the visualized ego-thing graph from the multi-head model for a scenario where the ego vehicle deviates for a parked truck. Each circle in the graph corresponds to a thing object in the frame and the ego vehicle is represented by a star. The edge linking two nodes represents the interactive relation among them. We manually draw a top-view map Fig. 5(c) to better represent the interactions based on spatial context.
In this paper, we propose a framework to model complicated interactions between driver and road users using graph convolution networks. The proposed framework demonstrates favorable quantitative performance on the HDD dataset. Qualitatively, we show the model can captures interactions between the ego vehicle and stuff objects, and the ego vehicle and thing objects. For future work, we plan to incorporate temporal modeling of thing objects into the proposed framework as it is an important cues in interaction modeling. With that, the framework will enable tactical behavior anticipation and behavior modeling , and potentially trajectory prediction of thing objects .
- (2016) Social LSTM: Human Trajectory Prediction in Crowded Spaces. In CVPR, Cited by: §I, §I.
- (2016) Layer N ormalization. In arXiv preprint arXiv:1607.06450, Cited by: §III-D.
- (2018) Robust Dense Mapping for Large-Scale Dynamic Environments. In ICRA, Cited by: §I.
- (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR, Cited by: §III-A, §IV-B, §IV-D, TABLE I, TABLE II, TABLE III.
- (2019) TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic using Weighted Interactions. In CVPR, Cited by: §I, §I, §I, §V.
- (2019) Crowd-Robot Interaction: Crowd-aware Robot Navigation with Attention-based Deep Reinforcement Learning. In ICRA, Cited by: §I, §I.
- (2011) Tactical Driver Behavior Prediction and Intent Inference: A Review. In ITSC, Cited by: §V.
- (2015) Learning Driver Behavior Models from Traffic Observations for Decision Making and Planning. IEEE Intelligent Transportation Systems Magazine 7 (1), pp. 69–79. Cited by: §I.
- (2017) Mask R-CNN. In CVPR, Cited by: §III-A, §III-B, §III-C.
- (2019) Spatio-Temporal Action Graph Networks. In ICCVW, Cited by: §II-B.
- (2015) Car that Knows before You Do: Anticipating Maneuvers via Learning Temporal Driving Models. In ICCV, Cited by: §I, §II-A.
- (2017) The Kinetics Human Action Video Dataset. In arXiv preprint arXiv:1705.06950, Cited by: §IV-B.
- (2014) Adam: A Method for Stochastic Optimization. In arXiv preprint arXiv:1412.6980, Cited by: §IV-B.
- (2017) Semi-supervised Classification with Graph Convolutional Networks. In ICLR, Cited by: §I, §II-B, §III-A, §III-D.
- (2000) A driver behavior recognition method based on a driver model framework. In SAE 2000 World Congress, Cited by: §II-A.
- (2019) Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. In arXiv preprint arXiv:1907.01341, Cited by: §II-B, §III-B.
- (2019) Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. In arXiv preprint arXiv:1907.01341, Cited by: §III-B.
- (2017) DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents. In CVPR, Cited by: §I.
- (2017) Situation Recognition with Graph Neural Networks. In ICCV, Cited by: §I, §II-B.
- (2016) Gated Graph Sequence Neural Networks. In ICLR, Cited by: §II-B.
- (2005) Reliable Method for Driving Events Recognition. IEEE Transactions on Intelligent Transportation Systems 6 (2), pp. 198–205. Cited by: §I, §II-A.
- (2018) Driving Policy Transfer via Modularity and Abstraction. In CoRL, Cited by: §II-B.
- (2018) Semi-supervised Learning: Fusion of Self-supervised, Supervised Learning, and Multimodal Cues for Tactical Driver Behavior Detection. In CVPRW, Cited by: TABLE I.
- (2000) Graphical Models for Driver Behavior Recognition in a SmartCar. In IV, Cited by: §I, §II-A.
- (2018) Intent-aware Multi-agent Reinforcement Learning. In ICRA, Cited by: §I.
- (2018) Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning. In CVPR, Cited by: §I, §II-A, §IV-A, TABLE I.
- (2016) You Only Look Once: Unified, Real-Time Object Detection . In CVPR, Cited by: §III-C.
- (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks . In NeurIPS, Cited by: §III-C.
- (2015) ImageNet LargeScale Visual Recognition Challenge. In IJCV, Cited by: §IV-B.
- (2014) A Novel Goal Oriented Concept for Situation Representation for ADAS and Automated Driving. In ITSC, Cited by: §I.
- (2019) Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. In IV, Cited by: §I.
- (2016) Rethinking the Inception Architecture for Computer Vision. In CVPR, Cited by: §IV-B.
- (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV, Cited by: §IV-D, TABLE I.
- (2018) Social Attention: Modeling Attention in Human Crowds. In ICRA, Cited by: §I, §I.
- (2018) A 3D Dynamic Scene Analysis Framework for Development of Intelligent Transportation Systems. In IV, Cited by: §I.
- (2019) Deep Object Centric Policies for Autonomous Driving. In ICRA, Cited by: §II-A, §II-A.
- (2018) Videos as Space-Time Region Graphs. In ECCV, Cited by: §II-B, §III-E.
- (2013) Reasoning Based Framework for Driving Safety Monitoring using Driving Event Eecognition. IEEE Transactions on Intelligent Transportation Systems 14 (3), pp. 1231–1241. Cited by: §II-A.
- (2019) Learning Actor Relation Graphs for Group Activity Recognition. In CVPR, Cited by: §I, §II-B, §II-B, §III-B, §III-E.
- (2016) End-To-End Learning of Driving Models From Large-Scale Video Datasets. In CVPR, Cited by: §II-A.
- (2019) Temporal Recurrent Networks for Online Action Detection. In ICCV, Cited by: §II-A, TABLE I.
- (2018) Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI, Cited by: §II-B, §III-E.
- (2018) Graph R-CNN for Scene Graph Generation. In ECCV, Cited by: §II-B.
- (2016) A Non-conservatively Defensive Strategy for Urban Autonomous Driving. In ITSC, Cited by: §I.
- (2018) Temporal relational reasoning in videos. In ECCV, Cited by: §III-E.