Dynamic Traffic Scene Classification with Space-Time Coherence
This paper examines the problem of dynamic traffic scene classification under space-time variations in viewpoint that arise from video captured on-board a moving vehicle. Solutions to this problem are important for realization of effective driving assistance technologies required to interpret or predict road user behavior. Currently, dynamic traffic scene classification has not been adequately addressed due to a lack of benchmark datasets that consider spatiotemporal evolution of traffic scenes resulting from a vehicle’s ego-motion. This paper has three main contributions. First, an annotated dataset is released to enable dynamic scene classification that includes 80 hours of diverse high quality driving video data clips collected in the San Francisco Bay area. The dataset includes temporal annotations for road places, road types, weather, and road surface conditions. Second, we introduce novel and baseline algorithms that utilize semantic context and temporal nature of the dataset for dynamic classification of road scenes. Finally, we showcase algorithms and experimental results that highlight how extracted features from scene classification serve as strong priors and help with tactical driver behavior understanding. The results show significant improvement from previously reported driving behavior detection baselines in the literature.
Semantic description and understanding of dynamic road scenes from an egocentric video is a central problem in realization of effective driving assistance technologies required to interpret and predict road user behavior. In the driving context, scene refers to the place where such behaviors occur, and includes attributes such as environment (road types), weather, road-surface, traffic, lighting, etc. Importantly, scene context features serve as important priors for other downstream tasks such as recognition of objects, behavior, action, intention, as well as robust navigation, and localization. For example, cross-walks at intersections are likely places to find pedestrians crossing or waiting to cross. Likewise, knowing that an ego-vehicle is approaching an intersection helps auxiliary modules to look for traffic lights to slow down. Needless to say, effective solutions to the traffic scene classification problem provide contextual cues that promise to help driving assist technologies to reach human level visual understanding and reasoning.
The vast majority of research in scene classification has been conducted to address the problem of single image classification  . Recently, dynamic scene classification datasets  and associated algorithms  have emerged that exploit spatiotemporal features. However, majority of previous work consider a stationary camera and study image motion (i.e. spatial displacement of image features) that is induced by projected movement of scene elements over time. Typical examples include a rushing river, waterfall, or motion of cars on the highway from a surveillance camera. For driving tasks, scene understanding requires a dynamic representation characterized by displacement of image motion attained from moving traffic participants as well as variations in image formation that emerge from the vehicleâs ego-motion. The latter is an important and challenging problem that has not been addressed, primarily due to a lack of related datasets for driving scenes.
To address this solution gap, this paper critically examines the dynamic traffic scene classification problem under space-time variations in viewpoint (and therefore scene appearance) that arise from the egocentric formation of images collected from a moving vehicle. In particular, a novel driving scene video dataset is introduced to enable dynamic traffic scene classification. The dataset includes temporal annotations on place, environment (road-type), and weather/surface conditions and explicitly labels the viewpoint variations using multiple levels. Specifically, the place categories are annotated temporally with fine grained labels such as Approaching (A), Entering (E), and Passing (P), depending on the ego-carâs relative position to the place of interest. An example of this multi-level temporal annotation is depicted in Figure 1 and 2. This example illustrates the result of view variations (caused by changing distance to the intersection) as a vehicle approaches the scene of interest (i.e. intersection). The video clip is labeled using the three layers (A,E,P) to highlight the distinct appearance changes and showcases the proposed fine grained annotation strategy that is important for vehicle navigation and localization.
The main contributions of this work are as follows. First, a dataset is released that includes 80 hours of diverse high quality driving video data clips collected in San Francisco Bay area 111https://usa.honda-ri.com/hsd. The dataset includes temporal annotations for road places, road environment, weather, and road surface conditions. This dataset is intended to promote research in fine-grained dynamic scene classification for driving scenes. The second contribution includes development of machine learning algorithms that utilize the semantic context and temporal nature of the dataset to improve classification results. Finally, we present algorithms and experimental results that showcase how extracted features can serve as strong priors and help with tactical driver behavior understanding.
Ii Related Work
Ii-a Driving Data sets
Large scale public datasets geared towards automated or advanced driver assist systems [5, 6, 7, 8, 9, 10, 11], and scene understanding [12, 13, 14, 15], have helped the development of algorithms to better understand the scene layout and behavior of traffic participants. These datasets have limitations since they either do not adequately support dynamic scene classification or provide only a non exhaustive list of driving scene classes. Several papers support pixel wise annotations for semantic segmentation [14, 12, 15, 13]. While it may be useful to learn semantic segmentation models to parse the scene, we cannot infer the type of scene reliably. This would mean that separate models need to be developed to aggregate the semantic segmentation outputs and infer the type of scene. Other datasets provide models for understanding ego and participant behaviors in driving scenes [6, 9], but they do not have an exhaustive list for scene classes. With respect to datasets, most similar to our work is described in [11, 16]. While they provide labels for scene and weather classification, the dataset is more focused toward image retrieval and localization problems.
Ii-B Scene and Weather classification
MIT Places dataset  and Large Scene understanding dataset (LSUN)  were introduced to benchmark several deep learning based classification algorithms. While our data set serves a similar purpose , but for traffic scenes, we also support temporal classification to benchmark algorithms that are robust to spatio-temporal variations of the same scene. Moreover driving scenes have an unbalanced class distribution along with less inter class variation, making classification much more challenging. For example, fine grained classification of 3-way and 4-way intersection from a single frontal camera view is very challenging due to small scene variations between these two classes.
Existing frame based scene and weather classification can be grouped into the following methods: adding semantic segmentation and contextual information [17, 18, 19], using hand crafted features [20, 4, 21], multi resolution features [22, 23], or use multiple sensor fusion [24, 25]. Given the success and superior deep learning classification methods, we elected to use a learning based approach along with experimentation on how to add semantic segmentation and temporal feature aggregation to improve the results.
|BDD-Nexar ||N||Sem. Segmentation||U, H||Y|
|DR(eye)VE ||Y||Driver Behavior||U||Y|
Ii-C Temporal Classification
Video classification and human activity recognition tasks [27, 28, 29] have helped develop various state of the art deep learning methods for temporal aggregation. These methods aggregate spatio-temporal features through Long short-term memory modules (LSTM)  or Temporal Convolution Networks (TCN) . While such methods help activity recognition tasks by understanding object level motion primitives, they do not translate directly for temporal scene classification. In fact, frame based result averaging might be more suitable for our problem. Moreover, the entire scene is the focus of our task, not just the human actor.
Recently, work has been done for region proposal generation [32, 33], where two stream architectures are used to generate the start and end time of the event as well as the class activity. Our work is inspired by these methods. Our best model is a two stream architecture that decouples the region proposal and classification tasks. Specifically, we use the proposal generator to trim the untrimmed video and aggregate the features to classify the entire trimmed segment. This method outperforms simple frame based averaging techniques. For example, it is better to come to conclusion if the class is a 4-way intersection by looking at the segment (approaching, entering and passing) in its entirety rather than on a per frame basis. This helps the model parse the same intersection from various viewpoints. Details of the proposed method is provided in Section IV.
Iii Overview Of the Honda Scenes Dataset
Iii-a Data collection platform
Our collection platform involves two instrumented vehicles, each with different camera hardware and setup. The first vehicle contains three NVIDIA RCCB(60FOV) and one NVIDIA RCCB(100FOV). The second vehicle contains two Pointgrey Grasshopper (80FOV) and one Grasshopper (100FOV). This varied setup enables development of algorithms that support better generalization to camera hardware and positioning. The cameras cover approximately 160 degree frontal view.
Data was collected around the San Francisco Bay Area region over a course of six months under different weather and lighting conditions. Urban, Residential(Local), Highway, Ramp, and Rural areas are covered in this dataset. Different routes are taken for each recording session to avoid overlap in the scenes. Moreover, targeted data collection is done to reduce the impact of unbalanced class distribution for rare classes such as railway crossing or rural scenes. The total size of the post-processed dataset is approximately 60 GB and 80 video hours. The videos are converted to a resolution of at fps.
Iii-B Data set statistics and comparison
Table I shows the overall comparison with other data-sets. Our dataset is the only large scale driving dataset for the purpose of driving scene understanding. The datasets were annotated with exhaustive list of classes typical to driving scenarios. Three persons annotate each task and cross-check results to ensure quality. Intermediate models are trained to check the annotations and to scale the dataset with human in loop. The ELAN  annotation tool is used to annotate videos at multiple levels. The levels include Road Places, Road Environment (Road Types), Weather, and Road Surface condition. The data is split into training and validation in such a way so as to avoid any geographical overlap. This enforces generalization of models to new unseen areas, changes in lighting condition, and changes in viewpoint orientation. Further details about the class distribution are described below.
Road Places and Environment The 80 hour video clips are annotated with Road Place and Road Environment labels in a hierarchical and in a causal manner. There are three levels in the hierarchy. At the top level, Road Environment is annotated, followed by the Road Place classes at the mid level, and the fine grained annotations such as approaching, entering, passing at bottom level. This forms a descriptive dataset that allows our algorithms to learn the inter dependencies between the levels. The Road Environment labels include urban, local, highway and ramps. The local label includes residential scenes which are typically less traffic prone and contain more driveways as opposed to urban scenes. The Ramps class generally appear at highway exits and are connectors between two highways or a highway and other road types.
Each fine grained annotation is clearly defined based on the view from the ego-vehicle. The three-way, four-way, and five-way intersections each have approaching, entering, and passing labels based on the ego-vehicle’s position from the stop-line, traffic signal and or stop sign. Similarly construction zones, rail crossing, overhead bridge and tunnels are labelled based on the ego-vehicle’s position from the construction event, railway tracks, overhead bridge, and tunnels, respectively. Since the notion of entering does not exist or is too abrupt for lane merge, lane branch or zebra crossing classes, these categories are annotated with only approaching and passing fine grained labels. The overall class distribution in illustrated in Figure 3.
Weather and Road Surface condition Due to the lack of snow weather conditions in the San Francisco Bay Area, a separate targeted data collection was performed in Japan specifically for snow weather and snow surface conditions. This also helps the weather and surface prediction models generalize well to different places and road types. Video sequences are semi-automatically labeled using weather data and GPS information before further processing and quality checking by human annotators.
The temporal annotations for weather contain classes such as clear, overcast, snowing, raining and foggy weather conditions. The Road Surface has dry, wet, snow labels. Only frames with sufficient snow coverage on the road (more than 50%) are labeled as snow surface condition. This maintains the road surface and the weather condition labels to be mutually exclusive to each other. While we do provide temporal annotations, only sampled frames are used for our experiments. When predicting the conditions on untrimmed test videos, the results are averaged over a temporal window as these conditions do not change drastically frame to frame. Figure 3 shows the distribution of images over classes for weather and road surface conditions.
This section describes the proposed methods for dynamic road scene classification for holistic scene understanding with respect to an ego vehicle driving on a road. Our proposed methods are able to predict road weather, road surface condition, road environment and road places.
All experiments are based on the resnet50 model. It should be noted that that the proposed methods can be applied using any base Convolutional Neural Network (CNN). Any performance improvement on the base CNN could directly transfer to performance improvement on our methods. These models run on the NVIDIA P100 at 10 fps.
Road Weather and Road Surface Condition: To classify weather and road surface, we chose to train a frame based resnet50  model. For weather and road surface, approximately images of each class were used to fine-tune models pre-trained on the places365  dataset. Since foggy weather is a rare class, it was not used in our current set of experiments to avoid an unbalanced class distribution. As a first experiment, we finetuned a resent50 pretrained on the places365 dataset. The weather category is independent of traffic participants in the scene. Therefore, the base model resnet50 was fine-tuned on images where traffic participants were masked out as shown in Figure 4. A semantic segmentation model based on Deeplab  was used to segment and mask the traffic participants and allow the model to focus on the scene. The results are presented in Table II, illustrating that semantic masking improves performance.
|RGBS (4 channel)||0.89||0.81||0.34||0.13||0.54|
|S (1 channel)||0.90||0.81||0.24||0.25||0.55|
Road Environment: For Road Environment, experiments were performed with resnet50 pre-trained on places365 dataset. Similar to weather and road surface experiments the input to the model was progressively changed with no change to the training protocol. More specifically, experiments were conducted on the original input images (RGB), images concatenated with semantic segmentation (RGBS), images with traffic participants masked using semantic segmentation (RGB-masked), and finally only using a one channel semantic segmentation image (S). The class wise results are shown in Table III.
Interestingly, while RGB-masked images show overall best performance, semantic segmentation alone outperforms just RGB images, especially for ramp class. This might be due to the fact that scene structure is sufficient to understand the curved and the uncluttered nature of highway ramps. However, while decomposing the images to scene semantics allows the model to learn valuable structure information, it loses important texture cues about the type of buildings and driveways. Hence there is a lot of confusion between local and urban class resulting in lower local performance.
Road Places: Similar to road environment experiments, RGB, RGBS, masked-RGB, S were used to fine-tune a resnet50 model on places365 for road places. In these frame-based experiments, the approaching, entering and passing sub-classes are treated as separate classes, ie, approaching 3-way Intersection, entering 3-way Intersection and passing 3-way Intersection are treated as 3 different classes. In addition to frame-based experiments, standard LSTM architectures were added to our best frame based models. Such models would allow the capture of the temporal aspect of our labels(approaching, entering, passing). While the performance of LSTM and Bi-LSTM models improve our results we hypothesize that decoupling the rough locality (approaching, entering, passing) and the event class might help our models to better understand the scene.
Hence we propose a two stream architecture for event proposal and prediction as depicted in Figure 5. The event proposal network proposes candidate frames for the start and end of each event. This involves a classification network to predict approaching, entering, passing as the class labels and allows the model to learn temporal cues such as approaching is always followed by entering and then passing. These candidate frames are then sent as an event window to the prediction network. The prediction module aggregates all frames in this window through global average pooling and produces a singular class label for the entire event. The prediction module is similar to the R2D model in . During Testing we first segment out the event windows as proposals, followed by final event classification using event prediction module.
Summary of our results are shown in Table IV. It must be noted for the temporal experiments, while different input data were used, only our best results of RGB-Masking are displayed. We note that the performance of our model is worse than the BiLSTM model for Branch and Merge classes - possibly because these events are very short and feature averaging done by the prediction module doesn’t help.
|Event Proposal (ours)||0.285|
Iv-B Implementation details
All resent50 models fine-tuned in this paper were pre-trained on the places365 dataset. Data augmentation was performed to reduce over-fitting - random flips, random resize, and random crop were employed. All experiments were performed on NVIDIA P100. All videos were sampled at 3Hz to obtain frames used in experiments. The SGD optimizer was used for frame-based experiments and the Adam optimizer was used for the LSTM based experiments.
Iv-C Visualization of learned representations
It has been shown that even CNNs trained on just image labels have localization ability [38, 39, 40]. Here, we use one such method - Class activation maps  to show the localization ability of our models. Figure 6 shows some localizations produced by our place, weather and surface classification CNNs.
V Behavior Understanding
Honda Research Institute Driving Dataset (HDD)  was released to enable research on naturalistic driver behavior understanding. The dataset includes 104 hours of driving using a highly instrumented vehicle. Ego-vehicle driving behaviors such a left turn, right turn, lane merge are annotated. A CNN + LSTM architecture as shown in Figure 7 was proposed to classify the behavior temporally. Due to inclusion of the vehicle’s Controller Area Network (CAN) bus signal, which records various signals such as steering angle and speed, the results for left turn, right turn were significantly higher than difficult actions such as lane change, lane merge, and branch. Hence the current architecture that infers action directly from image fails to capture important cues.
We propose that using an intermediate scene context feature representation helps the model attend to important cues in the scene as evidenced by our attention maps in Figure 6. For a fair comparison we use the same RGB images to extract intermediate representations from a frame based resnet50 model trained on our dataset. This would correspond to the first row in our Table IV. While replacing the input as our scene context features, we keep the rest of the architecture and training protocol exactly the same. Since we only replace the model weights the number of parameters does not change.
As shown in Table V, scene context features improve the overall mean average precision (mAP), especially for rare and difficult classes. Though our model is trained on a different dataset, scene context features embed a better representation as opposed to direct image features. Since the model is able to describe the scene and attend to different regions, it is able to associate actions better with the scenes. For example, lane branch action occurs in the presence of a possible lane branch, or U-Turn generally occurs at intersections; otherwise, it is a false positive. Hence our dataset and pre-trained models can serve as priors ( false positive removers) and descriptive scene cues (soft attention to scene) for other driving related tasks.
|Left lane Change||0.42||0.55|
|Right lane change||0.23||0.52|
|Left lane branch||0.25||0.47|
|Right lane branch||0.01||0.17|
In this paper, we introduced a novel traffic scene dataset and proposed algorithms that utilize the spatio-temporal nature of the dataset for various classification tasks. We demonstrated experimentally that hard attention through semantic segmentation helps scene classification. For the various scene classes studied in this paper, we showed that motion of the scene elements captured by our temporal models provide a more descriptive representations of traffic scene video than simply using their static appearance.
Our models perform better than the conventional CNN + LSTM architectures used for temporal activity recognition. Furthermore, we have shown the importance of weak object localization for various classification tasks. Experimental observations based on trained models show that dynamic classification of road scenes provide important priors for higher level understanding of driving behavior.
In future work, we plan to explore annotation of object boundaries for rare classes to achieve better supervision for attention. Finally, work is ongoing on developing models that are causal and can support multiple outputs to address issues with place classes which are not mutually exclusive (e.g. construction zone at an intersection).
We would like to thank our colleagues Yi-Ting Chen, Haiming Gang, Kalyani Polagani, and Kenji Nakai for their support and input.
-  B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE transactions on pattern analysis and machine intelligence, 2017.
-  F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
-  N. Shroff, P. Turaga, and R. Chellappa, “Moving vistas: Exploiting motion for describing scenes,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 1911–1918, IEEE, 2010.
-  C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Dynamic scene recognition with complementary spatiotemporal features,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 12, pp. 2389–2401, 2016.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3354–3361, IEEE, 2012.
-  A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3182–3190, 2015.
-  Y. Chen, J. Wang, J. Li, C. Lu, Z. Luo, H. Xue, and C. Wang, “LiDAR-Video Driving Dataset: Learning Driving Policies Effectively,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5870–5878, 2018.
-  H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving models from large-scale video datasets,” arXiv preprint, 2017.
-  V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko, “Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7699–7707, 2018.
-  F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving video database with scalable annotation tooling,” arXiv preprint arXiv:1805.04687, 2018.
-  W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
-  X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The apolloscape dataset for autonomous driving,” arXiv preprint arXiv:1803.06184, 2018.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
-  V. Madhavan and T. Darrell, The BDD-Nexar Collective: A Large-Scale, Crowsourced, Dataset of Driving Scenes. PhD thesis, Masterâs thesis, EECS Department, University of California, Berkeley, 2017.
-  G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes.,” in ICCV, pp. 5000–5009, 2017.
-  S. Garg, N. Suenderhauf, and M. Milford, “Don’t Look Back: Robustifying Place Categorization for Viewpoint- and Condition-Invariant Place Recognition,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.
-  J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 702–709, IEEE, 2012.
-  X. Li, Z. Wang, and X. Lu, “A Multi-Task Framework for Weather Recognition,” in Proceedings of the 2017 ACM on Multimedia Conference, pp. 1318–1326, ACM, 2017.
-  D. Lin, C. Lu, H. Huang, and J. Jia, “RSCM: Region Selection and Concurrency Model for Multi-Class Weather Recognition,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4154–4167, 2017.
-  A. Bolovinou, C. Kotsiourou, and A. Amditis, “Dynamic road scene classification: Combining motion with a visual vocabulary model,” in Information Fusion (FUSION), 2013 16th International Conference on, pp. 1151–1158, IEEE, 2013.
-  I. Sikirić, K. Brkić, J. Krapac, and S. Šegvić, “Image representations on a budget: Traffic scene classification in a restricted bandwidth scenario,” in 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 845–852, IEEE, 2014.
-  L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, “Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2055–2068, 2017.
-  F.-Y. Wu, S.-Y. Yan, J. S. Smith, and B.-L. Zhang, “Traffic scene recognition based on deep CNN and VLAD spatial pyramids,” in Machine Learning and Cybernetics (ICMLC), 2017 International Conference on, vol. 1, pp. 156–161, IEEE, 2017.
-  P. Jonsson, “Road condition discrimination using weather data and camera images,” in Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference on, pp. 1616–1621, IEEE, 2011.
-  P. Jonsson, J. Casselgren, and B. Thörnberg, “Road surface status classification using spectral analysis of NIR camera images,” IEEE Sensors Journal, vol. 15, no. 3, pp. 1641–1656, 2015.
-  A. Palazzi, D. Abati, S. Calderara, F. Solera, and R. Cucchiara, “Predicting the Driver’s Focus of Attention: the DR(eye)VE Project,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941, 2016.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702, 2015.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, 2018.
-  Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the Faster R-CNN Architecture for Temporal Action Localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139, 2018.
-  H. Xu, A. Das, and K. Saenko, “R-C3D: region convolutional 3d network for temporal activity detection,” in IEEE Int. Conf. on Computer Vision (ICCV), pp. 5794–5803, 2017.
-  H. Brugman, A. Russel, and X. Nijmegen, “Annotating Multi-media/Multi-modal Resources with ELAN.,” in LREC, 2004.
-  “Deep residual learning for image recognition, author=He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, 2016.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization.,” in ICCV, pp. 618–626, 2017.
-  J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-down neural attention by excitation backprop,” International Journal of Computer Vision, vol. 126, no. 10, pp. 1084–1102, 2018.