Video Relation Detection with Trajectory-aware Multi-modal Features
Video relation detection problem refers to the detection of the relationship between different objects in videos, such as spatial relationship and action relationship. In this paper, we present video relation detection with trajectory-aware multi-modal features to solve this task. Considering the complexity of doing visual relation detection in videos, we decompose this task into three sub-tasks: object detection, trajectory proposal and relation prediction. We use the state-of-the-art object detection method to ensure the accuracy of object trajectory detection and multi-modal feature representation to help the prediction of relation between objects. Our method won the first place on the video relation detection task of Video Relation Understanding Grand Challenge in ACM Multimedia 2020 with 11.74% mAP, which surpasses other methods by a large margin.
Video relation detection is to find all object trajectories and relation between them as triplet ⟨ subject,predicate,object ⟩ in a video. It bridges visual information and linguistic information, enabling the cross-modal information transformation. Comparing to other computer vision tasks like object detection and semantic segmentation, visual relation detection requires not only localizing and categorizing single object but also understanding the interaction between different objects. To capture the relation between objects, more information of the video content need to be utilized.
Visual relation detection in videos is much more difficult than in static images. On the one hand, spatial-temporal localization of objects is needed instead of just spatial localization. This requires to track the same object in different frames along the temporal axis. On the other hand, relations between objects become more variable. Relations between the same object pair can change during time and some temporal-related relations will be introduced, which makes it more difficult to predict the relations. Thus, it is hard to directly apply existing methods on visual relation detection(Liao et al., 2020) or scene graph generation(Ren et al., 2020) to this task.
Several methods have been proposed to solve the problem. (Shang et al., 2017) firstly introduced a baseline for video relation detection. It firstly divides the video into several segments with same temporal length using sliding window. Secondly it performs video relation detection in segments by a object trajectory proposal module and a relation prediction module. Finally it generates the video relation detection results in the video by merging the result of those segments greedily. (Tsai et al., 2019) introduced a Gated Spatio-Temporal Energy Graph model as the relation prediction module of the baseline proposed by (Shang et al., 2017). By constructing a Conditional Random Field on a fully-connected spatio-temporal graph, the statistical dependency between relational entities spatially and temporally can be better exploited. (Qian et al., 2019) introduced graph convolutional network into the relation predictor, which takes better advantages of the spatial-temporal context.
In this paper, we propose a method for video relation detection problem. We follow the scheme of (Shang et al., 2017) to build our system with a object trajectory detector module and a relation predictor module. For object trajectory detector, we first perform object detection for each video frame with the state-of-the-art detector Cascade R-CNN(Cai and Vasconcelos, 2018) with ResNeSt101(Zhang et al., 2020) as backbone. Then we use a Dynamic Programming Algorithm improved from seq-NMS(Han et al., 2016) to associate the object detection results of all frames and generate trajectories for each object. For relation predictor, we combine motion feature, visual feature, language feature and location mask feature for each trajectory pair to predict the relation between them. The use of multi-modal feature helps to increase the accuracy of relation prediction. The framework of our method is shown in Figure. 1 Our method achieved the first place on the video relation detection task of Video Relation Understanding Grand Challenge(Shang et al., 2019b) in ACM Multimedia 2020.
2. Object Trajectory Detection
2.1. Object Detection
We choose Cascade R-CNN as our object detection model and ResNeSt101 as the network backbone. To train the object detector, we extract frames from each video to build the training set and validation set. Due to the high similarity between frames in the same video, using all frames is not necessary. Thus, we sample at most 15 key frames, whose bounding boxes are drawn by human, for each video in the training set of VidOR dataset(Shang et al., 2019a). The training set of detection consists of 97221 images extracted by the above method. Also, the validation set consists of 31220 human-labeled frames extracted from the validation set of VidOR dataset.
During training, we notice that the class imbalance issue exists in our training procedure. Classes with more annotations(adult, child, baby, etc.) have high AP up to 0.7 while classes with less annotations(crocodile, frisbee, etc.) get low AP close to zero. To overcome this imbalance issue, we extend our training set with part of the images from MS COCO dataset, which is more balanced than our training set.
During testing, we perform object detection for all video frames and keep bounding boxes that have confidence score higher than 0.01 as our final detection results.
2.2. Trajectory Generation
We take the tracking-by-detection strategy to generate object trajectories. Based on the object detection results of all video frames, we use a Dynamic Programming algorithm improved from seq-NMS to associate bounding boxes that belong to the same object and generate the trajectory. This algorithm consists of two part: Graph Building and Trajectory Selection. By regarding each bounding box as a node of the graph, we can link the bounding boxes that are likely to belong to the same object and from consecutive frames. After that, paths of the graph represent trajectories and we can run Dynamic Programming algorithm to pick paths that are more likely to be a object trajectory.
Graph Building: First, we regard each bounding box as a node and build the initial graph with no edge between them. Let ,,,, represent the object category, confidence score, bounding box, set of precursor nodes and set of successor nodes of the bounding box in frame . We set and to empty for all nodes initially. Then, for each , T is the frame count and all possible that satisfy and exist, if and are the same and the IoU of and is higher than a threshold, add to and add to . By doing this, we link bounding boxes pair from consecutive frames that has a IoU higher than the threshold. We set the threshold to 0.2 in our experiments.
We notice that when the camera or object in the video is moving violently, the IoU of bounding boxes that belong to the same object from consecutive frames will be very low. In this case, the original seq-NMS algorithm won’t link them, causing the lost of tracking. To solve this problem, we introduce a new linking mechanism. First,for bounding boxes and , we define scale_ratio and area_ratio as:
Then, for each that satisfies
we create a path from to by interpolating nodes in each time , as shown in Fig. 2. The of the interpolated node is obtained by linear interpolation of and . The confidence score of the interpolated node will be set to 0. By applying this linking mechanism, the trajectory generation module is more robust to violent movement of camera and objects. In our experiments, we limit to be less than 8 to make a trade-off between performance and complexity.
Trajectory Selection: After building the graph, we can regard a full path(path that can not be extended) of the graph as a object trajectory and take sum of confidence score of nodes in the path as the score of the path. Then, we repeatedly select path with the highest score and remove the nodes of the path from the graph. We achieve this by Dynamic Programming Algorithm used in (Han et al., 2016). Trajectories selected by the algorithm will be returned as the trajectory detection result.
3. Relation Prediction
Follow the scheme of (Shang et al., 2017), we first divide the video into overlapped segments with same length and perform object trajectory detection in all segments. We set segment length to 32 frames and overlap length to 16 frames in our experiments. After that, we predict the relation between all possible object pairs in the same segment.
To fully capture the video context and temporal movement, we use multi-modal features, including motion feature, visual feature, language feature and location mask feature, to help the relation prediction.
Motion Feature: For a trajectory pair in 32 frame segment, we first calculate the location feature following method used in (Sun et al., 2019) for frame 0, 8, 16, 24 and 31. Let be the feature calculate for frame . To capture the relative location of the pair in the static frame, we generate static feature by concatenating all the features calculated for frame 0, 8, 16, 24 and 31. To capture the dynamic movement of the pair, we generate dynamic feature by concatenating , , , .
Visual Feature: Due to the high complexity of extracting feature from video using network like I3D(Carreira and Zisserman, 2017), we choose to only extract visual feature from static frame using 2-D network. Most previous work used the object detection model to extract feature for relation prediction. However, detection model focus on the category of single object in the image. It can not capture the relation information properly. Thus, we use a scene graph generation model(Tang et al., 2020) pre-trained on Visual Genome Dataset(Krishna et al., 2017) to extract feature for relation prediction to help better capturing the interaction between objects. We only use the the middle frame of the segment to extract feature. For each pair, we extract a 4096-d feature for bounding box of the subject, bounding box of the object and the union of their bounding box respectively.
Language Feature: For language context, we follow (Sun et al., 2019) to generate a 300-d feature for subject and object category respectively and concatenate them as the final language feature.
Location Mask Feature: Since coordinates only have very limited ability in representing location, we further introduce the binary mask of the bounding box to better capture the relative location of subject and object. We follow the method of (Zellers et al., 2018) to generate a mask base on the bounding boxes of the subject and object in the middle frame of the segment as a input of the relation predictor.
3.2. Network Design
Using the features mentioned above as input, we design a simple neural network to predict the relation. The structure of the network is shown in Figure. 3.
After analysing the dataset, we find that about 99% of object pairs in the training set have no more than one spatial relation and one action relation. Thus, we convert the multi-label classification problem appeared in VidOR(Shang et al., 2019a) Dataset to two single-label classification problem. We use focal loss(Lin et al., 2017) to supervise the spatial label and the action label separately to deal with the severe imbalance issue.
In this section, we present experiment results in VidOR Dataset. We use the official evaluation code of the grand challenge to evaluate out results. More detail can be found in https://videorelation.nextcenter.org/mm20-gdc/task1.html.
|\topruleMethod||mAP||R@50||R@100||tagging P@1||tagging P@5||tagging P@10|
4.1. Component Analysis
Object Trajectory Detection: We adopt Cascade R-CNN with ResNeSt101 as our object detector and Dynamic Programming algorithm with cross-frame linking mechanism as trajectory generator. To prove the effectiveness of our trajectory generation algorithm, we firstly evaluate it in the optional task Video Object Detection. Since we don’t submit our result for the optional task, we only compare our result on the validation set of VidOR Dataset with the result of the first place of the optional task in 2020 on the test set of VidOR Dataset. Table. 2 shows that we surpass the first place of the optional task in 2020 by a large margin. Secondly, we compare the video visual relation detection results using dynamic programming with and without cross-frame linking mechanism. The results are shown in Table. 3. CFLM means cross-frame linking mechanism. We can find that cross-frame linking mechanism increases the mAP from 8.84% to 9.93%.
|\midruleDeepBlueAI(on test set)||9.66|
|ours(on validation set)||14.59|
|\midruleours w/o CFLM||66.59||8.30||8.84|
Relation Prediction: We use multi-modal features to proceed relation prediction. We perform 4 experiments using our multi-modal features without language feature, motion feature, visual feature and location mask feature respectively. As shown in Table. 4, our method using all features outperforms other experiments, which shows the effectiveness of our multi-modal features. We can also find that using less feature doesn’t decrease the mAP as much as not using cross-frame linking mechanism. It means that for current dataset and methods in video relation detection, robust trajectory detection matters more.
|\midruleOurs w/o language||66.70||8.98||9.66|
|Ours w/o motion||67.18||9.01||9.86|
|Ours w/o visual||65.99||8.83||9.50|
|Ours w/o mask||65.75||9.08||9.74|
4.2. Comparison with state-of-the-art
We compare our results with other methods in VidOR validation dataset. As shown in Table. 5, our method outperform other methods by a large margin, which proves the effectiveness of our method.
|\midruleRELAbuilder(Zheng et al., 2019)||33.05||1.58||1.47|
|MAGUS.Gamma(Sun et al., 2019)||51.20||6.89||6.56|
|VRD-STGC(Liu et al., 2020)||48.92||8.21||6.85|
We use model ensemble strategy to further improve our result for the challenge task. Tabel. 6 shows the comparison of our method and other methods in VidOR test dataset. We also outperform all other methods by a large margin.
|\midruleRELAbuilder(Zheng et al., 2019)||0.55|
|MAGUS.Gamma(Sun et al., 2019)||6.31|
The detailed evaluation scores of our method on VidOR test set is shown in Table. 1.
In this paper, we propose trajectory-aware multi-modal features for video relation detection. Finally, we achieved 11.74% mAP, ranking the first place on the video relation detection task of Video Relation Understanding Grand Challenge in ACM Multimedia 2020.
This work was partially supported by the National Natural Science Foundation of China (Grant 61876177), Beijing Natural Science Foundation (Grant 4202034), Fundamental Research Funds for the Central Universities and Zhejiang Lab (No. 2019KD0AB04).
- journalyear: 2020
- copyright: acmcopyright
- conference: Proceedings of the 28th ACM International Conference on Multimedia; October 12–16, 2020; Seattle, WA, USA
- booktitle: Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA
- price: 15.00
- doi: 10.1145/3394171.3416284
- isbn: 978-1-4503-7988-5/20/10
- Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving Into High Quality Object Detection. (2018), 6154–6162.
- Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. (2017), 4724–4733.
- Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-NMS for Video Object Detection. arXiv: Computer Vision and Pattern Recognition (2016).
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Lijia Li, David A Shamma, et al. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
- Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 482–490.
- Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.
- Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, and Yadong Mu. 2020. Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video Relation Detection with Spatio-Temporal Graph. (2019), 84–93.
- Guanghui Ren, Lejian Ren, Yue Liao, Si Liu, Bo Li, Jizhong Han, and Shuicheng Yan. 2020. Scene Graph Generation With Hierarchical Context. IEEE Transactions on Neural Networks and Learning Systems (2020).
- Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tatseng Chua. 2019a. Annotating Objects and Relations in User-Generated Videos. (2019), 279–287.
- Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tatseng Chua. 2017. Video Visual Relation Detection. (2017), 1300–1308.
- Xindi Shang, Junbin Xiao, Donglin Di, and Tat-Seng Chua. 2019b. Relation Understanding in Videos: A Grand Challenge Overview. In Proceedings of the 27th ACM International Conference on Multimedia. 2652–2656.
- Xu Sun, Tongwei Ren, Yuan Zi, and Gangshan Wu. 2019. Video Visual Relation Detection via Multi-modal Feature Fusion. (2019), 2657–2661.
- Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased Scene Graph Generation from Biased Training. arXiv: Computer Vision and Pattern Recognition (2020).
- Yaohung Hubert Tsai, Santosh K Divvala, Louisphilippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. 2019. Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph. (2019), 10424–10433.
- Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing with Global Context. (2018), 5831–5840.
- Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi-Li Zhang, Haibin Lin, Yu e Sun, Tong He, Jonas Mueller, R. Manmatha, Mengnan Li, and Alexander J. Smola. 2020. ResNeSt: Split-Attention Networks. ArXiv abs/2004.08955 (2020).
- Sipeng Zheng, Xiangyu Chen, Shizhe Chen, and Qin Jin. 2019. Relation Understanding in Videos. (2019), 2662–2666.