Progressive Relation Learning for Group Activity Recognition
Group activities usually involve spatio-temporal dynamics among many interactive individuals, while only a few participants at several key frames essentially define the activity. Therefore, effectively modeling the group-relevant and suppressing the irrelevant actions (and interactions) are vital for group activity recognition. In this paper, we propose a novel method based on deep reinforcement learning to progressively refine the low-level features and high-level relations of group activities. Firstly, we construct a semantic relation graph (SRG) to explicitly model the relations among persons. Then, two agents adopting policy according to two Markov decision processes are applied to progressively refine the SRG. Specifically, one feature-distilling (FD) agent in the discrete action space refines the low-level spatio-temporal features by distilling the most informative frames. Another relation-gating (RG) agent in continuous action space adjusts the high-level semantic graph to pay more attention to group-relevant relations. The SRG, FD agent, and RG agent are optimized alternately to mutually boost the performance of each other. Extensive experiments on two widely used benchmarks demonstrate the effectiveness and superiority of the proposed approach.
Group activity recognition, which refers to discern the activities involving a large number of interactive individuals (\eg, up to 12 persons in a volleyball game), has attracted growing interests in the communities of computer vision [6, 27, 26, 30, 18]. Unlike conventional video action recognition that only concentrates on the spatio-temporal dynamics of one or two persons, group activity recognition further requires understanding the group-relevant interactions among many individuals.
In the past a few years, a series of approaches combine the hand-crafted feature with probability graph [4, 13, 21]. Recently, the LSTM, strucural RNNs and message passing neural network (MPNN) are also applied to model the interactions among persons, sub-groups and groups [18, 27, 2]. The interaction relations in these methods are implicitly contained in the ordered RNNs or the passing messages of MPNN. Moreover, not all the existing relations are relevant to the group activity and the pairwise relations may contain many edges that are coupled from spurious noise, such as cluttered background, inaccurate human detection, and interaction between outlier persons (\eg, the “Waiting” person in Fig. 1). Due to the relations in previous methods are modeled implicitly, it is unable to determine how much one specific relation is relevant to the group activity.
In addition, although a large number of persons may involve in a group activity, usually only a few actions or interactions in several key frames essentially define the group activity. Yan \etal  heuristically defined the key participants as the ones with “long motion” and “flash motion”. Qi \etal  applied a “self-attention” mechanism to attend to important persons and key frames. Nevertheless, these methods are limited to the coarse person level, and have not dug into the fine-grained relation level to consider which relations are vital (\eg, regulating 15 pairwise relations is more fine-grained than attending 6 persons).
To move beyond such limitations, we propose a progressive relation learning approach for group activity recognition. Firstly, the semantic relations are explicitly modeled in a graph. Then, as illustrated in Fig. 1, two agents progressively refine the low-level spatio-temporal features and high-level semantic relations of group activities. Specifically, at the feature level, a feature-distilling agent explores a policy to distill the most informative frames of low-level spatio-temporal features. At the relation level, a relation-gating agent further refines the high-level semantic relation graph to attend the group-relevant relations.
In summary, the contributions of this paper can be summarized as: 1) We propose a novel progressive relation learning method for group activity recognition, which is the first attempt to apply deep reinforcement learning for this task. 2) We explicitly model group activities through a semantic relation graph and propose a RG agent to refine the relations in the high-level semantic graph. 3) A FD agent is also proposed to further filter the frames of low-level spatio-temporal features that used for constructing the relation graph. 4) The proposed approach outperforms the state-of-art methods on two widely used benchmarks.
2 Related Works
Reinforcement Learning. Reinforcement learning (RL) has benefited many fields of computer vision, such as image cropping  and visual semantic navigation . Regarding the optimization policy, RL can be categorized into the value-based methods, policy-based methods, and their hybrids. The value-based methods (\eg, deep Q-learning ) are good at solving the problems in low dimensional discrete action space, but they fail in high dimensional continuous space. Although the policy-based methods (\eg, policy gradient ) are capable to deal with the problems in continuous space, they suffer from high variance of gradient estimation. The hybrid methods, such as Actor-Critic algorithms , combine their advantages and are capable for both of discrete and continuous action spaces. Moreover, by exploiting asynchronous updating, the Asynchronous Advantage Actor-Critic (A3C) algorithm  has largely improved the training efficiency. Therefore, we adopt the A3C algorithm to optimize both of our RG agent in continuous action space and our FD agent in discrete action space.
Graph Neural Network. Due to the advantages of representing and reasoning over structured data, the graph neural network (GNN) has attracted increasing attention [29, 28, 9]. Graph convolutional network (GCN) generalizes CNN on graph, which therefore can deal with non-Euclidean data . It has been widely applied in computer vision, \eg, point cloud classification , action recognition . Another class of GNN combines graph with RNN, in which each node captures the semantic relation and structured information from its neighbors through multiple iterations of passing and updating, \eg, message-passing neural network , graph network block . Each relation in the former class (\ie, GCN) is represented by a scalar in its adjacency matrix that is not adequate for modeling the complex context information in group activity. Therefore, our semantic relation graph is built under the umbrella of the latter class that each relation is explicitly represented by a learnable vector.
3.1 Individual Feature Extraction
Following , the person bounding boxes are firstly obtained through the object tracker in the Dlib library . As shown in Fig. 2, the visual feature (\eg, appearance and pose) of each person is extracted through a convolutional neural network (called Person-CNN). Then, the spatial visual feature is fed into a long short-term memory network (called Person-LSTM) to model the individual temporal dynamic . Finally, we concatenate the stacked visual features and temporal dynamics of all persons as the basic spatio-temporal features, \ie, . These basic representations contain no context information, such as the person to person, person to group, and group to group interactions. Besides, the spatial distance vectors and direction vectors between each pair of persons are concatenated as the original interaction features , where and are the displacements along horizontal and vertical axes, respectively.
3.2 Semantic Relation Graph
Inferring semantic relations over inherent structure in a scene is helpful to suppress noises, such as inaccurate human detection, mistaken action recognition, and outlier people not involved in a particular group activity. To achieve it, we explicitly model the structured relations through a graph network . Let us put aside the two agents in Fig. 2 and explain how to build the baseline semantic relation graph first. Let a graph , where is the global attribute (\ie, activity scores), and are respectively the person nodes and the relation edges among them. The attributes of person nodes and the attributes of relation edges are respectively initialized with the embeddings of low-level spatio-temporal features and original interaction features .
During graph passing, each node collects the contextual information from each of its neighbors via a collecting function , and aggregates all collected information via an aggregating function , \ie,
where the collecting function is implemented by a neural network , and  denotes concatenation. Then, the aggregated contextual information updates the node attributes via a node updating function (network ),
After that, each edge enrolls message from the sender and receiver to update its edge attributes via an edge updating function (network ),
To simplify the problem, we consider the graph is acyclic, \ie, = 0, and undirected, \ie, . Finally, the global attribute is updated based on semantic relations in the whole semantic relation graph, \ie,
where is parameter matrix and is bias. The , and are implemented with LSTM networks.
Since propagating information over the graph once captures at most pairwise relations, we update the graph for iterations to encode high-order interactions. After the propagations, the graph automatically learns the high-level semantic relations from the low-level individual features in the scene. Finally, the activity probabilities can be obtained by appending a softmax layer to the after the last iteration.
3.3 Progressively Relation Gating
Although the above fully-connected semantic graph is capable to explicitly model any type of relation, it also contains many group-irrelevant relations. Therefore, we introduce a relation-gating agent to explore to select the group-relevant relations in the graph. The decision process is formulated as a Markov Process .
States. The state consists of three parts . is the whole semantic graph, represented by the stack of all relation triplets (“sender”, “relation”, “receiver”), which provides the global information about the current scene. is concatenation of the relation triplet corresponding to one specific relation that will be refined, which provides the local information for the agent. , where and denote the attribute dimensions of nodes and relations, respectively. is global attributes of the relation graph at the current state, where is the activity scores.
Action. Inspired by the information gates in the LSTMs, we introduce a gate for each relation edge. The action of the agent is to generate the gate . Then, it is applied to adjust the corresponding relation at each reinforcement step, \ie, = . Since the semantic relation graph is in-directed, we normalize the values of gates before gating operation, \ie, .
Reward. The reward , reflecting the efficacy of action w.r.t the state , consists of three parts. 1) To encourage the relation gates to selects group-relevant relations, we propose a structured sparsity reward. We define structured sparsity as the norm of , \ie,
where is row vectors of . As illustrated in Fig. 2(a), unlike norm that tends to uniformly make all gating elements sparse, the norm can encourage the rows of to be sparse. Thus, the structured sparsity is very helpful to attend to a few key participants which have wide influence to others. The structured sparsity reward at the th reinforcement step is defined to encourage the agent to gradually attend to a few key participants and relations, \ie,
where and the is sign function. 2) To encourage the posterior probability to evolve along an ascending trajectory, we introduce an ascending reward with respect to the probability of groundtruth activity label, \ie,
where is predicted probability of the groundtruth label at the th step. reflects the probability improvement of the groundtruth. 3) To ensure that the model tends to predict correct classes, inspired by , a strong stimulation is enforced when the predicted class shifts from wrong to correct after a step, and a strong punishment is applied if the turning goes otherwise, \ie,
Finally, the total reward for the RG agent is
Relation-gating Agent. Since searching high dimensional continuous action space is challenging for reinforcement learning, we compromise to let the agent output one gating value at a time and cycle through all edges within each reinforcement step. The architecture of the RG agent is shown in Fig. 2(b), which is under an Actor-Critic framework . Inspired by human’s decision making that historical experience can assist the current decision, a LSTM block is used to memorize the information of the past states. The agent maintains both a policy (also named Actor) to generate actions (gates) and an estimation of value function (also named Critic) to assess values for corresponding states. Specifically, the Actor outputs a mean and a standard deviation of action distribution . The action is sampled from the Gaussian distribution during training, and is set as directly during testing.
Optimization. The agent is optimized with the classical A3C algorithm  for reinforcement learning. The policy and the value function of the agent are updated after every (updating interval) steps or when a terminal state is reached. The accumulated reward at the step is , where is the discount factor, is the reward at the step, and varies from 0 to . The advantage function can be calculated by , and the entropy of policy is . Eventually, the gradients are accumulated via Eq. 11 and Eq. 12 to respectively update the value function and the policy of agent.
where controls the strength of entropy regularization.
3.4 Progressively Feature Distilling
To further refine the low-level spatio-temporal features used for constructing graph, we introduced another feature-distilling agent. It is aimed at distilling the most informative frames of features, which is also formulated as a Markov Decision Process .
State. The state of the FD agent consists of three components . The whole feature tensor of an activity provides the global information about the activity clip, where , and are respectively the numbers of person, frame and feature dimension of the feature tensor. The local feature carries the implicit information of the distilled frames, where is the number of frames to be kept. In order to be explicitly aware of the distilled frames, the state of FD agent also contains the binary mask of the distilled frames.
Action. As shown in Fig. 3(a), the FD agent outputs two types of discrete actions for each selected frame, \ie “stay distilled” indicating the frame is informative that the agent determines to keep it, and “shift to alternate” indicating the agent determines to discard the frame and take in an alternate. The shifting may be frequent at the beginning but will gradually become stable after some explorations (Fig. 3(a)). In order to give equal chance for all alternates to be enrolled, the latest discarded frames are appended to the end of a queue and have the lowest priority to be enrolled again.
Feature-distilling Agent. The FD agent in Fig. 3(b) is also constructed under the Actor-Critic  framework. The agent takes in the global knowledge from the whole feature , the implicit local knowledge from the distilled features , and the explicit local knowledge from the binary frame mask . Finally, the agent outputs an action vector for the distilled feature frames and a value for the current state. The action vector is sampled from the policy distribution during training, and is directly set as the action type with max probability during testing.
Optimization and Rewards. The optimization algorithm (A3C) and object function are same as the RG agent. The reward only contains the components about trajectory ascending and class shifting introduced above, \ie,
3.5 Training Procedure
In the proposed approach, the agents and the graph need to be updated respectively on CPU (to exploit numerous CPU cores/threads for asynchronous updating workers according to A3C algorithm ) and GPU. In addition, the graph is updated after each video batch, but the agents are updated many times during each video when the number of reinforcement step reaches the updating interval or a terminal state is reached. Thus, the graph and agents are updated on different devices with different updating periods, and it is unable to optimize them with conventional end-to-end training. Therefore, we adopt alternate training.
Individual Feature Preparation. Following , we finetune the Person-CNN (VGG16) pretrained on ImageNet with individual action labels to extract visual features, and then train the Person-LSTM with individual action labels to extract temporal features. To lower the computation burden, the extracted individual features are saved to disk and only need reloading after this procedure.
Alternate Training. There are totally 9 separated training stages. At each stage, only one of the three components (SRG, trained with 15 epochs; FD- or RG-agent, trained with 2 hours) is trained and the remaining two are frozen (or removed). In the first stage, the SRG (without agents) is trained with the extract features to capture the context information within activities. In the second stage, the SRG is frozen, and the FD agent is introduced and trained with the rewards provided by the frozen SRG. In the third stage, the SRG and FD agent are frozen, the RG agent is introduced and trained with the rewards provided by the frozen SRG and FD agent. After that, one of the SRG, FD agent and RG agent is trained in turn with the remaining two be frozen in the following 6 stages.
Volleyball Datasets . The Volleyball dataset is currently the largest dataset for group activity recognition. It contains 4830 clips of 55 volleyball videos. Each clip is annotated with 8 group activity categories (\ie, right set, right spike, right pass, right winpoint, left winpoint, left pass, left spike and left set), and its middle frame is annotated with 9 individual action labels (\ie, waiting, setting, digging, falling, spiking, blocking, jumping, moving and standing). We employ the metrics of Multi-class Classifcation Accuracy (MCA) and Mean Per Class Accuracy (MPCA) to evaluate the performance following .
Collective Activity Dataset (CAD) . The CAD contains 2481 activity clips of 44 videos. The middle frame of each clip is annotated with 6 individual action classes (\ie, NA, crossing, walking, waiting, talking and queueing), and the group activity label is assigned as the majority action label of individuals in the scene. We split the training and testing sets as the protocol in . Following , we merge the classes “Walking” and “Crossing” as “Moving” and report the MPCA to evaluate the performance.
4.2 Implementation Details
The Person-CNN outputs 4096-d features and the Person-LSTM equipped with 3000 hidden neurons takes in all the features in T (T=10) time steps. In the SRG, the embedding sizes of node and edge are 1000 and 100 respectively, and the graph passes 3 iterations at each time. Thus, the number of hidden neurons in updating functions , , and are 1000, 1000 and 100, respectively. In the RG agent, the layers FC1, FC2, …, FC7 are respectively contains 512, 256, 512, 256, 256, 64 and 256 neurons, and its LSTM network contains 128 hidden nodes. In the FD agent, the number of feature frames to be kept is set as 5. In Fig. 3(b), the neuron number of the two FC layers from the left to right is 64 and 256, the channels of Conv1, Conv2, Conv3, Conv4 are respectively 1024, 1024, 256, 256, and the LSTM network contains 128 neurons.
During training, we use RMSprop/Adam (SRG/Agents) optimizer with an initial learning rate of 0.00001/0.0001 (SRG/Agents) and a weight decay of 0.0001. The batch size is 8/16 (CAD/Volleyball) for SRG training. The discount factor , entropy factor and the number of asynchronous workers in A3C for both agents are respectively set as 0.99, 0.01 and 16. In practice, the updating interval and (in Eq. 9) are set as 5/5 and 15/20 (RG/FD agent), respectively. In Volleyball dataset, following , the 12 players are split into two subgroups (\ie, the left team and the right team) according to positions, and the RG agent are shared by the two subgroups in our framework, and finally the outputs of the two subgroups are averaged. In CAD dataset, since the number of individuals is varying from 1 to 12, we select 5 effective persons for each frame and fill zeros for the frames contain less than 5 persons following .
4.3 Baseline and Variants for Ablation Studies
To examine the effectiveness of each component in the proposed method, we conduct ablation studies with the following baseline and variants. StagNet w/o Atten. : this baseline constructs a message passing graph network with the similar low-level features as our SRG. It implicitly represents the interactions by the passing messages, while our SRG explicitly models relations in a full graph network. Ours-SRG: this variant only contains the SRG of the proposed method. Ours-SRG+FD: this variant contains both the SRG and FD agent, and they are trained alternately to boost each other. Ours-SRG+RG: this variant contains both the SRG and RG agent, and they are alternately trained. Ours-SRG+FD+RG (PRL): this is the proposed progressive reinforcement learning (PRL) approach that contains all the three components (\ie, the SRG, FD agent, and RG agent).
4.4 Results on the Volleyball Dataset
To examine the effectiveness of each component, we compare the proposed PRL against the above baseline and variants. As Table 1 shows, although building graphs on similar low-level features, our semantic relation graph is superior to the baseline (StagNet w/o Atten. ) because our semantic relations are explicitly modeled while the baseline only implicitly contains them in the passing messages. Our SRG+FD boosts the SRG over 1.2% (MCA) and 0.7% (MPCA) by applying the FD agent to filter out ambiguous frames of features, and our SRG+SG also improves the performance of the SRG over 1.6% (MCA) and 1.6% (MPCA) by exploiting the RG agent to refine the relations. Finally, our PRL achieves the best performance because that it combines the advantages from the two agents. Note that the PRL eventually improves 3.1% (MCA) over the original SRG, which is even larger than the sum of increments from the two agents, 2.8% (MCA), indicating that the two agents can boost each other through the alternate training procedure.
Then, we compare our PRL with other state-of-art methods in the Table 1. Even though the methods SBGAR , PC-TDM , and SPA+KD+OF  constructed an additional stream to exploit extra optical flow input which are unfair to compare with, our PRL still consistently outperforms them on all metrics. Specifically, the proposed PRL outperforms the state-of-art method SPA+KD+OF  by 1.8% and 0.7% w.r.t MPCA and MCA, respectively. More fairly, our PRL outperforms SPA+KD (SPA+KD+OF without optical flow) by larger margins of 2.8% and 2.1% with respect to MPCA and MPCA, respectively.
In addition, the confusion matrix of the proposed PRL is shown in Fig. 5. As we can see, our PRL achieves promising recognition accuracies () on most of the activities. The main failure cases are from “set” and “pass” within the left and right subgroups, which is probably due to the very similar actions and positions of the key participants. We also visualized several refined semantic relation graphs in Fig. 6, where the relations with top5 gate values are shown and the importance degree of persons are indirectly computed by summing the connected relation gates (normalized over all persons). In Fig. 6a, benefited from the rewards of structured sparsity, our RG agent successfully discovers the subset of relations related to the “digging” person is the key to determine the activity “left pass”. In Fig. 6b, the model predicts “right winpoint” mainly based on two relation clusters, including the cluster characterized by the two “failing” persons in the left team and the cheering cluster in the right team.
4.5 Results on the Collective Activity Dataset
Table 2 shows the comparison with different methods on the CAD dataset. Following [30, 26], the MPCA results of several methods are calculated from the reported confusion matrices in [8, 32, 10, 20, 15]. Note that MPCA of the baseline (StagNet w/o Atten. ) is unavailable, we list MCA for approximate comparison. Our PRL outperforms the state-of-art method (SPA+KD ) without extra optical flow input by a margin of 1.3%. It even exceeds most of the methods with extra optimal flow input, which indicates the superiority of the proposed method. Although the SPA+KD+OF  performs better than our PRL, its main improvement (3.2%) is owed to the extra optical flow information (cf. Table 2).
In addition, the confusion matrix of our PRL on the CAD dataset is shown in Fig. 7. The results clearly show that our PRL has solved the recognition of “Queening” and “Talking”, which proves the effectiveness of our framework. However, the “Waiting” is seriously confused by “Moving” probably because some “Moving” activities under the scene of crossing street are ill co-occur with some “Waiting” activities in the CAD dataset. Data cleaning or more training data are required to distinguish these two categories. Furthermore, we analyze the results by visualizing the final SRGs. For the “Moving” activity in Fig. 6c, our method concentrates on the relations among the three moving persons to suppress the noisy relations caused by the “Waiting” person. Similarly, in Fig. 6d, our method successfully attends to the relations connected to the “Talking” person and weakens the relations among the three audiences.
In this work, we propose a novel progressive relation learning framework for group activity recognition, which is the first attempt to apply deep reinforcement learning for this task. To better model the structured semantic relations within group activities, the feature-distilling agent progressively distills the most informative frames of the low-level features, and the relation-gating agent further refines the high-level relations in a semantic relation graph. Furthermore, the proposed PRL achieves the state-of-the-art results on two widely used benchmarks that indicates the effectiveness and superiority of our method.
- Timur M. Bagautdinov, Alexandre Alahi, François Fleuret, Pascal Fua, and Silvio Savarese. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR, pages 3425–3434, 2017.
- Sovan Biswas and Juergen Gall. Structural recurrent neural network (SRNN) for group activity analysis. In WACV, pages 1625–1632, 2018.
- Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Process. Mag., 34(4):18–42, 2017.
- Wongun Choi and Silvio Savarese. A unified framework for multi-target tracking and collective activity recognition. In ECCV, pages 215–230. Springer, 2012.
- Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In ICCV Workshops, pages 1282–1289. IEEE, 2009.
- Zhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In CVPR, pages 4772–4781, 2016.
- Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, pages 1263–1272, 2017.
- Hossein Hajimirsadeghi, Wang Yan, Arash Vahdat, and Greg Mori. Visual recognition by counting instances: A multi-instance cardinality potential kernel. In CVPR, pages 2596–2605, 2015.
- Guyue Hu, Bo Cui, and Shan Yu. Skeleton-based action recognition with synchronous local and non-local spatio-temporal learning and frequency attention. In IEEE International Conference on Multimedia and Expo, 2019.
- Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. In CVPR, pages 1971–1980, 2016.
- Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
- Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In NIPS, pages 1008–1014, 1999.
- Tian Lan, Yang Wang, Weilong Yang, Stephen N. Robinovitch, and Greg Mori. Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell., 34(8):1549–1562, 2012.
- Debang Li, Huikai Wu, Junge Zhang, and Kaiqi Huang. A2-RL: aesthetics aware reinforcement learning for image cropping. In CVPR, pages 8193–8201, 2018.
- Xin Li and Mooi Choo Chuah. SBGAR: semantics based group activity recognition. In ICCV, pages 2895–2904, 2017.
- Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937, 2016.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, and Luc Van Gool. stagnet: An attentive semantic RNN for group activity recognition. In ECCV, pages 104–120, 2018.
- Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin A. Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. In ICML, pages 4467–4476, 2018.
- Tianmin Shu, Sinisa Todorovic, and Song-Chun Zhu. CERN: confidence-energy recurrent network for group activity recognition. In CVPR, pages 4255–4263, 2017.
- Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, and Song-Chun Zhu. Joint inference of groups, events and human roles in aerial videos. In CVPR, pages 4576–4584, 2015.
- Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In CVPR, pages 1227–1236, 2019.
- Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In CVPR, pages 29–38, 2017.
- Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057–1063, 1999.
- Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou. Deep progressive reinforcement learning for skeleton-based action recognition. In CVPR, pages 5323–5332, 2018.
- Yansong Tang, Zian Wang, Peiyang Li, Jiwen Lu, Ming Yang, and Jie Zhou. Mining semantics-preserving attention for group activity recognition. In ACM MM, pages 1283–1291. ACM, 2018.
- Minsi Wang, Bingbing Ni, and Xiaokang Yang. Recurrent modeling of interaction context for collective activity recognition. In CVPR, 2017.
- Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
- Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
- Rui Yan, Jinhui Tang, Xiangbo Shu, Zechao Li, and Qi Tian. Participation-contributed temporal dynamic model for group activity recognition. In ACM MM, pages 1292–1300. ACM, 2018.
- Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi. Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543, 2018.
- Zheng Zhou, Kan Li, Xiangjian He, and Mengmeng Li. A generative model for recognizing mixed group activities in still images. In IJCAI, pages 3654–3661, 2016.