Who did What at Where and When: Simultaneous Multi-Person Tracking and Activity Recognition
We present a bootstrapping framework to simultaneously improve multi-person tracking and activity recognition at individual, interaction and social group activity levels. The inference consists of identifying trajectories of all pedestrian actors, individual activities, pairwise interactions, and collective activities, given the observed pedestrian detections. Our method uses a graphical model to represent and solve the joint tracking and recognition problems via multi-stages: (1) activity-aware tracking, (2) joint interaction recognition and occlusion recovery, and (3) collective activity recognition. We solve the where and when problem with visual tracking, as well as the who and what problem with recognition. High-order correlations among the visible and occluded individuals, pairwise interactions, groups, and activities are then solved using a hypergraph formulation within the Bayesian framework. Experiments on several benchmarks show the advantages of our approach over state-of-art methods.
Multi-person activity recognition is a major component of many applications, \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., video surveillance and traffic control. The problem entails the inference of the activities, the actors and their motion trajectories, as well as the time dynamics of the events. This task is challenging, since the activities are analyzed from both the spontaneous individual actions and the complex social dynamics involving groups and crowds . We aim to address not only the where and when problem by visual tracking, but also the who and what problem by activity recognition.
While advanced methods for person detection are becoming more reliable [2, 3], most existing activity recognition approaches rely on visual tracking following a tracking-by-detection paradigm. These methods either fail to consider social interactions while inferring activities [4, 5, 6] or have difficulties recognizing the structural correlations of actions and interactions [7, 8, 9]. In particular, there are two major challenges: (i) ineffective tracking due to frequent occlusions in groups and crowds, and (ii) the lack of a suitable methodology to infer the complex but salient structures involving social dynamics and groups.
In this paper, we address both challenges using a bootstrapping framework to simultaneously improve the two tasks of multi-person tracking and social group activity recognition. We take state-of-the-art person detections [2, 3] as input to perform initial multi-person tracking. We then recognize stable group structures including the temporally cohesive individual activities (such as walking) and pairwise interactions (such as walking side-by-side, see Fig.1 to robustly infer collective social activities (such as street crossing in a group) in multiple stages. Auxiliary inputs such as body orientation detections can be considered within the stages as well. The recognized activities and salient grouping structures are used as priors to recover occluded detections and false associations to improve performance.
We explicitly explore the correlations of pairwise interactions (of two individuals) and their group activities (within the group of more individuals) during the optimization. Observe in Fig.1 that group activities generally consist of correlated pairwise interactions, which we have exploited in the multi-stage inference steps. In our method, the multi-target tracking and the recognition of individual/group activities are jointly optimized, such that consistent activity labels characterizing the dynamics of the individuals and groups can be obtained. The individual and group activities are formulated using a dynamic graphical model, and high-order correlations are represented using hypergraphs. The simultaneous pedestrian tracking and multi-person activity recognition problems are then to be solved using an efficient cohesive cluster search in the hypergraphs.
The main contribution of this work is two-fold. First, we propose a new framework that can jointly solve the two tasks of real-time simultaneous tracking and activity recognition. Explicit modeling of the correlations among the individual activities, pairwise interactions, and collective activities leads to a consistent solution. Second, we propose a hypergraph formulation to infer the high-order correlations among social dynamics, occlusions, groups, and activities in multi-stages. Simultaneous tracking and activity recognition are formulated as a bootstrapping framework, which can be solved efficiently using the search of cohesive clusters in the hypergraphs. This hypergraph solution is general that it can be extended to include additional scenarios or constraints in new applications.
Experiments on several benchmarks show the advantages of our method with improvements in both activity recognition and multi-person tracking. Our method is easily deployable to real-world applications, since: (i) camera calibration is not required; (ii) online video streams can be processed by considering a time window in a round; (iii) the computation can be performed in real-time (about 20 FPS, not including the input detection steps).
Ii Related Works
There exists a tremendous amount of multi-person tracking and activity recognition works. See [10, 11] for survey. Our work is most related to the collective activity recognition, which are organized into the following three categories — recognition based on (i) detection, (ii) tracking, and (iii) simultaneous tracking and recognition.
Ii-a Collective Activity Recognition based on Detection
A hierarchical model is used in  to recognize collective activities by considering the person-person and group-person contextual information. The work of  uses hierarchical deep neural networks with a graphical model to recognize collective activities based on the dependencies of individual activities. This work is further extended in , where the individual and collective activities are iteratively recognized using RNN with refinements. Multi-instance learning is used in  to recognize collective activities by inferring the cardinality relations of individual activities. A recurrent CNN is used in  for the joint target detection and activity recognition.
Ii-B Collective Activity Recognition based on Tracking
In this category, individual target trajectories are used as the input to recognize collective activities. Collective activities are recognized in  using random forests for the spatio-temporal volume classification. A two-stage deep temporal neural network is used in , where the first stage recognizes individual activities, and the second stage aggregates individual observations to recognize collective activities. In , the key constituents of activities and their relationships are used to recognize collective activities. A graphical model is developed in  to capture high-order temporal dependencies of video features. The and-or graph  is applied for video parsing and activity querying, where the detectors and trackers are launched upon receiving queries. A RNN architecture is designed in  to model high-order social group interaction contexts.
Ii-C Simultaneous Tracking and Activity Recognition
Only very few works deal with the problem of simultaneous multi-person tracking and activity recognition. In , per-frame and per-track cues extracted from an appearance-based tracker are combined to capture the regularity of individual actions. A network flow-based model is used in  to link detections while inferring collective activities. However, these two methods did not consider pairwise interactions for activity recognition. In [7, 8], the tracking and activity recognition are formulated as a joint energy maximization problem, which is solved by belief propagation with branch-and-bound. However high-order correlations among individual and pairwise activities are not considered, which limits the activity recognition performance.
We start with notation definition in our method. Given an input video sequence, consider the most recent time window in an online fashion, and denote previous time frames as . Let represent a set of target detections obtained using person detectors e.g. [2, 3]. Let represent the set of existing target trajectories. Let , , and represent the set of recognized individual activities, pairwise interactions, and collective activities, respectively. Given , our approach aims to simultaneously solve the multi-person tracking and activity recognition problems, by inferring the following four terms within : (i) target trajectories , where is the number of observed targets, (ii) individual activity labels , (iii) pairwise interaction labels , and (iv) collective activity labels , where represents the collective activity with the most involved targets in the -th frame. After a time window is processed, the method will extend target tracklets, update activity labels, and move on to the next time window: , , , and . To simplify notions, we omit the temporal indices to represent the variables within as , and represent previous variables as , \peek_meaning:NTF . i.e \peek_catcode:NTF a i.e. i.e., , , .
|x||a target trajectory|
|a||an individual activity label, e.g.|
|i||a pairwise interaction label, e.g. approaching (AP), facing-each-other (FE), standing-in-a-row (SR), …|
|c||a collective activity label, e.g., crossing, walking, gathering, …|
|number of observed targets (tracklets)|
|video time window of length prior to time , i.e.,|
|person detections (bounding boxes)|
|individual activity classes,|
|pairwise interaction classes,|
|collective activity classes,|
|existing entities prior to time window , , , ,|
|number of individual activity classes, in the CAD and Augmented-CAD datasets, in the New-CAD dataset|
|number of interaction classes, which is also the number of sub-hypergraphs used in our|
|method, in the CAD and Augmented-CAD datasets, in the New-CAD dataset|
|number of collective activity classes, in CAD, in Augmented-CAD, in New-CAD datasets|
|a joint distribution|
|confidence terms from the decomposition of|
|clique potential functions in the Markov random field|
|updated terms of after an optimization stage, respectively|
|updated terms of after an optimization stage, respectively|
|the distance likelihood term for estimating the interaction between two targets|
|the group connectivity term for estimating the interaction between two targets|
|the individual activity agreement term for estimating the interaction between two targets|
|the distance change type likelihood term for estimating the interaction between two targets|
|the facing direction likelihood term for estimating the interaction between two targets|
|the frontness/sideness likelihood term for estimating the interaction between two targets|
|a candidate tracklet|
|the set of all candidate tracklets|
|a (putative) individual activity of a candidate tracklet|
|the set of (putative) individual activities for all candidate tracklets|
|the appearance similarity for tracklet linking|
|time threshold for appearance-based tracklet linking|
|operator represents the association of two tracklets|
|the number of hypothetical tracklets to generate from an existing tracklet ,|
|activity recognition hypergraph|
|the vertex set of a hypergraph|
|the hyperedge set of a hypergraph|
|the hyperedge weights of a hypergraph|
|the appearance hyperedge weight, working with control parameter|
|the facing-direction hyperedge weight, working with control parameter|
|the geometric similarity hyperedge weight, working with control parameter|
|the hyperedge degree, i.e., the number of incident vertices of the hyperedge|
|a -degree hyperedge,|
|a hyperedge cluster, which is a vertex set with interconnected hyperedges|
|number of vertes in a hypergraph cluster ,|
|the set of all incident hyperedges of a cluster|
|weighting function operated on a hypergraph cluster|
|the indicator vector to denote the vertex selection from to be included in|
|used in weight normalization|
|used in weight normalization|
|image coordinate vector between two positions at and|
|a sub-hypergraph indexed by , i.e.,|
|the hyperedges of the sub-hypergraph corresponding to the -th interaction class|
|the hyperedge weights of the sub-hypergraph corresponding to the -th interaction class|
|the vertex set of a graph; is associated with in this paper|
|the edge set of a graph|
|the edge weights of a graph|
|a graph edge connecting two vertices and|
|the correlation between the activities of two targets and used to calculate weight|
|a function to calculate the correlation between the activities of two targets|
|Eucludean distance between two targets in the image coordinate.|
|the angle between the facing direction of and the relative vector from to .|
|sparse graph by discarding edges with small weights from|
|, ,||video frame indices|
|, , ,||target tracklet indices|
|, ,||hypergraph vertex indices|
|the index for hypergraph clusters e.g. , from|
|the index for interaction classes e.g. ; is also the index for sub-hypergraphs e.g.|
|the index for collective activity classes|
Iii-a Problem Formulation
We aim to infer accurate trajectories of all targets () and their individual activities (), pairwise interactions () and collective activities () from the observed detections (). Relationship between these variables can be expressed as the joint distribution as a dependency graph in Fig.1. Based on the conditional independence assumption of in the graphical model, can be decomposed into three terms:
(i) is the confidence of target tracking, where the calculation will be given in III-C. (ii) models the inter-dependencies among target trajectories, individual activity and pairwise interaction labels, which is further expressed as a Markov random field (red cycle in Fig. 1):
where , and are three clique potential functions capturing the inter-correlations between each variable pair. Derivation of these clique potentials will be given in III-C and III-D. (iii) reflects an important assumption that collective activities can be effectively modeled by robust inference of target trajectories, individual activities and pairwise interactions, where III-E will provide details.
The inference of the joint tracking and recognition is then formulated as seeking
However, standard iterative optimization such as block coordinate descent is not practical due to that: (i) the coupling of variables is still complicated; (ii) each of these variables represents a superset of time-dependent variables, so their joint optimization will be very inefficient; (iii) a real-time processing method is desired. We adopt a heuristic approximate solution using multi-stage updating scheme, which first jointly updates , , and then updates , followed by the update of . Our strategy is based on an important hypothesis that inferring pairwise interactions is crucial in resolving the entire optimization, because is the knob governing the representations in-between and .
Our updating scheme shares spirit with the standard Gibbs sampling and MH-MCMC method for the inference in probabilistic graphical models. The updating scheme takes the following three stages:
Stage 1 activity-aware tracking ( III-C), where individual target trajectories and activity labels are updated using:
Stage 2 joint interaction recognition and occlusion recovery ( III-D), where the interaction labels together with the target trajectories and activities are updated using:
Stage 3 collective activity recognition ( III-E), where the collective activity labels are updated using:
We will show in III-C and III-D that we model high-order correlations among , and using two respective hypergraphs. The clique potentials in Stage 1 and Stage 2 can be derived as the optimization of maximal weight search over the two hypergraphs, in order to infer . Stage 3 infers using a probabilistic formulation based on the inferred .
Iii-B Cohesive Cluster Search on the Hypergraph
We define an undirected hypergraph , where denotes the vertex set of , where denotes vertex index. An undirected hyperedge with -incident vertices is defined as , where is the degree of the hyperedge. The set of all -degree hyperedges is denoted as . The weights of hyperedges are denoted as , i.e., each hyperedge is associated with a weight.
We use the hypergraphs to represent both (1) the detection-tracklet association for tracking , and (2) the correlations among individual activities and pairwise interactions . The joint problem of multi-person tracking (with possible refinements) and group activity recognition can be solved using a standard cohesive cluster search on the hypergraph . A cluster within a hypergraph is a vertex set with interconnected hyperedges. We use to denote the number of vertices in , and to denote the set of all incident hyperedges of . A cluster is cohesive if its vertices are interconnected by a large amount of hyperedges with dense weights. Denote the weighting function that measures the weight of a cluster. For a vertex , the cohesive cluster search optimization is to determine a large cluster with dense weights:
We use indicator vector , to denote the selection of vertices from to be included in : for , and otherwise. The selection is constrained such that up to vertices including are enclosed in , such that , and .
The design of affects the resulting cluster from the search. Typical can be the total weight of all incident hyperedges. However, direct maximization of the total weight leads to a large cluster that is not necessarily cohesive. Instead, we maximize a normalized weight, which is the total weight divided by the cardinality of all incident hyperedges. This normalization also enables continuous optimization. For with vertices and -degree hyperedges, this normalizer is . Our weighting function is:
where denote vertex indices.
Iii-C Activity-Aware Tracking
Stage 1 of our method simultaneously recognizes individual activities and links tracklets in the following two steps (see Fig.2 for a schematic overview):
(T1) Generate candidate tracklets . For each existing target , we generate a set of candidate tracklets from observed detections using the tracking method in . 111 All candidate tracklets and their labels are denoted with a bar . We employ a gating strategy to restrict the number of candidate tracklets to consider. The appearance similarity between and each tracklet is calculated using the POI features  and Euclidean metric. If this similarity is above a threshold , is added into . Targets with no associated detection within time are discarded to reduce unnecessary computation. We use and to include a rich set of candidate tracklets for linking. If any tracklet in ends up not linked with any target ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., in Fig.2), a new target is created. If any target ends up with no linked tracklet for status update, it is considered occluded. 222 We use trajectory prediction based on motion extrapolation in step (R1) of III-D to determine if the target is still within the scene.
(T2) Link tracklets with . After candidate tracklets are generated, for each candidate tracklet , we determine its individual activity label for the purpose of activity-aware tracking. We consider = individual activity labels regarding the motion pattern: standing, walking, and running, by calculating the velocity of each and modeling the posteriors using sigmoid similar to : . We consider social contextual cues and the correlations between individual activities in finding the best tracklet linking combinations. This also enables robust occlusion recovery for tracking. Our solution is to represent all terms using a tracking hypergraph . The clique potential function in Eq.(3) can then be inferred as , where represents a cohesive cluster obtained from , and denotes cluster index.
The activity-aware tracking by linking tracklets with is performed in three sub-steps: (T2.1) Estimate social group structure using correlations between individual activities in a graph representation. (T2.2) Construct hypergraph . (T2.3) Optimize tracking based on .
(T2.1) Estimate social group structure. We represent the social group structure of tracked targets and the correlations between individual activities using an undirected complete graph with . . Edge weight reflects the correlation between activities and of and , respectively. We define to reflect the correlation between activities of two targets similar to :
where represents Euclidean distance between the targets. As shown in Fig.4a, represents the angle between the facing direction of and the relative vector from to , and represents the velocity of . For a target , if is recognized as “standing”, we use the classifier in  to calculate the body orientation out of 8 quantizations. Otherwise, estimates motion direction from the trajectory. Edge weights of are calculated according to Eq.(9) and refined using further grouping cues as in . Fig.4b visualizes the correlation defined by Eq.(9). The probability is higher on the side of a person than in the front or back, which is an implementation of Hall’s proxemics social norms . We discard edges with weights lower than to obtain a sparse graph denoted as for computation speedup.
(T2.2) Construct hypergraph using to capture the high-order correlations between activities within a group. A vertex represents a hypothesis of linking a tracked target with its candidate tracklet, \peek_meaning:NTF . i.e \peek_catcode:NTF a i.e. i.e., where “” represents the association of two tracklets. A -degree hyperedge represents the combination of tracklet linking hypotheses in an assignment.
The linking of tracklets with can be considered as an assignment problem with the following two tracklet assignment constraints: (i) a target cannot be linked with two or more candidate tracklets, and (ii) a candidate tracklet cannot be linked with two or more targets. We enforce these constraints in the construction of hyperedges in . Specifically, , where and , if and only if , and can co-exist in a hyperedge in .
We further consider motion and behavior consistencies and their correlations (via ) in determining the hyperedge weights. Specifically, we consider three affinities that determine the hyperedge weights: the appearance () of each tracklet, the facing-direction () and the geometric similarity () between tracked targets.
The appearance affinity between a target and a candidate tracklet is computed using the appearance features of tracklets as :
We assume that activity states (such as walking direction) do not change abruptly in-between small linked tracklets. In other words, difference between facing directions of two targets should be small for linked tracklets:
Our method aims to run on surveliiance videos without calibration. To ensure smooth tracking, we use a geometric affinity term to ensure that relative angles between two targets does not change abruptly:
where and represent the relative image coordinate vectors between tracked targets and candidate tracklets. Final affinity value of a hyperedge is computed by , where , , are set as , , .
(T2.3) Optimize tracking based on . This step aims to determine the optimal tracklet linking among candidates represented in the hypergraph . The optimization is performed by the cohesive cluster search on described in III-B. For each vertex , such a search yields a cluster with a score. Since a vertex may appear in multiple clusters, if any resulting cluster violates the tracklet assignment constraints in (T2.2), such a cluster is removed to avoid further consideration. We ensure that the resulting cohesive clusters represent valid tracklet linking hypotheses that is sound and redundancy-free. 333 Hypergraph clusters are processed sequentially in descending order of their scores. If any cluster violates the constraints, new cluster is discarded and any duplication is removed. In case a target ends up not linked with any candidate tracklets ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., in Fig.3), such a target should be either outside the scene or under occlusion. We store all discovered occlusions and will try to recover them at Stage 2 in III-D. Finally, target trajectories are updated with the newly linked tracklets in to be , and activity labels are augmented with respective ones in to be .
Iii-D Joint Interaction Recognition and Occlusion Recovery
Our approach is motivated from the observation that pairwise interactions within a group can provide rich contextual cues to recognize the activities (as in Fig.1) and recover possible occlusions. Stage 2 of our method jointly resolves the two problems of (1) recognizing pairwise interactions and (2) occlusion recovery to improve tracking. We again use a hypergraph representation to explore the high-order correlations among the interactions , such that a similar cluster search scheme can be applied for optimization. Specifically, we construct the (activity) recognition hypergraph () based on the inferred target locations and individual activities . The optimization over maximizes the clique potential function in Eq.(4), as , where represents a cohesive cluster obtained from .
Stage 2 of our method jointly recognizes and recovery occlusions in the following three main steps (see Fig.3 for a schematic overview):
(R1) Generate hypothetical tracklets for occlusion recovery from given existing and .
(R2) Construct hypergraph based on to infer high-order correlations among their pairwise interactions .
(R3) Optimize recognition and recovery over to simultaneously recognize interaction and link occluded targets with suitable hypothetical tracklets.
(R1) Generate hypothetical tracklets . For each possibly occluded target , we generate a few hypothetical tracklets based on trajectory predictions, where is empirically set to . 444 All hypothetical tracklets are denoted with a hat across the paper. For a moving target with , we generate via motion extrapolation. For a stationary target with , we add a small perturbation to .
(R2) Construct hypergraph , such that high-order correlations among interactions among and are captured for the purposes of simultaneous activity recognition and occlusion recovery. Thus, . Each hyperedge in characterizes the likelihood of a pairwise interaction . For example in Fig.3, are connected by 3 hyperedges, which correspond to interactions “WS”, “RS”, “SS”, respectively. See IV for a complete list of interaction class defined in public datasets [24, 7]. We denote the number of interaction classes.
The inference of each interaction class can be optimized independently. We can thus decompose into sub-hypergraphs , with for the -th interaction class. For each hyperedge , the weight reflects how likely the interaction between the targets are cohesive as a whole ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., all walking-side-by-wide).
We calculate the hyperedge weights in in two steps: (R2.1) evaluates each pairwise interaction with a confidence score. (R2.2) constructs hyperedges in using the average score from all involved targets.
(R2.1) Recognize pairwise interaction activities. We calculate a confidence score for each possible pairwise interaction between the targets , using a simple effective rule-based probabilistic approach as in . Specifically, the confidence score of belonging to the -th class is calculated by multiplying the following six component probabilities: distance (ds), group connectivity (gc) calculated in (9), individual activity agreement (aa), distance change type (dc), facing direction (dr), and frontness/sidedness (fs):
|Distance||, where denotes normal distribution|
|Group connectivity||, where|
|, where is defined in Eq.(9)|
|Individual activity agreement||, where|
|Distance-change type||, where|
|if , ; if , ; otherwise,|
|Facing direction||, where|
|Pairwise Interaction||Associated Collective Activity (C)||Probabilistic Formulation|