Who did What at Where and When: Simultaneous MultiPerson Tracking and Activity Recognition
Abstract
We present a bootstrapping framework to simultaneously improve multiperson tracking and activity recognition at individual, interaction and social group activity levels. The inference consists of identifying trajectories of all pedestrian actors, individual activities, pairwise interactions, and collective activities, given the observed pedestrian detections. Our method uses a graphical model to represent and solve the joint tracking and recognition problems via multistages: (1) activityaware tracking, (2) joint interaction recognition and occlusion recovery, and (3) collective activity recognition. We solve the where and when problem with visual tracking, as well as the who and what problem with recognition. Highorder correlations among the visible and occluded individuals, pairwise interactions, groups, and activities are then solved using a hypergraph formulation within the Bayesian framework. Experiments on several benchmarks show the advantages of our approach over stateofart methods.
I Introduction
Multiperson activity recognition is a major component of many applications, \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., video surveillance and traffic control. The problem entails the inference of the activities, the actors and their motion trajectories, as well as the time dynamics of the events. This task is challenging, since the activities are analyzed from both the spontaneous individual actions and the complex social dynamics involving groups and crowds [1]. We aim to address not only the where and when problem by visual tracking, but also the who and what problem by activity recognition.
While advanced methods for person detection are becoming more reliable [2, 3], most existing activity recognition approaches rely on visual tracking following a trackingbydetection paradigm. These methods either fail to consider social interactions while inferring activities [4, 5, 6] or have difficulties recognizing the structural correlations of actions and interactions [7, 8, 9]. In particular, there are two major challenges: (i) ineffective tracking due to frequent occlusions in groups and crowds, and (ii) the lack of a suitable methodology to infer the complex but salient structures involving social dynamics and groups.
In this paper, we address both challenges using a bootstrapping framework to simultaneously improve the two tasks of multiperson tracking and social group activity recognition. We take stateoftheart person detections [2, 3] as input to perform initial multiperson tracking. We then recognize stable group structures including the temporally cohesive individual activities (such as walking) and pairwise interactions (such as walking sidebyside, see Fig.1 to robustly infer collective social activities (such as street crossing in a group) in multiple stages. Auxiliary inputs such as body orientation detections can be considered within the stages as well. The recognized activities and salient grouping structures are used as priors to recover occluded detections and false associations to improve performance.
We explicitly explore the correlations of pairwise interactions (of two individuals) and their group activities (within the group of more individuals) during the optimization. Observe in Fig.1 that group activities generally consist of correlated pairwise interactions, which we have exploited in the multistage inference steps. In our method, the multitarget tracking and the recognition of individual/group activities are jointly optimized, such that consistent activity labels characterizing the dynamics of the individuals and groups can be obtained. The individual and group activities are formulated using a dynamic graphical model, and highorder correlations are represented using hypergraphs. The simultaneous pedestrian tracking and multiperson activity recognition problems are then to be solved using an efficient cohesive cluster search in the hypergraphs.
The main contribution of this work is twofold. First, we propose a new framework that can jointly solve the two tasks of realtime simultaneous tracking and activity recognition. Explicit modeling of the correlations among the individual activities, pairwise interactions, and collective activities leads to a consistent solution. Second, we propose a hypergraph formulation to infer the highorder correlations among social dynamics, occlusions, groups, and activities in multistages. Simultaneous tracking and activity recognition are formulated as a bootstrapping framework, which can be solved efficiently using the search of cohesive clusters in the hypergraphs. This hypergraph solution is general that it can be extended to include additional scenarios or constraints in new applications.
Experiments on several benchmarks show the advantages of our method with improvements in both activity recognition and multiperson tracking. Our method is easily deployable to realworld applications, since: (i) camera calibration is not required; (ii) online video streams can be processed by considering a time window in a round; (iii) the computation can be performed in realtime (about 20 FPS, not including the input detection steps).
Ii Related Works
There exists a tremendous amount of multiperson tracking and activity recognition works. See [10, 11] for survey. Our work is most related to the collective activity recognition, which are organized into the following three categories — recognition based on (i) detection, (ii) tracking, and (iii) simultaneous tracking and recognition.
Iia Collective Activity Recognition based on Detection
A hierarchical model is used in [12] to recognize collective activities by considering the personperson and groupperson contextual information. The work of [13] uses hierarchical deep neural networks with a graphical model to recognize collective activities based on the dependencies of individual activities. This work is further extended in [9], where the individual and collective activities are iteratively recognized using RNN with refinements. Multiinstance learning is used in [14] to recognize collective activities by inferring the cardinality relations of individual activities. A recurrent CNN is used in [15] for the joint target detection and activity recognition.
IiB Collective Activity Recognition based on Tracking
In this category, individual target trajectories are used as the input to recognize collective activities. Collective activities are recognized in [16] using random forests for the spatiotemporal volume classification. A twostage deep temporal neural network is used in [4], where the first stage recognizes individual activities, and the second stage aggregates individual observations to recognize collective activities. In [17], the key constituents of activities and their relationships are used to recognize collective activities. A graphical model is developed in [18] to capture highorder temporal dependencies of video features. The andor graph [19] is applied for video parsing and activity querying, where the detectors and trackers are launched upon receiving queries. A RNN architecture is designed in [20] to model highorder social group interaction contexts.
IiC Simultaneous Tracking and Activity Recognition
Only very few works deal with the problem of simultaneous multiperson tracking and activity recognition. In [5], perframe and pertrack cues extracted from an appearancebased tracker are combined to capture the regularity of individual actions. A network flowbased model is used in [6] to link detections while inferring collective activities. However, these two methods did not consider pairwise interactions for activity recognition. In [7, 8], the tracking and activity recognition are formulated as a joint energy maximization problem, which is solved by belief propagation with branchandbound. However highorder correlations among individual and pairwise activities are not considered, which limits the activity recognition performance.
Iii Method
We start with notation definition in our method. Given an input video sequence, consider the most recent time window in an online fashion, and denote previous time frames as . Let represent a set of target detections obtained using person detectors e.g. [2, 3]. Let represent the set of existing target trajectories. Let , , and represent the set of recognized individual activities, pairwise interactions, and collective activities, respectively. Given , our approach aims to simultaneously solve the multiperson tracking and activity recognition problems, by inferring the following four terms within : (i) target trajectories , where is the number of observed targets, (ii) individual activity labels , (iii) pairwise interaction labels , and (iv) collective activity labels , where represents the collective activity with the most involved targets in the th frame. After a time window is processed, the method will extend target tracklets, update activity labels, and move on to the next time window: , , , and . To simplify notions, we omit the temporal indices to represent the variables within as , and represent previous variables as , \peek_meaning:NTF . i.e \peek_catcode:NTF a i.e. i.e., , , .
symbol  description  
Video Activities 
x  a target trajectory 
a  an individual activity label, e.g.  
i  a pairwise interaction label, e.g. approaching (AP), facingeachother (FE), standinginarow (SR), …  
c  a collective activity label, e.g., crossing, walking, gathering, …  
number of observed targets (tracklets)  
video time window of length prior to time , i.e.,  
person detections (bounding boxes)  
target trajectories,  
individual activity classes,  
pairwise interaction classes,  
collective activity classes,  
existing entities prior to time window , , , ,  
number of individual activity classes, in the CAD and AugmentedCAD datasets, in the NewCAD dataset  
number of interaction classes, which is also the number of subhypergraphs used in our  
method, in the CAD and AugmentedCAD datasets, in the NewCAD dataset  
number of collective activity classes, in CAD, in AugmentedCAD, in NewCAD datasets  
Problem Formulation 
a joint distribution  
confidence terms from the decomposition of  
clique potential functions in the Markov random field  
updated terms of after an optimization stage, respectively  
updated terms of after an optimization stage, respectively  
the distance likelihood term for estimating the interaction between two targets  
the group connectivity term for estimating the interaction between two targets  
the individual activity agreement term for estimating the interaction between two targets  
the distance change type likelihood term for estimating the interaction between two targets  
the facing direction likelihood term for estimating the interaction between two targets  
the frontness/sideness likelihood term for estimating the interaction between two targets  
Tracking 
a candidate tracklet  
the set of all candidate tracklets  
a (putative) individual activity of a candidate tracklet  
the set of (putative) individual activities for all candidate tracklets  
the appearance similarity for tracklet linking  
time threshold for appearancebased tracklet linking  
operator represents the association of two tracklets  
the number of hypothetical tracklets to generate from an existing tracklet ,  
symbol  description  

Hypergraph 
hypergraph  
tracking hypergraph  
activity recognition hypergraph  
the vertex set of a hypergraph  
the hyperedge set of a hypergraph  
the hyperedge weights of a hypergraph  
the appearance hyperedge weight, working with control parameter  
the facingdirection hyperedge weight, working with control parameter  
the geometric similarity hyperedge weight, working with control parameter  
the hyperedge degree, i.e., the number of incident vertices of the hyperedge  
a degree hyperedge,  
a hyperedge cluster, which is a vertex set with interconnected hyperedges  
number of vertes in a hypergraph cluster ,  
the set of all incident hyperedges of a cluster  
weighting function operated on a hypergraph cluster  
the indicator vector to denote the vertex selection from to be included in  
used in weight normalization  
used in weight normalization  
image coordinate vector between two positions at and  
a subhypergraph indexed by , i.e.,  
the hyperedges of the subhypergraph corresponding to the th interaction class  
the hyperedge weights of the subhypergraph corresponding to the th interaction class  
Graph 

graph  
the vertex set of a graph; is associated with in this paper  
the edge set of a graph  
the edge weights of a graph  
a graph edge connecting two vertices and  
the correlation between the activities of two targets and used to calculate weight  
a function to calculate the correlation between the activities of two targets  
Eucludean distance between two targets in the image coordinate.  
the angle between the facing direction of and the relative vector from to .  
sparse graph by discarding edges with small weights from  
Indices 
, ,  video frame indices 
, , ,  target tracklet indices  
, ,  hypergraph vertex indices  
the index for hypergraph clusters e.g. , from  
the index for interaction classes e.g. ; is also the index for subhypergraphs e.g.  
the index for collective activity classes  
Iiia Problem Formulation
We aim to infer accurate trajectories of all targets () and their individual activities (), pairwise interactions () and collective activities () from the observed detections (). Relationship between these variables can be expressed as the joint distribution as a dependency graph in Fig.1. Based on the conditional independence assumption of in the graphical model, can be decomposed into three terms:
(1) 
(i) is the confidence of target tracking, where the calculation will be given in IIIC. (ii) models the interdependencies among target trajectories, individual activity and pairwise interaction labels, which is further expressed as a Markov random field (red cycle in Fig. 1):
(2) 
where , and are three clique potential functions capturing the intercorrelations between each variable pair. Derivation of these clique potentials will be given in IIIC and IIID. (iii) reflects an important assumption that collective activities can be effectively modeled by robust inference of target trajectories, individual activities and pairwise interactions, where IIIE will provide details.
The inference of the joint tracking and recognition is then formulated as seeking
However, standard iterative optimization such as block coordinate descent is not practical due to that: (i) the coupling of variables is still complicated; (ii) each of these variables represents a superset of timedependent variables, so their joint optimization will be very inefficient; (iii) a realtime processing method is desired. We adopt a heuristic approximate solution using multistage updating scheme, which first jointly updates , , and then updates , followed by the update of . Our strategy is based on an important hypothesis that inferring pairwise interactions is crucial in resolving the entire optimization, because is the knob governing the representations inbetween and .
Our updating scheme shares spirit with the standard Gibbs sampling and MHMCMC method for the inference in probabilistic graphical models. The updating scheme takes the following three stages:
Stage 1 activityaware tracking ( IIIC), where individual target trajectories and activity labels are updated using:
(3) 
Stage 2 joint interaction recognition and occlusion recovery ( IIID), where the interaction labels together with the target trajectories and activities are updated using:
(4) 
Stage 3 collective activity recognition ( IIIE), where the collective activity labels are updated using:
(5) 
We will show in IIIC and IIID that we model highorder correlations among , and using two respective hypergraphs. The clique potentials in Stage 1 and Stage 2 can be derived as the optimization of maximal weight search over the two hypergraphs, in order to infer . Stage 3 infers using a probabilistic formulation based on the inferred .
IiiB Cohesive Cluster Search on the Hypergraph
We define an undirected hypergraph , where denotes the vertex set of , where denotes vertex index. An undirected hyperedge with incident vertices is defined as , where is the degree of the hyperedge. The set of all degree hyperedges is denoted as . The weights of hyperedges are denoted as , i.e., each hyperedge is associated with a weight.
We use the hypergraphs to represent both (1) the detectiontracklet association for tracking , and (2) the correlations among individual activities and pairwise interactions . The joint problem of multiperson tracking (with possible refinements) and group activity recognition can be solved using a standard cohesive cluster search on the hypergraph [21]. A cluster within a hypergraph is a vertex set with interconnected hyperedges. We use to denote the number of vertices in , and to denote the set of all incident hyperedges of . A cluster is cohesive if its vertices are interconnected by a large amount of hyperedges with dense weights. Denote the weighting function that measures the weight of a cluster. For a vertex , the cohesive cluster search optimization is to determine a large cluster with dense weights:
(6) 
We use indicator vector , to denote the selection of vertices from to be included in : for , and otherwise. The selection is constrained such that up to vertices including are enclosed in , such that , and .
The design of affects the resulting cluster from the search. Typical can be the total weight of all incident hyperedges. However, direct maximization of the total weight leads to a large cluster that is not necessarily cohesive. Instead, we maximize a normalized weight, which is the total weight divided by the cardinality of all incident hyperedges. This normalization also enables continuous optimization. For with vertices and degree hyperedges, this normalizer is . Our weighting function is:
(7) 
where denote vertex indices.
IiiC ActivityAware Tracking
Stage 1 of our method simultaneously recognizes individual activities and links tracklets in the following two steps (see Fig.2 for a schematic overview):
(T1) Generate candidate tracklets . For each existing target , we generate a set of candidate tracklets from observed detections using the tracking method in [22]. ^{1}^{1}1 All candidate tracklets and their labels are denoted with a bar . We employ a gating strategy to restrict the number of candidate tracklets to consider. The appearance similarity between and each tracklet is calculated using the POI features [3] and Euclidean metric. If this similarity is above a threshold , is added into . Targets with no associated detection within time are discarded to reduce unnecessary computation. We use and to include a rich set of candidate tracklets for linking. If any tracklet in ends up not linked with any target ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., in Fig.2), a new target is created. If any target ends up with no linked tracklet for status update, it is considered occluded. ^{2}^{2}2 We use trajectory prediction based on motion extrapolation in step (R1) of IIID to determine if the target is still within the scene.
(T2) Link tracklets with . After candidate tracklets are generated, for each candidate tracklet , we determine its individual activity label for the purpose of activityaware tracking. We consider = individual activity labels regarding the motion pattern: standing, walking, and running, by calculating the velocity of each and modeling the posteriors using sigmoid similar to [23]: . We consider social contextual cues and the correlations between individual activities in finding the best tracklet linking combinations. This also enables robust occlusion recovery for tracking. Our solution is to represent all terms using a tracking hypergraph . The clique potential function in Eq.(3) can then be inferred as , where represents a cohesive cluster obtained from , and denotes cluster index.
The activityaware tracking by linking tracklets with is performed in three substeps: (T2.1) Estimate social group structure using correlations between individual activities in a graph representation. (T2.2) Construct hypergraph . (T2.3) Optimize tracking based on .
(T2.1) Estimate social group structure. We represent the social group structure of tracked targets and the correlations between individual activities using an undirected complete graph with . . Edge weight reflects the correlation between activities and of and , respectively. We define to reflect the correlation between activities of two targets similar to [23]:
(9) 
where represents Euclidean distance between the targets. As shown in Fig.4a, represents the angle between the facing direction of and the relative vector from to , and represents the velocity of . For a target , if is recognized as “standing”, we use the classifier in [24] to calculate the body orientation out of 8 quantizations. Otherwise, estimates motion direction from the trajectory. Edge weights of are calculated according to Eq.(9) and refined using further grouping cues as in [23]. Fig.4b visualizes the correlation defined by Eq.(9). The probability is higher on the side of a person than in the front or back, which is an implementation of Hall’s proxemics social norms [25]. We discard edges with weights lower than to obtain a sparse graph denoted as for computation speedup.
(T2.2) Construct hypergraph using to capture the highorder correlations between activities within a group. A vertex represents a hypothesis of linking a tracked target with its candidate tracklet, \peek_meaning:NTF . i.e \peek_catcode:NTF a i.e. i.e., where “” represents the association of two tracklets. A degree hyperedge represents the combination of tracklet linking hypotheses in an assignment.
The linking of tracklets with can be considered as an assignment problem with the following two tracklet assignment constraints: (i) a target cannot be linked with two or more candidate tracklets, and (ii) a candidate tracklet cannot be linked with two or more targets. We enforce these constraints in the construction of hyperedges in . Specifically, , where and , if and only if , and can coexist in a hyperedge in .
We further consider motion and behavior consistencies and their correlations (via ) in determining the hyperedge weights. Specifically, we consider three affinities that determine the hyperedge weights: the appearance () of each tracklet, the facingdirection () and the geometric similarity () between tracked targets.
The appearance affinity between a target and a candidate tracklet is computed using the appearance features of tracklets as [3]:
(10) 
We assume that activity states (such as walking direction) do not change abruptly inbetween small linked tracklets. In other words, difference between facing directions of two targets should be small for linked tracklets:
(11) 
Our method aims to run on surveliiance videos without calibration. To ensure smooth tracking, we use a geometric affinity term to ensure that relative angles between two targets does not change abruptly:
(12) 
where and represent the relative image coordinate vectors between tracked targets and candidate tracklets. Final affinity value of a hyperedge is computed by , where , , are set as , , .
(T2.3) Optimize tracking based on . This step aims to determine the optimal tracklet linking among candidates represented in the hypergraph . The optimization is performed by the cohesive cluster search on described in IIIB. For each vertex , such a search yields a cluster with a score. Since a vertex may appear in multiple clusters, if any resulting cluster violates the tracklet assignment constraints in (T2.2), such a cluster is removed to avoid further consideration. We ensure that the resulting cohesive clusters represent valid tracklet linking hypotheses that is sound and redundancyfree. ^{3}^{3}3 Hypergraph clusters are processed sequentially in descending order of their scores. If any cluster violates the constraints, new cluster is discarded and any duplication is removed. In case a target ends up not linked with any candidate tracklets ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., in Fig.3), such a target should be either outside the scene or under occlusion. We store all discovered occlusions and will try to recover them at Stage 2 in IIID. Finally, target trajectories are updated with the newly linked tracklets in to be , and activity labels are augmented with respective ones in to be .
IiiD Joint Interaction Recognition and Occlusion Recovery
Our approach is motivated from the observation that pairwise interactions within a group can provide rich contextual cues to recognize the activities (as in Fig.1) and recover possible occlusions. Stage 2 of our method jointly resolves the two problems of (1) recognizing pairwise interactions and (2) occlusion recovery to improve tracking. We again use a hypergraph representation to explore the highorder correlations among the interactions , such that a similar cluster search scheme can be applied for optimization. Specifically, we construct the (activity) recognition hypergraph () based on the inferred target locations and individual activities . The optimization over maximizes the clique potential function in Eq.(4), as , where represents a cohesive cluster obtained from .
Stage 2 of our method jointly recognizes and recovery occlusions in the following three main steps (see Fig.3 for a schematic overview):

(R1) Generate hypothetical tracklets for occlusion recovery from given existing and .

(R2) Construct hypergraph based on to infer highorder correlations among their pairwise interactions .

(R3) Optimize recognition and recovery over to simultaneously recognize interaction and link occluded targets with suitable hypothetical tracklets.
(R1) Generate hypothetical tracklets . For each possibly occluded target , we generate a few hypothetical tracklets based on trajectory predictions, where is empirically set to . ^{4}^{4}4 All hypothetical tracklets are denoted with a hat across the paper. For a moving target with , we generate via motion extrapolation. For a stationary target with , we add a small perturbation to .
(R2) Construct hypergraph , such that highorder correlations among interactions among and are captured for the purposes of simultaneous activity recognition and occlusion recovery. Thus, . Each hyperedge in characterizes the likelihood of a pairwise interaction . For example in Fig.3, are connected by 3 hyperedges, which correspond to interactions “WS”, “RS”, “SS”, respectively. See IV for a complete list of interaction class defined in public datasets [24, 7]. We denote the number of interaction classes.
The inference of each interaction class can be optimized independently. We can thus decompose into subhypergraphs , with for the th interaction class. For each hyperedge , the weight reflects how likely the interaction between the targets are cohesive as a whole ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., all walkingsidebywide).
We calculate the hyperedge weights in in two steps: (R2.1) evaluates each pairwise interaction with a confidence score. (R2.2) constructs hyperedges in using the average score from all involved targets.
(R2.1) Recognize pairwise interaction activities. We calculate a confidence score for each possible pairwise interaction between the targets , using a simple effective rulebased probabilistic approach as in [23]. Specifically, the confidence score of belonging to the th class is calculated by multiplying the following six component probabilities: distance (ds), group connectivity (gc) calculated in (9), individual activity agreement (aa), distance change type (dc), facing direction (dr), and frontness/sidedness (fs):
(13) 
Detailed formulation of the above component probabilities and formulation are provided in Tables III and IV.
Component  Probability 

Distance  , where denotes normal distribution 
Group connectivity  , where 
, where is defined in Eq.(9)  
Individual activity agreement  , where 
Distancechange type  , where 
, where  
;  
if , ; if , ; otherwise,  
Facing direction  , where 
Frontness/sideness  , where 
Pairwise Interaction  Associated Collective Activity (C)  Probabilistic Formulation 

facingeachother  talking  
(FE)  
standinginarow  queuing  
(SR)  
standingsidebyside  waiting  
(SS)  
dancingtogether  dancing 