Who did What at Where and When: Simultaneous Multi-Person Tracking and Activity Recognition

Who did What at Where and When: Simultaneous Multi-Person Tracking and Activity Recognition

Wenbo Li, Ming-Ching Chang, Siwei Lyu
   Computer Science Department, University at Albany, State University of New York
Ming-Ching Chang is the corresponding author.

We present a bootstrapping framework to simultaneously improve multi-person tracking and activity recognition at individual, interaction and social group activity levels. The inference consists of identifying trajectories of all pedestrian actors, individual activities, pairwise interactions, and collective activities, given the observed pedestrian detections. Our method uses a graphical model to represent and solve the joint tracking and recognition problems via multi-stages: (1) activity-aware tracking, (2) joint interaction recognition and occlusion recovery, and (3) collective activity recognition. We solve the where and when problem with visual tracking, as well as the who and what problem with recognition. High-order correlations among the visible and occluded individuals, pairwise interactions, groups, and activities are then solved using a hypergraph formulation within the Bayesian framework. Experiments on several benchmarks show the advantages of our approach over state-of-art methods.

Group activity, collective activity recognition, pairwise interaction, multi-person tracking, hypergraph, high-order correlation.

I Introduction

Multi-person activity recognition is a major component of many applications, \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., video surveillance and traffic control. The problem entails the inference of the activities, the actors and their motion trajectories, as well as the time dynamics of the events. This task is challenging, since the activities are analyzed from both the spontaneous individual actions and the complex social dynamics involving groups and crowds [1]. We aim to address not only the where and when problem by visual tracking, but also the who and what problem by activity recognition.

While advanced methods for person detection are becoming more reliable [2, 3], most existing activity recognition approaches rely on visual tracking following a tracking-by-detection paradigm. These methods either fail to consider social interactions while inferring activities [4, 5, 6] or have difficulties recognizing the structural correlations of actions and interactions [7, 8, 9]. In particular, there are two major challenges: (i) ineffective tracking due to frequent occlusions in groups and crowds, and (ii) the lack of a suitable methodology to infer the complex but salient structures involving social dynamics and groups.

In this paper, we address both challenges using a bootstrapping framework to simultaneously improve the two tasks of multi-person tracking and social group activity recognition. We take state-of-the-art person detections [2, 3] as input to perform initial multi-person tracking. We then recognize stable group structures including the temporally cohesive individual activities (such as walking) and pairwise interactions (such as walking side-by-side, see Fig.1 to robustly infer collective social activities (such as street crossing in a group) in multiple stages. Auxiliary inputs such as body orientation detections can be considered within the stages as well. The recognized activities and salient grouping structures are used as priors to recover occluded detections and false associations to improve performance.

Fig. 1: (Top) This work is based on two main hypotheses that: (i) Multi-person tracking and activity recognition can be jointly solved using an improved, unified framework. (ii) Group collective activities (crossing, talking, waiting, chasing, etc.) can be characterized by the cohesive pairwise interactions (walking side by side, facing each other, standing side by side, running after, etc., see Table IV) within the group. See III-A. (Bottom) The dependency graph that can jointly infer the target tracking (), individual activities (), pairwise interactions (), and collective activities (), all from the input detections (). Numbers on the edges indicate the inference stages in the multi-stage updating scheme.

We explicitly explore the correlations of pairwise interactions (of two individuals) and their group activities (within the group of more individuals) during the optimization. Observe in Fig.1 that group activities generally consist of correlated pairwise interactions, which we have exploited in the multi-stage inference steps. In our method, the multi-target tracking and the recognition of individual/group activities are jointly optimized, such that consistent activity labels characterizing the dynamics of the individuals and groups can be obtained. The individual and group activities are formulated using a dynamic graphical model, and high-order correlations are represented using hypergraphs. The simultaneous pedestrian tracking and multi-person activity recognition problems are then to be solved using an efficient cohesive cluster search in the hypergraphs.

The main contribution of this work is two-fold. First, we propose a new framework that can jointly solve the two tasks of real-time simultaneous tracking and activity recognition. Explicit modeling of the correlations among the individual activities, pairwise interactions, and collective activities leads to a consistent solution. Second, we propose a hypergraph formulation to infer the high-order correlations among social dynamics, occlusions, groups, and activities in multi-stages. Simultaneous tracking and activity recognition are formulated as a bootstrapping framework, which can be solved efficiently using the search of cohesive clusters in the hypergraphs. This hypergraph solution is general that it can be extended to include additional scenarios or constraints in new applications.

Experiments on several benchmarks show the advantages of our method with improvements in both activity recognition and multi-person tracking. Our method is easily deployable to real-world applications, since: (i) camera calibration is not required; (ii) online video streams can be processed by considering a time window in a round; (iii) the computation can be performed in real-time (about 20 FPS, not including the input detection steps).

Ii Related Works

There exists a tremendous amount of multi-person tracking and activity recognition works. See [10, 11] for survey. Our work is most related to the collective activity recognition, which are organized into the following three categories — recognition based on (i) detection, (ii) tracking, and (iii) simultaneous tracking and recognition.

Ii-a Collective Activity Recognition based on Detection

A hierarchical model is used in [12] to recognize collective activities by considering the person-person and group-person contextual information. The work of [13] uses hierarchical deep neural networks with a graphical model to recognize collective activities based on the dependencies of individual activities. This work is further extended in [9], where the individual and collective activities are iteratively recognized using RNN with refinements. Multi-instance learning is used in [14] to recognize collective activities by inferring the cardinality relations of individual activities. A recurrent CNN is used in [15] for the joint target detection and activity recognition.

Ii-B Collective Activity Recognition based on Tracking

In this category, individual target trajectories are used as the input to recognize collective activities. Collective activities are recognized in [16] using random forests for the spatio-temporal volume classification. A two-stage deep temporal neural network is used in [4], where the first stage recognizes individual activities, and the second stage aggregates individual observations to recognize collective activities. In [17], the key constituents of activities and their relationships are used to recognize collective activities. A graphical model is developed in [18] to capture high-order temporal dependencies of video features. The and-or graph [19] is applied for video parsing and activity querying, where the detectors and trackers are launched upon receiving queries. A RNN architecture is designed in [20] to model high-order social group interaction contexts.

Ii-C Simultaneous Tracking and Activity Recognition

Only very few works deal with the problem of simultaneous multi-person tracking and activity recognition. In [5], per-frame and per-track cues extracted from an appearance-based tracker are combined to capture the regularity of individual actions. A network flow-based model is used in [6] to link detections while inferring collective activities. However, these two methods did not consider pairwise interactions for activity recognition. In [7, 8], the tracking and activity recognition are formulated as a joint energy maximization problem, which is solved by belief propagation with branch-and-bound. However high-order correlations among individual and pairwise activities are not considered, which limits the activity recognition performance.

Iii Method

We start with notation definition in our method. Given an input video sequence, consider the most recent time window in an online fashion, and denote previous time frames as . Let represent a set of target detections obtained using person detectors e.g. [2, 3]. Let represent the set of existing target trajectories. Let , , and represent the set of recognized individual activities, pairwise interactions, and collective activities, respectively. Given , our approach aims to simultaneously solve the multi-person tracking and activity recognition problems, by inferring the following four terms within : (i) target trajectories , where is the number of observed targets, (ii) individual activity labels , (iii) pairwise interaction labels , and (iv) collective activity labels , where represents the collective activity with the most involved targets in the -th frame. After a time window is processed, the method will extend target tracklets, update activity labels, and move on to the next time window: , , , and . To simplify notions, we omit the temporal indices to represent the variables within as , and represent previous variables as , \peek_meaning:NTF . i.e \peek_catcode:NTF a i.e. i.e., , , .

symbol description

Video Activities

x a target trajectory
a an individual activity label, e.g.
i a pairwise interaction label, e.g. approaching (AP), facing-each-other (FE), standing-in-a-row (SR), …
c a collective activity label, e.g., crossing, walking, gathering, …
number of observed targets (tracklets)
video time window of length prior to time , i.e.,
person detections (bounding boxes)
target trajectories,
individual activity classes,
pairwise interaction classes,
collective activity classes,
existing entities prior to time window , , , ,
number of individual activity classes, in the CAD and Augmented-CAD datasets, in the New-CAD dataset
number of interaction classes, which is also the number of sub-hypergraphs used in our
method, in the CAD and Augmented-CAD datasets, in the New-CAD dataset
number of collective activity classes, in CAD, in Augmented-CAD, in New-CAD datasets

Problem Formulation

a joint distribution
confidence terms from the decomposition of
clique potential functions in the Markov random field
updated terms of after an optimization stage, respectively
updated terms of after an optimization stage, respectively
the distance likelihood term for estimating the interaction between two targets
the group connectivity term for estimating the interaction between two targets
the individual activity agreement term for estimating the interaction between two targets
the distance change type likelihood term for estimating the interaction between two targets
the facing direction likelihood term for estimating the interaction between two targets
the frontness/sideness likelihood term for estimating the interaction between two targets


a candidate tracklet
the set of all candidate tracklets
a (putative) individual activity of a candidate tracklet
the set of (putative) individual activities for all candidate tracklets
the appearance similarity for tracklet linking
time threshold for appearance-based tracklet linking
operator represents the association of two tracklets
the number of hypothetical tracklets to generate from an existing tracklet ,
TABLE I: Notations for video activities, problem formulation and visual tracking.
symbol description


tracking hypergraph
activity recognition hypergraph
the vertex set of a hypergraph
the hyperedge set of a hypergraph
the hyperedge weights of a hypergraph
the appearance hyperedge weight, working with control parameter
the facing-direction hyperedge weight, working with control parameter
the geometric similarity hyperedge weight, working with control parameter
the hyperedge degree, i.e., the number of incident vertices of the hyperedge
a -degree hyperedge,
a hyperedge cluster, which is a vertex set with interconnected hyperedges
number of vertes in a hypergraph cluster ,
the set of all incident hyperedges of a cluster
weighting function operated on a hypergraph cluster
the indicator vector to denote the vertex selection from to be included in
used in weight normalization
used in weight normalization
image coordinate vector between two positions at and
a sub-hypergraph indexed by , i.e.,
the hyperedges of the sub-hypergraph corresponding to the -th interaction class
the hyperedge weights of the sub-hypergraph corresponding to the -th interaction class


the vertex set of a graph; is associated with in this paper
the edge set of a graph
the edge weights of a graph
a graph edge connecting two vertices and
the correlation between the activities of two targets and used to calculate weight
a function to calculate the correlation between the activities of two targets
Eucludean distance between two targets in the image coordinate.
the angle between the facing direction of and the relative vector from to .
sparse graph by discarding edges with small weights from


, , video frame indices
, , , target tracklet indices
, , hypergraph vertex indices
the index for hypergraph clusters e.g. , from
the index for interaction classes e.g. ; is also the index for sub-hypergraphs e.g.
the index for collective activity classes
TABLE II: Graph and hypergraph notations.

Iii-a Problem Formulation

We aim to infer accurate trajectories of all targets () and their individual activities (), pairwise interactions () and collective activities () from the observed detections (). Relationship between these variables can be expressed as the joint distribution as a dependency graph in Fig.1. Based on the conditional independence assumption of in the graphical model, can be decomposed into three terms:


(i) is the confidence of target tracking, where the calculation will be given in  III-C. (ii) models the inter-dependencies among target trajectories, individual activity and pairwise interaction labels, which is further expressed as a Markov random field (red cycle in Fig. 1):


where , and are three clique potential functions capturing the inter-correlations between each variable pair. Derivation of these clique potentials will be given in  III-C and  III-D. (iii) reflects an important assumption that collective activities can be effectively modeled by robust inference of target trajectories, individual activities and pairwise interactions, where  III-E will provide details.

The inference of the joint tracking and recognition is then formulated as seeking

However, standard iterative optimization such as block coordinate descent is not practical due to that: (i) the coupling of variables is still complicated; (ii) each of these variables represents a superset of time-dependent variables, so their joint optimization will be very inefficient; (iii) a real-time processing method is desired. We adopt a heuristic approximate solution using multi-stage updating scheme, which first jointly updates , , and then updates , followed by the update of . Our strategy is based on an important hypothesis that inferring pairwise interactions is crucial in resolving the entire optimization, because is the knob governing the representations in-between and .

Our updating scheme shares spirit with the standard Gibbs sampling and MH-MCMC method for the inference in probabilistic graphical models. The updating scheme takes the following three stages:

Stage 1 activity-aware tracking ( III-C), where individual target trajectories and activity labels are updated using:


Stage 2 joint interaction recognition and occlusion recovery ( III-D), where the interaction labels together with the target trajectories and activities are updated using:


Stage 3 collective activity recognition ( III-E), where the collective activity labels are updated using:


We will show in  III-C and  III-D that we model high-order correlations among , and using two respective hypergraphs. The clique potentials in Stage 1 and Stage 2 can be derived as the optimization of maximal weight search over the two hypergraphs, in order to infer . Stage 3 infers using a probabilistic formulation based on the inferred .

Notations for video activities, problem formulation and tracking are summarized in Table I, where graph and hypergraph related notations are summarized in Table II.

Iii-B Cohesive Cluster Search on the Hypergraph

We define an undirected hypergraph , where denotes the vertex set of , where denotes vertex index. An undirected hyperedge with -incident vertices is defined as , where is the degree of the hyperedge. The set of all -degree hyperedges is denoted as . The weights of hyperedges are denoted as , i.e., each hyperedge is associated with a weight.

Fig. 2: Stage 1 activity-aware tracking. Given five targets , and new tracklets , step (T1) optimizes the association of candidate tracklets with existing target tracks. Step (T2) determines the best candidate assignments for tracklet linking in three steps. (T2.1) estimates the group structure using graph , where the edges represent the correlations of activities between individuals. (T2.2) constructs hypergraph with hyperedges based on the estimated group structure. (T2.3) solves the candidate tracklet linking and infers the possible occlusions in an optimization over .

We use the hypergraphs to represent both (1) the detection-tracklet association for tracking , and (2) the correlations among individual activities and pairwise interactions . The joint problem of multi-person tracking (with possible refinements) and group activity recognition can be solved using a standard cohesive cluster search on the hypergraph [21]. A cluster within a hypergraph is a vertex set with interconnected hyperedges. We use to denote the number of vertices in , and to denote the set of all incident hyperedges of . A cluster is cohesive if its vertices are interconnected by a large amount of hyperedges with dense weights. Denote the weighting function that measures the weight of a cluster. For a vertex , the cohesive cluster search optimization is to determine a large cluster with dense weights:


We use indicator vector , to denote the selection of vertices from to be included in : for , and otherwise. The selection is constrained such that up to vertices including are enclosed in , such that , and .

The design of affects the resulting cluster from the search. Typical can be the total weight of all incident hyperedges. However, direct maximization of the total weight leads to a large cluster that is not necessarily cohesive. Instead, we maximize a normalized weight, which is the total weight divided by the cardinality of all incident hyperedges. This normalization also enables continuous optimization. For with vertices and -degree hyperedges, this normalizer is . Our weighting function is:


where denote vertex indices.

It is intuitive to enforce that must contain at least one hyperedge, thus must . Let and . The conditions is then . We relax the constraint of to be , so is a continuous variable for optimization. Eq.(6) is re-written as:


We solve Eq.(8) using the pairwise update algorithm in [21].

Iii-C Activity-Aware Tracking

Stage 1 of our method simultaneously recognizes individual activities and links tracklets in the following two steps (see Fig.2 for a schematic overview):

  • (T1) Generate candidate tracklets from new detections that maximizes in Eq.(3).

  • (T2) Link tracklets with by maximizing the appearance, motion, and geometric consistencies that maximizes in Eq.(3).

(T1) Generate candidate tracklets . For each existing target , we generate a set of candidate tracklets from observed detections using the tracking method in [22]111 All candidate tracklets and their labels are denoted with a bar . We employ a gating strategy to restrict the number of candidate tracklets to consider. The appearance similarity between and each tracklet is calculated using the POI features [3] and Euclidean metric. If this similarity is above a threshold , is added into . Targets with no associated detection within time are discarded to reduce unnecessary computation. We use and to include a rich set of candidate tracklets for linking. If any tracklet in ends up not linked with any target ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., in Fig.2), a new target is created. If any target ends up with no linked tracklet for status update, it is considered occluded. 222 We use trajectory prediction based on motion extrapolation in step (R1) of  III-D to determine if the target is still within the scene.

(T2) Link tracklets with . After candidate tracklets are generated, for each candidate tracklet , we determine its individual activity label for the purpose of activity-aware tracking. We consider = individual activity labels regarding the motion pattern: standing, walking, and running, by calculating the velocity of each and modeling the posteriors using sigmoid similar to [23]: . We consider social contextual cues and the correlations between individual activities in finding the best tracklet linking combinations. This also enables robust occlusion recovery for tracking. Our solution is to represent all terms using a tracking hypergraph . The clique potential function in Eq.(3) can then be inferred as , where represents a cohesive cluster obtained from , and denotes cluster index.

The activity-aware tracking by linking tracklets with is performed in three sub-steps: (T2.1) Estimate social group structure using correlations between individual activities in a graph representation. (T2.2) Construct hypergraph . (T2.3) Optimize tracking based on .

Fig. 3: Stage 2 joint interaction recognition and occlusion recovery in a road-crossing scenario, where , , and are walking side-by-side across a road, while , , are standing side-by-side waiting. Step (R1) considers the linking of the occluded target to three hypothetical tracklets . Step (R2) constructs hypergraph for the inference in two steps. (R2.1) evaluates each pairwise interaction by calculating a confidence score, where wrongly assigned labels are depicted in red. (R2.2) constructs hyperedges based on the recognized pairwise interactions, where each hyperedge characterizes the likelihood of a pairwise interaction. Step (R3) optimizes the inference over to jointly recognize interaction labels and resolve the tracklet linking and occlusion recovery.

(T2.1) Estimate social group structure. We represent the social group structure of tracked targets and the correlations between individual activities using an undirected complete graph with . . Edge weight reflects the correlation between activities and of and , respectively. We define to reflect the correlation between activities of two targets similar to [23]:


where represents Euclidean distance between the targets. As shown in Fig.4a, represents the angle between the facing direction of and the relative vector from to , and represents the velocity of . For a target , if is recognized as “standing”, we use the classifier in [24] to calculate the body orientation out of 8 quantizations. Otherwise, estimates motion direction from the trajectory. Edge weights of are calculated according to Eq.(9) and refined using further grouping cues as in [23]. Fig.4b visualizes the correlation defined by Eq.(9). The probability is higher on the side of a person than in the front or back, which is an implementation of Hall’s proxemics social norms [25]. We discard edges with weights lower than to obtain a sparse graph denoted as for computation speedup.

Fig. 4: Social group affinity between a pair of individuals is calculated based on: (a) distance, angle, and motion (velocity magnitude & direction). (b) visualizes such a measure at (0, 0) with direction vector (1, 1) arrow in a color map depicting the probability kernel between 0 (blue) and 1 (red).

(T2.2) Construct hypergraph using to capture the high-order correlations between activities within a group. A vertex represents a hypothesis of linking a tracked target with its candidate tracklet, \peek_meaning:NTF . i.e \peek_catcode:NTF a i.e. i.e., where “” represents the association of two tracklets. A -degree hyperedge represents the combination of tracklet linking hypotheses in an assignment.

The linking of tracklets with can be considered as an assignment problem with the following two tracklet assignment constraints: (i) a target cannot be linked with two or more candidate tracklets, and (ii) a candidate tracklet cannot be linked with two or more targets. We enforce these constraints in the construction of hyperedges in . Specifically, , where and , if and only if , and can co-exist in a hyperedge in .

We further consider motion and behavior consistencies and their correlations (via ) in determining the hyperedge weights. Specifically, we consider three affinities that determine the hyperedge weights: the appearance () of each tracklet, the facing-direction () and the geometric similarity () between tracked targets.

The appearance affinity between a target and a candidate tracklet is computed using the appearance features of tracklets as [3]:


We assume that activity states (such as walking direction) do not change abruptly in-between small linked tracklets. In other words, difference between facing directions of two targets should be small for linked tracklets:


Our method aims to run on surveliiance videos without calibration. To ensure smooth tracking, we use a geometric affinity term to ensure that relative angles between two targets does not change abruptly:


where and represent the relative image coordinate vectors between tracked targets and candidate tracklets. Final affinity value of a hyperedge is computed by , where , , are set as , , .

(T2.3) Optimize tracking based on . This step aims to determine the optimal tracklet linking among candidates represented in the hypergraph . The optimization is performed by the cohesive cluster search on described in  III-B. For each vertex , such a search yields a cluster with a score. Since a vertex may appear in multiple clusters, if any resulting cluster violates the tracklet assignment constraints in (T2.2), such a cluster is removed to avoid further consideration. We ensure that the resulting cohesive clusters represent valid tracklet linking hypotheses that is sound and redundancy-free. 333 Hypergraph clusters are processed sequentially in descending order of their scores. If any cluster violates the constraints, new cluster is discarded and any duplication is removed. In case a target ends up not linked with any candidate tracklets ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., in Fig.3), such a target should be either outside the scene or under occlusion. We store all discovered occlusions and will try to recover them at Stage 2 in III-D. Finally, target trajectories are updated with the newly linked tracklets in to be , and activity labels are augmented with respective ones in to be .

Iii-D Joint Interaction Recognition and Occlusion Recovery

Our approach is motivated from the observation that pairwise interactions within a group can provide rich contextual cues to recognize the activities (as in Fig.1) and recover possible occlusions. Stage 2 of our method jointly resolves the two problems of (1) recognizing pairwise interactions and (2) occlusion recovery to improve tracking. We again use a hypergraph representation to explore the high-order correlations among the interactions , such that a similar cluster search scheme can be applied for optimization. Specifically, we construct the (activity) recognition hypergraph () based on the inferred target locations and individual activities . The optimization over maximizes the clique potential function in Eq.(4), as , where represents a cohesive cluster obtained from .

Stage 2 of our method jointly recognizes and recovery occlusions in the following three main steps (see Fig.3 for a schematic overview):

  • (R1) Generate hypothetical tracklets for occlusion recovery from given existing and .

  • (R2) Construct hypergraph based on to infer high-order correlations among their pairwise interactions .

  • (R3) Optimize recognition and recovery over to simultaneously recognize interaction and link occluded targets with suitable hypothetical tracklets.

(R1) Generate hypothetical tracklets . For each possibly occluded target , we generate a few hypothetical tracklets based on trajectory predictions, where is empirically set to 444 All hypothetical tracklets are denoted with a hat across the paper. For a moving target with , we generate via motion extrapolation. For a stationary target with , we add a small perturbation to .

(R2) Construct hypergraph , such that high-order correlations among interactions among and are captured for the purposes of simultaneous activity recognition and occlusion recovery. Thus, . Each hyperedge in characterizes the likelihood of a pairwise interaction . For example in Fig.3, are connected by 3 hyperedges, which correspond to interactions “WS”, “RS”, “SS”, respectively. See IV for a complete list of interaction class defined in public datasets [24, 7]. We denote the number of interaction classes.

The inference of each interaction class can be optimized independently. We can thus decompose into sub-hypergraphs , with for the -th interaction class. For each hyperedge , the weight reflects how likely the interaction between the targets are cohesive as a whole ( \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., all walking-side-by-wide).

We calculate the hyperedge weights in in two steps: (R2.1) evaluates each pairwise interaction with a confidence score. (R2.2) constructs hyperedges in using the average score from all involved targets.

(R2.1) Recognize pairwise interaction activities. We calculate a confidence score for each possible pairwise interaction between the targets , using a simple effective rule-based probabilistic approach as in [23]. Specifically, the confidence score of belonging to the -th class is calculated by multiplying the following six component probabilities: distance (ds), group connectivity (gc) calculated in (9), individual activity agreement (aa), distance change type (dc), facing direction (dr), and frontness/sidedness (fs):


Detailed formulation of the above component probabilities and formulation are provided in Tables III and IV.

Component Probability
Distance , where denotes normal distribution
Group connectivity , where
, where is defined in Eq.(9)
Individual activity agreement , where
Distance-change type , where
, where
if , ; if , ; otherwise,
Facing direction , where
Frontness/sideness , where
TABLE III: Component probabilities for the pairwise interaction activities. The parameters used in these component probabilities, \peek_meaning:NTF . e.g \peek_catcode:NTF a e.g. e.g., the means and standard deviations are calculated from the training dataset.
Pairwise Interaction Associated Collective Activity (C) Probabilistic Formulation
facing-each-other talking
standing-in-a-row queuing
standing-side-by-side waiting
dancing-together dancing