Spatio-temporal interaction model for crowd video analysis

Spatio-temporal interaction model for crowd video analysis

Neha Bhargava
Indian Institute of Technology Bombay
India
neha@ee.iitb.ac.in
   Subhasis Chaudhuri
Indian Institute of Technology Bombay
India
sc@ee.iitb.ac.in
Abstract

We present an unsupervised approach to analyze crowd at various levels of granularity individual, group and collective. We also propose a motion model to represent the collective motion of the crowd. The model captures the spatio-temporal interaction pattern of the crowd from the trajectory data captured over a time period. Furthermore, we also propose an effective group detection algorithm that utilizes the eigenvectors of the interaction matrix of the model. We also show that the eigenvalues of the interaction matrix characterize various group activities such as being stationary, walking, splitting and approaching. The algorithm is also extended trivially to recognize individual activity. Finally, we discover the overall crowd behavior by classifying a crowd video in one of the eight categories. Since the crowd behavior is determined by its constituent groups, we demonstrate the usefulness of group level features during classification. Extensive experimentation on various datasets demonstrates a superlative performance of our algorithms over the state-of-the-art methods.

Understanding human behavior at an individual level, at a group level and at a crowd level in different scenarios has always attracted the researchers. The variability and complexity in the behavior make it a highly challenging task. However, this decade is witnessing a huge interest of researchers in the area of crowd motion analysis due to its various applications in surveillance, safety, public place management, hazard prevention, and virtual environments. This interest has resulted in many interesting papers in the area. We are aware of at least four survey papers on the subject of crowd analysis that indicate the amount of attention, it has drawn in this and the previous decade [1],[2],[3],[4]. The latest survey paper [1] by Chang et al. encapsulates the recent works published after 2009, covering topics of motion pattern segmentation, crowd behavior and anomaly detection. Thida et al. [2] provide a review on macroscopic and microscopic modeling methods. They also present a critical survey on crowd event detection. Julio et al. cover various vision techniques applicable to crowd analysis such as tracking, density estimation, and computer simulation [3]. Zhan et al. discuss various vision based techniques used in crowd analysis. They also discuss crowd analysis from the perspective of different disciplines psychology, sociology and computer graphics [4]. At the top level, the techniques used in crowd motion analysis can be divided into two major classes holistic and particle based. The holistic methods consider crowd as a single entity and analyze the overall behavior. These methods fail to provide much insight at an individual or intermediate level. On the other hand, particle based methods consider crowd as a collection of individuals. But their performance degrades with the increase in crowd density due to occlusion and tracking problems. The analysis at intermediate level i.e. at group level might provide more insights at individual and overall levels.

(a) Uniform crowd
(b) Mixed crowd
(c) Stationary group
(d) Walking
(e) Approaching
(f) Splitting
Figure 1: LABEL:sub@fig:crowd2 and LABEL:sub@fig:crowd1 give examples of structured and unstructured crowd. Output of the proposed algorithm: LABEL:sub@fig:stand - LABEL:sub@fig:split show groups with different activities: Standing (St), Walking (W), Splitting (Sp) and Approaching (A). Tracklets for some of the agents over past few frames are also shown. Each color represents a detected group (Best viewed in color). The videos are from BEHAVE [5] and CUHK [6] datasets.

We believe that a moderately dense crowd consists of various groups which form a primary entity of a crowd [6, 7] whereas a highly dense crowd can be considered to form a single group and a highly sparse crowd might have groups with cardinality of one. Together, they guide the overall behavior of the crowd and individually influence the actions of the members. Therefore, group level analysis and hence group detection becomes important in crowd analysis. We define a group as a set of individuals (agents) having some sort of interactions e.g. the group members are walking together. Spatial proximity is also necessary to form a group; if there are agents with a similar motion pattern but are far away from each other, they do not form a group as per our definition. Each group has its own set of goals that leads to various interaction patterns among the members of the group. The collective behavior of these constituent groups identifies the global crowd behavior which can vary from a highly structured to a completely unstructured pattern. In case of a structured crowd, for example marching of soldiers, all groups are in coordination and share the same goal (see Fig.(a)a); whereas in an unstructured crowd, for example at railway station or at a shopping complex, there are multiple groups with different goals (see Fig.(b)b). We are interested in understanding these different types of crowd behaviors at various levels by exploiting motion information of individuals. The paper makes the following contributions:

  1. A framework is proposed to model the collective motion of the crowd by a first order dynamical system. The model captures the interaction patterns among the individuals. Although, the proposed model does not capture any possible non-linear relations, its usefulness for short-term analysis has been verified experimentally. We also provide an optimization formulation for the estimation of the interaction matrix under the constraints of spatial proximity, temporal continuity and sparsity of inter-agent relationship.

  2. Since the interaction matrix is learned from the trajectory data, it captures the spatio-temporal patterns present among the agents. We observe that the eigenvectors of the interaction matrix reflect the spatio-temporal patterns. Thus, we propose a spectral clustering [8] based algorithm to identify the groups present in the scene. Extensive experimentation on various datasets demonstrates the effectiveness of the algorithm.

  3. We also demonstrate how the activities can be classified at three different levels at atomic (individual) level, at group level and at crowd level. The eigenvalues of the interaction matrix characterize various group and individual activities Fig (c)c-(f)f show examples of activities at group level. At crowd level, we employ group level features to identify the behavior of the crowd. We classify the crowd videos in one of the 8 categories as defined by [6] and demonstrate its performance in terms of classification accuracy.

The remaining part of the paper is organized as follows. Next section reviews the related literature. Section 2 explains the proposed mathematical formulation followed by group detection algorithm in Section 3. Detection of group activity and atomic activity is discussed in Section 4. We look at crowd video classification in Section 5. The experimental results are presented in Section 6 followed by conclusions in Section 7.

1 Related Work

There are numerous research papers in the challenging and interesting area of crowd behavior analysis. There are several holistic approaches (e.g[9][10][11][12][13]) as well as particle based algorithms (e.g[14][15][7][16][17]) in the literature. Holistic methods analyze crowd as a single entity and ignore individuals or groups. In many papers, a dense crowd is considered analogous to fluid and hence concepts from fluid mechanics are applied for analysis. Mehran et al. in [9] present streakline representation of crowd flow for behavior analysis. Solmaz et al. recognize crowd behaviors such as bottlenecks, fountainheads, lanes, arches and blocks through stability analysis of a dynamical system [10]. Benabbas et al. detect motion patterns and events in the crowded scenes by modeling motion and velocity at each spatial location [11]. In [12], Lin et al. find a coherent motion regions in the video by generating thermal energy field.

The agent based approaches analyze each individual or group to discover the global behavior. Solera et al. propose correlation clustering based group detection which uses socially constrained features. Shao et al. introduce a collective transition prior in [6] and represent each group by a Markov chain. They define interesting group descriptors which proved to be useful in group state analysis and crowd classification. In [15], Sethi and Chowdhury propose a phase space algorithm to identify pairwise correlation between the motion patterns. Ge et al. find groups by hierarchical clustering based on pairwise velocities and distance [18][7]. Zhou et al. find groups by using coherent filtering [16]. They propose a coherent neighbor invariance property which characterizes coherently moving individuals. Sochman et al[19] infer groups based on social force model [14]. They define a pairwise group activity confidence to identify groups. Srikrishnan and Chaudhuri in [20] define a linear cyclic pursuit based framework for collective motion modelling with the goal of short-term prediction. But they do not explore group detection and there is no analysis of crowd behavior. In the interesting work of [17], they consider group detection as a clustering problem and learn a socially meaningful pairwise affinity under Structural SVM framework.

Most of the particle based algorithms compute pairwise velocity and spatial cues to find the groups hierarchically. They do not model spatio-temporal patterns of the agents collectively which might capture more complex interactions. Additionally, most of the methods assume a constant velocity motion model which is not valid for many scenarios. To address these limitations in the paper, we propose to model motion trajectories collectively instead of individually or pairwise. Also instead of relying on spatio-temporal information directly (which is prone to noise) for group detection, we use spectral clustering to identify groups.

2 Mathematical Formulation

We define a group as a set of agents having spatial proximity and some sort of interaction. In general, such interactions are complex and non-linear in nature. We approximate these interactions locally in time by a first order dynamical model. Note that we refer by agent an individual entity (represented by a point to be tracked) in the crowd.

2.1 Proposed Interaction Model

We model the collective relationship among the agents by a first order affine system. Our hypothesis is based on the intuition that each agent takes into consideration () the movement of other agents present nearby and () her/his desired goal, while taking the next step. To capture these two intuitions, our model relates the next position of each agent to the current positions of all the agents including herself/himself. Let , then

(1)

where is the total number of agents, , , is the bias, and is the location of the agent at time instant along the -axis. We call as the interaction matrix which captures the evolution of an agent as a function of all agents present in the scene. Note that has no assumption on its form and entries. It need not be symmetric i.e. agent may not depend on agent in the same way as agent depends on agent. For example, consider a case where agent is stationary and agent approaches him/her. Since their behaviors are not symmetric with respect to each other, we assume that it implies .

In this paper, it is assumed that the motions along and directions are independent and hence can be analyzed independently. The corresponding model along direction is . In the rest of the paper, we discuss the solution for matrix noting this fact that the same process is also carried out for . In the end, the outputs from both the models are combined appropriately to get the final output. We expect matrices and to be dependent on crowd motion. Since crowd behavior might change with time, the interaction matrix is time varying in nature, which we represent as where is a time instant. Assuming has independent eigenvectors, the general solution to Eq.(1) is given as

(2)

where is the eigenvalue, is the corresponding normalized eigenvector, and are the corresponding constant coefficients that depend on the initial condition and respectively. Different values of and generate various motion patterns for an agent. These patterns can be associated to different motion tracks generated by an agent while walking, approaching, splitting or being stationary. For example, an agent is stationary if and at location or an agent is moving with a constant speed if and . Hence, this more generalized model is appropriate for modeling temporally localized complex motions.

(a)
(b)
Figure 2: LABEL:sub@fig:pred_err Illustration of suitability of the proposed model: Average -step prediction error for sample videos from BEHAVE and CUHK datasets, each curve corresponds to a different video. LABEL:sub@fig:ng1 Neighborhood criteria: Spatial neighborhoods around agents p and r are represented as circles around them. There are a total of 20 agents in the scene out of which only 8 are neighbors of p. Estimation of elements of a row of corresponding to agent p, considering all agents present in the scene requires previous video frames (assuming ). While the use of neighborhood constraint reduces this to frames.

2.2 Validation of the Model

We use an average -step prediction error as a measure to test the validity of the proposed model on real videos. Fig. (a)a shows average errors for different step size prediction on videos from BEHAVE and CUHK datasets, each curve corresponding to a different video. The -step prediction error at any time instant is calculated as follows:

(3)

It may be noted that matrix is estimated from the latest video frames up to and then Eq. 1 is used to obtain . The -step prediction error for the video is obtained by averaging over all the frames of the video. As expected, error increases with but with a marginal increment. We observe that, for both the databases, prediction is quite valid up to 1-1.5 seconds (about 30 frames). Since the model assumes that the interaction remains same over frames, Fig (a)a suggests that one can select upto 30 frames without introducing much error. These error plots show that the proposed model is suitable for short-term analysis, which is the underlying theme of the proposed algorithm.

2.3 Estimation of Interaction Model Parameters

The matrix and vector at any time instant are learned from the immediate past trajectory data of all the agents in a least squares framework. We update and with each incoming frame as interaction patterns may change over the time. In addition, sudden changes in these interactions are unlikely. Therefore it is desired that the entries of and do not change drastically in consecutive time instants we assume them to be varying smoothly over time. We incorporate this constraint by minimizing norm of the difference between the current matrix and the previous estimate at instant. Furthermore for crowded scenes, it is unlikely that an agent’s motion depends on all the agents present in the neighborhood. We capture this sparse relationship in by minimizing norm of .

Adding these constraints to the cost function, the final formulation at time instant becomes:

(4)

where contains the positions of all agents from   to frames concatenated together with an appended row of ones to account for the bias, is the estimate at the previous frame and and are appropriate regularization parameters. Note that we will use instead of for notational convenience.

One requires at least past positions to solve the Eq. 4. Therefore the interaction pattern is assumed to remain constant over frames. Hence we want to be small enough to capture the short-term linear relationship among the agents. A large (in crowded scenes) leads to two major problems: () longer trajectories (i.e. higher ) are required to learn the interaction matrix as which may not be available and () the interaction may not remain constant over past positions for high values of as discussed before and we would like to keep as discussed in the previous section. To address these problems, we identify spatial neighbors of each agent separately and learn only the corresponding entries in the matrix (one row at a time). The neighborhood is defined as follows the agent p is a neighbor to the agent q if . The value of is decided so as to satisfy the constraint . The intuition for enforcing the neighborhood criteria is that it is unlikely that far away agents influence the motion of an agent. The advantage is that the shorter trajectories are now sufficient as the number of entries of to be learned are lesser. Note that we estimate matrix in a row-wise manner where the row has number of entries to be estimated as equal to one more than the number of the neighbors of agent . Further, there could be an agent within the spatial proximity of another agent but there may not be any interaction between them. Hence it is required that the corresponding entry in the matrix should be zero. This is enforced by adding sparsity constraint in Eq. 4. We use L1General package developed by Schmidt [21] for solving L1-regularization problems.

For an illustration, see Fig.(b)b. There are a total of agents present in the scene. Estimation of the row of matrix corresponding to agent p requires 50 previous frames (assuming ) whereas the neighborhood based estimation reduces this to 23. Also consider a case where agents p and r interact with each other but are not within the spatial proximity owing to neighborhood constraint. The interaction is captured when intersection of neighborhoods of p and r has at least one interacting agent, in this case its q who is in the spatial proximity of both.

3 Group Detection

In this section, we discuss an algorithm for identifying the groups present in the scene. As seen from Eq. 2, the general solution is a linear combination of eigenvectors at any time instant . Notice that if the corresponding entries of any two rows of the eigenvector matrix are similar, the corresponding agents form a group. This group information is not available from the position vector alone at a particular time instant because temporal evolution is also an important factor in deciding the groups. Since the eigenvectors are learned from the trajectories collectively, it encapsulates spatio-temporal evolution of the agents and hence can be exploited for group detection.

Let the eigenvector matrix contain all the eigenvectors column-wise. We define a mapping for agent as where is the entry of eigenvector of interaction matrix and is the number of significant eigenvalues. A clustering algorithm is applied on the points { to identify the groups. The clustering algorithm runs on the components of eigenvectors, therefore this algorithm falls in the category of spectral clustering [8]. Since the number of groups is unknown, we apply a threshold based clustering. The adaptive threshold used for the point is , where is its magnitude and is found empirically. For example, all the agents within the distance of from will form a group with agent . In this way, all the groups are obtained. We consider only significant eigenvectors (with ) of for group detection since the response from the eigenvectors with dies down to an insignificant level within the period of frames.

It may be noted that this group detection algorithm remains the same in the case where does not have independent eigenvectors. In such a case, the clustering algorithm runs on generalized eigenvectors.

4 Group Activity Identification

While the eigenvectors identify the groups, the eigenvalues can be used to determine the activity of a group. We employ the same model mentioned in Eq. 1 for the group g to estimate its interaction matrix and . We do not use the submatrix formed by the agents of the group g in the previously learned matrix to get . This is to get a refined matrix for the group and avoid any possible interference from the outside agents in the estimation. Let , where is the cardinality of the group g and is the position of the agent of the group at time instant . To learn matrix at time instant, we define a similar optimization framework as follows, where the second term enforces temporal continuity in the activity but unlike Eq. 4, there is no need for sparsity constraint as, by definition, all agents in a group interact. Therefore,

(5)
(a)
Figure 3: Illustration of group activity - Stationary, Approaching, Walking and Splitting respectively from the estimated model parameters for a group consisting of two members. Eigenvalue with is the activity deciding eigenvalue. See the text for details.

Assuming to be again diagonalizable, the general solution is similar as given in Eq. 2. The velocity vector for the group g can be written as

(6)

where are the eigenvalues of . Since some of the coefficients and could be zero, let be the largest eigenvalue for which at least one of the coefficients or is non-zero. Now we state how different values of characterize various activities:

  1. Stationary: A group is stationary if indicating all the eigenvalues (with at least one non-zero coefficient) to be zero. That corresponds to zero velocity vector and hence the agents are stationary. In the illustration shown in Fig. 3(a), the deciding eigenvalue is which is 0. The two agents are stationary at locations 140 and 120 respectively.

  2. Approaching: A group has an approaching members if as . In the example shown in Fig. 3(b), . One agent is stationary at 120 while the other agent starts from the location 100 and approaches to the first one.

  3. Walking: If then the group is walking with a constant velocity of . In Fig. 3(c), both the agents walk together and deciding eigenvalue corresponds to . Note that we do not discriminate between walking and running in this work.

  4. Splitting: A group has a tendency for divergence if as . In Fig. 3(d), this corresponds to . Initially the two agents were standing together and then the second agent starts moving away from the first one leading to split of the group.

This group activity detection method is dependent on eigenvalues and hence sensitive to perturbations in the measurements. To address this, we define threshold bands for crucial values of eigenvalues. For example, if , we consider to be and if then it is considered as 0.

4.1 Atomic Activity Detection

This algorithm is now extended to identification of individual’s activity as follows. Let denotes position of an agent at time , then

(7)

The velocity is as follows:

(8)

Note that there is no longer a activity called splitting as one needs at least two agents to define it. We identify following activities based on the value of :

  1. Stationary: An agent is stationary if at the location given by . It is also stationary when and .

  2. Stopping: indicates that the agent is stopping soon.

  3. Walking: An agent is walking if . Further, an agent is walking with a constant velocity if and .

Note that the group detection and activity recognition algorithms run in and directions independently and results need to be combined together. For group detection, a group is formed only if it is formed in both the directions. For example, let and be the label vectors (indicating assigned group number for all the four agents) obtained along and directions respectively. It says that agents {1,2,4} form a group along direction while {1,3,4} form a group along axis. Combining both the labels will result in the final label vector as i.e. out of 4 agents, 1 and 4 are grouped together while agents 2 and 3 are singleton groups. To identify the final group activity from the two separate group activity estimates along and directions, we merge the two decisions according to the following priority sequence SplittingWalkingApproachingStationary. For example, if a group has splitting and approaching activities in and directions respectively, the final group activity is splitting.

5 Crowd Video Classification

Having group level information in hand, we can use them in identifying the overall crowd behavior. Ability to identify crowd behavior enables crowd management systems to design and manage public places effectively to ensure safety and smooth operation. The overall crowd behavior is determined by how each group behaves. Depending on the synchronization among the groups, the behavior of crowd varies from being structured to unstructured. In this section, we define group level features that are useful for crowd video classification. We classify crowd videos into 8 classes as defined by [6]. The dataset containing 474 video clips covers a variety of videos. The eight classes are as follows:

: Mixed crowd

: Well organized crowd following mainstream:

: Not well organized crowd following any mainstream

: Crowd merge

: Crowd split

: Crowd crossing in opposite directions

: Intervened escalator traffic

: Smooth escalator traffic

We employ group level features that cover low-level details such as motion information to high-level information such as group activities. The features are described as follows:

  1. Group density (): It is the ratio of number of groups by the total number of agents in the scene. A low value of indicates highly structured crowd. For example, for a group of marching soldiers is small whereas a mixed crowd has a higher group density.

  2. Histogram of : The histogram has three bins which are , and , where is the largest eigenvalue of the interaction matrix for a group ( from the last section). The value at a particular bin is the number of groups in a scene having as defined by that bin. Left skewed histogram i.e. towards indicates moving crowd whereas right skewed histogram suggests more or less stationary crowd.

  3. Histogram of motion direction: The motion direction of each member of a group is calculated from its trajectory data and the mean direction is assigned to the group. This histogram has eight bins covering to with a bin size of . The bin value is the number of groups falling in that particular bin. The uniform histogram indicates a mixed crowd whereas a skewed histogram indicates directionality in the crowd movement.

Since the analysis is conducted independently in and directions; we get two histograms for , leading to final feature vector of length . We use random forest (RF) as a classifier [22]. It consists of a multitude of decision trees that are trained from randomly sampled subsets of training dataset (bootstrap aggregating). This bootstrapping increases the performance by reducing the variance of the classifier. Also the split at each node of a tree is decided by features selected randomly out of features where . We use RF to classify a crowd video by training it with the above mentioned features. The classification results are discussed in next section.

6 Experiments and Results

In this section, we discuss the performance of the proposed algorithms for group detection, group activity recognition and crowd video classification. We have tested our algorithms on various publicly available datasets containing real videos. We first discuss these datasets followed by performance evaluation of the proposed algorithms.

6.1 Datasets

We tested our algorithms on different videos from various datasets contributed by several researchers namely CUHK [6], BEHAVE [5], BIWI Walking Pedestrians [23], Crowds-By-Example (CBE[24] and Vittorio Emanuele II Gallery (VEIIG) [25]. CUHK dataset is a comprehensive crowd video dataset containing 474 video clips covering various crowd behaviors with varying crowd density. BEHAVE dataset has video clips with low crowd density and covering various group activities. BIWI dataset contains two low density crowd videos (namely eth and hotel). CBE has a medium density crowd video (student003) recorded outside a university. These datasets collectively cover a large variety of crowd videos.

6.2 Group Detection

We tested group detection algorithm on all the 474 videos from CUHK dataset and 3 video clips (having duration of more than 10 minutes in total) from BEHAVE dataset. In case of videos from CUHK dataset, we restricted our algorithm to run only on those data that have sufficiently long tracks, since some of the clips are too short to accommodate for an analysis of a large number of agents. We compared the proposed algorithm with other methods on these selected agents. The ground truth for CUHK dataset was obtained manually.

    CF [16]     CT [6]    Proposed
 NMI 0.66 0.69 0.86
 Purity 0.71 0.72 0.90
 RI 0.67 0.69 0.85
Table 1: Performance comparison of different group detection algorithms on CUHK dataset.

. Baseline  [17]  [17] Proposed P R P R P R BIWI eth CBE student003 VEIIG

Table 2: Performance comparison of the proposed group detection with [17]

We compare the proposed algorithm for group detection with state-of-the-art methods by Shao et al. [6] and Zhou et al. [16]. For quantitative analysis on CUHK videos, we randomly select two time instants for each video where we compare the proposed algorithm with other methods and the ground truth instead of manually deciding on the instants when the performance has to be evaluated. We use Normalized Mutual Information (NMI) [26], Purity [27] and Rand Index (RI) [28] which are widely used for evaluation of clustering algorithms. Table 1 shows the comparison on these measures. It is quite evident from the table that the performance of the proposed algorithm far surpasses those of [6] and [16].

Fig. 4 demonstrates a visual comparison for different scenarios. Since Zhou et al. in [16] find coherent motion patterns at one time and then update them over time, it is sensitive to tracking errors and has the possibility of accumulation of errors if any frame has tracking error. Shao et al. [6] assign every agent to a collective transition prior. They have spatial proximity constraint only at the initial time instant which might not be effective as time progresses. Their algorithm groups all the agents moving in the same direction giving less importance to their spatial relationships. This can be observed from the output figures in column (b) of Fig. 4. Further in row, a person with red hat is moving faster than the group behind him but CT and CF fail to capture this difference in velocity while the proposed algorithm could capture it. The groups in last row have small changes in their directions of movement which is again not captured by these two methods while the proposed method detects such small changes.

(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a)
(a) CF
(b) CT
(c) P
(d) GT
Figure 4: Comparison of group detection results from Coherent Filtering [16] in column (a), Collective Transition [6] in column (b), our proposed method in column (c) with the ground truth in column (d) for different types of scenes. Each group is represented by a different color. Best viewed in color and when zoomed.

We also compare the proposed group detection algorithm with the method of [17] on the videos VEIIG, student003 and eth. To compare with  [17], we also use G-MITRE precision P and recall R as proposed by them. Table 2 shows the quantitative results that indicate an improved performance by the proposed method.

The proposed algorithm outperforms these state-of-the-art methods because it is more robust to tracking errors since we extract groups from the eigenvectors rather than directly using the tracklets. It is quite evident from Fig. 4 where the tracklets for various agents are marked with different colors to indicate the group they belong to, that the proposed algorithm is able to detect agents in a group much better than the other existing methods. Also the proposed algorithm yields NMI , Purity and RI on video clips from BEHAVE dataset whereas the corresponding measures for [6] and [16] have very low values (e.g. Purity for CF is only ). It shows that these methods do not perform well in videos of a sparse crowd whereas the proposed method can also handle a sparse crowd effectively.

(a)
(a)
(a)
(a)
(a) stationary
(b) approaching
(c) walking, stationary
(d) splitting
Figure 5: Group activity results on BEHAVE dataset. Notation - St: Stationary, A: Approaching, W: Walking and Sp: Splitting. Best viewed in color and when zoomed.
(a)
(b)
(c)
(d)
Figure 6: Group activity results on CUHK dataset. Same notation as in Fig. 5. Best viewed in color and when zoomed.

6.3 Group Activity Recognition

We use BEHAVE and CUHK datasets for testing the algorithm for group activity identification. Here, we have excluded the clips containing other activities such as fight. We compared the activity results with the ground truth at regular intervals. Table 8 shows the confusion matrix for the proposed algorithm on BEHAVE dataset. The algorithm gives an accuracy of 70% for Walking and Stationary activities whereas it is less for the other two activities. We observed that the algorithm gets confused between these two activities. We suspect that the confusion is due to the fact that Splitting and Approaching are more abrupt in the motion dynamics than Walking and Stationary, which results in a poorer estimate of eigenvalues over the window of frames. In CUHK dataset, since groups in most of the videos are walking, we obtain an accuracy of . Some of the qualitative results on the videos from BEHAVE and CUHK dataset are given in Fig. 5 and Fig. 6, respectively.

(a)
(b)
(c)
Figure 7: (a) Confusion matrix with categories represented as C1 to C8 (true classes along the column and predicted classes along the row), (b) Out of bag (OOB) error, (c) Importance plot for the features
(a)
Figure 8: Confusion matrix for group activity on BEHAVE dataset. The true classes are along the column and the predicted classes are along the row.

6.4 Crowd Video Classification

Since we update the interaction model with each incoming frame as explained in Section , we collect group level features at regular intervals. From each class, we randomly pick 70% feature vectors to train the classifier and the remaining for testing. As discussed before, we use random forest as a classifier with and . We run the classifier 100 times with random splits of dataset for training and testing. To avoid over-fitting, the training data and testing data do not contain features from the same video. The average accuracy obtained is around 74%, an improvement over [6] where the reported accuracy is 70%. The confusion matrix is shown in Fig. (a)a. From this figure, it is seen that classification of the crowd for Class 4 (Class Merge) is difficult, while the rest of the classes could be categorized quite easily using the proposed method. The OOB error, which indicates the generalized error, converges to a value 30% as shown in Fig. (b)b. The importance plots, which show the significance of each group level feature in the classification are shown in Fig. (c)c. It shows that the group density and histogram of eigenvalues are important for classification.

7 Conclusions

In this work, we presented a framework for analysis of medium dense crowd videos at various levels. We proposed a first order dynamical system to model agent trajectories collectively and subsequently demonstrated the effectiveness of this interaction model for group detection. We also show how eigenvalues of the model characterize group activities. We then showed the effectiveness of group level features in crowd video classification.

Our algorithm assumes the availability of tracks which itself is a challenge in many crowded videos due to occlusion and other tracking problems. As a next goal, we aspire to define a unified framework where the proposed model and a tracker work together to improve each other’s performance in crowded videos by incorporating group interaction cues.

References

  • [1] T. Li, H. Chang, M. Wang, B. Ni, R. Hong, S. Yan, Crowded scene analysis: A survey, IEEE Transactions on Circuits and Systems for Video Technology 25 (3) (2015) 367–386.
  • [2] M. Thida, Y. L. Yong, P. Climent-Pérez, H.-l. Eng, P. Remagnino, A literature review on video analytics of crowded scenes, in: Intelligent Multimedia Surveillance, Springer, 2013, pp. 17–36.
  • [3] J. S. J. Junior, S. Musse, C. Jung, Crowd analysis using computer vision techniques, IEEE Signal Processing Magazine 5 (27) (2010) 66–77.
  • [4] B. Zhan, D. N. Monekosso, P. Remagnino, S. A. Velastin, L.-Q. Xu, Crowd analysis: a survey, Machine Vision and Applications 19 (5-6) (2008) 345–357.
  • [5] S. Blunsden, R. Fisher, The behave video dataset: ground truthed video for multi-person behavior classification, Annals of the BMVA 4 (1-12) (2010) 4.
  • [6] J. Shao, C. C. Loy, X. Wang, Scene-independent group profiling in crowd, in: CVPR, 2014, IEEE, 2014, pp. 2227–2234.
  • [7] W. Ge, R. T. Collins, R. B. Ruback, Vision-based analysis of small groups in pedestrian crowds, IEEE Trans. PAMI 34 (5) (2012) 1003–1016.
  • [8] A. Y. Ng, M. I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 2 (2002) 849–856.
  • [9] R. Mehran, B. E. Moore, M. Shah, A streakline representation of flow in crowded scenes, in: Computer Vision–ECCV 2010, Springer, 2010, pp. 439–452.
  • [10] B. Solmaz, B. E. Moore, M. Shah, Identifying behaviors in crowd scenes using stability analysis for dynamical systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (10) (2012) 2064–2070.
  • [11] Y. Benabbas, N. Ihaddadene, C. Djeraba, Motion pattern extraction and event detection for automatic visual surveillance, Journal on Image and Video Processing 2011 (2011) 7.
  • [12] W. Lin, Y. Mi, W. Wang, J. Wu, J. Wang, T. Mei, A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes, IEEE Transactions on Image Processing 25 (4) (2016) 1674–1687.
  • [13] A. Pennisi, D. D. Bloisi, L. Iocchi, Online real-time crowd behavior detection in video sequences, Computer Vision and Image Understanding 144 (2016) 166–176.
  • [14] D. Helbing, P. Molnar, Social force model for pedestrian dynamics, Physical review E 51 (5) (1995) 4282.
  • [15] R. J. Sethi, A. K. Roy-Chowdhury, Individuals, groups, and crowds: Modelling complex, multi-object behaviour in phase space, in: IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011, IEEE, 2011, pp. 1502–1509.
  • [16] B. Zhou, X. Tang, X. Wang, Coherent filtering: detecting coherent motions from crowd clutters, in: Computer Vision–ECCV 2012, Springer, 2012, pp. 857–871.
  • [17] F. Solera, S. Calderara, R. Cucchiara, Socially constrained structural learning for groups detection in crowd, IEEE transactions on pattern analysis and machine intelligence 38 (5) (2016) 995–1008.
  • [18] W. Ge, R. T. Collins, B. Ruback, Automatically detecting the small group structure of a crowd, in: Workshop on Applications of Computer Vision (WACV), 2009, IEEE, 2009, pp. 1–8.
  • [19] J. Šochman, D. C. Hogg, Who knows who-inverting the social force model for finding groups, in: IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011, IEEE, 2011, pp. 830–837.
  • [20] V. Srikrishnan, S. Chaudhuri, Crowd motion analysis using linear cyclic pursuit, in: International Conference on Pattern Recognition (ICPR), 2010, IEEE, 2010, pp. 3340–3343.
  • [21] M. Schmidt, G. Fung, R. Rosales, Optimization methods for l1-regularization, University of British Columbia, Technical Report TR-2009 19.
  • [22] L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32.
  • [23] S. Pellegrini, A. Ess, K. Schindler, L. Van Gool, You’ll never walk alone: Modeling social behavior for multi-target tracking, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 261–268.
  • [24] A. Lerner, Y. Chrysanthou, D. Lischinski, Crowds by example, in: Computer Graphics Forum, Vol. 26, Wiley Online Library, 2007, pp. 655–664.
  • [25] S. Bandini, A. Gorrini, G. Vizzari, Towards an integrated approach to crowd analysis and crowd synthesis: A case study and first results, Pattern Recognition Letters 44 (2014) 16–29.
  • [26] M. Wu, B. Schölkopf, A local learning approach for clustering, in: Advances in neural information processing systems, 2006, pp. 1529–1536.
  • [27] C. C. Aggarwal, A human-computer interactive method for projected clustering, IEEE Transactions on Knowledge and Data Engineering 16 (4) (2004) 448–460.
  • [28] W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association 66 (336) (1971) 846–850.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
549
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description