Long-Range Trajectories from Global and Local Motion Representations
Motion is a fundamental cue for scene analysis and human activity understanding in videos. It can be encoded in trajectories for tracking objects and for action recognition, or in form of flow to address behaviour analysis in crowded scenes. Each approach can only be applied on limited scenarios. We propose a motion-based system that represents the spatial and temporal features of the flow in terms of long-range trajectories. The novelty resides on the system formulation, its generic approach to handle scene variability and motion variations, motion integration from local and global representations, and the resulting long-range trajectories that overcome trajectory-based approach problems. We report the results and conclusions that state its pertinence on different scenarios, comparing and correlating the extracted trajectories of individual pedestrians, manually annotated. We also propose an evaluation framework and stress the diverse system characteristics that can be used for human activity tasks, namely on motion segmentation.
Nowadays, almost any public space has a CCTV (Closed Circuit Television) system installed. This has fostered the implementation of (semi-)automatic systems to interpret real-world scenes by monitoring individuals and their activities, detecting common motion patterns, and identifying unusual behaviours. The key insight for this type of system is to exploit spatiotemporal relationships among subjects and motion patterns, while retaining structural information of the scene. Underlying motion representations such as trajectories are, by nature, intuitive and useful to build up solutions for such problems. To avoid ambiguity with related work and clarify our aim, throughout the paper the term global motion trajectories refers to trajectories that translate typical paths of pedestrians, normally used for video scene analysis [?].
Long-duration trajectories offer several advantages [?] over short-range tracklets [?] for visual analysis tasks such as activity discovery and learning of semantic region models for event recognition. However, they imply to overcome occlusions, camera motions, and nonrigid deformations issues. In this paper, we provide a system for computing long-range global motion trajectories based on the combination of a rich global flow representation, that accurately captures spatiotemporal continuity and transitions of motion, with a local flow energy field in terms of entropy, which measures the degree of motion variability revealing salient regions for flow topology analysis.
The type of scenarios addressed in this work are inherent to public spaces under CCTV systems. We aim to prove that the proposed system can approximate global motion trajectories for different pedestrian scenarios such as crowds, with structured and unstructured movement, and multi-tracking, with sparse and dense groups. Our contributions can be summarised as:
The paper’s outline is as follows. In Section 2, we survey the related work. Next, in Section 3, we briefly present the relevant theoretical concepts behind our system. The following Section 4, summarises an overview about the system’s foundations. A description about the main system steps is presented next, in Section 5. The experimental setup and results are reported in Section 6. Finally, we formulate the conclusions and future work in Section 7.
Considering various types of scenes, from dense crowd context to multi-tracking with sparse groups, the literature can be subdivided depending on the scene density and the object size. Indeed, human motion can be described at different levels from micro to macro scale. Each one implies specific motion analysis since their underlying relationships among individuals and space-context behaviour differ. Normally for high density scenes and low object resolution, motion is modelled at a global level and patterns are inferred [?]. On the other side for scenes with a small number of objects, multi-tracking approaches [?] are preferred since they track objects individually and describe motion by their spatial position. The computer vision community has been addressing several research problems related to each scenario independently.
Crowded scenes present two types of categories, structured and unstructured, depending if the movement of objects are defined by physical constraints or if they move freely in any direction, respectively. Related work focuses on modelling scene structures and on recognising the co-occurrences of crowd behaviours. For instance, the authors of [?] proposed a framework that implements a Lagrangian Particle Dynamics to advect a grid of particles and use them for motion interpretation in the form of physically and dynamically distinguishable motion segments. This type of approaches overcomes the lack of optical flow in capturing long-range temporal dependencies, and do not suffer from problems faced by object-tracking-based approaches. However, they do not consider spatial changes, cause time delays, and imply high computational effort.
For structured scenes, motion patterns are the most salient features that help understanding the scene [?]. Some approaches [?] consider the division of the video in spatiotemporal cuboids to identify prototypical motion pattern representations and variations within each one. It is not usual to extract motion trajectories from this type of conditions. However, to the best of our knowledge, there are some works [?] that try to approximate the extraction of motion patterns in terms of super tracks, but they did not measure their similarity with traditional object tracks and they did not test their approach on low density scenes.
For unstructured scenes, the concept of coherent motion emerges. It describes the free collective movement of individuals in groups and try to infer collective behaviours. This type of approach models the crowd dynamics focusing on individuals and interactions among them [?]. These approaches can follow two types of taxonomy:
Macro models have statistical meaning and not physical, while micro present physical validation but are difficult to scale up to macro scale.
Scenarios with sparse and dense groups follow a single or multi-tracking approaches. Both present difficulties related to target’s size, number of similar objects, and occlusions [?]. For crowded scenes, tracking-based models disregard the correlation between pedestrians in a close vicinity. Motion trajectory mechanisms can also be performed at feature level [?], instead of object level, by tracking interest points. However, they face critical factors to solve: selection of good tracking features, correct mapping between selected features and actions of interest, trajectory discontinuity due to inconsistent point correspondence, among others.
Our work borrows concepts from fluid dynamics to integrate macroscopic behaviour, in terms of global flow dynamics, with microscopic approach, in terms of local information theory concepts. From a technical point of view, the proposed system is novel since it integrates, in a pertinent way, different concepts into the pipeline to extract long-range trajectories that approximate the global motion trajectories. From a practical perspective, its pertinence surpass most common methods which can only be applied to limited scenarios.
Motion can be described by Lagrangian and Eulerian flow descriptions, which are formulated on different frames of reference and describe coherent structures of temporal dynamics in terms of trajectories. The Lagrangian coordinate system implies the advection and tracking of particles injected into the flow, and permits the observation of how the flow deforms and rotates the fluid. The Eulerian approach extracts a dense flow coverage since particles are computed at fixed positions, providing an overview over the entire flow at a specific time-step.
In a time dependent vector field there are four types of characteristic curves: streamlines, pathlines, streaklines, and timelines. Streamlines and pathlines are described as curves tangent to the vector field. Streaklines can be computed from the spatial and temporal gradients of the flow map. For unsteady flows, directions of flow depends on time as well as on position therefore streamline, pathline, and streakline representations are different [?]. In this work, we explore the streaklines and streamlines complementary representations.
Vector Field Representation and Advection
A grid of particles is overlaid on the flow field. The scene’s motion is quantified by particles’ movement driven by dense optical flow. This advection process considers a video represented by a 3-dimensional array , where is the number of frames, frame’s width, and frame’s height, and by the corresponding optical flow , where , , and . The particle position at grid point at time is achieved by solving
The repetition of this process at each frame yields a family of curves that represents the particle trajectory set. Since human motion creates unsteady flow, each point can be represented by a set of pathlines, streaklines, and streamlines.
Streaklines: Computation and Derived Information
Streaklines are the locus of points that connect all the particles that had been originated from the same initial point in the past at a given time. Streaklines should not get too long due to shape inconsistency with the flow and instability on numerical integration solution. The authors of [?] revealed that streaklines are the most informative flow representation when compared with optical flow and particle flow. Streak Flow can be obtained from time integration of the velocity field. Such representation fills the gaps of optical flow and captures faster immediate dynamic flow changes than traditional particle flow representation [?].
Streamlines: Computation and Derived Information
Streamlines can be obtained by bidirectional numerical integration of vector field using an autonomous ODE (Ordinary Differential Equation) system. They can be described as curves tangent to the vector field at every point in the flow [?]. The integration starts from a seed point and ends when it: reaches another streamline’s neighbour or a critical point, hits the domain boundary or forms a closed path. Several streamline placement algorithms have been proposed on the literature including flow topology based methods [?], evenly-spaced streamline placement method [?], and a hybrid flow topology-evenly-spaced streamline algorithm [?]. All of them share three common stages:
Traditional approaches for motion analysis that consist of detecting moving objects or features, tracking them, and analysing their tracks, miss the estimation of long-term motion representations that bring important cues for scene understanding [?]. Our approach is able to extract long-range motion trajectories that encode spatial and temporal changes in the scene, as well as local motion statistics around each trajectory point in the form of discrete distributions. It follows a Lagrangian perspective to integrate motion through temporal domain under an Eulerian view, similar to the extended particle technique defined in [?]. Figure 1 shows the overall system workflow.
The system subdivides temporally the video in a series of mini-batches without overlap, and subdivides spatially each frame in a static grid. In this way, spatiotemporal cells are created (see Figure ?). On each frame, the flow vectors are formed from the combination of a sparse sampling strategy with the flow map obtained from an optical flow algorithm. The locations of the sampled key points are used as starting points and a median filtering is applied to the current flow map to obtain approximated and smoothed ending points for each flow vector. All flow vectors are collected by the corresponding enclosing cell for further quantisation and clustering, based on the orientation and spatial information respectively. Such step is performed on each cell at the end of each mini-batch. This dual-operation aims to reduce the number of flow vectors and leads to a fine-to-coarse representation, which defines three levels of granularity:
Flow information of local neighborhood around each cell is acquired and continuously updated. This information permits a richer distribution to accurately infer entropy and energy measures, which help to obtain vector field’s characteristics such as topology.
From the advection scheme we can extract streaklines initiated at each particle position. Such process should not be too long to avoid the propagation of flow changes that are inconsistent with the actual flow, which can also be caused by the optical flow algorithms that introduce noisy flow vectors. Human-like motion does not produce so well-defined properties such as vorticity. Therefore, to approximate human-motion scenarios to fluid motion and reduce propagation artifacts, the advection is executed between mini-batches and not between individual frames. This is one of the novel features that distinguish our system from the approach presented in [?].
Since pedestrian flow is unsteady, streaklines and streamlines curves are different in direction and shape. At the end of a set of consecutive mini-batches, so-called memory cell size, an averaged streak flow map is combined with one of the fine-to-coarse flow representations to obtain a dense vector field, which is used as input for the streamline diffusion process. However, such discretisation could be insufficient to avoid broken streamlines. Instead of investigating about diffusion techniques of streamlines and improve them, we optimise globally the local short-range streamlines and obtain global long-range trajectories, which accurately correlate features related to flow decomposition and to the motion model, and can handle local ambiguities by spatiotemporal regularisation.
Our system, due to its novel formulation, is able to extract meaningful and correct long-range trajectories that represent either object tracks and motion patterns for different scenarios such as crowds, with structured and unstructured movement, and multi-tracking, with sparse and dense groups. Correct system initialisation and parametrisation of its different steps for the aforementioned scenarios are studied and evaluated in this work.
The input of the system is a monocular video sequence of frames, where its volume is (pixels) (frames). The volume is subdivided into spatiotemporal cells of size , without overlap, where is the width of the cell in pixels, is the height of the cell in pixels, and the number of frames in the cell, which is equal to the temporal duration of each mini-batch (Figure ?). The video is composed by a set of mini-batches, . The output primitive is a trajectory , where is the trajectory descriptor, and are the trajectory spatial coordinates at mini-batch . The temporal coordinate, , is integral (correspond to mini-batches) and the spatial coordinates, , are in sub-pixel accuracy. The set of detected trajectories is denoted by . The Algorithm ? summarizes the relevant steps of the system.
The beginning of this stage is composed by the sampling strategy and by the motion estimation represented by flow maps, which are indeed instantaneous velocities. Both information are combined to create the flow vectors. After, they undergo a filtering step to remove noise and outliers, and then are distributed by the enclosing cells, where are locally quantised and grouped. These steps form the instantiation block of the system and are executed at each frame, excepting the quantisation and clustering operation. Its output is:
All the operations taken in this stage are executed within each mini-batch.
Sampling and Motion Estimation
The sampling extracts a set of key points from one of two possible distributions: dense or sparse. The dense sampling is highly computational demanding, and such effort is propagated through the subsequent steps. Also, it introduces noisy points that do not add discriminative value, therefore sparse sampling is preferred. The motion flow is estimated using optical flow algorithms, which differ from either frame-to-frame analysis or larger spatiotemporal displacements. We tested two algorithms: the short-classical Farnebäck [?], and the large-descriptor matching in variational model (LDOF) [?]. Both present good results but we adopted the Farnebäck’s method due to the trade-off between computational effort and robustness. See [?] for a more detailed description about these results.
Each flow vector is represented by , where is the sampling point, and are the motion field components in and directions, respectively. The filtering step builds each vector flow by assuming the key point location to be the initial vector’s position, , and considering the flow vector’s endpoint, , as the median (component-wise) of the flow field, , in a neighborhood of size . Therefore, the number of flow vectors is equal to the number of key points. The adopted method is a median filtering kernel that performs better than the bilinear interpolation [?].
A dual-threshold on flow magnitude is applied to remove flow vectors that have little motion information, as well as extremely high magnitudes. However, we empirically verified that such operation is not enough to remove the flow vectors resulting from the background noise. For this reason, a novel outlier removal technique is proposed. It consists on a rough approximation that fits the magnitude of the flow vectors into a unimodal gaussian distribution and estimates both lower and upper bounds inspired on the Chebyshev’s theorem and on the skewness measure.
Since the magnitude of the flow vectors depend on the instantaneous velocities and on the kernel size, we expect that most of the flow vector’s magnitude to be smaller than the mean magnitude. Therefore the distribution will not be symmetric and will be, normally, skewed to the right, as illustrated by Figure ?.
The Chebyshev’s theorem can be applied to any data set regardless of its distribution. In this case, due to the sharp skewed distribution, the Chebyshev’s inequality is not good enough to estimate the non-symmetric, lower and upper, bounds, . Our technique estimates two parameters to obtain those bounds, a shift factor, , and a scale factor, . The intuition behind is that the nonparametric skew measure gives a shift factor to account with the difference between the median and the mean, accordingly with the distribution’s tendency (positive or negative skew), while the ratio between the number of observations represented by the median and by the mode provides a factor to scale the shift amount (the greater the ratio, the smaller the scale factor, and, consequently, the larger will be the acceptance gap between the threshold limits).
The nonparametric skew measure is given by , which assumes values between . Considering these boundaries, the shift factor is decomposed in the following limits
The scale factor, , results from the relation between the number of observations represented by the median, , and the mode, , of the log-transformed flow vector’s magnitude distribution. We employ the log-transformation to approximate the data to a symmetric distribution before measuring the amplitude relation. We adopt the Freedman-Diaconis rule to obtain the optimal bin width, , where is the interquartile range, and is the number of observations in distribution. The scale factors are given by , which produces the final non-symmetric bounds, , which are used to estimate the final threshold limits for outlier removal, , where and are the minimum and maximum values of the flow vector’s magnitude distribution, respectively.
After several experiments on different datasets, this technique presented coherent and better results to remove global outliers on flow vector’s data than state-of-the-art methods (see Section 6.2).
The video volume has a regular spatiotemporal distribution. Spatially each frame is divided by a grid, whose resolution is dependent on the frame size and is set at the beginning. Temporally the video duration is evenly divided. Each spatiotemporal region is denominated a cell, , and contains the flow vectors whose initial positions lay inside it. Each flow vector is encoded by , where is the sampling point, is the flow magnitude length, is the flow angle relative to positive x-axis, and is the frame. This step as well as the previous ones are executed every frame.
Quantisation and Clustering
In order to obtain a fine-to-coarse flow vector representation that could permit to model different levels of patterns, a two-step quantisation and clustering approach is applied on each cell at the end of each mini-batch. This operation considers all the flow vectors collected along the duration of the mini-batch, therefore the number of key points is much greater than the number of cells. The aim is to reduce the number of flow vectors, while maintaining the geometric structure of the flow field, and to obtain different representations of local dominant motion flows in a fine-to-coarse scale. Figure ? illustrates this process.
The first quantisation step uses the flow vector angle and considers a full-degree histogram (360) with 8 bins to represent orientation groups. Only the major groups, with weight above the histogram’s median value, are taken into account for next clustering step. This eliminates noisy flow vectors whose orientation fall apart the expected local distribution. The second step uses the flow vector position and applies a spatial clustering on each valid orientation group. A k-means approach with center initialisation [?] is adopted. To select a robust value, we compute k-means with increasing until the compactness measure, , satisfies the condition
with , where is the input sample , is the index clustering center to which sample belongs, and is the compactness ratio threshold, which is normally selected very low (), and we set .
Several clusters per orientation group are obtained, the so-called dominant groups, which are weighted by the number of flow vectors that belong to them and are ordered in a descendent-way. These groups represent the local dominant flows, which are described by , where is the average position, is the total number of flow vectors, and is the average orientation angle. A fine-to-coarse representation, with three levels of flow vector granularity, is obtained:
They are useful for computational consumption requirements and to investigate their length and time scale impact on the dynamics of the motion advection system. This three-level global flow vector representation is obtained per mini-batch. However, it can be accumulated during the last mini-batches, so-called memory cell, in order to create dense flow field representations for different discriminative levels. In this paper, we explore these representations for streamline diffusion.
At this stage, each cell is able to estimate local region parameters using information from its neighborhood. One of them is the entropy, whose computation might not be reliable if a small number of samples is presented. To overcome this problem, for each cell we collected as samples its flow vectors, the flow groups of the cells that belongs to its neighborhood of size , and replicate its flow vectors at boundary cells. The entropy calculation relies on the probability of each angular bin , , where corresponds to the number of vectors in bin . Entropy is then measured and used on this work as an indicative of:
5.2Flow Model Advection
This stage captures long-range temporal dependencies to represent spatial and temporal features of the flow. It is responsible to advect the motion considering an average flow map along the entire mini-batch and a dense grid of particles. It uses the global fine-to-coarse flow vector representation in a two-fold way:
This stage is executed at each mini-batch and its output is a set of motion representations such as streaklines, potentials, streak flow, and interpolated flow vector maps, among others (see Figure ?).
Motion Advection and Spatiotemporal Interpolation
A dense grid of particles is considered. Each particle has fluid properties and their initial position correspond to each pixel on image. This characteristic follows the assumption that the computation of the streakline vector field needs a dense path line integration [?]. All particles are integrated over time accordingly to an average of the optical flow maps along the current mini-batch, which is restarted at the beginning of the next mini-batch. On each time step a particle on position is created and all other particles previously initialised on same position follows the flow field. This process is expressed by equation and is repeated along the memory cell size to obtain the streaklines. The streak flow is extracted from temporal integration of the velocity field. We use the Runge-Kutta-Fehlberg (a.k.a. RKF45) for this purpose.
Considering that each streakline is a collection of particles, we get a set of 3D data points for and flow directions. To compute the streak flow, , with sub-pixel level accuracy, we adopt a multi-resolution method based on B-spline refinement to approximate scattered data by error minimization on both dimensions. The 3D data points are given as input, the tensor product B-spline surfaces are produced, and the result is a least square approximation to the scattered data with B-splines for each flow direction that represents the streak flow on each direction
Due to the ending conditions of the streamline diffusion process, described on Section ?, and since we are interest in extracting long-range streamlines to represent global motion trajectories, we include a post-processing step that links short streamlines. This stage uses the flow information collected over a sequence of mini-batches, of memory cell size (see Figure ?).
Streamline diffusion requires as input a vector field obtained from a dense flow field and the largest it represents the temporal and local changes over time, the longer the streamlines, and the better they emphasise the global field temporal coherency and the better they describe the topology of the flow. To this end, we use a combination of:
The resulting flow field is formed by B-spline interpolation, as explained on Section ?, considering as input the set of flow vectors from the fine-to-coarse representation superimposed to the averaged streak flow. This flow field is converted into a vector field using a grid-based discretisation and the filtering step explained on Section ?. We adopted the state-of-the-art farthest point seeding method [?] as the streamline diffusion technique
The streamline linking process permits to obtain long-range streamlines. It is formulated as a combinatorial matching problem that considers compatibility in terms of flow appearance, motion, and spatiotemporal regularisation among all short-streamlines. We adopted a discrete Markov Random Field (MRF) process to encode association constraints between query and candidate streamlines, re-correlate the set of short-streamlines and extract an optimal linkage between them. This undirected model was inspired on [?]. Our formulation in terms of probability of linkage, , between streamlines is defined by
where are the unary potentials that model the compatibility between a query streamline, , and a candidate streamline, ; are the pairwise potentials for link regularisation, in case of tracking ambiguities, between a pair of query streamlines, and , considering the candidate streamlines that lay in their spatiotemporal neighborhood, . The global optimisation problem, given by Equation 4, is inferred using a tree-reweighted belief propagation. Under this context, the streamlines taken on the MRF process are called tracks.
The compatibility term is divided into three components:
Appearance and motion similarity terms consider a symmetrically weighted average comparison of features along the last elements of the query track, , and the first elements of the candidate track, . The weight is an exponentially decaying factor, , that works as a confidence parameter.
Instead of considering an individual average information (motion or appearance) for each track and then take their difference in the similarity term, as used in [?], we adopted point-to-point operations. We formulate the appearance term between tracks and based on the cosine similarity of the streak flow’s angle at each track’s position given by
where is the track’s, , cosine of the streak flow angle at time , is an outlier weight that measures how well fits the appearance characteristics of the entire track, , which is modelled by a Gaussian distribution of the track’s streak flow angles. The same is defined for track . The normalisation factor is expressed by
and the appearance similarity is defined by
The motion term considers the velocity variation. Similarly, the velocity difference is taken by a point-to-point track relation stated by
where is the track’s, , velocity at time . The same is defined for track . The motion similarity is expressed by
The prior on motion model that predicts track’s movement on discontinuities considers linear kinematic equations to estimate the closest point of the query track, , to the initial point of the candidate track, . The motion integration is done until the distance travelled equals the length between the last point of and the first point of , and is governed by
where is the flow vector velocity at position . The next velocity, , is randomly chosen from a Gaussian distribution of the velocities of the track . After this, a weighted distance, that includes spatial and angular values between the last predicted point of the query track and the initial point of the candidate track, is used to obtain the motion discontinuity similarity term
where is the last segment of the query track, is the first segment of the candidate track, is the angle between both segments, and is a weighted factor (in this case ). For each set of mini-batches, the system extracts a set of streamlines, which are connected with the streamlines obtained from the subsequent set of mini-batches.
Every node in the graph, i.e. every streamline, has an additional state with a predefined cost to represent the terminal state and to avoid a forced linking. The graph just defines unary potentials among streamlines that present a compatibility term, , below a predetermined threshold. Candidate track’s formation is evaluated under geometrical constraints:
Only the ones that satisfy pre-defined thresholds are included in the graph. In the same way, the compatibility term between query tracks, , follows a geometrical pruning with the same constraints, excluding the constraint. In terms of temporal neighboring, only the tracks which belong to the same set of mini-batches are considered in the link regularisation step. This pruning process effectively reduces the computational effort without affecting the final results.
We took further advantage of the system’s characteristics and used the cell’s entropy to detect the areas with high entropy values. The query tracks whose ending points and candidate track whose starting points fall on these regions are not considered in the linking process. This highly improves efficiency.
We present results on several datasets according to the following criteria:
|Name||Category||Frame Size||Fps||N frames|
|UCF 913-36l||Crowd Structured (C.S.)||480x360||25||467|
|UMN seq3||Crowd Unstructured (C.U.)||320x240||30||658|
|PETS2013 S2 L1 Time12-34 View001||Dense Multi-Tracking (D.MT.)||768x576||–||794|
|PETS2013 S2 L3 Time14-41 View001||Sparse Multi-Tracking (S.MT.)||768x576||–||240|
In our previous work [?], we tested the instantiation stage with satisfactory results. A baseline was reached, namely we adopted the FAST sampling, the median filtering kernel size of , the LDOF’s optical flow algorithm [?], and a spatial cell size of . In this work, considering empirical research, we fixed the neighborhood size to .
We identified four system parameters:
After some experiences, we verified that the most relevant are the last two. For further analysis, we vary minibatch size fixing memory cell size, and vice versa according to Table ?.
The outlier removal technique assumes a predominant role on system accuracy. Figure ? provides a qualitative confirmation. A quantitative comparison of the proposed technique with several outlier removal methods is provided in Table 2, reported under three metrics, namely true positive (TP) rate, true negative (TN) rate and the mean of both, here called the true balance (TB) rate. Results were computed considering the (C.S.) and (C.U.) scenarios. For the former, nearly 20 frames were masked, since it is a large video sequence with many persons per frame, while for the latter, all the frames were masked. Pedestrians were manually annotated with bounding boxes and the masks are the inner ellipses within each one. The flow vectors that lay outside the masks are considered the background, i.e. the outliers, which are considered as negative samples.
|TP||97.9 (1.3)||18.9 (5.5)||27.4 (22.6)||98.7 (0.9)||58.9 (10.7)||68.0 (4.6)||92.2 (15.9)||79.7 (6.5)|
|TN||1.7 (1.2)||19.2 (8.1)||16.5 (6.7)||1.2 (0.8)||12.7 (5.5)||21.0 (4.6)||3.6 (4.4)||82.7 (6.0)|
|TP||96.6 (5.7)||13.8 (13.6)||29.3 (33.7)||97.9 (3.6)||58.6 (21.7)||77.3 (12.1)||76.8 (31.6)||79.9 (14.2)|
|TN||2.3 (1.8)||24.8 (21.9)||17.8 (10.7)||1.7 (1.3)||21.3 (21.1)||16.5 (7.2)||7.8 (6.2)||89.7 (6.4)|
[Comparison results with and without the proposed outlier removal technique. By row: from left to right, with outliers and after outlier removal. By column: from top to bottom, C.S., C.U., D.MT., and S.MT.]Comparison results with and without the proposed outlier removal technique. By row: from left to right, with outliers and after outlier removal. By column: from top to bottom, C.S., C.U., D.MT., and S.MT.
We have tested with simple and more complex techniques (for most of them, we used the available source code). For instance, the most naïve approach, the Std method, considers as outliers the samples that distance from the mean more than , while more robust approaches follow statistical models that deal with skewed data. All techniques are applied on the log-normalised magnitude distribution of flow vectors. Results show that our technique keeps a high TP rate and, more importantly, it demonstrates a higher TN rate, proving its effectiveness on removing background’s flow vectors. None of the others techniques are able to achieve satisfactory TN rate, which support our evidence to propose a new outlier removal technique for this problem. The combined TB rate clearly shows the overall supremacy of our technique. We conclude that the combination of the Chebyshev’s theorem with the skewness metric brings a stabilisation factor to the initial rough approximation of the unimodal normal distribution for the formulation of our outlier removal technique.
This step is also crucial for further motion advection step. Figure ? confirms the high perturbation that undesired flow vectors introduce on the streamline formation. The great advantage of the proposed outlier removal technique is that it is data-driven, therefore it automatically adjusts to several scene contexts, as well as to large motion variations during the same scenario. We verified that normally the percentage of flow vectors removed is between , but it could varied around , depending on the scene category and the video characteristics.
6.3Global Motion Trajectories
Our aim in this paper is to show that our system is able to extract long-range global motion trajectories that approximate the ones obtained from multi-tracking approaches, along with the advantage to be useful and effective in different scenarios.
The final trajectories are obtained by the linking process. The parameters of the selected farthest point seeding method were fixed to the default values, and , both in sub-pixel accuracy. The former controls the spacing distance of the density field, and the latter expresses the saturation ratio to trigger the seeding of a new streamline. It is necessary to execute a pruning step to remove redundant and short streamlines. Such process removes trajectories with a number of points below the mean value of the whole set of trajectories extracted on all set of mini-batches of the entire sequence. We used in this step the number of points since after the diffusion process the points are very close to each other (nearly one pixel), therefore its value is approximately equal to the trajectory length. Table ? states the significant reduction of trajectories achieved with the pruning and linking processes.
The geometrical constraints for the MRF graph were fixed to , , and . For the remaining parameters used on the similarities terms, we follow [?] with the exception of and , which are the number of points taken for appearance and velocity similarity terms, respectively. They were fixed heuristically considering a percentage of the trajectory with the minimum number of points (), and . The velocity term is lower due to a higher expected variation, therefore less point-to-point measures are considered.
To the best of our knowledge there is not any evaluation framework that deals with comparison between manual trajectories and automatic long-range motion trajectories of pedestrians. We extend our previous work [?] to propose an evaluation methodology that supports clustering of trajectories by similarity to obtain the most representatives, and measures the correspondence between extracted and annotated trajectories. We follow a similar approach to [?] and use a one-to-one distance function between each annotated trajectory and each auto-generated one to obtain a distance matrix, and solve the assignment problem with the Hungarian algorithm [?]. We report the quality of the matching process with the miss detection (FN) and false positive (FP) rates.
The distance matrix passes through a regularisation process before applying the Hungarian algorithm. This step favours configurations with small residuals and down-weights large errors, that could otherwise dominate the matching process. We evaluate four alternatives to regularise the distance matrix:
An annotated trajectory is considered correctly matched if its distance to an auto-generated one is below a certain threshold. The threshold is also used to evaluate the relationship between the accumulated error, sum of the distances between the trajectories correctly matched, and the false positive rate. Such threshold is considered to be the of the cluster with lowest values after applying the same K-means process detailed on Section ?. Before computing the distance matrix, all trajectories are resampled using a cubic spline interpolation scheme, where the resampling value is global and is equal to the length of the trajectory with the minimum number of points.
Next, we present some results for each dataset that will help us to answer the following questions:
We took conclusions based on the relation between the false positive rate and the accumulated error, varying the threshold for the incorrectly classified matches. The following results are presented accordingly with the regularisation step and the distance function. For instance, the chosen distance metrics between trajectories are Euclidean, Hausdorff, Dynamic Time Warping (DTW) and Longest Common Subsequence (LCS), while the regularisation that leads most times to the lowest error is selected. The distance matrix is normalised by the minimum and maximum, therefore the accumulated error for each metric can be compared. The distance functions are calculated considering four feature’s trajectory, , which are the point coordinates and the normalised vector direction between subsequent trajectory segments. Further individual analyses will be based on the results reported on Figures ?, ?, ?, ?.
Crowd Structured (C.S.) Scenario
For this scenario a manual annotation of pedestrians was performed. Since the motion is structured, i.e. is represented by common motion patterns, for each one we selected and annotated several persons that undergo a motion pattern. At the end, we obtained representatives trajectories for each pattern, which led to 25 trajectories.
Both memory cell size and minibatch size have similar behaviour, namely lower values have less error and FP rate associated, therefore for this type of scenario memory and minibatch size should be kept low. We also notice that memory size is less sensitive than minibatch size, therefore memory size could be higher than the minibatch size. The chosen regularisation was the local scaling RLS, since, in general, it presents lower error. The DTW measure presents a steeper monotonically decreasing behaviour which shows a better compromise between the error and the FP rate. It also produces less false positives. Euclidean distance also performs well, while LCS measure presents the worst results, since it has almost a linear behaviour (see Figure ?).
Figure ? presents some matching results, where manual trajectories are represented by green, and auto-generated trajectories are illustrated by red if a false positive is considered, or by white if a correct match is assigned. This nomenclature is kept for subsequent figures. Figures ? and ? show the threshold influence when deciding if it is a false positive or not, the former is a false negative assignment, and the latter is a correct match. Figure ? presents a wrong match, probably due to the global minimisation problem associated with the Hungarian algorithm. However, we state by Figure ? that our system is capable to generate a possible assignment for the previous miss match. We verify valid assignments and false positive detections. It is important to highlight that just some trajectories were annotated for this scenario, which has hundreds of trajectories that could be more similar to the automatic trajectories.
Crowd Unstructured (C.U.) Scenario
This is the most challenging scenario for our system. It is a low resolution video that represents a crowded scene where pedestrians move randomly in various directions, causing constant occlusions. We manually annotated all the pedestrians, which lead to a total of 15 trajectories.
In this case, memory cell size has a less stable behaviour assuming the relation between the error and the FP rate. However, in general lower values present better results. Minibatch size has even more variability, and we verify that medium and large values perform better than lower values. The selected regularisation was the median RLS. In terms of distance function, we reached the same conclusion of the previous scenario, just highlighting the improvement of the Hausdorff metric, which shows a steeper curve on low thresholds, but maintains a large FP rate at higher thresholds (see Figure ?).
In general, Figure ? permits to take the same analysis as in the previous scenario. We visually verified that our system extracts trajectories that are very similar to the manual ones. However, our evaluation system was unable to consider them in the final matching results. This factor leads us to conclude that, despite the larger matching differences, the system performs well on such demanding scenario.
Dense Multi-Tracking (D.MT.) Scenario
This scenario is also very challenging. Normally, multi-tracking approaches are applied to it and flow-based approaches conduct to poor results. There are 19 manual trajectories to match.
An analysis shows that low values for memory cell and minibatch size decrease performance, while medium values perform better. The selected regularisation was also the median RLS. Regarding the distance functions, previous conclusions still hold true. The Hausdorff metric reaches the best performance, which presents a low and steeper error for low false positive rates (see Figure ?).
Figure ? shows the most difficult trajectories to match in this dataset. In fact, Figure ? presents a complex manual trajectory, which is very long and has several direction changes. However, it could be divided on two cross sections, where one of them is automatically captured by the system, nevertheless it could not be identified as a correct match.
Sparse Multi-Tracking (S.MT.) Scenario
This is another scenario where multi-tracking approaches are normally applied. This dataset has an additional difficulty related to illumination variations that could lead to erroneous flow calculation. There are 44 manual trajectories.
The performance dependence on memory cell and minibatch is as in the S.C. scene. The regularisation of the distance matrix was based on the clustering threshold alternative, and the DTW was, again, the metric which presented the best results (see Figure ?).
This dataset presents the best results (see Figure ?), which is an excellent conclusion since it proves the versatility of the system. It correctly captures short and long trajectories, as well as false positives. However, we notice that the matches presented on Figures ? and ? could be inversely assigned to obtain a better matching.
As general evaluation, we conclude that low values should be considered for memory cell size, while minibatch size can be varied from low to medium values depending on scene category. Therefore, minibatch size presents a larger functional interval, while avoiding error propagation derived from numeric calculations of the motion advection scheme. The lower values for memory cell size could be explained by the intuition that it permits to capture less linear trajectories, eliminating the possibility to miss the detection of trajectory segments with high curvature, since the linking process could favour direction continuity. In terms of regularisation, it is obvious to conclude that a non-linear technique improves the results. In terms of the distance function, we verify that metrics that consider point-to-point geometrical information and shape benefit the matching process.
For qualitative validation, Figure ? presents the spatiotemporal comparison between the extracted trajectories and the manual ones. The density obtained clearly shows the robustness and efficiency of our system. Is important to mention that the assignment process always finds a correspondence between a manual trajectory and an auto-generated one, which corroborates the density evaluation.
This section reports the usefulness of our system related to human activity tasks. In this case, we evaluate our motion segmentation performance against two state-of-the-art works [?]. We follow both the qualitative and quantitative analysis approach of [?] to present and compare the dynamic segmentations of different behaviour states through a video sequence. For the qualitative analysis, we used the same datasets as in [?], namely Argentina and Boston
[Qualitative comparison of segmentation results in Argentina and Boston datasets. By row: frame 115 (1st row), and frame 213 (2nd row) of Argentina; frame 40 (3rd row), frame 433 (4th row), and frame 2042 (5th row) of Boston. By column: from left to right, pathlines, streaklines, and our approach.]Qualitative comparison of segmentation results in Argentina and Boston datasets. By row: frame 115 (1st row), and frame 213 (2nd row) of Argentina; frame 40 (3rd row), frame 433 (4th row), and frame 2042 (5th row) of Boston. By column: from left to right, pathlines, streaklines, and our approach.
For the experiments, we use the following parameters for each approach:
In order for the comparison to be as fair as possible we use the same optical flow result per pair of frames for each approach, namely the Classic+NL from [?].
Frames 115 and 213 illustrate two behavioural phases in which traffic lights change and north/south flow of pedestrians emerges, and west/east and north/south bound vehicles flow develop. Remaining frames refer to Boston dataset and also consider different behaviors when traffic light changes an east/west flow of pedestrians emerges simultaneously with a north/south bound vehicle flow. Figure ? demonstrates that streaklines are spatially and temporally pronounced and more accurate on capturing dynamic objects than the pathlines, which show fragmented segments of movement. However, our approach clearly shows longer motions and robustly segments different flows, even on cluttered conditions. For instance, our approach is able to detect flow of pedestrians on each sidewalks and distinguish them from north and south bound (frame 115), and detect and separate standalone motions (frames 40, 433 and 2042). None of the other approaches are able to do this, even inspecting the results reported in [?]. We clarify that our segmentation approach is computed per pixel, considering the similarity of the cosine difference in a 8-connected neighborhood, i.e. it follows the same process of the streak flow similarity of [?]. In our case, the flow is derived from the long-range trajectories, where each one is resampled and their segments are considered as the flow vectors.
For the quantitative comparison we also follow the same criterion stated in [?]. Our approach outperforms by a large margin, more than twice, both state-of-the-art approaches in the number of correctly segmented objects, even considering the results reported in [?]. The number of non-segmented objects is less than three times of the reported by the remaining approaches, which corroborates the qualitative analysis. However, it also presents a higher number of incorrectly segmented objects, probably because an over-segmentation on clutter regions (see Figure ?).
We present a complete and novel system based on global dense flow and local motion information that extracts meaningful long-range trajectories. It main advantages are its efficiency on different scenarios, and the correlation between the extracted trajectories and the individual pedestrians movements. The system combines a macro and micro analysis of motion, that to the best of our knowledge was not explored before. This work presents important contributions on different stages, namely on the system formulation, motion extraction and propagation scheme, a novel technique for removal of flow vector outliers, a fine-to-coarse flow representation, and integration of local motion information into a global re-correlation algorithm at the tracklet-level to form long-range trajectories. We also show that the system provides robust spatiotemporal motion information that can be used for human activity tasks such as segmentation, where it outperforms state-of-the-art algorithms.
An evaluation framework is proposed to deal with the matching problem between global and individual trajectories. We demonstrate very good results on different datasets. However, two aspects should be improved: the false positive rate, since sometimes they are wrongly detected; and the assignment process, since in some cases the system produces most similar trajectories than the ones that were assigned by the matching process.
More sequences should be used to optimise system’s parameters for each scene context. In that way, we believe that the system could get even better results. For future work, we will conduct a deeper system component analysis, explore the fine-to-coarse representation to model different level of human-activity-related knowledge, and will use the trajectories extracted from other scene contexts, such as duo interaction and individual tracking, to verify their impact on action classification. We believe that detection and classification of human activity could benefit from a system like ours.
The first author would like to thank FCT - Fundação para a Ciência e Tecnologia (Portuguese Foundation for Science and Technology) for the financial support for the PhD grant with reference SFRH/BD/51430/2011.
- We used the Least Squares Approximation of Scattered Data with B-splines Library (http://www.sintef.no/Projectweb/Geometry-Toolkits/LSMG/).
- Algorithm is implemented on the Computational Geometry Algorithms library (CGAL) http://www.cgal.org/.
- We acknowledge and thank the efforts of the first author of [?] to try to overcome such discrepancy, although without success.