KPCA Spatiotemporal trajectory point cloud classifier for recognizing human actions in a CBVR system
Abstract
We describe a content based video retrieval (CBVR) software system for identifying specific locations of a human action within a full length film, and retrieving similar video shots from a query. For this, we introduce the concept of a trajectory point cloud for classifying unique actions, encoded in a spatiotemporal covariant eigenspace, where each point is characterized by its spatial location, local FrenetSerret vector basis, time averaged curvature and torsion and the mean osculating hyperplane. Since each action can be distinguished by their unique trajectories within this space, the trajectory point cloud is used to define an adaptive distance metric for classifying queries against stored actions. Depending upon the distance to other trajectories, the distance metric uses either large scale structure of the trajectory point cloud, such as the mean distance between cloud centroids or the difference in hyperplane orientation, or small structure such as the time averaged curvature and torsion, to classify individual points in a fuzzyKNN. Our system can function in realtime and has an accuracy greater than 93% for multiple action recognition within video repositories. We demonstrate the use of our CBVR system in two situations: by locating specific frame positions of trained actions in two full featured films, and video shot retrieval from a database with a web search application.
keywords:
Human Motion Recognition, Content Based Video Retrieval, SpatioTemporal Templates, KernelPCA, FrenetSerret Formulas, differential curvature, fuzzyKNN1 Introduction
Recognizing specific human activities from realtime or recorded videos is a challenging practical problem in computer vision research. If implemented as an efficient search engine, such algorithms would be valuable for managing and querying large multimedia database repositories where keyword searches on human actions are practically meaningless. Thus, a contentbased video retrieval (CBVR) paradigm, where such queries are undertaken by comparing the actual multimedia content  as distinguished from others that compare only semantic tags  can provide a more powerful indexing/annotation and retrieval methods to produce richer query results. Within this paradigm, computer vision algorithms are used to automatically index videos along the entire timeline consisting of semantics and feature vectors. Queries compare a similar reduction of the input video to all those feature vectors in the video repository. To be practically useful as a search engine, the CBVR algorithms must be fast, robust, and accurate.
We describe a CBVR system and algorithms for recognizing human actions in stored or realtime streaming videos by using a velocity encoded spatiotemporal representation of the movement. In our method, each image frame of the original video shot is replaced by a simplified image, called an MVFI (motion vector flow instances) motion template, that extracts the direction and strength of the velocity flow field (Olivieri et al., 2012), found by frametoframe differencing. Each template image can be further projected as a point into a reduced dimensional eigenspace through a Principal component analysis (PCA), or kernelPCA (KPCA) transformation, so that frames of the entire video sequence, when projected into this space, trace out a unique curve we call the spatiotemporal trajectory. These trajectories provide an efficient technique for distinguishing different actions since similar actions have similar trajectories, while different actions can have radically different trajectories.
For comparing trajectories, we describe a novel classification scheme that uses local differential geometric properties of these curves. We refer to our algorithm as the trajectory point cloud classifier method. In this method, each ndimensional point contains information about the local neighborhood of the trajectory, while the collection of such points defines a macroscopic object, or cloud. In this way, the algorithm works on two scales for determining the distance between different trajectories (or actions). For large trajectory separations, the distance is dominated by the difference of centroids between clouds. For partially overlapping trajectories, the distance is dominated by a mean hyperplane that defines unique orientations of the trajectories, and when there is significant overlap, difference between the orientation of local patches of the trajectory dominate the distance metric. This description is valid since different types of actions will lie on completely different osculating hyperplanes, while similar actions would lie on the same plane, or one that is closeby. At a small scale, individual points in the trajectory point cloud are endowed with their local geometric properties of their part of the curve in which they are embedded. This local information can be used to infer class membership. We show that our distance metric is more effective than traditional classifiers based upon methods such as KNN that do not incorporate information about the connectedness of points. Moreover, with this technique, we remediate one of the traditional drawbacks of spatiotemporal methods  that they are limited to describing global properties of the motion. By utilizing this local information, slight differences of the movement can be distinguished.
Figure 1 shows a block diagram of our CBVR recognition system, consisting of two separate, but interconnected branches: the indexing and the querying path. In the indexing step, a set of videos are processed using computer vision algorithms with the purpose of annotating human actions (e.g., walking, jogging, jumping, etc.) along the timeline of a video. In our implementation, indexing consists of obtaining KPCA spatiotemporal trajectories from an encoding of the velocity field at evenly distributed points along the timeline of the video, from a moving window of overlapping video segments. As shall be described in this paper, because spatiotemporal points are connected, the trajectory point cloud allows us to obtain information about the mean hyperplane on which this trajectory lives. This meta information associated with a particular video shot is stored in the database. In the querying phase, an input video is processed with the same steps as the indexed videos and comparisons are made between all metadata of the query with all metadata from all videos in the repository.
This paper is organized as follows. First, we briefly review previous work on human action recognition. Relevant mathematical details of constructing the linearPCA and kernelPCA covariance eigenspaces are provided and comparisons are carried out with two human action databases, the KTH (Schuldt et al., 2004) and the MILE (Olivieri et al., 2012). Next, we describe our new trajectory cloud classifier method that is capable of resolving ambiguities that can arise for recognizing different types of action classes. Finally, we describe two examples of our CBVR system: locating specific positions along the timeline of a set of full feature films that contain particular human actions, and as a search engine for retrieving similar videos from a video shot repository.
2 Background
Characterizing human motion without markers is a difficult computer vision problem that has generated a large amount of research. Many different approaches in this field cover a broad spectrum of techniques  ranging from fullbody tracking in 3D with multiple cameras (Rius et al., 2009) to Bayesian inference models (Meeds et al., 2008). A recent review (Poppe, 2010) and new textbook (Szeliski, 2010) provide a taxonomic overview of the most salient algorithms that have been developed to characterize human motion. Methods also vary greatly in computational requirements, so that a solution which is more precise may be practically unusable for realtime applications or for a search engine that will be used in a CBVR system (Hu et al., 2011; Hosseini and EftekhariMoghadam, 2013).
Several recent reviews of content based video indexing and retrieval are available (Beecks et al., 2010; Bhatt and Kankanhalli, 2011; Hu et al., 2011). Specific CBVRs for retrieving video shots from a query with human actions have been described by Jones and Shao (2013) and in (Laptev et al., 2008) by using a full movie repository. A large scale data mining methods that uses unsupervised clustering of human action videos was described by (Liao et al., 2013). Another large scale study, that could treat more than 100 human actions, as been reported by Nga and Yanai (2014). This study automatically extracting video shots from semantic queries by using videos examples that have been previously tagged. Many other specific studies exist in more narrow domains, such as that by Küçük and Yazıcı (2011), who described a system for indexing and querying news videos. Nonetheless, all CBVR methods to date grapple with the spatiotemporal problem, that does not exist in content based image retrieval. Another common theme in all studies is the necessity of CBVR systems to treat the large amount variations of actions in videos. For this reason, universally applicable CBVR systems are still in their infancy.
2.1 Computationally intensive methods
Amongst the most computationally demanding solutions are those that capture the full human body part motion over time, such as work described in (Rius et al., 2009), where the 3D tracking was accomplished with particle filters or (Samy Sadek and Michaelis2, 2013), where affine invariant features are derived from 3D spatiotemporal action shapes. Another example is work by (Ugolotti et al., 2013), where a particle swarm model is used for detecting people and performing 3D reconstruction. In another approach, full body motion is inferred from a probabilistic graphical model that determines connected sticks figures (Meeds et al., 2008). Similarly, (Felzenszwalb et al., 2010) describe a multiscale deformable parts model based upon segmenting human parts from each image frame. While many of these methods are able to capture fine details of body motion, they would require excessive computation, rendering them unusable for realtime information retrieval. Also, the low level information requires another level of processing to distinguish actions. One example where this lowlevel parts movement information is converted into higher level information is provided in (Ikizler and Forsyth, 2008), who used Hidden Markov models (HMMs) to infer composite human motion/actions.
2.2 SpatioTemporal and Realtime Methods
For realtime recognition of human actions, spatiotemporal methods can provide accurate performance. Such methods sacrifice fine details of the movement in order to provide a more computationally efficient solution. There are many spatiotemporal methods, and the term is used as an umbrella phrase for a wide class of implementations. Nonetheless, such methods share a common theme  they capture global spatiotemporal characteristics from optical flow. Several spatiotemporal approaches have been explored, specific to the human motion recognition problem. Some authors compared surfaces traced out in time (Blank et al., 2005), while others seek representations based upon moments, (Achard et al., 2008) or (Bobick and Davis, 2001).
The basic idea is that different actions can be distinguished by their unique spatiotemporal flow fields. By capturing the information from these flow fields, highly discriminatory feature vectors for time in the video can be constructed. Comparing different actions is tantamount to comparing these feature vectors. Because such comparisons are efficient, these methods are attractive for multimedia annotation and querying (Ren et al., 2009). The feature vectors can be inserted directly as metadata at each point within a video shot file, and/or as information in the database, to be used in future queries. It is in this way that the video search is performed by content and not only with semantic keywords. Therefore, a video shot query consists of comparing its set of spatiotemporal vectors  each representing segments of the video along the timeline  with the corresponding feature vector sets, stored as metadata within all videos of the database.
An elegant way to capture the spatiotemporal characteristics of some motion in a video is to transform the original image sequence into a simplified set of images, called motion templates. These templates provide a quantized representation of the original image frames determined from a background/foreground segmentation technique, such as frame differencing. For example, given a video shot of some human motion, one set of motion templates may be the binarized (i.e., only black/white) human silhouette obtained from pairwise frame differencing during the action. Classic work using spatiotemporal templates for video shots was first described in (Bobick and Davis, 1996). In that work, the authors developed motion templates based upon frame difference information. In particular, they introduced the concept of the MHI (motion history instance) and the MEI (motion energy instance) and demonstrated their ability to distinguish people from their gait. Later, (Venkatesh Babu and Ramakrishnan, 2004) used similar motion templates to distinguish different types of human actions in video sequences.
For motion templates based upon frame differencing, robust algorithms that eliminate background noise are critical. Several frame differencing algorithms have been proposed that perform interpolation and smoothing by using strong features, such as SIFT, and reducing uncorrelated differences between images. The resulting difference vectors, when represented on a grid are referred to as dense optical flow. One implementation, available in the popular library OpenCV, uses a polynomial technique (Farnebäck, 2003) to optimize the input parameters for obtaining the best foreground optical flow for situations with complex scenes and potentially consisting of moving backgrounds. Another useful implementation in OpenCV is the multiscale pyramid LucasKanade algorithm (Lucas and Kanade, 1981) for selecting the scale of the optical flow.
We recently described a new template, the Motion Vector Flow Instance (MVFI) (Olivieri et al., 2012; DíazPereira et al., 2014) that utilizes a dense optical flow algorithm for encoding both the magnitude and direction of the foreground motion. This encoding scheme improves the discrimination results of human movements from previously employed motion templates, because it contains both first and second derivatives of the velocity field. As described, a dimensionality reduction transformation is applied to the image sequence after subtracting the mean motion. The insight of this step can also be found in face recognition, with the concept of eigenfaces (Etemad and Chellappa, 1997), where better discrimination and separability of different face classes are achieved by projecting along principal components derived from differences of all images from the mean, called the covariance matrix. Thus, in the same way that the essential features of a face are the same, but what is important in distinguishing two faces are the slight difference of facial features or the covariance; the same is true in human motion.
The covariance eigenspace transformation for human movement was first described in (Huang et al., 1999) to distinguish the way people walk (their gait). By using supervised learning with PCA and Fisher Linear Discriminant Analysis (LDA), they classified gait of different people by preassigning the projections of images of a video shot into the training eigenspace. The Fisher LDA simultaneously minimizes the inclass variance (same actions are closer) while maximizes the out of class variance, thereby separating different classes in the space. Other studies, such as (Lam et al., 2007), applied this technique in order to identify general human actions in different environments. More recently, (Cho et al., 2009) used this PCA+LDA technique to analyze the gait from a set of subjects in order to establish a quantitative grading that could be useful for diagnosing the level of Parkinson disease.
The PCA method finds the linear eigenspace transformation for a given dataset that has the maximum projection of the data along the new basis vectors. However, the PCA is a linear transformation, meaning that the orthogonal space is obtained through a combination of translations and rotations of the original space. The KPCA extends this idea to include nonlinear transformations with the use of the kernel trick (Bishop, 2006; Mohri et al., 2012). The choice of the kernel function provides the ability to fine tune the solution space for a given input. As should be expected, the linear PCA solution can be recovered from the KPCA method by choosing a constant kernel function. Recently, (Ekinci and Aykut, 2007) used the KPCA approach for gait recognition, and others have applied this technique for improving face recognition (Luh and Lin, 2011; Xie and Lam, 2006).
3 SpatioTemporal Trajectories
In this section, we provide the technical details behind our spatiotemporal classification method summarized in Figure 1. We implemented our system and algorithms in Python and make use of several libraries including Scipy/Numpy and OpenCV (ver2.4), an wellknown open source library for computer vision. We also developed a graphical interface in PyQT (QT4 library extensions for python) and produce realtime 3dimensional plots of the spatiotemporal with MayaVi.
3.1 The MVFI spatiotemporal template
In (Olivieri et al., 2012), we described the MVFI (Motion Vector Flow Instance) spatiotemporal template that encodes the velocity field of different human movements. These templates are formed by obtaining a representation of the optical flow field, , of the foreground motion on an evenly spaced grid that are mapped on each image frame at . From this flow field, boxes sizes encode the direction while the pixel color encodes the velocity magnitude. For an input video consisting of frames, this procedure will produce a corresponding video sequence of template frames having the frames. A summary of the steps are illustrated in Figure 2.
Figure 2(b1) illustrates the basic idea of how the MVFI is constructed with a boxing video shot. Using a particular frame in the video sequence, the optical flow vectors are superimposed image. The algorithm uses this information to create a template, consisting of boxes whose size and shape represents the direction of the vector, and the pixel intensity, an indication of the relative strength of the vector. The construction proceeds as follows: an empty storage list , used as a temporary container for manipulating vectors at time . For each optical flow grid point , information about the vector is used to form boxes that are pushed onto the list . Next, this list is sorted by box size so that the largest box is on top. To construct the final image templates at each time, , the boxes pushed off the sorted list and drawn within an empty image frame. In this way, the template accentuates the largest velocity components placing these vectors on top, which will be visible in the template sequence. This same procedure is repeated for all subsequent image frames in the video shot.
We showed in (Olivieri et al., 2012) that velocity information improves the recognition performance of human actions over previous methods, since these templates capture an instantaneous snapshot of the entire velocity field. Rapid velocity changes, relative to the mean velocity, will have corresponding trajectories that are very far from the origin of the canonical KPCA space. In this way, such trajectories are easily distinguished from human movement with small velocity components. Because most human actions are well differentiated by the velocity of body parts and full body movement, these templates are particularly effective for discriminating different types of such actions.
3.2 Mathematics of the PCA and KPCA space
We refer to the spatiotemporal template sequence of human actions as , where there is one template image for each frame in the original video shot. A particular template image in the sequence is given by , and there are such image templates in the sequence .
For the purpose of supervised learning, there will be video shots for a particular human action class, . For training, we combine all image templates from all the video shots into one column vector, . Thus, , an element of , is an image template pertaining to the th class, and having the th frame within the sequence . The total number of images in is , which is given by the sum . The training set, is given by the vector , where each is a matrix of the pixels in the image frame . The training vector is a column vector consisting of all the pixels from the image sequence.
Linear PCA
This space is constructed from the orthogonal vectors that possess the most variance between all the images in . A reduced dimensional PCA space is found by first obtaining the mean of the vector , given by , and then obtaining the covariance, , representing pixels that deviate from the mean:
The matrix is found by calculating the contribution from all pixels relative to this mean, , so that,
The orthogonal directions with the most variance are found from the eigenvectors and eigenvalues of :
assuming that can be diagonalized. However, is a very large matrix ( is the total number of pixels of ).
In practice, this excessively large matrix above is simplified (Fukunaga, 1990) with the relation , which is a smaller matrix (only ) amenable to diagonalization. From this modified eigenvalue equation, the set of eigenvectors and eigenvalues () that span the space of are approximately equivalent to those of the original matrix, , thereby justifying the truncation of the matrix.
A further approximation is made to reduce the solution spectrum to a small number of eigenvectors. Such an approximation is justified since the values of the eigenvalues decrease monotonically fast for modest eigenvectors indices, , so that for . Thus, the dimensional eigenspace is truncated so that only the largest eigenvalues are kept. In practice, we truncate the basis at . The partial set of eigenvectors span a space , and represent projections of the original images:
The above mathematical procedure describes the precise manner by which the image sequence is converted to a spatiotemporal trajectory; each point representing one template in this reduced dimensional eigenspace.
The KPCA
The PCA is a linear rotation of the original dimensional bases into one having maximum variance for the given data set. Intuitively, if the data were a general ellipsoid having some angle with respect to the original axes, the PCA transformation would discover the rotation coincident with the principal axes of the ellipsoid. Such linear transformations may not be optimal and a more general nonlinear transformation could provide a better solution. The KPCA method (Scholkopf et al., 1999), retains the concept of PCA, but can be nonlinear. The method uses the kernel trick  that states that only the form of the inner product needs to be specified, not the bases functions, making it a practical method implement. In practice, an appropriate kernel is chosen with model parameters adjusted that maximize the outofclass separation while minimize the inclass separation.
The detailed mathematics for constructing the kernelPCA method can be found elsewhere (Bishop, 2006), however we describe briefly its use for obtaining spatiotemporal trajectories. As before, we construct column vector with all the template images from the video shots in the training set: (with elements). Also, as before, we subtract the mean movement from . A nonlinear transformation that will reduce the space to an dimensional space is found by postulating bases vectors , so that each point is projected onto these directions , where (with )
The covariance matrix is given by:
An appropriate solution eigenvalue problem:
is found by diagonalizing
After algebraic manipulations, the kernel trick consists in finding a transformation where only form of the inner product is needed to project the original vector into this newly postulated space having basis vectors . In this way, the form of the eigenvectors do not need to be calculated in order to find projections. Instead we write the transformation in terms of the inner product, here called the kernel function, given by .
A projection of the original point into this space along the th component is written as:
where are the coefficients for each eigenvector that are obtained based on the normalization condition.
We use a polynomial kernel with an optimized order of the polynomial for the analyzed data:
3.3 Recognition of new actions from a KNN distance of trajectories
With the kernelPCA transformation from a training set, we can classify a query video by projecting it into this newly formed space and comparing it to the trajectories corresponding to the training set. The exact procedure is as follows: a query video containing a human action is processed with low level image processing algorithms to create the set of MVFI templates. These templates are then projected into the newly formed space through a KPCA transformation. A distance metric, such as KNN, could be used to calculate the proximity of constituent points along the trajectory into each of the defined classes. Depending upon a preestablished threshold, the query video shot is classified depending upon the percentage of points pertaining to each class.
We used the public KTH database (Schuldt et al., 2004) for performing training and validation of our algorithm. In particular, we performed tests with the following six actions: walking, jogging, running, boxing, clapping and waving. From our own human action database, we also studied four actions: jogging, boxing, playing tennis and greeting.
Figure 3(a) shows the spatiotemporal trajectories, or projections, into the polynomial KPCA eigenspace constructed from four different human actions. The set of kernel parameters were selected that provide maximum separation of the four classes. Query video shots containing one of the trained actions were transformed and projected into the space for classification, as shown in Figure 3b. As can be seen, the trajectory is closest to those trajectories corresponding to the same action. By calculating the KNN distance between the query shot and the trajectories stored for each video in the database along its timeline, similarity scores were obtained.
4 Using local differential curvature for distinguishing action classes
One of the problems with the traditional KNN distance metric, as described in the previous section, for distinguishing different action classes from the spatiotemporal trajectories is the ambiguities for points along the curve that cross into different class boundaries, especially near the origin. We will recall points near the origin in the covariance space represent parts of the motion having small velocity components. Many actions can have at least some parts during their motion with small velocity, so the overlap with another action class in the space is common. Thus, a distance metric solely based on the Euclidean separation between points or groups, loses information about the connectedness and spatial orientation of the full trajectory curve.
Instead, we define a new concept of points along the trajectory, the trajectory point cloud that allows us to define a new distance metric based upon the local differential geometry of the curve. This new method uses different scales of the human action spatiotemporal trajectories. Viewed from far away, the spatiotemporal curves lie within unique mean (osculating) hyperplanes. By determining the hyperplane of different trajectories, we can distinguish the different corresponding actions. On a finer scale, each point has local geometric characteristics, such as the curvature and torsion, providing information about how it is connected in time. We can use this local information to provide better KNN class discrimination at a finer scale. Thus, we shall define a distance metric that combines the knowledge from different scales to classify trajectories. We call this classification, the trajectory point cloud classifier.
We can find the mean hyperplane from local properties of the curve. A qualitative description of our method is as follows. The spatiotemporal trajectory is parameterized by a constant speed arc length, simplifying the differential geometry. We divide this trajectory into sequential segments, (where ), that overlap in a way similar to a moving window. We use these segments to determine the local properties of the curve: the curvature, torsion and the comoving orthogonal basis along the arc length, from the generalized dimensional FrenetSerret (FS) equations. For each segment , we obtain its socalled binormal vector , which defines the osculating plane traced out by this curve. By summing the weighted contribution of all such binormal vectors the , we obtain the mean osculating hyperplane for the entire trajectory. Each binormal vector is weighted by a term proportional to the radius of curvature. Recall that the curvature is a measure of how much the curve deviates from a straight line, while the torsion is a measure how much the curve moves out of the plane. Thus, those segments with large radii of curvature contribute the most in defining the mean hyperplane, while those that tightly closed, having a high curvature, contribute less. The unique hyperplane can be used in a distance metric to distinguish different trajectories based upon the angles between the trajectory planes.
The trajectory point cloud is a way of describing the different scales associated with the trajectory. Locally, each trajectory point contains not only its spatial position, but how it is connected to other points. At a larger scale, the entire trajectory can be treated as a cloud of points, having a centroid and mean radius. Therefore, this multiscale information is used to distinguish trajectories in three situations related to the separation between cloud centroids, namely when it is (a) approximately zero (overlapping clouds), (b) approximately the radius of a cloud, or c) larger than several cloud radii. The first and last (a and c) are classified well with clustering methods, such as the easily implemented KNN. For the case when trajectories overlap, however, we can use additional information of the local geometric properties to distinguish points. With our new geometric formalism of trajectories, we treat this in two ways: with mean osculating hyperplane orientations and to distinguish finer details, with a fuzzyKNN like method.
4.1 Definitions of the trajectory point cloud
To aid in the definitions and concepts, Figure 4A shows the trajectories from two different human actions and the associated trajectory point clouds, , and . The points along the trajectories represent the MVFI image templates transformed into the KPCA space. Two characteristics are evident upon visual inspection: the curves appear to lie in separate planes, and they are partially overlapping. The figure shows the cloud surface; the mean cloud radius , which is used for the distance metric. The vectors are the resultant weighted normal vectors to the time averaged hyperplane.
Figure 4B shows two isolated regions along the trajectory, while all other details and parts of the trajectory have been removed for visual clarity. In these isolated regions, particular discrete curve segments, have been selected out for illustration. In the algorithm, these sequential overlapping curve segments form a set , as described above. The figure shows how each curve segment, , can be used to calculate the FS local frame, the curvature and torsion. While each segment define slightly different planes and have different curvature, the aggregate will define an average plane for the entire trajectory. In Figure 4, is the segment in green, with the binormal vector , that is slightly out of the plane defined by . The segment defined by is also out of plane and has binormal vector . Since the curvature of will be higher than or , it will contribute less to the resultant vector, since we calculate this resulting vector weighted by the radius of curvature.
These concepts are illustrated further in Figure 5A, which also serves to define the variables involved. Two segments, and , of a single trajectory are represented. Segment lies within the plane , while lies within plane . For each, we can make the following definitions.
4.2 Local Differential properties
The Trajectory
We now formalize the ideas described previously. A trajectory curve, , is parameterized by the arc length through the mapping . In practice, represent the spatiotemporal trajectory as in terms of a splines. splines are smooth functions and parameterized in terms of the arclength. In this way, they can be used to calculate local differential properties of the trajectories in a practical and numerically efficient way.
Formally, we can write the th degree spline, and its first derivative as:
(1)  
(2) 
where are piecewise polynomial basis functions that are functions of the arc length . The points are called knots and are the control points along the arc length of the curve. Higher order derivatives can be obtained in a similar way. These equations are used to obtain polynomial expressions for the curvature , the torsion , and the FrenetSerret basis vectors.
Arc segment and local frame
We define a discrete segments of arc along the trajectory as with length . Thus, the trajectory consists of a collection of such arc segments: , for . In practice we take the arc lengths to be equal so that for all .
For each segment , we can calculate the average curvature and torsion centered within the interval at , by integrating over the arc segment. The equations of the curvature and torsion in terms of the trajectory along the entire curve, and the mean values for each segment are given by:
With these quantities, we can obtain the local basis frame from the general dimensional FrenetSerret (FS) equations, given in terms of the vectors , wellknown from the theory of curves. The tangential vector is the derivative of the trajectory with respect to the arc length , (Figure 5A, is tangent to the curve ). The normal vector is found by taking the derivative with respect to the and inversely proportional to the curvature. Thus, . The binormal vector is found by taking the cross product between the normal and tangential vector and also related to the torsion: . Now we have the exact equation of the binormal vector that is used to define the plane of the curve, or the osculating plane.
Given an entire trajectory , we can find the FS frame, mean curvature, and mean torsion for each segment , so that . The mean osculating plane can be found by summing the weighted contributions from all arc segments, and the resulting vector defines the plane for the th trajectory. We can see in Figure 5A how the weighted contribution of each depends upon the curvature. In particular, small tightly curved loops (large ), indicated by , should contribute less in defining the mean plane than large radius segments (shown as in the figure). Making connection to the temporal dependence of the trajectory as points in a video sequence, the resultant binormal vector is really a time averaged osculating plane. The equation is given by:
where is a normalization constant.
Alternatives descriptions of planes
Many alternative techniques exists for obtaining the mean hyperplane that cuts through a set of points, that need not rely upon the differential properties of curves. Nonetheless, the method we developed has the advantage of providing local geometric information that can be used on several scales. Figure 5B illustrates two alternative methods for obtaining a mean plane through a set of points in 3dimensions. If no knowledge is available for how points are connected, Singular Value Decomposition (SVD) provides a simple projection procedure for finding the best fit plane through points in a least square sense. This method will often fail to coincide with the plane defined by connected points, as shown in Figure Figure 5B (indicated by plane ). A method for obtaining a mean plane from connected points is to construct successive polygon segments, also shown in Figure 5B (and later in Figure 10). This method yields the same plane as that defined by the binormal vector. In this method, however, all other quantities must still be calculated for other steps in our the classification algorithm.
4.3 Steps in the trajectory point cloud algorithm
The steps of the the trajectory point cloud classifier algorithm are shown in Figure 6. In the previous section, we described steps 1 and 2, where we defined the concept of the trajectory point cloud with the collection of segments, , and the time averaged osculating plane from the resulting binormal vector . We now use this information to develop a distance metric that classify an unknown video into a one of a set of trained classes.
In step 3 of 6), we use each trajectory to calculate macroscopic quantities: the cloud centroid from the trajectory , and the as well as the average cloud radius . From these definitions, we can express each trajectory cloud as the tuple:
where the set , are the local properties of each trajectory point in the cloud.
How does this trajectory information help to distinguish between different action classes? Figure 7 illustrates different configuration scenarios that can occur with respect to the trajectory point clouds. The configurations define three separate regions that our distance metric will be selectively sensitive:

Region 1 (top left): when the trajectories overlap. This is the case where the trajectories correspond to the same action. For this situation, we want the distance to only depend upon the centroid (), which is close to zero. Thus, we want to eliminate contributions of the distance metric that correspond to the orientation of the mean hyperplanes. If we wish to distinguish fine details between actions of the same class, we will use a specialized KNN, we call the fuzzy cloud KNN, briefly described below.

Region 2 (bottom left): This is when the trajectories are separated by at least a mean cloud radius, . In this case, the trajectories can be partially overlapping. This is precisely the region where ambiguities can arise in other metrics. Here we see the power of the hyperplane method. In this case, we want the contribution to the distance metric from the hyperplane normal vectors to be maximum.

Region 3 (top right): This is the case when the trajectories are separated larger than a few cloud radii. In this case, the cloud centroid is sufficient to resolve different classes. Thus, here the contribution from the hyperplane orientation should also be decreasingly small as the separation distance between the cloud centroid is increased.
These ideas are captured in the function (bottom right) as a function of the trajectory point cloud separation . The function can treat the three regions above in a different manner: (a) it is zero when the separation is approximately zero , (b) it is maximum when the separation is a mean radius, , and (c) it decreases exponentially for separations greater than a mean radius,
The function that will modulate the hyperplane orientation in the distance metric between trajectory point clouds is shown in Figure 7(bottom right) and is given by:
(3) 
where and are two trajectory cloud tuples defined previously, the free parameters , , and are chosen as a function of the cloud radius; is a scaling constant, controls how steep the function is close to the origin, that is how quickly the function cuts off, while controls the long exponential tail, so that larger values will go to zero faster. Different values of these parameters are shown in Figure 7 in order to illustrate the effect of each of the free parameters. Values of these parameters for real trajectories of our study are given in below in the experimental results section.
4.4 The Fuzzy Cloud KNN
Our trajectory cloud classifier was designed so that when trajectories overlap, the hyperplane orientation can be used to distinguish different actions. However, in some situation, two different actions could have similar hyperplanes. Also, in another situation, we may wish to distinguish the difference between two executions of the same action, as in our recent work that studies the quality of Olympic gymnastics movements (DíazPereira et al., 2014). For these situations, we can use the set of local trajectory segments, to obtain a distance measure. We developed a specialized KNN algorithm to classify a query trajectory into a set of classes, called the fuzzy cloud KNN, that uses the local information of the trajectory.
Although the details are beyond the scope of this paper and shall be described elsewhere, Figure 8 illustrates the general idea of the algorithm. Different possible overlap configurations are shown in Figure 8A and B. The points pertaining to different trajectories are given in different colors and labeled with their trajectory tuple, and , respectively. The situations illustrated in the figure provide the logic for assigning membership rules. In the configurations of type A when clouds overlap at some angle, the normal vector orientations are opposite and the curvatures are large and small, respectively. In configuration B when clouds are nearly coincident, the local trajectory points will have vectors and curvatures that will coincide on average.
Figure 8C illustrates the idea of the fuzzy cloud KNN using trajectory points, represented as wedges to accentuate the orientation . In the example, a test wedge (shown at the center in blue) is to be classified into either one of two groups (indicated by red and green). Analogous to the classic KNN algorithm, a value is chosen that determines the maximum nearest neighbors to be considered for the classification of the test point. As in the original fuzzyKNN algorithm described by Keller et al. (1985), these neighboring points are weighted by a set of fuzzy membership functions that are inversely proportional to the separation. In our algorithm, such functions are parameterized by the relative difference, , between the test point and a neighboring point (pertaining to one of the classes), and written as , with the quantization . Rather than assigning crisp class membership for the test point, this procedure produces a set of vectors whose components are the values of , between and . These vectors are used in an aggregate function , which defines a set of rules for class inference.
4.5 The Distance Metric
Given the above definitions, we can now define the full distance metric between trajectory point clouds, which consist of three terms: one that depends on the centroid distance, another that depends upon the orientation of the hyperplanes, and another that can provide fine structure details from a fuzzy KNN like inference:
(4) 
where , modulates the strength of the hyperplane orientation (as shown in Figure 7), and modulates the strength of the fuzzy cloud KNN penalty function , so that it contributes when the trajectory clouds partially or fully overlap. Since the function produces solutions that depend on the class type, this function contributes differently to for each class.
4.6 Implementation and Results of the TPC Classifier
We implemented the formalism for the trajectory point cloud (TPC) in a set of Python classes that depend only upon Numpy/Scipy/Matplotlib libraries for numerical operations and plotting. The Bspline routine from scipy.interpolate
was used to represent the curves and higher order derivatives. All other functions were implemented given the descriptions provided in previous sections.
Figure 9 shows the results of calculating the binormal vector for a particular spatiotemporal trajectory. In particular, Figure 9a (left) shows a plane obtained with the binormal vectors for each individual segments and the correspondence with the polygon plane for the same segment. Figure 9a (right) shows the same trajectory with many other segments and corresponding planes defined by . The values of the arclength averaged radii of curvature, normal vector to polygon (given by ) and FS vectors () are given in the table inset. As seen, the binormal vectors are coincident with the polygon normal vectors. Figure 9b, shows successive solution by summing each along the trajectory. The are drawn resultant binormal vector is indicated in the figure by the darkest plane and indicated by the arrow. Figure 9c shows the convergence of with successive addition of each segments for different values of the segment length .
Figure 10a, shows planes for the trajectories of two actions. For comparison, planes were calculated from the SVD method and the resultant binormal vector method described above. For the case of trajectory (right), both methods are similar. However, in the case of the trajectory , the SVD fails to properly calculate the plane for the closed connected curve, while the mean binormal plane is correct.
Figure 10b shows two separate action comparisons that can suffer from ambiguities with the classical KNN: (top) jogging/walking, and (bottom) falling/fastwalking. Since the covariance space is different depending upon the actions trained, we normalized all quantities with respect to the separation maximum extent of the two clouds . In the modulation function (Equation 3), we set empirically and , in order to peak close to and have a long tail, guaranteeing a contribution from the hyperplane orientation term for trajectories that are relatively close, while moderate for those further away. As can be seen from the values, the distance metric for the jogging/walking case (top) is dominated by the first term of Eq. 4 (having a value of , while the second term is ), while the falling/fastwalking case (bottom) is dominated by the second term of Eq. 4 ( is less than ) which depends on the angle between hyperplanes.
5 Experimental Results of CBVR
From the spatiotemporal analysis with a KPCA and our new trajectory point cloud classifier described in the previous sections, we validated the recognition performance of our CBVR system using two public video datasets (MILE database (Olivieri et al., 2012)) and (KTH database (Schuldt et al., 2004)). Another objective of these tests was to show that well chosen parameters in a KPCA can outperform the recognition rates of a linearPCA, while still retaining computational performance. For this, we finetuned the polynomial kernel function of the KPCA in order to maximize the class separation of human activities in the study.
5.1 Experiments in MILE video database
The specifics of our database (Olivieri et al., 2012) are as follows. It consists of 240 video shot sequences representing 4 human actions (boxing, greeting, playing tennis and jogging) recorded with 12 different people. The video shots were obtained under normal lighting conditions using a commodity Sony (DCRHC15) MiniDV, with a sampling rate of 25 frames/s. All actions were recorded using the same focal distance and no special backlighting preparations were implemented. The videos were saved in AVI MPEG encoding format. Together with the raw footage, we processed each video shot with an adaptive resizing algorithm to create image sequences of , for later use in our CBVR system. Figure 3 shows a sampling of different MVFI templates (b) in BGR color space that result from the different human actions (a). Finally, the frames are converted to grayscale for vector quantization of the spatiotemporal templates.
We carried out experimental tests with a training set consisting of 64 video shots (8 people, 4 human actions, and 2 video shots for each person): boxing (), greeting (), jogging () and playing tennis (). For controls in our analysis, we also considered two cases: (1) a null action, defined as a scene without a human action, and (2) a nondefined action, which are other actions not considered in the training set. In the case of a null action, the resulting trajectories in the PCA eigenspace are concentrated close to the origin.
Figure 11 shows a comparison of both the linear and polynomial kernel PCA applied to one of the four classes training discussed above. The example shows the spaces formed with two, three and four separate human action classes, each represented by a single video shot and a single person. The results demonstrate that we can achieve a better separation between the different classes from the KPCA, than can be obtained from the linearPCA. Indeed, by fine tuning the kernel function parameters, we can control the class separation, which ultimately can lead to improved classification performance of the algorithm.
The polynomial kernel takes the form:
where the value of is selected to maximize the class separation. The dependence this parameter on the class separation is shown in Figure 12 that shows action classification results for different values of . Just as in the Fisher criteria, the objective function seeks a constrained maximization solution: maximizing the average distance between points of the eigenspace trajectory belonging different classes (outclass) while minimizing the average distances among points belonging to the same class (inclass). These relations are shown in Figure 12 for plots of the ratio of outclass and inclass, corresponding to training data previously shown in Figure 11. These studies indicate that the optimal value of the tuning parameter, , is independent of the human action type as well as total number of classes in the training set.
5.2 Results of trajectory point cloud classifier
Figure 13 shows the results of the trajectory point cloud classifier.
In order to quantify the distance between classes, we used a simple Euclidean metric. Thus, given two points and , in separate classes, and respectively, within the dimensional space, the distance . A metric for the total distance between classes and is to sum all pairwise distances .
We compared the PCA and kernelPCA methods by normalizing the distance vectors obtained in the respective spaces, dividing by the largest distances: or , along the principal axes, . From these maximum values, we defined the ratio and the normalization factor , such that:
The result of this normalization procedure is shown in Figure 13, consisting of the results obtained from the distances between the classes indicated in Figure 11. In all cases, the polynomial kernelPCA provided superior class separation and recognition results, even when the linear PCA is combined with linear discriminant analysis (LDA).
5.3 Experiments in KTH database
The KTH video database (Schuldt et al., 2004) is a widely used public databases for testing and comparing human motion recognition algorithms. This database contains six action classes (boxing, hand clapping, hand waving, jogging, running and walking). These actions were recorded with 25 people in four different scenarios (figure 14a): outdoors, the camera is parallel to the object moving trajectories, outdoors, there is an angle between the camera and the object moving trajectories, or there are scale changes, outdoors, there are different clothes or pack on the back, and indoors, there are various degrees of shadows. Figure 14 (b) illustrates an example of MVFI templates a sampling of video frames from this database.
From the KTH database, the training set we selected consists of six human actions performed by eight different people. All the other videos in the database were used as the test set. The results of the recognition performance is given in the confusion matrix of Table 1. The confusion matrix provides a comparison between the results obtained with the linear PCA and KPCA. The lowest recognition rate corresponds to the running actions, given the similarity with jogging in the database. As in previous comparisons, the polynomial KPCA provided better discrimination amongst the different actions when compared with the linear PCA.
PCA  Box  Clap  Wave  Jog  Run  Walk  
Box  91.2 (89.5)  6.9 (7.6)  1.9 (2.9)  0  0  0  
Clap  9.6 (12.3)  84.3 (79.8)  6.1 (7.9)  0  0  0  
Wave  4.3 (5.2)  10.1 (11.3)  85.6 (83.5)  0  0  0  
Jog  0  0  0  91.8 (89.6)  6.7 (8.8)  1.5 (1.6)  
Run  0  0  0  14.1 (17.9)  83.8 (77.3)  2.1 (4.8)  
Walk  0  0  0  5.2 (4.7)  1.2 (2.1)  93.6 (93.2)  
PCA+LDA  Box  Clap  Wave  Jog  Run  Walk  
Box  93.7 (90.2)  5.4 (8.3)  0.9 (1.5)  0  0  0  
Clap  6.4 (9.6)  91.1 (87.3)  2.5 (3.1)  0  0  0  
Wave  3.9 (3.6)  5.7 (5.8)  90.4 (90.6)  0  0  0  
Jog  0  0  0  93.6 (91.4)  5.0 (6.5)  1.4 (2.1)  
Run  0  0  0  8.4 (9.5)  89.7 (85.2)  1.9 (2.1)  
Walk  0  0  0  7.8 (7.6)  0.2 (0.3)  92 (92.1)  
Pol. kernelPCA  Box  Clap  Wave  Jog  Run  Walk  
Box  94.6 (92.4)  3.8 (5.1)  1.6 (2.5)  0  0  0  
Clap  5.7 (7.8)  93.1 (89.2)  1.2 (3.0)  0  0  0  
Wave  1.1 (2.4)  5.2 (6.8)  93.7 (90.8)  0  0  0  
Jog  0  0  0  94.8 (92.6)  3.7 (5.1)  1.5 (2.3)  
Run  0  0  0  6.4 (8.7)  92.3 (88.8)  1.3 (2.5)  
Walk  0  0  0  3.9 (5.1)  1.0 (0.8)  95.1 (94.1) 
The average recognition rate is a useful metric for comparing the performance of different classifiers for human actions. Table 2 shows the average recognition rate from our results and compared with results published previously by other researchers. Our results, using the MVFI templates with either the PCA and KPCA, outperformed other recognition techniques. Our system achieves realtime recognition with an accuracy greater than 93%. From the details provided from other published results, we could not determine if the techniques function in realtime or not.
Methods  Recognition 

accuracy (%)  
Pol. kernelPCA + MVFI (this paper)  93.9 (91.3) 
PCA + LDA + MVFI (this paper)  91.8 (89.5) 
PCA + MVFI (this paper)  88.4 (85.5) 
Liu and Shah Liu and Shah (2008)  94.2 
Mikolajczyk et al. Mikolajczyk and Uemura (2008)  93.2 
Schindler et al. Schindler and Van Gool (2008)  92.7 
Laptev et al. Laptev et al. (2008)  91.8 
Jhuang et al. Jhuang et al. (2007)  91.7 
5.4 Experiments as a CBVR: Video indexing and Annotation
Once an action is identified, a full video sequence can be annotated, marking those parts of the video containing relevant human actions and possibly storing this information as metadata. As our algorithm marches through the video, it must decide whether a trained event is present or not. In particular, the routine identifies human action in the training set as well as nonactions or null frames. The algorithm processes frames of the timeline in a video at a time, performing the KPCA transformation, and calculating the distance metric to trained classes. The algorithm proceeds with an overlapping moving window, of frames, thereby determining actions for every frames. The essential steps in the algorithm are given as follows:
We used our system to be able to annotate sections of videos and return the time intervals during which large actions take place. We have tested our algorithm with several films to try to identify 5 human actions: picking up the phone, drinking, sitting, walking and running. We contemplate the null case to classify any other action such as “a car moving” or “a dog playing”. Figure 15 illustrates indexing the timeline with trajectory point cloud and associated feature tuple . A query shot will make comparisons to each of these vectors.
We performed ground truth validation tests of our algorithm by applying it to two feature length movies that we annotated manually. From these tests, we determined the recognition rate of our algorithm for detecting the location of actions similar to those in our training set. Figure 16 shows the results for detecting two actions, ”walking” and ”drinking” in two open source films (”Route 66  an american bad dream” and ”Valkaama”). To determine the location of these frames we used a marching moving window, with a window size of frames and an overlap of frames. The training set was taken from the MILE database by selecting five actions performed by eight separate people. Figure 16 shows false positives (FP) and false negatives (FN) for classifications (walk/other actions and drink/other actions). Many FPs were due to different shot angles and body clipping that were not considered in the training set, but produced similar MVFI spatiotemporal trajectories to those of running and walking. These can be eliminated by more stringent requirements and by increasing the training set to include more shot angles and body clipping scenes similar to those found in the movie.
We compared our results with a ground truth manual annotation of both full featured films in order to obtain quantitative performance information of the our algorithm, such as the sensitivity and specificity. For each of the films in Figure 15, the manual annotation of the 5 types of human actions shown are included in the training and on top the result of our automatic annotation produced by our system. In the case of the first film ”Route66”, the figure shows a scene from the film that our system correctly detected correctly a ”walking” action shot, while from the movie ”Valkaama”, we show a particular results of ”drinking” and “picking up” actions. For each of the actions defined in the study, the results of and are shown in the Table 3. The analyses were made by dividing the actions into groups, in the same way as we explained previously for experiments with the MILE/KTH human movement dataset. Each analysis consists of two groups, (1) the action in question (2) any other action not considered in the study.
Actions  Real shots  CBVR results  

TP  TN  FP  FN  TPR  TNR  
Walk  42  29  54  13  8  0.78  0.81 
Run  28  21  67  8  7  0.75  0.89 
PickUp Phone  4  3  90  9  1  0.75  0.91 
Drink  18  13  73  12  5  0.72  0.86 
Sit  11  9  78  14  2  0.82  0.85 
5.5 Web application for CBVR with short video shots
As an interface for our CBVR algorithms, we developed a lightweight web application (available at http://fideo.milegroup.net) that can query the database from saved videos using a draganddrop search box, or the query can be made from a live webcam capture of a human action. An example screenshot of the querybyvideo web application is shown in Figure 17, where results from a query with a boxing action are shown. In particular, a brief description of the web interface application is described as follows. For an existing query video, the shot, is moved into the draganddrop search box, uploading the video to the server. For the live stream option, a web application records the video from the webcam and subsequently uploads the result to the server. Once uploaded, the video shot is processed by a serverside application to produce the corresponding spatiotemporal trajectory that will be used to produce a similarity search against all videos, in the database. This search is carried out using search windows, so that if a video is longer than the search window, the entire duration of the video is searched to determine the location of the action within the video of the database.
As described previously, the database server contains the finegrained spatiotemporal trajectories for all points across their timeline obtained with the KPCA transformation. For a query shot, similar videos and/or location of video segments in larger videos are found by calculating the pairwise accumulated distance between the targets and the query. In our example of Figure 17(b), the recorded video shot correctly produces a higher similarity to all videos with the similar action (boxing), as seen through a higher similarity percentage. For null actions, or for actions that are not contemplated in the training set, a hit rate should yield negligible hit rate values.
6 Conclusions
The spatiotemporal template method allows complex motions to be processed and classified in realtime by using a supervised learning procedure. We showed that by using a KPCA transformations, better outofclass separation can be obtained by fine tuning the kernel parameter depending upon the nature of the data. As we postulated, the KPCA provides more flexibility through a nonlinear transformation as compared with the linearPCA.
Nonetheless, there is a limit to the extent that different action classes can be separated even with highly tuned kernel engineering of the KPCA space. This is especially true as the number of different action classes increases in a multiclass classification analysis. The scaling to larger classes is accompanied with a commensurate increase in class boundary overlap. As these class boundaries become softer, traditional classifiers such as KNN or SVM will be unable to crisply distinguish the membership of certain points in along the trajectory, and therefore, the recognition rate will suffer.
Thus, the most profound contribution of this paper is a new classifier for spatiotemporal trajectories, that we call the trajectory point cloud classifier. As described, this classifier specifically treats complicated cases but more common case when trajectories partially overlap, namely they are different action classes but there class boundary is not crisp. Our method considers local differential geometric properties of the trajectories in order to identify the average ndimensional osculating hyperplane where these trajectories live. Different actions will lie on hyperplanes that are oriented at different angles and the center of mass of these trajectory point clouds will allow us to control the extent to which this orientation is incorporated into the distance calculation between different clouds. Thus, we say that the distance metric for our classifier is orientationdependent, and that the direction is determined by the weighted binormal vector to the mean osculating hyperplane obtained by the independent contribution from a collection of sequentially overlapped curve segments along the trajectory.
Our method resolves the problem of overlapping trajectories as arises more commonly in multiclass analysis. This is in contrast to traditional methods such as the classical KNN, where the trajectory is treated as a set of independent points, thereby ignoring essential information about the connectedness of the points. Thus, we demonstrated that our new trajectory point cloud classifier is superior to the KNN (or other pointcentric methods) for detecting human actions with a spatiotemporal methodology. Nonetheless, even though we described this technique in the context of human motion recognition, the classification technique is general and can be extended to other cases, where the points are correlated, as in this case for timesequenced video frames.
Finally, we provided a proof of principle demonstration of how our spatiotemporal MVFI and classification method could be used as a CBVR system to annotate/index and query videos from a multimedia database. Due to the nearly infinite variety of shot angles and partial body shots, online learning combined with probabilistic inference could cover a wider range of motion variations and contexts.
References
 Achard et al. (2008) Achard, C., Qu, X., Mokhber, A., Milgram, M., 2008. A novel approach for recognition of human actions with semiglobal features. Machine Vision and Applications 19 (1), 27–34.
 Beecks et al. (2010) Beecks, C., Uysal, M., Seidl, T., July 2010. A comparative study of similarity measures for contentbased multimedia retrieval. In: Multimedia and Expo (ICME), 2010 IEEE International Conference on. pp. 1552–1557.
 Bhatt and Kankanhalli (2011) Bhatt, C. A., Kankanhalli, M. S., 2011. Multimedia data mining: State of the art and challenges. Multimedia Tools Appl. 51 (1), 35–76.
 Bishop (2006) Bishop, C. M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA.
 Blank et al. (2005) Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R., 2005. Actions as spacetime shapes. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. Vol. 2. pp. 1395–1402 Vol. 2.
 Bobick and Davis (1996) Bobick, A. F., Davis, J. W., 1996. An appearancebased representation of action. In: Proceedings of the 13th Int. Conf. on Pattern Recognition (ICPR). pp. 307–312.
 Bobick and Davis (2001) Bobick, A. F., Davis, J. W., 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (3), 257–267.
 Cho et al. (2009) Cho, C. W., Chao, W. H., Lin, S. H., Chen, Y. Y., 2009. A visionbased analysis system for gait recognition in patients with parkinson’s disease. Expert Systems with Applications 36(3) (3), 7033–7039.
 DíazPereira et al. (2014) DíazPereira, M. P., GómezConde, I., Escalona, M., Olivieri, D. N., 2014. Automatic recognition and scoring of olympic rhythmic gymnastic movements. Human Movement Science in press.
 Ekinci and Aykut (2007) Ekinci, M., Aykut, M., 2007. Human gait recognition based on kernel pca using projections. Journal of Computer Science and Technology 22, 867–876.
 Etemad and Chellappa (1997) Etemad, K., Chellappa, R., 1997. Discriminant analysis for recognition of human face images. In: Audio and Videobased Biometric Person Authentication. Vol. 14(8) of Lecture Notes in Computer Science. pp. 1724–1733.
 Farnebäck (2003) Farnebäck, G., 2003. Twoframe motion estimation based on polynomial expansion. In: Proceedings of the 13th Scandinavian Conf. on Image analysis. pp. 363–370.
 Felzenszwalb et al. (2010) Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D., 2010. Object detection with discriminatively trained partbased models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9) (9), 1627 –1645.
 Fukunaga (1990) Fukunaga, K., 1990. Introduction to statistical pattern recognition (2nd ed.). Academic Press Professional, Inc., San Diego, CA, USA.
 Hosseini and EftekhariMoghadam (2013) Hosseini, M.S., EftekhariMoghadam, A.M., 2013. Fuzzy rulebased reasoning approach for event detection and annotation of broadcast soccer video. Applied Soft Computing 13 (2), 846 – 866.
 Hu et al. (2011) Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S., 2011. A survey on visual contentbased video indexing and retrieval. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 41 (6), 797–819.
 Huang et al. (1999) Huang, P. S., Harris, C. J., Nixon, M. S., aug. 1999. Human gait recognition in canonical space using temporal templates. In: IEE Proceedings of Vision, Image and Signal Processing. Vol. 146(2). pp. 93–100.
 Ikizler and Forsyth (2008) Ikizler, N., Forsyth, D., 2008. Searching for complex human activities with no visual examples. Int. J. Computer Vision 80(3), 337–357.
 Jhuang et al. (2007) Jhuang, H., Serre, T., Wolf, L., Poggio, T., 2007. A biologically inspired system for action recognition. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. pp. 1–8.
 Jones and Shao (2013) Jones, S., Shao, L., 2013. Contentbased retrieval of human actions from realistic video databases. Information Sciences 236, 56–65.
 Keller et al. (1985) Keller, J., Gray, M., Givens, J., July 1985. A fuzzy knearest neighbor algorithm. Systems, Man and Cybernetics, IEEE Transactions on SMC15 (4), 580–585.
 Küçük and Yazıcı (2011) Küçük, D., Yazıcı, A., 2011. Exploiting information extraction techniques for automatic semantic video indexing with an application to turkish news videos. KnowledgeBased Systems 24 (6), 844 – 857.
 Lam et al. (2007) Lam, T. H., Lee, R. S., Zhang, D., 2007. Human gait recognition by the fusion of motion and static spatiotemporal templates. Pattern Recognition 40 (9), 2563 – 2573.
 Laptev et al. (2008) Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B., June 2008. Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 1–8.
 Liao et al. (2013) Liao, K., Liu, G., Xiao, L., Liu, C., 2013. A samplebased hierarchical adaptive kmeans clustering method for largescale video retrieval. KnowledgeBased Systems 49, 123 – 133.
 Liu and Shah (2008) Liu, J., Shah, M., 2008. Learning human actions via information maximization. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 1–8.
 Lucas and Kanade (1981) Lucas, B. D., Kanade, T., 1981. An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence  Volume 2. IJCAI’81. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 674–679.
 Luh and Lin (2011) Luh, G.C., Lin, C.Y., 2011. {PCA} based immune networks for human face recognition. Applied Soft Computing 11 (2), 1743 – 1752, the Impact of Soft Computing for the Progress of Artificial Intelligence.
 Meeds et al. (2008) Meeds, E., Ross, D., Zemel, R., Roweis, S., 2008. Learning stickfigure models using nonparametric bayesian priors over trees. In: Proceedings of the EEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–8.
 Mikolajczyk and Uemura (2008) Mikolajczyk, K., Uemura, H., 2008. Action recognition with motionappearance vocabulary forest. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 1–8.
 Mohri et al. (2012) Mohri, M., Rostamizadeh, A., Talwalkar, A., 2012. Foundations of Machine Learning. The MIT Press.
 Nga and Yanai (2014) Nga, D. H., Yanai, K., 2014. Automatic extraction of relevant video shots of specific actions exploiting web data. Computer Vision and Image Understanding 118, 2–15.
 Olivieri et al. (2012) Olivieri, D. N., Gómez Conde, I., Vila Sobrino, X. A., 2012. Eigenspacebased fall detection and activity recognition from motion templates and machine learning. Expert Syst. Appl. 39 (5), 5935–5945.
 Poppe (2010) Poppe, R., 2010. A survey on visionbased human action recognition. Image & Vision Computing 28 (6), 976–990.
 Ren et al. (2009) Ren, W., Singh, S., Singh, M., Zhu, Y., 2009. Stateoftheart on spatiotemporal informationbased video retrieval. Pattern Recognition 42 (2), 267 – 282.
 Rius et al. (2009) Rius, I., Gonzàlez, J., Varona, J., Roca, F. X., 2009. Actionspecific motion prior for efficient bayesian 3d human body tracking. Pattern Recognition 42 (11), 2907 – 2921.
 Samy Sadek and Michaelis2 (2013) Samy Sadek, Ayoub AlHamadi, G. K., Michaelis2, B., 6 2013. Affineinvariant feature extraction for activity recognition. ISRN Machine Vision 2013.
 Schindler and Van Gool (2008) Schindler, K., Van Gool, L., 2008. Action snippets: How many frames does human action recognition require? In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 1–8.
 Scholkopf et al. (1999) Scholkopf, B., Smola, A., Müller, K.R., 1999. Kernel principal component analysis. In: Advances in kernel methods  support vector learning. MIT Press, pp. 327–352.
 Schuldt et al. (2004) Schuldt, C., Laptev, I., Caputo, B., 2004. Recognizing human actions: A local svm approach. In: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04). IEEE Computer Society, Washington, DC, USA, pp. 32–36.
 Szeliski (2010) Szeliski, R., 2010. Computer Vision: Algorithms and Applications, 1st Edition. SpringerVerlag New York, Inc., New York, NY, USA.
 Ugolotti et al. (2013) Ugolotti, R., Nashed, Y. S., Mesejo, P., Špela Ivekovič, Mussi, L., Cagnoni, S., 2013. Particle swarm optimization and differential evolution for modelbased object detection. Applied Soft Computing 13 (6), 3092 – 3105, swarm intelligence in image and video processing.
 Venkatesh Babu and Ramakrishnan (2004) Venkatesh Babu, R., Ramakrishnan, K. R., 2004. Recognition of human actions using motion history information extracted from the compressed video. Image and Vision Computing 22(8) (8), 597–607.
 Xie and Lam (2006) Xie, X., Lam, K.M., Sep. 2006. Gaborbased kernel pca with doubly nonlinear mapping for face recognition with a single face image. Trans. Img. Proc. 15 (9), 2481–2492.