Human action recognition using local two-stream convolutional neural network features and support vector machines

Human action recognition using local two-stream convolutional neural network features and support vector machines


This paper proposes a simple yet effective method for human action recognition in video. The proposed method separately extracts local appearance and motion features using state-of-the-art three-dimensional convolutional neural networks from sampled snippets of a video. These local features are then concatenated to form global representations which are then used to train a linear SVM to perform the action classification using full context of the video, as partial context as used in previous works. The videos undergo two simple proposed preprocessing techniques, optical flow scaling and crop filling. We perform an extensive evaluation on three common benchmark dataset to empirically show the benefit of the SVM, and the two preprocessing steps.

1 Introduction

Human action recognition is a very active area of study in the computer vision research community. This is likely because a viable, reliable solution to such a problem would have a vast impact on society, in domains such as healthcare, surveillance, and entertainment. In the surveillance space, detection of anomalous or illegal events, such as shoplifting; robbery; and fighting, would be possible. In entertainment applications, human-computer interaction could reach new levels of effectiveness, since reliable detection of behaviour and engagement would be possible. Lastly, in the healthcare industry, a solution could assist in the rehabilitation of patients [1].

In a video, actions are place, and are captured by cameras of different sensors. Finding a general, reliable, and robust solution to (video) human action recognition is still an open problem. Both motion and appearance information needs to be accounted for when modelling the problem, and thus extracting a reliable set of features that generalise to novel settings is difficult. The features, or representations of actions, should be discriminative enough to tell the difference between temporally-similar, and spatially-similar, actions.

Typically, there are two approaches to human action recognition. The first, which dominated early action recognition research, involves extracting hand-engineered features [2, 3, 4], henceforth termed engineered features. The most common, high-level approach to extracting engineered features starts with an interest point detector - typically either space-time-interest points [5] (STIPs) or dense sampling. Then, feature descriptors are extracted at each of these locations. Such descriptors include, but are not limited to, histograms of oriented gradients [6] (HOG), histograms of optical flow [7] (HOF), -jets ([2, 3], and motion boundary histograms [8] (MBH). These features are typically computed locally - in a 2D or 3D window around the interest points. These local features are then encoded and quantised to form global, fixed-length representations for the video. The two most common approaches to do this is the bag-of-words framework (BoW), or Fisher vectors. These then serve as the final action representation, which are commonly fed into an SVM to perform the classification.

The second set of approaches, unsurprisingly, leverage deep learning. Simple inputs, such as RGB video or optical flow videos, are fed into 2D or 3D convolutional neural networks (CNNs) to learn salient spatio-temporal features about the actions, and perform the classification. We term these features automated features. Up until recently, very few large, labelled action recognition datasets existed, and thus the benefit of deep learning could not be realised in this domain. As such, the approaches extracting engineered features typically dominated the research landscape. However, with the introduction large-scale action recognition datasets, such as Kinetics and YouTube-8M, deep learning has reaps the rewards. Approaches based on deep learning are now achieving state-of-the-art results on all common benchmarks datasets. The issue with such approach is their massive computational resource demands - some state-of-the-art techniques being trained using upwards of 60 GPUs [9].

This work serves to introduce a novel approach to human action recognition, and demonstrates a number of key benefits to aid recognition performance. Firstly, performing two simple pre-processing techniques can assist the networks with learning better action representations, and using a linear SVM trained on a set of local, crop-level action representations improves performance greatly from simply taking a consensus vote of the network’s crop-level predictions.

The remainder of the work is structured as follows. Section 2 provides a background of the relevant literature, and Section 3 introduces the proposed approach. Section 4 provides a comprehensive evaluation of the proposed method on the two benchmark datasets. Lastly, Section 5 concludes the work and provides potential future work.

2 Background and Related Work

Previous research into action recognition can largely be split into two groups - those hand-crafting features, and those learning feature automatically using deep neural networks. We split the literature review in accordance with this.

2.1 Engineering Features

[2] detect interest points in a video, at various spatial and temporal scales, using an 3D extension of the Harris operator, known as space-time interest points [5] (STIPs). At each of these locations, fourth-order local jets are computed. A BoW approach is taken, clustering a subset of these jets using -Means. The final video representation is then the resultant histogram of visual word occurrences. An SVM was trained on these histograms. This approach resulted in a best average accuracy of 71.7% on the KTH dataset. [10] takes a similar approach, starting with STIPs. Then, instead of computing -jets at these locations, HOG and HOF vectors are instead computed in space-time volumes around the detected points. These HOG and HOF vectors are concatenated and normalised for each subvolume within this volume. A BoW framework is employed, with . An SVM with a kernel is used for classification. This approach resulted in an average accuracy of 91.8% on the KTH dataset. This method was state-of-the-art at the time, due to the features being able to capture more of the pertinent motion and appearance information than that of [2], for example. [3] evaluated various interest point and descriptor combinations on a host of datasets. Similar to the approaches above, video sequences are represented as a bag-of-words. The descriptors are quantised into visual words using K-Means clustering with . These visual words are then used to compute histograms of visual word occurrences, and these serve as the final representations for the video. For classification, an SVM is used with the kernel. One key finding is that in realistic video settings, dense sampling outperforms other interest point detection methods, while increasing the number of features by a factor of 15. Another key finding is that the HOG/HOF descriptors seem to, in general, perform well. [11] introduce Action Bank - an extension of the Object Bank [12] approach to image representation into the video domain. This method is the current state-of-the-art on the KTH dataset. Action bank represents a video as a collected output of many action detectors that each produce a correlation volume. Results obtained were 98.2% on KTH, 95.0% on UCF Sports, 57.9% on UCF50, and 26.9% on HMDB51. The two main downsides of this approach are its massive computational inefficiency (e.g. a video from UCF50 took anywhere from 0.4 - 34 hours), which is likely the reason the method has not been researched further, as well as the fact it performs comparatively poorly on realistic datasets.

2.2 Automated Features

A method that spawned much further research in action recognition was introduced by [13]. This method is based on the intuition that video can naturally be broken down into spatial and temporal components. The spatial component, defined by the individual frames of the videos, carries information about scenes and objects in the video. The temporal component, defined by motion between frames, carries information about the movement in the video (e.g. that of camera motion and the objects). Both of these components, or streams, are implemented as CNNs. The spatial stream CNN operates on individual RGB frames of the video. Thus, this network makes action predictions for each frame. The static appearance of a frame is hypothesised to be a useful cue for action recognition since actions are often strongly associated with particular objects. The temporal stream CNN is fed some form of pre-computed optical flow input. This is done so that the temporal network does not have to estimate the motion itself. Three different types of inputs are considered for the temporal CNN. To fuse the output of both the spatial and temporal networks, two methods are considered: averaging and training a linear SVM on stacked -normalised softmax vectors as features for the SVM. This approach resulted in 88.0% and 59.4% accuracies on the UCF-101 and HMDB51 datasets, respectively. Due to the impressive performance of this approach, especially for deep learning-based approaches to action recognition, it has seen much interest in recent action recognition research. This novel two-stream approach, as well as the optical flow volume inputs, have been widely studied and adapted to try improve the recognition performance [9, 14, 15].

3 Methodology

The main intuition behind the proposed method is that actions in video can naturally be decomposed into appearance and motion information, and thus these two components are separately modelled. This is along the same lines as recent research [9, 13, 14]. These components are modelled using the state-of-the-art 3D CNN architecture known as I3D, employing the RGB network for the spatial modelling, and the flow network for the temporal modelling. One key factor in modelling actions in this way is the spatial and temporal resolutions of the samples used to train the networks. In order to capture as much of the full temporal evolution of the actions as possible, both the spatial and temporal resolutions need to be as high as possible - state-of-the-art techniques using temporal resolutions of frames or more [9, 15]. However, [16] showed that temporal resolution is somewhat more important than spatial resolution for action recognition. We take cognisance of these findings in our modelling process by using a temporal resolution as large as possible, within computational resource limitations. Formally, consider a video , where is the gray scale for 8-bit images; is the number of frames in the video; and is the spatial resolution of the frames of the video. We first compute the optical flow video at this original resolution. This is computed by applying dense optical flow for each pair of frames in the video. This flow procedure can be defined as a mapping , which takes in two grayscale frames as inputs, and outputs a -channel ‘image’. The two channels of this image correspond to the horizontal and vertical components of the flow vector fields, respectively. This results in an optical flow video . The frames of the RGB video are normalised by dividing all values by . The temporal resolution is then limited to frames. This can result in one of cases. If , we perform crop filling, shown in Figure 1. We posit that this duplication process will retain the natural flow and progression of the video or action better than the common alternative in literature of repeatedly appending the last frame to fill the deficit. If , we sample equally-spaced frames from . Lastly, if , we sample the full video. We apply this same temporal sampling to flow video . This results in volumes and . Next, all frames of both and are resized such that the smaller side is equal to , preserving aspect ratio in the process. This yields volume and , respectively.

Frame 1

Frame 1

Frame 2

Frame 2

Frame 3

Frame 1

Frame 1

Frame 2

Frame 3

Frame 1

Frame 2

Frame 3

Figure 1: Visualisation of the case, for a video of length frames needing to increase its temporal resolution to frames. This is to ensure that the logical flow of the action still holds when we need to increase the temporal resolution of a video. This approach is in contrast to the method of simply duplicating the last frame times to fill the deficit. We posit that this duplication approach is better for such cases.

A rescaling process is then performed on flow video , similar to the process performed in [16]. The intuition behind the rescaling is that, if the flow for a particular pixel was at a particular resolution, then the flow at, say, half the resolution, should be (i.e. no longer ). Previous works typically do not perform this rescaling, and thus we posit the flow values are not truly representative of the flow at that scale / resolution. This rescaling proved useful for discriminating between temporally-similar actions, such as walk and run. The rescaling is defined by two factors, and , representing the scaling factors for and flow components, respectively. Formally, given an original spatial resolution of , and a new resized resolution of , we have that:


Scale factor is computed in a similar manner. The flow is rescaled by multiplying all elements of the vector fields by , and all elements of the vectors fields by . Finally, we crop volumes from and , where each crop is such that and , where is a crop from , and is a crop from . The crops are the four corners and the centre crop. The coefficient of ensures that the crops have good spatial coverage of the videos / actions. Next, we use the RGB and flow I3D networks pre-trained on the ImageNet and Kinetics datasets. We fine-tune the RGB network on the crops, and fine-tune the flow network on the crops. Our baseline approach is taking a consensus vote over the network’s predictions for the crops of a video ( RGB crops, and flow crops). However, since this consensus voting scheme is naive, and only operates on a crop level, we propose an alternative approach in which the classifier have context about all crops available to it in order to make its prediction. This context, we posit, will prove beneficial, and aid recognition performance. The penultimate layer of both I3D networks can be used as generic action features. To this end, we sample feature vectors for both RGB and flow corresponding to the crops for both respective streams. Assume these vectors are -dimensional. We then concatenate these vectors to form one feature vector . This feature vector is then power-normalised: . We use in all experiments. We use power normalisation in place of regular -normalisation since it has shown to give better performance [17]. Since may be in the order of tens of thousands of dimensions, we first reduce the dimensionality of these vectors using PCA (PCA will be applied by default unless stated otherwise). This reduced-dimension vector is used as the final ‘action’ representation. A linear SVM is then trained on these vectors, and learns to classify actions from this representation. We posit that this will achieve better performance than making predictions directly from the network for three reasons. Firstly, the majority voting scheme from above is simple. Secondly, the layers of the network up until before the final layer can be seen to learn a rich representation of the input that the final layer can classify using a softmax classifier. We posit that an SVM is more effective for classifying from these representations than the softmax classifier. Lastly, the SVM will have the information about all RGB and flow crops from the representation, whereas the network makes predictions on a crop-by-crop basis, and thus does not have the information from other crops to aid its prediction.

3.1 Hyperparameter Optimisation

For the linear SVMs, the only parameter we optimise for is the penalty parameter . We use a grid search approach, and choose the that maximises five-fold cross-validation accuracy. The parameters for the I3D networks, such as number of layers and number of filters / neurons per layer, are left as default as per the recommendation of the authors in [9]. The weights for the network are set to that setting which minimises the cross-entropy loss on a validation dataset.

4 Experimental Results

4.1 Action Recognition Datasets

In this research, we study and apply our proposed technique to the KTH and HMDB51 datasets. This provides a good framework to test our method on a simple dataset, KTH (as a sort of regression test), and a more difficult, complex dataset, HMDB51.

The KTH dataset consists of fairly-static backgrounds, with actions performed 25 different actors. It consists of 599 videos of 25 people repeatedly performing one of the six actions. Altogether, of which there are 2391 subsequences of actions being performed. There are four different scenarios spread across the videos, namely s1: outdoors; s2: outdoors with scale variation; s3: outdoors with different clothes; and s4: indoors. Further, the dataset consists of the following six action classes: Walking; Jogging; Running; Boxing; Handclapping; and Handwaving. HMDB51 [18] is a large human action recognition dataset, often used to thoroughly test a particular approach to action recognition, as it is widely considered one of the more difficult benchmark datasets in this domain. It consists of videos from a variety of sources such as films, public databases, YouTube, and Google videos. There are 6676 clips divided into 51 different categories/classes. Each class contains a minimum of 101 videos. Rather than enumerate all 51 action classes, below is a description of the 5 general classes that the 51 categories are grouped into [18]: General facial actions; Facial actions with body manipulations; General body movements; Body movements with object interactions; and Body movements for human interaction.

UCF101 [19] is a large-scale human action recognition dataset, consisting of 13320 realistic videos collected from YouTube. There are 101 action classes in the dataset, which can be broadly grouped into the following 5 categories: Human-Object Interaction; Body-Motion Only; Human-Human Interaction; Playing Musical Instruments; and Sports.

The videos contain large variations in illumination, scale, viewpoint, appearance, and pose, and the backgrounds are often cluttered. There exists some consistency between videos (such as a common background or viewpoint), even between those which are in separate classes.

4.2 Performance Evaluation

For both datasets, we will use the average accuracy over the classes as the main metric for evaluating performance. This is the standard evaluation metric from previous literature. For the KTH dataset, the data is split according to the standard split of 8 people for training, 8 for validation, and 9 for testing. We then average the results over a pre-defined number of trials of this train/test split framework. Performance is then measured by average accuracy over these splits. For the HMDB51 dataset, there are three pre-defined train/test splits that come with the dataset, as defined by the introducing authors. Performance is then measured by average accuracy over these three splits. We also provide performance measures in the form of classification performance tables (showing class-by-class precision, recall, and F1-score), and confusion matrices.

4.3 Experimental Setup

Technologies used to run the experiments are Python with various machine learning, computer vision, and deep learning libraries. Experiments were run on a 32-core machine with 128GB of RAM and a RTX 2080Ti GPU, as well as a separate 8-core machine, with 32GB of RAM, and a GTX 1080 GPU.

4.4 Results on KTH Dataset

To test on the KTH dataset we set parameter and parameter . Such a setting, we posit, should be able to adequately model the comparatively-simple 2391 snippets of the dataset. We split the dataset in the recommended manner of 8 people for training, 8 for validation, and 9 for testing. We fine-tune the I3D networks for a certain number of epochs, and use the model that minimises the cross-entropy loss on the validation set. Due to the KTH dataset’s relative simplicity, we would expect performance to be fairly high. We can see a comparison of our proposed approaches against previous methods in Table 1. It is clear that our method is competitive with state-of-the-art approaches, achieving a highest accuracy of 96.6%, which is second only to the state-of-the-art approach by [11] with 98.2%. Even with the baseline majority voting approach, performance is still competitive. We can most likely attribute this performance to the KTH’s simplicity. The backgrounds are mostly static, and homogeneous. There is also not much clutter and occlusion to contend with. The I3D networks are, therefore, able to model the actions from such videos well. The ActionBank method proposed by [11] performs better most likely due to the amount of domain knowledge incorporated into the hand-crafted features, however, it should be noted that this method does not generalise well to complex settings / datasets such as HMDB51. Our method is, instead, simply tasked with learning the actions directly from RGB pixel values and optical flow vector fields, and generalises better than ActionBank to such settings. The baseline approach is less powerful simply because it makes predictions on a crop-by-crop basis and performs a majority vote. There are no steps taken to use information from the other crops (e.g. in the form of pooling or the SVM approach) to make the predictions. However, it is still relatively high, since much of the context about an action get be determined from a single video crop (as a result of the simple background, minimal noise, and limited clutter / occlusion).

Our approach performs better than the other methods because our method is structured in such a way that it makes no stringent assumptions about the actions. Instead, we feed raw RGB and optical flow inputs and task the networks with learning everything. The majority of the previous approaches hand-craft features that may work well on some simpler datasets, but struggle to generalise to more complex settings. Our method has no such limitation, and as such has a higher accuracy. Some of the previous works do compete with our method, such as [20] and [4]. The dense trajectory representation [20] is strong since it extracts both zero-order (HOF) and first-order (MBH) motion information, as well as trajectory and appearance (HOG) information. It then encodes and quantises these descriptors using the state-of-the-art Fisher vector representation. This means the representation is powerful enough to perform well on the KTH dataset since it takes many different cues into account to represent an action (and similarly for the hierarchical representation employed in [4], where the features are able to encode enough pertinent information about the actions to perform well).

Method Accuracy
N-Jets + BoW [2] 71.7
STIP + HOG/HOF [10] 91.8
Dense Sampling + HOG/HOF [3] 92.1
Action Bank [11] 98.2
Action MACH [21] 88.66
Hough Forest [22] 92.0
3D HOG [23] 91.4
Space-Time Hierarchy [4] 95.53
ISA [24] 93.9
Dense Trajectories [20] 94.53
Ours - baseline 94.5
Ours - SVM 96.6
Table 1: Results on the KTH dataset.

We ran an additional experiment to investigate the effect of PCA on performance. Surprisingly, reducing the dimension from the original 20400 dimensions to approximately 1500 dimensions resulted in a marginally higher accuracy (i.e. the system’s performance is slightly worse without PCA). This is likely due to the curse of dimensionality, although the difference is so negligible, that we are unable to draw any real conclusions. These very large dimensions also discount the use of non-linear kernels in the SVM, as they typically do not perform well in such contexts. Moreover, linear SVMs have been shown to perform well in a high-dimensional action recognition context [20, 17].

As part of our method, we perform what we term as ‘crop filling’, which is appending frames to a crop of a video if the number of frames is less than the desired number for the input into the networks. This is typically done by repeatedly appending the last frame the required number of times until the deficit is filled. We instead propose a scheme as described in the previous chapter, and can be seen in Figure 1. Additionally, we also perform optical flow scaling as opposed to leaving the flow values untouched as often done in previous research. We investigate leaving out these two steps - that is, appending the last frame and performing no flow scaling. This resulted in a baseline accuracy of 91.5%, which is noticeably lower than the baseline accuracy compared with when the two proposed methods are employed - 95.0%. This suggests that performing these two steps aids performance. Interestingly, the performance of the SVM method drops only very slightly to 95.7%. Since it is difficult to tell how much these two techniques help on the KTH dataset, we will see that they are indeed beneficial in more complex settings, such as on the HMDB51 dataset. We can see aforementioned results in Table 2.

Baseline SVM
No Flow Scaling + Crop Filling 91.5 95.7
Flow Scaling + Crop Filling 95.0 95.8
Table 2: Results of the proposed method with and without the flow scaling and crop filling procedures on the KTH dataset.

4.5 Results on HMDB51 Dataset

To test on the HMDB51 dataset, we set parameter and . This setting is partially inspired by the findings by [16], in which it is shown that temporal resolution is somewhat more important in action recognition than spatial resolution. Thus, we attempt to make the temporal resolution as long as possible, within the constraints of computational resource (and time) limitations (40 being the limit in the circumstances). A large temporal resolution is needed for complex, varietal datasets such as HMDB51 since the videos are fairly long, and, more importantly, the majority of the full temporal evolution of the action needs to be modelled to ensure enough of the variation in the data is being captured. The latter is because the dataset has some actions that are very temporally (and spatially) similar - talk vs laugh, draw sword vs sword exercise, and kick vs kick ball. This is in tandem with the other existing difficulties related to the dataset, such as very complicated, non-static backgrounds, clutter, and occlusion. Performance for this dataset is represented as accuracy. We fine-tune the I3D networks for a certain number of epochs, and use the model that minimises the cross-entropy loss on the validation set, where the validation set is given by 20% of the training data.

Results on the HMDB51 dataset can be seen in Table 3. It is clear that the method is not competitive with state-of-the-art approaches [9] and [14]. We attribute this to computational resource limitations, as we were not able to train for temporal (and spatial) resolutions as large as those approaches used (i.e. frames). This makes a large difference since this is a 60% increase in temporal resolution that the networks can learn from, which is vital for complex datasets such as HMDB51 (and less so for a dataset such as KTH). The authors from these approaches had more than 60 GPUs to train their networks, whereas we had only 1 available. Time available for the research was also a prohibitive factor. Further, these state-of-the-art approaches employ the more accurate TVL-1 optical flow algorithm to compute the optical flow videos, whereas we instead opt for Farneback’s optical flow algorithm. We do this since the TVL-1 algorithm is very computationally intensive and prohibits such systems’ real-time capabilities (even though the resulting flow estimates are more accurate). However, it should be noted that the goal of this work is not to necessarily to achieve state-of-the-art performance on these datasets, but rather to thoroughly investigate using RGB (spatial) and flow (temporal) information to model actions in video, and introduce potential improvements to the baseline, such as optical flow scaling and using another class of model on the extracted features of the networks (in our case, a linear SVM). These improvements may prove useful in competing with state-of-the-art when these networks can be trained on larger batches of crops with higher temporal and spatial resolutions. This, however, is left as future work.

Method Accuracy
Action Bank [11] 26.9
iDT + FV [17] 57.2
MPEG Flow [25] 46.7
iDT + Stack Fisher vectors [26] 66.7
Two-Stream CNN [13] 59.4
Two-Stream CNN + Fusion [27] 65.4
Long-Term Temporal Convolution [16] 64.8
I3D [9] 80.9
PoTion + I3D [14] 80.9
Ours - baseline 50.7
Ours - SVM 62.8
Table 3: Results on the HMDB51 dataset.

We ran an additional experiment to investigate the effect of not applying PCA to the features before normalising and feeding them into the SVM. The results without performing PCA are negligible. The dimension is reduced from 20400 to 3500. This suggests that there is a low-dimensional basis governing these local CNN features, as we can reduce the number of dimensions by over 80% and see no difference in performance. Also, we can conclude that, for the HMDB51 dataset, applying PCA is a better idea than not applying it, as we are left with fewer dimensions to deal with, and there will be a resultant reduction in training and testing time (with no effect on performance) for the linear SVM.

Similarly to the KTH dataset, we perform an investigation of not performing the ‘crop filling’ and flow scaling procedures. This resulted in a baseline accuracy of 48.4%, which is lower than the baseline accuracy compared with when the two proposed methods are employed - 50.6%. The SVM-based approach also drops from 64.1% to 59.9%. This reinforces the suggestion that performing these two steps aids performance, and the benefits of doing so are more clear on the HMDB51 dataset than on the KTH dataset. We posit that the scaling procedure results in flow values that are more representative of what the flow values should be at the resolution of the input. The ‘filling’ procedure, we posit, represents a more natural progression of the video, and by extension, the action, than simply repeatedly appending the last frame would. We can see a tabulation of these aforementioned results in Table 4.

Baseline SVM
No Flow Scaling + Crop Filling 48.4 59.9
Flow Scaling + Crop Filling 50.6 64.1
Table 4: Results of the proposed method with and without the flow scaling and crop filling procedures on the HMDB51 dataset (split 1).

4.6 Results on the UCF101 Dataset

For testing on the UCF101 dataset, we set parameter and . This setting is the maximum possible for our time and computational resource limitations. It should be noted, however, that previous works used significantly larger spatial and temporal resolutions for this dataset. Large temporal and spatial resolutions are very important for relatively complex datasets such as UCF101. It should be noted that our quoted accuracy figures for the UCF101 dataset are for split 1 only.

It is clear in Table 5 that the proposed SVM on the full video context, as opposed to the baseline softmax classifier on a portion of the context, significantly improves performance. This is the same trend as seen in the other two studied benchmark datasets. Overall performance is lacking compared to the state-of-the-art approaches for the same reasons discussed for the HMDB51 dataset. Computational resources limitations prevented us from obtaining maximum performance from the I3D network, which therefore inherently limits the representational power of the extracted features for RGB and flow. As shown in [16], greater spatial and temporal resolutions significantly improve performance of action recognition systems. However, for our relatively small spatial and temporal resolutions, we empirically show that the SVM approach does indeed prove beneficial.

C3D [28] 90.4
Two-Stream CNN [13] 88.0
Two-Stream CNN + Fusion [27] 93.5
Long-Term Temporal Convolution [16] 92.7
I3D [9] 98.0
PoTion + I3D [14] 98.2
Ours - baseline 80.5
Ours - SVM 86.5
Table 5: Results on the UCF101 dataset.

In Table 6, we see an interesting result that does not follow the trend of the HMDB51 and KTH datasets. That is, for this dataset - UCF101 - the flow scaling and crop filling does not really aid performance. Performance stays pretty much the same for both the baseline and SVM approaches. A likely cause for this is the fact that, firstly, the crop filling case of is a rare in a dataset with relatively long videos such as in UCF101. Additionally, flow scaling typically helps most when distinguishing between actions that are very temporally similar. UCF101 has fewer of these in comparison to HMDB51 and KTH. Most of the actions in the dataset are temporally dissimilar.

Baseline SVM
No Flow Scaling + Crop Filling 80.5 86.4
Flow Scaling + Crop Filling 79.8 86.3
Table 6: Results of the proposed method with and without the flow scaling and crop filling procedures on the UCF101 dataset (split 1).

5 Conclusion and Future Work

Given the limited temporal resolution we were able to employ, the method performs admirably. It, unsurprisingly, competes with state-of-the-art on the KTH dataset, achieving a highest average accuracy of %. The simple backgrounds, and lack of clutter and occlusion in the videos mean performance on the dataset is high. We outperform all previous methods exception that of [11]. However, we significantly outperform [11] on the realistic, complex dataset HMDB51, suggesting our method is able to generalise to these settings more effectively. Due to the limited temporal (and spatial) resolution, we are not able to compete with state-of-the-art on the HMDB51 and UCF101 datasets, achieving a highest average accuracy of %, and %, respectively. However, we still compete with other deep learning-based methods, and outperform many engineered features approaches. More importantly, we are able to effectively demonstrate our two main goals of the research. Firstly, using a SVM trained on context from all crops of the video is significantly more effective than taking a majority vote of crop-level network predictions. The benefit of the SVM is more apparent on the HMDB51 and UCF101 datasets since, for the KTH dataset, much of the action’s context can be determined from a single crop (whereas this does not apply for the more complex HMDB51 and UCF101 datasets). Secondly, performing the two simple pre-processing steps of crop filling and optical flow scaling results in higher recognition performance for the KTH and HMDB51 datasets, since these datasets contain more temporally similar actions than UCF101.

The most important extension to this work is to increase the temporal resolution (and spatial resolution) to see how this affects performance, and at one point does performance plateau - when the increase in resolution no longer aids performance. Furthermore, different, more accurate, optical flow algorithms could be used in place of Farneback’s methods (such as TVL-1 [29] or Brox [30]). However, these methods are generally much slower to compute, and thus to make them feasible for a real-time human action recognition system one would likely need to use their GPU implementations. Another possible extension is to train the networks on more crops of the videos (by sampling more crops, or not limiting the temporal resolution and then sampling from the full video).


  1. Shian-Ru Ke, Hoang Thuc, Yong-Jin Lee, Jenq-Neng Hwang, Jang-Hee Yoo, and Kyoung-Ho Choi. A review on video-based human activity recognition. Computers, 2(2):88–131, Jun 2013.
  2. C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local SVM approach. In ICPR, pages 32–36. IEEE, 2004.
  3. Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, pages 124.1–124.11. BMVA Press, 2009.
  4. Adriana Kovashka and Kristen Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR, pages 2046–2053. IEEE, Jun 2010.
  5. Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, Sep 2005.
  6. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages 886–893. IEEE, 2005.
  7. Rizwan Chaudhry, Avinash Ravichandran, Gregory Hager, and Rene Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In CVPR, pages 1932–1939. IEEE, Jun 2009.
  8. Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, pages 428–441. Springer Berlin Heidelberg, 2006.
  9. Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 4724–4733. IEEE, Jul 2017.
  10. Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In CVPR, pages 1–8. IEEE, Jun 2008.
  11. S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, pages 1234–1241. IEEE, Jun 2012.
  12. Li-Jia Li, Hao Su, Yongwhan Lim, and Li Fei-Fei. Object bank: An object-level image representation for high-level visual recognition. International Journal of Computer Vision, 107(1):20–39, Sep 2013.
  13. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
  14. Vasileios Choutas, Philippe Weinzaepfel, Jerome Revaud, and Cordelia Schmid. PoTion: Pose MoTion representation for action recognition. In CVPR, pages 7024–7033. IEEE, Jun 2018.
  15. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36. Springer International Publishing, 2016.
  16. Gul Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1510–1517, Jun 2018.
  17. Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558. IEEE, 2013.
  18. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, pages 2556–2563. IEEE, Nov 2011.
  19. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  20. Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In CVPR, pages 3169–3176. IEEE, Jun 2011.
  21. Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In ICCV, pages 1–8. IEEE, 2008.
  22. Angela Yao, Jue Gall, and Luc Van Gool. A hough transform-based voting framework for action recognition. In CVPR, pages 2061–2068. IEEE, Jun 2010.
  23. A. Klaeser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, pages 99.1–99.10. BMVA Press, 2008.
  24. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, pages 3361–3368. IEEE, 2011.
  25. Vadim Kantorov and Ivan Laptev. Efficient feature extraction, encoding, and classification for action recognition. In CVPR, pages 2593–2600. IEEE, Jun 2014.
  26. Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang Peng. Action recognition with stacked fisher vectors. In ECCV, pages 581–595. Springer International Publishing, 2014.
  27. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941. IEEE, Jun 2016.
  28. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497. IEEE, Dec 2015.
  29. C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-l1 optical flow. In Lecture Notes in Computer Science, pages 214–223. Springer Berlin Heidelberg, 2007.
  30. Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In Lecture Notes in Computer Science, pages 25–36. Springer Berlin Heidelberg, 2004.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description