Discriminative convolutional Fisher vector network for action recognition

Discriminative convolutional Fisher vector network for action recognition

Petar Palasek, Ioannis Patras
School of Electrical Engineering and Computer Science
Queen Mary University of London
London E1 4NS, United Kingdom
p.palasek@qmul.ac.uk, i.patras@qmul.ac.uk

In this work we propose a novel neural network architecture for the problem of human action recognition in videos. The proposed architecture expresses the processing steps of classical Fisher vector approaches, that is dimensionality reduction by principal component analysis (PCA) projection, Gaussian mixture model (GMM) and Fisher vector descriptor extraction, as network layers. By contrast to other methods where these steps are performed consecutively and the corresponding parameters are learned in an unsupervised manner, having them defined as a single neural network allows us to refine the whole model discriminatively in an end to end fashion. Furthermore, we show that the proposed architecture can be used as a replacement for the fully connected layers in popular convolutional networks achieving a comparable classification performance, or even significantly surpassing the performance of similar architectures while reducing the total number of trainable parameters by a factor of 5. We show that our method achieves significant improvements in comparison to the classical chain.

1 Introduction

With the amounts of video data available online growing at high rates, the need for automatic video analysis is becoming more and more pressing. Being able to automatically recognize what the content of a given video, or, more narrowly recognize the actions that are depicted in it, is not only useful for organizing huge video datasets, but also something that could help improve systems for video surveillance, human-computer interaction systems and assistance systems.

Due to their strong performance on the problem of image recognition [17], Fisher vector (FV) descriptors [18] have also been applied for the problem of action recognition [27, 29] where they achieved state of the art results and remained one of the dominant approaches to this day.

The main idea behind FVs is to encode a set of local descriptors extracted from a sample (i.e. an image or a video) as a vector of deviations from the parameters of a generative model (usually a Gaussian mixture model) fitted to the descriptors extracted on the training set. Although FVs are good global descriptors on their own, there are shortcomings in the way they are extracted. Namely, the GMM used for encoding is learnt in an unsupervised way without receiving any additional information about the task at hand. This results in descriptors that are not tailored for a discriminative task as the GMM also learns to model the intra-class variations of the training data something that is not relevant for classification problems.

Recently, methods that are based on multi-layered convolutional [11] neural networks (CNNs) surpassed the Fisher vector descriptors’ performance in recognition problems and reached the new state of the art in image classification. The seminal work described in [9] has shown the power of models that can be trained end to end in a supervised way on large amounts of labeled data using backpropagation. This is one of the works that helped regain popularity of neural networks and start the deep learning revolution. Many works on deep networks have been published since then, not only for the task of image recognition [1, 21], but for the problem of action recognition [4, 6, 7, 20, 26] as well.

However, the state of the art deep learning architectures usually contain a large number of layers with a huge number of trainable parameters making them difficult to optimize without suitable hardware infrastructure (i.e. big clusters or machines with multiple GPUs have become a necessity nowadays).

Recently several works that try to combine the power of Fisher vector representations and neural network approaches have been published. This includes deep Fisher networks from [19] used for large-scale image classification, stacked Fisher vectors [14] used for action recognition, deep Fisher kernels [24] and the hybrid classification architecture [16] also applied on image classification problems. Even though the works listed above explore adding supervision at different steps of the standard Fisher vector descriptor pipeline, none of them have tried to refine the local feature extraction, Fisher vector encoding and the classification steps jointly.

In this paper we describe a method that expresses all the different steps of the action recognition using FV, as layers in a neural network. The layers are initialized by unsupervised training in a layer by layer manner, and are subsequently refined in an end-to-end training. The proposed architecture results in spatio-temporal descriptors at intermediate levels of the architecture, calculated by local aggregation in a spatio-temporal structure of frame-level descriptors. More specifically, the main contributions of this work are the following:

  • We describe a novel neural network architecture for action recognition which includes two new types of layers; the Gaussian mixture model layer and the Fisher vector descriptor layer. Combining these layers with other standard ones into a single deep neural network gives us a way of jointly finetuning the parameters of the whole architecture with respect to a chosen discriminative cost, using the standard backpropagation algorithm. We show that adding supervision at every stage of the network improves the discriminative power of the extracted Fisher vector descriptor compared to the standard version of the descriptor which is extracted in an unsupervised manner.

  • Analogous to convolutional neural networks, where the same operation is applied at different locations of the input tensor, our network offers a natural way of extracting the Fisher vector descriptors densely from a given input video, both in space and in time. This also allows us to easily extract the descriptor only from selected parts of the video, providing a straightforward way of implementing other architectures, such as spatial pyramids.

  • We show that the proposed architecture can be used as a replacement for the fully connected layers in popular convolutional networks such as the VGG-16 network, achieving a comparable classification performance while reducing the total number of trainable parameters by a factor of 5.

2 Related work

Fisher vector descriptors were firstly introduced for solving the problem of image classification in [15]. Essentially, the idea is to represent an image using a global descriptor which describes how the parameters of a generative model should change in order to better model the distribution of local features in images, based on a set of local features extracted from a given image. The theory and practice of using Fisher vectors for the task of image classification is described in [18].

The first work that applied Fisher vector descriptors for the problem of action recognition in videos used HOG, HOF and MBH features [30] extracted along dense trajectories as local features [27]. The trajectories are extracted by defining a dense grid of points which are then tracked using optical flow that was estimated offline, this way including motion information in the pipeline. By encoding the extracted trajectory features with the Fisher vector descriptor, this approach and the improved version of [29] achieved state of the art results for the action recognition problem.

Following the growing popularity of deep neural networks, several works were published on using neural network based approaches for the problem of action recognition. In [6] a 3D extension of the standard 2D convolutional neural networks (CNNs) was introduced, where information from both the space and the time dimension are included by performing 3D convolution. A more recent work described in [26] also applied 3D CNNs, but with a much deeper network architecture. The work of [7] examines different kinds of extending CNNs into the time domain by fusing features extracted from stacks of frames in order to include motion information. Transfer learning is also applied in order to prevent overfitting to small video datasets. Motion information was included in the work of [20] in an explicit way by providing dense optical flow at the input of the network. More specifically, two streams of a network are employed; one performing classification based on static video frames and the other based on the optical flow. Different ways of fusing the spatial and the temporal streams of such networks are studied in [4]. The work of [31] considers different ways of aggregating strong CNN image features over long periods of time, including feature pooling and using recurrent neural networks. In [13] CNN features are extracted from random subvolumes of a video and encoded using the FV descriptor in order to arrive at a representation suitable for video classification.

The work of [19] combines ideas from the area of neural networks with the Fisher vector descriptor by forming a deep Fisher network in which two Fisher vector layers are stacked. The network is discriminatively trained for the problem of image classification, however the features at the input layer are fixed, manually-designed features. Stacked Fisher vectors are also applied for action recognition in [14], where the first layer encodes the improved dense trajectories from [29]. After discriminative dimensionality reduction a second Fisher vector encoding is done. The combination of the FV and the stacked FV showed to be beneficial. The work in [16] treats the Fisher vector descriptor as an unsupervised layer followed by a number of fully connected layers. that can be trained with backpropagation. End to end training of a Fisher kernel SVM viewed as a deep network is done in [24] for the problem of image classification. This method uses manually-designed features at the input layer and requires retraining of the SVM on the whole training set at each step of the training.

3 Proposed architecture

Figure 1: An illustration of the proposed architecture. The input to the network is a stack of static frames (marked in green in the left of the figure) from a video which are passed through the feature extraction layers resulting in feature maps of size where is the number of channels. The feature maps are then pooled temporally and spatially which results in new feature maps of size whose dimensionality is reduced in the dimensionality reduction layer to . The Fisher vector descriptor layer passes these feature maps to the GMM layer which gives a tensor of posteriors at its output. Using the posteriors and the input feature maps, the FV layer outputs the FV descriptor of the frames of the video. Note that we can use different crops of the posterior tensor in order to calculate the FV descriptor of only a part of the network input (e.g. using the subtensor marked in blue in the posterior tensor would correspond to calculating the FV descriptor for only the bottom right corner of the input video). Finally the FV descriptors are normalized and fed into the classification layer where their scores are averaged and used to predict the label for the given input. In order to predict a label for the whole video of length , we slide the network along the time axis with a stride of frames, updating the FV descriptor/s on the way. The classification step is the same as when predicting the label of a stack of frames.

In this section we describe the proposed architecture - an illustration is given in Figure 1. We start by describing each of the used layers in detail and give the final overview of the whole architecture in Subsection 3.7.

The architecture can be divided into six parts; the local feature extraction layers, the spatio-temporal pooling layer, the dimensionality reduction layer, the Gaussian mixture layer, the Fisher vector descriptor layer and the classification layer. Their descriptions follow in Subsections 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6.

3.1 Local feature extraction layers

To do the first step of local feature extraction in our architecture, we tried using two different networks. We fed static video frames into a small network consisting of a single convolutional layer followed by a pooling layer. This part of the network was pretrained on static video frames using a convolutional restricted Boltzmann machine [12] as described in [13], with the difference that we used local contrast normalization as a preprocessing step. To show that the feature extraction layers can be replaced by any larger and more complex network, we also used the VGG-16 [21] network pretrained on the ImageNet dataset. Given consecutive images, the output of the feature extraction layers is feature maps of size , where denotes its height, the width and the number of its channels.

3.2 Spatio-temporal pooling layer

Figure 2: An illustration of how the spatio-temporal layer performs pooling on the extracted feature maps. Subtensors of size are pooled into tensors of size and resized into vectors of dimensionality . This is repeated for all subtensors in the extracted feature maps, sliding horizontally and vertically with a stride of .

In order to include motion information from the input video, we want to combine feature maps extracted from multiple static frames into a more powerful representation. To do so, we follow the work of [27] where a spatio-temporal volume of features extracted from frames is divided into subvolumes of size and then pooled temporally and spatially using mean pooling. The resulting representation is then resized into a vector of dimensionality , where is the dimensionality of the local features extracted from the previous layer. This vector is then used as input to the following layer. In the case that the tensor at the input of this layer has bigger spatial dimensions than , the described pooling procedure is repeated for each subtensor, moving through the tensor with a spatial stride as shown in Figure 2. Later we will describe how we deal with longer time periods. Given feature maps of size at input, the output of this layer is a tensor of size , where and .

3.3 Dimensionality reduction layer

In the standard Fisher vector pipeline the locally extracted features are decorrelated and their dimensionality is reduced by performing PCA. Assuming that the mean of the data and the principal axes were found offline, the mapping from the original data to a lower-dimensional space can be written as:


The dimensionality of matrix is , where is the number of components and is the dimensionality of original data. Note that we do not put any constraints on the matrix , so after backpropagating through the layer and updating its parameters the projection applied on the input data is not guaranteed to be orthogonal. Given a tensor of size at input, the output of this layer is a tensor of size .

3.4 Gaussian mixture model layer

A Gaussian mixture model is defined as a weighted sum of K components [18]:


where is a component weight and is a probability density function of the Gaussian distribution:


Every GMM can be described by the parameter set , where is the -th component weight, is its mean vector and its covariance matrix. The mixture coefficients are constrained to be positive and to sum to one, that is and which can be easily enforced by using internal weights and defining:


as it was done in [8]. For each sample , posteriors describing the responsibility of each component for generating the sample can be found as:


When viewing the GMM as a neural network layer, we treat the sample as the input to the layer and the posteriors as its output. In the case when is a -dimensional vector, we can see that can be calculated by using subtraction, addition, multiplication, division and exponentiation which all are standard operations found in neural networks.

In a more general case, can be treated as a -dimensional tensor, that can correspond to a feature map consisting of channels of height and width . This tensor could, for example, be an image with 3 RGB channels, or any feature map outputted by a preceding layer. Following the same idea of convolution in convolutional layers, we can calculate both and by performing the same operations we did in the case when was a vector, repeating them for each of the -dimensional vectors in the tensor. This procedure will result in also a -dimensional tensor of size . This is easily extended to the case of -dimensional tensors that usually appear in deep learning frameworks, with the first dimension corresponding to the number of samples in a minibatch.

3.5 Fisher vector descriptor layer

The Fisher vector descriptor is a global descriptor used to represent data by describing how the parameters of a generative model fitted on a distribution of local features should change in order to better model the local features extracted from the given data sample. An introduction to Fisher vectors and the underlying theory can be found in [18]. In this subsection we show how the Fisher vector descriptors are calculated. If we define the statistics of the GMM as:




where is the -th posterior from Equation 5, the parts of the Fisher vector corresponding to each of the parameters of the GMM can be calculated as follows:


where represents the number of local descriptors,




The resulting vectors are concatenated into a large vector:


which is the unnormalized version of the Fisher vector. The FV is normalized by applying two kinds of normalization; power normalization:


and L2 normalization:


We can view the Fisher vector descriptor encoding as a network layer which contains an internal GMM layer and receives data, in the simplest case an -dimensional vector, as input. The input data is passed to the internal GMM layer which gives the posteriors at its output. The input data and the posteriors are then used to calculate the statistics from Equations 6, 7 and 8. We can see that the operations required in order to calculate both the unnormalized (Equation 12) and normalized (Equation 14) versions of the Fisher vector descriptor are all standard operations typically found in neural networks so the Fisher vector descriptor encoding can be easily expressed as a network layer. The dimensionality of the calculated FV is .

Following the same reasoning as with the GMM layer when dealing with tensor data at input, the Fisher vector layer can also receive a tensor of size as its input. This tensor can be seen as a set of -dimensional vectors which we can then encode using the FV descriptor by the procedure described above. Note, that the FV encoding is performed by aggregating, across the first two modes of the tensor, the differential representations that are extracted along the fibers of the input tensor in the third mode. It is therefore trivial to extract the FV descriptor only from a selected subtensor of the input tensor. This allows us to use multiple crops of the input video during both the training and testing time in order to prevent overfitting and help improve generalization. This also allows us to create other architectures, such as the spatial pyramid [10].

3.6 Classification layer

Given a Fisher vector descriptor of a video or a part of a video we design the final layer of our network to output a prediction of the input video’s class. To this end, we train binary one-vs-all support vector machines as a classifier, where is the number of classes. The cost that we use for optimizing the whole network is the squared SVM hinge loss defined as:


where are the SVM weights, with being a regularization constant, is the label encoded as a vector where all elements are -1 except for one that is 1, marking which class the input belongs to, and with being the SVM score, . We use to denote the -th element of the vector .

3.7 Fisher vector network for action recognition

The input to our network is a video represented as a stack of consecutive static frames. In order to explain the pipeline of our architecture we will first limit the length of the input video to , with . Each of the frames is passed through the local feature extraction layers which output feature maps of size , where is the number of channels. These feature maps are then sent through the spatio-temporal pooling layer where they are pooled temporally and spatially as described in Subsection 3.2, resulting in a representation of size . After passing this through the dimensionality reduction layer described in Subsection 3.3 the new representation is of a lower dimensionality, . This is then passed into the Fisher vector descriptor layer described in Subsection 3.5 which internally uses the same input in the Gaussian mixture model layer from Subsection 3.4 to get a tensor of posteriors of size . We can treat the input tensor as a set of descriptors with dimensions which also have corresponding posteriors for each of the components of the GMM layer. These are then used to calculate the needed statistics and get the unnormalized version of the Fisher vector. The FV can then be normalized and fed into the classification layer which gives the predicted label for the frames of the original input.

For the unconstrained case when the whole video of frames is to be classified it is enough to notice that the unnormalized Fisher vector descriptor of a sequence of frames is equal to the sum of the unnormalized FVs of the first frames and the second frames. Therefore, we can calculate the FV representation of the whole video by sliding our network along the time axis with a temporal stride of frames and summing the unnormalized FVs for each of the time segments. After normalizing the FV we can feed it into the classification layer to get the predicted label for the whole video. When several several crops are used, the resulting FV are fed to the classification layer and the corresponding outputs averaged.

The proposed neural network could also be viewed as a 3D ”filter” with an internal representation which changes as the filter is ”convolved” through a given video represented as a spatio-temporal volume. Once the filter passes through the whole video, the classification layer of the network uses the internal video representation, i.e. the Fisher vector descriptor, to give a prediction about the video’s label.

3.8 Number of trainable parameters

As our proposed network is a fully convolutional network, the number of its trainable parameters does not depend on the input’s dimensions. Here we will summarize the total number of parameters learnt in each of our architecture’s layers, excluding the local feature extraction layers as these can be replaced by an arbitrary network.

The PCA layer consists of two trainable parameters; a dimensional mean vector and a dimensional matrix of principal axes , where denotes the number of components kept after applying the projection and is the number of channels of the tensor returned from the spatio-temporal pooling layer. The FV layer consists of a GMM layer that contains trainable sets of parameters , and , where is a scalar, is a -dimensional vector and is diagonal matrix containing trainable parameters. The classification layer consists of a dimensional matrix and a -dimensional vector, where is the number of classes and is the dimensionality of the FV layer output, . In total, the top layers of our architecture contain trainable parameters. As a concrete example, the top layers of the architecture we finetuned on the UCF-101 dataset (Table 3) with , , and contained 5 869 157 trainable parameters.

The dimensionality of the last pooling layer in the VGG-16 [21] architecture for a single input is (512, 7, 7). The pooling layer is fully connected to 4096 units, followed by two more fully connected layers containing 4096 and 1000 units respectively. Including the biases, this corresponds to having 123 642 856 trainable parameters after the convolution and pooling layers. In case of the last fully connected layer having only 101 units (when applied to the UCF-101 dataset), the parameter count is 119 959 653. Similarly, the fully connected layers in the CNN-M-2048 [1, 20] network contain 4096, 2048 and 1000 units each, amounting to 85 941 224 trainable parameters in the top layers. For the case when there are 101 classes, the fully connected layers contain 84 099 173 trainable parameters.

By replacing the fully connected layers at the end of the network with the layers we propose, the number of trainable parameters drops to under 5% of the original number in case of the VGG-16 network, and under 7% in case of using the CNN-M-2048 network.

4 Experiments and results

The UCF-101 dataset introduced in [22] consists of video clips from different classes, divided into three pairs of train and test sets. To evaluate the performance of a method on this dataset, the average accuracy over the three splits is reported. We first run our experiments only on the first split and only evaluate the most promising approach on all three splits.

We start by implementing the method described in [13] in the Lasagne/Theano framework [2, 25] and treat it as the baseline for our experiments which we perform on the UCF-101 dataset. Training the architecture included training a single layer convolutional restricted Boltzmann machine [12] containing 64 filters of size px, learning a PCA projection (), training a GMM () using the expectation-maximization algorithm and training a multi-class SVM classifier (). All these steps, except for training of the SVM are done in an unsupervised manner. After initializing the parameters of our architecture with the ones we got by the unsupervised training steps mentioned above, we did one epoch of finetuning of the whole network using the AdaGrad adaptive gradient algorithm [3].

The size of the temporal window, i.e. the number of frames needed to calculate a single FV descriptor, is set to in all of our experiments. The size of the spatial window in the spatio-temporal pooling layer is set to correspond to a window of pixels in the input video (, when the single convolutional RBM was used). These are the values used in other similar works, e.g. [28]. We can control how dense we want to sample the features from the given video by setting the spatial stride parameter and the temporal stride parameter . In order to decrease the time needed to do a single finetuning pass through the training set, we use ”pixels” (corresponding to 16 pixels in the input video) and frames in most of our finetuning experiments. One epoch of finetuning using these parameters on the whole training set takes around 10 hours on a Titan X GPU.

As can be seen from the results reported in Table 1, our method using features extracted from a single layer convolutional RBM performs better than the other state of the art methods shown in Table 4 that suffered from overfitting when trained only on the UCF-101 dataset. However, when more complex models are pretrained on datasets that provide larger amounts of data than the UCF-101 dataset, the performance of the simple single layer network is easily surpassed. This is not surprising as the simple network is too shallow to learn more discriminative features needed for action classification. To show how our proposed method works when the simple network is replaced with a more complex one, we choose the VGG-16 network from [21] pretrained on the ImageNet dataset, which was also used in the two-stream network of [4].

The VGG-16 network consists of 13 convolutional layers, followed by 3 fully connected layers. We first use the outputs of the conv4_3 layer as the input to the layers proposed in this work. Similar to the previously described experiment, we randomly extract subvolumes from conv4_3 layer’s feature maps, corresponding to px and spatio-temporal subvolumes in the original video. A subset of subvolumes per video is then used to learn a PCA mapping lowering their dimensionality to . These are then used to train a GMM with components, which we use to extract Fisher vector descriptors from and finally train a SVM with . We report the results of this experiment on all three splits of UCF-101 in table 2. We repeat the same procedure replacing the conv4_3 layer by conv5_3 and report the results in Table 3. As the features extracted from the conv5_3 layer performed better than the ones from layer conv4_3, we pick this layer for our finetuning experiments.

The larger network is more prone to overfitting so we regularize the finetuning using using dropout [23] () on the output of the local feature extraction layers. To maximize the amount of information available in the network during both training and testing we set the spatial stride , but we keep the temporal stride fixed to as in the previous experiments. We finetune the network using stochastic gradient descent with momentum (set to 0.9), showing the network one video at a time. The initial learning rate was set to 0.0001 and it was multiplied by a factor of 0.95 after each epoch. One epoch of finetuning took around 17 hours 30 FPS). Testing ran at a speed of around 40 FPS.

Method Split 1
Single CRBM + FV [13] 55.06%
Ours, single CRBM, random sampling 59.95%
Ours, single CRBM, dense sampling 60.37%
Our finetuned network (after 1 epoch) 61.67%
Table 1: UCF-101 split 1 classification accuracy, using features from a single convolutional RBM trained only on UCF101.
Method Split 1 Split 2 Split 3 Average
Random sampling 70.84% 70.30% 70.81% 70.65%
Table 2: UCF-101 classification accuracy, using the conv4_3 layer features from the VGG-16 network pretrained on ImageNet.
Method Split 1 Split 2 Split 3 Average
Random sampling 75.65% 76.06% 74.89% 75.54%
Dense sampling 75.55% 76.33% 74.35% 75.41%
Finetuned network (after 5 epochs) 76.39% 77.29% 77.30% 76.99%
Finetuned network (after 11 epochs) 79.12% 78.63% 76.35% 78.03%
Finetuned network (after 33 epochs) 81.84% - - -
Table 3: UCF-101 classification accuracy, using the conv5_3 layer features from the VGG-16 network pretrained on ImageNet. Finetuning was done using SGD with initial learning rate = 0.0001, momentum = 0.9 and extraction layer dropout p = 0.9.
Method Split Accuracy Total parameters
Slow fusion network [7] all 41.3%
Spatial CNN-M-2048 [20] 1 52.3% 90.63 M
Single CRBM + FV [13] 1 55.06% 5.33 M
Ours, single CRBM 1 61.67% 5.33 M
Table 4: State of the art methods using only static features, trained and evaluated on UCF-101.
Method Pretrained on Split Accuracy Top layers parameters Total parameters
Slow fusion network [7] Sports 1M all 65.4% - -
Encoding objects [5] ImageNet all 65.6% - -
Spatial CNN-M-2048 [20] ImageNet 1 72.8% 84.1 M 90.63 M
Ours, VGG-16 ImageNet 1 81.84% 5.87 M 20.58 M
Spatial VGG-16 [4, 21] ImageNet 1 82.61% 119.96 M 134.67 M
Table 5: State of the art methods using only static features, pretrained on a larger dataset, finetuned and evaluated on UCF-101.

5 Discussion

Compared to the work of [13], where the same kind of features and encoding were used and the extraction was performed on randomly selected subvolumes, our network is naturally capable of performing dense sampling, thus increasing the available information from the underlying video and improving the final classification performance. By performing dense sampling and finetuning the whole network an improvement to 61.3% was achieved. This is explained by the fact that including more training data helps prevent overfitting.

Let us note that the lowest level of our network extracts features at frame level from intensity information alone, and therefore is not directly comparable to the full two stream network from [20], one stream of which is trained on optical flow that was extracted offline at the input. However, we do compare favorably with the spatial stream of the network when trained directly on static frames of the UCF-101 dataset, where it overfits and results in a classification accuracy of 52.3% - compared to 61.67% that we obtain when using a simple, single-layer convolutional RBM. In order to prevent overfitting, the two stream network is pretrained on a different, larger dataset - this alone improves the accuracy of [20] to 72.8%.

Our approach, which includes the time dimension by pooling feature maps extracted from subvolumes of the video, achieves an accuracy of 61.67%, without including optical flow explicitly and only using a single convolutional RBM at the lowest layers of the network. The work of [7] also tried to tackle the problem of including motion features implicitly by trying to learn them from stacks of static frames. The approach of slowly fusing the feature maps resulted in an accuracy of 65.4% on the UCF-101 dataset when pretrained on a larger (Sports 1M) dataset. Whereas training directly on UCF-101 resulted in overfitting with an accuracy of 41.3%. This is again comparable to the 61.67% that we obtain with the proposed approach, when using the simple single-layer convolutional RBM features.

By simply replacing the single convolutional RBM layer at the lowest level of our architecture with VGG-16, a deep network containing 13 convolutional layers pretrained on ImageNet, we boost the classification accuracy to 75.41%. The main contribution of our proposed method is shown after performing the finetuning of the network as a whole, which further boosts the classification accuracy on UCF-101 to 81.84%. While this is lower than the 82.61% achieved by the VGG-16 network as the spatial stream of [4], we point out that our architecture contains 20.58 million trainable parameters in total, compared to the 134.67 million parameters contained in VGG-16. If we only look at the top layers of the two architectures, the 3 fully connected layers containing 119.96 million parameters in VGG-16 can be replaced by our proposed layers that contain only 5.87 million parameters, that is less than 5% of the parameter count, at the cost of diminishing the classification performance by 3.5%. Longer finetuning and a finer choice of the finetuning hyperparameters should lower this performance gap. On the other hand, our method compares favorably to the CNN-M-2048 spatial stream of [20], achieving 81.84% versus its 72.8%, while requiring less than 23% of its total trainable parameter count.

6 Conclusion

In this paper we have proposed a convolutional architecture that expresses the various steps of the Fisher vector based action recognition as layers in convolutional neural network that can be trained or refined end to end in a supervised manner. Our model outperforms significantly the baseline architecture where the various levels are trained in a layer by layer manner unsupervised, and state of the art CNN architectures when trained on the same amount of data. We show that replacing the top fully connected layers in popular convolutional network architectures with our proposed layers results in a significant reduction of the needed trainable parameter count, while achieving a comparable performance, or even significantly surpassing the performance of similar architectures.


  • [1] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
  • [2] S. Dieleman, J. Schlüter, C. Raffel, E. Olson, S. K. Sønderby, D. Nouri, D. Maturana, M. Thoma, E. Battenberg, J. Kelly, J. D. Fauw, M. Heilman, D. M. de Almeida, B. McFee, H. Weideman, G. Takács, P. de Rivaz, J. Crall, G. Sanders, K. Rasul, C. Liu, G. French, and J. Degrave. Lasagne: First release., Aug. 2015.
  • [3] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • [4] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pages 1933–1941, 2016.
  • [5] M. Jain, J. C. van Gemert, and C. G. Snoek. What do 15,000 object categories tell us about classifying and localizing actions? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 46–55, 2015.
  • [6] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 221–231, 2013.
  • [7] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725–1732. IEEE, 2014.
  • [8] J. Krapac, J. Verbeek, and F. Jurie. Modeling spatial layout with Fisher vectors for image categorization. In 2011 International Conference on Computer Vision, pages 1487–1494. IEEE, 2011.
  • [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
  • [10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages 2169–2178. IEEE, 2006.
  • [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [12] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.
  • [13] P. Palasek and I. Patras. Action recognition using convolutional restricted Boltzmann machines. In Proceedings of the 1st International Workshop on Multimedia Analysis and Retrieval for Multimodal Interaction, MARMI ’16, pages 3–8, New York, NY, USA, 2016. ACM.
  • [14] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In European Conference on Computer Vision, pages 581–595. Springer, 2014.
  • [15] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
  • [16] F. Perronnin and D. Larlus. Fisher vectors meet neural networks: A hybrid classification architecture. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3743–3752, 2015.
  • [17] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In Computer Vision–ECCV 2010, pages 143–156. Springer, 2010.
  • [18] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. International journal of computer vision, 105(3):222–245, 2013.
  • [19] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep fisher networks for large-scale image classification. In Advances in neural information processing systems, pages 163–171, 2013.
  • [20] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [22] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [23] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [24] V. Sydorov, M. Sakurada, and C. H. Lampert. Deep fisher kernels-end to end learning of the fisher kernel gmm parameters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1402–1409, 2014.
  • [25] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016.
  • [26] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015.
  • [27] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3169–3176. IEEE, 2011.
  • [28] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, pages 1–20, 2013.
  • [29] H. Wang and C. Schmid. Action recognition with improved trajectories. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 3551–3558. IEEE, 2013.
  • [30] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al. Evaluation of local spatio-temporal features for action recognition. In BMVC 2009-British Machine Vision Conference, 2009.
  • [31] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694–4702, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description