High Order Neural Networks for Video Classification

# High Order Neural Networks for Video Classification

Jie Shao,    Kai Hu ,    Yixin Bao,    Yining Lin,    Xiangyang Xue,
Fudan University, Carnegie Mellon University, Qiniu Inc., ByteDance AI Lab
{shaojie, xyxue}@fudan.edu.cn{kaihu}@andrew.cmu.edu
{baoyixin, linyining}@qiniu.com
Work done during internship at Qiniu Inc.
###### Abstract

Capturing spatiotemporal correlations is an essential topic in video classification. In this paper, we present high order operations as a generic family of building blocks for capturing high order correlations from high dimensional input video space. We prove that several successful architectures for visual classification tasks are in the family of high order neural networks, theoretical and experimental analysis demonstrates their underlying mechanism is high order. We also proposal a new LEarnable hiGh Order (LEGO) block, whose goal is to capture spatiotemporal correlation in a feedforward manner. Specifically, LEGO blocks implicitly learn the relation expressions for spatiotemporal features and use the learned relations to weight input features. This building block can be plugged into many neural network architectures, achieving evident improvement without introducing much overhead. On the task of video classification, even using RGB only without fine-tuning with other video datasets, our high order models can achieve results on par with or better than the existing state-of-the-art methods on both Something-Something (V1 and V2) and Charades datasets.

$\dagger$$\dagger$footnotetext: Indicate equal contributions.

## 1 Introduction

The last decade has witnessed great success of deep learning in computer vision. In particular, deep neural networks have been demonstrated to be effective models for tackling multiple visual tasks, such as image recognition [14, 8, 32, 31], object detection [25, 17, 24, 5] and video classification [13, 27, 30, 3].

Videos are high-dimensional and highly-redundant data. As they can be viewed as a sequence of continuously changing images, the differences between adjacent frames are subtle. For example, in Fig. 1, is this an action about plugging cable into charger or pulling cable out of charger? The appearance in each frame is quite similar, only by looking in a consecutive view, we can figure out what this action is. Therefore, an effective video classification model should focus on capturing spatial and temporal movements. Existing state-of-the-art methods include ConvNets with temporal models [21], two-stream ConvNets [30, 38] and 3D ConvNets [33, 41, 3, 23]. They cascade a group of first order units to learn video representations.

Looking back to the first order units, e.g. convolutional operations, They have the form of: , which means the output of one convolutional layer is the linear combination of the input. Although activation functions can be added to introduce non-linearity into the network, they are not learnable. In addition to weighted sum, other information, e.g. spatiotemporal correlations, can be lost.

In this paper, we introduce high order operation as an efficient and generic component for capturing spatiotemporal correlations. The high order operation is a family of expressions that include terms computed by high order products of input features. For instance, is an example of second order representation, which consists of first order term and second order term.

There are several advantages of using high order operations: (a) Mathematically, the high order expressions will be more capable of fitting non-linear functions than linear combination. (b) The high order expression can learn attention from the video. It will be clear if writing the above-mentioned second order equation as: . (c) Multiplicative connections in high order term allows one unit to gate another, see Fig 2. Specifically, if one unit of multiplicative pair is zero, the other member of the pair can have no effect, no matter how strong its output is. On the other hand, if one unit of a pair is one, the output of the other is passed unchanged. Such pairwise interactions can encode a prior knowledge in high order networks, which makes it easier for solving video classification problems. As in the plugging into / out of charger case, the frames are highly similar, with only minor differences, feature selection mechanism, i.e. gating, needs to be considered.

We point out that several successful architectures or frameworks for visual classification tasks are in the family of high order neural networks: (a) Squeeze-and-Excitation Networks [11], (b) Appearance-and-Relation Networks [37], and (c) Non-local Neural Networks [39]. They use different priors to define the form of spatiotemporal relations, which leads to high order formulations. A limitation of these works is that they use pre-defined patterns, e.g. square operation in ARTNet, which may limit the block representation ability. In this paper, we take a step towards addressing this problem. We propose a LEarnable hiGh Order (LEGO) block to implicitly learn the relation expressions for spatiotemporal features, and use the learned relations to weight input features. Intuitively, LEGO learns a kernel with context for each position in the feature map. Different positions in the feature map have different contexts and therefore have different kernels that are learned. This block can be plugged into many neural network architectures, achieving significant improvement without introducing much overhead.

We test the high order neural networks on Something-Something dataset (V1 and V2) and Charades datasets. Both datasets are challenging in action recognition in that appearance can provide insufficient information for classification. Using RGB only without fine-tuning with other video datasets and without any bells and whistles (e.g., optical flow, multi-scale testing), our method achieves results on par with or better than the existing state-of-the-art methods.

Our main contributions are summarized as follows. Firstly, we are among the first who attempt to introduce high order neural networks to the task of video classification. Secondly, we provide interpretations of some state-of-the-art architectures from high order points of view, explaining their underlying mechanism. Thirdly, we introduce LEarnable hiGh Order (LEGO) block, which obtains competitive results on both Something-Something and Charades datasets.

## 2 Related Work

Video classification architectures. The high performance of 2D ConvNets in image classification tasks [14] makes it appealing to try to reuse them for video classification tasks. In reference to image classification models, several works have tried to design effective architectures [21, 30, 38, 33, 3, 23, 34, 41]. Ng et al. [21] combined 2D ConvNet with LSTM for video classification, where 2D ConvNet is act as a spatial feature extractor and LSTM is responsible for capturing temporal dependencies. Simonyan et al. [30] designed a two-stream architecture to capture appearance and motion information separately. The spatial stream uses RGB frames as inputs, while the temporal stream learns from optical flow. Wang et al. [38] further generalized this framework to learn long-range dependencies by temporal segment. While the two-stream framework turns out to be effective for video classification, it is time-consuming to train two networks and calculate optical flow in advance. Tran et al. [33] investigated 3D ConvNets to learn spatiotemporal features end-to-end. Similar with 2D ConvNets, 3D ConvNets learns linear combination of the input, but with spatiotemporal filters. Because of the additional kernel dimension, 3D ConvNets are computationally expensive and harder to train. To overcome this limitation, some researchers tried to save computation by replacing 3D convolutions with separable convolutions [23, 34] or mixed convolutions [34, 41]. Meanwhile, Carreira et al. [3] introduced an inflation operation. It allows for converting pre-trained 2D models into 3D, which also diminishes the difficulty of training 3D ConvNets.

Attention mechanism. The mechanism of attention stems from the study of human vision. Because of the bottleneck of information processing [22], humans selectively focus on a subset of the available sensory information [35]. Itti et al. [12] proposed a neuronal inspired visual attention system, which combined multiple scale image features. Larochelle [15] described a model based on a Boltzmann machine that can learn how to accumulate information about a shape over several fixations. Mnih [20] trained task-specific policies using reinforcement learning methods. The model selected a sequence of processing regions adaptively. In [42] the authors applied attention mechanisms to the problem of generating image descriptions. The attention mechanism is also widely used in Natural language processing. [2] used attention matrix linking the expressions learned by each word in the source language to the words that are currently predicted to be translated. To address the deficiency problems of global attention, [18] proposed a local attentional mechanism that chooses to focus only on a small subset of the source positions per target word. [36] proposed a new architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely, and achieved state of the art results on several translation tasks.

High order neural networks. High order neural networks are known to provide inherently more powerful mapping abilities than their first order brethren. Sigma-Pi network [26] is a generalization of the multi-layer perceptrons (MLP), where the main difference is using multiplicative connections unit instead of simple additive unit. Other research involving high order correlations includes high order conjunctive connections [10], associative memories [4] [16], etc. The general high order networks suffer the problem that the number of weights increases combinatorially. Some researchers use prior knowledge to select the product terms. Yang et al. [43] investigated parabolic neuron. Heywood et al. [9] proposed a dynamic weight pruning algorithm for identifying and removing redundant weights. Many recent works of video classification involved high order building blocks. Wang et al. [37] investigated square-pooling architecture for relation modeling. Non-local neural networks [39] formulated the neuron response at a position as a weighted sum of features embedding at all positions in the input feature maps, both the weights and features embedding are functions of input feature, which lead to a high order representation.

## 3 Methodology

We first review the mathematical formulation of convolutional layers (the first order neural network) and give a definition of high order neural networks. Next, we prove that several successful architectures for visual classification tasks are in the family of high order neural networks, demonstrates their underlying mechanism. In the end of this section, we propose a new learnable high order block.

### 3.1 Notation convention

In this section, we make an agreement on the notations to be used. Given video data (or features) with T frames, the size of each frame is and the number of channels is C, the data is represented as:

 X∈RC×T×H×W={xthw},xthw∈RC. (1)

In Equation 1, is the feature at position on the feature map. , , and .

The neighbourhood of position is defined as:

 N(t,h,w)={(i,j,k)∣∣|i−t|≤Kt,|j−h|≤Kh,|k−w|≤Kw}. (2)

The convolutional layers we refer to have D filters, and the kernel size of each filter is . The stride is 1 and the padding is zero padding. Such convolutional layer has parameters, and we use to represent a certain parameter. The output of after the convolutional layers is:

 Y∈RD×T×H×W={ythw},ythw∈RD. (3)

### 3.2 First order neural network

Convolutional layers are the most commonly used operations in visual tasks and they are first order operations.

The element of is computed as (bias terms ignored):

 y(d)thw=t+Kt∑i=t−Kth+Kh∑j=h−Khw+Kw∑k=w−KwC∑c=1wijkdcx(c)ijk=t+Kt∑i=t−Kth+Kh∑j=h−Khw+Kw∑k=w−KwwTijkdxijk=∑p∈N(t,h,w)wTpdxp. (4)

In Equation 4 , . Given Equation 4, the outputs at at position are computed as:

 yp=∑q∈N(p)Wqxq. (5)

Equation 5 gives the general formulation of convolutional layers. It is a first order expression. The kernel commonly used in I3D [3] and P3D [23] architectures is in this form, thus he output features are highly dependent on the weighted sum of the input features in one frame. Consider the case that two input features are very similar, which is common in videos because differences between adjacent frames in videos are subtle, such convolutional layer will loss the action information outside the weighted sum.

### 3.3 High order Formulation

Since most problems of interest are not linearly separable, a single layer of first order neural network cannot deal with it. One alternative is to cascade a group of first order units, but it may suffer from the risk of losing temporal information in multi-layer transformation. We suggest to add high order expressions. Following [6], a generic high order operation in deep neural networks is defined as:

 yp=n∑k=0Tk(p). (6)

In Equation 6 , is the order expression for position . is the bias terms that are mostly ignored with batch normalization and takes the formulation in Equation 5 . We will focus on second or third order expressions. It’s intuitive as the mean and the variance are commonly used moments in statistics.

A common form of second order expression is dot product:

 T2(p)=∑i,jwTijpxixj (7)

Hadamard product is also a second order expression. Note that hadamard product can be transformed to matrix multiplication by diagonalizing vectors to matrices, we include this case in Equation 7.

We can rewrite Equation 7 in a new formulation:

 T2(p)=∑i,jwTijpxixj=∑j(∑iwTijpxi)xj=∑jT1(j;p)xj (8)

Equation 8 shows that the second order expressions use first order expressions to learn contextual semantic information as the kernel weights.

### 3.4 Instantiations

Squeeze-and-Excitation Networks proposed SE block. The SE block seeks to strengthen the representation ability of CNN layers by channel-wise relation, which explicitly models the relations between channels. As Figure 3 (b) shows, the SE block uses a sequence to learn a set of channel specific weights. The weights are multiplied channel-wisely with the original inputs. The sequence is a first oder operation and the outputs after scale are conducted by high order expression:

 yi=T1({j|∀xj∈X})⊗xi. (9)

It is the key point that the channel-wise weights are learned from first order operations and thus the block is a high order operation. If they are initialized as zero-order parameters, the SE-Inception block turns to a trival Inception block.

Appearance-and-Relation Networks proposed the SMART block for video classification. The SMART block seeks to model appearance and relation from RGB input in a separate and explicit manner. As Figure 3 (c) shows, there are two paths in the block for modeling appearance and relation separately. The novel part in the block is the square operation that turns each element to . The outputs before concat layer will be a linear operation without a square operation as there is no more activaction layer. In this case if the square operation is removed, the SMART block turns to a trival Conv3d block. So the high order operation is the key point. The high oder formulation of SMART block is:

 yi=∑jT2d1(j;i)+(∑jT3d1(j;i))2 (10)

The general formulation of SMART blocks is energy model [1], which is an evident high order operation. Specifically, a hidden unit in the energy model is calculated as follows:

 zk=∑fwzkf(wxfTx+wyfTy)2 (11)

In Equation 11, and are patches from consecutive frames, is the transformation (relation) between them.

Non-local Neural Networks proposed non-local operation to capture long-range dependencies with deep neural networks. The generic non-local operation in deep neural networks is:

 yi=1C(x)∑∀jf(xi,xj)g(xj) (12)

In equation 12, is a pair-wise function representing relationship such as affinity between and . The function computes a representation of the input at the position . is a normalization factor mostly set as . Both and are functions of input features, leading non-local operation to a high order representation. In [39], some choices of function are discussed:

1. where and are convolutional layers (see Figure 3 (d)). In this case, non-local operation is a third order operation:

 yi=WgN∑∀jxTiWθϕxjxj=∑∀jT2(j;i)xj (13)
2. where denotes concatenation. In this case, non-local operation is a second order operation.

To illustrate the role of high order operations in such blocks, we introduce a mask non-local opearation:

 yi=1C(x)∑j∈N(i)f(xi,xj)g(xj) (14)

In Equation 14, is the neigbourhood of position defined in Equation 2, which means only relations between adjacent positions are computed since relations between far-away positions are relately weaker in probability. In section 4, we conduct some experiments to show that the mask non-local has the same performance as non-local when only positions in the feature map are covered in the pair-wise relation computing, which reducing the computational cost by a large amount.

### 3.5 LEGO block

In order to verify the effective of high order, we propose a simple network block with no bells and whistles, LEarnable hiGh Order operation, termed as LEGO.

The general high order networks suffer the combinatorially increasing parameters problem. Several related works use prior knowledge (i.e., square operation in ARTNet) to select a subset of the product terms in Equation 7. Unlike the explicitly constrains, which may limit the block representation ability, we use a more general learnable operation.

 yi=∑j∈N(i)fj[M(i)]xj (15)

As shown in Equation 15, different positions on the feature map are applied different kernels, which are functions of the input features. LEGO block is different from the local convolution operation, which also do not share kernels. As a first order operation, local convolution lacks input context. We assume the knowledge to generate the kernels are the same at all locations. Given a relatively large receptive field, LEGO can learn such knowledge to generate different kernels at different positions.

The implementation details are shown in Figure 4. Conv1 denotes convolutions for embedding and conv4 denotes convolution to keeping the output channel number. The orange boxes denote convolutions for learning the knowledge for generating a kernel. The tensor shape after the orange boxes are where is the size of the kernel. In Figure 4 , we set the filter numbers in conv3 to be 27, so each position ( position) will get one kernel with kernel size. The operation denotes element-wise sum and the operation applies the kernels to the corresponding positions. Given a two tensors and where , the operation is defines as:

 Z=X⊗Y={zthw}∈RT×H×H×W×Czthw=t+1∑i=t−1h+1∑j=h−1w+1∑k=w−1y(p)thwxijk (16)

In Equation 16 , is the element of . The index p is given by the position in a fixed order.

## 4 Experiments on Video Classification

In this section we describe the experimental results of our method. First, we introduce the action recognition datasets and the evaluation settings. Next, we study different aspects of our proposed LEGO on Charades and Something-Something (V1 and V2) datasets and compare with the state-of-the-art methods.

### 4.1 Datasets

We evaluate the performance of high order neural networks on two action recognition benchmarks: Charades [29] and Something-Something [7, 19].

The Something-Something dataset is a recent video dataset for human-object interaction recognition. It has two releases. The V1 dataset has 86K training videos, around 12K validation videos and 11K testing videos. The V2 dataset has 22K videos, which is more than twice as many videos as V1. There are 169K training videos, around 25K validation videos and 27K testing videos in the V2 dataset. The number of classes in this dataset is 174, some of the ambiguous activity categories are challenging, such as ‘Pushing something from left to right’ versus ‘Pushing something from right to left’, ‘Poking something so lightly that it doesn’t or almost doesn’t move’ versus ‘Pushing something so that it slightly moves’. We can see that the temporal relations and transformations of the objects rather than the appearance of the objects characterize the activities in this dataset.

The Charades dataset is a dataset of daily indoors activities, which consists of 8K training videos and 1.8K validation videos. The average video duration is 30 seconds. There are 157 action classes in this dataset and multiple actions can happen at the same time.

### 4.2 Implementation Details

Our backbone model is based on the ResNet-50 I3D [3] architecture, as showed in Table 1.

The model is initialized with the pre-trained ImageNet ResNet 50 model in Pytorch torchvision library. Following [3], a 3D kernel with tkk dimension can be inflated from a 2D kk kernel by repeating the weights t times along the time dimension and rescaling them by dividing t. We do not use any other video datasets for pre-training.

The training consists of three stages. Taking the training for Something-Something V1 dataset for example. At all stages, the video frames are first resized to [256,320] dimension and then randomly cropped to a spatial 224224 dimension.

At stage 1, we set , and train the I3D baseline model on the datasets. Thus the model takes 8 video frames as inputs. Since the video in the datasets are relatively short, the 8 frames are sampled at a frame rate of 6 fps. The stage 1 model is trained with a 4-GPU machine where each GPU has 20 video clips in a mini-batch. The learning rate is kept as 0.01 during training for 20 epochs.

At stage 2, we set , and add 5 LEGO blocks to the I3D model. The LEGO blocks are added following [39] (3 to res4 and 2 to res3, to every other residual block). The stage 2 model takes 16 video frames as inputs, which are sampled at a frame rate of 12 fps. The stage 2 model is fine-tuned from the stage 1 model with a 4-GPU machine where each GPU has 6 video clips in a mini-batch. The learning rate is 1e-2 for the first 5 epochs and 1e-3 for the next 10 epochs.

At stage 3, we set and continue fine-tuning from the stage 2 model. The stage 3 model takes 32 video frames sampled at a frame rate of 12 fps as inputs and trained with a 4-GPU machine where each GPU has 2 video clips in a mini-batch. The learning rate is 1e-3 for the first 5 epochs and 1e-4 for the next 10 epochs.

The purpose of three-stage training is to reduce the loss caused by small batch training (batch size 16). The stage 1 and 2 models can learn relatively stable parameters for the batch normalization layer for the final model. The loss functions we use are cross entropy loss for Something-Something datasets and BCEWithLogitsLoss for Charades datasets (multi-class and multi-label).

At test stage, we use fully-convolutional test in [256,320] dimension space and sample 30 clips for testing. The final predictions are based on the the averaged softmax scores of all clips.

### 4.3 Results on Something-Something V1 datasets

We first study different aspects of our proposed LEGO on Something-Something V1 dataset, and then compare with the state-of-the-art methods on Something-Something (V1 and V2) and Charades dataset.

LEGO at different stages. We study the network performance when the LEGO blocks are added to different stages on the network. The baseline model is ResNet-50 I3D. We add two LEGO blocks after the first and third bottleneck on (1) res, (2) res, (3) res and (4) res and conduct three-steps training discussed in section 4.2. As shown in Table 3, the improvement of adding LEGO block on res and res is relatively big and the latter is the best result of adding LEGO blocks to a single stage. We also find out that the improvement is smaller when adding LEGO to deeper stage of the network. One possible explanation is that spatiotemporal correlation weakens as the network going deeper, since high level features are more linear separable so high order information is less important. One possible reason that high order on res cannot get the maximum improvement is that the output size of res is 8 times larget than the output size of res, thus the receptive field is relatively smaller to learn the knowledge for extracting features. An evidence can be found in the following study.

LEGO with different receptive fields. We continue the experiments on res_2 and study how the size of receptive fields influence the improvement. In section 3.4, we use conv, one convolution layer, and conv, one convolution layer, to learn the knowledge. In this case, the size of receptive field is . We replace conv to (1) one convolution layer, (2) one convolution layer, in order to get different receptive fields and test the network performance. As shown in Table 3, larger receptive fields on temporal dimension can improve the accuracy. But considering the trade-off between computational complexity and the improvement, we choose the convolution layer as conv in Figure LABEL:fig:figure4 for the building block.

Comparison to the state of the art. We compare the performance with the state-of-the-art approaches on Something-Something V1 dataset. The results are summarized in Table 2. Our model achieves superior performance without pre-training with video datasets. We first compare with non-local neural networks. Non-local neural networks add five non-local blocks on res and res, we replace the blocks with our LEGO block and get 1.3% improvement. We continue adding 2 blocks on res. Due to GPU limits, we do not use the three-steps training but fine-tune from LEGO on res3,4 with a learning rate of 1e-4. The model can get 0.2% improvement. It is expected that with pre-training with Kinetics dataset and applying three-steps traing, our model can get a better performance.

We also conduct a experiment to compare the performance of non-local neural network and the mask non-local model, as mentioned in Section 3.4, in order to illustrate the role of high order operations in non-local block. Given a input size THW, our mask strategy is to compute the pair-wise relation between position and other positions in the neighborhood of:

Table 4 shows the number of pair-wise relation computed in two models. In mask non-local, zero-padding is applied for computing the pair-wise relation. As Table 2 shows, the mask non-local model with local spatial information has similar performance with the initial non-local model with global information. Thus high order property contributes mainly to the good performance of non-local neural networks.

Table 5 compares FLOPs and Top1 accuracy relative to the baseline. Our high order model (best model with LEGO on res2,3,4) is more accurate than the I3D counterpart (e.g., 45.9 vs. 41.6), while increasing a small amount of FLOPs (139G vs. 111G). This comparison shows that our method can be more effective than 3D convolutions when used alone.

### 4.4 Results on Something-Something V2 dataset

We also investigate our models on Something-Something V2 dataset. Table 6 shows the comparisons with the previous results on this dataset. When adding high order blocks to res3 and res4 stages, our high order ResNet 50 achieves 59.2% Top 1 accuracy. When adding high order blocks to res2, res3 and res4 stages, our high order ResNet 50 achieves 59.2% Top1 accuracy.

### 4.5 Results on Charades dataset

In this subsection we study the performance of high order neural networks on Charades dataset. It’s worth noting that our model is initialized with the pre-trained ImageNet ResNet 50 and trained directly on Charades dataset, without finetuning on other video datasets.

We report our results in Table 7. The baseline I3D ResNet 50 approach achieves 31.8% mAP. Other state-of-the-art method include Nonlocal (with an mAP of 33.5%), Joint GCN (36.2% mAP), and combining above two methods achieves the best result on the leaderboard, with an mAP of 37.5%. By adding LEGO blocks to res3 and res4 stages of the baseline I3D, our method achives 5.1% improvements in mAP. And we achive another 0.2% gain by continuously adding LEGO blocks to res2 stage. The improvement indicates the effectiveness of LEGO. Our model is expect to have a better performance if pre-trained with the Kinetics dataset as the NL+GCN model did.

## 5 Conclusion

In this paper, we leveraged high order neural networks to the task of video classification. We analyzed some state-of-the-art architectures from high-order points of view, explaining their underlying mechanism. As demonstrated on the Something-Something dataset (V1 and V2) and Charades datasets, the proposed LEGO neural network is able to achive state-of-the-art results, even using RGB only without fine-tuning with other video datasets. This performance improvement may be ascribed to the fact that the high order expressions are more capable to capture the spatiotemporal corralations. In future work, we plan to investigate our methods ability in more visual tasks.

## References

• [1] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America, 2(2):284–299, 1985.
• [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
• [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Computer Vision and Pattern Recognition (CVPR), 2017.
• [4] H. Chen, Y. Lee, G. Sun, H. Lee, T. Maxwell, and C. L. Giles. High order correlation model for associative memory. In AIP Conference Proceedings, volume 151, pages 86–99. AIP, 1986.
• [5] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. arXiv preprint arXiv:1605.06409, 2016.
• [6] C. L. Giles and T. Maxwell. Learning, invariance, and generalization in high-order neural networks. Applied Optics, 26(23):4972–4978, 1987.
• [7] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Frusnd, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The ”something something” video database for learning and evaluating visual common sense. arXiv preprint arXiv:1706.04261, 2017.
• [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Neural Information Processing Systems (NIPS), 2016.
• [9] M. Heywood and P. Noakes. A framework for improved training of sigma-pi networks. IEEE Transactions on Neural Networks, 6(4):893–903, 1995.
• [10] G. F. Hinton. A parallel computation that assigns canonical object-based frames of reference. In Proceedings of the 7th international joint conference on Artificial intelligence-Volume 2, pages 683–685. Morgan Kaufmann Publishers Inc., 1981.
• [11] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. 2018.
• [12] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, Nov 1998.
• [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li. Large-scale video classification with convolutional neural networks. Computer Vision and Pattern Recognition (CVPR), 2014.
• [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS), 2012.
• [15] H. Larochelle and G. E. Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1243–1251. Curran Associates, Inc., 2010.
• [16] Y. Lee, G. Doolen, H. Chen, G. Sun, T. Maxwell, and H. Lee. Machine learning using a higher order correlation network. Technical report, Los Alamos National Lab., NM (USA); Maryland Univ., College Park (USA), 1986.
• [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. European Conference on Computer Vision (ECCV), 2016.
• [18] T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Association for Computational Linguistics, 2015.
• [19] F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. Fine-grained video classification and captioning. arXiv preprint arXiv:1804.09235, 2018.
• [20] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. 3:2204–2212, 2014.
• [21] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. Computer Vision and Pattern Recognition (CVPR), 2015.
• [22] E. Niebur and C. Koch. Computational architectures for attention. In R. Parasuraman, editor, The Attentive Brain, chapter 9, pages 163–186. MIT Press, Cambridge, MA, 1998.
• [23] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. International Conference on Computer Vision (ICCV), 2017.
• [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. Computer Vision and Pattern Recognition (CVPR), 2016.
• [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Neural Information Processing Systems (NIPS), 2015.
• [26] D. E. Rumelhart and J. L. McClelland. Parallel distributed processing: explorations in the microstructure of cognition. volume 1. foundations. 1986.
• [27] J. L. P. N. G. T. B. V. S. V. S. Abu-El-Haija, N. Kothari. Youtube-8m: A large-scale video classification benchmark. 1609.08675, 2016.
• [28] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. Computer Vision and Pattern Recognition (CVPR), 2017.
• [29] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. European Conference on Computer Vision (ECCV), 2016.
• [30] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems (NIPS), 2014.
• [31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
• [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. Computer Vision and Pattern Recognition (CVPR), 2015.
• [33] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d con- volutional networks. International Conference on Computer Vision (ICCV), 2015.
• [34] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotem- poral convolutions for action recognition. Computer Vision and Pattern Recognition (CVPR), 2018.
• [35] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis, and F. Nuflo. Modeling visual attention via selective tuning. Artif. Intell., 78(1-2):507–545, Oct. 1995.
• [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.
• [37] L. Wang, W. Li, W. Li, and L. Van Gool. Appearance-and-relation networks for video classification. 2018.
• [38] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision (ECCV), 2016.
• [39] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
• [40] X. Wang and A. Gupta. Videos as space-time region graphs. 2018.
• [41] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. European Conference on Computer Vision (ECCV), 2018.
• [42] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2048–2057, Lille, France, 07–09 Jul 2015. PMLR.
• [43] H. Yang and C. C. Guest. High order neural networks with reduced numbers of interconnection weights. In Neural Networks, 1990., 1990 IJCNN International Joint Conference on, pages 281–286. IEEE, 1990.
• [44] A. A. T. A. Zhou, B. Temporal relational reasoning in videos. European Conference on Computer Vision (ECCV), 2018.
• [45] M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient convolutional network for online video understanding. In European Conference on Computer Vision (ECCV), 2018.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters