High Order Neural Networks for Video Classification
Abstract
Capturing spatiotemporal correlations is an essential topic in video classification. In this paper, we present high order operations as a generic family of building blocks for capturing high order correlations from high dimensional input video space. We prove that several successful architectures for visual classification tasks are in the family of high order neural networks, theoretical and experimental analysis demonstrates their underlying mechanism is high order. We also proposal a new LEarnable hiGh Order (LEGO) block, whose goal is to capture spatiotemporal correlation in a feedforward manner. Specifically, LEGO blocks implicitly learn the relation expressions for spatiotemporal features and use the learned relations to weight input features. This building block can be plugged into many neural network architectures, achieving evident improvement without introducing much overhead. On the task of video classification, even using RGB only without finetuning with other video datasets, our high order models can achieve results on par with or better than the existing stateoftheart methods on both SomethingSomething (V1 and V2) and Charades datasets.
1 Introduction
The last decade has witnessed great success of deep learning in computer vision. In particular, deep neural networks have been demonstrated to be effective models for tackling multiple visual tasks, such as image recognition [14, 8, 32, 31], object detection [25, 17, 24, 5] and video classification [13, 27, 30, 3].
Videos are highdimensional and highlyredundant data. As they can be viewed as a sequence of continuously changing images, the differences between adjacent frames are subtle. For example, in Fig. 1, is this an action about plugging cable into charger or pulling cable out of charger? The appearance in each frame is quite similar, only by looking in a consecutive view, we can figure out what this action is. Therefore, an effective video classification model should focus on capturing spatial and temporal movements. Existing stateoftheart methods include ConvNets with temporal models [21], twostream ConvNets [30, 38] and 3D ConvNets [33, 41, 3, 23]. They cascade a group of first order units to learn video representations.
Looking back to the first order units, e.g. convolutional operations, They have the form of: , which means the output of one convolutional layer is the linear combination of the input. Although activation functions can be added to introduce nonlinearity into the network, they are not learnable. In addition to weighted sum, other information, e.g. spatiotemporal correlations, can be lost.
In this paper, we introduce high order operation as an efficient and generic component for capturing spatiotemporal correlations. The high order operation is a family of expressions that include terms computed by high order products of input features. For instance, is an example of second order representation, which consists of first order term and second order term.
There are several advantages of using high order operations: (a) Mathematically, the high order expressions will be more capable of fitting nonlinear functions than linear combination. (b) The high order expression can learn attention from the video. It will be clear if writing the abovementioned second order equation as: . (c) Multiplicative connections in high order term allows one unit to gate another, see Fig 2. Specifically, if one unit of multiplicative pair is zero, the other member of the pair can have no effect, no matter how strong its output is. On the other hand, if one unit of a pair is one, the output of the other is passed unchanged. Such pairwise interactions can encode a prior knowledge in high order networks, which makes it easier for solving video classification problems. As in the plugging into / out of charger case, the frames are highly similar, with only minor differences, feature selection mechanism, i.e. gating, needs to be considered.
We point out that several successful architectures or frameworks for visual classification tasks are in the family of high order neural networks: (a) SqueezeandExcitation Networks [11], (b) AppearanceandRelation Networks [37], and (c) Nonlocal Neural Networks [39]. They use different priors to define the form of spatiotemporal relations, which leads to high order formulations. A limitation of these works is that they use predefined patterns, e.g. square operation in ARTNet, which may limit the block representation ability. In this paper, we take a step towards addressing this problem. We propose a LEarnable hiGh Order (LEGO) block to implicitly learn the relation expressions for spatiotemporal features, and use the learned relations to weight input features. Intuitively, LEGO learns a kernel with context for each position in the feature map. Different positions in the feature map have different contexts and therefore have different kernels that are learned. This block can be plugged into many neural network architectures, achieving significant improvement without introducing much overhead.
We test the high order neural networks on SomethingSomething dataset (V1 and V2) and Charades datasets. Both datasets are challenging in action recognition in that appearance can provide insufficient information for classification. Using RGB only without finetuning with other video datasets and without any bells and whistles (e.g., optical flow, multiscale testing), our method achieves results on par with or better than the existing stateoftheart methods.
Our main contributions are summarized as follows. Firstly, we are among the first who attempt to introduce high order neural networks to the task of video classification. Secondly, we provide interpretations of some stateoftheart architectures from high order points of view, explaining their underlying mechanism. Thirdly, we introduce LEarnable hiGh Order (LEGO) block, which obtains competitive results on both SomethingSomething and Charades datasets.
2 Related Work
Video classification architectures. The high performance of 2D ConvNets in image classification tasks [14] makes it appealing to try to reuse them for video classification tasks. In reference to image classification models, several works have tried to design effective architectures [21, 30, 38, 33, 3, 23, 34, 41]. Ng et al. [21] combined 2D ConvNet with LSTM for video classification, where 2D ConvNet is act as a spatial feature extractor and LSTM is responsible for capturing temporal dependencies. Simonyan et al. [30] designed a twostream architecture to capture appearance and motion information separately. The spatial stream uses RGB frames as inputs, while the temporal stream learns from optical flow. Wang et al. [38] further generalized this framework to learn longrange dependencies by temporal segment. While the twostream framework turns out to be effective for video classification, it is timeconsuming to train two networks and calculate optical flow in advance. Tran et al. [33] investigated 3D ConvNets to learn spatiotemporal features endtoend. Similar with 2D ConvNets, 3D ConvNets learns linear combination of the input, but with spatiotemporal filters. Because of the additional kernel dimension, 3D ConvNets are computationally expensive and harder to train. To overcome this limitation, some researchers tried to save computation by replacing 3D convolutions with separable convolutions [23, 34] or mixed convolutions [34, 41]. Meanwhile, Carreira et al. [3] introduced an inflation operation. It allows for converting pretrained 2D models into 3D, which also diminishes the difficulty of training 3D ConvNets.
Attention mechanism. The mechanism of attention stems from the study of human vision. Because of the bottleneck of information processing [22], humans selectively focus on a subset of the available sensory information [35]. Itti et al. [12] proposed a neuronal inspired visual attention system, which combined multiple scale image features. Larochelle [15] described a model based on a Boltzmann machine that can learn how to accumulate information about a shape over several fixations. Mnih [20] trained taskspecific policies using reinforcement learning methods. The model selected a sequence of processing regions adaptively. In [42] the authors applied attention mechanisms to the problem of generating image descriptions. The attention mechanism is also widely used in Natural language processing. [2] used attention matrix linking the expressions learned by each word in the source language to the words that are currently predicted to be translated. To address the deficiency problems of global attention, [18] proposed a local attentional mechanism that chooses to focus only on a small subset of the source positions per target word. [36] proposed a new architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely, and achieved state of the art results on several translation tasks.
High order neural networks. High order neural networks are known to provide inherently more powerful mapping abilities than their first order brethren. SigmaPi network [26] is a generalization of the multilayer perceptrons (MLP), where the main difference is using multiplicative connections unit instead of simple additive unit. Other research involving high order correlations includes high order conjunctive connections [10], associative memories [4] [16], etc. The general high order networks suffer the problem that the number of weights increases combinatorially. Some researchers use prior knowledge to select the product terms. Yang et al. [43] investigated parabolic neuron. Heywood et al. [9] proposed a dynamic weight pruning algorithm for identifying and removing redundant weights. Many recent works of video classification involved high order building blocks. Wang et al. [37] investigated squarepooling architecture for relation modeling. Nonlocal neural networks [39] formulated the neuron response at a position as a weighted sum of features embedding at all positions in the input feature maps, both the weights and features embedding are functions of input feature, which lead to a high order representation.
3 Methodology
We first review the mathematical formulation of convolutional layers (the first order neural network) and give a definition of high order neural networks. Next, we prove that several successful architectures for visual classification tasks are in the family of high order neural networks, demonstrates their underlying mechanism. In the end of this section, we propose a new learnable high order block.
3.1 Notation convention
In this section, we make an agreement on the notations to be used. Given video data (or features) with T frames, the size of each frame is and the number of channels is C, the data is represented as:
(1) 
In Equation 1, is the feature at position on the feature map. , , and .
The neighbourhood of position is defined as:
(2) 
The convolutional layers we refer to have D filters, and the kernel size of each filter is . The stride is 1 and the padding is zero padding. Such convolutional layer has parameters, and we use to represent a certain parameter. The output of after the convolutional layers is:
(3) 
3.2 First order neural network
Convolutional layers are the most commonly used operations in visual tasks and they are first order operations.
The element of is computed as (bias terms ignored):
(4) 
In Equation 4 , . Given Equation 4, the outputs at at position are computed as:
(5) 
Equation 5 gives the general formulation of convolutional layers. It is a first order expression. The kernel commonly used in I3D [3] and P3D [23] architectures is in this form, thus he output features are highly dependent on the weighted sum of the input features in one frame. Consider the case that two input features are very similar, which is common in videos because differences between adjacent frames in videos are subtle, such convolutional layer will loss the action information outside the weighted sum.
3.3 High order Formulation
Since most problems of interest are not linearly separable, a single layer of first order neural network cannot deal with it. One alternative is to cascade a group of first order units, but it may suffer from the risk of losing temporal information in multilayer transformation. We suggest to add high order expressions. Following [6], a generic high order operation in deep neural networks is defined as:
(6) 
In Equation 6 , is the order expression for position . is the bias terms that are mostly ignored with batch normalization and takes the formulation in Equation 5 . We will focus on second or third order expressions. It’s intuitive as the mean and the variance are commonly used moments in statistics.
A common form of second order expression is dot product:
(7) 
Hadamard product is also a second order expression. Note that hadamard product can be transformed to matrix multiplication by diagonalizing vectors to matrices, we include this case in Equation 7.
3.4 Instantiations
SqueezeandExcitation Networks proposed SE block. The SE block seeks to strengthen the representation ability of CNN layers by channelwise relation, which explicitly models the relations between channels. As Figure 3 (b) shows, the SE block uses a sequence to learn a set of channel specific weights. The weights are multiplied channelwisely with the original inputs. The sequence is a first oder operation and the outputs after scale are conducted by high order expression:
(9) 
It is the key point that the channelwise weights are learned from first order operations and thus the block is a high order operation. If they are initialized as zeroorder parameters, the SEInception block turns to a trival Inception block.
AppearanceandRelation Networks proposed the SMART block for video classification. The SMART block seeks to model appearance and relation from RGB input in a separate and explicit manner. As Figure 3 (c) shows, there are two paths in the block for modeling appearance and relation separately. The novel part in the block is the square operation that turns each element to . The outputs before concat layer will be a linear operation without a square operation as there is no more activaction layer. In this case if the square operation is removed, the SMART block turns to a trival Conv3d block. So the high order operation is the key point. The high oder formulation of SMART block is:
(10) 
The general formulation of SMART blocks is energy model [1], which is an evident high order operation. Specifically, a hidden unit in the energy model is calculated as follows:
(11) 
In Equation 11, and are patches from consecutive frames, is the transformation (relation) between them.
Nonlocal Neural Networks proposed nonlocal operation to capture longrange dependencies with deep neural networks. The generic nonlocal operation in deep neural networks is:
(12) 
In equation 12, is a pairwise function representing relationship such as affinity between and . The function computes a representation of the input at the position . is a normalization factor mostly set as . Both and are functions of input features, leading nonlocal operation to a high order representation. In [39], some choices of function are discussed:

where and are convolutional layers (see Figure 3 (d)). In this case, nonlocal operation is a third order operation:
(13) 
where denotes concatenation. In this case, nonlocal operation is a second order operation.
To illustrate the role of high order operations in such blocks, we introduce a mask nonlocal opearation:
(14) 
In Equation 14, is the neigbourhood of position defined in Equation 2, which means only relations between adjacent positions are computed since relations between faraway positions are relately weaker in probability. In section 4, we conduct some experiments to show that the mask nonlocal has the same performance as nonlocal when only positions in the feature map are covered in the pairwise relation computing, which reducing the computational cost by a large amount.
3.5 LEGO block
In order to verify the effective of high order, we propose a simple network block with no bells and whistles, LEarnable hiGh Order operation, termed as LEGO.
The general high order networks suffer the combinatorially increasing parameters problem. Several related works use prior knowledge (i.e., square operation in ARTNet) to select a subset of the product terms in Equation 7. Unlike the explicitly constrains, which may limit the block representation ability, we use a more general learnable operation.
(15) 
As shown in Equation 15, different positions on the feature map are applied different kernels, which are functions of the input features. LEGO block is different from the local convolution operation, which also do not share kernels. As a first order operation, local convolution lacks input context. We assume the knowledge to generate the kernels are the same at all locations. Given a relatively large receptive field, LEGO can learn such knowledge to generate different kernels at different positions.
The implementation details are shown in Figure 4. Conv1 denotes convolutions for embedding and conv4 denotes convolution to keeping the output channel number. The orange boxes denote convolutions for learning the knowledge for generating a kernel. The tensor shape after the orange boxes are where is the size of the kernel. In Figure 4 , we set the filter numbers in conv3 to be 27, so each position ( position) will get one kernel with kernel size. The operation denotes elementwise sum and the operation applies the kernels to the corresponding positions. Given a two tensors and where , the operation is defines as:
(16) 
In Equation 16 , is the element of . The index p is given by the position in a fixed order.
4 Experiments on Video Classification
In this section we describe the experimental results of our method. First, we introduce the action recognition datasets and the evaluation settings. Next, we study different aspects of our proposed LEGO on Charades and SomethingSomething (V1 and V2) datasets and compare with the stateoftheart methods.
4.1 Datasets
We evaluate the performance of high order neural networks on two action recognition benchmarks: Charades [29] and SomethingSomething [7, 19].
The SomethingSomething dataset is a recent video dataset for humanobject interaction recognition. It has two releases. The V1 dataset has 86K training videos, around 12K validation videos and 11K testing videos. The V2 dataset has 22K videos, which is more than twice as many videos as V1. There are 169K training videos, around 25K validation videos and 27K testing videos in the V2 dataset. The number of classes in this dataset is 174, some of the ambiguous activity categories are challenging, such as ‘Pushing something from left to right’ versus ‘Pushing something from right to left’, ‘Poking something so lightly that it doesn’t or almost doesn’t move’ versus ‘Pushing something so that it slightly moves’. We can see that the temporal relations and transformations of the objects rather than the appearance of the objects characterize the activities in this dataset.
The Charades dataset is a dataset of daily indoors activities, which consists of 8K training videos and 1.8K validation videos. The average video duration is 30 seconds. There are 157 action classes in this dataset and multiple actions can happen at the same time.
4.2 Implementation Details
layer  output size  
conv  , 64, stride 1,2,2  
pool  , max, stride 1,2,2  
res  
pool  , max, stride 2,1,1  
res  
res  
res  
global average pool and fc  111 
The model is initialized with the pretrained ImageNet ResNet 50 model in Pytorch torchvision library. Following [3], a 3D kernel with tkk dimension can be inflated from a 2D kk kernel by repeating the weights t times along the time dimension and rescaling them by dividing t. We do not use any other video datasets for pretraining.
The training consists of three stages. Taking the training for SomethingSomething V1 dataset for example. At all stages, the video frames are first resized to [256,320] dimension and then randomly cropped to a spatial 224224 dimension.
At stage 1, we set , and train the I3D baseline model on the datasets. Thus the model takes 8 video frames as inputs. Since the video in the datasets are relatively short, the 8 frames are sampled at a frame rate of 6 fps. The stage 1 model is trained with a 4GPU machine where each GPU has 20 video clips in a minibatch. The learning rate is kept as 0.01 during training for 20 epochs.
At stage 2, we set , and add 5 LEGO blocks to the I3D model. The LEGO blocks are added following [39] (3 to res4 and 2 to res3, to every other residual block). The stage 2 model takes 16 video frames as inputs, which are sampled at a frame rate of 12 fps. The stage 2 model is finetuned from the stage 1 model with a 4GPU machine where each GPU has 6 video clips in a minibatch. The learning rate is 1e2 for the first 5 epochs and 1e3 for the next 10 epochs.
At stage 3, we set and continue finetuning from the stage 2 model. The stage 3 model takes 32 video frames sampled at a frame rate of 12 fps as inputs and trained with a 4GPU machine where each GPU has 2 video clips in a minibatch. The learning rate is 1e3 for the first 5 epochs and 1e4 for the next 10 epochs.
Method  Pretrain dataset  Input size  Backbone  Modality  Top1 Acc.(%) 

MultiScale TRN [44]  Imagenet    Inception  RGB  33.6 
ECO [45]    multiinput ensemble  Inception+3D ResNet 18  RGB+Flow  43.9 
I3D [3]  ImageNet,Kinetics  ResNet 50  RGB  41.6  
NL I3D [39]  ImageNet,Kinetics  ResNet 50  RGB  44.6  
NL I3D + GCN [40]  ImageNet,Kinetics  ResNet 50  RGB  46.1  
Mask NL  ImageNet  ResNet50  RGB  44.5  
LEGO on res3,4  ImageNet  ResNet 50  RGB  45.7  
LEGO on res2,3,4  ImageNet  ResNet 50  RGB  45.9 
The purpose of threestage training is to reduce the loss caused by small batch training (batch size 16). The stage 1 and 2 models can learn relatively stable parameters for the batch normalization layer for the final model. The loss functions we use are cross entropy loss for SomethingSomething datasets and BCEWithLogitsLoss for Charades datasets (multiclass and multilabel).
At test stage, we use fullyconvolutional test in [256,320] dimension space and sample 30 clips for testing. The final predictions are based on the the averaged softmax scores of all clips.
4.3 Results on SomethingSomething V1 datasets
We first study different aspects of our proposed LEGO on SomethingSomething V1 dataset, and then compare with the stateoftheart methods on SomethingSomething (V1 and V2) and Charades dataset.
LEGO at different stages. We study the network performance when the LEGO blocks are added to different stages on the network. The baseline model is ResNet50 I3D. We add two LEGO blocks after the first and third bottleneck on (1) res, (2) res, (3) res and (4) res and conduct threesteps training discussed in section 4.2. As shown in Table 3, the improvement of adding LEGO block on res and res is relatively big and the latter is the best result of adding LEGO blocks to a single stage. We also find out that the improvement is smaller when adding LEGO to deeper stage of the network. One possible explanation is that spatiotemporal correlation weakens as the network going deeper, since high level features are more linear separable so high order information is less important. One possible reason that high order on res cannot get the maximum improvement is that the output size of res is 8 times larget than the output size of res, thus the receptive field is relatively smaller to learn the knowledge for extracting features. An evidence can be found in the following study.
model  Top1 Acc. (%)  
baseline  41.6  
stage to add  res  43.5  
res  43.7  
res  43.2  
res  42.9  
receptive field  res  43.3  
res  43.5  
res  43.6 
LEGO with different receptive fields. We continue the experiments on res_2 and study how the size of receptive fields influence the improvement. In section 3.4, we use conv, one convolution layer, and conv, one convolution layer, to learn the knowledge. In this case, the size of receptive field is . We replace conv to (1) one convolution layer, (2) one convolution layer, in order to get different receptive fields and test the network performance. As shown in Table 3, larger receptive fields on temporal dimension can improve the accuracy. But considering the tradeoff between computational complexity and the improvement, we choose the convolution layer as conv in Figure LABEL:fig:figure4 for the building block.
Comparison to the state of the art. We compare the performance with the stateoftheart approaches on SomethingSomething V1 dataset. The results are summarized in Table 2. Our model achieves superior performance without pretraining with video datasets. We first compare with nonlocal neural networks. Nonlocal neural networks add five nonlocal blocks on res and res, we replace the blocks with our LEGO block and get 1.3% improvement. We continue adding 2 blocks on res. Due to GPU limits, we do not use the threesteps training but finetune from LEGO on res3,4 with a learning rate of 1e4. The model can get 0.2% improvement. It is expected that with pretraining with Kinetics dataset and applying threesteps traing, our model can get a better performance.
We also conduct a experiment to compare the performance of nonlocal neural network and the mask nonlocal model, as mentioned in Section 3.4, in order to illustrate the role of high order operations in nonlocal block. Given a input size THW, our mask strategy is to compute the pairwise relation between position and other positions in the neighborhood of:
Table 4 shows the number of pairwise relation computed in two models. In mask nonlocal, zeropadding is applied for computing the pairwise relation. As Table 2 shows, the mask nonlocal model with local spatial information has similar performance with the initial nonlocal model with global information. Thus high order property contributes mainly to the good performance of nonlocal neural networks.
model, ResNet50  res3  res4 

nonlocal  162828  161414 
mask nonlocal  161313  1677 
Table 5 compares FLOPs and Top1 accuracy relative to the baseline. Our high order model (best model with LEGO on res2,3,4) is more accurate than the I3D counterpart (e.g., 45.9 vs. 41.6), while increasing a small amount of FLOPs (139G vs. 111G). This comparison shows that our method can be more effective than 3D convolutions when used alone.
model, ResNet50  FLOPs  Top1 Acc.(%) 

I3D  111G  41.6 
NL I3D  335G  44.6 
Ours  139G  45.9 
4.4 Results on SomethingSomething V2 dataset
We also investigate our models on SomethingSomething V2 dataset. Table 6 shows the comparisons with the previous results on this dataset. When adding high order blocks to res3 and res4 stages, our high order ResNet 50 achieves 59.2% Top 1 accuracy. When adding high order blocks to res2, res3 and res4 stages, our high order ResNet 50 achieves 59.2% Top1 accuracy.
model  backbone  Top1 Acc.(%) 

MultiScale TRN [44]  Inception  56.2 
LEGO on res3,4  ResNet 50  59.2 
LEGO on res2,3,4  ResNet 50  59.6 
4.5 Results on Charades dataset
In this subsection we study the performance of high order neural networks on Charades dataset. It’s worth noting that our model is initialized with the pretrained ImageNet ResNet 50 and trained directly on Charades dataset, without finetuning on other video datasets.
model  backbone  mAP 
TwoStream [28]  VGG 16  18.6 
MultiScale TRN [44]  Inception  25.2 
I3D [3]  ResNet 50  31.8 
I3D [3]  Inception  32.9 
I3D [39]  ResNet 101  35.5 
NL I3D [39]  ResNet 50  33.5 
GCN [40]  ResNet 50  36.2 
NL I3D + GCN [40]  ResNet 50  37.5 
LEGO on res3,4  ResNet 50  36.9 
LEGO on res2,3,4  ResNet 50  37.1 
We report our results in Table 7. The baseline I3D ResNet 50 approach achieves 31.8% mAP. Other stateoftheart method include Nonlocal (with an mAP of 33.5%), Joint GCN (36.2% mAP), and combining above two methods achieves the best result on the leaderboard, with an mAP of 37.5%. By adding LEGO blocks to res3 and res4 stages of the baseline I3D, our method achives 5.1% improvements in mAP. And we achive another 0.2% gain by continuously adding LEGO blocks to res2 stage. The improvement indicates the effectiveness of LEGO. Our model is expect to have a better performance if pretrained with the Kinetics dataset as the NL+GCN model did.
5 Conclusion
In this paper, we leveraged high order neural networks to the task of video classification. We analyzed some stateoftheart architectures from highorder points of view, explaining their underlying mechanism. As demonstrated on the SomethingSomething dataset (V1 and V2) and Charades datasets, the proposed LEGO neural network is able to achive stateoftheart results, even using RGB only without finetuning with other video datasets. This performance improvement may be ascribed to the fact that the high order expressions are more capable to capture the spatiotemporal corralations. In future work, we plan to investigate our methods ability in more visual tasks.
References
 [1] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America, 2(2):284–299, 1985.
 [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Computer Vision and Pattern Recognition (CVPR), 2017.
 [4] H. Chen, Y. Lee, G. Sun, H. Lee, T. Maxwell, and C. L. Giles. High order correlation model for associative memory. In AIP Conference Proceedings, volume 151, pages 86–99. AIP, 1986.
 [5] J. Dai, Y. Li, K. He, and J. Sun. Rfcn: Object detection via regionbased fully convolutional networks. arXiv preprint arXiv:1605.06409, 2016.
 [6] C. L. Giles and T. Maxwell. Learning, invariance, and generalization in highorder neural networks. Applied Optics, 26(23):4972–4978, 1987.
 [7] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Frusnd, P. Yianilos, M. MuellerFreitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The ”something something” video database for learning and evaluating visual common sense. arXiv preprint arXiv:1706.04261, 2017.
 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Neural Information Processing Systems (NIPS), 2016.
 [9] M. Heywood and P. Noakes. A framework for improved training of sigmapi networks. IEEE Transactions on Neural Networks, 6(4):893–903, 1995.
 [10] G. F. Hinton. A parallel computation that assigns canonical objectbased frames of reference. In Proceedings of the 7th international joint conference on Artificial intelligenceVolume 2, pages 683–685. Morgan Kaufmann Publishers Inc., 1981.
 [11] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. 2018.
 [12] L. Itti, C. Koch, and E. Niebur. A model of saliencybased visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, Nov 1998.
 [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li. Largescale video classification with convolutional neural networks. Computer Vision and Pattern Recognition (CVPR), 2014.
 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS), 2012.
 [15] H. Larochelle and G. E. Hinton. Learning to combine foveal glimpses with a thirdorder boltzmann machine. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1243–1251. Curran Associates, Inc., 2010.
 [16] Y. Lee, G. Doolen, H. Chen, G. Sun, T. Maxwell, and H. Lee. Machine learning using a higher order correlation network. Technical report, Los Alamos National Lab., NM (USA); Maryland Univ., College Park (USA), 1986.
 [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. European Conference on Computer Vision (ECCV), 2016.
 [18] T. Luong, H. Pham, and C. D. Manning. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Association for Computational Linguistics, 2015.
 [19] F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. Finegrained video classification and captioning. arXiv preprint arXiv:1804.09235, 2018.
 [20] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. 3:2204–2212, 2014.
 [21] J. Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. Computer Vision and Pattern Recognition (CVPR), 2015.
 [22] E. Niebur and C. Koch. Computational architectures for attention. In R. Parasuraman, editor, The Attentive Brain, chapter 9, pages 163–186. MIT Press, Cambridge, MA, 1998.
 [23] Z. Qiu, T. Yao, and T. Mei. Learning spatiotemporal representation with pseudo3d residual networks. International Conference on Computer Vision (ICCV), 2017.
 [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. Computer Vision and Pattern Recognition (CVPR), 2016.
 [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. Neural Information Processing Systems (NIPS), 2015.
 [26] D. E. Rumelhart and J. L. McClelland. Parallel distributed processing: explorations in the microstructure of cognition. volume 1. foundations. 1986.
 [27] J. L. P. N. G. T. B. V. S. V. S. AbuElHaija, N. Kothari. Youtube8m: A largescale video classification benchmark. 1609.08675, 2016.
 [28] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. Computer Vision and Pattern Recognition (CVPR), 2017.
 [29] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. European Conference on Computer Vision (ECCV), 2016.
 [30] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. Neural Information Processing Systems (NIPS), 2014.
 [31] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. Computer Vision and Pattern Recognition (CVPR), 2015.
 [33] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d con volutional networks. International Conference on Computer Vision (ICCV), 2015.
 [34] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotem poral convolutions for action recognition. Computer Vision and Pattern Recognition (CVPR), 2018.
 [35] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis, and F. Nuflo. Modeling visual attention via selective tuning. Artif. Intell., 78(12):507–545, Oct. 1995.
 [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.
 [37] L. Wang, W. Li, W. Li, and L. Van Gool. Appearanceandrelation networks for video classification. 2018.
 [38] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision (ECCV), 2016.
 [39] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In CVPR, 2018.
 [40] X. Wang and A. Gupta. Videos as spacetime region graphs. 2018.
 [41] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speedaccuracy tradeoffs in video classification. European Conference on Computer Vision (ECCV), 2018.
 [42] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2048–2057, Lille, France, 07–09 Jul 2015. PMLR.
 [43] H. Yang and C. C. Guest. High order neural networks with reduced numbers of interconnection weights. In Neural Networks, 1990., 1990 IJCNN International Joint Conference on, pages 281–286. IEEE, 1990.
 [44] A. A. T. A. Zhou, B. Temporal relational reasoning in videos. European Conference on Computer Vision (ECCV), 2018.
 [45] M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient convolutional network for online video understanding. In European Conference on Computer Vision (ECCV), 2018.