Global Secondorder Pooling Convolutional Networks
Abstract
Deep Convolutional Networks (ConvNets) are fundamental to, besides largescale visual recognition, a lot of vision tasks. As the primary goal of the ConvNets is to characterize complex boundaries of thousands of classes in a highdimensional space, it is critical to learn higherorder representations for enhancing nonlinear modeling capability. Recently, Global Secondorder Pooling (GSoP), plugged at the end of networks, has attracted increasing attentions, achieving much better performance than classical, firstorder networks in a variety of vision tasks. However, how to effectively introduce higherorder representation in earlier layers for improving nonlinear capability of ConvNets is still an open problem. In this paper, we propose a novel network model introducing GSoP across from lower to higher layers for exploiting holistic image information throughout a network. Given an input 3D tensor outputted by some previous convolutional layer, we perform GSoP to obtain a covariance matrix which, after nonlinear transformation, is used for tensor scaling along channel dimension. Similarly, we can perform GSoP along spatial dimension for tensor scaling as well. In this way, we can make full use of the secondorder statistics of the holistic image throughout a network. The proposed networks are thoroughly evaluated on largescale ImageNet1K, and experiments have shown that they outperformed nontrivially the counterparts while achieving stateoftheart results.
1 Introduction
Deep Convolutional Networks (ConvNets) are fundamental to computer vision field, since they are not only paramount for high accuracy of largescale object recognition, but also play central roles, through means of pretrained models, in advancing substantially many other computer vision tasks, e.g., object detection [28], semantic segmentation [26] and video classification [34]. Given color images as inputs, the ConvNets can learn progressively the lowlevel, midlevel and highlevel features [41], finally producing global image representations connected to softmax layer for classification. To better characterize complex boundaries of thousands of classes in a very highdimensional space, one possible solution is to learn higherorder representations for enhancing nonlinear modeling capability of ConvNets.
Recently, modeling of higherorder statistics for more discriminative image representations has attracted great interests in deep ConvNets. The global secondorder pooling (GSoP), producing covariance matrices as image representations, has achieved stateoftheart results in a variety of vision tasks [21, 2, 32, 35] such as object recognition, finegrained visual categorization, object detection and video classification. The pioneering works, i.e., DeepOP [17] and bilinear CNN (BCNN) [25], performed global secondorder pooling, rather than the commonly used global average (i.e., firstorder) pooling (GAvP) [24], after the last convolutional layers in an endtoend manner. However, most of the variants of GSoP [6, 1] only focused on smallscale scenarios. In largescale visual recognition, MPNCOV [22, 21] has shown matrix power normalized GSoP can significantly outperform global average pooling.
Though GSoP plugged at the end of network has proven successful, how to effectively introduce higherorder representation in earlier layers for improving nonlinear capability of ConvNets is still an open problem. Several works [23, 36, 42] have made attempts to enhance nonlinear modeling capability using quadratic transformation to model feature interactions, instead of only using linear transformation of convolutions. However, performance gains of these methods are limited in largescale visual recognition. Motivated by SqueezeandExcitation (SE) networks [14], we introduce GSoP across from lower to higher layers of deep ConvNets, aiming to learn more discriminative representations by exploiting the secondorder statistics of holistic image throughout a deep ConvNet.
At the heart of our global secondorder networks is the GSoP block, which can be conveniently plugged into any location of a deep ConvNet. Given a 3D tensor outputted by some previous convolutional layer, we first perform GSoP to model pairwise channel correlations of the holistic tensor. We then accomplish embedding of the resulting covariance matrix by convolutions and nonlinear activations, which is finally used for scaling the 3D tensor along channel dimension. The diagram of our GSoP convolutional network (GSoPNet) is presented in Figure 1(a) and the proposed secondorder block is illustrated in Figure 1(b). The primary differences of the proposed GSoPNet from existing networks are compared in Table 1, which will be detailed in next section. Our main contributions are threefold. (1) Distinct from the existing methods which can only exploit secondorder statistics at network end, we are among the first who introduce this modeling into intermediate layers for making use of holistic image information in earlier stages of deep ConvNets. By modeling the correlations of the holistic tensor, the proposed blocks can capture longrange statistical dependency [34], making full use of the contextual information in the image. (2) We design a simple yet effective GSoP block, which is highly modular with low memory and computational complexity. The GSoP block, which is able to capture global secondorder statistics along channel dimension or position dimension, can be conveniently plugged into existing network architectures, further improving their performance with small overhead. (3) On ImageNet benchmark, we perform a thorough ablation study of the proposed networks, analyzing the characteristics and behaviors of the proposed GSoP block. Extensive comparison with the counterparts has shown the competitiveness of our networks.
2 Related Works
GAvP (1–order) Inbetween Network.
Global average pooling plugged at the end of network [24], which summarizes the firstorder statistics (i.e., mean vector) as image representations, has been widely used in most deep ConvNets such as ResNet [10], Inception [30] and DenseNet [16]. For the first time, SENet [14] introduced GAvP inbetween network for making use of holistic image context at earlier stages, reporting significant improvement over its networkend counterparts. The SENet consists of two modules: a squeeze module accomplishing global average pooling followed by convolution and nonlinear activations for capturing channel dependency, and an excitation module scaling channel for data recalibration. Besides GAvP along channel dimension, CBAM [37] extends the idea of SENet, combining GAvP along channel dimension as well as spatial dimension for accomplishing selfattention. Compared to SENet and CBAM which uses only firstorder statistics (mean) of the holistic image, our GSoPNet exploits secondorder statistics (correlations), having stronger modeling capability.
GSoP (2–order) at Network Net.
The global secondorder pooling, plugged at network end and trainable in an endtoend manner, has received great interests, achieving significant performance improvement [2, 22, 21]. Several researchers [6, 2, 1] have shown close connections between higherorder pooling with kernel machines, based on which they proposed explicit mapping functions as kernel approximation for compactness of covariance representations. Wang et al. [33] proposed a global Gaussian distribution embedding network (GDeNet), where one multivariate Gaussian, identified as a symmetric positive definite matrix of covariance matrix and mean vector [20], is plugged at network end. MoNet [38] proposed a submatrix squareroot layer, making GDeNet to have compact representation. In [3], the firstorder information are combined with the secondorder one which achieves consistent improvements over the standard bilinear networks on texture recognition. In all the aforementioned works, secondorder modeling are only exploited at the end of deep networks.
inbetween network  end of network  
global pool  means  global pool  means  
NA  NA  
NA  1–order  
–order  –order  
NA  2–order  
GSoPNet (ours)  2–order  2–order 
Qaudratic Transformation Network.
The conventional network depends heavily on linear convolution operations. Several researchers take a step further to explore higherorder transformation for enhancing nonlinear modeling capability of deep networks. The secondorder Response Transform (SORT) [36] develops a twobranch network module to combine responses of two convolutional blocks and multiplication of the responses. They perform elementwise square root for normalizing the secondorder term. In [23], a factorized bilinear network (FBN) is proposed to model the pairwise feature interaction. By constraining the rank of quadratic transformation matrix, FBN can introduce bilinear pooling into intermediate layers. Zoumpourlis et al. [42] introduce Volterra kernelbased convolutions, which can model first, second or higherorder interactions of data, serving as approximations of nonlinear functionals. All the works above are concerned with nonlinear filters, applied only to local neighborhood, just like linear convolution. In contrast, our GSoP networks collect the secondorder statistics of the holistic image for enhancing nonlinear capability of deep networks.
3 Global Secondorder Pooling Network
We illustrate the proposed GSoPNet in Figure 1(a). Note that the secondorder pooling block we designed can be conveniently inserted after any convolutional layer. By introducing this block in intermediate layers, we can model highorder statistics of the holistic image at early stages, having ability to enhance nonlinear modeling capability of deep ConvNets.
In practice, we build two network architectures. With GSoP blocks inbetween network and at the end of network, we can use GSoP block as well which is followed by the common global average pooling, producing the mean vector as compact image representation, which we call GSoPNet1. Alternatively, at the end of network, we can adopt matrix power normalized covariance matrices as image representations [22], called GSoPNet2, which is more discriminative yet is highdimensional.
3.1 Global Secondorder Pooling Block
Figure 1(b) shows the diagram of the key module of our network, i.e., GSoP block. Similar to [14], the block consists of two modules, i.e., squeeze module and excitation module. The squeeze module aims to model the secondorder statistics along the channel dimension of the input tensor. We are given a 3D tensor of as an input, where and are spatial height and width and is the number of channels. First, we use convolution reducing the number of channels from to () to decrease the computational cost of the following operations. For the tensor of reduced dimensionality, we compute pairwise channel correlations, obtaining one covariance matrix. The resulting covariance matrix has clear physical meaning, i.e., its row indicates statistical dependency of channel with all channels. As the quadratic operations involved change the order of data, we perform rowwise normalization for the covariance matrix, respecting the inherent structural information. In contrast, the SENet uses global firstorder pooling, which can only summarize the mean of individual channels, having limited statistical modeling capability.
In the excitation module, prior to channel scaling, we perform two consecutive operations of convolution plus nonlinear activation for covariance matrix embedding. To maintain the structural information, the covariance matrix is subject to rowwise convolution, which is followed by a Leaky Rectified Linear Unit (LReLU). Then we perform the second convolution and this time we use the sigmoid function as a nonlinear activation, outputting a weight vector. We finally perform dot product between the weight vector and channels. Individual channels are thus emphasized or suppressed in a soft manner in terms of the weights.
3.2 Extension to Spatial Position
In previous section, we describe global secondorder pooling along channel dimension, which we call channelwise GSoP. We can extend it to spatial position, called positionwise GSoP, capturing pairwise feature correlations of the holistic tensor for positionwise feature scaling. The design philosophy of the positionwise GSoP Block is very similar to that of the channelwise one. We also use 11 convolution for reducing the number of channels. Furthermore, as we are to compute pairwise correlations of features at all spatial positions, we adopt downsampling, decreasing the spatial size to fixed . So we obtain a positionwise covariance matrix of . Row of the covariance matrix, where enumerates all spatial positions, indicates statistical correlation of the feature with all features. The positionwise covariance matrix is also fed to two consecutive operations, i.e., rowwise convolutionLReLU and convolutionsigmoid. After appropriate reshaping, we can obtain a weight matrix which encodes nonlinear pairwise dependency among features at all positions. At last, the weight matrix is upsampled to and then multiplied elementwise with spatial features.
3.3 Mechanism of GSoP Block
In classical deep ConvNets, restricted by limited receptive field size, the convolution operations can only process a local neighborhood of 3D tensor. The data at distant position cannot interact, e.g., the small blue tensor and the small yellow one as shown in Figure 2. The longrange dependencies can only be captured by larger receptive fields produced by deep stacking of convolutional operations. This leads to several downsides such as optimization difficulty and modeling difficulty of multihop dependency [34].
By computing all pairwise feature correlations (or inner product), the nonlocal operation can capture dependency of features at distant positions. As a result, the nonlocal operation can excite significant features, which is consistent with selfattention machinery [31]. Our positionwise GSoP multiplies each feature with one weight, which encodes nonlinear correlations of this feature with features at all positions. As such, our positionwise GSoP can also model longrange dependency of features, functioning as a kind of spatial selfattention. Beyond that, our channelwise GSoP can capture longrange dependency along channel dimension, steering selfattention to significant channels. Note that SENet can capture longrange channel dependency as well, which, however, can model only the firstorder statistical dependency, having limited representation capability.
3.4 Block Implementation
Our blocks can be conveniently inserted into ResNet architecture. The ResNet contains 4 residual stages, i.e., conv2_x, , conv5_x, each containing stacks of bottleneck blocks. The exception is the first stage (i.e., conv1) which only contains one single convolutional layer, without bottleneck structure. To simplify block design and to tradeoff between computational complexity and classification accuracy, we adopt fixed size covariance matrices for all residual stages. In practice, we reduce the number of channel to 128 for both channelwise and positionwise GSoP; in addition, we set the size of spatial covariance matrix to 64 (i.e., ==8). We note that the value of covariance matrix size is evaluated in Section 4.1.
channelwise GSoP 
positionwise GSoP 

layers  3D filter  output tensor  3D filter  output tensor 
conv + BN + ReLU 
111024 G=1 
1414128 
111024 G=1 
1414128 
down sampling 
– 
– 
– 
88128 
COV pool+BN 
– 
128128 1128128 
– 
6464 16464 
conv + BN + LReLU (0.1) 
11281 G=128 
11512 
1641 G=64 
11256 
conv + sigmoid 
11512 G=1 
111024 
11256 G=1 
1164 881 
up sampling 
– 
– 
– 
14141 
dot product 
– 
14141024 
– 
14141024 
parameters (M)  0.72  0.16  
MFLOPs  28.1  26.2 
After the 11 convolution for dimensionality reduction of channels, we perform downsampling for positionwise GSoP to obtain feature maps of fixed size (i.e., ). By reshaped to a 3D tensor with first dimension being singleton, the covariance matrix can be seen as 1 feature map with channels, and so rowwise BN and rowwise grouped convolutions [19] can be easily accomplished. The channel number after the row convolution is raised to and for channelwise pooling and positionwise pooling, respectively. The size of weight vector for channelwise pooling or weight matrix for positionwise pooling, should match the input tensor size. We mention that after the proposed blocks, we also use a shortcut connection, adding the input tensor to the scaled, output one. In Table 2, we present implementation of GSoP block for conv4_x.
4 Experiments
In this section, we first conduct ablation analysis of the proposed GSoPNets. We then make comparison with the competing methods as well as stateofthearts on ImageNet. We finally evaluate generalization capability of our network to smallscale classification. All of our program are implemented under the PyTorch framework, and runs on four workstations each of which is equipped with 2 GTX 1080Ti GPUs and an Intel i74790K@4GHz CPU.
Datasets
Our experiments are mainly conducted on ImageNet1K [4] benchmark. The ImageNet1K contains 1.28M training images and 50K validation images from 1,000 classes. In Section 4.1, for the purpose of faster ablation study, we build a small subset of ImageNet1K by randomly selecting 250 classes, including 320K12.5K images for trainingvalidation, which we call ImageNetK. For comparison with stateoftheart networks, we adopt standard ImageNet1K in Section 4.2. To evaluate the generalization capability of our network, we also make experiments on CIFAR100 benchmark [18], which contains 60K color images of 32x32 pixels from 100 categories, with 50K images for training and 10K images for testing.
Experimental Setting
During training from scratch with ResNet architecture on ImageNet, we follow [10] for data augmentation involving scale, color and flip jittering. The weights are initialized as in [9]. We randomly crop images from the rescaled images with perchannel mean subtraction. The networks are optimized using stochastic gradient descent (SGD) with a momentum of 0.9 and a minibatch of 160. The initial learning rate is set to 0.1, divided by 10 every 30 epochs until 100 epochs, unless specified otherwise. During testing stage, we evaluate the error on the single center crop from an image whose shorter size is 256.
For training from scratch on CIFAR100, following g [11, 14], we use standard data augmentation for training, including horizontal flip and random translation. The networks are trained within 110 epochs with the initial learning rate of 0.25, which is reduced to 0.025 and 0.0025 at the and epoch, respectively. The minibatch size is 128 and weight decay is 1e4.
4.1 Ablation analysis on GSoPNets
We develop a lightweight residual network of 26 layers (i.e., ResNet26) as our baseline architecture, where every residual stage contains two bottlenecks. Following [22], we do not perform downsampling at this stage as small number of features is harmful for robust covariance estimation. For conv2_xconv4_x, we insert perstage GSoP block after the last bottleneck structure of each residual stage. For GSoPNet1 we insert one GSoP block followed by common global average pooling, outputting an 2Kdimensional image representation fully connected to softmax layer, while for GSoPNet2 we use matrix power normalized covariance pooling, producing 32Kdimensional image representation. Table 3 presents the architecture of our GSoPNets.
output  layer  
conv1  112112  conv, 77, 64, Stride2 
pool1  5656  max pool, 33, Stride2 
conv2_x 
GSoP Block 

conv3_x  2828 
GSoP Block 
conv4_x  1414 
GSoP Block 
conv5_x  1414 

1  
1  FC + softmax 
top1 errtop5 err GSoPNet1 GSoPNet2 channelwise cov size 6464 18.004.99 16.844.58 128128 17.42/4.53 16.684.36 256256 17.614.64 16.674.18 positionwise cov size 3636 19.215.46 17.344.80 6464 18.375.05 17.184.80 144144 18.415.08 17.514.63 vanilla network 19.185.62 
top1 errtop5 err GSoPNet1 GSoPNet2 channelwise pool 17.424.53 16.684.36 positionwise pool 18.375.05 17.184.80 fusion average 17.904.73 16.774.36 maximum 17.484.52 16.804.39 concatenation 17.584.61 16.494.35 
top1 err top5 err 19.18 5.62 18.45 5.22 18.72 5.33 18.85 5.24 18.33 5.12 17.42 4.53 17.43 4.71 16.68 4.36 
Impact of Covariance Size.
The covariance matrices, produced by the secondorder pooling blocks, encode the statistical correlation of the holistic tensors, playing a central role in our networks. So we first evaluate impact of covariance matrix size on the proposed networks. Table 3(a) summarizes the results, in which the top and middle panel shows the impacts using channelwise (cov size: ) and positionwise pooling (cov size: ), respectively. We first observe that, whatever the secondorder pooling, the proposed networks improve over vanilla ResNet26, demonstrating that our holistic modeling methods in earlier stages are beneficial in enhancing the network’s discriminative capability. For channelwise secondorder pooling, relative to varying values of , GSoPNet1 achieves the best results with . The errors of GSoPNet2 consistently decline as gets larger and the lowest error is obtained with . For positionwise secondorder pooling, GSoPNet1 with produces the lowest errors. Notably, for either channelwise or positionwise pooling, it is clear that GSoPNet2 performs much better than GSoPNet1, which suggests that image representation of covariance matrix is superior to that of mean vector by average pooling.
Fusion of Channel and Positionwise Pooling.
The channelwise and positionwise secondorder pooling capture statistical correlations from different dimensions of 3D tensor. They can be combined for holistic image modeling. Given an input tensor, we independently perform secondorder poolng along the channel dimension and spatial dimension, producing two output tensors. We can fuse the two output tensors by the commonly used operations of averagemaximum and concatenation. As concatenation operation increases tensor size, we use one convolutional layer for maintaining the original tensor size.
The results of fusion methods are presented in Table 3(b). For GSoPNet1, the average scheme performs worse than the other two, while the maximum scheme is slightly better than the concatenation one. For GSoPNet2, the concatenation scheme is a little superior to the other two schemes. However, compared to separate channelwise pooling, with any fusion scheme, combination of positionwise pooling brings little improvement. These results suggest that the two kinds of secondorder pooling methods are not complementary, though the two proposed networks individually have obvious improvement over the vanilla network.
Performance of Single Secondorder Block.
In this part, we conduct experiments to analyze the performance of single channelwise block separately added to different residual stage. We neglect analysis on single positionwise block as it is not promising. Table 3(c) presents the results, where S2 denotes residual stage 2, and so on; , C and denote no secondorder block, one channelwise block and iSQRTCOV meta layer [22] inserted at the corresponding residual stage, respectively. It can be seen that insertion of single block into any residual stage brings comparable improvement over the vanilla network. This indicates the channelwise secondorder pooling at different stage makes similar contribution to the overall performance of channelwise GSoPNet1. The iSQRTCOV, which inserts a matrix normalized covariance matrix at residual stage 4 as the final image representation, is a strong baseline, achieving comparable result with GSoPNet1. The GSoPNet2, which inserts global secondorder pooling at intermediate stages, outperform iSQRTCOV by a nontrivial margin. This suggests the benefits of introducing secondorder statistics in earlier layers of networks.
4.2 Results on ImageNet1K
In this subsection, we further evaluate our proposed GSoPNets on standard ImageNet1K under ResNet50 architecture. We insert a GSoP block for residual stage 2, 3 and 4, respectively. For GSoPNet1, we insert one GSoP block for residual stage 5, followed by the commonly used global average pooling; for GSoPNet2, instead of the GSoP block, the metalayer of iSQRTCOV [21] is inserted.
4.2.1 Convergence and Network Complexity
Convergence.
Figure 3 illustrates the convergence curves of our GSoPNet. For GSoPNet1, though secondorder statistical modeling is exploited, it is for tensor scaling while the convolutional filters and image representation are of both linear just like the original ResNet50. As shown in the top figure, the convergence behavior of GSoPNet1 is similar to that of ResNet50, but consistently has lower validation error throughout the training process. Different from iSQRTCOV, for GSoPNet2 we introduce secondorder blocks for residual stages 1,2 and 3. From the bottom Figure, we can see that GSoPNet2 inherits fast convergence property of iSQRTCOV, while steadily performs better. We attribute the improvement of our networks over their counterparts to the holistic modeling of secondorder statistics introduced in earlier stages.
Network Complexity.
Table 5 shows comparison of parameter and computation. The number of parameters of GSoPNet1 is comparable to that of the vanilla ResNet50, while GSoPNet2 has nearly doubled the number of parameters. The increased parameters in GSoPNet2 are mainly due to FC layer, in which dimensionality of image representation is 32K, accounting for most increase of the total parameters, just like MPNCOV [22] and iSQRTCOV [21]. We argue that advances on model compression, e.g., [5, 27, 8], has potential to significantly reduce the number of parameters, particularly in FC layer, while maintaining the performance; in practice, we can exploit such techniques for reducing parameters. Analogous to [22, 21], the GFLOPs of our networks are 1.58x of the number of vanilla ResNet. The computations increased are attributed to removal of downsampling in the last residual stage, so that feature map size doubles. This operation is helpful for robust covariance estimation by alleviating the problem of small sample and high dimensionality [22]. This somewhat slowdowns the training, however, while making little difference for inference. With a single GTX 1080Ti GPU with CUDA 9.0 and CuDNN7.1, the inference time (ms) per image are 2.52 vs 2.682.84 (vanilla ResNet50 vs GSoPNet1GSoPNet2).
description  top1  top5  params/GFLOPs  
He et al. [10] 
Baseline network 
23.85  7.13  25.5M/3.86 
FBN [23]  Quadratic transformations  24.0  7.1  – 
SORT [36]  23.82  6.72  –  
MPNCOV [22]  GSoP at network end  22.74  6.54  2.2/1.6 
iSQRTCOV [21]  22.14  6.22  2.2/1.6  
SENet [14]  GAvP across network  23.29  6.62  1.1/1.0 
GENet [12]  21.88  5.80  1.3/1.0  
CBAM [37]  22.66  6.31  1.1/1.0  
GSoPNet1 (ours)  GSoP across network  22.32  6.02  1.1/1.6 
GSoPNet2 (ours)  21.19  5.64  2.3/1.7  
ResNeXt [39]  Modified architectures upon ResNet  22.11  5.90  1.0/1.0 
DropBlock [7]  21.87  5.98  1.0/1.0  
DRNA50 [40]  22.94  6.57  1.0/4.9 
4.2.2 Comparison with Competing Networks.
Table 5 compares classification errors between our GSoPNets and the competing networks on ImageNet1K.
Comparison with FBN and SORT
The two works [23, 36] are among the first which introduce quadratic transformation, instead of just linear convolutions, throughout a network. However, compared to the vanilla network, their performance gains are not significant. In contrast, our networks are much better, achieving over 2.8% and 2.6% higher accuracies than FBN and SORT. This comparison demonstrates that, by making favorable use of higherorder information, we can greatly improve the network performance.
Comparison with Global Cov Pool at Network End.
Here we compare our GSoPNet2 with several methods where global secondorder pooling is inserted only at the end of network. All of them estimate covariance matrices of the last convolutional features as image representations. DeepO computes matrix logarithm for covariance matrix while BCNN performs elementwise power normalization plus normalization. As DeepO and BCNN are not competitive for largescale visual recognition [22], here we do not compare with them. MPNCOV uses structured normalization by matrix square root, and iSQRTCOV is a faster version of MPNCOV, in which matrix square root is based on iterative algorithm, rather than GPU unfriendly SVD. Our GSoPNet2 outperforms MPNCOV by 1.55% in top1 error (0.90% in top5 error). Compared to iSQRTCOV, the GSoPNet2 achieves 0.95%0.58% lower top1top5 error rates, while resulting in negligible overhead. We note that the iSQRTCOV is a strong baseline and our improvement is nontrivial. The comparison between our GSoPNet2 and MPNCOViSQRTCOV indicates that introducing higherorder statistics in earlier stages can enhance representational learning capability of deep ConvNets.
Comparison with Global Avg Pool across Network.
From Table 5, we can see that our GSoPNet1 performs better than SENet in top1top5 errors. As an extension of SENet, CBAM combines global average and max pooling along both channel dimensional and spatial dimension. Nevertheless, the error rates of GSoPNet1 are lower than CBAM. Building upon SENet, GENet [12] proposes gather and excitation operations for exploiting context information. Our GSoPNet2 outperforms GENet by a nontrival margin. These comparisons between our networks and SENet and its variants show that higherorder modeling is able to capture richer statistics than the firstorder modeling, leading to more discriminative representation. Notably, we do not insert GSoP block after each bottleneck structure; instead, we only insert the GSoP block per residual stage. As a result, we only add no more than 4 GSoP blocks, and more GSoP blocks may further improve the performance of our network.
Comparison with Stateofthearts.
Finally, we compare with several stateoftheart networks which modify upon ResNet50 architecture. Compared to ResNet, ResNeXt [38] considerably increases network width, which, however, keeps parameters and computation almost unchanged through extensive use of grouped convolutions [19]. DRNA50 [40] removes downsampling in residual stage 3 and 4, and meanwhile uses dilated convolution to maintain the receptive size. DropBlock [7] extends dropout technique to convolution; by drop blocks of feature map randomly, it maintains the context integrity during training. As shown in Table 5, these modified networks performs much better than ResNet50. Nevertheless, our GSoPNet2 outperforms all of them by a nontrivial margin. It is noteworthy to mention that, if built upon the modified networks above, the performance of our network may improve further.
4.3 Results on CIFAR100
This section conducts experiments on CIFAR100 [18] to evaluate the generalization capability of the proposed GSoPNet. The backbone network is preactivation ResNet164 [11], containing 3 residual stages each with 18 bottlenecks; the final image represenation is 256D. In GSoPNet1, we insert 18 GSoP blocks into the backbone network uniformly, and in GSoPNet2 the last GSoP block is replaced by a metalayer of iSQRTCOV. Downsampling is not performed in the last residual stage. The final dimension of image representation in GSoPNet2 is 8K and a dropout layer (dropout rate=0.5) is used for FC layer. The covariance size is in both GSoPNet1 and GSoPNet2.
The experimental results on CIFAR100 are presented in Table 6. Compared with the vanilla network, GSoPNet1 and GSoPNet2 obtain gains of 3.47% and 5.75%, respectively, improving the performance by a large margin. CMPE [15] implements channelwise excitation operation by establishing the correlation of the channelwise representation between two nearby bottlenecks, which can be considered as a crossblock version of SENet. GSoPNet1 performs better than SENet and CMPE by 0.45% and 1.49% respectively. iSQRTCOV is very competitive, outperforming SENet by . By introducing secondorder statistics in earlier stages, our GSoPNet2 makes further improvement ( 1.37%) over iSQRTCOV.
5 Conclusion
We presented a simple yet effective deep convolutional network model for capturing holistic statistical correlations across all stages of network. By exploiting the holistic higherorder information at earlier stages, the proposed model can learn more discriminative representations. As far as we know, our work is among the first which introduce global secondorder pooling into lower layers of deep networks. Our proposed networks performs better than SENet [14], i.e., the firstorder counterpart, while nontrivially improves stateoftheart iSQRTCOV [21] which plugged global covariance pooling as image representation only at network end. The proposed GSoP blocks are highly modular, which can be conveniently plugged into other deep architectures, e.g., Inception [30] and DenseNet [16].
References
 [1] S. Cai, W. Zuo, and L. Zhang. Higherorder integration of hierarchical convolutional activations for finegrained visual categorization. In ICCV, 2017.
 [2] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernel pooling for convolutional neural networks. In CVPR, 2017.
 [3] X. Dai, J. YueHei Ng, and L. S. Davis. Fason: First and second order information fusion network for texture recognition. In CVPR, July 2017.
 [4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A largescale hierarchical image database. In CVPR, 2009.
 [5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
 [6] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, 2016.
 [7] G. Ghiasi, T.Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In NIPS, 2018.
 [8] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification. In ICCV, 2015.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [12] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi. Gatherexcite: Exploiting feature context in convolutional neural networks. In NIPS, 2018.
 [13] J. Hu, L. Shen, A. Samuel, S. Gang, and W. Enhua. Squeezeandexcitation networks. arXiv:1709.01507v3, 2018.
 [14] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In CVPR, 2018.
 [15] Y. Hu, G. Wen, M. Luo, D. Dai, and M. Jiajiong. Competitive innerimaging squeeze and excitation for residual network. arXiv:1807.08920, 2018.
 [16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [17] C. Ionescu, O. Vantzos, and C. Sminchisescu. Matrix backpropagation for deep networks with structured layers. In ICCV, 2015.
 [18] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech. Rep., 2009.
 [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
 [20] P. Li, Q. Wang, H. Zeng, and L. Zhang. Local LogEuclidean multivariate Gaussian descriptor and its application to image classification. IEEE TPAMI, 2017.
 [21] P. Li, J. Xie, Q. Wang, and Z. Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, 2018.
 [22] P. Li, J. Xie, Q. Wang, and W. Zuo. Is secondorder information helpful for largescale visual recognition? In ICCV, Oct 2017.
 [23] Y. Li, N. Wang, J. Liu, and X. Hou. Factorized bilinear models for image recognition. In ICCV, 2017.
 [24] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
 [25] T.Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for finegrained visual recognition. In ICCV, 2015.
 [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 [28] J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. In CVPR, 2017.
 [29] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS. 2017.
 [32] H. Wang, Q. Wang, M. Gao, P. Li, and W. Zuo. Multiscale locationaware kernel representation for object detection. In CVPR, 2018.
 [33] Q. Wang, P. Li, and L. Zhang. GDeNet: Global Gaussian distribution embedding network and its application to visual recognition. In CVPR, 2017.
 [34] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In CVPR, 2018.
 [35] Y. Wang, M. Long, J. Wang, and P. S. Yu. Spatiotemporal pyramid network for video action recognition. In CVPR, 2017.
 [36] Y. Wang, L. Xie, C. Liu, S. Qiao, Y. Zhang, W. Zhang, Q. Tian, and A. Yuille. SORT: Secondorder response transform for visual recognition. In ICCV, 2017.
 [37] S. Woo, J. Park, J.Y. Lee, and I. So Kweon. Cbam: Convolutional block attention module. In ECCV, 2018.
 [38] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang. Monet: Deep motion exploitation for video object segmentation. In CVPR, 2018.
 [39] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
 [40] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In CVPR, 2017.
 [41] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
 [42] G. Zoumpourlis, A. Doumanoglou, N. Vretos, and P. Daras. Nonlinear convolution filters for CNNbased learning. In ICCV, 2017.