Global Second-order Pooling Convolutional Networks

Global Second-order Pooling Convolutional Networks

Zilin Gao, Jiangtao Xie, Qilong Wang, Peihua Li
Dalian University of Technology, Tianjin University
peihuali@dlut.edu.cn
Abstract

Deep Convolutional Networks (ConvNets) are fundamental to, besides large-scale visual recognition, a lot of vision tasks. As the primary goal of the ConvNets is to characterize complex boundaries of thousands of classes in a high-dimensional space, it is critical to learn higher-order representations for enhancing non-linear modeling capability. Recently, Global Second-order Pooling (GSoP), plugged at the end of networks, has attracted increasing attentions, achieving much better performance than classical, first-order networks in a variety of vision tasks. However, how to effectively introduce higher-order representation in earlier layers for improving non-linear capability of ConvNets is still an open problem. In this paper, we propose a novel network model introducing GSoP across from lower to higher layers for exploiting holistic image information throughout a network. Given an input 3D tensor outputted by some previous convolutional layer, we perform GSoP to obtain a covariance matrix which, after nonlinear transformation, is used for tensor scaling along channel dimension. Similarly, we can perform GSoP along spatial dimension for tensor scaling as well. In this way, we can make full use of the second-order statistics of the holistic image throughout a network. The proposed networks are thoroughly evaluated on large-scale ImageNet-1K, and experiments have shown that they outperformed non-trivially the counterparts while achieving state-of-the-art results.

1 Introduction

Deep Convolutional Networks (ConvNets) are fundamental to computer vision field, since they are not only paramount for high accuracy of large-scale object recognition, but also play central roles, through means of pretrained models, in advancing substantially many other computer vision tasks, e.g., object detection [28], semantic segmentation [26] and video classification [34]. Given color images as inputs, the ConvNets can learn progressively the low-level, mid-level and high-level features [41], finally producing global image representations connected to soft-max layer for classification. To better characterize complex boundaries of thousands of classes in a very high-dimensional space, one possible solution is to learn higher-order representations for enhancing nonlinear modeling capability of ConvNets.

Recently, modeling of higher-order statistics for more discriminative image representations has attracted great interests in deep ConvNets. The global second-order pooling (GSoP), producing covariance matrices as image representations, has achieved state-of-the-art results in a variety of vision tasks [21, 2, 32, 35] such as object recognition, fine-grained visual categorization, object detection and video classification. The pioneering works, i.e., DeepO[17] and bilinear CNN (B-CNN) [25], performed global second-order pooling, rather than the commonly used global average (i.e., first-order) pooling (GAvP) [24], after the last convolutional layers in an end-to-end manner. However, most of the variants of GSoP [6, 1] only focused on small-scale scenarios. In large-scale visual recognition, MPN-COV [22, 21] has shown matrix power normalized GSoP can significantly outperform global average pooling.

Though GSoP plugged at the end of network has proven successful, how to effectively introduce higher-order representation in earlier layers for improving non-linear capability of ConvNets is still an open problem. Several works [23, 36, 42] have made attempts to enhance non-linear modeling capability using quadratic transformation to model feature interactions, instead of only using linear transformation of convolutions. However, performance gains of these methods are limited in large-scale visual recognition. Motivated by Squeeze-and-Excitation (SE) networks [14], we introduce GSoP across from lower to higher layers of deep ConvNets, aiming to learn more discriminative representations by exploiting the second-order statistics of holistic image throughout a deep ConvNet.

At the heart of our global second-order networks is the GSoP block, which can be conveniently plugged into any location of a deep ConvNet. Given a 3D tensor outputted by some previous convolutional layer, we first perform GSoP to model pairwise channel correlations of the holistic tensor. We then accomplish embedding of the resulting covariance matrix by convolutions and non-linear activations, which is finally used for scaling the 3D tensor along channel dimension. The diagram of our GSoP convolutional network (GSoP-Net) is presented in Figure 1(a) and the proposed second-order block is illustrated in Figure 1(b). The primary differences of the proposed GSoP-Net from existing networks are compared in Table 1, which will be detailed in next section. Our main contributions are threefold. (1) Distinct from the existing methods which can only exploit second-order statistics at network end, we are among the first who introduce this modeling into intermediate layers for making use of holistic image information in earlier stages of deep ConvNets. By modeling the correlations of the holistic tensor, the proposed blocks can capture long-range statistical dependency [34], making full use of the contextual information in the image. (2) We design a simple yet effective GSoP block, which is highly modular with low memory and computational complexity. The GSoP block, which is able to capture global second-order statistics along channel dimension or position dimension, can be conveniently plugged into existing network architectures, further improving their performance with small overhead. (3) On ImageNet benchmark, we perform a thorough ablation study of the proposed networks, analyzing the characteristics and behaviors of the proposed GSoP block. Extensive comparison with the counterparts has shown the competitiveness of our networks.

2 Related Works

GAvP (1–order) In-between Network.

Global average pooling plugged at the end of network [24], which summarizes the first-order statistics (i.e., mean vector) as image representations, has been widely used in most deep ConvNets such as ResNet [10], Inception [30] and DenseNet [16]. For the first time, SE-Net [14] introduced GAvP in-between network for making use of holistic image context at earlier stages, reporting significant improvement over its network-end counterparts. The SE-Net consists of two modules: a squeeze module accomplishing global average pooling followed by convolution and non-linear activations for capturing channel dependency, and an excitation module scaling channel for data recalibration. Besides GAvP along channel dimension, CBAM [37] extends the idea of SE-Net, combining GAvP along channel dimension as well as spatial dimension for accomplishing self-attention. Compared to SE-Net and CBAM which uses only first-order statistics (mean) of the holistic image, our GSoP-Net exploits second-order statistics (correlations), having stronger modeling capability.

GSoP (2–order) at Network Net.

The global second-order pooling, plugged at network end and trainable in an end-to-end manner, has received great interests, achieving significant performance improvement [2, 22, 21]. Several researchers [6, 2, 1] have shown close connections between higher-order pooling with kernel machines, based on which they proposed explicit mapping functions as kernel approximation for compactness of covariance representations. Wang et al. [33] proposed a global Gaussian distribution embedding network (GDeNet), where one multivariate Gaussian, identified as a symmetric positive definite matrix of covariance matrix and mean vector [20], is plugged at network end. MoNet [38] proposed a sub-matrix square-root layer, making GDeNet to have compact representation. In [3], the first-order information are combined with the second-order one which achieves consistent improvements over the standard bilinear networks on texture recognition. In all the aforementioned works, second-order modeling are only exploited at the end of deep networks.

in-between network end of network
global pool means global pool means

AlexNet [19]

VGG [29]

NA NA

ResNet [10]

Inception [30] DenseNet [16]

NA 1–order

SE-Net [14]

CBAM [37]

–order –order

DeepO[17] B-CNN [25]

MPN-COV [22] GDeNet [33]

NA 2–order
GSoP-Net (ours) 2–order 2–order
Table 1: Summary of ConvNet models in terms of global statistical pooling. Different from existing networks, we introduce global second-order pooling into intermediate layers of deep ConvNets. So we can make full use of second-order statistics to effectively capture holistic image information throughout a network.
(a) Overview of GSoP-Net. The proposed global second-order pooling (GSoP) block can be conveniently inserted after any convolutional layer in-between network. We propose to use, at the network end, GSoP block followed by common global average pooling producing compact image representations (GSoP-Net1), or matrix power normalized covariance [22] outputting covariance matrices as image representations (GSoP-Net2).
(b) GSoP block. Given an input tensor, after dimension reduction, the GSoP block starts with covariance matrix computation, followed by two consecutive operations of a linear convolution and non-linear activation, producing the output tensor which is scaling (multiplication) of the input one along the channel dimension.
Figure 1: Our global second-order pooling network (GSoP-Net). Figure 1(a) gives an overview of GSoP-Net and the proposed GSoP block is presented in Figure 1(b). We introduce global second-order pooling into intermediate layers of deep ConvNets, which goes beyond the existing works where GSoP can only be used at network end. By modeling higher-order statistics of holistic images at earlier stages, our network can enhance capability of non-linear representation learning of deep networks.
Qaudratic Transformation Network.

The conventional network depends heavily on linear convolution operations. Several researchers take a step further to explore higher-order transformation for enhancing non-linear modeling capability of deep networks. The second-order Response Transform (SORT) [36] develops a two-branch network module to combine responses of two convolutional blocks and multiplication of the responses. They perform element-wise square root for normalizing the second-order term. In [23], a factorized bilinear network (FBN) is proposed to model the pairwise feature interaction. By constraining the rank of quadratic transformation matrix, FBN can introduce bilinear pooling into intermediate layers. Zoumpourlis et al. [42] introduce Volterra kernel-based convolutions, which can model first-, second- or higher-order interactions of data, serving as approximations of non-linear functionals. All the works above are concerned with non-linear filters, applied only to local neighborhood, just like linear convolution. In contrast, our GSoP networks collect the second-order statistics of the holistic image for enhancing non-linear capability of deep networks.

3 Global Second-order Pooling Network

We illustrate the proposed GSoP-Net in Figure  1(a). Note that the second-order pooling block we designed can be conveniently inserted after any convolutional layer. By introducing this block in intermediate layers, we can model high-order statistics of the holistic image at early stages, having ability to enhance non-linear modeling capability of deep ConvNets.

In practice, we build two network architectures. With GSoP blocks in-between network and at the end of network, we can use GSoP block as well which is followed by the common global average pooling, producing the mean vector as compact image representation, which we call GSoP-Net1. Alternatively, at the end of network, we can adopt matrix power normalized covariance matrices as image representations [22], called GSoP-Net2, which is more discriminative yet is high-dimensional.

3.1 Global Second-order Pooling Block

Figure 1(b) shows the diagram of the key module of our network, i.e., GSoP block. Similar to [14], the block consists of two modules, i.e., squeeze module and excitation module. The squeeze module aims to model the second-order statistics along the channel dimension of the input tensor. We are given a 3D tensor of as an input, where and are spatial height and width and is the number of channels. First, we use convolution reducing the number of channels from to () to decrease the computational cost of the following operations. For the tensor of reduced dimensionality, we compute pairwise channel correlations, obtaining one covariance matrix. The resulting covariance matrix has clear physical meaning, i.e., its row indicates statistical dependency of channel with all channels. As the quadratic operations involved change the order of data, we perform row-wise normalization for the covariance matrix, respecting the inherent structural information. In contrast, the SE-Net uses global first-order pooling, which can only summarize the mean of individual channels, having limited statistical modeling capability.

In the excitation module, prior to channel scaling, we perform two consecutive operations of convolution plus non-linear activation for covariance matrix embedding. To maintain the structural information, the covariance matrix is subject to row-wise convolution, which is followed by a Leaky Rectified Linear Unit (LReLU). Then we perform the second convolution and this time we use the sigmoid function as a nonlinear activation, outputting a weight vector. We finally perform dot product between the weight vector and channels. Individual channels are thus emphasized or suppressed in a soft manner in terms of the weights.

3.2 Extension to Spatial Position

In previous section, we describe global second-order pooling along channel dimension, which we call channel-wise GSoP. We can extend it to spatial position, called position-wise GSoP, capturing pairwise feature correlations of the holistic tensor for position-wise feature scaling. The design philosophy of the position-wise GSoP Block is very similar to that of the channel-wise one. We also use 11 convolution for reducing the number of channels. Furthermore, as we are to compute pairwise correlations of features at all spatial positions, we adopt downsampling, decreasing the spatial size to fixed . So we obtain a position-wise covariance matrix of . Row of the covariance matrix, where enumerates all spatial positions, indicates statistical correlation of the feature with all features. The position-wise covariance matrix is also fed to two consecutive operations, i.e., row-wise convolutionLReLU and convolutionsigmoid. After appropriate reshaping, we can obtain a weight matrix which encodes nonlinear pair-wise dependency among features at all positions. At last, the weight matrix is upsampled to and then multiplied element-wise with spatial features.

Figure 2: Classical convolutional operations fail to capture holistic dependency of 3D tensor due to limited receptive field size. For example, the data in small blue tensor cannot interact with that of yellow tensor at distant position due to limited receptive filed size. Our GSoP-Net addresses this by modeling pairwise correlations of the holistic tensor.

3.3 Mechanism of GSoP Block

In classical deep ConvNets, restricted by limited receptive field size, the convolution operations can only process a local neighborhood of 3D tensor. The data at distant position cannot interact, e.g., the small blue tensor and the small yellow one as shown in Figure 2. The long-range dependencies can only be captured by larger receptive fields produced by deep stacking of convolutional operations. This leads to several downsides such as optimization difficulty and modeling difficulty of multi-hop dependency [34].

By computing all pairwise feature correlations (or inner product), the non-local operation can capture dependency of features at distant positions. As a result, the non-local operation can excite significant features, which is consistent with self-attention machinery [31]. Our position-wise GSoP multiplies each feature with one weight, which encodes nonlinear correlations of this feature with features at all positions. As such, our position-wise GSoP can also model long-range dependency of features, functioning as a kind of spatial self-attention. Beyond that, our channel-wise GSoP can capture long-range dependency along channel dimension, steering self-attention to significant channels. Note that SE-Net can capture long-range channel dependency as well, which, however, can model only the first-order statistical dependency, having limited representation capability.

3.4 Block Implementation

Our blocks can be conveniently inserted into ResNet architecture. The ResNet contains 4 residual stages, i.e., conv2_x, , conv5_x, each containing stacks of bottleneck blocks. The exception is the first stage (i.e., conv1) which only contains one single convolutional layer, without bottleneck structure. To simplify block design and to tradeoff between computational complexity and classification accuracy, we adopt fixed size covariance matrices for all residual stages. In practice, we reduce the number of channel to 128 for both channel-wise and position-wise GSoP; in addition, we set the size of spatial covariance matrix to 64 (i.e., ==8). We note that the value of covariance matrix size is evaluated in Section 4.1.

channel-wise GSoP

position-wise GSoP

layers 3D filter output tensor 3D filter output tensor

conv + BN

+ ReLU

111024

G=1

1414128

111024

G=1

1414128

down sampling

88128
COV pool+BN

128128

1128128

6464

16464

conv + BN +

LReLU (0.1)

11281

G=128

11512

1641

G=64

11256
conv + sigmoid

11512

G=1

111024

11256

G=1

1164

881

up sampling

14141

dot product

14141024

14141024
parameters (M) 0.72 0.16
MFLOPs 28.1 26.2
Table 2: GSoP blocks for conv4_x. ‘G’ indicates #grouped convolutions [19], in which G=1 indicates common convolution (no group); gray text indicates reshape operation. Shortcut connections are added after GSoP blocks.

After the 11 convolution for dimensionality reduction of channels, we perform downsampling for position-wise GSoP to obtain feature maps of fixed size (i.e., ). By reshaped to a 3D tensor with first dimension being singleton, the covariance matrix can be seen as 1 feature map with channels, and so row-wise BN and row-wise grouped convolutions [19] can be easily accomplished. The channel number after the row convolution is raised to and for channel-wise pooling and position-wise pooling, respectively. The size of weight vector for channel-wise pooling or weight matrix for position-wise pooling, should match the input tensor size. We mention that after the proposed blocks, we also use a shortcut connection, adding the input tensor to the scaled, output one. In Table 2, we present implementation of GSoP block for conv4_x.

4 Experiments

In this section, we first conduct ablation analysis of the proposed GSoP-Nets. We then make comparison with the competing methods as well as state-of-the-arts on ImageNet. We finally evaluate generalization capability of our network to small-scale classification. All of our program are implemented under the PyTorch framework, and runs on four workstations each of which is equipped with 2 GTX 1080Ti GPUs and an Intel i7-4790K@4GHz CPU.

Datasets

Our experiments are mainly conducted on ImageNet-1K [4] benchmark. The ImageNet-1K contains 1.28M training images and 50K validation images from 1,000 classes. In Section 4.1, for the purpose of faster ablation study, we build a small subset of ImageNet-1K by randomly selecting 250 classes, including 320K12.5K images for trainingvalidation, which we call ImageNet-K. For comparison with state-of-the-art networks, we adopt standard ImageNet-1K in Section 4.2. To evaluate the generalization capability of our network, we also make experiments on CIFAR-100 benchmark [18], which contains 60K color images of 32x32 pixels from 100 categories, with 50K images for training and 10K images for testing.

Experimental Setting

During training from scratch with ResNet architecture on ImageNet, we follow [10] for data augmentation involving scale, color and flip jittering. The weights are initialized as in [9]. We randomly crop images from the rescaled images with per-channel mean subtraction. The networks are optimized using stochastic gradient descent (SGD) with a momentum of 0.9 and a mini-batch of 160. The initial learning rate is set to 0.1, divided by 10 every 30 epochs until 100 epochs, unless specified otherwise. During testing stage, we evaluate the error on the single center crop from an image whose shorter size is 256.

For training from scratch on CIFAR-100, following g [11, 14], we use standard data augmentation for training, including horizontal flip and random translation. The networks are trained within 110 epochs with the initial learning rate of 0.25, which is reduced to 0.025 and 0.0025 at the and epoch, respectively. The mini-batch size is 128 and weight decay is 1e-4.

4.1 Ablation analysis on GSoP-Nets

We develop a lightweight residual network of 26 layers (i.e., ResNet-26) as our baseline architecture, where every residual stage contains two bottlenecks. Following [22], we do not perform downsampling at this stage as small number of features is harmful for robust covariance estimation. For conv2_xconv4_x, we insert per-stage GSoP block after the last bottleneck structure of each residual stage. For GSoP-Net1 we insert one GSoP block followed by common global average pooling, outputting an 2K-dimensional image representation fully connected to softmax layer, while for GSoP-Net2 we use matrix power normalized covariance pooling, producing 32K-dimensional image representation. Table 3 presents the architecture of our GSoP-Nets.

output layer
conv1 112112 conv, 77, 64, Stride2
pool1 5656 max pool, 33, Stride2
conv2_x

GSoP Block

conv3_x 2828

GSoP Block

conv4_x 1414

GSoP Block

conv5_x 1414

1

GSoP block+GAvP, 2K

or

iSQRT-COV [21], 32K

1 FC + softmax
Table 3: GSoP-Net with ResNet-26 architecture.
top-1 errtop-5 err GSoP-Net1 GSoP-Net2 channel-wise cov size 6464 18.004.99 16.844.58 128128 17.42/4.53 16.684.36 256256 17.614.64 16.674.18 position-wise cov size 3636 19.215.46 17.344.80 6464 18.375.05 17.184.80 144144 18.415.08 17.514.63 vanilla network 19.185.62 (a) Impact of covariance matrix size.
top-1 errtop-5 err GSoP-Net1 GSoP-Net2 channel-wise pool 17.424.53 16.684.36 position-wise pool 18.375.05 17.184.80 fusion average 17.904.73 16.774.36 maximum 17.484.52 16.804.39 concatenation 17.584.61 16.494.35 (b) Comparison of fusion schemes.
top-1 err top-5 err 19.18 5.62 18.45 5.22 18.72 5.33 18.85 5.24 18.33 5.12 17.42 4.53 17.43 4.71 16.68 4.36 (c) Single block performance.
Table 4: Ablation results of our GSoP-Nets with ResNet-26 architecture on ImageNet-K.
Impact of Covariance Size.

The covariance matrices, produced by the second-order pooling blocks, encode the statistical correlation of the holistic tensors, playing a central role in our networks. So we first evaluate impact of covariance matrix size on the proposed networks. Table 3(a) summarizes the results, in which the top and middle panel shows the impacts using channel-wise (cov size: ) and position-wise pooling (cov size: ), respectively. We first observe that, whatever the second-order pooling, the proposed networks improve over vanilla ResNet-26, demonstrating that our holistic modeling methods in earlier stages are beneficial in enhancing the network’s discriminative capability. For channel-wise second-order pooling, relative to varying values of , GSoP-Net1 achieves the best results with . The errors of GSoP-Net2 consistently decline as gets larger and the lowest error is obtained with . For position-wise second-order pooling, GSoP-Net1 with produces the lowest errors. Notably, for either channel-wise or position-wise pooling, it is clear that GSoP-Net2 performs much better than GSoP-Net1, which suggests that image representation of covariance matrix is superior to that of mean vector by average pooling.

Fusion of Channel- and Position-wise Pooling.

The channel-wise and position-wise second-order pooling capture statistical correlations from different dimensions of 3D tensor. They can be combined for holistic image modeling. Given an input tensor, we independently perform second-order poolng along the channel dimension and spatial dimension, producing two output tensors. We can fuse the two output tensors by the commonly used operations of averagemaximum and concatenation. As concatenation operation increases tensor size, we use one convolutional layer for maintaining the original tensor size.

The results of fusion methods are presented in Table 3(b). For GSoP-Net1, the average scheme performs worse than the other two, while the maximum scheme is slightly better than the concatenation one. For GSoP-Net2, the concatenation scheme is a little superior to the other two schemes. However, compared to separate channel-wise pooling, with any fusion scheme, combination of position-wise pooling brings little improvement. These results suggest that the two kinds of second-order pooling methods are not complementary, though the two proposed networks individually have obvious improvement over the vanilla network.

Performance of Single Second-order Block.

In this part, we conduct experiments to analyze the performance of single channel-wise block separately added to different residual stage. We neglect analysis on single position-wise block as it is not promising. Table 3(c) presents the results, where S2 denotes residual stage 2, and so on; , C and denote no second-order block, one channel-wise block and iSQRT-COV meta layer [22] inserted at the corresponding residual stage, respectively. It can be seen that insertion of single block into any residual stage brings comparable improvement over the vanilla network. This indicates the channel-wise second-order pooling at different stage makes similar contribution to the overall performance of channel-wise GSoP-Net1. The iSQRT-COV, which inserts a matrix normalized covariance matrix at residual stage 4 as the final image representation, is a strong baseline, achieving comparable result with GSoP-Net1. The GSoP-Net2, which inserts global second-order pooling at intermediate stages, outperform iSQRT-COV by a non-trivial margin. This suggests the benefits of introducing second-order statistics in earlier layers of networks.

4.2 Results on ImageNet-1K

Figure 3: Convergence curves of our GSoP-Nets under ResNet-50 architecture. Top: GSoP-Net1 vs vanilla network; bottom: GSoP-Net2 vs iSQRT-COV.

In this subsection, we further evaluate our proposed GSoP-Nets on standard ImageNet-1K under ResNet-50 architecture. We insert a GSoP block for residual stage 2, 3 and 4, respectively. For GSoP-Net1, we insert one GSoP block for residual stage 5, followed by the commonly used global average pooling; for GSoP-Net2, instead of the GSoP block, the meta-layer of iSQRT-COV [21] is inserted.

4.2.1 Convergence and Network Complexity

Convergence.

Figure 3 illustrates the convergence curves of our GSoP-Net. For GSoP-Net1, though second-order statistical modeling is exploited, it is for tensor scaling while the convolutional filters and image representation are of both linear just like the original ResNet-50. As shown in the top figure, the convergence behavior of GSoP-Net1 is similar to that of ResNet-50, but consistently has lower validation error throughout the training process. Different from iSQRT-COV, for GSoP-Net2 we introduce second-order blocks for residual stages 1,2 and 3. From the bottom Figure, we can see that GSoP-Net2 inherits fast convergence property of iSQRT-COV, while steadily performs better. We attribute the improvement of our networks over their counterparts to the holistic modeling of second-order statistics introduced in earlier stages.

Network Complexity.

Table 5 shows comparison of parameter and computation. The number of parameters of GSoP-Net1 is comparable to that of the vanilla ResNet-50, while GSoP-Net2 has nearly doubled the number of parameters. The increased parameters in GSoP-Net2 are mainly due to FC layer, in which dimensionality of image representation is 32K, accounting for most increase of the total parameters, just like MPN-COV [22] and iSQRT-COV [21]. We argue that advances on model compression, e.g., [5, 27, 8], has potential to significantly reduce the number of parameters, particularly in FC layer, while maintaining the performance; in practice, we can exploit such techniques for reducing parameters. Analogous to [22, 21], the GFLOPs of our networks are 1.58x of the number of vanilla ResNet. The computations increased are attributed to removal of downsampling in the last residual stage, so that feature map size doubles. This operation is helpful for robust covariance estimation by alleviating the problem of small sample and high dimensionality [22]. This somewhat slowdowns the training, however, while making little difference for inference. With a single GTX 1080Ti GPU with CUDA 9.0 and CuDNN7.1, the inference time (ms) per image are 2.52 vs 2.682.84 (vanilla ResNet-50 vs GSoP-Net1GSoP-Net2).

description top-1 top-5 params/GFLOPs
He et al. [10]

Baseline network

23.85 7.13 25.5M/3.86
FBN [23] Quadratic transformations 24.0 7.1
SORT [36] 23.82 6.72
MPN-COV [22] GSoP at network end 22.74 6.54 2.2/1.6
iSQRT-COV [21] 22.14 6.22 2.2/1.6
SE-Net [14] GAvP across network 23.29 6.62 1.1/1.0
GENet [12] 21.88 5.80 1.3/1.0
CBAM [37] 22.66 6.31 1.1/1.0
GSoP-Net1 (ours) GSoP across network 22.32 6.02 1.1/1.6
GSoP-Net2 (ours) 21.19 5.64 2.3/1.7
ResNeXt [39] Modified architectures upon ResNet 22.11 5.90 1.0/1.0
DropBlock [7] 21.87 5.98 1.0/1.0
DRN-A-50 [40] 22.94 6.57 1.0/4.9
Table 5: Comparison (%) of different methods with ResNet-50 architecture on ImageNet-1K.

4.2.2 Comparison with Competing Networks.

Table 5 compares classification errors between our GSoP-Nets and the competing networks on ImageNet-1K.

Comparison with FBN and SORT

The two works [23, 36] are among the first which introduce quadratic transformation, instead of just linear convolutions, throughout a network. However, compared to the vanilla network, their performance gains are not significant. In contrast, our networks are much better, achieving over 2.8% and 2.6% higher accuracies than FBN and SORT. This comparison demonstrates that, by making favorable use of higher-order information, we can greatly improve the network performance.

Comparison with Global Cov Pool at Network End.

Here we compare our GSoP-Net2 with several methods where global second-order pooling is inserted only at the end of network. All of them estimate covariance matrices of the last convolutional features as image representations. DeepO computes matrix logarithm for covariance matrix while B-CNN performs element-wise power normalization plus normalization. As DeepO and B-CNN are not competitive for large-scale visual recognition [22], here we do not compare with them. MPN-COV uses structured normalization by matrix square root, and iSQRT-COV is a faster version of MPN-COV, in which matrix square root is based on iterative algorithm, rather than GPU unfriendly SVD. Our GSoP-Net2 outperforms MPN-COV by 1.55% in top-1 error (0.90% in top-5 error). Compared to iSQRT-COV, the GSoP-Net2 achieves 0.95%0.58% lower top-1top-5 error rates, while resulting in negligible overhead. We note that the iSQRT-COV is a strong baseline and our improvement is nontrivial. The comparison between our GSoP-Net2 and MPN-COViSQRT-COV indicates that introducing higher-order statistics in earlier stages can enhance representational learning capability of deep ConvNets.

Comparison with Global Avg Pool across Network.

From Table 5, we can see that our GSoP-Net1 performs better than SE-Net in top-1top-5 errors. As an extension of SE-Net, CBAM combines global average and max pooling along both channel dimensional and spatial dimension. Nevertheless, the error rates of GSoP-Net1 are lower than CBAM. Building upon SE-Net, GENet [12] proposes gather and excitation operations for exploiting context information. Our GSoP-Net2 outperforms GENet by a non-trival margin. These comparisons between our networks and SE-Net and its variants show that higher-order modeling is able to capture richer statistics than the first-order modeling, leading to more discriminative representation. Notably, we do not insert GSoP block after each bottleneck structure; instead, we only insert the GSoP block per residual stage. As a result, we only add no more than 4 GSoP blocks, and more GSoP blocks may further improve the performance of our network.

Comparison with State-of-the-arts.

Finally, we compare with several state-of-the-art networks which modify upon ResNet-50 architecture. Compared to ResNet, ResNeXt [38] considerably increases network width, which, however, keeps parameters and computation almost unchanged through extensive use of grouped convolutions [19]. DRN-A-50 [40] removes downsampling in residual stage 3 and 4, and meanwhile uses dilated convolution to maintain the receptive size. DropBlock [7] extends dropout technique to convolution; by drop blocks of feature map randomly, it maintains the context integrity during training. As shown in Table 5, these modified networks performs much better than ResNet-50. Nevertheless, our GSoP-Net2 outperforms all of them by a non-trivial margin. It is noteworthy to mention that, if built upon the modified networks above, the performance of our network may improve further.

4.3 Results on CIFAR-100

This section conducts experiments on CIFAR-100 [18] to evaluate the generalization capability of the proposed GSoP-Net. The backbone network is pre-activation ResNet-164 [11], containing 3 residual stages each with 18 bottlenecks; the final image represenation is 256-D. In GSoP-Net1, we insert 18 GSoP blocks into the backbone network uniformly, and in GSoP-Net2 the last GSoP block is replaced by a meta-layer of iSQRT-COV. Downsampling is not performed in the last residual stage. The final dimension of image representation in GSoP-Net2 is 8K and a dropout layer (dropout rate=0.5) is used for FC layer. The covariance size is in both GSoP-Net1 and GSoP-Net2.

The experimental results on CIFAR-100 are presented in Table 6. Compared with the vanilla network, GSoP-Net1 and GSoP-Net2 obtain gains of 3.47% and 5.75%, respectively, improving the performance by a large margin. CMPE [15] implements channel-wise excitation operation by establishing the correlation of the channel-wise representation between two nearby bottlenecks, which can be considered as a cross-block version of SE-Net. GSoP-Net1 performs better than SE-Net and CMPE by 0.45% and 1.49% respectively. iSQRT-COV is very competitive, outperforming SE-Net by . By introducing second-order statistics in earlier stages, our GSoP-Net2 makes further improvement ( 1.37%) over iSQRT-COV.

model top-1 err params GFLOPs
He et al [11] 24.33 1.7M 0.25
SE-Net [13] 21.31 1.9M 0.29
CMPE [15] 22.35 2.0M NA
MPN-COV [21] 19.95 2.5M 0.52
GSoP-Net1 (ours) 20.86 2.9M 0.55
GSoP-Net2 (ours) 18.58 3.6M 0.58
Table 6: Error comparison (%) of our networks with the counterparts on CIFAR-100.

5 Conclusion

We presented a simple yet effective deep convolutional network model for capturing holistic statistical correlations across all stages of network. By exploiting the holistic higher-order information at earlier stages, the proposed model can learn more discriminative representations. As far as we know, our work is among the first which introduce global second-order pooling into lower layers of deep networks. Our proposed networks performs better than SE-Net [14], i.e., the first-order counterpart, while non-trivially improves state-of-the-art iSQRT-COV [21] which plugged global covariance pooling as image representation only at network end. The proposed GSoP blocks are highly modular, which can be conveniently plugged into other deep architectures, e.g., Inception [30] and DenseNet [16].

References

  • [1] S. Cai, W. Zuo, and L. Zhang. Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In ICCV, 2017.
  • [2] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernel pooling for convolutional neural networks. In CVPR, 2017.
  • [3] X. Dai, J. Yue-Hei Ng, and L. S. Davis. Fason: First and second order information fusion network for texture recognition. In CVPR, July 2017.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
  • [6] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, 2016.
  • [7] G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In NIPS, 2018.
  • [8] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV, 2015.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
  • [12] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In NIPS, 2018.
  • [13] J. Hu, L. Shen, A. Samuel, S. Gang, and W. Enhua. Squeeze-and-excitation networks. arXiv:1709.01507v3, 2018.
  • [14] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, 2018.
  • [15] Y. Hu, G. Wen, M. Luo, D. Dai, and M. Jiajiong. Competitive inner-imaging squeeze and excitation for residual network. arXiv:1807.08920, 2018.
  • [16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [17] C. Ionescu, O. Vantzos, and C. Sminchisescu. Matrix backpropagation for deep networks with structured layers. In ICCV, 2015.
  • [18] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech. Rep., 2009.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
  • [20] P. Li, Q. Wang, H. Zeng, and L. Zhang. Local Log-Euclidean multivariate Gaussian descriptor and its application to image classification. IEEE TPAMI, 2017.
  • [21] P. Li, J. Xie, Q. Wang, and Z. Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, 2018.
  • [22] P. Li, J. Xie, Q. Wang, and W. Zuo. Is second-order information helpful for large-scale visual recognition? In ICCV, Oct 2017.
  • [23] Y. Li, N. Wang, J. Liu, and X. Hou. Factorized bilinear models for image recognition. In ICCV, 2017.
  • [24] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
  • [25] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, 2015.
  • [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
  • [28] J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. In CVPR, 2017.
  • [29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS. 2017.
  • [32] H. Wang, Q. Wang, M. Gao, P. Li, and W. Zuo. Multi-scale location-aware kernel representation for object detection. In CVPR, 2018.
  • [33] Q. Wang, P. Li, and L. Zhang. GDeNet: Global Gaussian distribution embedding network and its application to visual recognition. In CVPR, 2017.
  • [34] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
  • [35] Y. Wang, M. Long, J. Wang, and P. S. Yu. Spatiotemporal pyramid network for video action recognition. In CVPR, 2017.
  • [36] Y. Wang, L. Xie, C. Liu, S. Qiao, Y. Zhang, W. Zhang, Q. Tian, and A. Yuille. SORT: Second-order response transform for visual recognition. In ICCV, 2017.
  • [37] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon. Cbam: Convolutional block attention module. In ECCV, 2018.
  • [38] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang. Monet: Deep motion exploitation for video object segmentation. In CVPR, 2018.
  • [39] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  • [40] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In CVPR, 2017.
  • [41] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
  • [42] G. Zoumpourlis, A. Doumanoglou, N. Vretos, and P. Daras. Non-linear convolution filters for CNN-based learning. In ICCV, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
321526
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description