Res2Net: A New Multi-scale Backbone Architecture
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models are available on https://mmcheng.net/res2net/.
Visual patterns occur at multi-scales in natural scenes as shown in Fig. 1. First, objects may appear with different sizes in a single image, e.g., the sofa and cup are of different sizes. Second, the essential contextual information of an object may occupy a much larger area than the object itself. For instance, we need to rely on the big table as context to better tell whether the small black blob placed on it is a cup or a pen holder. Third, perceiving information from different scales is essential for understanding parts as well as objects for tasks such as fine-grained classification and semantic segmentation. Thus, it is of critical importance to design good features for multi-scale stimuli for visual cognition tasks, including image classification [krizhevsky2012imagenet], object detection [ren2015faster], attention prediction [selvaraju2017grad], target tracking [zhang2017multi], action recognition [simonyan2014two], semantic segmentation [chen2018deeplab], salient object detection [hou2017deeply, BorjiCVM2019], object proposal [ren2015faster, BingObjCheng2018], skeleton extraction [zhao2018hifi], stereo matching [nie2019multi], and edge detection [xie2015holistically, liu2017richer].
Unsurprisingly, multi-scale features have been widely used in both conventional feature design [belongie2002shape, lowe2004distinctive] and deep learning [szegedy2015going, chen2017dual]. Obtaining multi-scale representations in vision tasks requires feature extractors to use a large range of receptive fields to describe objects/parts/context at different scales. Convolutional neural networks (CNNs) naturally learn coarse-to-fine multi-scale features through a stack of convolutional operators. Such inherent multi-scale feature extraction ability of CNNs leads to effective representations for solving numerous vision tasks. How to design a more efficient network architecture is the key to further improving the performance of CNNs.
In the past few years, several backbone networks, e.g., [krizhevsky2012imagenet, simonyan2014very, szegedy2015going, he2016deep, huang2017densely, Chollet_2017_CVPR, xie2017aggregated, chen2017dual, yu2018deep, hu2018senet], have made significant advances in numerous vision tasks with state-of-the-art performance. Earlier architectures such as AlexNet [krizhevsky2012imagenet] and VGGNet [simonyan2014very] stack convolutional operators, making the data-driven learning of multi-scale features feasible. The efficiency of multi-scale ability was subsequently improved by using conv layers with different kernel size (e.g., InceptionNets [szegedy2015going, szegedy2016rethinking, szegedy2017inception]), residual modules (e.g., ResNet [he2016deep]), shortcut connections (e.g., DenseNet [huang2017densely]), and hierarchical layer aggregation (e.g., DLA [yu2018deep]). The advances in backbone CNN architectures have demonstrated a trend towards more effective and efficient multi-scale representations.
In this work, we propose a simple yet efficient multi-scale processing approach. Unlike most existing methods that enhance the layer-wise multi-scale representation strength of CNNs, we improve the multi-scale representation ability at a more granular level. Different from some concurrent works [chen2019drop, chen2018biglittle, cheng2019high] that improve the multi-scale ability by utilizing features with different resolutions, the multi-scale of our proposed method refers to the multiple available receptive fields at a more granular level. To achieve this goal, we replace the filters111 Convolutional operators and filters are used interchangeably. of channels, with a set of smaller filter groups, each with channels (without loss of generality we use ). As shown in Fig. 2, these smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extracts features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters along with another group of input feature maps. This process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of filters to fuse information altogether. Along with any possible path in which input features are transformed to output features, the equivalent receptive field increases whenever it passes a filter, resulting in many equivalent feature scales due to combination effects.
The Res2Net strategy exposes a new dimension, namely scale (the number of feature groups in the Res2Net block), as an essential factor in addition to existing dimensions of depth [simonyan2014very], width222 Width refers to the number of channels in a layer as in [Zagoruyko2016WRN]., and cardinality [xie2017aggregated]. We state in Sec. 4.4 that increasing scale is more effective than increasing other dimensions.
Note that the proposed approach exploits the multi-scale potential at a more granular level, which is orthogonal to existing methods that utilize layer-wise operations. Thus, the proposed building block, namely Res2Net module, can be easily plugged into many existing CNN architectures. Extensive experimental results show that the Res2Net module can further improve the performance of state-of-the-art CNNs, e.g., ResNet [he2016deep], ResNeXt [xie2017aggregated], and DLA [yu2018deep].
2 Related Work
2.1 Backbone Networks
Recent years have witnessed numerous backbone networks [krizhevsky2012imagenet, simonyan2014very, szegedy2015going, he2016deep, huang2017densely, Chollet_2017_CVPR, xie2017aggregated, yu2018deep], achieving state-of-the-art performance in various vision tasks with stronger multi-scale representations. As designed, CNNs are equipped with basic multi-scale feature representation ability since the input information follows a fine-to-coarse fashion. The AlexNet [krizhevsky2012imagenet] stacks filters sequentially and achieves significant performance gain over traditional methods for visual recognition. However, due to the limited network depth and kernel size of filters, the AlexNet has only a relatively small receptive field. The VGGNet [simonyan2014very] increases the network depth and uses filters with smaller kernel size. A deeper structure can expand the receptive fields, which is useful for extracting features from a larger scale. It is more efficient to enlarge the receptive field by stacking more layers than using large kernels. As such, the VGGNet provides a stronger multi-scale representation model than AlexNet, with fewer parameters. However, both AlexNet and VGGNet stack filters directly, which means each feature layer has a relatively fixed receptive field.
Network in Network (NIN) [lin2013network] inserts multi-layer perceptrons as micro-networks into the large network to enhance model discriminability for local patches within the receptive field. The 1 1 convolution introduced in NIN has been a popular module to fuse features. The GoogLeNet [szegedy2015going] utilizes parallel filters with different kernel sizes to enhance the multi-scale representation capability. However, such capability is often limited by the computational constraints due to its limited parameter efficiency. The Inception Nets [szegedy2016rethinking, szegedy2017inception] stack more filters in each path of the parallel paths in the GoogLeNet to further expand the receptive field. On the other hand, the ResNet [he2016deep] introduces short connections to neural networks, thereby alleviating the gradient vanishing problem while obtaining much deeper network structures. During the feature extraction procedure, short connections allow different combinations of convolutional operators, resulting in a large number of equivalent feature scales. Similarly, densely connected layers in the DenseNet [huang2017densely] enable the network to process objects in a very wide range of scales. DPN [chen2017dual] combines the ResNet with DenseNet to enable feature re-usage ability of ResNet and the feature exploration ability of DenseNet. The recently proposed DLA [yu2018deep] method combines layers in a tree structure. The hierarchical tree structure enables the network to obtain even stronger layer-wise multi-scale representation capability.
2.2 Multi-scale Representations for Vision Tasks
Multi-scale feature representations of CNNs are of great importance to a number of vision tasks including object detection [ren2015faster], face analysis [bulat2017far, najibi2017ssh], edge detection [liu2017richer], semantic segmentation [chen2018deeplab], salient object detection [Liu2019PoolSal, Zhao2019RgbdSal], and skeleton detection [zhao2018hifi], boosting the model performance of those fields.
2.2.1 Object detection.
Effective CNN models need to locate objects of different scales in a scene. Earlier works such as the R-CNN [girshick2014rich] mainly rely on the backbone network, i.e., VGGNet [simonyan2014very], to extract features of multiple scales. He et al. propose an SPP-Net approach [he2015spatial] that utilizes spatial pyramid pooling after the backbone network to enhance the multi-scale ability. The Faster R-CNN method [ren2015faster] further proposes the region proposal networks to generate bounding boxes with various scales. Based on the Faster R-CNN, the FPN [lin2017feature] approach introduces feature pyramid to extract features with different scales from a single image. The SSD method [liu2016ssd] utilizes feature maps from different stages to process visual information at different scales.
2.2.2 Semantic segmentation.
Extracting essential contextual information of objects requires CNN models to process features at various scales for effective semantic segmentation. Long et al. [long2015fully] propose one of the earliest methods that enables multi-scale representations of the fully convolutional network (FCN) for semantic segmentation task. In DeepLab, Chen et al. [chen2018deeplab, chen2017rethinking] introduces cascaded atrous convolutional module to expand the receptive field further while preserving spatial resolutions. More recently, global context information is aggregated from region-based features via the pyramid pooling scheme in the PSPNet [Zhao2017PSP].
2.2.3 Salient object detection.
Precisely locating the salient object regions in an image requires an understanding of both large-scale context information for the determination of object saliency, and small-scale features to localize object boundaries accurately [zhao2019optimizing]. Early approaches [borji2015salient] utilize handcrafted representations of global contrast [cheng2015global] or multi-scale region features [WangDRFI2017]. Li et al.[li2015visual] propose one of the earliest methods that enables multi-scale deep features for salient object detection. Later, multi-context deep learning [zhao2015saliency] and multi-level convolutional features [zhang2017amulet] are proposed for improving salient object detection. More recently, Hou et al.[hou2017deeply] introduce dense short connections among stages to provide rich multi-scale feature maps at each layer for salient object detection.
2.3 Concurrent Works
Recently, there are some concurrent works aiming at improving the performance by utilizing the multi-scale features [chen2019drop, chen2018biglittle, cheng2019high, SunZJCXLMWLW19]. Big-Little Net [chen2018biglittle] is a multi-branch network composed of branches with different computational complexity. Octave Conv [chen2019drop] decomposes the standard convolution into two resolutions to process features at different frequencies. MSNet [cheng2019high] utilizes a high-resolution network to learn high-frequency residuals by using the up-sampled low-resolution features learned by a low-resolution network. Other than the low-resolution representations in current works, the HRNet [SunXLW19, SunZJCXLMWLW19] introduces high-resolution representations in the network and repeatedly performs multi-scale fusions to strengthen high-resolution representations. One common operation in [chen2019drop, chen2018biglittle, cheng2019high, SunXLW19, SunZJCXLMWLW19] is that they all use pooling or up-sample to re-size the feature map to times of the original scale to save the computational budget while maintaining or even improving performance. While in the Res2Net block, the hierarchical residual-like connections within a single residual block module enable the variation of receptive fields at a more granular level to capture details and global features. Experimental results show that Res2Net module can be integrated with those novel network designs to further boost the performance.
3.1 Res2Net Module
The bottleneck structure shown in Fig. 2(a) is a basic building block in many modern backbone CNNs architectures, e.g., ResNet [he2016deep], ResNeXt [xie2017aggregated], and DLA [yu2018deep]. Instead of extracting features using a group of filters as in the bottleneck block, we seek alternative architectures with stronger multi-scale feature extraction ability, while maintaining a similar computational load. Specifically, we replace a group of filters with smaller groups of filters, while connecting different filter groups in a hierarchical residual-like style. Since our proposed neural network module involves residual-like connections within a single residual block, we name it Res2Net.
Fig. 2 shows the differences between the bottleneck block and the proposed Res2Net module. After the convolution, we evenly split the feature maps into feature map subsets, denoted by , where . Each feature subset has the same spatial size but number of channels compared with the input feature map. Except for , each has a corresponding convolution, denoted by . We denote by the output of . The feature subset is added with the output of , and then fed into . To reduce parameters while increasing , we omit the convolution for . Thus, can be written as:
Notice that each convolutional operator could potentially receive feature information from all feature splits . Each time a feature split goes through a convolutional operator, the output result can have a larger receptive field than . Due to the combinatorial explosion effect, the output of the Res2Net module contains a different number and different combination of receptive field sizes/scales.
In the Res2Net module, splits are processed in a multi-scale fashion, which is conducive to the extraction of both global and local information. To better fuse information at different scales, we concatenate all splits and pass them through a convolution. The split and concatenation strategy can enforce convolutions to process features more effectively. To reduce the number of parameters, we omit the convolution for the first split, which can also be regarded as a form of feature reuse.
In this work, we use as a control parameter of the scale dimension. Larger potentially allows features with richer receptive field sizes to be learnt, with negligible computational/memory overheads introduced by concatenation.
3.2 Integration with Modern Modules
Numerous neural network modules have been proposed in recent years, including cardinality dimension introduced by Xie et al.[xie2017aggregated], as well as squeeze and excitation (SE) block presented by Hu et al.[hu2018senet]. The proposed Res2Net module introduces the scale dimension that is orthogonal to these improvements. As shown in Fig. 3, we can easily integrate the cardinality dimension [xie2017aggregated] and the SE block [hu2018senet] with the proposed Res2Net module.
3.2.1 Dimension cardinality.
The dimension cardinality indicates the number of groups within a filter [xie2017aggregated]. This dimension changes filters from single-branch to multi-branch and improves the representation ability of a CNN model. In our design, we can replace the 3 3 convolution with the 3 3 group convolution, where indicates the number of groups. Experimental comparisons between the scale dimension and cardinality are presented in Sec. 4.2 and Sec. 4.4.
3.2.2 SE block.
A SE block adaptively re-calibrates channel-wise feature responses by explicitly modelling inter-dependencies among channels [hu2018senet]. Similar to [hu2018senet], we add the SE block right before the residual connections of the Res2Net module. Our Res2Net module can benefit from the integration of the SE block, which we have experimentally demonstrated in Sec. 4.2 and Sec. 4.3.
3.3 Integrated Models
Since the proposed Res2Net module does not have specific requirements of the overall network structure and the multi-scale representation ability of the Res2Net module is orthogonal to the layer-wise feature aggregation models of CNNs, we can easily integrate the proposed Res2Net module into the state-of-the-art models, such as ResNet [he2016deep], ResNeXt [xie2017aggregated], DLA [yu2018deep] and Big-Little Net [chen2018biglittle]. The corresponding models are referred to as Res2Net, Res2NeXt, Res2Net-DLA, and bLRes2Net-50, respectively.
The proposed scale dimension is orthogonal to the cardinality [xie2017aggregated] dimension and width [he2016deep] dimension of prior work. Thus, after the scale is set, we adjust the value of cardinality and width to maintain the overall model complexity similar to its counterparts. We do not focus on reducing the model size in this work since it requires more meticulous designs such as depth-wise separable convolution [Ma_2018_ECCV], model pruning [han2015learning], and model compression [cheng2017survey].
For experiments on the ImageNet [russakovsky2015imagenet] dataset, we mainly use the ResNet-50 [he2016deep], ResNeXt-50 [xie2017aggregated], DLA-60 [yu2018deep], and bLResNet-50 [chen2018biglittle] as our baseline models. The complexity of the proposed model is approximately equal to those of the baseline models, whose number of parameters is around and the number of FLOPs for an image of pixels is around for 50-layer networks. For experiments on the CIFAR [krizhevsky2009learning] dataset, we use the ResNeXt-29, 8c64w [xie2017aggregated] as our baseline model. Empirical evaluations and discussions of the proposed models with respect to model complexity are presented in Sec. 4.4.
4.1 Implementation Details
We implement the proposed models using the Pytorch framework. For fair comparisons, we use the Pytorch implementation of ResNet [he2016deep], ResNeXt [xie2017aggregated], DLA [yu2018deep] as well as bLResNet-50 [chen2018biglittle], and only replace the original bottleneck block with the proposed Res2Net module. Similar to prior work, on the ImageNet dataset [russakovsky2015imagenet], each image is of 224224 pixels randomly cropped from a re-sized image. We use the same data argumentation strategy as [he2016deep, szegedy2016rethinking]. Similar to [he2016deep], we train the network using SGD with weight decay 0.0001, momentum 0.9, and a mini-batch of 256 on 4 Titan Xp GPUs. The learning rate is initially set to 0.1 and divided by 10 every 30 epochs.
All models for the ImageNet, including the baseline and proposed models, are trained for 100 epochs with the same training and data argumentation strategy. For testing, we use the same image cropping method as [he2016deep]. On the CIFAR dataset, we use the implementation of ResNeXt-29 [xie2017aggregated]. For all tasks, we use the original implementations of baselines and only replace the backbone model with the proposed Res2Net.
|top-1 err. ()||top-5 err. ()|
We conduct experiments on the ImageNet dataset [russakovsky2015imagenet], which contains 1.28 million training images and 50k validation images from 1000 classes. We construct the models with approximate 50 layers for performance evaluation against the state-of-the-art methods. More ablation studies are conducted on the CIFAR dataset.
4.2.1 Performance gain.
Table I shows the top-1 and top-5 test error on the ImageNet dataset. For simplicity, all Res2Net models in Table I have the scale . The Res2Net-50 has an improvement of 1.84 on top-1 error over the ResNet-50. The Res2NeXt-50 achieves a 0.85 improvement in terms of top-1 error over the ResNeXt-50. Also, the Res2Net-DLA-60 outperforms the DLA-60 by 1.27 in terms of top-1 error. The Res2NeXt-DLA-60 outperforms the DLA-X-60 by 0.64 in terms of top-1 error. The SE-Res2Net-50 has an improvement of 1.68 over the SENet-50. bLRes2Net-50 has an improvement of 0.73 in terms of top-1 error over the bLResNet-50. The Res2Net module further enhances the multi-scale ability of bLResNet at a granular level even bLResNet is designed to utilize features with different scales as discussed in Sec. 2.3. Note that the ResNet [he2016deep], ResNeXt [xie2017aggregated], SE-Net [hu2018senet], bLResNet [chen2018biglittle], and DLA [yu2018deep] are the state-of-the-art CNN models. Compared with these strong baselines, models integrated with the Res2Net module still have consistent performance gains.
We also compare our method against the InceptionV3 [szegedy2016rethinking] model, which utilizes parallel filters with different kernel combinations. For fair comparisons, we use the ResNet-50 [he2016deep] as the baseline model and train our model with the input image size of 299299 pixels, as used in the InceptionV3 model. The proposed Res2Net-50-299 outperforms InceptionV3 by 1.14 on top-1 error. We conclude that the hierarchical residual-like connection of the Res2Net module is more effective than the parallel filters of InceptionV3 when processing multi-scale information. While the combination pattern of filters in InceptionV3 is dedicatedly designed, the Res2Net module presents a simple but effective combination pattern.
4.2.2 Going deeper with Res2Net.
Deeper networks have been shown to have stronger representation capability [he2016deep, xie2017aggregated] for vision tasks. To validate our model with greater depth, we compare the classification performance of the Res2Net and the ResNet, both with 101 layers. As shown in Table II, the Res2Net-101 achieves significant performance gains over the ResNet-101 with 1.82 in terms of top-1 error. Note that the Res2Net-50 has the performance gain of 1.84 in terms of top-1 error over the ResNet-50. These results show that the proposed module with additional dimension scale can be integrated with deeper models to achieve better performance. We also compare our method with the DenseNet [huang2017densely]. Compared with the DenseNet-161, the best performing model of the officially provided DenseNet family, the Res2Net-101 has an improvement of 1.54 in terms of top-1 error.
|top-1 err.||top-5 err.|
4.2.3 Effectiveness of scale dimension.
|Setting||FLOPs||Runtime||top-1 err.||top-5 err.|
To validate our proposed dimension scale, we experimentally analyze the effect of different scales. As shown in Table III, the performance increases with the increase of scale. With the increase of scale, the Res2Net-50 with 14w8s achieves performance gains over the ResNet-50 with 1.99 in terms of top-1 error. Note that with the preserved complexity, the width of decreases with the increase of scale. We further evaluate the performance gain of increasing scale with increased model complexity. The Res2Net-50 with 26w8s achieves significant performance gains over the ResNet-50 with 3.05 in terms of top-1 error. A Res2Net-50 with 18w4s also outperforms the ResNet-50 by 0.93 in terms of top-1 error with only 69 FLOPs. Table III shows the Runtime under different scales, which is the average time to infer the ImageNet validation set with the size of 224 224. Although the feature splits need to be computed sequentially due to hierarchical connections, the extra run-time introduced by Res2Net module can often be ignored. Since the number of available tensors in a GPU is limited, there are typically sufficient parallel computations within a single GPU clock period for the typical setting of Res2Net, i.e., .
We also conduct some experiments on the CIFAR-100 dataset [krizhevsky2009learning], which contains 50k training images and 10k testing images from 100 classes. The ResNeXt-29, 8c64w [xie2017aggregated] is used as the baseline model. We only replace the original basic block with our proposed Res2Net module while keeping other configurations unchanged. Table IV shows the top-1 test error and model size on the CIFAR-100 dataset. Experimental results show that our method surpasses the baseline and other methods with fewer parameters. Our proposed Res2NeXt-29, 6c24w6s outperforms the baseline by 1.11. Res2NeXt-29, 6c24w4s even outperforms the ResNeXt-29, 16c64w with only 35 parameters. We also achieve better performance with fewer parameters, compared with DenseNet-BC (). Compared with Res2NeXt-29, 6c24w4s, Res2NeXt-29, 8c25w4s achieves a better result with more width and cardinality, indicating that the dimension scale is orthogonal to dimension width and cardinality. We also integrate the recently proposed SE block into our structure. With fewer parameters, our method still outperforms the ResNeXt-29, 8c64w-SE baseline.
|Wide ResNet [Zagoruyko2016WRN]||36.5M||20.50|
|ResNeXt-29, 8c64w [xie2017aggregated] (base)||34.4M||17.90|
|ResNeXt-29, 16c64w [xie2017aggregated]||68.1M||17.31|
|DenseNet-BC (k = 40) [huang2017densely]||25.6M||17.18|
|ResNeXt-29, 8c64w-SE [hu2018senet]||35.1M||16.77|
|Baseball||Penguin||Ice cream||Bulbul||Mountain dog||Ballpoint||Mosque|
4.4 Scale Variation
Similar to Xie et al.[xie2017aggregated], we evaluate the test performance of the baseline model by increasing different CNN dimensions, including scale (Equation (1)), cardinality [xie2017aggregated], and depth [simonyan2014very]. While increasing model capacity using one dimension, we fix all other dimensions. A series of networks are trained and evaluated under these changes. Since [xie2017aggregated] has already shown that increasing cardinality is more effective than increasing width, we only compare the proposed dimension scale with cardinality and depth.
Fig. 5 shows the test precision on the CIFAR-100 dataset with regard to the model size. The depth, cardinality, and scale of the baseline model are and , respectively. Experimental results suggest that scale is an effective dimension to improve model performance, which is consistent with what we have observed on the ImageNet dataset in Sec. 4.2. Moreover, increasing scale is more effective than other dimensions, resulting in quicker performance gains. As described in Equation (1) and Fig. 2, for the case of scale , we only increase the model capacity by adding more parameters of filters. Thus, the model performance of is slightly worse than that of increasing cardinality. For , the combination effects of our hierarchical residual-like structure produce a rich set of equivalent scales, resulting in significant performance gains. However, the models with scale 5 and 6 have limited performance gains, about which we assume that the image in the CIFAR dataset is too small (3232) to have many scales.
4.5 Class Activation Mapping
To understand the multi-scale ability of the Res2Net, we visualize the class activation mapping (CAM) using Grad-CAM [selvaraju2017grad], which is commonly used to localize the discriminative regions for image classification. In the visualization examples shown in Fig. 4, stronger CAM areas are covered with lighter colors. Compared with ResNet, the Res2Net based CAM results have more concentrated activation maps on small objects such as ‘baseball’ and ‘penguin’. Both methods have similar activation maps on the middle size objects, such as ‘ice cream’. Due to stronger multi-scale ability, the Res2Net has activation maps that tend to cover the whole object on big objects such as ‘bulbul’, ‘mountain dog’, ‘ballpoint’, and ‘mosque’, while activation maps of ResNet only cover parts of objects. Such ability of precisely localizing CAM region makes the Res2Net potentially valuable for object region mining in weakly supervised semantic segmentation tasks [AdversErasingCVPR2017].
4.6 Object Detection
For object detection task, we validate the Res2Net on the PASCAL VOC07 [everingham2010pascal] and MS COCO [lin2014microsoft] datasets, using Faster R-CNN [ren2015faster] as the baseline method. We use the backbone network of ResNet-50 vs. Res2Net-50, and follow all other implementation details of [ren2015faster] for a fair comparison. Table V shows the object detection results. On the PASCAL VOC07 dataset, the Res2Net-50 based model outperforms its counterparts by on average precision (AP). On the COCO dataset, the Res2Net-50 based model outperforms its counterparts by on AP, and on AP@IoU=0.5.
We further test the AP and average recall (AR) scores for objects of different sizes as shown in Table VI. Objects are divided into three categories based on the size, according to [lin2014microsoft]. The Res2Net based model has a large margin of improvement over its counterparts by , , and on AP for small, medium, and large objects, respectively. The improvement of AR for small, medium, and large objects are , , and , respectively. Due to the strong multi-scale ability, the Res2Net based models can cover a large range of receptive fields, boosting the performance on objects of different sizes.
4.7 Semantic Segmentation
|Backbone||Setting||Mean IoU ()|
Semantic segmentation requires a strong multi-scale ability of CNNs to extract essential contextual information of objects. We thus evaluate the multi-scale ability of Res2Net on the semantic segmentation task using PASCAL VOC12 dataset [everingham2015pascal]. We follow the previous work to use the augmented PASCAL VOC12 dataset [hariharan2011semantic] which contains 10582 training images and 1449 val images. We use the Deeplab v3+ [Chen_2018_ECCV] as our segmentation method. All implementations remain the same with Deeplab v3+ [Chen_2018_ECCV] except that the backbone network is replaced with ResNet and our proposed Res2Net. The output strides used in training and evaluation are both 16. As shown in Table VII, Res2Net-50 based method outperforms its counterpart by 1.5 on mean IoU. And Res2Net-101 based method outperforms its counterpart by 1.2 on mean IoU. Visual comparisons of semantic segmentation results on challenging examples are illustrated in Fig. 6. The Res2Net based method tends to segment all parts of objects regardless of object size.
4.8 Instance Segmentation
Instance segmentation is the combination of object detection and semantic segmentation. It requires not only the correct detection of objects with various sizes in an image but also the precise segmentation of each object. As mentioned in Sec. 4.6 and Sec. 4.7, both object detection and semantic segmentation require a strong multi-scale ability of CNNs. Thus, the multi-scale representation is quite beneficial to instance segmentation. We use the Mask R-CNN [he2017mask] as the instance segmentation method, and replace the backbone network of ResNet-50 with our proposed Res2Net-50. The performance of instance segmentation on MS COCO [lin2014microsoft] dataset is shown in Table VIII. The Res2Net-26w4s based method outperforms its counterparts by 1.7 on and 2.4 on . The performance gains on objects with different sizes are also demonstrated. The improvement of for small, medium, and large objects are 0.9, 1.9, and 2.8, respectively. Table VIII also shows the performance comparisons of Res2Net under the same complexity with different scales. The performance shows an overall upward trend with the increase of scale. Note that compared with the Res2Net-50-48w2s, the Res2Net-50-26w4s has an improvement of 2.8 on , while the Res2Net-50-48w2s has the same compared with ResNet-50. We assume that the performance gain on large objects is benefited from the extra scales. When the scale is relatively larger, the performance gain is not obvious. The Res2Net module is capable of learning a suitable range of receptive fields. The performance gain is limited when the scale of objects in the image is already covered by the available receptive fields in the Res2Net module. With fixed complexity, the increased scale results in fewer channels for each receptive field, which may reduce the ability to process features of a particular scale.
4.9 Salient Object Detection
Pixel level tasks such as salient object detection also require the strong multi-scale ability of CNNs to locate both the holistic objects as well as their region details. Here we use the latest method DSS [hou2017deeply] as our baseline. For a fair comparison, we only replace the backbone with ResNet-50 and our proposed Res2Net-50, while keeping other configurations unchanged. Following [hou2017deeply], we train those two models using the MSRA-B dataset [liu2011learning], and evaluate results on ECSSD [yan2013hierarchical], PASCAL-S [li2014secrets], HKU-IS [li2015visual], and DUT-OMRON [yang2013saliency] datasets. The F-measure and Mean Absolute Error (MAE) are used for evaluation. As shown in Table IX, the Res2Net based model has a consistent improvement compared with its counterparts on all datasets. On the DUT-OMRON dataset (containing 5168 images), the Res2Net based model has a improvement on F-measure and a improvement on MAE, compared with ResNet based model. The Res2Net based approach achieves greatest performance gain on the DUT-OMRON dataset, since this dataset contains the most significant object size variation compared with the other three datasets. Some visual comparisons of salient object detection results on challenging examples are illustrated in Fig. 7.
4.10 Key-points Estimation
Human parts are of different sizes, which requires the key-points estimation method to locate human key-points with different scales. To verify whether the multi-scale representation ability of Res2Net can benefit the task of key-points estimation, we use the SimpleBaseline [Xiao_2018_ECCV] as the key-points estimation method and only replace the backbone with the proposed Res2Net. All implementations including the training and testing strategies remain the same with the SimpleBaseline [Xiao_2018_ECCV]. We train the model using the COCO key-point detection dataset [lin2014microsoft], and evaluate the model using the COCO validation set. Following common settings, we use the same person detectors in SimpleBaseline [Xiao_2018_ECCV] for evaluation. Table X shows the performance of key-points estimation on the COCO validation set using Res2Net. The Res2Net-50 and Res2Net-101 based models outperform baselines on by 3.3 and 3.0, respectively. Also, Res2Net based models have considerable performance gains on human with different scales compared with baselines.
5 Conclusion and Future Work
We present a simple yet efficient block, namely Res2Net, to further explore the multi-scale ability of CNNs at a more granular level. The Res2Net exposes a new dimension, namely “scale”, which is an essential and more effective factor in addition to existing dimensions of depth, width, and cardinality. Our Res2Net module can be integrated with existing state-of-the-art methods with no effort. Image classification results on CIFAR-100 and ImageNet benchmarks suggested that our new backbone network consistently performs favourably against its state-of-the-art competitors, including ResNet, ResNeXt, DLA, etc.
Although the superiority of the proposed backbone model has been demonstrated in the context of several representative computer vision tasks, including class activation mapping, object detection, and salient object detection, we believe multi-scale representation is essential for a much wider range of application areas. To encourage future works to leverage the strong multi-scale ability of the Res2Net, the source code is available on https://mmcheng.net/res2net/.
This research was supported by NSFC (NO. 61620106008, 61572264), the national youth talent support program, and Tianjin Natural Science Foundation (17JCJQJC43700, 18ZXZNGX00110).
Shang-Hua Gao is a master student in Media Computing Lab at Nankai University. He is supervised via Prof. Ming-Ming Cheng. His research interests include computer vision, machine learning, and radio vortex wireless communications.
Ming-Ming Cheng received his PhD degree from Tsinghua University in 2012, and then worked with Prof. Philip Torr in Oxford for 2 years. He is now a professor at Nankai University, leading the Media Computing Lab. His research interests includes computer vision and computer graphics. He received awards including ACM China Rising Star Award, IBM Global SUR Award, etc. He is a senior member of the IEEE and on the editorial boards of IEEE TIP.
Kai Zhao Kai Zhao is currently a Ph.D candidate with college of computer science, Nankai University, under the supervision of Prof Ming-Ming Cheng. His research interests mainly focus on statistical learning and computer vision.
Xin-Yu Zhang is an undergraduate student from School of Mathematical Sciences at Nankai University. His research interests include computer vision and deep learning.
Ming-Hsuan Yang is a professor in Electrical Engineering and Computer Science at University of California, Merced. He received the PhD degree in Computer Science from the University of Illinois at Urbana-Champaign in 2000. Yang has served as an associate editor of the IEEE TPAMI, IJCV, CVIU, etc. He received the NSF CAREER award in 2012 and the Google Faculty Award in 2009.
Philip Torr received the PhD degree from Oxford University. After working for another three years at Oxford, he worked for six years for Microsoft Research, first in Redmond, then in Cambridge, founding the vision side of the Machine Learning and Perception Group. He is now a professor at Oxford University. He has won awards from top vision conferences, including ICCV, CVPR, ECCV, NIPS and BMVC. He is a senior member of the IEEE and a Royal Society Wolfson Research Merit Award holder.