Global Weighted Average Pooling Bridges Pixel-level Localization and Image-level Classification

Global Weighted Average Pooling Bridges Pixel-level Localization and Image-level Classification


In this work, we first tackle the problem of simultaneous pixel-level localization and image-level classification with only image-level labels for fully convolutional network training. We investigate the global pooling method which plays a vital role in this task. Classical global max pooling and average pooling methods are hard to indicate the precise regions of objects. Therefore, we revisit the global weighted average pooling (GWAP) method for this task and propose the class-agnostic GWAP module and the class-specific GWAP module in this paper. We evaluate the classification and pixel-level localization ability on the ILSVRC benchmark dataset. Experimental results show that the proposed GWAP module can better capture the regions of the foreground objects. We further explore the knowledge transfer between the image classification task and the region-based object detection task. We propose a multi-task framework that combines our class-specific GWAP module with R-FCN. The framework is trained with few ground truth bounding boxes and large-scale image-level labels. We evaluate this framework on PASCAL VOC dataset. Experimental results show that this framework can use the data with only image-level labels to improve the generalization of the object detection model.

I Introduction

Over the last few years supervised convolutional neural network (CNN) methods have been extremely successful for the whole-image classification [1], the region-based object detection [2] and the semantic image segmentation [3] tasks. There are a lot of works focused on carefully designing models to improve the performance of these tasks independently. However, the drawback of such discrete supervised learning systems is data hunger. In the fully supervised setting, the performance of a learned model highly depends on the amount of training data. In order to achieve satisfactory generalization performance, a large number of precise training pairs, where each sample is associated with a label or target, are often required. Annotating training samples is often time-consuming and costly. Especially for the region-based object detection and the semantic image segmentation tasks, it is often tedious to specify exactly where the objects are and to annotate precise contours of the objects. In this paper, we explore the methods to reduce annotation costs for object localization and detection.

One solution for the reduction of annotation costs is the weakly supervised learning (WSL) method. In the WSL setting, only image-level labels indicating the presence or absence of objects are required. WSL leave out the effort of annotating the specific position of an object and thus reduce the costs. WSL is a valuable setup for many practical applications (i.e. object localization/detection) since the weak supervision (i.e. image-level labels) is easier to obtain than the full supervision (i.e. bounding boxes). WSL is also feasible because of the inherent correlation between the whole-image classification and the object localization/detection. For example, an image was annotated as ”bird” because there is exactly a bird in this image. Even though it is not sure where the bird is, information from the region of the bird is more discriminative than the uninformative backgrounds for the classification of the whole image. Various WSL methods [4, 5, 6] demonstrate their potential of auto-localization to reduce the annotation effort.

One branch of WSL for object detection is the region-based methods [7, 8, 9, 10]. These methods first use object proposal methods (e.g. selective search [11]) to generate candidate detection results, then pick up the contributive ones to the image classification task as the confident detection results. CNN model for these methods usually contains the region-based pooling module and generates features, classifies the feature vectors for the corresponding regions. The other branch of WSL for object localization is the pixel-level localization methods [5, 6]. Different from the former one, CNN models for these methods do not use object proposals as the candidates, but try to predict each position on the feature map. These methods are built upon the fully convolutional architecture1, which is also an important class of network structures in CNN. In this paper, we mainly explore the second branch.

Fig. 1: Illustration of the weakly supervised learning framework for simultaneous pixel-level localization and image-level classification with fully convolutional networks.

There are two classical WSL methods, namely global max pooling (GMP)  [5] and global average pooling (GAP)  [6], for object localization. They both solve the WSL problem by the multiple instance learning (MIL) process, where pixel-level predictions are grouped into the image-level prediction and a label is attached to the whole image and not to every pixel-level prediction. In such frameworks, GMP and GAP correspond to different grouping strategies. The two strategies yield great potential for the WSL task, but both of them are hardwired. As illustrated in Fig. 2, GMP only selects the largest value as the final result, which is compact to meet the assumption that at least one candidate prediction is the true prediction but also loses the other useful information for accurate localization. GAP is the opposite. It equally aggregates all input values as the final output, which favors the saliency input but cannot directly indicate which inputs are the true prediction. Thus, both GMP and GAP can only obtain the approximate pixel-level localization result. To this end, we revisit the global weighted average pooling (GWAP) operation for the WSL localization task.

(a) GMP
(b) GAP
(c) GWAP
Fig. 2: Illustration of the global pooling methods: (a) global max pooling (GMP), (b) global average pooling (GWAP), (c) global weighted average pooling (GWAP).

GWAP compute the weighted mean of the input values with their corresponding weights. These weights are non-negative and indicate the relative importance of the input values. In GWAP, the values with a higher weight contribute more to the final result than the values with a lower weight. For the WSL of object localization, the weights in GWAP naturally provide the pixel-level localization. Because the regions of objects are usually more discriminative than the backgrounds’ and thus have higher weights. This property provides a way to specify the precise regions of the objects.

To implement this idea, we propose the pixel-level prediction module to generate weights for GWAP. Fig. 1 illustrate a basic framework for WSL with GWAP. Specifically, the pixel-level prediction module consists of a prediction unit and a normalization unit. In this work, we both design the class-agnostic prediction module and the class-specific module. Different from GMP and GAP, our GWAP method is a learnable method. This learnable weighted average operation generalizes max and average operation and also provides the capacity for more accurate localization. The proposed method is fully differentiable and easy to be trained by the standard backpropagation. We empirically show that this pixel-level prediction module trained only with image-level labels can effectively highlight the foreground regions and suppress backgrounds (Fig. 8). Based on this observation, we further explore the transferability of this framework. We combine the popular fully convolutional object detection framework R-FCN [12] with our method to obtain a novel semi and weakly supervised detection method. This forms a multi-task learning framework. When the number of training samples for the object detection task is small, the WSL with GWAP can provide useful regularization to avoid overfitting and thus improve the generalization. Experimental results on the ILSVRC [13] and PASCAL VOC datasets [14] show the effectiveness of the proposed methods.

Ii Related work

In this section, we briefly review the related work on weakly supervised object detection/localization and knowledge transfer for object detection with minimal supervision.

Ii-a Weakly supervised object detection

Weakly supervised object detection methods [15, 16, 17, 7, 8, 9, 10] are evolved from the region-based object detection frameworks [18, 19, 20, 21]. They generate proposals as the candidate detections and alternately learn to find the confident proposals. Early works are formulated by shallow models and mainly focus on solving the non-convex optimization problem by using good initialization strategies [16, 15], learning strategies [17, 9] and developing smoothed models [7, 8]. They fail to share deep representation learning and their performance is still far from the fully supervised methods. The first deep learning framework [10] for this task aggregates proposal classification scores and localization scores to image-level predictions. Based on this framework, context-aware network [22], expectation-maximization algorithm [23], online instance classifier refinement process [24] and collaborative learning method [25] are proposed to improve the performance. These methods mainly focus on developing strategies to increase the number of proposals that really contain complete objects. Our work differs these methods in that, we focus on the weakly supervised pixel-level localization problem. We do not use proposals for the WSL. And our work builds upon fully convolutional networks.

Ii-B Weakly supervised object localization

Different from these weakly supervised object detection methods, weakly supervised object localization methods build upon fully convolutional networks and output coarse regions of objects instead of the tight bounding boxes. Some works [26, 27] for this task heuristically search for regions that have important impacts on the classification. Some works [28, 29] use visualization techniques to find the saliency regions for classification. Oquab et al. [5] first unified the image-level classification and the pixel-level localization tasks into an end-to-end learning framework by using the global max pooling (GMP) method. Sun et al. [30] further extends the method into a multi-scale cascaded neural network and use the log-sum-exp (LSE) pooling method to improve the localization accuracies. Zhou et al. [6] revisited the global average pooling (GAP) method and proposed the class activation mapping for localizing discriminative regions. These methods have shown that the convolutional units of CNN actually behave as meaningful pattern detectors. So the location information of target objects emerges in the activation maps after convolution. These works [5, 6, 30] can predict approximate locations of objects and output accurate image-level labels. Even though they yield promising results, GMP and GAP operation are hardwired. LSE uses a hyper parameter to adjust the pooling scale but is also not flexible enough. Durand et al. [31, 32] proposed methods to select multiple high and low scoring regions instead of a single highest scoring region. However, their methods also need to fine-tune the hyperparameters for selection and are also restrained. These works demonstrate that the regions of target objects have more effects on the whole-image classification and inspire us to explore the localization ability of fully convolutional networks. To obtain more precise localization results, we revisit the global weight average pooling (GWAP) operation and propose the class-agnostic and class-specific GWAP modules for this task. Different from prior works, the proposed modules are learnable and flexible to capture various appearances of objects. In addition, we first explore the combination between the weakly supervised module and the region-based detection framework in this paper.

Ii-C Knowledge transfer for object detection with minimal supervision

Despite using the image-level labels, transferring knowledge from an auxiliary dataset also can improve the learning performances when lack of supervised data for object detection. Hoffman et al. [33] use labeled data to train the transfer model between image classification and object detection, then use this transfer model for unlabeled data. Shi et al. [34] uses a ranking model to enhance the selection of the detection results. Guillaumin et al. [35] transfer the knowledge about the plausible location, appearance, and context of the target objects from the similar classes. Chen et al. [36] design a flexible deep architecture, a background depression regularization and a transfer-knowledge regularization to alleviate transfer difficulties. Different from these works, our work does not use extra datasets. And we transfer knowledge between the weakly supervised object localization task and the supervised object detection task.

Iii Proposed Method

Iii-a Global Weighted Average Pooling for WSL

Simultaneous pixel-level localization and image-level classification with only image-level labels is a weakly supervised learning (WSL) problem. In this subsection, we discuss how the global weighted average pooling method can be used for this task.

In this work, we build our method based on the fully convolutional network (FCN). In convolutional neural networks, the convolution operation actually can be seen as a pattern detection process in a sliding window manner. And each activation after the convolution along the spatial dimension can be used to specify whether the pattern is located at this position or not. These activations reflect the pixel-level localizations of objects in an image and also indicate the importance of corresponding features for recognizing the objects. The pixel-level information and the image-level recognition are correlative. For example, if the image-level result says that there is a cat, some pixel-level predictions must indicate the cat too. If the image-level result says that there is no cat, any pixel-level predictions should not be the cat. Thus, the entire image-level recognition can be obtained by aggregating the pixel-level results.

In this paper, we use the global weighted average pooling (GWAP) method to bridge the pixel-level localization and the image-level classification in fully convolutional networks. GWAP can be used in two ways. One is aggragating features. The other is aggragating scores. At the spatial location , let represents the prediction score of class in the classification convolutional layer and represents the activation of unit in the last convolutional layer before the classification layer, respectively. For aggragating scores, the image-level prediction


where is the weight at the spatial location . For aggragating features, the image-level prediction


where is the classification weight corresponding to class for unit . When , Equ. (1) and Equ. (2) are equivalent.

In our task, the weight should satisfy the following principles. First, the weights should be non-negative and some may be zero, but not all of them. That is and . Naturally, a higher weight means the more contribution to the aggregating result than a lower one. Second, the weights should be normalized such that they sum up to . That is . With these properties, GWAP performs the gating function. The model with GWAP passes scores/features with high weights and suppress the scores/features with low weights. Therefore, the weights should be able to indicate the object regions which are informative for the image-level classification.

In this paper, we design the pixel-level prediction module to generate . With the above analysis, the pixel-level prediction module contains the pixel-level classification unit and the spatial normalization unit. The pixel-level classification unit outputs predictions of the pixel-level locations. The spatial normalization unit normalizes these predictions for aggregating. In this work, we design both the class-agnostic (Section III-B) and the class-specific (Section III-C) pixel-level prediction modules.

Iii-B Class-agnostic Pixel-level Prediction

Fig. 3: The network architecture of the proposed class-agnostic GWAP method for image classification. (: Equ. 3, norm: Equ. 4, GWAP: Equ. 5, fc: fully conneted layer)

In subsection III-A, we introduce the intuition of using global weighted average pooling for simultaneous pixel-level localization and image-level classification. In this section, we introduce the class-agnostic implementation for GWAP.

In the class-agnostic implementation, there is only one weight map for the classification. We use this shared weight map to aggregate features and then classify the aggregated feture vector. Fig. 3 illustrates the network architecture of the poposed method. Concretely, FCNs caculate the features of the input image. We obtain the feature maps from the last convolutional layer. A bypass network is added after to generate the weight map . The weight map is computed by the following functions


where , . And , and denote the number of channels, the height of the feature maps and the width of the feature maps, respectively. is the parameter matrix and is the corresponding bias. is the sigmoid function. is the exponentiate operation. Once the weight map is computed, the aggregated feature vector of the whole image is computed by


Then, one fully connected layer is used to classify the aggregated feature vector and the task-specific loss function is used to train the whole network. After training with this framework, the class-agnostic pixel-level prediction module can distinguish foreground objects and backgrounds. The proposed network ouputs accurate image-level labels and predicts precise regions of objects simultaneously in a forward pass.

Comparision with the soft attention model: Soft attention model (SAM) is proposed by Bahdanau [37] in neural machine translation area, and has been successfully used for computer vision tasks [38, 39, 40, 41, 42, 43, 44]. It searches for parts of the input that are relevant to the final prediction without any advance annotations. This has the same spirit with our task. However, SAM did not work very well for our task in our early experiment. The score function of SAM is just an exponentiate operation (), which is just an approximation of the operation. Thus SAM is also hard to obtain precise object regions. Different from SAM, we use the sigmoid function to provide a non-linear bound for the prediction. This makes it easier to capture the various appearances of an object and thus has a better localization property. We verify this experimentally in Section IV.

Iii-C Class-specific Pixel-level Prediction

In some applications, we might need to know the class of a highlight region in the pixel-level prediction module. Thus, we further propose the class-specific pixel-level prediction module. In Section III-D, we use this module for semi and weakly supervised detection.

Fig. 4: The network architecture of the proposed class-specific GWAP module for image classification. (bg: background class, : element-wise product, norm: Equ. 8, : class-specific weight map, GWAP: Equ. 9.)

Fig. 4 illustrates the design of the proposed class-specific GWAP module. First, we add a convolution layer to obtain the class score map , where is the number of object classes, and are the height and the width of the score maps. We also add one more channel to encode backgrounds . We concat and and obtain . To constrain that every position on the map can only belong to a unique class, we use the softmax operation, defined as follows:


where , . After obtaining , the final score map


where is the sigmoid function, and is the element-wise product. Then the class-specific weight maps


Finally, we obtain the aggregated score vector for the whole image. That is


The class-specific module outputs localization for each class as shown in Fig. III-D. The class-specific and the class-agnostic GWAP module have the same design ideas and thus the same properties. They both learnable and easy to train. They all can predicts precise regions of objects.

Iii-D Semi and Weakly Supervised Detection

In order to further explore whether the GWAP module can share features with the object detection task, we propose the multi-task learning framework. As illustrated in Fig. 5, this framework contains two branches. One is the standard object detection pipeline. In this work, we choose the R-FCN [12] model as the detection branch. Because R-FCN is also fully convolutional, which is compatible with our GWAP module. The R-FCN branch can only be trained with images that have ground truth bounding boxes. The other branch is the image classification branch with GWAP module. Here we use the class-specific GWAP module mentioned in Section III-C for the image classification branch. This branch can be trained with only image-level labels. For better performance, we further add local object region regularization for the class-specific score map (illustrated in Fig. 6). With the ground truth bounding boxes, we constraint that the average score of the object region is larger than the negative samples. In this work, we train the object detection task and the image classification task jointly. The total loss is . in our work. balances the two tasks to avoid overfitting to one of them. Experimental results show that this multi-task framework improves the generalization when lack of supervised training data for detection.

Fig. 5: Illustration of the multi task framework for semi and weakly supervised detection.
Fig. 6: Illustration of the local region regularization for the class-specific score map.

Iv Ablation experiments

In this section, we first conduct some ablation experiments to illustrate the effectiveness of the design of our class-agnostic GWAP method (Section III-B). Without loss of generality, we only perform experiments with the standard end-to-end classification process, which is much simpler than a specialized model for the task. The observation conclusions here are also applicable to the class-specific GWAP method because the class-agnostic and class-specific GWAP methods have the same design ideas.

Iv-a Setup

We conduct a set of basic experiments to understand how GWAP compares to GAP [6] and evaluate the design of the class-agnostic GWAP method. For fair comparison, we implement these methods on the same basic network architecture, such as the CaffeNet (essentially AlexNet [45]) and GoogLeNet [46]. We call CaffeNet model C and GoogLeNet model G, respectively. We use the pre-trained ImageNet model that are available online.2 3 For CaffeNet, we remove the layers after and replace them with GWAP module or GAP followed by a classification layer. For GoogLeNet, we remove the layers after (i.e., pool4 to prob). For multi-label classification, we use sigmoid cross entropy loss in this section. Then we fine-tune these networks on the trainval set of PASCAL VOC2007 [14] and evaluate it on the test set. All experiments use the single-scale training and testing ( input image). During training, each sampled image is horizontally flipped. No other data augmentation is used. The size of mini-batch is set to 4 images. This is a relatively small batch size. We set the initial learning rate to , and decrease it every iterations. We run a total of iterations.

Iv-B Results

method net mAP
GAP C 68.6
GWAP w/o C 57.7
GWAP w/o C 68.4
GWAP C 70.4
GWAP-gt C 73.7
GAP G 84.7
GWAP w/o G 81.5
GWAP w/o G 83.7
GWAP G 85.1
GWAP-gt G 86.5
TABLE I: VOC 2007 test classification average precision () of the ablation experiments (C: CaffeNet, G: GoogLeNet, w/o: without, : sigmoid function, exponentiate operation, -gt: using ground truth attention map).
method net F-measure
GAP C 0.3517
GWAP w/o C 0.1344
GWAP w/o C 0.3611
GWAP C 0.5060
GAP G 0.4164
GWAP w/o G 0.1905
GWAP w/o G 0.4577
GWAP G 0.5450
TABLE II: Attention effectiveness on the VOC 2007 test set (C: CaffeNet, G: GoogLeNet, w/o: without, : sigmoid function, exponentiate operation).
(a) Original images
(b) GoogLeNet-GAP
(c) GoogLeNet-GWAP without sigmoid function
(d) GoogLeNet-GWAP without exponentiate operation
(e) GoogLeNet-GWAP
Fig. 7: Illustration of attention maps from different models.

Classification: We provide classification results on PASCAL VOC2007 test set (Table I) to demonstrate the effectiveness of the proposed method. Firstly, we verify the intuition (Sec. III-A) of the idea of GWAP. We provide the classification results of GWAP with the ground truth weight map (refer to as GWAP-gt). Given the ground truth bounding boxes of target objects, we set values that belong to the regions of objects as 1 and the rest as 0. We normalize these values (sum to 1) to construct the ground truth weight map. As shown in Table I, GWAP-gt provides an improvement over GAP in mAP (CaffeNet: from 68.6 to 73.7, GoogLeNet: from 84.7 to 86.5). This demonstrates that having the regions of objects is helpful for the whole-image classification. And our method GWAP also yields an improvement over GAP (CaffeNet: from 68.6 to 70.4, GoogLeNet: from 84.7 to 85.1). We also provide the results of the class-agnostic GWAP without sigmoid function (refer to as GWAP w/o ) and without exponentiate operation (refer to as GWAP w/o ). Results show that performance drops a lot without the sigmoid function. And without the exponentiate operation, performance also drops but better than the GWAP w/o . This demonstrates the importance of using the non-linear sigmoid function. In addition, only using the sigmoid function is seemingly insufficient to obtain a higher performance than GAP. The exponentiate operation also enhances the result to some extent.

Object localization effectiveness: We use F-measure to quantitatively evaluate the performance of object localization in the weight map , which is borrowed from salient object detection [47]. Concretely, we convert a weight map to a heat map normalized to [0,1]. Given a heat map , we then convert it to a binary mask and compute Precision and Recall by comparing with ground-truth : . F-measure is obtained by , where is set to . F-measure measures the ability to indicate the full extents of objects with minimal noise. Setting the whole region to foreground can easily lead to recall, but is meaningless. In this part, the ground truth regions are formed by the given bounding boxes of target objects. For an image, we aggregate regions of multi boxes into one region, for compatibility with our class-agnostic weight map. We use the adaptive threshold method Otsu [48] for binarizing . Quantitative results are shown in Table II and some visualization results are given in Fig. 7. GWAP achieves the highest score, which demonstrates the localization ability of the proposed method. We also find that there is a positive correlation between the object localization effectiveness and the performance of classification. This is consistent with our intuition.

V Results on ILSVRC

Networks top-1 val. error top-5 val.error
VGGnet-GAP [6] 33.4 12.2
GoogLeNet-GAP [6] 35.0 13.2
AlexNet*-GAP [6] 44.9 20.9
AlexNet-GAP [6] 51.1 26.3
GoogLeNet [6] 31.9 11.3
VGGnet [6] 31.2 11.4
AlexNet [6] 42.6 19.5
NIN [6] 41.9 19.6
GoogLeNet-GMP [6] 35.6 13.9
GoogLeNet-GWAP 31.8 11.3
TABLE III: Classification error on the ILSVRC validation set. ()
Networks top-1 val. error
GoogLeNet-GAP [6] 56.40
VGGnet-GAP [6] 57.20
GoogLeNet [6] 60.09
AlexNet*-GAP [6] 63.75
AlexNet-GAP [6] 67.19
NIN [6] 65.47
Backprop on GoogLeNet [6] 61.31
Backprop on VGGnet [6] 61.12
Backprop on AlexNet [6] 65.17
GoogLeNet-GMP [6] 57.78
GoogLeNet-GWAP 54.99
GoogLeNet-GWAP (multi-scale) 54.09
TABLE IV: Localization error on the ILSVRC validation set. ()
Fig. 8: Illustration of example outputs. Fisrt column is original images with ground truth bounding boxes in green. Second, third and fourth columns are weight maps from our GoogLeNet-GWAP model with , , sized input images, respectively. The average of the three scaled weight maps are shown in the fifth column. The last column is top-1 predicted class activation maps from GoogLeNet-GAP model.

In this section, we evaluate the efficiency of the class-agnostic GWAP method on the large-scale ILSVRC 2014 benchmark dataset [13]. We also provide both classification and localization results. We use the same experimental setup as in GAP [6]. We fine-tune the GoogLeNet [46] with our class-agnostic GWAP module on the pretrained imagenet model for 1000-way object classification.

Classification: We use the same error metrics(top-1, top-5) as ILSVRC for classification. Table III shows the classification performance of the original networks, GAP networks, and our GoogLeNet-GWAP network. We find that GoogLeNet-GWAP provides a large improvement over GoogLeNet-GAP (top-1: 3.2, top-5: 1.9), and is also comparable to the original GoogLeNet. These results demonstrate the efficiency of the proposed method. Compared to GAP, our method does not have the performance drop problem.

Localization: Similarly as in [6], in order to generate the bounding box of a predicted object from the weight map , we use the simple thresholding technique to segment the heat map, and we take the bounding box that covers the largest connected component in the segmentation map. We only do this for the top-1 predicted class. We provide the top-1 results on the ILSVRC validation set in Table IV. We observe that our method GoogLeNet-GWAP yields the lowest localization error of on top-1. Furthermore, we observe the outputs of our GoogLeNet-GWAP model with the multi-scaled input image. We have tried three scales: 224, 448 and 672 in this paper. We also provide the localization result by simply average these multi-scale weight maps. With this multi-scale strategy, there is an improvement of . As shown in Fig. 8, we observe that the proposed method is able to successfully indicate target objects. Different from GoogLeNet-GAP 4, our method tends to localize the full extent of objects and provide shape and contour information about objects. The larger resolution of weight maps, the more precise regions of objects emerge in. These results are promising since we did not use any annotated bounding boxes.

Vi Semi and Weakly Supervised Detection

In this section, we evaluate our method on the PASCAL VOC benchmark. Same as in [12], we train the proposed model in Section III-D on the union set of VOC 2007 trainval and VOC 2012 trainval, and evaluate on VOC 2007 test set. Object detection accuracy is measured by mean Average Precision (mAP). For each category in the training set, some of the images have both the image-level labels and the ground truth bounding boxes, while other images only have image-level labels. For comparison, we use the same training settings in [12]. We use ResNet-50 as the basic model. The weakly supervised task and the supervised detection task are end-to-end trained jointly. The weight for the loss of the weakly supervised task is in all these experiments.

Results are shown in Table V. The baseline model is the standard R-FCN [12] detection model that can only be trained with images that have ground truth bounding boxes. We combined GWAP and GAP method with R-FCN. Then the images with image-level labels are also used for training. We show that training with few labeled bounding boxes and large-scale image-level labels significantly improve the performances. This demonstrates that the multi-task framework efficiently regularizes the learning when lack of supervised training data for detection. We also show that the proposed GWAP method performs better than GAP for this task. We report some visualizations of the class-specific weight map in Fig. 9.

\diagboxMethodProportion 0.01 0.02 0.05 0.1
R-FCN 29.89 40.25 53,24 61.05
R-FCN + GAP 39.10 45.85 57.36 62.32
R-FCN + GWAP (ours) 39.86 47.53 57.34 63.17
R-FCN + GWAP + GAP (ours) 40.24 47.20 57.48 62.55
TABLE V: Comparisions on PASCAL VOC 2007 test set with different proportions of images that have ground truth bounding boxes. (mAP())
(a) Horse&Person
(b) Horse
(c) Person
(d) Cat&Dog
(e) Cat
(f) Dog
(g) Bus&Car&Person
(h) Bus
(i) Car
(j) Person
Fig. 9: Visulization of some detection results and the corresponding class-specific weight maps.

Vii Conclusion

In this paper, we revisit the global weighted average pooling (GWAP) method and develop the class-agnostic/specific GWAP modules for simultaneous pixel-level localization and image-level classification with only image-level labels for training. We show that precise regions of objects can be obtained by the proposed methods without using supervised annotations. We further propose a multi-task framework that combines our class-specific GWAP module with R-FCN. We show that this framework can use the data with only image-level labels to significantly improve the generalization of the object detection model. We hope that the results of this paper will encourage future exploration in weakly supervised learning and object detection with convolutional neural networks. We also expect the GWAP module to have useful applications. In the future, we plan to combine our method with the other weakly supervised detection methods.



  1. Only the last layer is fully-connected.
  4. We download the pre-trained model from


  1. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, arXiv preprint arXiv:1512.03385.
  2. R. G. J. S. Shaoqing Ren, Kaiming He, Faster R-CNN: Towards real-time object detection with region proposal networks, arXiv preprint arXiv:1506.01497.
  3. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  4. J. Wu, Y. Yu, C. Huang, K. Yu, Deep multiple instance learning for image classification and auto-annotation, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2015, pp. 3460–3469.
  5. M. Oquab, L. Bottou, I. Laptev, J. Sivic, Is object localization for free? – weakly-supervised learning with convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  6. B. Zhou, A. Khosla, L. A., A. Oliva, A. Torralba, Learning Deep Features for Discriminative Localization., CVPR.
  7. H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, T. Darrell, et al., On learning to localize objects with minimal supervision., in: ICML, 2014, pp. 1611–1619.
  8. H. Bilen, M. Pedersoli, T. Tuytelaars, Weakly supervised object detection with convex clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1081–1089.
  9. C. Wang, K. Huang, W. Ren, J. Zhang, S. Maybank, Large-scale weakly supervised object localization via latent category learning, IEEE Transactions on Image Processing 24 (4) (2015) 1371–1385.
  10. H. Bilen, A. Vedaldi, Weakly supervised deep detection networks, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  11. J. R. Uijlings, K. E. van de Sande, T. Gevers, A. W. Smeulders, Selective search for object recognition, International journal of computer vision 104 (2) (2013) 154–171.
  12. J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: Advances in neural information processing systems, 2016, pp. 379–387.
  13. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (IJCV) 115 (3) (2015) 211–252. doi:10.1007/s11263-015-0816-y.
  14. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,
  15. M. P. Kumar, B. Packer, D. Koller, Self-paced learning for latent variable models, in: Advances in Neural Information Processing Systems, 2010, pp. 1189–1197.
  16. T. Deselaers, B. Alexe, V. Ferrari, Weakly supervised localization and learning with generic knowledge, International journal of computer vision 100 (3) (2012) 275–293.
  17. R. G. Cinbis, J. Verbeek, C. Schmid, Weakly supervised object localization with multi-fold multiple instance learning, IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99) (2016) 1–1. doi:10.1109/TPAMI.2016.2535231.
  18. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE transactions on pattern analysis and machine intelligence 32 (9) (2010) 1627–1645.
  19. R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based convolutional networks for accurate object detection and segmentation, IEEE transactions on pattern analysis and machine intelligence 38 (1) (2016) 142–158.
  20. K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition., IEEE Transactions on Pattern Analysis Machine Intelligence 37 (9) (2014) 1904–1916.
  21. R. Girshick, Fast r-cnn, in: IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
  22. V. Kantorov, M. Oquab, M. Cho, I. Laptev, Contextlocnet: Context-aware deep network models for weakly supervised localization, in: European Conference on Computer Vision, Springer, 2016, pp. 350–365.
  23. Z. Yan, J. Liang, W. Pan, J. Li, C. Zhang, Weakly- and semi-supervised object detection with expectation-maximization algorithm.
  24. P. Tang, X. Wang, X. Bai, W. Liu, Multiple instance detection network with online instance classifier refinement (2017) 3059–3067.
  25. J. Wang, J. Yao, Y. Zhang, R. Zhang, Collaborative learning for weakly supervised object detection.
  26. L. Bazzani, A. Bergamo, D. Anguelov, L. Torresani, Self-taught object localization with deep networks, in: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2016, pp. 1–9.
  27. A. J. Bency, H. Kwon, H. Lee, S. Karthikeyan, B. Manjunath, Weakly supervised localization using deep feature maps, arXiv preprint arXiv:1603.00489.
  28. M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European Conference on Computer Vision, Springer, 2014, pp. 818–833.
  29. K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, arXiv preprint arXiv:1312.6034.
  30. C. Sun, M. Paluri, R. Collobert, R. Nevatia, L. Bourdev, Pronet: Learning to propose object-specific boxes for cascaded neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3485–3493.
  31. T. Durand, N. Thome, M. Cord, Weldon: Weakly supervised learning of deep convolutional neural networks, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  32. T. Durand, T. Mordan, N. Thome, M. Cord, Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017.
  33. J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, K. Saenko, Lsda: Large scale detection through adaptation, in: Advances in Neural Information Processing Systems, 2014, pp. 3536–3544.
  34. Z. Shi, P. Siva, T. Xiang, Transfer learning by ranking for weakly supervised object annotation, arXiv preprint arXiv:1705.00873.
  35. M. Guillaumin, V. Ferrari, Large-scale knowledge transfer for object localization in imagenet, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 3202–3209.
  36. H. Chen, Y. Wang, G. Wang, Y. Qiao, Lstd: A low-shot transfer detector for object detection, arXiv preprint arXiv:1803.01529.
  37. D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473.
  38. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: Proceedings of The 32nd International Conference on Machine Learning, 2015, pp. 2048–2057.
  39. S. Sharma, R. Kiros, R. Salakhutdinov, Action recognition using visual attention, arXiv preprint arXiv:1511.04119.
  40. L.-C. Chen, Y. Yang, J. Wang, W. Xu, A. L. Yuille, Attention to scale: Scale-aware semantic image segmentation, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  41. Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention networks for image question answering, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  42. Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning with semantic attention, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  43. C.-Y. Lee, S. Osindero, Recursive recurrent nets with attention modeling for ocr in the wild, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  44. J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, S. Yan, Attentive contexts for object detection, arXiv preprint arXiv:1603.07415.
  45. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
  46. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
  47. A. Borji, M.-M. Cheng, H. Jiang, J. Li, Salient object detection: A benchmark, IEEE Transactions on Image Processing 24 (12) (2015) 5706–5722.
  48. N. Otsu, A threshold selection method from gray-level histograms, Automatica 11 (285-296) (1975) 23–27.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description