Adaptively Denoising Proposal Collection for Weakly Supervised Object Localization

Adaptively Denoising Proposal Collection for Weakly Supervised Object Localization

Wenju Xu School of Engineering, University of Kansas, Lawrence, KS, USA 66045    Yuanwei Wu School of Engineering, University of Kansas, Lawrence, KS, USA 66045    Wenchi Ma School of Engineering, University of Kansas, Lawrence, KS, USA 66045    Guanghui Wang School of Engineering, University of Kansas, Lawrence, KS, USA 66045

In this paper, we address the problem of weakly supervised object localization (WSL), which trains a detection network on the dataset with only image-level annotations. The proposed approach is built on the observation that the proposal set from the training dataset is a collection of background, object parts, and objects. Several strategies are taken to adaptively eliminate the noisy proposals and generate pseudo object-level annotations for the weakly labeled dataset. A multiple instance learning (MIL) algorithm enhanced by mask-out strategy is adopted to collect the class-specific object proposals, which are then utilized to adapt a pre-trained classification network to a detection network. In addition, the detection results from the detection network are re-weighted by jointly considering the detection scores and the overlap ratio of proposals in a proposal subset optimization framework. The optimal proposals work as object-level labels that enable a pseudo-strongly supervised dataset for training the detection network. Consequently, we establish a fully adaptive detection network. Extensive evaluations on the PASCAL VOC 2007 and 2012 datasets demonstrate a significant improvement compared with the state-of-the-art methods.

1 Introduction

Object detection, which attempts to place a tight bounding box around every object of a given image, is an important problem for image understanding. This problem has been extensively studied in recent years [1, 2, 3, 20, 26, 29], and the state-of-the-art detection performance promotes a variety of applications, including human pose estimation [32] and crowd counting [41]. One key step for object detection is to learn a distinctive representation of the objects from a large quantity of labeled data. Most existing methods rely on object-level labeled dataset [8] so that their models learn visual features from those specified regions. However, data annotation is an exhaustive and error prone work. In order to reduce the annotation cost, a common strategy is to learn the detector in a weakly supervised manner that only binary image-level labels representing the overall presence or absence of an object category are added to the images for training.

Multiple instance learning (MIL) [6, 9] is an intensively used strategy in dealing with the task of weakly supervised object localization (WSL). It selects object regions of interest (proposals) from the positive images that contains the object, and learns an appearance model of the object from the features in the selected regions. This method has the tendency to get stuck in local optima. Therefore, a re-localization and re-training strategy is typically taken to push the solution close to the global optima. Pentina et al. [23] forms a curriculum learning strategy to feed the training process from easy images with big objects to hard images with many small objects. Shi et al. [26] propose a strategy that re-weights the proposals’ scores according to the consistence between the proposal size and the estimated object size. Even though these strategies attempt to improve the MIL, finding positive image bags containing certain class object for MIL classifier, in some senses, depends on guessing and it is possible to take a negative bag as a positive one. It is also difficult to get tight bounding boxes exactly containing the objects. These drawbacks require strategies that adaptively refine the estimated bounding boxes to tightly contain the objects.

Another line of this research is based on convolutional neural networks (CNNs) [16, 31] that are capable of learning generic visual features generalized to many tasks. Methods in this category are inspired by the facts that, without location annotation, a pre-trained image classification CNN learns representative information of objects and object parts. Many efforts leverage CNN to extract discriminative appearance features, then train a MIL appearance model for object detection [33]. Recent efforts [3, 20] achieve significant performance improvement with proposed end-to-end methods, which adopt a pre-trained classification network to mine location information and transfer the problem from weakly supervised object localization to psudo-strongly supervised object detection. However, generating instance-level labels from the image-level labels is nontrivial since the objects from the same category may appear with different shapes and background. A pre-trained classifier makes predictions on salient features. The extracted appearance features represent object parts, which lack information on the instance as a whole. Moreover, it is different to determine the size of bounding boxes that exactly contain the objects in the feature-level searching. As a result, the obtained instance-level labels are inexact.

Figure 1: Overview of our method. We use mask-out strategy to collect the generic region proposals and take the MIL to generate pseudo labeled training set. This dataset is then fed to a WSL loop, so that the object detector is re-trained progressively. We also take the re-localization [43, 44] step by re-weighting object proposals according to the detection scores and the overlap of the proposals. Bounding boxes (in yellow) represent the confident proposals; while the bounding box in other colors in each block represents the highest confident proposal.

In this paper, we propose a new framework based on two observations: (i) The proposals are a collection of background, object parts, and objects; and (ii) it is hard to train object detectors directly under weakly labeled dataset due to the substantial amount of noise in the object proposal collection and the size variation of the objects. Our method integrates several strategies to adaptively eliminate the noise in the object proposal collection. We take an enhanced MIL algorithm, which is proceeded by a mask-out strategy to mine the proposal collection and fine-tune a pre-trained classification network through re-weighting and re-training, which exploits proposal subset optimization [42] to further re-weight the detection results.

Our re-weighting and re-training strategy aims at determining the optimal proposals automatically. To this end, we take a subset optimization method to select object proposals. It is based on both the detection scores from the pre-trained detection network and the overlaps between the candidate bounding boxes. This strategy puts higher weights on proposals that have large overlap area with others. Specially, We reweight those object proposals with high detection scores according to how much the bounding box overlaps with other bounding boxes. Iteratively, we utilize this subset optimization method to improve the re-localization step.

This re-weighting scheme reduces the uncertainty in the proposal distribution, making the re-weighting step more likely to pick a proposal correctly covering the object. Fig. 2 shows an example of how the subset optimization changes the proposal score induced by the current object detector, leading to a more accurate localization.

Our contributions are as follows: (i) We propose a novel work flow to collect confident proposals, which integrates the mask-out strategy, MIL, and subset proposal optimization. The MIL model is trained on the selected proposals of mask-out strategy and mines confident proposals to reduce the background clutters and potential confusion from similar objects cases. The subset proposal optimization further refines the proposals by re-scoring the bounding box; (ii) Following the idea of re-localization and re-training, the candidate proposals are refined based on both the detection scores and the overlap ratios between the proposals. We then iteratively adapt a pre-trained classification network to a detection network with those quality enhanced proposals. This is a new pipeline for improving object proposals; And (iii) detailed evaluations on the PASCAL VOC 2007 and 2012 datasets [10] demonstrate that our weakly supervised object detection with adaptively denoised proposal collection performs favorably against the state-of-the-art methods. The proposed model and trained parameters will be available on the authors website.

2 Related work

Extracting meaningful information from the environment is a challenging task [34, 35, zhou2014smart]. In recent years, deep neural networks are becoming more and more popular for knowledge discovering in many computer vision tasks, such as object recognition [xu2019towards, zhang2018bpgrad], object detection [21, kong2017cancer], visual question answering [39], pose estimateion [15], image synthesis [37, 36, 38], face recognition [5], and depth estimation [13]. Object detection is the task of recognizing and localizing the objects in the images with the deep model trained on labelled ground truth [22]. However, labelling the images with bounding box for each object is a nontrivial work. In the scenario of weakly supervised localization, the training images are known to containing instances of a certain object class but their locations. There is no ground truth bounding box available for each object in the training dataset. The task is both to localize the objects (estimate the bounding boxes tightly containing the instances) and to classify the objects. What we have are the image-level annotations which are weak supervision for localizing the objects. To train a detection network with image-level supervision, we need first to localize objects in all the images of the training dataset based on image-level annotations, and then use the localization results to train a detector for the test set. The WSL problem is often handled with multiple instance learning (MIL) [2, 3, 7, 29], where the images are treated as bags of object proposals [45] (which are bounding boxes estimated to localize the objects). A negative image dose not contain instances of certain category. A positive image contains at least one positive instance, mixed in with a majority of negative ones. The goal is to find the true positive instances from which to learn a classifier for proposal classification.

Previous works achieve significant improvement by exploring ways to enhance the MIL. Siva et al. [28] propose an effective negative mining approach combined with discriminative saliency measures. Song et al. [29] formulate an initialization strategy for WSL as a discriminative submodular cover problem in a graph-based framework, and develop a negative mining technique to increase robustness against incorrectly localized boxes [30]. Bilen et al. [2, 3] propose a relaxed version of MIL that softly labels object instances instead of choosing the highest scoring ones. They also propose a discriminative convex clustering algorithm to jointly learn a discriminative object model and enforce the similarity of the localized object regions.

Figure 2: Detection results from NMS (red line in left) and subset optimization (center). Bounding boxes (BB) (right) represent the highest confident proposals got from different steps (blue BB: CNN, green BB: maskout, pink BB: re-train, cyan BB: MIL). We compare the detection results by bounding boxes in different colors, which shows our re-training strategy is able to get the denoising proposals by re-weighting object proposals according to the detection scores and the overlap ratios of the proposals.

As CNNs have turned out to be surprisingly effective in many vision tasks including classification and detection, recent state-of-the-art WSL approaches also build on CNN architectures [3] or CNN features [7]. Bilen et al. [3] modify a region-based CNN architecture [11] and propose a CNN with two streams, one focuses on recognition and the other one on localization, which performs simultaneously region selection and classification. Similarly, Li et al. [20] use the MIL to obtain the initial detection results and propose a domain adaption method [14] to fine-tune a classification network into a detection network with the initial detection results. The results show a performance improvement on the detection accuracy. Shi et al. [26] attempt to score the proposals by the size and retrain the detection network with the re-weighted proposals according to an easy to hard order, based on the assumption that the proposals with bigger size provide more information to train the network than the those with smaller size. Our work is related to these CNN-based MIL approaches that perform WSL by end-to-end training from image-level labels. In contrast to the above methods, however, we focus on a CNN architecture that is re-trained in an order for detection accuracy improvement with denoised proposals.

The concept of adaptive learning in an order was also studied in computer vision [19, 23]. These works focus on a key question: how to re-weight the proposals? Sharmanska et al. [24] employ some privileged information to distinguish between easy and hard examples in an image classification task. The privileged information are additional cues available at training time, but not at test time. They employ several additional cues, such as object bounding boxes [12], image tags and rationales to define their concept of easiness [19]. Lai et al. [18] select highly confident object proposals under the guidance of class-specific saliency maps. Pentina et al. [23] consider learning the visual attributes of objects. Shi et al. [26] propose a size estimator to re-weight the proposals based on the size of the instances in the image. They use curriculum learning in a WSL setting and propose object size as an ”easiness” measure. Shi et al. [25] consider the task of discovering object classes in an unordered image collection. their model is initialized with regions of ”stuff” categories, and is then used to support discovering ”thing” categories in unlabelled images with the help of a fully supervised segmentation model. Bodla et al. [4] propose a soft method to select the bounding boxes. They decay the classification score of a box which has a high overlap with top-scoring boxes, rather than suppressing it. Jie et al. [16] explore the Fast RCNN model [11] and propose a self-taught learning method for proposal selection. The most related work to ours is the very recent study [31], which designs an on-line classifier refinement pipeline to progressively locate the most discriminative region of an image. By contrast, we propose a novel work flow to adaptively refine the proposals, i.e., to iteratively collect a more confident subset of proposals. In addition, we take the re-training strategy to fine-tune the model with the denoised proposal subset. The proposed work flow, by integrating several novel proposal mining strategies, makes it adaptable to a variety of weakly supervised object detection tasks.

3 Adaptively Denoised Proposal Collection

The proposed weakly supervised object detection method is illustrated in Fig. 1. This model consists of three major components, namely confident proposal learning, object detector learning and proposal subset optimization. They are successively employed to adaptively refine the proposal collection. The remainder of this section discusses these three components in details.

3.1 Confident Proposal Mining

We consider the weakly supervised object localization problem as an adaptively proposal denoising procedure that gradually refines the proposal collection. At the end, we transfer the problem from the weakly supervised object localization to a pseudo-strongly supervised object detection. Based on a pre-trained CNN classification network and a MIL model, our work flow adaptively selects confident proposals other than those comprised of background or object parts from the candidate proposals generated by EdgeBoxes [45].

Assisted by the classification network, we first utilize the mask-out strategy to collect object proposals. The idea of masking out the input of CNN has been previously explored in [40], which replaces the pixel values of the proposals with fixed mean pixel values; and compares the classification scores of feeding the real image and its mask-out images into the classification network. Intuitively, if the mask-out image introduces a notable drop in the classification score for the class, the region can be considered as containing an object belonging to the class. Inspired by [40], we apply the mask-out strategy to select the proposals containing a certain object. We denote the classification network as that maps an image to a confidence vector of classes. The confident proposals are selected by investigating the difference of classification score between the selected image and its mask-out image . This is formulated as


where represents the masked-out region. To select confident proposals, we first set a threshold on the classification score. The region is considered discriminative for class based on two aspects: the score of classifying the image to the class is beyond the threshold and the classification score drop between the image and corresponding mask-out images is maximum.

Once the proposals are obtained by applying the mask-out strategy, we separately learn one MIL model for each category. Taking the purified proposals selected by the mast-out strategy as training dataset makes the basic MIL initialized from a higher baseline, which not only stabilizes the training process, but also reduces the time for training [20]. In the MIL model, each instance is described by a feature vector. More specifically, each feature vector is regarded as an instance and each image is represented by a bag of instances. For instance, the training image is considered as a bag of proposals with pseudo strong labels indicating whether the bag contains an instance in the specific category. A bag is considered to be negative if there is no instances or all its instances are not in that category, while it is positive if there is at least one of its instances in that category. Given feature representation , we iteratively train the MIL model with the objective written as


where represents the parameters of the MIL model and is called the ”latent variable” chosen from the set , which is typically a set of bounding boxes. The top-scoring proposals given by the mask-out strategy are taken as positive samples for each category, which are used to train the MIL model. Among the initial bounding boxes, the set contains all possible candidate instances. Maximizing the objective function over amounts to choosing a bounding box containing the whole object. The proposal, in this work, is represented by a 4096-dimensional feature vector from the second-last layer of the classification network.

The top row in Fig.1 demonstrates the idea of the confident proposal mining, which starts from the mask-out strategy and ends with the high confident output from MIL.

3.2 Proposal Subset Optimization

Proposal selection based object detection method has one severely issue of overlapping among the bounding boxes that correspond to the same object. To select the best bounding box for each object, greedy non-maximum suppression (NMS) is widely employed as the latest strategy which selects the top-scoring bounding box and discards other bounding boxes that have overlaps with the chosen one larger than a threshold . Due to simplicity, this NMS mainly focus on the detection score . By taking the Intersection over Union (IoU) as the measure of overlapping, this non-maximum suppression process can be described as


However, there are no instance-level labels available for network training in the weakly supervised localization task, even the bounding boxes estimated with top score are tended to be noisy. To overcome this issue, we propose a subset optimization scheme. It is realized by re-weighting the detection scores among the bounding boxes with high but noisy initial scores, where greedy NMS is not able to adjust the estimated bounding box accordingly. The proposed approach is similar to that described in [42]. However, we employ the method to solve the weakly supervised learning problem. The confident proposals with high detection scores are grouped into clusters by jointly considering the scores and the spatial overlaps between the proposals. The bounding box set is represented by . We denote the group membership as , where if belongs to a cluster . Then one exemplary bounding box is selected from each cluster as the final output. This is formulated as finding the maximum a posterior (MAP) solution of the joint distribution , which tends to assign big value to the bounding boxes that have large overlap with more confident proposals. After taking the log of the posterior, the objective function becomes:




is the window IoU used to measure the spatial overlap between and , is the score set containing the detection scores of all the bounding boxes, and is the normalization constant. Parameter and control the penalty level. Note that our proposal subset optimization method takes both the scores and the overlapping into consideration since the detection scores in the weakly supervised task are not always reliable.

The proposal subset optimization problem is defined as:


In this setting, we first maximize the objective function over according to Eq. (4), which will select the cluster centers. Then, a greedy algorithm is used to choose a minimal number of bounding boxes as the outputs based on Eq. (6). More details of the method can be found in [42].

3.3 Object Detector Learning

In this step, we adapt the pre-trained classification network to an object detection network. This neural network is trained with the pseudo labeled proposals obtained from the proposal subset optimization strategy. We employ the re-weighting and re-training strategy for network adaption. The network parameters are fine-tuned for object localization, as illustrated in the bottom of the Fig.1. We organize it as adaptively refining the proposal subset, which is similar to the curriculum learning. However, we do not separate the training dataset into easy and hard parts. We start by running MIL, which is initialized with the results from mask-out strategy. This leads to a reasonable first detection model . We move forward by running proposal subset optimization on the proposals subset with high detection scores, which produce a re-weighted proposal subset. The process then moves on to the second training iteration, where the training dataset consists of re-weighted proposals with more confident pseudo labels. As a result, the refined model will localize the objects better than , as it is trained with better supervision in the re-training step. The process iteratively moves on to the next round, starting from the detection model and yielding a better one . The whole training procedure is described in Algorithm 1.

The selected results from each strategy are shown in Fig. 2. It is demonstrated that the bounding boxes selected by the fully adapted detection network exactly contain the objects, while the bounding boxes selected by mask-out strategy and MIL contain the object but with a large margin. By re-weighting the confident proposals according to the detection scores and the overlap of the proposals, the re-training strategy is able to generate more confident proposals.

2:Images ; candidate boxes; the corresponding scores; K, the refinement times; M, the network iteration times; , network parameters.
4:A fully adaptive detection network.
5:Classify the real images and the mask-out images with the classification network; select the top M proposals by Eq. (1) as the initial proposal set .
6:For each category, construct positive and negative bags within ; train the MIL model and collect the detection results from the trained MIL model as proposal set .
7:for  do
8:     Set .
9:     for  do
10:          Sample as a minibatch.
11:         Network forward propagation and get loss .
12:          Network backward propagation, .      
13:     end
14:      Collect the detection results from the trained detection network, .
15:      Choose the proposals with subset optimization by Eq. (6); update the proposal set .
Algorithm 1 The training pipeline of the proposed algorithm.

4 Experimental Evaluation

Dataset and settings: The proposed approach is extensively evaluated on two publicly available datasets: PASCAL VOC 2007 and 2012 datasets. Both of them have 20 classes of different images. We employ both the AlexNet [17] and VGGNet [27] as our base CNN models, initialized with parameters transferred from the classification network, which is pre-trained on the ImageNet dataset. As an initialization step for class-specific proposal mining, we use Edge Boxes [45] to generate 2,000 object proposals for each image. The mask-out strategy is first utilized to remove most of the noisy proposals and return top 50 confident proposals. These selected proposals work as the input for multiple instance learning. At the re-training stage, network is trained by employing the SGD solver with the learning rate of 0.0001 for 40k iterations.

Figure 3: A comparison of our method (AlexNet) of detection mean average precision (mAP) on the PASCAL VOC 2007 dataset. Our method with the mAP (36.1%) significantly outperforms other methods for most of the categories.
Methods          (VOC 2007) CorLoc mAP
Li  et al. [20] 49.8 31.0
Shi  et al. [26]    (AlexNet) 60.9 36.0
Our scheme 53.4 37.2
Li  et al. [20] 52.4 39.5
Shi  et al. [26] 64.7 37.2
Bilen  et al. [3]   (VGGNet) 53.5 34.8
Jie  et al. [16] 56.1 40.8
Tang  et al. [31] 60.6 41.2
Our scheme 55.9 40.9
Methods           (VOC 2012) CorLoc mAP
Li  et al. [20] - 22.4
Our scheme       (AlexNet) - 25.3
Li  et al. [20] - 29.1
Jie  et al. [16] 54.8 38.5
Tang  et al. [31]  (VGGNet) 62.1 37.9
Our scheme 55.2 35.2
Table 1: Quantitative comparison in terms of detection mean average precision (mAP) on the PASCAL VOC 2007 test set and correct localization (CorLoc) on the PASCAL VOC 2007 trainval set using AlexNet or VGGNet. The last rows show the mAP on the PASCAL VOC 2012 val set. We highlight the best performances and underline the 2nd best performances.

Evaluation metrics: To quantitatively evaluate the performance of the proposed method, we take two types of metrics, which are applied at the training and testing stage respectively. In the training dataset, we compute the percentage of images from which we obtain correct localization (CorLoc) [9]. In the test dataset, we evaluate the performance of the object detector using mean average precision (mAP), a standard metric used in PASCAL VOC. Within both the metrics, we consider that a bounding box is correct if it has an IoU ratio of at least 50% with the ground-truth object annotation.

Comparison with the state-of-the-art algorithms: We compared the proposed algorithm with the state-of-the-art methods dealing with the weakly supervised object localization problem [29]. None of them use strong labels for training.

Fig. 3 shows the performance comparison between our proposed method developed with the AlexNet as baseline and the state-of-the-art WSL works [2, 20, 29] on the VOC 2007 dataset. Models from Song et al. [29] and Bilen et al. [2] are MIL-based approaches with advanced model initialization. Our method is developed based on that from Li et al. [20]. Moreover, Tang et al. [31] propose an on-line instance classifier refinement, which classifies a fixed-size conv feature produced by some convolutional (conv) layers with spatial pyramid pooling (SPP) layer. As the classifier is trained with the features from the SPP net, this model takes the advantage of a better initialization. In an entirely different way, we progressively adapt a classification network to an object detection network with denoised proposals as the pseudo strong labels. Such domain adaptation helps to learn a better object detector from image-level annotated data. Unlike previous works that rely on noisy proposals to localize the object candidates, we mine finer and class-specific proposals from the proposed work flow, which integrates the mask-out, MIL and subset proposal optimization. In addition, a fully model adaption is guaranteed with the re-training and re-weighting strategy.

Figure 4: Performance over different IoU threshold of the VGG16 version on PASCAL VOC 2007
Re-train (mAP)
AlexNet 31.0 34.1 36.8 37.2 37.2
VGGNet 38.5 39.9 40.3 40.9 40.9
Table 2: Quantitative comparison in terms of detection mean average precision (mAP) on the PASCAL VOC 2007 test set for different re-training steps with AlexNet or VGGNet.

By incorporating the proposal subset optimization, the proposed model significantly outperforms other methods in terms of mAP for most of the categories. In Table 1, we make comparisons in terms of both the CorLoc and mAP on the training and testing set of the VOC 2007 dataset, respectively. In addition, we present the mAP on the val set of the VOC 2012. For other baseline methods, we list the best performances of the AlexNet and VGGNet models, which are reported in the paper. Based on the VGGNet, our method achieves 40.9% mAP on VOC 2007 test set and 35.2% mAP on VOC 2012 val set. It is also evident from Table 1 that the detection performance is significantly improved by using a deeper network. Note that the method introduced by jie et al. [16] is a regional CNN detector (Fast R-CNN [11]). This model trained on seed samples is sufficiently powerful for selecting the most confident tight positives and is able to further train itself with the optimized proposals. We compare our method against this Fast RCNN based method by listing the results in Table 1. A similar performance is obtained by our model as the one on VOC 2007.

In addition to the standard IoU for evaluation, we analyze the influence over different IoU threshold in Fig 4. It is evident that setting IoU = 0.5 achieves the best performance, and the results are not very sensitive to different values: when changing it from 0.5 to 0.6, the performance only drops a little bit.

Strategy BBox-initialization Re-weighting Re-training
Mask-out MIL Subset Optimization AlexNet VGG16 VGG19
VOC 2007 3 24 2 4 7 9
VOC 2012 7 36 3 7 14 17
Table 3: Quantitative comparison in terms of computational time (hour) on the PASCAL VOC 2007 and 2012 training sets for different strategies.

Impact of re-training strategy: The re-training strategy we utilized so far is straightforward. The process is to establish an order that adaptively optimize the refined proposals, and then fine-tune the detection network with the confident proposals. We notice that the proposals used to fine-tune the network are critical to train the baseline for detection. So it is promising to improve the annotation through an adaptive way.

We use the same settings during the re-training stage as we adapt the classification network to a detection network. After training the detection network, we select the top 30 detection results and optimize them with the proposal subset optimization. Consequently, the training dataset is adaptively denoised and we obtain a better detection network. Table 2 demonstrates that the mAP is increased from 31.0% to 37.2% for the AlexNet and from 38.5% to 40.9% for the VGGNet.

Figure 5: Sample detection results. Green boxes indicate ground-truth annotation. Red boxes indicate correct detections (with IoU 0.5). The sample images show the correct detections from different classes.
Figure 6: The sample images shows the wrong detections due to imprecise detection. Green boxes indicate ground-truth annotation. Red boxes indicate imprecise detections (with IoU 0.5).

Computational time analysis: We report the evaluation results on PASCAL VAL 2007 and PASCAL VAL 2012 in the paper. The re-training is conducted under AlexNet, VGG16, and VGG19. The training time of the experiments largely depends on the hardware resource. We train and evaluate the proposed method using the Intel Xeon(R) CPU E5-1607 v2 3.00GHz 4 and four K80 GPUs with 12 GB memory on a cluster. To reduce the training time of MIL, we employ 12 CPUs to separately train the MIL for each category of 21 classes. The training time of the experiment is shown in Table 3.

Error analysis: Fig. 5 shows some samples with accurate detections and Fig. 6 shows several examples with wrong detections. Our model often detects the correct objects in the image since we train the detector by incorporating a proposal subset optimization to improve the inaccuracy of the localization. Most of the model for WSL task may fail to predict a sufficient tight bounding box [20]. The adaptive denoising part of Fig. 1 illustrates the procedure that the proposals are adaptively selected so that they gradually converge to the ground-truth of annotations. Nonetheless, the proposed model still has limitation as shown in the wrong detections in Fig. 6. This is because our proposal subset optimization also depends on the detection scores even though it incorporates the overlaps of the proposals.

5 Conclusion

We have proposed a novel model by integrating adaptive proposal denoising strategies to handle the weakly supervised object localization problem. This approach first selects confident proposals by utilizing the output of the MIL framework as the starting point of training the detection network. At the training stage, we first adapt a pre-trained classification network with high confident proposals to a detection network, then re-weight the detection results with the proposal subset optimization method. The re-weighted proposals are taken to re-train the network, resulting a detection network that achieves competitive performance on PASCAL VOC datasets. As a follow-up study, it is desire to adapt a new feature extraction method for the weakly supervised localization task. It is interesting to add the attention mechanism that assists to obtain attended features. We would like to introduce a module that effectively and efficiently extracts purified features.


  • [1] A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani (2014) Self-taught object localization with deep networks. arXiv. Cited by: §1.
  • [2] H. Bilen, M. Pedersoli, and T. Tuytelaars (2015) Weakly supervised object detection with convex clustering. In CVPR, Cited by: §1, §2, §2, §4.
  • [3] H. Bilen and A. Vedaldi (2015) Weakly supervised deep detection networks. arXiv preprint arXiv:1511.02853. Cited by: §1, §1, §2, §2, §2, Table 1.
  • [4] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017) Improving object detection with one line of code. arXiv preprint arXiv:1704.04503. Cited by: §2.
  • [5] F. Cen and G. Wang (2019) Dictionary representation of deep features for occlusion-robust face recognition. IEEE Access. Cited by: §2.
  • [6] R. G. Cinbis, J. Verbeek, and C. Schmid (2014) Multi-fold mil training for weakly supervised object localization. In CVPR, Cited by: §1.
  • [7] R. G. Cinbis, J. Verbeek, and C. Schmid (2017) Weakly supervised object localization with multi-fold multiple instance learning. IEEE transactions on pattern analysis and machine intelligence 39 (1). Cited by: §2, §2.
  • [8] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1.
  • [9] T. Deselaers, B. Alexe, and V. Ferrari (2012) Weakly supervised localization and learning with generic knowledge. IJCV 100 (3). Cited by: §1, §4.
  • [10] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. IJCV 111 (1). Cited by: §1.
  • [11] R. Girshick (2015) Fast r-cnn. In ICCV, Cited by: §2, §2, §4.
  • [12] M. Guillaumin and V. Ferrari (2012) Large-scale knowledge transfer for object localization in imagenet. In CVPR, Cited by: §2.
  • [13] L. He, G. Wang, and Z. Hu (2018) Learning depth from single images with deep neural network embedding focal length. IEEE Transactions on Image Processing 27 (9), pp. 4676–4689. Cited by: §2.
  • [14] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko (2014) LSDA: large scale detection through adaptation. In NIPS, Cited by: §2.
  • [15] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang (2015) Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing 24 (12), pp. 5659–5670. Cited by: §2.
  • [16] Z. Jie, Y. Wei, X. Jin, J. Feng, and W. Liu (2017) Deep self-taught learning for weakly supervised object localization. In CVPR, Cited by: §1, §2, Table 1, §4.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §4.
  • [18] B. Lai and X. Gong (2017) Saliency guided end-to-end learning for weakly supervised object detection. arXiv preprint arXiv:1706.06768. Cited by: §2.
  • [19] M. Lapin, M. Hein, and B. Schiele (2014) Learning using privileged information: svm+ and weighted svm. Neural Networks 53. Cited by: §2.
  • [20] D. Li, J. Huang, Y. Li, S. Wang, and M. Yang (2016) Weakly supervised object localization with progressive domain adaptation. In CVPR, Cited by: §1, §1, §2, §3.1, Table 1, §4, §4.
  • [21] X. Liu, M. Song, L. Zhang, D. Tao, J. Bu, and C. Chen (2012) Pedestrian detection using a mixture mask model. In Proceedings of 2012 9th IEEE International Conference on Networking, Sensing and Control, pp. 271–276. Cited by: §2.
  • [22] W. Ma, Y. Wu, Z. Wang, and G. Wang (2018) Mdcn: multi-scale, deep inception convolutional neural networks for efficient object detection. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2510–2515. Cited by: §2.
  • [23] A. Pentina, V. Sharmanska, and C. H. Lampert (2015) Curriculum learning of multiple tasks. In CVPR, Cited by: §1, §2.
  • [24] V. Sharmanska, N. Quadrianto, and C. H. Lampert (2013) Learning to rank using privileged information. In ICCV, Cited by: §2.
  • [25] M. Shi, H. Caesar, and V. Ferrari (2017) Weakly supervised object localization using things and stuff transfer. arXiv preprint arXiv:1703.08000. Cited by: §2.
  • [26] M. Shi and V. Ferrari (2016) Weakly supervised object localization using size estimates. In ECCV, Cited by: §1, §1, §2, §2, Table 1.
  • [27] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.
  • [28] P. Siva, C. Russell, and T. Xiang (2012) In defence of negative mining for annotating weakly labelled data. In ECCV, Cited by: §2.
  • [29] H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, T. Darrell, et al. (2014) On learning to localize objects with minimal supervision.. In ICML, Cited by: §1, §2, §2, §4, §4.
  • [30] H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell (2014) Weakly-supervised discovery of visual pattern configurations. In NIPS, Cited by: §2.
  • [31] P. Tang, X. Wang, X. Bai, and W. Liu (2017) Multiple instance detection network with online instance classifier refinement. In CVPR, Cited by: §1, §2, Table 1, §4.
  • [32] A. Toshev and C. Szegedy (2014) Deeppose: human pose estimation via deep neural networks. In CVPR, Cited by: §1.
  • [33] J. Wu, Y. Yu, C. Huang, and K. Yu (2015) Deep multiple instance learning for image classification and auto-annotation. In CVPR, Cited by: §1.
  • [34] W. Xu, D. Choi, and G. Wang (2018) Direct visual-inertial odometry with semi-dense mapping. Computers & Electrical Engineering 67, pp. 761–775. Cited by: §2.
  • [35] W. Xu and D. Choi (2016) Direct visual-inertial odometry and mapping for unmanned vehicle. In International Symposium on Visual Computing, pp. 595–604. Cited by: §2.
  • [36] W. Xu, S. Keshmiri, and Wang,Guanghui (2019) Toward learning a unified many-to-many mapping for diverse image translation. Pattern Recognition 93, pp. 570 – 580. Cited by: §2.
  • [37] W. Xu, K. Shawn, and G. Wang (2019) Adversarially approximated autoencoder for image generation and manipulation. IEEE Transactions on Multimedia. Cited by: §2.
  • [38] W. Xu, K. Shawn, and Wang,Guanghui (2019) Stacked wasserstein autoencoder. Neurocomputing 363, pp. 195 – 204. Cited by: §2.
  • [39] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE transactions on neural networks and learning systems (99), pp. 1–13. Cited by: §2.
  • [40] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In ECCV, Cited by: §3.1.
  • [41] C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In CVPR, Cited by: §1.
  • [42] J. m. Zhang, S. Sclaroff, Z. Lin, X. H. Shen, B. Price, and R. Mech (2016) Unconstrained salient object detection via proposal subset optimization. In CVPR, Cited by: §1, §3.2, §3.2.
  • [43] Y. Zhang, K. Sohn, R. Villegas, G. Pan, and H. Lee (2015) Improving object detection with deep convolutional networks via bayesian optimization and structured prediction. In CVPR, Cited by: Figure 1.
  • [44] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler (2015) Segdeepm: exploiting segmentation and context in deep neural networks for object detection. In CVPR, Cited by: Figure 1.
  • [45] C. L. Zitnick and P. Dollár (2014) Edge boxes: locating object proposals from edges. In ECCV, Cited by: §2, §3.1, §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description