Weakly Supervised Localization Using MinMax Entropy: an Interpretable Framework
Abstract
Weakly supervised object localization (WSOL) models aim to locate objects of interest in an image after being trained only on data with coarse image level labels. Deep learning models for WSOL rely typically on convolutional attention maps with no constraints on the regions of interest which allows these models to select any region, making them vulnerable to false positive regions and inconsistent predictions. This issue occurs in many application domains, e.g., medical image analysis, where interpretability is central to the prediction process. In order to improve the localization reliability, we propose a deep learning framework for WSOL with pixel level localization. Our framework is composed of two sequential subnetworks: a localizer that localizes regions of interest; followed by a classifier that classifies these regions. Within its endtoend training, we incorporate the prior knowledge stating that, in an agnosticclass setup, an image is more likely to contain relevant –i.e., object of interest– and irrelevant regions –i.e., noise, background–. Based on the conditional entropy measured at the classifier level, the localizer is driven to spot relevant regions identified with low conditional entropy, and irrelevant regions identified with high conditional entropy. Our framework is able to recover large and even complete discriminative regions in an image using our recursive erasing algorithm that we incorporate within the backpropagation during training. Moreover, the framework handles intrinsically multiinstances. Experimental results on public datasets with medical images (GlaS colon cancer) and natural images (CaltechUCSD Birds2002011, Oxford flower 102) show that, compared to state of the art WSOL methods, the proposed approach can provide significant improvements in terms of imagelevel classification and pixellevel localization. Our framework showed robustness to overfitting when dealing with few training samples. Performance improvements are due in large part to our framework effectiveness at disregarding irrelevant regions. A public reproducible PyTorch implementation is provided^{1}^{1}1https://github.com/sbelharbi/wsolminmaxentropyinterpretability.
1 Introduction
Object localization^{2}^{2}2Object localization consists in isolating an object of interest by providing the coordinates of its surrounding bounding box. In this work, it is also understood as a task providing pixel level segmentation of the object, which provides more accuracy of the localization. To avoid confusion when presenting the literature, we specify the case being considered. can be considered as one of the most fundamental tasks for image understanding, as it provides crucial clues to challenging visual problems, such as object detection or semantic segmentation. Deep learning methods, and particularly convolutional neural networks (CNNs), are driving recent progress in these tasks. Nevertheless, despite their remarkable performance, a downside of these methods is the large amount of labeled data required for training, which is a time consuming task and prone to observer variability. To overcome this limitation, weakly supervised learning (WSL) has emerged recently as a surrogate for extensive annotation of training data zhou2017brief . WSL involves scenarios where training is performed with inexact or uncertain supervision. In the context of object localization or semantic segmentation, weak supervision typically comes in the form of image level tags KERVADEC201988 ; kim2017two ; pathak2015constrained ; teh2016attention ; wei2017object , scribbles Lin2016 ; ncloss:cvpr18 or bounding boxes Khoreva2017 .
In WSOL, current state of the art methods in object localization and semantic segmentation rely heavily on classification activation maps produced by convolutional networks in order to localize regions of interest zhou2016learning , which can be also used as an interpretation of the model’s decision Zhang2018VisualInterp . Different work has been done in WSOL field to leverage the need to pixel level annotation. We mention bottomup methods which rely on the input signal to locate the object of interest. Such methods include spatial pooling techniques over activation maps durand2017wildcat ; oquab2015object ; sun2016pronet ; zhang2018adversarial ; zhou2016learning , multiinstance learning ilse2018attention , attendanderase based methods kim2017two ; LiWPE018CVPR ; pathak2015constrained ; SinghL17 ; wei2017object . While such methods provide pixel level localization, other methods have been introduced to predict a bounding box instead, named weakly supervised object detectors bilen2016weakly ; kantorov2016contextlocnet ; shen2018generative ; tang2017multiple ; wan2018min . Inspired by human visual attention, topdown methods, which rely on the input signal and a selective backward signal to determine the corresponding object, were proposed including special feedback layers cao2015look , backpropagation error zhang2018top , GradCAM ChattopadhyaySH18wacv ; selvaraju2017grad for the gradient of the object class with respect to the activation maps.
Within an agnosticclass setup, input image often contains the object of interest among other parts such as noise, background, and other irrelevant subjects. Most the aforementioned methods do not consider such prior, and feed the entire image to the model. Ignoring such prior in the case where the object of interest in different images has a common shape/texture/color, the model may still be able to localize the most discriminative part of it easily oquab2015object . This is the case of natural images, for instance. However, in the case where the object can appear in different and random shape/structure, or may have relatively similar texture/color to the irrelevant parts, the model may easily confuse between the object and the irrelevant parts. This is mainly due to the fact that the network is free to select any area of the image as a region of interest as long as the selected region allows to reduce the classification loss. Such free selection can lead to high false positive regions and inconsistent localization. This issue can be furthermore understood from the point of view of feature selection and sparsity tibshirani2015statistical . However, instead of selecting the relevant features, the model is required to select a set of pixels –i.e., raw features– representing the object of interest. Since the only constraint to optimize during such selection is to minimize the classification loss, and without other priors nor pixel level supervision, the optimization may converge to a model that can select any random subset of pixels as long as the loss is minimized. This does not guarantee that the selected pixels represent an object, nor the correct object, nor even make sens to us^{3}^{3}3wan2018min argue that there is an inconsistency between the classification loss and the task of WSOL; and that typical the optimization may reach suboptimal solutions with considerable randomness in them.. From optimization perspective, it does not matter which set of pixels is selected (with respect to interpretability), but what matter is to obtain a minimal loss. In practice, and in deep WSOL, this often results in localizing the smallest (–i.e., sparse set–) common discriminative region of the object such as dog’s face
with respect to the object ’dog’
kim2017two ; SinghL17 ; zhou2016learning which makes sens since localizing the dog’s face can be statistically sufficient to discriminate the object ’dog’
from other objects. Once such region is located, the classification loss may reach the minimum; and then, the model stops learning.
False positive regions can be problematic in critical domains such as medical applications where interpretability plays a central role in trusting and understanding an algorithm’s prediction. To address this important issue, and motivated by the importance of using prior knowledge in learning to alleviate overfitting when training using few samples sbelharbiarxivsep2017 ; krupka2007incorporating ; mitchell1980need ; yu2007incorporating , we propose to use the aforementioned prior –i.e., an image is likely to contain relevant and irrelevant regions– in order to favorite models that behave as such. To this end, we constrain the model to learn to localize both relevant and irrelevant image regions simultaneously in an endtoend manner within a weakly supervised scenario, where only image level labels are used for training. We model the relevant –i.e., discriminative– regions as the complement of the irrelevant –i.e., nondiscriminative– regions (Fig.1). Our model is composed of two submodels: (1) a localizer that aims at localizing regions of interest by predicting a latent mask, (2) and a classifier that aims at classifying the visible content of the input image through the latent mask. The localizer is trained, by employing the conditional entropy coverentropy2006 , to simultaneously identify (1) relevant regions where the classifier has high confidence with respect to the image label, (2) and irrelevant regions where the classifier is being unable to decide which image label to assign. This modeling allows the discriminative regions to pop out and be used to assign the corresponding image label, while suppressing nondiscriminative areas, leading to more reliable predictions. In order to localize complete discriminative regions, we extend our proposal by training the localizer to recursively erase discriminative parts during training. To this end, we propose a recursive erasing algorithm that we incorporate within the backpropagation. At each recursion, and within the backpropagation, the algorithm localizes the most discriminative region; stores it; then erases it from the input image. At the end of the final recursion, the model has gathered a large extent of the object of interest that is feed next to the classifier. Thus, our model is driven to localize complete relevant regions while discarding irrelevant regions, resulting in more reliable object localization regions. Moreover, since the discriminative parts are allowed to be extended over different instances, the proposed model handles multiinstances natively.
The main interest of predicting a mask is the higher (at pixel level) localization precision, instead of localization at a bounding box level which predicts a coarse localization that still contains false positive pixels, even in ground truth images. In some applications, such as in medical imaging, object localization may require a higher level of precision such as localizing cells, boundaries, and organs which may have an unstructured shape, and different scale that a bounding box may be unable to precisely localize. In such cases, a pixel level localization, such as in our proposal, can be more useful than a bounding box. The illustrative example shown in Fig.1 (bottom row) show a situation where using a bounding box to localize the glands is clearly problematic. Similar examples can be found in with natural images. Therefore, segmentation metrics (such as Dice index) are considered for evaluation in our experiments, instead of standard object localization metrics such like mAP (mean Average Precision). As a consequence, our choice of datasets is constrained by the availability of an image and pixel level labels.
The main contribution of this paper is new deep learning framework for weakly supervised object localization at pixel level. Our framework is composed of two sequential subnetworks where the first one localizes regions of interest, whereas the second one classifies them. Based on conditional entropy, the endtoend training of this framework allows to incorporate prior knowledge indicating that, in a classagnostic setup, the image is more likely to contain relevant regions (object of interest) and irrelevant regions (noise, background). Given the conditional entropy measured at the classifier level, the localizer is driven to localize relevant regions (with low conditional entropy) and irrelevant regions (with high conditional entropy). Such localization is achieved with the main goal of providing a more interpretable and reliable regions of interest. This paper also contributes a recursive erasing algorithm that is incorporated within backpropagation, along with a practical implementation in order to obtain complete discriminative regions. Finally, we conduct an extensive series of experiments on two public image datasets (medical and natural scenes), where the results show the effectiveness of the proposed approach in terms of pixel level localization while maintaining competitive accuracy for imagelevel classification.
2 Background on WSOL
In this section, we briefly review state of the art of WSOL methods that aim at localizing objects of interest using only image level labels as supervision.
Fully convolutional networks with spatial pooling have shown to be effective to obtain localization of discriminative regions durand2017wildcat ; oquab2015object ; sun2016pronet ; zhang2018adversarial ; zhou2016learning . Multiinstance learning based methods have been used within an attention framework to localize regions of interest ilse2018attention . Since neural networks often, kim2017two ; SinghL17 ; zhou2016learning , provide small and most discriminative regions of object of interest, SinghL17 propose to hide large patches in training image randomly in order to force the network to seek other discriminative regions to recover large part of the object of interest. wei2017object use the attention map of a trained network to erase the most discriminative part of the original image. kim2017two use twophase learning stage where they combine the attention maps of two networks to obtain a complete region of the object. LiWPE018CVPR propose a twostage approach where the first network classifies the image, and provides an attention map of the most discriminative parts. Such attention is used to erase the corresponding parts over the input image, then feed the resulting erased image to a second network to make sure that there is no discriminative parts left.
Weakly supervised object detectors methods have emerged as an approach for localizing regions of interest using bounding boxes instead of pixel level. Such approaches rely on region proposals such as edge box zitnick2014edge and selective search uijlings2013selective ; van2011segmentation . In teh2016attention , the content of each proposed region is passed through an attention module, then a scoring module to obtain an average image. bilen2016weakly propose an approach to address multiclass object localization. Many improvements of this work have been proposed since then kantorov2016contextlocnet ; tang2017multiple . Other approaches rely on multistage training where in the first stage a network is trained to localize then refined in later stages for object detection diba2017weakly ; ge2018multi ; sun2016pronet . In order to reduce the variance of the localization of the boxes, wan2018min propose to reduce an entropy defined on the position of such boxes. shen2018generative propose to use generative adversarial networks to generate the proposals in order to speedup inference since most of the region proposals techniques are time consuming.
Inspired by the human visual attention, topdown methods was proposed. In Simonyan14a ; DB15a ; zeiler2014ECCV , backpropagation error is used in order to visualize saliency maps over the image for the predicted class. In cao2015look , an attention map is built to identify the class relevant regions using feedback layer. zhang2018top propose Excitation backprop that allows to pass along topdown signals downwards in the network hierarchy through a probabilistic framework. GradCAM selvaraju2017grad generalize CAM zhou2016learning using the derivative of the class scores with respect to each location on the feature maps which has been furthermore generalized in ChattopadhyaySH18wacv . In practice, topdown method are considered as visual explanatory tools, and they can be overwhelming in term of computation and memory usage even during inference.
While the aforementioned approaches have shown great success mostly with natural images, they still lack a mechanism for modeling what is relevant and irrelevant within an image. This is crucial for determining the reliability of the regions of interest. Erasebased methods kim2017two ; LiWPE018CVPR ; pathak2015constrained ; SinghL17 ; wei2017object follow such concept where the nondiscriminative parts are suppressed through constraints, allowing only the discriminative ones to emerge. Explicitly modeling negative evidence within the model has shown to be effective in WSOL Azizpour2015SpotlightTN ; durand2017wildcat ; durand2016weldon ; PariziVZF14 .
The technique we propose is related to Behpour2019 ; wan2018min in a sense that both use entropy to explore the input image. However, while wan2018min defines an entropy measure over the bounding boxes’ position to minimize its variance, we define an entropic measure over the classifier to be low over discriminative regions, while being high over nondiscriminative ones. Our recursive erasing algorithm follows general erasing and mining techniques kim2017two ; LiWPE018CVPR ; SinghL17 ; wan2018min ; wei2017object , but places more emphasis on mining consistent regions, and being performed on the fly during backpropagation. For instance, compared to wan2018min , our algorithm attempts to expand regions of interest, accumulate consistent regions while erasing, provide automatic mechanism to stop erasing over samples independently from each other. However wan2018min aims to locate multiple instances without erasing, and use manual/empirical threshold for assigning confidence to boxes.
Our proposal can also be seen as a supervised dropout srivastava14a . While dropout is applied over a given input image, and randomly zeroes out pixels, our proposed approach seeks to zero out irrelevant pixels and preserve only the discriminative ones that support an image label. Freom that perspective, our proposal mimics a discminitative gate that inhibits irrelevant and noisy regions while allowing only informative and discriminative regions to pass through the gate.
3 The minmax entropy framework for WSOL
3.1 Notations and definitions
Let us consider a set of training samples where is an input image with depth , height , and width ; a realization of the discrete random variable with support set ; is the image level label (i.e., image class), a realization of the discrete random variable with support set . We define a decidable region^{4}^{4}4In this context, the notion of region indicates one pixel. of an image as any informative part of the image that allows predicting the image label. An undecidable region is any noisy, uninformative, and irrelevant part of the image that does not provide any indication nor support for the image class. To model such definitions, we consider a binary mask where a location with value indicates a decidable region, otherwise it is an undecidable region. We model the decidability of a given location with a binary random variable . Its realization is , and its conditional probability over the input image is defined as follows,
(1) 
We note a binary mask indicating the undecidable region, where . We consider the undecidable region as the complement of the decidable one. We can write: , where is the norm. Following such definitions, an input image can be decomposed into two images as , where is the Hadamard product. We note , and . inherits the imagelevel label of . We can write the pair in the same way as . We note by , and as the respective approximation of , and (Sec.3.3). We are interested in modeling the true conditional distribution where . is its estimate. Following the previous discussion, predicting the image label depends only on the decidable region, i.e., . Thus, knowing does not add any knowledge to the prediction, since does not contain any information about the image label. This leads to: . As a consequence, the image label is conditionally independent of the undecidable region provided the decidable region Kollergraphical2009 : , where are the random variables modeling the decidable and the undecidable regions, respectively. In the following, we provide more details on how to exploit such conditional independence property in order to estimate and .
3.2 Minmax entropy
We consider modeling the uncertainty of the model prediction over decidable, or undecidable regions using conditional entropy (CE). Let us consider the CE of , denoted and computed as coverentropy2006 ,
(2) 
Since the model is required to be certain about its prediction over , we constrain the model to have low entropy over . Eq.2 reaches its minimum when the probability of one of the classes is certain, i.e., coverentropy2006 . Instead of directly minimizing Eq.2, and in order to ensure that the model predicts the correct image label, we cast a supervised learning problem using the crossentropy between and using the imagelevel label of as a supervision,
(3) 
Eq.3 reaches its minimum at the same conditions as Eq.2 with the true image label as a prediction. We note that Eq.3 is the negative loglikelihood of the sample . In the case of , we consider the CE of , denoted and computed as,
(4) 
Over irrelevant regions, the model is required to be unable to decide which image class to predict since there is no evidence to support any class. This can be seen as a high uncertainty in the model decision. Therefore, we consider maximizing the entropy of Eq.4. The later reaches its maximum at the uniform distribution coverentropy2006 . Thus, the inability of the model to decide is reached since each class is equiprobable. An alternative to maximizing Eq.4 is to use a supervised target distribution since it is already known (i.e., uniform distribution). To this end, we consider as a uniform distribution,
(5) 
and caste a supervised learning setup using a crossentropy between and over ,
(6) 
The minimum of Eq.6 is reached when is uniform, thus, Eq.4 reaches its maximum. Now, we can write the total training loss to be minimized as,
(7) 
The posterior probability is modeled using a classifier with a set of parameters ; it can operate either on or . The binary mask (and ) is learned using another model with a set of parameters . In this work, both models are based on neural networks (fully convolutional networks LongSDcvpr15 in particular). The networks and can be seen as two parts of one single network that localizes regions of interest using a binary mask, then classifies their content. Fig.2 illustrates the entire model.
Due to the depth of , receives its supervised gradient based only on the error made by . In order to boost the supervised gradient at , and provide it with more hints to be able to select the most discriminative regions with respect to the image class, we propose to use a secondary classification task at the output of to classify the input , following lee15apmlr . computes the posterior probability which is another estimate of . To this end, is trained to minimize the crossentropy between and ,
(8) 
The total training loss to minimize is formulated as,
(9) 
3.3 Mask computation
The mask is computed using the last feature maps of which contains high abstract descriminative activations. We note such feature maps by a tensor that contains a spatial map for each class. is computed by aggregating the spatial activation of all the classes as,
(10) 
where is the continuous downsampled version of , and is the feature map of the class of the input . At convergence, the posterior probability of the winning class is pushed toward while the rest is pushed down to . This leaves only the feature map of the winning classe. is upscaled using interpolation^{5}^{5}5In most neural networks libraries (Pytorch (pytorch.org), Chainer (chainer.org)), the upsacling operations using interpolation/upsamling have a nondeterministic backward. This makes training unstable due to the nondeterministic gradient; and reproducibility impossible. To avoid such issues, we detach the upsacling operation from the training graph and consider it as input data for . to which has the same size as the input , then pseudothresholded using a sigmoid function to obtain a pseudobinary ,
(11) 
where is a constant scalar that ensures that the sigmoid approximately equals to when is larger than , and approximately equals to otherwise.
3.4 Object completeness using incremental recursive erasing and trust coefficients
Object classification methods tend to rely on small discriminative regions kim2017two ; SinghL17 ; zhou2016learning . Thus, may still contain discriminative parts. Following kim2017two ; LiWPE018CVPR ; pathak2015constrained ; SinghL17 , and in particular wei2017object , we propose a learning incremental and recursive erasing approach that drives to seek complete discriminative regions. However, in the opposite of wei2017object where such mining is done offline, we propose to incorporate the erasing within the backpropagation using an efficient and practical implementation. This allows to learn to seek discriminative parts. Therefore, erasing during inference is unnecessary. Our approach consists in applying recursively before applying within the same forward. The aim of the recursion, with maximum depth , is to mine more discriminative parts within the nondiscriminative regions of the image masked by . We accumulate all discriminative parts in a temporal mask . At each recursion, we mine the most discriminative part, that has been correctly classified by , and accumulate it in . However, with the increase of , the image may run out of discriminative parts. Thus, is forced, unintentionally, to consider nondiscriminative parts as discriminative. To alleviate this risk, we introduce trust coefficients that control how much we trust a mined discriminative region at each step of the recursion for each sample as follows,
(12) 
where computes the trust of the current mask of the sample at the step as follows,
(13) 
where encodes the overall trust with respect to the current step of the recursion. Such trust is expected to decrease with the depth of the recursion bel16 . controls the slop of the trust function. The second part of Eq.13 is computed with respect to each sample. It quantifies how much we trust the estimated mask for the current sample ,
(14) 
In Eq.14, is computed over . Eq.14 ensures that at a step , for a sample , the current mask is trusted only if correctly classifies the erased image, and does not increase the loss. The first condition ensures that the accumulated discriminative regions belong to the same class, and more importantly, the true class. Moreover, it ensures that does not change its class prediction through the erasing process. This introduces a consistency between the mined regions across the steps and avoids mixing discriminative regions of different classes. The second condition ensures maintaining, at least, the same confidence in the predicted class compared to the first forward without erasing (). The given trust in this case is equal to the probability of the true class. The regions accumulator is initialized to zero at at each forward in . is not maintained through epoches; starts over each time processing the sample . This prevents accumulating incorrect regions that may occur at the beginning of the training. In order to automatize when to stop erasing, we consider a maximum depth of the recursion . For a minibatch, we keep erasing as along as we do not reach steps of erasing, and there is at least one sample with a trust coefficient nonzero (Eq.14). Once a sample is assigned a zero trust coefficient, it is maintained zero all along the erasing (Eq.12)(Fig.4). Direct implementation of Eq.12 is not practical since performing a recursive computation on a large model requires a large memory that increases with the depth . To avoid such issue, we propose a practical implementation using gradient accumulation at through the loss Eq.8; such implenetation requires the same memory size as in the case without erasing (Alg.1). We provide more details in the supplementary material (Sec.A.1).
4 Results and analysis
Our experiments focused simultaneously on classification and object localization tasks. Thus, we consider datasets that provide imagelevel and pixellevel labels for evaluation on classification and object localization tasks. Particularly, the following three datasets were considered: GlaS in medical domain, and CUB2002011 and Oxford flower 102 on natural scene images. (1) GlaS dataset was provided in the 2015 Gland Segmentation in Colon Histology Images Challenge Contest^{6}^{6}6GlaS: warwick.ac.uk/fac/sci/dcs/research/tia/glascontest. sirinukunwattana2017gland . The main task of the challenge is gland segmentation of microscopic images. However, imagelevel labels were provided as well. The dataset is composed of 165 images derived from 16 Hematoxylin and Eosin (H&E) histology sections of two grades (classes): benign, and malignant. It is divided into 84 samples for training, and 80 samples for test. Images have a high variation in term of gland shape/size, and overall H&E stain. In this dataset, the glandes are the regions of interest that the pathologists use to prognosis the image grading of being benign or malignant. (2) CUB2002011 dataset^{7}^{7}7CUB2002011: www.vision.caltech.edu/visipedia/CUB2002011.html WahCUB2002011 is a dataset for bird species with samples and species. Preliminary experiments were conducted on smaller version of this datatset where we selected randomly 5 species and build a small dataset with samples for training, and for test; referred to in this work as CUB5. The entire dataset is referred to as CUB. In this dataset, the object of interest are the birds. (3) Oxford flower 102^{8}^{8}8Oxford flower 102: http://www.robots.ox.ac.uk/ vgg/data/flowers/102/ nilsback2007delving datatset is collection of 102 species (classes) of flowers commonly occurring in United Kingdom; referred to here as OxF. It contains a total of samples. We used the provided splits for training ( samples), validation ( samples) and test ( samples) sets. Regions of interest are the flowers which were automatically segmented. In GlaS, CUB5 and CUB datasets, we randomly select of training samples for effective training, and for validation to perform early stopping. We provide in our public code the used splits and the deterministic code that generated them for the different datasets.
In all the experiments, imagelevel labels are used during training/evaluation, while pixellevel labels are used exclusively during evaluation. The evaluation is conducted at two levels: at image level where the classification error is reported, and at the pixel level where we report F1 score (Dice index). over the foreground (object of interest), referred to as F1. When dealing with binary data, F1 score is equivalent to Dice index. We report as well the F1 score over the background, referred to as F1, in order to measure how well the model is able to identify irrelevant regions. We compare our method to different methods of WSOL. The methods use similar pretrained backbone (resent18 heZRS16 ) for feature extraction and differ mainly in the final pooling layer: CAMAvg uses average pooling zhou2016learning , CAMMax uses maxpooling oquab2015object , CAMLSE uses an approximation to maximum PinheiroC15cvpr ; sun2016pronet , Wildcat uses the pooling in durand2017wildcat , GradCAM selvaraju2017grad , and Deep MIL is the work of ilse2018attention with adaptation to multiclass. We use supervised segmentation using UNet Ronnebergerunet2015 as an upper bound of the performance for pixellevel evaluation (Full sup.). As a basic baseline, we use a mask full of 1 of the same size of the image as a constant prediction of the objects of interest to show that F1 alone is not an efficient metric to evaluate pixellevel localization particularly over GlaS set (Allones, see Tab.2). In our method, and share the same pretrained backbone (resnet101 heZRS16 ) to avoid overfitting while using durand2017wildcat as a pooling function. All methods are trained using stochastic gradient descent using momentum. In our approach, we used the same hyperparameters over all datasets, while other methods required adaptation to each dataset. We provide a reproducible code^{9}^{9}9https://github.com/sbelharbi/wsolminmaxentropyinterpretability, the datasets splits, more experimental details, and visual results in the supplementary material (Sec.B).
A comparison of the obtained results of different methods, over all datasets, is presented in Tab.1 and Tab.2 with visual results illustrated in Fig.3. In Tab.2, and compared to other WSOL methods, our method obtains relatively similar F1 score; while it obtains large F1 over GlaS where it may be easy to obtain high F1 by predicting a mask full of 1 (Fig.3). However, a model needs to be very selective in order to obtain high F1 score in order to localize tissues (irrelevant regions) where our model seems to excel at. Cub5 set seems to be more challenging due to the variable size (from small to big) of the birds, their view, the context/surrounding environment, and the few training samples. Our model outperforms all the WSOL methods in both F1 and F1 with a large gap due mainly to its ability to discard nondiscriminative regions which leaves it only with the region of interest, in this case, the bird. While our model shows improvements in localization, it is still far behind full supervision.
Image level  

Method  Error (%)  
GlaS  CUB5  CUB  OxF  
CAMAvg zhou2016learning  
CAMMax oquab2015object  
CAMLSE PinheiroC15cvpr ; sun2016pronet  
Wildcat durand2017wildcat  
Deep MIL ilse2018attention  
GradCAM selvaraju2017grad  
Ours () 
Pixel level  

Method  F1 (%)  F1 (%)  
GlaS  CUB5  CUB  OxF  GlaS  CUB5  CUB  OxF  
Allones  
CAMAvg zhou2016learning  
CAMMax oquab2015object  
CAMLSE PinheiroC15cvpr ; sun2016pronet  
Wildcat durand2017wildcat  
Deep MIL ilse2018attention  
GradCAM selvaraju2017grad  
Ours ()  
Full sup.: UNet Ronnebergerunet2015 
Similar improvements are observed on CUB data. In the case of OxF dataset, our approach provides lower F1 values compared to other WSOL methods. However, the latter are not far from the performance of the Allones that predicts a constant mask full of ones. Given the large size of flowers, predicting a mask that is active over all the image will easily lead to of F. The best WSOL methods for OxF are only better than Allones by , suggesting that such methods have predicted a full mask in many cases. In term of F1, our approach is better than all the WSOL techniques. All methods achieve low error rate on GlaS which implies that it represents an easy classification problem. Surprisingly, the other methods seem to overfit on CUB5, while our model shows a robustness. The other methods outperform our approach on CUB and OxF, although it is in a competitive range to half WSOL methods. Results obtained on both these datasets indicate that, compared to WSOL methods, our approach is effective in terms of image classification and object localization with more reliability in term of object localization.
Visual quality of our approach (Fig.3) shows that the predicted regions of interest on GlaS agree with the doctor methodology of colon cancer diagnostics where the glands are used as diagnostic tool. Additionally, the ability to deal with multiinstances when there are multiple glands within the image. On CUB5/CUB, our model succeeds to locate birds in order to predict its category which one may do in such task. We notice that the head, chest, tail, or body particular spots, are often parts that are used by our model to decide a bird’s species, which seems a reasonable strategy as well. On OxF dataset, we observed that our approach mainly locates the central part of pistil. When it is not enough, the model relies on the petals or on unique discriminative parts of the flower. In term of time complexity, the inference time of our model is the same as a standard fully convolutional network since the recursive algorithm is disabled during inference. However, one may expect a moderate increase in training time that depends mainly on the depth of the recursion (see Sec.B.3.2).
5 Conclusion
In this work, we have presented a novel approach for WSOL where we constrained learning relevant and irrelevant regions within the model. Evaluated on three datasets, and compared to state of the art WSOL methods, our approach showed its effectiveness in correctly localizing object of interest with small false positive regions while maintaining a competitive classification error. This makes our approach more reliable in term of interpetability. As future work, we consider extending our approach to handle multiple classes within the image. Different constraints can be applied over the predicted mask, such as texture properties, shape, or other region constraints. However, this requires the mask to be differentiable with respect to the model’s parameters to be able to train the network using such constraints. Predicting bounding boxes instead of heat maps is considered, as well, since they can be more suitable in some applications where pixellevel accuracy is not required.
We discussed in Sec.B.3 a fundamental issue in erasingbased algorithms, that we noticed from applying our approach over CUB5 datasets. We arrived to the conclusion that such algorithms luck the ability to remember the location of the already mined regions of interest which can be problematic in the case where there is only one instance in the image, and, only small discriminative region. This can easily prevent recovering the complete discriminative region since the rest of the regions may not be discriminative enough to be spotted, such as the case of birds when the head is already erased. Assisting erasing algorithms with a memorylike mechanism, or spatial information about the previous mined discriminative regions may drive the network to seek discriminative regions around the previously spotted regions, since the parts of an object of interest are often closely located. Potentially, this may allow the model to spot large portion of the object of interest in this case.
Acknowledgments
This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and the Canadian Institutes of Health Research.
References
 [1] H. Azizpour, M. Arefiyan, S. Naderi Parizi, and S. Carlsson. Spotlight the negatives: A generalized discriminative latent model. In BMVC, 2015.
 [2] S. Behpour, K. Kitani, and B. Ziebart. Ada: Adversarial data augmentation for object detection. In WACV, 2019.
 [3] S. Belharbi, C. Chatelain, R. Hérault, and S. Adam. Neural networks regularization through classwise invariant representation learning. arXiv preprint arXiv:1709.01867, 2017.
 [4] S. Belharbi, R.Hérault, C. Chatelain, and S. Adam. Deep multitask learning with evolving weights. In ESANN, 2016.
 [5] H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
 [6] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al. Look and think twice: Capturing topdown visual attention with feedback convolutional neural networks. In ICCV, 2015.
 [7] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. GradCAM++: Generalized gradientbased visual explanations for deep convolutional networks. In WACV, 2018.
 [8] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
 [9] T. M. Cover and J. A. Thomas. Elements of Information Theory. WileyInterscience, 2006.
 [10] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly supervised cascaded convolutional networks. In CVPR, 2017.
 [11] T. Durand, T. Mordan, N. Thome, and M. Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In CVPR, 2017.
 [12] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Weldon: Weakly supervised learning of deep convolutional neural networks. In CVPR, 2016.
 [13] W. Ge, S. Yang, and Y. Yu. Multievidence filtering and fusion for multilabel classification, object detection and semantic segmentation based on weakly supervised learning. In CVPR, 2018.
 [14] G. Ghiasi, T.Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In NIPS. 2018.
 [15] K. He, X. Zhang, S.g Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [16] M. Ilse, J. M. Tomczak, and M. Welling. Attentionbased deep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.
 [17] V. Kantorov, M. Oquab, M. Cho, and I. Laptev. Contextlocnet: Contextaware deep network models for weakly supervised localization. In ECCV, 2016.
 [18] H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. Ben Ayed. ConstrainedCNN losses for weakly supervised segmentation. MedIA, 2019.
 [19] A. Khoreva, R. Benenson, J.H. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017.
 [20] D. Kim, D. Cho, D. Yoo, and I. So Kweon. Twophase learning for weakly supervised object localization. In ICCV, 2017.
 [21] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques  Adaptive Computation and Machine Learning. The MIT Press, 2009.
 [22] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
 [23] E. Krupka and N. Tishby. Incorporating prior knowledge on features into learning. In Artificial Intelligence and Statistics, 2007.
 [24] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. DeeplySupervised Nets. In ICAIS, 2015.
 [25] K. Li, Z. Wu, K.C. Peng, J. Ernst, and Y. Fu. Tell me where to look: Guided attention inference network. In CVPR, 2018.
 [26] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribblesupervised convolutional networks for semantic segmentation. In CVPR, 2016.
 [27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [28] T.M. Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, 1980.
 [29] M.E. Nilsback and A. Zisserman. Delving into the whorl of flower segmentation. In BMVC, 2007.
 [30] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?weaklysupervised learning with convolutional neural networks. In CVPR, 2015.
 [31] S. Naderi Parizi, A. Vedaldi, A.w Zisserman, and P. F. Felzenszwalb. Automatic discovery and optimization of parts for image classification. In ICLR, 2015.
 [32] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, 2015.
 [33] P. H. O. Pinheiro and R. Collobert. From imagelevel to pixellevel labeling with convolutional networks. In CVPR, 2015.
 [34] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
 [35] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. GradCAM: Visual explanations from deep networks via gradientbased localization. In ICCV, 2017.
 [36] Y. Shen, R. Ji, S. Zhang, W. Zuo, Y. Wang, and F. Huang. Generative adversarial learning towards fast weakly supervised detection. In CVPR, 2018.
 [37] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLRw, 2014.
 [38] K. K. Singh and Y. J. Lee. Hideandseek: Forcing a network to be meticulous for weaklysupervised object and action localization. In ICCV, 2017.
 [39] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez, et al. Gland segmentation in colon histology images: The glas challenge contest. MIA, 2017.
 [40] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In ICLRw, 2015.
 [41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
 [42] C. Sun, M. Paluri, R. Collobert, R. Nevatia, and L. Bourdev. Pronet: Learning to propose objectspecific boxes for cascaded neural networks. In CVPR, 2016.
 [43] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers. Normalized Cut Loss for Weaklysupervised CNN Segmentation. In CVPR, 2018.
 [44] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance detection network with online instance classifier refinement. In CVPR, 2017.
 [45] E. W. Teh, M. Rochan, and Y. Wang. Attention networks for weakly supervised object localization. In BMVC, 2016.
 [46] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
 [47] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
 [48] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011.
 [49] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The CaltechUCSD Birds2002011 Dataset. Technical report, California Institute of Technology, 2011.
 [50] F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye. Minentropy latent model for weakly supervised object detection. In CVPR, 2018.
 [51] Y. Wei, J. Feng, X. Liang, M.M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In CVPR, 2017.
 [52] T. Yu, T. Jan, S. Simoff, and J. Debenham. Incorporating prior domain knowledge into inductive machine learning. Unpublished doctoral dissertation Computer Sciences, 2007.
 [53] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
 [54] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Topdown neural attention by excitation backprop. IJCV, 2018.
 [55] Q.s. Zhang and S.c. Zhu. Visual interpretability for deep learning: a survey. FITEE, 2018.
 [56] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang. Adversarial complementary learning for weakly supervised object localization. In CVPR, 2018.
 [57] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
 [58] Z.H. Zhou. A brief introduction to weakly supervised learning. NSR, 2017.
 [59] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014.
Appendix A The minmax entropy framework for WSOL
a.1 Object completeness using incremental recursive erasing and trust coefficients
Appendix B Results and analysis
In this section, we provide more details on our experiments, analysis, and discuss some of the drawbacks of our approach. We took many precautions to make the code reproducible for our model up to Pytorch’s terms of reproducibility. Please see the README.md
file for the concerned section with the code^{10}^{10}10https://github.com/sbelharbi/wsolminmaxentropyinterpretability. We checked reproducibility up to a precision of . All our experiments were conducted using the seed . We run all our experiments over one GPU with 12GB^{11}^{11}11Our code supports multiGPU, and Batchnorm synchronization with our own support to reproducibility., and an environment with 10 to 64 GB of RAM. Finally, this section shows more visual results, analysis, training time, and drawbacks.
b.1 Datasets
We provide in Fig.5 some samples from each dataset’s test set along with their mask that indicates the object of interest.
As we mentioned in Sec.4, we consider a subset from the original CUB2002011 dataset for preliminary experiments, and we referred to it as CUB5. To build it, we select, randomly, 5 classes from the original dataset. Then, pick all the corresponding samples of each class in the provided train and test set to build our train and test set (CUB5). Then, we build the effective train set, and validation set by taking randomly , and the left from the train set of CUB5, respectively. We provide the splits, and the code used to generate them. Our code generates the following classes:

019.Gray_Catbird

099.Ovenbird

108.White_necked_Raven

171.Myrtle_Warbler

178.Swainson_Warbler
b.2 Experiments setup
The following is the configuration we used for our model over all the datasets:
 Data

1. Patch size (hxw): . (for training sample patches, however, for evaluation, use the entire input image). 2. Augment patch using random rotation, horizontal/vertical flipping. (for CUB5 only horizontal flipping is performed). 3. Channels are normalized using mean and standard deviation. 4. For GlaS: patches are jittered using brightness=, contrast=, saturation=, hue=.
 Model

Pretrained resnet101 [15] as a backbone with [11] as a pooling score with our adaptation, using modalities per class. We consider using dropout [41] (with value over GlaS and over CUB5, CUB, OxF over the final map of the pooling function right before computing the score). High dropout is motivated by [14, 38]. This allows to drop most discriminative parts at features with most abstract representation. The dropout is not performed over the final mask, but only on the internal mask of the pooling function. As for the parameters of [11], we consider their since most negative evidence is dropped, and use . . For evaluation, our predicted mask is binarized using a threshold to obtain exactly a binary mask. All our presented masks in this work follows this thresholding. Our F1, and F1 are computed over this binary mask.
 Optimization

1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. Learning rate of decayed by each epochs with minimum value of . 4. Maximum epochs of 400. 5. Batch size of . 6. Early stopping over validation set using classification error as a stopping criterion.
Other WSOL methods use the following setup with respect to each dataset:
GlaS:
 Data

1. Patch size (hxw): . 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Patches are jittered using brightness=, contrast=, saturation=, hue=.
 Model

1. Pretrained resnet18 [15] as a backbone.
 Optimization

1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. 160 epochs 4. Learning rate of for the first , and of for the last epochs. 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.
CUB5:
 Data

1. Patch size (hxw): . (resized while maintaining the ratio). 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Random affine transformation with degrees , shear , scale .
 Model

Pretrained resnet18 [15] as a backbone.
 Optimization

1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. epochs. 4. Learning rate of decayed every with . 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.
CUB/OxF:
 Data

1. Patch size (hxw): . (resized while maintaining the ratio). 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Random affine transformation with degrees , shear , scale .
 Model

Pretrained resnet18 [15] as a backbone.
 Optimization

1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. epochs. 4. Learning rate of decayed every with . 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.
b.3 Results
In this section, we provide more visual results over the test set of each dataset.
Over GlaS dataset (Fig.7, 8), the visual results show clearly how our model, with and without erasing, can handle multiinstance. Adding the erasing feature allows recovering more discriminative regions. The results over CUB5 (Fig.9, 10, 11, 12, 13) while are interesting, they show a fundamental limitation to the concept of erasing in the case of oneinstance. In the case of multiinstance, if the model spots one instance, then, erases it, it is more likely that the model will seek another instance which is the expected behavior. However, in the case of one instance, and where the discriminative parts are small, the first forward allows mainly to spot such small part and erase it. Then, the leftover may not be sufficient to discriminate. For instance, in CUB5, in many cases, the model often spots the head. Once it is hidden, the model is unable to find other discriminative parts. A clear illustration to this issue is in Fig.9, row 5. The model spots correctly the head, but was unable to spot the body while the body has similar texture, and it is located right near to the found head. We believe that the main cause of this issue is that the erasing concept forgets where discriminative parts are located. Erasing algorithms seem to be missing this feature that can be helpful to localize the entire object of interest by seeking around the found disciminative regions. In our erasing algorithm, once a region is erased, the model forgets about its location. Adding a memorylike, or constraints over the spatial distribution of the mined discriminative regions may potentially alleviate this issue.
It is interesting to notice the strategy used by our model to localize some types of birds. In the case of the 099.Ovenbird
, it relies on the texture of the chest (white doted with black), while it localizes the white spot on the bird neck in the case of 108.White_necked_Raven
. One can notice as well that our model seems to be robust to small/occluded objects. In many cases, it was able to spot small birds in a difficult context where the bird is not salient.
b.3.1 Impact of our recursive erasing algorithm on the performance
Tab.3 and Tab.4 show the boosting impact of our erasing recursive algorithm in both classification and localization error. From Tab.4, we can observe that using our recursive algorithm adds a large improvement in F1 without degrading F1. This means that the recursion allows the model to correctly localize larger portions of the object of interest without including false positive regions. The observed improvement in localization allows better classification error as observed in Tab.3. Such improvement can be seen as well in the precisionrecall curves in Fig.6.
Image level  
Ours  Error (%)  
GlaS  CUB5  CUB  OxF  
Pixel level  
Ours  F1 (%)  F1 (%)  
GlaS  CUB5  CUB  OxF  GlaS  CUB5  CUB  OxF  
b.3.2 Running time of our recursive erasing algorithm
Adding recursive computation in the backpropagation loop is expected to add an extra computation time. Tab.5 shows the training time (of 1 run) of our model with and without recursion over identical computation resource. The observed extra computation time is mainly due to gradient accumulation (line 12. Alg.1) which takes the same amount of time as parameters’ update (which is expensive to compute). The forward and the backward are practically fast, and take less time compared to gradient update. We do not compare the running between the datasets since they have different number/size of samples, and different preprocessing that it is included in the reported time. Moreover, the size of samples has an impact over the total time during the training over the validation set.
Model  GlaS  CUB5 

Ours ()  49min  65min 
Ours ()  90min ()  141min () 
b.3.3 Postprocessing using conditional random field (CRF)
Postprocessing the output of fully convolutional networks using a CRF often leads to smooth and better aligned mask with the object of interest [8]. To this end, we use the CRF implementation of [22]^{12}^{12}12https://github.com/lucasbeyer/pydensecrf. The results are presented in Tab.6. Following the notation in [22], we set . We set, over all the methods, for iterations, over GlaS, and for iterations, over CUB5, CUB, and OxF. Tab.6 shows a slight improvement in term of F1 and slight degradation in term of F1. When investigating the processed masks, we found that the CRF helps in improving the mask only when the mask covers precisely large part of the object of interest. In this case, the CRF helps spreading the mask over the object. In the case where there is high false positive, or the mask misses largely the object, the CRF does not help. We can see as well that the CRF increase slightly the false positive by spreading the mask out of the object. Since our method has small false positive –i.e., the produced mask covers mostly the object and avoids regions outside the object– using CRF helps in improving both F1 and F1 in most cases.
Pixel level  

Method  F1 (%)  F1 (%)  
GlaS  CUB5  CUB  OxF  GlaS  CUB5  CUB  OxF  
Avg [57]  
Max [30]  
LSE [33, 42]  
Wildcat [11]  
GC [35]  
Ours () 