Min-max Entropy for Weakly Supervised Pointwise Localization

# Min-max Entropy for Weakly Supervised Pointwise Localization

Soufiane Belharbi1 , Jérôme Rony1, Jose Dolz1, Ismail Ben Ayed1, Luke McCaffrey2, & Eric Granger1
1 École de technologie supérieure, Université du Québec, Montreal, Canada
2 Rosalind and Morris Goodman Cancer Research Centre, Dept. of Oncology, McGill University, Montreal, Canada
{soufiane.belharbi.1,jerome.rony.1}@etsmtl.net
{jose.dolz,ismail.benayed,eric.granger}@etsmtl.ca
luke.mccaffrey@mcgill.ca
###### Abstract

Pointwise localization allows more precise localization and accurate interpretability, compared to bounding box, in applications where objects are highly unstructured such as in medical domain. In this work, we focus on weakly supervised localization (WSL) where a model is trained to classify an image and localize regions of interest at pixel-level using only global image annotation. Typical convolutional attentions maps are prune to high false positive regions. To alleviate this issue, we propose a new deep learning method for WSL, composed of a localizer and a classifier, where the localizer is constrained to determine relevant and irrelevant regions using conditional entropy (CE) with the aim to reduce false positive regions. Experimental results on a public medical dataset and two natural datasets, using Dice index, show that, compared to state of the art WSL methods, our proposal can provide significant improvements in terms of image-level classification and pixel-level localization (low false positive) with robustness to overfitting. A public reproducible PyTorch implementation is provided.

## 1 Introduction

Pointwise localization is an important task for image understanding, as it provides crucial clues to challenging visual recognition problems, such as semantic segmentation, besides being an essential and precise visual interpretability tool. Deep learning methods, and particularly convolutional neural networks (CNNs), are driving recent progress in these tasks. Nevertheless, despite their remarkable performance, their training requires large amounts of labeled data, which is time consuming and prone to observer variability. To overcome this limitation, weakly supervised learning (WSL) has emerged recently as a surrogate for extensive annotations of training data (Zhou, 2017). WSL involves scenarios where training is performed with inexact or uncertain supervision. In the context of pointwise localization or semantic segmentation, weak supervision typically comes in the form of image level tags (Kervadec et al., 2019; Kim et al., 2017; Pathak et al., 2015; Teh et al., 2016; Wei et al., 2017), scribbles (Lin et al., 2016; Tang et al., 2018) or bounding boxes (Khoreva et al., 2017).

Current state-of-the-art WSL methods rely heavily on pixelwise activation maps produced by a CNN classifier at the image level, thereby localizing regions of interest (Zhou et al., 2016). Furthermore, this can be used as an interpretation of the model’s decision (Zhang & Zhu, 2018). The recent literature abounds of WSL works that relax the need of dense and prohibitively time consuming pixel-level annotations (Rony et al., 2019). Bottom-up methods rely on the input signal to locate regions of interest, including spatial pooling techniques over activation maps (Durand et al., 2017; Oquab et al., 2015; Sun et al., 2016; Zhang et al., 2018b; Zhou et al., 2016), multi-instance learning (Ilse et al., 2018) and attend-and-erase based methods (Kim et al., 2017; Li et al., 2018; Pathak et al., 2015; Singh & Lee, 2017; Wei et al., 2017). While these methods provide pointwise localization, the models in (Bilen & Vedaldi, 2016; Kantorov et al., 2016; Shen et al., 2018; Tang et al., 2017; Wan et al., 2018) predict a bounding box instead, i.e., perform weakly supervised object detection. Inspired by human visual attention, top-down methods rely on the input signal and a selective backward signal to determine the corresponding region of interest. This includes special feedback layers (Cao et al., 2015), backpropagation error (Zhang et al., 2018a) and Grad-CAM (Chattopadhyay et al., 2018; Selvaraju et al., 2017).

In many applications, such as in medical imaging, region localization may require high precision such as cells, boundaries, and organs localization; regions that have an unstructured shape, and different scale that a bounding box may not be able to localize precisely. In such cases, a pointwise localization can be more suitable. The illustrative example in Fig.1 (bottom row) shows a typical case where using a bounding box to localize the glands is clearly problematic. This motivates us to consider predicting a mask instead of a bounding box. Consequently, our latter choice of evaluation datasets is constrained by the availability of both global image annotation for training and pixel-level annotation for evaluation. In this work, we focus on the case where there is one object of interest in the image.

Often, within an agnostic-class setup, input image contains the object of interest among other irrelevant parts (noise, background). Most the aforementioned WSL methods do not consider such prior, and feed the entire image to the model. In such scenario, (Wan et al., 2018) argue that there is an inconsistency between the classification loss and the task of WSL; and that typically the optimization may reach sub-optimal solutions with considerable randomness in them, leading to high false positive localization. False positive localization is aggravated when a class appears in different and random shape/structure, or may have relatively similar texture/color to the irrelevant parts driving the model to confuse between both parts. False positive regions can be problematic in critical domains such as medical applications where interpretability plays a central role in trusting and understanding an algorithm’s prediction. To address this important issue, and motivated by the importance of using prior knowledge in learning to alleviate overfitting when training using few samples (Belharbi et al., 2017; Krupka & Tishby, 2007; Mitchell, 1980; Yu et al., 2007), we propose to use the aforementioned prior in order to favorite models with low false positive localization. To this end, we constrain the model to learn to localize both relevant and irrelevant regions simultaneously in an end-to-end manner within a WSL scenario, where only image-level labels are used for training. We model the relevant (discriminative) regions as the complement of the irrelevant (non-discriminative) regions (Fig.1). Our model is composed of two sub-models: (1) a localizer that aims to localize both types of regions by predicting a latent mask, (2) and a classifier that aims to classify the visible content of the input image through the latent mask. The localizer is driven through CE (Cover & Thomas, 2006) to simultaneously identify (1) relevant regions where the classifier has high confidence with respect to the image label, (2) and irrelevant regions where the classifier is being unable to decide which image label to assign. This modeling allows the discriminative regions to pop out and be used to assign the corresponding image label, while suppressing non-discriminative areas, leading to more reliable predictions. In order to localize complete discriminative regions, we extend our proposal by training the localizer to recursively erase discriminative parts during training only. To this end, we propose a consistent recursive erasing algorithm that we incorporate within the backpropagation. At each recursion, and within the backpropagation, the algorithm localizes the most discriminative region; stores it; then erases it from the input image. At the end of the final recursion, the model has gathered a large extent of the object of interest that is fed next to the classifier. Thus, our model is driven to localize complete relevant regions while discarding irrelevant regions, resulting in more reliable region localization. Moreover, since the discriminative parts are allowed to be extended over different instances, our proposal handles multi-instances intrinsically.

The main contribution of this paper is a new deep learning framework for WSL at pixel level. The framework is composed of two sequential sub-networks where the first one localizes regions of interest, whereas the second classifies them. Based on CE, the end-to-end training of the framework allows to incorporate prior knowledge that, an image is more likely to contain relevant and irrelevant regions. Throughout the CE measured at the classifier level, the localizer is driven to localize relevant regions (with low CE) and irrelevant regions (with high CE). Such localization is achieved with the main goal of providing a more interpretable and reliable regions of interest with low false positive localization. This paper also contributes a consistent recursive erasing algorithm that is incorporated within backpropagation, along with a practical implementation in order to obtain complete discriminative regions. Finally, we conduct an extensive series of experiments on three public image datasets (medical and natural), where the results show the effectiveness of the proposed approach in terms of pointwise localization (measured with Dice index) while maintaining competitive accuracy for image-level classification.

## 2 Background on WSL

In this section, we briefly review state of the art of WSL methods, divided into two main categories, aiming at pointwise localization of regions of interest using only image-level labels as supervision. (1) Fully convolutional networks with spatial pooling have shown to be effective to obtain localization of discriminative regions (Durand et al., 2017; Oquab et al., 2015; Sun et al., 2016; Zhang et al., 2018b; Zhou et al., 2016). Multi-instance learning methods have been used within an attention framework to localize regions of interest (Ilse et al., 2018). (Singh & Lee, 2017) propose to hide randomly large patches in training image in order to force the network to seek other discriminative regions to recover large part of the object of interest, since neural networks often provide small and most discriminative regions of object of interest (Kim et al., 2017; Singh & Lee, 2017; Zhou et al., 2016). (Wei et al., 2017) use the attention map of a trained network to erase the most discriminative part of the original image. (Kim et al., 2017) use two-phase learning stage where the attention maps of two networks are combined to obtain a complete region of the object. (Li et al., 2018) propose a two-stage approach where the first network classifies the image, and provides an attention map of the most discriminative parts. Such attention is used to erase the corresponding parts over the input image, then feed the resulting erased image to a second network to make sure that there is no discriminative parts left. (2) Inspired by the human visual attention, top-down methods were proposed. In (Simonyan et al., 2014; Springenberg et al., 2015; Zeiler & Fergus, 2014), backpropagation error is used in order to visualize saliency maps over the image for the predicted class. In (Cao et al., 2015), an attention map is built to identify the class relevant regions using feedback layer. (Zhang et al., 2018a) propose Excitation backprop that allows to pass along top-down signals downwards in the network hierarchy through a probabilistic framework. Grad-CAM (Selvaraju et al., 2017) generalize CAM (Zhou et al., 2016) using the derivative of the class scores with respect to each location on the feature maps; it has been furthermore generalized in (Chattopadhyay et al., 2018). In practice, top-down methods are considered as visual explanatory tools, and they can be overwhelming in term of computation and memory usage even during inference.

While the aforementioned approaches have shown great success mostly with natural images, they still lack a mechanism for modeling what is relevant and irrelevant within an image which is important to reduce false positive localization. This is crucial for determining the reliability of the regions of interest. Erase-based methods (Kim et al., 2017; Li et al., 2018; Pathak et al., 2015; Singh & Lee, 2017; Wei et al., 2017) follow such concept where the non-discriminative parts are suppressed through constraints, allowing only the discriminative ones to emerge. Explicitly modeling negative evidence within the model has shown to be effective in WSL (Azizpour et al., 2015; Durand et al., 2017, 2016; Parizi et al., 2015).

Our proposal is related to (Behpour et al., 2019; Wan et al., 2018) in using entropy-measure to explore the input image. However, while (Wan et al., 2018) defines an entropy over the bounding boxes’ position to minimize its variance, we define a CE over the classifier to be low over discriminative regions, while being high over non-discriminative ones. Our recursive erasing algorithm follows general erasing and mining techniques (Kim et al., 2017; Li et al., 2018; Singh & Lee, 2017; Wan et al., 2018; Wei et al., 2017), but places more emphasis on mining consistent regions, and being performed on the fly during backpropagation. For instance, compared to (Wan et al., 2018), our algorithm attempts to expand regions of interest, accumulate consistent regions while erasing, provide automatic mechanism to stop erasing over samples independently from each other. However (Wan et al., 2018) aims to locate multiple instances without erasing, and use manual/empirical threshold for assigning confidence to boxes. Our proposal can be seen as a guided dropout (Srivastava et al., 2014). While standard dropout is applied over a given input image to randomly zero out pixels, our proposed approach seeks to zero out irrelevant pixels and keep only the discriminative ones that support the image label. From this perspective, our proposal mimics a discriminative gate that inhibits irrelevant and noisy regions while allowing only informative and discriminative regions to pass through the gate.

## 3 The min-max entropy framework for WSL

Notations and definitions: Let us consider a set of training samples where is an input image with depth , height , and width ; a realization of the discrete random variable with support set ; is the image-level label (i.e., image class), a realization of the discrete random variable with support set . We define a decidable region222In this context, the notion of region indicates one pixel. of an image as any informative part of the image that allows predicting the image label. An undecidable region is any noisy, uninformative, and irrelevant part of the image that does not provide any indication nor support for the image class. To model such definitions, we consider a binary mask where a location with value indicates a decidable region, otherwise it is an undecidable region. We model the decidability of a given location with a binary random variable . Its realization is , and its conditional probability over the input image is defined as follows,

 pM(m=1|X,(r,z))={1if X(r,z) is a decidable region,0otherwise. (1)

We note a binary mask indicating the undecidable region, where . We consider the undecidable region as the complement of the decidable one. We can write: , where is the norm. Following such definitions, an input image can be decomposed into two images as , where is the Hadamard product. We note , and . inherits the image-level label of . We can write the pair in the same way as . We note by , and as the respective approximation of , and . We are interested in modeling the true conditional distribution where . is its estimate. Following the previous discussion, predicting the image label depends only on the decidable region, i.e., . Thus, knowing does not add any knowledge to the prediction, since does not contain any information about the image label. This leads to: . As a consequence, the image label is conditionally independent of provided (Koller & Friedman, 2009): , where are the random variables modeling the decidable and the undecidable regions, respectively. In the following, we provide more details on how to exploit such conditional independence property in order to estimate and .

Min-max entropy: We consider modeling the uncertainty of the model prediction over decidable, or undecidable regions using conditional entropy (CE). Let us consider the CE of , denoted and computed as (Cover & Thomas, 2006),

 H(Y|X=X+)=−∑y∈Y^p(Y|X=X+)log^p(Y|X=X+). (2)

Since the model is required to be certain about its prediction over , we constrain the model to have low entropy over . Eq.2 reaches its minimum when the probability of one of the classes is certain, i.e., (Cover & Thomas, 2006). Instead of directly minimizing Eq.2, and in order to ensure that the model predicts the correct image label, we cast a supervised learning problem using the cross-entropy between and using the image-level label of as a supervision,

 H(pi,^pi)+ =−∑y∈Yp(Y=y|X=X+i)log^p(Y=y|X=X+i)=−log^p(yi|X+i). (3)

Eq.3 reaches its minimum at the same conditions as Eq.2 with the true image label as a prediction. We note that Eq.3 is the negative log-likelihood of the sample . In the case of , we consider the CE of , denoted and computed as,

 H(Y|X=X−)=−∑y∈Y^p(Y|X−)log^p(Y|X−). (4)

Over irrelevant regions, the model is required to be unable to decide which image class to predict since there is no evidence to support any class. This can be seen as a high uncertainty in the model decision. Therefore, we consider maximizing the entropy of Eq.4. The later reaches its maximum at the uniform distribution (Cover & Thomas, 2006). Thus, the inability of the model to decide is reached since each class is equiprobable. An alternative to maximizing Eq.4 is to use a supervised target distribution since it is already known (i.e., uniform distribution). To this end, we consider as a uniform distribution, and caste a supervised learning setup using a cross-entropy between and over ,

 H(qi,^pi)− =−∑y∈Yq(Y=y|X=X−i)log^p(Y=y|X=X−i)=−1c∑y∈Ylog^p(y|X−i). (5)

The minimum of Eq.5 is reached when is uniform, thus, Eq.4 reaches its maximum. Now, we can write the total training loss to be minimized as,

 minE(Xi,yi)∈D[H(pi,^pi)++H(qi,^pi)−]. (6)

The posterior probability is modeled using a classifier with a set of parameters ; it can operate either on or . The binary mask (and ) is learned using another model with a set of parameters . In this work, both models are based on neural networks (fully convolutional networks (Long et al., 2015) in particular). The networks and can be seen as two parts of one single network that localizes regions of interest using a binary mask, then classifies their content. Fig.2 illustrates the entire model.

Due to the depth of , receives its supervised gradient based only on the error made by . In order to boost the supervised gradient at , and provide it with more hints to select the most discriminative regions with respect to the image class, we consider using a secondary classification task at the output of to classify the input , following (Lee et al., 2015). computes the posterior probability which is another estimate of . To this end, is trained to minimize the cross-entropy between and ,

 H(pi,^psi)=−log^ps(Y=yi|X=Xi). (7)

The total training loss to minimize is formulated as,

 min{θM,θC}E(Xi,yi)∈D[H(pi,^pi)++H(qi,^pi)−+H(pi,^psi)]. (8)

Mask computation and recursive erasing: The mask is computed using the last feature maps of which contains high abstract descriminative activations. We note such feature maps by a tensor that contains a spatial map for each class. is computed by aggregating the spatial activation of all the classes as, where is the continuous downsampled version of , and is the feature map of the class of the input . At convergence, the posterior probability of the winning class is pushed toward while the rest is pushed down to . This leaves only the feature map of the winning classe. is upscaled using interpolation (Sec.A.2) to which has the same size as the input , then pseudo-thresholded using a sigmoid function to obtain a pseudo-binary ,

 pM(m=1|Xi,(r,z))=1/(1+exp(−ω×(T↑i(r,z)−σ′))), (9)

where is a constant scalar that ensures that the sigmoid approximately equals to when is larger than , and approximately equals to otherwise. At this point, may still contain discriminative regions. To alleviate this issue, we propose a learning incremental and recursive erasing approach that drives to mine complete discriminative regions. The mining algorithm is consistent, sample dependent, it has a maximum recursion depth , associates trust coefficients to each recursion, integrated within the backpropagation, operates only during training, and has a practical implementation. Due to space limitation, we left it in the supplementary material (Sec.A.1).

## 4 Results and analysis

Our experiments focus simultaneously on classification and pointwise localization tasks. Thus, we consider datasets that provide both image and pixel-level labels for evaluation. Particularly, the following three datasets are considered: GlaS in medical domain, and CUB-200-2011 and Oxford flower 102 on natural scene images. (1) GlaS dataset, one of the rare medical datasets that fits our scenario (Rony et al., 2019), was provided in the 2015 Gland Segmentation in Colon Histology Images Challenge Contest (Sirinukunwattana et al., 2017). The main task of the challenge is gland segmentation of microscopic images. However, image-level labels were provided as well. The dataset is composed of 165 images derived from 16 Hematoxylin and Eosin (H&E) histology sections of two grades (classes): benign, and malignant. It is divided into 84 samples for training, and 80 samples for test. Images have a high variation in term of gland shape/size, and overall H&E stain. In this dataset, the glandes are the regions of interest that the pathologists use to prognosis the image grading of being benign or malignant. (2) CUB-200-2011 dataset444CUB-200-2011: www.vision.caltech.edu/visipedia/CUB-200-2011.html (Wah et al., 2011) is a dataset for bird species with samples and species. Preliminary experiments were conducted on small version of this datatset where we selected randomly 5 species and build a small dataset with samples for training, and for test; referred to in this work as CUB5. The entire dataset is referred to as CUB. In this dataset, the regions of interest are the birds. (3) Oxford flower 102555Oxford flower 102: http://www.robots.ox.ac.uk/ vgg/data/flowers/102/ (Nilsback & Zisserman, 2007) datatset is collection of 102 species (classes) of flowers commonly occurring in United Kingdom; referred to here as OxF. It contains a total of samples. We used the provided splits for training ( samples), validation ( samples) and test ( samples) sets. Regions of interest are the flowers which were segmented automatically. In GlaS, CUB5 and CUB datasets, we randomly select of training samples for effective training, and for validation to perform early stopping. We provide in our public code the used splits and the deterministic code that generated them for the different datasets.

In all the experiments, image-level labels are used during training/evaluation, while pixel-level labels are used exclusively during evaluation. The evaluation is conducted at two levels: at image-level where the classification error is reported, and at the pixel-level where we report F1 score (Dice index) over the foreground (region of interest), referred to as F1. When dealing with binary data, F1 score is equivalent to Dice index. We report as well the F1 score over the background, referred to as F1, in order to measure how well the model is able to identify irrelevant regions. We compare our method to different methods of WSL. Such methods use similar pre-trained backbone (resent18 (He et al., 2016)) for feature extraction and differ mainly in the final pooling layer: CAM-Avg uses average pooling (Zhou et al., 2016), CAM-Max uses max-pooling (Oquab et al., 2015), CAM-LSE uses an approximation to maximum (Pinheiro & Collobert, 2015; Sun et al., 2016), Wildcat uses the pooling in (Durand et al., 2017), Grad-CAM (Selvaraju et al., 2017), and Deep MIL is the work of (Ilse et al., 2018) with adaptation to multi-class. We use supervised segmentation using U-Net (Ronneberger et al., 2015) as an upper bound of the performance for pixel-level evaluation (Full sup.). As a simple baseline, we use a mask full of 1 with the same size of the image as a constant prediction of regions of interest to show that F1 alone is not an efficient metric to evaluate pixel-level localization particularly over GlaS set (All-ones, Tab.2). In our method, and share the same pre-trained backbone (resnet101 (He et al., 2016)) to avoid overfitting while using (Durand et al., 2017) as a pooling function. All methods are trained using stochastic gradient descent using momentum. In our approach, we use the same hyper-parameters over all datasets, while other methods require adaptation to each dataset. We provide the datasets splits, more experimental details, and visual results in the supplementary material (Sec.B). Our reproducible code is publicly available.

A comparison of the obtained results of different methods, over all datasets, is presented in Tab.1 and Tab.2 with visual results illustrated in Fig.3. In Tab.2, and compared to other WSL methods, our method obtains relatively similar F1 score; while it obtains large F1 over GlaS where it may be easy to obtain high F1 by predicting a mask full of 1 (Fig.3). However, a model needs to be very selective in order to obtain high F1 score in order to localize tissues (irrelevant regions) where our model seems to excel at. Cub5 set seems to be more challenging due to the variable size (from small to big) of the birds, their view, the context/surrounding environment, and the few training samples. Our model outperforms all the WSL methods in both F1 and F1 with a large gap due mainly to its ability to discard non-discriminative regions which leaves it only with the region of interest, the bird in this case. While our model shows improvements in pointwise localization, it is still far behind full supervision.

Similar improvements are observed on CUB data. In the case of OxF dataset, our approach provides low F1 values compared to other WSL methods. However, the latter are not far from the performance of the All-ones that predicts a constant mask. Given the large size of flowers, predicting a mask that is active over all the image will easily lead to of F. The best WSL methods for OxF are only better than All-ones by , suggesting that such methods have predicted a full mask in many cases. In term of F1, our approach is better than all the WSL techniques. All methods achieve low classification error on GlaS which implies that it represents an easy classification problem. Surprisingly, the other methods seem to overfit on CUB5, while our model shows a robustness. The other methods outperform our approach on CUB and OxF, although ours is still in a competitive range to half WSL methods. Results obtained on both these datasets indicate that, compared to WSL methods, our approach is effective in terms of image classification and pointwise localization with more reliability in the latter.

Visual quality of our approach (Fig.3) shows that the predicted regions of interest on GlaS agree with the doctor methodology of colon cancer diagnostics where the glands are used as diagnostic tool. Additionally, it deals well with multi-instances when there are multiple glands within the image. On CUB5/CUB, our model succeeds to locate birds in order to predict its category which one may do in such task. We notice that the head, chest, tail, or body particular spots are often parts that are used by our model to decide a bird’s species, which seems a reasonable strategy as well. On OxF dataset, we observe that our approach mainly locates the central part of pistil. When it is not enough, the model relies on the petals or on unique discriminative parts of the flower. In term of time complexity, the inference time of our model is the same as a standard fully convolutional network since the recursive algorithm is disabled during inference. However, one may expect a moderate increase in training time that depends mainly on the depth of the recursion (see Sec.B.3.2).

## 5 Conclusion

In this work, we present a novel approach for WSL at pixel-level where we impose learning relevant and irrelevant regions within the model with the aim to reduce false positive localization. Evaluated on three datasets, and compared to state of the art WSL methods, our approach shows its effectiveness in accurately localizing regions of interest with low false positive while maintaining a competitive classification error. This makes our approach more reliable in term of interpetability. As future work, we consider extending our approach to handle multiple classes within the image. Different constraints can be applied over the predicted mask, such as texture properties, shape, or other region constraints. Predicting bounding boxes instead of heat maps is considered as well since they can be more suitable in some applications where pixel-level accuracy is not required. Our recursive erasing algorithm can be further improved by using a memory-like mechanism that provides spatial information to prevent forgetting the previously spotted regions and promote localizing the entire region (Sec.B.3).

#### Acknowledgments

This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and the Canadian Institutes of Health Research.

## References

• Azizpour et al. (2015) H. Azizpour, M. Arefiyan, S. Naderi Parizi, and S. Carlsson. Spotlight the negatives: A generalized discriminative latent model. In BMVC, 2015.
• Behpour et al. (2019) S. Behpour, K. Kitani, and B. Ziebart. Ada: Adversarial data augmentation for object detection. In WACV, 2019.
• Belharbi et al. (2016) S. Belharbi, R.Hérault, C. Chatelain, and S. Adam. Deep multi-task learning with evolving weights. In ESANN, 2016.
• Belharbi et al. (2017) S. Belharbi, C. Chatelain, R. Hérault, and S. Adam. Neural networks regularization through class-wise invariant representation learning. arXiv preprint arXiv:1709.01867, 2017.
• Bilen & Vedaldi (2016) H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
• Cao et al. (2015) C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV, 2015.
• Chen et al. (2015) L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
• Cover & Thomas (2006) T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 2006.
• Durand et al. (2017) T. Durand, T. Mordan, N. Thome, and M. Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In CVPR, 2017.
• Durand et al. (2016) Thibaut Durand, Nicolas Thome, and Matthieu Cord. Weldon: Weakly supervised learning of deep convolutional neural networks. In CVPR, 2016.
• Ghiasi et al. (2018) G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In NIPS. 2018.
• He et al. (2016) K. He, X. Zhang, S.g Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
• Ilse et al. (2018) M. Ilse, J. M. Tomczak, and M. Welling. Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.
• Kantorov et al. (2016) V. Kantorov, M. Oquab, M. Cho, and I. Laptev. Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV, 2016.
• Kervadec et al. (2019) H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. Ben Ayed. Constrained-CNN losses for weakly supervised segmentation. MedIA, 2019.
• Khoreva et al. (2017) A. Khoreva, R. Benenson, J.H. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017.
• Kim et al. (2017) D. Kim, D. Cho, D. Yoo, and I. So Kweon. Two-phase learning for weakly supervised object localization. In ICCV, 2017.
• Koller & Friedman (2009) D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.
• Krähenbühl & Koltun (2011) P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
• Krupka & Tishby (2007) E. Krupka and N. Tishby. Incorporating prior knowledge on features into learning. In Artificial Intelligence and Statistics, 2007.
• Lee et al. (2015) C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised Nets. In ICAIS, 2015.
• Li et al. (2018) K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu. Tell me where to look: Guided attention inference network. In CVPR, 2018.
• Lin et al. (2016) D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016.
• Long et al. (2015) J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
• Mitchell (1980) T.M. Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, 1980.
• Nilsback & Zisserman (2007) M.-E. Nilsback and A. Zisserman. Delving into the whorl of flower segmentation. In BMVC, 2007.
• Oquab et al. (2015) M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In CVPR, 2015.
• Parizi et al. (2015) S. Naderi Parizi, A. Vedaldi, A.w Zisserman, and P. F. Felzenszwalb. Automatic discovery and optimization of parts for image classification. In ICLR, 2015.
• Pathak et al. (2015) D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, 2015.
• Pinheiro & Collobert (2015) P. H. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015.
• Ronneberger et al. (2015) O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
• Rony et al. (2019) Belharbi S. Rony, J., J. Dolz, I. Ben Ayed, L. McCaffrey, and E. Granger. Deep weakly-supervised learning methods for classification and localization in histology images: a survey. coRR, abs/1909.03354, 2019.
• Selvaraju et al. (2017) R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
• Shen et al. (2018) Y. Shen, R. Ji, S. Zhang, W. Zuo, Y. Wang, and F. Huang. Generative adversarial learning towards fast weakly supervised detection. In CVPR, 2018.
• Simonyan et al. (2014) K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLRw, 2014.
• Singh & Lee (2017) K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV, 2017.
• Sirinukunwattana et al. (2017) K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez, et al. Gland segmentation in colon histology images: The glas challenge contest. MIA, 2017.
• Springenberg et al. (2015) J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In ICLRw, 2015.
• Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
• Sun et al. (2016) C. Sun, M. Paluri, R. Collobert, R. Nevatia, and L. Bourdev. Pronet: Learning to propose object-specific boxes for cascaded neural networks. In CVPR, 2016.
• Tang et al. (2018) M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers. Normalized Cut Loss for Weakly-supervised CNN Segmentation. In CVPR, 2018.
• Tang et al. (2017) P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance detection network with online instance classifier refinement. In CVPR, 2017.
• Teh et al. (2016) E. W. Teh, M. Rochan, and Y. Wang. Attention networks for weakly supervised object localization. In BMVC, 2016.
• Wah et al. (2011) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, California Institute of Technology, 2011.
• Wan et al. (2018) F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye. Min-entropy latent model for weakly supervised object detection. In CVPR, 2018.
• Wei et al. (2017) Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In CVPR, 2017.
• Yu et al. (2007) T. Yu, T. Jan, S. Simoff, and J. Debenham. Incorporating prior domain knowledge into inductive machine learning. Unpublished doctoral dissertation Computer Sciences, 2007.
• Zeiler & Fergus (2014) M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
• Zhang et al. (2018a) J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. IJCV, 2018a.
• Zhang & Zhu (2018) Q.-s. Zhang and S.-c. Zhu. Visual interpretability for deep learning: a survey. FITEE, 2018.
• Zhang et al. (2018b) X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang. Adversarial complementary learning for weakly supervised object localization. In CVPR, 2018b.
• Zhou et al. (2016) B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
• Zhou (2017) Z.-H. Zhou. A brief introduction to weakly supervised learning. NSR, 2017.

## Appendix A The min-max entropy framework for WSL

### a.1 Region completeness using incremental recursive erasing and trust coefficients

Deep classification models tend to rely on small discriminative regions (Kim et al., 2017; Singh & Lee, 2017; Zhou et al., 2016). Thus, in our proposal, may still contain discriminative parts. Following (Kim et al., 2017; Li et al., 2018; Pathak et al., 2015; Singh & Lee, 2017), and in particular (Wei et al., 2017), we propose a learning incremental and recursive erasing approach that drives to seek complete discriminative regions. However, in the opposite of (Wei et al., 2017) where such mining is done offline, we propose to incorporate the erasing within the backpropagation using an efficient and practical implementation. This allows to learn to seek discriminative parts. Therefore, erasing during inference is unnecessary. Our approach consists in applying recursively before applying within the same forward. The aim of the recursion, with maximum depth , is to mine more discriminative parts within the non-discriminative regions of the image masked by . We accumulate all discriminative parts in a temporal mask . At each recursion, we mine the most discriminative part, that has been correctly classified by , and accumulate it in . However, with the increase of , the image may run out of discriminative parts. Thus, is forced, unintentionally, to consider non-discriminative parts as discriminative. To alleviate this risk, we introduce trust coefficients that control how much we trust a mined discriminative region at each step of the recursion for each sample as follows,

 R+,⋆i\coloneqqmax(R+,⋆i,Ψ(t,i)R+,ti), (10)

where computes the trust of the current mask of the sample at the step as follows,

 ∀t≥0,Ψ(t,i)=exp−tσΓ(t,i), (11)

where encodes the overall trust with respect to the current step of the recursion. Such trust is expected to decrease with the depth of the recursion (Belharbi et al., 2016). controls the slop of the trust function. The second part of Eq.11 is computed with respect to each sample. It quantifies how much we trust the estimated mask for the current sample ,

 Γ(t,i)={^ps(Y=yi|X=Xi⊙R−,⋆i)if$^yi=yi$and$H(pi,^psi)t≤H(pi,^psi)0$,0otherwise. (12)

In Eq.12, is computed over . Eq.12 ensures that at a step , for a sample , the current mask is trusted only if correctly classifies the erased image, and does not increase the loss. The first condition ensures that the accumulated discriminative regions belong to the same class, and more importantly, the true class. Moreover, it ensures that does not change its class prediction through the erasing process. This introduces a consistency between the mined regions across the steps and avoids mixing discriminative regions of different classes. The second condition ensures maintaining, at least, the same confidence in the predicted class compared to the first forward without erasing (). The given trust in this case is equal to the probability of the true class. The regions accumulator is initialized to zero at at each forward in . is not maintained through epoches; starts over each time processing the sample . This prevents accumulating incorrect regions that may occur at the beginning of the training. In order to automatize when to stop erasing, we consider a maximum depth of the recursion . For a mini-batch, we keep erasing as along as we do not reach steps of erasing, and there is at least one sample with a trust coefficient non-zero (Eq.12). Once a sample is assigned a zero trust coefficient, it is maintained zero all along the erasing (Eq.10)(Fig.4). Direct implementation of Eq.10 is not practical since performing a recursive computation on a large model requires a large memory that increases with the depth . To avoid such issue, we propose a practical implementation using gradient accumulation at through the loss Eq.7; such implementation requires the same memory size as in the case without erasing. An illustration of our proposed recursive erasing algorithm is provided in Fig.4. Alg.1 illustrates our implementation using accumulated gradient through the backpropagation within the localizer . We note that this erasing algorithm is performed only during training.

### a.2 Note on interpolation (Eq.9)

In most neural networks libraries (Pytorch (pytorch.org), Chainer (chainer.org)), the upsacling operations using interpolation/upsamling have a non-deterministic backward. This makes training unstable due to the non-deterministic gradient; and makes reproducibility impossible as well. To avoid such issues, we detach the upsacling operation, in Eq.9, from the training graph and consider it as input data for .

## Appendix B Results and analysis

In this section, we provide more details on our experiments, analysis, and discuss some of the drawbacks of our approach. We took many precautions to make the code reproducible for our model up to Pytorch’s terms of reproducibility. Please see the README.md file for the concerned section in the code. We checked reproducibility up to a precision of . All our experiments were conducted using the seed . We run all our experiments over one GPU with 12GB777Our code supports multiGPU, and Batchnorm synchronization with our own support to reproducibility., and an environment with 10 to 64 GB of RAM (depending on the size of the dataset). Finally, this section shows more visual results, analysis, training time, and drawbacks.

### b.1 Datasets

We provide in Fig.5 some samples from each dataset’s test set along with their mask that indicates the region of interest.

As we mentioned in Sec.4, we consider a subset from the original CUB-200-2011 dataset for preliminary experiments, and we referred to it as CUB5. To build it, we select, randomly, 5 classes from the original dataset. Then, pick all the corresponding samples of each class in the provided train and test set to build our train and test set (CUB5). Then, we build the effective train set, and validation set by taking randomly , and the left from the train set of CUB5, respectively. We provide the splits, and the code used to generate them. Our code generates the following classes:

1. 019.Gray_Catbird

2. 099.Ovenbird

3. 108.White_necked_Raven

4. 171.Myrtle_Warbler

5. 178.Swainson_Warbler

### b.2 Experiments setup

The following is the configuration we used for our model over all the datasets:

Data
1. Patch size (hxw): . (for training sample patches, however, for evaluation, use the entire input image). 2. Augment patch using random rotation, horizontal/vertical flipping. (for CUB5 only horizontal flipping is performed). 3. Channels are normalized using mean and standard deviation. 4. For GlaS: patches are jittered using brightness=, contrast=, saturation=, hue=.
Model

Pretrained resnet101 (He et al., 2016) as a backbone with (Durand et al., 2017) as a pooling score with our adaptation, using modalities per class. We consider using dropout (Srivastava et al., 2014) (with value over GlaS and over CUB5, CUB, OxF over the final map of the pooling function right before computing the score). High dropout is motivated by (Ghiasi et al., 2018; Singh & Lee, 2017). This allows to drop most discriminative parts at features with most abstract representation. The dropout is not performed over the final mask, but only on the internal mask of the pooling function. As for the parameters of (Durand et al., 2017), we consider their since most negative evidence is dropped, and use . . For evaluation, our predicted mask is binarized using a threshold to obtain exactly a binary mask. All our presented masks in this work follows this thresholding. Our F1, and F1 are computed over this binary mask.

Optimization
1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. Learning rate of decayed by each epochs with minimum value of . 4. Maximum epochs of 400. 5. Batch size of . 6. Early stopping over validation set using classification error as a stopping criterion.

Other WSL methods use the following setup with respect to each dataset:

GlaS:

Data
1. Patch size (hxw): . 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Patches are jittered using brightness=, contrast=, saturation=, hue=.
Model
1. Pretrained resnet18 (He et al., 2016) as a backbone.
Optimization
1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. 160 epochs 4. Learning rate of for the first , and of for the last epochs. 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.

CUB5:

Data
1. Patch size (hxw): . (resized while maintaining the ratio). 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Random affine transformation with degrees , shear , scale .
Model

Pretrained resnet18 (He et al., 2016) as a backbone.

Optimization
1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. epochs. 4. Learning rate of decayed every with . 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.

CUB/OxF:

Data
1. Patch size (hxw): . (resized while maintaining the ratio). 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Random affine transformation with degrees , shear , scale .
Model

Pretrained resnet18 (He et al., 2016) as a backbone.

Optimization
1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. epochs. 4. Learning rate of decayed every with . 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.

### b.3 Results

In this section, we provide more visual results over the test set of each dataset.

Over GlaS dataset (Fig.7, 8), the visual results show clearly how our model, with and without erasing, can handle multi-instance. Adding the erasing feature allows recovering more discriminative regions. The results over CUB5 (Fig.9, 10, 11, 12, 13) while are interesting, they show a fundamental limitation to the concept of erasing in the case of one-instance. In the case of multi-instance, when the model spots one instance, then, erases it, it is more likely that the model will seek another instance which is the expected behavior. However, in the case of one instance, and where the discriminative parts are small, the first forward allows mainly to spot such small part and erase it. Then, the leftover may not be sufficient to discriminate. For instance, in CUB5, in many cases, the model spots only the head. Once it is hidden, the model is unable to find other discriminative parts. A clear illustration to this issue is in Fig.9, row 13. The model spots correctly the head, but was unable to spot the body while the body has similar texture, and it is located right near to the found head. We believe that the main cause of this issue is that the erasing concept forgets where discriminative parts are located since the mining iterations are done independently from each other in a sens that the next mining iteration is unaware of what was already mined. Erasing algorithms seem to be missing this feature that can be helpful to localize the entire region of interest by seeking around all the previously mined disciminative regions. In our erasing algorithm, once a region is erased, the model forgets about its location. Adding a memory-like, or constraints over the spatial distribution of the mined discriminative regions may potentially alleviate this issue. Another parallel issue of erasing algorithms is that once the most discriminative regions are erased it may not be possible to discriminate using the leftover regions. This may explain why our model was unable to spot other parts of the bird once its head is erased. Probably using soft-erasing (blur the pixel for example) can be more helpful than hard-erasing (set pixel to zero).

It is interesting to notice the strategy used by our model to localize some types of birds. In the case of the 099.Ovenbird, it relies on the texture of the chest (white doted with black), while it localizes the white spot on the bird neck in the case of 108.White_necked_Raven. One can notice as well that our model seems to be robust to small/occluded regions. In many cases, it was able to spot small birds in a difficult context where the bird is not salient.

Visual results over CUB and OxF are presented in Fig.14, and Fig.15, respectively.

#### b.3.1 Impact of our recursive erasing algorithm on the performance

Tab.3 and Tab.4 show the boosting impact of our erasing recursive algorithm in both classification and pointwise localization performance. From Tab.4, we can observe that using our recursive algorithm adds a large improvement in F1 without degrading F1. This means that the recursion allows the model to correctly localize larger portions of the region of interest without including false positive parts. The observed improvement in localization allows better classification error as observed in Tab.3. The localization improvement can be seen as well in the precision-recall curves in Fig.6.

#### b.3.2 Running time of our recursive erasing algorithm

Adding recursive computation in the backpropagation loop is expected to add an extra computation time. Tab.5 shows the training time (of 1 run) of our model with and without recursion over identical computation resource. The observed extra computation time is mainly due to gradient accumulation (line 12. Alg.1) which takes the same amount of time as parameters’ update (which is expensive to compute). The forward and the backward are practically fast, and take less time compared to gradient update. We do not compare the running between the datasets since they have different number/size of samples, and different pre-processing that it is included in the reported time. Moreover, the size of samples has an impact over the total time during the training over the validation set.

#### b.3.3 Post-processing using conditional random field (CRF)

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters