Weakly-Supervised Semantic Segmentation by Iterative Affinity Learning

Weakly-Supervised Semantic Segmentation by Iterative Affinity Learning

Abstract

Weakly-supervised semantic segmentation is a challenging task as no pixel-wise label information is provided for training. Recent methods have exploited classification networks to localize objects by selecting regions with strong response. While such response map provides sparse information, however, there exist strong pairwise relations between pixels in natural images, which can be utilized to propagate the sparse map to a much denser one. In this paper, we propose an iterative algorithm to learn such pairwise relations, which consists of two branches, a unary segmentation network which learns the label probabilities for each pixel, and a pairwise affinity network which learns affinity matrix and refines the probability map generated from the unary network. The refined results by the pairwise network are then used as supervision to train the unary network, and the procedures are conducted iteratively to obtain better segmentation progressively. To learn reliable pixel affinity without accurate annotation, we also propose to mine confident regions. We show that iteratively training this framework is equivalent to optimizing an energy function with convergence to a local minimum. Experimental results on the PASCAL VOC 2012 and COCO datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

Keywords:
Weakly-supervised learning Semantic segmentation Affinity

1 Introduction

Semantic segmentation aims to predict a label for each pixel from a set of pre-defined object classes. With the advances of Deep Neural Networks (DNNs), significant progress has been made in semantic segmentation (Long et al. (2015); Zhao et al. (2017); Chen et al. (2018, 2017); Zhou et al. (2019)). However, fully-supervised methods require a large amount of pixel-wise annotations, which is time-consuming and expensive. To make semantic segmentation more practical, a number of weakly-supervised methods have been proposed in recent years based on partial information of each image, such as bounding boxes (Dai et al. (2015); Khoreva et al. (2017)), scribbles (Lin et al. (2016)), points (Bearman et al. (2016)), and even class labels (Pathak et al. (2015); Wang et al. (2018b); Ahn and Kwak (2018); Huang et al. (2018); Wei et al. (2018)). In this paper, we present a weakly-supervised semantic segmentation algorithm based only on class labels of an image.

Figure 1: Top row: Given training images and their class labels, our framework generates accurate segmentation results. Bottom row: By iteratively learning affinity, our framework progressively generates better segments for supervising the segmentation network. The seed regions generated by the CAM method (Zhou et al. (2016)) are shown in (d) where white color pixels denote image locations with unknown labels.

Weakly-supervised semantic segmentation based on class labels is challenging as no pixel in an image is annotated (i.e., an image is only annotated with class labels as shown in Figure 1(a)). Recently, the Class Activation Map (CAM) method (Zhou et al. (2016)) has been developed to generate discriminative object seed regions with classification networks. Since coarse response maps are generated (Figure 1(d)), these regions cannot be directly used to train an accurate segmentation network. As data redundancy often exists in natural images (Kersten (1987)), significant statistical dependencies among pixels in images can be exploited. We can learn similarities or affinities from images, and propagate sparse and noisy labels of object regions to generate dense and accurate annotations. With weak supervision, this is challenging as there are no accurate pixel-wise annotations and the region labels from the CAM method are noisy and sometimes inaccurate. To address these issues, we mine confident regions from the coarse pixel labels and then learn pixel affinities from them to refine the coarse labels. Iteratively, we mine confident regions from the refined results and learn more robust affinities until convergence.

In this paper, we propose an iterative affinity learning framework, which consists of two major branches (see Figure 2): a unary segmentation network which learns the pixel-wise probability of semantic categories from produced labels, and a pairwise network which refines the current labels by learning the affinity matrix and propagating the labels. The refined results by the pairwise network provide better “ground truth” to retrain the unary segmentation network in the next iteration. The above procedures are conducted iteratively until convergence to obtain better segmentation progressively. Figure 1 shows one example. Given training images and the class labels, the proposed framework can generate accurate semantic segmentation results. This is achieved by the iterative optimization strategy which learns reliable affinity and generates better masks for supervising the segmentation network.

The key ingredient of our framework is learning affinities between pixels, which determines the amount of improvements achieved at each iteration. However, under weak supervision, we do not have accurate annotations to learn pixel affinities. To address this issue, we propose to mine confident regions from the output results of the unary network, and then use them to supervise the pairwise affinity network. Our motivation is that, to learn the affinity, we only need to know some pixel samples, which indicate the pixels belonging to the same (their pixel affinity should be high) or different classes (their pixel affinity should be low). Even with a small amount of pixel samples, we are able to learn segmentation by propagating and mining more labels via learning the affinity. We also show that iteratively training the proposed framework is equivalent to optimizing an energy function with an EM-like approach. Furthermore, we show that this process always converges to a local minimum due to that the energy function is differentiable with respect to both the output labels and the network parameters.

The main contributions of this work are summarized as follows:

  • We present an iterative affinity learning framework to progressively generate better segmentation, and show that it is equivalent to optimizing an energy loss function. We show that it always converges to a local minimum.

  • We propose a method to learn reliable affinity from inaccurate annotations by mining confident regions.

  • We demonstrate that the proposed weakly-supervised semantic segmentation algorithm performs favorably against the state-of-the-art methods on the PASCAL VOC 2012 and COCO datasets.

Figure 2: Illustration of the proposed framework. The framework consists of two branches: a (b) unary network which predicts a (c) probability map of the input image, and a (d) pairwise network, which learns the (e) affinity matrix from the (g) mined confident regions. The learned affinities are then applied to the probability map from the unary network to generate the (f) refined segmentation. The refined results are then used as supervision signals to retrain the unary network. These procedures are conducted iteratively to learn more robust affinity progressively and produce more accurate segmentation.

2 Related Work

In this section, we discuss related methods for weakly-supervised semantic segmentation and learning affinity for segmentation.

2.1 Weakly-Supervised Semantic Segmentation

Weakly-supervised semantic segmentation based on class labels has drawn much attention in recent years due to low annotation costs. Early methods (Pathak et al. (2014, 2015); Pinheiro and Collobert (2015)) mainly formulate this problem as a multi-instance learning (MIL) problem. Pathak et al. (2014) propose to add a max-pooling layer on top of FCN (Long et al. (2015)) and design a multi-class MIL loss for training the network. Based on this framework, several methods have been developed (Pathak et al. (2015); Pinheiro and Collobert (2015)). Pathak et al. (2015) add several constraints on foreground and suppression schemes to the MIL framework for weakly-supervised semantic segmentation. Pinheiro and Collobert (2015) replace the max-pooling layer of the MIL framework with a new Log-Sum-Exp layer which can consider more information of the feature layers.

Recent methods (Kolesnikov and Lampert (2016); Wei et al. (2017a); Wang et al. (2018b); Ahn and Kwak (2018); Huang et al. (2018); Wei et al. (2018)) tackle weakly-supervised semantic segmentation by a two-stage procedure, which first generates initial object labels with class activation maps (Zhou et al. (2016)), and then trains segmentation networks based on the response maps. Kolesnikov and Lampert (2016) present an end-to-end framework with three modules (seed, expand and constrain) as loss functions, and the class activation maps are used as supervisory signals. A number of methods are developed to expand object regions based on the class activation maps. Wei et al. (2017a) propose to progressively erase most significant regions in the activation maps and then generate more regions. These regions are then used as ground truth to train a segmentation network. Wang et al. (2018b) develop a bottom-up and top-down framework which iteratively mines common object features to expand initial object regions from the class activation maps. Ahn and Kwak (2018) learn the pixel affinity from the activation maps and then apply the random walk method to refine them. Huang et al. (2018) design a deep seeded region growing algorithm which improves the seed regions to supervise the network.

Among the above-mentioned approaches, Wei et al. (2017a) and Wang et al. (2018b) also use iterative strategies to refine segmentation results. However, the method of Wei et al. (2017a) heavily relies on the CAM network to progressively produce the most significant regions in the remaining images. Consequently, less discriminative object regions are usually missing. In addition, this method is not able to suppress noisy regions well. Wang et al. (2018b) expand object regions by mining common object features. However, common features are only learned from each superpixel region and the pixel-wise context information is not exploited. In contrast, our method can learn and propagate pixel-wise affinities to achieve better segmentation results. We note Ahn and Kwak (2018) also use the pixel affinities to refine segmentation results. However, the affinities are only learned from the coarse response map of the CAM method. In the proposed framework, the pixel affinities are iteratively optimized, which are more reliable and lead to better segmentation results.

2.2 Learning Pixel Affinity for Segmentation

An affinity matrix measures the similarities between pixels and has been widely used in object segmentation. Some early methods directly define similarity functions to compute affinity matrices. Hagen and Kahng (1992) propose a spectral methods for ratio cut (Wei and Cheng (1989)) which captures both min-cut and equipartition to locate natural clusters. Shi and Malik (2000) formulate image segmentation as a graph partitioning problem and present the normalized cut algorithm. This algorithm considers both the dissimilarity between different groups and the similarity within the same group.

In recent years, with the advances of DNNs, numerous algorithms have been proposed to learn the affinity end-to-end with deep networks (Liu et al. (2017); Maire et al. (2016); Bertasius et al. (2017)). Maire et al. (2016) present the affinity CNN which directly learns an affinity matrix to model pairwise relations for figure and ground embedding. Liu et al. (2017) design the spatial propagation network (SPN) which directly learns pixel affinities and a spatial linear propagation module. The SPN takes images and coarse masks as input and learns pixel affinities end-to-end to refine the coarse masks. Bertasius et al. (2017) develop a random walk layer on top of the semantic segmentation network to learn the pixel affinities.

These methods all learn the pixel affinities under full supervision to refine segmentation results. In contrast, our method aims to learn pixel affinities to refine object regions from coarse and inaccurate labels without pixel-wise annotations. To address this challenging problem, we propose an iterative optimization framework which progressively mines confident regions for learning reliable affinity and generates better segmentation results.

3 Proposed Algorithm

We solve the weakly-supervised semantic segmentation problem with an iterative optimization algorithm which progressively learns robust pixel affinities and propagates label information for accurate results. We present an EM approach that alternatively learns the network parameters for both unary segmentation and pairwise affinity networks, and maximizes the likelihood of the “ground-truth” labels. This is different from the fully-supervised approaches where the supervision requires the ground-truth labels.

3.1 Formulation

Let denote an image. The proposed framework consists of two major branches (Figure 2): (a) a unary network parameterized by that learns the label probability with respect to each pixel in , and (b) a pairwise network that learns the pixel affinities, where , is the number of pixels, and is the parameter of the pairwise network. In addition, we denote as the hidden state of the output labels. We use the subscript to denote the step in the iterative process.

We represent each image as an undirected weighted graph , with the vertex set , where each edge between and has a weight . The adjacency matrix is . The degree matrix is a diagonal matrix with the degrees as elements, where . The semantic segmentation problem is then to minimize the following energy loss function:

(1)

where is the Laplacian matrix, and

(2)

That is, to minimize the loss function (2) is to enforce pixels with high similarities (their affinity is high) to have similar labels. This allows us to propagate label information for accurate results.

Instead of designing similarity metric and solve as an optimization problem (Levin et al. (2008)), we propose an iterative learning method to refine the probability map and the networks via an EM formulation. We denote as the affinity transformation matrix (Liu et al. (2017)), which is learnable by the pairwise network , and , as the output of the unary network and the pairwise network, respectively. The EM procedure are as follows:

  • Initialization: We train the unary network and the pairwise network with object seeds from class activation maps (Zhou et al. (2016)) to obtain the initial parameter , the unary response map (Figure 2(c)).

  • E-step: We refine the unary probability by minimizing w.r.t given , where:

    (3)

    and compute the refined map as (i.e., is the output of the pairwise network in step ). From (3) we have , where the corresponding network implementation is described in Section 3.2.2.

  • M-step: In this step, we minimize by learning both and , through training the network and with the supervision signal extracted from (Section 3.2.1, 3.2.2, 3.3).

It is straightforward to show the above procedures always converge to a local minimum, due to that is differentiable with respect to both and the network parameters. However, to validate the M-step, we need to validate the link between the E-step and M-step, i.e., how we use the network response from step to train and to minimize the energy function (1).

For the unary network, in the step, it uses segmentation results of as supervision to generate label probability . For training the pairwise network in the step, we consider the softmax cross-entropy loss function with as supervision:

(4)

Since is a monotonic increasing and convex function, optimizing is to learn and . Therefore, minimizing is equivalent to minimize . As the first term is a constant, to optimize is to minimize the second term:

(5)

which is consistent with (1). By using as supervision to train the unary network and using to supervise the pairwise network with the softmax cross-entropy loss, it is equivalent to minimize the original energy loss function. Namely, the objective of the M-step is to minimize the energy loss function.

However, in the stage of the pairwise network, if we use to supervise the pairwise network to learn affinity and then refine itself, there is no information gain over iterations and the optimization will come to convergence to a relative low performance with very few steps (Section 4.5.2). To obtain more accurate supervision in each step, we propose to mine confident regions from the output of the unary network . These confident regions contain pixels belonging to object regions with high precision from which we can learn reliable affinity matrices (Section 3.3). We denote it as , and expect it to have lower energy (Section 4.5.2):

(6)

With mining confident regions, our algorithm converges to a lower energy and obtains better segmentation results.

The proposed EM procedures are summarized in Algorithm 1.

Input:
     Generate object seeds from CAM, set it as .
     Training images .
Initialize:
     Train networks and with to obtain the parameters , the affinity matrix and the output of the unary network .

0:  
1:  Propagate with : (Section 3.2.2).
1:  
2:  Train with as supervision to obtain (Section 3.2.1).
3:  Mine confident regions from the output of (Section 3.3).
4:  Train with as supervision to obtain (Section 3.2.2).
Algorithm 1 Procedures of the proposed approach

3.2 Network Architecture and Training

Figure 2 shows the architecture of the proposed framework. The framework consists of two major branches, a unary network that learns the label probability of each pixel, and a pairwise network that learns the affinity. The learned affinities are applied to the output probability map of the unary network to refine it and obtain better segmentation results.

Unary Network

The unary network aims to generate a probability map given a coarse segmentation mask. In this work, we use the DeepLab (Chen et al. (2018)) model as the unary segmentation network. To initialize the framework, we first generate object seed regions using the CAM method (Zhou et al. (2016)) in a way similar to (Ahn and Kwak (2018)). The CAM method generates object regions for all classes, including background, and pixels with weak response are labelled as unknown, as shown in Figure 1(d). We then use them as pseudo ground truth to train the unary segmentation network. The training process is the same as fully-supervised methods with a softmax loss as the objective function. With this segmentation network, probability maps are generated for all classes.

Pairwise Network

The pairwise network aims to learn the pixel affinities from object regions and then applies to the probability maps to refine the segmentation results. In this work, we use the Spatial Propagation Network (SPN) (Liu et al. (2017)) to learn pairwise affinities. The SPN learns the affinity transformation matrix from an image to refine the coarse probability maps and generates better segmentation . It is an end-to-end framework which simultaneously learns the affinity transformation matrix and outputs the refined segmentation . When learning the affinity, we raster scan the pixels from four directions: left-to-right, top-to-bottom, and vice versa. Since we use all three RGB image channels, we learn 12 affinity matrices. More details regarding the spatial propagation network can be found in Liu et al. (2017).

Figure 3: Some examples of mining confident regions from segmentation results of the unary network: (a) images, (b) segmentation results of the unary network, (c) mined confident regions, (d) refined results of the pairwise affinity network, (e) ground truth. White color pixels denote image locations with unknown labels.
Figure 4: Illustration of mining confident regions. Given an (a) input image, we segment it into superpixel regions, and apply the learned (b) region confidence network to predict object classes, and generate (c) confidence scores for all regions. By selecting regions with high confidence score, we can obtain the (d) mined confident regions. White color pixels denote image regions with unknown labels. Compared with the (e) unary segmentation result, the noisy regions are mostly removed and some regions are corrected. Thus, regions with high precision are extracted and used to better supervise the pairwise network.

The spatial propagation network has been shown to perform well in pixel labelling under full supervision (Liu et al. (2017)). However, under weak supervision, it is challenging to train the pairwise affinity network as no pixel-wise annotations are provided. A straight-forward approach is to use the segmentation result at time as ground truth to supervise the pairwise affinity network. However, as the segmentation results are not accurate, the affinity matrix cannot be learned well. To address this issue, we first mine confident regions from the segmentation results and then learn the affinity matrix from the mined confident regions. If we can obtain some confident regions which have high precision to identify the class each region belongs to, we know that pixels within same class should have high affinities and pixels of different classes should have low affinities. Thus, we can also learn reliable affinity matrices. For regions with low confidence scores, we mark them with unknown labels when training the pairwise affinity network. Namely, when computing the softmax loss function, these regions are ignored. The details of mining confident regions are introduced in Section 3.3. We denote the mined confident regions as . To train the pairwise affinity network, we utilize the softmax loss:

(7)

To learn accurate affinity matrices, we also introduce a region smoothness loss. Our motivation is that a good affinity matrix should have similar values for pixels in the same object regions such that the refined results can be smooth and have clear boundaries. To achieve this, we average the learned affinity matrix within each superpixel region and denote it as . The objective function is then to minimize the difference between and :

(8)

3.3 Mining Confident Regions

The key issue in our framework is how to learn reliable affinity matrices without accurate annotations. Our solution is to mine confident regions. We expect these confident regions contain pixels belonging to object regions with high precision from which we can learn reliable affinity matrices.

Our method is based on statistical learning. The object regions generated by the unary network contain noisy results. For example, some background pixels may be recognized as parts of an object, as shown in Figure 3(b). We can learn a confidence score for each region with the segmentation results of the unary network as training samples. For a region with a certain class label, its initial confidence score is set as for this object, and for other objects. By training a multi-class classification network with these regions, each region is assigned with a new confidence score. For region pixels that have high similarity to one object, they will receive high confidence scores. For region pixels different from one object (i.e., noisy regions), they will receive low confidence scores. With this procedure, we can remove noisy regions and select confident regions with high confidence scores. In this paper, we set the threshold as based on our empirical observations. Some examples are shown in Figure 3(c). With these confident regions, we can learn reliable affinity matrices from them and thus generate more accurate segmentation results (Figure 3(d)).

We first segment images into superpixel regions (Felzenszwalb and Huttenlocher (2004)) , where denotes the -th superpixel in the -th image. For each region, its class label is obtained from the segmentation results. If more than 80% pixels of a superpixel is marked with a certain class in the segmentation results, then this superpixel is considered as a sample of class . This scheme is formulated with the one-hot encoding, namely, , where , , and is the number of classes. With the superpixel regions and corresponding labels , we can train a region classification network parameterized by to obtain a confidence score for each region with the cross-entropy loss function:

(9)

We train the region confidence network with the architecture proposed by (Wang et al. (2018a)) which is a variant of the fast R-CNN model with a mask pooling scheme. Similar to recent weakly-supervised learning methods (Pathak et al. (2015); Kolesnikov and Lampert (2016); Ahn and Kwak (2018); Huang et al. (2018)), we initialize this network with the weights of a pre-trained model based on the ImageNet. The model is trained with using (9) as the loss function, where the superpixel region is the input and the corresponding class label is the supervisory signal. With this region confidence network, we extract features of all superpixel regions of an image in one forward pass, and then recognize their classes. Figure 4 shows the process of mining confident regions. With the trained region confidence network, we can re-predict each superpixel region of images, and obtain confidence scores for all regions (Figure 4(c)). To extract regions with high precision for learning reliable affinities, we select regions with high confidence scores (e.g., in this work), and leave others as unknown (Figure 4(d)). Namely, we do not use unknown regions for training the pairwise affinity network.

Image CCNN SEC MCOF Ours GT
Figure 5: Visual comparisons with the state-of-the-art methods on the PASCAL VOC 2012 val set.

4 Experimental Results

4.1 Settings

We evaluate the proposed method on the PASCAL VOC 2012 (Everingham et al. (2010)) and COCO (Lin et al. (2014)) datasets. The PASCAL VOC 2012 dataset contains object classes and background class with 1464 training images, 1449 validation images, and 1456 testing images. Same as the recent work (Wei et al. (2017a); Ahn and Kwak (2018); Huang et al. (2018); Wei et al. (2018); Wang et al. (2018b)), we use the augmented set with 10582 images from (Hariharan et al. (2011)) for training. For the COCO dataset, it contains more complex scenes and more classes ( classes plus background class) with 80k images for training and 40k images for validation. We iteratively train our framework on the training set using only class labels. For inference, we forward the input images to the trained networks in the last iteration to obtain segmentation results, and the process is still efficient. We evaluate the proposed algorithm against the state-of-the-art methods using the mean intersection-over-union (mIoU) metric.

4.2 Training Process

The CAM Network is trained with the PyTorch framework and other models are trained with the Caffe package (Jia et al. (2014)). Similar to recent weakly-supervised learning methods (Pathak et al. (2015); Kolesnikov and Lampert (2016); Ahn and Kwak (2018); Huang et al. (2018)), all networks are initialized with the weights of a pre-trained model based on the ImageNet. All the source code and trained models will be made available online.

CAM Model.

The CAM model is used to generate object seed regions from images based on the implementation by Ahn and Kwak (Ahn and Kwak (2018)). To train this CAM model, the input data is the training images and the supervisory signals are the corresponding class labels. Similar to the CAM model by Zhou et al. (Zhou et al. (2016)), we use random cropping to augment data. For each crop, we take the class labels corresponding to the original images before cropping as supervision, and no additional supervisory signals are required.

Unary Network.

We use the polynomial decay policy for the learning rate to train the model (Chen et al. (2018)). The learning rate of the -th iteration, , is:

(10)

where the base learning rate , , and maximal iterations . The momentum parameter is set to be .

Pairwise Network.

We use the polynomial decay policy for the learning rate as described in (10) to train the model, where , , , the momentum parameter is set as .

Mining Confident Regions Network.

We use the step learning rate decay policy. For the -th iteration, the learning rate is:

(11)

where the base learning rate , , , and is the floor function. The momentum parameter is also set to be .

4.3 Performance Evaluation

We evaluate the proposed algorithm on the PASCAL VOC 2012 dataset against the state-of-the-art weakly-supervised segmentation methods including MIL-FCN (Pathak et al. (2014)), CCNN (Pathak et al. (2015)), MIL-sppxl (Pinheiro and Collobert (2015)), EM-Adapt (Papandreou et al. (2015)), BFBP (Saleh et al. (2016)), DCSM (Shimoda and Yanai (2016)), AF-SS (Qi et al. (2016)), AF-MCG (Qi et al. (2016)), SEC (Kolesnikov and Lampert (2016)), STC (Wei et al. (2017b)), CBTS (Roy and Todorovic (2017)), AE-PSL (Wei et al. (2017a)), MCOF (Wang et al. (2018b)), PSA (Ahn and Kwak (2018)), DSRG (Huang et al. (2018)), MDC (Wei et al. (2018)) and AISI (Fan et al. (2018)). Table 1 shows the experimental results by all the evaluated methods using the VGG16 (Simonyan and Zisserman (2014)) model as the backbone network. The proposed algorithm achieves 62.0% and 62.4% on the val and test sets, respectively, with performance gain over the MDC (Wei et al. (2018)) method by 1.6%. We note the PSA (Ahn and Kwak (2018)) model also uses affinities to refine object regions. However, as this method only learns affinities from coarse masks generated from CAM, the improvement by the affinity propagation is limited. The proposed algorithm performs favorably against the PSA method by 3.6% and 1.9% on the val and test sets, respectively. We also note that the AISI (Fan et al. (2018)) model recently achieves similar performance as the proposed algorithm (61.3% on val, 62.1% on test). However, this method uses the Net (Fan et al. (2019)) to generate salient instances, which is trained with full supervision using pixel-wise annotations. Table 2 shows the results when the ResNet (He et al. (2016)) is used as the backbone model. The proposed algorithm achieves performance gain over PSA (Ahn and Kwak (2018)) by 2.6% and 1.7% and AISI (Fan et al. (2018)) by 0.7% and 0.9% on val and test sets, respectively. Figure 5 shows some segmentation results. Overall, the segmentation results by the proposed algorithm contain fewer noisy segments.

 

Methods Training Images val test
MIL-FCN (ICLR’15) 10K 25.7 24.9
CCNN (ICCV’15) 10K 35.3 35.6
MIL-sppxl (CVPR’15) 700K 36.6 35.8
EM-Adapt (ICCV’15) 10K 38.2 39.6
BFBP (ECCV’16) 10K 46.6 48.0
DCSM (ECCV’16) 10K 44.1 45.1
AF-SS (ECCV’16) 10K 52.6 52.7
AF-MCG (ECCV’16) 10K 54.3 55.5
SEC (ECCV’16) 10K 50.7 51.7
STC (PAMI’17) 50K 49.8 51.2
CBTS (CVPR’17) 10K 52.8 53.7
AE-PSL (CVPR’17) 10K 55.0 55.7
MCOF (CVPR’18) 10K 56.2 57.6
PSA (CVPR’18) 10K 58.4 60.5
DSRG (CVPR’18) 10K 59.0 60.4
MDC (CVPR’18) 10K 60.4 60.8
AISI (ECCV’18) 10K 61.3 62.1
Ours 10K 62.0 62.4

 

Table 1: Comparisons with the state-of-the-art weakly-supervised semantic segmentation methods on the PASCAL VOC 2012 val set and test set. All methods use the VGG16 model as the backbone network ( indicates methods implicitly use full supervision).

 

Methods Training Images val test
MCOF (CVPR’18) 10K 60.3 61.2
PSA (CVPR’18) 10K 61.7 63.7
DSRG (CVPR’18) 10K 61.4 63.2
AISI (ECCV’18) 10K 63.6 64.5
Ours 10K 64.3 65.4

 

Table 2: Evaluation results when using the ResNet as the backbone model on the PASCAL VOC 2012 dataset( indicates methods implicitly use full supervision).
Figure 6: Visual segmentation results of each iteration of our framework on the PASCAL VOC 2012 training set. The initial object seeds are very coarse, by iteratively learning affinity, the segmentation results become better from coarse to fine. (a) images, (b) initial object seeds, (c)-(g) produced segmentation results of iterations , (h) ground truth.

4.4 Comparison with Iterative PSA

We note that the PSA method (Ahn and Kwak (2018)) also refines the confident regions from the CAM model based on affinities for semantic segmentation. However, this approach differs significantly from our method in finding confident regions and learning affinities. Different from the PSA model, the proposed method is optimized iteratively. To analyze the performance of the proposed method, we design an alternative PSA approach for evaluation. In this alternative method, confident regions are minded from the PSA model and the affinities are iteratively learned. We show the evaluation results17 on the PASCAL VOC 2012 training set in Table 3. The segmentation results of the alternative PSA model are not further refined as the number of iterations is increased. We analyze the mined confident regions by both approaches in terms of the precision metric. Table 4 shows that the precision of the confident regions by the alternative PSA does not increase with the number of iterations. This can be attributed that the PSA method determines confident object and background regions by strengthening foreground and weakening background activation maps. We note that this approach is effective for the coarse CAM results as it can remove noisy regions. However, this operation also removes numerous object regions when the results are dense (e.g., large objects and complex scenes). Consequently, such object regions cannot be identified with more iterations. With the learned affinities from the spatial propagation network, our method mines confident regions with a confidence network, which can remove ambiguous regions and correct noisy regions to obtain regions with higher precision. As shown in Section 3.1 and (6), confident regions with higher precision help the framework converge to lower energy and obtain better segmentation results, such that our approach can gradually improve with more iterations until convergence.

 

step 1 step 2 step 3 step 4 step 5
PSA 55.6 54.9 52.1 49.8 48.1
Ours 55.2 59.5 61.4 62.7 63.1

 

Table 3: Comparisons with the PSA when it is also refined iteratively. The results show the mIoU on the PASCAL VOC 2012 training set.

 

step 1 step 2 step 3 step 4 step 5
PSA 76.2 73.7 71.7 70.2 68.7
Ours 73.4 78.1 81.0 81.2 81.2

 

Table 4: Analyze the accuracy of the confident regions of the iterative PSA and ours. The results show the precision on the PASCAL VOC 2012 training set.

 

train train val
mIoU Precision mIoU
step 0 Seeds 46.2 62.2 -
Unary Network \colorgreen51.5 72.7
step 1 Mined Conf Regions 49.8 \colorblack73.4
Pairwise Network \colorblue55.2 73.1 51.3
Unary Network \colorgreen56.6 76.3
step 2 Mined Conf Regions 54.1 \colorblack78.1
Pairwise Network \colorblue59.5 80.5 57.2
Unary Network \colorgreen59.3 79.2
step 3 Mined Conf Regions 56.4 \colorblack81.0
Pairwise Network \colorblue61.4 79.7 59.9
Unary Network \colorgreen60.8 80.7
step 4 Mined Conf Regions 56.9 \colorblack81.2
Pairwise Network \colorblue62.7 80.9 61.6
Unary Network \colorgreen61.7 79.7
step 5 Mined Conf Regions 57.5 \colorblack81.2
Pairwise Network \colorblue63.1 80.8 62.0
Unary Network \colorgreen61.6 77.7
step 6 Mined Conf Regions 57.6 \colorblack81.0
Pairwise Network \colorblue62.8 80.6 61.8
Unary Network \colorgreen61.8 78.9
step 7 Mined Conf Regions 57.7 \colorblack81.3
Pairwise Network \colorblue63.2 80.9 62.0

 

Table 5: Intermediate results of the proposed framework on the PASCAL VOC 2012 training set.

4.5 Ablation Studies

We conduct ablation studies to analyze the contribution of each module in the proposed framework. All experiments are carried out on the PASCAL VOC 2012 dataset with the VGG16 model as the backbone network.

Iterative Affinity Learning

To demonstrate the effectiveness of the proposed iterative affinity learning method, we show the intermediate results on the PASCAL VOC 2012 training and validation sets in Table 5. We analyze the segmentation results of the training process using the IoU and precision metric. As the performance of the proposed method reaches a plateau after 5 iterations, we use the networks trained at the 5-th iteration for inference. With the affinity matrix being optimized in the first 5 iterations, the performance of the unary network increases gradually from 51.5% to 61.7%, and that of the pairwise network increases from 55.2% to 63.1%. At each step, the mIoU of the pairwise network results is higher than that of the unary network, which demonstrates that the learned affinity matrix is effective in refining the unary segmentation network. The main reason that the performance is increased with more iterations is that the proposed method learns robust affinities from the mined confident regions. Under weak supervision, we do not have integral and accurate object masks. As we mentioned in Section 3.2.2, to learn robust affinities we only need some confident regions which have high precision to identify the class each region belongs to. As shown in Table 5, at each step, as ambiguous regions are removed, the mined confident regions are less integral than that of the unary network (i.e., lower mIoU), but the precision is higher, which provides more accurate supervision for learning affinities robustly. We also show segmentation results at each iteration in Figure 6. With more iterations, our framework gradually generates more accurate segmentation results.

 

step 1 step 2 step 3 step 4 step 5
without mining conf. 52.8 56.6 56.5 56.6 56.4
with mining conf. 55.2 59.5 61.4 62.7 63.1

 

Table 6: Comparisons with method that eliminates the procedure of mining confident regions. The results show the mIoU on the PASCAL VOC 2012 training set.

 

step 1 step 2 step 3 step 4 step 5
without mining conf. 0.092 0.065 0.061 0.053 0.048
with mining conf. 0.061 0.044 0.042 0.034 0.029

 

Table 7: Energy of each iteration without and with the procedure of mining confident regions.
Figure 7: Visualization of the learned affinity without and with region smoothness constraint when training the pairwise affinity network. For each image, the first row (a) show results without the region smoothness constraint, and the second row (b) is the results with the region smoothness constraint. With the region smoothness constraint, the learned affinity values inside object regions are smoother with more clear object boundaries. Best viewed in color.

Mining Regions with High Confidence Scores

To validate the proposed mining method for confident regions, we compare with the alternative without this procedure. We show the output of the pairwise network at each iteration in Table 6. Without mining confident regions, the framework converges after 2 iterations and achieves lower segmentation performance. With the mined confident regions for learning reliable affinity matrices, the proposed method performs better over the iterations. The results demonstrate the importance of the proposed mining method.

As stated in Section 3.1, by mining regions with high confidence scores, we expect they have a lower energy than original (formula (6)). To validate this claim, we compare the energy before and after mining the confident regions. We randomly select 500 images as samples and compute their average energy with (1). Table 7 shows the intermediate results at each step. With the proposed mining confident regions, the energy is decreased, which indicates that formula (6) can be satisfied.

Pairwise Affinity Learning

The pairwise affinity network aims to learn the pixel affinities to refine the segmentation results with spatial propagation. To validate the effectiveness and necessity of learning the pairwise network, we remove it from our framework. Table 8 shows the segmentation results with and without using pixel affinities. Without learning the affinity network, the performance at each iteration is much lower than the proposed method. Finally, the mIoU is lower than our method by 8.8% on the training set and 9.2% on the val set. These results demonstrate the importance of learning the affinity network. Although the proposed algorithm is able to obtain regions with high precision by mining confident regions, it misses some regions of objects, as shown in Figure 3(c). If we directly use the mined confident regions to supervision the unary segmentation network, some segments are likely missing, and thus affect the performance. By learning the pairwise affinity network, we can propagate the pixel labels from confident regions to regions with unknown labels. As such, we can achieve better object segmentation results.

 

step 1 step 2 step 3 step 4 step 5
without learning affinity 49.6 51.7 53.5 54.2 54.3
with learning affinity 55.2 59.5 61.4 62.7 63.1

 

Table 8: Comparisons with the alternative method without the pairwise affinity network. The segmentation results on the PASCAL VOC 2012 training set are presented using the mIoU.

Region Smoothness Constraint on Affinity

To validate the proposed region smoothness loss for training the pairwise affinity network, we show the learned pixel affinities in Figure 7. As mentioned in Section 3.2.2, we learn 12 affinity matrices (4 directions for 3 image channels), each affinity matrix has the same channels with the input probability maps. For presentation clarity, here we show some channels of the first learned affinity matrix. The results are similar for other matrices. With the region smoothness constraint, the learned affinity values inside object regions are smoother with more clear object boundaries. For segmentation, the region smoothness constraint can improve the final results from 61.2 % to 62.0% on the PASCAL VOC 2012 val set.

 

Methods mIoU
SEC (ECCV’16) (Kolesnikov and Lampert (2016)) 22.4
BFBP (ECCV’16) (Saleh et al. (2016)) 20.4
DSRG (CVPR’18) (Huang et al. (2018)) 26.0
Ours 27.7

 

Table 9: Evaluation results on the COCO dataset.

4.6 Results on the COCO Dataset

We conduct experiments on the more challenging COCO dataset, and compare with some recent methods including SEC (Kolesnikov and Lampert (2016)), BFBP (Saleh et al. (2016)), and DSRG (Huang et al. (2018)). Table 9 shows the results on the val set, where all methods use the VGG16 network as the backbone model. The proposed algorithm achieves 27.7% on mIoU and performs favorably against the state-of-the-art methods.

5 Conclusions

In this paper, we propose a weakly-supervised semantic segmentation algorithm using an iterative affinity learning framework. Starting from the coarse annotations from the class activation maps, we exploit data redundancies in natural images to learn pixel affinities and propagate labels iteratively. Our framework consists of a unary segmentation network to predict the class probability map, and a pairwise affinity network to learn affinity and refine the results of the unary network. We propose to mine confident regions for learning the reliable affinity. The refined results are then considered as supervisory signals to retrain the unary network. The procedures are conducted iteratively to learn more robust affinity and generate better segmentation progressively. Experimental results on both the PASCAL VOC 2012 and COCO datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

Acknowledgments

This work is supported by National Key Basic Research Program of China (No. 2016YFB0100900), Beijing Science and Technology Planning Project (No. Z191100007419001), National Natural Science Foundation of China (No. 61773231), and National Science Foundation (CAREER No. 1149783).

Footnotes

  1. email: andyxwang@tencent.com
  2. email: sifeil@nvidia.com
  3. email: mhmpub@ustb.edu.cn
  4. email: mhyang@ucmerced.edu
  5. email: andyxwang@tencent.com
  6. email: sifeil@nvidia.com
  7. email: mhmpub@ustb.edu.cn
  8. email: mhyang@ucmerced.edu
  9. email: andyxwang@tencent.com
  10. email: sifeil@nvidia.com
  11. email: mhmpub@ustb.edu.cn
  12. email: mhyang@ucmerced.edu
  13. email: andyxwang@tencent.com
  14. email: sifeil@nvidia.com
  15. email: mhmpub@ustb.edu.cn
  16. email: mhyang@ucmerced.edu
  17. We use the code provided by the authors. The authors report results on the original training set (1464 images) of the PASCAL VOC 2012 dataset. Here we present results on the augmented training set (10582 images) as all models are trained on the augmented training set.

References

  1. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4981–4990. Cited by: §1, §2.1, §2.1, §3.2.1, §3.3, §4.1, §4.2, §4.2, §4.3, §4.4.
  2. What’s the point: semantic segmentation with point supervision. In Proceedings of European Conference on Computer Vision (ECCV), pp. 549–565. Cited by: §1.
  3. Convolutional random walk networks for semantic image segmentation.. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 858–866. Cited by: §2.2.
  4. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 40 (4), pp. 834–848. Cited by: §1, §3.2.1, §4.2.
  5. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1.
  6. Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1635–1643. Cited by: §1.
  7. The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) 88 (2), pp. 303–338. Cited by: §4.1.
  8. S4Net: single stage salient-instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6103–6112. Cited by: §4.3.
  9. Associating inter-image salient instances for weakly supervised semantic segmentation. In Proceedings of European Conference on Computer Vision (ECCV), pp. 367–383. Cited by: §4.3.
  10. Efficient graph-based image segmentation. International Journal of Computer Vision (IJCV) 59 (2), pp. 167–181. Cited by: §3.3.
  11. New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1074–1085. Cited by: §2.2.
  12. Semantic contours from inverse detectors. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 991–998. Cited by: §4.1.
  13. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.3.
  14. Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7014–7023. Cited by: §1, §2.1, §3.3, §4.1, §4.2, §4.3, §4.6, Table 9.
  15. Caffe: convolutional architecture for fast feature embedding. In Proceedings of ACM international conference on Multimedia (ACM MM), pp. 675–678. Cited by: §4.2.
  16. Predictability and redundancy of natural images. JOSA A 4 (12), pp. 2395–2400. Cited by: §1.
  17. Simple does it: weakly supervised instance and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 876–885. Cited by: §1.
  18. Seed, expand and constrain: three principles for weakly-supervised image segmentation. In Proceedings of European Conference on Computer Vision (ECCV), pp. 695–711. Cited by: §2.1, §3.3, §4.2, §4.3, §4.6, Table 9.
  19. A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 30, pp. 228–242. Cited by: §3.1.
  20. Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3159–3167. Cited by: §1.
  21. Microsoft COCO: common objects in context. In Proceedings of European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §4.1.
  22. Learning affinity via spatial propagation networks. In Proceedings of Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 1520–1530. Cited by: §2.2, §3.1, §3.2.2, §3.2.2.
  23. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1, §2.1.
  24. Affinity CNN: learning pixel-centric pairwise relations for figure/ground embedding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 174–182. Cited by: §2.2.
  25. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1742–1750. Cited by: §4.3.
  26. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1796–1804. Cited by: §1, §2.1, §3.3, §4.2, §4.3.
  27. Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144. Cited by: §2.1, §4.3.
  28. From image-level to pixel-level labeling with convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1713–1721. Cited by: §2.1, §4.3.
  29. Augmented feedback in semantic segmentation under image level supervision. In Proceedings of European Conference on Computer Vision (ECCV), pp. 90–105. Cited by: §4.3.
  30. Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3529–3538. Cited by: §4.3.
  31. Built-in foreground/background prior for weakly-supervised semantic segmentation. In Proceedings of European Conference on Computer Vision (ECCV), pp. 413–432. Cited by: §4.3, §4.6, Table 9.
  32. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22 (8), pp. 888–905. Cited by: §2.2.
  33. Distinct class-specific saliency maps for weakly supervised semantic segmentation. In Proceedings of European Conference on Computer Vision (ECCV), pp. 218–234. Cited by: §4.3.
  34. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
  35. Edge preserving and multi-scale contextual neural network for salient object detection. IEEE Transactions on Image Processing (TIP) 27 (1), pp. 121–134. Cited by: §3.3.
  36. Weakly-supervised semantic segmentation by iteratively mining common object features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1354–1362. Cited by: §1, §2.1, §2.1, §4.1, §4.3.
  37. Towards efficient hierarchical designs by ratio cut partitioning. In IEEE International Conference on Computer-Aided Design, pp. 298–301. Cited by: §2.2.
  38. Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1568–1576. Cited by: §2.1, §2.1, §4.1, §4.3.
  39. STC: a simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 39 (11), pp. 2314–2320. Cited by: §4.3.
  40. Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7268–7277. Cited by: §1, §2.1, §4.1, §4.3.
  41. Pyramid scene parsing network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890. Cited by: §1.
  42. Learning deep features for discriminative localization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: Figure 1, §1, §2.1, 1st item, §3.2.1, §4.2.
  43. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision (IJCV) 127 (3), pp. 302–321. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
408768
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description