SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation

SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation


One-shot semantic segmentation poses a challenging task of recognizing the object regions from unseen categories with only one annotated example as supervision. In this paper, we propose a simple yet effective Similarity Guidance network to tackle the One-shot (SG-One) segmentation problem. We aim at predicting the segmentation mask of a query image with the reference to one densely labeled support image. To obtain the robust representative feature of the support image, we firstly propose a masked average pooling strategy for producing the guidance features using only the pixels belonging to the support image. We then leverage the cosine similarity to build the relationship between the guidance features and features of pixels from the query image. In this way, the possibilities embedded in the produced similarity maps can be adopted to guide the process of segmenting objects. Furthermore, our SG-One is a unified framework which can efficiently process both support and query images within one network and be learned in an end-to-end manner. We conduct extensive experiments on Pascal VOC 2012. In particular, our SG-One achieves the mIoU score of 46.3%, which outperforms the state-of-the-art.


1 Introduction

Object Semantic Segmentation (OSS) aims at predicting the class label of each pixel. Deep neural networks have achieved tremendous success on OSS tasks, such as U-net [18], FCN [16] and Mask R-CNN [12]. However, these algorithms trained with full annotations require many investments for expensive labeling tasks. To reduce the budget, a promising alternative approach is to apply weak annotations for learning, \egimage-level labels [25, 23, 24], scribbles [15, 21], bounding boxes [13, 6] and points [2]. Whereas a main disadvantage of weakly supervised methods is the lack of the ability for generalization to unseen classes. For example, if a network is trained to segment dogs using thousands of images containing various breeds of dogs, it will not be able to segment bikes without finetuning the network using many images containing bikes.

In contrast, humans are very good at recognizing things with a few guidance. For instance, it is very easy for a child to recognize various breeds of dogs with the reference to only one picture of a dog. Inspired by this, one-shot learning is dedicated to imitating this powerful ability of human beings. In other words, one-shot learning targets to recognize new objects according to only one annotated example. This is a great challenge for the standard learning methodology. Instead of using tremendous annotated instances to learn the characteristic patterns of a specific category, our target is to learn one-shot networks to generalize to unseen classes with only one densely annotated example.

Figure 1: An overview of the proposed SG-One approach for testing a new class.

Concretely, one-shot segmentation is to discover the object regions of a query image with the reference to only one support image. Both the query and support images are sampled from the same unseen category. Caelles \etal [3] proposed a method for video object segmentation. However, it needs to finetune and change the parameters during testing, which costs extra computational resources. Shaban \etal [20] is the first to study the image semantic segmentation problem in one-shot learning setting. They propose a model named OSLSM to adapt the Siamese network [14] to support semantic segmentation. Mainly, a pair of parallel networks is trained for extracting the features of labeled support images and query images, respectively. These features are then fused to generate the probability maps of the target objects. Segmentation masks can finally be generated by thresholding the probability maps. Rakelly \etal [17] and Dong \etal [7] inherit the same framework of OSLSM and apply some slight changes to obtain a better segmentation performance. These methods provide an advantage that the trained parameters of observed classes can be directly utilized for testing unseen classes without finetuning. Nevertheless, there are some weaknesses with these methods: 1) The parameters of using the two parallel networks are redundant, which is prone to be overfit and leads to the waste of computational resources; 2) combining the features of support and query images by merely multiplication is inadequate for guiding the query images to learn high-quality segmentation masks.

To tackle above mentioned weaknesses, we propose a Similarity Guidance Network for One-Shot Semantic Segmentation (SG-One) in this paper. The fundamental idea of SG-One is to guide the segmentation process by effectively incorporating the pixel-wise similarities between the features of support objects and query images. We propose to apply the masked average pooling operation for extracting the representative vector of the support images. Then, the guidance maps are produced by calculating cosine similarities between the representative vector and the feature of query images at each pixel. The segmentation masks of query images are predicted by injecting the guidance maps into the segmentation process. In detail, the position-wise feature vectors of query images are multiplied by the corresponding similarity values. Such a strategy can effectively contribute to activate the target object regions of query images following the guidance of support images and their masks. Furthermore, we adopt a unified network for producing similarity guidances and predicting segmentation masks of query images, simultaneously. Such a network is more capable of generalizing to unseen classes.

Our approach offers multiple appealing advantages over the previous state-of-the-arts, \egOSLSM [20] and co-FCN [17]. First, OSLSM and co-FCN incorporate the segmentation masks of support images by changing the input structure of the network or the statistic distribution of input images. Differently, we extract the representative vector from the intermediate feature maps with the masked average pooling operation rather than changing the inputs. Our approach does not harm the input structure of the network, nor harm the statistics of input data. Second, OSLSM and co-FCN directly multiply the representative vector to feature maps of query images for predicting the segmentation masks. SG-One calculate the similarities between the representative vector and the features at each pixel of support images, and the similarity maps are employed to guide the segmentation branch for finding the target object regions. Our method is superior in leading the segmentation process of query images. Third, OSLSM and co-FCN adopt a pair of VGGnet-16 networks for processing support and query images, respectively. We employ a unified network to process them simultaneously and utilize much less parameters, so as to reduce the computational burden and increase the ability of generalizing to new classes in testing.

The overview of SG-One is illustrated in Figure 1. After the training phase, the SG-One network can predict the segmentation masks of a new class without changing the parameters. For example, the query image of an unseen class, \egcow, is processed to discover the target object regions of the cow with only one annotated support image provided. We apply two branches, similarity guidance branch and segmentation branch, to produce the guidance maps and segmentation masks. The maps are generated by calculating cosine similarities between the representative vector of the support object, \egthe cow, and features of the support image. We multiply the similarity maps to feature maps of the query image from the segmentation branch for segmenting the target regions of the cow. The two branches are connected by concatenating the features maps, so the two branches can exchange information during the forward and backward stages.

To sum up, our main contributions are three-fold:

  • We calculate the representative vector of support images using masked average pooling operation without requiring to change the input structure of networks.

  • We produce the guidance for segmenting the query images using cosine similarities between the representative vector of support images and features of query images.

  • Our approach achieves the cross-validate mIoU of 46.3% on the PASCAL-5i dataset in one-shot segmentation setting, surpassing the current state-of-the-art.

Figure 2: The network of the proposed SG-One approach. A query image and a labelled support image are fed into the same network. The Guidance Branch is to extract the representative vector of the target object in support image. The Segmentation Branch is to produce the features of the query image. We use CosineSimilarty to guide the network find the target regions of the query image.

2 Related Work

Object semantic segmentation (OSS) aims at classifying every pixel in a given image. OSS with dense annotations has achieved great success in precisely identifying various kinds of objects. FCN [16] and U-Net [18] abandon fully connected layers and propose to only use convolutional layers for preserving relative positions of pixels. Based on the advantages of FCN, DeepLab proposed by Chen \etal [4, 5], is one of the best algorithms for segmentation. It employs dilated convolution operations to increase the receptive field, and meanwhile to save parameters in comparison with the large convolutional kernel methods. He \etal [12] proposes segmentation masks and detection bounding boxes can be predicted simultaneously using a unified network.

Weakly object segmentation seeks an alternative approach for segmentation, in order to reduce the expenses in labeling segmentation masks. Zhou [28] and Zhang [26, 27] propose to discover precise object regions using a classification network with only image-level labels. Wei [25, 23] apply a two-stage mechanism to predict segmentation masks. Concretely, confident regions of objects and background are firstly extracted via the methods of Zhou \etalor Zhang \etal. Then, a segmentation network, such as DeepLab, can be trained for segmenting the target regions. An alternative weakly segmentation approach is to use scribble lines to indicate the rough positions of objects and background. Lin \etal [15] and Tang \etal [21] adopted spectral clustering methods to distinguish the object pixels according to the similarity of adjacent pixels and ground-truth scribble lines.

Few-shot learning algorithms are dedicated to distinguish the patterns of classes or objects with only a few labeled samples. Networks should generalize to recognize new objects with few images based on the parameters of base models. The base models are trained using totally different classes without overlaps with the testing classes. Finn \etal [10] tries to learn some internal transferable representations, and these representations are broadly applicable to various tasks. Vinyals \etal [22] and Annadani \etal [1] propose to learn embedding vectors. The vectors of the same categories are close, while the vectors of the different categories are apart.

3 Methodology

3.1 Problem Definition

Suppose we have three datasets: a training set , a support set and a testing set , where is an image, is the corresponding segmentation mask and is the number of images in each set. Both the support set and training set have annotated segmentation masks. The support set and testing set share the same types of objects which are disjoint with the training set. We denote as the semantic class of the mask . Therefore, we have . If there are annotated images for each of new classes, the target few-shot problem is named -way -shot. Our purpose is to train a network on the training set , which is able to precisely predict segmentation masks on testing set according to the reference of the support set .

In order to better learn the connection between the support and testing set, we mimic this mechanism in the training process. For a query image , we construct a pair by randomly selecting a support image whose mask has the same semantic class as . We are supposed to estimate the segmentation mask with a function , where is the parameters of the function. In testing, is picked from the support set and is an image from testing set .

3.2 The Proposed Model

In this section, we firstly present the masked average pooling operation for extracting the object representative vector of annotated support images. Next, similarity guidance maps are introduced for combining the features between the representative vectors and features of query images. Finally, we describe the details of the proposed unified one-shot framework of SG-One for predicting segmentation masks of unseen classes.

Masked Average Pooling Support pairs of images and masks are usually utilized to encode the support objects into representative vectors. OSLSM [20] proposes to erase the background pixels from the support images by multiplying the binary masks to support images. co-FCN [17] proposes to construct the input block of five channels by concatenating the support images with their positive and negative masks. However, there are two empirical disadvantages of the two methods. First, erasing the background pixels to zeros may change the statistic distribution of the support image set. If we apply a unified network to process both the query images and the erased support images, the variance of the input data will greatly increase. Second, concatenating the support images with their masks breaks the input structure of the network, which will also prevent the implementation of a unified network.

We propose to employ masked average pooling for extracting the representative vectors of supporting objects. Suppose we have a support RGB image and its binary segmentation mask , where and are the width and height of the image. If the output feature maps of is , where is the number of channels, and are width and height of the feature maps. We firstly resize the feature maps to the same size as the mask via bilinear interpolation. We denote the resized feature maps as . Then, the element of the representative vector is computed by averaging the pixels within the object regions on the feature map.


As the discussion in FCN [16], fully convolutional networks are able to preserve the relative positions of input pixels. Therefore, through masked average pooling, we expert to extract the feature of object regions while disregarding the background contents. Also, we argue that the input of contextual regions is helpful to learn better object features. This statement has been discussed in DeepLab [4] which incorporates contextual information using dilated convolutions. Masked average pooling keeps the input structure of the network unchanged, which enables us to process both the support and query images within one network.

Similarity Guidance One-shot semantic segmentation aims to segment the target object within query images given a support image of the reference object. As we discussed, the masked average pooling method is employed to extract the feature vector of the reference object, where is the number of channels. Suppose the feature maps of a query image is . We employ the cosine distance to measure the similarity between the representative vector and each pixel within following Eq. (2)


where is the similarity value at the pixel . is the feature vector of query image at the pixel . As a result, the similarity map integrates the support object feature with the query features. We use the map as guidance to teach the segmentation branch how to discover the desired object regions. We do not explicitly optimize the cosine similarity. In particular, we element-wisely multiply the similarity guidance map to the feature maps of query images from segmentation branch. Then, we optimize the guided feature maps to fit the corresponding ground truth mask.

3.3 The Similarity Guidance Method

Figure 2 depicts the structure of our proposed model. SG-One includes three components, Stem, Similarity Guidance Branch and Segmentation Branch. Different components have different structures and functionalities. Stem is a fully convolutional network for extracting intermediate features of both support and query images.

Similarity Guidance Branch inputs the features of both query and support images. We apply this branch to produce the similarity guidance maps by combining the features of reference objects with the features of query images. For the features of support images, we implement three convolutional blocks to extract the highly abstract and semantic features, followed by a masked averaged pooling layer to obtain representative vectors. The extracted representative vectors of support images are expected to contain the high-level semantic features of a specific object. For the features of query images, we reuse the three blocks and employ the cosine similarity layer to calculate the closeness between the representative vector and the features at each pixel of the query images.

Segmentation branch is for discovering the target object regions of query images with the guidance of the generated similarity maps. We employ three convolutional layers with the kernel size of to obtain the features for segmentation. The inputs of the last two convolutional layers are concatenated with the paralleling feature maps from Similarity Guidance Branch. Through the concatenation, Segmentation Branch can borrow features from the paralleling branch, and these two branches are able to communicate information during the forward and backward stages. We fuse the generated features with the similarity guidance maps by multiplication at each pixel. Finally, the fused features are processed by two convolutional layers with the kernel size of and , followed by a bilinear interpolation layer. We apply the cross-entropy loss function to classify each pixel to be the same class with support images or to be background.

One-Shot Testing One annotated support images are provided to guide the query image of unseen classes for segmenting semantic objects. We do not need to finetune and change the parameters of the entire network. We only need to forward the query and support images through the network for generating the expected segmentation masks.

K-Shot Testing Suppose there are support images , we propose to segment the query image using two approaches. The first one is to ensemble the segmentation masks corresponding to the support images following OSLSM [20] and co-FCN [17] based on Eq. (3)


where is the predicted semantic label of the pixel at corresponding to the support image . Another approach is to average the representative vectors, and then use the averaged vector to guide the segmentation process. It is notable that we do not need to retrain the network using -shot support images. We use trained network in the one-shot manner to test the segmentation performance using -shot support images.

4 Experiments

4.1 Dataset and Metric

Following the evaluation method of the previous works OSLSM [20] and co-FCN [17], we create the PASCAL-5i using the PASCAL VOC 2012 dataset [8] and the extended SDS dataset [11]. For the 20 object categories in PASCAL VOC, we use cross-validation method to evaluate the proposed model by sampling five classes as test categories in Table 1, where is the fold number, while the left 15 classes are the training label-set . We build the training set by excluding the images which have object classes in the testing set . We follow the same procedure of the baseline methods to form the testing set. We use the same test set as OSLSM [20], which has 1,000 support-query tuples for each fold.

Dataset Test classes
PASCAL-50 aeroplane,bicycle,bird,boat,bottle
PASCAL-51 bus,car,cat,chair,cow
PASCAL-52 diningtable,dog,horse,motorbike,person
PASCAL-53 potted plant,sheep,sofa,train,tv/monitor
Table 1: Testing classes for 4-fold cross-validation test.

Suppose the predicted segmentation mask is and the corresponding ground-truth annotation is , given a specific class . We define the Intersection over Union () of class as , where the and are the number of true positives, false positives and false negatives of the predicted masks. The mIoU is the average of IoUs over different classes, \ie, where is the number of testing classes. We report the averaged mIoU on the four cross-validation datasets.

4.2 Implementation details

We implement the proposed approach based on the VGG-16 network following the previous works [20, 17]. Stem takes the input of RGB images to extract middle-level features, and it downsamples the images by the scale of 8. We use the first three blocks of the VGG-16 network as Stem. For the first two convolutional blocks of Similarity Guidance Branch, we adopt the structure of conv4 and conv5 of VGGnet-16 and remove the maxpooling layers to maintain the resolution of feature maps. One conv33 layer of 512 channels are added on the top without using after this layer. The following module is masked average pooling to extract the representative vector of support images. In Segmentation Branch, all of the convolutional layers with kernel size have 128 channels. The last layer of conv11 kernel has two channels corresponding to the object and background pixels, respectively. All of the convolutional layers except for the third and the last one are followed by a layer.

Following the baseline methods [20, 17], we use the pre-trained weights on ILSVRC [19]. All input images remain the original size without any data augmentations. We Implement the network using PyTorch. We train the network with the learning rate of . The batch size is , and the weight decay is 0.0005. We adopt the SGD optimizer with the momentum of 0.9. All networks are trained and tested on NVIDIA TITAN X GPUs with 12 GB memory. Our source code is avilable at

Methods1 PASCAL-50 PASCAL-51 PASCAL-52 PASCAL-53 Mean
1-NN 25.3 44.9 41.7 18.4 32.6
LogReg 26.9 42.9 37.1 18.4 31.4
Siamese 28.1 39.9 31.8 25.8 31.4
OSVOS [3] 24.9 38.8 36.5 30.1 32.6
OSLSM [20] 33.6 55.3 40.9 33.5 40.8
co-FCN [17] 36.7 50.6 44.9 32.4 41.1
Ours 40.2 58.4 48.4 38.4 46.3
Table 2: Mean IoU results of 1-shot segmentation on the PASCAL-5i dataset. The best results are in bold.
Methods PASCAL-50 PASCAL-51 PASCAL-52 PASCAL-53 Mean
1-NN 34.5 53.0 46.9 25.6 40.0
LogReg 35.9 51.6 44.5 25.6 39.3
Co-segmentation [9] 25.1 28.9 27.7 26.3 27.1
OSLSM [20] 35.9 58.1 42.7 39.1 43.9
co-FCN [17] 37.5 50.0 44.1 33.9 41.4
Ours-max 40.8 57.2 46.0 38.5 46.2
Ours-avg 41.9 58.6 48.6 39.4 47.1
Table 3: Mean IoU results of 5-shot segmentation on the PASCAL-5i dataset. The best results are in bold.
Methods 1-shot 5-shot
FG-BG [17] 55.1 55.6
OSLSM [20] 55.2 -
co-FCN [17] 60.1 60.8
PL+SEG [7] 61.2 62.3
Ours 63.1 65.9
Table 4: Mean IoU results regarding the evaluation method in co-FCN [17] and PL+SEG [7]
Figure 3: Segmentation results on unseen classes with the guidance of support images. For the failure pairs, the ground-truth is on the left side while the predicted is on the right side.

4.3 Comparison

One-shot Table 2 compares the proposed SG-One approach with baseline methods in one-shot semantic segmentation. We observe that our method outperforms all baseline models. The mIoU of our approach on the four divisions achieves 46.3%, which is significantly better than co-FCN by 5.2% and OSLSM by 5.5%. Compared to the baselines, SG-One earns the largest gain of 4.9% on PASCAL-53, where the testing classes are potted plant, sheep, sofa, train, tv/monitor. co-FCN [17] constructs the input of the support network by concatenating the support images, positive and negative masks, and it obtains 41.1%. OSLSM [20] proposed to feed only the object pixels as input by masking out the background regions, and this method obtains 40.8%. OSVOS [3] adopts a strategy of finetuning the network using the support samples in testing, and it achieves only 32.6%. To summarize, SG-One can effectively predict segmentation masks on new classes without changing the parameters. Our similarity guidance method is better than the baseline methods in incorporating the support objects for segmenting unseen objects.

Figure 3 shows the one-shot segmentation results using SG-One on unseen classes. We observe that SG-One can precisely distinguish the object regions from background with the guidance of the support images, even if some support images and query images do not share much appearance similarities. We also show some failure cases to benefit the future researches. We ascribe the failure to 1) the target object regions are too similar to background pixels, \egthe side of the bus and the sky; 2) the target region have very uncommon features with the discovered discriminative regions, \egthe vest of the dog, which may far distant with the representative feature of support objects.

Five-shot Table 3 illustrates the 5-shot segmentation results on the four divisions. As we discussed, we apply two approaches for 5-shot semantic segmentation. The approach of averaging the representative vectors from the five support images achieves 47.1% which significantly outperforms the current state-of-the-art, co-FCN, by 5.2%. This result is also better than the corresponding 1-shot mIoU of 46.3%. Therefore, the averaged support vector has a better representative of the features in guiding segmentation process. Another approach is to solely fuse the final segmentation results by combining all of the detected object pixels. We do not observe any increase of this approach, comparing to the 1-shot result. It is notable that we do not specifically train a new network for 5-shot segmentation. The trained network in the one-shot way is directly applied to predict the 5-shot segmentation results.

We also test the proposed model regarding the evaluation method in co-FCN [17] and PL+SEG [7]. This metric firstly calculates the IoU of foreground and background, respectively. Then we obtain the mean IoU of the foreground and background pixels. We also report the averaged mIoU on the four cross-validation datasets. Table 4 compares the SG-One with the baseline methods regarding this metric in terms of 1-shot and 5-shot semantic segmentation. It is obvious that the proposed approach outperforms all previous baselines. SG-One achieves 63.1% of 1-shot segmentation and 65.9% of 5-shot segmentation, while the most competitive baseline method PL+SEG only obtains 61.2% and 62.3%. PL+SEG implements the CRF method after obtaining the predicted segmentation masks. CRF can usually bring around 3% of improvement based on the same backbone according to the work [4]. On the contrary, the proposed network is trained end-to-end, and our results do not require any pre-processing or post-processing steps.

4.4 Multi-Class Segmentation

We conduct experiments to verify the ability of SG-One in segmenting an image with multiple classes. We randomly select 1000 entries of the query and support images. Query images may contain objects of multiple classes. For each entry, we sample five annotated images from the five testing classes as support images. For every query image, we predict its segmentation masks with the images of different support classes. We fuse the segmentation masks of the five classes by comparing the classification scores. The mIoU on the four datasets is 29.4%. The proposed SG-One approach is able to segment the objects of different unseen classes with the guidance of support images.

4.5 Ablation Study

Masked Average Pooling

The masked average pooling method employed in the proposed SG-One network is superior in incorporating the guidance masks of support images. Shaban \etal [20] proposed to multiply the binary masks to the input support RGB images, so that the network can only extract features of target objects. co-FCN [17] proposed by Rakelly \etalconcatenates the support RGB images with the corresponding positive masks, \ieobject pixels are 1 while background pixels are 0, and negative binary masks \ieobject pixels are 1 and background pixels are 0, constituting the inputs of 5 channels. We follow the instructions of these two methods and compare with our masked average pooling approach. Concretely, we firstly replace the masked average pooling layer by a global average pooling layer. Then, we implement two networks. 1) SG-One-masking adopts the methods in OSLSM [20], in which support images are multiplied by the binary masks to keep the object regions solely. 2) SG-One-concatenate adopts the methods in co-FCN [17], in which we concatenate the positive and negative masks to the support images forming an input with 5 channels. We add an extra input block (VGGnet-16) with 5 convolutional channels for adapting concatenated inputs, while the rest networks are exactly the same as the compared networks.

Table 5 compares the performance of different methods in processing support images and masks. Our masked average pooling approach achieves the best results on every dataset. The mIoU of the four datasets is 46.3% using our method. The masking method (SG-One-masking) proposed in OSLSM [20] obtains 45.0% of the mIoU. The approach of co-FCN (SG-One-concat) only obtains 41.75%, which ascribe the modification of the input structure of the network. The modified input block cannot benefit from the pre-trained weights of processing low-level information. We also implement a network without using the binary masks of the support images, and this network achieves mIoU of 42.2%. In total, we can conclude that 1) a qualified method of using support masks is crucial for extracting object features to guide the network from learning the segmentation masks of query images; 2) the proposed masked average pooling method provides an effective way to reuse the structure of well-designed classification network for extracting object features of support pairs; 3) networks with 5-channel input cannot benefit from the pre-trained weights and the extra input block cannot be jointly trained with the query images. 4) the masked average pooling layer has superior generalization ability in segmenting unseen classes.

Methods Mean
SG-One-concat 38.4 51.4 44.8 32.5 41.75
SG-One-masking 41.9 55.3 47.4 35.5 45.0
SG-One-ours 40.2 58.4 48.4 38.4 46.3
Table 5: Comparison in different methods of processing support images. ( denotes the PASCAL-5i dataset.)

Guidance Similarity Generating Methods We adopt the cosine similarity to calculate the distance between the object feature vector and the feature maps of query images. The definition of the cosine distance is to measure the angle between two vectors, and its range is in . Correspondingly, we abandon the ReLU layers after the third convolutional layers of both guidance and segmentation branches. By doing so, we increase the variance of the cosine measurement, and the cosine similarity is not partly bounded in , but in . For comparison, we add the ReLU layers after the third convolutional layers. The mIoU on the four datasets drops to 45.5% comparing to the non-ReLU approach of 46.3%.

We also train a network using the 2-norm distance as the guidance, and obtain 30.7% on the four datasets. This result is far poor than the proposed cosine similarity method. Hence, the 2-norm distance is not a good choice for guiding the query images to discover target object regions.

The Unified Structure We adopt the proposed unified structure between the guidance and segmentation branches. This structure can benefit each other during the forward and backward stages. We implement two networks for illustrating the effectiveness of this structure. First, we remove the first three convolutional layers of Segmentation Branch, and then multiply the guidance similarity maps directly to the feature maps from Similarity Guidance Branch. The final mIoU of the four datasets decreases to 43.1%. Second, we cut off the connections between the two branches by removing the first and second concatenation operations between the two branches. The final mIoU obtains 45.7%. Therefore, Segmentation Branch in our unified network is very necessary for getting high-quality segmentation masks. Also, Segmentation Branch can borrow some information via the concatenation operation between the two branches.

We also verify the functionality of the proposed unified network in the demand of computational resources and generalization ability. In Table 6, we observe that our SG-One model uses only 19.0M parameters, while it achieves the best segmentation performance. Following the methods in OSLSM [20] and co-FCN [17], we use a separate network (SG-One-separate) to process support images. This network uses slightly more parameters (36.1M) than co-FCN(34.2M). The mIoU of SG-One-separate obtains 44.8%, and this result is far better than the 41.1% of co-FCN. This result shows that the approach we applied for incorporating the guidance information from support image pairs is superior to OSLSM and co-FCN in segmenting unseen classes. Surprisely, the proposed unified network can even achieve higher performance of 46.3%. We attribute the gain of 1.5% to the reuse of the network in extracting support and query features. The reuse strategy not only reduce the demand for computational resources and decrease the risk in becoming over-fitting, but offers the network more opportunities to see more training samples. OSLSM requires the most parameters (272.6M), whereas it has the lowest score.

Methods Parameters Mean
OSLSM [20] 272.6M 40.8
co-FCN [17] 34.2M 41.1
SG-One-separate 36.1M 44.8
SG-One-unified 19.0M 46.3
Table 6: Comparison in different methods of processing support images. ( denotes the PASCAL-5i dataset.)

5 Conclusion

We present the SG-One approach for semantic segmentation of unseen classes using only one annotated example. Our method provides some advantages. Frist, we abandon the masking strategy in OSLSM [20] and propose an effective masked average pooling approach to produce more robust object-related feature representations. Extensive experiments show the masked average pooling approach is more convenient and capable of incorporating contextual information than the method in OSLSM, because the entire image is fed into the network for learning features of desired regions so that contextual information is actually exploited due to the large receptive field. Second, we reduce the risks of the overfitting problem by avoiding the utilization of extra parameters through CosineSimilarity. Measuring the similarity between the features is not only effective but robust. The trained model on images of single class can be directly applied to segment multi-class images. Third, our work presents a pure end-to-end network, which does not require any pre-processing steps. The proposed approach more importantly boost the performance of one-shot semantic segmentation to a new state-of-the-art.


  1. The details of baseline methods refer to OSLSM [20]. The results for co-FCN [17] are for our re-implemented method.


  1. Y. Annadani and S. Biswas. Preserving semantic relations for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7603–7612, 2018.
  2. A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What’s the point: Semantic segmentation with point supervision. In European Conference on Computer Vision, pages 549–565, 2016.
  3. S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR 2017.
  4. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. preprint arXiv:1412.7062, 2014.
  5. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
  6. J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1635–1643, 2015.
  7. N. Dong and E. P. Xing. Few-shot semantic segmentation with prototype learning. BMVC, 2018.
  8. M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2014.
  9. A. Faktor and M. Irani. Co-segmentation by composition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1297–1304, 2013.
  10. C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
  11. B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In IEEE ICCV, pages 991–998, 2011.
  12. K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  13. A. Khoreva, R. Benenson, J. H. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, volume 1, page 3, 2017.
  14. G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
  15. D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. IEEE CVPR, 2016.
  16. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE CVPR, 2015.
  17. K. Rakelly, E. Shelhamer, T. Darrell, A. Efros, and S. Levine. Conditional networks for few-shot semantic segmentation. ICLR workshop, 2018.
  18. O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  19. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  20. A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots. One-shot learning for semantic segmentation. BMVC, 2017.
  21. M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers. Normalized cut loss for weakly-supervised cnn segmentation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 2018.
  22. O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  23. Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE CVPR, 2017.
  24. Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, and S. Yan. Stc: A simple to complex framework for weakly-supervised semantic segmentation. IEEE TPAMI, 2016.
  25. Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang. Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In IEEE CVPR, pages 7268–7277, 2018.
  26. X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang. Adversarial complementary learning for weakly supervised object localization. In IEEE CVPR, 2018.
  27. X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang. Self-produced guidance for weakly-supervised object localization. In European Conference on Computer Vision. Springer, 2018.
  28. B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. IEEE CVPR, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description