Image Co-segmentation via Multi-scale Local Shape Transfer

Image Co-segmentation via Multi-scale
Local Shape Transfer

Wei Teng, Yu Zhang, Xiaowu Chen, Jia Li, and Zhiqiang He W. Teng, Y. Zhang, X. Chen and J. Li are with the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China. (e-mail:;;; Li is also with the International Research Institute for Multidisciplinary Science, Beihang University, Beijing 100191, China.Z. He is with the Lenovo Research, China. (e-mail: author: Xiaowu Chen (e-mail: version of this work has been published in BMVC 2016 [1].

Image co-segmentation is a challenging task in computer vision that aims to segment all pixels of the objects from a predefined semantic category. In real-world cases, however, common foreground objects often vary greatly in appearance, making their global shapes highly inconsistent across images and difficult to be segmented. To address this problem, this paper proposes a novel co-segmentation approach that transfers patch-level local object shapes which appear more consistent across different images. In our framework, a multi-scale patch neighbourhood system is first generated using proposal flow on arbitrary image-pair, which is further refined by Locally Linear Embedding. Based on the patch relationships, we propose an efficient algorithm to jointly segment the objects in each image while transferring their local shapes across different images. Extensive experiments demonstrate that the proposed method can robustly and effectively segment common objects from an image set. On iCoseg, MSRC and Coseg-Rep dataset, the proposed approach performs comparable or better than the state-of-the-arts, while on a more challenging benchmark Fashionista dataset, our method achieves significant improvements.

Image co-segmentation, Shape transfer, Locally linear embedding.

I Introduction

image co-segmentation (e.g. [2, 3, 4, 5, 6]) is an important problem in the field of computer vision and multimedia. It aims to simultaneously segment all the pixels of the common foreground object(s) from the same category in a set of images. With its rapid development, image co-segmentation has supported various vision applications, such as fine-grained object recognition [7], visual concept discovery [8], and image retrieval [9], which greatly facilitated automatic analysis and utilization of large-scale multimedia data.

Fig. 1: Illustration of the local shape consistency. Holistic shape of the object in examplar images varies due to pose differences. However, object shape in local patches (displayed in bounding boxes with the same color) remain similar across different images.

In the past decade, various visual cues have been investigated for extracting objects from a predefined semantic image set. Among them, appearance cues such as color and texture have been proved effective. Based on the appearance descriptor, the common objects can be discovered by either learning a shared distribution [2] or constructing correspondences among them [10]. However, appearance cues are limited in distinguishing object from visually similar background, and dealing with complex object.

To address these problems, some studies have turned to the shape cues for reducing foreground or background ambiguities [11]. One popular approach is learning a shared shape model across different images. Exemplar works include shape priors [12],  [13], and deformable shape templates [14]. However, these approaches often have difficulty segmenting the common objects in such scenarios with large variance of viewpoints, scales and object poses. Moreover, they often require learning a uniform objects model, which may be unobtainable when the global shapes of those common objects are inconsistent. As a result, these template-based approaches may not work well.

By observing the common objects at patch level, we find that no matter how holistic shape varies, local shapes are stable and thus transferable (see Fig. 1). Inspired by this observation, we propose a novel framework for image co-segmentation by introducing patch-level shape transfer. Specifically, we first generate patches via multi-scale sliding windows. For each patch, we search for its transferable neighbours in other homogeneous images. To prune out the unreliable matches, we further learn a sparser neighbourhood system for the image patch set using Locally Linear Embedding [15]. Then a novel image co-segmentation approach is proposed by introducing patch-level consistencies into a graph-cut based segmentation framework. Extensive experiments on iCoseg [16], MSRC [17] and Coseg-rep [14] dataset show that our approach performs comparable or better than the state-of-the-arts. Moreover, on the challenging benchmark Fasionista dataset [18] with complex object appearances and poses, the proposed approach achieves remarkable improvements.

Our main contributions are summarized as follows:

1) we propose a novel image co-segmentation framework by introducing multi-scale local shape transfer.

2) we present a strategy to refine the patch correspondences in an image set through Locally Linear Embedding.

3) we propose an efficient co-segmentation algorithm by embedding patch consistencies into graph-cut based energy.

This work extends our previous study [1] in two aspects: 1) Inspired by the success of multi-scale strategy in many vision and multimedia tasks (e.g. [19, 20, 21, 22, 23]), we associate local patches across images at different resolutions via multi-scale sliding window.

2) we conduct more comparisons between our approach and state-of-the-arts methods on iCoseg dataset [16], MSRC dataset [17], Coseg-Rep dataset [14] and Fashionista dataset [18]. Quantitative results show that the proposed algorithm achieves significant improvements in both accuracy and speed.

In the rest of this paper, Section II briefly reviews the previous researches on image co-segmentation and Section III describes the proposed image co-segmentation framework in detail. Experimental results are presented in Section IV. Finally, the paper is concluded in Section V.

Ii Related Works

Existing studies on image co-segmentation algorithms can be roughly divided into categories: template-based and matching-based groups, which will be introduced in Section II-A and Section II-B respectively. Relavant works on shape transfer is also presented in Section II-C.

Ii-a Template-based Group

Most template-based approaches assume that there exists a single model which can be generalized to represent all the objects in a specific image set. Following this idea, some works proposed to learn shared distributions of appearance features. For example, Jolin et al. [2] combined spectral clustering and kernel methods into a discriminative clustering framework, which learned linear models jointly for foreground and background based on color and texture features. Kim et al. [24] modeled the co-segmentation as a temperature maximization problem of anisotropic heat diffusion, in which foreground objects were represented with a diffusion process. However, these models are often insufficient for objects with complex appearances. To address this issue, several works such as [12, 13, 14, 25] advocated using shape models to facilitate co-segmentation. A common practice is to learn shape prior maps, which indicate the likelihoods of the common objects appearing at different image locations. As object shapes are actually unknown in co-segmentation, the shape priors were iteratively refined using the current segmentations [13, 12]. In [14], a sophisticated model was designed to jointly segment the common objects and learn their deformable shape templates. In [25], a common shape pattern was discovered through Affinity Propagation (AP) clustering to refine the segmentations of image sets. More recently, Quan et al. [6] established close-loop graph to represent the foreground and background separately, and they applied both low-level appearance and high-level semantic features.

To sum up, template-based approaches are powerful since they output not only the segmented objects but also the learned foreground/background/shape models. However, the models proposed are often simple in tractability during learning or inference, and thus may not adequately capture real-world object with various appearances and structures.

Fig. 2: Framework of our approach. The initial foreground/background segmentation for each input image is estimated using [26]. Meanwhile, we construct a weighted graph among the patches sampled from different images using [27], where weights are learned by Locally Linear Embedding [15]. Finally, we optimize intra-image object segmentation and inter-image local shape transfer jointly while preserving the patch weights in label space.

Ii-B Matching-based Group

Matching-based approaches focus on building the region correspondences among different images. Some works mainly examine the matching constraint at object-level, and assumed that foreground feature histograms aggregated on different images are similar [28, 3, 29, 30]. However, this strategy may have difficulty in applying to object categories with large variation. Another idea is conducting image co-segmentation through local region correspondences among images. For example, Wang et al. [10] proposed to match regions in the functional space. Rubio et al. [31] developed a MRF formulation to jointly address object co-segmentation and region matching. Yu et al. [32] explored the inter-image similarity using a simple superpixel matching algorithm. After obtaining the correspondences among those superpixels, they transferred the foreground/background labels among the matched pixels/superpixels. Faktor and Irani [33] adopted structured matching to detect the common object parts in different images, through which “co-occurring” maps were generated to guide segmentation in each image. Lee et al. [34]suggested a multiple random walkers (MRW) clustering approach to extract the common objects from image set. More recently, with the rise of visual saliency (e.g. [35, 36, 37, 26, 38, 39]) in computer vision field, some studies adopted visual saliency in image co-segmentation. For example, Jerripothula et al. [5] proposed a saliency co-fusion-based co-segmentation method. Liu et al. [40] employed co-saliency maps of an image set to guide clustering image elements (i.e., superpixels) into two classes. What is more, Wang et al. [41] calculated co-occurrence maps of the common objects for image cosegmentation.

Our approach is also based on local region matching. Different from Rubio et al. [31] and Wang et al. [10], we transfer labels at patch-level rather than point-level. In this manner, structured consistency is imposed to preserve the local object shapes during transfer. Compared with [5], our approach does not assume the “co-saliency” of the common objects in the whole image set. In contrast, we assume only the co-occurrence of the parts of a common object in a sparse set of neighboring image patches. As a result, the proposed approach can robustly and effectively identify the whole foreground objects, as confirmed by extensive experiments.

Ii-C Shape Transfer

Shape transfer is a young yet widely adopted approach for data-driven foreground/background segmentation. Most existing works transfer the masks of pre-segmented objects to the test images, e.g.[42] and [43]. Beyond global object shapes, recent studies [44][11] pursed to adopt local shape masks for image segmentation. Xia et al. [44] proposed to infer foreground objects masks through sparse representations over global objects masks and local patch-level masks of the training set. Yang et al. [11] adopted dense correspondences among images to find candidate local shape masks for each patch of a test image in an example image set, then they investigated an object segmentation scheme by patch-level local shape transfer. Notably, local patch strategy help overcome the local deformations, which be proved by [44][11]. Our work is inspired by local shape transfer successes, but we operate image co-segmentation in an unsupervised method without assuming pre-segmented images at hand.

Iii Co-segmentation Framework

Iii-a Overview

The procedure of our approach is shown in Fig. 2. Given a set of images, our framework first estimates a coarse initial foreground segmentations by thresholding saliency maps [26], which provides known cues for learning Gaussian Mixture Models (GMM) in the following optimization. Meanwhile, we generate a number of patches on each image by multi-scale sliding windows and then construct a weighted patch graph to implement the transfer of patch-level local shape across images. Then to refine the segmentations in all images jointly, we integrate the patch graph and the coarse initial segmentation into a uniform framework. In the rest of this section, we first explain how local shape transfer helps image co-segmentation and the patch graph implementation, and then we describe our co-segmentation algorithm in detail.

Fig. 3: Illustration of local shape transfer. For the patches sampled from an image (bottom-left), we illustrate the neighbouring patches (bottom-middle) on different images. Different colors represent different patches and their neighbours. The transferred segmentation mask using average pooling (AVE.) and Locally Linear Embedding (LLE) are shown on the bottom-right. The learned weights at the bottom-left of each neighbouring patch show that Locally Linear Embedding is effective to suppress incorrect matches, which leads to more reliable shape transfer results.

Iii-B Local Shape Transfer

From the machine learning perspective, the holistic shape of common objects in real-world lies in a high-dimensional space. In existing studies, object shape spaces are often learned from sophisticated non-linear models (e.g., random forests [45]). However, our observation finds that the local object shapes can be well represented by their sparse neighbours in a linear and low-dimensional space. To this end, we first sample a number of patches for each image by multi-scale sliding windows. Then to connect the patches among different images, we construct a neighbourhood system through finding patch-level correspondences across images [27]. Specifically, let represent the collection of patches from image and denote the patch set from image . For each patch in , the algorithm in [27] calculated matching patch in and vice versa. Throughout the whole image set, matching operations are employed for image .With this manner, for the th patch in a set of M images, we readily obtain its neighbours from different images, and represent the indices of these neighbours with .

To represent th patch with its neighbouring patches, we normalize all patches to a uniform size of . Then one straightforward approach assumes that the neighbouring patches are all good surrogates of the original image patch. Based on this assumption, this approach formularizes the segmentation of the th patch by the average of its neighbours, that is, . In this formula, denotes the concatenation of binary segmentation labels in the th patch and represents the number of neighbours . However, constructing primary patch by this formulation is difficult in real-word. As shown in the right column of Fig. 3, aggregating the neighbouring patches with the inconsistent structure may confuse the shape transfer. To address this issue, we apply Locally Linear Embedding [15] to learn a sparser but more reliable neighbouring relationships for each patch:


where is the total number of patches sampled from the image set, and is the Histograms-of-Oriented (HOG) feature [46] extracted from the th patch. The simplex constraint imposes sparsity on neighbour selection. Given a set of weights and neighbors , the local shapes are thus transferred by . Fig. 3 illustrates that this strategy leads to more consistent shape transfer results.

Iii-C Image Co-segmentation by Local Shape Transfer

Given the patch graph, the initial segmentation in each image can be refined by transferring the multi-scale local shapes from other images in the set. During the transfer, the weights learned in the patch feature space are preserved for optimizing the label space. The objective of our algorithm is written formally as follows:


where concatenates the foreground/background labels of all pixels in the image set, is the binary segmentation of th image and denotes the foreground/background labels of th patch. The energy term implements intra-image foreground/background segmentation, for which we use the popular Markov Random Field (MRF) energy (see [47] for details). The problem (2) is NP-hard and usually a large scale as it operates on pixels. In order to seek the optimum solution of (2), we propose an efficient algorithm to approximate it by half quadratic splitting [48].

The core idea of our optimization algorithm is introducing an auxiliary variable using as a constraint. Then Equation (2) can be relaxed to:


To find optimal solution for and , we adopt an iterative process approach, in which and are optimized though keeping one of the them fixed. With this manner, the original problem is decoupled into two simple sub-problems:


In formula (4), patch label is binary. When is fixed, we change the second term of (4) into a linear form of w.r.t. , and then directly merge it into the unary potentials of the MRF energy. Consequently, by performing graph-cut [49] in each image, this new expression of (4) is efficient. The formula (5) aims to optimize  while keeping fixed. To solve such a large-scale quadratic program, we first discard the binary constraint of the variable . Then by solving a linear system, we obtain a closed-form solution of the quadratic program (5). More specifically, we approximate this solution by a sequence of label diffusions. In a diffusion step, the pixel label  in the th patch is optimized by fixing the labels of all other pixels. By setting the derivation w.r.t.  as zero, we obtain the following updated rule


For easy presentation, this formula can be written in a compact form: . Here and are defined as:


where the matrices , and concatenate in a row of the column vectors , and , respectively, is a pairwise matrix of patch-wise neighbouring weights, and is the identical matrix. The operator creates a diagonal matrix by picking out the diagonal elements of the input matrix. In practice, after a diffusion step, we normalize the soft labels into for each image. Finally, we terminate the diffusion step when previous and current are less than the threshold values set manually.

The two steps are repeated until near-convergence. Empirically, we terminate the optimization in 10 iterations and take the last discrete labels y as the final segmentations.

Iii-D Comparison with our previous method [1]

The main difference lies in that this work incorporates multi-scale strategy into image cosegmentaion. In particular, we first use multi-scale windows to sample images instead of fixed the patch size like previous work [1]. Then we directly find the neighboring patches on patch-level matching rather than on pixel-level dense correspondences, which guarantees the computational efficiency of our approach.

iCoseg Ours DLLE[1] Lee [34] Fu [50] Wang [10] Liu [40] Rubio [31] Vicente [51] Mukherjee [52] Joulin [2]
Alaska Bear 0.926 0.861 0.873 0.935 0.904 0.872 0.864 0.900 - 0.748
Red Sox Players 0.964 0.972 0.971 0.965 0.942 0.927 0.905 0.909 0.957 0.730
Stonehenge1 0.938 0.936 0.959 0.930 0.925 0.820 0.873 0.633 0.927 0.566
Stonehenge2 0.946 0.844 0.907 0.835 0.872 0.800 0.884 0.888 0.849 0.860
Liverpool 0.902 0.905 0.885 0.921 0.894 0.911 0.826 0.875 - 0.764
Ferrari 0.874 0.892 0.919 0.917 0.956 0.900 0.843 0.899 0.900 0.850
Taj Mahal 0.810 0.878 0.952 0.887 0.926 0.832 0.887 0.911 0.941 0.737
Elephants 0.974 0.961 0.931 0.904 0.867 0.900 0.750 0.431 0.877 0.701
Pandas 0.927 0.835 0.848 0.812 0.886 0.800 0.600 0.927 0.928 0.840
Kite 0.984 0.980 0.957 0.966 0.939 0.978 0.898 0.903 0.946 0.870
Kite panda 0.944 0.905 0.960 0.838 0.931 0.812 0.783 0.902 0.934 0.732
Gymnastics 0.992 0.984 0.961 0.954 0.904 0.969 0.871 0.917 0.922 0.909
Skating 0.924 0.893 0.916 0.817 0.787 0.822 0.768 0.775 0.966 0.821
Hot Balloons 0.991 0.969 0.977 0.965 0.904 0.938 0.890 0.901 0.952 0.852
Liberty Statue 0.993 0.966 0.945 0.927 0.968 0.957 0.916 0.938 0.966 0.906
Brown bear 0.962 0.938 0.937 0.948 0.881 0.823 0.804 0.953 0.885 0.740
Average 0.941 0.920 0.931 0.907 0.905 0.879 0.839 0.853 - 0.789
TABLE I: Comparison with leading co-segmentation approaches of correctly classified pixels (denoted by ACC) on iCoseg dataset.

Iv Experiments

Iv-a Experimental Settings

We first sample many patches on each image by four scales sliding windows, , , , . On each scale, there are no overlapping patches expect image boundary, which ensure that patches can cover the whole image. The unary term of the MRF energy is the (log-negative) foreground/background color likelihoods generated by 12-components GMM models. Initially, the GMMs are learned on saliency-based segmentations. In each iteration, we update them using the latest segmentations. We follow [53] to define the pairwise term, modeling color contrasts between the adjacent pixels. Parameters and are empirically set as and , respectively. We evaluate the proposed approach on three public benchmarks: iCoseg dataset [16], MSRC dataset [17], Coseg-Rep dataset [14] and Fashionista dataset [18]. For easy of presentation, we use MCO to represent the proposed approach of multi-scale local shape transfer and SCO to denote a variant of turning the proposed approach into single-scale with patch size of . We also refer to the previous algorithm [1] as DLLE which utilizes the pixel-level dense correspondences among different images.

The iCoseg dataset [16] contains 643 images of 38 object classes with pixel-level annotations. In each class, common objects have similar color but various locations and scales. We test a subset of 16 classes which are widely used by the leading co-segmentation approaches, and we also make a comparison with state-of-the-arts on the whole dataset of 38 classes. For each class, all images are used for our co-segmentation framework. The MSRC dataset [17] contains 418 images from 14 categories. The objects in each class have various color, pose and scale, which adds difficulties to conduct co-segmentation. Coseg-Rep dataset [14] has 23 categories with 572 images, where categories are different species of flowers and animals. Interestingly, there is a special category named ”Repetitives” that each image from this category has several objects of similar shape patterns. In our experiments, images from same class are segmented at once. The Fashionista dataset [18] contains 685 street photographs of fashion models. In contrast to conventional co-segmentation datasets, Fashionista dataset is extremely challenging with various human poses, background clutters and complex appearances. As existing co-segmentation approaches may have difficulty in operating large amounts of images, we randomly partition the dataset into 23 groups with nearly 30 images per group and resize them with a resolution of . Evaluations are averaged over 10 random partitions.

We use two evaluation protocols: the accuracy of correctly classified pixels (denoted by ACC) and intersection-over-union score (denoted by IOU). Both agreements are chosen for throughout comparison with previous approaches, and the latter is more preferred as it has been shown unbiased to the object size [54]. Note that higher values of both accuracies means the better co-segmentation results.

Iv-B Comparison with State-of-the-Arts

     sub-iCoseg (16)      iCoseg (38)
Ours     0.941 0.79   0.925 0.73
Quan [6]     0.948 0.82   0.933 0.76
Faktor [33]     0.944 0.79   0.928 0.73
Jerripothula [5]     - -   0.919 0.72
Kuettel  [13]     - -   0.914 -
Dai [14]     - -   0.895 -
Meng [55]     - -   - 0.71
TABLE II: Comparison results of the proposed method and the state-of-the-art co-segmentation methods on the iCoseg dataset in terms of average ACC and IOU.

Comparisons on iCoseg Dataset. As common objects in each class of iCoseg are known to have similar color properties, we follow the suggestion of ‘Joint-Grab-Cut’ [13]. Namely, we apply jointly updating the color models to all images rather than performing Grab-cut to each image separately. we summarize the co-segmentation accuracies in Table1 and Table2. In Table I, our approach outperforms other approaches based on local region matching [31, 10]. We believe that it is the patch-level structured consistency that makes the difference. We also obtain better results than [51, 50], although they used external training data. Note that [52] performs quite well on the reported 14 classes, achieving 92.49% average accuracy, while our approach obtains 94.5%. However, they also rely on training images to learn dictionaries while our approach is unsupervised. More specifically, our approach achieves the best overall performance with leading accuracies on categories. We obtain remarkable results on challenging categories, such as elephants, stonegenge2 and gymnastics. Although most previous approaches work less well as a result of these object categories with large pose variance, our proposed multi-scale local shape transfer strategy may handle them better. Interestingly, our multi-scale method outperforms our previous work [1] in 12 out of 16 categories. In Fig. 4, we illustrate some related visual results of our co-segmentation approach. In Table II, we show the comparison results of our method and state-of-the-arts co-segmentation algorithms on a subset of the iCoseg dataset (listed in Table I) as well as the whole iCoseg dataset. In terms of ACC from table II, our approach performs better than [13, 5] and comparably with [33, 6]. Although [6] has reported the best performance so far on iCoseg dataset as result of using high-level semantic features, all of [6, 33] and our approach can effectively locate the common objects on this dataset. And the main differences among these approaches are mainly due to finer localization of object boundaries.

Fig. 4: Representative segmentations of our approach on iCoseg dataset [16] and MSRC dataset [17]. Note that foreground objects are marked by green lines.
Ours 0.884 0.70
Wang [41] 0.909 0.73
Faktor [33] 0.892 0.73
Jerripothula [5] 0.887 0.71
Jerripothula [56] 0.884 0.70
Rubinstein [57] 0.877 0.68
(a) MSRC
Ours 0.949 0.794
DLLE [1] 0.937 0.755
Faktor [33] 0.869 0.501
Dai [14] 0.862 0.576
Joulin [2] 0.724 0.358
Rother [49] - 0.642
(b) Fashionista
TABLE III: Comparison results of the proposed method and the state-of-the-art co-segmentation methods on MSRC dataset and Fashionista dataset in terms of average ACC and IOU.

Comparisons on MSRC Dataset. Unlike iCoseg dataset, common objects from MSRC dataset always have more variances in appearance. And unfortunately, several images cannot obtain initial coarse masks by our previous simple thresholding strategy. Therefore, for each category in MSRC dataset, we apply saliency-cut [58] to threshold the saliency maps. Then with these initial coarse masks, we conduct our co-segmentation approach on each category and summarize the results in Table IV(a) and Fig. 4. We can see that our average performance is competitive to [5] as well as [56]. We also find that our method performs worse than [33][41] on MSRC dataset. The main reason could be attributed to the dependence on saliency maps. If the saliency detection method cannot treat the common object as salient or turn background pixels into salient, our method would not be able to segment it out. Inspired by recent deep learning technology, we will try to integrate convolutional neural network into co-segmentation in future research.

Ours Dai [14] Jerripothula [56] Jerripothula [5]
ACC 0.932 0.902 0.922 0.934
IOU 0.78 0.67 0.73 0.77
TABLE IV: Comparison results of the proposed method and the state-of-the-art co-segmentation methods on Coseg-Rep dataset in terms of average ACC and IOU.

Comparisons on Coseg-Rep Dataset. This dataset has 23 object categories and 572 images in total. Among them, a special category called ”repetitive” includes a variety of animals and flowers. To conduct our approach on this special category, we first divide the ”repetitive” category into two subcategory (one subcategory only contains animals, the other subcategory only includes flowers). Then to avoid that several images cannot obtain coarse masks by simple thresholding saliency maps, we apply saliency-cut [58] to threshold the saliency maps. After that, we conduct our co-segmentation method on each category or subcategory, and summarize the results in Table IV. In terms of intersection-over-union (denoted by IOU) score, it can be seen from Table IV that our approach outperforms the state-of-the-art methods on Coseg-Rep dataset. Moreover, even though [5] tuned its parameter over categories, our method achieve 1% improvement of IOU score when compared with the best results reported in [5].

Comparisons on Fashionista Dataset. To further prove the effectiveness of our approach and clarify our contributions, we apply state-of-the-arts [33, 14, 2] methods on the Fashionista dataset using the released codes. We also compare with a GrabCut [49] baseline using a bounding box with 8 pixels margin from the image borders. We summarize the evaluations in Table IV(b), where the values of [33, 14, 2] and our approach are averaged on all groups, while the values of the GrabCut baseline is directly taken from [11]. According to Table IV(b), we find that the leading co-segmentation approaches have difficulty in generalizing this dataset. Our approach obtains desirable performance on Fashionista dataset. Notably, we obtain 58%, 38% and 122% relative improvements over [33][14] and [2] in terms of IOU score on this dataset, respectively. Due to the complexity and large variance of object appearance and pose, the template-based approaches [14, 2] may have difficulty learning a proper template to represent the object category, while [33] often detects incomplete object shapes and misses important object details. On the contrary, the proposed multi-scale local shape transfer can successfully deal with the appearance and pose variances on Fashionista dataset. In Fig. 5, we give some visual comparisons between our algorithm and these well-known cosegmentation methods: [14, 2, 33].

Fig. 5: Visual comparisons between our method and state-of-the-arts. The segmentation results of [2],[14], [33] are obtained by their released code. And foreground objects are surrounded by green lines.

Comparisons with Previous Work. Compared with our previous work DLLE [1], the strategy of multi-scale local shape transfer is 2.1% higher on sub-iCoseg dataset and 3.9% higher on Fashionista datset. Significant improvements are observed on several categories (e.g.,Stonehenge2, Kite Panda, Pandas), since the proposed multi-scale local shape transfer can better handle large variation of objects scales. In order to clearly understand the contributions of the new algorithms proposed, we turn the proposed approach (MCO) into single-scale variant (SCO) with patch size of and conduct some additional experiments for MCO, SCO and DLLE [1]. Specifically, we select 30 images () randomly from Fashionista dataset [18] and resize the images to . Without the loss of generality, we average IOU scores of ten groups experiments generated by MCO, SCO and DLLE [1], respectively. Table V summarizes the average IOU results.

       MCO        SCO        DLLE [1]
images 30 30 30
IOU 0.800 0.777 0.753

Note: MCO is our proposed approach with multi-scale local shape transfer, SCO is a variant of turning the proposed approach into single-scale with patch size of , and DLLE refers to our previous algorithm [1].

TABLE V: Intersection-over-Union(IOU) scores on selected subset of Fashionista dataset

Table V illustrates that SCO improves IOU over the previous algorithm DLLE [1], which confirm the effectiveness of adopting proposal flow and HOG features to build the patch neighbourings system. Moreover, the proposed approach (MCO) achieves improved performance than the variant method SCO, which confirm the effectiveness of adding multi-scale to the procedure of local shape transfer. With multi-scale local shape transfer and patch neighbouring system, our proposed method MCO achieves substantial improvement than our previous approach [1]. In Fig. 5, we provide some visual comparisons between our presented approach MCO and our previous work DLLE [1]. Obviously, the proposed method MCO segment common objects more precisely than previous DLLE, such as legs and arms.

     dataset      ’w/o lst’      ’ours’
Fashionista 685 0.688 0.794
sub-icoseg 122 0.548 0.790

Note:’w/o lst’ and ’ours’ denote the image co-segmentation results that without local shape transfer and our complete framework, respectively.

TABLE VI: Intersection-over-Union(IOU) scores of with or without multi-scale local shape transfer during segmenting on Fashionista dataset and sub-iCoseg dataset(including 16 classes listed in table I).

Running Time. Our approach takes around 5 minutes to process 30 images with resolution . Saliency estimation can be done in a few seconds. Building correspondences, learning graph weights and optimization take around 2.7, 0.02 and 2 minutes, respectively. Empirical comparisons show that the current implementation runs much faster than several state-of-the-arts  [33] [14]. For example, Factor et al. [33] takes about 48 minutes and Dai et al. [14] takes about 58 minutes, using their released code.

Iv-C Performance Analysis

In this section, we aim to study how the proposed approach works and further demonstrate its effectiveness. To this end, we conduct two additional experiments.

The first experiment conducts on Fashionista dataset [18] and sub-iCoseg dataset [16] (including 16 classes listed in table I). The results desmonstrate the performance improvement of our algorithm after using multi-scale local shape transfer, and we summarize the results in table VI. Note that when no local shape transfer (’w/o lst’) is employed, this variant can be seen as grab-cut with saliency cues directly. However, it is observed that our complete framework with multi-scale local shape transfer significantly improves the results, confirming the effectiveness of multi-scale local shape transfer.

multi-scale patch size           IOU
1           0.786
2           0.797
3           0.794
TABLE VII: Intersection-over-Union(IOU) scores on Fashionista dataset.

In the two experiment, we investigate the segmentation accuracy as a function of the patch size of multi-scale. To this end, we adopt three different multi-scale strategies on Fashionista dataset [18] and summarize the performance in Table. VII. Without the loss of generality, we repeat this step for 10 times and each time randomly partition Fashionista dataset [18] into 23 groups with nearly 30 images per group. Table. VII shows average IOU scores of 10 experiments. Although the second strategy () obtain the best performance on IOU score, the third strategy () can achieve comparable IOU score. Note that IOU scores of the three different multi-scale strategies are close, this can be regarded as multi-scale strategy independent of patch size in certain degree. And we adopt the third strategy () in our experiments. Substantial comparison experiments illustrated in Table. I-III show that our multi-scale strategy performs efficiently on various benchmark datasets.

Failure cases. In Fig. 6, we show several typical failure cases of our approach. In the first row, our method can localize the bicycles whereas failing in segment thin rods and tires. In the second row, examples show that the proposed approach fails to segment the target objects if images contain multiple common salient objects (e.g.tree and car). In the third row, our approach cannot segment shoulders from background because shoulders are not salient in face category. These are common and challenge problems for image co-segmentation (e.g. [5]).

Fig. 6: Failure cases. (a) The input images. (b) Co-segmentation results by our approach. (c) The ground truth masks.

V Conclusion

This study proposes an unsupervised approach of multi-scale local shape transfer for image co-segmentation. It starts with generating saliency maps and multi-scale patches on each image. Then we construct a reliable patch neighbourhood system and incorporate label consistencies among neighbouring patches in different images. Finally, the common objects are segmented through a graph-cut based algorithm that can generate binary mask for each image. Extensive experiments demonstrate that the proposed algorithm of multi-scale local shape transfer can significantly boost the co-segmentation performance. Compared with state-of-the-arts, our approach performs comparably on iCoseg dataset and MSRC dataset, and substantially better on Coseg-Rep dataset and the challenging Fashinista dataset.

Our results also reveal that local shape transfer among images is valuable for distinguishing the common foreground objects from complex background. We believe that precise local shape correspondences are a reliable way to handle image co-segmentation. In the future, we will further explore the usage of local shape transfer in image co-segmentation. In particular, we will try some other weights learning methods (like popular deep learning architectures) to build a more reliable patch neighbourhood system. Moreover, we will attempt to distinguish and extract the common semantic objects in a multi-class image set by combining object shape cues and semantic label cues. We believe that multi-class objects co-segmentation will become an interesting and meaningful research direction in future image co-segmentation.


  • [1] W. Teng, Y. Zhang, X. Chen, J. Li, and Z. He, “Local shape transfer for image co-segmentation,” in Proc. British Machine Vision Conference, York, British, Sep. 19–22, 2016.
  • [2] A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for image co-segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, California, USA, Jun. 13–18, 2010, pp. 1943–1950.
  • [3] F. M. Meng, H. L. Li, G. H. Liu, and K. N. Ngan, “Object co-segmentation based on shortest path algorithm and saliency model,” IEEE Transactions on Multimedia, vol. 14, pp. 1429–1441, 2012.
  • [4] W. G. Wang and J. B. Shen, “Higher-order image co-segmentation,” IEEE Transactions on Multimedia, vol. 18, pp. 1011–1021, 2016.
  • [5] K. R. Jerripothula, J. Cai, and J. Yuan, “Image co-segmentation via saliency co-fusion,” IEEE Transactions on Multimedia, vol. 18, pp. 1896–1909, 2016.
  • [6] R. Quan, J. Han, D. Zhang, and F. Nie, “Object co-segmentation via graph optimized-flexible manifold ranking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, 2016.
  • [7] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, Jun. 7–12, 2015, pp. 5546–5555.
  • [8] X. Chen, A. Shrivastava, and A. Gupta, “Enriching visual knowledge bases via object discovery and segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, Jun. 24–27, 2014, pp. 2035–2042.
  • [9] X. D. Liang, L. Lin, W. Yang, P. Luo, J. S. Huang, and S. C. Yan, “Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval,” IEEE Transactions on Multimedia, vol. 18, pp. 1175–1186, 2016.
  • [10] F. Wang, Q. Huang, and L. J. Guibas, “Image co-segmentation via consistent functional maps,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, Jun. 25–27, 2013, pp. 849–856.
  • [11] J. Yang, B. Price, S. Cohen, Z. Lin, and M. H. Yang, “PatchCut: Data-driven object segmentation via local shape transfer,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, Jun. 7–12, 2015, pp. 1770–1778.
  • [12] B. Alexe, T. Deselaers, and V. Ferrari, “ClassCut for unsupervised class segmentation,” in Proc. European Conference on Computer Vision, Crete, Greece, Sep. 5–11, 2010, pp. 380–393.
  • [13] D. Kuettel, M. Guillaumin, and V. Ferrari, “Segmentation propagation in ImageNet,” in Proc. European Conference on Computer Vision, Firenze, Italy, Oct. 7–13, 2012, pp. 459–473.
  • [14] J. Dai, Y. N.Wu, J. Zhou, and S. C. Zhu, “Cosegmentation and cosketch by unsupervised learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, Jun. 25–27, 2013, pp. 1305–1312.
  • [15] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.
  • [16] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “iCoseg: Interactive cosegmentation with intelligent scribble guidance,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, California, USA, Jun. 13–18, 2010, pp. 3169–3176.
  • [17] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in Proc. European Conference on Computer Vision, Graz, Austria, May 7–13, 2006.
  • [18] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg, “Parsing clothing in fashion photographs,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Providence, Rhode Island, Jun. 16–21, 2012, pp. 3570–3577.
  • [19] Y. H. Tian, T. J. Huang, M. L. Jiang, and W. Gao, “Video copy-detection and localization with a scalable cascading framework,” IEEE Multimedia, vol. 20, pp. 72–86, 2013.
  • [20] Z. Lee and T. Q. Nguyen, “Multi-resolution disparity processing and fusion for large high-resolution stereo image,” IEEE Transactions on Multimedia, vol. 17, pp. 792–803, 2015.
  • [21] S. Liu, X. J. Qi, J. P. Shi, H. Zhang, and J. Y. J, “Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, Jun. 27–30, 2016, pp. 3141–3149.
  • [22] L. Zheng, S. J. Wang, J. D. Wang, and Q. Tian, “Accurate image search with multi-scale contextual evidences,” International Journal of Computer Vision, vol. 120, pp. 1–13, 2016.
  • [23] R. C. Hong, Z. Z. Hu, R. X. Wang, M. Wang, and D. C. Tao, “Multi-view object retrieval via multi-scale topic models,” IEEE Transactions on Image Processing, vol. 25, pp. 5814–5827, 2016.
  • [24] G. Kim, E. P. Xing, L. Fei-Fei, and T. Kanade, “Distributed cosegmentation via submodular optimization on anisotropic diffusion,” in Proc. IEEE International Conference on Computer Vision, Barcelona, Spain, Nov. 6–13, 2011, pp. 169–176.
  • [25] W. Tao, K. Li, and K. Sun, “SaCoseg: object cosegmentation by shape conformability,” IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, vol. 24, pp. 943–955, 2015.
  • [26] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, and B. Price, “Minimum barrier salient object detection at 80 fps,” in Proc. IEEE International Conference on Computer Vision, Chile, Dec. 7–13, 2015, pp. 1404–1412.
  • [27] B. Han, M. Cho, C. Schmid, and J. Ponce, “Proposal flow,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, Jun. 27–30, 2016, pp. 3475–3484.
  • [28] C. Rother, T. Minka, A. Blake, and V. Kolmogorov, “Cosegmentation of image pairs by histogram matching - incorporating a global constraint into mrfs,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, New York, USA, Jun. 17–22, 2006, pp. 993–1000.
  • [29] H. L. Li, F. M. Meng, and K. N. Ngan, “Co-salient object detection from multiple images,” IEEE Transactions on Multimedia, vol. 15, pp. 1896–1909, 2013.
  • [30] Z. Wang and R. Liu, “Semi-supervised learning for large scale image cosegmentation,” in Proc. IEEE International Conference on Computer Vision, Sydney, NSW, Australia, Dec. 3–6, 2013, pp. 393–400.
  • [31] J. C. Rubio, J. Serrat, A. L´opez, and N. Paragios, “Unsupervised co-segmentation through region matching,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Providence, Rhode Island, Jun. 16–21, 2012, pp. 749–756.
  • [32] H. Yu and X. Qi, “unsupervised cosegmentation based on superpixel matching and fastgrabcut,” in Proc. IEEE International Conference on Multimedia and Expo, Chengdu, China, Jul. 14–18, 2014.
  • [33] A. Faktor and M. Irani, “Co-segmentation by composition,” in Proc. IEEE International Conference on Computer Vision, Sydney, NSW, Australia, Dec. 3–6, 2013, pp. 1397–1304.
  • [34] C. Lee, W. D. Jang, J. Y. Sim, and C. S. Kim, “Multiple random walkers and their application to image cosegmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, Jun. 7–12, 2015, pp. 3837–3845.
  • [35] J. M. Shuai, L. Y. Qing, J. Miao, Z. G. Ma, and X. L. Chen, “Salient region detection via texture-suppressed background contrast,” in Proc. IEEE International Conference on Image Processing, 15–18, 2013, pp. 2470–2474.
  • [36] T. C. Ye, D. M. Zhang, K. Gao, G. Q. Jin, and Y. D. Z. Q. S. Yuan, “Salient region detection : Integrate both global and local cues,” in Proc. IEEE International Conference on Multimedia and Expo, Jul. 14–18, 2014, pp. 1–6.
  • [37] R. S. Liu, J. J. Cao, Z. C. Lin, and S. G. Shan, “Adaptive partial differential equation learning for visual saliency detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, Jun. 24–27, 2014, pp. 3866–3873.
  • [38] J. Li, Y. H. Tan, X. W. Chen, and T. J. Huang, “Measuring visual surprise jointly from intrinsic and extrinsic contexts for image saliency estimation,” International Journal of Computer Vision, vol. 120, pp. 44–60, 2016.
  • [39] W. C. Tu, S. F. He, Q. X. Yang, and S. Y. Chen, “Real-time salient object detection with a minimum spanning tree,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, Jun. 27–30, 2016, pp. 2334–2342.
  • [40] H. F. Liu, Z. Q. Tao, and Y. Fu, “Partition level constrained clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, pp. 1–1, 2017.
  • [41] C. Wang, H. Zhang, L. Yang, X. C. Cao, and H. K. Xiong, “Multiple semantic matching on augmented n-partite graph for object co-segmentation,” IEEE Transactions on Image Processing, vol. 26, pp. 5825–5839, 2017.
  • [42] J. Tighe and S. Lazebnik, “Finding Things: Image parsing with regions and per-exemplar detectors,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, Jun. 25–27, 2013, pp. 3001–3008.
  • [43] D. Kuettel and V. Ferrari, “Figure-ground segmentation by transferring window masks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Providence, Rhode Island, Jun. 16–21, 2012, pp. 558–565.
  • [44] W. Xia, C. Domokos, J. Xiong, L. F. Cheong, and S. Yan, “Segmentation over detection via optimal sparse reconstructions,” IEEE Transactions on Circuits & Systems for Video Technology, vol. 25, pp. 1295–1308, 2015.
  • [45] X. Liu, M. Song, D. Tao, J. Bu, and C. Chen, “Random geometric prior forest for multiclass object segmentation,” IEEE Trans. Image Processing, vol. 24, pp. 3060–3070, 2015.
  • [46] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, San Diego, USA, Jun. 20–26, 2005, pp. 886–893.
  • [47] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 1222–1239, 2001.
  • [48] D. Geman and G. Reynolds, “Constrained restoration and the recovery of discontinuities,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, pp. 367–383, 1992.
  • [49] C. Rother, V. Kolmogorov, and A. Blake, “GrabCut: Interactive foreground extraction using iterated graph cuts,” ACM Transactions on Graphics, vol. 23, pp. 309–314, 2004.
  • [50] H. Fu, D. Xu, S. Lin, and J. Liu, “Object-based RGBD image co-segmentation with mutex constraint,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, Jun. 7–12, 2015, pp. 4428–4436.
  • [51] S. Vicente, C. Rother, and V. Kolmogorov, “Object cosegmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, Jun. 20–25, 2011, pp. 2217–2224.
  • [52] L. Mukherjee, V. Singh, J. Xu, and M. D. Collins, “Analyzing the subspace structure of related images: Concurrent segmentation of image sets,” in Proc. European Conference on Computer Vision, Firenze, Italy, Oct. 7–13, 2012, pp. 128–142.
  • [53] W. Casaca, L. G. Nonato, and G. Taubin, “Laplacian coordinates for seeded image segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, Jun. 24–27, 2014, pp. 384–391.
  • [54] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmentation by tracking many figure-ground segments,” in Proc. IEEE International Conference on Computer Vision, Sydney, NSW, Australia, Dec. 3–6, 2013, pp. 2192–2199.
  • [55] F. N. Meng, J. F. Cai, and H. L. Li, “Cosegmentation of multiple image groups,” Elsevier Science Inc. 2016., vol. 146, pp. 67–76, 2016.
  • [56] K. R. Jerripothula, J. Cai, F. Meng, and J. Yuan, “Automatic image co-segmentation using geometric mean saliency,” in Proc. IEEE International Conference on Image Processing, Quebec, Canada, Sep. 27–30, 2015.
  • [57] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsupervised joint object discovery and segmentation in internet images,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, Jun. 25–27, 2013, pp. 1939–1946.
  • [58] M. M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S. M. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, pp. 569–582, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description