Co-segmentation for Space-Time Co-located Collections
We present a co-segmentation technique for space-time co-located image collections. These prevalent collections capture various dynamic events, usually by multiple photographers, and may contain multiple co-occurring objects which are not necessarily part of the intended foreground object, resulting in ambiguities for traditional co-segmentation techniques. Thus, to disambiguate what the common foreground object is, we introduce a weakly-supervised technique, where we assume only a small seed, given in the form of a single segmented image. We take a distributed approach, where local belief models are propagated and reinforced with similar images. Our technique progressively expands the foreground and background belief models across the entire collection. The technique exploits the power of the entire set of image without building a global model, and thus successfully overcomes large variability in appearance of the common foreground object. We demonstrate that our method outperforms previous co-segmentation techniques on challenging space-time co-located collections, including dense benchmark datasets which were adapted for our novel problem setting.
Nowadays, Crowdcam photography is both abundant and prevalent [2, 1]. A crowd of people capturing various events form collections with great variety in content. However, they normally share a common theme. We refer to a collection of images that was captured about the same time and space as ‘’Space-time Co-located” images, and we assume that such a co-located collection contains a significant subset of images that share a common foreground object, but other objects may also co-occur throughout the collection. See Figure 1 for such an example, where the Duchess of Cambridge is photographed in her wedding, and some of the images contain, for instance, her husband, Duke of Cambridge.
Foreground extraction is one of the most fundamental problems in computer vision, receiving ongoing attention for several decades now. Technically, the problem of cutting out the common foreground object from a collection of images is known and has been referred to as co-segmentation [7, 21, 6]. A traditional co-segmentation problem assumes that objects which are both co-occurring and salient necessarily belong to the foreground regions. However, the space-time co-location of the images leads to a more challenging setting, where the premise of common co-segmentation techniques is no longer valid, as the foreground object is not well-defined. Therefore, we ask the user to provide a segmented template image to specify what the intended foreground object is.
The foreground object varies considerably in appearance across the entire space-time co-located collection. Thus, we do not use a single global model to represent it, but instead take a distributed local approach. We decompose each image into parts at multiple scales. Parts store local beliefs about the foreground and background models. These beliefs are iteratively propagated to similar parts within and among images. In each iteration, one image is selected as the current seed. See Figure 2 which illustrates the progression of beliefs in the network of images.
The propagation of beliefs from a given seed is formulated as a convex belief propagation (CBP) optimization. Foreground and background likelihood maps of neighboring images are first inferred independently (see Section 4.2). These beliefs are then reinforced across images to consolidate local models and thus allow for more refined likelihood estimates (see Section 4.3). To allow for a joint image inference, we extend the CBP algorithm to include quadratic terms.
We show that when starting from a reliable seed model, we can progressively expand the foreground belief model across the entire collection. This gradual progression succeeds to co-segment the collection, outperforming state-of-the-art co-segmentation techniques on rich benchmark datasets which were adapted for our problem setting. We also provide an extensive evaluation on various space-time co-located collections which contain repeated elements that do not necessarily belong to the semantic foreground region. Our analysis demonstrates the advantages of our technique over previous methods, and in particular illustrates its robustness against significant cluttered backgrounds.
The main contributions of our work are (i) the introduction of the novel co-segmentation problem for space-time co-located image collections, (ii) a distributed approach that can handle the great variability in appearance of the foreground object, and (iii) an extended variational scheme for propagating information within and across images.
2 Related work
Segmenting and extracting the foreground object from an image is a fundamental and challenging problem, which has received significant ongoing attention. Extracting the foreground object requires some guidance or supervision since in most cases it is unclear what the semantic intent is. When several images that share a common foreground are given, the problem is referred to as co-segmentation . Many solutions have been proposed to the co-segmentation problem, which can be applied to image collections of varying sizes and characteristics [6, 5, 7, 21, 14]. Co-segmentation techniques learn the appearance commonalities in the collection to infer the common foreground object or objects. To initialize the learning process, unsupervised techniques are usually based on objectness  or visual saliency [6, 22] cues to estimate the target object.
State-of-the-art co-segmentation methods are based on recent advancements in feature matching and correspondence techniques [22, 21, 7]. Rubinstein et al.  proposed to combine saliency and dense correspondences to co-segment large and noisy internet collections. Faktor and Irani  also use dense correspondences, however, they compute statistical significance of the shared regions, rather than computing saliency separately per image. These techniques are unsupervised, and they assume that recurrent and unique segments necessarily belong to the object of interest. However, in many collections this is not the case, and some minimal semantic supervision is then required. Batra et al. , for example, aimed at topically related images and their supervision was given in the form of multiple user scribbles. In our work, we deal with images that belong to the same instance, and not to a general class, which exhibit great variability in appearance. We use the segmentation of a single image in the collection to guide the process and target the intended object.
The work of Kim and Xing  is most closely related to ours. In their work they address the problem of multiple foreground co-segmentation, where objects of interest repeatedly occur over an entire image collection. They show promising results when roughly percent of their images are manually annotated. In our work, we target a single object of interest, while space-time co-located collections often contain several repeated elements that clutter and distract common means to distinct the foreground. Unlike their global optimization framework, that solves for all the segmentations at once, our technique gradually progresses, and each image in turn guides the segmentation of its adjacent images. In this sense of progression, the method of Kuettel et. al  is similar to ours. However, their method is strongly based on the semantic hierarchy of ImageNet, while we aim at segmenting an unstructured space-time co-located image collection.
There are other space-time co-located settings where images share a common foreground. One setting is a video sequence [19, 8], where the coherence among the frames is very high. It is worth noting that even then, the extraction of the foreground object is surprisingly difficult. Kim and Xing  presented an approach to co-segment multiple photo streams, which are captured by various users at varying places and times. Similarly to our work, they also iteratively use a belief propagation model over the image graph. Another setting is multi-view object segmentation, e.g., [9, 4], where the common object is captured from a calibrated set of cameras. These techniques commonly employ 3D reconstruction of the scene to co-segment the object of interest. In our setting, the images are rather sparse in scene-space and not necessarily captured all at once, which makes any attempt to reconstruct the target object highly improbable.
3 Co-segmentation using iterative propagation
We describe the foreground and background models, denoted by F and B, using local beliefs that are propagated within and across images. To define a topology over which we propagate beliefs, we construct a parts-level graph , where nodes are image parts from all images, and edges connect corresponding parts in different images or spatially neighboring parts within an image. Furthermore, we define an associated image-level graph , where the nodes correspond to the images, and two images are connected by an edge if there exists at least one edge in that connects the pair of images. The F/B likelihoods are iteratively propagated throughout the part-level graph , while the propagation flow is determined according to the image-level graph . The graph topology is illustrated in Figure 3.
In what follows, we first explicitly define the graph topology. We then describe how these beliefs are gradually spread across the entire image collection, starting from the user-segmented template image.
3.1 Propagation graph topology
The basis for the propagation are image parts. To obtain the parts, we use the hierarchical image segmentation method of Arbeláez et al. . We threshold the ultrametric contour map, which defines the hierarchy of image regions, at a relatively fine level (). See Figure 6 (on the left) for an illustration of the parts obtained at a number of different levels. The level we use for the image parts is illustrated in the left-most image. Although a fine level yields a large number and perhaps less meaningful parts, it should be noted that a coarser level often merges between foreground and background parts.
We construct a parts-level graph , where edges connect corresponding parts or spatially neighboring parts within an image. To compute reliable correspondences between image parts, we use the non-rigid dense correspondence technique (NRDC) , which outputs a confidence measure (with values between and ) along with each displacement value. We consider corresponding pixels to be those with a confidence which exceeds a certain threshold, which we set empirically to .
Two images are connected by an edge in the associated image-level graph if there exists at least one edge in that connects the pair of images.
3.2 Iterative likelihood propagation
We assign each part in a foreground likelihood. Initially all parts are equally likely to be foreground or background (except the parts in the user-segmented template image, whose F-likelihood is either exactly 0 or 1).
The likelihoods are iteratively propagated throughout the graphs. In each iteration, a seed image is selected and its likelihoods are propagated to the adjacent neighbors in . In the first iteration, the seed image is always the user-segmented template image. In subsequent iterations the seed image randomly picked from the neighbors of the current seed. Within an iteration, the seed image likelihoods are considered fixed. Note that the template image likelihoods remain fixed throughout the whole algorithm.
The details of this propagation are described in the next section. The new estimates are first derived separately, according to Section 4.2, and are then jointly refined, according to Section 4.3. These new likelihood estimates are combined with previous estimates, where estimates are amortized along their propagation, and get exponentially lower weights over time, as we have more confidence in beliefs that are closer to our template image.
After propagating the likelihoods, we update the foreground-background segmentation for the next seed using a modified implementation of graph-cuts , where the unary terms are initialized according to the obtained likelihoods.
The algorithm above is terminated once all images have been propagated to at least once. To avoid error accumulation, we execute the full pipeline multiple times (five in our implementation). The final results are obtained by averaging all the likelihood estimates follows by a graph-cut segmentation.
4 Likelihood inference propagation
Our algorithm uses convex belief propagation and further extends the variational approximate inference program to include quadratic terms. Therefore, in Section 4.1, we briefly introduce notations used in later sections. In Section 4.2, we present an approach to infer an object likelihood map of a single target image from a seed image. Finally, in Section 4.3, we introduce a technique to propagate the likelihood maps across similar images to improve the accuracy and reliability of these inferred maps.
4.1 Convex belief propagation
Markov random fields (MRFs) consider joint distributions over discrete product spaces . The joint probability is defined by combining potential functions over subsets of variables. Throughout this work we consider two types of potential functions: single variable functions, , which correspond to the vertices in a graph, , and functions over pairs of variables that correspond to the graph edges, . The joint distribution is then given by Gibbs probability model:
Many computer vision tasks require to infer various quantities from the Gibbs distribution, e.g., the marginal probability .
Convex belief propagation [24, 11] is a message-passing algorithm that computes the optimal beliefs which approximate the Gibbs marginal probabilities. Furthermore, under certain conditions, these beliefs are precisely the Gibbs marginal probabilities . For completeness, in the Supplementary Material, we define these conditions and explicitly describe the optimization program.
4.2 Single target image inference
In the following we present the basic component of our method, which infers an object likelihood map of a target image from an image seed. We construct a Markov random field (MRF) on the parts of the target image and use a convex belief propagation to infer the likelihood of these parts to be labeled as foreground.
Each part can be labeled as either foreground or background, i.e., . First, we describe the local potentials of each part , which describe the likelihood of the part to belong to the foreground or the background. Then, we describe the pairwise potentials , which account for the spatial relations between adjacent parts. We infer the foreground-background beliefs of the parts in the target image by executing the standard convex belief propagation algorithm [24, 11].
4.2.1 Local potentials
The local potentials express the extent of agreement of a part with the foreground or background models. To define parts in the seed image, we use the technique of Arbeláez et. al  at multiple levels to obtain a large bag of candidate parts of different scales. Let be a part in the target image, and be a part in the source image seed. Then for each source part , we compute its compatibility with a target part , and denote it by .
To construct the foreground/background likelihood of each part in the target image , we consider the F/B parts of the source seed, and set
where and are the two labels that can be assigned to .
We define our compatibility measure as follows:
where is a balancing coefficient that controls the amount of enrichment of the available set of correspondences. The term measures the fraction of pixels that are matched between parts and . This is measured according to
where is the number of corresponding pixels, and is the number of pixels in part . As mentioned before, the matching is based on NRDC.
We identified that highly compatible parts are rather sparse, and thereby is almost always zero in many source-target pairs. Nonetheless, we can exploit these sparse correspondences to discover new compatible parts with the term . See Figure 4 for an illustration of the compatible target parts in the foreground regions without (top row) and with (bottom row) our enrichment term . In practice, since the background does not necessarily appear in both source and target image, only for regions .
In these foreground regions, the term aims at revealing a similarity between parts whose appearance and spatial displacement highly agree. Similarity in appearance, in our context, is measured according to the Bhattacharyya coefficient of the RGB histograms, following the method of . In order for parts and to highly agree in appearance, we further require that , where the number of nearest neighbors is set to three. To recognize parts whose spatial displacement agree, we utilize the set of corresponding pixels in the foreground regions. We approximate the pixel values of the part corresponding to according to the known correspondences. Formally, for each , let be the estimated corresponding region in the target. Thus, for a similarity between parts and , we require that .
To simplify computations, we assume to be a circle within the target image, which we compute according to the closest and farthest foreground correspondences. These two corresponding points define a relative scale between the two images. To compute the circle center, we compute the relative offset from the closest corresponding point (using the relative scale). The radius is determined according to the distance to the nearest corresponding point in the target. See Figure 5(a) for an illustration of the estimated corresponding region .
Putting it all together, is measured according to
In our experiments, we set for all the foreground regions. See Figure 5(b) for an illustration of the multi-scale source parts that obtained maximum compatibility with parts in the target image.
4.2.2 Pairwise potentials
The pairwise potential function induces spatial consistency from the part generation process within the image. As previously mentioned, we obtain parts at multiple scales by thresholding at varying levels in the ultrametric contour map (see Figure 6 for an illustration).
To measure spatial consistencies between adjacent parts in the target, we can compute how quickly these two parts merge into one by examining the level where the two parts become one. Hence, we define a pairwise relation between adjacent parts in each target image according to:
where the parameter was set empirically, and the finest level we examine to measure the spatial consistencies is (a merge there would induce the strongest proximity between the parts). See the heat-maps in Figure 6 for an illustration of these pairwise potentials on a few randomly-chosen target parts. See Figure 6 for an illustration of .
4.3 Joint multi-target inference
In Section 4.2, we presented our approach to infer an object likelihood map from a seed image. In our setting, similar regions may co-occur across multiple images. Therefore, to improve the accuracy and reliability of the likelihood maps obtained by a single inference step, we propagate the inferred maps onto adjacent images in the image graph . The output beliefs of each inferred target image is sent to its neighbors as a heat-map (i.e., per part foreground-background probability). Thus, our likelihood maps are complemented with joint inference across images.
To differentiate the different types of edges on , we denote the edges that connect parts across images by . A joint inference is encouraged by a pairwise potential function between matched parts in . Since the labels satisfy , this can be done with the potential function
Simply stated, the local potentials propagate the output beliefs of one target as input potentials of its neighboring images. Our intuition is that the output beliefs , which is concluded by running a convex belief propagation within its image, serves as a source seed signal to the neighboring image. This introduces novel non-linearities to the propagation algorithm. To control these non-linearities, we extend the variational approximate inference program to include quadratic terms.
We also determine the conditions for which these quadratic terms still define a concave inference program and prove that repeated convex belief propagation iterations across images achieve its global optimum. For more details on our extended variational approximate inference program, together with an empirical evaluation of our inference technique, please refer to the Supplementary Material, which can be found on the project website.
5 Results and evaluation
We empirically tested our algorithm on various datasets and compared our segmentation results to two state-of-the-art unsupervised co-segmentation techniques [7, 21] and to another semi-supervised co-segmentation technique . The merit of comparing our weakly-supervised technique with unsupervised ones is twofold; first, it serves as a qualitative calibration of the results on the new co-located setting, and second, it clearly demonstrates the necessity of some mininal supervision to define the semantic foreground region.
For all three methods we used the original implementations by the authors which are publicly available online. It should be noted that although  discuss an unsupervised approach as well, they only provide implementations for the semi-supervised approach. To be compatible to our input, we provide their method with one input template mask. We measure the performance on different datasets, including benchmark datasets that were adapted for our novel problem setting. The full implementation of our method, along with the datasets that were used in the experiments, is available at our project website at: https://cs.tau.ac.il/~averbuch1/coseg/.
We evaluated our technique on various challenging space-time co-located image collections depicting various dynamic events. Some of them (Bride, Singer, and Broadway) were downloaded from the internet, while others (Toddler, Baby, Singer with Guitarist, and Peru) were casually captured by multiple photographers. These images contain repeated elements that do not necessarily belong to the semantic foreground region, and the appearance of the foreground varies greatly throughout the collections. We provide thumbnails for the full seven sets, together with results and comparisons, on our project website. Please refer to these results for assessing the high quality of our results.
For a quantitative analysis, we manually annotated the foreground regions of three of our collections (Bride, Toddler, and Singer), and report the precision (percentage of correctly labeled pixels) and Jaccard similarity (intersection over union of result and ground truth segmentations) as in previous works (see Table 1(b)). It should be noted that, to strengthen the evaluation, we perform three independent runs for the semi-supervised techniques, starting from different random seeds, and report the average scores. Figure 9 shows a sample of results, where the left-most image is provided as template for the semi-supervised techniques. As can be observed from our results, the unsupervised co-segmentation techniques fail almost completely on our co-located collections. Regarding the semi-supervised technique, as Figure 9 demonstrates, when both the foreground and background regions highly resemble those of their counterparts in the given template, then the results of  are somewhat comparable to ours. As soon as the backgrounds differ or there are additional models that were not in the template, their method includes many outliers, as can be seen in Figure 9. Unlike their method, we avoid defining strict global models that hold for all the images in the collection, and thus allow flexibility that is required to deal with the variability across the collection.
Multiple foreground objects
We also compared our performance to  using their data. We use their main example, which also corresponds to our problem setting. The results are displayed in Figure 7 where we mark the multiple foreground objects in different colors. We execute our method multiple times with different seeds to meet their input. As we can see here and in general, our method has less false-positives and is more resistant to cluttered backgrounds. If we are able to spread our beliefs towards the target image, then we succeed in capturing the object rather well. Quantitatively, our technique cuts the precision error by more than half (from down to ). However, if there is not enough confidence that reaches the target image, then the object remains undetected, as can be observed in the uncolored basket of apples in the rightmost image.
Sampled video collections
The DAVIS dataset  is a recent benchmark for video segmentation techniques, containing 50 sequences that exhibit various challenges including occlusions and appearance changes. The dataset comes with per-frame, per-pixel ground-truth annotations. We sparsified these sequences (taking every 10 frame) to construct a large number of datasets that are somewhat related to our problem setting. Table 8 shows the intersection-over-union (IoU) scores on a representative subset and the average over all 50 collections. Similar to the input provided to video segmentation techniques in the mask propagation task, we also provide the semi-supervised techniques with a manual segmentation of the first frame. However, on our sparsified collections, subsequent frames are quite different, as illustrated in Figure 8.
Our extensive evaluation on the adapted DAVIS benchmark clearly illustrates, first of all, the difficulty of the problem setting, as the image structure is not temporally-coherent, and unlike dense video techniques, we cannot benefit from any temporal priors. Furthermore, it demonstrates the robustness of our technique, as it achieves the highest scores on most of the datasets, as well as the highest average score on all 50 collections.
6 Conclusions and future work
In this work, we have presented a co-segmentation method that takes a distributed approach. Common co-segmentation methods gather information from all the image in the collection, analyze it globally, building a common model, and then infer the common foreground objects in all, or part of, the images. Here, there is no global model. The beliefs are propagated across the collection without forming a global model of the foreground object. Each image independently, collects the beliefs from its neighbors, and consequentially infers its own model for the foreground object. Although our method is distributed, currently there is a seed model, which clearly does not concur to the claim of having a distributed method. However, some supervision is necessarily required to define the semantic target model. Currently, it is provided as a single segmented image, but the seed model can possibly be provided in other forms.
We have shown that our approach outperforms state-of-the-art co-segmentation methods. However, as our results demonstrate, there are limitations as the object cut-outs are imperfect. The entire object is not always inferred and also portions of the background may contaminate the extracted object. To alleviate these limitations, there are two possible avenues for future research: (i) one in high level, to better learn the semantics of the object, perhaps using data-driven approaches, e.g., convolutional networks, and (ii) in low level, seeking for better alternatives to graph-cuts and its inherent limitations.
In the future, we hope to explore our approach on massive collections, which may include thousands of photographs capturing interesting dynamic events. For example, a collection of images of a parade, where a 3D reconstruction is not applicable. The larger number of images is not just a quantitative difference, but qualitative as well, as the collection can become dense with stronger local connections. For such massive collections, the foreground object does not have to be only a single object. We can propagate multi-target beliefs over the image network, like we demonstrated in our comparison to Kim and Xing . Finally, the distributed nature of our method, leads itself to parallel computation, which can be effective for large scale collections.
-  A. Arpa, L. Ballan, R. Sukthankar, G. Taubin, M. Pollefeys, and R. Raskar. Crowdcam: Instantaneous navigation of crowd images using angled graph. In 3D Vision-3DV 2013, 2013 International Conference on, pages 422–429. IEEE, 2013.
-  T. Basha, Y. Moses, and S. Avidan. Photo sequencing. In Computer Vision–ECCV 2012, pages 654–667. Springer, 2012.
-  D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. icoseg: Interactive co-segmentation with intelligent scribble guidance. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3169–3176. IEEE, 2010.
-  N. D. Campbell, G. Vogiatzis, C. Hernández, and R. Cipolla. Automatic 3d object segmentation in multiple views using volumetric graph-cuts. Image and Vision Computing, 28(1):14–25, 2010.
-  K.-Y. Chang, T.-L. Liu, and S.-H. Lai. From co-saliency to co-segmentation: An efficient and fully unsupervised energy minimization model. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2129–2136. IEEE, 2011.
-  M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):569–582, 2015.
-  A. Faktor and M. Irani. Co-segmentation by composition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1297–1304, 2013.
-  Q. Fan, F. Zhong, D. Lischinski, D. Cohen-Or, and B. Chen. Jumpcut: non-successive mask transfer and interpolation for video cutout. ACM Transactions on Graphics (TOG), 34(6):195, 2015.
-  Z. Gang and Q. Long. Silhouette extraction from multiple images of an unknown background. In Proceedings of the Asian Conference of Computer Vision. Citeseer, 2004.
-  Y. HaCohen, E. Shechtman, D. B. Goldman, and D. Lischinski. Non-rigid dense correspondence with applications for image enhancement. ACM transactions on graphics (TOG), 30(4):70, 2011.
-  T. Heskes. Convexity arguments for efficient minimization of the Bethe and Kikuchi free energies. Journal of Artificial Intelligence Research, 26(1):153–190, 2006.
-  G. Kim and E. P. Xing. On multiple foreground cosegmentation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 837–844. IEEE, 2012.
-  G. Kim and E. P. Xing. Jointly aligning and segmenting multiple web photo streams for the inference of collective photo storylines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 620–627, 2013.
-  G. Kim, E. P. Xing, L. Fei-Fei, and T. Kanade. Distributed cosegmentation via submodular optimization on anisotropic diffusion. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 169–176. IEEE, 2011.
-  D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in imagenet. In Computer Vision–ECCV 2012, pages 459–473. Springer, 2012.
-  J. Ning, L. Zhang, D. Zhang, and C. Wu. Interactive image segmentation by maximal similarity based region merging. Pattern Recognition, 43(2):445–456, 2010.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, 2016.
-  J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE transactions on pattern analysis and machine intelligence, 39(1):128–140, 2017.
-  S. A. Ramakanth and R. V. Babu. Seamseg: Video object segmentation using patch seams. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 376–383. IEEE, 2014.
-  C. Rother, T. Minka, A. Blake, and V. Kolmogorov. Cosegmentation of image pairs by histogram matching-incorporating a global constraint into mrfs. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pages 993–1000. IEEE, 2006.
-  M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised joint object discovery and segmentation in internet images. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1939–1946. IEEE, 2013.
-  J. C. Rubio, J. Serrat, A. López, and N. Paragios. Unsupervised co-segmentation through region matching. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 749–756. IEEE, 2012.
-  S. Vicente, C. Rother, and V. Kolmogorov. Object cosegmentation. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2217–2224. IEEE, 2011.
-  M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. Trans. on Information Theory, 51(7):2313–2335, 2005.