Multiresolution hierarchy co-clustering for semantic segmentation
in sequences with small variations
This paper presents a co-clustering technique that, given a collection of images and their hierarchies, clusters nodes from these hierarchies to obtain a coherent multiresolution representation of the image collection. We formalize the co-clustering as a Quadratic Semi-Assignment Problem and solve it with a linear programming relaxation approach that makes effective use of information from hierarchies. Initially, we address the problem of generating an optimal, coherent partition per image and, afterwards, we extend this method to a multiresolution framework. Finally, we particularize this framework to an iterative multiresolution video segmentation algorithm in sequences with small variations. We evaluate the algorithm on the Video Occlusion/Object Boundary Detection Dataset, showing that it produces state-of-the-art results in these scenarios.
The goal of co-clustering is to robustly segment a reference image (or various reference images) within a collection of closely related images (for instance, multiple views of a given scene or a video sequence with small variations) without any prior knowledge of the number of clusters. This is closely related with the correlation clustering problem [Bansal04].
Co-clustering approaches that model the problem as a Quadratic Semi-Assignment Problem [Charikar2005] have been reported to outperform other co-clustering strategies [Glasner2011]. However, such solutions present inconsistencies on the clusters propagation among images which prevent to obtain a coherent labeling through the collection of video images.
The goal of unsupervised video segmentation is to efficiently extract coherent groups of voxels from sequences to represent the video information with many less primitives. Video segmentation techniques can be classified into three categories [Corso2012]: (a) frame-by-frame processing, that leads to low temporal coherence results [Brendel2009]; (b) iterative processing, that improves the temporal coherence while requiring reasonable algorithm complexity [Paris2008]; and (c) 3D volume processing, that leads to the best results but implies high complexity algorithms and memory requirements [Grundmann2010].
Regardless of the previous classification, it is nowadays widely accepted that multiresolution descriptions provide a richer framework for subsequent analysis, both in the image [Malik2011] as in the video case [Grundmann2010]. This way, current techniques mainly rely on motion information to build a set of coherent partition sequences, describing the video at different resolutions. Video sequences with global motion or little variation in the scene pose problems to motion-based segmentation approaches. In these cases, to strongly rely on motion information does not help to infer the semantic in the scene. Figure 1 presents an example of this behavior.
To handle this kind of sequences, we propose a video segmentation method based on the co-clustering of a sequence of region-based hierarchical image representations. Moreover, we extend this co-clustering to produce a multiresolution representation of the video sequence. Our main contributions are:
An optimization on hierarchies that fully exploits the tree information avoiding inconsistencies of previous co-clustering approaches by coding the partitions with boundary variables and efficiently representing the hierarchical constraints (Section 4).
An iterative approach for video segmentation based on the previous optimization process (Section 5), that combines the information at different resolutions.
We conduct experiments on the Video Occlusion/Object Boundary Detection Dataset comparing with the techniques in ([Grundmann2010], [Corso2012], [Galasso2012], [gunhee2012], [Joulin2012]). Comparisons are made using the implementations from respective author. We report an improvement in accuracy over state-of-the-art techniques. Figure 1, fourth row shows an example of our results.
2 Related work
In [Grundmann2010], a hierarchical graph-based method in which appearance and motion are used to group voxels is presented. This technique builds a coherent region-based representation of the entire video, processing it as a single stream. In our approach, we propose as well a multiresolution representation of the video sequence. Nevertheless, we avoid jointly processing the entire video and exploit the information provided by independent hierarchical segmentations.
The concept of hierarchical graph-based video segmentation is also used in [Corso2012]. In this work, sequences are processed relying on motion information and using bursts of frames in order to reduce the complexity of the algorithm. The information of these bursts is combined to create a supervoxel hierarchy of the entire video. Sequence partitions are then obtained using the uniform entropy slice in [Corso2013]. In our work, we also process groups of images instead of the whole collection. Moreover, we iteratively propagate contour information at different resolutions.
The work in [Galasso2012] extends the hierarchical image segmentation of [Malik2011] to the case of video, including motion information. To make the approach tractable, [Galasso2014] proposes a spectral graph reduction which allows defining an iterative segmentation process for video streaming. In our work, although we present a global framework, we also propose an iterative segmentation process to make the problem tractable.
Previous techniques decrease their performance when scenarios with small variations are considered (Figure 1) because motion does not help to describe semantics in the scene. To overcome this situation, we tackle the problem with a co-clustering approach.
In the context of biomedical imaging, [Vitaladevuni2010] stated a coclustering problem as a Quadratic Semi-Assignment Problem (QSAP) and, as in [Charikar2005], it tackled its solution with a Linear Programming (LP) relaxation approach. In [Charikar2005], the optimization function is computed from distances between regions and linear constraints are imposed on these distances. This relaxation creates a number of inequalities that grows as , where is the number of regions.
In [Vitaladevuni2010], these constraints are only imposed over cliques in an adjacency graph on the regions. This approach bounds the number of constraints to . Moreover, a regularization parameter was introduced in [Glasner2011] to avoid trivial solutions in the optimization process. Although these approaches reduce the complexity of the problem, the solution of the optimization presents inconsistencies. These inconsistencies appear because the proposed constraints do not force the solution of the problem to be a partition.
In our approach, we also define the co-clustering problem as a QSAP, but partitions are defined in terms of boundaries between regions. This allows us to reduce the complexity of the problem. Moreover, we substitute the previous constraints by imposing the structure of the hierarchies; this way, in addition to preventing inconsistencies, resulting partitions are closer to the semantic level.
Closely related to co-clustering between image partitions is the problem of co-segmentation, first introduced by [Rother2006]. These methods take as input two or more images containing a common foreground object with varying backgrounds and attempt to segment the foreground object from the background. [gunhee2012] extends the previous concept to the multiple foreground segmentation case. In it, the user has to define the number of background objects in the image collection and sets of adjacent regions (candidates) are selected from an initial segmentation. To obtain a tractable problem, every set of regions is represented as a tree. In our case, we do not require any parameter and, for each image, a single hierarchy is computed.
Co-segmentation has also been applied to image sequences in a single resolution framework ([Rubio2012], [Wei2013]) or using hierarchies [Kim2012]. Note that co-segmentation algorithms would generally fail when tackling the case of scenes with small variations, since background in consecutive frames may also maintain its appearance. The work in [Kim2012] proposes an optimization process over the nodes of the hierarchy. The use of nodes to define the inter image relations for all levels of the hierarchies would lead to an unfeasible number of variables and constraints. This problem is tackled in [Kim2012] by restricting the inter relations to the highest level of the hierarchies. We solve that problem by defining the optimization process over boundary segments, which makes the problem tractable.
In this work, we propose a method to generate a multiresolution collection of coherent segmentations along a sequence with small variations. These segmentations are created clustering nodes from a set of non-coherent hierarchies associated with the video. This allows our technique to efficiently keep semantic contours at different resolutions and to eliminate random boundaries.
3 Working with hierarchies
Each node of the hierarchy represents a region in the image, and the parent node of a set of regions represents their merging. For simplicity, let us assume that this hierarchy is binary (regions are merged by pairs). This structure is referred to as Binary Partition Tree in [Salembier00]. Note that this assumption can be done without loss of generality, as any hierarchy can be transformed into a binary one.
Commonly, such hierarchies are created using a greedy region merging algorithm that, starting from an initial leave partition , iteratively merges the most similar pair of neighboring regions. The concept of region similarity is what makes the difference among the various approaches.
The merging process ends when the whole image is represented by a single region, which is the root of the tree. The set of mergings that creates the tree, from the leaves to the root, is referred to as merging sequence.
Given the previous example, let us define a vector that encodes the boundaries between leaves. Using this notation, the partition generated after the first merging is represented by the sequence , where represents an active boundary.
In a binary hierarchy, a merging sequence contains partitions, where is the number of leaves (regions in ). This is the set of partitions that is usually analyzed when working with hierarchies. Still, we generate partitions which may not be included in the merging sequence. For instance, in Figure 2, the partition formed by would be generated and coded by the boundary combination . This is done by analyzing all possible configurations of nodes in the hierarchy leading to a partition. Thus, we explore a larger number of contour combinations which allows us to use different resolutions at different parts of the image depending on its semantics.
4 Co-clustering of hierarchies
Let us assume that we have a collection of images, representing the same scene, which share a set of common contours but present a large number of random boundaries (e.g.: a video sequence with small variations or a multiple view scene representation). In this section we first present a global framework for, given such a collection of images and their associated and non-coherent hierarchies, obtaining a partition collection by clustering nodes from these hierarchies. This partition collection aims at keeping only the common contours and at producing coherent regions through the collection; that is, the various instances of the same object (or part) receive the same label in all the partitions of the collection (Figure 3).
This is achieved by coding in the boundary matrix the whole set of possible boundaries between adjacent regions in the collection. This matrix contains information about both the intra boundaries (between adjacent regions in the same image) and the inter boundaries (between adjacent regions in different images). The optimal boundary configuration (the co-clustering result) is achieved through an optimization problem that combines the boundary matrix information and the information about similarity between regions, which is coded in the similarity matrix. As previously, the similarity matrix contains the information about intra and inter similarities between regions. Intra similarities are computed using global region descriptors while inter similarities rely on descriptors computed over all contour elements. To avoid inconsistencies in the result, some constraints are impossed to the optimization process. In our approach, intra constraints are obtained from the hierarchies, whereas the common triangular equations are adopted as inter constraints. In addition, we extend the previous hierarchical co-clustering to a multiresolution framework.
4.1 Co-clustering problem definition
Formally, let us consider that we have a collection of M images and their associated hierarchies . The merging sequence of a given hierarchy defines a set of partitions , where is the leave partition on which the hierarchy is built and is the number of regions in . The -th partition of hierarchy () is formed by a set of regions , where and .
To encode all possible partitions ( ) represented by a given hierarchy , let us define its intra boundary matrix, . This is a binary matrix whose components are variables that relate all regions in . This way, if, for the partition being coded, the boundary between leaves and is active; that is, if regions and have not been merged.
Note that, by correctly zeroing some elements of this matrix, the whole set of partitions in () can be unequivocally described. This allows the co-clustering to fully exploit the richness of the hierarchical representation.
Boundaries between leaves of different partitions are coded in the inter boundary matrices, . Regions and from partitions and respectively belong to the same cluster if .
Then, a co-clustering between nodes from a collection of hierarchies is defined by a binary matrix, the boundary matrix, where . It encodes the intra and inter boundary information between leaves of the M images in the collection.
Note that only encodes the information of the leaves. The hierarchical information is introduced in the optimization process through the intra constraints (Section 4.2.1).
In practice, not all the variables represented in this matrix are usefull, as boundaries between non adjacent leave regions are not considered in the process. Thus, in contrast to previous partition-based approaches in which the number of constraints was bounded by ([Vitaladevuni2010], [Glasner2011]), our maximum number of intra constrains is proportional to .
Our objective is to find the optimal boundary configuration that defines a collection of partitions using nodes from hierarchies that are put in correspondace to form clusters. As proposed in [Charikar2005], the co-clustering can be stated as an optimization problem. To compact notation, let us define :
where is a complex-valued Hermitian affinity matrix that measures the co-clustering quality.
4.2 Optimization Constraints
As commented in Section 2, we constrain the optimization process using the information in the hierarchy to avoid the inconsistencies of previous approaches. Previous co-clustering techniques ([Vitaladevuni2010], [Glasner2011]) use constraints that rely on the triangular equation to this purpose. This is, for each three-clique of adjacent regions, the labelling of these three regions to a single or to multiple clusters should be consistent. The main drawback of this approach is that label inconsistencies are only avoided in a reduced neighbourhood of each region. This information is expected to be propagated using the region adjacency, but inconsistencies are not specifically avoided out of this neighbourhood.
In this work, as we perform co-clustering between hierarchies, we exploit the tree information to both encourage semantic fusions between regions and to reduce the number of constraints involved in the optimization.
4.2.1 Intra Constraints
Each hierarchy contributes in two aspects to the optimization process. First, it defines the mergings between regions of its leave partition to form clusters. Second, it also includes the order in which these regions should be merged to represent each node of the tree. Note that this order is not conditioned by the merging sequence. These two contributions of the hierarchy information lead to a large number of constraints among the regions forming the subtree below a given node. Nevertheless, in this work, all these original constraints have been encoded with only two coupled constraints per node.
First, for a given parent node and in order to merge its two siblings, all the leaves that form the boundaries between these two siblings should be merged. This is imposed by:
where is the total number of common region boundaries from the leave partition that represents the union of both siblings, is a region from the first sibling and , are regions from the second sibling. This condition imposes that all the variables representing boundaries between two siblings should have the same value.
Second, for a given parent node and in order to merge its two siblings, the leaves that form their respective subtrees must also be merged:
where is the total number of inner region boundaries from the leaves partition of both siblings, and are regions from the first sibling and , are regions from the second sibling. This condition imposes that for a given node, a variable representing a boundary between two siblings can only impose a merging if all the leaves associated with the node are merged.
4.2.2 Inter Constraints
These constraints control the correspondances between nodes from different hierarchies. In this case, as we do not have any hierarchical relation for these nodes, the triangular equation is used to create the inter constraints:
where is the edge between leaves and of the region adjacency graph computed from the leave partitions.
Our co-clustering technique exploits the randomness of those partition contours that do not belong to semantic objects. In this process, the computation of region similarities is crucial to correctly match regions from different partitions. Two types of similarities are computed: intra similarities (between leaves from the same hierarchy) and inter similarities (between leaves from different hierarchies).
Previous clustering works in segmentation and cosegmentation frameworks ([Glasner2011], [Kim2012]), use the color information to compute intra similarities. We propose to compute these similarities as:
where is the length of the common boundary between leaves , and is the Bhathacharyya distance [Bhattacharyya1943] of the 8-bin RGB color histograms of regions , .
Inter similarities are used to create clusters combining nodes from different hierarchies. In [Glasner2011], inter similarities are computed using a HOG-based descriptor. Although this gradient information may be enough in some cases, additional descriptors able to robustly match region contours are required. However, only those descriptors that can be efficiently computed should be taken into account.
We propose to combine three simple yet effective descriptors, which are computed over the contour elements of each partition. These descriptors are combined in a feature vector associated with each contour element, what allows us to keep the additivity property that is the key to formulate our problem as a linear optimization.
Inter image similarity between regions and from partitions and respectively should be proportional to their joint probability . We considere three types of information to model differences between regions from different partitions: changes of color/illumination, deformations and small changes of position. In terms of probability, we consider these processes to be independent:
The color information is obtained from a histogram of pixels in a neighborhood of the boundary elements. Two histograms are computed in the direction of the normal to the contour element (one in the analyzed region and the other in the adjacent region) and they are averaged. To handle possible deformations, shape information around each contour element is captured with a HOG descriptor. In our work, HOGs are computed using the gPb [Maire2008] information. Finally, position changes are captured with the Euclidean distance between elements.
Similarity between contour elements is computed as , where is the feature vector of contour element that belongs to . This vector is formed as the concatenation of the three types of descriptors previously described. We allow matchings between contour elements that are closer than 20 pixels. Otherwise, .
Once both inter and intra similarities are computed for all contour elements of the leave partitions, a similarity matrix between regions is built for each pair of hierarchies.
where , are complex matrices that describe the edges orientations (computed using the gPb [Maire2008] information) of all contour elements from partitions and , and encodes the inter similarities between these elements.
Finally, the similarity matrix that measures the quality of the co-clustering is built using the information of all the inter and intra similarity matrices as in Equation 1.
4.4 Optimization process
Using the similarity matrix and the constraints presented in this section, the optimization process of Equation 2 can be formulated as:
where represents any parent node in the collection of hierarchies. The result of this optimization is a binary matrix that describes the collection of optimal partitions . Thus, nodes from the collection of hierarchies have been clustered with the same label and semantic contours are preserved through the collection.
Nowadays, it is commonly accepted that multiresolution region-based descriptions provide a rich framework for image and video analysis [Arbelaez2014], [Grundmann2010]. In this section, we extend the previous hierarchical co-clustering to a multiresolution framework as it is illustrated in Figure 4.
This is, for each hierarchy involved in the optimization process (), we cluster nodes to obtain partitions, forming a new optimal hierarchy () that represents the image at different resolution levels (). Moreover, the collection of optimal partitions generated for each resolution should keep their inter correspondances.
Let us consider a clustering problem as presented in Equation 9, from which a boundary matrix is obtained for each generated partition. The number of active boundaries in has a direct relation with the resolution of the resulting partitions and, in particular, that of intra boundaries. When imposing in the optimization process a low (high) number of intra contours, coarser (finer) resolutions are obtained. We have observed that parameterizing the search in the solution space with respect to the number of intra contours allows the algorithm to produce a set of well distributed resolutions.
Formally, given a collection of hierarchies (), their nodes are clustered to form a collection of partitions of a given resolution () by constraining the optimization problem presented in Equation 9 with an additional condition for each hierarchy:
where is the number of active boundaries to encode the leave contours, is the maximum fraction of these contours to describe the -th coarse level and represents the maximum difference in number of boundaries between consecutive levels.
This approach allows two search strategies. When , a complete set of consecutive, equal sized subspaces is analyzed. On the contrary, when a coarser sampling of the solution space is performed.
5 Multi-resolution video co-clustering
In this section we propose to particularize the technique presented in Section 4 to a multiresolution video segmentation algorithm for sequences with small variations. Note that the previous co-clustering technique could be adapted to a 3D volume approach, as in [Grundmann2010]. However, such an approach would require high memory resources (Section 1). Thus, we adopt an iterative approach as in [Galasso2014] (Figure 5).
We propose to propagate clusters along sequences at various resolutions, taking into account the information in previous processed frames. As in [Corso2012], we use pieces of video and propagate the result through the sequence. In our case, we propagate semantic contours using information from different granularities in the optimization process. This is a forward-only online processing, and the results are good and efficient in terms of time and complexity.
In particular, for each image () in the sequence and for a given resolution (), we perform a joint hierarchical co-clustering with the clustering result of the two previous frames at two different scales: the resolution level under analysis and the leave partition scale (see Figure 5). Precisely, we construct the boundary matrix using the optimal partition in at level () and the leave partitions in and ( and ).
where are intra or inter boundary variables from and that encode the boundaries between clusters of and , and is the cardinality of these variables.
In turn, regions in must be merged to form and inter correspondances between clusters must be kept:
where are intra or inter boundary variables from and that encode the unions of inter and intra clusters of and .
Leave partitions ( and ) are used to allow computing fine boundary similarities, whereas boundaries from and are included to enforce previous semantic contours. With this iterative process, clusters are robustly propagated through hierarchies in an efficient manner.
6 Experimental Evaluation
In this section, we present both qualitative and quantitative evaluations of our multiresolution hierarchical co-clustering (MRHC). As our technique aims to segment sequences with small variations, we use the Video Occlusion/Object Boundary Detection Dataset [Stein2009] for evaluation and comparison with state-of-the-art methods in the fields of video segmentation ([Grundmann2010], [Corso2012], [Galasso2012]) and co-segmentation ([Joulin2012], [gunhee2012]). Comparisons have been made using the implementations from respective authors. In order to asses the contribution of the multi-resolution framework (Section 5), we also evaluate the performance of our algorithm at a single level with the best overall results (OURS-SL). Moreover, based on the baseline in [Galasso2013], we consider a system that propagates labels from regions obtained with [Malik2011] using [Brox2009] (UCM-P). A random hierarchy created from the leave partitions of [Malik2011] is used as baseline technique.
The dataset includes 30 short sequences (42 objects) with indoor and outdoor scenes, noise and compression artifacts, unconstrained handheld camera motions and moving objects. For each sequence, the annotation of a single frame is provided as ground truth for segmentation assessment (Section 6.1). To assess temporal consistency (Section 6.2), we have manually annotated the remaining frames by merging regions from the leave partitions of [Malik2011].
The evaluation is performed using two types of measures. First, we use the measures presented in [Galasso2013]: boundary precision-recall (BPR) from [Malik2011] and a volume precision-recall metric (VPR). Second, as in ([Stein2009], [Glasner2011]), we use Consistency as the Jaccard index computed between a set of regions of a partition and the ground-truth and Efficiency as the minimum number of regions requested to obtain a given consistency.
In order to qualitatively assess our technique and to explore its limitations, we also analyze a subset of sequences from the SegTrack v2 Dataset [Fuxin2013], some of them containing strong deformations and rapid variations. In all the experiments, hierarchies have been obtained using [Malik2011] and resolution levels have been created per sequence ranging between the number of leaf contours ().
6.1 Segmentation assessment
In this experiment, we assess the segmentation quality of a given frame. The set of optimal partitions of this frame for all the resolution levels is considered. Then, for each efficiency value, the maximum consistency over this set of levels is selected; that is, fixing the number of regions, we select through the various resolutions the best Jacard object representation. Moreover, the BPR curve is considered to assess the quality of segmentation boundaries (Figure 6).
Co-segmentation results have been obtained fixing the number of clusters with respect to the number of objects in the scene, as proposed by the authors ([Joulin2012], [gunhee2012]). We report the best results for up to a given number of clusters, since consistency does not improve when increasing the number of clusters. These algorithms are competitive when the object is represented with one region. Still, our technique obtains better consistency for all efficiency levels. due to hierarchies and similarities among frames to describe objects.
Regarding video segmentation algorithms, our technique outperforms the three assessed state-of-the-art methods ([Grundmann2010], [Corso2012], [Galasso2012]). In [Corso2012], colour similarities are used to propagate supervoxels information. In contrast, our description of contours using colour, texture and distance measures, obtains better segmentation accuracy and BPR for all precision levels. Although the optical flow used in [Grundmann2010] is a powerfull descriptor, it is not enough to accurately segment objects in this type of sequences, specially with a low number of regions. As it can be observed in Figure 6, in terms of boundaries, their recall is close to our results for large precision values. However, in terms of object area, regions selected by our algorithm represent the object with higher accuray.
6.2 Temporal coherence assessment
In this section, we extend the previous ”efficiency versus consistency” analysis to the temporal domain, in order to assess the stability of partitions along video sequences.
The sequence consistency of a label (temporal cluster) is computed averaging the consistency values obtained at each frame by the region associated to this label. Results of the best sequence consistency achieved for all the resolutions, using the number of labels represented by each efficiency level, are plotted in Figure 6. In order to complete the analysis, we also present the VPR curve as computed in [Galasso2013].
As it can be observed, sequence consistency results are very similar to segmentation consistency ones (Figure 6). This stability shows that all methods correctly maintain the coherence of the partitions along the sequence. These results validate the iterative strategies used in [Corso2012] and in our approach (see Section 5). In both volume precision-recall and consistency-efficiency values, our method outperforms the analyzed state-of-the-art approaches and only the propagation method based on [Galasso2013] obtains better volume recall for low precision values and better efficiency. This confirms the results that were reported in previous works ([Galasso2013], [Galasso2014]).
A more detailed comparison of the presented algorithms for the objects in the database can be found in Table 1.
6.3 Qualitative assessment
In this section, we present results on two sequences from the Segtrack v2 database [Fuxin2013] to qualitative evaluate our algorithm. This database allows analyzing the limits of our technique, since video objects in it may undergo strong deformations and rapid movements.
Figure 7 shows two images of the sequence Parachute.
In it, the parachute is correctly segmented along the sequence at a given resolution. Moreover, its coloured stripes are coherently segmented through the sequence. As the object shape gradually changes, our method is able to coherently segment it at several resolution levels along the video.
Figure 7 shows two images of the sequence Girl. In this sequence, a girl runs and her shape undergoes strong deformations due to arm and leg rapid movements. Although the shape of the girl is correctly identified in both partitions as the union of a few regions (high consistency at medium efficiency for segmentation), not all its parts have been coherently matched (worse efficiency for temporal coherence).
In this work, we have presented a co-clustering framework that creates a coherent region-based multiresolution representation of an image collection, by clustering nodes from a collection of independent hierarchies. The co-clustering problem is formulated as a QSAP problem. Inconsistencies commonly derived from such optimization problems are avoided modelling the problem through boundary variables and effectively using hierarchical constraints. This way, our method robustly creates inter and intra relations between regions from the image collection.
This co-clustering framework has been particularized to obtain a video segmentation technique that coherently segments scenes with small variations. We have adopted an iterative strategy that allows reducing the algorithm complexity and memory requirements, while achieving high temporal coherence. We have assessed the results over the Video Occlusion/Object Boundary Detection Datase against five SoA techniques and three baseline ones. In all cases, our technique outperforms the SoA methods in video segmentation and co-segmentation for this type of sequence in all range of efficiencies. In order to promote reproducible research, all the resources of this project (code, results and evaluation protocols) are publicly available.