CoSparse Textural Similarity for Image Segmentation
Abstract
We propose an algorithm for segmenting natural images based on texture and color information, which leverages the cosparse analysis model for image segmentation within a convex multilabel optimization framework. As a key ingredient of this method, we introduce a novel textural similarity measure, which builds upon the cosparse representation of image patches. We propose a Bayesian approach to merge textural similarity with information about color and location. Combined with recently developed convex multilabel optimization methods this leads to an efficient algorithm for both supervised and unsupervised segmentation, which is easily parallelized on graphics hardware. The approach provides competitive results in unsupervised segmentation and outperforms stateoftheart interactive segmentation methods on the Graz Benchmark.
1 Introduction
The segmentation of natural images is a fundamental problem in computer vision. It forms the basis of many highlevel algorithms such as object recognition, image annotation, semantic scene analysis, motion estimation, and 3D object reconstruction. Despite its importance, the task of unsupervised segmentation is highly illposed and admittedly hard to evaluate since the quality of a segmentation result depends on the subsequent application. The handdrawn ground truth of the Berkeley benchmark [15] well demonstrates that different users have very different understandings of the same scene: While some people see the windows of a skyscraper as a single texture of the building, others see them as separate objects.
In this paper, we focus on two scenarios, namely unsupervised segmentation where no prior knowledge is given and hence quantitative validation is questionable, and supervised segmentation where ambiguities are solved by additional user input (scribbles or bounding boxes) and a clearly defined ground truth for performance evaluation is available – see Figure 1. In both cases, one can compute data likelihoods using color [2, 10, 17, 24, 27], texture [1, 25, 26], or location [3, 20]. While all features carry relevant information, for natural images texture features are particularly relevant. To model the dependence of texture on image location and color we extend the approach in [20] to a spatially varying color and texture model.
To extract textural information from images, methods based on sparse representations are quite successful [14]. Commonly, sparsity is exploited via the synthesis model, which assumes that every image patch can be approximated as a linear combination of a suitable dictionary and a very sparse vector or code. With this, the textural information is encoded in the set of active dictionary atoms, i.e. the support of the sparse code. Finding this set, however, requires to solve a costly optimization problem.
In this paper, we propose a more efficient way to obtain textural information by employing the cosparse analysis model [8, 19]. In this model, the sparse image representation is determined efficiently by a simple matrix vector product. Based on the theory developed in [8, 19] we derive a novel textural similarity measure for image patches and demonstrate it can be introduced to image segmentation. Figure 1 shows that the proposed measure combined with an efficient convex multilabel approach generates convincing results for both supervised and unsupervised segmentation problems.
In particular, we make the following contributions

The proposed approach combines the cosparse analysis model, recent convex relaxation techniques, and label cost priors within a single optimization problem.

The cosparse analysis model is leveraged for image segmentation through a novel texture similarity measure.

The method extends the purely color based approach in [20] by explicitly modeling the dependence between texture, color and location leading to a spatially varying color and texture model.

The approach does not rely on superpixels and thus avoids potentially incorrect pixel aggregations.
2 CoSparse Textural Similarity
The cosparse analysis model [8, 19] is based on the assumption that if denotes a vectorized image patch, there exists an analysis operator with such that is sparse. We refer to as the analyzed version of . The two major differences to the more commonly known synthesis model are: (i) the sparse code is found via a simple matrix vector multiplication and (ii) the zero entries of are the informative coefficients describing the underlying signal. Concretely, the textural structure of is encoded in its cosupport
(1) 
where denotes the th entry of . Geometrically, is orthogonal to all rows that determine the cosupport and thus lies in the intersection of the respective hyperplanes.
Note, that the better a signal fits to the cosparse analysis model, the sparser is its analyzed version, i.e. the larger is its cosupport. Since our ultimate goal is to discriminate between distinctive textures in natural images, a measure of textural similarity should better distinguish between representative patches, i.e. patches that fit the cosparse analysis model of natural image patches, while discriminating moderately for ”outlier”patches, i.e. patches that seldom occur in natural images. This motivates us to measure the textural similarity between two patches via
(2) 
where is the indicator function of a set, i.e. if and zero otherwise. This measure has two desired properties: 1) it distinguishes sensibly between patches that fit the model well, i.e. patches with a large cosupport, 2) it does not heavily discriminate between patches that fit the model less.
In our experiments, we employ an analysis operator learned according to [12] from patches extracted from natural images. As we only want to gather textural information independent of varying illumination conditions, we follow the simple bias and gain model and use patches that have been normalized to zeromean and unitnorm, i.e. and . We exclusively consider such patches for the remainder of the paper.
To identify an ”average” textural structure from a set of patches that serves as their textural representative, we provide the following definition. Let be a patch with analyzed version , then we say that is a textural representative of if
(3) 
In other words, the cosupport of the analyzed version of a textural representative is determined by the majority of the analyzed versions of the elements in . Note, that this definition is neither constructive in the sense that the representative can be easily deduced given , nor that it is uniquely determined. Nonetheless, it proves useful for clustering image patches according to their texture and can be efficiently approximated.
Up to now, we considered truly cosparse image patches, i.e. patches whose analyzed versions contain many coefficients that are exactly zero. However, this is an idealized assumption and in practice those patches are not truly cosparse but rather contain many coefficients that are close to zero. To account for this, we introduce the mapping as a smooth approximation of the indicator function of the cosupport, which is defined componentwise with a free parameter as
(4) 
In fact, it is easily seen that and .
Using this approximation of the cosupport, our textural similarity measure in (2) of two patches and associated with the analysis operator and is approximated by
(5) 
with denoting the norm. Using this, we approximate the cosupport of a textural representative of a set by the median of the set where , i.e.
(6) 
Note that this is in accordance to the wellknown fact that the centroid of a cluster with respect to the distance is given as the median of all corresponding cluster points.
3 Variational Cosparse Image Segmentation
In this section, we derive a Bayesian inference formulation for both supervised and unsupervised image segmentation based on the proposed textural similarity measure. We explicitly model the dependence of texture and color on the location in the image to account for texture variations within regions, e.g. a sky which is partially covered by clouds.
3.1 A Space Variant Texture and Color Distribution
For an image domain , let denote the input color (or gray scale) image. The segmentation problem can be solved by computing a labeling that indicates, which of the regions each pixel belongs to, i.e. . In a Bayesian framework the labeling can be computed by maximizing the conditional probability
(7) 
In the following, we will model the dependence of color and texture on to the image location. Let denote a small gray value texture patch centered at pixel . With the assumption that a pixel color depends on the local texture and its location , but is independent of the label of other pixels we obtain
(8) 
In the following, we derive the required probability that a pixel with texture patch belongs to segment .
Supervised Segmentation
Given the set of scribble samples consisting of location, color, and texture patches for each segment , i.e.
(9) 
we can  assuming independence  compute the likelihood of a pixel for belonging to region as the joint probability
(10) 
Following [20] we compute a spacedependent color likelihood based on the Parzen kernel density estimator [21]
(11) 
where denotes a kernel function. We use a Gaussian with variance . Parzen density estimators come with the advantage that they can represent arbitrary kinds of color distributions and provably converge to the true density for infinitely many samples. The estimated color distributions have shown to yield accurate results for interactive segmentation [20]. Just like the authors, we adapt the variance of the spatial kernel to the distance of the current pixel from the nearest user scribble of this class: where is the closest scribble location of all pixels in segment and a scaling factor, which we set to . The idea is that each color kernel is weighted by a spatial kernel in order to account for the uncertainty in the estimator at locations far away from the scribble points and for the certainty very close to them.
We will now extend the spatial color variation of [20] to a spatially varying texture distribution. To that end, we need to derive the probability that a patch belongs to segment , i.e. . As our goal is to extract local textural information in the vicinity of a pixel , we multiply each patch elementwise with a Gaussian mask to assign more weight to the central pixels prior to normalization according to Section 2. From these patches, we compute the approximated cosupport of a textural representative of each set of scribble points according to Equation (6), i.e.
(12) 
Using this, we assign to each pixel the a posteriori probability of belonging to class depending on the corresponding patch as
(13) 
where the parameters control the variance of . It can be interpreted as a measure of how well we trust the distance for deciding to which class belongs. Large values of assign a pixel to each of the classes with approximately equal probability, whereas small values of assign to the most similar class with very high probability. By setting proportional to we obtain a spatially varying texture distribution, which favors spatially close texture classes.
Unsupervised Segmentation
The approach introduced above can also be used for unsupervised image segmentation. In this case no color, texture, or location samples are given a priori. Location likelihoods cannot be obtained without sample data or user input, so we rely on color and texture only. To find segment classes, we first apply the kmeans on the color pixels to obtain color classes and the kmedians on the approximated image patches’ cosupports to obtain texture classes. With that, the initial segments are given as the combinations of all texture classes with all color classes (because the same texture with different color should be treated as a different segment). The segment class representatives are given as the centroids of the corresponding texture cluster and color cluster. With these representatives, we assign each pixel a texture and a colorlikelihood as in (13). The number of segments is automatically reduced by a minimum description length (MDL) prior, which is added to the variational approach introduced in the next section.
3.2 Variational Formulation
Based on the segment probabilities given in (10), we now define an energy optimization problem for the tasks of supervised and unsupervised segmentation. To this end, we specify the prior in (7) to favor segmentation regions of shorter boundary
(14) 
where denotes the perimeter of each region measured in metric . For example, the commonly used choice
favors boundaries coinciding with strong intensity gradients and, thus, prevents oversmoothed boundaries. We set . The weighting parameter balances the impact of the data term and the boundary length.
For supervised segmentation, the number of segments is known, but for unsupervised segmentation the number of segments must be determined automatically by means of an MDL prior [29]. Let us define the function ,
(15) 
which indicates if the label occurs in the segmentation. By multiplying (14) with the MDL prior
(16) 
we penalize the occurrence of a label by a cost . With increasing , the number of labels in the segmentation result is reduced and the number of segments automatically optimized. For supervised segmentation can be set to zero.
4 Minimization via Convex Relaxation
Problem (17) is the continuous equivalent to the Potts model, whose solution is known to be NPhard. However, a computationally tractable convex relaxation of this functional has been proposed in [4, 5, 13, 23, 30]. Due to the convexity of the problem the resulting solutions have the following properties: Firstly, the segmentation is independent of the initialization. Secondly, we obtain globally optimal segmentations for the case of two regions and nearoptimal – in practice often globally optimal – solutions for the multiregion case. In addition, the algorithm can be parallelized leading to average computation times of seconds per image on standard GPUs.
4.1 Conversion to a Convex Differentiable Problem
To apply convex relaxation techniques, we first represent the regions by the indicator function , where
(18) 
Here BV denotes the functions of bounded variation, i.e. functions with a finite total variation. For a valid segmentation we require that the sum of all indicator functions at each location amounts to one, so each pixel is assigned to exactly one label. Hence,
(19) 
denotes the set of valid segmentations.
To rewrite energy (17) in terms of the
indicator functions , we have to rewrite the label prior .
We will start with the boundary length prior in (14). The
boundary of the set indicated by can be written by means of the
total variation. Let denote the distributional derivative of
(which is for differentiable ),
the dual variables with
the space of smooth functions with compact support.
Then, following the coarea formula [9] the weighted perimeter of is equivalent to the weighted total variation
(20)  
[30]. The latter transformation (20) follows from integration by parts and the compact support of the dual variables .
Following [29], let us rewrite the minimum description length prior in (16) which is given by in terms of the region indicator functions . To this end, we define maximum value variables which yields
(21) 
To obtain a convex optimization problem we relax the set
(22) 
Furthermore, the constraints on the maximum value variables are relaxed to the convex constraints for all , and the are minimized. These constraints are introduced into the optimization problem using Lagrangian multipliers , where . With this, we finally obtain the convex optimization problem
(23)  
4.2 Implementation
To solve the relaxed convex optimization problem, we employ a primal dualalgorithm proposed in [23]. Essentially, it consists of alternating a projected gradient descent in the primal variables and with projected gradient ascent in the dual variables and . An overrelaxation step in the primal variables gives rise to auxiliary variables and :
(24)  
where denotes the projections onto the respective convex sets and the denote step sizes for primal and dual variables. These are optimized based on [22]. The projections onto and are straightforward, the projection onto the simplex is given in [16]. As shown in [23], the algorithm (24) provably converges to a minimizer of the relaxed problem.
Due to the relaxation we may end up with nonbinary solutions . To obtain binary solutions in the set , we assign each pixel to the label with maximum value , i.e. . This operation is known to preserve optimality in case of two regions [5]. In the multiregion case optimality bounds can be computed from the energy difference between the minimizer of the relaxed problem and its reprojected version. Typically the projected solution deviates less than 1% from the optimal energy, i.e. the results are very close to global optimality [20].
5 Experiments and Results
To evaluate the proposed algorithm we apply it to the interactive Graz benchmark [25] for supervised segmentation and to images from the wellknown Berkeley segmentation database [15] for unsupervised segmentation. We compare our approach against stateoftheart segmentation algorithms. For all experiments we use a patch size of , and a two times over complete analysis operator, i.e. . We do not require any training of the operator on the training set but use it as is avoiding over fitting to specific benchmarks. The parameter in (4) required to measure the textural similarity was set to . The textural similarity analysis is based only on highly parallelizable filter operations. Due to the inherently parallel structure of the optimization problem in (24), the algorithm can be efficiently implemented on graphics hardware. The experiments were carried out on an Intel Core i73770 3.4 GHz CPU with an NVIDIA Geforce GTX 580 GPU.
The average computation time per image of the Berkeley database with sizes of 321x481 was six seconds. For comparison, the following runtimes per image were reported: for Yang et al. [28] 3 minutes, for Mobahi et al. [18] 1 minute, for Santner et al. [25] 2 seconds and for Nieuwenhuis et al. [20] 1.5 seconds.
5.1 Results on the Graz Benchmark
We employ the Graz benchmark [25] for evaluating our supervised segmentation algorithm. It consists of 262 scribbleground truth pairs from 158 natural images containing between 2 and 13 user labeled segments. Here, we used a brush size of 13 pixels in diameter for scribbling as done by Santner et al. [25], and set . To rank our method, we compare our results with the Random Walker algorithm by Grady [11], a learning based approach that combines color and texture features by Santner [25], and the approach based on spatially varying color distribution by Nieuwenhuis and Cremers [20], which combines color and spatial information. Table 1 shows the average Dicescore [7] for all methods. This score compares the overlap of each region with its ground truth
(25) 
The results show that the proposed method outperforms stateoftheart supervised segmentation approaches. Figure 2 shows qualitative comparisons for a some images of the Graz benchmark, where texture is important to separate the regions. The sign on the wall can only be distinguished from the background by texture, the ground beneath the walking men changes color due to lighting and can only be recognized by texture, and the towels on the table are also segmented correctly only by using texture information. The average computation time on the Graz Benchmark is 2 seconds, which is along the lines of Santner et al. [25] with 2 seconds and Nieuwenhuis et al. [20] with 1.5 seconds.
Method  Score 

Grady [11], Random Walker  0.855 
Santner et al. [25], RGB, no texture  0.877 
Nieuwenhuis & Cremers [20], spaceconstant  0.889 
Santner [25], CIELab plus texture  0.927 
Nieuwenhuis & Cremers [20], spacevarying  0.931 
Proposed (spatially varying cosparse texture)  0.935 
5.2 Results on the Berkeley Segmentation Database
The Berkeley segmentation database [15] contains 300 natural images, 200 for training and 100 for testing. It serves for the evaluation of contour detectors and unsupervised segmentation algorithms. The ground truth is given by the overlay of handdrawn human segmentations. We set , , and the initial number of regions to . Benchmark results with respect to boundary displacement error (BE), probabilistic rand index (PRI), global consistency error (GCE) and variation of information (VOI) are given in Table 2. While the performance of the proposed method according to the similarity to human annotations is only mediocre, we observed that in many cases the segmentations are superior to those of stateoftheart methods.
Figure 3 shows comparisons of computed segmentations with the stateoftheart segmentation methods, which do not compute edges but closed object segments. We compare against the methods by Yang et al. [28], Mignotte [17] and Mobahi et al. [18]. The results show that the proposed method visually outperforms the other algorithms on a number of images where it better captures the objects in the scene. The resulting segments are more coherent (see for example the first image of the brown animal) and the segment boundaries follow more closely the actual object boundaries instead of texture edges (see for example the legs of the wolf, which are preserved, or the skiing person). We show more results in the supplemental material.
The average computation time per image of our algorithm on the Berkeley database with sizes of 321x481 was six seconds, which is due to the larger number of labels for this benchmark. This is a strong improvement in efficiency compared to other methods for unsupervised segmentation, e.g. Yang et al. [28] reported 3 minutes and Mobahi et al. [18] 1 minute per image.
Method  BE  PRI  VOI  GCE 

NCut [26]  17.15  0.7242  2.9061  0.2232 
FH [10]  16.67  0.7139  3.3949  0.1746 
Meanshift [6]  14.41  0.7958  1.9725  0.1888 
Proposed  13.66  0.7244  2.6156  0.3412 
Mobahi et al.[18]  12.681  0.807  1.705   
Yang et al. [28]  9.8962  0.7627  2.0236  0.1877 
Mignotte [17]  8.9951  0.7882  2.3035  0.2114 
6 Conclusion
We introduced a framework for segmentation of natural images which is applicable to both, supervised and unsupervised image segmentation. A new measure of textural similarity which is based on the cosparse representation of image patches is proposed. From this measure, a data likelihood is derived and integrated in a Bayesian maximum a posteriori estimation scheme in order to combine color, texture, and location information. The arising cost functional is minimized by means of convex relaxation techniques. With our efficient GPU implementation of the convex relaxation, the overall algorithm for multiregion segmentation converges within about two to six seconds for typical images. Moreover, the approach outperforms stateoftheart methods on the Graz segmentation benchmark.
References
 [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Trans. on Patt. Anal. and Mach. Intell., 33(5), 2011.
 [2] Y. Boykov and M. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in nd images. In ICCV, 2001.
 [3] T. Brox and D. Cremers. On local region models and a statistical interpretation of the piecewise smooth mumfordshah functional. Intern. J. of Comp. Vision, 84:184–193, 2009.
 [4] A. Chambolle, D. Cremers, and T. Pock. A convex approach for computing minimal partitions. Technical report, TR200805, University of Bonn, Germany, 2008.
 [5] T. Chan, S. Esedoḡlu, and M. Nikolova. Algorithms for finding global minimizers of image segmentation and denoising models. SIAM Journal on Applied Mathematics, 66(5):1632–1648, 2006.
 [6] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603–619, 2002.
 [7] L. Dice. Measures of the amount of ecologic association between species. Ecology, 26:297–302, 1945.
 [8] M. Elad, P. Milanfar, and R. Rubinstein. Analysis versus synthesis in signal priors. Inverse Problems, 3(3):947–968, 2007.
 [9] H. Federer. Geometric Measure Theory. Springer, 1996.
 [10] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graphbased image segmentation. Int. J. Comput. Vision, 59(2):167–181, 2004.
 [11] L. Grady. Random walks for image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(11):1768–1783, 2006.
 [12] S. Hawe, M. Kleinsteuber, and K. Diepold. Analysis Operator Learning and Its Application to Image Reconstruction. IEEE Trans. on Image Process., 22(6):2138–2150, 2013.
 [13] J. Lellmann, J. Kappes, J. Yuan, F. Becker, and C. Schnörr. Convex multiclass image labeling by simplexconstrained total variation. Technical report, HCI, IWR, University of Heidelberg, 2008.
 [14] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In CVPR, pages 1–8, 2008.
 [15] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
 [16] C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of . Journal of Optimization Theory and Applications, 50(1):195–200, 1986.
 [17] M. Mignotte. Mdsbased segmentation model for the fusion of contour and texture cues in natural images. Computer Vision and Image Understanding, 2012.
 [18] H. Mobahi, S. Rao, A. Yang, S. Sastry, and Y. Ma. Segmentation of natural images by texture and boundary compression. International Journal of Computer Vision, 95, 2011.
 [19] S. Nam, M. E. Davies, M. Elad, and R. Gribonval. The Cosparse Analysis Model and Algorithms. Applied and Comput. Harmonic Analysis, 34(1):30–56, 2013.
 [20] C. Nieuwenhuis and D. Cremers. Spatially varying color distributions for interactive multilabel segmentation. IEEE Trans. on Patt. Anal. and Mach. Intell., 2012.
 [21] E. Parzen. On the estimation of a probability density function and the mode. Annals of Mathematical Statistics, 1962.
 [22] T. Pock and A. Chambolle. Diagonal preconditioning for first order primaldual algorithms in convex optimization. In ICCV, pages 1762–1769, 2011.
 [23] T. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing the piecewise smooth MumfordShah functional. In ICCV, 2009.
 [24] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph., 23(3):309–314, 2004.
 [25] J. Santner, T. Pock, and H. Bischof. Interactive multilabel segmentation. In ACCV, 2010.
 [26] J. Shi and J. Malik. Normalized cuts and image segmentation. In CVPR, pages 731–737, 1997.
 [27] M. Unger, T. Pock, D. Cremers, and H. Bischof. TVSeg  interactive total variation based image segmentation. In BMVC, September 2008.
 [28] A. Yang, J. Wright, Y. Ma, and S. Sastry. Unsupervised segmentation of natural images via lossy data compression. Computer Vision and Image Understanding, 2008.
 [29] J. Yuan and Y. Boykov. TVbased multilabel image segmentation with label cost prior. In BMVC, 2010.
 [30] C. Zach, D. Gallup, J.M. Frahm, and M. Niethammer. Fast global labeling for realtime stereo using multiple plane sweeps. In VMV, 2008.