Image Segmentation using Sparse Subset Selection
Abstract
In this paper, we present a new image segmentation method based on the concept of sparse subset selection. Starting with an oversegmentation, we adopt local spectral histogram features to encode the visual information of the small segments into highdimensional vectors, called superpixel features. Then, the superpixel features are fed into a novel convex model which efficiently leverages the features to group the superpixels into a proper number of coherent regions. Our model automatically determines the optimal number of coherent regions and superpixels assignment to shape final segments. To solve our model, we propose a numerical algorithm based on the alternating direction method of multipliers (ADMM), whose iterations consist of two highly parallelizable subproblems. We show each subproblem enjoys closedform solution which makes the ADMM iterations computationally very efficient. Extensive experiments on benchmark image segmentation datasets demonstrate that our proposed method in combination with an oversegmentation can provide high quality and competitive results compared to the existing stateoftheart methods.
1 Introduction
Image segmentation is a fundamental and challenging task in computer vision with diverse applications in various areas, such as video segmentation [26, 27], object segmentation [11, 15, 20, 25], and semantic segmentation [9, 31]. The primary challenges of image segmentation are rooted in the diversity and ambiguity of visual textures encountered in input images. The solution to these challenges has been the subject of some research studies in the recent years [1, 17, 54].
One of the major challenges in image segmentation is to determine the optimal number of coherent regions. This parameter can be calculated based on the distribution of image features [54], given as a prior knowledge [29, 32, 56] or set to a constant value [1], depending on the segmentation methodology. The performance of segmentation methods heavily depends on the right choice of this parameter, denoted by . Figure 1 illustrates the segmentation results obtained for various choices of . In the case that is overestimated (shown in Figure 1\subreffig:spuriousClustersb), each coherent region may be divided into many separate segments. To merge these segments, many timeconsuming and complicated steps are required which in turn increase the computational complexity and decrease the performance of the algorithm. On the other hand, when is underestimated (shown in Figure 1\subreffig:spuriousClustersc), some coherent regions are forced to be merged together which in turn leads to a poor segmentation quality. Figure 1\subreffig:spuriousClustersd shows our method has achieved a high quality segmentation by properly determining parameter .
Generally, determination of the number of coherent regions in image segmentation is nearly similar to the problem of finding the optimal number of clusters in other areas [3, 46, 49]. These problems are similar in a sense that they both seek to find the optimal number of groups. However, they may be different depending on the nature and intrinsic properties of their input data. For instance, in image segmentation, the spatial relationship among the image features can be perceived as an important property of data. Features may also have more specific properties depending on the feature extraction procedure. These properties need to be taken into account in determining the optimal number of coherent regions.
In this paper, we adopt local spectral histogram (LSH) features [34] to model the input image. These features are computed by averaging the distribution of visual properties (such as color, texture, etc.) over a local patch centered at each pixel. Therefore, they can be considered as powerful tools to encode the local texture information. As LSH features are computed by averaging distributions in a local neighborhood, they are always nonnegative and the features belonging to the same coherent region are linearly dependent on each other. Our method leverages these properties to develop a convex model based on the concept of sparse subset selection. The main contributions of this work can be summarized as follows:
I: We design an effective convex model based on the properties of LSH features which automatically determines the optimal number of coherent regions and pixels assignment.
II: We develop a parallel numerical algorithm based on the alternating direction method of multiplier [4, 18] whose iterations consists of two subproblems with closedform solutions. We show the proposed algorithm can solve our model significantly faster than the standard convex solvers [40, 48, 50] while maintaining a high accuracy.
III: We conduct extensive experiments on three commonly used datasets, BSD300 [38], BSD500 [1], and MSRC [45] to show our results are competitive comparing to the results of stateofthearts methods.
The remainder of this paper is structured as follows: Section 2, shortly reviews related works; Section 3, explains our method in detail; Section 4, provides experimental results; Section 5, draws a conclusion about this paper.
Notation: Throughout this paper, matrices, vectors, and scalars are denoted by boldface uppercase, boldface lowercase, and italic lowercase letters, respectively. For a given matrix , symbol denotes the element at row and column, indicates the Frobenius norm, and is the norm of the norm of the rows in . For a given vector , symbols , , and denote standard norm, a diagonal matrix formed by the elements of , and the element of , respectively. Symbol stands for the trace operator, indicates the set of positive real numbers, and 1 is a column vector of all ones of appropriate dimension.
2 Related works
Oversegmentation is obtained by partitioning the input image into multiple small homogeneous regions, called superpixels. Recent segmentation algorithms usually utilize an oversegmentation and merge the similar superpixels to shape final segments [17, 32, 56]. Fu et al. [17] proposed a pipeline of three effective postprocessing steps which are applied on an oversegmentation to shape final segments. Li et al. [32] suggested to construct a bipartite graph over multiple oversegmentations provided by [7] and [16]. Then, the spectral clustering is applied on the graph to form final segments. Ren et al. [43] presented a method which constructs a cascade of boundary classifiers and iteratively merges the superpixels of an oversegmentation to build final segments.
One notable segmentation method is presented by Arbelaez et al. in [1], which reduces the problem of image segmentation to a contour detection. The method combines multiple contour cues of different image layers to create a contour map, called gPb. The contour map and its corresponding hierarchical segmentation further utilized by some algorithms as an initial oversegmentation [10, 19, 28, 33]. Liu et al. [33] trained a classifier over the gPb results to construct a hierarchical region merging tree. Then, the classifier iteratively merges the most similar regions to shape final segments. Gao et al. [19] proposed to construct a graph over the gPb results based on the spatial and visual information of superpixels. Then, a model is proposed to partition the graph into multiple components where each one is corresponding to a final segment. Recently, a widelyused extension of gPb is proposed by Arbelaez et al. [2], called Multiscale Combinatorial Grouping (MCG). The method combines the information obtained from multiple image scales to generate a hierarchical segmentation. Yu et al. [53] proposed a nonlinear embedding method based on a regularized objective which is integrated into MCG framework to provide better local distances among the superpixels. Chen et al. [6] realigned the MCG results by modifying the depth of segments in its hierarchical structure.
There are some recentlydeveloped methods based on deep learning in many related tasks such as contour detection [30, 37] and semantic image segmentation [24, 35, 41, 55]. Although these methods are able to exploit more sophisticated and complex representative features, they are often highly demanding in terms of training data and training time. Therefore, these methods may not be the most appropriate choice in some applications. To illustrate, consider natural image segmentation in which many unknown and diverse patterns are likely to be presented in each single image. This implies that we may have insufficient number of training samples per each pattern. Motivated by this, we propose a new image segmentation method based on the concept of sparse subset selection [12, 14]. The method starts with an oversegmentation (e.g., MCG) and uses an effective convex model to group the superpixels into a proper number of coherent regions. Our work is roughly similar to the factorizationbased segmentation (Fact) algorithm [54] in the sense that both use local spectral histogram (LSH) features to model the input image and seek to estimate the optimal number of coherent regions. However, our method differs from Fact in two major ways: (1) Fact determines the optimal number of coherent regions and pixels assignment in two consecutive steps which may lead to error propagation, but our model simultaneously determines the optimal number of coherent regions and pixels assignment in an effective manner. (2) Fact does not take advantage of the spatial information among pixels, but we incorporate this information as a Laplacian regularization term in our convex model. Moreover, we propose a parallel numerical algorithm based on the alternating direction method of multipliers [4, 18] to solve our model and obtain final segments. Note that the model can be easily utilized in some other applications such as video summarization [12, 13] and dimensionality reduction [14].
3 Proposed method
This section describes our segmentation method in two phases: problem formulation and numerical algorithm. The first phase formulates a convex model based on the properties of local spectral histogram (LSH) features and the second phase presents our solution to the model in details.
3.1 Problem formulation
Given an input image , we start with an oversegmentation consisting superpixels. We form feature matrix by averaging the LSH features of pixels within each superpixel. Hence, each is considered as a dimensional feature corresponding to the superpixel. Under the assumption of linear dependence among the LSH features, we model the feature matrix as,
(1) 
where is a dictionary of words inferred from the superpixels features, denotes a coefficient matrix whose rows indicate the contribution of each word in reconstructing , and indicates the model error. The goal is to design a model which takes into account the linear dependence and spatial relationship among the features to compute an optimal matrix .
In order to incorporate the linear dependence among the features into the model, we adopt a nonnegative matrix factorization framework to construct over the feature matrix . Consider the dissimilarity matrix where indicates the dissimilarity between and . We use to compute . In the case of normalized features and visual words, only depends on the inner product between and . More precisely, it shows how well is expressible by which is a reasonable dissimilarity measure according to the linear dependence among the features. Since the superpixels are not necessarily of the same size, we define a diagonal regularization matrix whose diagonal elements are proportional to the number of pixels in each superpixel. The elements of scale each by the size of the superpixel.
In order to embed the spatial relationship among the superpixels into the model, we construct a graph over the initial oversegmentation. Let be the graph where nodes are superpixels and edges connect every pairs of adjacent superpixels with a weight specified by . The edge weight between the adjacent superpixels and indicates their similarity and is defined as:
(2) 
where is the average strength of their common boundary and controls the effect of feature distances on their similarity weight. Given such graph , we define Laplacian matrix as .
Once the Laplacian matrix , the dissimilarity matrix , and the regularization matrix are computed, we seek to find a small subset of the dictionary words that well represents feature matrix . To do so, a model is required which satisfies the following requirements:

minimizes the number of selected words. In the ideal case, we are interested to have a single word corresponding to each coherent region.

ensures each feature in is well expressible as a nonnegative linear combination of the selected words. The coefficients of such linear combination indicate the contribution of each selected word in reconstructing the feature.

ensures each feature in is expressed by at least one selected word. To do so, we impose a constraint on the sum of the linear combination coefficients.

takes advantage of the spatial relationship and linear dependence of the features.
Motivated by [12], we formulate the following convex model which fulfills the requirements.
(3a)  
subject to  (3b)  
(3c) 
where and are regularization parameters. The first term in (3) is corresponding to the cost of representing feature matrix using dictionary proportional to the size of superpixels. The Laplacian regularization term incorporates the spatial relation of superpixels into the objective and the last term is a row sparsity regularization term which penalizes the objective in proportion to the number of selected words. Note that although does not directly appear in (3), the rows of are constructed based on the contribution of the dictionary words, , in reconstructing .
The optimal solution of problem (3) is whose nonzero rows are corresponding to the selected words. Note that not only determines the selected words but also shows the contribution of selected words in reconstructing the superpixel features . Hence, the elements of can be interpreted as a soft assignment of the superpixels to the selected words. In this case, the superpixel is assigned to the selected word which has the largest contribution in the reconstruction of . Final segmentation is obtained by merging the neighboring superpixels which are assigned to the same selected word. Figure 2 illustrates our segmentation pipeline in details.
3.2 Numerical algorithm
To solve the model, we propose an efficient numerical algorithm based on alternating direction method of multipliers (ADMM) which is a powerful technique to solve convex optimization problems. The basic idea of ADMM is to introduce auxiliary variables to break down a complicated convex problem into smaller subproblems where each one is efficiently solvable via an explicit formulas. The ADMM iteratively solves the subproblems until convergence. To formulate the ADMM, let define such that and reformulate (3) as follows:
(4a)  
subject to  (4b)  
(4c)  
(4d) 
where (4d) is imposed to guarantee the equivalence of (3) and (4). Using slack variable , (4) can be written in more convenient form as,
(5a)  
subject to  (5b)  
(5c)  
(5d)  
(5e) 
As , the third term of (5a) can be equivalently written as . Hence, (5) can be reformulated independent of as:
(6a)  
subject to  (6b)  
(6c)  
(6d)  
(6e) 
where (6d) is obtained by removing from (5d). In order to derive an ADMM formulation with subproblems possessing explicit formulas, we introduce auxiliary matrices and reformulate (6) as:
(7a)  
subject to  (7b)  
(7c)  
(7d)  
(7e)  
(7f)  
(7g) 
where and are the augmented Lagrangian parameters. As it is suggested in [4], we can set . Note that (7) is equivalent to (6), because the additional terms in (7a) vanish for any feasible solution. To solve (7), augmented Lagrangian function is formed as:
(8) 
where and are Lagrange multipliers associated with the equality constraints (7f) and (7g).
Given initial values for , , , and , the ADMM iterations to solve (7) are summarized as follow:
(9) 
(10) 
(11) 
Note that the variables in each of the above iterations can be stacked together to form a single matrix variable. Therefore, the numerical algorithm is not considered as a multiblock ADMM. To solve (9), let form by concatenating the rows of and . Then, (9) can be divided into equality constrained quadratic programs as follows:
(12a)  
subject to  (12b) 
where is a block diagonal positive semidefinite matrix, is a sparse matrix corresponding to the constraint (7d), and is a vector of all zeros.
Problem (10) can be split into two separate subproblems with closedform solutions as follows:
(13a)  
subject to  (13b) 
(14a)  
subject to  (14b) 
where each one consists of computationally cheap parallel programs. Subproblem (13) can be divided into parallel programs over the columns of where each one is a Euclidean norm projection onto the probability simplex constraints. These programs enjoy closedform solutions as presented in [51]. Subproblem (14) consists of small parallel programs over the columns of , where each program is a minimization of the Euclidean norm projection onto the nonnegative orthant and admits closedform solution.
Problem (11) can be split into two subproblems over and where each subproblem consists of parallel updates over the columns of corresponding matrix.
Our numerical algorithm consists of two subproblems with closedform solutions, which makes the iterations computationally efficient. To evaluate the convergence behavior of the proposed algorithm, we adopt combined residual presented in [21] as:
(15)  
Figure 3 demonstrates the convergence behavior of our algorithm in terms of combined residual and cost function. We solve (3) for three choices of to show the sensitivity of our numerical algorithm with respect to . Clearly seen, the numerical algorithm converges in a reasonable number of iterations for a wide range of .
4 Experiments
We perform multiple experiments on benchmark image segmentation datasets to evaluate the performance of our method (termed IS4). The first part of this section gives information about the benchmarking datasets, evaluation measures, and parameter settings. The second part compares our results with stateoftheart methods to demonstrate the effectiveness of IS4.
4.1 Settings
Datasets: We adopt three commonly used datasets in image segmentation: (1) BSD300 [38] containing 300 images (200 training and 100 validation) of size , where each image has in average 5 groundtruths manually drawn by human; (2) BSD500 [1] is an extension of BSD300 with 200 new testing images; (3) MSRC [45] containing 591 images of size , where each image has a single groundtruth. It should be mentioned that we use the cleaned version of MSRC [36] in our experiments.
Measures: We adopt three widely accepted measures in image segmentation: (1) segmentation Covering (Cov) [36], which measures the overlapping between two segmentations; (2) probability Rand Index (PRI) [42], which measures the probability that a pair of pixels is consistently labeled in two segmentations; (3) variation of Information (VoI) [39], which measures the distance between two segmentations as the average of their conditional entropy.
Parameters: Given an oversegmentation, IS4 computes the superpixels features by averaging the local spectral histogram (LSH) [34] features of pixels within each superpixel. We use the algorithm and parameters presented in [54] to extract features and build a dictionary of size . Parameter should be chosen sufficiently large (larger than the number of coherent regions) to ensure each superpixel feature is well expressible as a nonnegative linear combination of the dictionary words. We set which is normally much larger than the number of coherent regions in BSD and MSRC images. Our proposed model in (3) has two parameters and , where controls the effect of spatial relationship among superpixels and controls the number of selected words. As increases, the neighboring regions are more likely to be merged together and as increases, the number of selected words reduces. We optimize on the training set of BSD by applying grid search and use the optimized in our experiments on BSD300, BSD500, and MSRC datasets. Parameter is set to , where and is a constant computed based on , , and using [12]. If is greater than (which means ), only a single word is selected to represent the whole features. We follow [1, 2, 19, 43] to present our results as a family of segmentations which share the same parameter settings except for that varies from to . The evaluation measures are also reported at Optimal Dataset Scale (ODS) and Optimal Image Scale (OIS).
4.2 Results
Segmentation quality: We run IS4 on the benchmark datasets and report the results in tables 1, 2, and 3, to make a comparison with recent methods such as, Normalized cut (Ncut) [44], Multiscale Normalized cut (MNcut)[8], gPbUltametric contour map (gPb) [1], Image Segmentation by Cascade Region Agglomeration (ISCRA) [43], Reverse Image Segmentation with High/Lowlevel pairwise potentials (RISHL) [52], Multiscale Combinatorial Grouping (MCG) [2], Contourguided Color Palletes (CCP2) [17], Piecewise Flat Embedding (PFE) [53], DiscreteContinuous Gradient Orientation Estimation for Segmentation (DCSeg) [10], Graph Without Cut (GWC) [19], and Aligned hierarchical segmentation (MCGAligned) [6]. All scores are collected from [1, 2, 6, 10, 19] except the MCG on MSRC and CCP2 on BSD500 which are obtained by running the implementations provided by the respective authors.
Cov ()  PRI ()  VoI ()  
Methods  ODS  OIS  ODS  OIS  ODS  OIS 
MNcut[8]  
gPbUCM [1]  
ISCRA [43]  0.86  
RIS+HL[52]  0.82  0.86  
MCG [2]  0.61  0.86  1.37  
GWC [19]  0.61  0.68  0.82  0.86  
IS4(MCG)  0.61  1.54 
Cov ()  PRI ()  VoI ()  

Methods  ODS  OIS  ODS  OIS  ODS  OIS 
Ncut [44]  
gPbUCM [1]  
DCSeg [10]  
ISCRA [43]  
RIS+HL[52]  0.84  
MCG [2]  
PFE+MCG [53]  0.68  0.84  0.87  
MCGAligned [6]  0.63  0.68  1.53  
GWC [19]  0.87  
IS4(MCG)  0.63  1.35 
Cov ()  PRI ()  VoI ()  
Methods  ODS  OIS  ODS  OIS  ODS  OIS 
gPbUCM [1]  
ISCRA [43]  
GWC [19]  
MCG [2]  
IS4(MCG)  0.69  0.77  0.80  0.86  1.15  0.91 
Cov ()  PRI ()  VoI ()  

Methods  ODS  OIS  ODS  OIS  ODS  OIS 
ISCRA [43]  
IS4 (ISCRA) [7]  
CCP2 [17]  
IS4 (CCP2)  
MCG [2]  
IS4(MCG) 
Cov ()  PRI ()  VoI ()  

ODS  OIS  ODS  OIS  ODS  OIS  
Parameter sensitivity: To evaluate the role played by an initial oversegmentation, we run IS4 in combination with three segmentation methods CCP2, ISRA, and MCG. In CCP2, we use the same parameter settings as suggested by the respective author. In MCG and ISRA we respectively adopted the segmentations at scale and as the oversegmentations. In average, the oversegmentation layers provided by CCP2, ISRA, and MCG have 120, 45, and 37 superpixels, respectively. Figure 3(a) and table 4 respectively show the qualitative and quantitative results of these combinations. Moreover, we run IS4 for different to assess our robustness with respect to the variations of . The results are reported in table 5 in terms of segmentation measures. As tables 4 and 5 indicate, IS4 not only achieves satisfactory results for a wide range of but also improves the quality of initial oversegmentations on most of the segmentation measures. It is worth pointing out that IS4 can be applied on the result of any segmentation method. The result may either be directly generated by a segmentation algorithm (e.g., CCP2) or obtained from a specific level of a hierarchical segmentation (e.g., MCG).
Tables 1, 2, and 3 show that IS4 in combination with MCG generates a high quality segmentation. The scores indicate that IS4(MCG) outperforms all competitor methods on BSD300 (ODS: VoI) and MSCRC (ODS: Cov, PRI, VoI and OIS: Cov, PRI, VoI). Other scores obtained by IS4(MCG) are also on par or in close proximity of the best competitors except for BSD300 (OIS: Cov, PRI) and BSD500 (OIS: PRI). In comparison with MCG, IS4(MCG) achieves better results on BSD300 (ODS: on VoI), BSD500 (ODS: on Cov, on VoI and OIS: on VoI), and MSRC (ODS: on Cov, on PRI, on VoI and OIS: on Cov, on PRI, on VoI). Our method is also on par with MCG on BSD300 (ODS: Cov, PRI) and BSD500 (ODS: PRI). Moreover, the tables indicate that IS4(MCG) achieves lower score than MCG in BSD300 (OIS: Cov, PRI, VoI) and BSD500 (OIS: PRI). It may seem reasonable to state that IS4(MCG) should consistently improve the MCG measures. However, this is not a fairly accurate statement. The reason is that IS4 does not directly apply on the MCG hierarchical segmentation to improve or degrade the MCG results. It just adopts an oversegmentation by simply thresholding the MCG hierarchical segmentation and generates a family of segmentations. These segmentations differ from the ones which may be obtained at different thresholds of MCG.
MCG usually provides a highquality hierarchical segmentation, but its performance is sometimes unsatisfactory, especially in textured images. Our method in combination with MCG improves the segmentation quality of these images (shown in Figure 3(b)) by adopting superpixels features as an informative representation of small regions. Despite the advantages of IS4, it may fail to correctly segment the pixels belonging to an elongated coherent region. The reason is that the local neighborhood around these pixels contains visual information of neighboring coherent regions. Hence, their LSH features are inaccurate, which may cause a wrong assignment to the neighboring coherent regions.


Algorithm complexity: We decomposed model (3) into three subproblems (9), (10), and (11) which are considered as the steps of the alternating direction method of multipliers (ADMM). To investigate the complexity of solving these subproblems, we discuss each one separately. Subproblem (9) is cast as parallel equality constrained quadratic programs of form (12) where each can be efficiently solved by solving a system of linear equations [5] in [47]. Since and are the same in all programs, the lefthand side of such systems are similar. Hence, one program just needs to be solved and then backsolves can be carried out for the righthand sides of other programs. Therefore, the complexity of solving (9) is . It is worth mentioning that in practice the complexity would be much lower, because and are highly sparse and structured. Subproblem (10) is split into two parallel problems (13) and (14). Problem (13) consists of parallel programs where each is solvable in [51]. Problem (14) consists of parallel programs where each is solved in [4]. Therefore, (10) can be efficiently solved in . Subproblem (11) consists of parallel updates over the columns of and which can be performed in . In total, the complexity of our numerical solution is