Image Segmentation using Sparse Subset Selection

Image Segmentation using Sparse Subset Selection

Fariba Zohrizadeh             Mohsen Kheirandishfard             Farhad Kamangar
Department of Computer Science and Engineering,
University of Texas at Arlington, USA
{fariba.zohrizadeh,mohsen.kheirandishfard,farhad.kamangar}@uta.edu
Abstract

In this paper, we present a new image segmentation method based on the concept of sparse subset selection. Starting with an over-segmentation, we adopt local spectral histogram features to encode the visual information of the small segments into high-dimensional vectors, called superpixel features. Then, the superpixel features are fed into a novel convex model which efficiently leverages the features to group the superpixels into a proper number of coherent regions. Our model automatically determines the optimal number of coherent regions and superpixels assignment to shape final segments. To solve our model, we propose a numerical algorithm based on the alternating direction method of multipliers (ADMM), whose iterations consist of two highly parallelizable sub-problems. We show each sub-problem enjoys closed-form solution which makes the ADMM iterations computationally very efficient. Extensive experiments on benchmark image segmentation datasets demonstrate that our proposed method in combination with an over-segmentation can provide high quality and competitive results compared to the existing state-of-the-art methods.

1 Introduction

Image segmentation is a fundamental and challenging task in computer vision with diverse applications in various areas, such as video segmentation [26, 27], object segmentation [11, 15, 20, 25], and semantic segmentation [9, 31]. The primary challenges of image segmentation are rooted in the diversity and ambiguity of visual textures encountered in input images. The solution to these challenges has been the subject of some research studies in the recent years [1, 17, 54].

One of the major challenges in image segmentation is to determine the optimal number of coherent regions. This parameter can be calculated based on the distribution of image features [54], given as a prior knowledge [29, 32, 56] or set to a constant value [1], depending on the segmentation methodology. The performance of segmentation methods heavily depends on the right choice of this parameter, denoted by . Figure 1 illustrates the segmentation results obtained for various choices of . In the case that is overestimated (shown in Figure 1\subreffig:spuriousClustersb), each coherent region may be divided into many separate segments. To merge these segments, many time-consuming and complicated steps are required which in turn increase the computational complexity and decrease the performance of the algorithm. On the other hand, when is underestimated (shown in Figure 1\subreffig:spuriousClustersc), some coherent regions are forced to be merged together which in turn leads to a poor segmentation quality. Figure 1\subreffig:spuriousClustersd shows our method has achieved a high quality segmentation by properly determining parameter .

(a)
(b)
(c)
(d)
Figure 1: Segmenting image pixels into multiple coherent regions. (a) Input image. (b-d) Segmentation results when the number of coherent regions is overestimated (b), underestimated (c), and properly determined (our method) (d).

Generally, determination of the number of coherent regions in image segmentation is nearly similar to the problem of finding the optimal number of clusters in other areas [3, 46, 49]. These problems are similar in a sense that they both seek to find the optimal number of groups. However, they may be different depending on the nature and intrinsic properties of their input data. For instance, in image segmentation, the spatial relationship among the image features can be perceived as an important property of data. Features may also have more specific properties depending on the feature extraction procedure. These properties need to be taken into account in determining the optimal number of coherent regions.

In this paper, we adopt local spectral histogram (LSH) features [34] to model the input image. These features are computed by averaging the distribution of visual properties (such as color, texture, etc.) over a local patch centered at each pixel. Therefore, they can be considered as powerful tools to encode the local texture information. As LSH features are computed by averaging distributions in a local neighborhood, they are always nonnegative and the features belonging to the same coherent region are linearly dependent on each other. Our method leverages these properties to develop a convex model based on the concept of sparse subset selection. The main contributions of this work can be summarized as follows:

I: We design an effective convex model based on the properties of LSH features which automatically determines the optimal number of coherent regions and pixels assignment.

II: We develop a parallel numerical algorithm based on the alternating direction method of multiplier [4, 18] whose iterations consists of two sub-problems with closed-form solutions. We show the proposed algorithm can solve our model significantly faster than the standard convex solvers [40, 48, 50] while maintaining a high accuracy.

III: We conduct extensive experiments on three commonly used datasets, BSD300 [38], BSD500 [1], and MSRC [45] to show our results are competitive comparing to the results of state-of-the-arts methods.

The remainder of this paper is structured as follows: Section 2, shortly reviews related works; Section 3, explains our method in detail; Section 4, provides experimental results; Section 5, draws a conclusion about this paper.

Notation: Throughout this paper, matrices, vectors, and scalars are denoted by boldface uppercase, boldface lowercase, and italic lowercase letters, respectively. For a given matrix , symbol denotes the element at row and column, indicates the Frobenius norm, and is the norm of the norm of the rows in . For a given vector , symbols , , and denote standard norm, a diagonal matrix formed by the elements of , and the element of , respectively. Symbol stands for the trace operator, indicates the set of positive real numbers, and 1 is a column vector of all ones of appropriate dimension.

2 Related works

Over-segmentation is obtained by partitioning the input image into multiple small homogeneous regions, called superpixels. Recent segmentation algorithms usually utilize an over-segmentation and merge the similar superpixels to shape final segments [17, 32, 56]. Fu et al. [17] proposed a pipeline of three effective post-processing steps which are applied on an over-segmentation to shape final segments. Li et al. [32] suggested to construct a bipartite graph over multiple over-segmentations provided by [7] and [16]. Then, the spectral clustering is applied on the graph to form final segments. Ren et al. [43] presented a method which constructs a cascade of boundary classifiers and iteratively merges the superpixels of an over-segmentation to build final segments.

One notable segmentation method is presented by Arbelaez et al. in [1], which reduces the problem of image segmentation to a contour detection. The method combines multiple contour cues of different image layers to create a contour map, called gPb. The contour map and its corresponding hierarchical segmentation further utilized by some algorithms as an initial over-segmentation [10, 19, 28, 33]. Liu et al. [33] trained a classifier over the gPb results to construct a hierarchical region merging tree. Then, the classifier iteratively merges the most similar regions to shape final segments. Gao et al. [19] proposed to construct a graph over the gPb results based on the spatial and visual information of superpixels. Then, a model is proposed to partition the graph into multiple components where each one is corresponding to a final segment. Recently, a widely-used extension of gPb is proposed by Arbelaez et al. [2], called Multiscale Combinatorial Grouping (MCG). The method combines the information obtained from multiple image scales to generate a hierarchical segmentation. Yu et al. [53] proposed a nonlinear embedding method based on a -regularized objective which is integrated into MCG framework to provide better local distances among the superpixels. Chen et al. [6] realigned the MCG results by modifying the depth of segments in its hierarchical structure.

Figure 2: The pipeline of our proposed algorithm. Given an input image, we adopt an algorithm to generate a super-pixel segmentation layer. Then, we compute the superpixels features and learn a dictionary of words over all superpixels. Our convex model efficiently selects a small subset of informative words and softly assigns superpixels to the selected words. The neighboring superpixels which are assigned to the same selected words are merged to shape final segmentation.

There are some recently-developed methods based on deep learning in many related tasks such as contour detection [30, 37] and semantic image segmentation [24, 35, 41, 55]. Although these methods are able to exploit more sophisticated and complex representative features, they are often highly demanding in terms of training data and training time. Therefore, these methods may not be the most appropriate choice in some applications. To illustrate, consider natural image segmentation in which many unknown and diverse patterns are likely to be presented in each single image. This implies that we may have insufficient number of training samples per each pattern. Motivated by this, we propose a new image segmentation method based on the concept of sparse subset selection [12, 14]. The method starts with an over-segmentation (e.g., MCG) and uses an effective convex model to group the superpixels into a proper number of coherent regions. Our work is roughly similar to the factorization-based segmentation (Fact) algorithm [54] in the sense that both use local spectral histogram (LSH) features to model the input image and seek to estimate the optimal number of coherent regions. However, our method differs from Fact in two major ways: (1) Fact determines the optimal number of coherent regions and pixels assignment in two consecutive steps which may lead to error propagation, but our model simultaneously determines the optimal number of coherent regions and pixels assignment in an effective manner. (2) Fact does not take advantage of the spatial information among pixels, but we incorporate this information as a Laplacian regularization term in our convex model. Moreover, we propose a parallel numerical algorithm based on the alternating direction method of multipliers [4, 18] to solve our model and obtain final segments. Note that the model can be easily utilized in some other applications such as video summarization [12, 13] and dimensionality reduction [14].

3 Proposed method

This section describes our segmentation method in two phases: problem formulation and numerical algorithm. The first phase formulates a convex model based on the properties of local spectral histogram (LSH) features and the second phase presents our solution to the model in details.

3.1 Problem formulation

Given an input image , we start with an over-segmentation consisting superpixels. We form feature matrix by averaging the LSH features of pixels within each superpixel. Hence, each is considered as a -dimensional feature corresponding to the superpixel. Under the assumption of linear dependence among the LSH features, we model the feature matrix as,

(1)

where is a dictionary of words inferred from the superpixels features, denotes a coefficient matrix whose rows indicate the contribution of each word in reconstructing , and indicates the model error. The goal is to design a model which takes into account the linear dependence and spatial relationship among the features to compute an optimal matrix .

In order to incorporate the linear dependence among the features into the model, we adopt a non-negative matrix factorization framework to construct over the feature matrix . Consider the dissimilarity matrix where indicates the dissimilarity between and . We use to compute . In the case of normalized features and visual words, only depends on the inner product between and . More precisely, it shows how well is expressible by which is a reasonable dissimilarity measure according to the linear dependence among the features. Since the superpixels are not necessarily of the same size, we define a diagonal regularization matrix whose diagonal elements are proportional to the number of pixels in each superpixel. The elements of scale each by the size of the superpixel.

In order to embed the spatial relationship among the superpixels into the model, we construct a graph over the initial over-segmentation. Let be the graph where nodes are superpixels and edges connect every pairs of adjacent superpixels with a weight specified by . The edge weight between the adjacent superpixels and indicates their similarity and is defined as:

(2)

where is the average strength of their common boundary and controls the effect of feature distances on their similarity weight. Given such graph , we define Laplacian matrix as .

Once the Laplacian matrix , the dissimilarity matrix , and the regularization matrix are computed, we seek to find a small subset of the dictionary words that well represents feature matrix . To do so, a model is required which satisfies the following requirements:

  • minimizes the number of selected words. In the ideal case, we are interested to have a single word corresponding to each coherent region.

  • ensures each feature in is well expressible as a nonnegative linear combination of the selected words. The coefficients of such linear combination indicate the contribution of each selected word in reconstructing the feature.

  • ensures each feature in is expressed by at least one selected word. To do so, we impose a constraint on the sum of the linear combination coefficients.

  • takes advantage of the spatial relationship and linear dependence of the features.

Motivated by [12], we formulate the following convex model which fulfills the requirements.

(3a)
subject to (3b)
(3c)

where and are regularization parameters. The first term in (3) is corresponding to the cost of representing feature matrix using dictionary proportional to the size of superpixels. The Laplacian regularization term incorporates the spatial relation of superpixels into the objective and the last term is a row sparsity regularization term which penalizes the objective in proportion to the number of selected words. Note that although does not directly appear in (3), the rows of are constructed based on the contribution of the dictionary words, , in reconstructing .

The optimal solution of problem (3) is whose nonzero rows are corresponding to the selected words. Note that not only determines the selected words but also shows the contribution of selected words in reconstructing the superpixel features . Hence, the elements of can be interpreted as a soft assignment of the superpixels to the selected words. In this case, the superpixel is assigned to the selected word which has the largest contribution in the reconstruction of . Final segmentation is obtained by merging the neighboring superpixels which are assigned to the same selected word. Figure 2 illustrates our segmentation pipeline in details.

3.2 Numerical algorithm

To solve the model, we propose an efficient numerical algorithm based on alternating direction method of multipliers (ADMM) which is a powerful technique to solve convex optimization problems. The basic idea of ADMM is to introduce auxiliary variables to break down a complicated convex problem into smaller sub-problems where each one is efficiently solvable via an explicit formulas. The ADMM iteratively solves the sub-problems until convergence. To formulate the ADMM, let define such that and reformulate (3) as follows:

(4a)
subject to (4b)
(4c)
(4d)

where (4d) is imposed to guarantee the equivalence of (3) and (4). Using slack variable , (4) can be written in more convenient form as,

(5a)
subject to (5b)
(5c)
(5d)
(5e)

As , the third term of (5a) can be equivalently written as . Hence, (5) can be reformulated independent of as:

(6a)
subject to (6b)
(6c)
(6d)
(6e)

where (6d) is obtained by removing from (5d). In order to derive an ADMM formulation with subproblems possessing explicit formulas, we introduce auxiliary matrices and reformulate (6) as:

(7a)
subject to (7b)
(7c)
(7d)
(7e)
(7f)
(7g)

where and are the augmented Lagrangian parameters. As it is suggested in [4], we can set . Note that (7) is equivalent to (6), because the additional terms in (7a) vanish for any feasible solution. To solve (7), augmented Lagrangian function is formed as:

(8)

where and are Lagrange multipliers associated with the equality constraints (7f) and (7g).

Given initial values for , , , and , the ADMM iterations to solve (7) are summarized as follow:

(9)
(10)
(11)

Note that the variables in each of the above iterations can be stacked together to form a single matrix variable. Therefore, the numerical algorithm is not considered as a multi-block ADMM. To solve (9), let form by concatenating the rows of and . Then, (9) can be divided into equality constrained quadratic programs as follows:

(12a)
subject to (12b)

where is a block diagonal positive semi-definite matrix, is a sparse matrix corresponding to the constraint (7d), and is a vector of all zeros.

Problem (10) can be split into two separate sub-problems with closed-form solutions as follows:

(13a)
subject to (13b)
(14a)
subject to (14b)

where each one consists of computationally cheap parallel programs. Sub-problem (13) can be divided into parallel programs over the columns of where each one is a Euclidean norm projection onto the probability simplex constraints. These programs enjoy closed-form solutions as presented in [51]. Sub-problem (14) consists of small parallel programs over the columns of , where each program is a minimization of the Euclidean norm projection onto the nonnegative orthant and admits closed-form solution.

Problem (11) can be split into two sub-problems over and where each sub-problem consists of parallel updates over the columns of corresponding matrix.

Our numerical algorithm consists of two sub-problems with closed-form solutions, which makes the iterations computationally efficient. To evaluate the convergence behavior of the proposed algorithm, we adopt combined residual presented in [21] as:

(15)

Figure 3 demonstrates the convergence behavior of our algorithm in terms of combined residual and cost function. We solve (3) for three choices of to show the sensitivity of our numerical algorithm with respect to . Clearly seen, the numerical algorithm converges in a reasonable number of iterations for a wide range of .

Figure 3: Our convergence behavior for different choices of . Left: Combined residual. Right: Cost function.

4 Experiments

We perform multiple experiments on benchmark image segmentation datasets to evaluate the performance of our method (termed IS4). The first part of this section gives information about the benchmarking datasets, evaluation measures, and parameter settings. The second part compares our results with state-of-the-art methods to demonstrate the effectiveness of IS4.

4.1 Settings

Datasets: We adopt three commonly used datasets in image segmentation: (1) BSD300 [38] containing 300 images (200 training and 100 validation) of size , where each image has in average 5 ground-truths manually drawn by human; (2) BSD500 [1] is an extension of BSD300 with 200 new testing images; (3) MSRC [45] containing 591 images of size , where each image has a single ground-truth. It should be mentioned that we use the cleaned version of MSRC [36] in our experiments.

Measures: We adopt three widely accepted measures in image segmentation: (1) segmentation Covering (Cov) [36], which measures the overlapping between two segmentations; (2) probability Rand Index (PRI) [42], which measures the probability that a pair of pixels is consistently labeled in two segmentations; (3) variation of Information (VoI) [39], which measures the distance between two segmentations as the average of their conditional entropy.

Parameters: Given an over-segmentation, IS4 computes the superpixels features by averaging the local spectral histogram (LSH) [34] features of pixels within each superpixel. We use the algorithm and parameters presented in [54] to extract features and build a dictionary of size . Parameter should be chosen sufficiently large (larger than the number of coherent regions) to ensure each superpixel feature is well expressible as a nonnegative linear combination of the dictionary words. We set which is normally much larger than the number of coherent regions in BSD and MSRC images. Our proposed model in (3) has two parameters and , where controls the effect of spatial relationship among superpixels and controls the number of selected words. As increases, the neighboring regions are more likely to be merged together and as increases, the number of selected words reduces. We optimize on the training set of BSD by applying grid search and use the optimized in our experiments on BSD300, BSD500, and MSRC datasets. Parameter is set to , where and is a constant computed based on , , and using [12]. If is greater than (which means ), only a single word is selected to represent the whole features. We follow [1, 2, 19, 43] to present our results as a family of segmentations which share the same parameter settings except for that varies from to . The evaluation measures are also reported at Optimal Dataset Scale (ODS) and Optimal Image Scale (OIS).

4.2 Results

Segmentation quality: We run IS4 on the benchmark datasets and report the results in tables 1, 2, and 3, to make a comparison with recent methods such as, Normalized cut (Ncut) [44], Multi-scale Normalized cut (MNcut)[8], gPb-Ultametric contour map (gPb) [1], Image Segmentation by Cascade Region Agglomeration (ISCRA) [43], Reverse Image Segmentation with High/Low-level pairwise potentials (RIS-HL) [52], Multiscale Combinatorial Grouping (MCG) [2], Contour-guided Color Palletes (CCP-2) [17], Piecewise Flat Embedding (PFE) [53], Discrete-Continuous Gradient Orientation Estimation for Segmentation (DC-Seg) [10], Graph Without Cut (GWC) [19], and Aligned hierarchical segmentation (MCG-Aligned) [6]. All scores are collected from [1, 2, 6, 10, 19] except the MCG on MSRC and CCP-2 on BSD500 which are obtained by running the implementations provided by the respective authors.

Cov () PRI () VoI ()
Methods ODS OIS ODS OIS ODS OIS
MNcut[8]
gPb-UCM [1]
ISCRA [43] 0.86
RIS+HL[52] 0.82 0.86
MCG [2] 0.61 0.86 1.37
GWC [19] 0.61 0.68 0.82 0.86
IS4(MCG) 0.61 1.54
Table 1: Quantitative comparisons on BSD300 val set.
Cov () PRI () VoI ()
Methods ODS OIS ODS OIS ODS OIS
Ncut [44]
gPb-UCM [1]
DC-Seg [10]
ISCRA [43]
RIS+HL[52] 0.84
MCG [2]
PFE+MCG [53] 0.68 0.84 0.87
MCG-Aligned [6] 0.63 0.68 1.53
GWC [19] 0.87
IS4(MCG) 0.63 1.35
Table 2: Quantitative comparisons on BSD500 test set.
Cov () PRI () VoI ()
Methods ODS OIS ODS OIS ODS OIS
gPb-UCM [1]
ISCRA [43]
GWC [19]
MCG [2]
IS4(MCG) 0.69 0.77 0.80 0.86 1.15 0.91
Table 3: Quantitative comparisons on MSRC dataset.
Cov () PRI () VoI ()
Methods ODS OIS ODS OIS ODS OIS
ISCRA [43]
IS4 (ISCRA) [7]
CCP-2 [17]
IS4 (CCP-2)
MCG [2]
IS4(MCG)
Table 4: Sensitivity of our method with respect to the different initial over-segmentations on BSD500 test set.
Cov () PRI () VoI ()
ODS OIS ODS OIS ODS OIS
Table 5: Sensitivity of our method with respect to the parameter variations on BSD500 test set.

Parameter sensitivity: To evaluate the role played by an initial over-segmentation, we run IS4 in combination with three segmentation methods CCP-2, ISRA, and MCG. In CCP-2, we use the same parameter settings as suggested by the respective author. In MCG and ISRA we respectively adopted the segmentations at scale and as the over-segmentations. In average, the over-segmentation layers provided by CCP-2, ISRA, and MCG have 120, 45, and 37 superpixels, respectively. Figure 3(a) and table 4 respectively show the qualitative and quantitative results of these combinations. Moreover, we run IS4 for different to assess our robustness with respect to the variations of . The results are reported in table 5 in terms of segmentation measures. As tables 4 and 5 indicate, IS4 not only achieves satisfactory results for a wide range of but also improves the quality of initial over-segmentations on most of the segmentation measures. It is worth pointing out that IS4 can be applied on the result of any segmentation method. The result may either be directly generated by a segmentation algorithm (e.g., CCP-2) or obtained from a specific level of a hierarchical segmentation (e.g., MCG).

Tables 1, 2, and 3 show that IS4 in combination with MCG generates a high quality segmentation. The scores indicate that IS4(MCG) outperforms all competitor methods on BSD300 (ODS: VoI) and MSCRC (ODS: Cov, PRI, VoI and OIS: Cov, PRI, VoI). Other scores obtained by IS4(MCG) are also on par or in close proximity of the best competitors except for BSD300 (OIS: Cov, PRI) and BSD500 (OIS: PRI). In comparison with MCG, IS4(MCG) achieves better results on BSD300 (ODS: on VoI), BSD500 (ODS: on Cov, on VoI and OIS: on VoI), and MSRC (ODS: on Cov, on PRI, on VoI and OIS: on Cov, on PRI, on VoI). Our method is also on par with MCG on BSD300 (ODS: Cov, PRI) and BSD500 (ODS: PRI). Moreover, the tables indicate that IS4(MCG) achieves lower score than MCG in BSD300 (OIS: Cov, PRI, VoI) and BSD500 (OIS: PRI). It may seem reasonable to state that IS4(MCG) should consistently improve the MCG measures. However, this is not a fairly accurate statement. The reason is that IS4 does not directly apply on the MCG hierarchical segmentation to improve or degrade the MCG results. It just adopts an over-segmentation by simply thresholding the MCG hierarchical segmentation and generates a family of segmentations. These segmentations differ from the ones which may be obtained at different thresholds of MCG.

MCG usually provides a high-quality hierarchical segmentation, but its performance is sometimes unsatisfactory, especially in textured images. Our method in combination with MCG improves the segmentation quality of these images (shown in Figure 3(b)) by adopting superpixels features as an informative representation of small regions. Despite the advantages of IS4, it may fail to correctly segment the pixels belonging to an elongated coherent region. The reason is that the local neighborhood around these pixels contains visual information of neighboring coherent regions. Hence, their LSH features are inaccurate, which may cause a wrong assignment to the neighboring coherent regions.

(a)
(b)
Figure 4: Qualitative comparison of segmentations. (a) From top to bottom, left column: image, CCP-2 [17], ISCRA [43], and MCG [2]; right column: groundtruth, IS4(CCP-2), IS4(ISCRA), and IS4(MCG); Note that the left column shows the initial over-segmentation provided by these methods and the right column shows our final segmentation with these initial over-segmentations. (b) All segmentation results are shown at Optimal Dataset Scale (ODS).

Algorithm complexity: We decomposed model (3) into three sub-problems (9), (10), and (11) which are considered as the steps of the alternating direction method of multipliers (ADMM). To investigate the complexity of solving these sub-problems, we discuss each one separately. Subproblem (9) is cast as parallel equality constrained quadratic programs of form (12) where each can be efficiently solved by solving a system of linear equations [5] in [47]. Since and are the same in all programs, the left-hand side of such systems are similar. Hence, one program just needs to be solved and then back-solves can be carried out for the right-hand sides of other programs. Therefore, the complexity of solving (9) is . It is worth mentioning that in practice the complexity would be much lower, because and are highly sparse and structured. Subproblem (10) is split into two parallel problems (13) and (14). Problem (13) consists of parallel programs where each is solvable in [51]. Problem (14) consists of parallel programs where each is solved in [4]. Therefore, (10) can be efficiently solved in . Subproblem (11) consists of parallel updates over the columns of and which can be performed in . In total, the complexity of our numerical solution is