Bayesian nonparametric image segmentation using a generalized Swendsen-Wang algorithm
Unsupervised image segmentation aims at clustering the set of pixels of an image into spatially homogeneous regions. We introduce here a class of Bayesian nonparametric models to address this problem. These models are based on a combination of a Potts-like spatial smoothness component and a prior on partitions which is used to control both the number and size of clusters. This class of models is flexible enough to include the standard Potts model and the more recent Potts-Dirichlet Process model . More importantly, any prior on partitions can be introduced to control the global clustering structure so that it is possible to penalize small or large clusters if necessary. Bayesian computation is carried out using an original generalized Swendsen-Wang algorithm. Experiments demonstrate that our method is competitive in terms of RAND index compared to popular image segmentation methods, such as mean-shift, and recent alternative Bayesian nonparametric models.
Sophisticated statistical models were introduced early on to address unsupervised image segmentation tasks. The seminal 1984 paper of  popularized Ising-Potts models and more generally Markov Random Fields (MRF) as well as Markov chain Monte Carlo (MCMC) methods in this area. There has been an increasing interest in such approaches ever since [14, 27, 1, 2]. A key problem of MRF type approach is that they typically require specifying the number of clusters beforehand. It is easy conceptually to assign a prior to this number but then Bayesian inference becomes computationally demanding as the partition function of the MRF is analytically intractable . It is additionally difficult to design efficient reversible jump MCMC algorithms in this context . Recently, a few Bayesian nonparametric (BNP) models for image segmentation have been proposed which bypass these problems [6, 21, 25, 5, 11].
In this paper we propose an original class of BNP models which generalizes the approach pioneered by . Our model combines a spatial smoothness component to ensure that each data point is more likely to have the same label as other spatially close data and a partition prior to control the overall number and size of clusters. This model is flexible enough to encompass the Potts model, the Potts-Dirichlet Process (Potts-DP) model  but additionally it allows us to introduce easily prior information which prevents the creation of small and/or large clusters.
Bayesian inference in this context is not analytically tractable and requires the use of MCMC techniques. It is possible to derive a simple single-site Gibbs sampler to perform Bayesian computation as in [9, 21, 11] but the mixing properties of such samplers are poor. A popular alternative to single-site Gibbs sampling for Potts models is the Swendsen-Wang (SW) algorithm  which originates from . In an image segmentation context where the Potts model is unobserved, SW can also mix poorly but a generalized version of it has been developed to overcome this shortcoming [7, 14, 1, 2]. We develop here an original Generalized SW (GSW) algorithm that is reminiscent of split-merge samplers for Dirichlet process mixtures   . For a particular setting of the BNP model parameters, our GSW is actually an original split-merge sampler for Dirichlet process mixtures.
We demonstrate our BNP model and the associated GSW sampler on a standard database of different natural scene types. We focus here on a truncated version of the Potts-DP model penalizing low-size clusters. Experimentally this model provides visually better segmentation results than its non-truncated version and performs similarly to some recently proposed BNP alternatives. From a computational point of view, the GSW allows us to better explore high posterior probability regions compared to single-site Gibbs.
2 Statistical model
2.1 Likelihood Model
We model an observed image not as a collection of pixels but as a collection of super-pixels, which correspond to small blocks of contiguous textually-alike pixels [23, 11]. Unlike normal image pixels that enjoy regular lattice with their neighbours, the super-pixels on the other hand form irregular lattices with their neighbouring super-pixels.
The super-pixels, called sites, constituting an image are assumed conditionally independent given some latent variables with
where can take a number of different values called cluster locations denoted . These cluster locations are assumed to be statistically independent; i.e. for where is a probability measure with no atomic component.
The choice of and is application dependent. In the experiments discussed in section 4, is summarized by a histogram, is a multinomial distribution, the associated multinomial parameters and a finite Dirichlet distribution.
We associate to each site an allocation variable satisfying and we denote . Let be the random partition of defined by equivalence classes for the equivalence relation . The partition is an unordered collection of disjoint nonempty subsets of , , where and is the number of subsets for partition . Given the partition , the marginal likelihood of the observations , integrating out cluster locations, is given by
We assume further on that is known analytically; e.g. is a conjugate prior for .
2.2 Potts-Partition Model
Our model combines a Potts-type spatial smoothness component and a partition model. We review briefly the Potts model and partition models before discussing how they can be combined in a simple way. We then present examples of special interest.
2.2.1 Potts model
A standard approach to statistical image segmentation consists of assigning a Potts prior distribution on which introduces some spatial smoothness in the clustering [14, 27]. In this case, the allocation variables can only take a prespecified number of different values and we set
where we write ‘’ if the super-pixels and are neighbours on a prespecified neighbouring structure, if and otherwise. We set to enforce that two neighbours are more likely to have the same label. To simplify notation, we will write
and set if is not a neighbour of . The Potts model induces the following distribution over the partition
2.2.2 Partition model
We review a general class of partition models where the prior distribution on partitions can be expressed in terms of an exchangeable probability function (EPF) ; that is
where denotes the size of the cluster and is a symmetric function of its arguments, i.e.
for any permutation of . The EPF implicitly tunes the prior distribution on the overall number of clusters and sizes of the clusters. A good overview of EPF for clustering can be found in .
Let us denote then, if we only allow for a maximum number of clusters, we can select
which favours large values of but does not penalize cluster sizes or
where which is the finite Dirichlet partition model. If we do not limit the number of clusters, a very popular partition model is the Dirichlet process partition model where for
with the standard gamma function. The properties of this distribution over partitions are well understood, see e.g. . The parameter tunes the number of clusters in the partition as displayed in Figure 1, the mean number of clusters being approximately .
The Dirichlet process partition model can be further generalized to the two parameter Poisson-Dirichlet partition model  given by
where and verify either and or and for some . For , we obtain the Dirichlet process partition model.
All the previous models have been used extensively to address general clustering tasks. In the specific context of image segmentation, it can be of interest to exclude low size clusters. This is easily possible by restricting the support of to clusters of minimum size so that for the Dirichlet process prior, we have
2.2.3 Combining Potts and partition models
Our proposed model combines a spatial smoothness component of the form
akin to the Potts model with a EPF-type model (6) through
Clearly if we set for all then one recovers the classical priors for clustering allowing us to control the number and size of clusters whereas ensures that spatially close super-pixels are more likely to be in the same cluster.
If we select as (7), we are back to the standard Potts model given in (5) whereas if we select as the Dirichlet partition model (9) then the proposed partition model (13) corresponds to the Potts-DP model of .
In Figure 2, we display the expected number of clusters for the Potts-DP model for different values of the Potts parameter for neighbours and a Dirichlet process parameter ; the neighbouring structure is described in Section 4.
Image segmentation now relies on the posterior over partitions
where is given in (2).
3 Generalized Swendsen-Wang algorithm for Bayesian computation
3.1 Single-site Gibbs sampler
A standard strategy to sample approximately from is to successively update the cluster assignment of each site given the cluster assignments of the other sites. This sampler proceeds as follows. Let be the partition obtained by removing site from , and be the size of .
The site will be assigned to cluster with probability proportional to
and be assigned to a new cluster with probability proportional to
This strategy, used by  for the Potts-DP model, is simple to implement, but exhibits poor mixing properties as the cluster assignment of a given site is highly correlated with the cluster assignments of its neighbours due to the spatial smoothness component.
3.2 Generalized Swendsen-Wang sampler
We propose here a GSW sampler that allows us to update simultaneously cluster labels of groups of sites and hence improve the exploration of the posterior. This algorithm can be interpreted as a generalization of the technique proposed by  for standard Potts models to our generalized Potts-partition model. It includes as special cases the single-site Gibbs sampler presented previously and the classical Swendsen-Wang algorithm.
The GSW relies on the introduction of auxiliary binary bond variables where if sites and are bonded and otherwise. We write and the augmented model is defined by
with, for ,
where ; that is two neighbouring sites () sites with the same cluster assignment are bonded with probability . The parameters are hyperparameters of the GSW sampler whose choice will be discussed later on.
The introduction of this augmented probabilistic model allows us to sample from the resulting posterior distribution using a block Gibbs strategy which iteratively and successively samples and
The bonds are independent of the data given the partition . Therefore we have and the bond variables , are updated independently using (18). The other conditional distribution can be expressed as
The bond variables induce groups of sites which have the same cluster label, as the term implies that the conditional distribution only assigns positive probability mass to partitions where bonded sites are in the same cluster. Let be the groups of sites, or spin-clusters, induced by the bonds. We denote by the partition obtained by removing sites from and the size of . Note that this notation differs slightly from the one introduced in the previous section on single-site Gibbs sampling.
We then successively update the cluster assignment of each spin-cluster given the cluster assignments of the other spin-clusters. The spin-cluster is assigned to cluster with probability proportional to
and to a new cluster with probability proportional to
The difference between the original SW and this generalized version is the term
which only depends on the cluster assignments of the sites which are neighbours of the group . The algorithm reduces to single site Gibbs sampling if and to the classical SW algorithm if .
The overall GSW sampler, which is summarized in Figure 3, proceeds as follows at each iteration.
For each such that , sample the bond variables
where Ber is the Bernoulli distribution of parameter Let denote the corresponding spin-clusters.
For each spin-cluster
Let be the partition obtained by removing sites from , and be the size of . Then all sites in the spin-cluster will be associated to cluster with probability proportional to
or be associated to a new cluster with probability proportional to
Whatever being the hyperparameters , the Markov transition kernel of the GSW sampler admits as invariant distribution. A first simple choice, that is made in this article, is to set for all such that . Another choice is to set based on the observations. For example, we can set
where is some distance measure and are some positive tuning parameters. We also tried this setting, but this did not improve the results significantly. More sophisticated choices have been proposed by [1, 2] in the context of the standard Potts model. The authors report significant improvements but we have not pursued this approach here.
4.1 Dataset, preprocessing and likelihood
The dataset and preprocessing steps adopted here follow essentially  . We used two different datasets. The first dataset is from labelme toolbox’  “8 scene categories”, which is comprised of eight categories of natural outdoor images. All images used contain 256x256 pixels. The second dataset is the BSB300  which contain images of variable sizes. Sampling-based segmentation methods could be prohibitively slow if we were associated to each single pixel location a different site. Therefore, mimicking the steps taken in the dependent Chinese Restaurant Process (dd-CRP) and its Hierarchical (regional) variant (rdd-CRP) , we first group image pixels into so-called, super-pixels, in which around 60, colour/textually-alike pixels are grouped to form a single super-pixel. The super-pixel representation is a frequently used techniques to pre-group pixels in image processing literatures. The computation is relatively fast. It reduces the amount of sites one has to perform from a full image size to merely around sites. In our work, we used a standard super-pixel toolbox , Although it’s not a central theme of our research, but we anticipate that by using a more state-of-the-art super pixel generation algorithm, it should improve our segmentation result even further.
The observation is then constructed by forming a histogram of the pixels within each super pixel. Instead of using a simple 8-bin histogram as in , we followed the technique of , in which colour information was used to construct a 120-bin representation via a clustering procedure. In order to provide a fair comparison, the same histogram pre-processing is used for all the sampling-based methods described in this paper, namely, dd-CRP, rdd-CRP and our algorithm. The difference, however, between our work and the work of  is that we do not use texton histograms. This is done deliberately to perform a fair comparison with the mean-shift method, which uses purely luminance information.
We optimized the parameters of both mean-shift and rdd-CRP so as to maximize the rand index  by using a training set of images from the LabelMe dataset. The optimal rdd-CRP parameters we obtained were and . For mean-shift, we used a spatial bandwidth of 20, a range bandwidth of 15 and a minimum region size of 1500.
To compute , we use for a multinomial distribution and a 120-dimensional Dirichlet distribution with concentration vector , where is the normalised sum of all data histograms, and . We set for the concentration parameter.
4.2 Evaluation of the generalized Swendsen-Wang algorithm
We display the performance of the GSW algorithms for the Potts-DP model with and and different values of in Figure 4. Increasing the value of allows us to better explore the posterior distribution, see Figure 4(a). When using GSW as a stochastic search algorithm to get a Maximum A Posteriori (MAP) estimate, better MAP estimates are obtained on average as increases until about 10-20, as shown in Figure 4(b). Using too high a value of is inefficient as it slows down the convergence of the Markov chain to its stationary distribution. We found experimentally that provides on average good and stable results.
Similar conclusions were reached when using different values of and and for the truncated Potts-DP model discussed in (11).
4.3 Comparison to other methods
As we expect the number of clusters to be around ten, we evaluate the Potts-DP 111Experiments were also performed with the two parameter Poisson-Dirichlet, but without observed improvement on the performances. and the truncated Potts-DP models with and in agreement with Figure 2. We compare our results to mean-shift  and rdd-CRP . We tested all these methods by randomly selecting 50 images from each of 8 categories in the LabelMe dataset (i.e. on a total of 400 images) and 200 images from the Berkeley dataset. We display the obtained results on six particular images in Figure 5. As expected the number of clusters decreases as increases.
|Med. Nb clust.||12||6||9||6||4||3|
|Med. Nb clust.||23||9||12||8||4||3|
To assess the quality of the image segmentation results, we use the rand index  computed using the “ground-truth” which is obtained through a manual labelling . This comparison was also performed in  and the results are presented in Table 1. To evaluate the statistical significance of the results, we performed a Wilcoxon signed-rank test between the method with highest rand index (rdd-CRP for LabelMe and Mean-Shift for Berkeley dataset) and the other methods. We found no statistically significant difference (at the 1% level) between the performances of rdd-CRP, Potts-DP and truncated Potts-DP on the LabelMe dataset, and no statistical difference between Mean-Shift and Potts-DP on the Berkeley dataset.
Despite the significant differences observed visually when increasing the truncation threshold, e.g. see Figure 5, this does not translate in any improvement from the rand index point of view. However the manual labelling appears fairly subjective, so the rand index and the associated results have to be interpreted carefully.
This paper has introduced an original BNP image segmentation model that allows us to easily introduce prior information so as to penalize the overall number and size of clusters while preserving a spatial smoothing component. Computationally we have shown that Bayesian inference can be carried out using a GSW sampler which explores the posterior distribution of interest by splitting and merging clusters. Experimentally, the image segmentation results we obtained using a truncated Potts-DP model are competitive to mean-shift and rdd-CRP. We believe that it is a promising approach that deserves further investigation even if the model has limits inherent to the use of a spatial Potts prior: it penalizes the overall number and size of clusters but not connected components, and may end up with some isolated components. Nonetheless, there is always a tradeoff between goodness of fit of the model and computational tractability: the proposed BNP model has the ability to control the overall clustering structure, while the associated GSW sampler is easy to put in practice and allows experimentally a good exploration of the posterior. Furthermore, in the context of the standard Potts model, various improvements over the algorithm of  have been proposed by [1, 2]. In particular  propose a careful selection of the auxiliary parameters (24), various sophisticated reversible jump MCMC moves to swap labels and multi-level approaches. They report visually impressive segmentation results and it is likely that developing similar type ideas for the BNP segmentation model proposed here would further improve performance.
-  A. Barbu and S.C. Zhu. Generalizing Swendsen-Wang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1239–1253, 2005.
-  A. Barbu and S.C. Zhu. Generalizing Swendsen-Wang for image analysis. Journal of Computational and Graphical Statistics, 16(4):877–900, 2007.
-  D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
-  David B Dahl. Sequentially-allocated merge-split sampler for conjugate and nonconjugate Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 11, 2005.
-  L. Du, L. Ren, D. Dunson, and L. Carin. A Bayesian model for simultaneous image clustering, annotation and object segmentation. In Advances in Neural Information Processing Systems, 2009.
-  J.A. Duan, M. Guindani, and A.E. Gelfand. Generalized spatial Dirichlet process models. Biometrika, 94(4):809–825, 2007.
-  R. E. Edwards and A. D. Sokal. Generalization of the Fortuin-Kasteleyn-Swendsen-Wang representation and Monte Carlo algorithm. Physical review D, 38(6):2009–2012, 1988.
-  C.M. Fortuin and P.W. Kasteleyn. On the random-cluster model:: I. introduction and relation to other models. Physica, 57(4):536–564, 1972.
-  S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741, 1984.
-  S. Ghosh and E.B. Sudderth. Nonparametric learning for layered segmentation of natural images. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2272–2279. IEEE, 2012.
-  S. Ghosh, A.B. Ungureanu, E.B. Sudderth, and D.M. Blei. Spatial distance dependent chinese restaurant processes for image segmentation. In Advances in Neural Information Processing Systems, 2011.
-  P.J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711–732, 1995.
-  P.J. Green and S. Richardson. Hidden Markov models and disease mapping. Journal of the American Statistical Association, 97:1055–1070, 2002.
-  D.M. Higdon. Auxiliary variable methods for markov chain monte carlo with applications. Journal of the American Statistical Association, pages 585–595, 1998.
-  L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985.
-  Michael Hughes, Emily Fox, and Erik Sudderth. Effective split-merge monte carlo methods for nonparametric models of sequential data. In Advances in Neural Information Processing Systems, pages 1304–1312, 2012.
-  S. Jain and R. Neal. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1), 2004.
-  J. Lau and P.J. Green. Bayesian model based clustering procedures. Journal of Computational and Graphical Statistics, 16:526–558, 2007.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
-  Greg Mori, Xiaofeng Ren, Alexei A Efros, and Jitendra Malik. Recovering human body configurations: Combining segmentation and recognition. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–326. IEEE, 2004.
-  P. Orbanz and J. M. Buhmann. Nonparametric Bayesian image segmentation. International Journal of Computer Vision, 77:25–45, 2008.
-  J. Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related fields, 102:145–158, 1995.
-  X. Ren and J. Malik. Learning a classification model for segmentation. In IEEE International Conference on Computer Vision, 2003, 2003.
-  B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman. Labelme: a database and web-based tool for image annotation. International journal of computer vision, 77(1):157–173, 2008.
-  E.B. Sudderth and M.I. Jordan. Shared segmentation of natural scenes using dependent Pitman-Yor processes. In Advances in Neural Information Processing Systems, volume 21, pages 1585–1592, 2009.
-  R.H. Swendsen and J.-S. Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 58:86–88, 1987.
-  G. Winkler. Image analysis, random fields and Markov chain Monte Carlo methods: a mathematical introduction, volume 27. Springer Verlag, 2003.