Bayesian nonparametric image segmentation using a generalized SwendsenWang algorithm
Abstract
Unsupervised image segmentation aims at clustering the set of pixels of an image into spatially homogeneous regions. We introduce here a class of Bayesian nonparametric models to address this problem. These models are based on a combination of a Pottslike spatial smoothness component and a prior on partitions which is used to control both the number and size of clusters. This class of models is flexible enough to include the standard Potts model and the more recent PottsDirichlet Process model [21]. More importantly, any prior on partitions can be introduced to control the global clustering structure so that it is possible to penalize small or large clusters if necessary. Bayesian computation is carried out using an original generalized SwendsenWang algorithm. Experiments demonstrate that our method is competitive in terms of RAND index compared to popular image segmentation methods, such as meanshift, and recent alternative Bayesian nonparametric models.
1 Introduction
Sophisticated statistical models were introduced early on to address unsupervised image segmentation tasks. The seminal 1984 paper of [9] popularized IsingPotts models and more generally Markov Random Fields (MRF) as well as Markov chain Monte Carlo (MCMC) methods in this area. There has been an increasing interest in such approaches ever since [14, 27, 1, 2]. A key problem of MRF type approach is that they typically require specifying the number of clusters beforehand. It is easy conceptually to assign a prior to this number but then Bayesian inference becomes computationally demanding as the partition function of the MRF is analytically intractable [13]. It is additionally difficult to design efficient reversible jump MCMC algorithms in this context [12]. Recently, a few Bayesian nonparametric (BNP) models for image segmentation have been proposed which bypass these problems [6, 21, 25, 5, 11].
In this paper we propose an original class of BNP models which generalizes the approach pioneered by [21]. Our model combines a spatial smoothness component to ensure that each data point is more likely to have the same label as other spatially close data and a partition prior to control the overall number and size of clusters. This model is flexible enough to encompass the Potts model, the PottsDirichlet Process (PottsDP) model [21] but additionally it allows us to introduce easily prior information which prevents the creation of small and/or large clusters.
Bayesian inference in this context is not analytically tractable and requires the use of MCMC techniques. It is possible to derive a simple singlesite Gibbs sampler to perform Bayesian computation as in [9, 21, 11] but the mixing properties of such samplers are poor. A popular alternative to singlesite Gibbs sampling for Potts models is the SwendsenWang (SW) algorithm [26] which originates from [8]. In an image segmentation context where the Potts model is unobserved, SW can also mix poorly but a generalized version of it has been developed to overcome this shortcoming [7, 14, 1, 2]. We develop here an original Generalized SW (GSW) algorithm that is reminiscent of splitmerge samplers for Dirichlet process mixtures [4] [17] [16]. For a particular setting of the BNP model parameters, our GSW is actually an original splitmerge sampler for Dirichlet process mixtures.
We demonstrate our BNP model and the associated GSW sampler on a standard database of different natural scene types. We focus here on a truncated version of the PottsDP model penalizing lowsize clusters. Experimentally this model provides visually better segmentation results than its nontruncated version and performs similarly to some recently proposed BNP alternatives. From a computational point of view, the GSW allows us to better explore high posterior probability regions compared to singlesite Gibbs.
2 Statistical model
2.1 Likelihood Model
We model an observed image not as a collection of pixels but as a collection of superpixels, which correspond to small blocks of contiguous textuallyalike pixels [23, 11]. Unlike normal image pixels that enjoy regular lattice with their neighbours, the superpixels on the other hand form irregular lattices with their neighbouring superpixels.
The superpixels, called sites, constituting an image are assumed conditionally independent given some latent variables with
(1) 
where can take a number of different values called cluster locations denoted . These cluster locations are assumed to be statistically independent; i.e. for where is a probability measure with no atomic component.
The choice of and is application dependent. In the experiments discussed in section 4, is summarized by a histogram, is a multinomial distribution, the associated multinomial parameters and a finite Dirichlet distribution.
We associate to each site an allocation variable satisfying and we denote . Let be the random partition of defined by equivalence classes for the equivalence relation . The partition is an unordered collection of disjoint nonempty subsets of , , where and is the number of subsets for partition . Given the partition , the marginal likelihood of the observations , integrating out cluster locations, is given by
(2) 
where and
(3) 
We assume further on that is known analytically; e.g. is a conjugate prior for .
2.2 PottsPartition Model
Our model combines a Pottstype spatial smoothness component and a partition model. We review briefly the Potts model and partition models before discussing how they can be combined in a simple way. We then present examples of special interest.
2.2.1 Potts model
A standard approach to statistical image segmentation consists of assigning a Potts prior distribution on which introduces some spatial smoothness in the clustering [14, 27]. In this case, the allocation variables can only take a prespecified number of different values and we set
(4) 
where we write ‘’ if the superpixels and are neighbours on a prespecified neighbouring structure, if and otherwise. We set to enforce that two neighbours are more likely to have the same label. To simplify notation, we will write
and set if is not a neighbour of . The Potts model induces the following distribution over the partition
(5) 
2.2.2 Partition model
We review a general class of partition models where the prior distribution on partitions can be expressed in terms of an exchangeable probability function (EPF) [22]; that is
(6) 
where denotes the size of the cluster and is a symmetric function of its arguments, i.e.
for any permutation of . The EPF implicitly tunes the prior distribution on the overall number of clusters and sizes of the clusters. A good overview of EPF for clustering can be found in [18].
Let us denote then, if we only allow for a maximum number of clusters, we can select
(7) 
which favours large values of but does not penalize cluster sizes or
(8) 
where which is the finite Dirichlet partition model. If we do not limit the number of clusters, a very popular partition model is the Dirichlet process partition model where for
(9) 
with the standard gamma function. The properties of this distribution over partitions are well understood, see e.g. [18]. The parameter tunes the number of clusters in the partition as displayed in Figure 1, the mean number of clusters being approximately .
The Dirichlet process partition model can be further generalized to the two parameter PoissonDirichlet partition model [22] given by
(10) 
where and verify either and or and for some . For , we obtain the Dirichlet process partition model.
All the previous models have been used extensively to address general clustering tasks. In the specific context of image segmentation, it can be of interest to exclude low size clusters. This is easily possible by restricting the support of to clusters of minimum size so that for the Dirichlet process prior, we have
(11) 
2.2.3 Combining Potts and partition models
Our proposed model combines a spatial smoothness component of the form
(12) 
akin to the Potts model with a EPFtype model (6) through
(13) 
Clearly if we set for all then one recovers the classical priors for clustering allowing us to control the number and size of clusters whereas ensures that spatially close superpixels are more likely to be in the same cluster.
If we select as (7), we are back to the standard Potts model given in (5) whereas if we select as the Dirichlet partition model (9) then the proposed partition model (13) corresponds to the PottsDP model of [21].
3 Generalized SwendsenWang algorithm for Bayesian computation
3.1 Singlesite Gibbs sampler
A standard strategy to sample approximately from is to successively update the cluster assignment of each site given the cluster assignments of the other sites. This sampler proceeds as follows. Let be the partition obtained by removing site from , and be the size of .
The site will be assigned to cluster with probability proportional to
(15) 
and be assigned to a new cluster with probability proportional to
(16) 
This strategy, used by [21] for the PottsDP model, is simple to implement, but exhibits poor mixing properties as the cluster assignment of a given site is highly correlated with the cluster assignments of its neighbours due to the spatial smoothness component.
3.2 Generalized SwendsenWang sampler
We propose here a GSW sampler that allows us to update simultaneously cluster labels of groups of sites and hence improve the exploration of the posterior. This algorithm can be interpreted as a generalization of the technique proposed by [14] for standard Potts models to our generalized Pottspartition model. It includes as special cases the singlesite Gibbs sampler presented previously and the classical SwendsenWang algorithm.
The GSW relies on the introduction of auxiliary binary bond variables where if sites and are bonded and otherwise. We write and the augmented model is defined by
(17) 
where
with, for ,
(18) 
where ; that is two neighbouring sites () sites with the same cluster assignment are bonded with probability . The parameters are hyperparameters of the GSW sampler whose choice will be discussed later on.
The introduction of this augmented probabilistic model allows us to sample from the resulting posterior distribution using a block Gibbs strategy which iteratively and successively samples and
The bonds are independent of the data given the partition . Therefore we have and the bond variables , are updated independently using (18). The other conditional distribution can be expressed as
The bond variables induce groups of sites which have the same cluster label, as the term implies that the conditional distribution only assigns positive probability mass to partitions where bonded sites are in the same cluster. Let be the groups of sites, or spinclusters, induced by the bonds. We denote by the partition obtained by removing sites from and the size of . Note that this notation differs slightly from the one introduced in the previous section on singlesite Gibbs sampling.
We then successively update the cluster assignment of each spincluster given the cluster assignments of the other spinclusters. The spincluster is assigned to cluster with probability proportional to
(19) 
and to a new cluster with probability proportional to
(20) 
The difference between the original SW and this generalized version is the term
which only depends on the cluster assignments of the sites which are neighbours of the group . The algorithm reduces to single site Gibbs sampling if and to the classical SW algorithm if .
The overall GSW sampler, which is summarized in Figure 3, proceeds as follows at each iteration.

For each such that , sample the bond variables
where Ber is the Bernoulli distribution of parameter Let denote the corresponding spinclusters.

For each spincluster

Let be the partition obtained by removing sites from , and be the size of . Then all sites in the spincluster will be associated to cluster with probability proportional to
(22) or be associated to a new cluster with probability proportional to
(23)

Whatever being the hyperparameters , the Markov transition kernel of the GSW sampler admits as invariant distribution. A first simple choice, that is made in this article, is to set for all such that . Another choice is to set based on the observations. For example, we can set
(24) 
where is some distance measure and are some positive tuning parameters. We also tried this setting, but this did not improve the results significantly. More sophisticated choices have been proposed by [1, 2] in the context of the standard Potts model. The authors report significant improvements but we have not pursued this approach here.
4 Experiments
4.1 Dataset, preprocessing and likelihood
The dataset and preprocessing steps adopted here follow essentially [11] [10]. We used two different datasets. The first dataset is from labelme toolbox’ [24] “8 scene categories”, which is comprised of eight categories of natural outdoor images. All images used contain 256x256 pixels. The second dataset is the BSB300 [19] which contain images of variable sizes. Samplingbased segmentation methods could be prohibitively slow if we were associated to each single pixel location a different site. Therefore, mimicking the steps taken in the dependent Chinese Restaurant Process (ddCRP) and its Hierarchical (regional) variant (rddCRP) [11], we first group image pixels into socalled, superpixels, in which around 60, colour/textuallyalike pixels are grouped to form a single superpixel. The superpixel representation is a frequently used techniques to pregroup pixels in image processing literatures. The computation is relatively fast. It reduces the amount of sites one has to perform from a full image size to merely around sites. In our work, we used a standard superpixel toolbox [20], Although it’s not a central theme of our research, but we anticipate that by using a more stateoftheart super pixel generation algorithm, it should improve our segmentation result even further.
The observation is then constructed by forming a histogram of the pixels within each super pixel. Instead of using a simple 8bin histogram as in [21], we followed the technique of [11], in which colour information was used to construct a 120bin representation via a clustering procedure. In order to provide a fair comparison, the same histogram preprocessing is used for all the samplingbased methods described in this paper, namely, ddCRP, rddCRP and our algorithm. The difference, however, between our work and the work of [11] is that we do not use texton histograms. This is done deliberately to perform a fair comparison with the meanshift method, which uses purely luminance information.
We optimized the parameters of both meanshift and rddCRP so as to maximize the rand index [15] by using a training set of images from the LabelMe dataset. The optimal rddCRP parameters we obtained were and . For meanshift, we used a spatial bandwidth of 20, a range bandwidth of 15 and a minimum region size of 1500.
To compute , we use for a multinomial distribution and a 120dimensional Dirichlet distribution with concentration vector , where is the normalised sum of all data histograms, and . We set for the concentration parameter.
4.2 Evaluation of the generalized SwendsenWang algorithm
We display the performance of the GSW algorithms for the PottsDP model with and and different values of in Figure 4. Increasing the value of allows us to better explore the posterior distribution, see Figure 4(a). When using GSW as a stochastic search algorithm to get a Maximum A Posteriori (MAP) estimate, better MAP estimates are obtained on average as increases until about 1020, as shown in Figure 4(b). Using too high a value of is inefficient as it slows down the convergence of the Markov chain to its stationary distribution. We found experimentally that provides on average good and stable results.
Similar conclusions were reached when using different values of and and for the truncated PottsDP model discussed in (11).
4.3 Comparison to other methods
As we expect the number of clusters to be around ten, we evaluate the PottsDP ^{1}^{1}1Experiments were also performed with the two parameter PoissonDirichlet, but without observed improvement on the performances. and the truncated PottsDP models with and in agreement with Figure 2. We compare our results to meanshift [3] and rddCRP [11]. We tested all these methods by randomly selecting 50 images from each of 8 categories in the LabelMe dataset (i.e. on a total of 400 images) and 200 images from the Berkeley dataset. We display the obtained results on six particular images in Figure 5. As expected the number of clusters decreases as increases.
MS  rddCRP  (Truncated) PottsDP  

LabelMe  Med. RI  0.7623  0.7759  0.7712  0.7692  0.7483  0.7235 
pvalue  .0011  –  .0108  .0965  
Med. Nb clust.  12  6  9  6  4  3  
Berkeley  Med. RI  0.7988  0.7748  0.7882  0.7797  0.7291  0.6881 
pvalue  –  .0066  .0566  .0052  
Med. Nb clust.  23  9  12  8  4  3 
To assess the quality of the image segmentation results, we use the rand index [15] computed using the “groundtruth” which is obtained through a manual labelling [24]. This comparison was also performed in [11] and the results are presented in Table 1. To evaluate the statistical significance of the results, we performed a Wilcoxon signedrank test between the method with highest rand index (rddCRP for LabelMe and MeanShift for Berkeley dataset) and the other methods. We found no statistically significant difference (at the 1% level) between the performances of rddCRP, PottsDP and truncated PottsDP on the LabelMe dataset, and no statistical difference between MeanShift and PottsDP on the Berkeley dataset.
Despite the significant differences observed visually when increasing the truncation threshold, e.g. see Figure 5, this does not translate in any improvement from the rand index point of view. However the manual labelling appears fairly subjective, so the rand index and the associated results have to be interpreted carefully.
5 Discussion
This paper has introduced an original BNP image segmentation model that allows us to easily introduce prior information so as to penalize the overall number and size of clusters while preserving a spatial smoothing component. Computationally we have shown that Bayesian inference can be carried out using a GSW sampler which explores the posterior distribution of interest by splitting and merging clusters. Experimentally, the image segmentation results we obtained using a truncated PottsDP model are competitive to meanshift and rddCRP. We believe that it is a promising approach that deserves further investigation even if the model has limits inherent to the use of a spatial Potts prior: it penalizes the overall number and size of clusters but not connected components, and may end up with some isolated components. Nonetheless, there is always a tradeoff between goodness of fit of the model and computational tractability: the proposed BNP model has the ability to control the overall clustering structure, while the associated GSW sampler is easy to put in practice and allows experimentally a good exploration of the posterior. Furthermore, in the context of the standard Potts model, various improvements over the algorithm of [14] have been proposed by [1, 2]. In particular [2] propose a careful selection of the auxiliary parameters (24), various sophisticated reversible jump MCMC moves to swap labels and multilevel approaches. They report visually impressive segmentation results and it is likely that developing similar type ideas for the BNP segmentation model proposed here would further improve performance.
References
 [1] A. Barbu and S.C. Zhu. Generalizing SwendsenWang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1239–1253, 2005.
 [2] A. Barbu and S.C. Zhu. Generalizing SwendsenWang for image analysis. Journal of Computational and Graphical Statistics, 16(4):877–900, 2007.
 [3] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
 [4] David B Dahl. Sequentiallyallocated mergesplit sampler for conjugate and nonconjugate Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 11, 2005.
 [5] L. Du, L. Ren, D. Dunson, and L. Carin. A Bayesian model for simultaneous image clustering, annotation and object segmentation. In Advances in Neural Information Processing Systems, 2009.
 [6] J.A. Duan, M. Guindani, and A.E. Gelfand. Generalized spatial Dirichlet process models. Biometrika, 94(4):809–825, 2007.
 [7] R. E. Edwards and A. D. Sokal. Generalization of the FortuinKasteleynSwendsenWang representation and Monte Carlo algorithm. Physical review D, 38(6):2009–2012, 1988.
 [8] C.M. Fortuin and P.W. Kasteleyn. On the randomcluster model:: I. introduction and relation to other models. Physica, 57(4):536–564, 1972.
 [9] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741, 1984.
 [10] S. Ghosh and E.B. Sudderth. Nonparametric learning for layered segmentation of natural images. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2272–2279. IEEE, 2012.
 [11] S. Ghosh, A.B. Ungureanu, E.B. Sudderth, and D.M. Blei. Spatial distance dependent chinese restaurant processes for image segmentation. In Advances in Neural Information Processing Systems, 2011.
 [12] P.J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711–732, 1995.
 [13] P.J. Green and S. Richardson. Hidden Markov models and disease mapping. Journal of the American Statistical Association, 97:1055–1070, 2002.
 [14] D.M. Higdon. Auxiliary variable methods for markov chain monte carlo with applications. Journal of the American Statistical Association, pages 585–595, 1998.
 [15] L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985.
 [16] Michael Hughes, Emily Fox, and Erik Sudderth. Effective splitmerge monte carlo methods for nonparametric models of sequential data. In Advances in Neural Information Processing Systems, pages 1304–1312, 2012.
 [17] S. Jain and R. Neal. A splitmerge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1), 2004.
 [18] J. Lau and P.J. Green. Bayesian model based clustering procedures. Journal of Computational and Graphical Statistics, 16:526–558, 2007.
 [19] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
 [20] Greg Mori, Xiaofeng Ren, Alexei A Efros, and Jitendra Malik. Recovering human body configurations: Combining segmentation and recognition. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–326. IEEE, 2004.
 [21] P. Orbanz and J. M. Buhmann. Nonparametric Bayesian image segmentation. International Journal of Computer Vision, 77:25–45, 2008.
 [22] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related fields, 102:145–158, 1995.
 [23] X. Ren and J. Malik. Learning a classification model for segmentation. In IEEE International Conference on Computer Vision, 2003, 2003.
 [24] B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman. Labelme: a database and webbased tool for image annotation. International journal of computer vision, 77(1):157–173, 2008.
 [25] E.B. Sudderth and M.I. Jordan. Shared segmentation of natural scenes using dependent PitmanYor processes. In Advances in Neural Information Processing Systems, volume 21, pages 1585–1592, 2009.
 [26] R.H. Swendsen and J.S. Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 58:86–88, 1987.
 [27] G. Winkler. Image analysis, random fields and Markov chain Monte Carlo methods: a mathematical introduction, volume 27. Springer Verlag, 2003.