Recurrent Pixel Embedding for Instance Grouping
Abstract
We introduce a differentiable, endtoend trainable framework for solving pixellevel grouping problems such as instance segmentation consisting of two novel components. First, we regress pixels into a hyperspherical embedding space so that pixels from the same group have high cosine similarity while those from different groups have similarity below a specified margin. We analyze the choice of embedding dimension and margin, relating them to theoretical results on the problem of distributing points uniformly on the sphere. Second, to group instances, we utilize a variant of meanshift clustering, implemented as a recurrent neural network parameterized by kernel bandwidth. This recurrent grouping module is differentiable, enjoys convergent dynamics and probabilistic interpretability. Backpropagating the groupweighted loss through this module allows learning to focus on only correcting embedding errors that won’t be resolved during subsequent clustering. Our framework, while conceptually simple and theoretically abundant, is also practically effective and computationally efficient. We demonstrate substantial improvements over stateoftheart instance segmentation for object proposal generation, as well as demonstrating the benefits of grouping loss for classification tasks such as boundary detection and semantic segmentation.
1 Introduction
The successes of deep convolutional neural nets (CNNs) at image classification has spawned a flurry of work in computer vision on adapting these models to pixellevel image understanding tasks, such as boundary detection [1, 90, 64], semantic segmentation [60, 10, 46], optical flow [87, 20], and pose estimation [85, 7]. The key ideas that have enabled this adaption thus far are: (1) deconvolution schemes that allow for upsampling coarse pooled feature maps to make detailed predictions at the spatial resolution of individual pixels [90, 28], (2) skip connections and hypercolumns which concatenate representations across multiresolution feature maps [32, 10], (3) atrous convolution which allows efficient computation with large receptive fields while maintaining spatial resolution [10, 46], and (4) fully convolutional operation which handles variable sized input images.
In contrast, there has been less innovation in the development of specialized loss functions for training. Pixellevel labeling tasks fall into the category of structured output prediction [4], where the model outputs a structured object (e.g., a whole image parse) rather than a scalar or categorical variable. However, most CNN pixellabeling architectures are simply trained with loss functions that decompose into a simple (weighted) sum of classification or regression losses over individual pixel labels.
The need to address the output space structure is more apparent when considering problems where the set of output labels isn’t fixed. Our motivating example is object instance segmentation, where the model generates a collection of segments corresponding to object instances. This problem can’t be treated as kway classification since the number of objects isn’t known in advance. Further, the loss should be invariant to permutations of the instance labels within the same semantic category.
As a result, most recent successful approaches to instance segmentation have adopted more heuristic approaches that first use an object detector to enumerate candidate instances and then perform pixellevel segmentation of each instance [57, 17, 55, 56, 2]. Alternately one can generate generic proposal segments and then label each one with a semantic detector [31, 12, 32, 16, 82, 34]. In either case the detection and segmentation steps can both be mapped to standard binary classification losses. While effective, these approaches are somewhat unsatisfying since: (1) they rely on the object detector and nonmaximum suppression heuristics to accurately “count” the number of instances, (2) they are difficult to train in an endtoend manner since the interface between instance segmentation and detection is nondifferentiable, and (3) they underperform in cluttered scenes as the assignment of pixels to detections is carried out independently for each detection^{1}^{1}1This is less a problem for object proposals that are jointly estimated by bottomup segmentation (e.g., MCG [71] and COB [64]). However, such generic proposal generation is not informed by the topdown semantics..
Here we propose to directly tackle the instance grouping problem in a unified architecture by training a model that labels pixels with unitlength vectors that live in some fixeddimension embedding space (Fig. 1). Unlike kway classification where the target vectors for each pixel are specified in advance (i.e., onehot vectors at the vertices of a k1 dimensional simplex) we allow each instance to be labeled with an arbitrary embedding vector on the sphere. Our loss function simply enforces the constraint that the embedding vectors used to label different instances are far apart. Since neither the number of labels, nor the target label vectors are specified in advance, we can’t use standard softmax thresholding to produce a discrete labeling. Instead, we utilize a variant of meanshift clustering which can be viewed as a recurrent network whose fixed point identifies a small, discrete set of instance label vectors and concurrently labels each pixel with one of the vectors from this set.
This framework is largely agnostic to the underlying CNN architecture and can be applied to a range of low, mid and high level visual tasks. Specifically, we carry out experiments showing how this method can be used for boundary detection, object proposal generation and semantic instance segmentation. Even when a task can be modeled by a binary pixel classification loss (e.g., boundary detection) we find that the grouping loss guides the model towards higherquality feature representations that yield superior performance to classification loss alone. The model really shines for instance segmentation, where we demonstrate a substantial boost in object proposal generation (improving the stateoftheart average recall for 10 proposals per image from 0.56 to 0.77). To summarize our contributions: (1) we introduce a simple, easily interpreted endtoend model for pixellevel instance labeling which is widely applicable and highly effective, (2) we provide theoretical analysis that offers guidelines on setting hyperparameters, and (3) benchmark results show substantial improvements over existing approaches.
2 Related Work
Common approaches to instance segmentation first generate region proposals or classagnostic bounding boxes, segment the foreground objects within each proposal and classify the objects in the bounding box [92, 53, 31, 12, 17, 56, 34]. [55] introduce a fully convolutional approach that includes bounding box proposal generation in endtoend training. Recently, “boxfree” methods [69, 70, 57, 37] avoid some limitations of box proposals (e.g. for wiry or articulated objects). They commonly use Faster RCNN [74] to produce “centeredness” score on each pixel and then predict binary instance masks and class labels. Other approaches have been explored for modeling joint segmentation and instance labeling jointly in a combinatorial framework (e.g., [41]) but typically don’t address endtoend learning. Alternately, recurrent models that sequentially produce a list of instances [76, 73] offer another approach to address variable sized output structures in a unified manner.
The most closely related to ours is the associative embedding work of [67], which demonstrated strong results for grouping multiperson keypoints, and unpublished work from [23] on metric learning for instance segmentation. Our approach extends on these ideas substantially by integrating recurrent meanshift to directly generate the final instances (rather than heuristic decoding or thresholding distance to seed proposals). There is also an important and interesting connection to work that has used embedding to separate instances where the embedding is directly learned using a supervised regression loss rather than a pairwise associative loss. [80] train a regressor that predicts the distance to the contour centerline for boundary detection, while [3] predict the distance transform of the instance masks which is then postprocessed with watershed transform to generate segments. [82] predict an embedding based on scene depth and direction towards the instance center (like Hough voting).
Finally, we note that these ideas are related to work on using embedding for solving pairwise clustering problems. For example, normalized cuts clusters embedding vectors given by the eigenvectors of the normalized graph Laplacian [78] and the spatial gradient of these embedding vectors was used in [1] as a feature for boundary detection. Rather than learning pairwise similarity from data and then embedding prior to clustering (e.g., [63]), we use a pairwise loss but learn the embedding directly. Our recurrent meanshift grouping is reminiscent of other efforts that use unrolled implementations of iterative algorithms such as CRF inference [94] or bilateral filtering [40, 27]. Unlike general RNNs [6, 68] which are often difficult to train, our recurrent model has fixed parameters that assure interpretable convergent dynamics and meaningful gradients during learning.
3 Pairwise Loss for Pixel Embeddings
In this section we introduce and analyze the loss we use for learning pixel embeddings. This problem is broadly related to supervised distance metric learning [86, 48, 50] and clustering [49] but adapted to the specifics of instance labeling where the embedding vectors are treated as labels for a variable number of objects in each image.
Our goal is to learn a mapping from an input image to a set of dimensional embedding vectors (one for each pixel). Let be the embeddings of pixels and respectively with corresponding labels and that denote groundtruth instancelevel semantic labels (e.g., car.1 and car.2). We will measure the similarity of the embedding vectors using the cosine similarity, been scaled and offset to lie in the interval for notational convenience:
(1) 
In the discussion that follows we think of the similarity in terms of the inner product between the projected embedding vectors (e.g., ) which live on the surface of a dimensional sphere. Other common similarity metrics utilize Euclidean distance with a squared exponential kernel or sigmoid function [67, 23]. We prefer the cosine metric since it is invariant to the scale of the embedding vectors, decoupling the loss from model design choices such as weight decay or regularization that limit the dynamic range of Euclidean distances.
Our goal is to learn an embedding so that pixels with the same label (positive pairs with ) have the same embedding (i.e. ). To avoid a trivial solution where all the embedding vectors are the same, we impose the additional constraint that pairs from different instances (negative pairs with ) are placed far apart. To provide additional flexibility, we include a weight in the definition of the loss which specifies the importance of a given pixel. The total loss over all pairs and training images is:
(2) 
where is the number of pixels in the th image ( images in total), and is the pixel pair weight associated with pixel in image . The hyperparameter controls the maximum margin for negative pairs of pixels, incurring a penalty if the embeddings for pixels belonging to the same group have an angular separation of less than . Positive pairs pay a penalty if they have a similarity less than . Fig. 2 shows a graph of the loss function. [88] argue that the constant slope of the margin loss is more robust, e.g., than squared loss.
We carry out a simple theoretical analysis which provides a guide for setting the weights and margin hyperparameter in the loss function. Proofs can be found in the appendix.
3.1 Instanceaware Pixel Weighting
We first examine the role of embedding dimension and instance size on the training loss.
Proposition 1
For vectors , the total intrapixel similarity is bounded as . In particular, for vectors on the hypersphere where , we have .
This proposition indicates that the total cosine similarity (and hence the loss) for a set of embedding vectors has a constant lower bound that does not depend on the dimension of the embedding space (a feature lacking in Euclidean embeddings). In particular, this type of analysis suggests a natural choice of pixel weighting . Suppose a training example contains instances and denotes the set of pixels belonging to a particular groundtruth instance . We can write
where the first term on the r.h.s. corresponds to contributions to the loss function for positive pairs while the second corresponds to contributions from negative pairs. Setting for pixels belonging to groundtruth instance assures that each instance contributes equally to the loss independent of size. Furthermore, when the embedding dimension , we can simply embed the data so that the instance means are along orthogonal axes on the sphere. This zeros out the second term on the r.h.s., leaving only the first term which is bounded , and translates to corresponding upper and lower bounds on the loss that are independent of the number of pixels and embedding dimension (so long as ).
Pairwise weighting schemes have been shown important empirically [23] and class imbalance can have a substantial effect on the performance of different architectures (see e.g., [58]). While other work has advocated online bootstrapping methods for hardpixel mining or minibatch selection [61, 47, 79, 89], our approach is much simpler. Guided by this result we simply use uniform random sampling of pixels during training, appropriately weighted by instance size in order to estimate the loss.
3.2 Margin Selection
To analyze the appropriate margin, let’s first consider the problem of distributing labels for different instances as far apart as possible on a 3D sphere, sometimes referred to as Tammes’s problem, or the hardspheres problem [77]. This can be formalized as maximizing the smallest distance among points on a sphere: . Asymptotic results in [29] provide the following proposition (see proof in the appendix):
Proposition 2
Given vectors on a 2sphere, i.e. , , choosing , guarantees that for some pair . Choosing , guarantees the existence of an embedding with for all pairs .
Proposition 2 gives the maximum margin for a separation of groups of pixels in a three dimensional embedding space (sphere). For example, if an image has at most instances, can be set as small as , respectively.
For points in a higher dimension embedding space, it is a nontrivial problem to establish a tight analytic bound for the margin . Despite its simple description, distributing points on a dimensional hypersphere is considered a serious mathematical challenge for which there is no general solutions [77, 62]. We adopt a safe (trivial) strategy. For instances embedded in dimensions one can use value of which allows for zero loss by placing a pair of groups antipodally along each of the orthogonal axes. We adopt this setting for the majority of experiments in the paper where the embedding dimension is set to .
4 Recurrent MeanShift Grouping
While we can directly train a model to predict embeddings as described in the previous section, it is not clear how to generate the final instance segmentation from the resulting (imperfect) embeddings. One can utilize heuristic postprocessing [18] or utilize clustering algorithms that estimate the number of instances [57], but these are not differentiable and thus unsatisfying. Instead, we introduce a meanshift grouping model (Fig. 3) which operates recurrently on the embedding space in order to congeal the embedding vectors into a small number of instance labels.
Meanshift and closely related algorithms [26, 13, 14, 15] use kernel density estimation to approximate the probability density from a set of samples and then perform clustering on the input data by assigning or moving each sample to the nearest mode (local maxima). From our perspective, the advantages of this approach are (1) the final instance labels (modes) live in the same embedding space as the initial data, (2) the recurrent dynamics of the clustering process depend smoothing on the input allowing for easy backpropagation, (3) the behavior depends on a single parameter, the kernel bandwidth, which is easily interpretable and can be related to the margin used for the embedding loss.
4.1 Mean Shift Clustering
A common choice for nonparametric density estimation is to use the isotropic multivariate normal kernel and approximate the data density nonparametrically as . Since our embedding vectors are unit norm, we instead use the von MisesFisher distribution which is the natural extension of the multivariate normal to the hypersphere [25, 5, 65, 42], and is given by . The kernel bandwidth, determines the smoothness of the kernel density estimate and is closely related to the margin used for learning the embedding space. While it is straightforward to learn during training, we instead set it to satisfy throughout our experiments, such that the cluster separation (margin) in the learned embedding space is three standard deviations.
We formulate the mean shift algorithm in a matrix form. Let denote the stacked pixel embedding vectors of an image. The kernel matrix is given by . Let denote the diagonal matrix of total affinities, referred to as the degree when is viewed as a weighted graph adjacency matrix. At each iteration, we compute the mean shift , which is the difference vector between and the kernel weighted average of . We then modify the embedding vectors by moving them in the mean shift direction with step size :
(3) 
Note that unlike standard meanshift mode finding, we recompute at each iteration. These update dynamics are termed the explicit method and were analyzed by [9]. When and the kernel is Gaussian, this is also referred to as Gaussian Blurring Mean Shift (GBMS) and has been shown to have cubic convergence [9] under appropriate conditions. Unlike deep RNNs, the parameters of our recurrent module are not learned and the forward dynamics are convergent under general conditions. In practice, we do not observe issues with exploding or vanishing gradients during backpropagation through a finite number of iterations ^{2}^{2}2Some intuition about stability may be gained by noting that the eigenvalues of lie in the interval , but we have not been able to prove useful corresponding bounds on the spectrum of the Jacobian..
Fig. 4 demonstrates a toy example of applying the method to perform digit instance segmentation on synthetic images from MNIST [54]. We learn 3dimensional embedding in order to visualize the results before and after the mean shift grouping module. From the figure, we can see the mean shift grouping transforms the initial embedding vectors to yield a small set of instance labels which are distinct (for negative pairs) and compact (for positive pairs).
4.2 Endtoend training
It’s straightforward to compute the derivatives of the recurrent mean shift grouping module w.r.t based on the the chain rule so our whole system is endtoend trainable through backpropagation. Details about the derivative computation can be found in the appendix. To understand the benefit of endtoend training, we visualize the embedding gradient with and without the grouping module (Fig. 5). Interestingly, we observe that the gradient backpropagated through mean shift focuses on fixing the embedding in uncertain regions, e.g. instance boundaries, while suggesting small magnitude updates for those errors which will be easily fixed by the meanshift iteration.
While we could simply apply the pairwise embedding loss to the final output of the meanshift grouping, in practice we accumulate the loss over all iterations (including the initial embedding regression). We unroll the recurrent grouping module into loops, and accumulate the same loss function at the unrolled loop:
5 Experiments
We now describe experiments in training our framework to deal a variety of pixellabeling problems, including boundary detection, object proposal detection, semantic segmentation and instancelevel semantic segmentation.
5.1 Tasks, Datasets and Implementation
We illustrate the advantages of the proposed modules on several largescale datasets. First, to illustrate the ability of the instanceaware weighting and uniform sampling mechanism to handle imbalanced data and low embedding dimension, we use the BSDS500 [1] dataset to train a boundary detector for boundary detection ( pixels are nonboundary pixels). We train with the standard split [1, 90], using 300 trainval images to train our model based on ResNet50 [35] and evaluate on the remaining 200 test images. Second, to explore instance segmentation and object proposal generation, we use PASCAL VOC 2012 dataset [22] with additional instance mask annotations provided by [30]. This provides 10,582 and 1,449 images for training and evaluation, respectively.
We implement our approach using the toolbox MatConvNet [84], and train using SGD on a single Titan X GPU. ^{3}^{3}3The code and trained models can be found at https://github.com/aimerykong/RecurrentPixelEmbeddingforInstanceGrouping. To compute calibrated cosine similarity, we utilize an L2normalization layer before matrix multiplication [45], which also contains random sampling with a hyperparameter to control the ratio of pixels to be sampled for an image. In practice, we observe that performance does not depend strongly on this ratio and hence set it based on available (GPU) memory.
While our modules are architecture agnostic, we use the ResNet50 and ResNet101 models [35] pretrained over ImageNet [19] as the backbone. Similar to [10], we increase the output resolution of ResNet by removing the top global pooling layer and the last two pooling layers, replacing them with atrous convolution with dilation rate 2 and 4, respectively to maintain a spatial sampling rate. Our model thus outputs predictions at the input resolution which are upsampled for benchmarking.
We augment the training set using random scaling by , inplane rotation by degrees, random leftright flips, random crops with 20pixel margin and of size divisible by 8, and color jittering. When training the model, we fix the batch normalization in ResNet backbone, using the same constant global moments in both training and testing. Throughout training, we set batch size to one where the batch is a single input image. We use the “poly” learning rate policy [10] with a base learning rate of scaled as a function of iteration by .
5.2 Boundary Detection
For boundary detection, we first train a model to group the pixels into boundary or nonboundary groups. Similar to COB [64] and HED [90], we include multiple branches over ResBlock for training. Since the number of instances labels is 2, we learn a simple 3dimensional embedding space which has the advantage of easy visualization as an RGB image. Fig. 7 shows the resulting embeddings in the first row of each panel. Note that even though we didn’t utilize meanshift grouping, the trained embedding already produces compact clusters. To compare quantitatively to the stateoftheart, we learn a fusion layer that combines predictions from multiple levels of the feature hierarchy finetuned with a logistic loss to match the binary output. Fig. 7 shows the results in the second row. Interestingly, we can see that the finetuned model embeddings encode not only boundary presence/absence but also the orientation and signed distance to nearby boundaries.
Quantitatively, we compare our model to COB [64], HED [90], CEDN [91], LEP [66], UCM [1], ISCRA [75], NCuts [78], EGB [24], and the original mean shift (MShift) segmentation algorithm [15]. Fig. 6 shows standard benchmark precisionrecall for all the methods, demonstrating our model achieves stateoftheart performance. Note that our model has the same architecture of COB [64] except with a different loss functions and no explicit branches to compute boundary orientation. Our embedding loss by naturally pushes boundary pixel embeddings to be similar which is also the desirable property for detecting boundaries using logistic loss. Note that it is possible to surpass human performance with several sophisticated techniques [44], we don’t pursue this as it is out the scope of this paper.
5.3 Object Proposal Detection
Object proposals are an integral part of current object detection and semantic segmentation pipelines [74, 34], as they provide a reduced search space of locations, scales and shapes for subsequent recognition. Stateoftheart methods usually involve training models that output large numbers of proposals, particularly those based on bounding boxes. Here we demonstrate that by training our framework with 64dimensional embedding space on the object instance level annotations, we are able to produce very high quality object proposals by grouping the pixels into instances. It is worth noting that due to the nature of our grouping module, far fewer number of proposals are produced with much higher quality. We compare against the most recent techniques including POISE [39], LPO [52], CPMC [8], GOP [51], SeSe [83], GLS [72], RIGOR [38].
Fig. 8 shows the Average Recall (AR) [36] with respect to the number of object proposals^{4}^{4}4Our basic model produces proposals per image. In order to plot a curve for our model for larger numbers of proposals, we run the mean shift grouping with multiple smaller bandwidth parameters, pool the results, and remove redundant proposals.. Our model performs remarkably well compared to other methods, achieving high average recall of groundtruth objects with two orders of magnitude fewer proposals. We also plot the curves for SharpMask [69] and DeepMask [70] using the proposals released by the authors. Despite only training on PASCAL, we outperform these models which were trained on the much larger COCO dataset [59]. In Table 1 we report the total average recall at IoU for some recently proposed proposal detection methods, including unpublished work instDML [23] which is similar in spirit to our model but learns a Euclidean distance based metric to group pixels. We can clearly see that our method achieves significantly better results than existing methods.
prop.  SCG [71]  MCG [71]  COB [64]  instDML [23]  Ours 

10        0.558  0.769 
60  0.624  0.652  0.738  0.667  0.814 
Method 
plane 
bike 
bird 
boat 
bottle 
bus 
car 
cat 
chair 
cow 
table 
dog 
horse 
motor 
person 
plant 
sheep 
sofa 
train 
tv 
mean 

SDS [31]  58.8  0.5  60.1  34.4  29.5  60.6  40.0  73.6  6.5  52.4  31.7  62.0  49.1  45.6  47.9  22.6  43.5  26.9  66.2  66.1  43.8 
Chen et al. [12]  63.6  0.3  61.5  43.9  33.8  67.3  46.9  74.4  8.6  52.3  31.3  63.5  48.8  47.9  48.3  26.3  40.1  33.5  66.7  67.8  46.3 
PFN [57]  76.4  15.6  74.2  54.1  26.3  73.8  31.4  92.1  17.4  73.7  48.1  82.2  81.7  72.0  48.4  23.7  57.7  64.4  88.9  72.3  58.7 
MNC [17]                                          63.5 
Li et al. [55]                                          65.7 
R2IOS [56]  87.0  6.1  90.3  67.9  48.4  86.2  68.3  90.3  24.5  84.2  29.6  91.0  71.2  79.9  60.4  42.4  67.4  61.7  94.3  82.1  66.7 
Assoc. Embed. [67]                                          35.1 
instDML [23]  69.7  1.2  78.2  53.8  42.2  80.1  57.4  88.8  16.0  73.2  57.9  88.4  78.9  80.0  68.0  28.0  61.5  61.3  87.5  70.4  62.1 
Ours  85.9  10.0  74.3  54.6  43.7  81.3  64.1  86.1  17.5  77.5  57.0  89.2  77.8  83.7  67.9  31.2  62.5  63.3  88.6  74.2  64.5 

5.4 Semantic Instance Detection
As a final test of our method, we also train it to produce semantic labels which are combined with our instance proposal method to recognize the detected proposals.
For semantic segmentation which is a kway classification problem, we train a model using crossentropy loss alongside our embedding loss. Similar to our proposal detection model, we use a 64dimension embedding space on top of DeepLabv3 [11] as our base model. While there are more complex methods in literature such as PSPNet [93] and which augment training with additional data (e.g., COCO [59] or JFT300M dataset [81]) and utilize ensembles and postprocessing, we focus on a simple experiment training the base model with/without the proposed pixel pair embedding loss to demonstrate the effectiveness.
In addition to reporting mean intersection over union (mIoU) over all classes, we also computed mIoU restricted to a narrow band of pixels around the groundtruth boundaries. This partition into figure/boundary/background is sometimes referred to as a trimap in the matting literature and has been previously utilized in analyzing semantic segmentation performance [43, 10, 28]. Fig. 9 shows the mIoU as a function of the width of the trimap boundary zone. This demonstrates that with embedding loss yields performance gains over crossentropy primarily far from groundtruth boundaries where it successfully fills in holes in the segments output (see also qualitative results in Fig. 10). This is in spirit similar to the model in [33], which considers local consistency to improve spatial precision. However, our uniform sampling allows for longrange interactions between pixels.
To label detected instances with semantic labels, we use the semantic segmentation model described above to generate labels and then use a simple voting strategy to transfer these predictions to the instance proposals. In order to produce a final confidence score associated with each proposed object, we train a linear regressor to score each object instance based on its morphology (e.g., size, connectedness) and the consistency w.r.t. the semantic segmentation prediction. We note this is substantially simpler than approaches based, e.g. on FasterRCNN [74] which use much richer convolutional features to rescore segmented instances [34].
Comparison of instance detection performance are displayed in Table 2. We use a standard IoU threshold of 0.5 to identify true positives, unless an groundtruth instance has already been detected by a higher scoring proposal in which case it is a false positive. We report the average precision perclass as well as the average all classes (as in [30]). Our approach yields competitive performance on VOC validation despite our simple rescoring. Among the competing methods, the one closest to our model is instDML [23], that learns Euclidean distance based metric with logistic loss. The instDML approach relies on generating pixel seeds to derive instance masks. The pixel seeds may fail to correctly detect thin structures which perhaps explains why this method performs 10x worse than our method on the bike category. In contrast, our meanshift grouping approach doesn’t make strong assumptions about the object shape or topology.
For visualization purposes, we generate three random matrices projections of the 64dimensional embedding and display them in the spatial domain as RGB images. Fig. 11 shows the embedding visualization, as well as predicted semantic segmentation and instancelevel segmentation. From the visualization, we can see the instancelevel semantic segmentation outputs complete object instances even though semantic segmentation results are noisy, such as the bike in the first image in Fig. 11. The instance embedding provides important details that resolve both inter and intraclass instance overlap which are not emphasized in the semantic segmentation loss.
6 Conclusion and Future Work
We have presented an endtoend trainable framework for solving pixellabeling vision problems based on two novel contributions: a pixelpairwise loss based on spherical maxmargin embedding and a variant of mean shift grouping embedded in a recurrent architecture. These two components mesh closely to provide a framework for robustly recognizing variable numbers of instances without requiring heuristic postprocessing or hyperparameter tuning to account for widely varying instance size or classimbalance. The approach is simple and amenable to theoretical analysis, and when coupled with standard architectures yields instance proposal generation which substantially outperforms stateoftheart. Our experiments demonstrate the potential for instance embedding and open many opportunities for future work including learnable variants of meanshift grouping, extension to other pixellevel domains such as encoding surface shape, depth and figureground and multitask embeddings.
Acknowledgement
This project is supported by NSF grants IIS1618806, IIS1253538, DBI1262547 and a hardware donation from NVIDIA. Shu Kong personally thanks Mr. KevisKokitsi Maninis, Dr. Alireza Fathi, Dr. Kevin Murphy and Dr. Rahul Sukthankar for the helpful discussion, advice and encouragement.
References
 [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2011.
 [2] A. Arnab and P. H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. arXiv preprint arXiv:1704.02386, 2017.
 [3] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. arXiv preprint arXiv:1611.08303, 2016.
 [4] G. BakIr, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. Vishwanathan. Predicting structured data. MIT press, 2007.
 [5] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using von misesfisher distributions. Journal of Machine Learning Research, 6(Sep):1345–1382, 2005.
 [6] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [7] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. In CVPR, 2017.
 [8] J. Carreira and C. Sminchisescu. Cpmc: Automatic object segmentation using constrained parametric mincuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1312–1328, 2012.
 [9] M. A. CarreiraPerpinán. Generalised blurring meanshift algorithms for nonparametric clustering. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
 [10] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
 [11] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
 [12] Y.T. Chen, X. Liu, and M.H. Yang. Multiinstance object segmentation with occlusion handling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3470–3478, 2015.
 [13] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence, 17(8):790–799, 1995.
 [14] D. Comaniciu and P. Meer. Mean shift analysis and applications. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pages 1197–1203. IEEE, 1999.
 [15] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5):603–619, 2002.
 [16] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3992–4000, 2015.
 [17] J. Dai, K. He, and J. Sun. Instanceaware semantic segmentation via multitask network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
 [18] B. De Brabandere, D. Neven, and L. Van Gool. Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551, 2017.
 [19] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [20] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
 [21] V. A. Epanechnikov. Nonparametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153–158, 1969.
 [22] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
 [23] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277, 2017.
 [24] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graphbased image segmentation. International journal of computer vision, 59(2):167–181, 2004.
 [25] R. Fisher. Dispersion on a sphere. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, volume 217, pages 295–305. The Royal Society, 1953.
 [26] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on information theory, 21(1):32–40, 1975.
 [27] R. Gadde, V. Jampani, M. Kiefel, D. Kappler, and P. V. Gehler. Superpixel convolutional networks using bilateral inceptions. arXiv preprint arXiv:1511.06739, 2015.
 [28] G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In European Conference on Computer Vision, pages 519–534. Springer, 2016.
 [29] W. Habicht and B. Van der Waerden. Lagerung von punkten auf der kugel. Mathematische Annalen, 123(1):223–234, 1951.
 [30] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 991–998. IEEE, 2011.
 [31] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In European Conference on Computer Vision, pages 297–312. Springer, 2014.
 [32] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and finegrained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 447–456, 2015.
 [33] A. W. Harley, K. G. Derpanis, and I. Kokkinos. Segmentationaware convolutional networks using local attention masks. arXiv preprint arXiv:1708.04607, 2017.
 [34] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. arXiv preprint arXiv:1703.06870, 2017.
 [35] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [36] J. Hosang, R. Benenson, P. Dollár, and B. Schiele. What makes for effective detection proposals? IEEE transactions on pattern analysis and machine intelligence, 38(4):814–830, 2016.
 [37] H. Hu, S. Lan, Y. Jiang, Z. Cao, and F. Sha. Fastmask: Segment multiscale object candidates in one shot. arXiv preprint arXiv:1612.08843, 2016.
 [38] A. Humayun, F. Li, and J. M. Rehg. Rigor: Reusing inference in graph cuts for generating object regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 336–343, 2014.
 [39] A. Humayun, F. Li, and J. M. Rehg. The middle child problem: Revisiting parametric mincut and seeds for object proposals. In Proceedings of the IEEE International Conference on Computer Vision, pages 1600–1608, 2015.
 [40] V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4452–4461, 2016.
 [41] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. Instancecut: from edges to instances with multicut. arXiv preprint arXiv:1611.08272, 2016.
 [42] T. Kobayashi and N. Otsu. Von misesfisher mean shift for clustering on a hypersphere. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 2130–2133. IEEE, 2010.
 [43] P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3):302–324, 2009.
 [44] I. Kokkinos. Pushing the boundaries of boundary detection using deep learning. arXiv preprint arXiv:1511.07386, 2015.
 [45] S. Kong and C. Fowlkes. Lowrank bilinear pooling for finegrained classification. In CVPR, 2017.
 [46] S. Kong and C. Fowlkes. Recurrent scene parsing with perspective understanding in the loop. arXiv preprint, 2017.
 [47] S. Kong, X. Shen, Z. Lin, R. Mech, and C. Fowlkes. Photo aesthetics ranking network with attributes and content adaptation. In European Conference on Computer Vision, pages 662–679. Springer, 2016.
 [48] S. Kong and D. Wang. A dictionary learning approach for classification: Separating the particularity and the commonality. Computer Vision–ECCV 2012, pages 186–199, 2012.
 [49] S. Kong and D. Wang. A multitask learning strategy for unsupervised clustering via explicitly separating the commonality. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 771–774. IEEE, 2012.
 [50] S. Kong and D. Wang. Learning exemplarrepresented manifolds in latent space for classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 240–255. Springer, 2013.
 [51] P. Krähenbühl and V. Koltun. Geodesic object proposals. In European Conference on Computer Vision, pages 725–739. Springer, 2014.
 [52] P. Krahenbuhl and V. Koltun. Learning to propose objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1574–1582, 2015.
 [53] L. Ladickỳ, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr. What, where and how many? combining object detectors and crfs. In European conference on computer vision, pages 424–437. Springer, 2010.
 [54] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [55] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instanceaware semantic segmentation. arXiv preprint arXiv:1611.07709, 2016.
 [56] X. Liang, Y. Wei, X. Shen, Z. Jie, J. Feng, L. Lin, and S. Yan. Reversible recursive instancelevel object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 633–641, 2016.
 [57] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposalfree network for instancelevel object segmentation. arXiv preprint arXiv:1509.02636, 2015.
 [58] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
 [59] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
 [60] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
 [61] I. Loshchilov and F. Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
 [62] L. Lovisolo and E. Da Silva. Uniform distribution of points on a hypersphere with applications to vector bitplane encoding. IEE ProceedingsVision, Image and Signal Processing, 148(3):187–193, 2001.
 [63] M. Maire, T. Narihira, and S. X. Yu. Affinity cnn: Learning pixelcentric pairwise relations for figure/ground embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 174–182, 2016.
 [64] K.K. Maninis, J. PontTuset, P. Arbelaez, and L. Van Gool. Convolutional oriented boundaries: From image segmentation to highlevel tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [65] K. V. Mardia and P. E. Jupp. Directional statistics, volume 494. John Wiley & Sons, 2009.
 [66] L. Najman and M. Schmitt. Geodesic saliency of watershed contours and hierarchical segmentation. IEEE Transactions on pattern analysis and machine intelligence, 18(12):1163–1173, 1996.
 [67] A. Newell and J. Deng. Associative embedding: Endtoend learning for joint detection and grouping. arXiv preprint arXiv:1611.05424, 2016.
 [68] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
 [69] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In Advances in Neural Information Processing Systems, pages 1990–1998, 2015.
 [70] P. O. Pinheiro, T.Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016.
 [71] J. PontTuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE transactions on pattern analysis and machine intelligence, 39(1):128–140, 2017.
 [72] P. Rantalankila, J. Kannala, and E. Rahtu. Generating object segmentation proposals using global and local search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2417–2424, 2014.
 [73] M. Ren and R. S. Zemel. Endtoend instance segmentation with recurrent attention. In CVPR, 2017.
 [74] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [75] Z. Ren and G. Shakhnarovich. Image segmentation by cascaded region agglomeration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2011–2018, 2013.
 [76] B. RomeraParedes and P. H. S. Torr. Recurrent instance segmentation. In European Conference on Computer Vision, pages 312–329. Springer, 2016.
 [77] E. B. Saff and A. B. Kuijlaars. Distributing many points on a sphere. The mathematical intelligencer, 19(1):5–11, 1997.
 [78] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
 [79] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
 [80] A. Sironi, V. Lepetit, and P. Fua. Multiscale centerline detection by learning a scalespace distance transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2697–2704, 2014.
 [81] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.
 [82] J. Uhrig, M. Cordts, U. Franke, and T. Brox. Pixellevel encoding and depth layering for instancelevel semantic labeling. In German Conference on Pattern Recognition, pages 14–25. Springer, 2016.
 [83] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
 [84] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, pages 689–692. ACM, 2015.
 [85] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
 [86] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
 [87] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 1385–1392, 2013.
 [88] C.Y. Wu, R. Manmatha, A. Smola, and P. K. Sampling matters in deep embedding learning. arXiv preprint arXiv:1706.07567, 2017.
 [89] Z. Wu, C. Shen, and A. v. d. Hengel. Bridging categorylevel and instancelevel semantic image segmentation. arXiv preprint arXiv:1605.06885, 2016.
 [90] S. Xie and Z. Tu. Holisticallynested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
 [91] J. Yang, B. Price, S. Cohen, H. Lee, and M.H. Yang. Object contour detection with a fully convolutional encoderdecoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 193–202, 2016.
 [92] Y. Yang, S. Hallman, D. Ramanan, and C. C. Fowlkes. Layered object models for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1731–1743, 2012.
 [93] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. arXiv preprint arXiv:1612.01105, 2016.
 [94] S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.
Appendix
In this appendix, we provide proofs of the propositions introduced in the main paper for understanding our objective function and grouping mechanism. Then, we provide the details of the meanshift algorithm, computation of gradients and how it is adapted for recurrent grouping. We illustrate how the gradients are backpropagated to the input embedding using a toy example. Finally, we include more qualitative results on boundary detection and instance segmentation.
Appendix A Analysis of Pairwise Loss for Spherical Embedding
In this section, we provide proofs for the propositions presented in the paper which provide some analytical understanding of our proposed objective function, and the mechanism for subsequent pixel grouping mechanism.
Proposition 1
For vectors , the total intrapixel similarity is bounded as . In particular, for vectors on the hypersphere where , we have .
Proof 1
First note that . We expand the square and collect all the cross terms so we have . Therefore, . When all the vectors are on the hypersphere, i.e. , then .
Proposition 2
If vectors are distributed on a 2sphere (i.e. with ) then the similarity between any pair is lowerbounded by . Therefore, choosing the parameter in the maximum margin term in objective function to be less than results in positive loss even for a perfect embedding of instances.
We treat all the vectors as representatives of different instances in the image and seek to minimize pairwise similarity, or equivalently maximize pairwise distance (referred to as Tammes’s problem, or the hardspheres problem [77]).
Proof 2
Let be the distance between the closest point pair of the optimally distributed points. Asymptotic results in [29] show that, for some constant ,
(4) 
Since , we can rewrite this bound in terms of the similarity , so that for any :
(5) 
Therefore, choosing , guarantees that for some pair . Choosing , guarantees the existence of an embedding with .
Appendix B Details of Recurrent Mean Shift Grouping
There are two commonly used multivariate kernels in mean shift algorithm. The first, Epanechnikov kernel [21, 13], has the following profile
(6) 
where is the volume of the unit dimensional sphere. The standard meanshift algorithm computes the gradient of the kernel density estimate given by
and identifies modes (local maxima) where . The scale parameter is known as the kernel bandwidth and determines the smoothness of the estimator. The gradient of can be elegantly computed as the difference between and the mean of all data points with , hence the name “meanshift” for performing gradient ascent.
Since the Epanechnikov profile is not differentiable at the boundary, we use the squared exponential kernel adapted to vectors on the sphere:
(7) 
which can be viewed as a natural extension of the Gaussian to spherical data (known as the von Mises Fisher (vMF) distribution [25, 5, 65, 42]). In our experiments we set the bandwidth based on the margin so that .
Our proposed algorithm also differs from the standard meanshift clustering (i.e., [14]) in that rather than performing gradient ascent on a fixed kernel density estimate , at every iteration we alternate between updating the embedding vectors using gradient ascent on and reestimating the density for the updated vectors. This approach is termed Gaussian Blurring Mean Shift (GBMS) in [9] and has converge rate guarantees for data which starts in compact clusters.
In the paper we visualized embedding vectors after GBMS for specific examples. Figure 12 shows aggregate statistics over a collection of images (in the experiment of instance segmentation). We plot the distribution of pairwise similarities for positive and negative pairs during forward propagation through 10 iterations. We can observe that the mean shift module produces sharper distributions, driving the similarity between positive pairs to 1 making it trivial to identify instances.
b.1 Gradient Calculation for Recurrent Mean Shift
To backpropagate gradients through an iteration of GBMS, we break the calculation into a sequence of steps below where we assume the vectors in the data matrix have already been normalized to unit length.
(8) 
where is the updated data after one iteration which is subsequently renormalized to project back onto the sphere. Let denote the loss and denote elementwise product. Backpropagation gradients are then given by:
(9) 
b.2 Toy Example of Mean Shift Backpropagation
In the paper we show examples of the gradient vectors backpropagated through recurrent mean shift to the initial embedding space. Backpropagation through this fixed model modulates the loss on the learned embedding, increasing the gradient for initial embedding vectors whose instance membership is ambiguous and decreasing the gradient for embedding vectors that will be correctly resolved by the recurrent grouping phase.
Figure 13 shows a toy example highlighting the difference between supervised and unsupervised clustering. We generate a set of 1D data points drawn from three Gaussian distributions with mean and standard deviation as , and , respectively, as shown in Figure 13 (a). We use mean squared error for the loss with a fixed linear regressor and fixed target labels. The optimal embedding would set if , and if . We perform 30 gradient updates of the embedding vectors with a step size as 0.1. We analyze the behavior of Gaussian Blurring Mean Shift (GBMS) with bandwidth as .
If running GBMS for unsupervised clustering on these data with the default setting (bandwidth is 0.2), we can see they are grouped into three piles, as shown in Figure 13 (b). If updating the data using gradient descent without GBMS inserted, we end up with three visible clusters even though the data move towards the ideal embedding in terms of classification. Figure 13 (c) and (d) depict the trajectories of 100 random data points during the 30 updates and the final result, respectively.
Now we insert the GBMS module to update these data with different loops, and compare how this effects the performance. We show the updated data distributions and those after five loops of GBMS grouping in column (e) and (f) of Figure 13, respectively. We notice that, with GBMS, all the data are grouped into two clusters; while with GBMS grouping they become more compact and are located exactly on the “ideal spot” for mapping into label space (i.e. 3 and 5) and achieving zero loss. On the other hand, we also observe that, even though these settings incorporates different number of GBMS loops, they achieve similar visual results in terms of clustering the data. To dive into the subtle difference, we randomly select 100 data and depict their trajectories in column (g) and (h) of Figure 13, using a single loss on top of the last GBMS loop or multiple losses over every GBMS loops, respectively. We have the following observations:

By comparing with Figure 13 (c), which depicts update trajectories without GBMS, GBMS module provides larger gradient to update those data further from their “ideal spot” under both scenarios.

From (g), we can see the final data are not updated into tight groups. This is because that the updating mechanism only sees data after (some loops of) GBMS, and knows that these data will be clustered into tight groups through GBMS.

A single loss with more loops of GBMS provides greater gradient than that with fewer loops to update data, as seen in (g).

With more losses over every loops of GBMS, the gradients become even larger that the data are grouped more tightly and more quickly. This is because that the updating mechanism also incorporates the gradients from the loss over the original data, along with those through these loops of GBMS.
To summarize, our GBMS based recurrent grouping module indeed provides meaningful gradient during training with backpropagation. With the convergent dynamics of GBMS, our grouping module becomes especially more powerful in learning to group data with suitable supervision.
Appendix C Additional Boundary Detection Results
We show additional boundary detection results^{5}^{5}5Paper with highresolution figures can be found at the Project Page. on BSDS500 dataset [1] based on our model in Figure 15, 16, 17 and 18. Specifically, besides showing the boundary detection result, we also show 3dimensional pixel embeddings as RGB images before and after finetuning using logistic loss. From the consistent colors, we can see (1) our model essentially carries out binary classification even using the pixel pair embedding loss; (2) after finetuning with logistic loss, our model captures also boundary orientation and signed distance to the boundary. Figure 14 highlights this observation for an example image containing round objects. By zooming in one plate, we can observe a “colorful Mobius ring”, indicating the embedding features for the boundary also capture boundary orientation and the signed distance to the boundary.
Appendix D Additional Results on InstanceLevel Semantic Segmentation
We show more instancelevel semantic segmentation results on PASCAL VOC 2012 dataset [22] based on our model in Figure 19, 20 and 21. As we learn 64dimensional embedding (hypersphere) space, to visualize the results, we randomly generate three matrices to project the embeddings to 3dimension vectors to be treated as RGB images. Besides showing the randomly projected embedding results, we also visualize the semantic segmentation results used to product instancelevel segmentation. From these figures, we observe the embedding for background pixels are consistent, as the backgrounds have almost the same color. Moreover, we can see the embeddings (e.g. in Figure 19, the horses in row7 and row13, and the motorbike in row14) are able to connect the disconnected regions belonging to the same instance. Dealing with disconnected regions of one instance is an unsolved problem for many methods, e.g. [3, 41], yet our approach has no problem with this situation.