PolyTransform: Deep Polygon Transformer for Instance Segmentation

PolyTransform: Deep Polygon Transformer for Instance Segmentation


In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, we first exploit a segmentation network to generate instance masks. We then convert the masks into a set of polygons that are then fed to a deforming network that transforms the polygons such that they better fit the object boundaries. Our experiments on the challenging Cityscapes dataset show that our PolyTransform significantly improves the performance of the backbone instance segmentation network and ranks 1st on the Cityscapes test-set leaderboard. We also show impressive gains in the interactive annotation setting. \colorred1


1 Introduction

The goal of instance segmentation methods is to identify all countable objects in the scene, and produce a mask for each of them. With the help of instance segmentation, we can have a better understanding of the scene [67], design robotics systems that are capable of complex manipulation tasks [17], and improve perception systems of self-driving cars [44]. The task is, however, extremely challenging. In comparison to the traditional semantic segmentation task that infers the category of each pixel in the image, instance segmentation also requires the system to have the extra notion of individual objects in order to associate each pixel with one of them. Dealing with the wide variability in the scale and appearance of objects as well as occlusions and motion blur make this problem extremely difficult.

Figure 1: Overview of our PolyTransform model.

To address these issues, most modern instance segmentation methods employ a two-stage process [21, 62, 42], where object proposals are first created and then foreground background segmentation within each bounding box is performed. With the help of the box, they can better handle situations (\eg, occlusions) where other methods often fail [4]. While these approaches have produced impressive results and achieved state-of-the-art performance on multiple benchmarks (e.g., COCO [38], Cityscapes [11]) their output is often over-smoothed failing to capture the fine-grained details.

An alternative line of work tackles the problem of interactive annotation [5, 2, 61, 39]. These techniques have been developed in the context of having an annotator in the loop, where a ground truth bounding box is provided. The goal of these works is to speed up annotation work by providing an initial polygon for annotators to correct as annotating from scratch is a very expensive process. In this line of work, methods exploit polygons to better capture the geometry of the object [5, 2, 39], instead of treating the problem as a pixel-wise labeling task. This results in more precise masks and potentially faster annotation speed as annotators are able to simply correct the polygons by moving the vertices. However, these approaches suffer in the presence of large occlusions or when the object is split into multiple disconnected components.

With these problems in mind, in this paper we develop a novel model, which we call PolyTransform, and tackle both the instance segmentation and interactive annotation problems. The idea behind our approach is that the segmentation masks generated by common segmentation approaches can be viewed as a starting point to compute a set of polygons, which can then be refined. We performed this refinement via a deforming network that predicts for each polygon the displacement of each vertex, taking into account the location of all vertices. By deforming each polygon, our model is able to better capture the local geometry of the object. Unlike [5, 2, 39], our model has no restriction on the number of polygons utilized to describe each object. This allows us to naturally handle cases where the objects are split into parts due to occlusion.

We first demonstrate the effectiveness of our approach on the Cityscapes dataset [11]. On the task of instance segmentation, our model improves the initialization by 3.0 AP and 10.3 in the boundary metric on the validation set. Importantly, we achieve 1st place on the test set leaderboard, beating the current state of the art by 3.7 AP. We further evaluate our model on a new self-driving dataset. Our model improves the initialization by 2.1 AP and 5.6 in the boundary metric. In the context of interactive annotation, we outperform the previous state of the art [61] by 2.0% in the boundary metric. Finally, we conduct an experiment where the crowd-sourced labelers annotate the object instances using the polygon output from our model. We show that this can speed up the annotation time by 35%!

2 Related Work

In this section, we briefly review the relevant literature on instance segmentation and annotation in the loop.

Proposal-based Instance Segmentation:

Most modern instance segmentation models adopt a two-stage pipeline . First, an over-complete set of segment proposals is identified, and then a voting process is exploited to determine which one to keep [7, 14] As the explicit feature extraction process [53] is time-consuming [19, 20], Dai \etal[13, 12] integrated feature pooling into the neural network to improve efficiency. While the speed is drastically boosted comparing to previous approaches, it is still relatively slow as these approach is limited by the traditional detection based pipeline. With this problem in mind, researchers have looked into directly generating instance masks in the network and treat them as proposals [51, 52]. Based on this idea, Mask R-CNN [21] introduced a joint approach to do both mask prediction and recognition. It builds upon Faster R-CNN [54] by adding an extra parallel header to predict the object’s mask, in addition to the existing branch for bounding box recognition. Liu \etal[42] propose a path aggregation network [42] to improve the information flow in Mask R-CNN and further improve performance. More recently, Chen \etal[6] interleaves bounding box regression, mask regression and semantic segmentation together to boost instance segmentation performance. Xu \etal[63] fit Chebyshev polynomials to instances by having a network learn the coefficients, this allows for real time instance segmentation. Huang \etal[25] optimize the scoring of the bounding boxes by predicting IoU for each mask rather than only a classification score. Kuo \etal[34] start with bounding boxes and refine them using shape priors. Xiong \etal[62] and Kirillov \etal[31] extended Mask R-CNN to the task of panoptic segmentation. Yang \etal[64] extended Mask R-CNN to the task of video instance segmentation.

Proposal-free Instance Segmentation:

This line of research aims at segmenting the instances in the scene without an explicit object proposal. Zhang \etal[66, 65] first predicts instance labels within the extracted multi-scale patches and then exploits dense Conditional Random Field [33] to obtain a consistent labeling of the full image. While achieving impressive results, their approach is computationally intensive. Bai and Urtasun [4] exploited a deep network to predict the energy of the watershed transform such that each basin corresponds to an object instance. With one simple cut, they can obtain the instance masks of the whole image without any post-processing. Similarly, [32] exploits boundary prediction to separate the instances within the same semantic category. Despite being much faster, they suffer when dealing with far or small objects whose boundaries are ambiguous. To address this issue, Liu \etal[41] present a sequential grouping approach that employs neural networks to gradually compose objects from simpler elements. It can robustly handle situations where a single instance is split into multiple parts. Newell and Deng [49] implicitly encode the grouping concept into neural networks by having the model to predict both semantic class and a tag for each pixel. The tags are one dimensional embeddings which associate each pixel with one another. Kendall \etal[28] propose a method to assign pixels to objects having each pixel point to its object’s center so that it can be grouped. Sofiiuk \etal[58] use a point proposal network to generate points where the instances can be, this is then processed by a CNN to outputs instance masks for each location. Neven \etal[48] propose a new clustering loss that pulls the spatial embedding of pixels belonging to the same instance together to achieve real time instance segmentation while having high accuracy. Gao \etal[18] propose a single shot instance segmentation network that outputs a pixel pair affinity pyramid to compute whether two pixels belong to the same instance, they then combine this with a predicted semantic segmentation to output a single instance segmentation map.

Interactive Annotation:

The task of interactive annotation can also be posed as finding the polygons or curves that best fit the object boundaries. In fact, the concept of deforming a curve to fit the object contour can be dated back to the 80s where the active contour model was first introduced [27]. Since then, variants of ACM [10, 47, 9] have been proposed to better capture the shape. Recently, the idea of exploiting polygons to represent an instance is explored in the context of human in the loop segmentation [5, 2]. Castrejón \etal[5] adopted an RNN to sequentially predict the vertices of the polygon. Acuna \etal[2] extended [5] by incorporating graph neural networks and increasing image resolution. While these methods demonstrated promising results on public benchmarks [11], they require ground truth bounding box as input. Ling \etal[39] and Dong \etal[16] exploited splines as an alternative parameterization. Instead of drawing the whole polygon/curves from scratch, they start with a circle and deform it. Wang \etaltackled this problem with implicit curves using level sets [61], however, because the outputs are not polygons, an annotator cannot easily corrected them. In [46], Maninis \etaluse extreme boundary as inputs rather than bounding boxes and Majumder \etal[45] uses user clicks to generate content aware guidance maps; all of these help the networks learn stronger cues to generate more accurate segmentations. However, because they are pixel-wise masks, they are not easily amenable by an annotator. Acuna \etal[1] develop an approach that can be used to refine noisy annotations by jointly reasoning about the object boundaries with a CNN and a level set formulation. In the domain of offline mapping, several papers from Homayounfar \etaland Liang \etal[23, 35, 24, 36] have tackled the problem of automatically annotating crosswalks, road boundaries and lanes by predicting structured outputs such as a polyline.

Figure 2: Our feature extraction network.

3 PolyTransform

Our aim is to design a robust segmentation model that is capable of producing precise, geometry-preserving masks for each individual object. Towards this goal, we develop PolyTransform, a novel deep architecture that combines prevailing segmentation approaches [21, 62] with modern polygon-based methods [5, 2]. By exploiting the best of both worlds, we are able to generate high quality segmentation masks under various challenging scenarios.

In this section, we start by describing the backbone architecture for feature extraction and polygon initialization. Next, we present a novel deforming network that warps the initial polygon to better capture the local geometry of the object. An overview of our approach is shown in Figure 1.

3.1 Instance Initialization

The goal of our instance initialization module is to provide a good polygon initialization for each individual object. To this end, we first exploit a model to generate a mask for each instance in the scene. Our experiments show that our approach can significantly improve performance for a wide variety of segmentation models. If the segmentation model outputs proposal boxes, we use them to crop the image, otherwise, we fit a bounding box to the mask. The cropped image is later resized to a square and fed into a feature network (described in Sec. 3.2) to obtain a set of reliable deep features. In practice, we resize the cropped image to . To initialize the polygon, we use the border following algorithm of [59] to extract the contours from the predicted mask. We get the initial set of vertices by placing a vertex at every px distance in the contour. Empirically, we find such dense vertex interpolation provides a good balance between performance and memory consumption.

3.2 Feature Extraction Network

The goal of our feature extraction network is to learn strong object boundary features. This is essential as we want our polygons to capture high curvature and complex shapes. As such, we employ a feature pyramid network (FPN) [37] to learn and make use of multi-scale features. This network takes as input the crop obtained from the instance initialization stage and outputs a set of features at different pyramid levels. Our backbone can be seen in Figure 2.

3.3 Deforming Network

We have so far computed a polygon initialization and deep features of the FPN from the image crop. The next step is to build a feature embedding for all vertices and learn a deforming model that can effectively predict the offset for each vertex so that the polygon snaps better to the object boundaries.

Vertex embedding:

We build our vertex representation upon the multi-scale feature extracted from the backbone FPN network of the previous section. In particular, we take the , , , and feature maps and apply two lateral convolutional layers to each of them in order to reduce the number of feature channels from to each. Since the feature maps are , , , and of the original scale, we bilinearly upsample them back to the original size and concatenate them to form a feature tensor. To provide the network a notion of where each vertex is, we further append a 2 channel CoordConv layer [40]. The channels represent and coordinates with respect to the frame of the crop. Finally, we exploit the bilinear interpolation operation of the spatial transformer network [26] to sample features at the vertex coordinates of the initial polygon from the feature tensor. We denote such embedding as .

Deforming network:

When moving a vertex in a polygon, the two attached edges are subsequently moved as well. The movement of these edges depends on the position of the neighboring vertices. Each vertex thus must be aware of its neighbors and needs a way to communicate with one another in order to reduce unstable and overlapping behavior. In this work, we exploit the self-attending Transformer network [60] to model such intricate dependencies. We leverage the attention mechanism to propagate the information across vertices and improve the predicted offsets.

More formally, given the vertex embeddings , we first employ three feed-forward neural networks to transform it into , , , where , , stands for Query, Key and Value. We then compute the weightings between vertices by taking a softmax over the dot product . Finally, the weightings are multiplied with the keys to propagate these dependencies across all vertices. Such attention mechanism can be written as:

where is the dimension of the queries and keys, serving as a scaling factor to prevent extremely small gradients. We repeat the same operation a fix number of times, 6 in our experiments. After the last Transformer layer, we feed the output to another feed-forward network which predicts offsets for the vertices. We add the offsets to the polygon initialization to transform the shape of the polygon.

3.4 Learning

We train the deforming network and the feature extraction network in an end-to-end manner. Specifically, we minimize the weighted sum of two losses. The first penalizes the model for when the vertices deviate from the ground truth. The second regularizes the edges of the polygon to prevent overlap and unstable movement of the vertices.

Polygon Transforming Loss:

We make use of the Chamfer Distance loss to move the vertices of our predicted polygon closer to the ground truth polygon . The Chamfer Distance loss is defined as:

where and are the rasterized edge pixels of the polygons and . The first term of the loss penalizes the model when is far from and the second term penalizes the model when is far from .

In order to prevent unstable movement of the vertices, we add a standard deviation loss on the lengths of the edges between the vertices. Empirically, we found that without this term the vertices can suddenly shift a large distance, incurring a large loss and causing the gradients to blow up. We define the standard deviation loss as: where denotes the mean length of the edges.

4 Experiments

We evaluate our model in the context of both instance segmentation and interactive annotation settings.

Experimental Setup:

We train our model on 8 Titan 1080 Ti GPUs using the distributed training framework Horovod [56] for 1 day. We use a batch size of 1, ADAM [30], 1e-4 learning rate and a 1e-4 weight decay. We augment our data by randomly flipping the images horizontally. During training, we only train with instances whose proposed box has an Intersection over Union (IoU) overlap of over 0.5 with the ground truth (GT) boxes. We train with both instances produced using proposed boxes and GT boxes to further augment the data. For our instance segmentation experiments, we augment the box sizes by to during training and test with a box expansion. For our interactive annotation experiments, we train and test on boxes with an expansion of 5px on each side; we only compute a chamfer loss if the predicted vertex is at least 2px from the ground truth polygon. When placing weights on the losses, we found ensuring the loss values were approximately balanced produced the best result. For our PolyTransform FPN, we use ResNet50 [22] as the backbone and use the pretrained weights from UPSNet [62] on Cityscapes. For our deforming network we do not use pretrained weights.

training data AP AP AP person rider car truck bus train mcycle bcycle
DWT [4] fine
Kendall et al. [28] fine
Arnab et al. [3] fine
SGN [41] fine+coarse
PolygonRNN++ [2] fine
Mask R-CNN [21] fine
BShapeNet+ [29] fine
GMIS [43] fine+coarse
Neven et al. [48] fine
PANet [42] fine
Mask R-CNN [21] fine+COCO
AdaptIS [58] fine 31.4 29.1 50.0 31.6 41.7 39.4 12.1
SSAP [18] fine
BShapeNet+ [29] fine+COCO
UPSNet [62] fine+COCO
PANet [42] fine+COCO
Ours fine+COCO

Table 1: Instance segmentation on Cityscapes val and test set: This table shows our instance segmentation results on Cityscape test. We report models trained on fine and fine+COCO. We report AP and AP.
fine COCO AP AP car truck bus train person rider bcycle+r bcycle mcycle+r mcycle
Mask RCNN [21] -
PANet [42] -
UPSNet [62] -
PANet [42]
UPSNet [62]

Table 2: Instance segmentation on Our Dataset test set: This table shows our instance segmentation results Our Dataset test set. We report models trained on fine and fine+COCO. We report AP and AP. +r is short for with rider.
Init Backbone COCO AP AP AF AF
DWT Res101 -
UPSNet Res50 -
UPSNet Res50
UPSNet WRes38+PANet
Table 3: Improvement on Cityscapes val instance segmentation initializations: We report the AP, AF of the initialization and gain in AP, AF from the initialization instances when running our PolyTransform model for Cityscapes val.
Init Backbone COCO AP AP AF AF
M-RCNN Res50 -
UPSNet Res101 -
UPSNet Res101
Table 4: Improvement on Our Dataset val instance segmentation initializations: We report the AP, AF of the initialization and gain in AP, AF from the initialization instances when running our PolyTransform model for Our Dataset val.
Input Image Our Instance Segmentation GT Instance Segmentation
Figure 3: We showcase qualitative instance segmentation results of our model on the Cityscapes validation set.

4.1 Instance Segmentation


We use Cityscapes [11] which has high quality pixel-level instance segmentation annotations. The images were collected in 27 cities, and they are split into 2975, 500 and 1525 images for train/val/test. There are 8 instance classes: bicycle, bus, person, train, truck, motorcycle, car and rider. We also report results on a new dataset we collected. It consists of 10235/1139/1186 images for train/val/test split annotated with 10 classes: car, truck, bus, train , person, rider, bicycle with rider, bicycle, motorcycle with rider and motorcycle. Each image is of size .


For our instance segmentation results, we report the average precision (AP and AP) for the predicted mask. Here, the AP is computed at 10 IoU overlap thresholds ranging from 0.5 to 0.95 in steps of 0.05 following [11]. AP is the AP at an overlap of 50%. We also introduce a new metric that focusses on boundaries. In particular, we use a metric similar to [61, 50] where a precision, recall and F1 score is computed for each mask, where the prediction is correct if it is within a certain distance threshold from the ground truth. We use a threshold of 1px, and only compute the metric for true positives. We use the same 10 IoU overlap thresholds ranging from 0.5 to 0.95 in steps of 0.05 to determine the true positives. Once we compute the F1 score for all classes and thresholds, we take the average over all examples to get AF.

Instance Initialization:

Since our method improves upon a baseline initialization, we want to use a strong instance initialization to show we can still improve the results even when they are very strong. To do this, we take the publicly available UPSNet [62], and replace its backbone with WideResNet38 [55] and add all the elements of PANet [42] except for the synchronized batch normalization (we use group normalization instead). We further pretrain on COCO and use deformable convolution (DCN) [15] in the backbone.

Input Image
Our Instance Seg
GT Instance Seg
Input Image Our Instance Segmentation GT Instance Segmentation
Figure 4: We showcase qualitative instance segmentation results of our model on Our Dataset validation set.
Mean bicycle bus person train truck mcycle car rider F F
DEXTR* [46]
Deep Level Sets [61]
Table 5: Interactive Annotation (Cityscapes Stretch): This table shows our IoU % performance in the setting of annotation where we are given the ground truth boxes. DEXTR* represents DEXTR without extreme points.
Mean bicycle bus person train truck mcycle car rider F F
Polygon-RNN [5]
Polygon-RNN++ [2]
Curve GCN [39]
Deep Level Sets [61]
Table 6: Interactive Annotation (Cityscapes Hard): This table shows our IoU % performance in the setting of annotation where we are given the ground truth boxes.

Comparison to SOTA:

We compare our model with the state-of-the-art on the val and test sets of Cityscapes in Table 1. We see that we outperform all baselines in every metric. Our model achieves a new state-of-the-art test result of 40.1AP. This significantly out performs PANet by 3.7 and 2.8 points in AP and AP respecgtively. It also ranks number 1 on the official Cityscapes leaderboard. We report the results on our new dataset in Table 2. We achieve the strongest test AP result in this leaderboard. We see that we improve over PANet by 6.2 points in AP and UPSNet by 3.8 points in AP.

Robustness to Initialization:

We also report the improvement over different instance segmentation networks used as initialization in Table 3. We see significant and consistent improvements in val AP across all models. When we train our model on top of the DWT [4] instances we see an improvement of , points in AP and AF. We also train on top of the UPSNet results from the original paper along with UPSNet with WRes38+PANet as a way to reproduce the current SOTA val AP of PANet. We show an improvement of , points in AP and AF. Finally we improve on our best initialization by AP points in AP and AF. As we can see, our boundary metric sees a very consistent gain in AF across all models. This suggests that our model is significantly improving the instances at the boundary. We notice that a large gain in AP (WRes38+PANet to WRes38+PANet+DCN) does not necessarily translate to a large gain in AF, however, our model will always provide a significant increase in this metric. We also report the validation AP improvement over different instance segmentation networks in Table 4 for our new dataset. We see that we can improve on Mask R-CNN [21] by , points in AP, AF. For the different UPSNet models, we improve upon it between 1.4-2.2 AP points. Once again, our model shows a consistent and strong improvement over the instance segmentation initializations. We also see a very consistent gain in AF across all the models.

Annotation Efficiency:

We conduct an experiment where we ask crowd-sourced labelers to annotate 150 images from our new dataset with instances larger than 24x24px for vehicles and 12x14px for pedestrians/riders. We performed a control experiment where the instances are annotated completely from scratch (without our method) and a parallel experiment where we use our model to output the instances for them to fix to produce the final annotations. In the fully manual experiment, it took on average 60.3 minutes to annotate each image. When the annotators were given the PolyTransform output to annotate on top of, it took on average 39.4 minutes to annotate each image. Thus reducing 35% of the time required to annotate the images. This resulted in significant cost savings.

Qualitative Results:

We show qualitative results of our model on the validation set in Figure 3. In our instance segmentation outputs we see that in many cases our model is able to handle occlusion. For example, in row 3, we see that our model is able to capture the feet of the purple and blue pedestrians despite their feet being occluded from the body. We also show qualitative results on our new dataset in Figure 4. We see that our model is able to capture precise boundaries, allowing it to capture difficult shapes such as car mirrors and pedestrians.

FCN R50 -
FCN R101 -
FCN R101
DeepLabV3 R50 -
DeepLabV3 R101 -
DeepLabV3+ R101
Table 7: Improvement on Cityscapes Stretch segmentation initializations: We report the metric improvements when running our PolyTransform model on different models. We report our model results trained on FCN [57] and DeepLabV3 [8]. DeepLabV3+ uses the class balancing loss from [46]. We report on models with various backbones (Res50 vs Res101) and also with and without pretraining on COCO [38].

Failure Modes:

Our model can fail when the initialization is poor (left image in Figure 5). Despite being able to handle occlusion, our model can still fail when the occlusion is complex or ambiguous as seen in the right of Figure 5. Here there is a semi-transparent fence blocking the car.

4.2 Interactive Annotation

The goal is to annotate an object with a polygon given its ground truth bounding box. The idea is that the annotator provides a ground truth box and our model works on top of it to output a polygon representation of the object instance.


We follow [5] and split the Cityscapes dataset such that the original validation set is the test set and two cities from the training (Weimar and Zurich) form the validation set. In [61], the authors further split this dataset into two settings. The first is Cityscapes Hard, here the ground truth bounding box is enlarged to form a square and then the image is cropped. The second is Cityscapes Stretch, where the ground truth bounding box along with the image is stretched to a square and then cropped.


To evaluate our model for this task, we report the intersection over union (IoU) on a per-instance basis and average for each class. Then, following [5] this is averaged across all classes. We also report the boundary metric reported in [61, 50], which computes the F measure along the contour for a given threshold. The thresholds used are 1 and 2 pixels as Cityscapes contains a lot of small instances.

Instance Initialization:

For our best model we use a variation of DeepLabV3 [8], which we call DeepLabV3+ as the instance initialization network. The difference is that we train DeepLabV3 with the class balancing loss used in [46].

Comparison to SOTA:

Tables 5 and 6 shows the results on the test set in both Cityscapes Stretch and Hard respectively. For Cityscapes Stretch, we see that our model significantly outperforms the SOTA in the boundary metric, improving it by up to 2%. Unlike the Deep Level Sets [61] method which outputs a pixel wise mask, our method outputs a polygon which allows for it to be amenable to the annotator in the loop setting by simply moving the vertices. For Cityscapes Hard, our model outperforms the SOTA by 4.9%, 8.3% and 7.2% in mean IOU, F at 1px and F at 2px respectively.

Robustness to Initalization:

We also report improvements over different segmentation initializations in Table 7, the results are on the test set. Our models are trained on various backbone initialization models (FCN [57] and DeepLabV3 [8] with and without pretraining on COCO [38]). Our model is able to consistently and significantly improve the boundary metrics at 1 and 2 pixels by up to 1.5% and we improve the IOU between 0.1-0.2%. We also note that the difference in mean IOU between FCN and DeepLabV3 is very small (at most 0.5%) despite DeepLabV3 being a much stronger segmentation model. We argue that the margin for mean IOU improvement is very small for this dataset.


Our model runs on average 21 ms per object instance. This is 14x faster than Polygon-RNN++ [2] and 1.4x faster than Curve GCN [39] which are the state of the arts.

Figure 5: Failure modes: We show failure modes of our model. On the left, our model fails because the initialization is poor. On the right, the model fails because of complex occlusion. Yellow: Init, Blue: Ours.

5 Conclusion

In this paper, we present PolyTransform, a novel deep architecture that combines the strengths of both prevailing segmentation approaches and modern polygon-based methods. We first exploit a segmentation network to generate a mask for each individual object. The instance mask is then converted into a set of polygons and serve as our initialization. Finally, a deforming network is applied to warp the polygons to better fit the object boundaries. We evaluate the effectiveness of our model on the Cityscapes dataset as well as a novel dataset that we collected. Experiments show that our approach is able to produce precise, geometry-preserving instance segmentation that significantly outperforms the backbone model. Comparing to the instance segmentation initialization, we increase the validation AP and boundary metric by up to 3.0 and 10.3 points, allowing us to achieve 1st place on the Cityscapes leaderboard. We also show that our model speeds up annotation by 35%. Comparing to previous work on annotation-in-the-loop [2], we outperform the boundary metric by 2.0%. Importantly, our PolyTransform generalizes across various instance segmentation network.


  1. The supplementary of this paper can be found \colorredhere.


  1. D. Acuna, A. Kar and S. Fidler (2019) Devil is in the edges: learning semantic boundaries from noisy annotations. In CVPR, Cited by: §2.
  2. D. Acuna, H. Ling, A. Kar and S. Fidler (2018) Efficient interactive annotation of segmentation datasets with polygon-rnn++. Cited by: §1, §1, §2, §3, §4.2, Table 1, Table 6, §5.
  3. A. Arnab and P. H. S. Torr (2017) Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, Cited by: Table 1.
  4. M. Bai and R. Urtasun (2017) Deep watershed transform for instance segmentation. In CVPR, Cited by: §1, §2, §4.1, Table 1.
  5. L. Castrejón, K. Kundu, R. Urtasun and S. Fidler (2017) Annotating object instances with a polygon-rnn. In CVPR, Cited by: §1, §1, §2, §3, §4.2, §4.2, Table 6.
  6. K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy and D. Lin (2019) Hybrid task cascade for instance segmentation. In CVPR, Cited by: §2.
  7. L. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang and H. Adam (2018) MaskLab: instance segmentation by refining object detection with semantic and direction features. In CVPR, Cited by: §2.
  8. L. Chen, G. Papandreou, F. Schroff and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR. Cited by: §4.2, §4.2, Table 7.
  9. D. Cheng, R. Liao, S. Fidler and R. Urtasun (2019) DARNet: deep active ray network for building segmentation. In CVPR, Cited by: §2.
  10. L. D. Cohen (1991) On active contour models and balloons. CVGIP. Cited by: §2.
  11. M. Cordts, M. O. S. Ramos, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, D. A. R, T. Darmstadt, M. Informatics and T. Dresden () The cityscapes dataset. Cited by: §1, §1, §2, §4.1, §4.1.
  12. J. Dai, K. He, Y. Li, S. Ren and J. Sun (2016) Instance-sensitive fully convolutional networks. In ECCV, Cited by: §2.
  13. J. Dai, K. He and J. Sun (2015) Convolutional feature masking for joint object and stuff segmentation. In CVPR, Cited by: §2.
  14. J. Dai, K. He and J. Sun (2016) Instance-aware semantic segmentation via multi-task network cascades. In CVPR, Cited by: §2.
  15. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu and Y. Wei (2017) Deformable convolutional networks. ICCV. Cited by: §4.1.
  16. Z. Dong, R. Zhang and X. Shao (2019) Automatic annotation and segmentation of object instances with deep active curve network. IEEE Access. Cited by: §2.
  17. N. Fazeli, M. Oller, J. Wu, Z. Wu, J. B. Tenenbaum and A. Rodriguez (2019) See, feel, act: hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics. Cited by: §1.
  18. N. Gao, Y. Shan, Y. Wang, X. Zhao, Y. Yu, M. Yang and K. Huang (2019) SSAP: single-shot instance segmentation with affinity pyramid. In ICCV, Cited by: §2, Table 1.
  19. B. Hariharan, P. Arbeláez, R. Girshick and J. Malik (2014) Simultaneous detection and segmentation. In ECCV, Cited by: §2.
  20. B. Hariharan, P. Arbeláez, R. Girshick and J. Malik (2015) Hypercolumns for object segmentation and fine-grained localization. In CVPR, Cited by: §2.
  21. K. He, G. Gkioxari, P. Dollár and R. B. Girshick (2017) Mask R-CNN. CoRR. Cited by: §1, §2, §3, §4.1, Table 1, Table 2.
  22. K. He, X. Zhang, S. Ren and J. Sun (2015) Deep residual learning for image recognition. CoRR. Cited by: §4.
  23. N. Homayounfar, W. Ma, S. K. Lakshmikanth and R. Urtasun (2018) Hierarchical recurrent attention networks for structured online maps. In CVPR, Cited by: §2.
  24. N. Homayounfar, W. Ma, J. Liang, X. Wu, J. Fan and R. Urtasun (2019) DAGMapper: learning to map by discovering lane topology. In ICCV, Cited by: §2.
  25. Z. Huang, L. Huang, Y. Gong, C. Huang and X. Wang (2019) Mask Scoring R-CNN. In CVPR, Cited by: §2.
  26. M. Jaderberg, K. Simonyan and A. Zisserman (2015) Spatial transformer networks. In NIPS, Cited by: §3.3.
  27. M. Kass, A. Witkin and D. Terzopoulos (1988) Snakes: active contour models. IJCV. Cited by: §2.
  28. A. Kendall, Y. Gal and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, Cited by: §2, Table 1.
  29. H. Y. Kim and B. R. Kang (2018) Instance segmentation and object detection with bounding shape masks. CoRR. Cited by: Table 1.
  30. D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR. Cited by: §4.
  31. A. Kirillov, K. He, R. Girshick, C. Rother and P. Dollar (2019) Panoptic segmentation. In CVPR, Cited by: §2.
  32. A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy and C. Rother (2017) Instancecut: from edges to instances with multicut. In CVPR, Cited by: §2.
  33. P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, Cited by: §2.
  34. W. Kuo, A. Angelova, J. Malik and T. Lin (2019) ShapeMask: learning to segment novel objects by refining shape priors. In ICCV, Cited by: §2.
  35. J. Liang, N. Homayounfar, W. Ma, S. Wang and R. Urtasun (2019) Convolutional recurrent network for road boundary extraction. In CVPR, Cited by: §2.
  36. J. Liang and R. Urtasun (2018) End-to-end deep structured models for drawing crosswalks. In ECCV, Cited by: §2.
  37. T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan and S. J. Belongie (2016) Feature pyramid networks for object detection. CoRR. Cited by: §3.2.
  38. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §4.2, Table 7.
  39. H. Ling, J. Gao, A. Kar, W. Chen and S. Fidler (2019) Fast interactive object annotation with curve-gcn. In CVPR, Cited by: §1, §1, §2, §4.2, Table 6.
  40. R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev and J. Yosinski (2018) An intriguing failing of convolutional neural networks and the coordconv solution. CoRR. Cited by: §3.3.
  41. S. Liu, J. Jia, S. Fidler and R. Urtasun (2017) Sgn: sequential grouping networks for instance segmentation. In ICCV, Cited by: §2, Table 1.
  42. S. Liu, L. Qi, H. Qin, J. Shi and J. Jia (2018) Path aggregation network for instance segmentation. In CVPR, Cited by: §1, §2, §4.1, Table 1, Table 2.
  43. Y. Liu, S. Yang, B. Li, W. Zhou, J. Xu, H. Li and Y. Lu (2018) Affinity derivation and graph merge for instance segmentation. In ECCV, Cited by: Table 1.
  44. W. Ma, S. Wang, R. Hu, Y. Xiong and R. Urtasun (2019) Deep rigid instance scene flow. In CVPR, Cited by: §1.
  45. S. Majumder and A. Yao (2019) Content-aware multi-level guidance for interactive instance segmentation. In CVPR, Cited by: §2.
  46. K. Maninis, S. Caelles, J. Pont-Tuset and L. V. Gool (2017) Deep extreme cut: from extreme points to object segmentation. CVPR. Cited by: §2, §4.2, Table 5, Table 7.
  47. D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao and R. Urtasun (2018) Learning deep structured active contours end-to-end. In CVPR, Cited by: §2.
  48. D. Neven, B. D. Brabandere, M. Proesmans and L. V. Gool (2019) Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR, Cited by: §2, Table 1.
  49. A. Newell, Z. Huang and J. Deng (2017) Associative embedding: end-to-end learning for joint detection and grouping. In NIPS, Cited by: §2.
  50. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. H. Gross and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. CVPR. Cited by: §4.1, §4.2.
  51. P. O. Pinheiro, R. Collobert and P. Dollár (2015) Learning to segment object candidates. In NIPS, Cited by: §2.
  52. P. O. Pinheiro, T. Lin, R. Collobert and P. Dollár (2016) Learning to refine object segments. In ECCV, Cited by: §2.
  53. J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques and J. Malik (2017) Multiscale combinatorial grouping for image segmentation and object proposal generation. PAMI. Cited by: §2.
  54. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §2.
  55. S. Rota Bulò, L. Porzi and P. Kontschieder (2018) In-place activated batchnorm for memory-optimized training of dnns. In CVPR, Cited by: §4.1.
  56. A. Sergeev and M. D. Balso (2018) Horovod: fast and easy distributed deep learning in tensorflow. CoRR. Cited by: §4.
  57. E. Shelhamer, J. Long and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §4.2, Table 7.
  58. K. Sofiiuk, O. Barinova and A. Konushin (2019) AdaptIS: adaptive instance selection network. In ICCV, Cited by: §2, Table 1.
  59. S. Suzuki and K. Abe (1985) Topological structural analysis of digitized binary images by border following.. Computer Vision, Graphics, and Image Processing. Cited by: §3.1.
  60. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin (2017) Attention is all you need. CoRR. Cited by: §3.3.
  61. Z. Wang, H. Ling, D. Acuna, A. Kar and S. Fidler (2019) Object instance annotation with deep extreme level set evolution. In CVPR, Cited by: §1, §1, §2, §4.1, §4.2, §4.2, §4.2, Table 5, Table 6.
  62. Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer and R. Urtasun (2019) UPSNet: A unified panoptic segmentation network. CoRR. Cited by: §1, §2, §3, §4.1, §4, Table 1, Table 2.
  63. W. Xu, H. Wang, F. Qi and C. Lu (2019) Explicit shape encoding for real-time instance segmentation. In ICCV, Cited by: §2.
  64. L. Yang, Y. Fan and N. Xu (2019) Video instance segmentation. In ICCV, Cited by: §2.
  65. Z. Zhang, S. Fidler and R. Urtasun (2016) Instance-level segmentation for autonomous driving with deep densely connected mrfs. In CVPR, Cited by: §2.
  66. Z. Zhang, A. G. Schwing, S. Fidler and R. Urtasun (2015) Monocular object instance segmentation and depth ordering with cnns. In ICCV, Cited by: §2.
  67. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba (2017) Scene parsing through ade20k dataset. In CVPR, Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description