PolyTransform: Deep Polygon Transformer for Instance Segmentation
In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods.
In particular, we first exploit a segmentation network to generate instance masks. We then convert the masks into a set of polygons that are then fed to a deforming network that transforms the polygons such that they better fit the object boundaries.
Our experiments on the challenging Cityscapes dataset show that our PolyTransform significantly improves the performance of the backbone instance segmentation network and ranks 1st on the Cityscapes test-set leaderboard.
We also show impressive gains in the interactive annotation setting. \colorred
The goal of instance segmentation methods is to identify all countable objects in the scene, and produce a mask for each of them. With the help of instance segmentation, we can have a better understanding of the scene , design robotics systems that are capable of complex manipulation tasks , and improve perception systems of self-driving cars . The task is, however, extremely challenging. In comparison to the traditional semantic segmentation task that infers the category of each pixel in the image, instance segmentation also requires the system to have the extra notion of individual objects in order to associate each pixel with one of them. Dealing with the wide variability in the scale and appearance of objects as well as occlusions and motion blur make this problem extremely difficult.
To address these issues, most modern instance segmentation methods employ a two-stage process [21, 62, 42], where object proposals are first created and then foreground background segmentation within each bounding box is performed. With the help of the box, they can better handle situations (\eg, occlusions) where other methods often fail . While these approaches have produced impressive results and achieved state-of-the-art performance on multiple benchmarks (e.g., COCO , Cityscapes ) their output is often over-smoothed failing to capture the fine-grained details.
An alternative line of work tackles the problem of interactive annotation [5, 2, 61, 39]. These techniques have been developed in the context of having an annotator in the loop, where a ground truth bounding box is provided. The goal of these works is to speed up annotation work by providing an initial polygon for annotators to correct as annotating from scratch is a very expensive process. In this line of work, methods exploit polygons to better capture the geometry of the object [5, 2, 39], instead of treating the problem as a pixel-wise labeling task. This results in more precise masks and potentially faster annotation speed as annotators are able to simply correct the polygons by moving the vertices. However, these approaches suffer in the presence of large occlusions or when the object is split into multiple disconnected components.
With these problems in mind, in this paper we develop a novel model, which we call PolyTransform, and tackle both the instance segmentation and interactive annotation problems. The idea behind our approach is that the segmentation masks generated by common segmentation approaches can be viewed as a starting point to compute a set of polygons, which can then be refined. We performed this refinement via a deforming network that predicts for each polygon the displacement of each vertex, taking into account the location of all vertices. By deforming each polygon, our model is able to better capture the local geometry of the object. Unlike [5, 2, 39], our model has no restriction on the number of polygons utilized to describe each object. This allows us to naturally handle cases where the objects are split into parts due to occlusion.
We first demonstrate the effectiveness of our approach on the Cityscapes dataset . On the task of instance segmentation, our model improves the initialization by 3.0 AP and 10.3 in the boundary metric on the validation set. Importantly, we achieve 1st place on the test set leaderboard, beating the current state of the art by 3.7 AP. We further evaluate our model on a new self-driving dataset. Our model improves the initialization by 2.1 AP and 5.6 in the boundary metric. In the context of interactive annotation, we outperform the previous state of the art  by 2.0% in the boundary metric. Finally, we conduct an experiment where the crowd-sourced labelers annotate the object instances using the polygon output from our model. We show that this can speed up the annotation time by 35%!
2 Related Work
In this section, we briefly review the relevant literature on instance segmentation and annotation in the loop.
Proposal-based Instance Segmentation:
Most modern instance segmentation models adopt a two-stage pipeline . First, an over-complete set of segment proposals is identified, and then a voting process is exploited to determine which one to keep [7, 14] As the explicit feature extraction process  is time-consuming [19, 20], Dai \etal[13, 12] integrated feature pooling into the neural network to improve efficiency. While the speed is drastically boosted comparing to previous approaches, it is still relatively slow as these approach is limited by the traditional detection based pipeline. With this problem in mind, researchers have looked into directly generating instance masks in the network and treat them as proposals [51, 52]. Based on this idea, Mask R-CNN  introduced a joint approach to do both mask prediction and recognition. It builds upon Faster R-CNN  by adding an extra parallel header to predict the object’s mask, in addition to the existing branch for bounding box recognition. Liu \etal propose a path aggregation network  to improve the information flow in Mask R-CNN and further improve performance. More recently, Chen \etal interleaves bounding box regression, mask regression and semantic segmentation together to boost instance segmentation performance. Xu \etal fit Chebyshev polynomials to instances by having a network learn the coefficients, this allows for real time instance segmentation. Huang \etal optimize the scoring of the bounding boxes by predicting IoU for each mask rather than only a classification score. Kuo \etal start with bounding boxes and refine them using shape priors. Xiong \etal and Kirillov \etal extended Mask R-CNN to the task of panoptic segmentation. Yang \etal extended Mask R-CNN to the task of video instance segmentation.
Proposal-free Instance Segmentation:
This line of research aims at segmenting the instances in the scene without an explicit object proposal. Zhang \etal[66, 65] first predicts instance labels within the extracted multi-scale patches and then exploits dense Conditional Random Field  to obtain a consistent labeling of the full image. While achieving impressive results, their approach is computationally intensive. Bai and Urtasun  exploited a deep network to predict the energy of the watershed transform such that each basin corresponds to an object instance. With one simple cut, they can obtain the instance masks of the whole image without any post-processing. Similarly,  exploits boundary prediction to separate the instances within the same semantic category. Despite being much faster, they suffer when dealing with far or small objects whose boundaries are ambiguous. To address this issue, Liu \etal present a sequential grouping approach that employs neural networks to gradually compose objects from simpler elements. It can robustly handle situations where a single instance is split into multiple parts. Newell and Deng  implicitly encode the grouping concept into neural networks by having the model to predict both semantic class and a tag for each pixel. The tags are one dimensional embeddings which associate each pixel with one another. Kendall \etal propose a method to assign pixels to objects having each pixel point to its object’s center so that it can be grouped. Sofiiuk \etal use a point proposal network to generate points where the instances can be, this is then processed by a CNN to outputs instance masks for each location. Neven \etal propose a new clustering loss that pulls the spatial embedding of pixels belonging to the same instance together to achieve real time instance segmentation while having high accuracy. Gao \etal propose a single shot instance segmentation network that outputs a pixel pair affinity pyramid to compute whether two pixels belong to the same instance, they then combine this with a predicted semantic segmentation to output a single instance segmentation map.
The task of interactive annotation can also be posed as finding the polygons or curves that best fit the object boundaries. In fact, the concept of deforming a curve to fit the object contour can be dated back to the 80s where the active contour model was first introduced . Since then, variants of ACM [10, 47, 9] have been proposed to better capture the shape. Recently, the idea of exploiting polygons to represent an instance is explored in the context of human in the loop segmentation [5, 2]. Castrejón \etal adopted an RNN to sequentially predict the vertices of the polygon. Acuna \etal extended  by incorporating graph neural networks and increasing image resolution. While these methods demonstrated promising results on public benchmarks , they require ground truth bounding box as input. Ling \etal and Dong \etal exploited splines as an alternative parameterization. Instead of drawing the whole polygon/curves from scratch, they start with a circle and deform it. Wang \etaltackled this problem with implicit curves using level sets , however, because the outputs are not polygons, an annotator cannot easily corrected them. In , Maninis \etaluse extreme boundary as inputs rather than bounding boxes and Majumder \etal uses user clicks to generate content aware guidance maps; all of these help the networks learn stronger cues to generate more accurate segmentations. However, because they are pixel-wise masks, they are not easily amenable by an annotator. Acuna \etal develop an approach that can be used to refine noisy annotations by jointly reasoning about the object boundaries with a CNN and a level set formulation. In the domain of offline mapping, several papers from Homayounfar \etaland Liang \etal[23, 35, 24, 36] have tackled the problem of automatically annotating crosswalks, road boundaries and lanes by predicting structured outputs such as a polyline.
Our aim is to design a robust segmentation model that is capable of producing precise, geometry-preserving masks for each individual object. Towards this goal, we develop PolyTransform, a novel deep architecture that combines prevailing segmentation approaches [21, 62] with modern polygon-based methods [5, 2]. By exploiting the best of both worlds, we are able to generate high quality segmentation masks under various challenging scenarios.
In this section, we start by describing the backbone architecture for feature extraction and polygon initialization. Next, we present a novel deforming network that warps the initial polygon to better capture the local geometry of the object. An overview of our approach is shown in Figure 1.
3.1 Instance Initialization
The goal of our instance initialization module is to provide a good polygon initialization for each individual object. To this end, we first exploit a model to generate a mask for each instance in the scene. Our experiments show that our approach can significantly improve performance for a wide variety of segmentation models. If the segmentation model outputs proposal boxes, we use them to crop the image, otherwise, we fit a bounding box to the mask. The cropped image is later resized to a square and fed into a feature network (described in Sec. 3.2) to obtain a set of reliable deep features. In practice, we resize the cropped image to . To initialize the polygon, we use the border following algorithm of  to extract the contours from the predicted mask. We get the initial set of vertices by placing a vertex at every px distance in the contour. Empirically, we find such dense vertex interpolation provides a good balance between performance and memory consumption.
3.2 Feature Extraction Network
The goal of our feature extraction network is to learn strong object boundary features. This is essential as we want our polygons to capture high curvature and complex shapes. As such, we employ a feature pyramid network (FPN)  to learn and make use of multi-scale features. This network takes as input the crop obtained from the instance initialization stage and outputs a set of features at different pyramid levels. Our backbone can be seen in Figure 2.
3.3 Deforming Network
We have so far computed a polygon initialization and deep features of the FPN from the image crop. The next step is to build a feature embedding for all vertices and learn a deforming model that can effectively predict the offset for each vertex so that the polygon snaps better to the object boundaries.
We build our vertex representation upon the multi-scale feature extracted from the backbone FPN network of the previous section. In particular, we take the , , , and feature maps and apply two lateral convolutional layers to each of them in order to reduce the number of feature channels from to each. Since the feature maps are , , , and of the original scale, we bilinearly upsample them back to the original size and concatenate them to form a feature tensor. To provide the network a notion of where each vertex is, we further append a 2 channel CoordConv layer . The channels represent and coordinates with respect to the frame of the crop. Finally, we exploit the bilinear interpolation operation of the spatial transformer network  to sample features at the vertex coordinates of the initial polygon from the feature tensor. We denote such embedding as .
When moving a vertex in a polygon, the two attached edges are subsequently moved as well. The movement of these edges depends on the position of the neighboring vertices. Each vertex thus must be aware of its neighbors and needs a way to communicate with one another in order to reduce unstable and overlapping behavior. In this work, we exploit the self-attending Transformer network  to model such intricate dependencies. We leverage the attention mechanism to propagate the information across vertices and improve the predicted offsets.
More formally, given the vertex embeddings , we first employ three feed-forward neural networks to transform it into , , , where , , stands for Query, Key and Value. We then compute the weightings between vertices by taking a softmax over the dot product . Finally, the weightings are multiplied with the keys to propagate these dependencies across all vertices. Such attention mechanism can be written as:
where is the dimension of the queries and keys, serving as a scaling factor to prevent extremely small gradients. We repeat the same operation a fix number of times, 6 in our experiments. After the last Transformer layer, we feed the output to another feed-forward network which predicts offsets for the vertices. We add the offsets to the polygon initialization to transform the shape of the polygon.
We train the deforming network and the feature extraction network in an end-to-end manner. Specifically, we minimize the weighted sum of two losses. The first penalizes the model for when the vertices deviate from the ground truth. The second regularizes the edges of the polygon to prevent overlap and unstable movement of the vertices.
Polygon Transforming Loss:
We make use of the Chamfer Distance loss to move the vertices of our predicted polygon closer to the ground truth polygon . The Chamfer Distance loss is defined as:
where and are the rasterized edge pixels of the polygons and . The first term of the loss penalizes the model when is far from and the second term penalizes the model when is far from .
In order to prevent unstable movement of the vertices, we add a standard deviation loss on the lengths of the edges between the vertices. Empirically, we found that without this term the vertices can suddenly shift a large distance, incurring a large loss and causing the gradients to blow up. We define the standard deviation loss as: where denotes the mean length of the edges.
We evaluate our model in the context of both instance segmentation and interactive annotation settings.
We train our model on 8 Titan 1080 Ti GPUs using the distributed training framework Horovod  for 1 day. We use a batch size of 1, ADAM , 1e-4 learning rate and a 1e-4 weight decay. We augment our data by randomly flipping the images horizontally. During training, we only train with instances whose proposed box has an Intersection over Union (IoU) overlap of over 0.5 with the ground truth (GT) boxes. We train with both instances produced using proposed boxes and GT boxes to further augment the data. For our instance segmentation experiments, we augment the box sizes by to during training and test with a box expansion. For our interactive annotation experiments, we train and test on boxes with an expansion of 5px on each side; we only compute a chamfer loss if the predicted vertex is at least 2px from the ground truth polygon. When placing weights on the losses, we found ensuring the loss values were approximately balanced produced the best result. For our PolyTransform FPN, we use ResNet50  as the backbone and use the pretrained weights from UPSNet  on Cityscapes. For our deforming network we do not use pretrained weights.
|Kendall et al. ||fine|
|Arnab et al. ||fine|
|Mask R-CNN ||fine|
|Neven et al. ||fine|
|Mask R-CNN ||fine+COCO|
|Mask RCNN ||✓||-|
|Input Image||Our Instance Segmentation||GT Instance Segmentation|
4.1 Instance Segmentation
We use Cityscapes  which has high quality pixel-level instance segmentation annotations. The images were collected in 27 cities, and they are split into 2975, 500 and 1525 images for train/val/test. There are 8 instance classes: bicycle, bus, person, train, truck, motorcycle, car and rider. We also report results on a new dataset we collected. It consists of 10235/1139/1186 images for train/val/test split annotated with 10 classes: car, truck, bus, train , person, rider, bicycle with rider, bicycle, motorcycle with rider and motorcycle. Each image is of size .
For our instance segmentation results, we report the average precision (AP and AP) for the predicted mask. Here, the AP is computed at 10 IoU overlap thresholds ranging from 0.5 to 0.95 in steps of 0.05 following . AP is the AP at an overlap of 50%. We also introduce a new metric that focusses on boundaries. In particular, we use a metric similar to [61, 50] where a precision, recall and F1 score is computed for each mask, where the prediction is correct if it is within a certain distance threshold from the ground truth. We use a threshold of 1px, and only compute the metric for true positives. We use the same 10 IoU overlap thresholds ranging from 0.5 to 0.95 in steps of 0.05 to determine the true positives. Once we compute the F1 score for all classes and thresholds, we take the average over all examples to get AF.
Since our method improves upon a baseline initialization, we want to use a strong instance initialization to show we can still improve the results even when they are very strong. To do this, we take the publicly available UPSNet , and replace its backbone with WideResNet38  and add all the elements of PANet  except for the synchronized batch normalization (we use group normalization instead). We further pretrain on COCO and use deformable convolution (DCN)  in the backbone.
|Our Instance Seg|
|GT Instance Seg|
|Input Image||Our Instance Segmentation||GT Instance Segmentation|
|Deep Level Sets |
Comparison to SOTA:
We compare our model with the state-of-the-art on the val and test sets of Cityscapes in Table 1. We see that we outperform all baselines in every metric. Our model achieves a new state-of-the-art test result of 40.1AP. This significantly out performs PANet by 3.7 and 2.8 points in AP and AP respecgtively. It also ranks number 1 on the official Cityscapes leaderboard. We report the results on our new dataset in Table 2. We achieve the strongest test AP result in this leaderboard. We see that we improve over PANet by 6.2 points in AP and UPSNet by 3.8 points in AP.
Robustness to Initialization:
We also report the improvement over different instance segmentation networks used as initialization in Table 3. We see significant and consistent improvements in val AP across all models. When we train our model on top of the DWT  instances we see an improvement of , points in AP and AF. We also train on top of the UPSNet results from the original paper along with UPSNet with WRes38+PANet as a way to reproduce the current SOTA val AP of PANet. We show an improvement of , points in AP and AF. Finally we improve on our best initialization by AP points in AP and AF. As we can see, our boundary metric sees a very consistent gain in AF across all models. This suggests that our model is significantly improving the instances at the boundary. We notice that a large gain in AP (WRes38+PANet to WRes38+PANet+DCN) does not necessarily translate to a large gain in AF, however, our model will always provide a significant increase in this metric. We also report the validation AP improvement over different instance segmentation networks in Table 4 for our new dataset. We see that we can improve on Mask R-CNN  by , points in AP, AF. For the different UPSNet models, we improve upon it between 1.4-2.2 AP points. Once again, our model shows a consistent and strong improvement over the instance segmentation initializations. We also see a very consistent gain in AF across all the models.
We conduct an experiment where we ask crowd-sourced labelers to annotate 150 images from our new dataset with instances larger than 24x24px for vehicles and 12x14px for pedestrians/riders. We performed a control experiment where the instances are annotated completely from scratch (without our method) and a parallel experiment where we use our model to output the instances for them to fix to produce the final annotations. In the fully manual experiment, it took on average 60.3 minutes to annotate each image. When the annotators were given the PolyTransform output to annotate on top of, it took on average 39.4 minutes to annotate each image. Thus reducing 35% of the time required to annotate the images. This resulted in significant cost savings.
We show qualitative results of our model on the validation set in Figure 3. In our instance segmentation outputs we see that in many cases our model is able to handle occlusion. For example, in row 3, we see that our model is able to capture the feet of the purple and blue pedestrians despite their feet being occluded from the body. We also show qualitative results on our new dataset in Figure 4. We see that our model is able to capture precise boundaries, allowing it to capture difficult shapes such as car mirrors and pedestrians.
4.2 Interactive Annotation
The goal is to annotate an object with a polygon given its ground truth bounding box. The idea is that the annotator provides a ground truth box and our model works on top of it to output a polygon representation of the object instance.
We follow  and split the Cityscapes dataset such that the original validation set is the test set and two cities from the training (Weimar and Zurich) form the validation set. In , the authors further split this dataset into two settings. The first is Cityscapes Hard, here the ground truth bounding box is enlarged to form a square and then the image is cropped. The second is Cityscapes Stretch, where the ground truth bounding box along with the image is stretched to a square and then cropped.
To evaluate our model for this task, we report the intersection over union (IoU) on a per-instance basis and average for each class. Then, following  this is averaged across all classes. We also report the boundary metric reported in [61, 50], which computes the F measure along the contour for a given threshold. The thresholds used are 1 and 2 pixels as Cityscapes contains a lot of small instances.
Comparison to SOTA:
Tables 5 and 6 shows the results on the test set in both Cityscapes Stretch and Hard respectively. For Cityscapes Stretch, we see that our model significantly outperforms the SOTA in the boundary metric, improving it by up to 2%. Unlike the Deep Level Sets  method which outputs a pixel wise mask, our method outputs a polygon which allows for it to be amenable to the annotator in the loop setting by simply moving the vertices. For Cityscapes Hard, our model outperforms the SOTA by 4.9%, 8.3% and 7.2% in mean IOU, F at 1px and F at 2px respectively.
Robustness to Initalization:
We also report improvements over different segmentation initializations in Table 7, the results are on the test set. Our models are trained on various backbone initialization models (FCN  and DeepLabV3  with and without pretraining on COCO ). Our model is able to consistently and significantly improve the boundary metrics at 1 and 2 pixels by up to 1.5% and we improve the IOU between 0.1-0.2%. We also note that the difference in mean IOU between FCN and DeepLabV3 is very small (at most 0.5%) despite DeepLabV3 being a much stronger segmentation model. We argue that the margin for mean IOU improvement is very small for this dataset.
In this paper, we present PolyTransform, a novel deep architecture that combines the strengths of both prevailing segmentation approaches and modern polygon-based methods. We first exploit a segmentation network to generate a mask for each individual object. The instance mask is then converted into a set of polygons and serve as our initialization. Finally, a deforming network is applied to warp the polygons to better fit the object boundaries. We evaluate the effectiveness of our model on the Cityscapes dataset as well as a novel dataset that we collected. Experiments show that our approach is able to produce precise, geometry-preserving instance segmentation that significantly outperforms the backbone model. Comparing to the instance segmentation initialization, we increase the validation AP and boundary metric by up to 3.0 and 10.3 points, allowing us to achieve 1st place on the Cityscapes leaderboard. We also show that our model speeds up annotation by 35%. Comparing to previous work on annotation-in-the-loop , we outperform the boundary metric by 2.0%. Importantly, our PolyTransform generalizes across various instance segmentation network.
- The supplementary of this paper can be found \colorredhere.
- (2019) Devil is in the edges: learning semantic boundaries from noisy annotations. In CVPR, Cited by: §2.
- (2018) Efficient interactive annotation of segmentation datasets with polygon-rnn++. Cited by: §1, §1, §2, §3, §4.2, Table 1, Table 6, §5.
- (2017) Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, Cited by: Table 1.
- (2017) Deep watershed transform for instance segmentation. In CVPR, Cited by: §1, §2, §4.1, Table 1.
- (2017) Annotating object instances with a polygon-rnn. In CVPR, Cited by: §1, §1, §2, §3, §4.2, §4.2, Table 6.
- (2019) Hybrid task cascade for instance segmentation. In CVPR, Cited by: §2.
- (2018) MaskLab: instance segmentation by refining object detection with semantic and direction features. In CVPR, Cited by: §2.
- (2017) Rethinking atrous convolution for semantic image segmentation. CoRR. Cited by: §4.2, §4.2, Table 7.
- (2019) DARNet: deep active ray network for building segmentation. In CVPR, Cited by: §2.
- (1991) On active contour models and balloons. CVGIP. Cited by: §2.
- () The cityscapes dataset. Cited by: §1, §1, §2, §4.1, §4.1.
- (2016) Instance-sensitive fully convolutional networks. In ECCV, Cited by: §2.
- (2015) Convolutional feature masking for joint object and stuff segmentation. In CVPR, Cited by: §2.
- (2016) Instance-aware semantic segmentation via multi-task network cascades. In CVPR, Cited by: §2.
- (2017) Deformable convolutional networks. ICCV. Cited by: §4.1.
- (2019) Automatic annotation and segmentation of object instances with deep active curve network. IEEE Access. Cited by: §2.
- (2019) See, feel, act: hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics. Cited by: §1.
- (2019) SSAP: single-shot instance segmentation with affinity pyramid. In ICCV, Cited by: §2, Table 1.
- (2014) Simultaneous detection and segmentation. In ECCV, Cited by: §2.
- (2015) Hypercolumns for object segmentation and fine-grained localization. In CVPR, Cited by: §2.
- (2017) Mask R-CNN. CoRR. Cited by: §1, §2, §3, §4.1, Table 1, Table 2.
- (2015) Deep residual learning for image recognition. CoRR. Cited by: §4.
- (2018) Hierarchical recurrent attention networks for structured online maps. In CVPR, Cited by: §2.
- (2019) DAGMapper: learning to map by discovering lane topology. In ICCV, Cited by: §2.
- (2019) Mask Scoring R-CNN. In CVPR, Cited by: §2.
- (2015) Spatial transformer networks. In NIPS, Cited by: §3.3.
- (1988) Snakes: active contour models. IJCV. Cited by: §2.
- (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, Cited by: §2, Table 1.
- (2018) Instance segmentation and object detection with bounding shape masks. CoRR. Cited by: Table 1.
- (2014) Adam: A method for stochastic optimization. CoRR. Cited by: §4.
- (2019) Panoptic segmentation. In CVPR, Cited by: §2.
- (2017) Instancecut: from edges to instances with multicut. In CVPR, Cited by: §2.
- (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, Cited by: §2.
- (2019) ShapeMask: learning to segment novel objects by refining shape priors. In ICCV, Cited by: §2.
- (2019) Convolutional recurrent network for road boundary extraction. In CVPR, Cited by: §2.
- (2018) End-to-end deep structured models for drawing crosswalks. In ECCV, Cited by: §2.
- (2016) Feature pyramid networks for object detection. CoRR. Cited by: §3.2.
- (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §4.2, Table 7.
- (2019) Fast interactive object annotation with curve-gcn. In CVPR, Cited by: §1, §1, §2, §4.2, Table 6.
- (2018) An intriguing failing of convolutional neural networks and the coordconv solution. CoRR. Cited by: §3.3.
- (2017) Sgn: sequential grouping networks for instance segmentation. In ICCV, Cited by: §2, Table 1.
- (2018) Path aggregation network for instance segmentation. In CVPR, Cited by: §1, §2, §4.1, Table 1, Table 2.
- (2018) Affinity derivation and graph merge for instance segmentation. In ECCV, Cited by: Table 1.
- (2019) Deep rigid instance scene flow. In CVPR, Cited by: §1.
- (2019) Content-aware multi-level guidance for interactive instance segmentation. In CVPR, Cited by: §2.
- (2017) Deep extreme cut: from extreme points to object segmentation. CVPR. Cited by: §2, §4.2, Table 5, Table 7.
- (2018) Learning deep structured active contours end-to-end. In CVPR, Cited by: §2.
- (2019) Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR, Cited by: §2, Table 1.
- (2017) Associative embedding: end-to-end learning for joint detection and grouping. In NIPS, Cited by: §2.
- (2016) A benchmark dataset and evaluation methodology for video object segmentation. CVPR. Cited by: §4.1, §4.2.
- (2015) Learning to segment object candidates. In NIPS, Cited by: §2.
- (2016) Learning to refine object segments. In ECCV, Cited by: §2.
- (2017) Multiscale combinatorial grouping for image segmentation and object proposal generation. PAMI. Cited by: §2.
- (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §2.
- (2018) In-place activated batchnorm for memory-optimized training of dnns. In CVPR, Cited by: §4.1.
- (2018) Horovod: fast and easy distributed deep learning in tensorflow. CoRR. Cited by: §4.
- (2017) Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §4.2, Table 7.
- (2019) AdaptIS: adaptive instance selection network. In ICCV, Cited by: §2, Table 1.
- (1985) Topological structural analysis of digitized binary images by border following.. Computer Vision, Graphics, and Image Processing. Cited by: §3.1.
- (2017) Attention is all you need. CoRR. Cited by: §3.3.
- (2019) Object instance annotation with deep extreme level set evolution. In CVPR, Cited by: §1, §1, §2, §4.1, §4.2, §4.2, §4.2, Table 5, Table 6.
- (2019) UPSNet: A unified panoptic segmentation network. CoRR. Cited by: §1, §2, §3, §4.1, §4, Table 1, Table 2.
- (2019) Explicit shape encoding for real-time instance segmentation. In ICCV, Cited by: §2.
- (2019) Video instance segmentation. In ICCV, Cited by: §2.
- (2016) Instance-level segmentation for autonomous driving with deep densely connected mrfs. In CVPR, Cited by: §2.
- (2015) Monocular object instance segmentation and depth ordering with cnns. In ICCV, Cited by: §2.
- (2017) Scene parsing through ade20k dataset. In CVPR, Cited by: §1.