Panoptic Segmentation

Panoptic Segmentation

Alexander Kirillov  Kaiming He  Ross Girshick  Carsten Rother  Piotr Dollár
Facebook AI Research (FAIR)   HCI/IWR, Heidelberg University, Germany
Abstract

We propose and study a novel ‘Panoptic Segmentation’ (PS) task. Panoptic segmentation unifies the traditionally distinct tasks of instance segmentation (detect and segment each object instance) and semantic segmentation (assign a class label to each pixel). The unification is natural and presents novel algorithmic challenges not present in either instance or semantic segmentation when studied in isolation. To measure performance on the task, we introduce a panoptic quality (PQ) measure, and show that it is simple and interpretable. Using PQ, we study human performance on three existing datasets that have the necessary annotations for PS, which helps us better understand the task and metric. We also propose a basic algorithmic approach to combine instance and semantic segmentation outputs into panoptic outputs and compare this to human performance. PS can serve as foundation of future challenges in segmentation and visual recognition. Our goal is to drive research in novel directions by inviting the community to explore the proposed panoptic segmentation task.

1 Introduction

\thesubsubfigure Image
\thesubsubfigure Semantic Segmentation
\thesubsubfigure Instance Segmentation
\thesubsubfigure Panoptic Segmentation
Figure 1: For a given (\subreffig:image) image, we show ground truth for the tasks of: (\subreffig:semantic) semantic segmentation (per-pixel class labels), (\subreffig:instance) instance segmentation (per-object mask and class label), and (\subreffig:panoptic) the proposed Panoptic Segmentation (PS) task (per-pixel class+instance labels). PS generalizes both semantic and instance segmentation and requires identifying and delineating every visible object and region in an image. We expect this unified segmentation task to present novel challenges and enable innovative new methods.

In the early days of computer vision, things – countable objects such as people, animals, tools – received the dominant share of attention. Questioning the wisdom of this trend, Adelson [1] elevated the importance of studying systems that recognize stuff – amorphous regions of similar texture or material such as grass, sky, road. This dichotomy between stuff and things persists to this day, reflected in both the division of visual recognition tasks and in the specialized algorithms developed for stuff and thing tasks.

Studying stuff is most commonly formulated as a task known as semantic segmentation, see Figure 1. As stuff is amorphous and uncountable, this task is defined as simply assigning a category label to each pixel in an image (note that semantic segmentation treats thing categories as stuff). In contrast, studying things is typically formulated as the task of object detection or instance segmentation, where the goal is to detect each object and delineate it with a bounding box or segmentation mask, respectively, see Figure 1. While seemingly related, the datasets, details, and metrics for these two visual recognition tasks vary substantially.

The schism between semantic and instance segmentation has led to a parallel rift in the methods for these tasks. Stuff classifiers are usually built on fully convolutional nets [26] with dilations [41, 5] while object detectors often use object proposals [15] and are region-based [33, 14]. Overall algorithmic progress on these tasks has been incredible in the past decade, yet, something important may be overlooked by focussing on these tasks in isolation.

In this work we ask: Can there be a reconciliation between stuff and things? Is there a simple problem formulation that gracefully encompasses both tasks? And what would a unified visual recognition system look like?

With these questions in mind, we propose a new task that encompasses both stuff and things. We refer to the resulting task as Panoptic Segmentation (PS). The definition of panoptic is “including everything visible in one view”, in our context panoptic refers to a unified, global view of segmentation. The task formulation of PS is deceptively simple: each pixel of an image must be assigned a semantic label and an instance id. Pixels with the same label and id belong to the same object; for stuff labels the instance id is ignored. Both the ground truth and machine prediction must have this form. See Figure 1 for a visualization.

Panoptic segmentation is a generalization of both semantic and instance segmentation but introduces new algorithmic challenges. Unlike semantic segmentation, PS requires differentiating individual object instances; this poses a challenge for fully convolutional nets. Unlike instance segmentation, in PS object segments must be non-overlapping; this presents a challenge for region-based methods that operate on each object independently. Moreover, this task requires simultaneously recognizing both stuff and things. Designing a clean, end-to-end system for PS is an open problem that will require exploring innovative algorithmic ideas.

Our new panoptic segmentation task requires a new metric. We strive to make our metric complete, interpretable, and simple. Perhaps somewhat surprisingly, given the seeming complexity of our task, a natural metric that satisfies these properties exists. We define the Panoptic Quality (PQ) metric in §4 and show it can be decomposed into two interpretable terms: segmentation quality (SQ) and detection quality (DQ), allowing for a further breakdown of accuracy.

As both the ground truth and algorithm output for PS must take on the same form, we can perform a detailed study of human performance on panoptic segmentation. This allows us to understand the PQ metric in more detail, including detailed breakdowns of detection vs. segmentation and stuff vs. things performance. Moreover, measuring human PQ helps ground our understanding of machine performance. This is important as it will allow us to monitor performance saturations on various datasets for PS.

Finally we perform an initial study of machine performance for PS. To do so, we define a simple and likely suboptimal heuristic that combines the output of two independent systems for semantic and instance segmentation via a series of post-processing steps that merges their outputs (in essence, a sophisticated form of non-maximum suppression). Our heuristic establishes a baseline for PS and gives us insights as to the main algorithmic challenges it presents.

We study both human and machine performance on three popular segmentation datasets that have both stuff and things annotations. This includes the Cityscapes [6], ADE20k [44], and Mapillary Vistas [31] datasets. For each of these datasets, we obtained results of state-of-the-art methods directly from the challenge organizers. In the future we will extend our analysis to COCO [22] on which stuff is being annotated [4]. Together our results on these datasets form a solid foundation for the study of both human and machine performance on panoptic segmentation.

2 Related Work

Novel datasets and tasks have played a key role throughout the history of computer vision. They help catalyze progress and enable breakthroughs in our field, and just as importantly, they help us measure and recognize the progress our community is making. For example, ImageNet [34] helped drive the recent popularization of deep learning techniques for visual recognition [20] and exemplifies the potential transformational power that datasets and tasks can have. Our goals for introducing the panoptic segmentation task are similar: to challenge our community, to drive research in novel directions, and to enable both expected and unexpected innovation. We review related tasks next.

Object detection tasks.

Early work on face detection using ad-hoc datasets (e.g., [36, 38]) helped popularize bounding-box object detection. Later, pedestrian detection datasets [8] helped drive progress in the field. The PASCAL VOC dataset [9] upgraded the task to a more diverse set of general object categories on more challenging images. More recently, the COCO dataset [22] pushed detection towards the task of instance segmentation. By framing this task and providing a high-quality dataset, COCO helped define a new and exciting research direction and led to many recent breakthroughs in instance segmentation [32, 21, 14]. Our general goals for panoptic segmentation are similar.

Semantic segmentation tasks.

Semantic segmentation datasets have a rich history [35, 23, 9] and helped drive key innovations (e.g., fully convolutional nets [26] were developed using [23, 9]). These datasets contain both stuff and thing categories, but don’t distinguish individual object instances. Recently the field has seen numerous new segmentation datasets including Cityscapes [6], ADE20k [44], and Mapillary Vistas [31]. These datasets actually support both semantic and instance segmentation, and each has opted to have a separate track for the two tasks. Importantly, they contain all of the information necessary for PS. In other words, the panoptic segmentation task can be bootstrapped on these datasets without any new data collection.

Amodal segmentation task.

In [45] objects are annotated amodally: the full extent of each region is marked, not just the visible. Our work focuses on segmentation of all visible regions, but amodal panoptic segmentation is also possible.

Multitask learning.

With the success of deep learning for many visual recognition tasks, there has been substantial interest in multitask learning approaches that have broad competence and can solve multiple diverse vision problems in a single framework [19, 28, 30]. E.g., UberNet [19] solves multiple low to high-level visual tasks, including object detection and semantic segmentation, using a single network. While there is significant interest in this area, we emphasize that panoptic segmentation is not a multitask problem but rather a single, unified view of image segmentation.

3 Panoptic Segmentation

Definition.

The panoptic segmentation (PS) task is simple to define. Given a predetermined set of semantic categories encoded by , the task requires a panoptic segmentation algorithm to map each pixel of an image to a pair , where represents the semantic class of pixel and represents its instance id. Instances, not pixels, are the atomic units of output produced by the algorithm that will be used in a matching process for evaluation (described later). Ground truth annotations for an image are encoded in an identical manner.

Stuff and thing labels.

The semantic label set consist of subsets and , such that and . These subsets correspond to stuff labels and thing labels, respectively. When a pixel is labeled with , its corresponding instance id is irrelevant. That is, for stuff categories all pixels belong to the same instance (e.g., the same sky). Otherwise, all pixels with the same assignment, where , belong to the same instance (e.g., the same car), and conversely, all pixels belonging to a single instance must have the same . The selection of which categories are stuff vs. things is a design choice left to the creator of the dataset, just as in previous datasets.

Relationship to semantic segmentation.

Panoptic segmentation is a strict generalization of the classic semantic segmentation task. Indeed, both tasks require each pixel in an image to be assigned a semantic label. If the ground truth does not specify instances, or all categories are stuff, then two tasks are identical (although their metrics differ). However, inclusion of thing categories, which may have multiple instances per image, differentiates the tasks.

Relationship to instance segmentation.

The instance segmentation task requires a method to segment each object instance in an image. However, it allows overlapping segments, whereas the panoptic segmentation task permits only one semantic label and one instance id to be assigned to each pixel. Hence, for PS, no overlaps are possible by construction. In the next section we show that this difference plays an important role in performance evaluation.

Confidence scores.

Like semantic segmentation, but unlike instance segmentation, we do not require confidence scores associated with each segment for PS. This makes the panoptic task symmetric with respect to humans and machines: both must generate the same type of image annotation. It also makes evaluating human performance for PS simple. This is in contrast to instance segmentation, which is not easily amenable to such a study as human annotators do not provide explicit confidence scores (though a single precision/recall point may be measured). We note that confidence scores give downstream systems more information, which can be useful, so it may still be desirable to have a PS algorithm generate confidence scores in certain settings.

4 Panoptic Quality

Panoptic segmentation needs a suitable evaluation metric. We start by identifying several desiderata:

Completeness. The metric should measure the key performance characteristics of panoptic segmentation, including segmentation quality and detector precision and recall.

Interpretability. We seek a metric with identifiable meaning that facilitates communication and understanding.

Simplicity. In addition, the metric should be simple to define and implement. This improves transparency and allows for easy reimplementation. Related to this, the metric should be efficient to compute to enable rapid evaluation.

Guided by these principles, we propose a new Panoptic Quality (PQ) metric. PQ measures the quality of a predicted panoptic segmentation relative to the ground truth. It involves two steps: (1) instance matching and (2) PQ computation given the matches.

4.1 Instance Matching

We specify that a predicted segment and a ground truth segment can match only if their intersection over union (IoU) is strictly greater than 0.5. This requirement, together with the non-overlapping property of a panoptic segmentation, gives a unique matching: there can be at most one predicted segment matched with each ground truth segment.

Theorem 1.

Given a predicted and ground truth panoptic segmentation of an image, each ground truth segment can have at most one corresponding predicted segment with IoU strictly greater than 0.5 and vice verse.

Proof.

Let be a ground truth segment and and be two predicted segments. By definition, (they do not overlap). Since , we get the following:

Summing over , and since due to the fact that , we get:

Therefore, if , then has to be smaller than 0.5. Reversing the role of and can be used to prove that only one ground truth segment can have IoU with a predicted segment strictly greater than 0.5. ∎

The requirement that matches must have IoU greater than 0.5, which in turn yields the unique matching theorem, achieves two of our desired properties. First, it is simple and efficient as correspondences are unique and trivial to obtain. Second, it is interpretable and easy to understand (and does not require solving a complex matching problem as is commonly the case for these types of metrics [13, 40]).

4.2 PQ Computation

Figure 2: Toy illustration of ground truth and predicted panoptic segmentations of an image. Pairs of segments of the same color have IoU larger than 0.5 and are therefore matched. We show how the segments for the person class are partitioned into true positives , false negatives , and false positives .

We calculate PQ for each class independently and average over classes. This makes PQ insensitive to class imbalance. For each class, the unique matching splits the predicted and ground truth segments into three sets: true positives (), false positives (), and false negatives (), representing matched pairs of segments, unmatched predicted segments, and unmatched ground truth segments, respectively. An example is illustrated in Figure 2. Given these three sets, PQ is defined as:

(1)

PQ is intuitive after inspection: is simply the average IoU of matched segments, while is added to the denominator to penalize instances without matches. Note that all segments receive equal importance regardless of their area. Furthermore, if we multiply and divide PQ by the size of the set, then PQ can be seen as the multiplication of a Segmentation Quality (SQ) term and a Detection Quality (DQ) term:

(2)

Written this way, DQ is the familiar score [37] widely used for quality estimation in detection problems [29]. SQ is simply the average IoU of matched objects. We find this split of PQ = SQ DQ to provide insight, since it helps decompose the performance of a panoptic segmentation algorithm. We note, however, that the two values are not independent since SQ is measured only over matched objects.

Our definition of PQ achieves our desiderata. PQ is complete as it measures segmentation quality, precision, and recall, all in a fairly simple and interpretable formula. Its computation is straightforward and efficient, as are extensions to handle void regions and groups of instances [22].

5 Panoptic Segmentation Datasets

To our knowledge only three public datasets have both dense semantic and instance segmentation annotations: Cityscapes [6], ADE20k [44], and Mapillary Vistas [31]. We use all three datasets for panoptic segmentation. In addition, in the future we will extend our analysis to COCO [22] on which stuff is currently being annotated [4]111In addition to stuff annotations being incomplete, COCO instance segmentations contain overlaps. We plan on collecting depth ordering for all pairs of overlapping instances in COCO to resolve these overlaps..

Cityscapes [6] has 5000 images (2975 train, 500 val, and 1525 test) of ego-centric driving scenarios in urban settings. It has dense pixel annotations (97% coverage) of 19 classes among which 8 have instance-level segmentations.

ADE20k [44] has over 25k images (20k test, 2k val, 3k test) that are densely annotated with an open-dictionary label set. For the 2017 Places Challenge222http://placeschallenge.csail.mit.edu, 100 thing and 50 stuff classes that cover 89% of all pixels are selected. We use this closed vocabulary in our study.

Mapillary Vistas [31] has 25k street-view images (18k train, 2k val, 5k test) in a wide range of resolutions. The ‘research edition’ of the dataset is densely annotated (98% pixel coverage) with 28 stuff and 37 thing classes.

We use all three datasets (statistics in Table 1, right) for the study of both human and machine performance for PS.

PQ SQ DQ images classes coverage
Cityscapes 69.7 84.2 82.1 5k 19 97%
ADE20k 67.1 85.8 78.0 25k 150 89%
Vistas 57.5 79.5 71.4 25k 65 98%
Table 1: Human performance on panoptic segmentation on three datasets. Panoptic, segmentation, and detection quality (PQ, SQ, DQ) averaged over classes (PQ=SQDQ per class) are reported as percentages, establishing a measure of annotator consistency for each dataset. Dataset characteristics such as the number of classes, percent pixel coverage, and scene complexity vary widely, each of which impacts annotation difficulty, so the reported numbers should not be used to compare dataset annotation quality.

6 Human Performance Study

One advantage of panoptic segmentation and the panoptic quality metric is that it enables measuring human performance. Aside from this being interesting in its own right, human performance studies allow us to understand the task and metrics in detail, including breakdowns of performance along various axes. This gives us insight into intrinsic challenges posed by the task without biasing our analysis by algorithmic choices. Furthermore, human studies help ground machine performance (discussed in §7) and allow us to calibrate our understanding of the task and metrics.

Figure 3: Segmentation flaws. Images are zoomed and cropped. Top row (Vistas image): both annotators identify the object as a car, however, one splits the car into two cars. Bottom row (Cityscapes image): the segmentation is genuinely ambiguous.
PQ PQSt PQTh SQ SQSt SQTh DQ DQSt DQTh
Cityscapes 69.7 71.3 67.4 84.2 84.4 83.9 82.1 83.4 80.2
ADE20k 67.1 70.3 65.9 85.8 85.5 85.9 78.0 82.4 76.4
Vistas 57.5 62.6 53.4 79.5 81.6 77.9 71.4 76.0 67.7
Table 2: Human performance for stuff vs. things. Perhaps surprisingly, we find that human performance on each dataset is relatively similar for both stuff and things. In particular SQSt and SQTh are close, while there is a slight gap between DQSt and DQTh. This leads to a small overall gap in PQ between stuff and things.

Human annotations.

To enable human performance analysis, dataset creators graciously supplied us with 30 doubly annotated images for Cityscapes, 64 for ADE20k, and 46 for Vistas. For Cityscapes and Vistas, the images are annotated independently by different annotators. ADE20k is annotated by a single well-trained annotator who labeled the same set of images with a gap of six months. To measure Panoptic Quality (PQ) for human annotators, we treat one annotation for each image as ground truth and the other as the prediction. Note that the PQ is symmetric w.r.t. the ground truth and prediction, so order is unimportant.

Human performance.

Table 1 shows human performance on each dataset, along with the decomposition of PQ into segmentation quality (SQ) and detection quality (DQ). As expected, humans are not perfect at this task, which is consistent with studies of annotation quality performed in [6, 44, 31]. Visualizations of human segmentation and classification errors are shown in Figures 3 and 4, respectively.

We note that Table 1 establishes a measure of annotator consistency on each dataset, not an upper bound on human performance. We further emphasize that numbers are not comparable across datasets and should not be used to assess dataset quality. The number of classes, percent of annotated pixels, and scene complexity vary across datasets, each of which significantly impacts annotation difficulty.

Stuff vs. things.

PS requires segmentation of both stuff and things. In Table 2 we show PQSt and PQTh which is the PQ averaged over stuff classes and thing classes, respectively. For Cityscapes and ADE20k human performance for stuff and things are close, on Vistas the gap is a bit larger. Overall, this implies stuff and things have similar difficulty, although thing classes are somewhat harder. In Figure 5 we show PQ for every class in each dataset, sorted by PQ. Observe that stuff and things classes distribute fairly evenly. This implies that the proposed metric strikes a good balance and, indeed, is successful at unifying the stuff and things segmentation tasks without either dominating the error.

Figure 4: Classification flaws. Images are zoomed and cropped. Top row (ADE20k image): simple misclassification. Bottom row (Cityscapes image): the scene is extremely difficult, tram is the correct class for the segment. Many errors are difficult to resolve.
PQS PQM PQL SQS SQM SQL DQS DQM DQL
Cityscapes 35.5 63.5 86.2 67.6 80.2 89.7 52.2 78.7 95.9
ADE20k 53.7 68.5 79.5 78.0 84.3 88.4 69.0 81.2 89.6
Vistas 37.1 47.9 69.9 70.2 76.6 83.0 53.7 62.7 83.4
Table 3: Human performance vs. scale, for small (S), medium (M) and large (L) objects. Scale plays a large role in determining human accuracy for panoptic segmentation. On large objects both SQ and DQ are above 80 on all datasets, while for small objects DQ drops precipitously. SQ for small objects is quite reasonable.

Small vs. large objects.

To analyze how PQ varies with object size we partition the datasets into small (S), medium (M), and large (L) objects by considering the smallest , middle , and largest of objects in each dataset, respectively. In Table 3, we see that for large objects human performance for all datasets is quite good. For small objects, DQ drops significantly implying human annotators often have a hard time finding small objects. However, if a small object is found, it is segmented relatively well.

IoU threshold.

By enforcing an overlap greater than 0.5 IoU, we are given a unique matching by Theorem  1. However, is the 0.5 threshold reasonable? An alternate strategy is to use no threshold and perform the matching by solving a Maximum Weighted Bipartite Matching problem [39]. The optimization will return a matching that maximizes the sum of IoUs of the matched segments. We perform the matching using this optimization and plot the cumulative density functions of the match overlaps in Figure 6. Less than 16% of the matches have IoU overlap less than 0.5, indicating that relaxing the threshold should have minor effect.

To verify this intuition, in Figure 7 we show PQ computed for different IoU thresholds. Notably, the difference in PQ for IoU of 0.25 and 0.5 is relatively small, especially compared to the gap between IoU of 0.5 and 0.75, where the change in PQ is larger. Furthermore, many matches at lower IoU are false matches. Therefore, given that the matching procedure for IoU of 0.5 is simple and intuitive, we believe that the default choice of 0.5 is reasonable.

Figure 5: Per-Class Human performance, sorted by PQ. Thing classes are shown in red, stuff classes in orange (for ADE20k every other class is shown, classes without matches in the dual-annotated tests sets are omitted). Things and stuff are distributed fairly evenly, implying PQ balances their performance.
Figure 6: Cumulative density functions of overlaps for matched segments in three datasets when matches are computed by solving a Maximum Weighted Bipartite Matching problem [39]. After matching, less than 16% of matched objects have IoU below 0.5.
Figure 7: Human performance for different IoU thresholds. The difference in PQ using a matching threshold of 0.25 vs. 0.5 is relatively small. For IoU of 0.25 matching is obtained by solving a Maximum Weighted Bipartite Matching problem. For a threshold greater than 0.5 the matching is unique and much easier to obtain.

Sq vs. DQ balance.

Our DQ definition is equivalent to the score. However, other choices are possible. Inspired by the generalized score [37], we can introduce a parameter that enables tuning the penalty for detection errors:

(3)

By default is 0.5. Lowering reduces the penalty of unmatched segments and thus increases DQ (SQ is not affected). Since PQ=SQDQ, this changes the relative effect of PS vs. DQ on the final PQ metric. In Figure 8 we show SQ and DQ for various . The default strikes a good balance between SQ and DQ. In principle, altering can be used to balance the influence of segmentation and detection errors on the final metric. In a similar spirit, one could also add a parameter to balance influence of FP vs. FN errors.

Figure 8: SQ vs. DQ for different , see (3). Lowering reduces the penalty of unmatched segments and thus increases the reported DQ (SQ is not affected). We use of 0.5 throughout but by tuning one can balance the influence of SQ and DQ in the final metric.

7 Machine Performance Baselines

We now present simple machine baselines for panoptic segmentation. We are interested in two questions: (1) How do heuristic combinations of existing top-performing instance and semantic segmentation systems perform on panoptic segmentation? (2) How do the machine results compare to the human results that we presented previously?

Algorithms and data.

We want to understand panoptic segmentation in terms of existing well-established methods. Therefore, we create a basic PS system by applying reasonable heuristics (described shortly) to the output of existing top instance and semantic segmentation systems.

We obtained algorithm output for three datasets. For Cityscapes, we use the val set output generated by the current leading algorithms (PSPNet [43] and Mask R-CNN [14] for semantic and instance segmentation, respectively). For ADE20k, we received output for the winners of both the semantic [12, 11] and instance [27, 10] segmentation tracks on a 1k subset of test images from the 2017 Places Challenge. For Vistas, which is used for the LSUN’17 Segmentation Challenge, the organizers provide us with 1k test images and results from the winning entries for the instance and semantic segmentation tracks [25, 42].

Using this data, we start by analyzing PQ for the instance and semantic segmentation tasks separately, and then examine the full panoptic segmentation task. Note that our ‘baselines’ are very powerful and that simpler baselines may be more reasonable for fair comparison in papers on PS.

Cityscapes AP APNO PQTh SQTh DQTh
Mask R-CNN+COCO [14] 36.4 33.1 54.1 79.4 67.9
Mask R-CNN [14] 31.5 28.0 49.6 78.7 63.0
ADE20k AP APNO PQTh SQTh DQTh
Megvii [27] 30.1 24.8 41.1 81.6 49.6
G-RMI [10] 24.6 20.6 35.3 79.3 43.2
Table 4: Machine results on instance segmentation (stuff classes ignored). Non-overlapping predictions are obtained using the proposed heuristic. APNO is AP of the non-overlapping predictions. As expected, removing overlaps harms AP as detectors benefit from predicting multiple overlapping hypotheses. Methods with better AP also have better APNO and likewise improved PQ.

Instance segmentation.

Instance segmentation algorithms produce overlapping segments. To measure PQ, we must first resolve these overlaps. To do so we develop a simple non-maximum suppression (NMS)-like procedure. We first sort the predicted segments by their confidence scores and remove instances with low scores. Then, we iterate over sorted instances, starting from the most confident. For each instance we first remove pixels which have been assigned to previous segments, then, if a sufficient fraction of the segment remains, we accept the non-overlapping portion, otherwise we discard the entire segment. All thresholds are selected by grid search to optimize PQ. Results on Cityscapes and ADE20k are shown in Table 4 (Vistas is omitted as it only had one entry to the 2017 instance challenge). Most importantly, AP and PQ track closely, and we expect improvements in a detector’s AP will also improve its PQ.

Semantic segmentation.

Semantic segmentations have no overlapping segments by design, and therefore we can directly compute PQ. In Table 5 we compare mean IoU, a standard metric for this task, to PQ. For Cityscapes, the PQ gap between methods corresponds to the IoU gap. For ADE20k, the gap is much larger. This is because whereas IoU counts correctly predicted pixel, PQ operates at the level of instances. See the Table 5 caption for details.

Cityscapes IoU PQSt SQSt DQSt
PSPNet multi-scale [43] 80.6 66.6 82.2 79.3
PSPNet single-scale [43] 79.6 65.2 81.6 78.0
ADE20k IoU PQSt SQSt DQSt
CASIA_IVA_JD [12] 32.3 27.4 61.9 33.7
G-RMI [11] 30.6 19.3 58.7 24.3
Table 5: Machine results on semantic segmentation (thing classes ignored). Methods with better mean IoU also show better PQ results. Note that G-RMI has quite low PQ. We found this is because it hallucinates many small patches of classes not present in an image. While this only slightly affects IoU which counts pixel errors it severely degrades PQ which counts instance errors.

Panoptic segmentation.

To produce algorithm outputs for PS, we start from the non-overlapping instance segments from the NMS-like procedure described previously. Then, we combine those segments with semantic segmentation results by resolving any overlap between thing and stuff classes in favor of the thing class (i.e., a pixel with a thing and stuff label is assigned the thing label and its instance id). This heuristic is imperfect but sufficient as a baseline.

Table 6 compares PQSt and PQTh computed on the combined (‘panoptic’) results to the performance achieved from the separate predictions discussed above. For these results we use the winning entries from each respective competition for both the instance and semantic tasks. Since overlaps are resolved in favor of things, PQTh is constant while PQSt is slightly lower for the panoptic predictions. Visualizations of panoptic outputs are shown in Figure 9.

Human vs. machine panoptic segmentation.

To compare human vs. machine PQ, we use the machine panoptic predictions described above. For human results, we use the dual-annotated images described in §6 and use bootstrapping to obtain confidence intervals since these image sets are small. These comparisons are imperfect as they use different test images and are averaged over different classes (some classes without matches in the dual-annotated tests sets are omitted), but they can still give some useful signal.

We present the comparison in Table 7. For SQ, machines trail humans only slightly. On the other hand, machine DQ is dramatically lower than human DQ, especially on ADE20k and Vistas. This implies that recognition, i.e., object detection, is the main challenge for current methods. Overall, there is a significant gap between human and machine performance. We hope that this gap will inspire future research for the proposed panoptic segmentation task.

Figure 9: Panoptic Segmentation results on Cityscapes (left two) and ADE20k (right three). Predictions are based on the the merged outputs of state-of-the-art instance and semantic segmentation algorithms (see Tables 4 and 5). Colors for matched segments (IoU0.5) match (crosshatch pattern indicates unmatched regions and black indicates unlabeled regions). Best viewed in color and with zoom.
Cityscapes PQ PQSt PQTh
machine-separate n/a 66.6 54.1
machine-panoptic 61.2 66.4 54.1
ADE20k PQ PQSt PQTh
machine-separate n/a 27.4 41.1
machine-panoptic 35.6 24.5 41.1
Vistas PQ PQSt PQTh
machine-separate n/a 43.7 35.7
machine-panoptic 38.3 41.8 35.7
Table 6: Panoptic vs. independent predictions. The ‘machine-separate’ rows show PQ of semantic and instance segmentation methods computed independently (see also Tables 4 and 5). For ‘machine-panoptic’, we merge the non-overlapping thing and stuff predictions obtained from state-of-the-art methods into a true panoptic segmentation of the image. Due to the merging heuristic used, PQTh stays the same while PQSt is slightly degraded.

8 Future of Panoptic Segmentation

Our goal is to drive research in novel directions by inviting the community to explore the new Panoptic Segmentation task. We believe that the proposed task can lead to expected and unexpected innovations. We conclude by discussing some of these possibilities and our future plans.

Motivated by simplicity, the PS ‘algorithm’ in this paper is based on the heuristic combination of outputs from top-performing instance and semantic segmentation systems. This approach is a basic first step, but we expect more interesting algorithms to be introduced. Specifically, we hope to see PS drive innovation in at least two areas: (1) Deeply integrated end-to-end models that simultaneously address the dual stuff-and-thing nature of PS. A number of instance segmentation approaches including [24, 2, 3, 18] are designed to produce non-overlapping instance predictions and could serve as the foundation of such a system. (2) Since a PS cannot have overlapping segments, some form of higher-level ‘reasoning’ may be beneficial, for example, based on extending learnable NMS [7, 16, 17] to PS. We hope that the panoptic segmentation task will invigorate research in these areas leading to exciting new breakthroughs in vision.

Finally, we aim to work with competition organizers to extend current segmentation datasets to include a Panoptic Segmentation track. Candidate datasets are the ones explored in this work (Cityscapes [6], ADE20k [44], Mapillary Vistas [31]), as well as COCO [22]. We will be looking to run panoptic segmentation challenges in 2018.

Cityscapes PQ SQ DQ PQSt PQTh
human 69.6

84.1

82.0

71.2

67.4

machine 61.2 81.0 74.4 66.4 54.1
ADE20k PQ SQ DQ PQSt PQTh
human 67.6

85.7

78.6

71.0

66.4

machine 35.6 74.4 43.2 24.5 41.1
Vistas PQ SQ DQ PQSt PQTh
human 57.7

79.7

71.6

62.7

53.6

machine 38.3 73.6 47.7 41.8 35.7
Table 7: Human vs. machine performance. On each of the considered datasets human performance is much higher than machine performance (approximate comparison, see text for details). This is especially true for DQ, while SQ is closer. The gap is largest on ADE20k and smallest on Cityscapes. Note that as only a small set of human annotations is available, we use bootstrapping and show the the 5th and 95th percentiles error ranges for human results.

References

  • [1] E. H. Adelson. On seeing stuff: the perception of materials by humans and machines. In Human Vision and Electronic Imaging, 2001.
  • [2] A. Arnab and P. H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017.
  • [3] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.
  • [4] H. Caesar, J. Uijlings, and V. Ferrari. COCO-Stuff: Thing and stuff classes in context. arXiv:1612.03716, 2016.
  • [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  • [7] C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout. International journal of computer vision, 2011.
  • [8] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 2012.
  • [9] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes challenge: A retrospective. IJCV, 2015.
  • [10] A. Fathi, N. Kanazawa, and K. Murphy. Places challenge 2017: instance segmentation, G-RMI team. 2017.
  • [11] A. Fathi, K. Yang, and K. Murphy. Places challenge 2017: scene parsing, G-RMI team. 2017.
  • [12] J. Fu, J. Liu, L. Guo, H. Tian, F. Liu, H. Lu, Y. Li, Y. Bao, and W. Yan. Places challenge 2017: scene parsing, CASIA_IVA_JD team. 2017.
  • [13] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV, 2014.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. ICCV, 2017.
  • [15] J. Hosang, R. Benenson, P. Dollár, and B. Schiele. What makes for effective detection proposals? PAMI, 2015.
  • [16] J. Hosang, R. Benenson, and B. Schiele. Learning non-maximum suppression. PAMI, 2017.
  • [17] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. arXiv:1711.11575, 2017.
  • [18] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. InstanceCut: from edges to instances with multicut. In CVPR, 2017.
  • [19] I. Kokkinos. UberNet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
  • [20] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
  • [21] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017.
  • [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [23] C. Liu, J. Yuen, and A. Torralba. SIFT flow: Dense correspondence across scenes and its applications. PAMI, 2011.
  • [24] S. Liu, J. Jia, S. Fidler, and R. Urtasun. SGN: Sequential grouping networks for instance segmentation. In CVPR, 2017.
  • [25] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. LSUN’17: insatnce segmentation task, UCenter winner team. 2017.
  • [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [27] R. Luo, B. Jiang, T. Xiao, C. Peng, Y. Jiang, Z. Li, X. Zhang, G. Yu, Y. Mu, and J. Sun. Places challenge 2017: instance segmentation, Megvii (Face++) team. 2017.
  • [28] J. Malik, P. Arbeláez, J. Carreira, K. Fragkiadaki, R. Girshick, G. Gkioxari, S. Gupta, B. Hariharan, A. Kar, and S. Tulsiani. The three R’s of computer vision: Recognition, reconstruction and reorganization. PRL, 2016.
  • [29] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004.
  • [30] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In CVPR, 2016.
  • [31] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In CVPR, 2017.
  • [32] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In NIPS, 2015.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
  • [35] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recog. and segm. In ECCV, 2006.
  • [36] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994.
  • [37] C. Van Rijsbergen. Information retrieval. London: Butterworths, 1979.
  • [38] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
  • [39] D. B. West. Introduction to graph theory, volume 2. Prentice hall Upper Saddle River, 2001.
  • [40] Y. Yang, S. Hallman, D. Ramanan, and C. C. Fowlkes. Layered object models for image segmentation. PAMI, 2012.
  • [41] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  • [42] Y. Zhang, H. Zhao, and J. Shi. LSUN’17: semantic segmentation task, PSPNet winner team. 2017.
  • [43] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
  • [44] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ADE20K dataset. In CVPR, 2017.
  • [45] Y. Zhu, Y. Tian, D. Mexatas, and P. Dollár. Semantic amodal segmentation. In CVPR, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
19415
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description