Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

Abstract

We introduce Synscapes – a synthetic dataset for street scene parsing created using photorealistic rendering techniques, and show state-of-the-art results for training and validation as well as new types of analysis.

We study the behavior of networks trained on real data when performing inference on synthetic data: a key factor in determining the equivalence of simulation environments. We also compare the behavior of networks trained on synthetic data and evaluated on real-world data.

Additionally, by analyzing pre-trained, existing segmentation and detection models, we illustrate how uncorrelated images along with a detailed set of annotations open up new avenues for analysis of computer vision systems, providing fine-grain information about how a model’s performance changes according to factors such as distance, occlusion and relative object orientation.

\cvprfinalcopy\saythanks

1 Introduction

Figure 1: The scenario for each image in Synscapes is unique, and the dataset encompasses a wide variety of street scene configurations and environmental factors.

There is a range of data that can be considered synthetic in the machine learning context. For example, the common technique of using augmentation to create variations in data during training is a light-weight form of synthesis. On the other end of the scale we have approaches for creating data entirely by artificial means. Synscapes, along with several preceding synthetic datasets for computer vision tasks (and for street scene parsing in particular) are examples of the latter.

Synthetic datasets for computer vision tasks generally use computer graphics in order to create images that can be used for training and validation of machine learning systems. Most [11, 12] tend to use game engines to render the final images, but some [1] have also used offline, physically based rendering, commonly employed in visual effects production and animation, to produce their datasets.

Synthetic data generation methods generally highlight the low cost of producing the data as the main benefit, but this focus obscures some of the perhaps more important with synthetic data. Namely, the ability to produce arbitrary amounts of data from arbitrary probability distributions with arbitrarily detailed annotations. And in this sense, not all approaches to generating synthetic data are equal.

2 Previous synthetic street scene datasets

Virtual KITTI [4] closely recreates parts of the KITTI [6] dataset at a high level: buildings and individual actors are placed identically and the field of view matches. But the complexity and realism are both low.

Synthia [12] provides images from a virtual world created within the Unity framework. It was one of the earliest work showing that synthetic data could be assembled by leveraging off-the-shelf assets which were recombined to create relevant street scenarios.

Richter et al. [11] used Grand Theft Auto V to create a much more complex environment than Synthia, but still relied on simplistic geometry and non-photorealistic real-time rendering. Although an apparent easy-to-access source of data, the legality of using existing games is questionable, and it obscures the amount of technical and artistic work required to create such virtual worlds, which often ranges in the hundreds of person-years.

Recently, Richter et al. [10] released the Playing for Benchmarks dataset, which extended the use of GTA V with a larger set of images and wider range of annotations.

3 Synscapes overview

Figure 2: Overview of the procedural approach for scenario and image generation. Each image is defined by a scenario created from a specific sampling of the generating parameters. The benefit of the procedural approach is that it enables full control over the parameter sampling, scenario generation and rendering without manual work.

Synscapes is created with an end-to-end approach to realism, accurately capturing the effects of everything from illumination by sun and sky, to the scene’s geometric and material composition, to the optics, sensor and processing of the camera system. The images in the dataset do not follow a driven path through a single virtual world. Instead, an entirely unique scene is procedurally generated for each of the twenty-five thousand images. As a result, the dataset contains a wide range of variations and unique combinations of features.

The procedural engine, illustrated in Figure 2 and described in more detail in a previous paper [14], parameterizes all aspects of the 3D world generation and the image synthesis, and enables fully automated production of datasets. A scenario parameter controls e.g. the amount of cars or pedestrians, the width of the road, the road surface material, the time of day, or weather conditions. A scenario is an instantiation of a 3D world (the ego-vehicle, agents, etc.) defined by the sampling of a point in the high dimensional space of the scenario parameters. Each parameter in our system is coupled with a distribution such that each sample can be drawn in a statistically meaningful way. The scenario defines the input to the rendering engine responsible for simulation of sensors and optics. A scenario can be used for rendering both animations and single images. Since Synscapes is meant for experiments in training and validation, it consists of only single images, generated from parameter samples that ensure a wide coverage and variation of classes and features.

The images were rendered using unbiased path tracing [7]: the same physically based rendering technique that powers high-end visual effects in the film industry. Light transport is calculated using radiometric properties from the sun and the sky, modeling the light’s interaction with surfaces using physically based reflectance models, ensuring that each image is representative of the real world. Additionally, the effects of light scattering in the camera optics is modeled using a long-tail point spread function (PSF), and effects related to the imaging sensor such as readout noise, camera response function (CRF) and color characteristics are also simulated.

As the name implies, Synscapes was designed to be similar in structure and content to the Cityscapes dataset [3], and it includes all 19 of its training classes for semantic segmentation.

3.1 RGB camera images

Synscapes consists of 25,000 RGB images in PNG format at resolution, stored in the img/rgb subdirectory. In order to facilitate easier training on architectures configured for the Cityscapes dataset, we also provide the same images at resolution in the img/rgb-2k subdirectory. For the latter, the sensor simulation was executed at the higher resolution, so that individual pixels carry the appropriate noise profile, rather than up-sampled noise. The images are sequentially numbered with no padding, ranging from 1.png to 25000.png.

3.2 Annotation images

Each image is annotated with class, instance and depth information. The class annotation is stored as a single-channel PNG and uses the Cityscapes class id convention1. The instance images encode the instance id in the RGB channels such that the original id can be recovered according to

It should be noted that actors that are more than 99% occluded may be removed from the metadata file, but can still have small numbers of visible pixels in the RGB images.

The per-pixel depth values are stored in the floating point OpenEXR format, recording the planar depth (i.e. the z-depth component) of each pixel in meters, not the physical distance.

3.3 Metadata

Metadata associated with each image is stored in the meta subdirectory, with a single JSON file corresponding to each RGB image. Three types of metadata are provided: scene metadata, which describes properties of the scene as a whole; camera metadata, describing the intrinsics and extrinsics of the sensor; and instance metadata, which provides details on the individual actors in each image. These are described in more detail in the paper’s appendix. As a result of a particular quirk in how Cityscapes and other datasets classify bicyclists and motorcyclists, the instance considers the union of both the rider and its mount, with the class determined by the vehicle type. The class segmentation image, however, distinguishes between the rider and the vehicle, as expected.

3.4 Distribution of metadata parameters

Figure 3: Synscape’s broad distribution over scenario parameters allows for selection of subsets along several scenario parameters at once. Left column: overcast, right column: sunny. Top to bottom shows increasing number of cars.
Figure 4: Top row: Input images. Bottom row: predictions made by original authors’ reference model of DeepLab v3+, trained on Cityscapes. Left to right: Synscapes, Richter (GTA) and Synthia. Table 1 shows corresponding results.
Validation

Road

Sidewalk

Building

Wall

Fence

Pole

Tr.Light

Tr.Sign

Vegetation

Terrain

Sky

Person

Rider

Car

Truck

Bus

Train

Motorcycle

Bicycle

Mean IoU

FRRN Cityscapes 97.11 77.78 90.2 47.22 44.54 60.67 59.72 70.44 91.08 58.06 93.02 74.85 51.01 92.23 52.51 72.32 53.49 41.61 69.26 68.27
Synscapes 93.69 76.07 84.60 40.98 24.24 47.86 43.54 61.13 88.57 77.62 96.62 65.86 40.60 78.73 20.38 12.45 24.22 38.88 46.41 55.92
Richter 82.44 31.36 71.99 28.63 12.45 0 34.71 8.2 66.12 22.91 83.58 39.01 25.7 67.97 33.53 9.54 0.09 28.75 0 34.05
Synthia 40.84 18.53 62.47 4.52 0.51 18.21 0.24 1.15 48.46 0 80.55 21.19 4.02 27.68 0 0.99 0 10.3 2.96 18.03
DeepLab Cityscapes 98.15 84.85 92.71 57.34 62.14 65.20 68.62 78.87 92.68 63.46 95.33 82.26 62.84 95.37 85.30 89.10 80.92 64.55 77.34 78.79
Synscapes 94.83 77.76 88.02 49.66 35.70 59.78 53.90 80.73 89.86 79.73 95.40 78.74 63.19 72.68 23.64 15.76 22.94 62.41 64.49 63.64
Richter 86.67 43.09 82.67 39.74 24.03 0.00 46.27 30.80 67.28 23.83 90.81 71.83 54.65 80.34 59.58 16.62 0.16 52.01 0.53 45.84
Synthia 43.29 13.51 53.26 8.37 1.82 22.82 8.10 11.58 50.69 0.00 76.05 54.27 19.57 51.79 0.00 5.41 0.00 21.05 12.28 23.89
Table 1: Validation on synthetic data for reference versions of FRRN and DeepLab v3+ architectures.

Synscapes was constructed in such a way that each scenario parameter is varied independently, providing a broad distribution across all dimensions of variation. In particular, care was taken to ensure that all scenario parameters are de-correlated. For example, if we wanted to study the difference between images near sunrise versus those taken with the sun at zenith, we still want the broadest possible distribution across all other scenario parameters. By using 25,000 unique scene variations for the 25,000 images we avoid unwanted correlations and make possible the sort of analysis studied in Section 6. Figure 3 shows selected images across both of the sky_contrast and num_cars dimensions.

3.5 Visualization scripts

In order to visualize the instance metadata and view the dataset images sorted by scenario parameters, a set of scripts are provided2. These also serve as a reference for how to extract and utilize the metadata, for example the projection of 2D and 3D bounding boxes into image space.

  • view-synscapes.py allows viewing of the dataset in sorted order according to any metadata parameter, e.g. according to sun height, or the number of visible cars.

  • visualize-synscape-metadata.py can visualize the class and instance images as overlays on the RGB images, and also display 2D and 3D bounding boxes along with their respective class types. For clearer visualization, instances may be culled based on their occlusion level.

4 Synthetic data in testing and validation

Figure 5: Examples from object detection validation using Faster R-CNN trained on KITTI and evaluated on images from the GTA test set (top) and Synscapes test set (bottom).

One of the most common uses of synthetic data is in simulation. Whereas simulation of a planner can be done on a symbolic level, simulation of a perception system requires generation of sensor data that is both in the same format (quantitatively equivalent) and also of the same character (qualitatively equivalent) as real-world inputs. Simulating the data stream is generally straightforward, but simulation of the sensor’s behavior as it reacts to (and potentially interacts with) its surroundings, either e.g. optically or electromagnetically, is both complicated and difficult to achieve. Still, it is a crucial puzzle to solve. If the data created by a virtual sensor doesn’t correspond well to what its real-world counterpart would output, the simulation is not representative. For simulation of perception systems, this is a significant problem as it makes it difficult or impossible to determine whether e.g. the failure to detect a pedestrian in the simulation is due to domain shift (and that the identical real-world situation would be handled correctly), or whether there is a true deficiency in the model. In the end, this raises questions as to whether, and to what degree, sensor simulations can be trusted.

In order to quantify this effect we use publicly available networks that have been pre-trained on real-world datasets, and we then run inference on synthetic datasets. Although one may expect that synthetic datasets are less complex than real-world data and therefor easier to predict, the domain shift currently seems to obscure most such effects, with consistently lower scores on the synthetic datasets than the organic, real world counterparts that the models are trained on. Still, the effect likely plays a role, and finding methods that decouple domain shift from the relative difficulty of a given image would be a worthwhile research topic.

4.1 Semantic segmentation

We use the FRRN [8] and DeepLab v3+ [2] architectures to evaluate performance on the semantic segmentation task. For each architecture we use the Cityscapes pre-trained models provided by the original authors and perform inference on the validation set of each synthetic dataset, along with Cityscapes itself for reference.

As shown in Table 1, DeepLab achieves higher overall scores than FRRN (as expected), both overall and for individual classes. Furthermore, looking at the scores for each of the synthetic datasets, we see that they are consistently ordered for both network architectures. This suggests that their relative performances are a result of differences in the synthetic data rather than due to the network architecture itself. The Cityscapes-trained networks both achieve the best overall performance on Synscapes. This again suggests (but isn’t by itself proof) that the domain shift is smaller for Synscapes than the other two datasets, and that visual realism has a significant impact on synthetic data’s applicability as a testing and validation tool. Figure 4 shows examples of predictions for the DeepLab architecture on Synscapes, Richter and Synthia.

4.2 Object detection

Training Validation mAP mAP@0.50 mAP@0.75
KITTI KITTI 0.456 0.716 0.484
KITTI GTA 0.061 0.115 0.059
KITTI Synscapes 0.206 0.400 0.187
KITTI + Synscapes Synscapes 0.570 0.813 0.634
Table 2: COCO metric results for object detection using Faster-RCNN with Resnet101 from Google’s Tensorflow Object Detection API. The mean AP (mAP) is computed for IoU [0.5 : 0.95], and mAP@0.50 and mAP@0.75 are computed for IoU 0.50 and 0.75 respectively.
Training Validation Car Pedestrian
SynScapes SynScapes 0.752 0.802
KITTI KITTI 0.534 0.847
KITTI GTA 0.021 0.096
KITTI Synscapes 0.355 0.438
KITTI + Synscapes Synscapes 0.843 0.782
Table 3: Individual scores for classes Car and Pedestrian from Faster-RCNN. Average precision (AP) is computed at IoU = 0.50 using the Pascal VOC metric.

To investigate the use of synthetic data as testing or validation for object detection, we use the Faster R-CNN architecture with Resnet101, [9] with pre-trained weights from the Google model zoo3 as a reference model. Detection is performed for the two KITTI classes car and pedestrian (pedestrian = person in Synscapes’ labeling).

Figure 6: Examples of predictions from DeepLab networks trained on synthetic data. Top row: Example image and ground truth annotations. Bottom row, left to right: Synscapes, Richter (GTA) and Synthia as training set.

The results in Table 2 show how object detection performs when trained on Synscapes in comparison to the Playing for Benchmarks dataset (denoted GTA), [10]. GTA represents the current state-of-the-art for game engine-based approaches. For both datasets, the test set used consists of 1,000 images. For the GTA dataset, we maximized scenario/feature coverage by randomly selecting 1,000 images from the over 130,000 images in the database. The higher detection scores on Synscapes (mAP = 0.206) compared to GTA (mAP = 0.061) can most likely be attributed to the more accurate sensor simulation and better designed feature variation. The individual class scores for car and pedestrian displayed in Table 3. Figure 5 displays example images with detections from the two test sets.

The more accurate imaging sensor simulation and variation in the Synscape dataset make the images more suitable for testing and analysis. As mentioned in Section 3, Synscapes was designed with the Cityscapes dataset in mind, and there is a significant domain shift between KITTI and Cityscapes, which e.g. can be seen in the differences in dynamic range and local contrast in the images, as well as in the density of car instances. By fine tuning the KITTI trained reference model, the results improve drastically. The results show that pre-training on KITTI and fine-tuning on Synscapes (KITTI + Synscapes) yields an increase in performance compared to training and validating only on Synscapes. While this is familiar for synthetic-to-real transfer learning, the reverse is not well explored.

5 Synthetic data for training

Training

Road

Sidewalk

Building

Wall

Fence

Pole

Tr.Light

Tr.Sign

Vegetation

Terrain

Sky

Person

Rider

Car

Truck

Bus

Train

Motorcycle

Bicycle

Mean IoU

FRRN Cityscapes 97.11 77.78 90.2 47.22 44.54 60.67 59.72 70.44 91.08 58.06 93.02 74.85 51.01 92.23 52.51 72.32 53.49 41.61 69.26 68.27
SynScapes 90.71 47.54 76.46 23.16 25.12 37.19 35.4 38.95 79.11 20.87 86.4 58.56 39.72 82.09 14.48 27.84 14.7 12.91 47.53 45.20
Richter 40.27 21.15 62.52 7.16 6.85 8.41 11.03 1.52 75.42 12.55 60.09 31.71 0 27.43 14.91 7.47 7.98 0.23 0.02 20.88
Synthia 60.8 28.05 59.86 0.07 0.07 25.52 2.34 2.69 74.62 0 75.04 38.33 3.84 35.81 0 2.09 0 1.92 2.74 21.78
SynScapes + CS 97.70 81.64 91.27 51.34 49.37 65.35 66.87 75.59 91.78 61.09 94.15 78.65 58.22 93.94 70.51 82.40 79.09 54.30 72.63 74.52
Richter + CS 96.90 77.17 90.71 49.20 48.62 62.42 61.58 72.34 91.25 60.93 93.84 75.53 53.77 93.64 64.19 73.13 61.44 46.80 70.96 70.76
Synthia + CS 97.58 81.04 90.81 47.58 50.49 62.48 63.05 73.45 91.47 60.39 93.80 77.11 53.05 93.19 57.04 73.21 52.64 38.07 71.51 69.89
DeepLab Cityscapes 97.98 83.88 92.18 59.36 59.14 61.69 65.60 75.70 92.11 60.74 94.44 80.53 58.41 94.57 81.31 85.87 78.42 58.88 73.87 76.56
Synscapes 85.38 47.75 73.87 27.75 31.83 46.89 50.14 58.65 85.92 41.12 83.87 66.38 26.98 84.01 25.95 18.99 5.17 35.04 61.00 50.35
Richter 54.87 24.74 50.15 15.64 9.24 39.21 35.55 13.54 81.12 27.75 38.07 58.48 17.34 78.63 23.51 31.07 0.28 12.69 0.00 32.20
Synthia 71.04 29.91 69.55 3.24 0.15 33.09 28.89 12.46 76.22 0.00 71.95 65.98 23.21 75.92 0.00 23.82 0.00 13.13 23.25 32.73
Synscapes + CS 98.14 84.82 92.95 57.66 62.52 66.23 70.06 78.76 92.72 61.32 95.05 82.86 62.92 95.46 84.59 88.98 79.21 65.95 77.94 78.85
Richter + CS 97.45 81.46 92.66 58.45 63.31 65.23 69.37 78.69 92.49 62.60 94.93 81.79 60.96 95.20 80.99 85.54 72.18 63.71 76.80 77.57
Synthia + CS 97.98 84.12 92.52 54.33 56.56 64.80 68.84 78.01 92.51 61.41 95.00 82.05 62.60 95.14 81.58 86.28 77.32 63.42 76.99 77.45
Table 4: Base training and fine tuning results for FRRN and Deeplab v3+ architectures.
Training Validation

Road

Sidewalk

Building

Wall

Fence

Pole

Tr.Light

Tr.Sign

Vegetation

Terrain

Sky

Person

Rider

Car

Truck

Bus

Train

Motorcycle

Bicycle

Mean IoU

DeepLab v3+ Cityscapes Cityscapes 98.15 84.85 92.71 57.34 62.14 65.20 68.62 78.87 92.68 63.46 95.33 82.26 62.84 95.37 85.30 89.10 80.92 64.55 77.34 78.79
Synscapes SynScapes 98.58 93.80 95.20 92.75 87.48 72.50 74.28 86.78 93.20 90.38 97.24 84.52 77.87 93.99 85.10 88.48 89.87 76.46 74.57 87.00
Richter Richter 97.14 85.71 90.42 63.70 45.01 0.00 65.14 61.56 86.17 74.79 96.68 76.36 39.97 90.05 84.15 82.35 0.02 58.75 0.00 63.05
Synthia Synthia 93.14 91.27 92.02 69.33 57.60 54.43 17.90 31.78 79.51 0.00 93.96 74.64 53.05 89.28 0.00 89.85 0.00 69.77 29.74 57.22
Table 5: Self-validation results for Cityscapes (reference model), Synscapes, Richter and Synthia using DeepLab v3+.

Besides their use as testing and validation platforms, simulators often also promise the possibility of acting as a source for training data. In an ideal world, synthetic data would be superior to organic data, as detailed and accurate annotations could be generated at large volumes, and an arbitrarily large set of scenarios could be tested in parallel. In reality, however, there are several factors limiting this. Domain shift is perhaps an even larger problem in training than in testing, as is the challenge of creating virtual worlds with enough variation to provide meaningful learning material at scale.

In order to analyze how different datasets perform as sources of training material, we train multiple state-of-the-art network architectures for semantic segmentation and object detection and compare how they perform when evaluated on organic, real-world data. In order to avoid chasing performance numbers as a goal in itself, we use the same hyperparameters for each training session, with the aim to give an uncolored view into each datasets’ strengths and weaknesses.

5.1 Semantic segmentation

We train the same two networks as in Section 4 on Synscapes as well as the Richter and Synthia datasets. FRRN was trained with the default augmentation settings provided with the reference implementation, and was run for 100,000 iterations at a learning rate of with a batch size of 3 on a single GPU. For fine tuning, the best synthetic-only training iteration was chosen, and a second set of 100,000 iterations was then run on Cityscapes. For DeepLab, training was run on 4 GPUs for 100,000 iterations with a crop size of 513, learning rate and batch size 20.

Table 4 shows the per-class and mean IoU scores for the Cityscapes validation set. As with the validation tests in Section 4, the results from both network architectures are consistent, with Synscapes producing the highest overall and per-class performance for all classes but one with both FRRN and DeepLab. If the higher validation score had been a function of Synscapes being ”easier” than the other datasets, we would have expected the inverse effect during training. Instead, the opposite seems to hold. The fine tuning results show a tighter race, but when looking at the relative improvement from the baseline (76.56%) Synscapes provides more than twice the gain compared to both Richter and Synscapes. Figure 6 shows predictions on Cityscapes for DeepLab networks trained on the three datasets.

Finally, we study the performance for self-validation on the DeepLab networks. In Table 5 we note that Synscapes achieves a significantly higher score when evaluated on itself (87% versus 63% and 57% for Richter and Synthia, respectively). Several factors are likely involved: Synthia and Richter’s reduced geometric and rendering complexity likely results in a smaller feature space. Polygonal edges are clearly visible, and it is conceivable that the network confuses different classes due to these inadvertent characteristics. The distribution of objects and class imbalance is also likely at play. In contrast, Synscapes’ higher realism avoids the first problem, and its broad distribution of scenario parameters improves on the second. As a measurement of whether all classes in a dataset are equally well recognized, we compute the standard deviation for classes that are present in the dataset and find for Synscapes, and 17.51 and 24.55 for Richter and Synthia.

5.2 Object Detection

Training Validation mAP mAP@0.50 mAP@0.75
KITTI KITTI 0.456 0.716 0.484
Synscapes KITTI 0.092 0.237 0.069
Synscapes + KITTI KITTI 0.519 0.791 0.589
Synscapes Synscapes 0.488 0.777 0.518
Table 6: COCO metric results for object detection using Faster-RCNN with Resnet101 from Google’s Tensorflow Object Detection API. The mean AP (mAP) is computed for IoU [0.5 : 0.95], and mAP@0.50 and mAP@0.75 are computed for IoU 0.50 and 0.75 respectively.
Training Validation Car Pedestrian
KITTI KITTI 0.534 0.847
SynScapes KITTI 0.344 0.131
SynScapes + KITTI KITTI 0.902 0.679
Table 7: Individual results for classes ’car’ and ’pedestrian’ for average precision (AP) computed at IoU = 0.50 using the Pascal VOC metric. MW: KittiBox or RCNN?
Training Easy Medium Hard
KITTI 97.4% 89.0% 75.1 %
SynScapes 68.6% 53.4% 44.5 %
SynScapes + KITTI 98.7% 90.0% 77.0%
Table 8: Validation results from FastBox [13], using the KittiBox implementation. The training was performed from scratch and evaluated on 500 test images extracted from the KITTI dataset. The result scores are given as the average precision (AP) for the ’car’ and ’pedestrian’ classes.

We train two architectures for object detection, Faster R-CNN with Resnet101, [9], using the Tensorflow object detection API from Google4, and the KittiBox implementation5 of FastBox [13]. With Faster R-CNN, we evaluate the training performance using the KITTI benchmark dataset  [5] as baseline. Tables 6 and 7 show results from training on the two classes car and pedestrian, evaluated on test sets consisting if 500 images extracted from both the KITTI data set and Synscapes respectively. The first column indicates which data set was used during training and/or fine tuning, and the second which test set was used. For all training and fine tuning we used 300k iterations and a learning rate of for iterations up to 200k and for iterations over 200k. Although training on Synscapes alone performs poorly on KITTI due to the domain shift emanating from the fact that Synscapes was designed with the Cityscapes dataset in mind, the COCO results obtained when applying fine tuning shows a significant increase in performance as compared to the baseline in the first row. Interestingly, the fine tuned results both when training on Synscapes and validating on KITTI (Table 6), and when training on KITTI and validating on Synscapes (Table 2) not only bridges the domain gap in both directions, but also increase the performance with respect to both baselines (KITTI - KITTI and Synscapes - Synscapes).

Table 8 shows an evaluation using the KittiBox implementation of FastBox [13]. Training was performed with a learning rate of and 250k iterations for both training on Synscapes and fine tuning. Also in this case, the use of synthetic data improves the performance significantly.

6 Analysis using Synscapes

The procedural approach used to generate the imagery makes Synscapes particularly suitable for algorithm and dataset analysis. The carefully controlled distribution of the generating parameters used in the scenario generation enables us to make cuts in, or bin the resulting images and labels along the parameter dimensions. Another enabling factor is the wealth of metadata being generated for each rendered image.

Given the statistics in the feature variation in Synscapes, slicing, or binning along a dimension leaves even distributions along all other dimensions. Figure 3 illustrates how the weather conditions are binned into sunny and overcast skies, while the density of cars is quantized into bins ranging from a few to many cars in the image. Similarly, given the metadata one can for individual instances of objects/classes within the images slice along dimensions such as occlusion, heading, or distance from the ego-vehicle etc., which are useful in analyzing how an existing ML model reacts to varying inputs.

6.1 Semantic segmentation

In order to explore statistically how DeepLab’s pre-trained Cityscapes model behaves, we run prediction on the Synscapes training set (24,000 images). By binning either the images or the individual instances in each image according to one or more of the scenario and instance parameters’ values, we can average the performance of the network for all the images in a given bin, knowing that they represent a wide distribution of parameter values in all dimensions except the binned one.

Effect of object orientation

Figure 7: Analyzing the effect of each instance’s orientation. IoU score for the four basic orientations over depth. (Cityscapes-trained Deeplab v3+ on Synscapes.)

First, we use the oriented 3D bounding box to determine the relative orientation of each instance and study its effect on the segmentation performance. The pixels for each actor instance are categorized as belonging to one of the four cardinal directions (forward, backward, left or right relative to the ego vehicle), and a separate IoU score is computed for each subset. Additionally, we divide the predicted pixels of each direction category into 16 segments along depth to show how the performance also varies as a function of depth, as seen in Figure 7.

In this visualization, we see that the Person and Rider classes are well recognized independently of view direction, but for Car, Motorcycle and Bicycle the network performs worse for instances in the forward/backward directions than left/right.

Truck, Bus and Train all show differences between oncoming and same-side instances, which could be a factor both of bias in the training set distribution, but also due to the fact that e.g. a truck is easier to distinguish from a bus when seen from the rear, compared to when seen from the front.

Effect of object occlusion

Figure 8: Analyzing the effect of instance occlusion. The heat map shows relative IoU score across occlusion and depth. (Cityscapes-trained Deeplab v3+ on Synscapes.)
Figure 9: Linear regression between meta-parameters and per-class IoU score for semantic segmentation. Line width indicates correlation coefficient, and only correlations with p-value 0.05 or less are included for clarity. Scores for the 64 individual subsets are overlaid to show variance. (Cityscapes-trained Deeplab v3+ on Synscapes.)

Synscapes also provides a per-instance metric of the fraction of each object that is visible to camera. We first divide predicted pixels into four subsets according to the amount of occlusion: , , , and . Then, each subset is further divided according to depth. The IoU score for each subset is shown in Figure 8 as a heat map, which visualizes some inherent properties of the Cityscapes dataset.

We can see that Person and Rider score highest when unoccluded, whereas Car and Bus both have highest scores in areas with partial occlusion. This is likely due to differing exemplar balances in the training data, as Cityscapes is quite busy overall, with entirely unoccluded vehicles the exception rather than the rule.

We also note that Rider, Motorcycle and Bicycle perform very similarly, as is expected given their correlated placement in the training data.

Correlating performance to scenario parameters

Clearly, a network will not achieve the same performance on all input images. Certain configurations are more likely to be analyzed correctly, others may be harder. This effect is due to many factors: whether a given feature is present in the dataset, whether the exemplars are observed in varied enough conditions, etc. Knowing which configurations are problematic is highly desirable but also difficult to achieve with organic data; annotating bounding boxes can be done precisely, but labeling abstract descriptions of an image, e.g. degree of fogginess or amount of wear and tear to a road surface, is difficult to do consistently. Finally, capturing sufficient amounts of data to make reliable analyses, especially for uncommon situations, is expensive.

In order to illustrate how Synscapes’ detailed metadata and broad variation can be used to address this type of analysis, we explore whether there are correlations between the segmentation network’s performance and the scenario parameters used to generate the synthetic data. For a given selection of parameters, we divide the predictions on the 24,000 images in the training set into 64 subsets, and compute an IoU score for each class. We then perform a standard linear regression, and record the correlation factor and p-value for each class/parameter pair. Pairs with a p-value over 0.05 are discarded, with Figure 9 showing the resulting samples.

We find that motion blur (ego_speed) and time of day (sun_height) are the scenario parameters with the strongest effect on the network’s predictive performance. As motion blur increases in an image, features are smeared, and although the result is easy for a human to interpret, the resulting features would appear very different to a convolution-based neural network. In particular, classes with strong vertical features, such as Pole, Wall and Fence, are most strongly affected. Although detection of the road surface degrades, it does so the least of the measured classes, likely because it is (in general) the closest to the ego vehicle and will therefor contain the most motion blurred features in the training set.

Similarly, as the sun approaches the horizon, overall contrast in the image is reduced. Because of auto-exposure, the image isn’t necessarily darker, but without strong shadows features are generally more difficult to distinguish. We note the difference between overcast conditions (which illuminate the scene more evenly) from sunset conditions, which may have high contrast. Finally, we also see an expected correlation between the curb height and the score for Sidewalk, as the higher curb makes the edge more distinguishable.

6.2 Object detection

In order to do a similar study for object detection, we use a reference model of a Faster R-CNN network trained on KITTI and evaluate performance on 20,000 Synscapes images.

Effect of object orientation

Compared to the Cityscapes-based DeepLab model, we see some similarities and also some differences. First of all, as shown in Figure 10, the Pedestrian class is detected consistently for all four directions, although there is a sharp drop-off in performance at a distance of around 50 meters, whereas the segmentation model’s performance degraded more linearly with distance.

Similarly to the segmentation model, the Car class has distinct directional dependence. However, where Cityscapes has significantly better performance on cars seen from the side, the difference for the detection network is more subtle, and performs best for cars facing the ego vehicle (heading backwards), followed by cars facing the same direction. Interestingly, beyond 80 meters, cars are detected somewhat more reliably when seen from the side.

Effect of object occlusion

Figure 10: Analyzing the effect of instance orientation. IoU score for the four basic orientations over depth. (Faster R-CNN KITTI-trained model evaluated on Synscapes.)

Figure 11 shows a heat map of the Person and Car classes. The falloff in performance with distance is again clear, and we can see that e.g. an 80% occluded person at 10 meters is as difficult to detect as an unoccluded person at more than 50 meters. For distances up to 40 meters, performance on the Person class is strongly affected by occlusion, but beyond this range occluded instances are detected as well as unoccluded ones.

Cars are more evenly affected by occlusion, and differ from the segmentation model in that unoccluded cars are most consistently detected than partially occluded ones. This is consistent with inspection of the training datasets: KITTI contains many more instances of solitary cars than Cityscapes, which is more often found in clusters.

The low score for nearby, occluded cars (along the bottom of the graph) is largely due to under-representation in Synscapes; there aren’t many objects that can occlude a car in the 0-5 meter distance range.

Figure 11: Analyzing the effect of instance occlusion. The heat map shows relative IoU score across occlusion and depth. (Faster R-CNN KITTI-trained model evaluated on Synscapes.)

7 Conclusions and future work

The goal of this paper is to examine the role of realism in synthetic data. We compared Synscapes to two previous publicly accessible datasets: Synthia, which is purpose-built for the street parsing context, and one generated from Grand Theft Auto (Richter/GTA in the text), which, as a commercial computer game, isn’t inherently targeted towards machnine learning, but is entirely street scene-based. The GTA-based dataset is the most varied in terms of number of pedestrian and car models, and contains a large number of different types of buildings and situations. Synscapes has a smaller set of archetypes, but uses much higher fidelity geometry, textures and image synthesis techniques, and also provides entirely unique images.

In order to investigate the role that realism plays, we evaluatedseveral different uses of synthetic data in the computer vision-based machine learning context. First, we looked at the validation case, which represents the use of virtual simulation to test networks that have been trained on organic, real-world data. Next, we looked at the use of synthetic data as a source for training material and evaluated the resulting networks on organic data. Finally, we studied each synthetic datasets to see how well they can learn to predict on images that are part of their own training domain.

In each case, and for both semantic segmentation and object detection, Synscapes performs significantly better than the other datasets. As the performance advantages hold consistently across each evaluation method, we see strong evidence for the importance of realism in synthetic data. There are no indications that neural networks are able to naturally abstract away domain shift, and as such it is important for software that aims to achieve accurate sensor simulation attempt to achieve the highest realism possible.

Finally, we also leveraged Synscapes’ unique images, annotations and metadata to yield insights into an existing organically-trained neural network. By using the synthetic data as a richly-annotated testing proxy, we were able to yield insights into the Cityscapes and KITTI datasets, and discover correlations and biases in networks trained on them.

Future work could explore in finer detail the difference that individual choices in realism and sensor simulation fidelity makes, from geometric and textural variation and accuracy, to material physicality and illumination complexity. But until such metrics are established, the fundamental indications point to realism playing a large part in making simulation a core of the machine learning and computer vision playbook.

Appendix A Appendix

a.1 Camera metadata

The camera intrinsics and extrinsics are constant throughout the dataset, but for purposes of completeness, they can be found in the "camera" key of the metadata files.

  "extrinsic": {
    "pitch": 0.038,
    "roll": -0.0,
    "x": 1.7,
    "y": 0.1,
    "yaw": -0.0195,
    "z": 1.22
  },
  "intrinsic": {
    "fx": 1590.83437,
    "fy": 1592.79032,
    "resx": 1440,
    "resy": 720,
    "u0": 771.31406,
    "v0": 360.79945
  }

a.2 Scenario metadata

The variables used to drive the configuration of each image can be found under the scene key.

  • altitude_variation specifies the height difference in meter. Note that the extrema may be outside the camera’s view.

  • curb_height in meters.

  • ego_speed in meters per second. This implicitly indicates the amount of overall motion blur in each image.

  • fence_height in meters.

  • fence_presence specifies whether fences are present in the scene. Note that they may be hidden from the camera.

  • median_presence indicates whether there is a median in the middle of the road.

  • num_* determine the number of actors of a given class are visible in the image.

  • parking_angle defines the angle at which cars are parked; either 0 (parallel), 45 or 90 degrees. Some images do not include a parking lane, as specified by parking_presence.

  • rel_dist_to_isect contains the distance from the center of the ego vehicle to the center of the next street intersection.

  • sidewalk_width in meters.

  • sky_contrast expressed as the natural logarithm of the ratio of the 99th percentile pixel value and the mean pixel value. Values around 2.0 indicate overcast conditions, with sunny mid-day images around 6.0.

  • sun_height Normalized angular height of the sun with 0.0 at the horizon and 1.0 at zenith. As the sun’s height is simulated for a point away from the Earth’s equator, the sun height never reaches full zenith.

  • wall_height in meters, with wall_presence indicating whether the scene contains walls.

a.3 Instance metadata

Each instance of the non-static classes (person, rider, car, truck, bus, train, motorcycle and bicycle) is annotated with an instance id (see above), and for each instance id the metadata contains information about the following:

  • bbox2d specifies the image bounding box horizontally as and vertically in . Both use normalized image coordinates. The depth extents are available in , specified in meters.

  • bbox3d provides an oriented 3D bounding box, defined by the origin (at the rear lower right corner), the x vector facing forward, the y vector to the left, and the z vector facing up. The length of the orientation vectors define the extents in meters. All vectors are defined relative to the ego vehicle reference frame.

  • class indicates the class of the instance for convenience. The same information can be inferred from the class and instance images jointly.

  • occluded specifies the fractional occlusion of each instance, defined as the ratio between actual visible pixels and the number of pixels the instance would occupy if it were completely unoccluded. This is generally accurate to within 1%.

  • truncated provides the portion of the instance’s surface area that lies outside the image view. The accuracy is similarly within 1%.

Footnotes

  1. https://github.com/mcordts/cityscapesScripts
  2. https://github.com/7dlabs/synscapes-utils
  3. https://github.com/tensorflow/models/tree/master/research/object_detection
  4. https://github.com/tensorflow/models/tree/master/research/object_detection
  5. https://github.com/MarvinTeichmann/KittiBox

References

  1. Blasinski, H., Farrell, J., Lian, T., Liu, Z., and Wandell, B. Optimizing image acquisition systems for autonomous driving. Electronic Imaging 2018, 5 (2018), 1–7.
  2. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611 (2018).
  3. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
  4. Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 4340–4349.
  5. Geiger, A. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Washington, DC, USA, 2012), CVPR ’12, IEEE Computer Society, pp. 3354–3361.
  6. Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 11 (2013), 1231–1237.
  7. Kajiya, J. T. The rendering equation. SIGGRAPH Comput. Graph. 20, 4 (Aug. 1986), 143–150.
  8. Pohlen, T., Hermans, A., Mathias, M., and Leibe, B. Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on (2017).
  9. Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 91–99.
  10. Richter, S. R., Hayder, Z., and Koltun, V. Playing for benchmarks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 (2017), pp. 2232–2241.
  11. Richter, S. R., Vineet, V., Roth, S., and Koltun, V. Playing for Data: Ground Truth from Computer Games. Springer International Publishing, Cham, 2016, pp. 102–118.
  12. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A. M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016).
  13. Teichmann, M., Weber, M., Zöllner, J. M., Cipolla, R., and Urtasun, R. Multinet: Real-time joint semantic reasoning for autonomous driving. CoRR abs/1612.07695 (2016).
  14. Tsirikoglolu, A., Kronander, J., Wrenninge, M., and Unger, J. Procedural modeling and physically based rendering for synthetic data generation in automotive applications. arXiv preprint arXiv:1710.06270 (2017).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
311637
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description