Semantic Road Layout Understanding by Generative Adversarial Inpainting
Autonomous driving is becoming a reality, yet vehicles still need to rely on complex sensor fusion techniques to fully understand the scene they act in. Being able to discern the static environment from the dynamic entities that populate it, will improve scene comprehension algorithms and will pose constraints to the reasoning process about moving objects. With this in mind, we propose a scene comprehension task aimed at providing a complete understanding of the static surroundings, with a particular attention to road layout. In order to cut to the bare minimum the sensor requirements to be deployed on-board, we solely rely on semantic segmentation masks, which can be reliably obtained with computer vision algorithms from any RGB source. We cast this problem as an inpainting task based on a Generative Adversarial Network model to remove all dynamic objects from the scene and focus on understanding its static components such as streets, sidewalks and buildings. We evaluate this task on a synthetically generated dataset obtained with the CARLA simulator, demonstrating the effectiveness of our method.
Semantic Road Layout Understanding by Generative Adversarial Inpainting
Federico Becattini MICC - University of Florence email@example.com Lorenzo Berlincioni MICC - University of Florence firstname.lastname@example.org Leonardo Galteri MICC - University of Florence email@example.com Lorenzo Seidenari MICC - University of Florence firstname.lastname@example.org Alberto Del Bimbo MICC - University of Florence email@example.com
noticebox[b]Preprint. Work in progress.\end@float
Image inpainting refers to the task of predicting missing or damaged parts of an image, by inferring them from contextual information. A large variety of image inpainting techniques have been developed over the years for different applications, spanning from image editing and restoration to more complex task such as semantic inpainting . In this scenario, more than cleaning images from noise or recovering small damaged parts, large portions of the image are reconstructed bringing high level semantics of scene and objects into the loop. A recent trend has seen Generative Adversarial Networks (GAN) as the main protagonists of image inpainting techniques. Existing methods for image inpainting focus on completing natural scene images and are limited to RGB images.
We propose to tackle the problem of image inpainting in a different domain than RGB images, i.e. semantic segmentation data. A first attempt in this direction  has been recently made, where the authors use segmentations to guide the generation of the RGB reconstruction and obtain better quality images, which are more pleasant to the human eye. On the contrary, we completely discard the RGB content and focus only on semantic information, since we want to reconstruct the signal hidden in the image itself, and not its texture.
This motivation rises from the need to precisely comprehend the semantics of what is occluded in applications where appearance information is of relatively low importance. This is of particular interest in the field of autonomous driving, where clutter and occlusion occur with high frequency, limiting the understanding of a scene and therefore posing a threat to safety. The method we propose tackles this problem attempting to fully understand the static layout of the scene by removing dynamic objects. We believe that being able to understand the environment in which driving agents act would directly translate into a considerable advantage to constraint any reasoning about their possible behaviors.
To the best of our knowledge, we are the first to propose a semantic segmentation inpainting method focused on reconstructing the hidden semantics of a scene using GANs. The advantages of using semantic segmentations rather than RGB data are twofold: on the one hand it allows us to precisely identify dynamic objects at a pixel-level and use this information to know occlusion is; on the other hand it directly yields a complete understanding of the image, without the need of further applying other computer vision algorithms on the output of our method. Moreover, inpainting methods for RGB images, despite the huge steps forward seen in the recent years, still provide images which may be imprecise and of difficult interpretation. Our method instead is capable of directly inpaint the categorical class of the restored pixels, therefore excluding any level of uncertainty in the reconstruction.
2 Semantic Road Layout Understanding
In this paper we propose the novel task of Semantic Road Layout Understanding, in which we provide a pixel-wise estimate of static components within an urban scenario. The input and output for this tasks is given by a categorical mapping of the objects in the scene, where each pixel is assigned to a semantic class. We formalize this task as an inpainting task in ego-vehicle images, where static components have to be predicted by removing dynamic objects occluding the scene. More formally, let be a set of visual categories, composed by a subset of static classes and a subset of dynamic classes . Given an image, the aim is to convert its semantic mapping with values in , into a semantic mapping with values in . In Fig. 1 we show an example of the expected input and output for the Semantic Road Layout Understanding task.
In order to obtain a pair of images with and without dynamic objects, we generate synthetic frames with CARLA . CARLA is an open-source urban driving simulator built under the Unreal Engine. Apart from providing a sandbox for autonomous driving algorithms, it offers functionalities for recording sequences varying the number of various dynamic objects such as cars, pedestrians and bicycles. The sequences can be acquired as photo-realistic RGB videos or converted on the fly into depth or semantic segmentation mappings. Thanks to this functionality we are able to programmatically generate perfectly aligned pairs of pixel-wise semantic maps with and without dynamic objects. These pairs can be used to train and evaluate the task.
3 Semantic Segmentation Inpainting
Our proposed model for Semantic Road Layout Understanding  follows a Generative Adversarial Network paradigm: a generator is trained to generate plausible inpaintings and fool a discriminator, which is trained to recognize whether an image belongs to the real or reconstructed distributions of data.
We extend the Generative Inpainting Network of , where a coarse-to-fine approach is followed to generate RGB images. The authors introduced an attention layer to transform contextual regions into convolutional filters and estimate the correlation between background and foreground patches. This contextual attention is used to learn where to borrow image content and use it to guide the inpainting process. Since the reconstruction process involves a certain degree of uncertainty, the model is trained with a spatially discounted loss, which avoids to penalize pixels far from the boundaries of the region to inpaint.
In this paper we modify this architecture to work with N-dimensional data instead of just RGB images. The input segmentation masks are in fact fed to the network as a one-hot encoded tensor of the class labels in the semantic map, where is the width of the image, its height and the number of available semantic categories. In the same way we train the network to output a new tensor with the same width and height but containing only categories belonging to the static set .
To adapt the network to a segmentation task instead of an RGB inpainting task, we changed the reconstruction loss from an loss to a softmax cross-entropy loss. This corresponds to casting the problem as a classification task instead of a regression one. We do this since we do not want to impose a metric over the mutually exclusive categories and our goal is to obtain a hard assignment of each pixel to a specific category, opposed to classical inpainting scenarios where a perceptually close RGB value is acceptable.
To train the model, for each image we consider a crop and randomly sample a rectangular binary mask of maximum size within it. The portion of the input covered by the mask is then blacked out and reconstructed by the generator. The discriminator is fed with both original and reconstructed patches and trained to discriminate between them.
Since urban settings have a predominant geometrical component we added to the generator a spatially aware loss that enforces the learning process to be more strict on class region boundaries. To model this geometrical attention we apply a Sobel filter on the input tensor within the region to inpaint. This provides us with a response map that, once binarized, localizes the semantic boundaries of objects. We add this mask to the spatially discounted loss, obtaining a penalization on pixels close to the borders of the overall region and on semantic boundary pixels. This helps to follow the geometry of the scene and be able to accurately join together occluded parts of region boundaries.
4 Experimental Evaluation
We trained the model on Cityscapes , an urban driving dataset with 30 different visual classes. We have chosen the Cityscapes dataset since it contains a high variability of both static and dynamic categories and can therefore be adapted also to datasets comprising less categories. In our experiments we divided the classes into the dynamic and static subcategories, clustering together similar ones. The resulting 12 categories are the following.
At test time, we mask out all pixels belonging to dynamic categories and process the whole image in order to remove these objects. A qualitative evaluation on the Cityscapes dataset is shown in Fig. 2.
Given the lack of a supervision signal at test time to quantitatively evaluate the method, we generated an auxiliary test set using the CARLA  simulator. Thanks to the scripting functionalities of the simulator we generated pairs of perfectly aligned frames with and without dynamic objects. This allows us to produce a ground truth reconstruction that would be impossible to obtain from real world images. We generated 50 synthetic images as benchmark. Qualitative results are shown in Fig. 3.
To evaluate the reconstruction capabilities of the method we measured the pixel-wise accuracy between the reconstructed images and the ground truth. This is calculated as the fraction of correctly reconstructed pixels within the inpainted regions. On the synthetic dataset we collected, we obtain a pixel-wise accuracy of , proving that our method is capable of generating high quality reconstructions.
In this paper we presented the novel task of Semantic Road Layout Understanding, which aims at obtaining a full comprehension of the scene in an urban driving scenario. We formulated this task as an inpainting problem in the semantic segmentation domain, a setting which has yet to be explored despite its usefulness for a wide variety of scene understanding tasks. To solve this problem we proposed a Generative Adversarial Network architecture capable of removing dynamic objects from the scene and reconstruct occluded views. Moreover we showed how the model is capable of providing high quality results both on frames from the Cityscapes dataset and on a novel synthetically generated dataset.
-  Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio López, and Vladlen Koltun. Carla: An open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017.
-  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
-  Yuhang Song, Chao Yang, Yeji Shen, Peng Wang, Qin Huang, and C-C Jay Kuo. Spg-net: Segmentation prediction and guidance network for image inpainting. arXiv preprint arXiv:1805.03356, 2018.
-  Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892, 2018.