Points2Pix: 3D Point-Cloud to Image Translation using conditional GANs
We present the first approach for 3D point-cloud to image translation based on conditional Generative Adversarial Networks (cGAN). The model handles multi-modal information sources from different domains, i.e. raw point-sets and images. The generator is capable of processing three conditions, whereas the point-cloud is encoded as raw point-set and camera projection. An image background patch is used as constraint to bias environmental texturing. A global approximation function within the generator is directly applied on the point-cloud (Point-Net). Hence, the representative learning model incorporates global 3D characteristics directly at the latent feature space. Conditions are used to bias the background and the viewpoint of the generated image. This opens up new ways in augmenting or texturing 3D data to aim the generation of fully individual images. We successfully evaluated our method on the KITTI and SunRGBD dataset with an outstanding object detection inception score.
Domain translation is a well known and widely applied problem. It is typically treated in computer graphics or computer vision. Most research focuses on image-to-image translation [7, 34, 27]. Examples are Semantic-Labels to Image (e.g. Labels to Street-Scene, Labels to Facades ) or Image conversions (e.g. Day to Night, Black-and-White to Color). Those techniques deal with real domain translation problems, since they convert semantic sensor-independent context into realistic RGB image data or vice versa. However, domain translation is performed on top of images. Both domains encode the information as RGB values in pictures with a spatial dependency. We call that single mode domain translation:
Whereas, describe the translation functions between both image domains , with fixed image sizes: (height), (width).
We propose a novel multi-modal domain translation model using the example of 3D point-cloud to image translation. The treated problem is formally known as:
Here, n describes the number of points within the point-set. Our work is limited to (not ). Therefore an extensive new architecture is presented as combination of a typical encoder-decoder for image segmentation (UNet: ) as proposed by . More important is the models second input, where the architecture incorporates the real point-set to add 3D characteristics into the global feature space for constraint based individual image generation. We put conditions as viewpoint dependent projection and background image patches for fully individual image generation in compliance with 3D specifications (conditions: background, shape, distance, viewpoint).
2 Related Work
2.1 Image generation
2.1.1 Handcrafted Losses
As image generation could be reduced to per-pixel classification/regression with a wide application area it turns out to have a long tradition [23, 30, 31, 6]. Those applications suppose a conditionally unstructured loss applied on the output space, i.e. a pixel independence in terms of semantic relationship is supposed. The performance of those approaches strongly depends on the loss design, e.g. semantic segmentation .
2.1.2 Conditional GANs
Conditional GANs (cGAN) instead learn structured losses that affect the overall output in form of a joint improvement . In common, the cGAN is applied in a conditional setting. For image generation researchers were setting variable conditions: e.g. discrete labels [14, 4], text  and images [7, 34, 27].
In general the cGAN performs a mapping function , called generator, based on a condition and a random noise vector to generate an image :
For image-to-image translation  proposes a U-Net like structure for . To create realistic images at higher resolutions (e.g. 1024 x 2048)  recommend a pyramidal approach for similar to a PSPNet .
In general, the cGAN is composed by and a competing discriminator , which distinguishes between real images and created fake ones. A well-established discriminator network is the Patchgan  proposed by . Derived from that the competing objective of the cGAN could be described by its loss :
2.2 Point-cloud processing
High requirements for perception tasks of robotic applications enforced the usage of 3D sensors, e.g. RGBD-cameras , Lidar (Valeo SCALA). Research progress in the field of 3D point-cloud processing received a boost in the recent years. In principle, point-clouds have specific properties that clearly distinguish them from images. Hence, specific processing models are needed. Points usually are not ordered, there is no grid that encodes the 3D position as an image does. The overall category of a point-set is influenced by the interaction of points among others. Only the global sum of the points forms a shape with a meaning. Last, point-sets are invariant to basic transformations like translation or rotation. Therefore the combination of 3D points clouds and machine learning is indispensable. The processing type could be categorized into the following three classes.
2.2.1 Real 3D Point-cloud processing
 proposed the first neural network architecture Point-Net that handles natural points sets for classification and segmentation tasks with outperforming segmentation results on ShapeNet : mIoU 83.7. The model does not use convolutional layers, but fully connected ones and directly processes the coordinates of the point-set (size ): with . A chain of local transformations on the point-set followed by a global max-pooling layer is used to create an overall feature space, i.e. a global approximation function:
I.e. The overall meaning (e.g. object class) of a point set is approximated by . The advantage of the architecture is that it is robust against unordered point-clouds and transformations. The independence form the viewpoint variance helps to train with less training samples. Due to disadvantage for learning global features of large point sets the authors developed a second version Point-Net++ .
Voxelization approaches make use of the findings performing CNNs on images. Therefore, 3D data is converted to voxels or grid cells. After pre-processing standard machine learning architectures are applied. Unordered point-sets are avoided. Famous applications are 3D object detectors like [24, 3, 13, 9]
2.2.3 Combined models
Combined models have often shown most robust results (e.g. 3D object detection) and mostly make use of different sensor types.  investigated a method based on many local Point-Nets followed by a global 3D CNN.  architectures works the other way around. With the aid of a camera frustum points are filtered using a camera object detector. The filtered points are processed for 3D object detection with only one Point-Net  up to the last global max pooling layer ending in a general feature space. A 8 bit depth projection using the given camera projection matrix ()
2.2.4 Generative models
Point Cloud GAN by  is a famous approach for point-cloud generation. They do not perform any translation task, but they show that the common discriminator is not suitable for raw point-clouds.  performs label to point-cloud translation by using representative learning and introduce several 3D GAN derivatives. A similar study is published by  with focus on latent space analysis. However, learning 3D representations to generate viewpoint based images is missing within the research community. Therefore, we propose our novel technique Points2Pix.
We propose a novel cGAN architecture for generating photo-realistic images from pure point-clouds. Additionally, we describe conditions to bias the viewpoint, distance, shape and the background within the latent space. Therefore, we introduce the network architecture consisting of a generator (converting points to images), a discriminator and the specific loss.
The objective of our generator is to translate point-clouds into realistic-looking images, while using three conditions . The whole architecture is shown in Figure 1. The design is inspired by , which serves as base.
3.1.1 Condition one
First, as raw point-cloud is processed by PointNet . The model samples points as input, applying an input transformation and aggregates global point features using fully connected layers and a generic max pooling (see equation 5):
However, in contrast to the basic PointNet pipeline, the proposed model incorporates the the global 3D feature space () using concatenation at the innermost part within the Image encoder-decoder (UNet). Hence, are applied by the PointNet part, is implicitly performed with the aid of UNets decoder (see Fig. 1).
3.1.2 Condition two
The second condition denoted as is an image projection of the point-cloud using a perspective projection matrix
with a scaling according to the horizontal field of view denoted as in degrees, near clipping plane denoted as and far clipping plane . We encode radial depth ( green channel) with a normalized depth and intensities of the measured reflectance for each point falling into the projection image ( blue channel). Before applying , all points are transformed into the camera coordinate system using the extrinsic calibration . In this way we ensure the consistent viewpoint during training compared to the raw ground truth rgb image.
3.1.3 Condition three
Finally, the third condition is an arbitrary image background patch constraining environmental texturing. A surrounding image patch of the object cropped from the data set centered at the object origin up to a size of is extracted. During training the image background patch is compliant to the ground truth. In test-mode, background patches can be randomly mixed with point-clouds.
Both, and are combined to an input image, which is fed into a UNet with skip connections. At the innermost part, down sampled input features are concatenated with the global 3D feature space from . After up-sampling the output is a generated image with pixels. Since, we use a cGAN for training, there is no need for an unstructured Loss. The assessment of the output is performed by the discriminator. As a note, we do not use a random noise vector (3). Noise is only incorporated as dropout similar to .
We use the Markovian discriminator PatchGAN  that tries to distinguish between fake and real images at the scale of patches as well as possible. In contrast to  we do not take the condition into account. The output depends only on the generated image. Therefore, it consists of an L1 term to force low-frequency correctness  and is applied convolutionally across the image, averaging all responses. We only use convolutional layers with batch and instance normalization. In this way, it effectively solves the problem to be able to model high- and low frequency structures at once.
The objective of a basic GAN can be explained by an additive combination of the generative network loss and the discriminative network loss. In order to iteratively improve results during training should be reduced while grows ideally. Consequently the basic cGAN loss can be described as follows assuming the three input conditions ():
Random noise (3) is only realized using dropout. Compared to the typical cGAN loss (4), the model does not involve all conditions into the discriminator. However, we implicitly force conditions to be compliant within the output by using a weighted term  in the overall loss. This part describes the difference between the output and the ground truth. The final loss can be written as:
We conduct experiments on KITTI  for outdoor and SunRGBD  for indoor scenarios to explore the general validity of the method. Additionally, we show that the approach works for both, Lidar generated point-clouds and point-clouds coming from by RGB-D sensors. Following the recommendations of , the quality of the synthesized images is evaluated using an object based inception score. Furthermore, classification and diversity scores are added as additional assessment. Finally, we present some insights into our architecture decisions with additional ablation experiments.
To assess the realism of the produced images, YOLOv3  is used for validation. It is an off-the-shelf state of the art 2D object detector pre-trained on ImageNet and fine-tuned on the MS-Coco  data-set. This model includes overlapping classes in comparison to our experiments, e.g. cars (for KITTI) and chair (for SunRGBD). For the quantitative metrics we follow the instructions recommended by .
4.1.1 Classification Score
With the aid of YOLOv3 the number of correct detected classes is measured. This could be achieved due to object centered image patches in our experiments. The classification score ratio is then given by the detection ratio of fake images and ground truth ( true positives). The score could be directly affected by adjusting the confidence rate of the 2D object detector: .
4.1.2 Object based Inception Score111We call it inception score, because its similar to the proposal of . We do not use an inception model.
For positive results in terms of classification we measure the intersection over union of the predicted bounding box coming out of YOLOv3 for the ground truth and the accompanied fake image.
4.1.3 Diversity Score
We measure the diversity ability of our cGAN to produce a wide spread of different output features using a diversity score. Our objective is to bias the shape, distance and 3D characteristics of the object. We collect randomly ten different background image patches, while keeping the point-cloud constant ( and , ). This leads to different output images that all should have the same 3D object inside. Therefore we compare the ground truth YOLOv3 results and all the fake images with the aid of calculating the mean and the mean .
4.2 Training Details
We train the network on both data sets separately for epochs from scratch each, using the ADAM optimizer , with a learning rate of and momentum parameters , such as . For our background condition , we use image patches with a border width of pixels. We found using objects containing at least 700 points in their point-cloud as a good trade-off for minimum point density as well as object size.
Kitti: In a pre-processing step, we split the training examples of the 3D object detection benchmark and use samples for training and for evaluation. Therefore, we generate more than training images for the class car only using . Thus, each camera image is cropped centered at one labeled object with pixels. At the same time, strongly occluded or truncated objects are skipped.
SunRGBD: We extract 3267 images from the SunRGBD data-set containing the following classes: chair, table, desk, pillow, sofa and garbage bin. The split for training and validation is a 90/10 ratio. Image patches are extracted at the object center from the cameras point of view with a size of pixels with . The depth information comes from either MS kinect v1 or v2 and the Intel real-sense. Since, those sensors do not measure a reflectance, we only encode the radial depth inside the projection of . Hence, the projection image contains one channel only.
In Fig. 3 we show qualitative results for both data-sets and four different classes. Widely distributed output images are produced by alternating the background while keeping the point-cloud constant. An interesting point is, that our model learns 3D characteristics. This could be proven with different outputs (backgrounds) where the objects geometry stays constant. Note, even the objects color stays the same apart from slight differences in reflections and illuminations. This means, the model associates a color with a specific 3D shape represented within the 3D latent feature space. Hence, alternating backgrounds do not affect the objects representation (geometry, color).
Tables 2 and 2, as well as Fig. 4 show quantitative results based on our metrics described in 4.1. We achieve extreme positive results for KITTI (, ) and sufficient values for SunRGBD. SunRGBD includes a higher number of occlusions which drastically affects the scores. Additionally, there are far less samples on each class compared to cars in KITTI. Qualitative results of the inception score are shown in Fig. 5.
4.3.1 Ablation study
For completeness, we test two derivative architectures of our full pipeline (Fig. 6). In this way, we successfully show a point-cloud to image translation only based on the point cloud itself (PointNet only). Doing this, the whole training procedure runs much faster due to far less parameters to optimize. Nevertheless, sometimes a repeating noise with a high contrast similar to Moire effects appears, which indicates instabilities and uncertainties. Generated objects are in compliance with their 3D specifications, but in order to enlarge variance of the outputs and to control background conditions and are required. We found that the first part of the UNet and the view-point dependent projection especially help to reduce the mentioned noise effects. They provide additional information in 2D space and stabilize the network. As a fallback we additionally test a Unet only version (Fig. 6). However, our full pipeline significantly outperforms the derivative architectures in terms of classification (Fig. 4).
To further emphasize the influence of and to show our models ability to constrain object view-points, we rotate all input points for . We test that for KITTI with a rotation of 20 degrees around the y-axis and for SunRGBD with a rotation of 180 degrees around the x-axis (see Fig. 7). Note, that our point-cloud condition stays unmodified, because PointNet approximates a symmetric function to be invariant of rotations. The test shows that rotations can be implicitly learned. This offers many opportunities in generating 3D data.
In this work, we propose a novel approach for 3D point-cloud to image translation based on conditional GANs. Our network handles multi-modal sources from different domains and is capable of the translating unordered point-clouds to regular image grids. We use three conditions to generate a high diversity, while being flexible and keeping 3D characteristics. We prove that the model learns 3D characteristics, what even makes it possible to sample images from different viewpoints. Those networks are applicable in a wide variety of applications, especially 3D texturing.
-  (2017) Representation learning and adversarial generation of 3d point clouds. CoRR abs/1707.02392. External Links: Cited by: §2.2.4.
-  (2015) ShapeNet: an information-rich 3d model repository. CoRR abs/1512.03012. External Links: Cited by: §2.2.1.
-  (2016) Multi-view 3d object detection network for autonomous driving. CoRR abs/1611.07759. External Links: Cited by: §2.2.2.
-  (2015) Deep generative image models using a laplacian pyramid of adversarial networks. CoRR abs/1506.05751. External Links: Cited by: §2.1.2.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, Washington, DC, USA, pp. 3354–3361. External Links: Cited by: §4.
-  (2016) Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM Transactions on Graphics (Proc. of SIGGRAPH 2016) 35 (4), pp. 110:1–110:11. Cited by: §2.1.1.
-  (2016) Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004. External Links: Cited by: §1, §1, §2.1.2, §2.1.2, §2.1.2, §3.1.3, §3.1, §3.2, §4.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §4.2.
-  (2018) Joint 3d proposal generation and object detection from view aggregation. IROS. Cited by: §2.2.2.
-  (2016) 3D fully convolutional network for vehicle detection in point cloud. CoRR abs/1611.08069. External Links: Cited by: §2.1.2.
-  (2018) Point cloud GAN. CoRR abs/1810.05795. External Links: Cited by: §2.2.4.
-  (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. External Links: Cited by: §4.1.
-  (2018-06) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.2.
-  (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. External Links: Cited by: §2.1.2.
-  (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Cited by: §2.1.1.
-  (2016) PointNet: deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593. Cited by: §2.2.1, §2.2.3, §3.1.1.
-  (2017) Frustum pointnets for 3d object detection from RGB-D data. CoRR abs/1711.08488. External Links: Cited by: §2.2.3.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. CoRR abs/1706.02413. External Links: Cited by: §2.2.1.
-  (2018) YOLOv3: an incremental improvement. CoRR abs/1804.02767. External Links: Cited by: §4.1.
-  (2016) Generative adversarial text to image synthesis. CoRR abs/1605.05396. External Links: Cited by: §2.1.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Cited by: §1.
-  (2016) Improved techniques for training gans. CoRR abs/1606.03498. External Links: Cited by: footnote 1.
-  (2017) Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), pp. 640–651. External Links: Cited by: §2.1.1.
-  (2018) Complex-yolo: real-time 3d object detection on point clouds. CoRR abs/1803.06199. External Links: Cited by: §2.2.2.
-  (2015) SUN RGB-D: A RGB-D scene understanding benchmark suite. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 07-12-June-2015, pp. 567–576. External Links: Cited by: §4.
-  (2015) Deep sliding shapes for amodal 3d object detection in RGB-D images. CoRR abs/1511.02300. External Links: Cited by: §2.2.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.2, §2.1.2, §3.3.
-  (2016) Generative image modeling using style and structure adversarial networks. CoRR abs/1603.05631. External Links: Cited by: §4.1.
-  (2017) MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In Advances In Neural Information Processing Systems, Cited by: §2.2.4.
-  (2015) Holistically-nested edge detection. CoRR abs/1504.06375. External Links: Cited by: §2.1.1.
-  (2016) Pixel-level domain transfer. CoRR abs/1603.07442. External Links: Cited by: §2.1.1.
-  (2016) Pyramid scene parsing network. CoRR abs/1612.01105. External Links: Cited by: §2.1.2.
-  (2017) VoxelNet: end-to-end learning for point cloud based 3d object detection. CoRR abs/1711.06396. External Links: Cited by: §2.2.3.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593. External Links: Cited by: §1, §2.1.2, §3.2.