Points2Pix: 3D Point-Cloud to Image Translation using conditional GANs

Points2Pix: 3D Point-Cloud to Image Translation using conditional GANs

Stefan Milz Equal contribution1 Valeo Schalter und Sensoren GmbH, Germany
Ilmenau University of Technology, Germany 1
   Martin Simon * 1 Valeo Schalter und Sensoren GmbH, Germany
Ilmenau University of Technology, Germany 1
   Kai Fischer * 1 Valeo Schalter und Sensoren GmbH, Germany
Ilmenau University of Technology, Germany 1
   Maximillian Pöpperl 1 Valeo Schalter und Sensoren GmbH, Germany
Ilmenau University of Technology, Germany 1
  
Horst-Michael Gross
22
Abstract

We present the first approach for 3D point-cloud to image translation based on conditional Generative Adversarial Networks (cGAN). The model handles multi-modal information sources from different domains, i.e. raw point-sets and images. The generator is capable of processing three conditions, whereas the point-cloud is encoded as raw point-set and camera projection. An image background patch is used as constraint to bias environmental texturing. A global approximation function within the generator is directly applied on the point-cloud (Point-Net). Hence, the representative learning model incorporates global 3D characteristics directly at the latent feature space. Conditions are used to bias the background and the viewpoint of the generated image. This opens up new ways in augmenting or texturing 3D data to aim the generation of fully individual images. We successfully evaluated our method on the KITTI and SunRGBD dataset with an outstanding object detection inception score.

1 Introduction

Domain translation is a well known and widely applied problem. It is typically treated in computer graphics or computer vision. Most research focuses on image-to-image translation [7, 34, 27]. Examples are Semantic-Labels to Image (e.g. Labels to Street-Scene, Labels to Facades ) or Image conversions (e.g. Day to Night, Black-and-White to Color). Those techniques deal with real domain translation problems, since they convert semantic sensor-independent context into realistic RGB image data or vice versa. However, domain translation is performed on top of images. Both domains encode the information as RGB values in pictures with a spatial dependency. We call that single mode domain translation:

(1)

Whereas, describe the translation functions between both image domains , with fixed image sizes: (height), (width).

We propose a novel multi-modal domain translation model using the example of 3D point-cloud to image translation. The treated problem is formally known as:

(2)

Here, n describes the number of points within the point-set. Our work is limited to (not ). Therefore an extensive new architecture is presented as combination of a typical encoder-decoder for image segmentation (UNet: [21]) as proposed by [7]. More important is the models second input, where the architecture incorporates the real point-set to add 3D characteristics into the global feature space for constraint based individual image generation. We put conditions as viewpoint dependent projection and background image patches for fully individual image generation in compliance with 3D specifications (conditions: background, shape, distance, viewpoint).

2 Related Work

2.1 Image generation

2.1.1 Handcrafted Losses

As image generation could be reduced to per-pixel classification/regression with a wide application area it turns out to have a long tradition [23, 30, 31, 6]. Those applications suppose a conditionally unstructured loss applied on the output space, i.e. a pixel independence in terms of semantic relationship is supposed. The performance of those approaches strongly depends on the loss design, e.g. semantic segmentation [15].

2.1.2 Conditional GANs

Conditional GANs (cGAN) instead learn structured losses that affect the overall output in form of a joint improvement [7]. In common, the cGAN is applied in a conditional setting. For image generation researchers were setting variable conditions: e.g. discrete labels [14, 4], text [20] and images [7, 34, 27].

In general the cGAN performs a mapping function , called generator, based on a condition and a random noise vector to generate an image :

(3)

For image-to-image translation [7] proposes a U-Net like structure for . To create realistic images at higher resolutions (e.g. 1024 x 2048) [27] recommend a pyramidal approach for similar to a PSPNet [32].

In general, the cGAN is composed by and a competing discriminator , which distinguishes between real images and created fake ones. A well-established discriminator network is the Patchgan [10] proposed by [7]. Derived from that the competing objective of the cGAN could be described by its loss :

(4)

2.2 Point-cloud processing

High requirements for perception tasks of robotic applications enforced the usage of 3D sensors, e.g. RGBD-cameras [26], Lidar (Valeo SCALA). Research progress in the field of 3D point-cloud processing received a boost in the recent years. In principle, point-clouds have specific properties that clearly distinguish them from images. Hence, specific processing models are needed. Points usually are not ordered, there is no grid that encodes the 3D position as an image does. The overall category of a point-set is influenced by the interaction of points among others. Only the global sum of the points forms a shape with a meaning. Last, point-sets are invariant to basic transformations like translation or rotation. Therefore the combination of 3D points clouds and machine learning is indispensable. The processing type could be categorized into the following three classes.

2.2.1 Real 3D Point-cloud processing

[16] proposed the first neural network architecture Point-Net that handles natural points sets for classification and segmentation tasks with outperforming segmentation results on ShapeNet [2]: mIoU 83.7. The model does not use convolutional layers, but fully connected ones and directly processes the coordinates of the point-set (size ): with . A chain of local transformations on the point-set followed by a global max-pooling layer is used to create an overall feature space, i.e. a global approximation function:

(5)

I.e. The overall meaning (e.g. object class) of a point set is approximated by . The advantage of the architecture is that it is robust against unordered point-clouds and transformations. The independence form the viewpoint variance helps to train with less training samples. Due to disadvantage for learning global features of large point sets the authors developed a second version Point-Net++ [18].

2.2.2 Voxelization

Voxelization approaches make use of the findings performing CNNs on images. Therefore, 3D data is converted to voxels or grid cells. After pre-processing standard machine learning architectures are applied. Unordered point-sets are avoided. Famous applications are 3D object detectors like [24, 3, 13, 9]

2.2.3 Combined models

Combined models have often shown most robust results (e.g. 3D object detection) and mostly make use of different sensor types. [33] investigated a method based on many local Point-Nets followed by a global 3D CNN. [17] architectures works the other way around. With the aid of a camera frustum points are filtered using a camera object detector. The filtered points are processed for 3D object detection with only one Point-Net [16] up to the last global max pooling layer ending in a general feature space. A 8 bit depth projection using the given camera projection matrix ()

Figure 1: Points2Pix Generator Architecture. The figure outlines the overall pipeline of the Points2Pix generator. In general we split the design into three areas: Top: the PointNet for a raw point-cloud processing; Bottom: Unet with skip connections for image generation; Middle: Global feature space concatenation from point-set (top) and image processing (bottom) pipeline. The model needs only a raw point-set as input, which acts as condition . The point-sets coordinates will be directly processed by PointNet. A projection into the image plane is used for UNet as input, whereas as the camera projection matrix works as condition . Additionally an arbitrary background patch is used for background generation.

2.2.4 Generative models

Point Cloud GAN by [11] is a famous approach for point-cloud generation. They do not perform any translation task, but they show that the common discriminator is not suitable for raw point-clouds. [1] performs label to point-cloud translation by using representative learning and introduce several 3D GAN derivatives. A similar study is published by [29] with focus on latent space analysis. However, learning 3D representations to generate viewpoint based images is missing within the research community. Therefore, we propose our novel technique Points2Pix.

3 Points2Pix

We propose a novel cGAN architecture for generating photo-realistic images from pure point-clouds. Additionally, we describe conditions to bias the viewpoint, distance, shape and the background within the latent space. Therefore, we introduce the network architecture consisting of a generator (converting points to images), a discriminator and the specific loss.

3.1 Generator

The objective of our generator is to translate point-clouds into realistic-looking images, while using three conditions . The whole architecture is shown in Figure 1. The design is inspired by [7], which serves as base.

3.1.1 Condition one

First, as raw point-cloud is processed by PointNet [16]. The model samples points as input, applying an input transformation and aggregates global point features using fully connected layers and a generic max pooling (see equation 5):

(6)

However, in contrast to the basic PointNet pipeline, the proposed model incorporates the the global 3D feature space () using concatenation at the innermost part within the Image encoder-decoder (UNet). Hence, are applied by the PointNet part, is implicitly performed with the aid of UNets decoder (see Fig. 1).

3.1.2 Condition two

The second condition denoted as is an image projection of the point-cloud using a perspective projection matrix

(7)
(8)

with a scaling according to the horizontal field of view denoted as in degrees, near clipping plane denoted as and far clipping plane . We encode radial depth ( green channel) with a normalized depth and intensities of the measured reflectance for each point falling into the projection image ( blue channel). Before applying , all points are transformed into the camera coordinate system using the extrinsic calibration . In this way we ensure the consistent viewpoint during training compared to the raw ground truth rgb image.

3.1.3 Condition three

Finally, the third condition is an arbitrary image background patch constraining environmental texturing. A surrounding image patch of the object cropped from the data set centered at the object origin up to a size of is extracted. During training the image background patch is compliant to the ground truth. In test-mode, background patches can be randomly mixed with point-clouds.

Both, and are combined to an input image, which is fed into a UNet with skip connections. At the innermost part, down sampled input features are concatenated with the global 3D feature space from . After up-sampling the output is a generated image with pixels. Since, we use a cGAN for training, there is no need for an unstructured Loss. The assessment of the output is performed by the discriminator. As a note, we do not use a random noise vector (3). Noise is only incorporated as dropout similar to [7].

3.2 Discriminator

We use the Markovian discriminator PatchGAN [7] that tries to distinguish between fake and real images at the scale of patches as well as possible. In contrast to [7] we do not take the condition into account. The output depends only on the generated image. Therefore, it consists of an L1 term to force low-frequency correctness [34] and is applied convolutionally across the image, averaging all responses. We only use convolutional layers with batch and instance normalization. In this way, it effectively solves the problem to be able to model high- and low frequency structures at once.

Figure 2: Training Points2Pix: The figure outlines the competing training structure. The generators () output is a fake image based on its three conditions . The discriminator has to distinguish between fake and real images .

3.3 Loss

The objective of a basic GAN can be explained by an additive combination of the generative network loss and the discriminative network loss. In order to iteratively improve results during training should be reduced while grows ideally. Consequently the basic cGAN loss can be described as follows assuming the three input conditions ():

(9)

Random noise (3) is only realized using dropout. Compared to the typical cGAN loss (4), the model does not involve all conditions into the discriminator. However, we implicitly force conditions to be compliant within the output by using a weighted term [27] in the overall loss. This part describes the difference between the output and the ground truth. The final loss can be written as:

(10)

4 Experiments

We conduct experiments on KITTI [5] for outdoor and SunRGBD [25] for indoor scenarios to explore the general validity of the method. Additionally, we show that the approach works for both, Lidar generated point-clouds and point-clouds coming from by RGB-D sensors. Following the recommendations of [7], the quality of the synthesized images is evaluated using an object based inception score. Furthermore, classification and diversity scores are added as additional assessment. Finally, we present some insights into our architecture decisions with additional ablation experiments.

4.1 Metrics

To assess the realism of the produced images, YOLOv3 [19] is used for validation. It is an off-the-shelf state of the art 2D object detector pre-trained on ImageNet and fine-tuned on the MS-Coco [12] data-set. This model includes overlapping classes in comparison to our experiments, e.g. cars (for KITTI) and chair (for SunRGBD). For the quantitative metrics we follow the instructions recommended by [28].

4.1.1 Classification Score

With the aid of YOLOv3 the number of correct detected classes is measured. This could be achieved due to object centered image patches in our experiments. The classification score ratio is then given by the detection ratio of fake images and ground truth ( true positives). The score could be directly affected by adjusting the confidence rate of the 2D object detector: .

4.1.2 Object based Inception Score

111We call it inception score, because its similar to the proposal of [22]. We do not use an inception model.

For positive results in terms of classification we measure the intersection over union of the predicted bounding box coming out of YOLOv3 for the ground truth and the accompanied fake image.

(11)

4.1.3 Diversity Score

We measure the diversity ability of our cGAN to produce a wide spread of different output features using a diversity score. Our objective is to bias the shape, distance and 3D characteristics of the object. We collect randomly ten different background image patches, while keeping the point-cloud constant ( and , ). This leads to different output images that all should have the same 3D object inside. Therefore we compare the ground truth YOLOv3 results and all the fake images with the aid of calculating the mean and the mean .

4.2 Training Details

We train the network on both data sets separately for epochs from scratch each, using the ADAM optimizer [8], with a learning rate of and momentum parameters , such as . For our background condition , we use image patches with a border width of pixels. We found using objects containing at least 700 points in their point-cloud as a good trade-off for minimum point density as well as object size.

Kitti: In a pre-processing step, we split the training examples of the 3D object detection benchmark and use samples for training and for evaluation. Therefore, we generate more than training images for the class car only using . Thus, each camera image is cropped centered at one labeled object with pixels. At the same time, strongly occluded or truncated objects are skipped.

SunRGBD: We extract 3267 images from the SunRGBD data-set containing the following classes: chair, table, desk, pillow, sofa and garbage bin. The split for training and validation is a 90/10 ratio. Image patches are extracted at the object center from the cameras point of view with a size of pixels with . The depth information comes from either MS kinect v1 or v2 and the Intel real-sense. Since, those sensors do not measure a reflectance, we only encode the radial depth inside the projection of . Hence, the projection image contains one channel only.

Figure 3: Qualitative results of Points2Pix: The figure shows four different classes (cars (top) 3 samples of KITTI; table, chair, pillow (bottom) in each case one sample of SunRGBD). The results are taken from the test-set and never seen during training. The left column shows the ground truth image and the corresponding point-cloud in the second column. Fake images are generated based on a constant point-cloud and ten alternating background patches (column 3-10). The model retains 3D characteristics of the objects.

4.3 Results

In Fig. 3 we show qualitative results for both data-sets and four different classes. Widely distributed output images are produced by alternating the background while keeping the point-cloud constant. An interesting point is, that our model learns 3D characteristics. This could be proven with different outputs (backgrounds) where the objects geometry stays constant. Note, even the objects color stays the same apart from slight differences in reflections and illuminations. This means, the model associates a color with a specific 3D shape represented within the 3D latent feature space. Hence, alternating backgrounds do not affect the objects representation (geometry, color).

Tables 2 and 2, as well as Fig. 4 show quantitative results based on our metrics described in 4.1. We achieve extreme positive results for KITTI (, ) and sufficient values for SunRGBD. SunRGBD includes a higher number of occlusions which drastically affects the scores. Additionally, there are far less samples on each class compared to cars in KITTI. Qualitative results of the inception score are shown in Fig. 5.

Figure 4: Points2Pix classification score. The plot shows classification scores for KITTI and SUN-RGBD of our full Points2Pix architecture as well as two derivative architectures (see Fig. 6) over confidence thresholds used for object detection with YOLOv3. The full architecture outperforms for KITTI as well as for SunRGBD.
dataset class 0.3 0.5 0.7 KITTI car 0.76 0.77 0.77 0.1 0.2 0.3 SunRGBD sofa 0.52 0.77 0.77 table 0.70 - - chair 0.60 0.58 0.58
Table 1: The object based inception score is calculated on the test set for both data-sets. We show results for varying confidence thresholds, i.e. , , for KITTI and , , for SunRGBD.
dataset class 0.3 0.5 0.7 KITTI car 0.71 0.70 0.68 0.1 0.2 0.3 SunRGBD sofa 0.16 - - table 0.24 0.22 - chair 0.45 0.37 0.33
Table 2: Diversity score is calculated on the test-set for both data-sets. Each sample is recomputed ten times with a random image background patches. A minus indicates no detections for the associated class.
Figure 5: Qualitative object based inception results. The figure shows several generated cars and chairs (left) together with their accompanied real images (right). Green bounding boxes indicate detections on the real rgb image patches and red boxes visualize the corresponding ones on the fake images. The blue value obtains the IoU of both.
Figure 6: Architectural review: Two derivatives from the basic Points2Pix (left) generator are tested regarding their classification score (see Fig.4). One on hand a Unet only version (middle), on the other hand a PointNet only (right) version is tested. The full model outperforms the others.
Figure 7: Learning 3D representations: The full Points2Pix architecture learns 3D representations. The model offers a high flexibility in generation of different view points by adjusting condition . The left part shows two examples of KITTI when rotating the point-cloud slightly by 20 degrees around the y-axis. The right (SunRGBD) shows the results when flipping the projection by 180 degrees around the x-axis.

4.3.1 Ablation study

Architectural Review

For completeness, we test two derivative architectures of our full pipeline (Fig. 6). In this way, we successfully show a point-cloud to image translation only based on the point cloud itself (PointNet only). Doing this, the whole training procedure runs much faster due to far less parameters to optimize. Nevertheless, sometimes a repeating noise with a high contrast similar to Moire effects appears, which indicates instabilities and uncertainties. Generated objects are in compliance with their 3D specifications, but in order to enlarge variance of the outputs and to control background conditions and are required. We found that the first part of the UNet and the view-point dependent projection especially help to reduce the mentioned noise effects. They provide additional information in 2D space and stabilize the network. As a fallback we additionally test a Unet only version (Fig. 6). However, our full pipeline significantly outperforms the derivative architectures in terms of classification (Fig. 4).

Rotations

To further emphasize the influence of and to show our models ability to constrain object view-points, we rotate all input points for . We test that for KITTI with a rotation of 20 degrees around the y-axis and for SunRGBD with a rotation of 180 degrees around the x-axis (see Fig. 7). Note, that our point-cloud condition stays unmodified, because PointNet approximates a symmetric function to be invariant of rotations. The test shows that rotations can be implicitly learned. This offers many opportunities in generating 3D data.

5 Conclusion

In this work, we propose a novel approach for 3D point-cloud to image translation based on conditional GANs. Our network handles multi-modal sources from different domains and is capable of the translating unordered point-clouds to regular image grids. We use three conditions to generate a high diversity, while being flexible and keeping 3D characteristics. We prove that the model learns 3D characteristics, what even makes it possible to sample images from different viewpoints. Those networks are applicable in a wide variety of applications, especially 3D texturing.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas (2017) Representation learning and adversarial generation of 3d point clouds. CoRR abs/1707.02392. External Links: Link, 1707.02392 Cited by: §2.2.4.
  • [2] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: an information-rich 3d model repository. CoRR abs/1512.03012. External Links: Link, 1512.03012 Cited by: §2.2.1.
  • [3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2016) Multi-view 3d object detection network for autonomous driving. CoRR abs/1611.07759. External Links: Link, 1611.07759 Cited by: §2.2.2.
  • [4] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus (2015) Deep generative image models using a laplacian pyramid of adversarial networks. CoRR abs/1506.05751. External Links: Link, 1506.05751 Cited by: §2.1.2.
  • [5] A. Geiger (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, Washington, DC, USA, pp. 3354–3361. External Links: ISBN 978-1-4673-1226-4, Link Cited by: §4.
  • [6] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2016) Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM Transactions on Graphics (Proc. of SIGGRAPH 2016) 35 (4), pp. 110:1–110:11. Cited by: §2.1.1.
  • [7] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016) Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004. External Links: Link, 1611.07004 Cited by: §1, §1, §2.1.2, §2.1.2, §2.1.2, §3.1.3, §3.1, §3.2, §4.
  • [8] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Link, 1412.6980 Cited by: §4.2.
  • [9] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. IROS. Cited by: §2.2.2.
  • [10] B. Li (2016) 3D fully convolutional network for vehicle detection in point cloud. CoRR abs/1611.08069. External Links: Link, 1611.08069 Cited by: §2.1.2.
  • [11] C. Li, M. Zaheer, Y. Zhang, B. Póczos, and R. Salakhutdinov (2018) Point cloud GAN. CoRR abs/1810.05795. External Links: Link, 1810.05795 Cited by: §2.2.4.
  • [12] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. External Links: Link, 1405.0312 Cited by: §4.1.
  • [13] W. Luo, B. Yang, and R. Urtasun (2018-06) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.2.
  • [14] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. External Links: Link, 1411.1784 Cited by: §2.1.2.
  • [15] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Link, 1606.02147 Cited by: §2.1.1.
  • [16] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593. Cited by: §2.2.1, §2.2.3, §3.1.1.
  • [17] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2017) Frustum pointnets for 3d object detection from RGB-D data. CoRR abs/1711.08488. External Links: Link, 1711.08488 Cited by: §2.2.3.
  • [18] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. CoRR abs/1706.02413. External Links: Link, 1706.02413 Cited by: §2.2.1.
  • [19] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. CoRR abs/1804.02767. External Links: Link, 1804.02767 Cited by: §4.1.
  • [20] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. CoRR abs/1605.05396. External Links: Link, 1605.05396 Cited by: §2.1.2.
  • [21] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Link, 1505.04597 Cited by: §1.
  • [22] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. CoRR abs/1606.03498. External Links: Link, 1606.03498 Cited by: footnote 1.
  • [23] E. Shelhamer, J. Long, and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), pp. 640–651. External Links: Link, Document Cited by: §2.1.1.
  • [24] M. Simon, S. Milz, K. Amende, and H. Gross (2018) Complex-yolo: real-time 3d object detection on point clouds. CoRR abs/1803.06199. External Links: Link, 1803.06199 Cited by: §2.2.2.
  • [25] S. Song, S. P. Lichtenberg, and J. Xiao (2015) SUN RGB-D: A RGB-D scene understanding benchmark suite. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 07-12-June-2015, pp. 567–576. External Links: Document, ISBN 9781467369640, ISSN 10636919 Cited by: §4.
  • [26] S. Song and J. Xiao (2015) Deep sliding shapes for amodal 3d object detection in RGB-D images. CoRR abs/1511.02300. External Links: Link, 1511.02300 Cited by: §2.2.
  • [27] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.2, §2.1.2, §3.3.
  • [28] X. Wang and A. Gupta (2016) Generative image modeling using style and structure adversarial networks. CoRR abs/1603.05631. External Links: Link, 1603.05631 Cited by: §4.1.
  • [29] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum (2017) MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In Advances In Neural Information Processing Systems, Cited by: §2.2.4.
  • [30] S. Xie and Z. Tu (2015) Holistically-nested edge detection. CoRR abs/1504.06375. External Links: Link, 1504.06375 Cited by: §2.1.1.
  • [31] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. Kweon (2016) Pixel-level domain transfer. CoRR abs/1603.07442. External Links: Link, 1603.07442 Cited by: §2.1.1.
  • [32] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2016) Pyramid scene parsing network. CoRR abs/1612.01105. External Links: Link, 1612.01105 Cited by: §2.1.2.
  • [33] Y. Zhou and O. Tuzel (2017) VoxelNet: end-to-end learning for point cloud based 3d object detection. CoRR abs/1711.06396. External Links: Link, 1711.06396 Cited by: §2.2.3.
  • [34] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593. External Links: Link, 1703.10593 Cited by: §1, §2.1.2, §3.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390223
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description