Multi-View Silhouette and Depth Decomposition for High Resolution 3D Object Representation
We consider the problem of scaling deep generative shape models to high-resolution. Drawing motivation from the canonical view representation of objects, we introduce a novel method for the fast up-sampling of 3D objects in voxel space through networks that perform super-resolution on the six orthographic depth projections. This allows us to generate high-resolution objects with more efficient scaling than methods which work directly in 3D. We decompose the problem of 2D depth super-resolution into silhouette and depth prediction to capture both structure and fine detail. This allows our method to generate sharp edges more easily than an individual network. We evaluate our work on multiple experiments concerning high-resolution 3D objects, and show our system is capable of accurately producing objects at resolutions as large as 512512512 – the highest resolution reported for this task, to our knowledge. We achieve state-of-the-art performance on 3D object reconstruction from RGB images on the ShapeNet dataset, and further demonstrate the first effective 3D super-resolution method.
The 3D shape of an object is a combination of countless physical elements that range in scale from gross structure and topology to minute textures endowed by the material of each surface. Intelligent systems require representations capable of modeling this complex shape efficiently, in order to perceive and interact with the physical world in detail (e.g., object grasping, 3D perception, motion prediction and path planning).
Deep generative models have recently achieved strong performance in hallucinating diverse 3D object shapes, capturing their overall structure and rough texture Choy et al. (2016); Sharma et al. (2016); Wu et al. (2016). The first generation of these models utilized voxel representations which scale cubically with resolution, limiting training to only shapes on typical hardware. Numerous recent papers have begun to propose high resolution 3D shape representations with better scaling, such as those based on meshes, point clouds or octrees but these often require more difficult training procedures and customized network architectures.
Our 3D shape model is motivated by a foundational concept in 3D perception: that of canonical views. The shape of a 3D object can be completely captured by a set of 2D images from multiple viewpoints (see Luong and Viéville (1996); Denton et al. (2004) for an analysis of selecting the location and number of viewpoints).
Deep learning approaches for 2D image recognition and generation Simonyan and Zisserman (2014); He et al. (2016); Goodfellow et al. (2014); Karras et al. (2018) scale easily to high resolutions. This motivates the primary question in this paper: can a multi-view representation be used efficiently with modern deep learning methods?
We propose a novel approach for deep shape interpretation which captures the structure of an object via modeling of its canonical views in 2D, as depth maps. By utilizing many 2D orthographic projections to capture shape, a model represented in this fashion can be up-scaled to high resolution by performing semantic super-resolution in 2D space, which leverages efficient, well-studied network structures and training procedures.
The higher resolution depth maps are finally merged into a detailed 3D object using model carving.
Our method has several key components that allow effective and efficient training. We leverage two synergistic deep networks that decompose the task of representing an object’s depth: one that outputs the silhouette – capturing the gross structure; and a second that produces the local variations in depth – capturing the fine detail. This decomposition addresses the blurred images that often occur when minimizing reconstruction error by allowing the silhouette prediction to form sharp edges. Our method utilizes the low-resolution input shape as a rough template which simply needs carving and refinement to form the high resolution product. Learning the residual errors between this template and the desired high resolution shape simplifies the generation task and allows for constrained output scaling, which leads to significant performance improvements.
We evaluate our method’s ability to perform 3D object reconstruction on the the ShapeNet dataset Chang et al. (2015). This standard evaluation task requires generating high resolution 3D objects from single 2D RGB images. Furthermore, due to the nature of our pipeline we present the first results for 3D object super-resolution – generating high resolution 3D objects directly from low resolution 3D objects.
Our method achieves state-of-the-art quantitative performance, when compared to a variety of other 3D representations such as octrees, mesh-models and point clouds. Furthermore, our system is the first to produce 3D objects at resolution, which are visually impressive, both in isolation, and when compared to the ground truth objects. We additionally demonstrate that objects reconstructed from images can be placed in scenes to create realistic environments, as shown in figure 1. Code for all of our systems will be publicly available on a GitHub repository, in order to ensure reproducible experimental comparison111https://github.com/EdwardSmith1884/3D-Object-Super-Resolution. Given the efficiency of our method, each experiment was run on a single NVIDIA Titan X GPU in the order of hours.
2 Related Work
Deep Learning with 3D Data Recent advances with 3D data have leveraged deep learning, beginning with architectures such as 3D convolutions Maturana and Scherer (2015); Li et al. (2016) for object classification. For 3D generation, these methods typically use an autoencoder network, with a decoder composed of 3D deconvolutional layers Choy et al. (2016); Wu et al. (2016). This decoder receives a latent representation of the 3D shape and produces a probability for occupancy at each discrete position in 3D voxel space. This approach has been combined with generative adversarial approaches Goodfellow et al. (2014) to generate novel 3D objects Wu et al. (2016); Smith and Meger (2017); Liu et al. (2017), but only at a limited resolution.
2D Super-Resolution Super-resolution of 2D images is a well-studied problem Park et al. (2003). Traditionally, image super-resolution has used dictionary-style methods Freeman et al. (2002); Yang et al. (2010), matching patches of images to higher-resolution counterparts. This research also extends to depth map super-resolution Mac Aodha et al. (2012); Park et al. (2011); Hui et al. (2016). Modern approaches to super-resolution are built on deep convolutional networks Dong et al. (2016); Wang et al. (2015); Osendorfer et al. (2014) as well as generative adversarial networks Ledig et al. (2016); Karras et al. (2018) which use an adversarial loss to imagine high-resolution details in RGB images.
Multi-View Representation Our work connects to multi-view representations which capture the characteristics of a 3D object from multiple viewpoints in 2D Koenderink and Van Doorn (1976); Murase and Nayar (1995); Su et al. (2015); Qi et al. (2016); Kar et al. (2017), such as decomposing image silhouettes Macrini et al. (2002), Light Field Descriptors Chen et al. (2003), and 2D panoramic mapping Shi et al. (2015). Other representations aim to use orientation Saxena et al. (2009), rotational invariance Kazhdan et al. (2003) or 3D-SURF features Knopp et al. (2010). While many of these representations are effective for 3D classification, they have not previously been utilized to recover 3D shape in high resolution.
Efficient 3D Representations Given that naïve representations of 3D data require cubic computational costs with respect to resolution, many alternate representations have been proposed. Octree methods Tatarchenko et al. (2017); Häne et al. (2017) use non-uniform discretization of the voxel space to efficiently capture 3D objects by adapting the discretization level locally based on shape. Hierarchical surface prediction (HSP) Häne et al. (2017) is an octree-style method which divides the voxel space into free, occupied and boundary space. The object is generated at different scales of resolution, where occupied space is generated at a very coarse resolution and the boundary space is generated at a very fine resolution. Octree generating networks (OGN) Tatarchenko et al. (2017) use a convolutional network that operates directly on octrees, rather than in voxel space. These methods have only shown results up to resolution. Our method achieves higher accuracy at this resolution and can efficiently produce objects as large as .
A recent trend is the use of unstructured representations such as mesh models Pontes et al. (2017); Kato et al. (2017); Wang et al. (2018) and point clouds Qi et al. (2017); Fan et al. (2017) which represent the data by an unordered set with a fixed number of points. MarrNet Wu et al. (2017), which resembles our work, models 3D objects through the use of 2.5D sketches, which capture depth maps from a single viewpoint. This approach requires working in voxel space when translating 2.5D sketches to high resolution, while our method can work directly in 2D space, leveraging 2D super-resolution technology within the 3D pipeline.
In this section we describe our methodology for representing high resolution 3D objects. Our algorithm is a novel approach which uses the six axis-aligned orthographic depth maps (ODM),
to efficiently scale 3D objects to high resolution without directly interacting with the voxels. To achieve this, a pair of networks is used for each view, decomposing the super-resolution task into predicting the silhouette and relative depth from the low resolution ODM. This approach is able to recover fine object details and scales better to higher resolutions than previous methods, due to the simplified learning problem faced by each network, and scalable computations that occur primarily in 2D image space.
3.1 Orthographic Depth Map Super-Resolution
Our method begins by obtaining the orthographic depth maps of the six primary views of the low-resolution 3D object. In an ODM, each pixel holds a value equal to the surface depth of the object along the viewing direction at the corresponding coordinate. This projection can be computed quickly and easily from an axis-aligned 3D array via z-clipping, a well-known graphics operation. Super-resolution is then performed directly on these ODMs, before being mapped onto the low resolution object to produce a high resolution object.
Representing an object by a set of depth maps however, introduces a challenging learning problem, which requires both local and global consistency in depth. Furthermore, it is known that minimizing the mean squared error results in blurry images without sharp edges Mathieu et al. (2015); Pathak et al. (2016). This is particularly problematic as a depth map is required to be bimodal, with large variations in depth to create structure and small variations in depth to create texture and fine detail.
To address this concern, we propose decomposing the learning problem into two – predicting the silhouette and depth map separately. Separating the challenge of predicting gross shape from fine detail regularizes and reduces the complexity of the learning problem, leading to improved results when compared with directly estimating new surface depths.
Our full method, Multi-View Decomposition Networks (MVD), uses a set of twin of deep convolutional models and , to separately predict silhouette and variations in depth of the higher resolution ODM. We depict our system in figure 3.
The deep convolutional network for predicting the high-resolution silhouette, with parameters , is passed the low resolution ODM , extracted from input 3D object. The network outputs a probability that each pixel is occupied. It is trained by minimizing the mean squared error between the predicted and true silhouette of the high resolution ODM :
where is the indicator function.
The same low-resolution ODM is passed through the second deep convolution neural network, denoted with parameters , whose final output is passed through a sigmoid, to produce an estimate for the variation of the ODM within a fixed range . This output is added to the low-resolution depth map to produce our prediction for a constrained high-resolution depth map :
where denotes up-sampling using nearest neighbor interpolation.
We train our network by minimizing the mean squared error between our prediction and the ground truth high-resolution depth map . During training only, we mask the output with the ground truth silhouette to allow effective focus on fine detail for . We further add a smoothing regularizer which penalizes the total variation Rudin et al. (1992) within the predicted ODM. Our loss function is a simple combination of these terms:
where is the Hadamard product. The total variation penalty is used as an edge-preserving denoising which smooths out irregularities in the output.
The output of the constrained depth map and silhouette networks are then combined to produce a complete prediction for the high-resolution ODM. This accomplished by masking the constrained high-resolution depth map by the predicted silhouette:
denotes our predicted high resolution ODM which can then be mapped back onto the original low resolution object by model carving to produce a high resolution object.
3.2 3D Model Carving
To complete our super-resolution procedure, the six ODMs are combined with the low-resolution object to form a high-resolution object. This begins by further smoothing the up-sampled ODM with an adaptive averaging filter, which considers pixels beyond the adjacent neighbors. To preserve edges, only neighboring pixels within a threshold of the value of the center pixel are included. This smoothing, along with the total variation regularization in the our loss function, are added to enforce smooth changes in local depth regions.
Model carving begins by first up-sampling the low-resolution model to the desired resolution, using nearest neighbor interpolation. We then use the predicted ODMs to determine the surface of the new object. The carving procedure is separated into structure carving, corresponding to the silhouette prediction, and detail carving, corresponding to the constrained depth prediction. For the structure carving, for each predicted ODM, if a coordinate is predicted unoccupied, then all voxels perpendicular to the coordinate are highlighted to be removed. The removal actually occurs if there is agreement of at least two ODMs for the removal of a voxel. As there is a large amount of overlap in the surface area that the six ODMs observe, this silhouette agreement is enforced to maintain the structure of the object. However, we do not require agreement within the constrained depth map predictions. This is because, unlike the silhouettes, a depth map can cause or deepen concavities in the surface of the object which may not be visible from any other face. Requiring agreement among depth maps would eliminate their ability to influence these concavities. Thus, performing detail carving simply involves removing all voxels perpendicular to each coordinate of each ODM, up to the predicted depth.
In this section we present our results for both 3D object super-resolution and 3D object reconstruction from single RGB images. Our results are evaluated across 13 classes of the ShapeNet Chang et al. (2015) dataset. 3D super-resolution is the task of generating a high resolution 3D object conditioned on a low resolution input, while 3D object reconstruction is the task of re-creating high resolution 3D objects from a single RGB image of the object.
4.1 3D Object Super-Resolution
Dataset The dataset consists of low resolution voxelized objects and their high resolution counterparts. These objects were produced by converting CAD models found in the ShapeNetCore dataset Chang et al. (2015) into voxel format. We work with the three commonly used object classes from this dataset: Car, Chair and Plane, with around 8000, 7000, 4000 objects respectively.
For training, we pre-process this dataset, to extract the six ODMs from each object at high and low-resolution. CAD models converted at this resolution do not remain watertight in many cases, making it difficult to fill the inner volume of the object. We describe an efficient method for obtaining high resolution voxelized objects in the supplementary material. Data is split into training, validation, and test set using a ratio of 70:10:20 respectively.
Evaluation We evaluate our method quantitatively using the intersection over union metric (IoU) against a simple baseline and the prediction of the individual networks on the test set. The baseline method corresponds to the ground truth at resolution, by up-scaling to the high resolution using nearest neighbor up-sampling. While our full method, uses a combination of networks, we present an ablation study to evaluate the contribution of each separate network.
Implementation The super-resolution task requires a pair of networks, and , which share the same architecture. This architecture is derived from the generator of SRGAN Ledig et al. (2016), a state of the art 2D super-resolution network. Exact network architectures and training regime are provided in the supplementary material.
Results The super-resolution IoU scores are presented in table 1. Our method greatly outperforms the naïve nearest neighbor up-sampling baseline in every class. While we find that the silhouette prediction contributes far more to the IoU score, the addition of the depth variation network further increases the IoU score. This is due to the silhouette capturing the gross structure of the object from multiple viewpoints, while the depth variation captures the fine-grained details, which contributes less to the total IoU score. To qualitatively demonstrate the results of our super-resolution system we render objects from the test set at both resolution in figure 5 and resolution in figure 4.
The predicted high-resolution objects are all of high quality and accurately mimic the shapes of the ground truth objects. Additional renderings as well as multiple objects from each class at resolution can be found in our supplementary material.
4.2 3D Object Reconstruction from RGB Images
|Category||Baseline||Depth Variation ()||Silhouette ()||MVD (Both)|
Dataset To match the datasets used by prior work, two datasets are used for 3D object reconstruction, both derived from the ShapeNet dataset. The first, which we refer to as , consists of only the Car, Chair and Plane classes from the Shapenet dataset, and we re-use the and voxel objects produced for these classes in the previous section. The CAD models for each of these object were rendered into RGB images capturing random viewpoints of the objects at elevations between and all possible azimuth rotations. The voxelized objects and corresponding images were split into a training, validation and test set, with a ratio of 70:10:20 respectively.
The second dataset, which we refer to as , is that provided by Choy et al. (2016). It consists of images and objects produced from the 3 classes in the ShapeNet dataset used in the previous section, as well as 10 additional classes, for a total of around 50000 objects. From each object RGB images are rendered at random viewpoints, and we again compute their and resolution voxelized models and ODMs. The data is split into a training, validation and test set with a ratio of 70:10:20.
Evaluation We evaluate our method quantitatively with two evaluation schemes. In the first, we use IoU scores when reconstructing objects at resolution. We compare against HSP Häne et al. (2017) using the first dataset , and against OGN Tatarchenko et al. (2017) using the second dataset . To study the effectiveness of our super-resolution pipeline, we also compute the IoU scores using the low resolution objects predicted by our autoencoder (AE) with nearest neighbor up-sampling to produce predictions at resolution.
Our second evaluation is performed only on the second dataset, , by comparing the accuracy of the surfaces of predicted objects to those of the ground truth meshes. Following the evaluation procedure defined by Wang et al. (2018), we first convert the voxel models into meshes by defining squared polygons on all exposed faces on the surface of the voxel models. We then uniformly sample points from the two mesh surfaces and compute F1 scores. Precision and recall are calculated using the percentage of points found with a nearest neighbor in the ground truth sampling set less than a squared distance threshold of . We compare to state of the art mesh model methods, N3MR Kato et al. (2017) and Pixel2Mesh Wang et al. (2018), a point cloud method, PSG Fan et al. (2017), and a voxel baseline, 3D-R2N2 Choy et al. (2016), using the values reported by Wang et al. (2018).
Implementation For 3D object reconstruction, we first trained a standard autoencoder, similar to prior work Choy et al. (2016); Smith and Meger (2017), to produce objects at resolution. These low resolution objects are then used with our 3D super-resolution method, to generate 3D object reconstructions at a high resolution. This process is described in figure 2. The exact network architecture and training regime are provided in the supplementary material.
|Category||3D-R2N2 Choy et al. (2016)||PSG Fan et al. (2017)||N3MR Kato et al. (2017)||Pixel2Mesh Wang et al. (2018)||MVD (Ours)|
Results The results of our IoU evaluation compared to the octree methods Tatarchenko et al. (2017); Häne et al. (2017) can be seen in table 2. We achieve state-of-the-art performance on every object class in both datasets. Our surface accuracy results can be seen in table 3 compared to Wang et al. (2018); Fan et al. (2017); Kato et al. (2017); Choy et al. (2016). Our method achieves state of the art results on all 13 classes. We show significant improvements for many object classes and demonstrate a large improvement on the mean over all classes when compared against the methods presented. To qualitatively evaluate our performance, we rendered our reconstructions for each class, which can be seen in figure 6. Additional renderings can be found in the supplementary material.
In this paper we argue for the application of multi-view representations when predicting the structure of objects at high resolution. We outline a novel system for learning to represent 3D objects and demonstrate its affinity for capturing category-specific shape details at a high resolution by operating over the six orthographic projections of the object. In the task of super-resolution, our method outperforms baseline methods by a large margin, and we show its ability to produce objects as large as , with a 16 times increase in size from the input objects. The results produced are visually impressive, even when compared against the ground-truth. When applied to the reconstruction of high-resolution 3D objects from single RGB images, we outperform several state of the art methods with a variety of representation types, across two evaluation metrics.
All of our visualizations demonstrate the effectiveness of our method at capturing fine-grained detail, which is not present in the low resolution input but must be captured in our network’s weights during learning. Furthermore, given that the deep aspect of our method works entirely in 2D space, our method scales naturally to high resolutions. This paper demonstrates that multi-view representations along with 2D super-resolution through decomposed networks is indeed capable of modeling complex shapes.
- Chang et al.  Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Chen et al.  Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On visual similarity based 3d model retrieval. In Computer graphics forum, volume 22, pages 223–232. Wiley Online Library, 2003.
- Choy et al.  Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European Conference on Computer Vision, pages 628–644. Springer, 2016.
- Denton et al.  Trip Denton, M Fatih Demirci, Jeff Abrahamson, Ali Shokoufandeh, and Sven Dickinson. Selecting canonical views for view-based 3-d object recognition. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 273–276. IEEE, 2004.
- Dong et al.  Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
- Fan et al.  Haoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3d object reconstruction from a single image. In Conference on Computer Vision and Pattern Recognition (CVPR), volume 38, 2017.
- Freeman et al.  William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-resolution. IEEE Computer graphics and Applications, 22(2):56–65, 2002.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680. 2014.
- Häne et al.  Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d object reconstruction. arXiv preprint arXiv:1704.00710, 2017.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hui et al.  Tak-Wai Hui, Chen Change Loy, and Xiaoou Tang. Depth map super-resolution by deep multi-scale guidance. pages 353–369, 2016.
- Kar et al.  Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine. In Advances in Neural Information Processing Systems, pages 364–375, 2017.
- Karras et al.  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations, 2018.
- Kato et al.  Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. arXiv preprint arXiv:1711.07566, 2017.
- Kazhdan et al.  Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic representation of 3 d shape descriptors. In Symposium on geometry processing, volume 6, pages 156–164, 2003.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Knopp et al.  Jan Knopp, Mukta Prasad, Geert Willems, Radu Timofte, and Luc Van Gool. Hough transform and 3d surf for robust three dimensional classification. In European Conference on Computer Vision, pages 589–602. Springer, 2010.
- Koenderink and Van Doorn  Jan J Koenderink and Andrea J Van Doorn. The singularities of the visual mapping. Biological cybernetics, 24(1):51–59, 1976.
- Ledig et al.  Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
- Li et al.  Yangyan Li, Soeren Pirk, Hao Su, Charles R Qi, and Leonidas J Guibas. Fpnn: Field probing neural networks for 3d data. In Advances in Neural Information Processing Systems, pages 307–315, 2016.
- Liu et al.  Jerry Liu, Fisher Yu, and Thomas Funkhouser. Interactive 3d modeling with a generative adversarial network. arXiv preprint arXiv:1706.05170, 2017.
- Luong and Viéville  Q-T Luong and Thierry Viéville. Canonical representations for the geometries of multiple projective views. Computer vision and image understanding, 64(2):193–229, 1996.
- Mac Aodha et al.  Oisin Mac Aodha, Neill DF Campbell, Arun Nair, and Gabriel J Brostow. Patch based synthesis for single depth image super-resolution. In European Conference on Computer Vision, pages 71–84. Springer, 2012.
- Macrini et al.  Diego Macrini, Ali Shokoufandeh, Sven Dickinson, Kaleem Siddiqi, and Steven Zucker. View-based 3-d object recognition using shock graphs. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages 24–28. IEEE, 2002.
- Mathieu et al.  Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
- Maturana and Scherer  Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
- Murase and Nayar  Hiroshi Murase and Shree K Nayar. Visual learning and recognition of 3-d objects from appearance. International journal of computer vision, 14(1):5–24, 1995.
- Osendorfer et al.  Christian Osendorfer, Hubert Soyer, and Patrick Van Der Smagt. Image super-resolution with fast approximate convolutional sparse coding. In International Conference on Neural Information Processing, pages 250–257. Springer, 2014.
- Park et al.  Jaesik Park, Hyeongwoo Kim, Yu-Wing Tai, Michael S Brown, and Inso Kweon. High quality depth map upsampling for 3d-tof cameras. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1623–1630. IEEE, 2011.
- Park et al.  Sung Cheol Park, Min Kyu Park, and Moon Gi Kang. Super-resolution image reconstruction: a technical overview. IEEE signal processing magazine, 20(3):21–36, 2003.
- Pathak et al.  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
- Pontes et al.  Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon Lucey, Anders Eriksson, and Clinton Fookes. Image2mesh: A learning framework for single image 3d reconstruction. arXiv preprint arXiv:1711.10669, 2017.
- Qi et al.  Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
- Qi et al.  Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
- Rudin et al.  Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
- Saxena et al.  Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2009.
- Sharma et al.  Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae: Deep volumetric shape learning without object labels. In European Conference on Computer Vision, pages 236–250. Springer, 2016.
- Shi et al.  Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
- Shi et al.  Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. pages 1874–1883, 2016.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Smith and Meger  Edward J Smith and David Meger. Improved adversarial systems for 3d object generation and reconstruction. In Conference on Robot Learning, pages 87–96, 2017.
- Su et al.  Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
- Tatarchenko et al.  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2088–2096, 2017.
- Wang et al.  Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018.
- Wang et al.  Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for image super-resolution with sparse prior. In Proceedings of the IEEE International Conference on Computer Vision, pages 370–378, 2015.
- Wu et al.  Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
- Wu et al.  Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances In Neural Information Processing Systems, pages 540–550, 2017.
- Yang et al.  Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
Appendix A Super-Resolution Network Architecture
Both and share the same architecture, which is derived from the generator of SRGAN Ledig et al. , a state of the art 2D super-resolution network.
The architecture begins with a single convolutional layer followed by 16 identical residual blocks He et al.  with batch normalization, ReLU activations, and skip connections attaching each subsequent layer. This is followed by a convolutional layer with batch normalization, a ReLU activation function, and a skip connection attached to the output of the first convolutional layer. The final layers are a series of sub-pixel convolution layers to increase the resolution of the images Shi et al. , followed a final convolutional layer to decrease the color channel to 1, with a sigmoid activation to limit the output range. The number of sub-pixel convolution layers is equal to of the upscaling factor. The kernel size is and stride length is 1 for all layers, and all convolutional layers have kernel depth 128 except for the last, with depth 1. The kernel depth begins at size 256 for the sub-pixel convolutional layers and decreases by a factor of 2 for each subsequent layer, to offset the increase in kernel height and width.
Each network is trained using Adam Kingma and Ba  with default hyper-parameters and a learning rate of , trained over mini-batches of size 32, until convergence. We use a adaptive averaging filter with a threshold of 10 for objects and 20 for objects. The output of is constrained to a maximum of for objects and for objects.
Appendix B Low Resolution Object Reconstruction Network Architecture
The network used for predicting the low-resolution 3D objects is a deep convolutional autoencoder. The encoder network of this system takes as input a RGB image and passes it through five convolutional layers with batch normalization, leaky-ReLU activations, and stride length 2, followed by a fully connected layer to produce a vector of length 128. The network architecture for the decoder begins with a fully connected layer to increase the vector to length 1024 followed by an alternation of nine 3D deconvolutional and convolutional layers to morph the up-sampled vector to a complete 3D shape. It outputs a matrix of voxel probabilities. Training was performed using the Adam optimizer Kingma and Ba , using mean squared error loss, and was halted when IoU scores on the validation set stopped decreasing.
Appendix C Dataset Details
A main problem with voxelizing mesh models at high resolution is that meshes may not be water tight. This is makes producing solid objects, without any unintended holes or unfilled areas, difficult. A method to fix this problem, suggested by Häne et al. , involves eroding one voxel across the entire surface of the lower resolution model, applying a nearest neighbor up-sampling to high-resolution, occupying all voxels that intersect with the mesh, and then applying a graph-cut based regularization with a small smoothness term to decide the remaining voxels. While this does rectify the issue of non-watertight meshes, it may not reproduce the original surface perfectly and may lead to an overly smooth model.
We suggest a new method to produce accurate, high-resolution voxel models from non-watertight CAD models. We first convert the CAD model to voxels at resolution , and determine their orthographic depth maps. The high-resolution models are then down-sampled to resolution (wherein they are guaranteed to be watertight), then all internal voxels are filled, next they are up-sampled to the original resolution using nearest neighbor interpolation. Finally, the six depth faces are used to carve away the surface voxels of the reproduced high-resolution object. The only situation in which this does not make a complete model is in the rare case when the CAD model is missing one or more large faces at some point on its surface, and these objects are automatically discarded as no true voxel object can be extracted from the model, although this occurrence is rare, and does not occur in almost all object classes.
Appendix D Analysis of Super Resolution for ODMs
Several state of the art super-resolution techniques were tested alongside our own architecture. The first was a slight variant on SRGAN Ledig et al. , a state of the art adversarial generation system for image super-resolution, adept at producing photo-realistic RGB images at up to a 4 times resolution increase. The SRGAN system applies the generic GAN loss formulation Goodfellow et al.  along side a VGG loss (based on the difference of layer activations from a pre-trained VGG network Simonyan and Zisserman ) to upscale images, equipped with two deep convolutional neural networks acting as the generator and discriminator. The VGG loss term was removed from the generator loss function, and replaced by MSE loss as our dataset is far more constrained.
The second super-resolution algorithm compared was the SRGAN algorithm without adversarial loss. This corresponds to the generator of SRGAN directly predicting the higher resolution image, trained with a MSE loss. This was used as the adversarial loss is employed to achieve photorealism rather than reconstruction accuracy.
The third super-resolution scheme tested for our task was MS-Net Hui et al. , the state of the art for depth map super-resolution. This passes depth maps though a CNN consisting of a convolutional layer followed by, 3 deconvolution networks to increase the image dimensionality, then culminating in a final convolutional layer to output the high-resolution image. The novelty in the scheme is that instead of passing the image directly, only the high frequency details are passed through the network, and the result is the added to the original images low frequency information which is up-sampled to the higher resolution using bi-cubic interpolation.
We compare the accuracy of these algorithms to our own by testing their performance at recovering ODMs from ODMs from the chair object class. We also test the performance of our algorithm when omitting smoothing, not including our information from the occupancy maps, and when not including information from the depth maps. We train, validate, and test on the same 70:10:20 split as for the image reconstruction task. We trained all networks using the Adam optimizer Kingma and Ba  with a learning rate of , and halted learning when the performance on the validation set tested every epoch, bottomed out. The MSE for each algorithm on our held-out test set is shown in table 4 As can be seen, our algorithm achieves far lower error when recovering ODMs. The results demonstrate that smoothing and depth map information all play a role in improving the accuracy of our algorithm.
Appendix E Super-Resolution Visualizations
Super resolution renderings for the 13 classes of ShapeNet are presented on the following pages. Images are presented in the following order: low resolution (left), super-resolution output (center) and ground truth (right).
Appendix F Object Reconstruction Visualizations
3D object reconstruction renderings for the 13 classes of ShapeNet are presented on the following pages. The image inputs are presented on the left, and the high resolution output is presented on the right.