SurfaceNet: An Endtoend 3D Neural Network for Multiview Stereopsis
Abstract
This paper proposes an endtoend learning framework for multiview stereopsis. We term the network SurfaceNet. It takes a set of images and their corresponding camera parameters as input and directly infers the 3D model. The key advantage of the framework is that both photoconsistency as well geometric relations of the surface structure can be directly learned for the purpose of multiview stereopsis in an endtoend fashion. SurfaceNet is a fully 3D convolutional network which is achieved by encoding the camera parameters together with the images in a 3D voxel representation. We evaluate SurfaceNet on the largescale DTU benchmark.
1 Introduction
In multiview stereopsis (MVS), a dense model of a 3D object is reconstructed from a set of images with known camera parameters. This classic computer vision problem has been extensively studied and the standard pipeline involves a number of separate steps [4, 27]. In available multiview stereo pipelines, sparse features are first detected and then propagated to a dense point cloud for covering the whole surface [8, 11], or multiview depth maps are first computed followed with a depth map fusion step to obtain the 3D reconstruction of the object [15, 17]. A great variety of approaches have been proposed to improve different steps in the standard pipeline. For instance, the works [27, 4, 9] have focused on improving the depth map generation using MRF optimization, photoconsistency enforcement, or other depth map postprocessing operations like denoising or interpolation. Other approaches have focused on more advanced depth fusion algorithms [13].
Recent advances in deep learning have been only partially integrated. For instance, [32] uses a CNN instead of handcrafted features for finding correspondences among image pairs and [10] predicts normals for depth maps using a CNN, which improves the depth map fusion. The full potential of deep learning for multiview stereopsis, however, can only be explored if the entire pipeline is replaced by an endtoend learning framework that takes the images with camera parameters as input and infers the surface of the 3D object. Apparently, such endtoend scheme has the advantage that photoconsistency and geometric context for dense reconstruction can be directly learned from data without the need of manually engineering separate processing steps.
Instead of improving the individual steps in the pipeline, we therefore propose the first endtoend learning framework for multiview stereopsis, named SurfaceNet. Specifically, SurfaceNet is a 3D convolutional neural network that can process two or more views and the loss function is directly computed based on the predicted surface from all available views. In order to obtain a fully convolutional network, a novel representation for each available viewpoint, named colored voxel cube (CVC), is proposed to implicitly encode the camera parameters via a straightforward perspective projection operation outside the network. Since the network predicts surface probabilities, we obtain a reconstructed surface by an additional binarization step. In addition, an optional thinning operation can be applied to reduce the thickness of the surface. Besides of binarization and thinning, SurfaceNet does not require any additional postprocessing or depth fusion to obtain an accurate and complete reconstruction.
2 Related Works
Works in the multiview stereopsis (MVS) field can be roughly categorised into volumetric methods and depth maps fusion algorithms. While earlier works like space carving [14, 22] mainly use a volumetric representation, current stateoftheart MVS methods focus on depth map fusion algorithms [27, 4, 9], which have been shown to be more competitive in handling large datasets in practice. As our method is more related to the second category, our survey mainly covers depth map fusion algorithms. A more comprehensive overview of MVS approaches is given in the tutorial article [7].
The depth map fusion algorithms first recover depth maps [30] from view pairs by matching similarity patches [2, 18, 33] along epipolar line and then fuse the depth maps to obtain a 3D reconstruction of the object [27, 4, 9]. In order to improve the fusion accuracy, [4] mainly learns several sources of the depth map outliers. [27] is designed for ultra highresolution image sets and uses a robust decriptor for efficient matching purposes. The Gipuma algorithm proposed in [9] is a massively parallel method for multiview matching built on the idea of patchmatch stereo [3]. Aggregating image similarity across multiple views, [9] can obtain more accurate depth maps. The depth fusion methods usually contain several manually engineered steps, such as point matching, depth map denoising, and view pair selection. Compared with the mentioned depth fusion methods, the proposed SurfaceNet infers the 3D surface with thin structure directly from multiview images without the need of manually engineering separate processing steps.
The proposed method in [8] describes a patch model that consists of a quasidense set of rectangular patches covering the surface. The approach starts from a sparse set of matched keypoints that are repeatedly expanded to nearby pixel correspondences before filtering the false matches using visibility constraints. However, the reconstructed rectangular patches cannot contain enough surface detail, which results in small holes around the curved model surface. In contrast, our datadriven method predicts the surface with fine geometric detail and has less holes around the curved surface.
Convolutional neural networks have been also used in the context of MVS. In [10], a CNN is trained to predict the normals of a given depth map based on image appearance. The estimated normals are then used to improve the depth map fusion. Compared to its previous work [9], the approach increases the completeness of the reconstruction at the cost of a slight decrease in accuracy. Deep learning has also been successfully applied to other 3D applications like volumetric shape retrieval or object classification [28, 25, 16, 19]. In order to simplify the retrieval and representation of 3D shapes by CNNs, [24] introduces a representation, termed geometry image, which is a 2D representation that approximates a 3D shape. Using the 2D representation, standard 2D CNN architectures can be applied. While the 3D reconstruction of an object from a single or multiple views using convolutional neural networks has been studied in [26, 5], the approaches focus on a very coarse reconstruction without geometric details when only one or very few images are available. In other words, these methods are not suitable for MVS. The proposed SurfaceNet is able reconstruct large 3D surface models with detailed surface geometry.
3 Overview
We propose an endtoend learning framework that takes a set of images and their corresponding camera parameters as input and infers the 3D model. To this end, we voxelize the solution space and propose a convolutional neural network (CNN) that predicts for each voxel a binary attribute depending on whether the voxel is on the surface or not. We call the network, which reconstructs a 2D surface from a 3D voxel space, SurfaceNet. It can be considered as an analogy to object boundary detection [29], which predicts a 1D boundary from 2D image input.
4 SurfaceNet
Given two images and for two views and of a scene with known camera parameters and a voxelization of the scene denoted by a 3D tensor , our goal is to reconstruct the 2D surface in by estimating for each voxel if it is on the surface or not, \ie.
To this end, we propose an endtoend learning framework, which automatically learns both photoconsistency and geometric relations of the surface structure. Intuitively, for accurate reconstruction, the network requires the images and as well as the camera parameters. However, direct usage of , and their corresponding camera parameters as input would unnecessarily increase the complexity of the network since such network needs to learn the relation between the camera parameters and the projection of a voxel onto the image plane in addition to the reconstruction. Instead, we propose a 3D voxel representation that encodes the camera parameters implicitly such that our network can be fully convolutional.
We denote our representation as colored voxel cube (CVC), which is computed for each view and illustrated in Fig. 2. For a given view , we convert the image into a 3D colored cube by projecting each voxel onto the image and storing the RGB values for each voxel respectively. For the color values, we subtract the mean color [23]. Since this representation is computed for all voxels , the voxels that are on the same projection ray have the same color . In other words, the camera parameters are encoded with CVC. As a result, we obtain for each view a projectionspecific stripe pattern as illustrated in Fig. 2.
4.1 SurfaceNet Architecture
The architecture of the proposed SurfaceNet is shown in Fig. 3. It takes two colored voxel cubes from two different viewpoints as input and predicts for each voxel the confidence , which indicates if a voxel is on the surface. While the conversion of the confidences into a surface is discussed in Section 4.3, we first describe the network architecture.
The detailed network configuration is summarized in Table 1. The network input is a pair of colored voxel cubes, where each voxel stores three RGB color values. For cubes with voxels, the input is a tensor of size . The basic building blocks of our model are 3D convolutional layers , 3D pooling layers and 3D upconvolutional layers , where represents the th layer in the th group of convolutional layers. Additionally, a rectified linear unit (ReLU) is appended to each convolutional layer and the sigmoid function is applied to the layers and . In order to decrease the training time and increase the robustness of training, batch normalization [12] is utilized in front of each layer. The layers in are dilated convolutions [31] with dilation factor of 2. They are designed to exponentially increase the receptive field without loss of feature map resolution. The layer increases the performance by aggregating multiscale contextual information from the side layers to consider multiscale geometric features. Since the network is fully convolutional, the size of the CVC cubes can be adaptive. The output is always the same size as the CVC cubes.
layer name  type  output size  kernel size 

input  CVC    
conv  
upconv  
pooling  
conv  
upconv  
pooling  
conv  
upconv  
dilconv  
upconv  
conv  
conv 
4.2 Training
As the SurfaceNet is a dense prediction network, \ie, the network predicts the surface confidence for each voxel, we compare the prediction per voxel with the groundtruth . For training, we use a subset of the scenes from the DTU dataset [1] which provides images, camera parameters, and reference reconstructions obtained by a structured light system. A single training sample consists of a cube cropped from a 3D model and two CVC cubes and from two randomly selected views and . Since most of the voxels do not contain the surface, \ie, we weight the surface voxels by
(1) 
where denotes the set of sampled training samples, and the nonsurface voxels by . We use a classbalanced crossentropy function as loss for training, \iefor a single training sample we have:
(2)  
For updating the weights of the model, we use stochastic gradient descent with Nesterov momentum update.
Due to the relatively small number of 3D models in the dataset, we perform data augmentation in order to reduce overfitting and improve the generalization. Each cube is randomly translated and rotated and the color is varied by changing illumination and introducing Gaussian noise.
4.3 Inference
For inference, we process the scene not at once due to limitations of the GPU memory but divide the volume into cubes. For each cube, we first compute for both camera views the colored voxel cubes and , which encode the camera parameters, and infer the surface probability for each voxel by the SurfaceNet. Since some of the cubes might not contain any surface voxels, we discuss in Section 6 how these empty cubes can be rejected in an early stage.
In order to convert the probabilities into a surface, we perform two operations. The first operation is a simple thresholding operation that converts all voxels with into surface voxels and all other voxels are set to zero. In Section 6, we discuss how the threshold can be adaptively set when the 3D surface is recovered. The second operation is optional and it performs a thinning procedure of the surface since the surface might be several voxels thick after the binarization. To obtain a thin surface, we perform a pooling operation, which we call ray pooling. For each view, we vote for a surface voxel if , where denotes the voxels that are projected onto the same pixel. If both operations are used, a voxel is converted into a surface voxel if both views vote for it during ray pooling and .
5 MultiView Stereopsis
So far we have described training and inference with SurfaceNet if only two views are available. We now describe how it can be trained and used for multiview stereopsis.
5.1 Inference
If multiple views are available, we select a subset of view pairs and compute for a cube and each selected view the CVC cube . We will discuss at the end of the section how the view pairs are selected.
For each view pair , SurfaceNet predicts , \iethe confidence that a voxel is on the surface. The predictions of all view pairs can be combined by taking the average of the predictions for each voxel. However, the view pairs should not be treated equally since the reconstruction accuracy varies among the view pairs. In general, the accuracy depends on the viewpoint difference of the two views and the presence of occlusions.
To further identify occlusions between two views and , we crop a patch around the projected center voxel of for each image and . To compare the similarity of the two patches, we train a triplet network [20] that learns a mapping from images to a compact 128D Euclidean space where distances directly correspond to a measure of image similarity. The dissimilarity of the two patches is then given by
(3) 
where denotes the feature embedding provided by the triplet network for the patch from image . This measurement can be combined by the relation of the two viewpoints and , which is measured by the angle between the projection rays of the center voxel of , which is denoted by . We use a 2layer fully connected neural network , that has 100 hidden neurons with sigmoid activation function and one linear output neuron followed by a softmax layer. The relative weights for each view pair are then given by
(4) 
and the weighted average of the predicted surface probabilities by
(5) 
where denotes the set of selected view pairs. Since it is unnecessary to take all view pairs into account, we select only view pairs, which have the highest weight . In Section 7, we evaluate the impact of .
5.2 Training
The SurfaceNet can be trained together with the averaging of multiple view pairs (5). We select for each cube , random view pairs and the loss is computed after the averaging of all view pairs. We use as a tradeoff since larger values increase the memory for each sampled cube and thus require to reduce the batch size for training due to limited GPU memory. Note that for inference, can be larger or smaller than .
In order to train the triplet network for the dissimilarity measurement (3), we sample cubes and three random views where the surface is not occluded. The corresponding patches obtained by projecting the center of the cube onto the first two views serve as a positive pair. The negative patch is obtained by randomly shifting the patch of the third view by at least a quarter of the patch size. While the first two views are different, the third view can be the same as one of the other two views. At the same time, we use data augmentation by varying illumination or adding noise, rotation, scale, and translation. After SurfaceNet and the triplet network are trained, we finally learn the shallow network (4).
6 Implementation Details
As described in Section 4.3, the scene volume needs to be divided into cubes due to limitations of the GPU memory. However, it is only necessary to process cubes that are very likely to contain surface voxels. As an approximate measure, we apply logistic regression to the distance vector (3) for each view pair, which predicts if the patches of both views are similar. If for less than of the view pairs the predicted similarity probability is greater than or equal to , we reject the cube.
Instead of using a single threshold for binarization, one can also adapt the threshold for each cube based on its neighboring cubes . The approach is iterated and we initialize by for each cube . We optimize by minimizing the energy
(6) 
where denotes the estimated surface in cube after binarization with threshold .
For the binary term , we use
(7) 
The first two terms penalize if and disagree in the overlapping region, which can be easily achieved by setting the threshold very high such that the overlapping region contains only very few surface voxels. The last term therefore aims to maximize the surface voxels that are shared among the cubes.
We used the Lasagne Library [6] to implement the network structure. The code and the trained model are publically available.^{1}^{1}1https://github.com/mjiUST/SurfaceNet
7 Experiments
7.1 Dataset
The DTU multiview stereo dataset [1] is a large scale MVS benchmark. It features a variety of objects and materials, and contains 80 different scenes seen from 49 or 64 camera positions under seven different lighting conditions. The provided reference models are acquired by accurate structured light scans. The large selection of shapes and materials is wellsuited to train and test our method under realistic conditions and the scenes are complex enough even for the stateoftheart methods, such as furu [8], camp [4], tola [27] and Gipuma [9]. For evaluation, we use three subsets of the objects from the DTU dataset for training, validation and evaluation. ^{2}^{2}2Training: 2, 6, 7, 8, 14, 16, 18, 19, 20, 22, 30, 31, 36, 39, 41, 42, 44, 45, 46, 47, 50, 51, 52, 53, 55, 57, 58, 60, 61, 63, 64, 65, 68, 69, 70, 71, 72, 74, 76, 83, 84, 85, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 107, 108, 109, 111, 112, 113, 115, 116, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128. Validation: 3, 5, 17, 21, 28, 35, 37, 38, 40, 43, 56, 59, 66, 67, 82, 86, 106, 117. Evaluation: 1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49, 62, 75, 77, 110, 114, 118
The evaluation is based on accuracy and completeness. Accuracy is measured as the distance from the inferred model to the reference model, and the completeness is calculated the other way around. Although both are actually error measures and a lower value is better, we use the terms accuracy and completeness as in [1]. Since the reference models of the DTU dataset are downsampled to a resolution of for evaluation, we set the voxel resolution to . In order to train SurfaceNet with a reasonable batch size, we use cubes with voxels.
7.2 Impact of Parameters
We first evaluate the impact of the parameters for binarization. In order to provide a quantitative and qualitative analysis, we use model 9 of the dataset, which was randomly chosen from the test set. By default, we use cubes with voxels.
As discussed in Section 5.1, the binarization depends on the threshold and the parameter for thinning. If thinning is not performed. Fig. 4 shows the impact of and . For each parameter setting, the front view and the intersection with a horizontal plane (red) is shown from top view. The top view shows the thickness and consistency of the reconstruction. We observe that is a tradeoff between accuracy and completeness. While a large value discards large parts of the surface, results in a noisy and inaccurate reconstruction. The thinning, i.e., , improves the accuracy for any threshold and slightly impacts the completeness.
While the threshold seems to provide a good tradeoff between accuracy and completeness, we also evaluate the approach described in Section 6 where the constant threshold is replaced by an adaptive threshold that is estimated for each cube by minimizing (6) iteratively. The energy, however, also provides a tradeoff parameter (7). If is large, we prefer completeness and when is small we prefer accuracy. This is reflected in Fig. 5 where we show the impact of . Fig. 6 shows how the reconstruction improves for with the number of iterations from the initialization with . The method quickly converges after a few iterations. By default, we use and 8 iterations for the adaptive binarization approach.
methods (mm) 






0.574  0.284  0.627  0.202  
0.448  0.234  0.706  0.242  
adaptive threshold  
0.434  0.249  0.792  0.229  
adaptive threshold  
w/o weighted average  0.448  0.251  0.798  0.228 
We compare the adaptive binarization with the constant thresholding in Table 2 where we report mean and median accuracy and completeness for model 9. The configuration and provides a good tradeoff between accuracy and completeness. Using adaptive thresholding with as described in Section 6 achieves a better mean accuracy but the mean completeness is slightly worse. We also report in the last row the result when the probabilities are not weighted in (5), \ie, . This slightly deteriorates the accuracy.
We finally evaluate the impact of fusing the probabilities of the best view pairs in (5). By default, we used so far. The impact of is shown in Fig. 7. While taking only the best view pair results in a very noisy inaccurate reconstruction, the accuracy is substantially improved for three view pairs. After five view pairs the accuracy slightly improves at the cost of a slight deterioration of the completeness. We therefore keep .
7.3 Comparison with others methods
methods (mm) 






camp [4]  0.834  0.335  0.987  0.108  
furu [8]  0.504  0.215  1.288  0.246  
tola [27]  0.318  0.190  1.533  0.268  
Gipuma [9]  0.268  0.184  1.402  0.165  
, ,  1.327  0.259  1.346  0.145  
, ,  0.779  0.204  1.407  0.172  
, adapt ,  0.546  0.209  1.944  0.167  
, ,  0.625  0.219  1.293  0.141  
, ,  0.454  0.191  1.354  0.164  
, adapt ,  0.307  0.183  2.650  0.342 
We finally evaluate our approach on the test set consisting of 22 models and compare it with the methods camp [4], furu [8], tola [27], and Gipuma [9]. While we previously used cubes with voxels, we also include the results for cubes with voxels in Table 3. Using larger cubes, \ie, improves accuracy and completeness using a constant threshold with or . When adaptive thresholding is used, the accuracy is also improved but the completeness deteriorates.
If we compare our approach using the setting , , and with the other methods, we observe that our approach achieves a better accuracy than camp [4] and furu [8] and a better completeness than tola [27] and Gipuma [9]. Overall, the reconstruction quality is comparable to the other methods. A qualitative comparison is shown in Fig. 8.
7.4 Runtime
The training process takes about 5 days using an Nvidia Titan X GPU. The inference is linear in the number of cubes and view pairs. For one view pair and a cube with voxels, inference takes 50ms. If the cube size is increased to voxels, the runtime increases to 400ms. However, the larger the cubes are the less cubes need to be processed. In case of , model 9 is divided into 375k cubes, but most of them are rejected as described in Section 6 and only 95k cubes are processed. In case of , model 9 is divided into 48k cubes and only 12k cubes are processed.
7.5 Generalization
methods (mm) 






49 views  
, adapt ,  0.197  0.149  1.762  0.237  
, ,  0.256  0.122  0.756  0.193  
15 views  
, adapt ,  0.191  0.135  2.517  0.387  
, ,  0.339  0.117  0.971  0.229 
In order to demonstrate the generalization abilities of the model, we apply it to a camera setting that is not part of the training set and an object from another dataset.
The DTU dataset [1] provides for some models additional 15 views which have a larger distance to the object. For training, we only use the 49 views, which are available for all objects, but for testing we compare the results if the approach is applied to the same 49 views used for training or to the 15 views that have not been used for training. In our test set, the additional 15 views are available for the models 110, 114, and 118. The results in Table 4 show that the method performs very well even if the camera setting for training and testing differs. While the accuracy remains relatively stable, an increase of the completeness error is expected due to the reduced number of camera views.
8 Conclusion
In this work, we have presented the first endtoend learning framework for multiview stereopsis. To efficiently encode the camera parameters, we have also introduced a novel representation for each available viewpoint. The socalled colored voxel cubes combine the image and camera information and can be processed by standard 3D convolutions. Our qualitative and quantitative evaluation on a largescale MVS benchmark demonstrated that our network can accurately reconstruct the surface of 3D objects. While our method is currently comparable to the stateoftheart, the accuracy can be further improved by more advanced postprocessing methods. We think that the proposed network can also be used for a variety of other 3D applications.
Acknowledgments
The work has been supported in part by Natural Science Foundation of China (NSFC) under contract No. 61331015, 6152211, 61571259 and 61531014, in part by the National key foundation for exploring scientific instrument No.2013YQ140517, in part by Shenzhen Fundamental Research fund (JCYJ20170307153051701) and in part by the DFG project GA 1927/22 as part of the DFG Research Unit FOR 1505 Mapping on Demand (MoD).
References
 [1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl. Largescale data for multipleview stereopsis. International Journal of Computer Vision, pages 1–16, 2016.
 [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph, 28(3):24–1, 2009.
 [3] M. Bleyer, C. Rhemann, and C. Rother. Patchmatch stereostereo matching with slanted support windows. In British Machine Vision Conference, volume 11, pages 1–11, 2011.
 [4] N. D. Campbell, G. Vogiatzis, C. Hernández, and R. Cipolla. Using multiple hypotheses to improve depthmaps for multiview stereo. In European Conference on Computer Vision, pages 766–779, 2008.
 [5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In European Conference on Computer Vision, 2016.
 [6] S. Dieleman, J. Schlüter, C. Raffel, E. Olson, et al. Lasagne: First release, Aug. 2015.
 [7] Y. Furukawa and C. Hernández. Multiview stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision, 9(12):1–148, 2015.
 [8] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010.
 [9] S. Galliani, K. Lasinger, and K. Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In IEEE International Conference on Computer Vision, pages 873–881, 2015.
 [10] S. Galliani and K. Schindler. Just look at the image: Viewpointspecific surface normal prediction for improved multiview reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5479–5487, 2016.
 [11] M. Habbecke and L. Kobbelt. A surfacegrowing approach to multiview stereo reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
 [13] M. Jancosek and T. Pajdla. Multiview reconstruction preserving weaklysupported surfaces. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3121–3128, 2011.
 [14] K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. In IEEE International Conference on Computer Vision, volume 1, pages 307–314, 1999.
 [15] Y. Liu, X. Cao, Q. Dai, and W. Xu. Continuous depth estimation for multiview stereo. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2121–2128, 2009.
 [16] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems, pages 922–928, 2015.
 [17] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J. Frahm, R. Yang, D. Nistér, and M. Pollefeys. Realtime visibilitybased fusion of depth maps. In IEEE International Conference on Computer Vision, pages 1–8, 2007.
 [18] J. Pang, O. C. Au, Y. Yamashita, Y. Ling, Y. Guo, and J. Zeng. Selfsimilaritybased image colorization. In IEEE International Conference on Image Processing, pages 4687–4691, 2014.
 [19] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview CNNs for object classification on 3d data. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5648–5656, 2016.
 [20] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
 [21] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multiview stereo reconstruction algorithms. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 519–528, 2006.
 [22] S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2):151–173, 1999.
 [23] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [24] A. Sinha, J. Bai, and K. Ramani. Deep learning 3d shape surfaces using geometry images. In European Conference on Computer Vision, pages 223–240, 2016.
 [25] S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in RGBD images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016.
 [26] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multiview 3d models from single images with a convolutional network. In European Conference on Computer Vision, pages 322–337, 2016.
 [27] E. Tola, C. Strecha, and P. Fua. Efficient largescale multiview stereo for ultra highresolution image sets. Machine Vision and Applications, pages 1–18, 2012.
 [28] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
 [29] S. Xie and Z. Tu. Holisticallynested edge detection. In IEEE International Conference on Computer Vision, pages 1395–1403, 2015.
 [30] L. Xu, L. Hou, O. C. Au, W. Sun, X. Zhang, and Y. Guo. A novel rayspace based view generation algorithm via radon transform. 3D Research, 4(2):1, 2013.
 [31] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2016.
 [32] J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1592–1599, 2015.
 [33] A. Zheng, Y. Yuan, S. P. Jaiswal, and O. C. Au. Motion estimation via hierarchical block matching and graph cut. In IEEE International Conference on Image Processing, pages 4371–4375, 2015.