Abstract
We propose a cost volume based neural network for depth inference from multiview images. We demonstrate that building a cost volume pyramid in a coarsetofine manner instead of constructing a cost volume at a fixed resolution leads to a compact, lightweight network and allows us inferring high resolution depth maps to achieve better reconstruction results. To this end, a cost volume based on uniform sampling of frontoparallel planes across entire depth range is first built at the coarsest resolution of an image. Given current depth estimate, new cost volumes are constructed iteratively on the pixelwise depth residual to perform depth map refinement. While sharing similar insight with PointMVSNet as predicting and refining depth iteratively, we show that working on cost volume pyramid can lead to a more compact, yet efficient network structure compared with the PointMVSNet on 3D points. We further provide detailed analyses of relation between (residual) depth sampling and image resolution, which serves as a principle for building compact cost volume pyramid. Experimental results on benchmark datasets show that our model can perform 6x faster and has similar performance as stateoftheart methods.
1 Introduction
Multiview stereo (MVS) aims to reconstruct 3D model of a scene from a set of images captured by a camera from multiple view points. It is a fundamental problem for computer vision community and has been studied extensively for decades [32]. While traditional methods before deep learning era have great achievements on the reconstruction of scene with lambertian surfaces, they still suffer from illumination changes, lowtexture regions, and reflections resulting in unreliable matching correspondences for further reconstruction.
Recent learningbased approaches [4, 39, 40] adopt deep CNNs to infer the depth map for each view followed by a separate multipleview fusion process for building 3D models. These methods allow the network to extract discriminative features encoding global and local information of a scene to obtain robust feature matching for MVS. In particular, Yao \etalpropose MVSNet [39] to infer a depth map for each view. An essential step in [39] is to build a cost volume based on a plane sweep process followed by multiscale 3D CNNs for regularization. While effective in depth inference accuracy, its memory requirement is cubic to the image resolution. To allow handling high resolution images, they then adopt a recurrent cost volume regularisation process [40]. However, the reduction in memory requirements involves a longer runtime.
In order to achieve a computationally efficient network, Chen \etal [4] work on 3D point clouds to iteratively predict the depth residual along visual rays using edge convolutions operating on the nearest neighbors of each 3D point. While this approach is more efficient, its runtime increases almost linearly with the number of iteration levels.
In this work, we propose a cost volume pyramid based MultiView Stereo Network (CVPMVSNet) for depth inference. In our approach, we first build an image pyramid for each input image. Then, for the coarsest resolution of the reference image, we build a compact cost volume by sampling the depth across the entire depthrange of a scene. Then, at the next pyramid level, we perform residual depth search from the neighbor of the current depth estimate to construct a partial cost volume using multiscale 3D CNNs for regularisation. As we build these cost volumes iteratively with a short search range at each level, it leads to a small and compact network. As a result, our network is able to perform 6x faster than current stateoftheart networks on benchmark datasets.
While it is noteworthy that we share the similar insight with [4] as predicting and refining the depth map in a coarsetofine manner, our work differs from theirs in the following four main aspects. First, the approach in [4] performs convolutions on 3D point cloud. Instead, we construct cost volumes on a regular grid defined on the image coordinates, which is shown to be faster in runtime. Second, we provide a principle for building a compact cost volume pyramid based on the correlation between depth sampling and image resolution. As third main difference, we use a multiscale 3DCNN regularization to cover large receptive field and encourage local smoothness on residual depth estimates which, as shown in Fig. LABEL:fig:startfig, leads to a better accuracy. Finally, in contrast to [4] and other related works, our approach can output depth of small resolution with small resolution image.
In summary, our main contributions are

We propose a costvolume based, compact, and computational efficient depth inference network for MVS.

We build a cost volume pyramid in a coarsetofine manner based on a detailed analysis of the relation between the depth residual search range and the image resolution,

Our framework can handle high resolution images with less memory requirement, is 6x faster than the current stateoftheart framework, i.e. PointMVSNet [4], and achieves a better accuracy on benchmark datasets.
2 Related Work
Traditional MultiView Stereo. Multiview stereo has been extensively studied for decades. We refer to the algorithms before deep learning era as traditional MVS methods which represent the 3D geometry of objects or scene using voxels [6, 22], levelsets [8, 28], polygon meshes [7, 9] or depth maps [17, 21]. In the following, we mainly focus on discussions about volumetric and depthbased MVS methods which have been integrated to learningbased framework recently.
Volumetric representations can model most of the objects or scenes. Given a fixed volume of an object or scene, volumetricbased methods first divide the whole volume into small voxels and then, use a photometric consistency metric to decide whether the voxel belongs to the surface or not. These methods do not impose constraints on the shape of the objects. However, the space discretisation is memory intensive. By contrast, depthmap based MVS methods have shown more flexibility in modeling the 3D geometry of scene [26]. Readers are referred to [32] for detailed discussions. Similar to other recent learningbased approaches, we adopt depth map representation in our framework.
Deep learningbased MVS. Deep CNNs have significantly advanced the progress of highlevel vision tasks, such as image recognition [13, 33], object detection [12, 29], and semantic segmentation [23, 3]. As for 3D vision tasks, learningbased approaches have been widely adopted to solve stereo matching problems and have achieved very promising results [41, 25]. However, these learningbased approaches cannot be easily generalized to solve MVS problems as rectifications are required for the multiple view scenario which may cause the loss of information [39].
More recently, a few approaches have proposed to directly solve MVS problems. For instance, Ji \etal [16] introduce the first learning based pipeline for MVS. This approach learns the probability of voxels lying on the surface. Concurrently, Kar \etal [18] present a learnable system to upproject pixel features to the 3D volume and classify whether a voxel is occupied or not by the surface. These systems provide promising results. However, they use volumetric representations that are memory expensive and therefore, these algorithms can not handle largescale scenes.
Largescale scene reconstruction has been approached by Yao \etalin [39]. The authors propose to learn the depth map for each view by constructing a cost volume followed by 3D CNN regularisations. Then, they obtain the 3D geometry by fusing the estimated depth maps from multiple views. The algorithm uses cost volume with memory requirements cubic to the image resolution. Thus, it can not leverage all the information available in highresolution images. To circumvent this problem, the algorithm adopts GRUs [18] to regularise the cost volume in a sequential manner. As a result, the algorithm reduces the memory requirement but leads to increased runtime.
Closely related work to ours is PointMVSNet [4]. PointMVSNet is a framework to predict the depth in a coarsetofine manner working directly on point cloud. It allows the aggregation of information from its nearest neighbors in 3D space. Our approach shares similar insight as predicting and refining depth maps iteratively. However, we differ from PointMVSNet in a few key aspects: Instead of working on 3D, we build the cost volume on the regular image grid. Inspired by the idea of partial cost volume used in PWCNet [34] for optical flow estimation, we build partial cost volume to predict depth residuals. We compare the memory and computational efficiency with [4]. It shows that our costvolume pyramid based network leads to more compact and accurate models that run much faster for a given depthmap resolution.
Cost volume. Cost volume is widely used in traditional methods for dense depth estimation from unrectified images [5, 38]. However, most recent learningbased works build cost volume at a fixed resolution [14, 39, 14], which leads to high memory requirement for handling high resolution images. Recently, Sun \etal [34] introduced the idea of partial cost volume for optical flow estimation. In short, given an estimate of the current optical flow, the partial cost volume is constructed by searching correspondences within a rectangle around its position locally in the warped source view image. Inspired by such strategy, in this paper, we propose cost volume pyramid as an algorithm to progressively estimate the depth residual for each pixel along its visual ray. As we will demonstrate in our experiments, constructing cost volumes at multiple levels leads to a more effective and efficient framework.
3 Method
Let us now introduce our approach to depth inference for MVS. The overall system is depicted in Fig. 1. As existing works, we assume the reference image is denoted as , where and define its dimensions. Let be its neighboring source images. Assume are the corresponding camera intrinsics, rotation matrix, and translation vector for all views. Our goal is to infer the depth map for from . The key novelty of our approach is using a feedforward deep network on cost volume pyramid constructed in a coarsetofine manner. Below, we introduce our feature pyramid, the cost volume pyramid, depth map inference and finally provide details of the loss function.
3.1 Feature Pyramid
As raw images vary with illumination changes, we adopt learnable features, which has been demonstrated to be crucial step for extracting dense feature correspondences [39, 35]. The general practice in existing works is to make use of high resolution images to extract multiscale image features even for the output of a low resolution depth map. By contrast, we show that a low resolution image contains enough information useful for estimating a low resolution depth map.
Our feature extraction pipeline consists of two steps, see Fig. 1. First, we build the ()level image pyramid for each input image, , where the bottom level of the pyramid corresponds to the input image, . Second, we obtain feature representations at the th level using a CNN, namely feature extraction network. Specifically, it consists of 9 convolutional layers, each of which is followed by a leaky rectified linear unit (LeakyReLU). We use the same CNN to extract features for all levels in all the images. We denote the feature maps for a given level by , where is the number of feature channels used in our experiments. We will show that, compared to existing works, our feature extraction pipeline leads to significant reduction in memory requirements and, at the same time, improve performance.
3.2 Cost Volume Pyramid
Given the extracted features, the next step is to construct cost volumes for depth inference in the reference view. Common approaches usually build a single cost volume at a fixed resolution [39, 40, 14], which incurs in large memory requirements and thus, limits the use of highresolution images. Instead, we propose to build a cost volume pyramid, a process that iteratively estimates and refines depth maps to achieve high resolution depth inference. More precisely, we first build a cost volume for coarse depth map estimation based on images of the coarsest resolution in image pyramids and uniform sampling of the frontoparallel planes in the scene. Then, we construct cost volumes from this coarse estimation and depth residual hypotheses iteratively to achieve a depth map with high resolution and accuracy. We provide details about these two steps below.
Cost Volume for Coarse Depth Map Inference. We start building a cost volume at the th level corresponding to the lowest image resolution . Assume depth measured at the reference view of a scene ranges from to . We construct the cost volume for the reference view by sampling frontoparallel planes uniformly across entire depth range. A sampled depth represents a plane where its normal is the principal axis of the reference camera.
Similar to [39], we define the differentiable homography between the th source view and the reference view at depth as
(1) 
where and are the scaled intrinsic matrices of and at level .
Each homography suggests a possible pixel correspondence between in source view and a pixel in the reference view. This correspondence is defined as , where represents the depth of in the source view .
Given and , we use differentiable linear interpolation [15] to reconstruct the feature map for the reference view as . The cost for all pixels at depth is defined as its variance of features from views,
(2) 
where is the feature map of the reference image and is the average of feature volumes across all views () for each pixel. This metric encourages that the correct depth for each pixel has smallest feature variance, which corresponds to the photometric consistency constraint. We compute the cost map for each depth hypothesis and concatenate those cost maps to a single cost volume .
A key parameter to obtain good depth estimation accuracy is the depth sampling resolution . We will show in Section 3.3 how to determine the interval for depth sampling and coarse depth estimation.
Cost Volume for Multiscale Depth Residual Inference. Recall that our ultimate goal is to obtain for . We iterate starting on , a given depth estimate for the th level, to obtain a refined depth map for the next level until reaching the bottom level. More precisely, we first upsample to the next level via bicubic interpolation and then, we build the partial cost volume to regress the residual depth map defined as to obtain a refined depth map at the th level.
While we share the similar insight with [4] to iteratively predict the depth residual, we argue that instead of performing convolutions on a point cloud [4], building the regular 3D cost volume on the depth residual followed by multiscale 3D convolution can lead to a more compact, faster, and higher accuracy depth inference. Our motivation is that depth displacements for neighboring pixels are correlated which indicates that regular multiscale 3D convolution would provide useful contextual information for depth residual estimation. We therefore arrange the depth displacement hypotheses in a regular 3D space and compute the cost volume as follows.
(a)  (b) 
Given camera parameters for all camera views and the upsampled depth estimate , we consider one pixel . Current depth estimate for is defined as . Assume each depth residual hypothesis interval is , where represents the depth search range at and denotes the number of sampled depth residual. We consider the projection of corresponding hypothesised 3D point with depth () into view as
(3) 
where denotes the depth of corresponding pixel in view , and (see Figure 2). Then, the cost for that pixel at each depth residual hypothesis is similarly defined based on Eq. 2, which leads to a cost volume .
In the next section, we introduce our solution to determine the depth search intervals and range for all pixels, , which is essential to obtain accurate depth estimates.
3.3 Depth Map Inference
In this section, we first provide the details to perform depth sampling at the coarsest image resolution and discretisation of the local depth search range at higher image resolution for building the cost volume. Then, we introduce depth map estimators on cost volumes to achieve the depth map inference.
Depth Sampling for Cost Volume Pyramid We observe that the depth sampling for virtual depth planes is related to the image resolution. As shown in Fig. 3, it is not necessary to sample depth planes densely as projections of those sampled 3D points in the image are too close to provide extra information for depth inference. In our experiments, to determine the number of virtual planes, we compute the mean depth sampling interval for a corresponding 0.5 pixel distance in the image.
For determining the local search range for depth residual around the current depth estimate for each pixel, we first project its 3D point in a source view, find points that are two pixels away from the its projection along the epipolar line in both directions, and back project those two points into 3D. The intersection of these two rays with the visual ray in the reference view determines the search range for depth refinement at the next level(see Fig. 4).
Depth Map Estimator Similar to MVSNet [39], we apply 3D convolution to the constructed cost volume pyramid to aggregate context information and output probability volumes , where . Detailed 3D convolution network design is in Supp. Mat. Note that and are generated on absolute and residual depth, respectively. We therefore first apply softargmax to and obtain the coarse depth map. Then, we iteratively refine the obtained depth map by applying softargmax to to obtain the depth residual for higher resolutions.
Recall that sampled depth is at level . Therefore, the depth estimate for each pixel is computed as
(4) 
To further refine the current estimate which is either the coarse depth map or a refined depth at ()th level, we estimate the residual depth. Assume denotes the depth residual hypothesis. We compute the updated depth at the next level as
(5) 
where . In our experiments, we observe no depth map refinement after our pyramidal depth estimation is further required to obtain good results.
3.4 Loss Function
We adopt a supervised learning strategy and construct the pyramid for ground truth depth as supervisory signal. Similar to existing MVSNet framework[39], we make use of the norm measuring the absolute difference between the ground truth and the estimated depth. For each training sample, our loss is
(6) 
where is the set of valid pixels with ground truth measurements.
4 Experiments
In this section, we demonstrate the performance of our framework for MVS with a comprehensive set of experiments in standard benchmarks. Below, we first describe the datasets and benchmarks and then analyze our results.
4.1 Datasets
DTU Dataset [1] is a largescale MVS dataset with 124 scenes scanned from 49 or 64 views under 7 different lighting conditions. DTU provides a 3D point cloud acquired using structuredlight sensors. Each view consists of an image and the calibrated camera parameters. To train our model, we generate a depth map for each view by using the method provided by MVSNet [39]. We use the same training, validation and evaluation sets as defined in [39, 40, 4].
Tanks and Temples [20] contains both indoor and outdoor scenes under realistic lighting conditions with large scale variations. For comparison with other approaches, we evaluate our results on the intermediate set.
4.2 Implementation
Training We train our CVPMVSNet on DTU training set. Unlike previous methods [39, 4] that take high resolution image as input but estimate a depth map of smaller size, our method produces the same size depth map as the input image. For training, we match the groundtruth depth map by downsampling the high resolution image into a smaller one of size . Then, we build the image and ground truth depth pyramid with 2 levels. To construct the cost volume pyramid, we uniformly sample depth hypotheses across entire depth range at the coarsest (nd) level. Then, each pixel has depth residual hypotheses at the next level for the refinement of the depth estimation. Following MVSNet [39], we adopt 3 views for training. We implemented our network using Pytorch [27], and we used ADAM [19] to train our model. The batch size is set to 16 and the network is endtoend trained on a NVIDIA TITAN RTX graphics card for 27 epochs. The learning rate is initially set to 0.001 and divided by 2 iteratively at the ,, and epoch.
Metrics. We follow the standard evaluation protocol as in [1, 39]. We report the accuracy, completeness and overall score of the reconstructed point clouds. Accuracy is measured as the distance from estimated point clouds to the ground truth ones in millimeter and completeness is defined as the distance from ground truth point clouds to the estimated ones [1]. The overall score is the average of accuracy and completeness [39].
Evaluation As the parameters are shared across the cost volume pyramid, we can evaluate our model with different number of cost volumes and input views. For the evaluation, we set the number of depth sampling, for the coarsest depth estimation (same as [4]. We also provide results of in the Supp. Mat.) and for the following depth residual inference levels. Similar to previous methods [39, 4], we use 5 views and apply the same depth map fusion method to obtain the point clouds. We evaluate our model with images of different size and set the pyramid levels accordingly to maintain a similar size as the input image () at coarsest level. For instance, for an input size of , the pyramid has 5 levels and 4 levels for an input size of and .
Method  Acc.  Comp.  Overall (mm)  

Geometic 
Furu[10]  0.613  0.941  0.777 
Tola[36]  0.342  1.190  0.766  
Camp[2]  0.835  0.554  0.695  
Gipuma[11]  0.283  0.873  0.578  
Colmap[30, 31]  0.400  0.664  0.532  
Learning 
SurfaceNet[16]  0.450  1.040  0.745 
MVSNet[39]  0.396  0.527  0.462  
PMVSNet[4]  0.406  0.434  0.420  
RMVSNet[40]  0.383  0.452  0.417  
MVSCRF[37]  0.371  0.426  0.398  
PointMVSNet[4]  0.342  0.411  0.376  
Ours  0.296  0.406  0.351 
4.3 Results on DTU dataset
We first compare our results to those reported by traditional geometricbased methods and other learningbased baseline methods. As summarized in Table 1, our method outperforms all current learningbased methods [39, 40, 4] in terms of accuracy, completeness and overall score. Compared to geometricbased approaches, only the method proposed by Galliani \etal [11] provides slightly better results in terms of mean accuracy.
We now compare our results to related learning based methods in terms of GPU memory usage and runtime for different input resolution. The summary of these results is listed in Table 2. As shown, our network, with a similar memory usage (bottom row) is able to produce better point clouds with lower runtime. In addition, compared to PointMVSNet [4] on the same size of depth map output (top rows), our approach is six times faster and consumes six times less memory with similar accuracy. We can output high resolution depth map with better accuracy, less memory usage and shorter runtime than PointMVSNet [4].
RMVSNet [40]  PointMVSNet [4]  Ours  Ground truth 
Method  Input Size  Depth Map Size  Acc.(mm)  Comp.(mm)  Overall(mm)  fscore(0.5mm)  GPU Mem(MB)  Runtime(s) 

PointMVSNet[4]  1280x960  640x480  0.361  0.421  0.391  84.27  8989  2.03 
Ours640  640x480  640x480  0.372  0.434  0.403  82.44  1416  0.37 
PointMVSNet[4]  1600x1152  800x576  0.342  0.411  0.376    13081  3.04 
Ours800  800x576  800x576  0.340  0.418  0.379  86.82  2207  0.49 
MVSNet[39]  1600x1152  400x288  0.396  0.527  0.462  78.10  22511  2.76 
RMVSNet[40]  1600x1152  400x288  0.383  0.452  0.417  83.96  6915  5.09 
PointMVSNet[4]  1600x1152  800x576  0.342  0.411  0.376    13081  3.04 
Ours  1600x1152  1600x1152  0.296  0.406  0.351  88.61  8795  1.72 
Method  Rank  Mean  Family  Francis  Horse  Lighthouse  M60  Panther  Playground  Train 

PMVSNet [24]  11.72  55.62  70.04  44.64  40.22  65.20  55.08  55.17  60.37  54.29 
Ours  12.75  54.03  76.5  47.74  36.34  55.12  57.28  54.28  57.43  47.54 
PointMVSNet[4]  29.25  48.27  61.79  41.15  34.20  50.79  51.97  50.85  52.38  43.06 
RMVSNet[40]  31.75  48.40  69.96  46.65  32.59  42.95  51.88  48.80  52.00  42.38 
MVSNet[39]  42.75  43.48  55.99  28.55  25.07  50.79  53.96  50.86  47.90  34.69 
Figures 5 and 6 show some qualitative results. As shown, our method is able to reconstruct more details than PointMVSNet [4], see for instance, the details highlighted in blue box of the roof behind the front building. Compared to RMVSNet [40] and PointMVSNet [4], as we can see in the normal maps, our results are smoother on the surfaces while capturing more highfrequency details in edgy areas.
RMVSNet [40]  PointMVSNet [4]  Ours  Ground truth 
(a) Train  (b) Panther 
(c) Lighthouse  (d) Family  (e) Horse 
4.4 Results on Tanks and Temples
We now evaluate the generalization ability of our method. To this end, we use the model trained on DTU without any finetuning to reconstruct point clouds for scenes in Tanks and Temples dataset. For fair comparison, we use the same camera parameters, depth range and view selection of MVSNet [39]. For comparison, we consider four baselines [39, 40, 4, 24] and evaluate the fscore on Tanks and Temples. Table 3 summarizes these results. As shown, our method yielded a mean fscore 5% higher than PointMVSNet [4] which is the best baseline on DTU dataset, and only 1% lower than PMVSNet [24]. Note that PMVSNet [24] applies more depth filtering process for point cloud fusion than ours which just follows the simple fusion process provided by MVSNet [39]. Qualitative results of our point cloud reconstructions are shown in Fig. 7.



(a)  (b) 
4.5 Ablation study
Training pyramid levels. We first analyse the effect of the number of pyramid levels on the quality of the reconstruction. To this end, we downsample the images to form pyramids with four different levels. Results of this analysis are summarized in Table 4a. As shown, the proposed 2level pyramid is the best. As the level of pyramid increases, the image resolution of the coarsest level decreases. For more than 2levels, this resolution is too small to produce a good initial depth map to be refined.
Evaluation pixel interval settings. We now analyse the effect of varying the pixel interval setting for depth refinement. As discussed in section 3.3, the depth sampling is determined by the corresponding pixel offset in source views, hence, it is important to set a suitable pixel interval. Table 4b summarizes the effect of varying the interval from depth ranges corresponding to 0.25 pixel to 2 pixels, during evaluation. As shown, the performance drops when the interval is too small (0.25 pixel) or too large (2 pixels).
5 Conclusion
In this paper, we proposed CVPMVSNet, a cost volume pyramid based depth inference framework for MVS. CVPMVSNet is compact, lightweight, fast in runtime and can handle high resolution images to obtain high quality depth map for 3D reconstruction. Our model achieves better performance than stateoftheart methods by extensive evaluation on benchmark datasets. In the future, we want to explore the integration of our approach into a learningbased structurefrommotion framework to further reduce the memory requirements for different applications.
Acknowledgements
This research is supported in part by the Australia Research Council DECRA Fellowship (DE180100628).
References
 (2016) Largescale data for multipleview stereopsis. International Journal of Computer Vision 120 (2), pp. 153–168. Cited by: §4.1, §4.2.
 (2008) Using multiple hypotheses to improve depthmaps for multiview stereo. In Computer Vision – ECCV 2008, D. Forsyth, P. Torr and A. Zisserman (Eds.), Berlin, Heidelberg, pp. 766–779. Cited by: Table 1.
 (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
 (2019) Pointbased multiview stereo network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1538–1547. Cited by: 3rd item, §1, §1, §1, §2, §3.2, Figure 5, Figure 6, §4.1, §4.2, §4.2, §4.3, §4.3, §4.3, §4.4, Table 1, Table 2, Table 3.
 (1996) A spacesweep approach to true multiimage matching. In Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 358–363. Cited by: §2.
 (1999) Poxels: probabilistic voxelized volume reconstruction. In Proceedings of International Conference on Computer Vision (ICCV), pp. 418–425. Cited by: §2.
 (2004) Silhouette and stereo fusion for 3d object modeling. Computer Vision and Image Understanding 96 (3), pp. 367–392. Cited by: §2.
 (2002) Variational principles, surface evolution, pde’s, level set methods and the stereo problem. IEEE. Cited by: §2.
 (1995) Objectcentered surface reconstruction: combining multiimage stereo and shading. International Journal of Computer Vision 16 (1), pp. 35–56. Cited by: §2.
 (201008) Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (8), pp. 1362–1376. External Links: Document, ISSN Cited by: Table 1.
 (2016) Gipuma: massively parallel multiview stereo reconstruction. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V 25, pp. 361–369. Cited by: §4.3, Table 1.
 (2015) Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.
 (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
 (2019) DPSNet: endtoend deep plane sweep stereo. In International Conference on Learning Representations, Cited by: §2, §3.2.
 (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §3.2.
 (2017) Surfacenet: an endtoend 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2307–2315. Cited by: §2, Table 1.
 (2001) Handling occlusions in dense multiview stereo. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 1, pp. I–I. Cited by: §2.
 (2017) Learning a multiview stereo machine. In Advances in neural information processing systems, pp. 365–376. Cited by: §2, §2.
 (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 (2017) Tanks and temples: benchmarking largescale scene reconstruction. ACM Transactions on Graphics (ToG) 36 (4), pp. 78. Cited by: Figure 7, §4.1, Table 3.
 (2002) Multicamera scene reconstruction via graph cuts. In European conference on computer vision, pp. 82–96. Cited by: §2.
 (2000) A theory of shape by space carving. International journal of computer vision 38 (3), pp. 199–218. Cited by: §2.
 (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.
 (2019) Pmvsnet: learning patchwise matching confidence aggregation for multiview stereo. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10452–10461. Cited by: §4.4, Table 3.
 (2016) Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703. Cited by: §2.
 (2007) Realtime visibilitybased fusion of depth maps. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §2.
 (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §4.2.
 (2003) Variational stereovision and 3d scene flow estimation with statistical similarity measures. In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 597. Cited by: §2.
 (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.
 (2016) Structurefrommotion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
 (2016) Pixelwise view selection for unstructured multiview stereo. In European Conference on Computer Vision (ECCV), Cited by: Table 1.
 (2006) A comparison and evaluation of multiview stereo reconstruction algorithms. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 1, pp. 519–528. Cited by: §1, §2.
 (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
 (2018) PWCnet: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §2, §2.
 (2019) BAnet: dense bundle adjustment networks. In International Conference on Learning Representations, External Links: Link Cited by: §3.1.
 (20120901) Efficient largescale multiview stereo for ultra highresolution image sets. Machine Vision and Applications 23 (5), pp. 903–920. Cited by: Table 1.
 (2019) MVSCRF: learning multiview stereo with conditional random fields. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4312–4321. Cited by: Table 1.
 (2003) Multiresolution realtime stereo on commodity graphics hardware. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 1, pp. I–I. Cited by: §2.
 (2018) MVSnet: depth inference for unstructured multiview stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §1, §2, §2, §2, §3.1, §3.2, §3.2, §3.3, §3.4, §4.1, §4.2, §4.2, §4.2, §4.3, §4.4, Table 1, Table 2, Table 3.
 (2019) Recurrent mvsnet for highresolution multiview stereo depth inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5534. Cited by: §1, §3.2, Figure 5, Figure 6, §4.1, §4.3, §4.3, §4.4, Table 1, Table 2, Table 3.
 (2016) Stereo matching by training a convolutional neural network to compare image patches.. Journal of Machine Learning Research 17 (132), pp. 2. Cited by: §2.