1 Introduction


We propose a cost volume based neural network for depth inference from multi-view images. We demonstrate that building a cost volume pyramid in a coarse-to-fine manner instead of constructing a cost volume at a fixed resolution leads to a compact, lightweight network and allows us inferring high resolution depth maps to achieve better reconstruction results. To this end, a cost volume based on uniform sampling of fronto-parallel planes across entire depth range is first built at the coarsest resolution of an image. Given current depth estimate, new cost volumes are constructed iteratively on the pixelwise depth residual to perform depth map refinement. While sharing similar insight with Point-MVSNet as predicting and refining depth iteratively, we show that working on cost volume pyramid can lead to a more compact, yet efficient network structure compared with the Point-MVSNet on 3D points. We further provide detailed analyses of relation between (residual) depth sampling and image resolution, which serves as a principle for building compact cost volume pyramid. Experimental results on benchmark datasets show that our model can perform 6x faster and has similar performance as state-of-the-art methods.


1 Introduction

Multi-view stereo (MVS) aims to reconstruct 3D model of a scene from a set of images captured by a camera from multiple view points. It is a fundamental problem for computer vision community and has been studied extensively for decades [32]. While traditional methods before deep learning era have great achievements on the reconstruction of scene with lambertian surfaces, they still suffer from illumination changes, low-texture regions, and reflections resulting in unreliable matching correspondences for further reconstruction.

Recent learning-based approaches [4, 39, 40] adopt deep CNNs to infer the depth map for each view followed by a separate multiple-view fusion process for building 3D models. These methods allow the network to extract discriminative features encoding global and local information of a scene to obtain robust feature matching for MVS. In particular, Yao \etalpropose MVSNet [39] to infer a depth map for each view. An essential step in [39] is to build a cost volume based on a plane sweep process followed by multi-scale 3D CNNs for regularization. While effective in depth inference accuracy, its memory requirement is cubic to the image resolution. To allow handling high resolution images, they then adopt a recurrent cost volume regularisation process [40]. However, the reduction in memory requirements involves a longer run-time.

In order to achieve a computationally efficient network, Chen \etal [4] work on 3D point clouds to iteratively predict the depth residual along visual rays using edge convolutions operating on the nearest neighbors of each 3D point. While this approach is more efficient, its run-time increases almost linearly with the number of iteration levels.

In this work, we propose a cost volume pyramid based Multi-View Stereo Network (CVP-MVSNet) for depth inference. In our approach, we first build an image pyramid for each input image. Then, for the coarsest resolution of the reference image, we build a compact cost volume by sampling the depth across the entire depth-range of a scene. Then, at the next pyramid level, we perform residual depth search from the neighbor of the current depth estimate to construct a partial cost volume using multi-scale 3D CNNs for regularisation. As we build these cost volumes iteratively with a short search range at each level, it leads to a small and compact network. As a result, our network is able to perform 6x faster than current state-of-the-art networks on benchmark datasets.

While it is noteworthy that we share the similar insight with [4] as predicting and refining the depth map in a coarse-to-fine manner, our work differs from theirs in the following four main aspects. First, the approach in [4] performs convolutions on 3D point cloud. Instead, we construct cost volumes on a regular grid defined on the image coordinates, which is shown to be faster in run-time. Second, we provide a principle for building a compact cost volume pyramid based on the correlation between depth sampling and image resolution. As third main difference, we use a multi-scale 3D-CNN regularization to cover large receptive field and encourage local smoothness on residual depth estimates which, as shown in Fig. LABEL:fig:startfig, leads to a better accuracy. Finally, in contrast to [4] and other related works, our approach can output depth of small resolution with small resolution image.

In summary, our main contributions are

  • We propose a cost-volume based, compact, and computational efficient depth inference network for MVS.

  • We build a cost volume pyramid in a coarse-to-fine manner based on a detailed analysis of the relation between the depth residual search range and the image resolution,

  • Our framework can handle high resolution images with less memory requirement, is 6x faster than the current state-of-the-art framework, i.e. Point-MVSNet [4], and achieves a better accuracy on benchmark datasets.

Figure 1: Network Structure. Reference and source images are first downsampled to form a image pyramid. We apply feature extraction network to all levels and images to extract feature maps. We build the  cost volume pyramid in a coarse-to-fine manner. We start with the construction of a cost volume corresponding to coarsest image resolution followed by building partial cost volumes iteratively for depth residual estimation in order to achieve depth map for . Please refer to Figure 2 for details about reprojection & feature fetching module.

2 Related Work

Traditional Multi-View Stereo. Multi-view stereo has been extensively studied for decades. We refer to the algorithms before deep learning era as traditional MVS methods which represent the 3D geometry of objects or scene using voxels [6, 22], level-sets [8, 28], polygon meshes [7, 9] or depth maps [17, 21]. In the following, we mainly focus on discussions about volumetric and depth-based MVS methods which have been integrated to learning-based framework recently.

Volumetric representations can model most of the objects or scenes. Given a fixed volume of an object or scene, volumetric-based methods first divide the whole volume into small voxels and then, use a photometric consistency metric to decide whether the voxel belongs to the surface or not. These methods do not impose constraints on the shape of the objects. However, the space discretisation is memory intensive. By contrast, depth-map based MVS methods have shown more flexibility in modeling the 3D geometry of scene [26]. Readers are referred to [32] for detailed discussions. Similar to other recent learning-based approaches, we adopt depth map representation in our framework.

Deep learning-based MVS. Deep CNNs have significantly advanced the progress of high-level vision tasks, such as image recognition [13, 33], object detection [12, 29], and semantic segmentation [23, 3]. As for 3D vision tasks, learning-based approaches have been widely adopted to solve stereo matching problems and have achieved very promising results [41, 25]. However, these learning-based approaches cannot be easily generalized to solve MVS problems as rectifications are required for the multiple view scenario which may cause the loss of information [39].

More recently, a few approaches have proposed to directly solve MVS problems. For instance, Ji \etal [16] introduce the first learning based pipeline for MVS. This approach learns the probability of voxels lying on the surface. Concurrently, Kar \etal [18] present a learnable system to up-project pixel features to the 3D volume and classify whether a voxel is occupied or not by the surface. These systems provide promising results. However, they use volumetric representations that are memory expensive and therefore, these algorithms can not handle large-scale scenes.

Large-scale scene reconstruction has been approached by Yao \etalin [39]. The authors propose to learn the depth map for each view by constructing a cost volume followed by 3D CNN regularisations. Then, they obtain the 3D geometry by fusing the estimated depth maps from multiple views. The algorithm uses cost volume with memory requirements cubic to the image resolution. Thus, it can not leverage all the information available in high-resolution images. To circumvent this problem, the algorithm adopts GRUs [18] to regularise the cost volume in a sequential manner. As a result, the algorithm reduces the memory requirement but leads to increased run-time.

Closely related work to ours is Point-MVSNet [4]. Point-MVSNet is a framework to predict the depth in a coarse-to-fine manner working directly on point cloud. It allows the aggregation of information from its nearest neighbors in 3D space. Our approach shares similar insight as predicting and refining depth maps iteratively. However, we differ from Point-MVSNet in a few key aspects: Instead of working on 3D, we build the cost volume on the regular image grid. Inspired by the idea of partial cost volume used in PWC-Net [34] for optical flow estimation, we build  partial cost volume to predict depth residuals. We compare the memory and computational efficiency with [4]. It shows that our cost-volume pyramid based network leads to more compact and accurate models that run much faster for a given depth-map resolution.

Cost volume. Cost volume is widely used in traditional methods for dense depth estimation from unrectified images [5, 38]. However, most recent learning-based works build cost volume at a fixed resolution [14, 39, 14], which leads to high memory requirement for handling high resolution images. Recently, Sun \etal [34] introduced the idea of partial cost volume for optical flow estimation. In short, given an estimate of the current optical flow, the partial cost volume is constructed by searching correspondences within a rectangle around its position locally in the warped source view image. Inspired by such strategy, in this paper, we propose cost volume pyramid as an algorithm to progressively estimate the depth residual for each pixel along its visual ray. As we will demonstrate in our experiments, constructing cost volumes at multiple levels leads to a more effective and efficient framework.

3 Method

Let us now introduce our approach to depth inference for MVS. The overall system is depicted in Fig. 1. As existing works, we assume the reference image is denoted as , where and define its dimensions. Let be its neighboring source images. Assume are the corresponding camera intrinsics, rotation matrix, and translation vector for all views. Our goal is to infer the depth map for from . The key novelty of our approach is using a feed-forward deep network on cost volume pyramid constructed in a coarse-to-fine manner. Below, we introduce our feature pyramid, the cost volume pyramid, depth map inference and finally provide details of the loss function.

3.1 Feature Pyramid

As raw images vary with illumination changes, we adopt learnable features, which has been demonstrated to be crucial step for extracting dense feature correspondences [39, 35]. The general practice in existing works is to make use of high resolution images to extract multi-scale image features even for the output of a low resolution depth map. By contrast, we show that a low resolution image contains enough information useful for estimating a low resolution depth map.

Our feature extraction pipeline consists of two steps, see Fig. 1. First, we build the ()-level image pyramid for each input image, , where the bottom level of the pyramid corresponds to the input image, . Second, we obtain feature representations at the -th level using a CNN, namely feature extraction network. Specifically, it consists of 9 convolutional layers, each of which is followed by a leaky rectified linear unit (Leaky-ReLU). We use the same CNN to extract features for all levels in all the images. We denote the feature maps for a given level by , where is the number of feature channels used in our experiments. We will show that, compared to existing works, our feature extraction pipeline leads to significant reduction in memory requirements and, at the same time, improve performance.

3.2 Cost Volume Pyramid

Given the extracted features, the next step is to construct cost volumes for depth inference in the reference view. Common approaches usually build a single cost volume at a fixed resolution [39, 40, 14], which incurs in large memory requirements and thus, limits the use of high-resolution images. Instead, we propose to build a cost volume pyramid, a process that iteratively estimates and refines depth maps to achieve high resolution depth inference. More precisely, we first build a cost volume for coarse depth map estimation based on images of the coarsest resolution in image pyramids and uniform sampling of the fronto-parallel planes in the scene. Then, we construct cost volumes from this coarse estimation and depth residual hypotheses iteratively to achieve a depth map with high resolution and accuracy. We provide details about these two steps below.

Figure 2: Reprojection and feature fetching. The depth displacement () hypotheses is a regular 3D space based on the upsampled depth map from previous level (). The corresponding image feature of each hypothesis is obtained by projecting the 3D point to source views to obain a reconstructed feature via interpolating at that location.

Cost Volume for Coarse Depth Map Inference. We start building a cost volume at the th level corresponding to the lowest image resolution . Assume depth measured at the reference view of a scene ranges from to . We construct the cost volume for the reference view by sampling fronto-parallel planes uniformly across entire depth range. A sampled depth represents a plane where its normal is the principal axis of the reference camera.

Similar to [39], we define the differentiable homography between the th source view and the reference view at depth as


where and are the scaled intrinsic matrices of and at level .

Each homography suggests a possible pixel correspondence between in source view and a pixel in the reference view. This correspondence is defined as , where represents the depth of in the source view .

Given and , we use differentiable linear interpolation [15] to reconstruct the feature map  for the reference view as . The cost for all pixels at depth is defined as its variance of features from views,


where is the feature map of the reference image and is the average of feature volumes across all views () for each pixel. This metric encourages that the correct depth for each pixel has smallest feature variance, which corresponds to the photometric consistency constraint. We compute the cost map for each depth hypothesis and concatenate those cost maps to a single cost volume .

A key parameter to obtain good depth estimation accuracy is the depth sampling resolution . We will show in Section 3.3 how to determine the interval for depth sampling and coarse depth estimation.

Cost Volume for Multi-scale Depth Residual Inference. Recall that our ultimate goal is to obtain for . We iterate starting on , a given depth estimate for the th level, to obtain a refined depth map for the next level until reaching the bottom level. More precisely, we first upsample to the next level via bicubic interpolation and then, we build the partial cost volume to regress the residual depth map defined as to obtain a refined depth map at the th level.

While we share the similar insight with [4] to iteratively predict the depth residual, we argue that instead of performing convolutions on a point cloud [4], building the regular 3D cost volume on the depth residual followed by multi-scale 3D convolution can lead to a more compact, faster, and higher accuracy depth inference. Our motivation is that depth displacements for neighboring pixels are correlated which indicates that regular multi-scale 3D convolution would provide useful contextual information for depth residual estimation. We therefore arrange the depth displacement hypotheses in a regular 3D space and compute the cost volume as follows.

(a) (b)
Figure 3: (a) Densely sampled depth will result in very close( pixel) locations (pink points) in source view which have similar feature. (b) Points projected using appropriate depth sampling carry distinguishable information.

Given camera parameters for all camera views and the upsampled depth estimate , we consider one pixel . Current depth estimate for is defined as . Assume each depth residual hypothesis interval is , where represents the depth search range at and denotes the number of sampled depth residual. We consider the projection of corresponding hypothesised 3D point with depth () into view as


where denotes the depth of corresponding pixel in view , and (see Figure 2). Then, the cost for that pixel at each depth residual hypothesis is similarly defined based on Eq. 2, which leads to a cost volume .

In the next section, we introduce our solution to determine the depth search intervals and range for all pixels, , which is essential to obtain accurate depth estimates.

3.3 Depth Map Inference

In this section, we first provide the details to perform depth sampling at the coarsest image resolution and discretisation of the local depth search range at higher image resolution for building the cost volume. Then, we introduce depth map estimators on cost volumes to achieve the depth map inference.

Depth Sampling for Cost Volume Pyramid We observe that the depth sampling for virtual depth planes is related to the image resolution. As shown in Fig. 3, it is not necessary to sample depth planes densely as projections of those sampled 3D points in the image are too close to provide extra information for depth inference. In our experiments, to determine the number of virtual planes, we compute the mean depth sampling interval for a corresponding 0.5 pixel distance in the image.

Figure 4: Depth Search Range. We search along the visual ray in a range so that the projection of 3D point from current depth at source view is at most 2 pixels away.

For determining the local search range for depth residual around the current depth estimate for each pixel, we first project its 3D point in a source view, find points that are two pixels away from the its projection along the epipolar line in both directions, and back project those two points into 3D. The intersection of these two rays with the visual ray in the reference view determines the search range for depth refinement at the next level(see Fig. 4).

Depth Map Estimator Similar to MVSNet [39], we apply 3D convolution to the constructed cost volume pyramid to aggregate context information and output probability volumes , where . Detailed 3D convolution network design is in Supp. Mat. Note that and are generated on absolute and residual depth, respectively. We therefore first apply soft-argmax to and obtain the coarse depth map. Then, we iteratively refine the obtained depth map by applying soft-argmax to to obtain the depth residual for higher resolutions.

Recall that sampled depth is at level . Therefore, the depth estimate for each pixel is computed as


To further refine the current estimate which is either the coarse depth map or a refined depth at ()th level, we estimate the residual depth. Assume denotes the depth residual hypothesis. We compute the updated depth at the next level as


where . In our experiments, we observe no depth map refinement after our pyramidal depth estimation is further required to obtain good results.

3.4 Loss Function

We adopt a supervised learning strategy and construct the pyramid for ground truth depth as supervisory signal. Similar to existing MVSNet framework[39], we make use of the norm measuring the absolute difference between the ground truth and the estimated depth. For each training sample, our loss is


where is the set of valid pixels with ground truth measurements.

4 Experiments

In this section, we demonstrate the performance of our framework for MVS with a comprehensive set of experiments in standard benchmarks. Below, we first describe the datasets and benchmarks and then analyze our results.

4.1 Datasets

DTU Dataset [1] is a large-scale MVS dataset with 124 scenes scanned from 49 or 64 views under 7 different lighting conditions. DTU provides a 3D point cloud acquired using structured-light sensors. Each view consists of an image and the calibrated camera parameters. To train our model, we generate a depth map for each view by using the method provided by MVSNet [39]. We use the same training, validation and evaluation sets as defined in [39, 40, 4].

Tanks and Temples [20] contains both indoor and outdoor scenes under realistic lighting conditions with large scale variations. For comparison with other approaches, we evaluate our results on the intermediate set.

4.2 Implementation

Training We train our CVP-MVSNet on DTU training set. Unlike previous methods [39, 4] that take high resolution image as input but estimate a depth map of smaller size, our method produces the same size depth map as the input image. For training, we match the ground-truth depth map by downsampling the high resolution image into a smaller one of size . Then, we build the image and ground truth depth pyramid with 2 levels. To construct the cost volume pyramid, we uniformly sample depth hypotheses across entire depth range at the coarsest (nd) level. Then, each pixel has depth residual hypotheses at the next level for the refinement of the depth estimation. Following MVSNet [39], we adopt 3 views for training. We implemented our network using Pytorch [27], and we used ADAM [19] to train our model. The batch size is set to 16 and the network is end-to-end trained on a NVIDIA TITAN RTX graphics card for 27 epochs. The learning rate is initially set to 0.001 and divided by 2 iteratively at the ,, and epoch.

Metrics. We follow the standard evaluation protocol as in [1, 39]. We report the accuracycompleteness and overall score of the reconstructed point clouds. Accuracy is measured as the distance from estimated point clouds to the ground truth ones in millimeter and completeness is defined as the distance from ground truth point clouds to the estimated ones [1]. The overall score is the average of accuracy and completeness [39].

Evaluation As the parameters are shared across the cost volume pyramid, we can evaluate our model with different number of cost volumes and input views. For the evaluation, we set the number of depth sampling, for the coarsest depth estimation (same as [4]. We also provide results of in the Supp. Mat.) and for the following depth residual inference levels. Similar to previous methods [39, 4], we use 5 views and apply the same depth map fusion method to obtain the point clouds. We evaluate our model with images of different size and set the pyramid levels accordingly to maintain a similar size as the input image () at coarsest level. For instance, for an input size of , the pyramid has 5 levels and 4 levels for an input size of and .

Method Acc. Comp. Overall (mm)


Furu[10] 0.613 0.941 0.777
Tola[36] 0.342 1.190 0.766
Camp[2] 0.835 0.554 0.695
Gipuma[11] 0.283 0.873 0.578
Colmap[30, 31] 0.400 0.664 0.532


SurfaceNet[16] 0.450 1.040 0.745
MVSNet[39] 0.396 0.527 0.462
P-MVSNet[4] 0.406 0.434 0.420
R-MVSNet[40] 0.383 0.452 0.417
MVSCRF[37] 0.371 0.426 0.398
Point-MVSNet[4] 0.342 0.411 0.376
Ours 0.296 0.406 0.351
Table 1: Quantitative results of reconstruction quality on DTU dataset (lower is better). Our method outperforms all methods on Mean Completeness and Overall reconstruction quality and achieved second best on Mean Accuracy.

4.3 Results on DTU dataset

We first compare our results to those reported by traditional geometric-based methods and other learning-based baseline methods. As summarized in Table 1, our method outperforms all current learning-based methods [39, 40, 4] in terms of accuracycompleteness and overall score. Compared to geometric-based approaches, only the method proposed by Galliani \etal [11] provides slightly better results in terms of mean accuracy.

We now compare our results to related learning based methods in terms of GPU memory usage and runtime for different input resolution. The summary of these results is listed in Table 2. As shown, our network, with a similar memory usage (bottom row) is able to produce better point clouds with lower runtime. In addition, compared to Point-MVSNet [4] on the same size of depth map output (top rows), our approach is six times faster and consumes six times less memory with similar accuracy. We can output high resolution depth map with better accuracy, less memory usage and shorter runtime than Point-MVSNet [4].

R-MVSNet [40] Point-MVSNet [4] Ours Ground truth
Figure 5: Qualitative results of scan 9 of DTU dataset. The upper row shows the point clouds and the bottom row shows the normal map corresponding to the orange rectangle. As highlighted in the blue rectangle, the completeness of our results is better than those provided by Point-MVSNet[4]. The normal map (orange rectangle) further shows that our results are smoother on surfaces while maintaining more high-frequency details.
Method Input Size Depth Map Size Acc.(mm) Comp.(mm) Overall(mm) f-score(0.5mm) GPU Mem(MB) Runtime(s)
Point-MVSNet[4] 1280x960 640x480 0.361 0.421 0.391 84.27 8989 2.03
Ours-640 640x480 640x480 0.372 0.434 0.403 82.44 1416 0.37
Point-MVSNet[4] 1600x1152 800x576 0.342 0.411 0.376 - 13081 3.04
Ours-800 800x576 800x576 0.340 0.418 0.379 86.82 2207 0.49
MVSNet[39] 1600x1152 400x288 0.396 0.527 0.462 78.10 22511 2.76
R-MVSNet[40] 1600x1152 400x288 0.383 0.452 0.417 83.96 6915 5.09
Point-MVSNet[4] 1600x1152 800x576 0.342 0.411 0.376 - 13081 3.04
Ours 1600x1152 1600x1152 0.296 0.406 0.351 88.61 8795 1.72
Table 2: Comparison of reconstruction quality, GPU memory usage and runtime on DTU dataset for different input sizes. GPU memory usage and runtime are obtained by running the official evaluation code of baselines on a same machine with a NVIDIA TITAN RTX graphics card. For the same size of depth maps (Ours-640, Ours-800) and a performance similar to Point-MVSNet [4], our method is 6 times faster and consumes 6 times smaller GPU memory. For the same size of input images (Ours), our method achieves the best reconstruction with shortest time and a reasonable GPU memory usage.
Method Rank Mean Family Francis Horse Lighthouse M60 Panther Playground Train
P-MVSNet [24] 11.72 55.62 70.04 44.64 40.22 65.20 55.08 55.17 60.37 54.29
Ours 12.75 54.03 76.5 47.74 36.34 55.12 57.28 54.28 57.43 47.54
Point-MVSNet[4] 29.25 48.27 61.79 41.15 34.20 50.79 51.97 50.85 52.38 43.06
R-MVSNet[40] 31.75 48.40 69.96 46.65 32.59 42.95 51.88 48.80 52.00 42.38
MVSNet[39] 42.75 43.48 55.99 28.55 25.07 50.79 53.96 50.86 47.90 34.69
Table 3: Performance on Tanks and Temples [20] on November 12, 2019. Our results outperform Point-MVSNet [4], which is the strongest baseline on DTU dataset, and are competitive compared to P-MVSNet [24].

Figures 5 and 6 show some qualitative results. As shown, our method is able to reconstruct more details than Point-MVSNet [4], see for instance, the details highlighted in blue box of the roof behind the front building. Compared to R-MVSNet [40] and Point-MVSNet [4], as we can see in the normal maps, our results are smoother on the surfaces while capturing more high-frequency details in edgy areas.

R-MVSNet [40] Point-MVSNet [4] Ours Ground truth
Figure 6: Additional results from DTU dataset. Best viewed on screen.
(a) Train (b) Panther
(c) Lighthouse (d) Family (e) Horse
Figure 7: Point cloud reconstruction of Tanks and Temples dataset [20]. Best viewed on screen.

4.4 Results on Tanks and Temples

We now evaluate the generalization ability of our method. To this end, we use the model trained on DTU without any fine-tuning to reconstruct point clouds for scenes in Tanks and Temples dataset. For fair comparison, we use the same camera parameters, depth range and view selection of MVSNet [39]. For comparison, we consider four baselines [39, 40, 4, 24] and evaluate the f-score on Tanks and Temples. Table 3 summarizes these results. As shown, our method yielded a mean f-score 5% higher than Point-MVSNet [4] which is the best baseline on DTU dataset, and only 1% lower than P-MVSNet [24]. Note that P-MVSNet [24] applies more depth filtering process for point cloud fusion than ours which just follows the simple fusion process provided by MVSNet [39]. Qualitative results of our point cloud reconstructions are shown in Fig. 7.

Levels Coarsest Img. Size Acc. Comp. Overall
2 80x64 0.296 0.406 0.351
3 40x32 0.326 0.407 0.366
4 20x16 0.339 0.411 0.375
5 10x8 0.341 0.412 0.376
Pixel Interval Acc. (mm) Comp. (mm) Overall (mm)
2 0.299 0.413 0.356
1 0.299 0.403 0.351
0.5 0.296 0.406 0.351
0.25 0.313 0.482 0.397
(a) (b)
Table 4: Parameter sensitivity on DTU dataset. a) Accuracy as a function of the number of pyramid levels. b) Accuracy as a function of the interval setting.

4.5 Ablation study

Training pyramid levels. We first analyse the effect of the number of pyramid levels on the quality of the reconstruction. To this end, we downsample the images to form pyramids with four different levels. Results of this analysis are summarized in Table 4a. As shown, the proposed 2-level pyramid is the best. As the level of pyramid increases, the image resolution of the coarsest level decreases. For more than 2-levels, this resolution is too small to produce a good initial depth map to be refined.

Evaluation pixel interval settings. We now analyse the effect of varying the pixel interval setting for depth refinement. As discussed in section 3.3, the depth sampling is determined by the corresponding pixel offset in source views, hence, it is important to set a suitable pixel interval. Table 4b summarizes the effect of varying the interval from depth ranges corresponding to 0.25 pixel to 2 pixels, during evaluation. As shown, the performance drops when the interval is too small (0.25 pixel) or too large (2 pixels).

5 Conclusion

In this paper, we proposed CVP-MVSNet, a cost volume pyramid based depth inference framework for MVS. CVP-MVSNet is compact, lightweight, fast in runtime and can handle high resolution images to obtain high quality depth map for 3D reconstruction. Our model achieves better performance than state-of-the-art methods by extensive evaluation on benchmark datasets. In the future, we want to explore the integration of our approach into a learning-based structure-from-motion framework to further reduce the memory requirements for different applications.


This research is supported in part by the Australia Research Council DECRA Fellowship (DE180100628).


  1. H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola and A. B. Dahl (2016) Large-scale data for multiple-view stereopsis. International Journal of Computer Vision 120 (2), pp. 153–168. Cited by: §4.1, §4.2.
  2. N. D. F. Campbell, G. Vogiatzis, C. Hernández and R. Cipolla (2008) Using multiple hypotheses to improve depth-maps for multi-view stereo. In Computer Vision – ECCV 2008, D. Forsyth, P. Torr and A. Zisserman (Eds.), Berlin, Heidelberg, pp. 766–779. Cited by: Table 1.
  3. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
  4. R. Chen, S. Han, J. Xu and H. Su (2019) Point-based multi-view stereo network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1538–1547. Cited by: 3rd item, §1, §1, §1, §2, §3.2, Figure 5, Figure 6, §4.1, §4.2, §4.2, §4.3, §4.3, §4.3, §4.4, Table 1, Table 2, Table 3.
  5. R. T. Collins (1996) A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 358–363. Cited by: §2.
  6. J. S. De Bonet and P. Viola (1999) Poxels: probabilistic voxelized volume reconstruction. In Proceedings of International Conference on Computer Vision (ICCV), pp. 418–425. Cited by: §2.
  7. C. H. Esteban and F. Schmitt (2004) Silhouette and stereo fusion for 3d object modeling. Computer Vision and Image Understanding 96 (3), pp. 367–392. Cited by: §2.
  8. O. Faugeras and R. Keriven (2002) Variational principles, surface evolution, pde’s, level set methods and the stereo problem. IEEE. Cited by: §2.
  9. P. Fua and Y. G. Leclerc (1995) Object-centered surface reconstruction: combining multi-image stereo and shading. International Journal of Computer Vision 16 (1), pp. 35–56. Cited by: §2.
  10. Y. Furukawa and J. Ponce (2010-08) Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (8), pp. 1362–1376. External Links: Document, ISSN Cited by: Table 1.
  11. S. Galliani, K. Lasinger and K. Schindler (2016) Gipuma: massively parallel multi-view stereo reconstruction. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V 25, pp. 361–369. Cited by: §4.3, Table 1.
  12. R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
  14. S. Im, H. Jeon, S. Lin and I. S. Kweon (2019) DPSNet: end-to-end deep plane sweep stereo. In International Conference on Learning Representations, Cited by: §2, §3.2.
  15. M. Jaderberg, K. Simonyan and A. Zisserman (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §3.2.
  16. M. Ji, J. Gall, H. Zheng, Y. Liu and L. Fang (2017) Surfacenet: an end-to-end 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2307–2315. Cited by: §2, Table 1.
  17. S. B. Kang, R. Szeliski and J. Chai (2001) Handling occlusions in dense multi-view stereo. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 1, pp. I–I. Cited by: §2.
  18. A. Kar, C. Häne and J. Malik (2017) Learning a multi-view stereo machine. In Advances in neural information processing systems, pp. 365–376. Cited by: §2, §2.
  19. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  20. A. Knapitsch, J. Park, Q. Zhou and V. Koltun (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG) 36 (4), pp. 78. Cited by: Figure 7, §4.1, Table 3.
  21. V. Kolmogorov and R. Zabih (2002) Multi-camera scene reconstruction via graph cuts. In European conference on computer vision, pp. 82–96. Cited by: §2.
  22. K. N. Kutulakos and S. M. Seitz (2000) A theory of shape by space carving. International journal of computer vision 38 (3), pp. 199–218. Cited by: §2.
  23. J. Long, E. Shelhamer and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.
  24. K. Luo, T. Guan, L. Ju, H. Huang and Y. Luo (2019) P-mvsnet: learning patch-wise matching confidence aggregation for multi-view stereo. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10452–10461. Cited by: §4.4, Table 3.
  25. W. Luo, A. G. Schwing and R. Urtasun (2016) Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703. Cited by: §2.
  26. P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J. Frahm, R. Yang, D. Nistér and M. Pollefeys (2007) Real-time visibility-based fusion of depth maps. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §2.
  27. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.2.
  28. J. Pons, R. Keriven, O. Faugeras and G. Hermosillo (2003) Variational stereovision and 3d scene flow estimation with statistical similarity measures. In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 597. Cited by: §2.
  29. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.
  30. J. L. Schönberger and J. Frahm (2016) Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
  31. J. L. Schönberger, E. Zheng, M. Pollefeys and J. Frahm (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: Table 1.
  32. S. M. Seitz, B. Curless, J. Diebel, D. Scharstein and R. Szeliski (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 1, pp. 519–528. Cited by: §1, §2.
  33. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  34. D. Sun, X. Yang, M. Liu and J. Kautz (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §2, §2.
  35. C. Tang and P. Tan (2019) BA-net: dense bundle adjustment networks. In International Conference on Learning Representations, External Links: Link Cited by: §3.1.
  36. E. Tola, C. Strecha and P. Fua (2012-09-01) Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision and Applications 23 (5), pp. 903–920. Cited by: Table 1.
  37. Y. Xue, J. Chen, W. Wan, Y. Huang, C. Yu, T. Li and J. Bao (2019) MVSCRF: learning multi-view stereo with conditional random fields. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4312–4321. Cited by: Table 1.
  38. R. Yang and M. Pollefeys (2003) Multi-resolution real-time stereo on commodity graphics hardware. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 1, pp. I–I. Cited by: §2.
  39. Y. Yao, Z. Luo, S. Li, T. Fang and L. Quan (2018) MVSnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §1, §2, §2, §2, §3.1, §3.2, §3.2, §3.3, §3.4, §4.1, §4.2, §4.2, §4.2, §4.3, §4.4, Table 1, Table 2, Table 3.
  40. Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang and L. Quan (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5534. Cited by: §1, §3.2, Figure 5, Figure 6, §4.1, §4.3, §4.3, §4.4, Table 1, Table 2, Table 3.
  41. J. Zbontar and Y. LeCun (2016) Stereo matching by training a convolutional neural network to compare image patches.. Journal of Machine Learning Research 17 (1-32), pp. 2. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description