Learned Semantic Multi-Sensor Depth Map Fusion
Volumetric depth map fusion based on truncated signed distance functions has become a standard method and is used in many 3D reconstruction pipelines. In this paper, we are generalizing this classic method in multiple ways: 1) Semantics: Semantic information enriches the scene representation and is incorporated into the fusion process. 2) Multi-Sensor: Depth information can originate from different sensors or algorithms with very different noise and outlier statistics which are considered during data fusion. 3) Scene denoising and completion: Sensors can fail to recover depth for certain materials and light conditions, or data is missing due to occlusions. Our method denoises the geometry, closes holes and computes a watertight surface for every semantic class. 4) Learning: We propose a neural network reconstruction method that unifies all these properties within a single powerful framework. Our method learns sensor or algorithm properties jointly with semantic depth fusion and scene completion and can also be used as an expert system, e.g. to unify the strengths of various photometric stereo algorithms. Our approach is the first to unify all these properties. Experimental evaluations on both synthetic and real data sets demonstrate clear improvements.
Holistic 3D scene understanding is one of the central goals of computer vision research. Tremendous progress has been made within the last decades to recover accurate 3D scene geometry with a variety of sensors [8, 23, 35] and image-based 3D reconstruction methods [19, 49, 39]. With the breakthrough in machine learning, algorithms that recover 3D geometry increasingly include semantic information [26, 20, 21, 1, 5, 31, 10, 14, 13, 6, 43] in order to improve the algorithm robustness, the accuracy of the 3D reconstruction and to provide a richer scene representation. Many consumer products like smartphones, game consoles, augmented and virtual reality devices, as well as cars and household robots are equipped with an increasing amount of cameras and depth sensors. Computer vision systems can highly benefit from this trend by leveraging multiple data sources and providing richer and more accurate results. In this paper, we address the problem of multi-sensor depth map fusion for semantic 3D reconstruction.
Nowadays, depth can be estimated very robustly from multiple and even single RGB images . Nevertheless, depending on the camera, scene lighting, as well as the object and material properties, the noise statistics of computed depth maps can vary largely. Moreover, popular depth sensors like the Kinect have varying noise statistics  depending on the depth value and the pixel distance to the image center. They also have trouble recovering depth on object edges as well as on light reflecting or absorbing surfaces, but perform well on low-textured surfaces and within short depth ranges. In contrast, image-based stereo methods usually perform well on object edges and across a wide depth range, but fail on low-textured surfaces and have comparably high noise and outlier rates.
|Inputs|| (Std. TSDF Fusion)||Learned Fusion (ours)|
While traditional methods have tried to model these effects, they usually impose strong assumptions about noise distribution, or they require tedious calibration to estimate all parameters . In contrast, we leverage the strength of machine learning techniques to extract sensor properties and scene parameters automatically from training data and use them in form of confidence values for a more accurate semantic depth map fusion. Fig. 1 shows example output of our method. In sum, we make the following contributions:
We propose the first method to unify semantic 3D reconstruction, scene completion and multi-sensor data fusion into a single machine-learning-based framework. Our approach uses only few model parameters and thus needs only small amounts of training data to generalize well.
Our method analyses the sensor output and learns depth sensor-specific noise and outlier statistics which are considered when estimating confidence values for the TSDF fusion. For the case that the depth source is an algorithm we feed in both information about the depth output and information about the input patches such that out network is better able to learn when the algorithm typically fails.
Besides the multi-sensor data fusion, our approach can also be used as an expert system for multi-algorithm depth fusion in which the outputs of various stereo methods are fused to reach a better reconstruction accuracy.
2 Related Work
Volumetric Depth Fusion. In their pioneering work, Curless and Levoy  proposed a simple and effective method to fuse depth maps from multiple views by averaging truncated signed distance functions (TSDFs) within a regular voxel grid. With the broad availability of low-cost depth sensors like the MS Kinect, this method became very popular with influential works like KinectFusion  and its numerous extensions, like voxel hashing  or voxel octrees . This depth fusion method has become standard for SLAM frameworks like InfiniTAM  and was further generalized to account for drift and calibration errors, e.g. ElasticFusion , BundleFusion , but also for 3D reconstruction frameworks [53, 29, 20, 21, 13, 6].
All these methods have in common that TSDF fusion is performed via simple uniformly weighted averaging. Hence these methods do not account for the fact that depth measurements may exhibit different noise and outlier rates. This has been tackled by probabilistic fusion methods.
Probabilistic Depth Fusion. Probabilistic approaches explicitly model sensor noise, typically with a Gaussian distribution. A very simple approach with only 2.5D output and a Gaussian noise assumption can be found in . A point-based fusion approach is proposed in . Instead of a voxel grid, the fusion updates are directly performed on a point cloud. This has been extended to anisotropic point-based fusion in  to account for different noise levels when a surface is observed from a different viewing angle. For a fixed-topology the mesh-based fusion approach by  fuses depth information over various mesh resolutions. A more complex probabilistic fusion method is proposed in  which includes long range visibility constraint in their online fusion method. A similar model with long-range ray-based visibility constraints was used in [47, 46], although these methods are not real-time capable. Recently, PSDF Fusion  demonstrated a combination of probabilistic modeling and a TSDF scene representation. However, they also assume a Gaussian error distribution of the input depth values. Overall, probabilistic approaches handle noise and outliers better than traditional TSDF fusion methods. Nevertheless, the majority of these methods impose strong assumptions about the sensor error distributions to define the prior model. The first method that implicitly learns an unknown error distribution during the fusion is OctNetFusion by Riegler \etal. They jointly learn the splitting of the octree scene representation, but multiple sensors or semantic information are not considered.
Multi-Sensor Data Fusion. Early approaches like Zhu \etal fuse time-of-flight depth and stereo, but only for a 2.5D depth map. Kim \etal  fuse the same sensor combination with 3D via a probabilistic framework on a voxel grid. Work by  strives for low-level data fusion to improve the Kinect output with stereo correspondences. As an extension of , Duan \etal use a probabilistic approach for the fusion of Kinect and Stereo in real-time. None of the current multi-sensor depth fusion networks is able to incorporate semantic information and their generalization is usually non-trivial.
3D Reconstruction with Confidences. A wide range of 3D reconstruction approaches estimate confidence values for depth hypotheses which are then later used for adaptive fusion. All these approaches typically use either handcrafted confidence weights [18, 48, 30] rather than learning them intrinsically from data or they learn only 2D score map without learning their 3D fusion [37, 45, 44, 50].
Semantic 3D Reconstruction and Scene Completion. Joint semantic label estimation and 3D geometry has been proposed with traditional energy-based methods to estimate depth maps  or dense volumetric 3D [26, 20, 21, 1, 5, 31]. Machine learning-based approaches have pushed the state of the art in reconstructing and completing 3D scenes [10, 14, 13, 6]. These methods are not real-time capable, but real-time fusion of CNN-based single-image depth and semantics has recently been presented in CNN-SLAM .
So far none of the semantic 3D reconstruction approaches is able to properly handle multiple sensors with different noise characteristics and their extension is not straightforward. Our goal is a general framework which unifies all the previously discussed properties within a learning-based method.
For performing semantic 3D reconstruction, our method requires as input a set of RGB-D images and their corresponding 2D semantic segmentations as shown in Fig. 1. The semantic segmentations can be fused into the TSDF representation of the scene, using . In the following, we describe how we can robustly produce an accurate TSDF by fusing measurements from multiple depth sensors.
Key idea. We consider multiple depth sensors which produce a set of depth maps by scanning a scene. The most common approach to data fusion consists in fusing all the depth maps, regardless of the sensor that produced them, into a TSDF representation of the scene. However, this does not reflect the specific noise and outliers statistics of each measurement. We propose to overcome this issue by learning a confidence estimator for every sensor that weights the measurements before fusing them. For each sensor, we can produce a TSDF representation of the scene by fusing the corresponding depth maps. Our method learns to estimate confidence values for every voxel in TSDF, such that the accuracy of the semantic 3D reconstruction is maximized.
We propose an end-to-end trainable neural network architecture which can be roughly separated into two parts: a sensor confidence network which predicts a confidence value for each sensor measurement, and a semantic 3D reconstruction network which takes all aggregated noisy measurements and corresponding confidences and performs semantic 3D reconstruction.
The overall network structure is depicted in Fig. 2 and the individual network parts are detailed in the following subsections.
3.1 Sensor Confidence Network
Weighted TSDF Fusion. A sensor produces a set of depth maps that can be fused into a TSDF , following . We learn to estimate corresponding confidence maps , where for every voxel , is the confidence for the measurement . The fusion of all the sensor measurements is then computed via a point-wise weighted average:
Goal. The purpose of the confidence weight learning for multi-sensor TSDF fusion is twofold: 1) Intra-Sensor Weighting: The network captures the noise and outlier statistics among measurements thus producing a spatially varying confidence map, \egpoints that are mostly observed from a far distance can get a lower confidence than those mainly observed from a closer distance. 2) Inter-Sensor Weighting: The network analyses the noise and outlier statistics among different sensors in order to weight them against each other. In this regard the network also accounts for normalization which is important if there are different amounts of data available from different sensors. This avoids for instance a bias towards a sensor with a higher frame rate.
Feature extraction. We aggregate features from the input data which we believe will help the network to estimate a reliable confidence value. Ideally, we could feed all input data into our confidence network and the network could identify important features for the confidence estimation on its own, but the amount of input data for the scenes we consider in this paper renders this infeasible. Therefore, our selected feature set is certainly not exhaustive and there might be other useful features or better feature combinations. However, we found all of them improving the reconstruction results. For each sensor and each voxel, we extract the following features :
Average 33 patches in depth image (9 values): Analyzing neighboring depth values helps to identify outliers in the depth map (Fig. 3).
Mean and standard deviation of image gradient norm on 33 patches (2 values): Especially for stereo methods the average gradient norm of a patch indicates how much gradient information is contained in the patch. Homogeneously colored patches should lead to low confidence values.
Mean and standard deviation of normalized cross correlation (NCC) of stereo 55 patches (in case of stereo algorithms) (2 values): NCC is an established measure for estimating patch similarity for stereo methods. If the patches do not match well, or there is a high variance of NCC values among patches voting for the same point, then the confidence value should be reduced.
This set of features is then processed for each voxel individually by a small neural network which estimates a confidence weight for a single voxel (magenta in Fig. 2).
Confidence Network Architecture. The small confidence estimation networks have identical structure for each sensor and identical weights for each voxel of a sensor. They consist of 5 fully connected neural layers with ReLU activations and with a decreasing number of neurons . The last layer is initialized with biases equal to one such that the initial confidence values are equal for each sensor. The remaining weights are initialized randomly. The output of the confidence networks are then aggregated into a single TDSF volume which serves as input for a semantic 3D reconstruction network.
3.2 Semantic 3D Reconstruction Network
Our approach learns in an end-to-end fashion how to jointly perform data fusion and semantic 3D reconstruction. The data fusion should facilitate the semantic 3D reconstruction by providing additional and more complete information about the scene. To perform the reconstruction, we use the architecture introduced in  which leverages the benefits of neural networks and variational methods. The fundamental principle of the method is to compute a consistent voxel labeling from noisy and incomplete depth such that semantic voxel transitions are statistically similar to the transitions previously seen in the training data. For instance, a bed should be standing on the ground, with vertical transitions to the ground below and the free space above, while a wall should have a horizontal transitions to free space.
The motivations are the following:
The architecture, which relies on the principles of total variation segmentation and inpainting, contains very few parameters to learn due to weight sharing. Due to the few parameters the network does not need much training data which is beneficial since only few and small real data sets are available for training.
The compact architecture allows to easily extend the network to estimate further parameters for the data fusion and still allows to process larger scenes with more then 15M voxels.
The energy formulation allows us to incorporate an arbitrary number of sensors into the 3D reconstruction method, which is more difficult with standard feed-forward architectures.
|Stereo (S) full: 0.71||Kinect (K) full: 0.77||S half + K half: 0.76||S half + K half (d): 0.77|
|S half + K half (d, g): 0.78||S full + K full (d, g): 0.786||Perfect sensor full: 0.794||Ground Truth|
Variational method. We briefly describe the working principles of the reconstruction network. More details can be found in . At its core, the network minimizes an energy such that the solution corresponds to a scene with label transition statistics that match the training data. We define the voxel grid, and write the energy as:
In Eq. (2), is the voxel labeling we optimize for, defined such that is the probability that label is given to voxel . The operator denotes element-wise multiplication (Hadamard product). The operator is a regularizer that enforces the labeling to respect certain conditions on the semantic transitions (\egthe bed stands on the ground). During training, is learned to capture typical scene statistics. This can be implemented as a convolution which locally compares voxels to their neighborhood, thus verifying the semantic transitions.
The energy (2) is numerically minimized with a first-order algorithm . To this end, dual variables are introduced to account for the non-differentiability and the constraint in Eq. (2), leading to the following equivalent discretized saddle point energy
The numerical minimization iterations are unrolled and each layer of our network (blue cylinders in Fig. 2) performs the following updates to minimize energy (3). The inputs and outputs of each layer are shown on the left.
For better readability these steps show the single resolution variant. For the multi-grid version the update steps for and change slightly (please see  for more details).
|A) SGBM : 0.71||B) BM : 0.71||C) PSMNet : 0.69||D) FCRN monocular : 0.44|
|A+B+C (d): 0.72||A+B+C (d, g): 0.725||A+B+C (d, g, n): 0.735||A+B+C+D (d, g): 0.73|
Setup and Implementation. The entire framework has been implemented in Python/Tensorflow and runs on a computer with E5-2630 processor and an NVidia GTX 1080 Ti GPU running a recent Linux distribution. The network training was done with the ADAM optimizer , with learning rate and batch size . All training samples were random crops of the input data of dimension . Then every crop was randomly rotated around the -axis and randomly flipped along and axes. The network was trained for 1000 epochs, which was enough to converge for all datasets. One epoch iterated once over all scenes. The number of hierarchical levels was set to 3 and number of unrolled optimization iterations to 50, as in . On average training required about 3 hours for 1000 epochs. Inference of one scene takes 3 to 5 minutes on GPU.
Datasets. The experiments were done on three datasets: SUNCG , ScanNet  and ETH3D . For every dataset and experiment we measure semantic and free-space accuracy. Semantic accuracy (SA) is defined as a ratio of occupied voxels (\ienon free space) for which the particular semantic label was estimated correctly, divided by the total number of occupied voxels. Similarly, the free-space accuracy (FA) is a ratio of voxels, for which the unique free-space label was estimated correctly, divided by the number of free-space voxels. Splitting accuracy into two parts helps to account for domination of free-space voxels in all scenes. Then, the loss function is defined as categorical cross entropy, separately computed for semantic voxels as and free-space voxels as , which are then added together to compute the total loss . We set to achieve better semantic reconstructions.
The artificial data origin of the SUNCG dataset with 38 semantic labels enables full control of the data fusion. All components of the method are examined on this dataset with an ablation study. We simulate several different depth sensors, such as a perfect sensor, Kinect and different stereo algorithms.
The baseline is a recent work by Cherabier \etal  where they use a simple averaging of input TSDF volumes trained with the network without confidence estimation module. We add gradually the following input measurements: average 33 depth patches, mean and standard deviation of gradients, mean and standard deviation of normalized cross-correlation between stereo patches in case of a stereo algorithm.
Training and validation sets were created by randomly selecting 100 and 30 scenes respectively. Qualitative results for a selected scene and quantitative results on the whole dataset are shown in Fig. 4. Every input brings an increase in performance, measured by semantic and free-space accuracy. Quantitative results contain only semantic accuracy. The free-space accuracy was close to 0.95 with small deviations in all settings. The increase in accuracy is small, but the values approach the upper limit given by the perfect sensor and the reconstructions look better visually.
|Input Images||Kinect Confidences||Noisy Kinect Confidences|
|Standard TSDF in ||ScanComplete ||Learned Fusion (ours)||Ground Truth|
4.2 Stereo Expert System
The proposed method was applied to create an expert system for stereo algorithms. We used the following four methods for stereo depth estimation:
Pyramid Stereo Matching Network (PSMNet)  – 3D CNN architecture with spatial pyramid pooling module for depth map estimation from a stereo pair.
Depth Prediction with Fully Convolutional Residual Networks (FCRN)  – fully convolutional architecture with residual learning which is trained to estimate depth map from a single RGB image.
Semi-Global Block Matching (SGBM)  – classical method (by H. Hirschmuller), which matches blocks of a given size in a pair of images using mutual information.
Block Matching (BM)  – a version of block matching algorithm provided by K. Konolige.
At first, we trained a network without confidence values on each of the stereo algorithms separately. Then, a fused combination of these methods with learned confidence values was trained. Fig. 5 shows that the learned fusion performs better than any of the stereo methods on its own. More importantly, the learned fusion results are less noisy, more accurate and complete. The stereo system results can be compared to results of other sensor fusion models in Fig. 4.
|Ground Truth||Standard  (Geometry)||Standard  (Error)||Learned (Geometry)||Learned (Error)|
|Standard TSDF in ||0.837||1.606||0.79||0.96|
|ETH3D||Standard TSDF in ||0.50||0.96|
Previous two experiments were done on a synthetic dataset. The next evaluation on ScanNet  dataset shows that the method is able to perform well also on real data. However, the dataset contains only measurements from one sensor, Kinect. In order to create an additional sensor, we simulated an artificial noisy Kinect with outliers modeled by Gaussian noise with zero mean and standard deviation of 2 meters with probability of 1%. We used 7 training scenes and 5 validation scenes from the hotel bedroom category, which have 9 semantic labels. Ground truth was obtained by running total variation on all views, whereas only every 10th view was used for further fusion. The proposed fusion method was compared to two state-of-the-art baselines, Cherabier \etal  with simple averaging of TSDF volumes and ScanComplete . ScanComplete also optimizes geometry, but is not designed for sensor fusion. Hence, this method performs worse when we input uniformly averaged multi-sensor data. ScanComplete is trained on SUNCG and fine-tuning on ScanNet is difficult due to incompleteness as already stated by the authors and thus omitted.
Tab. 1 shows various performance scores of the input geometry in comparison to the completed results. The proposed learned fusion improves semantic accuracy of  by 11%. Volumes with estimated confidence values are visualized in Fig. 6, together with two selected reconstructed validation scenes. Learned confidence values in the top row show that the network is able to learn different weights for artificially created noisy Kinect sensor and downscales occasional noisy pixels. Voxels outside walls are not downscaled and not penalized, because they are part of the unknown label which is not included in the loss function. For the non-noisy Kinect, the confidence values are decreasing for voxels further away from the center. The Kinect sensor is known to have less precise measurements with increasing depth  and this was learned by the network.
The last experiment on ETH3D dataset is done to confirm that the proposed method is able to work not only on real data, but also with several real sensors. ETH3D dataset  comprises multi-view images with high resolution camera, as well as with low resolution camera rigs. The ground truth is given by a laser scan.
We again tested that the joint learned fusion performs better than  with simple TSDF averaging. Training set had two scenes, delivery area and terrains, whereas validation set consisted of a single scene playground. Only these three scenes contain measurements from both sensors, which explains the used set size. The resolution was set to 8 cm, which gives scenes large enough for training. The number of parameters to learn is low and the results show that only several scenes are enough to train the model. The label set consists of only two labels, free-space and occupied-space, as no semantic ground truth is available.
Tab. 2 contains quantitative results on ETH3D dataset, where the increase in semantic accuracy is 9%. Fig. 7 shows visualized reconstructions of all scenes with one close-up view for each scene. The learned fusion is able to provide more complete reconstructions and it does not contain as many separate outlier semantic voxels in ground truth free-space. The error is measured as distance to the ground truth with gray color regions representing the correct reconstruction with error less than 5 voxels.
We proposed a novel machine learning-based depth fusion method that unifies semantic depth integration, multi-sensor or multi-algorithm data fusion as well as geometry denoising and completion. We substantially generalize the recent semantic 3D reconstruction method  to incorporate an arbitrary amount of depth sensors. To balance the contribution of each sensor according to their noise statistics, we extract features from the sensor data and learn the network to predict suitable confidence weights for each sensor and each point in space. Our approach is generic and can also learn reliability statistics of different stereo algorithms. This allows us to use the method as an expert system that weights and fuses the outputs of all algorithms, providing a result that is better than of any individual algorithm.
Acknowledgements. Denys Rozumnyi was supported by Czech Science Foundation grant GA18-05360S, CTU student grant SGS17/185/OHK3/3T/13 and ETH SSRF. Further support was received by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00280. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.
-  (2016) Large-scale semantic 3d reconstruction: an adaptive multi-resolution model for multi-class volumetric labeling. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §1, §2.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: Figure 5, 4th item.
-  (2011-05) A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40 (1), pp. 120–145. External Links: Cited by: §3.2.
-  (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418. Cited by: Figure 5, 1st item.
-  (2016) Multi-label semantic 3d reconstruction using voxel blocks. In International Conference on 3D Vision (3DV), Cited by: §1, §2.
-  (2018-09) Learning priors for semantic 3d reconstruction. In European Conference on Computer Vision (ECCV), Cited by: Figure 1, §1, §2, §2, Figure 2, §3.2, §3.2, §3.2, Figure 6, Figure 7, §4.1, §4.3, §4.3, §4.4, Table 1, Table 2, §4, §5.
-  (2011) Improving the kinect by cross-modal stereo. In Proc. of the British Machine and Vision Conference (BMVC), pp. 1–10. External Links: Cited by: §2.
-  (2010) 3D shape scanning with a time-of-flight camera. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1173–1180. External Links: Cited by: §1.
-  (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1996, New Orleans, LA, USA, August 4-9, 1996, pp. 303–312. External Links: Cited by: §2, §3.1.
-  (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §4.3, Table 1, §4.
-  (2017) BundleFusion: real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans. Graph. 36 (3), pp. 24:1–24:18. External Links: Cited by: §2.
-  (2018) 3DMV: joint 3d-multi-view prediction for 3d semantic scene segmentation. In Proc. European Conference on Computer Vision (ECCV), pp. 458–474. External Links: Cited by: §1, §2, §2.
-  (2018-06) ScanComplete: large-scale scene completion and semantic segmentation for 3d scans. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Figure 6, §4.3, Table 1.
-  (2018-09) PSDF fusion: probabilistic signed distance function for on-the-fly 3d data fusion and scene reconstruction. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2012) Probabilistic depth map fusion for real-time multi-view stereo. In Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 11-15, 2012, pp. 368–371. External Links: Cited by: §2, §2.
-  (2012) Probabilistic depth map fusion of kinect and stereo in real-time. In 2012 IEEE International Conference on Robotics and Biomimetics, ROBIO 2012, Guangzhou, China, December 11-14, 2012, pp. 2317–2322. External Links: Cited by: §2.
-  (2014) Floating scale surface reconstruction. ACM Trans. Graph. 33 (4), pp. 46:1–46:11. External Links: Cited by: §2.
-  (2010) Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32 (8), pp. 1362–1376. External Links: Cited by: §1.
-  (2013) Joint 3d scene reconstruction and class segmentation. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 97–104. External Links: Cited by: §1, §2, §2, §3.
-  (2017) Dense semantic 3d reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (9), pp. 1730–1743. External Links: Cited by: §1, §2, §2.
-  (2008-02) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 328–341. External Links: Cited by: Figure 5, 3rd item.
-  (2011) KinectFusion: real-time dynamic 3d surface reconstruction and interaction. In International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2011, Vancouver, BC, Canada, August 7-11, 2011, Talks Proceedings, pp. 23. External Links: Cited by: §1, §2.
-  (2015) Very high frame rate volumetric integration of depth images on mobile devices. IEEE Trans. Vis. Comput. Graph. 21 (11), pp. 1241–1250. External Links: Cited by: §2.
-  (2013) Real-time 3d reconstruction in dynamic scenes using point-based fusion. In 2013 International Conference on 3D Vision, 3DV 2013, Seattle, Washington, USA, June 29 - July 1, 2013, pp. 1–8. External Links: Cited by: §2.
-  (2013) 3D scene understanding by Voxel-CRF. In Proc. International Conference on Computer Vision (ICCV), pp. 1425–1432. External Links: Cited by: §1, §2.
-  (2009) Multi-view image and tof sensor fusion for dense 3d reconstruction. In IEEE Workshop on 3-D Digital Imaging and Modeling (3DIM) at the International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2014) Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) abs/1412.6980. External Links: Cited by: §4.
-  (2009) Continuous global optimization in multiview 3d reconstruction. International Journal of Computer Vision 84 (1), pp. 80–96. Cited by: §2.
-  (2017) A TV prior for high-quality scalable multi-view stereo reconstruction. International Journal of Computer Vision 124 (1), pp. 2–17. External Links: Cited by: §2.
-  (2014) Joint semantic segmentation and 3d reconstruction from monocular video. In Proc. European Conference on Computer Vision (ECCV), pp. 703–718. External Links: Cited by: §1, §2.
-  (2012-11) Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction. International Journal of Computer Vision 100 (2), pp. 122–133 (en). External Links: Cited by: §2.
-  (2016) Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pp. 239–248. Cited by: Figure 5, 2nd item.
-  (2015) Anisotropic point-based fusion. In 18th International Conference on Information Fusion, FUSION 2015, Washington, DC, USA, July 6-9, 2015, pp. 2121–2128. External Links: Cited by: §2.
-  (2018-06) Optimal structured light à la carte. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2013) Real-time 3d reconstruction at scale using voxel hashing. ACM Trans. Graph. 32 (6), pp. 169:1–169:11. External Links: Cited by: §2.
-  (2016) Learning from scratch a confidence measure. In Proc. of the British Machine and Vision Conference (BMVC), External Links: Cited by: §2.
-  (2017) OctNetFusion: learning depth fusion from data. In International Conference on 3D Vision (3DV), Cited by: §2.
-  (2016) Pixelwise view selection for unstructured multi-view stereo. In Proc. European Conference on Computer Vision (ECCV), Cited by: §1.
-  (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.4, Table 2, §4.
-  (2017) Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §4.
-  (2013) Large-scale multi-resolution surface reconstruction from RGB-D sequences. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp. 3264–3271. External Links: Cited by: §2.
-  (2017) CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574. External Links: Cited by: §1, §1, §2.
-  (2018) Beyond local reasoning for stereo confidence estimation with deep learning. In Proc. European Conference on Computer Vision (ECCV), pp. 323–338. External Links: Cited by: §2.
-  (2017) Learning confidence measures in the wild. In Proc. of the British Machine and Vision Conference (BMVC), Cited by: §2.
-  (2016) Patches, planes and probabilities: A non-local prior for volumetric 3d reconstruction. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3280–3289. External Links: Cited by: §2.
-  (2015) Towards probabilistic volumetric reconstruction using ray potentials. In 2015 International Conference on 3D Vision, 3DV 2015, Lyon, France, October 19-22, 2015, pp. 10–18. External Links: Cited by: §2.
-  (2017) Global, dense multiscale reconstruction for a billion points. International Journal of Computer Vision 125 (1-3), pp. 82–94. External Links: Cited by: §2.
-  (2012) High accuracy and visibility-consistent dense multiview stereo. IEEE Trans. Pattern Anal. Mach. Intell. 34 (5), pp. 889–901. External Links: Cited by: §1.
-  (2018) Just-in-time reconstruction: inpainting sparse maps using single view depth predictors as priors. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 1–9. External Links: Cited by: §2.
-  (2016) ElasticFusion: real-time dense SLAM and light source estimation. I. J. Robotics Res. 35 (14), pp. 1697–1716. External Links: Cited by: §2.
-  (2012) A generative model for online depth fusion. In Proc. European Conference on Computer Vision (ECCV), pp. 144–157. External Links: Cited by: §2.
-  (2007) A globally optimal algorithm for robust tv-l1 range image integration.. In Proc. International Conference on Computer Vision (ICCV), pp. 1–8. Cited by: §2.
-  (2016) Structure-based auto-calibration of RGB-D sensors. In 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, May 16-21, 2016, pp. 5076–5083. External Links: Cited by: §1, §1, §4.3.
-  (2008) Fusion of time-of-flight depth and stereo for high accuracy depth maps. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §2.
-  (2016) Monocular, real-time surface reconstruction using dynamic level of detail. In Fourth International Conference on 3D Vision, 3DV 2016, Stanford, CA, USA, October 25-28, 2016, pp. 37–46. External Links: Cited by: §2.