Learned Semantic MultiSensor Depth Map Fusion
Abstract
Volumetric depth map fusion based on truncated signed distance functions has become a standard method and is used in many 3D reconstruction pipelines. In this paper, we are generalizing this classic method in multiple ways: 1) Semantics: Semantic information enriches the scene representation and is incorporated into the fusion process. 2) MultiSensor: Depth information can originate from different sensors or algorithms with very different noise and outlier statistics which are considered during data fusion. 3) Scene denoising and completion: Sensors can fail to recover depth for certain materials and light conditions, or data is missing due to occlusions. Our method denoises the geometry, closes holes and computes a watertight surface for every semantic class. 4) Learning: We propose a neural network reconstruction method that unifies all these properties within a single powerful framework. Our method learns sensor or algorithm properties jointly with semantic depth fusion and scene completion and can also be used as an expert system, e.g. to unify the strengths of various photometric stereo algorithms. Our approach is the first to unify all these properties. Experimental evaluations on both synthetic and real data sets demonstrate clear improvements.
1 Introduction
Holistic 3D scene understanding is one of the central goals of computer vision research. Tremendous progress has been made within the last decades to recover accurate 3D scene geometry with a variety of sensors [8, 23, 35] and imagebased 3D reconstruction methods [19, 49, 39]. With the breakthrough in machine learning, algorithms that recover 3D geometry increasingly include semantic information [26, 20, 21, 1, 5, 31, 10, 14, 13, 6, 43] in order to improve the algorithm robustness, the accuracy of the 3D reconstruction and to provide a richer scene representation. Many consumer products like smartphones, game consoles, augmented and virtual reality devices, as well as cars and household robots are equipped with an increasing amount of cameras and depth sensors. Computer vision systems can highly benefit from this trend by leveraging multiple data sources and providing richer and more accurate results. In this paper, we address the problem of multisensor depth map fusion for semantic 3D reconstruction.
Nowadays, depth can be estimated very robustly from multiple and even single RGB images [43]. Nevertheless, depending on the camera, scene lighting, as well as the object and material properties, the noise statistics of computed depth maps can vary largely. Moreover, popular depth sensors like the Kinect have varying noise statistics [54] depending on the depth value and the pixel distance to the image center. They also have trouble recovering depth on object edges as well as on light reflecting or absorbing surfaces, but perform well on lowtextured surfaces and within short depth ranges. In contrast, imagebased stereo methods usually perform well on object edges and across a wide depth range, but fail on lowtextured surfaces and have comparably high noise and outlier rates.
Inputs  [6] (Std. TSDF Fusion)  Learned Fusion (ours) 
While traditional methods have tried to model these effects, they usually impose strong assumptions about noise distribution, or they require tedious calibration to estimate all parameters [54]. In contrast, we leverage the strength of machine learning techniques to extract sensor properties and scene parameters automatically from training data and use them in form of confidence values for a more accurate semantic depth map fusion. Fig. 1 shows example output of our method. In sum, we make the following contributions:

[topsep=1pt,leftmargin=*]

We propose the first method to unify semantic 3D reconstruction, scene completion and multisensor data fusion into a single machinelearningbased framework. Our approach uses only few model parameters and thus needs only small amounts of training data to generalize well.

Our method analyses the sensor output and learns depth sensorspecific noise and outlier statistics which are considered when estimating confidence values for the TSDF fusion. For the case that the depth source is an algorithm we feed in both information about the depth output and information about the input patches such that out network is better able to learn when the algorithm typically fails.

Besides the multisensor data fusion, our approach can also be used as an expert system for multialgorithm depth fusion in which the outputs of various stereo methods are fused to reach a better reconstruction accuracy.
2 Related Work
Volumetric Depth Fusion. In their pioneering work, Curless and Levoy [9] proposed a simple and effective method to fuse depth maps from multiple views by averaging truncated signed distance functions (TSDFs) within a regular voxel grid. With the broad availability of lowcost depth sensors like the MS Kinect, this method became very popular with influential works like KinectFusion [23] and its numerous extensions, like voxel hashing [36] or voxel octrees [42]. This depth fusion method has become standard for SLAM frameworks like InfiniTAM [24] and was further generalized to account for drift and calibration errors, e.g. ElasticFusion [51], BundleFusion [12], but also for 3D reconstruction frameworks [53, 29, 20, 21, 13, 6].
All these methods have in common that TSDF fusion is performed via simple uniformly weighted averaging. Hence these methods do not account for the fact that depth measurements may exhibit different noise and outlier rates. This has been tackled by probabilistic fusion methods.
Probabilistic Depth Fusion. Probabilistic approaches explicitly model sensor noise, typically with a Gaussian distribution. A very simple approach with only 2.5D output and a Gaussian noise assumption can be found in [16]. A pointbased fusion approach is proposed in [25]. Instead of a voxel grid, the fusion updates are directly performed on a point cloud. This has been extended to anisotropic pointbased fusion in [34] to account for different noise levels when a surface is observed from a different viewing angle. For a fixedtopology the meshbased fusion approach by [56] fuses depth information over various mesh resolutions. A more complex probabilistic fusion method is proposed in [52] which includes long range visibility constraint in their online fusion method. A similar model with longrange raybased visibility constraints was used in [47, 46], although these methods are not realtime capable. Recently, PSDF Fusion [15] demonstrated a combination of probabilistic modeling and a TSDF scene representation. However, they also assume a Gaussian error distribution of the input depth values. Overall, probabilistic approaches handle noise and outliers better than traditional TSDF fusion methods. Nevertheless, the majority of these methods impose strong assumptions about the sensor error distributions to define the prior model. The first method that implicitly learns an unknown error distribution during the fusion is OctNetFusion by Riegler \etal[38]. They jointly learn the splitting of the octree scene representation, but multiple sensors or semantic information are not considered.
MultiSensor Data Fusion. Early approaches like Zhu \etal[55] fuse timeofflight depth and stereo, but only for a 2.5D depth map. Kim \etal [27] fuse the same sensor combination with 3D via a probabilistic framework on a voxel grid. Work by [7] strives for lowlevel data fusion to improve the Kinect output with stereo correspondences. As an extension of [16], Duan \etal[17] use a probabilistic approach for the fusion of Kinect and Stereo in realtime. None of the current multisensor depth fusion networks is able to incorporate semantic information and their generalization is usually nontrivial.
3D Reconstruction with Confidences. A wide range of 3D reconstruction approaches estimate confidence values for depth hypotheses which are then later used for adaptive fusion. All these approaches typically use either handcrafted confidence weights [18, 48, 30] rather than learning them intrinsically from data or they learn only 2D score map without learning their 3D fusion [37, 45, 44, 50].
Semantic 3D Reconstruction and Scene Completion. Joint semantic label estimation and 3D geometry has been proposed with traditional energybased methods to estimate depth maps [32] or dense volumetric 3D [26, 20, 21, 1, 5, 31]. Machine learningbased approaches have pushed the state of the art in reconstructing and completing 3D scenes [10, 14, 13, 6]. These methods are not realtime capable, but realtime fusion of CNNbased singleimage depth and semantics has recently been presented in CNNSLAM [43].
So far none of the semantic 3D reconstruction approaches is able to properly handle multiple sensors with different noise characteristics and their extension is not straightforward. Our goal is a general framework which unifies all the previously discussed properties within a learningbased method.
3 Method
For performing semantic 3D reconstruction, our method requires as input a set of RGBD images and their corresponding 2D semantic segmentations as shown in Fig. 1. The semantic segmentations can be fused into the TSDF representation of the scene, using [20]. In the following, we describe how we can robustly produce an accurate TSDF by fusing measurements from multiple depth sensors.
Key idea. We consider multiple depth sensors which produce a set of depth maps by scanning a scene. The most common approach to data fusion consists in fusing all the depth maps, regardless of the sensor that produced them, into a TSDF representation of the scene. However, this does not reflect the specific noise and outliers statistics of each measurement. We propose to overcome this issue by learning a confidence estimator for every sensor that weights the measurements before fusing them. For each sensor, we can produce a TSDF representation of the scene by fusing the corresponding depth maps. Our method learns to estimate confidence values for every voxel in TSDF, such that the accuracy of the semantic 3D reconstruction is maximized.
We propose an endtoend trainable neural network architecture which can be roughly separated into two parts: a sensor confidence network which predicts a confidence value for each sensor measurement, and a semantic 3D reconstruction network which takes all aggregated noisy measurements and corresponding confidences and performs semantic 3D reconstruction.
The overall network structure is depicted in Fig. 2 and the individual network parts are detailed in the following subsections.
3.1 Sensor Confidence Network
Weighted TSDF Fusion. A sensor produces a set of depth maps that can be fused into a TSDF , following [9]. We learn to estimate corresponding confidence maps , where for every voxel , is the confidence for the measurement . The fusion of all the sensor measurements is then computed via a pointwise weighted average:
(1) 
Goal. The purpose of the confidence weight learning for multisensor TSDF fusion is twofold: 1) IntraSensor Weighting: The network captures the noise and outlier statistics among measurements thus producing a spatially varying confidence map, \egpoints that are mostly observed from a far distance can get a lower confidence than those mainly observed from a closer distance. 2) InterSensor Weighting: The network analyses the noise and outlier statistics among different sensors in order to weight them against each other. In this regard the network also accounts for normalization which is important if there are different amounts of data available from different sensors. This avoids for instance a bias towards a sensor with a higher frame rate.
Feature extraction. We aggregate features from the input data which we believe will help the network to estimate a reliable confidence value. Ideally, we could feed all input data into our confidence network and the network could identify important features for the confidence estimation on its own, but the amount of input data for the scenes we consider in this paper renders this infeasible. Therefore, our selected feature set is certainly not exhaustive and there might be other useful features or better feature combinations. However, we found all of them improving the reconstruction results. For each sensor and each voxel, we extract the following features :

[topsep=2pt,leftmargin=*]

Average 33 patches in depth image (9 values): Analyzing neighboring depth values helps to identify outliers in the depth map (Fig. 3).

Mean and standard deviation of image gradient norm on 33 patches (2 values): Especially for stereo methods the average gradient norm of a patch indicates how much gradient information is contained in the patch. Homogeneously colored patches should lead to low confidence values.

Mean and standard deviation of normalized cross correlation (NCC) of stereo 55 patches (in case of stereo algorithms) (2 values): NCC is an established measure for estimating patch similarity for stereo methods. If the patches do not match well, or there is a high variance of NCC values among patches voting for the same point, then the confidence value should be reduced.
This set of features is then processed for each voxel individually by a small neural network which estimates a confidence weight for a single voxel (magenta in Fig. 2).
Confidence Network Architecture. The small confidence estimation networks have identical structure for each sensor and identical weights for each voxel of a sensor. They consist of 5 fully connected neural layers with ReLU activations and with a decreasing number of neurons . The last layer is initialized with biases equal to one such that the initial confidence values are equal for each sensor. The remaining weights are initialized randomly. The output of the confidence networks are then aggregated into a single TDSF volume which serves as input for a semantic 3D reconstruction network.
3.2 Semantic 3D Reconstruction Network
Our approach learns in an endtoend fashion how to jointly perform data fusion and semantic 3D reconstruction. The data fusion should facilitate the semantic 3D reconstruction by providing additional and more complete information about the scene. To perform the reconstruction, we use the architecture introduced in [6] which leverages the benefits of neural networks and variational methods. The fundamental principle of the method is to compute a consistent voxel labeling from noisy and incomplete depth such that semantic voxel transitions are statistically similar to the transitions previously seen in the training data. For instance, a bed should be standing on the ground, with vertical transitions to the ground below and the free space above, while a wall should have a horizontal transitions to free space.
The motivations are the following:

[topsep=2pt,leftmargin=*]

The architecture, which relies on the principles of total variation segmentation and inpainting, contains very few parameters to learn due to weight sharing. Due to the few parameters the network does not need much training data which is beneficial since only few and small real data sets are available for training.

The compact architecture allows to easily extend the network to estimate further parameters for the data fusion and still allows to process larger scenes with more then 15M voxels.

The energy formulation allows us to incorporate an arbitrary number of sensors into the 3D reconstruction method, which is more difficult with standard feedforward architectures.
Stereo (S) full: 0.71  Kinect (K) full: 0.77  S half + K half: 0.76  S half + K half (d): 0.77 
S half + K half (d, g): 0.78  S full + K full (d, g): 0.786  Perfect sensor full: 0.794  Ground Truth 
Variational method. We briefly describe the working principles of the reconstruction network. More details can be found in [6]. At its core, the network minimizes an energy such that the solution corresponds to a scene with label transition statistics that match the training data. We define the voxel grid, and write the energy as:
(2)  
In Eq. (2), is the voxel labeling we optimize for, defined such that is the probability that label is given to voxel . The operator denotes elementwise multiplication (Hadamard product). The operator is a regularizer that enforces the labeling to respect certain conditions on the semantic transitions (\egthe bed stands on the ground). During training, is learned to capture typical scene statistics. This can be implemented as a convolution which locally compares voxels to their neighborhood, thus verifying the semantic transitions.
The energy (2) is numerically minimized with a firstorder algorithm [3]. To this end, dual variables are introduced to account for the nondifferentiability and the constraint in Eq. (2), leading to the following equivalent discretized saddle point energy
(3) 
The numerical minimization iterations are unrolled and each layer of our network (blue cylinders in Fig. 2) performs the following updates to minimize energy (3). The inputs and outputs of each layer are shown on the left.
(4) 
For better readability these steps show the single resolution variant. For the multigrid version the update steps for and change slightly (please see [6] for more details).
A) SGBM [22]: 0.71  B) BM [2]: 0.71  C) PSMNet [4]: 0.69  D) FCRN monocular [33]: 0.44 
A+B+C (d): 0.72  A+B+C (d, g): 0.725  A+B+C (d, g, n): 0.735  A+B+C+D (d, g): 0.73 
4 Experiments
Setup and Implementation. The entire framework has been implemented in Python/Tensorflow and runs on a computer with E52630 processor and an NVidia GTX 1080 Ti GPU running a recent Linux distribution. The network training was done with the ADAM optimizer [28], with learning rate and batch size . All training samples were random crops of the input data of dimension . Then every crop was randomly rotated around the axis and randomly flipped along and axes. The network was trained for 1000 epochs, which was enough to converge for all datasets. One epoch iterated once over all scenes. The number of hierarchical levels was set to 3 and number of unrolled optimization iterations to 50, as in [6]. On average training required about 3 hours for 1000 epochs. Inference of one scene takes 3 to 5 minutes on GPU.
Datasets. The experiments were done on three datasets: SUNCG [41], ScanNet [11] and ETH3D [40]. For every dataset and experiment we measure semantic and freespace accuracy. Semantic accuracy (SA) is defined as a ratio of occupied voxels (\ienon free space) for which the particular semantic label was estimated correctly, divided by the total number of occupied voxels. Similarly, the freespace accuracy (FA) is a ratio of voxels, for which the unique freespace label was estimated correctly, divided by the number of freespace voxels. Splitting accuracy into two parts helps to account for domination of freespace voxels in all scenes. Then, the loss function is defined as categorical cross entropy, separately computed for semantic voxels as and freespace voxels as , which are then added together to compute the total loss . We set to achieve better semantic reconstructions.
4.1 Suncg
The artificial data origin of the SUNCG dataset with 38 semantic labels enables full control of the data fusion. All components of the method are examined on this dataset with an ablation study. We simulate several different depth sensors, such as a perfect sensor, Kinect and different stereo algorithms.
The baseline is a recent work by Cherabier \etal [6] where they use a simple averaging of input TSDF volumes trained with the network without confidence estimation module. We add gradually the following input measurements: average 33 depth patches, mean and standard deviation of gradients, mean and standard deviation of normalized crosscorrelation between stereo patches in case of a stereo algorithm.
Training and validation sets were created by randomly selecting 100 and 30 scenes respectively. Qualitative results for a selected scene and quantitative results on the whole dataset are shown in Fig. 4. Every input brings an increase in performance, measured by semantic and freespace accuracy. Quantitative results contain only semantic accuracy. The freespace accuracy was close to 0.95 with small deviations in all settings. The increase in accuracy is small, but the values approach the upper limit given by the perfect sensor and the reconstructions look better visually.
Input Images  Kinect Confidences  Noisy Kinect Confidences 
Standard TSDF in [6]  ScanComplete [14]  Learned Fusion (ours)  Ground Truth 
4.2 Stereo Expert System
The proposed method was applied to create an expert system for stereo algorithms. We used the following four methods for stereo depth estimation:

[topsep=2pt,leftmargin=*]

Pyramid Stereo Matching Network (PSMNet) [4] – 3D CNN architecture with spatial pyramid pooling module for depth map estimation from a stereo pair.

Depth Prediction with Fully Convolutional Residual Networks (FCRN) [33] – fully convolutional architecture with residual learning which is trained to estimate depth map from a single RGB image.

SemiGlobal Block Matching (SGBM) [22] – classical method (by H. Hirschmuller), which matches blocks of a given size in a pair of images using mutual information.

Block Matching (BM) [2] – a version of block matching algorithm provided by K. Konolige.
At first, we trained a network without confidence values on each of the stereo algorithms separately. Then, a fused combination of these methods with learned confidence values was trained. Fig. 5 shows that the learned fusion performs better than any of the stereo methods on its own. More importantly, the learned fusion results are less noisy, more accurate and complete. The stereo system results can be compared to results of other sensor fusion models in Fig. 4.
Ground Truth  Standard [6] (Geometry)  Standard [6] (Error)  Learned (Geometry)  Learned (Error) 
Method  TP rate  Distance  SA  FA 

Input  0.507  3.376  0.55  0.79 
ScanComplete [14]  0.588  2.527  0.47  0.90 
Standard TSDF in [6]  0.837  1.606  0.79  0.96 
Proposed  0.953  1.410  0.90  0.97 
Dataset  Fusion Method  SA  FA 

ETH3D  Standard TSDF in [6]  0.50  0.96 
Learned (proposed)  0.59  0.97 
4.3 ScanNet
Previous two experiments were done on a synthetic dataset. The next evaluation on ScanNet [11] dataset shows that the method is able to perform well also on real data. However, the dataset contains only measurements from one sensor, Kinect. In order to create an additional sensor, we simulated an artificial noisy Kinect with outliers modeled by Gaussian noise with zero mean and standard deviation of 2 meters with probability of 1%. We used 7 training scenes and 5 validation scenes from the hotel bedroom category, which have 9 semantic labels. Ground truth was obtained by running total variation on all views, whereas only every 10th view was used for further fusion. The proposed fusion method was compared to two stateoftheart baselines, Cherabier \etal [6] with simple averaging of TSDF volumes and ScanComplete [14]. ScanComplete also optimizes geometry, but is not designed for sensor fusion. Hence, this method performs worse when we input uniformly averaged multisensor data. ScanComplete is trained on SUNCG and finetuning on ScanNet is difficult due to incompleteness as already stated by the authors and thus omitted.
Tab. 1 shows various performance scores of the input geometry in comparison to the completed results. The proposed learned fusion improves semantic accuracy of [6] by 11%. Volumes with estimated confidence values are visualized in Fig. 6, together with two selected reconstructed validation scenes. Learned confidence values in the top row show that the network is able to learn different weights for artificially created noisy Kinect sensor and downscales occasional noisy pixels. Voxels outside walls are not downscaled and not penalized, because they are part of the unknown label which is not included in the loss function. For the nonnoisy Kinect, the confidence values are decreasing for voxels further away from the center. The Kinect sensor is known to have less precise measurements with increasing depth [54] and this was learned by the network.
4.4 Eth3d
The last experiment on ETH3D dataset is done to confirm that the proposed method is able to work not only on real data, but also with several real sensors. ETH3D dataset [40] comprises multiview images with high resolution camera, as well as with low resolution camera rigs. The ground truth is given by a laser scan.
We again tested that the joint learned fusion performs better than [6] with simple TSDF averaging. Training set had two scenes, delivery area and terrains, whereas validation set consisted of a single scene playground. Only these three scenes contain measurements from both sensors, which explains the used set size. The resolution was set to 8 cm, which gives scenes large enough for training. The number of parameters to learn is low and the results show that only several scenes are enough to train the model. The label set consists of only two labels, freespace and occupiedspace, as no semantic ground truth is available.
Tab. 2 contains quantitative results on ETH3D dataset, where the increase in semantic accuracy is 9%. Fig. 7 shows visualized reconstructions of all scenes with one closeup view for each scene. The learned fusion is able to provide more complete reconstructions and it does not contain as many separate outlier semantic voxels in ground truth freespace. The error is measured as distance to the ground truth with gray color regions representing the correct reconstruction with error less than 5 voxels.
5 Conclusion
We proposed a novel machine learningbased depth fusion method that unifies semantic depth integration, multisensor or multialgorithm data fusion as well as geometry denoising and completion. We substantially generalize the recent semantic 3D reconstruction method [6] to incorporate an arbitrary amount of depth sensors. To balance the contribution of each sensor according to their noise statistics, we extract features from the sensor data and learn the network to predict suitable confidence weights for each sensor and each point in space. Our approach is generic and can also learn reliability statistics of different stereo algorithms. This allows us to use the method as an expert system that weights and fuses the outputs of all algorithms, providing a result that is better than of any individual algorithm.
Acknowledgements. Denys Rozumnyi was supported by Czech Science Foundation grant GA1805360S, CTU student grant SGS17/185/OHK3/3T/13 and ETH SSRF. Further support was received by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00280. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.
References
 [1] (2016) Largescale semantic 3d reconstruction: an adaptive multiresolution model for multiclass volumetric labeling. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §1, §2.
 [2] (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: Figure 5, 4th item.
 [3] (201105) A firstorder primaldual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40 (1), pp. 120–145. External Links: ISSN 09249907 Cited by: §3.2.
 [4] (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418. Cited by: Figure 5, 1st item.
 [5] (2016) Multilabel semantic 3d reconstruction using voxel blocks. In International Conference on 3D Vision (3DV), Cited by: §1, §2.
 [6] (201809) Learning priors for semantic 3d reconstruction. In European Conference on Computer Vision (ECCV), Cited by: Figure 1, §1, §2, §2, Figure 2, §3.2, §3.2, §3.2, Figure 6, Figure 7, §4.1, §4.3, §4.3, §4.4, Table 1, Table 2, §4, §5.
 [7] (2011) Improving the kinect by crossmodal stereo. In Proc. of the British Machine and Vision Conference (BMVC), pp. 1–10. External Links: Link, Document Cited by: §2.
 [8] (2010) 3D shape scanning with a timeofflight camera. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1173–1180. External Links: Link, Document Cited by: §1.
 [9] (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1996, New Orleans, LA, USA, August 49, 1996, pp. 303–312. External Links: Link, Document Cited by: §2, §3.1.
 [10] (2017) ScanNet: richlyannotated 3d reconstructions of indoor scenes. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
 [11] (2017) ScanNet: richlyannotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §4.3, Table 1, §4.
 [12] (2017) BundleFusion: realtime globally consistent 3d reconstruction using onthefly surface reintegration. ACM Trans. Graph. 36 (3), pp. 24:1–24:18. External Links: Link, Document Cited by: §2.
 [13] (2018) 3DMV: joint 3dmultiview prediction for 3d semantic scene segmentation. In Proc. European Conference on Computer Vision (ECCV), pp. 458–474. External Links: Link, Document Cited by: §1, §2, §2.
 [14] (201806) ScanComplete: largescale scene completion and semantic segmentation for 3d scans. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Figure 6, §4.3, Table 1.
 [15] (201809) PSDF fusion: probabilistic signed distance function for onthefly 3d data fusion and scene reconstruction. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.
 [16] (2012) Probabilistic depth map fusion for realtime multiview stereo. In Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 1115, 2012, pp. 368–371. External Links: Link Cited by: §2, §2.
 [17] (2012) Probabilistic depth map fusion of kinect and stereo in realtime. In 2012 IEEE International Conference on Robotics and Biomimetics, ROBIO 2012, Guangzhou, China, December 1114, 2012, pp. 2317–2322. External Links: Link, Document Cited by: §2.
 [18] (2014) Floating scale surface reconstruction. ACM Trans. Graph. 33 (4), pp. 46:1–46:11. External Links: Link, Document Cited by: §2.
 [19] (2010) Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32 (8), pp. 1362–1376. External Links: Link, Document Cited by: §1.
 [20] (2013) Joint 3d scene reconstruction and class segmentation. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 97–104. External Links: Document Cited by: §1, §2, §2, §3.
 [21] (2017) Dense semantic 3d reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (9), pp. 1730–1743. External Links: Link, Document Cited by: §1, §2, §2.
 [22] (200802) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 328–341. External Links: Document, ISSN 01628828 Cited by: Figure 5, 3rd item.
 [23] (2011) KinectFusion: realtime dynamic 3d surface reconstruction and interaction. In International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2011, Vancouver, BC, Canada, August 711, 2011, Talks Proceedings, pp. 23. External Links: Link, Document Cited by: §1, §2.
 [24] (2015) Very high frame rate volumetric integration of depth images on mobile devices. IEEE Trans. Vis. Comput. Graph. 21 (11), pp. 1241–1250. External Links: Link, Document Cited by: §2.
 [25] (2013) Realtime 3d reconstruction in dynamic scenes using pointbased fusion. In 2013 International Conference on 3D Vision, 3DV 2013, Seattle, Washington, USA, June 29  July 1, 2013, pp. 1–8. External Links: Link, Document Cited by: §2.
 [26] (2013) 3D scene understanding by VoxelCRF. In Proc. International Conference on Computer Vision (ICCV), pp. 1425–1432. External Links: Link Cited by: §1, §2.
 [27] (2009) Multiview image and tof sensor fusion for dense 3d reconstruction. In IEEE Workshop on 3D Digital Imaging and Modeling (3DIM) at the International Conference on Computer Vision (ICCV), Cited by: §2.
 [28] (2014) Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) abs/1412.6980. External Links: Link, 1412.6980 Cited by: §4.
 [29] (2009) Continuous global optimization in multiview 3d reconstruction. International Journal of Computer Vision 84 (1), pp. 80–96. Cited by: §2.
 [30] (2017) A TV prior for highquality scalable multiview stereo reconstruction. International Journal of Computer Vision 124 (1), pp. 2–17. External Links: Link, Document Cited by: §2.
 [31] (2014) Joint semantic segmentation and 3d reconstruction from monocular video. In Proc. European Conference on Computer Vision (ECCV), pp. 703–718. External Links: Link Cited by: §1, §2.
 [32] (201211) Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction. International Journal of Computer Vision 100 (2), pp. 122–133 (en). External Links: ISSN 09205691, 15731405, Link, Document Cited by: §2.
 [33] (2016) Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pp. 239–248. Cited by: Figure 5, 2nd item.
 [34] (2015) Anisotropic pointbased fusion. In 18th International Conference on Information Fusion, FUSION 2015, Washington, DC, USA, July 69, 2015, pp. 2121–2128. External Links: Link Cited by: §2.
 [35] (201806) Optimal structured light à la carte. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [36] (2013) Realtime 3d reconstruction at scale using voxel hashing. ACM Trans. Graph. 32 (6), pp. 169:1–169:11. External Links: Link, Document Cited by: §2.
 [37] (2016) Learning from scratch a confidence measure. In Proc. of the British Machine and Vision Conference (BMVC), External Links: Link Cited by: §2.
 [38] (2017) OctNetFusion: learning depth fusion from data. In International Conference on 3D Vision (3DV), Cited by: §2.
 [39] (2016) Pixelwise view selection for unstructured multiview stereo. In Proc. European Conference on Computer Vision (ECCV), Cited by: §1.
 [40] (2017) A multiview stereo benchmark with highresolution images and multicamera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.4, Table 2, §4.
 [41] (2017) Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §4.
 [42] (2013) Largescale multiresolution surface reconstruction from RGBD sequences. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 18, 2013, pp. 3264–3271. External Links: Link, Document Cited by: §2.
 [43] (2017) CNNSLAM: realtime dense monocular SLAM with learned depth prediction. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574. External Links: Link, Document Cited by: §1, §1, §2.
 [44] (2018) Beyond local reasoning for stereo confidence estimation with deep learning. In Proc. European Conference on Computer Vision (ECCV), pp. 323–338. External Links: Link, Document Cited by: §2.
 [45] (2017) Learning confidence measures in the wild. In Proc. of the British Machine and Vision Conference (BMVC), Cited by: §2.
 [46] (2016) Patches, planes and probabilities: A nonlocal prior for volumetric 3d reconstruction. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3280–3289. External Links: Link, Document Cited by: §2.
 [47] (2015) Towards probabilistic volumetric reconstruction using ray potentials. In 2015 International Conference on 3D Vision, 3DV 2015, Lyon, France, October 1922, 2015, pp. 10–18. External Links: Link, Document Cited by: §2.
 [48] (2017) Global, dense multiscale reconstruction for a billion points. International Journal of Computer Vision 125 (13), pp. 82–94. External Links: Link, Document Cited by: §2.
 [49] (2012) High accuracy and visibilityconsistent dense multiview stereo. IEEE Trans. Pattern Anal. Mach. Intell. 34 (5), pp. 889–901. External Links: Link, Document Cited by: §1.
 [50] (2018) Justintime reconstruction: inpainting sparse maps using single view depth predictors as priors. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 2125, 2018, pp. 1–9. External Links: Link, Document Cited by: §2.
 [51] (2016) ElasticFusion: realtime dense SLAM and light source estimation. I. J. Robotics Res. 35 (14), pp. 1697–1716. External Links: Link, Document Cited by: §2.
 [52] (2012) A generative model for online depth fusion. In Proc. European Conference on Computer Vision (ECCV), pp. 144–157. External Links: Link, Document Cited by: §2.
 [53] (2007) A globally optimal algorithm for robust tvl1 range image integration.. In Proc. International Conference on Computer Vision (ICCV), pp. 1–8. Cited by: §2.
 [54] (2016) Structurebased autocalibration of RGBD sensors. In 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, May 1621, 2016, pp. 5076–5083. External Links: Link, Document Cited by: §1, §1, §4.3.
 [55] (2008) Fusion of timeofflight depth and stereo for high accuracy depth maps. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link, Document Cited by: §2.
 [56] (2016) Monocular, realtime surface reconstruction using dynamic level of detail. In Fourth International Conference on 3D Vision, 3DV 2016, Stanford, CA, USA, October 2528, 2016, pp. 37–46. External Links: Link, Document Cited by: §2.