Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Abstract

We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. Traditional approaches to 3D reconstruction rely on an intermediate representation of depth maps prior to estimating a full 3D model of a scene. We hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics. After accumulation, a 3D CNN refines the accumulated features and predicts the TSDF values. Additionally, semantic segmentation of the 3D model is obtained without significant computation. This approach is evaluated on the Scannet dataset where we significantly outperform state-of-the-art baselines (deep multiview stereo followed by traditional TSDF fusion) both quantitatively and qualitatively. We compare our 3D semantic segmentation to prior methods that use a depth sensor since no previous work attempts the problem with only RGB input.

Keywords:
Multiview Stereo; TSDF; 3D Reconstruction

1 Introduction

Reconstructing the world around us is a long standing goal of computer vision. Recently many applications have emerged, such as autonomous driving and augmented reality, which rely heavily upon accurate 3D reconstructions of the surrounding environment. These reconstructions are often estimated by fusing depth measurements from special sensors, such as structured light, time of flight, or LIDAR, into 3D models. While these sensor can be extremely effective, they require special hardware making them more cumbersome and expensive than systems that rely solely on RGB cameras. Furthermore, they often suffer from noise and missing measurements due to low albedo and glossy surfaces as well as occlusion.

Another approach to 3D reconstruction is to use monocular [27, 28, 15], binocular [4, 2] or multivew [24, 23, 44, 19] stereo methods which take RGB images (one, two, or multiple respectively) and predict depth maps for the images. Despite the plethora of recent research, these methods are still much less accurate than depth sensors, and do not produce satisfactory results when fused into a 3D model.

Figure 1: Example results from our method on a Scannet scene. Top: our predicted 3D mesh and the ground truth. Notice how our method produces high quality 3D reconstructions despite not using a depth sensor. The network learns to complete geometry in a plausible way, despite it not being present in the ground truth. Bottom: a sample input image, a depth map rendered from our mesh, and the ground truth depth.

In this work, we observe that depth maps are often just intermediate representations that are then fused with other depth maps into a full 3D model. As such, we propose a method that takes a sequence of RGB images and directly predicts a full 3D model in an end-to-end trainable manner. This allows the network to fuse more information and learn better geometric priors about the world, producing much better reconstructions. Furthermore, it reduces the complexity of the system by eliminating steps like frame selection, as well as reducing the required compute by amortizing the cost over the entire sequence.

Our method is inspired by two main lines of work: cost volume based multi view stereo [24, 49] and Truncated Signed Distance Function (TSDF) refinement [10, 13]. Cost volume based multi view stereo methods construct a cost volume using a plane sweep. Here, a reference image is warped onto the target image for each of a fixed set of depth planes and stacked into a 3D cost volume. For the correct depth plane, the reference and target images will match while for other depth planes they will not. As such, the depth is computed by taking the argmin over the planes. This is made more robust by warping image features extracted by a CNN instead of the raw pixel measurements, and by filtering the cost volume with another CNN prior to taking the argmin.

TSDF refinement starts by fusing depth maps from a depth sensor into an initial voxel volume using TSDF fusion [8], in which each voxel stores the truncated signed distance to the nearest surface. Note that a triangulated mesh can then be extracted from this implicit representation by finding the zero crossing surface using marching cubes [30]. TSDF refinement methods [10, 13] take this noisy, incomplete TSDF as input and refine it by passing it through a 3D convolutional encoder-decoder network.

Similar to cost volume multi view stereo approaches, we start by using a 2D CNN to extract features from a sequence of RGB images. These features are then back projected into a 3D volume using the known camera intrinsics and extrinsics. However, unlike cost volume approaches which back project the features into a target view frustum using image warping, we back project into a canonical voxel volume, where each pixel gets mapped to a ray in the volume (similar to [40]). This avoids the need to choose a target image and allows us to fuse an entire sequence of frames into a single volume. We fuse all the frames into the volume using a simple running average. Next, as in both cost volume and TSDF refinement, we pass our voxel volume through a 3D convolutional encoder-decoder to refine the features. Finally, as in TSDF refinement, our feature volume is used to regress the TSDF values at each voxel.

We train and evaluate our network on real scans of indoor rooms from the Scannet[9] dataset. Our method significantly out performs state-of-the-art multi view stereo baselines [24, 44] producing accurate and complete meshes.

As an additional bonus, for minimal extra compute, we can add an additional head to our 3D CNN and perform 3D semantic segmentation. While the problems of 3D semantic and instance segmentation have received a lot of attention recently [20, 17], all previous methods assume the depth was acquired using a depth sensor. Although our 3D segmentations are not competitive with the top performers on the Scannet benchmark leader board, we establish a strong baseline for the new task of 3D semantic segmentation from multi view RGB.

2 Related Work

2.1 3D reconstruction

Reconstructing a 3D model of a scene usually involves acquiring depth for a sequence of images and fusing the depth maps using a 3D data structure. The most common 3D structure for depth accumulation is the voxel volume used by TSDF fusion[8]. However, surfels (oriented point clouds) are starting to gain popularity [48, 38]. These methods are usually used with a depth sensor, but can also be applied to depth maps predicted from monocular or stereo images.

With the rise of deep learning, monocular depth estimation has seen huge improvements [27, 28, 15], however their accuracy is still far below state-of-the-art stereo methods. A popular classical approach to stereo [19] uses mutual information and semi global matching to compute the disparity between two images. More recently, several end-to-end plane sweep algorithms have been proposed. DeepMVS[23] uses a patch matching network. MVDepthNet[44] constructs the cost volume from raw pixel measurements and performs 2D convolutions, treating the planes as feature channels. GPMVS[22] builds upon this and aggregates information into the cost volume over long sequences using a Gaussian process. MVSNet[49] and DPSNet[24] construct the cost volume from features extracted from the images using a 2D CNN. They then filter the cost volume using 3D convolutions on the 4D tensor. All of these methods require choosing a target image to predict depth for and then finding suitable neighboring reference images. Recent binocular stereo methods [4, 2] use a similar cost volume approach, but avoid frame selection by using a fixed baseline stereo pair. Depth maps over a sequence are computed independently (or weakly coupled in the case of [22]). In contrast to these approaches, our method constructs a single coherent 3D model from a sequence of input images directly.

While TSDF fusion is simple and effective, it cannot reconstruct partially occluded geometry and requires averaging many measurements to reduce noise. As such, learned methods have been proposed to improve the fusion. OctNet-Fusion[36] uses a 3D encoder-decoder to aggregate multiple depth maps into a TSDF and shows results on single objects and portions of scans. ScanComplete[13] builds upon this and shows results for entire rooms. SG-NN[10] improves upon ScanComplete by increasing the resolution using sparse convolutions[17] and training using a novel self-supervised training scheme. 3D-SIC[21] focuses on 3D instance segmentation using region proposals and adds a per instance completion head. Routed fusion[47] uses 2D filtering and 3D convolutions in view frustums to improve aggregation of depth maps.

More similar in spirit to ours are networks that take one or more images and directly predict a 3D representation. 3D-R2N2 [6] encodes images to a latent space and then decodes a voxel occupancy volume. Octtree-Gen[43] increases the resolution by using an octtree data structure to improve the efficiency of 3D voxel volumes. Deep SDF[34] chooses to learn a generative model that can output an SDF value for any input position instead of discretizing the volume. SurfNet [39] learns a 3D offset from a template UV map of a surface. Point set generating networks[14] learns to generate point clouds with a fixed number of points. Pixel2Mesh++[45] uses a graph convolutional network to directly predict a triangulated mesh. These methods encode the input to a small latent code and report results on single objects, mostly from shapenet[3]. As such it is not clear how to extend them to work on full scene reconstructions. Mesh-RCNN [16] builds upon 2D object detection[18] and adds an additional head to predict a voxel occupancy grid for each instance and then refines them using a graph convolutional network on a mesh.

Back projecting image features into a voxel volume and then refining them using a 3D CNN has also been used for human pose estimation [50, 25]. These works regress 3D heat maps that are used to localize joint locations.

Deep Voxels [40] and the follow up work of scene representation networks[41] accumulate features into a 3D volume forming an unsupervised representation of the world which can then be used to render novel views without the need to form explicit geometric intermediate representations.

2.2 3D Semantic Segmentation

In addition to reconstructing geometry, many applications require semantic labeling of the reconstruction to provide a richer representation. Broadly speaking, there are two approaches to solving this problem: 1) Predict semantics on 2D input images using a 2D segmentation network[1, 18, 5] and back project the labels to 3D [32, 33, 31] 2) Directly predict the semantic labels in the 3D space. All of these methods assume depth is provided by a depth sensor. A notable exception is Kimera [37], which uses multiview stereo [19] to predict depth, however, they only show results on synthetic data and ground truth 2D segmentations.

SGPN[46] formulates instance segmentation as a 3D point cloud clustering problem. Predicting a similarity matrix and clustering the 3D point cloud to derive semantic and instance labels. 3D-SIS[20] improves upon these approaches by fusing 2D features in a 3D representation. RGB images are encoded using a 2D CNN and back projected onto the 3D geometry reconstructed from depth maps. A 3D CNN is then used to predict 3D object bounding boxes and semantic labels. SSCN [17] predicts semantics on a high resolution voxel volume enabled by sparse convolutions.

In contrast to these approaches, we propose a strong baseline to the relatively untouched problem of 3D semantic segmentation without a depth sensor.

3 Method

Our method takes as input an arbitrary length sequence of RGB images, each with known intrinsics and pose. These images are passed through a 2D CNN backbone to extract features. The features are then back projected into a 3D voxel volume and accumulated using a running average. Once the image features have been fused into 3D, we regress a TSDF directly using a 3D CNN (See Fig. 2). We also experiment with adding an additional head to predict semantic segmentation.

Figure 2: Schematic of our method. Features are extracted from a sequence of images using a 2D CNN and then back projected into a 3D volume. These volumes are accumulated and then passed through a 3D CNN to directly regress a TSDF reconstruction of the scene. We can also jointly predict the 3D semantic segmentation of the scene.

3.1 Feature Volume Construction

Let be an image in a sequence of RGB images. We extract features using a standard 2D CNN where is the feature dimension. These 2D features are then back projected into a 3D voxel volume using the known camera intrinsics and extrinsics, assuming a pinhole camera model. Consider a voxel volume

(1)
(2)

where and are the extrinsics and intrinsics matrices for image respectively, is the perspective mapping and is the slice operator. Here are the voxel coordinates in world space and are the pixel coordinates in image space. Note that this means that all voxels along a camera ray are filled with the same features corresponding to that pixel (See Fig. 3).

These feature volumes are accumulated over the entire sequence using a weighted running average similar to TSDF fusion as follows:

(3)
(4)

For the weights we use a binary mask which stores if voxel is inside or outside the view frustum of the camera.

Figure 3: A) Illustration of the back projection of features into 3D. All the voxels along a ray are filled with the features from the corresponding pixel. The features of all the cameras are accumulated into a single volume. B) Visualizes a top down cross section of the accumulated feature volume (we display the first channel). As can be seen (highlighted in orange) a region of voxels was not viewed by any camera (hole in the floor in the ground truth) and thus contains all zero features. C) With naive skip connections in the 3D CNN the empty voxels lead to significant artifacts. D) Using our masked skip connections reduces the artifacts and allows the network to better complete the geometry.

3.2 3D Encoder-Decoder

Once the features are accumulated into the voxel volume, we use a 3D convolutional encoder-decoder network to refine the features and regress the output TSDF (Fig. 4). Each layer of the encoder and decoder uses a set of 3x3x3 residual blocks. Downsampling is implemented with 3x3x3 stride 2 convolution, while upsampling uses trilinear interpolation followed by a 1x1x1 convolution to change the feature dimension. The feature dimension is doubled with each downsampling and halved with each upsampling. All convolution layers are followed by batchnorm and relu.

We also include additive skip connections from the encoder to the decoder. The encoder features are passed through a 1x1x1 convolution followed by a batchnorm and relu. However there may be voxels which were never observed during the sequence and thus don’t have any features back projected into them. The large receptive field of the coarser resolution layers in the network is able to smooth over and infill these areas, but adding zero values from the early layers of the decoder undoes this bringing the zeros back. This significantly reduces the networks ability to complete the geometry in unobserved regions. As such, for these voxels we do not use a naive skip connection from the encoder. Instead, we pass the decoder features through the same batchnorm and relu to match the scale of the standard skip connections and add them.

Figure 4: Our 3D encoder-decoder architecture. Blue boxes denote residual blocks, green boxes are stride 2 convolutions and red boxes are trilinear upsampling. The arrows from the encoder to the decoder indicate masked skip connections. Our network predicts TSDFs in a coarse to fine manner with the previous layer being used to sparsify the next resolution.

Let be the features from the decoder, be the features that are being obtained from the encoder via the skip connection, be the convolution, and be the batchnorm and relu. Our masked skip connection is then:

(5)

At the topmost layer of the encoder-decoder, we use a 1x1x1 convolution followed by a tanh activation to regress the final TSDF values. In addition, we also include intermediate output heads at each decoded resolution prior to upsampling. This is used as intermediate supervision to help the network train faster, as well as guide the later resolutions to focus on refining predictions near surfaces and ignoring large empty regions that the coarser resolutions are already confident about. For our semantic segmentation models we also include an additional 1x1x1 convolution to predict the segmentation logits (only at the final resolution).

Since our features are back projected along entire rays, the voxel volume is filled densely and thus we cannot take advantage of sparse convolutions[17] in the encoder. However, by applying a hard threshold the intermediate output TSDFs, the decoder can be sparsified allowing for the use of sparse convolutions similar to [10]. In practice, we found that we were able to train our models at voxel resolution without the need for sparse convolutions. While we don’t sparsify the feature volumes, we do use the multi resolution outputs to sparsify the final predicted TSDF. Any voxel predicted to be beyond a fixed distance threshold is truncated in the following resolution.

4 Implementation Details

We use a Resnet50-FPN[29] followed by the merging method of [26] with 32 output feature channels as our 2D backbone. The features are back projected into a voxel grid. Our 3D CNN consists of a four scale resolution pyramid where we double the number of channels each time we half the resolution. The encoder consists of (1,2,3,4) residual blocks at each scale respectively, and the decoder consists of (3,2,1) residual blocks.

Initially, we train the network end-to-end using short sequences covering portions of rooms, since all frames need to be kept in memory for back propagation. We train with ten frame sequences, an initial learning rate of and a 96x96x56 voxel grid. After 35k iterations, we freeze the 2D network and fine tune the 3D network. This removes the need to keep all the activations from the 2D CNN in memory and allows for inplace accumulation of the feature volumes, breaking the memory dependence on the number of frames. We fine tune the network with 100 frame sequences with a learning rate of .

At test time, similar to the fine tuning stage, we accumulate the feature volumes in place, allowing us to operate on arbitrary length sequences (often thousands of frames for ScanNet) and we use a 400x400x104 sized voxel grid. Training the network to completion takes around 36 hours on 8 Titan RTX GPUs with a batch size of 16 and synchronized batchnorm.

4.1 Ground Truth Preparation and Loss

Figure 5: A) Ground truth mesh fused from entire sequence. B) Corresponding TSDF. Notice the solid red regions corresponding to that were unobserved. The area in front of the green chair (green arrow) is missing due to occlusion and thus we do not penalize for these voxels. On the other hand, the area behind the wall (yellow arrow) is outside the room and thus loss is computed here. C) Mesh from training batch. D) Corresponding TSDF. E) Shift augmentations to prevent hallucination behind the wall.

We supervise the multi scale TSDF reconstructions using loss to the ground truth TSDF values. Following [12], we log-transform the predicted and target values before applying the loss, and only backpropagate loss for voxels that were observed in the ground truth (i.e. have TSDF values strictly less than .) However, to prevent the network from hallucinating artifacts behind walls, outside the room, we also mark all the voxels where their entire vertical column is equal to and penalize in these areas to. The intuition for this is that if the entire vertical column was not observed it was probably not within the room.

Furthermore, to force the finer resolution layers to learn more detail, we only compute the loss for voxels which were not beyond a fraction (.97) of the truncation distance in the previous resolution. Without this, the later layers loss is dominated by the large number of voxels that are far from the surface and easily classified as empty, preventing it from learning effectively.

To construct the ground truth TSDFs we run TSDF fusion at each resolution on the full sequences, prior to training (see Fig. 5). This results in less noisy and more complete ground truth than simply fusing the short training batch sequences on the fly. However, this adds the complication that now we have to find the appropriate region of the TSDF for the training batch. First we center the batch geometry in our reconstruction volume (and crop to our training size bounds) by back projecting all the depth points into 3D and centering their centroid. However, we found that by always centering the visible geometry in our volume at training time, the network does not have a chance to learn to avoid hallucinating geometry far beyond the walls (the network takes advantage of the fact that the bounds of the volume are fit to the visible area). This causes the network to not know what to do when the volume is much larger at test time. As such, after centering, we apply a random shift along the viewing direction of the camera to provide more voxels to compute loss on beyond the visible scene. We also apply a random rotations about the vertical axis for data augmentation.

Next, we clip the ground truth to the visible frustum so the network is not forced to hallucinate geometry far beyond what has been observed. We fuse another TSDF from the batch of images and construct a mask where corresponding to voxels that were observed during the sequence. This mask is dilated by 3 voxels to force the network to complete geometry near observations. We then apply this mask to the TSDF from the full sequence to construct our final training target.

5 Results

We evaluate our method on ScanNet[9], which consists of 2.5M images across 707 distinct spaces. Standard train/validation/test splits are adopted. The 3D reconstructions are benchmarked using standard 2D depth metrics [24, 22], as shown in Table 1 as well as quantitatively in Table 2 and qualitatively in Figure 6 in 3D. The 2D metrics used are: absolute relative error (Abs Rel), absolute difference error (Abs diff), square relative error (Sq Rel), root mean square error (RMSE) and inlier ratios ( where ). The 3D metrics used are the L1 distance on the TSDF, IOU on the TSDF, and Champfer distance on the mesh.

We compare our method to 3 state-of-the-art baselines: MVDepthNet[44], DPSNet[24], and GPMVS[22]. For each method we compare to both the models provided by the authors as well as after we fine tuned (FT) them on scannet. We evaluated the methods using 2 and 6 reference images (for each target image), with frames selected temporally with a stride of 20 and 10 respectively. From Table. 1 we can see that fine tuning the models on scannet helps a lot. However, surprisingly, adding more reference images does not help that much. We believe this is because with temporal frame selection, the added frames do not contain that much extra information (note that both 2 neighbors stride 20 and 6 neighbors stride 10 cover the same time span (+/- 20 frames), the latter is just sampled more densely). While better frame selection could improve this, our method avoids the issue entirely by fusing information from all the frames available.

To evaluate these in 3D we take their outputs and fuse them into TSDFs using TSDF fusion. Our plain model, before finetuning on longer sequences, is comparable with prior state-of-the-art on the 2D metrics despite the fact that our method is designed for full 3D reconstruction (Table 1). After finetuning the 3D CNN on longer sequences, the proposed method surpasses state-of-the-art on some of the 2D metrics, and is well within standard deviation of the baselines on other metrics.

Method AbsRel AbsDiff SqRel RMSE
MVDepthNet [44]-2 .119 .209 .091 .304 .869 .959 .985
MVDepthNet FT-2 .105 .191 .097 .305 .895 .970 .989
MVDepthNet FT-6 .108 .195 .0740 .305 .886 .968 .989
DPSNet [24]-2 .147 .224 .103 .346 .848 .947 .976
DPSNet FT-2 .102 .167 .057 .267 .910 .970 .987
DPSNet FT-6 .102 .167 .057 .268 .910 .971 .987
GPMVS [22]-2 .130 .260 .094 .345 .848 .946 .975
GPMVS FT-2 .107 .225 .096 .465 .890 .959 .978
GPMVS FT-6 .109 .230 .116 .571 .893 .956 .975
Ours (plain) .094 .160 .080 .165 .902 .950 .972
Ours (semseg) .106 .187 .092 .204 .884 .939 .966
Ours (finetuned) .089 .157 .075 .174 .906 .952 .974
Table 1: 2D Depth Metrics: Our method out performs previous methods by a large margin.

However, our method really stands out when evaluated using the 3D metrics, presented in Table 2. Here, it significantly outperforms prior work before fine tuning, and fine tuning just widens the gap. The 2D metrics do not reflect the true performance when evaluating full 3D reconstructions. If occluding boundaries are off by a few pixels, the 2D metrics will strongly penalize this. However, when evaluated in 3D this is a minor error since the surface is still very close to the ground truth. Note that 3D reconstruction is a strictly harder problem than depth estimation as it requires not only accurate depth estimation but also consistent depth across various views.

Method IOU 0.5 Champfer
MVDepthNet .524 .854 .722
DPSNet .427 .861 .597
GPMVS .480 .856 8.529
Ours (plain) .260 .867 .624
Ours (semseg) .276 .859 .515
Ours (finetuned) .211 .894 .372
Table 2: 3D Geometry Metrics: Our method out performs previous methods by a large margin.

As mentioned previously, we augment the existing 3D-CNN with a semantic segmentation head, requiring only a single convolution, to be able to not only reconstruct the 3D structure of the scene but also provide semantic labels to the surfaces. Since no prior work attempts to do 3D semantic segmentation from only RGB images, and there are no established benchmarks, we propose a new evaluation procedure. The semantic labels from the predicted mesh are transferred onto the ground truth mesh using nearest neighbor lookup on the vertices, and then the standard IOU metric can be used. We report our result in Fig. 7 and note that this is an unfair comparison since all prior methods include depth as input.

From the results in Table 3 we see that our approach, Atlas, is surprisingly competitive with (and even beats some) prior methods that include depth as input. Having depth as an input makes the problem significantly easier because the only source of error is from the semantic predictions. In our case, in order to correctly label a vertex we must both predict the geometry correct as well as the semantic label. From Fig. 7 we can see that mistakes in geometry compounds with mistakes in semantics which leads to lower IOUs.

Method mIOU
ScanNet [9] 30.6
PointNet++ [35] 33.9
SPLATNet [42] 39.3
3DMV [11] 48.4
3DMV-FTSDF 50.1
PointNet++SW 52.3
SparseConvNet [17] 72.5
MinkowskiNet [7] 73.4
Ours 41.0
ScanNet 3D Semantic Segmentation metrics. We transfer our labels from the predicted mesh to the ground truth mesh using nearest neighbors.
Table 3: 3D Semantic Label Benchmark on ScanNet [9]

5.1 Inference Time

Since our method only requires running a small 2D CNN on each frame, the cost of running the large 3D CNN is amortized over a sequence of images. On the other hand, MVS methods must run all their compute on every frame. Note that they must also run TSDF fusion to accumulate the depth maps into a mesh, but we assume this is negligible compared to the network inference time (note that ours bypasses this step by directly regressing the TSDF). We report inference times using 2 neighbors.

We evaluate our model with a volume size of voxels. At 4cm resolution that corresponds to a volume of meters. All models are run on a single NVidia TiTan RTX GPU. From Table 4 we can see that after approximately 4 frames, ours becomes faster than DPSNet (note that most Scannet scenes are a few thousands of frames).

Method Per Frame Time (sec) Per Sequence Time (sec)
MVDepthNet 2 0.048 0
GPMVS 2 0.051 0
DPSNet 2 0.322 0
Ours .071 .840
Table 4: Inference Time

6 Conclusions

In this work, we present a novel approach to 3D scene reconstruction. Notably, our approach does not require depth inputs; is unbounded temporally, allowing the integration of long frame sequences; completes unobserved geometry; and supports the efficient prediction of other quantities such as semantics. We have experimentally verified that the classical approach to 3D reconstruction via per view depth estimation is inferior to direct regression to a 3D model from an input RGB sequence. We have also demonstrated that without significant additional compute, a semantic segmentation objective can be added to the model to accurately label the resultant surfaces. In our future work, we aim to improve the back projection and accumulation process. One approach is to allow the network to learn where along a ray to place the features (instead of uniformly). This will improve the models ability to handle occlusions and large multi room scenes. We also plan to add additional tasks such as instance segmentation and intrinsic image decomposition. Our method is particularly well suited for intrinsic image decomposition because the network has the ability to reason with information from multiple views in 3D.

Figure 6: Qualitative 3D reconstruction results. Left to right: DPSNet, ours, ground truth. We can see that our method is much less noisy than DPSNet producing clean and accurate meshes. Furthermore we often complete geometry that is missing from the ground truth.
Figure 7: Qualitative 3D semantic segmentations. Left to right: Ours, our labels transferred to the ground truth mesh, ground truth labels. We are able to accurately segment the 3D scene despite not using a depth sensor.

Footnotes

  1. email: {zmurez,tvanas,jbartolozzi,asinha,vbadrinarayanan,arabinovich}@magicleap.com
  2. email: {zmurez,tvanas,jbartolozzi,asinha,vbadrinarayanan,arabinovich}@magicleap.com
  3. email: {zmurez,tvanas,jbartolozzi,asinha,vbadrinarayanan,arabinovich}@magicleap.com
  4. email: {zmurez,tvanas,jbartolozzi,asinha,vbadrinarayanan,arabinovich}@magicleap.com
  5. email: {zmurez,tvanas,jbartolozzi,asinha,vbadrinarayanan,arabinovich}@magicleap.com
  6. email: {zmurez,tvanas,jbartolozzi,asinha,vbadrinarayanan,arabinovich}@magicleap.com

References

  1. V. Badrinarayanan, A. Kendall and R. Cipolla (2015) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. External Links: 1511.00561 Cited by: §2.2.
  2. R. Chabra, J. Straub, C. Sweeney, R. Newcombe and H. Fuchs (2019) StereoDRNet: dilated residual stereonet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11786–11795. Cited by: §1, §2.1.
  3. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song and H. Su (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §2.1.
  4. J. Chang and Y. Chen (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418. Cited by: §1, §2.1.
  5. B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam and L. Chen (2019) Panoptic-deeplab. arXiv preprint arXiv:1910.04751. Cited by: §2.2.
  6. C. B. Choy, D. Xu, J. Gwak, K. Chen and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pp. 628–644. Cited by: §2.1.
  7. C. Choy, J. Gwak and S. Savarese (2019) 4D spatio-temporal convnets: minkowski convolutional neural networks. External Links: 1904.08755 Cited by: Table 3.
  8. B. Curless and M. Levoy (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 303–312. Cited by: §1, §2.1.
  9. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §1, Table 3, §5.
  10. A. Dai, C. Diller and M. Nießner (2019) SG-nn: sparse generative neural networks for self-supervised scene completion of rgb-d scans. arXiv preprint arXiv:1912.00036. Cited by: §1, §1, §2.1, §3.2.
  11. A. Dai and M. Nießner (2018) 3DMV: joint 3d-multi-view prediction for 3d semantic scene segmentation. External Links: 1803.10409 Cited by: Table 3.
  12. A. Dai, C. R. Qi and M. Nießner (2016) Shape completion using 3d-encoder-predictor cnns and shape synthesis. External Links: 1612.00101 Cited by: §4.1.
  13. A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm and M. Nießner (2018) Scancomplete: large-scale scene completion and semantic segmentation for 3d scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4578–4587. Cited by: §1, §1, §2.1.
  14. H. Fan, H. Su and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §2.1.
  15. H. Fu, M. Gong, C. Wang, K. Batmanghelich and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §1, §2.1.
  16. G. Gkioxari, J. Malik and J. Johnson (2019) Mesh r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9785–9795. Cited by: §2.1.
  17. B. Graham, M. Engelcke and L. van der Maaten (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9224–9232. Cited by: §1, §2.1, §2.2, §3.2, Table 3.
  18. K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.1, §2.2.
  19. H. Hirschmuller (2007) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30 (2), pp. 328–341. Cited by: §1, §2.1, §2.2.
  20. J. Hou, A. Dai and M. Nießner (2018) 3D-sis: 3d semantic instance segmentation of rgb-d scans. External Links: 1812.07003 Cited by: §1, §2.2.
  21. J. Hou, A. Dai and M. Nießner (2019) 3D-sic: 3d semantic instance completion for rgb-d scans. arXiv preprint arXiv:1904.12012. Cited by: §2.1.
  22. Y. Hou, J. Kannala and A. Solin (2019) Multi-view stereo by temporal nonparametric fusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2651–2660. Cited by: §2.1, Table 1, §5, §5.
  23. P. Huang, K. Matzen, J. Kopf, N. Ahuja and J. Huang (2018) Deepmvs: learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830. Cited by: §1, §2.1.
  24. S. Im, H. Jeon, S. Lin and I. Kweon (2019) DPSNet: end-to-end deep plane sweep stereo. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §1, §1, §1, §2.1, Table 1, §5, §5.
  25. K. Iskakov, E. Burkov, V. Lempitsky and Y. Malkov (2019) Learnable triangulation of human pose. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7718–7727. Cited by: §2.1.
  26. A. Kirillov, R. Girshick, K. He and P. Dollár (2019) Panoptic feature pyramid networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6399–6408. Cited by: §4.
  27. K. Lasinger, R. Ranftl, K. Schindler and V. Koltun (2019) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. arXiv preprint arXiv:1907.01341. Cited by: §1, §2.1.
  28. J. H. Lee, M. Han, D. W. Ko and I. H. Suh (2019) From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326. Cited by: §1, §2.1.
  29. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §4.
  30. W. E. Lorensen and H. E. Cline (1987) Marching cubes: a high resolution 3d surface construction algorithm. ACM siggraph computer graphics 21 (4), pp. 163–169. Cited by: §1.
  31. J. McCormac, R. Clark, M. Bloesch, A. Davison and S. Leutenegger (2018) Fusion++: volumetric object-level slam. In 2018 international conference on 3D vision (3DV), pp. 32–41. Cited by: §2.2.
  32. J. McCormac, A. Handa, A. Davison and S. Leutenegger (2017) Semanticfusion: dense 3d semantic mapping with convolutional neural networks. In 2017 IEEE International Conference on Robotics and automation (ICRA), pp. 4628–4635. Cited by: §2.2.
  33. G. Narita, T. Seno, T. Ishikawa and Y. Kaji (2019) Panopticfusion: online volumetric semantic mapping at the level of stuff and things. arXiv preprint arXiv:1903.01177. Cited by: §2.2.
  34. J. J. Park, P. Florence, J. Straub, R. Newcombe and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §2.1.
  35. C. R. Qi, L. Yi, H. Su and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. External Links: 1706.02413 Cited by: Table 3.
  36. G. Riegler, A. Osman Ulusoy and A. Geiger (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586. Cited by: §2.1.
  37. A. Rosinol, M. Abate, Y. Chang and L. Carlone (2020) Kimera: an open-source library for real-time metric-semantic localization and mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA), Cited by: §2.2.
  38. T. Schöps, T. Sattler and M. Pollefeys (2019) Surfelmeshing: online surfel-based mesh reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.1.
  39. A. Sinha, A. Unmesh, Q. Huang and K. Ramani (2017) Surfnet: generating 3d shape surfaces using deep residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6040–6049. Cited by: §2.1.
  40. V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein and M. Zollhofer (2019) Deepvoxels: learning persistent 3d feature embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2437–2446. Cited by: §1, §2.1.
  41. V. Sitzmann, M. Zollhöfer and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, Cited by: §2.1.
  42. H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang and J. Kautz (2018) SPLATNet: sparse lattice networks for point cloud processing. External Links: 1802.08275 Cited by: Table 3.
  43. M. Tatarchenko, A. Dosovitskiy and T. Brox (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.1.
  44. K. Wang and S. Shen (2018) MVDepthNet: real-time multiview depth estimation neural network. In 2018 International Conference on 3D Vision (3DV), pp. 248–257. Cited by: §1, §1, §2.1, Table 1, §5.
  45. N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67. Cited by: §2.1.
  46. W. Wang, R. Yu, Q. Huang and U. Neumann (2018) SGPN: similarity group proposal network for 3d point cloud instance segmentation. In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2569–2578. Cited by: §2.2.
  47. S. Weder, J. Schönberger, M. Pollefeys and M. R. Oswald (2020) RoutedFusion: learning real-time depth map fusion. arXiv preprint arXiv:2001.04388. Cited by: §2.1.
  48. T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker and A. Davison ElasticFusion: dense slam without a pose graph. Cited by: §2.1.
  49. Y. Yao, Z. Luo, S. Li, T. Fang and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §1, §2.1.
  50. C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus and T. Brox (2019) FreiHAND: a dataset for markerless capture of hand pose and shape from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 813–822. Cited by: §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
412536
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description