3D Manhattan Room Layout Reconstruction from a Single 360 Image
Recent approaches for predicting layouts from 360 panoramas produce excellent results. These approaches build on a common framework consisting of three steps: a pre-processing step based on edge-based alignment, prediction of layout elements, and a post-processing step by fitting a 3D layout to the layout elements. Until now, it has been difficult to compare the methods due to multiple different design decisions, such as the encoding network (e.g., SegNet or ResNet), type of elements predicted (e.g., corners, wall/floor boundaries, or semantic segmentation), or method of fitting the 3D layout. To address this challenge, we summarize and describe the common framework, the variants, and the impact of the design decisions. For a complete evaluation, we also propose extended annotations for the Matterport3D dataset chang2017matterport3d, and introduce two depth-based evaluation metrics.
Keywords:3D Room Layout Deep Learning Single Image 3D Manhattan World
Estimating the 3D room layout of indoor environment is an important step toward a holistic scene understanding and would benefit many applications such as robotics and virtual/augmented reality. The room layout specifies the positions, orientations, and heights of the walls, relative to the camera center. The layout can be represented as a set of projected corner positions or boundaries or as a 3D mesh. Existing works apply to special cases of the problem such as predicting cuboid-shaped layouts from perspective images or from panoramic images.
Recently, various approaches zou2018layoutnet; yang2019dula; sun2019horizonnet for 3D room layout reconstruction from a single panoramic image have been proposed, which all produce excellent results. These methods are not only able to reconstruct cuboid room shapes, but also estimate non-cuboid general Manhattan layouts as shown in Fig. 1. Different from previous work zhang2014panocontext that estimates 3D layouts by decomposing a panorama into perspective images, these approaches operate directly on the panoramic image in equirectangular view, which effectively reduces the inference time. These methods all follow a common framework: (1) a pre-processing edge-based alignment step, ensuring that wall-wall boundaries are vertical lines and substantially reducing prediction error; (2) a deep neural network that predicts the layout elements, such as layout boundaries and corner positions (LayoutNet zou2018layoutnet and HorizonNet sun2019horizonnet), or a semantic 2D floor plan in the ceiling view (DuLa-Net yang2019dula); and (3) a post-processing step that fits the (Manhattan) 3D layout to the predicted elements.
However, until now, it has been difficult to compare these methods due to multiple different design decisions. For example, LayoutNet uses SegNet as encoder while DuLa-Net and HorizonNet use ResNet; HorizonNet applies random stretching data augmentation sun2019horizonnet in training, while LayoutNet and DuLa-Net do not. Direct comparison of the three methods may conflate impact of contributions and design decisions. We therefore want to isolate the effects of the contributions by comparing performance with the same encoding architectures and other settings. Moreover, given the same network prediction, we want to compare the performance by using different post-processing steps (under equirectangular view or ceiling view). Therefore, in this paper, the authors of LayoutNet and DuLa-Net work together to better describe the common framework, the variants, and the impact of the design decisions for 3D layout estimation from a single panoramic image. For a detailed comparison, we evaluate performance using a unified encoder (i.e., ResNet he2016deep) and consistent training details such as random stretching data augmentation, and discuss effects using different post-processing steps. Based on the modifications to LayoutNet and DuLa-Net listed above, we propose the improved version called LayoutNet v2 111Code is available at: https://github.com/zouchuhang/LayoutNetv2 and DuLa-Net v2 222Code is available at: https://github.com/SunDaDenny/DuLa-Net, which achieve the state-of-the-art for cuboid layout reconstruction.
To compare performance for reconstructing different types of 3D room shape, we extend the annotations of Matterport3D dataset with ground truth 3D layouts. Unlike existing public datasets, such as the PanoContext dataset zhang2014panocontext, which provides mostly cuboid layouts of only two scene types, i.e., bedroom and living room, the Matterport3D dataset contains 2,295 real indoor RGB-D panoramas of more than 12 scene types, e.g., kitchen, office, and corridor, and 10 general 3D layouts, e.g., “L”-shape and “T”-shape. Moreover, we leverage the depth channel of dataset images and introduce two depth-based evaluation metrics for comparing general Manhattan layout reconstruction performance.
The experimental results demonstrate that: (1) LayoutNet’s decoder can better capture global room shape context, performing the best for cuboid layout reconstruction and being robust to foreground occlusions. (2) For non-cuboid layout estimation, DuLa-Net and HorizonNet’s decoder can better capture detailed layout shapes like pipes, showing better generalization to various complex layout types. Their simplified network output representation also take much less time for the post-processing step. (3) At the component level, a pre-trained and denser ResNet encoder and random stretching data augmentation can help boost performance for all methods. For LayoutNet, the post-processing method that works under the equirectangular view performs better. For DuLa-Net and HorizonNet, the post-processing step under ceiling view is more suitable. We hope our analysis and discoveries can inspire researchers to build up more robust and efficient 3D layout reconstruction methods from a single panoramic image in the future.
Our contributions are:
We introduce two frameworks, LayoutNet v2 and DuLa-Net v2, which extend the corresponding state-of-the-art approaches of 3D Manhattan layout reconstruction from an RGB panoramic image. Our approaches compare well in terms of speed and accuracy and achieve the best results for cuboid layout reconstruction.
We conduct extensive experiments for LayoutNet v2, DuLa-Net v2 and another state-of-the-art approach, HorizonNet. We discuss the effects of encoders, post-processing steps, performance for different room shapes, and time consumption. Our investigations can help inspire researchers to build up more robust and efficient approaches for single panoramic image 3D layout reconstruction.
We extend the Matterport3D dataset with general Manhattan layout annotations. The annotations contain panoramas depicting room shapes of various complexity. The dataset will be made publicly available. In addition, two depth-based evaluation metrics are introduced for measuring the performance of general Manhattan layout reconstruction.
2 Related Work
There are numerous papers that propose solutions for estimating a 3D room layout from a single image. The solutions differ in the layout shapes (i.e., cuboid layout vs. general Manhattan layout), inputs (i.e., perspective vs. panoramic image), and methods to predict geometric features and fit model parameters.
In terms of room layout assumptions, a popular choice is the “Manhattan world” assumption coughlan1999manhattan, meaning that all walls are aligned with a canonical coordinate system coughlan1999manhattan; ramalingam2013manhattan. To make the problem easier, a more restrictive assumption is that the room is a cuboid hedau2009recovering; dasgupta2016delay; lee2017roomnet, i.e., there are exactly four room corners in the top-down view. Recent state-of-the-art methods zou2018layoutnet; yang2019dula; sun2019horizonnet adopt the Manhattan world assumption but allow for room layout with arbitrary complexity.
In terms of the type of input images, the images may differ in the FoV (field of view) - ranging from being monocular (i.e., taken from a standard camera) to 360 panoramas, and whether depth information is provided. The methods are then largely depending on the input image types. It is probably most difficult problem when only a monocular RGB image is given. Typically, geometric (e.g., lines and corners) lee2009geometric; hedau2009recovering; ramalingam2013lifting and/or semantic (e.g., segmentation into different regions hoiem2005geometric; hoiem2007recovering and volumetric reasoning gupta2010estimating) ”cues” are extracted from the input image, a set of room layout hypotheses is generated, and then an optimization or voting process is taken to rank and select the best one among the hypotheses.
Traditional methods treat the task as an optimization problem. An early work by Delage et al. delage2006dynamic fit floor/wall boundaries in a perspective image taken by a level camera to create a 3D model under the Manhattan world assumption using dynamic Bayesian networks. Most methods are based on finding best-fitting hypotheses among detected line segments lee2009geometric, vanishing points hedau2009recovering, or geometric contexts hoiem2005geometric. Subsequent works follow a similar approach, with improvements to layout generation schwing2012efficient; schwing2012efficient_eccv; ramalingam2013manhattan, features for scoring layouts schwing2012efficient_eccv; ramalingam2013manhattan, and incorporation of object hypotheses hedau2010thinking; gupta2010estimating; del2012bayesian; del2013understanding; zhao2013scene or other context.
Recently, neural network-based methods took stride in tackling this problem. There exist methods that train deep network to classify pixels into layout surfaces (e.g., walls, floor, ceiling) dasgupta2016delay; izadinia2017im2cad, boundaries mallya2015learning, corners lee2017roomnet, or a combination ren2016coarse. A trend is that the neural networks generate higher and higher levels of information - starting from line segments mallya2015learning; stpio, surface labels dasgupta2016delay, to room types lee2017roomnet and room boundaries and corners zou2018layoutnet, to faciliate the final layout generation process. Recent methods push the edge further by using neural networks to directly predict a 2D floor plan yang2019dula or as three 1D vectors that concisely encode the room layout sun2019horizonnet. In both cases, the final room layouts are reconstructed by a simple post-processing step.
Another line of works aims to leverage the extra depth information for room model reconstruction, including utilizing single depth image for 3D scene reconstruction zhang2013estimating; zou2019complete; liu2016layered, and scene reconstructions from point clouds newcombe2011kinectfusion; monszpart2015rapter; liu2018floornet; cabral2014piecewise. Liu et al. liu2015rent3d present Rent3D, which takes advantage of a known floor plan. Note that neither estimated depths nor reconstructed 3D scenes necessarily equate a clean room layout as such inputs may contain clutters.
The seminal work by Zhang et al. zhang2014panocontext advocates the use of 360 panoramas for indoor scene understanding, for the reason that the FOV of 360 panoramas is much more expansive. Work in this direction flourished, including methods based on optimization approaches over geometric Fukano2016RoomRF; pintore2016omnidirectional; yang2016efficient; yang2016efficient; xu2017pano2cad and/or semantic cues xu2017pano2cad; automatic and later based on neural networks lee2017roomnet; zou2018layoutnet. Most methods rely on leveraging existing techniques for single perspective images on samples taken from the input panorama. The LayoutNet introduced by Zou et al. zou2018layoutnet was the first approach to predict room layout directly on panorama, which led to better performance. Yang et al. yang2019dula and Pintore et al. pintore2016omnidirectional follow the similar idea and propose to predict directly in the top-down view converted from input panorama. In this manner, the vertical lines in the panorama become radial lines emanated from the image center. An advantage of this representation is that the room layout becomes a closed loop in 2D that can be extracted more easily. As mentioned in yang2019dula, the ceiling view is arguably better as it provides a clutter-free view of the room layout.
Datasets with detailed ground truth 3D room layouts play a crucial role for both network training and performance validation. In this work, we use three public datasets for the evaluation, which are PanoContext zhang2014panocontext, Stanford 2D-3D stfd2d3d, and Matterport3D chang2017matterport3d. All three datasets are composed of RGB(D) panoramic images of various indoor scene types and differ from each other in the following intrinsic properties: (1) the complexity of room layout; (2) the diversity of scene types; and (3) the scale of dataset. For those datasets lack of ground truth 3D layouts, we further extend their annotations with detailed 3D layouts using an interactive annotator, PanoAnnotator yang2018panoannotator. A few sample panoramic images from the chosen datasets are shown in Fig. 2. We will briefly describe each dataset and discuss differences as follows.
3.1 PanoContext Dataset
PanoContext zhang2014panocontext dataset contains RGB panoramic images of two indoor environments, i.e., bedrooms and living rooms, and all the images are annotated as cuboid layouts. For the evaluation, we follow the official train-test split and further carefully split 10% validation images from the training samples such that similar rooms do not appear in the training split.
3.2 Stanford 2D-3D Dataset
Stanford 2D-3D stfd2d3d dataset contains RGB panoramic images collected from large-scale indoor environments, including offices, classrooms, and other open spaces like corridors. Since the original dataset does not provide ground truth layout annotations, we manually labeled the cuboid layouts using the PanoAnnotator. The Stanford 2D-3D dataset is more challenging than PanoContext as the images have smaller vertical FOV and more occlusions on the wall-floor boundaries. We follow the official train-val-test split for evaluation.
3.3 Our Labeled MatterportLayout Dataset
We carefully selected RGBD panoramic images from Matterport3D chang2017matterport3d dataset and extended the annotations with ground truth 3D layouts. We call our collected and relabeled subset the MatterportLayout dataset.
Matterport3D chang2017matterport3d dataset is a large-scale RGB-D dataset containing over ten thousand RGB-D panoramic images collected from 90 building-scale scenes. Matterport3D has the following advantages over the other datasets:
covers a larger variety of room layouts (e.g., cuboid, “L”-shape, “T”-shape rooms, etc) and over 12 indoor environments (e.g., bedroom, office, bathroom and hallway, etc);
has aligned ground truth depth for each image, allowing quantitative evaluations for layout depth estimation; and
is three times larger in scale than PanoContext and Stanford 2D-3D, providing rich data for training and evaluating our approaches.
Note that there also exists the Realtor360 dataset introduced in yang2019dula, which contains over indoor panoramas and annotated 3D room layouts. However, Realtor360 currently could not be made publicly available due to some legal privacy issue.
The detailed annotation procedure and dataset statistics are elaborated as follows.
Annotation Process of MatterportLayout.
First, we collected from the Matterport3D a subset of images that has closed 3D space and Manhattan 3D layout, and excluded images with artifacts resulting from stitching perspective views. Then we use the PanoAnnotator to annotate ground truth 3D layouts and obtain a dataset of RGB-D panoramas with detailed layout annotations. To evaluate the quality of estimated 3D layouts using the depth measurements, we further process the aligned ground truth depth maps to remove pixels that belong to foreground objects (e.g., furniture). Specifically, we align the ground truth depth map to the rendered depth map of the annotated 3D layout and mask out inconsistent pixels between two depth maps. For the alignment, we scale the rendered depth by normalizing camera height to 1.6m. We then mask out pixels in the ground truth depth map that are more than 0.15m away from their counterparts in the rendered depth map. In our experiment, we use the unmasked pixels for evaluation. See Fig. 3 for some examples.
|# of Corners||4||6||8||10||12||14||16||18||20||22|
MatterportLayout Dataset Statistics.
We use approximately 70%, 10%, and 20% of the data for training, validation, and testing. Images from the same room do not appear in different sets. Moreover, we ensure that the proportions of rooms with the same 3D shape in each set are similar. We show in Table 1 the total numbers of images annotated for different 3D room shapes. The 3D room shape is classified according to the numbers of corners in ceiling view: cuboid shape has four corners,“L”-shape room has six corners, “T”-shape has eight corners, etc. The MatterportLayout dataset covers a large variety of 3D room shapes, with approximately 52% cuboid rooms, 22% “L”-shape rooms, 13% “T”-shape rooms, and 13% more complex room shapes. The train, validation, and test sets have similar distributions of different room shapes, making our experiments reliable for both training and testing.
4 Methods Overview
In this section, we introduce the common framework, the variants, and the impact of the design decisions of recently proposed approaches for 3D Manhattan layout reconstruction from a single panoramic image. Table 2 summarizes the key design choices that LayoutNet (Fig. 4), DuLa-Net (Fig. 5) and HorizonNet originally proposed in their papers respectively. Though all three methods follow the same general framework, they differ in the details. We unify some of the designs and training details and propose our modified LayoutNet and DuLa-Net methods as follows, which show better performance compared with the original ones.
4.1 General framework
The general framework can be decomposed into three parts. First, we discuss in Sec. 4.1.1 the input and pre-processing step. Second, we introduce the network design of encoder in Sec. 4.1.2, the decoder of layout pixel predictions in Sec. 4.1.3 and the training loss for each method in Sec. 4.1.4. Finally, we discuss the structured layout fitting in Sec. 4.1.5.
4.1.1 Input and Pre-processing
Given the input as a panorama that covers a horizontal field of view, the first step for all methods is to align the image to have horizontal floor plane. The alignment, which is first proposed by LayoutNet and is then inherited by DuLa-Net and HorizonNet, ensures that wall-wall boundaries are vertical lines and substantially reduces error. We estimate the floor plane direction under spherical projection using Zhang et al.’s approach zhang2014panocontext: select long line segments using the Line Segment Detector (LSD) von2008lsd in each overlapping perspective view, then vote for three mutually orthogonal vanishing directions using the Hough Transform. We then rotate the scene and re-project it to the 2D equirectangular projection. The aligned panoramic image is used for all three methods as input. For better predicting layout elements, LayoutNet and DuLa-Net utilize additional input as follows.
LayoutNet additionally concatenates a Manhattan line feature map lying on three orthogonal vanishing directions using the alignment method as described in the previous paragraph.
DuLa-Net uses a two-branch design. In parallel with the input panoramic image, there’s another ceiling-view perspective image projected from the panorama image by using the E2P module described as follows. The perspective image is assumed to be square with dimension . For every pixel in the perspective image at position , the position of the corresponding pixel in the equirectangular panorama, , , is derived as follows. First, the field of view of the pinhole camera of the perspective image is defined as . Then, the focal length can be derived as:
is the 3D position of the pixel in the perspective image in the camera space. It is then rotated by 90 or -90 along the x-axis (counter-clockwise) if the camera is looking upward (e.g., looking at the ceiling) or downward (e.g., looking at the floor), respectively. Next, the rotated 3D position is projected to the equirectangular space. To do so, the rotated 3D position is first projected onto a unit sphere by vector normalization, the resulting 3D position on the unit sphere is denoted as , and then the following formula is applied to project back to , which is the corresponding 2D position in the equirectangular panorama:
Finally, is used to interpolate a pixel value from the panorama. Note that this process is differentiable so it can be used in conjunction with back-propagation. DuLa-Net peforms E2P with a of and produces a perspective image of .
The originally proposed three methods use different encoders. LayoutNet uses SegNet as encoder while both DuLa-Net and HorizonNet use ResNet. Here we unify the encoder of the three approaches by using ResNet, which shows better performance in capturing layout features then SegNet in experiments (Sec. 5.2.2, Sec. 6). The ResNet encoder receives a RGB panoramic image under equirectangular view as input.
For LayoutNet, the last fully connected layer and the average pooling layer of the ResNet encoder are removed.
DuLa-Net uses a separate ResNet encoder for both panorama-branch and ceiling-branch. The panorama-branch has an output dimension of . For the ceiling-branch , the output dimension is .
HorizonNet performs a separate convolution for each of the feature maps produced by each block of the ResNet encoder. The convolution down samples each map by 8 in height and 16 in width, with feature size up-sampled to 256. The feature maps are then reshaped to 256 x 1 x 256 and concatenated based on the first dimension, producing the final bottleneck feature.
4.1.3 Layout Pixel Predictions
The layout pixel predictions can be corner and boundary positions under equirectangular view, or semantic floor map under ceiling view. We describe each type of prediction as follows.
Both LayoutNet and HorizonNet predict layout corners and boundaries under equirectangular projection. For LayoutNet, the decoder consists of two branches. The top branch, the layout boundary map () predictor, decodes the bottleneck feature into a 2D feature map with the same resolution as the input. is a 3-channel probability prediction of wall-wall, ceiling-wall and wall-floor boundary on the panorama, for both visible and occluded boundaries. The boundary predictor contains layers of nearest neighbor up-sampling operation, each followed by a convolution layer with kernel size of , and the feature size is halved through layers from . The final layer is a Sigmoid operation. Skip connections are added to each convolution layer following the spirit of the U-Net structure ronneberger2015u, in order to prevent shifting of prediction results from the up-sampling step. The lower branch, the 2D layout corner map () predictor, follows the same structure as the boundary map predictor and additionally receives skip connections from the top branch for each convolution layer. This stems from the intuition that layout boundaries imply corner positions, especially for the case when a corner is occluded. It’s shown in zou2018layoutnet that the joint prediction helps improve the accuracy of the both maps, leading to a better 3D reconstruction result. We exclude the 3D regressor proposed in zou2018layoutnet as the regressor is shown to be ineffective in the original paper.
HorizonNet simplifies LayoutNet’s prediction by predicting three 1-D vectors with 1024 dimensions instead of two 512x1024 probability maps. The three vectors represent the ceiling-wall and the floor-wall boundary position, and the existence of wall-wall boundary (or corner) of each image column. HorizonNet further applies an RNN block to refine the vector predictions, which considerably help boost performance as reported in sun2019horizonnet.
DuLa-Net’s panorama-branch predicts floor-ceiling probability map under equirectangular view. has the same resolution as the input. A pixels in with higher value means a higher probability to be ceiling or floor. The decoder of consists of 6 layers. Each of the first 5 layers contains the nearest neighbor up-sampling operation followed by a convolution layer and ReLU activation function, the channel number is halved from 512 (if using ResNet18 as an encoder). The final layer of the decoder replaces the ReLU by Sigmoid to ensure the data range is in . The second branch of DuLa-Net predicts 2D probability map under ceiling view which will be introduced in the next paragraph.
The decoder of DuLa-Net’s ceiling-branch has the same architecture as the panorama-branch (see Fig. 5). The decoder outputs a probability map. DuLa-Net then fuses the feature map from the panorama-branch to the ceiling-branch through the E2P projection module as described in Sec. 4.1.2. Applying fusion techniques increases the prediction accuracy. It is conjectured that, in a ceiling-view image, the areas near the image boundary (where some useful visual clues such as shadows and furniture arrangements exist) are more distorted, which can have a detrimental effect for the ceiling-branch to infer room structures. By fusing features from the panorama-branch (in which distortion is less severe), performance of the ceiling-branch can be improved.
DuLa-Net applies fusions before each of the first five layers of the decoders. For each fusion connection, an E2P conversion with the set to 160 is taken to project the features under the equirectangular view to the perspective ceiling view. Each fusion works as follows:
where is the feature from ceiling-branch and is the feature from panorama-branch after applying the E2P conversion. and are the decay coefficients. is the index of the layer. After each fusion, the merged feature, , is sent into the next layer of ceiling-view decoder.
Note that DuLa-Net’s 2D floor plan prediction cannot predict 3D layout height, which is an important parameter for 3D layout reconstruction. To infer the layout height, three fully connected layers are added to the middlemost feature of panorama-branch. The dimensions of the three layers are 256, 64, and 1. To make the regression of the layout height more robust, dropout layers are added after the first two layers. To take the middlemost feature as input, DuLa-Net first applies average along channel dimensions, which produces a 1-D feature with 512 dimensions, and take it as the input of the fully connected layers.
4.1.4 Loss Function
The overall loss function is:
Here is the probability that each image pixel is on the boundary between two walls; is the probability that each image pixel is on a corner; and are pixel probabilities of edge and corner with ground truth values of and , respectively. The loss is the summation over the binary cross entropy error of the predicted pixel probability in and compared with ground truth.
The overall loss function is:
Here for and , we apply binary cross entropy loss:
For (layout height), we use L1-loss:
where , and are the ground truth of , , and .
For the three channel 1-D prediction, HorizonNet applies L1-Loss for regressing the ceiling-wall boundary and floor-wall boundary position, and uses binary Cross Entropy loss for the wall-wall corner existence prediction.
4.1.5 Structured Layout Fitting
Given the 2D predictions (corner, boundaries and ceiling-view floor plan), the camera position and 3D layout can be directly recovered, up to a scale and translation, by assuming that bottom corners are on the same ground plane and that the top corners are directly above the bottom ones. The layout shape is then further constrained to be Manhattan, so that intersecting walls are perpendicular, e.g., like a cuboid or “L”-shape in a ceiling view. The final output is a sparse and compact planar 3D Manhattan layout. The optimization can be performed under equirectangular view or ceiling biew, as introduced as follows.
Since LayoutNet’s network outputs, i.e., 2D corner and boundary probability maps, are under equirectangular view, the 3D layout parameters are optimized to fit the predicted 2D maps. The initial 2D corner predictions are obtained from the corner probability map that our network outputs as follows. First, the responses are summed across rows, to get a summed response for each column. Then, local maxima are found in the column responses, with distance between local maxima of at least 20 pixels. Finally, the two largest peaks are found along the selected columns. These 2D corners might not satisfy Manhattan constraints, so we perform optimization to refine the estimates.
The ceiling level is initialized as the average (mean) of 3D upper-corner heights, and then optimize for a better fitting room layout, relying on both corner and boundary information to evaluate 3D layout candidate :
where denotes the 2D projected corner positions of . Cardinality of is #walls 2. The nearby corners are connected on the image to obtain which is the set of projected wall-ceiling boundaries, and which is the set of projected wall-floor boundaries (each with cardinality of #walls). denotes the pixel-wise probability value on the predicted . and denote the probability on . LayoutNet finds that adding wall-wall boundaries in the scoring function helps less, since the vertical pairs of predicted corners already reveals the wall-wall boundaries information.
Note that the cost function in Eqn. 7 is slightly different from the cost function originally proposed in LayoutNet: we revise the cost function to compute the average response across layout lines instead of the maximum response. In this way, we are able to produce a relatively smoothed space for the gradient ascent based optimization as introduced below. The originally proposed LayoutNet uses sampling to find the best ranked layout based on the cost function, which is time consuming and is constrained to the pre-defined sampling space. We instead use stochastic gradient ascent robbins1951stochastic to search for local optimum of the cost function 333We revised the SGD based optimization implemented by Sun (with different loss term weights): https://github.com/sunset1995/pytorch-layoutnet. We demonstrate the performance boost by using gradient ascent in experiments (Sec. 5.2.2)
We extend the equirectangular view optimization for general Manhattan layout. Since LayoutNet’s network prediction might miss occluded corners, which are important for the post-processing step that relies on Manhattan assumption, we adopt HorizonNet’s post-processing step to find occluded corners for initialization before performing the fitting refinement in the equirectangular view.
DuLa-Net’s network outputs 2D floor plan predictions under ceiling view. Given the probability maps ( and ) and the layout height () predicted by the network, DuLa-Net reconstructs the final 3D layout in the following two steps:
Estimating a 2D Manhattan floor plan shape using the probability maps.
Extruding the floor plan shape along its normal according to the layout height.
For step 1, two intermediate maps, denoted as and , are derived from ceiling pixels and floor pixels of the floor-ceiling probability map using the E2P conversion. DuLa-Net further uses a scaling factor, , to register the with , where the constant is the distance between the camera and the ceiling.
Finally, a fused floor plan probability map is computed as follows:
Fig. 6 (a) illustrates the above process. The probability map is binarized using a threshold of . A bounding rectangle of the largest connected component is computed for later use. Next, the binary image is converted to a densely sampled piece-wise linear closed loop and simplify it using the Douglas-Peucker algorithm (see Fig. 6 (b)). A regression analysis is run on the edges. The edges are clustered into sets of axis-aligned horizontal and vertical lines. These lines divide the bounding rectangle into several disjoint grid cells (see Fig. 6 (c)). The shape of the 2D floor plan is defined as the union of grid cells where the ratio of floor plan area is greater than (see Fig. 6 (d)). Note that this post-processing step does not have an implicit constraints on layout shapes (cuboid or non-cuboid). To evaluate on cuboid room layout, we directly use the bounding rectangle of the largest connected component as the predicted 2D floor plan for DuLa-Net.
For HorizonNet, although the prediction is done under equirectangular view, the post-processing step is done under ceiling view. First, the layout height is estimated by averaging over the predicted floor and ceiling positions in each column. Second, the scaled ceiling boundary and floor boundary are projected to the ceiling view, same as Dula-Net. Following LayoutNet’s approach, HorizonNet then initialize the corner positions by finding the most prominent wall-wall corner points and project them to ceiling view. The orientations of walls are retrieved by computing the first PCA component along the projected lines between two nearby corners. The best-fit layout layout scale is computed by voting under ceiling view. Finally, the 3D layout is reconstructed.
For non-cuboid Manhattan layouts, some of the walls can be occluded from the camera position. LayoutNet finds the best-fit layout shape based on the 2D predictions, which might not be able to recover the occluded layout corners and boundaries. DuLa-Net fits a polygon to the predicted 2D floor plan, which explicitly enforces the neighboring walls to be orthogonal to each other. HorizonNet detects occlusions by checking the orientation of the first PCA component for nearby layout walls. If two neighboring walls are parallel to each other, HorizonNet will hallucinate the occluded walls. We conjecture that the difference in handling occlusions is the main reason why LayoutNet performs better than DuLa-Net and HorizonNet for cuboid layouts (no occlusions) while performs slightly worse for non-cuboid layouts.
4.2 Implementation Details
We implement LayoutNet and DuLa-Net using PyTorch. For HorizonNet, we directly use their PyTorch source code available online for comparison. For implementation details, we summarize the data augmentation methods in Sec. 4.2.1 and the training scheme and hyper-parameters in Sec. 4.2.2.
4.2.1 Data augmentation
We show in Table 3 the summary of the different data augmentations originally proposed in each method. All three methods use horizontal rotation, left-right flipping and luminance change to augment the training samples. We unify the data augmentation by adding random stretching (introduced below) to our modified LayoutNet and DuLa-Net methods.
is introduced by HorizonNet. The augmentation utilizes the property of panoramic images, projects the pixels into 3D space, stretches pixels along 3D axes, re-projects and interpolates pixels to the equirectangular image to augment training data. The effectiveness of this approach has been demonstrated in sun2019horizonnet.
Ground truth smoothing.
For LayoutNet, the target 2D boundary and corner maps are both binary maps that consist of thin curves (boundary map) or points (corner map) on the images, respectively. This makes training more difficult. For example, if the network predicts the corner position slightly off the ground truth, a huge penalty will be incurred. Instead, LayoutNet dilates the ground truth boundary and corner map with a factor of 3 and then smooth the image with a Gaussian kernel of . Note that even after smoothing, the target image still contains zero values, so the back propagated gradients of the background pixels is re-weighted by multiplying with . HorizonNet also utilizes this smoothing strategy for their wall-wall corner existence prediction. Smoothing does not suit DuLa-Net since DuLa-Net predicts the complete floor map with clear boundaries.
4.2.2 Training Scheme and Parameters
We use pre-trained weights on ImageNet to initialize the ResNet encoders.We perform random stretching with stretching factors and . For each method, we use the same hyper-parameters for evaluating on the different datasets.
|Encoder||3D IoU (%)||Corner Error (%)||Pixel Error (%)|
|LayoutNet v2||DuLa-Net v2||HorizonNet||LayoutNet v2||DuLa-Net v2||HorizonNet||LayoutNet v2||DuLa-Net v2||HorizonNet|
|Encoder||3D IoU (%)||Corner Error (%)||Pixel Error (%)|
|LayoutNet v2||DuLa-Net v2||HorizonNet||LayoutNet v2||DuLa-Net v2||HorizonNet||LayoutNet v2||DuLa-Net v2||HorizonNet|
uses the ADAM kingma2014adam optimizer with and to update network parameters. The network learning rate is . To train the network, we first train the layout boundary prediction branch, then fix the weights of boundary branch and train the corner prediction branch, and finally we train the whole network end-to-end. To avoid the unstable learning of the batch normalization layer in ResNet encoder due to smaller batch size, we freeze the parameters of the batch normalization (bn) layer when training end-to-end. The batch size for ResNet-18 and ResNet-34 encoder is 4, while the batch size for ResNet-50 is 2 (Which is too small to have a stable training of the bn layer, leading performance drops comparing with LayoutNet using ResNet-18 or ResNet-34 encoder as shown in Table 4 and Table 5 in experiments). We set the term weights in Eqn. 4.1.4 as .
4.3 Summarization of Modifications
As introduced in Sec. 4.1 and Sec. 4.2, we unify some of the designs and training details and propose the modified LayoutNet and DuLa-Net methods. In this section, we summarize our modifications to LayoutNet (Denoted as “LayoutNet v2”) and DuLa-Net (Denoted as “DuLa-Net v2”) as follows.
We use pre-trained ResNet encoder instead of SegNet encoder trained from scratch. We add random stretching data augmentation. We perform 3D layout fitting using gradient ascent optimization instead of sampling based searching scheme. We extend the equirectangular view optimization for general Manhattan layout.
We choose to use deeper ResNet encoders instead of the ResNet-18 one and add random stretching data augmentation.
5 Experiments and Discussions
In this section, we evaluate the performance of LayoutNet v2, DuLa-Net v2 and HorizonNet introduced in Sec. 4. We describe the evaluation metrics in Sec. 5.1 and compare the methods on PanoContext dataset and Stanford 2D-3D dataset for cuboid layout reconstruction in Sec. 5.2. We evaluate performance on MatterportLayout for general Manhattan layout estimation in Sec. 5.3. Finally, based on the experiment results, we discuss the advantages and disadvantages of each method in Sec. 5.4.
5.1 Evaluation Setup
We use the following five standard evaluation metrics:
Corner error, which is the distance between the predicted layout corners and the ground truth under equirectangular view. The error is normalized by the image diagonal length and averaged across all images.
Pixel error, which is the pixel-wise semantic layout prediction (wall, ceiling, and floor) accuracy compared to the ground truth. The error is averaged across all images.
3D IoU, defined as the volumetric intersection over union between the predicted 3D layout and the ground truth. The result is averaged over all the images.
2D IoU, defined as the pixel-wise intersection over union between predicted layout under ceiling view and the ground truth. The result is averaged over all the images.
rmse, defined as the root mean squared error between predicted layout depth and the ground truth :
. We use the true camera height, which is 1.6 for each image, to generate the predicted depth map.The result is averaged over all the images.
, defined as the percentage of pixels where the ratio (or its reciprocal) between the prediction and the label is within a threshold of 1.25:.
We use corner error, pixel error, and 3D IoU to evaluate performance of cuboid layout reconstruction. For general Manhattan layout reconstruction, since the predicted layout shape can be different from the ground truth shape, we use 3D IoU, 2D IoU and depth measurements (i.e., rmse and ) for evaluation.
5.2 Performance on PanoContext and Stanford 2D-3D
In this experiment, we evaluate the performance of LayoutNet v2, DuLa-Net v2, and HorizonNet on the PanoContext dataset and Stanford 2D-3D dataset, which is comprised of cuboid layouts. For all three methods, we used a unified (ResNet) encoder and analyzed the performance of using different post-processing steps.
For the evaluation on PanoContext dataset, we use both the training split of PanoContext dataset and the whole Stanford 2D-3D dataset for training and vice versa for the evaluation on Stanford 2D-3D dataset. The split for validation and testing of each dataset is reported in Sec. 3. We use the same dataset setting for all three methods.
We show in Fig. 7 the qualitative results of the experiments on PanoContext dataset and Stanford 2D-3D dataset. All methods offer similar accuracy. LayoutNet v2 slightly outperforms on PanoContext and offers more robustness to occlusion, while DuLa-Net v2 outperforms in two of three metrics for Stanford 2D-3D.
5.2.1 Evaluation on Unified Encoder
Table 4 and Table 5 show the performance for LayoutNet v2, DuLa-Net v2 and HorizonNet on PanoContext dataset and Stanford 2D-3D dataset, respectively. In each row, we report performance by using ResNet-18, ResNet-34, and ResNet-50 encoders respectively. For both DuLa-Net v2 and HorizonNet, using ResNet-50 obtains the best performance, indicating that deeper encoder can better capture layout features. For LayoutNet v2, we spot a performance drop with ResNet-50, this is mainly due to the smaller number of batch size (we use 2 in experiment, which is the maximum available number to run on a single GPU of 12GB) that leads to unstable training of the batch normalization layer in ResNet encoder. We expect an better performance of LayoutNet v2 with ResNet-50 by training on a GPU with a larger memory, but we consider it as an unfair comparison with the other two methods since the hardware setup is different. In general, LayoutNet v2 with ResNet-34 outperforms all other methods on PanoContext dataset and obtains lowest pixel error on Stanford 2D-3D dataset. DuLa-Net v2, on the other hand, shows the best 3D IoU and corner error on Stanford 2D-3D dataset. Note that the reported number for HorizonNet with ResNet-50 is slightly lower than that reported in the original paper. This is attributed to the difference in the training dataset, i.e., the authors used both the training slipt of PanoContext dataset and Stanford 2D-3D dataset for training. We thus retrain the HorizonNet using our training dataset setting for a fair comparison.
5.2.2 Ablation Study
We show in Table 6 the ablation study for LayoutNet v2 on the best performing PanoContext dataset. The first row shows the performance reported in zou2018layoutnet. The proposed LayoutNet v2 with ResNet encoder, modified data augmentation and post-processing step boosts the overall performance by a large margin ( in 3D IoU). A large performance drop is observed when training the model from scratch (w/o ImageNet pre-training). Using gradient ascent for post-processing contributes the most to the performance boost (w/o gradient ascent), while adding random stretching data augmentation contributes less (w/o random stretching). Freezing batch normalization layout when training end-to-end can avoid unstable training of this layer when the batch size is small (w/o freeze bn layer). Including all modifications together achieves the best performance.
We show in Table 7 the ablation study for DuLa-Net v2 on the Stanford 2D-3D dataset. We obtain a performance boost of in 3D IoU when comparing with the original model yang2019dula by using a deeper ResNet encoder (ResNet-50 vs. ResNet-18). Similar to LayoutNet v2, using the random stretching data augmentation (w/o random stretching) improves the performance only marginally.
|Method||3D IoU (%)||
|w/o ImageNet pre-train||78.71||0.89||2.57|
|w/o gradient ascent||83.60||0.73||2.12|
|w/o freeze bn layer||83.98||0.70||2.01|
|w/o random stretching||83.97||0.65||1.92|
|w/ DuLa-Net opt||81.45||0.90||2.73|
|w/ HorizonNet opt||82.70||0.77||2.15|
|w/ Semantic opt||84.35||0.65||1.96|
|Method||3D IoU (%)||
|w/o random stretching||85.03||0.94||2.85|
|Method||Optimization avg CPU Time (ms)||Network avg. GPU time (ms)|
Comparison with different post-processing steps.
In this experiment, we compare the performance of LayoutNet v2 while using the post-processing steps of DuLa-Net v2 and HorizonNet, and combining its own optimization step with additional semantic loss, respectively. The post-processing step of HorizonNet utilizes predicted layout boundaries and corner positions in each image column, which can be easily converted from the output of LayoutNet v2. To adapt DuLa-Net v2’s post-processing step, we train LayoutNet v2 to predict the semantic segmentation (i.e., wall probability map) under equirectangular view as an additional channel in the boundary prediction branch. Then, we use the predicted floor-ceiling probability map as input to the post-processing step of DuLa-Net v2. Alternatively, we can also incorporate the predicted wall probability map into the layout optimization of LayoutNet v2. We add an additional loss term to Eqn. 7 for the average per-pixel value enclosed in the wall region of the predicted probability map with a threshold of 0.5. We set the semantic term weights to 0.3 for grid search in the validation set. As reported in Table 6 (row 6-8), together with LayoutNet v2’s neural network, a post-processing under equirectangular view performs better than the one under ceiling view. We found that the additional semantic optimization did not improve the post-processing step under equirectangular view. This is because the jointly predicted semantic segmentation is not that accurate, achieving only 2.59% pixel error compared with the 1.79% pixel error by our proposed LayoutNet v2.
Another interesting study is to see whether the performance of DuLa-Net v2 will be affected by using the post-processing step that works on the equirectangular view. However, it is not clear how to convert from its output probability maps to layout boundaries and corner positions.
5.2.3 Timing Statistics
We show in Table 8 the timing performance of LayoutNet v2 with ResNet-34 encoder, DuLa-Net v2 with ResNet-50 encoder, and HorizonNet with ResNet-50 encoder. We report the computation time of HorizonNet with RNN refinement branch. Note that HorizonNet without RNN only costs 8ms for network prediction but produces less accurate result compared with other approaches. We report average time consumption for a single forward pass of the network and the post-processing step.
5.3 Performance on MatterportLayout
In this experiment, we compare the performance of three methods on estimating the general Manhattan layouts using the MatterportLayout dataset. For a detailed evaluation, we report the performance for layouts of different complexity. We categorize each layout shape according to the number of floor plan corners in the ceiling view, e.g. a cuboid has 4 corners, an “L”-shape has 6 corners, and a “T”-shape has 8 corner. The dataset split used for training/validation/testing is reported in Sec. 3.
Fig. 8 shows the qualitative comparisons of the three methods. All three methods have similar performance when the room shape is simpler, such as cuboid and ‘L”-shape rooms. For more complex room shapes, HorizonNet is capable of estimating thin structures like the walls as shown in Fig. 8 (6th row, 1st column), but could also be confused by the reflected room boundaries in the mirror as shown in Fig. 8 (6th row, 2nd column). LayoutNet v2 tends to ignore the thin layout structures like the bumped out wall as shown in Fig. 8 (7th row, 1st column). DuLa-Net v2 is able to estimate the occluded portion of the scene, utilizing cues from the 2D ceiling view as shown in Fig. 8 (8th row, 2nd column), but could also be confused by ceiling edges as shown in Fig. 8 (8th row, last column).
|Metric: 3D IoU ()|
|Method||Overall||4 corners||6 corners||8 corners||10 corners|
|Metric: 2D IoU ()|
|Method||overall||4 corners||6 corners||8 corners||10 corners|
|Method||overall||4 corners||6 corners||8 corners||10 corners|
|Method||overall||4 corners||6 corners||8 corners||10 corners|
Table 9 shows the quantitative comparison of three methods on estimating general Manhattan layout using the MatterportLayout dataset. We consider the 3D IoU, 2D IoU and two depth accuracy measurements (i.e., rmse and ) for the performance evaluation. Overall, among the three methods, HorizonNet shows the best performance while LayoutNet v2 has similar performance for 2D IoU and 3D IoU with cuboid room shape. DuLa-Net v2 performs better than LayoutNet v2 for non-cuboid shapes, while being slightly worse than HorizonNet.
5.4.1 Why do LayoutNet v2 and HorizonNet perform differently on different datasets?
On PanoContext dataset and Stanford 2D-3D dataset, LayoutNet v2 outperforms the other two methods. However, on MatterportLayout dataset, HorizonNet stands to be the clear winner. We believe this is due to the different design of network decoder and the different representation of network’s outputs, making each method performs differently for cuboid layout and non-cuboid layout, as discussed below.
LayoutNet v2 relies more on global room shape context, i.e., it can predict one side of the wall given the prediction of the other three walls. This is benefited from the two-branch network prediction of room boundaries and corners, and the corner prediction is guided by the room boundaries: boundaries will also get gradients from error predicted corners during training. However, HorizonNet emphasizes more on local edge and corner responses, e.g., predict whether this column has a corner, and the position of floor and ceiling in this column. A direct evidence is that, by training on Stanford 2D-3D dataset which has all cuboid shapes, LayoutNet v2 predicts cuboid shape only, while the HorizonNet has non-cuboid outputs. These characteristics are also reflected in the qualitative results shown in Fig. 8. As we discussed in Sec. 5.3, LayoutNet v2 often misses thin layout structures such as pipes, while HorizonNet can be more sensitive to those thin structures. We also show in Fig. 9 the confusion matrix on correctly estimating the number of corners of the 3D layouts for each method. For the cuboid layout (4 corners), LayoutNet v2 shows the highest recall rate. However, LayoutNet v2 also tends to predict some non-cuboid layouts (e.g., 6 corners, 8 corners, 10 corners) to be cuboid. On the other hand, DuLa-Net v2 and HorizonNet shows better and comparable performance for estimating the non-cuboid room layouts. Therefore, the error in layout type prediction is the major cause of error for LayoutNet v2 in 3D reconstruction on the MatterportLayout dataset.
5.4.2 Analysis and Future Improvements for DuLa-Net v2
DuLa-Net v2 is sensitive to the parameter of FOV (introduced in the E2P projection in Sec. 4.1.1). A smaller FOV (e.g., ) can lead to higher quality predictions for most of the rooms, but some larger rooms could be clipped by the image plane after projection. A larger FOV (e.g., , ) could produce fewer clipped rooms after projection, but the prediction quality for some rooms may decrease, due to the down-scaled ground truth 2D floor plan in ceiling view. In this paper, we use the setting of FOV=, but we suggest to improve the prediction quality by combining the prediction of multiple networks trained with different FOVs in the future work. To give an idea of the potential improvement, we report the numbers for MatterportLayout dataset by removing the rooms that are too big to be clipped by the boundary of the projection under the setting of FOV=. For this case, the 3D IoU improves from 74.53 to 76.82.
6 Conclusions and Future Work
In this paper, we provide a thorough analysis of the three state-of-the-art methods for 3D Manhattan layout reconstruction from a single RGB indoor panoramic image, namely, LayoutNet, DuLa-Net, and HorizonNet. We further propose the improved version called LayoutNet v2 and DuLa-Net v2, which incorporate certain advantageous components from HorizonNet. LayoutNet v2 performs the best on PanoContext dataset and offers more robustness to occlusion. DuLa-Net v2 outperforms in two of three metrics for Stanford 2D-3D. To evaluate the performance on reconstructing general Manhattan layout shapes, we extend the Matterport3D dataset with general Manhattan layout annotations and introduce the MatterportLayout dataset. The annotations contain panoramas of both simple (e.g., cuboid) and complex room shapes. We introduce two depth based evaluation metrics for evaluating the quality of reconstruction.
Future work can be in three directions: (1) Relax Manhattan constraints to general layout. In real cases indoor layouts are more complex and could have non-Manhattan property like arch. One research direction is to study approaches that could generalize across Manhattan layouts and non-Manhattan ones with curve ceilings or walls. (2) Use additional depth and normal information. Our approach is based on a single RGB image only, and we can acquire rich geometric information from either predicted depth map from a single image, or captured depth maps from sensors. Incorporating depth features to both network predictions and the post-processing step could help for more accurate 3D layout reconstruction; (3) Extend to multi-view based 3D layout reconstruction. Reconstruction from a single image is difficult due to occlusions. We can extend our approach for layout reconstruction from multiple images. Using multiple images can recover a more complete floor plan and scene layout, which has various applications such as virtual 3D room walk through for real estate.