DuLaNet: A DualProjection Network for Estimating Room Layouts from a Single RGB Panorama
Abstract
We present a deep learning framework, called DuLaNet, to predict Manhattanworld 3D room layouts from a single RGB panorama. To achieve better prediction accuracy, our method leverages two projections of the panorama at once, namely the equirectangular panoramaview and the perspective ceilingview, that each contains different clues about the room layouts. Our network architecture consists of two encoderdecoder branches for analyzing each of the two views. In addition, a novel feature fusion structure is proposed to connect the two branches, which are then jointly trained to predict the 2D floor plans and layout heights. To learn more complex room layouts, we introduce the Realtor360 dataset that contains panoramas of Manhattanworld room layouts with different numbers of corners. Experimental results show that our work outperforms recent stateoftheart in prediction accuracy and performance, especially in the rooms with noncuboid layouts.
1 Introduction
Inferring highquality 3D room layouts from indoor panoramic images plays a crucial role in indoor scene understanding and can be beneficial to various applications, including virtual/augmented reality and robotics. To that end, recent methods recover 3D room layouts by using deep learning to predict the room corners and boundaries on the input panorama. For example, LayoutNet [32] achieved impressive reconstruction accuracy for Manhattan worldconstrained rooms. However, the clutter in the room, e.g. furniture, poses a challenge to extract critical corners and edges that are occluded in the input panorama. In addition, estimating 3D layouts from 2D corner and edge maps is an illposed problem and thus imposing extra constraints in the optimization. Therefore, it remains challenging to process complex room layouts.
In this work, we present a novel endtoend framework to estimate a 3D room layout from a single RGB panorama. By the intuition that a neural network may extract different kinds of features given the same panorama but in different projections, we propose to predict the room layouts from two distinct views of the panoramas, namely the equirectangular panoramaview and the perspective ceilingview. The network architecture follows the encoderdecoder scheme and consists of two branches, the panoramabranch and the ceilingbranch, for respectively analyzing images of the panoramaview and the ceilingview. The outputs of panoramabranch include a floorceiling probability map and a layout height, while the ceilingbranch outputs a floor plan probability map. To share information between branches, we employ a feature fusion scheme to connect the first few layers of decoders through a E2P conversion that transforms intermediate feature maps from equirectangular projection to perspective ceilingview. We find that better prediction performance is achieved by jointly training the two connected branches. The final 2D floor plan is then obtained by fitting an axisaligned polygon to a fused floor plan probability map (see Figure 3 for details) and then extruded by the estimated layout height.
To learn from panoramas with complex layouts, we need a proper dataset for network training and testing. However, existing public datasets, such as PanoContext [29] dataset, provide mostly labeled 3D layouts with simple cuboid shapes. To learn more complex layouts, we introduce a new dataset, Realtor360, which includes a subset of SUN360 [23] dataset (593 living rooms and bedrooms) and 1980 panoramas collected from a real estate database. We annotated the whole dataset with a custommade interactive tool to obtain the groundtruth 3D layouts.
A key feature of our dataset is that it contains rooms with more complex shapes in terms of the numbers of the corners. The experimental results demonstrate that our method outperforms the current stateoftheart method ([32]) in prediction accuracy, especially with rooms with more than four corners. Our method also takes much less time to compute the final room layouts. Fig. 1 shows some room layouts estimated by our method. Our contributions are summarized as follows:

We propose a novel network architecture that contains two encoderdecoder branches to analyze the input panorama in two different projections. These two branches are further connected through a feature fusion scheme. This dualprojection architecture can infer room layouts with more complex shapes beyond cuboids and Lshapes.

Our neural network is an important step towards building an endtoend architecture. Our network directly outputs a probability map of the 2D floor plan. This output requires significantly less postprocessing to obtain the final 3D room layout than the output of the current state of the art.

We introduce a new data set, called Realtor360, that contains 2573 panoramas depicting rooms with 4 to 12 corners. To the best of our knowledge, this is largest data set of indoor images with room layout annotations currently available.
2 Related Work
There are multiple papers that propose a solution to estimate room layouts from a single image taken in an indoor environment. They mainly differ in three aspects: 1) the assumptions of the room layouts, 2) the types of the input images, and 3) the methods. In terms of room layout assumptions, a popular choice is the ”Manhattan world” assumption [3], meaning that all walls are aligned with a global coordinate system [3, 22]. To make the problem easier to solve, a more restrictive assumption is that the room is a cuboid [7, 4, 12], i.e., there exist exactly four room corners. Our method adopts the Manhattan world assumption but allows for arbitrary numbers of corners.
In terms of types of input images, the images may differ in the FoV (field of view)  ranging from being monocular (i.e., taken from a standard camera) to 360 panoramas, and whether depth information is provided. The methods then largely depend on the input image types. The problem is probably most difficult to solve when only a monocular RGB image is given. Typically, geometric (e.g., lines and corners) [13, 7, 21] and/or semantic (e.g., segmentation into different regions [8, 9] and volumetric reasoning [6]) ”cues” are extracted from the input image, a set of room layout hypotheses is generated, and then an optimization or voting process is taken to rank and select one among the hypotheses. Recently, neural networkbased methods took stride in tackling this problem. A trend is that the neural networks generate higher and higher levels of information  starting from line segments [16, 30], surface labels [4], to room types [12] and room boundaries and corners [32], to make the final layout generation process increasingly easier to solve. Our method pushes this trend one step further by using neural networks to directly predict a 2D floor plan probability map that requires only a 2D polygon fitting process to produce the final 2D room layout.
If depth information is provided, there exist methods that estimate scene annotations including room layouts [27, 14, 28]. A deeper discussion is beyond the scope of this paper.
Closely related problems include depth estimation from a given image [31, 20] and scene reconstructions from point clouds [18, 17, 15]. Note that neither estimated depths nor reconstructed 3D scenes necessarily equate a clean room layout as such inputs may contain clutters.
360 panorama: The seminal work by Zhang et al. [29] advocates the use of 360 panoramas for indoor scene understanding for the reason that the FOV of 360 panoramas is much more expansive. Work in this direction flourished, including methods based on optimization approaches over geometric [5, 20, 25] and/or semantic cues [24, 26] and later based on neural networks [12, 32]. Except for LayoutNet [32], most methods rely on leveraging existing techniques for single perspective images on samples taken from the input panorama. We believe that this is a major reason of LayoutNet’s superior performance since it performs predictions on the panorama as a whole, thus extracting more global information that the input panorama might contain. A further step in this direction can be found in [20], in which the input panorama is projected to a 2D ”floor” view in which the camera position is mapped to the center of the image and the vertical lines in the panorama become radial lines emanated from the image center. An advantage of this approach is that the room layout becomes a 2D closed loop that can be extracted more easily. We derived our ”ceiling” view idea here  instead of looking downward toward the floor in which all the clutter in the room is included, we look upward toward the ceiling and got a more clutterfree view of the room layout.
3 Overview
Fig. 2 illustrates the overview of our framework. Given the input as an equirectangular panoramic image, we follow the same preprocessing step used in PanoContext [29] to align the panoramic image with a global coordinate system, i.e. we make a Manhattan world assumption. Then, we transform the panoramic image into a perspective ceilingview image through an equirectangular to perspective (E2P) conversion (Sec. 4). The panoramaview and ceilingview images are then fed to a network consisting of two encoderdecoder branches. These two branches are connected via a E2Pbased feature fusion scheme and jointly trained to predict a floor plan probability map, a floorceiling probability map, and the layout height (Sec. 5). Two intermediate probability maps are derived from the floorceiling probability map using E2P conversion and combined with floor plan probability map to obtain a fused floor plan probability map. The final 3D Manhattan layout is determined by extruding a 2D Manhattan floor plan estimated on the fused floor plan probability map using the predicted layout height (Sec. 6).
4 E2P conversion
In this section, we explain the formulation of E2P conversion that transforms an equirectangular panorama to a perspective image. We assume the perspective image is square with dimension . For every pixel in the perspective image at position , we derive the position of the corresponding pixel in the equirectangular panorama, , , as follows. First, we define the field of view of the pinhole camera of the perspective image as . Then, the focal length can be derived as:
, the 3D position of the pixel in the perspective image in the camera space, is then rotated by 90 or 90 along the xaxis (counterclockwise) if the camera is looking upward (looking at the ceiling) or downward (looking at the floor), respectively.
Next, we project the rotated 3D position to the equirectangular space. To do so, we first project it onto a unit sphere by vector normalization, , and apply the following formula:
(1) 
to project , the 3D position on the unit sphere, back to , the corresponding 2D position in the equirectangular panorama. Finally, we use to interpolate a pixel value from the panorama. We note that this process is differentiable so it can be used in conjunction with backpropagation.
5 Network architecture
Our network architecture is illustrated in Fig. 2. It consists of two encoderdecoder branches, for the panoramaview and the ceilingview input images. We denote the panoramaview branch as and the ceilingview branch as . The encoder and decoder of are denoted as and and for they are denoted as and . A key concept is that our network predicts the floor plan and the layout height. With these two predictions, we can reconstruct a 3D room layout in a postprocess (Sec. 6).
5.1 Encoder
We use ResNet18 as the architecture for both and . The input dimension of is (the dimension of the input panorama) and the output dimension is . For , the input and output dimensions are and . Note that the input of is a perspective ceilingview image generated by applying E2P conversion to the input panorama with set to 160 and set to 512. We also tried other more computationally expensive network architectures such as ResNet50 for the encoders. However, we find no improvements in accuracy so we chose to work with ResNet18 for simplicity.
5.2 Decoder
Both and consist of six convolutional layers. The first five layers are resize convolutions [1] with ReLU activations. The last layer is a regular convolution with sigmoid activation. The numbers of channels of the six layers are 256, 128, 64, 32, 16, and 1. To infer the layout height, we add three fully connected layers to the middlemost feature of . The dimensions of the three layers are 256, 64, and 1. To make the regression of the layout height more robust, we add dropout layers after the first two layers. To take the middlemost feature as input, we first apply global average pooling along both x and y dimensions, which produces an 1D feature with 512 dimension, and take it as the input of the fully connected layers.
The output of is a probability map of the floor and the ceiling in the equirectangular projection, denoted as the floorceiling probability map (). For , the output is a probability map of the floor plan in the ceiling view, denoted as the floor plan probability map (). Note that also outputs a predicted layout height ().
5.3 Feature fusion
We find that applying fusion techniques to merge the features in both and increases the prediction accuracy. We conjecture a reason as follows. In a ceilingview image, the areas near the image boundary (where some useful visual clues such as shadows and furniture arrangements exist) are more distorted, which can have a detrimental effect for the ceilingview branch to infer room structures. By fusing features from the panoramaview branch (in which distortion is less severe), performance of the ceilingview branch can be improved.
We apply fusions before each of the first five layers of and . For each fusion connection, a E2P conversion (Sec. 4) with set to 160 is taken to project the features in , which are originally in the equirectangular view, to the perspective ceiling view. Each fusion works as follows:
(2) 
where is the feature from and is the feature from after applying the E2P conversion. and are the decay coefficients. is the index of the layer. After each fusion, the merged feature, , is sent into the next layer of . The performance improvement of this technique is discussed in Sec. 8.
5.4 Loss function
For and , we apply binary cross entropy loss:
(3) 
For (layout height), we use L1loss:
(4) 
The overall loss function is:
(5) 
where , and are the ground truth of , , and .
5.5 Training details
We implement our method with PyTorch[19]. We use the Adam[10] optimizer with and . The learning rate is and batch size is . Our training loss converges after about epochs. For each training iteration we augment the input panorama with random flipping and horizontal rotatations by 0, 90, 180, and 270. For fusion, we set and in Eqn. 2 to be and . We set the in Eqn. 5 to be . Because we estimate the floor plan probability map in the ceiling view, we assume the distance between the camera and the ceiling to be 1.6 meters, and use this constant to normalize the ground truth.
6 3D layout estimation
Given the probability maps ( and ) and the layout height () predicted by the network, we reconstruct the final 3D layout in the following two steps:

Estimating a 2D Manhattan floor plan shape using the probability maps.

Extruding the floor plan shape along its normal according to the layout height.
For step 1, two intermediate maps, denoted as and , are derived from ceiling pixels and floor pixels of the floorceiling probability map using the E2P conversion. We further use a scaling factor, , to register the with , where the constant is the distance between the camera and the ceiling. Finally, a fused floor plan probability map is computed as follows:
(6) 
Fig. 3 (a) illustrates the above process. The probability map is binarized using a threshold of . A bounding rectangle of the largest connected component is computed for later use. Next, we convert the binary image to a densely sampled piecewise linear closed loop and simplify it using the DouglasPeucker algorithm (see Fig. 3 (b)). We run a regression analysis on the edges and cluster them into sets of axisaligned horizontal and vertical lines. These lines divide the bounding rectangle into several disjoint grid cells (see Fig. 3 (c)). We define the shape of the 2D floor plan as the union of grid cells where the ratio of floor plan area is greater than (see Fig. 3 (d)).
4 corners  6 corners  8 corners  10+ corners  Total \bigstrut 

1246  950  316  61  2573 \bigstrut 
7 Realtor360 dataset
A dataset that contains a sufficient number of 3D room layouts with different numbers of corners is crucial for training as well as testing our network. Unfortunately, existing public domain datasets, such as the PanoContext [29] dataset and the Stanford 2D3D dataset labeled by Zou et al. [32], contain mostly layouts with a simple cuboid shape. To prove that our framework is flexible enough to deal with rooms with an arbitrary number of corners, we introduce a new dataset, named Realtor360, that contains over 2500 indoor panoramas and annotated 3D room layouts. We classify each room according to its layout complexity measured by the number of corners in the floor plan. Table 1 shows the statistics of the dataset and a few visual examples can be found in Fig. 4. The source panoramic images in the Realtor360 dataset are collected from two sources. The first one is a subset of the SUN360 dataset [23], which contains 593 living rooms and bedrooms panoramas. The other source is a real estate database with 1980 indoor panoramas acquired from a realestate company. We annotate the 3D layouts of these indoor panoramas using a custommade interactive tool as explained below.
Method  Average  4 corners  6 corners  8 corners  10+ corners \bigstrut  
2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%) \bigstrut  
LayoutNet [32]  65.84  62.77  80.41  76.6  60.5  57.87  41.16  41.16  22.35  22.35 \bigstrut[t] 
ours (fcpnly)  75.2  72.02  76.75  73.27  76.04  73.06  70.8  67.89  56.42  54.2 
ours (fponly)  75.75  72.18  79.66  75.54  75.42  72.23  70.51  67.39  51.03  48.57 
ours (w/o fusion)  78.52  74.8  81.77  77.57  78.5  75.1  73.61  70.37  57.01  54.12 
ours (full)  80.53  77.2  82.63  78.91  80.72  77.79  78.12  74.86  63.1  59.72 \bigstrut[b] 
Method  Average  4 corners  6 corners  8 corners  10+ corners \bigstrut  

2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%)  2D IoU (%)  3D IoU (%) \bigstrut  
LayoutNet [32]  71.31  67.91  80.69  76.82  68.95  65.83  50.31  47.23  44.53  42.51 \bigstrut[t] 
Ours (full)  77.87  74.16  82.42  78.3  77.19  73.74  70.81  67.55  54.05  50.96 \bigstrut[b] 
Annotation tool.
To annotate the 2D indoor panoramas with highquality 3D room layouts, we developed an interactive tool to facilitate the labeling process. The tool first leverages existing automatic methods to extract a depth map [11] and line segments [29] from the input panorama. Then, an initial 3D Manhattanworld layout is created by sampling the depth along the horizontal line in the middle of the panorama. The tool allows the users to refine the initial 3D layout through a set of intuitive operations, including (i) pushing/pulling a wall; (ii) merging multiple walls; and (iii) splitting a wall. It also offers a handy function to snap the layout edges to the estimated line segments during the interactive editing to improve the accuracy. We plan to release the dataset along with the annotation tool for academic use after publication of this work.
8 Experiments
We compare our method to LayoutNet [32], a stateoftheart method in room layout estimation, through a series of quantitative and qualitative experiments on our Realtor360 dataset and the PanoContext [29] dataset. We also conduct ablation study with several alternative configurations of our method. We adopt 2D and 3D Intersection over Union (IoU) to evaluate the accuracy of the estimated 2D floor plans and 3D layouts, which is a standard metric in similar tasks [2]. All the experiments used the same hyperparameter discussed in Sec. 5.5. Fig. 5 shows a few 3D room layouts with different numbers of corners estimated using our method. Please refer to the supplementary materials for more results in the following experiments.
Evaluation on the Realtor360 dataset.
To train both LayoutNet [32] and our DuLaNet on the Realtor360 dataset, we randomly selected 2169 panoramas for training and took the remaining 404 panoramas for testing. We further classify the testing panoramas according to their numbers of corners. We run LayoutNet using the codes and default hyperparameter released by the authors. The quantitative comparison with LayoutNet is shown in Table 2. We observe that LayoutNet delivers good performance on cuboidshaped rooms (4 corners), similar to the numbers reported in their paper. However, the accuracy drops significantly as the number of corners increases. In comparison, Our DuLaNet not only outperforms LayoutNet on cuboidshaped rooms by a small margin (around ), but also performs well on rooms with larger numbers of corners. This leads to an overall performance gain of in both 2D and 3D metrics when compared to LayoutNet.
Since the 3D layout optimization and the hyperparameter of LayoutNet were tuned on a dataset that contains mostly cuboidshaped rooms, we conducted another experiment by training both networks on a revised training set that excludes rooms of noncuboid layouts, while keeping the testing set untouched. Table 3 shows the quantitative results. Note that while the performance of LayoutNet improves, our method still outperforms on all kinds of rooms.
From the qualitative comparison shown in Fig. 6, we can observe a strong tendency of LayoutNet to predict the rooms to be cuboidshaped, possibly due to the constraints imposed in their 3D layout optimization. In comparison, our method simplifies the problem by directly predicting a Manhattanworld floor plan without any assumptions about the numbers of corners. We conjecture that this is a main reason why our method outperforms LayoutNet, especially with rooms with more than four corners.
We also conducted an ablation study that evaluates the performance of our method in different configurations as follows: 1) ours(fconly): only panoramaview branch, 2) ours(fponly): only ceilingview branch, and 3) ours(w/o fusion): our full model but without feature fusion. The quantitative results in Table 2 shows that jointly training both branches leads to better performance than training only one of them. In addition, adding feature fusion between the two branches further improves the performance.
Evaluation on the PanoContext dataset.
LayoutNet provided quantitative results on the PanoContext [29] dataset with 414 panoramas for training and 53 panoramas for testing. All rooms are labeled as cuboidshape. To compare, we trained our network on the same dataset. The quantitative comparison is shown in Table 4. Our model outperforms LayoutNet by a small margin.
Timing.
An endtoend computation takes three main steps  1) an alignment process to align the input panorama with a global coordinate system, 2) floor plan probability map prediction by our neural network, and 3) 2D floor plan fitting. Step 1) is most timeconsuming, which takes about 13.37s measured on a machine with a single NVIDIA 1080ti GPU and Intel i77700 3.6GHZ CPU. Step 2) takes only 34.68ms and step 3) takes only 21.71ms.
Compared to LayoutNet, they carry out the same alignment process and their neural network prediction is also very fast (39ms). However, they needed another very timeconsuming 3D layout optimization step in the end, which takes 30.5s. In summary, an endtoend computation by LayoutNet takes about 43.9s while our method takes about 13.4s, a speed up of 3.28X.
9 Conclusion
We present an endtoend deep learning framework, called DuLaNet, for estimating 3D room layouts from a single RGB panorama. We propose a new network architecture that consists of two encoderdecoder branches for analyzing features from two distinct views of the input panoramas, namely the equirectangular panoramaview and the perspective ceilingview. The two branches are connected through a novel feature fusion scheme and jointly trained to achieve the best accuracy in the prediction of 2D floor plan and layout height. To learn from complex layouts, we introduce a new dataset, Realtor360, which contains 2573 indoor panoramas of Manhattanworld room layouts with various complexity. Both the quantitative and qualitative results demonstrate that our method outperforms the current stateoftheart in prediction accuracy, especially with rooms with more than four corners, and take much less time to compute the final 3D room layouts.
Limitations and future work.
Our method has the following limitations: i) without knowing the object semantics, our network might get confused with the rooms that contains mirrors or large occluding objects as shown in Fig. 7; and ii) our approach of 3D layout estimation involves heuristics and assumptions that might over or underestimate the underlying floor plan probability map and also restrain the results to Manhattan world. We propose to explore the following directions in the near future. First, introducing the object semantics, i.e., segmentation and labels, to the network architecture could potentially improve the accuracy by ignoring those distracting and occluding objects from the floor plan prediction. Second, designing a principled algorithm for a more robust 3D layout estimation, e.g., no Manhattanworld assumption and support rooms with curve shapes. Last but not the least, we believe that even better results can be achieved by experimenting with a larger range of encoders for our network architecture.
References
 [1] A. P. Aitken, C. Ledig, L. Theis, J. Caballero, Z. Wang, and W. Shi. Checkerboard artifact free subpixel convolution: A note on subpixel convolution, resize convolution and convolution resize. CoRR, abs/1707.02937, 2017.
 [2] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
 [3] J. M. Coughlan and A. L. Yuille. Manhattan world: Compass direction from a single image by bayesian inference. pages 941–, 1999.
 [4] S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay: Robust spatial layout estimation for cluttered indoor scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 616–624, June 2016.
 [5] K. Fukano, Y. Mochizuki, S. Iizuka, E. SimoSerra, A. Sugimoto, and H. Ishikawa. Room reconstruction from a single spherical image by higherorder energy minimization. 2016 23rd International Conference on Pattern Recognition (ICPR), pages 1768–1773, 2016.
 [6] A. Gupta, M. Hebert, T. Kanade, and D. M. Blei. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1288–1296. Curran Associates, Inc., 2010.
 [7] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial layout of cluttered rooms. In 2009 IEEE 12th International Conference on Computer Vision, pages 1849–1856, Sept 2009.
 [8] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context from a single image. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 1, pages 654–661 Vol. 1, Oct 2005.
 [9] D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1):151–172, Oct 2007.
 [10] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [11] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, pages 239–248. IEEE Computer Society, 2016.
 [12] C. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. Roomnet: Endtoend room layout estimation. CoRR, abs/1703.06241, 2017.
 [13] D. C. Lee, M. Hebert, and T. Kanade. Geometric reasoning for single image structure recovery. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2136–2143, June 2009.
 [14] C. Liu, P. Kohli, and Y. Furukawa. Layered scene decomposition via the occlusioncrf. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 165–173, June 2016.
 [15] C. Liu, J. Wu, and Y. Furukawa. Floornet: A unified framework for floorplan reconstruction from 3d scans. European Conference on Computer Vision (ECCV), 2018, 2018.
 [16] A. Mallya and S. Lazebnik. Learning informative edge maps for indoor scene layout prediction. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 936–944, Washington, DC, USA, 2015. IEEE Computer Society.
 [17] A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra. Rapter: Rebuilding manmade scenes with regular arrangements of planes. ACM Trans. Graph., 34(4):103:1–103:12, July 2015.
 [18] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Realtime dense surface mapping and tracking. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, Oct 2011.
 [19] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [20] G. Pintore, V. Garro, F. Ganovelli, E. Gobbetti, and M. Agus. Omnidirectional image capture on mobile devices for fast automatic generation of 2.5d indoor maps. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, March 2016.
 [21] S. Ramalingam and M. Brand. Lifting 3d manhattan lines from a single image. 2013 IEEE International Conference on Computer Vision, pages 497–504, 2013.
 [22] S. Ramalingam, J. K. Pillai, A. Jain, and Y. Taguchi. Manhattan junction catalogue for spatial reasoning of indoor scenes. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3065–3072, June 2013.
 [23] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place representation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2695–2702, June 2012.
 [24] J. Xu, B. Stenger, T. Kerola, and T. Tung. Pano2cad: Room layout from a single panorama image. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 354–362, March 2017.
 [25] H. Yang and H. Zhang. Efficient 3d room shape recovery from a single panorama. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5422–5430, June 2016.
 [26] Y. Yang, S. Jin, R. Liu, S. Bing Kang, and J. Yu. Automatic 3d indoor scene modeling from single panorama. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [27] J. Zhang, C. Kan, A. G. Schwing, and R. Urtasun. Estimating the 3d layout of indoor scenes and its clutter from depth sensors. In 2013 IEEE International Conference on Computer Vision, pages 1273–1280, Dec 2013.
 [28] Y. Zhang, M. Bai, P. Kohli, S. Izadi, and J. Xiao. Deepcontext: Contextencoding neural pathways for 3d holistic scene understanding. International Conference on Computer Vision (ICCV 2017), 2017.
 [29] Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: A wholeroom 3d context model for panoramic scene understanding. In Computer Vision  ECCV 2014  13th European Conference, Zurich, Switzerland, September 612, 2014, Proceedings, Part VI, pages 668–686, 2014.
 [30] H. Zhao, M. Lu, A. Yao, Y. Guo, Y. Chen, and L. Zhang. Physics inspired optimization on semantic transfer features: An alternative method for room layout estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [31] N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras. Omnidepth: Dense depth estimation for indoors spherical panoramas. In The European Conference on Computer Vision (ECCV), September 2018.
 [32] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Appendix A Comparisons to LayoutNet [32]
a.1 Retraining LayoutNet
For the comparisons with LayoutNet, we used the neural network and the 3D optimization module provided by the authors of LayoutNet. According to our analysis, the part of the code that concerns the 3D optimization module is specially designed for cuboidshape rooms and works very well. However, the part designed for noncuboid shaped rooms has multiple issues and is not sufficiently robust. Our analysis indicates that the algorithm has the best performance when it only tries to fit rooms with 4 corners. To provide the best possible version of LayoutNet, we experimented with three different settings in network training and report the results here. 1) LayoutNet with the original weights pretrained by the authors. 2) LayoutNet retrained with the subset of our Realtor360 dataset, which contains only cuboid layouts (4 corners). 3) LayoutNet retrained with our complete Realtor360 training dataset. The performance comparison is shown in Table 5. We can observe that LayoutNet’s performance improved significantly when it is retrained on our Realtor360 dataset. It is further improved when it is retrained with cuboidshaped rooms only since the 3D optimization module works correctly only for such rooms. In the paper, we therefore reported results for the best version of LayoutNet that we could create, i.e. LayoutNet retrained on our Realtor360 dataset with rooms of cuboid layouts (4corners) only, followed by their 3D optimization module for cuboidshaped rooms.
Metric  Training set  Average \bigstrut 

3D IoU (%)  pretrained  57.96 \bigstrut[t] 
Realtor360 (4only)  67.91 \bigstrut[b]  
Realtor360 (all)  62.77 \bigstrut[b] 