DuLa-Net: A Dual-Projection Network for Estimating Room Layouts from a Single RGB Panorama
We present a deep learning framework, called DuLa-Net, to predict Manhattan-world 3D room layouts from a single RGB panorama. To achieve better prediction accuracy, our method leverages two projections of the panorama at once, namely the equirectangular panorama-view and the perspective ceiling-view, that each contains different clues about the room layouts. Our network architecture consists of two encoder-decoder branches for analyzing each of the two views. In addition, a novel feature fusion structure is proposed to connect the two branches, which are then jointly trained to predict the 2D floor plans and layout heights. To learn more complex room layouts, we introduce the Realtor360 dataset that contains panoramas of Manhattan-world room layouts with different numbers of corners. Experimental results show that our work outperforms recent state-of-the-art in prediction accuracy and performance, especially in the rooms with non-cuboid layouts.
Inferring high-quality 3D room layouts from indoor panoramic images plays a crucial role in indoor scene understanding and can be beneficial to various applications, including virtual/augmented reality and robotics. To that end, recent methods recover 3D room layouts by using deep learning to predict the room corners and boundaries on the input panorama. For example, LayoutNet  achieved impressive reconstruction accuracy for Manhattan world-constrained rooms. However, the clutter in the room, e.g. furniture, poses a challenge to extract critical corners and edges that are occluded in the input panorama. In addition, estimating 3D layouts from 2D corner and edge maps is an ill-posed problem and thus imposing extra constraints in the optimization. Therefore, it remains challenging to process complex room layouts.
In this work, we present a novel end-to-end framework to estimate a 3D room layout from a single RGB panorama. By the intuition that a neural network may extract different kinds of features given the same panorama but in different projections, we propose to predict the room layouts from two distinct views of the panoramas, namely the equirectangular panorama-view and the perspective ceiling-view. The network architecture follows the encoder-decoder scheme and consists of two branches, the panorama-branch and the ceiling-branch, for respectively analyzing images of the panorama-view and the ceiling-view. The outputs of panorama-branch include a floor-ceiling probability map and a layout height, while the ceiling-branch outputs a floor plan probability map. To share information between branches, we employ a feature fusion scheme to connect the first few layers of decoders through a E2P conversion that transforms intermediate feature maps from equirectangular projection to perspective ceiling-view. We find that better prediction performance is achieved by jointly training the two connected branches. The final 2D floor plan is then obtained by fitting an axis-aligned polygon to a fused floor plan probability map (see Figure 3 for details) and then extruded by the estimated layout height.
To learn from panoramas with complex layouts, we need a proper dataset for network training and testing. However, existing public datasets, such as PanoContext  dataset, provide mostly labeled 3D layouts with simple cuboid shapes. To learn more complex layouts, we introduce a new dataset, Realtor360, which includes a subset of SUN360  dataset (593 living rooms and bedrooms) and 1980 panoramas collected from a real estate database. We annotated the whole dataset with a custom-made interactive tool to obtain the ground-truth 3D layouts.
A key feature of our dataset is that it contains rooms with more complex shapes in terms of the numbers of the corners. The experimental results demonstrate that our method outperforms the current state-of-the-art method () in prediction accuracy, especially with rooms with more than four corners. Our method also takes much less time to compute the final room layouts. Fig. 1 shows some room layouts estimated by our method. Our contributions are summarized as follows:
We propose a novel network architecture that contains two encoder-decoder branches to analyze the input panorama in two different projections. These two branches are further connected through a feature fusion scheme. This dual-projection architecture can infer room layouts with more complex shapes beyond cuboids and L-shapes.
Our neural network is an important step towards building an end-to-end architecture. Our network directly outputs a probability map of the 2D floor plan. This output requires significantly less post-processing to obtain the final 3D room layout than the output of the current state of the art.
We introduce a new data set, called Realtor360, that contains 2573 panoramas depicting rooms with 4 to 12 corners. To the best of our knowledge, this is largest data set of indoor images with room layout annotations currently available.
2 Related Work
There are multiple papers that propose a solution to estimate room layouts from a single image taken in an indoor environment. They mainly differ in three aspects: 1) the assumptions of the room layouts, 2) the types of the input images, and 3) the methods. In terms of room layout assumptions, a popular choice is the ”Manhattan world” assumption , meaning that all walls are aligned with a global coordinate system [3, 22]. To make the problem easier to solve, a more restrictive assumption is that the room is a cuboid [7, 4, 12], i.e., there exist exactly four room corners. Our method adopts the Manhattan world assumption but allows for arbitrary numbers of corners.
In terms of types of input images, the images may differ in the FoV (field of view) - ranging from being monocular (i.e., taken from a standard camera) to 360 panoramas, and whether depth information is provided. The methods then largely depend on the input image types. The problem is probably most difficult to solve when only a monocular RGB image is given. Typically, geometric (e.g., lines and corners) [13, 7, 21] and/or semantic (e.g., segmentation into different regions [8, 9] and volumetric reasoning ) ”cues” are extracted from the input image, a set of room layout hypotheses is generated, and then an optimization or voting process is taken to rank and select one among the hypotheses. Recently, neural network-based methods took stride in tackling this problem. A trend is that the neural networks generate higher and higher levels of information - starting from line segments [16, 30], surface labels , to room types  and room boundaries and corners , to make the final layout generation process increasingly easier to solve. Our method pushes this trend one step further by using neural networks to directly predict a 2D floor plan probability map that requires only a 2D polygon fitting process to produce the final 2D room layout.
Closely related problems include depth estimation from a given image [31, 20] and scene reconstructions from point clouds [18, 17, 15]. Note that neither estimated depths nor reconstructed 3D scenes necessarily equate a clean room layout as such inputs may contain clutters.
360 panorama: The seminal work by Zhang et al.  advocates the use of 360 panoramas for indoor scene understanding for the reason that the FOV of 360 panoramas is much more expansive. Work in this direction flourished, including methods based on optimization approaches over geometric [5, 20, 25] and/or semantic cues [24, 26] and later based on neural networks [12, 32]. Except for LayoutNet , most methods rely on leveraging existing techniques for single perspective images on samples taken from the input panorama. We believe that this is a major reason of LayoutNet’s superior performance since it performs predictions on the panorama as a whole, thus extracting more global information that the input panorama might contain. A further step in this direction can be found in , in which the input panorama is projected to a 2D ”floor” view in which the camera position is mapped to the center of the image and the vertical lines in the panorama become radial lines emanated from the image center. An advantage of this approach is that the room layout becomes a 2D closed loop that can be extracted more easily. We derived our ”ceiling” view idea here - instead of looking downward toward the floor in which all the clutter in the room is included, we look upward toward the ceiling and got a more clutter-free view of the room layout.
Fig. 2 illustrates the overview of our framework. Given the input as an equirectangular panoramic image, we follow the same pre-processing step used in PanoContext  to align the panoramic image with a global coordinate system, i.e. we make a Manhattan world assumption. Then, we transform the panoramic image into a perspective ceiling-view image through an equirectangular to perspective (E2P) conversion (Sec. 4). The panorama-view and ceiling-view images are then fed to a network consisting of two encoder-decoder branches. These two branches are connected via a E2P-based feature fusion scheme and jointly trained to predict a floor plan probability map, a floor-ceiling probability map, and the layout height (Sec. 5). Two intermediate probability maps are derived from the floor-ceiling probability map using E2P conversion and combined with floor plan probability map to obtain a fused floor plan probability map. The final 3D Manhattan layout is determined by extruding a 2D Manhattan floor plan estimated on the fused floor plan probability map using the predicted layout height (Sec. 6).
4 E2P conversion
In this section, we explain the formulation of E2P conversion that transforms an equirectangular panorama to a perspective image. We assume the perspective image is square with dimension . For every pixel in the perspective image at position , we derive the position of the corresponding pixel in the equirectangular panorama, , , as follows. First, we define the field of view of the pinhole camera of the perspective image as . Then, the focal length can be derived as:
, the 3D position of the pixel in the perspective image in the camera space, is then rotated by 90 or -90 along the x-axis (counter-clockwise) if the camera is looking upward (looking at the ceiling) or downward (looking at the floor), respectively.
Next, we project the rotated 3D position to the equirectangular space. To do so, we first project it onto a unit sphere by vector normalization, , and apply the following formula:
to project , the 3D position on the unit sphere, back to , the corresponding 2D position in the equirectangular panorama. Finally, we use to interpolate a pixel value from the panorama. We note that this process is differentiable so it can be used in conjunction with backpropagation.
5 Network architecture
Our network architecture is illustrated in Fig. 2. It consists of two encoder-decoder branches, for the panorama-view and the ceiling-view input images. We denote the panorama-view branch as and the ceiling-view branch as . The encoder and decoder of are denoted as and and for they are denoted as and . A key concept is that our network predicts the floor plan and the layout height. With these two predictions, we can reconstruct a 3D room layout in a post-process (Sec. 6).
We use ResNet-18 as the architecture for both and . The input dimension of is (the dimension of the input panorama) and the output dimension is . For , the input and output dimensions are and . Note that the input of is a perspective ceiling-view image generated by applying E2P conversion to the input panorama with set to 160 and set to 512. We also tried other more computationally expensive network architectures such as ResNet-50 for the encoders. However, we find no improvements in accuracy so we chose to work with ResNet-18 for simplicity.
Both and consist of six convolutional layers. The first five layers are resize convolutions  with ReLU activations. The last layer is a regular convolution with sigmoid activation. The numbers of channels of the six layers are 256, 128, 64, 32, 16, and 1. To infer the layout height, we add three fully connected layers to the middlemost feature of . The dimensions of the three layers are 256, 64, and 1. To make the regression of the layout height more robust, we add dropout layers after the first two layers. To take the middlemost feature as input, we first apply global average pooling along both x and y dimensions, which produces an 1-D feature with 512 dimension, and take it as the input of the fully connected layers.
The output of is a probability map of the floor and the ceiling in the equirectangular projection, denoted as the floor-ceiling probability map (). For , the output is a probability map of the floor plan in the ceiling view, denoted as the floor plan probability map (). Note that also outputs a predicted layout height ().
5.3 Feature fusion
We find that applying fusion techniques to merge the features in both and increases the prediction accuracy. We conjecture a reason as follows. In a ceiling-view image, the areas near the image boundary (where some useful visual clues such as shadows and furniture arrangements exist) are more distorted, which can have a detrimental effect for the ceiling-view branch to infer room structures. By fusing features from the panorama-view branch (in which distortion is less severe), performance of the ceiling-view branch can be improved.
We apply fusions before each of the first five layers of and . For each fusion connection, a E2P conversion (Sec. 4) with set to 160 is taken to project the features in , which are originally in the equirectangular view, to the perspective ceiling view. Each fusion works as follows:
where is the feature from and is the feature from after applying the E2P conversion. and are the decay coefficients. is the index of the layer. After each fusion, the merged feature, , is sent into the next layer of . The performance improvement of this technique is discussed in Sec. 8.
5.4 Loss function
For and , we apply binary cross entropy loss:
For (layout height), we use L1-loss:
The overall loss function is:
where , and are the ground truth of , , and .
5.5 Training details
We implement our method with PyTorch. We use the Adam optimizer with and . The learning rate is and batch size is . Our training loss converges after about epochs. For each training iteration we augment the input panorama with random flipping and horizontal rotatations by 0, 90, 180, and 270. For fusion, we set and in Eqn. 2 to be and . We set the in Eqn. 5 to be . Because we estimate the floor plan probability map in the ceiling view, we assume the distance between the camera and the ceiling to be 1.6 meters, and use this constant to normalize the ground truth.
6 3D layout estimation
Given the probability maps ( and ) and the layout height () predicted by the network, we reconstruct the final 3D layout in the following two steps:
Estimating a 2D Manhattan floor plan shape using the probability maps.
Extruding the floor plan shape along its normal according to the layout height.
For step 1, two intermediate maps, denoted as and , are derived from ceiling pixels and floor pixels of the floor-ceiling probability map using the E2P conversion. We further use a scaling factor, , to register the with , where the constant is the distance between the camera and the ceiling. Finally, a fused floor plan probability map is computed as follows:
Fig. 3 (a) illustrates the above process. The probability map is binarized using a threshold of . A bounding rectangle of the largest connected component is computed for later use. Next, we convert the binary image to a densely sampled piece-wise linear closed loop and simplify it using the Douglas-Peucker algorithm (see Fig. 3 (b)). We run a regression analysis on the edges and cluster them into sets of axis-aligned horizontal and vertical lines. These lines divide the bounding rectangle into several disjoint grid cells (see Fig. 3 (c)). We define the shape of the 2D floor plan as the union of grid cells where the ratio of floor plan area is greater than (see Fig. 3 (d)).
|4 corners||6 corners||8 corners||10+ corners||Total \bigstrut|
7 Realtor360 dataset
A dataset that contains a sufficient number of 3D room layouts with different numbers of corners is crucial for training as well as testing our network. Unfortunately, existing public domain datasets, such as the PanoContext  dataset and the Stanford 2D-3D dataset labeled by Zou et al. , contain mostly layouts with a simple cuboid shape. To prove that our framework is flexible enough to deal with rooms with an arbitrary number of corners, we introduce a new dataset, named Realtor360, that contains over 2500 indoor panoramas and annotated 3D room layouts. We classify each room according to its layout complexity measured by the number of corners in the floor plan. Table 1 shows the statistics of the dataset and a few visual examples can be found in Fig. 4. The source panoramic images in the Realtor360 dataset are collected from two sources. The first one is a subset of the SUN360 dataset , which contains 593 living rooms and bedrooms panoramas. The other source is a real estate database with 1980 indoor panoramas acquired from a real-estate company. We annotate the 3D layouts of these indoor panoramas using a custom-made interactive tool as explained below.
|Method||Average||4 corners||6 corners||8 corners||10+ corners \bigstrut|
|2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%) \bigstrut|
|LayoutNet ||65.84||62.77||80.41||76.6||60.5||57.87||41.16||41.16||22.35||22.35 \bigstrut[t]|
|ours (w/o fusion)||78.52||74.8||81.77||77.57||78.5||75.1||73.61||70.37||57.01||54.12|
|ours (full)||80.53||77.2||82.63||78.91||80.72||77.79||78.12||74.86||63.1||59.72 \bigstrut[b]|
|Method||Average||4 corners||6 corners||8 corners||10+ corners \bigstrut|
|2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%)||2D IoU (%)||3D IoU (%) \bigstrut|
|LayoutNet ||71.31||67.91||80.69||76.82||68.95||65.83||50.31||47.23||44.53||42.51 \bigstrut[t]|
|Ours (full)||77.87||74.16||82.42||78.3||77.19||73.74||70.81||67.55||54.05||50.96 \bigstrut[b]|
To annotate the 2D indoor panoramas with high-quality 3D room layouts, we developed an interactive tool to facilitate the labeling process. The tool first leverages existing automatic methods to extract a depth map  and line segments  from the input panorama. Then, an initial 3D Manhattan-world layout is created by sampling the depth along the horizontal line in the middle of the panorama. The tool allows the users to refine the initial 3D layout through a set of intuitive operations, including (i) pushing/pulling a wall; (ii) merging multiple walls; and (iii) splitting a wall. It also offers a handy function to snap the layout edges to the estimated line segments during the interactive editing to improve the accuracy. We plan to release the dataset along with the annotation tool for academic use after publication of this work.
We compare our method to LayoutNet , a state-of-the-art method in room layout estimation, through a series of quantitative and qualitative experiments on our Realtor360 dataset and the PanoContext  dataset. We also conduct ablation study with several alternative configurations of our method. We adopt 2D and 3D Intersection over Union (IoU) to evaluate the accuracy of the estimated 2D floor plans and 3D layouts, which is a standard metric in similar tasks . All the experiments used the same hyper-parameter discussed in Sec. 5.5. Fig. 5 shows a few 3D room layouts with different numbers of corners estimated using our method. Please refer to the supplementary materials for more results in the following experiments.
Evaluation on the Realtor360 dataset.
To train both LayoutNet  and our DuLa-Net on the Realtor360 dataset, we randomly selected 2169 panoramas for training and took the remaining 404 panoramas for testing. We further classify the testing panoramas according to their numbers of corners. We run LayoutNet using the codes and default hyper-parameter released by the authors. The quantitative comparison with LayoutNet is shown in Table 2. We observe that LayoutNet delivers good performance on cuboid-shaped rooms (4 corners), similar to the numbers reported in their paper. However, the accuracy drops significantly as the number of corners increases. In comparison, Our DuLa-Net not only outperforms LayoutNet on cuboid-shaped rooms by a small margin (around ), but also performs well on rooms with larger numbers of corners. This leads to an overall performance gain of in both 2D and 3D metrics when compared to LayoutNet.
Since the 3D layout optimization and the hyper-parameter of LayoutNet were tuned on a dataset that contains mostly cuboid-shaped rooms, we conducted another experiment by training both networks on a revised training set that excludes rooms of non-cuboid layouts, while keeping the testing set untouched. Table 3 shows the quantitative results. Note that while the performance of LayoutNet improves, our method still outperforms on all kinds of rooms.
From the qualitative comparison shown in Fig. 6, we can observe a strong tendency of LayoutNet to predict the rooms to be cuboid-shaped, possibly due to the constraints imposed in their 3D layout optimization. In comparison, our method simplifies the problem by directly predicting a Manhattan-world floor plan without any assumptions about the numbers of corners. We conjecture that this is a main reason why our method outperforms LayoutNet, especially with rooms with more than four corners.
We also conducted an ablation study that evaluates the performance of our method in different configurations as follows: 1) ours(fc-only): only panorama-view branch, 2) ours(fp-only): only ceiling-view branch, and 3) ours(w/o fusion): our full model but without feature fusion. The quantitative results in Table 2 shows that jointly training both branches leads to better performance than training only one of them. In addition, adding feature fusion between the two branches further improves the performance.
Evaluation on the PanoContext dataset.
LayoutNet provided quantitative results on the PanoContext  dataset with 414 panoramas for training and 53 panoramas for testing. All rooms are labeled as cuboid-shape. To compare, we trained our network on the same dataset. The quantitative comparison is shown in Table 4. Our model outperforms LayoutNet by a small margin.
An end-to-end computation takes three main steps - 1) an alignment process to align the input panorama with a global coordinate system, 2) floor plan probability map prediction by our neural network, and 3) 2D floor plan fitting. Step 1) is most time-consuming, which takes about 13.37s measured on a machine with a single NVIDIA 1080ti GPU and Intel i7-7700 3.6GHZ CPU. Step 2) takes only 34.68ms and step 3) takes only 21.71ms.
Compared to LayoutNet, they carry out the same alignment process and their neural network prediction is also very fast (39ms). However, they needed another very time-consuming 3D layout optimization step in the end, which takes 30.5s. In summary, an end-to-end computation by LayoutNet takes about 43.9s while our method takes about 13.4s, a speed up of 3.28X.
We present an end-to-end deep learning framework, called DuLa-Net, for estimating 3D room layouts from a single RGB panorama. We propose a new network architecture that consists of two encoder-decoder branches for analyzing features from two distinct views of the input panoramas, namely the equirectangular panorama-view and the perspective ceiling-view. The two branches are connected through a novel feature fusion scheme and jointly trained to achieve the best accuracy in the prediction of 2D floor plan and layout height. To learn from complex layouts, we introduce a new dataset, Realtor360, which contains 2573 indoor panoramas of Manhattan-world room layouts with various complexity. Both the quantitative and qualitative results demonstrate that our method outperforms the current state-of-the-art in prediction accuracy, especially with rooms with more than four corners, and take much less time to compute the final 3D room layouts.
Limitations and future work.
Our method has the following limitations: i) without knowing the object semantics, our network might get confused with the rooms that contains mirrors or large occluding objects as shown in Fig. 7; and ii) our approach of 3D layout estimation involves heuristics and assumptions that might over- or under-estimate the underlying floor plan probability map and also restrain the results to Manhattan world. We propose to explore the following directions in the near future. First, introducing the object semantics, i.e., segmentation and labels, to the network architecture could potentially improve the accuracy by ignoring those distracting and occluding objects from the floor plan prediction. Second, designing a principled algorithm for a more robust 3D layout estimation, e.g., no Manhattan-world assumption and support rooms with curve shapes. Last but not the least, we believe that even better results can be achieved by experimenting with a larger range of encoders for our network architecture.
-  A. P. Aitken, C. Ledig, L. Theis, J. Caballero, Z. Wang, and W. Shi. Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize. CoRR, abs/1707.02937, 2017.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
-  J. M. Coughlan and A. L. Yuille. Manhattan world: Compass direction from a single image by bayesian inference. pages 941–, 1999.
-  S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay: Robust spatial layout estimation for cluttered indoor scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 616–624, June 2016.
-  K. Fukano, Y. Mochizuki, S. Iizuka, E. Simo-Serra, A. Sugimoto, and H. Ishikawa. Room reconstruction from a single spherical image by higher-order energy minimization. 2016 23rd International Conference on Pattern Recognition (ICPR), pages 1768–1773, 2016.
-  A. Gupta, M. Hebert, T. Kanade, and D. M. Blei. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1288–1296. Curran Associates, Inc., 2010.
-  V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial layout of cluttered rooms. In 2009 IEEE 12th International Conference on Computer Vision, pages 1849–1856, Sept 2009.
-  D. Hoiem, A. A. Efros, and M. Hebert. Geometric context from a single image. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 1, pages 654–661 Vol. 1, Oct 2005.
-  D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1):151–172, Oct 2007.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, pages 239–248. IEEE Computer Society, 2016.
-  C. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. Roomnet: End-to-end room layout estimation. CoRR, abs/1703.06241, 2017.
-  D. C. Lee, M. Hebert, and T. Kanade. Geometric reasoning for single image structure recovery. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2136–2143, June 2009.
-  C. Liu, P. Kohli, and Y. Furukawa. Layered scene decomposition via the occlusion-crf. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 165–173, June 2016.
-  C. Liu, J. Wu, and Y. Furukawa. Floornet: A unified framework for floorplan reconstruction from 3d scans. European Conference on Computer Vision (ECCV), 2018, 2018.
-  A. Mallya and S. Lazebnik. Learning informative edge maps for indoor scene layout prediction. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 936–944, Washington, DC, USA, 2015. IEEE Computer Society.
-  A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra. Rapter: Rebuilding man-made scenes with regular arrangements of planes. ACM Trans. Graph., 34(4):103:1–103:12, July 2015.
-  R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, Oct 2011.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
-  G. Pintore, V. Garro, F. Ganovelli, E. Gobbetti, and M. Agus. Omnidirectional image capture on mobile devices for fast automatic generation of 2.5d indoor maps. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, March 2016.
-  S. Ramalingam and M. Brand. Lifting 3d manhattan lines from a single image. 2013 IEEE International Conference on Computer Vision, pages 497–504, 2013.
-  S. Ramalingam, J. K. Pillai, A. Jain, and Y. Taguchi. Manhattan junction catalogue for spatial reasoning of indoor scenes. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3065–3072, June 2013.
-  J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place representation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2695–2702, June 2012.
-  J. Xu, B. Stenger, T. Kerola, and T. Tung. Pano2cad: Room layout from a single panorama image. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 354–362, March 2017.
-  H. Yang and H. Zhang. Efficient 3d room shape recovery from a single panorama. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5422–5430, June 2016.
-  Y. Yang, S. Jin, R. Liu, S. Bing Kang, and J. Yu. Automatic 3d indoor scene modeling from single panorama. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  J. Zhang, C. Kan, A. G. Schwing, and R. Urtasun. Estimating the 3d layout of indoor scenes and its clutter from depth sensors. In 2013 IEEE International Conference on Computer Vision, pages 1273–1280, Dec 2013.
-  Y. Zhang, M. Bai, P. Kohli, S. Izadi, and J. Xiao. Deepcontext: Context-encoding neural pathways for 3d holistic scene understanding. International Conference on Computer Vision (ICCV 2017), 2017.
-  Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI, pages 668–686, 2014.
-  H. Zhao, M. Lu, A. Yao, Y. Guo, Y. Chen, and L. Zhang. Physics inspired optimization on semantic transfer features: An alternative method for room layout estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras. Omnidepth: Dense depth estimation for indoors spherical panoramas. In The European Conference on Computer Vision (ECCV), September 2018.
-  C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Appendix A Comparisons to LayoutNet 
a.1 Re-training LayoutNet
For the comparisons with LayoutNet, we used the neural network and the 3D optimization module provided by the authors of LayoutNet. According to our analysis, the part of the code that concerns the 3D optimization module is specially designed for cuboid-shape rooms and works very well. However, the part designed for non-cuboid shaped rooms has multiple issues and is not sufficiently robust. Our analysis indicates that the algorithm has the best performance when it only tries to fit rooms with 4 corners. To provide the best possible version of LayoutNet, we experimented with three different settings in network training and report the results here. 1) LayoutNet with the original weights pre-trained by the authors. 2) LayoutNet retrained with the subset of our Realtor360 dataset, which contains only cuboid layouts (4 corners). 3) LayoutNet retrained with our complete Realtor360 training dataset. The performance comparison is shown in Table 5. We can observe that LayoutNet’s performance improved significantly when it is re-trained on our Realtor360 dataset. It is further improved when it is re-trained with cuboid-shaped rooms only since the 3D optimization module works correctly only for such rooms. In the paper, we therefore reported results for the best version of LayoutNet that we could create, i.e. LayoutNet re-trained on our Realtor360 dataset with rooms of cuboid layouts (4-corners) only, followed by their 3D optimization module for cuboid-shaped rooms.
|Metric||Training set||Average \bigstrut|
|3D IoU (%)||pre-trained||57.96 \bigstrut[t]|
|Realtor360 (4-only)||67.91 \bigstrut[b]|
|Realtor360 (all)||62.77 \bigstrut[b]|