Pano Popups: Indoor 3D Reconstruction with a PlaneAware Network
Abstract
In this work we present a method to train a planeaware convolutional neural network for dense depth and surface normal estimation as well as plane boundaries from a single indoor image. Using our proposed loss function, our network outperforms existing methods for singleview, indoor, omnidirectional depth estimation and provides an initial benchmark for surface normal prediction from images. Our improvements are due to the use of a novel planeaware loss that leverages principal curvature as an indicator of planar boundaries. We also show that including geodesic coordinate maps as network priors provides a significant boost in surface normal prediction accuracy. Finally, we demonstrate how we can combine our network’s outputs to generate high quality 3D “popup” models of indoor scenes.
1 Introduction
Omnidirectional imaging is currently experiencing a surge in popularity, thanks to the advent of interactive panorama photo sharing on social media platforms, the rise of small, affordable cameras like the Ricoh Theta and Samsung Gear360, and the host of potential applications that arise from capturing wide field of view (FoV) in a single frame. At the same time, deep learning has never been a more useful tool for solving computer vision tasks from object recognition to 3D reconstruction. In order to fully utilize this rising form of media, we must extend existing deep learning methods to the omnidirectional domain. Unfortunately, this is not necessarily a trivial task.
Due to the radically different camera models, deep networks trained on perspective images do not transfer well to omnidirectional images. Omnidirectional images replace the concept of the image plane with that of the image sphere. Yet because we require a 2D planar representation of the image, omnidirectional cameras typically provide outputs as FoV equirectangular projections. This representation of the spherical image, while compact, suffers from significant horizontal distortion, especially near the poles.
While there have been a number of efforts to handle the difficulties of equirectangular projections [1, 2, 3, 5, 25, 26], we are interested in exploring their possible uses. There is excitement over the range of applications of omnidirectional imaging from headmounted displays to medical scopes to autonomous vehicles. In this paper, we target indoor scene modeling.
Perspective image methods are impeded by a small FoV that is more likely to be limited by featureless, homogeneous regions in an indoor scene. With the larger FoV in images, these homogeneous regions can be reasoned about in the larger context of the scene. Our goal is to predict the dense depth and surface normals for a piecewiseplanar reconstruction of the scene. This objective differs from much of the existing work that uses omnidirectional images for indoor 3D modeling. Those, such as RoomNet [14] and LayoutNet [30], aim to generate a simple model of the scene by leveraging a Manhattan World constraint to estimate the dominant planes. That type of model is useful for determining the shapes of rooms and floorplans of buildings, but not for modeling the objects that comprise the captured scene. While we, too, are essentially estimating planes in the scene, we aim for a more finegrained model in order to better capture these important details. To this end, we relax the Manhattan constraint to a simple planar one. That is, we assume only that our scene is piecewiseplanar.
We use a convolutional neural network (CNN) to predict depth and surface normal estimates per pixel as well as a map of the plane boundaries in the image. We enforce the planar assumption by using a planeaware loss function that modifies each pixel’s contribution to the learning based on its principal curvature. Using our network outputs, we then generate high quality 3D planar models of the scene as seen in Figure 1.
We summarize our contributions in this paper as follows:

We propose a planeaware cost function to estimate depth, surface normals, and plane boundaries from a single image.

We demonstrate that the inclusion of geodesic coordinate maps as extra inputs to the network improves surface normal prediction from omnidirectional images.

We qualitatively show that our network can be used to generate a 3D planar model from a single image.
2 Related Work
2.1 Singleview estimation
There is a significant body of existing research on the task of monocular depth estimation from perspective images. One of the first papers to report success in this task was from Saxena \etal[21], who use a Markov Random Field to infer depth from a blend of local and global image features. With the advent of practical deep learning, more recent methods have focused on applying CNNs to estimate depth. Eigen \etal[7] present a CNN for depth estimation that uses multiscale predictions to provide coarse and fine supervision for the depth predictions. Eigen \etal[6] built on that work to simultaneously generate surface normal predictions and semantic labels as well. Dharmasiri \etal[4] follow a similar network design but replace semantic label prediction with principal curvature prediction. Our network architecture has some commonalities with the aforementioned, primarily in our use of multiscale predictions and similar prediction modalities. However, our goal is more aligned with that of Qi \etal[19] who propose a method for enforcing geometric consistency in the network outputs. In that work, the authors use the depth predictions to refine normal predictions and vice versa. In our case, we use a planeaware loss to make our network predictions geometrically consistent. Our objective is also somewhat similar to that of Liu \etal[16] who predict a planar segmentation of the scene. However, they rely on a separate plane classification branch in their network and are limited to a fixed number of planes. We use a parametric definition of a plane derived from the principal curvature map and are thus unlimited in the number of planes we can predict.
There have been other recent works in monocular depth estimation that, while interesting and useful, are not currently feasible for our task. Godard \etal[10] use stereo image pairs to train a model for monocular depth estimation using an image reconstruction loss. In our case, we only have access to monocular images. Li and Snavely [15] train a network on a dataset built from largescale, unordered image collections. Alas, there is not yet such a repository for omnidirectional images.
2.2 Omnidirectional images
The primary distinction between our work and those presented above is the mode of our input data. Most research in monocular depth estimation has relied thus far on perspective image projections. We instead operate on equirectangular image projections, which image a spherical capture oo a plane. This representation carries high levels of distortion. There is an active branch of research in developing solutions to account for these factors. Su and Grauman [25] propose a transfer learning approach to train networks to operate on equirectangular projections. Using an existing perspectiveprojectiontrained network as the target, they train an equirectangular network with a learnable adaptive convolutional kernel to match the outputs. Tateno \etal[26] present a distortionaware convolutional kernel that convolves over the sampling grid transformed by a distortion function. In this way, the network can be trained on perspective images and still perform effectively on spherical projections. Coors \etal[3] independently derive the same operation and show that it can be highly effective for object detection on images. Both methods train on perspective images and evaluate on spherical projections. Another promising method is the spherical convolution derived by Cohen \etal[1] [2]. Spherical convolutions address the nuances of spherical projections by filtering rotations of the feature maps rather than translations. Most recently, Eder and Frahm [5] demonstrate that resampling spherical images to a subdivided icosahedron substantially improves the performance of CNNs trained on spherical data. In our work we do not directly address the problem of specialized convolutions. Rather, we explore the application of omnidirectional image inference for the task of indoor 3D modeling. Our work is most similar to that of Zioulis \etal[29] who estimate depth directly from omnidirectional images.
There is also a growing body of work using panorama images to generate indoor scene layouts. Xu \etal[28] fuse object detection and 3D geometry estimation use Bayesian inference to generate 3D room layouts from a single image. Rather than dividing the problem into subtasks, Lee \etal[14] use an endtoend CNN to generate a 3D room layout from a single perspective image. Zhou \etal[30] improve this technique by incorporating vanishing point alignment and prediction additional layout elements to their model. All of the aforementioned layout generation models assume a Manhattan World in their predictions. While this may be useful for common room shapes, it is too simple a prior for general indoor scene modeling. Our work focuses on a more complete indoor 3D model, so we relax this Manhattan constraint to a planar one.
3 PlaneAware Estimation
We present a CNN that estimates dense depth and surface normal predictions as well as a planar boundary map from a single image. To learn depth and normal prediction, we supervise training with ground truth values. Observing that a nonzero principal curvature indicates the presence of a planar boundary, we supervise training for the planar boundary map using the norm of the principal curvature.
3.1 Network architectures
We analyze our planeaware loss function using a network based on the RectNet architecture used by Zioulis \etal[30]. Our network uses the same encoderdecoder structure with rectangular filter banks on the input layers, but with two decoder branches: one for depth predictions and one for joint surface normal and plane boundary map prediction. We also include skip connections from encoder to decoder layers as in UNet from Ronneberger \etal[20], as we observe it improves performance. Our network takes a fivechannel input: an RGB equirectangular projection and the associated geodesic map containing latitude and longitude coordinates for each pixel. This design is based on the observation that distortion in equirectangular projections is location dependent. Given that these images are indexed by their geodesic coordinates, given in latitude and longitude, we provide the network with location information in the form of a geodesic coordinate map of the image. We find that this provides a significant boost in performance for surface normal prediction in particular and discuss it in more detail in Section 4.4. Figure 2 provides a detailed overview of our network.
3.2 Training
Recall our premise that each scene is piecewiseplanar. This assumption provides a few constraints. First, each scene should be segmented by some web of edges that define the boundaries between each plane. Second, each planar region should have a constant depth gradient and all pixels within should have the same surface normal. Furthermore, the principal curvature, which is effectively the second derivative of depth, should be zero. Lastly, the depth and normal predictions within a planar region should satisfy the plane equation , where is the normal, is the 3D point, and is the plane’s distance from the origin.
We enforce these constraints through a multiscale, multitask loss function. We compute individual losses over the depth, surface normals, and plane boundary map predictions as well as a loss over the plane distance prediction for each pixel, denoted as , , , and , respectively. This last term is computed as a function of both the depth and normal predictions, which encourages planar consistency. Each of the losses is also weighted using a planeaware function . For the depth, curvature, and plane distance losses, we use the reverse Huber, or BerHu, loss proposed by Laina \etal[13]. This loss is given as
(1) 
where we adjust on a perbatch basis to be 20% of the max perbatch error as in [13]. Our planeaware function weights the impact of each pixel to the loss by the norm of its ground truth principal curvature, :
(2) 
As curvature is zero on a planar surface, this term gives full weight to all pixels that lie on planes. However, pixels that fall along sharp plane boundaries and thus have higher curvatures will have their contribution to the loss downweighted. This is similar to the textureedgeaware loss weighting used by Godard \etal[10], except that we use the curvature values instead of intensity gradients. Our formulation makes more sense for our task, given that we are interested in planar boundaries rather than texture ones.
Each component of the loss is given below. The subscript denotes the th pixel in the image; is depth, is normal, and is curvature.
(3) 
(4) 
(5) 
(6) 
where is the relevant output map and the asterisks denote ground truth values. In Equation (6), where is the directional unit vector from the camera center to pixel on the sphere, i.e. is the backprojected 3D point.
It is worth noting that other singleview depth estimation papers typically include an penalty on the gradient of the depth or disparity prediction to account for homogeneous regions where depth may be ambiguous [9, 29]. However, this term is known to lead to oversmoothing, especially for surfaces that are not frontoplanar to the camera. In the case of images, where depth is defined as the distance from a 3D point to the camera center (rather than to the image plane), this gradient penalty would encourage the prediction of a circular scene wherein each point is locally frontoplanar to the camera. Thus, we do not penalize the depth gradient at all. In the planar boundary map prediction, however, we do include an penalty to encourage sparsity in the edge predictions.
Our total loss is thus the sum of all of these terms at two scales weighted by some hyperparameters , , , and :
(7) 
We empirically set the hyperparameters to balance the contribution of each component loss. In our reported results, , , , , , , , and . The penalty coefficient in Equation (5) is always . Nonetheless, we observed that small changes to these hyperparameters have negligible effects on the network training. Note that we do not use any loss for planar boundary map prediction for the downscaled prediction () as we observed that it made no impact in the final plane boundary map. We train the network for epochs with a batch size of and use the Adam optimizer [12] with an initial learning rate of decayed by half every epochs.
Loss  AbsRel  SqRel  RMSLin  RMSLog  

L2 + smoothing [29]  
Planeaware (ours)  
Ablation  
L2 instead of BerHu  
No curvature penalty  
No plane loss 
4 Evaluation
In this section we evaluate our proposed planeaware depth and normal estimation. First, we demonstrate the benefit of our planeaware loss through comparison to a baseline, the loss used by Zioulis \etal[29], as well as in a series of ablation experiments. Second, we demonstrate the importance of predicting surface normals rather than relying on derived normals from predicted depth. We then examine the effect of including coordinate priors as inputs to the network. Finally, we qualitatively show how we can leverage the predicted plane boundary map to create 3D reconstructions in Section 5.
4.1 Dataset
We train and evaluate our method using the Scene Understanding and Modeling (SUMO) dataset [27], a collection of 58,631 computer generated omnidirectional images of indoor scenes derived from SunCG [23]. As released, the SUMO dataset contains RGBD cube map images with a cube face dimension of pixels. To prepare this data for our experiments, we resample the cube maps to pixel equirectangular images using bilinear interpolation for color information and nearestneighbor interpolation for depth. For the purposes of surface normal and principal curvature prediction, we augment the dataset with normal and curvature maps for each image as well. We derive the ground truth normal maps from the provided images by first resampling them to the vertices of icosahedral triangular mesh as in [5], scaling each vertex by the ground truth depth, computing the surface normal for each face, and rendering the normal maps back into an equirectangular projection. For the ground truth planar boundary maps, we use the norm of the principal curvature. The curvature maps are derived as in [24] using the eigenvalues of the matrix given by:
where and are vectors that, with the surface normal , form an orthonormal basis at a given point . , , and are defined by the derivatives of the the surface normal at that point:
Method  Avg. Ang.  

Planeaware + Lat./Lon.  
Derived from depth  
No curvature penalty  
No plane loss  
No coordinates  
Lat. only  
Lon. only 
4.2 Depth estimation
We evaluate the depth estimation task using the standard set of metrics defined in Eigen \etal[7], shown in Table 1. Because depth estimates are subject to the arbitrary scale of the training distribution, we use the median scaling technique given by [29] to normalize the depth distributions during evaluation. The numbers we report are based on pixels whose ground truth depth falls within the range . We set to be 4.375 standard deviations above the mean of the training set, deriving this value from an analysis of the evaluation threshold used by Zioulis \etal[29]. To evaluation our proposed loss, we compare to network training under the loss used by Zioulis \etal[29] as a baseline. This loss is simply an minimization with a gradient penalty at two scales, as given by Equation (8):
(8) 
The results in Table 1 show that our loss formulation outperforms the baseline. We note that the training on synthetic images leads to a high performance for the baseline as well, so we also look to a qualitative analysis to reinforce the effect of our planeaware formulation. Figure 4 shows a selection of network outputs comparing our loss to the baseline. Observe the finergrained depth estimate of lounge chair in the center of row (1) and the shelving and counters in rows (2) and (3). We find that training with our proposed planeaware loss results in sharper details in the resulting depth maps. We posit that this effect is due to extra supervision provided by the ground truth curvature penalty, which limits smoothing on geometric edges.
We perform an ablation study on elements of our loss function, also listed in Table 1. Among other things, these results demonstrate that our improvement is not simply due to the use of the BerHu loss. We see a moderate impact from both the planarconsistency regularizer as well as the curvature penalty. Interestingly, we found that removing the associated curvature prediction task altogether neither affected the depth or normal prediction accuracy. However, we keep it in the network as it plays a key role in generating the 3D reconstructions, discussed in Section 5.
4.3 Surface normal estimation
For surface normal estimates, we examine pixels that fall within the same valid ground truth depth range. We evaluate the average angular error per pixel as well as the percentage of pixels whose angular error falls within a threshold of the ground truth. Table 2 shows that our loss formulation is useful for improving surface normal prediction. As a baseline we use the surface normals derived from the depth predictions. These results indicate that derived normals are no replacement for an independent surface normal prediction. Our predicted normals are much less susceptible to noisy depth values than their derived counterparts. Figure 5 shows a qualitative comparison of our predicted results compared to the derived normals. When the depth estimation is fairly accurate, the derived normals are only slightly noisier than the prediction, as in row (1). However, in cases where the depth predictions are not as high quality, the predicted normals are often still very good, while the derived normals degrade significantly, as in rows (2) and (3). This effect is why we rely on the indepdendent surface normal prediction branch when generating a 3D reconstruction.
4.4 Geodesic map inputs
We also delve deeper into the impact of the latitude and longitude map priors in the network. Fixing all other aspects of the network, we evaluate the performance of our network on the SUMO dataset with and without the geodesic map channels. Consistent with our expectations, the results in the bottom block of Table 2 suggest that the geodesic map inputs have a positive impact in surface normal estimation. We surmise that the geodesic map helps the network disambiguate the orientation of the surface normal. It is notable that without the geodesic map, we see errors occur at the peak point of barreling on planes in the equirectangular projection as in the topleft image in Figure 6. Interestingly, longitude provides the most important information, which aligns with what we observe in Figure 6: predictions changing abruptly along the rows.
Because the equirectangular grid is indexed by spherical coordinates rather than a Cartesian grid, the distance between adjacent pixels is rowdependent as well. Adjacent pixels nearer to the top and bottom of the image actually lie closer together on the sphere than adjacent pixels near the middle of the image do. This sampling scheme is problematic for CNNs because the convolution operation’s translation equivariance inherently assumes an even sampling. Somehow the network needs to learn to map the geodesic sampling to a Cartesian one. Our experiments suggest that including the geodesic maps as extra input channels is a useful way to pass this information to the network. These findings line up with the results of Liu \etal[17] who show that incorporating pixel location information can help a network learn some degree of translation dependence, which is what we also need to achieve.
5 3D Planar Model Reconstruction
An important consequence of our planarity assumption is that the network provides all of the information necessary to detect and segment planes in the input images. By defining these planes, we can generate “popup” models from a single image, as proposed by Hoiem \etal[11]. Indoor omnidirectional images are uniquely suited to produce these types of reconstructions as they are capable of capturing entire rooms in a single image.
To generate these reconstructions, we first isolate the sharpest edges in the planar boundary map using Otsu thresholding [18] and then identify each connected component in the resulting segmentation. An example of the result of this plane segmentation is shown in Figure 7. Thanks to the quality of our plane boundary predictions, this segmentation process requires no threshold tuning. To turn this segmentation into a 3D planar model, we first compute the median normal within each segmented plane. Then, we estimate the distance parameter of the plane equation in each segment using a 1parameter RANSAC [8] with a final leastsquares refinement over the inliers. Lastly, we project each pixel onto its associated plane. The model is finally “poppedup” in 3D by backprojecting the point cloud according to these new depths. We mesh the points by resampling to the vertices of a icosahedral triangular grid and scaling the vertices according to the adjusted depths, resulting in the models shown in Figure 8.
Reiterating the importance of surface normal prediction, we found incorporating normal information to be vital to our RANSAC routine. Estimating planes solely from the depth estimates gives a much noisier reconstruction. Furthermore, we observe that having plane information allows us to produce higher quality 3D models than those generated from depth estimates alone. Figure 9 compares our method, which leverages depth, normals, and boundary information, to the baseline network, which only estimates depth. Where the latter model suffers from smoothed edges, ours is able to produce sharp plane boundaries.
The significant drawback of monocular depth estimation is that the lack of any regularization over the estimates leads to fairly noisy predictions. This stands in contrast to stereo methods (and even pseudostereo methods like Godard \etal[10]) in which a second image can be used to ensure consistency in the depth map. However, with our planar assumption, we can resolve some of the depth ambiguity while staying purely monocular. Moreover, the planar constraint removes the dependence on texture to recover depth. Although making assumptions about the scene may be impractical for specific tasks like autonomous vehicle depth estimation [22], Figure 9 demonstrates that a simple planarity assumption can be leveraged with great effect for indoor 3D modeling.
6 Conclusion
We have presented a CNN capable of predicting depth, surface normals, and planar boundaries from a single indoor image. Using a novel planeaware loss function, we have achieved stateoftheart results for these tasks. We have also demonstrated that the inclusion of a geodesic map can improve surface normal estimates for omnidrectional images. Lastly, we have shown that our network provides all the information necessary to produce a 3D planar model of the scene. Looking ahead, we see an emerging opportunity to utilize this type of allinone prediction from omnidirectional images to bootstrap indoor 3D reconstruction.
Appendix A Extended Results
In this section, we provide a further qualitative review of our work. Figure 10 shows more examples of our network’s depth estimates compared to our baseline. Figure 11 provides more cases to justify the prediction of normals independently from depth. Finally, Figure 12 shows more comparisons of popup reconstructions along with examples of the plane boundary predictions and segmentations.
References
 T. Cohen, M. Geiger, J. Köhler, and M. Welling. Convolutional networks for spherical signals. arXiv preprint arXiv:1709.04893, 2017.
 T. S. Cohen, M. Geiger, J. Köhler, and M. Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
 B. Coors, A. P. Condurache, and A. Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 518–533, 2018.
 T. Dharmasiri, A. Spek, and T. Drummond. Joint prediction of depths, normals and surface curvature from rgb images using cnns. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 1505–1512. IEEE, 2017.
 M. Eder and J.M. Frahm. Convolutions on spherical images. arXiv preprint arXiv:1905.08409, 2019.
 D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
 D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multiscale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2366–2374. Curran Associates, Inc., 2014.
 M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
 R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
 C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, volume 2, page 7, 2017.
 D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo popup. In ACM transactions on graphics (TOG), volume 24, pages 577–584. ACM, 2005.
 D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
 C.Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. Roomnet: Endtoend room layout estimation. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 4875–4884. IEEE, 2017.
 Z. Li and N. Snavely. Megadepth: Learning singleview depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
 C. Liu, J. Yang, D. Ceylan, E. Yumer, and Y. Furukawa. Planenet: Piecewise planar reconstruction from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2579–2588, 2018.
 R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. arXiv preprint arXiv:1807.03247, 2018.
 N. Otsu. A threshold selection method from graylevel histograms. IEEE transactions on systems, man, and cybernetics, 9(1):62–66, 1979.
 X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018.
 O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
 N. Smolyanskiy, A. Kamenev, and S. Birchfield. On the importance of stereo for accurate depth estimation: An efficient semisupervised deep neural network approach. arXiv preprint arXiv:1803.09719, 2018.
 S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 A. Spek, W. H. Li, and T. Drummond. A fast method for computing principal curvatures from range images. arXiv preprint arXiv:1707.00385, 2017.
 Y.C. Su and K. Grauman. Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems, pages 529–539, 2017.
 K. Tateno, N. Navab, and F. Tombari. Distortionaware convolutional filters for dense prediction in panoramic images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 707–722, 2018.
 L. Tchapmi and D. Huber. The sumo challenge.
 J. Xu, B. Stenger, T. Kerola, and T. Tung. Pano2cad: Room layout from a single panorama image. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 354–362. IEEE, 2017.
 N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras. Omnidepth: Dense depth estimation for indoors spherical panoramas. arXiv preprint arXiv:1807.09620, 2018.
 C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2051–2059, 2018.