Tangent Images for Mitigating Spherical Distortion
In this work, we propose “tangent images,” a spherical image representation that facilitates transferable and scalable computer vision. Inspired by techniques in cartography and computer graphics, we render a spherical image to a set of distortion-mitigated, locally-planar image grids tangent to a subdivided icosahedron. By varying the resolution of these grids independently of the subdivision level, we can effectively represent high resolution spherical images while still benefiting from the low-distortion icosahedral spherical approximation. We show that training standard convolutional neural networks on tangent images compares favorably to the many specialized spherical convolutional kernels that have been developed, while also allowing us to scale training to significantly higher spherical resolutions. Furthermore, because we do not require specialized kernels, we show that we can transfer networks trained on perspective images to spherical data without fine-tuning and with limited performance drop-off. Finally, we demonstrate that tangent images can be used to improve the quality of sparse feature detection on spherical images, illustrating its usefulness for traditional computer vision tasks like structure-from-motion and SLAM.
A number of methods have been proposed to address convolutions on spherical images. These techniques vary in design, encompassing learnable transformations [26, 27], generalizations and modifications of the convolution operation [9, 10, 12, 28], and specialized kernels for spherical representations [8, 17, 30]. In general, these spherical convolutions fall into two classes: those that operate on equirectangular projections and those that operate on a subdivided icosahedral representation of the sphere. The latter has been shown to significantly mitigate spherical distortion, which leads to significant improvements for dense prediction tasks [11, 12, 20]. It also has the useful property that icosahedron’s faces and vertices scale roughly by a factor of at each subdivision, permitting a simple analogy to upsampling and downsampling operations in standard convolutional neural networks (CNNs). Because of the performance improvements provided by the subdivided icosahedron, we focus expressly on this representation.
Despite a growing body of work on these icosahedral convolutions, there are two significant impediments to further development: (1) the transferability of standard CNNs to spherical data on the icosahedron, and (2) the difficulty in scaling the proposed spherical convolution operations to high resolution spherical images. Prior work has implied [8, 12] or demonstrated [10, 28, 30] the transferability of networks trained on perspective images to different spherical representations. However, those who report results see a noticeable decrease in accuracy compared to CNN performance on perspective images and specialized networks that are trained natively on spherical data, leaving this important and desired behavior an unresolved question. Additionally, the proposed specialized convolutional kernels either require subsequent network tuning [8, 30] or are incompatible with the standard convolution .
Nearly all prior work on icosahedral convolutions has been built on the analogy between pixels and faces [8, 20] or pixels and vertices [12, 17, 30]. While elegant on the surface, this parallel has led to difficulties in scaling to higher resolution spherical images. Figure 2 depicts spherical image resolutions evaluated in the prior work. Notice that the highest resolution obtained so far is a level 8 subdivision, which is comparable to a equirectangular image. Superficially, this pixel resolution seems reasonably high, but the angular resolution per pixel is still quite low. A equirectangular image has an angular resolution of . For comparison, a VGA resolution () perspective image with field of view (FOV) has an angular resolution of . This is most similar to a equirectangular image, which has an angular resolution of and corresponds to a level 10 subdivided icosahedron. This is a significantly higher resolution than prior work has been capable of demonstrating, and this is the resolution on which we demonstrate our proposed approach in this paper.
In this work, we aim to address both transferability and scalability while leveraging efficient implementations of existing network architectures and operations. To this end, we propose a solution that decouples resolution from subdivision level using oriented, distortion-mitigated images that can be filtered with the standard grid convolution operation. Using these tangent images, standard CNN performance is competitive with specialized networks, yet they efficiently scale to high resolution spherical data and open the door to performance-preserving network transfer between perspective and spherical data. Furthermore, use of the standard convolution operation allows us to leverage highly-optimized convolution implementations, such as those from the cuDNN library , to train our networks. Additionally, the benefits of tangent images are not restricted to deep learning, as they address distortion through the data representation rather than the data processing tools. This means that our approach can be used for traditional vision applications like structure-from-motion and SLAM as well.
We summarize our contributions as follows:
We propose the tangent image spherical representation: a set of oriented, low-distortion images rendered tangent to faces of the icosahedron.
We show that standard CNNs trained on tangent images perform competitively with specialized spherical convolutional kernels while also scaling effectively to high resolution spherical data.
We demonstrate that, using tangent images, CNNs trained on perspective images can be applied to spherical data with no fine tuning and minimal performance dropoff.
We illustrate that tangent images have utility for traditional computer vision tasks by using them to improve sparse keypoint matching on spherical data.
2 Related Work
Recently, there have been a number of efforts to close the gap between CNN performance on perspective images and spherical images. These efforts can be naturally divided based on the spherical image representation used.
2.1 Equirectangular images
Equirectangular images are a popular spherical image representation thanks to their simple relation between rectangular and spherical coordinates. However, they demonstrate severe image distortion as a result. A number of methods have been proposed to address this issue. Su and Grauman  develop a learnable, adaptive kernel to train a CNN to transfer models trained on perspective images to the equirectangular domain. Su \etal extend this idea by developing a kernel that learns to transform a feature map according to local distortion properties. Cohen \etal[9, 7] develop spherical convolutions, which provides the rotational equivariance necessary for convolutions on the sphere. This method requires a specialized kernel, however, making it difficult to transfer the insights developed from years of research into traditional CNNs. Works from Coors \etal and Tateno \etal address equirectangular image distortion by warping the planar convolution kernel in a location-dependent manner. Because the equirectangular representation is so highly distorted, most recent work on this topic, has looked to leverage the distorted-reducing properties of the icosahedral spherical approximation.
2.2 Icosahedral representations
Representing the spherical image as a subdivided icosahedron mitigates spherical distortion, thus improving CNN accuracy compared to techniques that operate on equirectangular images. Eder and Frahm  motivate this representation using analysis from the field of cartography. Further research on this representation has primarily focused on the development of novel kernel designs to handle discretization and orientation challenges on the icosahedral manifold. Lee \etal convolve on this representation by defining new, orientation-dependent, kernels to sample from triangular faces of the icosahedron. Jiang \etal reparameterize the convolutional kernel as a linear combination of differential operators on the surface of an icosahedral mesh. Zhang \etal present a method that applies a special hexagonal convolution on the icosahedral net. Cohen \etal precompute an atlas of charts at different orientations that cover the icosahedral grid and use masked kernels along with an feature-orienting transform to convolve on these planar representations. Eder \etal define the “mapped convolution” that allows the custom specification of convolution sampling patterns through a type of graph convolution. In this way, they specify the filters’ orientation and sample from the icosahedral surface. Our tangent image representation addresses data orientation by ensuring all tangent images are consistently oriented when rendering and circumvents the discretization issue by rendering to image pixel grids.
3 Mitigating Spherical Distortion
Image distortion is the reason that we cannot simply apply many state-of-the-art CNNs to spherical data. Distortion changes the representation of the image, resulting in local content deformation that violates translational equivariance, the key property of a signal required for convolution functionality. The graph in Figure 3 shows just how little distortion is required to produce a significant drop-off in CNN performance.
Distortion in the most popular spherical image representations, equirectangular images and cube maps, is quite significant , and hence results in even worse performance. Although we can typically remove most lens distortion in perspective images using tools like the Brown-Conrady distortion model , spherical distortion is inescapable. This follows from Gauss’s Theorema Egregium, a consequence of which is that a spherical surface is not homeomorphic to a plane. As such, any effort to represent a spherical image as a planar one will result in some degree of distortion. Thus, our objective, and one shared by cartographers for thousands of years, is limited to finding an optimal planar representation of the sphere for our use case.
3.1 The icosahedral sphere
Consider the classical method of exhaustion of approximating a circle with inscribed regular polygons. It follows that, in three dimensions, we can approximate a sphere in the same way. Thus, the choice of planar spherical approximation ought to be the convex Platonic solid with the most faces: the icosahedron. The icosahedron has been used by cartographers to represent Earth at least as early as Buckminster Fuller’s Dymaxion map , which projects the globe onto the icosahedral net. Recent work in computer vision [8, 11, 12, 20, 17, 30] has demonstrated the shape’s utility for resolving the distortion problem for CNNs on spherical images as well.
While an improvement over single-plane image projections and its Platonic solid cousin, the cube, the 20-face icosahedron on its own is still limited in its distortion-mitigating properties. It can be improved by repeatedly applying Loop subdivision  to subdivide the faces and interpolate the vertices, producing increasingly close spherical approximations with decreasing amounts of local distortion on each face. Figure 4 demonstrates how distortion decreases at each subdivision level. Not all prior work takes advantage of this extra distortion reduction, though. There has largely been a trade-off between efficiency and representation. The charts used by Cohen \etal and the net used by Zhang \etal are efficient thanks to their planar image representations, but they are limited to the distortion properties of a level 0 icosahedron. On the other hand, the mapped convolution proposed by Eder \etal operates on the mesh itself and thus can benefit from higher level subdivision, but it does not scale well to higher level meshes due to cache coherence problems when computing intermediate features on the mesh. Jiang \etal provide efficient performance on the mesh, but do so by approximating convolution with a differential operator, which means existing networks can not be transferred. It is also interesting to note that the current top-performing method for many deep learning tasks, , uses the net of the level 0 icosahedron. This suggests that extensive subdivisions may not be necessary for all use cases.
Practical methods for processing spherical images must address the efficient scalability problem, but also should permit the transfer of well-researched, high-performance methods designed for perspective images. They should also provide the opportunity to modulate the level of acceptable distortion depending on the application. To address these constraints, we propose to break the coupling of subdivision level and spherical image resolution by representing a spherical image as a collection of images with tunable resolution and distortion characteristics.
3.2 Tangent images
Subdividing the icosahedron provides diminishing returns rather quickly from a distortion-reduction perspective, as indicated by the red vertical line in Figure 4. Nonetheless, existing methods must continue to subdivide in order to match the spherical image resolution to the number of mesh elements. We untether these considerations by fixing a base level of subdivision, , to define an acceptable degree of distortion, and then rendering the spherical image to square, oriented, planar pixel grids tangent to each face at that base level. The resolution of these tangent images is subsequently determined by the resolution of the spherical input. Given a subdivision level, , corresponding to the spherical input resolution, the dimension of the tangent image, , is given by the relation:
This design preserves the same resolution scaling that would occur through further subdivisions by instead increasing the resolution of the tangent image. This relationship is illustrated in Figure 5.
Our tangent images are motivated by existing techniques in related fields. The approximation of sections of the sphere by low-distortion planar regions is similar to the Universal Transverse Mercator (UTM) geodetic coordinate system, which divides the Earth into a number of nearly-Euclidean zones. Additionally, as tangent images can be thought of as rendering a spherical mesh to a set of quad textures, the high resolution benefits are similar to Ptex , a computer graphics technique that enables efficient high-resolution texturing by providing every quad of a 3D mesh with its own texture map. A visualization of the tangent image concept is provided in Figure 1.
Computing tangent images Tangent images are the gnomonic projection of the spherical data onto oriented, square planes centered at each face of a level subdivided icosahedron. The number of tangent images, , is determined by the faces of the base level icosahedron: , while their spatial extent is a function of the vertex resolution, , of the level icosahedron and the resolution of the image grid, given by Equation (1). Let be the barycenter of a triangular face of the icosahedron in spherical coordinates. We then compute the bounds of the plane in spherical coordinates as the inverse gnomonic projection at central latitude and longitude of the points:
The vertex resolution, , of a level icosahedron, , is computed as the mean angle between all vertices, , and their neighbors, :
Using ensures that the tangent images completely cover their associated triangular faces. Because vertex resolution roughly halves at each subsequent subdivision level, we define .
Using tangent images Tangent images require rendering from and to the sphere only once each. First, we create the tangent image set by rendering to the planes defined by Equation (2). Then, we apply the desired perspective image algorithm (e.g. a CNN or keypoint detector). Finally, we compute the regions on each plane visible to a spherical camera at the center of the icosahedron and render the algorithm output back to the sphere.
We have released our tangent image rendering code and associated experiments as a PyTorch extension
Prior research has established a common suite of experiments that have become the test bed for new research on spherical convolutions. This set typically includes some combination of spherical MNIST classification [9, 8, 17, 20, 30], shape classification [9, 13, 17], climate pattern segmentation [8, 17, 30], and semantic segmentation [8, 17, 20, 28, 30]. In order to benchmark against these prior works, we evaluate our method on the shape classification and semantic segmentation tasks. Additionally, we demonstrate our method’s fairly seamless transfer of CNNs trained on perspective images to spherical data. Finally, to show the versatility of the tangent image representation, we introduce a new benchmark, sparse keypoint detection on spherical images, and compare our representation to an equirectangular image baseline.
We first evaluate our proposed method on the shape classification task. As with prior work, we use the ModelNet40 dataset  rendered using the method described by Cohen \etal. Because the data densely encompasses the entire sphere, unlike spherical MNIST, which is sparse and projected only on one hemisphere, we believe this task is more indicative of general classification performance.
Experimental setup We use the network architecture from Jiang \etal, but we replace the specialized kernels with simple 2D convolutions. A forward pass involves running the convolutional blocks on each patch separately and subsequently aggregating the patch features with average pooling. We train and test on level 5 resolution data as with the prior work.
|Cohen \etal||Spherical Correlation||85.0%|
|Esteves \etal||Spectral Parameterization||88.9%|
Results and analysis Results of our experiments are shown in Table 1. Without any specialized convolutional kernels, we outperform most of the prior work on this task. The best performing method from Jiang \etal leverages a specialized convolution approximation on the mesh, which inhibits the ability to fine-tune existing CNN models for the task. Our method can be thought of as using a traditional CNN in a multi-view approach to spherical images. This means that, for global inference tasks like classification, we could select our favorite pre-trained network and transfer it to spherical data. In this case, it is likely that some fine-tuning may be necessary to address the final patch aggregation step in our network design.
4.2 Semantic segmentation
We next consider the task of semantic segmentation in order to demonstrate dense prediction capabilities. To compare to prior work, we perform a baseline evaluation of our method at low icosahedron resolutions (5 and 6), but we also evaluate the performance of our method at a level 10 input resolution in order to demonstrate the usefulness of the tangent image representation for processing high resolution spherical data. No prior work has operated at this resolution. We hope that our work can serve as a benchmark for further research on high resolution spherical images.
Experimental setup We train and test our method on the Stanford 2D3DS dataset , as with prior work [9, 8, 17, 30]. We evaluate RGB-D inputs at levels 5, 7, and 10, the maximum resolution provided by the dataset. At level 10 we also evaluate using only RGB inputs to demonstrate the benefit of high resolution capabilities. For the level 5 and 7 experiments, we use the residual UNet-style architecture as in [17, 30], but we again replace the specialized kernels with convolutions. The higher resolution of the level 10 inputs requires the larger receptive field of a deeper network, so we use a fully-convolutional ResNet 101  model pre-trained on COCO  for those experiments. For level 5 data, we train on the entire set of tangent images, while for the higher resolution experiments, we randomly sample a subset of tangent images from each spherical input to expedite training. We found this sampling method to be useful without loss of accuracy. We liken it to training on multiple perspective views of a scene.
Results and analysis We report the global results of our experiments in Table 2. Results on the Stanford2D3DS dataset are averaged over the 3 folds. Individual class results can be found in the supplementary material (Section 9). As expected, our method does not perform as well as prior work at the level 5 resolution. Recall that a level 5 resolution spherical image is equivalent to a perspective image with FOV. Our method takes that already low angular resolution image and separates it into a set of low pixel resolution images. Although it had limited impact on classification, these dual low resolutions are problematic for dense prediction tasks. We expound on the low-resolution limitation further in the supplementary material (Section A).
Where our tangent image representation excels is when scaling to high resolution images. What we sacrifice in low-resolution performance, we make up for by efficiently scaling to high resolution inputs. The graph in Figure 6 shows this effect. By scaling to the full resolution of the dataset, we are able to report the highest performing results ever on this spherical dataset by a wide margin using RGB inputs only. Adding the extra depth channel, we are able to increase the performance further ( mAcc., mIOU). At input level 10, we find that base level 1 delivers the best trade-off between the lower FOV at higher base levels and the increased distortion present in lower ones. We elaborate on this trade-off in the supplementary material (Section A).
4.3 Network transfer
Because our method inherently converts a spherical image into a collection of perspective ones, we can transfer dense prediction networks trained on perspective images to process spherical inputs without fine-tuning on the spherical data and with limited performance drop-off. We apply the perspective CNN as-is to our tangent image grids and then render the predictions back to the sphere.
Experimental setup We perform two network transfer experiments to highlight both the generalizability and benefit of tangent images. First, we fine-tune the pre-trained, fully-convolutional ResNet101 model  provided by the PyTorch model zoo on the RGB persepective image training set of the Stanford2D3DS dataset . This dataset is useful for this task, as it contains corresponding perspective and spherical images for every scene. We then evaluate semantic segmentation performance on the RGB spherical test set at level 8 resolution. During the dataset fine-tuning, we make sure to consider the desired angular pixel resolution of the spherical test data. A network trained on perspective images with an angular pixel resolution of has learned filters accordingly. Should we apply those filters to an image captured at the identical position, at the same image resolution, but with a narrower field-of-view, the difference in angular pixel resolution is effectively conformal scale distortion. To match the angular resolution of our spherical evaluation set, we normalize the camera matrices for all perspective images during training such they have the same angular pixel resolution as the test images. Because this is effectively a center-crop of the data, we also randomly shift our new camera center in order capture all parts of the image. Details of this pre-processing are given in the supplementary material (Section B). We do not fine-tune on the spherical data.
Additionally, we compare our transfer results to both the transfer results and natively-trained results reported by Zhang \etal on the OmniSYNTHIA dataset. We use the network architecture from  and train on the perspective images from the SYNTHIA dataset  that correspond to the OmniSYNTHIA training set using the camera normalization procedure laid out above. We report spherical evaluation results on at base level 1.
|6||Zhang \etal (transfer)||44.8||36.7|
|Zhang \etal (native)||52.2||43.6|
|7||Zhang \etal (transfer)||47.2||38.0|
|Zhang \etal (native)||57.1||48.3|
|8||Zhang \etal (transfer)||52.8||45.3|
|Zhang \etal (native)||55.1||47.1|
Results and anlysis Results for the Stanford2D3DS dataset and OmniSYNTHIA dataset experiments are given in Tables 3 and 4, respectively. In the Stanford2D3DS experiment, recall that both results are achieved using the network trained only on perspective data. Using tangent images, we are able to preserve of the accuracy and of the IOU of the perspective evaluation without any subsequent network tuning. This is because the tangent images on which the spherical data is rendered are, in fact, quite similar to perspective images. They have limited distortion, and we have matched the angular pixel resolution. Additionally, in the OmniSYNTHIA dataset experiment, we show that our tangent images significantly outperform the transfer method from Zhang \etal without specialized kernels or subsequent fine-tuning (results from Zhang \etal are after 10 epochs of fine-tuning on spherical data). We also highlight that, at higher resolutions, our transfer results are actually better than the Zhang \etal results that were trained natively on spherical data. In this work, we have been limited by the maximum resolution of current spherical image datasets, but this outcome suggests that network transfer with tangent images allows us to effectively process even higher resolution spherical data by training with high resolution perspective images.
4.4 Sparse keypoint correspondences
Recent research on spherical images has focused on deep learning tasks, primarily because many of those works have focused on the convolution operation. As our contribution relates to the representation of spherical data, not specifically convolution, we aim to show that our approach has applications beyond deep learning. To this end, we evaluate the use of tangent images for sparse keypoint detection, a critical step of structure-from-motion, SLAM, and a variety of other traditional computer vision applications.
Data As there is no existing benchmark for this task, we create a dataset using a subset of the spherical images provided by the Stanford2D3DS dataset . To create this dataset, we first cluster Area 1 images according to the room information provided by the dataset. Then, for each location, we compute SIFT features  in the equirectangular images and identify image pairs with FOV overlap using the spherical structure-from-motion pipeline provided by the OpenMVG library . Next, we compute the average volumetric FOV overlap for each overlapping image pair. Because we are dealing with spherical images, all 3D points are technically visible to every camera. That is, there are no image bounds to constrain “visible” regions. Instead, we use the ground truth depth maps and pose information to back-project each image pair into a canonical pose. We then compute the percentage of right image points that are visible to the left camera using the the left image depth map to remove occluded points, and vice versa. We average the two values to provide an FOV overlap score for the image pair. This overlap is visualized in Figure 7. We define our keypoints dataset as the top 60 image pairs according to this overlap metric. Finally, we split the resulting dataset into an “Easy” set and “Hard” set, again based on FOV overlap. The resulting dataset statistics are shown in Table 5. All images are evaluated at their full, level 10 resolution. We provide the dataset details in the supplementary material (Section D) to enable further research.
Experimental setup To evaluate our proposed representation, we detect and describe keypoints on the tangent image grids and then render those keypoints back to the spherical image. This rendering step ensures only keypoints visible to a spherical camera at the center of the icosahedron are rendered, as the tangent images have overlapping content. We then use OpenMVG  to compute putative correspondences and geometrically-consistent inlier matches.
Results and analysis We evaluate the quality of correspondence matching at 3 different base levels using the equirectangular image format as a baseline. We compute the putative matching ratio (PMR), matching score (MS), and precision (P) metrics defined by Heinly \etal. For an image set of image pairs, , with putative correspondences, inlier matches, and detected keypoints visible to both images, the metrics over the image pairs as defined as follows:
In the same way that we compute the FOV overlap, we use the ground truth pose and depth information provided by the dataset to determine which keypoints in the left image should be visible to the right image () and vice versa (), accounting for occlusion.
Results are given in Table 6. Our use of tangent images has a strong impact on the resulting correspondences, particularly on the hard split. Recall that this split has a lower FOV overlap and fewer inlier matches at the baseline equirectangular representation. Improved performance in this case is thus especially useful. We observe a significant improvement in PMR in both splits. We attribute this improvement to the computation of the SIFT feature vector on our less distorted representation. Like the convolution operation, SIFT descriptors also require translational equivariance in the detection domain. Tangent images restore this property with their low-distortion representation, which results in repeatable descriptors. The better ratio of higher quality putative matches as well as the better localization of the keypoints affects the inlier matches as well, resulting in a better MS score. We attribute the leveling off in performance beyond level 1 to the reduced FOV of higher level subdivisions, which affects the detector’s ability to find keypoints at larger scales.
|Split||# Pairs||Mean FOV Overlap||# Corr.|
We have presented tangent images, a spherical image representation that renders the image onto a oriented pixel grids tangent to a subdivided icosahedron. We have shown that these tangent images do not require specialized convolutional kernels for training CNNs and efficiently scale to represent high resolution data. We have also shown that they facilitate the transfer of networks trained on perspective images to spherical data with limited performance loss. These results further suggest that network transfer using tangent images can open the door to processing even higher resolution spherical images. Lastly, we have demonstrated the utility of tangent images for traditional computer vision tasks in addition to deep learning. Our results indicate that tangent images can be a very useful spherical representation for a wide variety of computer vision applications.
Acknowledgements We would like to thank David Luebke, Pierre Moulon, Li Guan, and Jared Heinly for their efforts and consultation in support of this work.
In this supplementary material we provide the following additional information:
Expanded discussion of some of the current limitations of tangent images (Section A)
Further description and analysis of our transfer learning experiments, including class-level results (Section B)
Class-level and qualitative results for the semantic segmentation experiments at different input resolutions (Section C)
Details of our 2D3DS Keypoints dataset along with individual image pair results and a qualitative comparison of select image pairs (Section D)
Training and architecture details for all CNN experiments detailed in this paper (Section E)
An example of a spherical image represented as tangent images (Figure 9)
Appendix A Limitations
We have demonstrated the usefulness of our proposed tangent images, but we have also exposed some limitations of the formulation and opportunities for further research.
Resolution When using tangent images, low angular resolution spherical data is processed on low pixel resolution images. This can severely limit the receptive field of the networks, to which we attribute our poor performance on the level 5 semantic segmentation task, for example. However, this limitation is only notable in the context of the existing literature, because prior work has been restricted to low resolution spherical data, as shown in Figure 2. A viable solution is to incorporate the rendering into the convolution operation. In this way, we can regenerate the tangent images at every operation and effectively resolve the receptive field issue. However, as this is an issue for low resolution images, and we are interested in high resolution spherical data, we leave this modification for future study.
FOV The base subdivision level provides a constraint on the FOV of the tangent images. Table 7 shows the FOV of the tangent images at different base subdivision levels. As the FOV decreases, algorithms that rely on some sense of context or neighborhood break down. We observe this effect for both the CNN and keypoint experiments. While this is certainly a trade-off with tangent images, we have demonstrated that base levels and resolutions can be modulated to fit the required application. Another important point to observe regarding tangent image FOV is that the relationship between FOV and subdivision level does not hold perfectly at lower subdivision levels due the outsize influence of faces near the 12 singular points on the icosahedron. This effect largely disappears after base level 2, but when normalizing camera matrices to match spherical angular resolution at base levels 0 and 1, it is necessary to choose the right base level for the data. We use a base level of 1 in our transfer learning experiments on the OmniSYNTHIA dataset for this reason.
Appendix B Network Transfer
In this section, we first address the goals of the network transfer experiments. Then, we provide ablation results demonstrating the importance of the camera normalization pre-processing step. We also include class-level results for our experiments. Finally, we give further explanation of the training data camera normalization process.
With our contribution, we aim to address equivalent network performance regardless of the input data format. That is, for a given network, we strive to achieve equal performance on both perspective and spherical data. This objective is motivated by the limited number of spherical image datasets and the difficulty of collecting large scale spherical training data. If we can achieve high transferability of perspective image networks, we reduce the need for large amounts of spherical training data to train specialized spherical networks. Our results suggest that investigating transferable representations, like our tangent images, might be an avenue to achieve this goal.
b.2 Transferring without camera normalization
As an ablation study, we apply the ResNet 101 semantic segmentation model from Section 4.3 of the main paper to spherical inputs at various resolutions. Recall that our model has been trained on perspective images normalized to have a per-pixel angular resolution equivalent to that of a level 8 icosahedron. The bar chart in Figure 8 shows what happens when we apply this model to inputs at other angular resolutions, motivating the need for the camera normalization step. Observe how performance deteriorates as the angular resolution of the spherical input moves further from the per-pixel angular resolution of the training data.
b.3 Camera normalization
In order to ensure angular resolution conformity between perspective training data and spherical test data, we normalize the intrinsic camera matrices of our training images to match the angular resolution of the spherical inputs. To do this, we resample all perspective image input to a common camera with the desired angular resolution. The angular resolution in radians-per-pixel of an image, and , is defined as:
where the numerator computes the FOV of the image as a function of the image dimensions, and , and the focal lengths, and . Because spherical input has uniform angular resolution in every direction, we resample our perspective inputs likewise: .
Choosing camera properties Our choices of , , and new image dimensions, and , for the perspective images depend on the angular resolution of the spherical input. Given a spherical input with angular resolution, , we select values for , , , and that satisfy:
While there are a variety of options that we could use, we choose to set because () is a reasonable field of view for a perspective image. We select and accordingly. For a level 8 input, this results in .
Normalizing intrinsics Recall the definition of the intrinsic matrix, , given focal lengths and and principal point :
Given our choices for field of view and image dimension explained above, we compute a new, common intrinsic matrix. The new focal lengths and can be computed as:
The new principal point is given as:
the camera intrinsics can be normalized using the relation:
where and are homogeneous pixel coordinates in the original and resampled images, respectively, and is the inverse of the intrinsic matrix associated with the original image.
Random shifts If we were to simply resample the image using Equation (12), we would end up with center crops of the original perspective images. In order to ensure that we do not discard useful information, we randomly shift the principle point of the original camera by some before normalizing. This produces variations in where we are sampling from the original image. Including this shift, we arrive at the formula we use for resampling the perspective training data:
To ensure our crops are distributed across the entire image, we set:
b.4 Per-class results
Table 8 gives the per-class results for our semantic segmentation transfer experiment. While perspective image performance should be considered an upper bound on spherical performance, note that in some cases, we appear to actually perform better on the spherical image. This is because the spherical evaluation is done on equirectangular images in order to be commensurate across representations. This means that certain labels are duplicated and others are reduced due to distortion, which can skew the per-class results.
Appendix C Semantic Segmentation
We provide the per-class quantitative results for our semantic segmentation experiments from Section 4.2 in the main paper. Additionally, we qualitatively analyze the benefits of training higher resolution networks, made possible by the tangent image representation.
c.1 Class Results
Per-class results are given for semantic segmentation in Table 9. Nearly every class benefits from the high-resolution inputs facilitated by tangent images. This is especially noticeable for classes with fine-grained detail and smaller objects, like chair, window, and bookcase. The table class is an interesting example of the benefit of our method. While prior work has higher accuracy, our high resolution classification has significantly better IOU. In other words, our high resolution inputs may not result in correct classifications of every table pixel, but the classifications that are correct are much more precise. This increased precision is reflected almost across the board by mean IOU performance.
c.2 Qualitative Results
Figure 10 gives 3 examples of semantic segmentation results at each resolution. The most obvious benefits of the higher resolution results are visible in the granularity of the output segmentation. Notice the fine detail preserved in the chairs in the level 10 output in the bottom row and even the doorway and whiteboard in the middle row. However, recall that our level 10 network uses a base level of 1. The effects of the smaller FOV of the tangent images are visible in the misclassifications of wall on the right of the level 10 output in the middle row. The level 5 network has no such problems classifying that surface, having been trained at a lower input resolution and using base level 0. Nevertheless, it is worth noting that large, homogeneous regions are going to be problematic for high resolution images, regardless of method, due to receptive field limitations of the network. If the region in question is larger than the receptive field of the network, there is no way for the network to infer context. As such, we are less concerned by errors on these large regions.
Appendix D Stanford 2D3DS Keypoints Dataset
d.2 Qualitative Examples
We provide a qualitative comparison of keypoint detections in Figure 11. These images illustrate two interesting effects, in particular. First, in highly distorted regions of the image that have repeatable texture, like the floor in both images, detecting on the equirectangular format produces a number of redundant keypoints distributed over the distorted region. With tangent images, we see fewer redundant points as the base level increases and the ones that are detected are more accurate and robust, as indicated by the higher MS score. Additionally, the equirectangular representation results in more keypoint detections at larger scales. These outsize scales are an effect of distortion. Rotating the camera so that the corresponding keypoints are detected at different pixel locations with different distortion characteristics will produce a different scale, and consequently a difference descriptor. This demonstrates the need for translational equivariance in keypoint detection, which requires the lower distortion provided by our tangent images. This is reflected quantitatively by the higher PMR scores.
Figure 12 shows an example of inlier correspondences computed on the equirectangular images and at different base levels for an image pair from the hard split. Even though we detect fewer keypoints using tangent planes, we still have the same quality or better inlier correspondences. Distortion in the equirectangular format results in keypoint over-detection, which can potentially strain the subsequent inlier model fitting. Using tangent images, we detect fewer, but higher quality, samples. This results in more efficient and reliable RANSAC  model fitting. This is why tangent images perform noticeably better on the hard set, where there are fewer correspondences to be found.
Appendix E Network Training Details
We detail the training parameters and network architectures used in our experiments to encourage reproducible research. All deep learning experiments were performed using the PyTorch library.
e.1 Shape classification
For the shape classification experiment, we use the network architecture from , replacing their MeshConv layers with 2D convolutions with unit padding. For downsampling operations, we bilinearly interpolate the tangent images. We first render the ModelNet40  shapes to equirectangular images to be compatible with our tangent image generation implementation. The equirectangular image dimensions are , which is equivalent to the level 5 icosahedron. We train the network with a batch size of and learning rate of using the Adam optimizer .
e.2 Semantic segmentation
We use the residual U-Net-style architecture from [17, 30] for our semantic segmentation results at levels 5 and 7. Instead of MeshConv  and HexConv , we swap in 2D convolutions with unit padding. For the level 10 results, we use the fully-convolutional ResNet101  pre-trained on COCO  provided in the PyTorch model zoo. We train and test on each of the standard folds of the Stanford 2D3DS dataset . For all spherical data, evaluation metrics are computed on the re-rendered spherical image, not on the tangent images.
Level 5, 7 parameters For the level 5 and 7 experiments, our tangent images were base level 0 RGB-D images, and we use all 20 tangent images from each image in a batch of 8 spherical images, resulting in an effective batch size of 160 tangent images. We use Adam optimization  with an initial learning rate of and decay by every 20 epochs, as in .
Level 10 parameters For the level 10 experiments, we use RGB-D images at base level 1 and randomly sample 4 tangent images from each image in a batch of 4 spherical inputs, resulting in an effective batch size of 16 tangent images. Because the pre-trained network does not have a depth channel, we initialize the depth channel filter with zero weights. This has the effect of slowly adding the depth information to the model. Similarly, the last layer is randomly initialized, as Stanford 2D3DS has a different number of classes than COCO. We use Adam optimization  with a learning rate of .
e.3 Transfer learning
We again use the fully-convolutional ResNet101  architecture pre-trained on COCO . We fine tune for 10 epochs on the perspective images of the Standford2D3DS dataset . We use a batch size of and a learning rate of .
|Pair ID||Left Image||Right Image||FOV Overlap|
|Pair ID||Left Image||Right Image||FOV Overlap|
- Planet texture maps. http://planetpixelemporium.com/earth8081.html. Accessed: 2019-04-16.
- Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
- Duane C Brown. Decentering distortion of lenses. Photogrammetric Engineering and Remote Sensing, 1966.
- Fuller Richard Buckminster. Cartography, Jan. 29 1946. US Patent 2,393,676.
- Brent Burley and Dylan Lacewell. Ptex: Per-face texture mapping for production rendering. In Computer Graphics Forum, volume 27, pages 1155–1164. Wiley Online Library, 2008.
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
- Taco Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Convolutional networks for spherical signals. arXiv preprint arXiv:1709.04893, 2017.
- Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral cnn. In International Conference on Machine Learning, pages 1321–1330, 2019.
- Taco S. Cohen, Mario Geiger, Jonas KÃ¶hler, and Max Welling. Spherical CNNs. In International Conference on Learning Representations, 2018.
- Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In European Conference on Computer Vision, pages 525–541. Springer, 2018.
- Marc Eder and Jan-Michael Frahm. Convolutions on spherical images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1–5, 2019.
- Marc Eder, True Price, Thanh Vu, Akash Bapat, and Jan-Michael Frahm. Mapped convolutions. arXiv preprint arXiv:1906.11096, 2019.
- Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning so (3) equivariant representations with spherical cnns. In ECCV, pages 52–68, 2018.
- Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Jared Heinly, Enrique Dunn, and Jan-Michael Frahm. Comparative evaluation of binary features. In ECCV, pages 759–773. Springer, 2012.
- Chiyu Max Jiang, Jingwei Huang, Karthik Kashinath, Prabhat, Philip Marcus, and Matthias Niessner. Spherical CNNs on unstructured grids. In International Conference on Learning Representations, 2019.
- Jon A Kimerling, Kevin Sahr, Denis White, and Lian Song. Comparing geometrical properties of global grids. Cartography and Geographic Information Science, 26(4):271–288, 1999.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Yeonkun Lee, Jaeseok Jeong, Jongseob Yun, Wonjune Cho, and Kuk-Jin Yoon. Spherephd: Applying cnns on a spherical polyhedron representation of 360deg images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9181–9189, 2019.
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Charles Loop. Smooth subdivision surfaces based on triangles. Master’s thesis, University of Utah, Department of Mathematics, 1987.
- David G Lowe et al. Object recognition from local scale-invariant features. In ICCV, volume 99, pages 1150–1157, 1999.
- Pierre Moulon, Pascal Monasse, Romuald Perrot, and Renaud Marlet. Openmvg: Open multiple view geometry. In International Workshop on Reproducible Research in Pattern Recognition, pages 60–74. Springer, 2016.
- German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems, pages 529–539, 2017.
- Yu-Chuan Su and Kristen Grauman. Kernel transformer networks for compact spherical convolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. In European Conference on Computer Vision, pages 732–750. Springer, 2018.
- Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- Chao Zhang, Stephan Liwicki, William Smith, and Roberto Cipolla. Orientation-aware semantic segmentation on icosahedron spheres. arXiv preprint arXiv:1907.12849, 2019.