Flat2Sphere: Learning Spherical Convolutionfor Fast Features from 360° Imagery

Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery

Abstract

While 360° cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial. Convolutional neural networks (CNNs) trained on images from perspective cameras yield “flat” filters, yet 360° images cannot be projected to a single plane without significant distortion. A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate, but much too computationally intensive for real problems. We propose to learn a spherical convolutional network that translates a planar CNN to process 360° imagery directly in its equirectangular projection. Our approach learns to reproduce the flat filter outputs on 360° data, sensitive to the varying distortion effects across the viewing sphere. The key benefits are 1) efficient feature extraction for 360° images and video, and 2) the ability to leverage powerful pre-trained networks researchers have carefully honed (together with massive labeled image training sets) for perspective images. We validate our approach compared to several alternative methods in terms of both raw CNN output accuracy as well as applying a state-of-the-art “flat” object detector to 360° data. Our method yields the most accurate results while saving orders of magnitude in computation versus the existing exact reprojection solution.

1 Introduction

Unlike a traditional perspective camera, which samples a limited field of view of the 3D scene projected onto a 2D plane, a camera captures the entire viewing sphere surrounding its optical center, providing a complete picture of the visual world—an omnidirectional field of view. As such, viewing imagery provides a more immersive experience of the visual content compared to traditional media.

cameras are gaining popularity as part of the rising trend of virtual reality (VR) and augmented reality (AR) technologies, and will also be increasingly influential for wearable cameras, autonomous mobile robots, and video-based security applications. Consumer level cameras are now common on the market, and media sharing sites such as Facebook and YouTube have enabled support for content. For consumers and artists, cameras free the photographer from making real-time composition decisions. For VR/AR, data is essential to content creation. As a result of this great potential, computer vision problems targeting content are capturing the attention of both the research community and application developer.

Immediately, this raises the question: how to compute features from images and videos? Arguably the most powerful tools in computer vision today are convolutional neural networks (CNN). CNNs are responsible for state-of-the-art results across a wide range of vision problems, including image recognition zhou2014scenerecog (); he2016resnet (), object detection girshick2014rcnn (); ren2015fasterRCNN (), image and video segmentation long2015fcn (); he2017mask (); fusionseg2017 (), and action detection twostream-actions (); feichtenhofer2016convaction (). Furthermore, significant research effort over the last five years (and really decades lecun ()) has led to well-honed CNN architectures that, when trained with massive labeled image datasets imagenet (), produce “pre-trained” networks broadly useful as feature extractors for new problems. Indeed such networks are widely adopted as off-the-shelf feature extractors for other algorithms and applications (c.f., VGG simonyan2014vgg (), ResNet he2016resnet (), and AlexNet alexnet () for images; C3D c3d () for video).

However, thus far, powerful CNN features are awkward if not off limits in practice for imagery. The problem is that the underlying projection models of current CNNs and data are different. Both the existing CNN filters and the expensive training data that produced them are “flat”, i.e., the product of perspective projection to a plane. In contrast, a image is projected onto the unit sphere surrounding the camera’s optical center.

To address this discrepancy, there are two common, though flawed, approaches. In the first, the spherical image is projected to a planar one,1 then the CNN is applied to the resulting 2D image lai2017semantic (); hu2017deeppilot () (see Fig. 1, top). However, any sphere-to-plane projection introduces distortion, making the resulting convolutions inaccurate. In the second existing strategy, the image is repeatedly projected to tangent planes around the sphere, each of which is then fed to the CNN/filters xiao2012sun360 (); zhang2014panocontext (); su2016accv (); su2017cvpr () (Fig. 1, bottom). In the extreme of sampling every tangent plane, this solution is exact and therefore accurate. However, it suffers from very high computational cost. Not only does it incur the cost of rendering each planar view, but also it prevents amortization of convolutions: the intermediate representation cannot be shared across perspective images because they are projected to different planes.

Figure 1: Two existing strategies for applying CNNs to images. Top: The first strategy unwraps the input into a single planar image using a global projection (most commonly equirectangular projection), then applies the CNN on the distorted planar image. Bottom: The second strategy samples multiple tangent planar projections to obtain multiple perspective images, to which the CNN is applied independently to obtain local results for the original image. Strategy I is fast but inaccurate; Strategy II is accurate but slow. The proposed approach learns to replicate flat filters on spherical imagery, offering both speed and accuracy.

We propose a learning-based solution that, unlike the existing strategies, sacrifices neither accuracy nor efficiency. The main idea is to learn a CNN that processes a image in its equirectangular projection (fast) but mimics the “flat” filter responses that an existing network would produce on all tangent plane projections for the original spherical image (accurate). Because convolutions are indexed by spherical coordinates, we refer to our method as spherical convolution (SphConv). We develop a systematic procedure to adjust the network structure in order to account for distortions. Furthermore, we propose a kernel-wise pre-training procedure which significantly accelerates the training process.

In addition to providing fast general feature extraction for imagery, our approach provides a bridge from content to existing heavily supervised datasets dedicated to perspective images. In particular, training requires no new annotations—only the target CNN model (e.g., VGG simonyan2014vgg () pre-trained on millions of labeled images) and an arbitrary collection of unlabeled images.

We evaluate SphConv on the Pano2Vid su2016accv () and PASCAL VOC pascal () datasets, both for raw convolution accuracy as well as impact on an object detection task. We show that it produces more precise outputs than baseline methods requiring similar computational cost, and similarly precise outputs as the exact solution while using orders of magnitude less computation. Furthermore, we demonstrate that SphConv can successfully replicate the widely used Faster-RCNN ren2015fasterRCNN () detector on data when training with only 1,000 unlabeled images containing unrelated objects. For a similar cost as the baselines, SphConv generates better object proposals and recognition rates.

2 Related Work

360° vision

Vision for data is quickly gaining interest in recent years. The SUN360 project samples multiple perspective images to perform scene viewpoint recognition xiao2012sun360 (). PanoContext zhang2014panocontext () parses images using 3D bounding boxes, applying algorithms like line detection on perspective images then backprojecting results to the sphere. Motivated by the limitations of existing interfaces for viewing video, several methods study how to automate field-of-view (FOV) control for display su2016accv (); su2017cvpr (); lai2017semantic (); hu2017deeppilot (), adopting one of the two existing strategies for convolutions (Fig. 1). In these methods, a noted bottleneck is feature extraction cost, which is hampered by repeated sampling of perspective images/frames, e.g., to represent the space-time “glimpses” of su2017cvpr (); su2016accv (). This is exactly where our work can have positive impact.

Knowledge distillation

Our approach relates to knowledge distillation ba2014distilling (); hinton2015distilling (); romero2014fitnets (); parisotto2015actormimic (); gupta2016suptransfer (); wang2016modelregression (); bucilua2006compression (), though we explore it in an entirely novel setting. Distillation aims to learn a new model given existing model(s). Rather than optimize an objective function on annotated data, it learns the new model that can reproduce the behavior of the existing model, by minimizing the difference between their outputs. Most prior work explores distillation for model compression bucilua2006compression (); ba2014distilling (); romero2014fitnets (); hinton2015distilling (). For example, a deep network can be distilled into a shallower ba2014distilling () or thinner romero2014fitnets () one, or an ensemble can be compressed to a single model hinton2015distilling (). Rather than compress a model in the same domain, our goal is to learn across domains, namely to link networks on images with different projection models. Limited work considers distillation for transfer parisotto2015actormimic (); gupta2016suptransfer (). In particular, unlabeled target-source paired data can help learn a CNN for a domain lacking labeled instances (e.g., RGB vs. depth images) gupta2016suptransfer (), and multi-task policies can be learned to simulate action value distributions of expert policies parisotto2015actormimic (). Our problem can also be seen as a form of transfer, though for a novel task motivated strongly by image processing complexity as well as supervision costs. Different from any of the above, we show how to adapt the network structure to account for geometric transformations caused by different projections. Also, whereas most prior work uses only the final output for supervision, we use the intermediate representation of the target network as both input and target output to enable kernel-wise pre-training.

Spherical image projection

Projecting a spherical image into a planar image is a long studied problem. There exists a large number of projection approaches (e.g., equirectangular, Mercator, etc.) barre1987curvilinear (). None is perfect; every projection must introduce some form of distortion. The properties of different projections are analyzed in the context of displaying panoramic images lihi-squaring (). In this work, we unwrap the spherical images using equirectangular projection because 1) this is a very common format used by camera vendors and researchers xiao2012sun360 (); su2016accv (); fb2015meta (), and 2) it is equidistant along each row and column so the convolution kernel does not depend on the azimuthal angle. Our method in principle could be applied to other projections; their effect on the convolution operation remains to be studied.

3 Approach

We describe how to learn spherical convolutions in equirectangular projection given a target network trained on perspective images. We define the objective in Sec. 3.1. Next, we introduce how to adapt the structure from the target network in Sec. 3.2. Finally, Sec. 3.3 presents our training process.

3.1 Problem Definition

Let be the input spherical image defined on spherical coordinates , and let be the corresponding flat RGB image in equirectangular projection. is defined by pixels on the image coordinates , where each is linearly mapped to a unique . We define the perspective projection operator which projects an -degree field of view (FOV) from to pixels on the the tangent plane . That is, . The projection operator is characterized by the pixel size in , and denotes the resulting perspective image. Note that we assume following common digital imagery.

Given a target network2 trained on perspective images with receptive field (Rf) , we define the output on spherical image at as

(1)

where w.l.o.g. we assume for simplicity. Our goal is to learn a spherical convolution network that takes an equirectangular map as input and, for every image position , produces as output the results of applying the perspective projection network to the corresponding tangent plane for spherical image :

(2)

where is the corresponding spherical coordinate.

This can be seen as a domain adaptation problem where we want to transfer the model from the domain of to that of . However, unlike typical domain adaptation problems, the difference between and is characterized by a geometric projection transformation rather than a shift in data distribution. Note that the training data to learn requires no manual annotations: it consists of arbitrary images coupled with the “true” outputs computed by exhaustive planar reprojections. Furthermore, at test time, only a single equirectangular projection of the entire input will be computed to obtain its corresponding dense (inferred) outputs via .

3.2 Network Structure

(a) Spherical convolution.
(b) Inverse perspective projection.
Figure 2: (a) The kernel weight in spherical convolution is tied only along each row, and each kernel convolves along the row to generate 1D output. Note that the kernel size differs at different rows and layers, and it expands near the top and bottom of the image. (b) Inverse perspective projections to equirectangular projections at different polar angles . The same square image will distort to different sizes and shapes depending on .

The main challenge for transferring to is the distortion introduced by equirectangular projection. The distortion is location dependent—a square in perspective projection will not be a square in the equirectangular projection, and its shape and size will depend on the polar angle . See Fig. 1(b). The convolution kernel should transform accordingly. Our approach 1) adjusts the shape of the convolution kernel to account for the distortion, in particular the content expansion, and 2) reduces the number of max-pooling layer to match the pixel sizes in and , as we detail next.

We adapt the architecture of from as follows. First, we untie the weight of convolution kernels at different by learning one separate kernel for each output row . Next, we adjust the shape of such that it covers the Rf of the original kernel. We consider to cover if more than of pixels in the Rf of are also in the Rf of in . The Rf of in is obtained by backprojecting the grid to using , where the center of the grid aligns on . should be large enough to cover , but it should also be as small as possible to avoid overfitting. Therefore, we optimize the shape of using the following procedure, where denotes the layer of the network. The shape of is initialized as . We first adjust the height and increase by 2 until the height of the Rf is larger than that of in . We then adjust the width similar to . Furthermore, we restrict the kernel size to be smaller than an upper bound . See Fig. 3. Because the Rf of depends on , we search for the kernel size starting from the bottom layer.

It is important to relax the kernel from being square to being rectangular, because equirectangular projection will expand content horizontally near the poles of the sphere (see Fig. 1(b) top). If we restrict the kernel to be square, the Rf of can easily be taller but narrower than that of which leads to overfitting. It is also important to restrict the kernel size, otherwise the kernel can grow wide rapidly near the poles and eventually cover the entire row. Note that cutting off the kernel size does not reduce the information significantly, because pixels in equirectangular projection do not distribute on the unit sphere uniformly but instead are denser near the pole. Therefore, the pixels are redundant in the region where the kernel size expands dramatically.

Besides adjusting the kernel sizes, we also adjust the number of pooling layers to match the pixel size in and . We define and restrict to ensure . Because max-pooling introduces shift invariance up to pixels in the image, which corresponds to degree on the unit sphere, the physical meaning of max-pooling depends on the pixel size. Since the pixel size is usually larger in and max-pooling increases the pixel size by a factor of , we remove the pooling layer in if .

Fig. 1(a) illustrates how spherical convolution differs from ordinary CNN. Note that we approximate one layer in by one layer in , so the number of layers and the number of output channels in each layer is exactly the same as the target network. However, this does not have to be the case.

Figure 3: Method to select the kernel height . We project the receptive field of the target kernel to equirectangular projection and increase until it is taller than the target kernel in . The kernel width is determined using the same procedure after is set. We restrict the kernel size by an upper bound .

3.3 Training Process

Given the goal in Eq. 2 and the architecture described in Sec. 3.2, we would like to learn the network by minimizing the loss . However, the network converges slowly, possibly due to the large number of parameters. Instead, we propose a kernel-wise pre-training process that disassembles the network and initially learns each kernel independently.

To perform kernel-wise pre-training, we further require to generate the same intermediate representation as in all layers :

(3)

Given Eq. 3, every layer is independent of each other. In fact, every kernel is independent and can be learned separately. We learn each kernel by taking the “ground truth” value of the previous layer as input and minimizing the loss , except for the first layer. Note that refers to the convolution output of layer before applying any non-linear operation, e.g. ReLU, max-pooling, etc. It is important to learn the target value before applying ReLU because it provides more information. We combine the non-linear operation with during kernel-wise pre-training, and we replace max-pooling with dilated convolutionyu2015dilated ().

For the first convolution layer, we derive the analytic solution directly. The projection operator is linear in the pixels in equirectangular projection: , for coefficients from, e.g., bilinear interpolation. Because convolution is a weighted sum of input pixels , we can combine the weight and interpolation coefficient as a single convolution operator:

(4)

The output value of will be exact and requires no learning. Of course, the same is not possible for because of the non-linear operations between layers.

After kernel-wise pre-training, we can further fine-tune the network jointly across layers and kernels using Eq. 2. Because the pre-trained kernels cannot fully recover the intermediate representation, fine-tuning can help to adjust the weights to account for residual errors. We ignore the constraint introduced in Eq. 3 when performing fine-tuning. Although Eq. 3 is necessary for kernel-wise pre-training, it restricts the expressive power of and degrades the performance if we only care about the final output. Nevertheless, the weights learned by kernel-wise pre-training are a very good initialization in practice, and we typically only need to fine-tune the network for a few epochs.

4 Experiments

To evaluate our approach, we use the VGG architecture3 and the Faster-RCNN ren2015fasterRCNN () model as our target network . We learn a network to produce the topmost (conv5_3) convolution output.

Datasets

We use two datasets: Pano2Vid for training, and Pano2Vid and PASCAL for testing.

Pano2Vid: We sample frames from the videos in the Pano2Vid dataset su2016accv () for both training and testing. The dataset consists of 86 videos crawled from YouTube using four keywords: “Hiking,” “Mountain Climbing,” “Parade,” and “Soccer”. We sample frames at 0.05fps to obtain 1,056 frames for training and 168 frames for testing. We use “Mountain Climbing” for testing and others for training, so the training and testing frames are from disjoint videos. See Supp. for sampling process. Because the supervision is on a per pixel basis, this corresponds to (non i.i.d.) samples. Note that most object categories targeted by the Faster-RCNN detector do not appear in Pano2Vid, meaning that our experiments test the content-independence of our approach.

PASCAL VOC: Because the target model was originally trained and evaluated on PASCAL VOC 2007, we “360-ify” it to evaluate the object detector application. We test with the 4,952 PASCAL images, which contain 12,032 bounding boxes. We transform them to equirectangular images as if they originated from a camera. In particular, each object bounding box is backprojected to 3 different scales and 5 different polar angles on the image sphere using the inverse perspective projection, where is the resolution of the target network’s Rf. See Supp. for details. Backprojection allows us to evaluate the performance at different levels of distortion in the equirectangular projection.

Metrics

We generate the output widely used in the literature (conv5_3) and evaluate it with the following metrics.

Network output error measures the difference between and . In particular, we report the root-mean-square error (RMSE) over all pixels and channels. For PASCAL, we measure the error over the Rf of the detector network.

Detector network performance measures the performance of the detector network in Faster-RCNN using multi-class classification accuracy. We replace the ROI-pooling in Faster-RCNN by pooling over the bounding box in . Note that the bounding box is backprojected to equirectangular projection and is no longer a square region.

Proposal network performance evaluates the proposal network in Faster-RCNN using average Intersection-over-Union (IoU). For each bounding box centered at , we project the conv5_3 output to the tangent plane using and apply the proposal network at the center of the bounding box on the tangent plane. Given the predicted proposals, we compute the IoUs between foreground proposals and the bounding box and take the maximum. The IoU is set to 0 if there is no foreground proposal. Finally, we average the IoU over bounding boxes.

We stress that our goal is not to build a new object detector; rather, we aim to reproduce the behavior of existing 2D models on data with lower computational cost. Thus, the metrics capture how accurately and how quickly we can replicate the exact solution.

Baselines

We compare our method with the following baselines.

  • Exact — Compute the true target value for every pixel. This serves as an upper bound in performance and does not consider the computational cost.

  • Direct — Apply on directly. We replace max-pooling with dilated convolution to produce a full resolution output. This is Strategy I in Fig. 1 and is used in video analysis lai2017semantic (); hu2017deeppilot ().

  • Interp — Compute every -pixels and interpolate the values for the others. We set such that the computational cost is roughly the same as our SphConv. This is a more efficient variant of Strategy II in Fig. 1.

  • Perspect — Project onto a cube map fb2015cubemap () and then apply on each face of the cube, which is a perspective image with FOV. The result is backprojected to to obtain the feature on . We use for the cube map resolution so is roughly the same as . This is a second variant of Strategy II in Fig. 1 used in PanoContext zhang2014panocontext ().

SphConv variants

We evaluate three variants of our approach:

  • OptSphConv — To compute the output for each layer , OptSphConv computes the exact output for layer using then applies spherical convolution for layer . OptSphConv serves as an upper bound for our approach, where it avoids accumulating any error across layers.

  • SphConv-Pre — Uses the weights from kernel-wise pre-training directly without fine-tuning.

  • SphConv — The full spherical convolution with joint fine-tuning of all layers.

Implementation details

We set the resolution of to . For the projection operator , we map to pixels following SUN360 xiao2012sun360 (). The pixel size is therefore for and for . Accordingly, we remove the first three max-pooling layers so has only one max-pooling layer following conv4_3. The kernel size upper bound following the max kernel size in VGG. We insert batch normalization for conv4_1 to conv5_3. See Supp. for details.

4.1 Network output accuracy and computational cost

Fig. 3(a) shows the output error of layers conv3_3 and conv5_3 on the Pano2Vid su2016accv () dataset (see Supp. for similar results on other layers.). The error is normalized by that of the mean predictor. We evaluate the error at 5 polar angles uniformly sampled from the northern hemisphere, since error is roughly symmetric with the equator.

First we discuss the three variants of our method. OptSphConv performs the best in all layers and , validating our main idea of spherical convolution. It performs particularly well in the lower layers, because the Rf is larger in higher layers and the distortion becomes more significant. Overall, SphConv-Pre performs the second best, but as to be expected, the gap with OptConv becomes larger in higher layers because of error propagation. SphConv outperforms SphConv-Pre in conv5_3 at the cost of larger error in lower layers (as seen here for conv3_3). It also has larger error at for two possible reasons. First, the learning curve indicates that the network learns more slowly near the pole, possibly because the Rf is larger and the pixels degenerate. Second, we optimize the joint loss, which may trade the error near the pole with that at the center.

Comparing to the baselines, we see that ours achieves lowest errors. Direct performs the worst among all methods, underscoring that convolutions on the flattened sphere—though fast—are inadequate. Interp performs better than Direct, and the error decreases in higher layers. This is because the Rf is larger in the higher layers, so the -pixel shift in causes relatively smaller changes in the Rf and therefore the network output. Perspective performs similarly in different layers and outperforms Interp in lower layers. The error of Perspective is particularly large at , which is close to the boundary of the perspective image and has larger perspective distortion.

(a) Network output errors vs. polar angle
(b) Cost vs. accuracy
Figure 4: (a) Network output error on Pano2Vid; lower is better. Note the error of Exact is 0 by definition. Our method’s convolutions are much closer to the exact solution than the baselines’. (b) Computational cost vs. accuracy on PASCAL. Our approach yields accuracy closest to the exact solution while requiring orders of magnitude less computation time (left plot). Our cost is similar to the other approximations tested (right plot). Plot titles indicate the y-labels, and error is measured by root-mean-square-error (RMSE).

Fig. 3(b) shows the accuracy vs. cost tradeoff. We measure computational cost by the number of Multiply-Accumulate (MAC) operations. The leftmost plot shows cost on a log scale. Here we see that Exact—whose outputs we wish to replicate—is about 400 times slower than SphConv, and SphConv approaches Exact’s detector accuracy much better than all baselines. The second plot shows that SphConv is about faster than Interp (while performing better in all metrics). Perspective is the fastest among all methods and is faster than SphConv, followed by Direct which is faster than SphConv. However, both baselines are noticeably inferior in accuracy compared to SphConv.

Figure 5: Three AlexNet conv1 kernels (left squares) and their corresponding four SphConv-Pre kernels at (left to right).

To visualize what our approach has learned, we learn the first layer of the AlexNet alexnet () model provided by the Caffe package jia2014caffe () and examine the resulting kernels. Fig. 5 shows the original kernel and the corresponding kernels at different polar angles . is usually the re-scaled version of , but the weights are often amplified because multiple pixels in fall to the same pixel in like the second example. We also observe situations where the high frequency signal in the kernel is reduced, like the third example, possibly because the kernel is smaller. Note that we learn the first convolution layer for visualization purposes only, since (only) has an analytic solution (cf. Sec 3.3). See Supp. for the complete set of kernels.

4.2 Object detection and proposal accuracy

(a) Detector network performance.
(b) Proposal network accuracy (IoU).
Figure 6: Faster-RCNN object detection accuracy on a version of PASCAL across polar angles , for both the (a) detector network and (b) proposal network. refers to the Rf of . Best viewed in color.

Having established our approach provides accurate but efficient convolutions, we now examine how important that accuracy is to object detection on inputs. Fig. 5(a) shows the result of the Faster-RCNN detector network on PASCAL in format. OptSphConv performs almost as well as Exact. The performance degrades in SphConv-Pre because of error accumulation, but it still significantly outperforms Direct and is better than Interp and Perspective in most regions. Although joint training (SphConv) improves the output error near the equator, the error is larger near the pole which degrades the detector performance. Note that the Rf of the detector network spans multiple rows, so the error is the weighted sum of the error at different rows. The result, together with Fig. 3(a), suggest that SphConv reduces the conv5_3 error in parts of the Rf but increases it at the other parts. The detector network needs accurate conv5_3 features throughout the Rf in order to generate good predictions.

Direct again performs the worst. In particular, the performance drops significantly at , showing that it is sensitive to the distortion. In contrast, Interp performs better near the pole because the samples are denser on the unit sphere. In fact, Interp should converge to Exact at the pole. Perspective outperforms Interp near the equator but is worse in other regions. Note that falls on the top face, and is near the border of the face. The result suggests that Perspective is still sensitive to the polar angle, and it performs the best when the object is near the center of the faces where the perspective distortion is small. We use the inherent orientation of the image for the cube map.

Fig. 5(b) shows the performance of the object proposal network for two scales (see Supp. for more). Interestingly, the result is different from the detector network. OptSphConv still performs almost the same as Exact, and SphConv-Pre performs better than baselines. However, Direct now outperforms other baselines, suggesting that the proposal network is not as sensitive as the detector network to the distortion introduced by equirectangular projection. The performance of the methods is similar when the object is larger (right plot), even though the output error is significantly different. The only exception is Perspective, which performs poorly for regardless of the object scale. It again suggests that objectness is sensitive to the perspective image being sampled.

\adjincludegraphics

[width=trim=0 .60pt 0 0,clip]./Examples/Rf224tilt32/008428-bbox000.jpg

\adjincludegraphics

[width=trim=0 .60pt 0 0,clip]./Examples/Rf336tilt32/006500-bbox003.jpg

Figure 7: Object detection examples on PASCAL test images. Images show the top 40% of equirectangular projection; black regions are undefined pixels. Text gives predicted label, multi-class probability, and IoU, resp.

Fig. 7 shows examples of objects successfully detected by our approach in spite of severe distortions. See Supp. for more examples.

5 Conclusion

We propose to learn spherical convolutions for images. Our solution entails a new form of distillation across camera projection models. Compared to current practices for feature extraction on images/video, spherical convolution benefits efficiency by avoiding performing multiple perspective projections, and it benefits accuracy by adapting kernels to the distortions in equirectangular projection. Results on two datasets demonstrate how it successfully transfers state-of-the-art vision models from the realm of limited FOV 2D imagery, into the realm of omnidirectional data.

Future work will explore SphConv in the context of other dense prediction problems like segmentation, as well as the impact of different projection models within our basic framework.

In the appendix, we provide additional details to supplement the main paper submission. In particular, this document contains:

  1. Figure illustration of the spherical convolution network structure

  2. Implementation details, in particular the learning process

  3. Data preparation process of each dataset

  4. Complete experiment results

  5. Additional object detection result on Pascal, including both success and failure cases

  6. Complete visualization of the AlexNet conv1 kernel in spherical convolution

Appendix A Spherical Convolution Network Structure

Fig. 8 shows how the proposed spherical convolutional network differs from an ordinary convolutional neural network (CNN). In a CNN, each kernel convolves over the entire 2D map to generate a 2D output. Alternatively, it can be considered as a neural network with a tied weight constraint, where the weights are shared across all rows and columns. In contrast, spherical convolution only ties the weights along each row. It learns a kernel for each row, and the kernel only convolves along the row to generate 1D output. Also, the kernel size may differ at different rows and layers, and it expands near the top and bottom of the image.

Figure 8: Spherical convolution illustration. The kernel weights at different rows of the image are untied, and each kernel convolves over one row to generate 1D output. The kernel size also differs at different rows and layers.

Appendix B Additional Implementation Details

We train the network using ADAM kingma2014adam (). For pre-training, we use the batch size of 256 and initialize the learning rate to 0.01. For layers without batch normalization, we train the kernel for 16,000 iterations and decrease the learning rate by 10 every 4,000 iterations. For layers with batch normalization, we train for 4,000 iterations and decrease the learning rate every 1,000 iterations. For fine-tuning, we first fine-tune the network on conv3_3 for 12,000 iterations with batch size of 1. The learning rate is set to 1e-5 and is divided by 10 after 6,000 iterations. We then fine-tune the network on conv5_3 for 2,048 iterations. The learning rate is initialized to 1e-4 and is divided by 10 after 1,024 iterations. We do not insert batch normalization in conv1_2 to conv3_3 because we empirically find that it increases the training error.

Appendix C Data Preparation

This section provides more details about the dataset splits and sampling procedures.

Pano2Vid

For the Pano2Vid dataset, we discard videos with resolution and sample frames at 0.05fps. We use “Mountain Climbing” for testing because it contains the smallest number of frames. Note that the training data contains no instances of “Mountain Climbing”, such that our network is forced to generalize across semantic content. We sample at a low frame rate in order to reduce temporal redundancy in both training and testing splits. For kernel-wise pre-training and testing, we sample the output on 40 pixels per row uniformly to reduce spatial redundancy. Our preliminary experiments show that a denser sample for training does not improve the performance.

Pascal Voc 2007

As discussed in the main paper, we transform the 2D PASCAL images into equirectangular projected data in order to test object detection in omnidirectional data while still being able to rely on an existing ground truthed dataset. For each bounding box, we resize the image so the short side of the bounding box matches the target scale. The image is backprojected to the unit sphere using , where the center of the bounding box lies on . The unit sphere is unwrapped into equirectangular projection as the test data. We resize the bounding box to three target scales corresponding to , where is the Rf of . Each bounding box is projected to 5 tangent planes with and . By sampling the boxes across a range of scales and tangent plane angles, we systematically test the approach in these varying conditions.

Appendix D Complete Experimental Results

This section contains additional experimental results that do not fit in the main paper.

Figure 9: Network output error.

Fig. 9 shows the error of each meta layer in the VGG architecture. This is the complete version of Fig. 4a in the main paper. It becomes more clear to what extent the error of SphConv increases as we go deeper in the network as well as how the error of Interp decreases.

Figure 10: Proposal network accuracy (IoU).

Fig. 10 shows the proposal network accuracy for all three object scales. This is the complete version of Fig. 6b in the main paper. The performance of all methods improves at larger object scales, but Perspective still performs poorly near the equator.

Appendix E Additional Object Detection Examples

Figures 11, 12 and 13 show example detection results for SphConv-Pre on the version of PASCAL VOC 2007. Note that the large black areas are undefined pixels; they exist because the original PASCAL test images are not data, and the content occupies only a portion of the viewing sphere.

Figure 11: Object detection results on PASCAL VOC 2007 test images transformed to equirectangular projected inputs at different polar angles . Black areas indicate regions outside of the narrow field of view (FOV) PASCAL images, i.e., undefined pixels. The polar angle from top to bottom. Our approach successfully learns to translate a 2D object detector trained on perspective images to inputs.
Figure 12: Object detection results on PASCAL VOC 2007 test images transformed to equirectangular projected inputs at .
Figure 13: Object detection results on PASCAL VOC 2007 test images transformed to equirectangular projected inputs at .

Fig. 14 shows examples where the proposal network generate a tight bounding box while the detector network fails to predict the correct object category. While the distortion is not as severe as some of the success cases, it makes the confusing cases more difficult. Fig. 15 shows examples where the proposal network fails to generate tight bounding box. The bounding box is the one with the best intersection over union (IoU), which is less than 0.5 in both examples.

Figure 14: Failure cases of the detector network.
Figure 15: Failure cases of the proposal network.

Appendix F Visualizing Kernels in Spherical Convolution

Fig. 16 shows the target kernels in the AlexNet alexnet () model and the corresponding kernels learned by our approach at different polar angles . This is the complete list for Fig. 5 in the main paper. Here we see how each kernel stretches according to the polar angle, and it is clear that some of the kernels in spherical convolution have larger weights than the original kernels. As discussed in the main paper, these examples are for visualization only. As we show, the first layer is amenable to an analytic solution, and only layers are learned by our method.

Figure 16: Learned conv1 kernels in AlexNet (full). Each square patch is an AlexNet kernel in perpsective projection. The four rectangular kernels beside it are the kernels learned in our network to achieve the same features when applied to an equirectangular projection of the viewing sphere.

Footnotes

  1. e.g., with equirectangular projection, where latitudes are mapped to horizontal lines of uniform spacing
  2. e.g., could be AlexNet alexnet () or VGG simonyan2014vgg () pre-trained for a large-scale recognition task.
  3. https://github.com/rbgirshick/py-faster-rcnn

References

  1. https://facebook360.fb.com/editing-360-photos-injecting-metadata/.
  2. https://code.facebook.com/posts/1638767863078802/under-the-hood-building-360-video/.
  3. J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, 2014.
  4. A. Barre, A. Flocon, and R. Hansen. Curvilinear perspective, 1987.
  5. C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil. Model compression. In ACM SIGKDD, 2006.
  6. J. Deng, W. Dong, R. Socher, L. Li, and L. Fei-Fei. Imagenet: a large-scale hierarchical image database. In CVPR, 2009.
  7. M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015.
  8. C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  9. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  10. S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
  11. K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
  12. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  13. G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  14. H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, and M. Sun. Deep 360 pilot: Learning a deep agent for piloting through 360 deg sports video. In CVPR, 2017.
  15. S. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in video. In CVPR, 2017.
  16. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
  17. D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  19. W.-S. Lai, Y. Huang, N. Joshi, C. Buehler, M.-H. Yang, and S. B. Kang. Semantic-driven generation of hyperlapse from 360° video. arXiv preprint arXiv:1703.10798, 2017.
  20. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proc. of the IEEE, 1998.
  21. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  22. E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
  23. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  24. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  25. K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  26. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  27. Y.-C. Su and K. Grauman. Making 360° video watchable in 2d: Learning videography for click free viewing. In CVPR, 2017.
  28. Y.-C. Su, D. Jayaraman, and K. Grauman. Pano2vid: Automatic cinematography for watching 360° videos. In ACCV, 2016.
  29. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  30. Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In ECCV, 2016.
  31. J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place representation. In CVPR, 2012.
  32. F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  33. L. Zelnik-Manor, G. Peters, and P. Perona. Squaring the circle in panoramas. In ICCV, 2005.
  34. Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. In ECCV, 2014.
  35. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
11314
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description