Extreme View Synthesis
We present Extreme View Synthesis, a solution for novel view extrapolation that works even when the number of input images is small—as few as two. In this context, occlusions and depth uncertainty are two of the most pressing issues, and worsen as the degree of extrapolation increases. We follow the traditional paradigm of performing depth-based warping and refinement, with a few key improvements. First, we estimate a depth probability volume, rather than just a single depth value for each pixel of the novel view. This allows us to leverage depth uncertainty in challenging regions, such as depth discontinuities. After using it to get an initial estimate of the novel view, we explicitly combine learned image priors and the depth uncertainty to synthesize a refined image with less artifacts. Our method is the first to show visually pleasing results for baseline magnifications of up to . The code is available at https://github.com/NVlabs/extreme-view-synth
figureWe propose a novel view synthesis method that can generate extreme views, \ie, images synthesized from a small number of cameras (two in this example) and from significantly different viewpoints. In this comparison with the method by Zhou \etal [zhou2018stereo], we show the left view from the camera setup depicted above. Even at a baseline magnification our method produces sharper results.
The ability to capture visual content and render it from a different perspective, usually referred to as novel view synthesis, is a long-standing problem in computer graphics. When appropriately solved, it enables telepresence applications such as head-mounted virtual and mixed reality, and navigation of remote environments on a 2D screen—an experience popularized by Google Street View. The increasing amount of content that is uploaded daily to sharing services offers a rich source of data for novel view synthesis. Nevertheless, a seamless navigation of the virtual world requires a more dense sampling than these sparse observations offer. Synthesis from sparse views is challenging, in particular when generating views creating disocclusions, a common situation when the viewpoint is extrapolated, rather than interpolated, from the input cameras.
Early novel view synthesis methods can generate new images by interpolation either in pixel space [chen1993view], or in ray space [levoy1996light]. Novel views can also be synthesized with methods that use 3D information explicitly. A typical approach would use it to warp the input views to the virtual camera and merge them based on a measure of quality [buehler2001unstructured]. The advantage of such methods is that they explicitly leverage geometric constraints. Depth, however, does not come without disadvantages. First and foremost is the problem of occlusions. Second, depth estimation is always subject to a degree of uncertainty. Both of these issues are further exacerbated when the novel view is pushed farther from the input camera, as shown in Figure 5. Existing methods deal with uncertainty by propagating reliable depth values to similar pixels [CDSD13], or by modeling it explicitly [penner2017soft]. But these approaches cannot leverage depth to refine the synthesized images, nor do they use image priors to deal with the unavoidable issues of occlusions and artifacts.
More recent approaches use large data collections and learn the new views directly [flynn2016deepstereo, zhou2018stereo]. The power of learning-based approaches lies in their ability to leverage image priors to fill missing regions, or correct for poorly reconstructed ones. However, they still cause artifacts when the position of the virtual camera differs significantly from that of the inputs, in particular when the inputs are few.
In their Stereo Magnification work, Zhou \etalcleverly extract a layered representation of the scene [zhou2018stereo]. The layers, which they learn to combine into the novel view, offer a regularization that allows for an impressive stereo baseline extrapolation of up to . Our goal is similar, in that we want to use as few as two input cameras and extrapolate a novel view. Moreover, we want to push the baseline extrapolation much further, up , as shown in Figure Extreme View Synthesis. In addition, we allow the virtual camera to move and rotate freely, instead of limiting to translations along the baseline.
At a high level, we follow the depth-warp-refine paradigm, but we leverage two key insights to achieve such large extrapolation. First, depth estimation is not always reliable: instead of exact depth estimates, we use depth probability volumes. Second, while image refinement networks are great at learning generic image priors, we also use explicit information about the scene by sampling patches according to the depth probability volumes. By combining these two concepts, our method works for both view interpolation and extreme extrapolation. We show results on a large number of examples in which the virtual camera significantly departs from the original views, even when only two input images are given. To the best of our knowledge, ours is the first method to produce visually pleasing results for such extreme view synthesis from unstructured cameras.
2 Related Work
Early methods for novel view synthesis date back several decades [greene1986environment]. Image interpolation methods, among the first approaches to appear, work by interpolating between corresponding pixels from the input images [chen1993view], or between rays in space [levoy1996light]. The novel view can also be synthesized as a weighted combination of the input cameras, when information about the scene geometry is available [buehler2001unstructured, debevec1996modeling]. All of these methods generally assume additional information—correspondences, depth, or geometry—to be given.
Recent methods produce excellent results taking only images as an input. This can be done, for instance, by using an appropriate representation of the scene, such as plane sweep volumes, and by learning weights to merge them down into a single image [flynn2016deepstereo]. Further building on the concept layered depth images [he1998layered], Zitnick \etaldeveloped a high-quality video-based rendering system for dynamic scenes that can interpolate between views [Zitnick2004HighqualityVV]. Zhou \etalpropose a learned layer-based representation of the scene, dubbed MPI [zhou2018stereo]. Their results are impressive, but quickly degrade beyond limited translations of the novel view. The works of Mildenhall \etal [mildenhall2019local] and Srinivasan \etal [srinivasan2019pushing] build on the MPI representation further improving the quality of the synthesized view, even for larger camera translations111These works were published after the submission of this paper and are included here for a more complete coverage of the state-of-the-art..
A different approach is to explicitly use depth information, which can be estimated from the input images directly, and used to warp the input images into the novel view. Kalantari \etal, for instance, learn to estimate both disparity and the novel view from the sub-aperture images of a lightfield camera [kalantari2016learning]. For larger displacements of the virtual camera, however, depth uncertainty results in noticeable artifacts. Chaurasia \etaltake accurate but sparse depth and propagate it using super-pixels based on their similarity in image space [CDSD13]. Penner and Zhang explicitly model the confidence that a voxel corresponds to empty space or to a physical surface, and use it while performing back-to-front synthesis of the novel view [penner2017soft].
The ability of deep learning techniques to learn priors has also paved the way to single-image methods. Srinivasan \etallearn a light field and depth along each ray from a single image [srinivasan2017learning]. Zhou \etalcast this problem as a prediction of appearance flows, which allows them to synthesize novel views of a 3D object or scene from a single observation [zhou2016view]. From a single image, Xie \etalproduce stereoscopic images [xie2016deep3d], while Tulsiani \etalinfer a layered representation of the scene [tulsiani2018layer].
Our approach differs from published works for its ability to generate extrapolated images under large viewpoint changes and from as few as two cameras.
Our goal is to synthesizes a novel view, , from input views, . A common solution to this problem is to estimate depth and use it to warp and fuse the inputs into the novel view. However, depth estimation algorithms struggle in difficult situations, such as regions around depth discontinuities; this causes warping errors and, in turn, artifacts in the final image. These issues further worsen when is small, or is extrapolated, \ie, when the virtual camera is not on the line connecting the centers of any two input cameras. Rather than using a single depth estimate for a given pixel, our method accounts for the depth’s probability distribution, which is similar in spirit to the work of Liu \etal [liu2019neural]. We first estimate distributions , one per input view, and combine them to estimate the distribution for the virtual camera, , Section 4. Based on the combined distribution , we render the novel view back to front, Section 5. Finally, we refine at the patch level informed by relevant patches from the input views, which we select based on the depth distribution and its uncertainty, Section 6. Figure 6 shows an overview of the method.
4 Estimating the Depth Probability Volume
Several methods exist that estimate depth from multiple images [Kar2017LearningAM, galliani2015massively], stereo pairs [kendall17deepstereo, khamis18stereonet], and even single image [MegaDepthLi18, saxena2006learning]. Inspired by the work of Huang \etal, we treat depth estimation as a learning-based, multi-class classification problem [huang2018deepmvs]. Specifically, depth can be discretized into values and each depth value can be treated as a class. Depth estimation, then, becomes a classification problem: each pixel in can be associated with a probability distribution over the depth values along , the ray leaving the camera at and traversing the scene. We refer to the collection of all the rays for camera as a depth probability volume, , where is the resolution of . The network to estimate the ’s, can be trained with a cross-entropy loss against ground truth one-hot vectors that are 1 for the correct class and 0 elsewhere, as in Huang \etal [huang2018deepmvs]. We follow the common practice of uniformly sampling disparity instead of depth222Technically, “disparity” is only defined in the case of a stereo pair. Here we use the term loosely to indicate a variable that is inversely proportional to depth. to improve the estimation accuracy of closer objects.
Empirically, we observe that the resulting depth volumes exhibit desirable behaviors. For most regions, the method is fairly certain about disparity and the probability along presents a single, strong peak around the correct value. Around depth discontinuities, where the point-spread-function of the lens causes pixels to effectively belong to both foreground and background, the method tends to produce a multi-modal distribution, with each peak corresponding to the disparity levels of the background and foreground, see for instance Figure 7. This is particularly important because depth discontinuities are the most challenging regions when it comes to view synthesis.
Solving for the depth probability volumes requires that we know the location and the camera’s intrinsic parameters for each input view. We estimate these using Colmap [colmap]. For a given scene, we set the closest and farthest disparity levels as the bottom and top depth percentiles respectively, and use uniformly spaced disparity steps. Similarly to the method of Huang \etal, we also cross-bilateral filter the depth probability volume guided by an input RGB image [kraehenbuehl2011crf]. However, we find , , and to work better for our case and iterate the filter for times. We refer the reader to Krähenhühl and Koltun for the role of each parameter [kraehenbuehl2011crf].
Finally, we can estimate the probability volume for the novel view by resampling these probability volumes. Conceptually, the probability of each disparity for each pixel , , can be estimated by finding the intersecting rays ’s from the input cameras and average their probability. This, however, is computationally demanding. We note that this can be done efficiently by resampling the ’s with respect to , accumulating each of the volumes into the novel view volume, and normalizing by the number of contributing views. This accumulation is sensible because the probability along is a proper distribution. This is in contrast with traditional cost volumes [hosni2012fast] for which costs are not comparable across views: the same value for the cost in two different views may not indicate that the corresponding disparities are equally likely to be correct. Depth probability volumes also resemble the soft visibility volumes by Penner and Zhang [penner2017soft]. However, their representation is geared towards identifying empty space in front of the first surface. Therefore, they behave differently in regions of uncertainty, such as depth discontinuities, where depth probability volumes carry information even beyond the closest surface.
Figure 8 shows an example of the resampling procedure, where we consider only a planar slice of the volumes and, for simplicity, that the probability along the input rays is binary. We use nearest neighbor sampling, which, based on our experiments, yields quality comparable with tri-linear interpolation at a fraction of the cost. After merging all views, we normalize the values along each ray in to enforce a probability distribution.
5 Synthesis of a Novel View
Using the depth probability volume , we backward warp pixels from the inputs and render in a back-to-front fashion an initial estimate of the novel view, . Specifically, we start from the farthest plane, where , and compute a pixel in the novel view as
where is the indicator function, and are the coordinates in that correspond to in . Note that these are completely defined by the cameras’ centers and the plane at . R is a function that merges pixels from weighting them based on the distance of the cameras’ centers, and the angles between the cameras’ principal axes. Details about the threshold and the weights are in the Supplementary. As we sweep the depth towards a larger disparity , \ie, closer to the camera, we overwrite those pixels for which is above threshold333An alternative to overwriting the pixels, is to weigh their RGB values with the corresponding depth probabilities. However, in our experiments, this resulted in softer edges or ghosting that were harder to fix for the refinement network (Section 6.1). We speculate that the reason to be that such artifacts are more “plausible” to the refinement network than abrupt and incoherent RGB changes..
The resulting image will, in general, presents artifacts and holes, see Figure 9(a). This is expected, since we are rejecting depth estimates that are too uncertain, and we overwrite pixels as we sweep the depth plane from back to front. However, at this stage we are only concerned with generating an initial estimate of the novel view that obeys the geometric constraints captured by the depth probability volumes.
6 Image Refinement
The image synthesized as described in Section 5 is generally affected by apparent artifacts, as shown in Figures 9(a) and (c). Most notably, these include regions that are not rendered, either because of occlusions or missing depth information, and the typical “fattening” of edges at depth discontinuities. Moreover, since we render each pixel independently, structures may be locally deformed. We address these artifacts by training a refinement network that works at the patch level. For a pixel in , we first extract a patch around it (for clarity of notation, we omit its dependence on ). The goal of the refinement network is to produced a higher quality patch with less artifacts. One could consider the refinement operation akin to denoising, and train a network to take a patch and output the refined patch, using a dataset of synthesized and ground truth patches and an appropriate loss function [johnson2016perceptual, zhao2017loss]. However, at inference time, this approach would only leverage generic image priors and disregard the valuable information the input images carry. Instead, we turn to the depth probability volume. Consider the case of a ray traveling close to a depth discontinuity, which is likely to generate artifacts. The probability distribution along this ray generally shows a peak corresponding to the foreground and one to the background, see Figure 7. Then, rather than fixing the artifacts only based on generic image priors, we can guide the refinement network with patches extracted from the input views at the locations reprojected from these depths. Away from depth discontinuities, the distribution usually has a single, strong peak, and the synthesized images are generally correct. Still, since we warp the pixels independently, slight depth inaccuracy may cause local deformation. Once again, patches from the input views can inform the refinement network about the underlying structure even if the depth is slightly off.
To minimize view-dependent differences in the patches without causing local deformations, we warp them with the homography induced by the depth plane. For a given disparity , we compute the warped patch
where is an operator that warps a patch based on homography , and is the homography induced by plane at disparity . This patch selection strategy can be seen as an educated selection of a plane sweep volume [collins1996space], where only the few patches that are useful are fed into the refinement network, while the large number of irrelevant patches, which can only confuse it, are disregarded. In the next section we describe our refinement network, as well as details about its training.
6.1 Refinement Network
Our refinement strategy, shown in Figure 10, takes a synthesized patch and warped patches from each input view . The number of patches contributed to each can change from view to view: because of occlusions, an input image may not “see” a particular patch, or the patch could be outside of its field of view. Moreover, the depth distribution along a ray traveling close to a depth discontinuity may have one peak, or several. As a result, we need to design our refinement network to work with a variable number of patches.
We use a UNet architecture for its proven performance on a large number of vision applications. Rather than training it on a stack of concatenated patches, which would lock us into a specific value of , we apply the encoder to each of the available patches independently. We then perform max-pooling over the features generated from all the available patches and we concatenate the result with the features of the synthesized patch, see Figure 10. The encoder has seven convolutional layers, four of which downsample the data by means of strided convolution. We also use skip connections from the four downsampling layers of the encoder to the decoder. Each skip connection is a concatenation of the features of the synthesized patch for that layer and a max-pooling operation on the features of the candidate patches at the same layer.
We train the refinement network using the MVS-Synth dataset [huang2018deepmvs]. We use a perceptual loss [johnson2016perceptual] as done by Zhuo \etal [zhou2018stereo], and train with ADAM [kingma2014adam]. More details about the network and the training are in the Supplementary.
7 Evaluation and Results
In this section we offer a numerical evaluation of our method and present several visual results. We recommend to zoom into the images in the electronic version of the paper to better inspect them, and to use a media-enabled PDF viewer to play the animated figures.
Using two views as an input, the depth probability volumes take s, view synthesis (estimation of the depth volume in the novel view and rendering) takes 30s, and the refinement network takes s (all averages).
Non-blind image quality metrics such as SSIM [SSIM] and PSNR require ground truth images. For a quantitative evaluation of our proposed method we use the MVS-Synth [huang2018deepmvs] dataset. The MVS-Synth dataset provides a set of high-quality renderings obtained from the game GTA-V, broken up into a hundred sequences. For each sequence, color images, depth images, and the camera parameters for each are provided. The location of the cameras in each sequence is unstructured. In our evaluation we select two adjacent cameras as the input views to our method and generate a number of nearby views also in the sequence. We then compute the PSNR and SSIM metrics between the synthesized and ground-truth images.
In addition, we can use the same protocol to compare against Stereo Magnification (SM) by Zhou \etal[zhou2018stereo]. Although SM is tailored towards magnifying the baseline of a stereo pair, it can also render arbitrary views that are not in the baseline between the two cameras. We chose to quantitatively compare against SM because it also addresses the problem of generating extreme views, although in a more constrained setting.
Table 1 shows PSNR and SSIM values for our method before and after refinement, and for SM. The results show that the refinement network does indeed improve the quality of the final result. In addition, the metrics measured on our method output are higher than those of SM.
|Metric||Ours Warped||Ours Refined||SM|
While sequences of real images cannot be used to evaluate our algorithm numerically, we can at least use them for visual comparisons of the results.
We perform a qualitative evaluation and compare against SM on their own data. In their paper, Zhou \etalshow results when magnifying a stereo baseline by a factor of . While their results are impressive at that magnification, in this paper we push the envelop to extreme and show results for magnification of the input baseline.
Figure Extreme View Synthesis and 14 show magnification on stereo pairs of scenes with complicated structure and occlusions. At this magnification level, the results of Zhou \etalare affected by strong artifacts. Even in the areas that appear to be correctly reconstructed, such as the head of Mark Twain’s statue in Figure 14(left), a closer inspection reveals a significant amount of blur. Our method generates results that are sharper and present fewer artifacts. We also compare against their method at the magnification level they show, and observe similar results, see Supplementary.
The method by Penner and Zhang arguably produces state-of-the-art results for novel view synthesis. However, their code is not available and their problem setting is quite different in that they focus on interpolation and rely on a larger number of input cameras than our method. For completeness, however, we show a comparison against their method in Figure 17. Our reconstruction, despite using many fewer inputs, shows a quality that is comparable to theirs, though it degrades for larger extrapolation.
To validate our method more extensively, inspired by the collection strategy implemented by Zhou \etal[zhou2018stereo], we capture a number of frame sequences from YouTube videos.
A few of the results are shown in Figure 13. The leftmost column shows the camera locations for the images shown on the right. The color of the cameras matches the color of the frame around the corresponding image, and gray indicates input cameras. We present results for a number of different camera displacements and scenes, showcasing the strength of our solution. In particular, the first three rows show results using only two cameras as inputs, with the virtual cameras being displaced by several times the baseline between the inputs cameras. The third row shows a dolly-in trajectory (\ie, the camera moves towards the scene), which is a particularly difficult case. Unfortunately, it may be challenging to appreciate the level of extrapolation when comparing images side by side, even when zooming in. However, we also show an animated sequence in Figure 12. To play the sequence, click on the image using a media-enabled reader, such as Adobe Reader. In the Supplementary we show additional video sequences and an animation that highlights the extent of parallax in one of the scenes.
Furthermore, our method can take any number of input images. The last two rows of Figure 13 show two scenes for which we used four input cameras.
We also conduct an evaluation to show that the use of patches as input to the refinement network does indeed guide the network to produce a better output. Figure 11 shows a comparison between our network and a network with the same exact number of parameters—the architecture differs only in the fact that it does not have additional patches. It can be observed that the proposed architecture (Figure 11(c) and Figure 11(f)) can reconstruct local structure even when the single-patch network (Figure 11(b) and Figure 11(e)) cannot. Indeed, the refinement network guided by patches can synthesize pixels in areas that had previously been occluded.
While the refinement network can fix artifacts and fill in holes at the disocclusion boundaries, it can not hallucinate pixels in areas that were outside of the frusta of the input cameras—that is a different problem requiring a different solution, such as GAN-based synthesis [wang2018high]. The refinement network also struggles to fix artifacts that look natural, such as an entire region reconstructed in the wrong location.
Finally, because the depth values are discrete, certain novel views may be affected by depth quantization artifacts. A straightforward solution is to increase the number of disparity levels (at the cost of a larger memory footprint and execution time) or adjust the range of disparities to better fit the specific scene.
We presented a method to synthesize novel views from a set of input cameras. We specifically target extreme cases, which are characterized by two factors: small numbers of input cameras, as few as two, and large extrapolation, up to for stereo pairs. To achieve this, we combine traditional geometric constraints with the learned priors. We show results on several real scenes and camera motions, and for different numbers of input cameras.
The authors would like to thank Abhishek Badki for his help with Figure 13, and the anonymous reviewers for their thoughtful feedback.