Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance
In this work we address the challenging problem of multiview 3D surface reconstruction. We introduce a neural network architecture that simultaneously learns the unknown geometry, camera parameters, and a neural renderer that approximates the light reflected from the surface towards the camera. The geometry is represented as a zero level-set of a neural network, while the neural renderer, derived from the rendering equation, is capable of (implicitly) modeling a wide set of lighting conditions and materials. We trained our network on real world 2D images of objects with different material properties, lighting conditions, and noisy camera initializations from the DTU MVS dataset. We found our model to produce state of the art 3D surface reconstructions with high fidelity, resolution and detail.
Learning 3D shapes from 2D images is a fundamental computer vision problem. A recent successful neural network approach to solving this problem involves the use of a (neural) differentiable rendering system along with a choice of (neural) 3D geometry representation. Differential rendering systems are mostly based on ray casting/tracing Sitzmann et al. (2019); Niemeyer et al. (2019); Li et al. (2018); Liu et al. (2019c); Saito et al. (2019); Liu et al. (2019a), or rasterization Loper and Black (2014); Kato et al. (2018); Genova et al. (2018); Liu et al. (2019b); Chen et al. (2019), while popular models to represent 3D geometry include point clouds Yifan et al. (2019), triangle meshes Chen et al. (2019), implicit representations defined over volumetric grids Jiang et al. (2019), and recently also neural implicit representations, namely, zero level sets of neural networks Liu et al. (2019c); Niemeyer et al. (2019).
The main advantage of implicit neural representations is their flexibility in representing surfaces with arbitrary shapes and topologies, as well as being mesh-free (i.e., no fixed a-priori discretization such as a volumetric grid or a triangular mesh). Thus far, differentiable rendering systems with implicit neural representations Liu et al. (2019c, a); Niemeyer et al. (2019) did not incorporate lighting and reflectance properties required for producing faithful appearance of 3D geometry in images, nor did they deal with trainable camera locations and orientations.
The goal of this paper is to devise an end-to-end neural architecture system that can learn 3D geometries from masked 2D images and rough camera estimates, and requires no additional supervision, see Figure 1. Towards that end we represent the color of a pixel as a differentiable function in the three unknowns of a scene: the geometry, its appearance, and the cameras. Here, appearance means collectively all the factors that define the surface light field, excluding the geometry, i.e., the surface bidirectional reflectance distribution function (BRDF) and the scene’s lighting conditions. We call this architecture the Implicit Differentiable Renderer (IDR). We show that IDR is able to approximate the light reflected from a 3D shape represented as the zero level set of a neural network. The approach can handle surface appearances from a certain restricted family, namely, all surface light fields that can be represented as continuous functions of the point on the surface, its normal, and the viewing direction. Furthermore, incorporating a global shape feature vector into IDR increases its ability to handle more complex appearances (e.g., indirect lighting effects).
Most related to our paper is Niemeyer et al. (2019), that was first to introduce a fully differentiable renderer for implicit neural occupancy functions Mescheder et al. (2019), which is a particular instance of implicit neural representation as defined above. Although their model can represent arbitrary color and texture, it cannot handle general appearance models, nor can it handle unknown, noisy camera locations. For example, we show that the model in Niemeyer et al. (2019), as-well-as several other baselines, fail to generate the Phong reflection model Foley et al. (1996). Moreover, we show experimentally that IDR produces more accurate 3D reconstructions of shapes from 2D images along with accurate camera parameters. Notably, while the baseline often produces shape artifact in specular scenes, IDR is robust to such lighting effects. Our code and data are available at https://github.com/lioryariv/idr.
To summarize, the key contributions of our approach are:
End-to-end architecture that handles unknown geometry, appearance, and cameras.
Expressing the dependence of a neural implicit surface on camera parameters.
Producing state of the art 3D surface reconstructions of different objects with a wide range of appearances, from real-life 2D images, with both exact and noisy camera information.
2 Previous work
Differentiable rendering systems for learning geometry comes (mostly) in two flavors: differentiable rasterization Loper and Black (2014); Kato et al. (2018); Genova et al. (2018); Liu et al. (2019b); Chen et al. (2019), and differentiable ray casting. Since the current work falls into the second category we first concentrate on that branch of works. Then, we will describe related works for multi-view surface reconstruction and neural view synthesis.
Implicit surface differentiable ray casting. Differentiable ray casting is mostly used with implicit shape representations such as implicit function defined over a volumetric grid or implicit neural representation, where the implicit function can be the occupancy function Mescheder et al. (2019); Chen and Zhang (2019), signed distance function (SDF) Park et al. (2019) or any other signed implicit Atzmon and Lipman (2019). In a related work, Jiang et al. (2019) use a volumetric grid to represent an SDF and implement a ray casting differentiable renderer. They approximate the SDF value and the surface normals in each volumetric cell. Liu et al. (2019a) use sphere tracing of pre-trained DeepSDF model Park et al. (2019) and approximates the depth gradients w.r.t. the latent code of the DeepSDF network by differentiating the individual steps of the sphere tracing algorithm; Liu et al. (2019c) use field probing to facilitate differentiable ray casting. In contrast to these works, IDR utilize exact and differentiable surface point and normal of the implicit surface, and considers a more general appearance model, as well as handle noisy cameras.
Multi-view surface reconstruction. During the capturing process of an image, the depth information is lost. Assuming known cameras, classic Multi-View Stereo (MVS) methods Furukawa and Ponce (2009); Schönberger et al. (2016); Campbell et al. (2008); Tola et al. (2012) try to reproduce the depth information by matching features points across views. However, a post-processing steps of depth fusion Curless and Levoy (1996); Merrell et al. (2007) followed by the Poisson Surface Reconstruction algorithm Kazhdan et al. (2006) are required for producing a valid 3D watertight surface reconstruction. Recent methods use a collection of scenes to train a deep neural models for either sub-tasks of the MVS pipeline, e.g., feature matching Leroy et al. (2018), or depth fusion Donne and Geiger (2019); Riegler et al. (2017), or for an End-to-End MVS pipeline Huang et al. (2018); Yao et al. (2018, 2019). When the camera parameters are unavailable, and given a set of images from a specific scene, Structure From Motion (SFM) methods Snavely et al. (2006); Schonberger and Frahm (2016); Kasten et al. (2019); Jiang et al. (2013) are applied for reproducing the cameras and a sparse 3D reconstruction. Tang and Tan Tang and Tan (2019) use a deep neural architecture with an integrated differentiable Bundle Adjustment Triggs et al. (1999) layer to extract a linear basis for the depth of a reference frame, and features from nearby images and to optimize for the depth and the camera parameters in each forward pass. In contrast to these works, IDR is trained with images from a single target scene, producing an accurate watertight 3D surface reconstruction.
Neural representation for view synthesis. Recent works trained neural networks to predict novel views and some geometric representation of 3D scenes or objects, from a limited set of images with known cameras. Sitzmann et al. (2019) encode the scene geometry using an LSTM to simulate the ray marching process. Mildenhall et al. (2020) use a neural network to predict volume density and view dependent emitted radiance to synthesis new views from a set of images with known cameras. Oechsle et al. (2020) use a neural network to learns the surface light fields from an input image and geometry and predicting unknown views and/or scene lighting. Differently from IDR, these methods do not produce a 3D surface reconstruction of the scene’s geometry nor handle unknown cameras.
Our goal is to reconstruct the geometry of an object from masked 2D images with possibly rough or noisy camera information. We have three unknowns: (i) geometry, represented by parameters ; (ii) appearance, represented by ; and (iii) cameras represented by . Notations and setup are depicted in Figure 2.
We represent the geometry as the zero level set of a neural network (MLP) ,
with learnable parameters . To avoid the everywhere solution, is usually regularized Mescheder et al. (2019); Chen and Zhang (2019). We opt for to model a signed distance function (SDF) to its zero level set Park et al. (2019). We enforce the SDF constraint using the implicit geometric regularization (IGR) Gropp et al. (2020), detailed later. SDF has two benefits in our context: First, it allows an efficient ray casting with the sphere tracing algorithm Hart (1996); Jiang et al. (2019); and second, IGR enjoys implicit regularization favoring smooth and realistic surfaces.
IDR forward model. Given a pixel, indexed by , associated with some input image, let denote the ray through pixel , where denotes the unknown center of the respective camera and the direction of the ray (i.e., the vector pointing from towards pixel ). Let denote the first intersection of the ray and the surface . The incoming radiance along , which determines the rendered color of the pixel , is a function of the surface properties at , the incoming radiance at , and the viewing direction . In turn, we make the assumptions that the surface property and incoming radiance are functions of the surface point , and its corresponding surface normal , the viewing direction , and a global geometry feature vector . The IDR forward model is therefore:
where is a second neural network (MLP). We utilize in a loss comparing and the pixel input color to simultaneously train the model’s parameters . We next provide more details on the different components of the model in equation 2.
3.1 Differentiable intersection of viewing direction and geometry
Henceforth (up until section 3.4), we assume a fixed pixel , and remove the subscript notation to simplify notation. The first step is to represent the intersection point as a neural network with parameters . This can be done with a slight modification to the geometry network .
Let denote the intersection point. As we are aiming to use in a gradient descent-like algorithm, all we need to make sure is that our derivations are correct in value and first derivatives at the current parameters, denoted by ; accordingly we denote , , , and .
Let be defined as in equation 1. The intersection of the ray and the surface can be represented by the formula
and is exact in value and first derivatives of and at and .
To prove this functional dependency of on its parameters, we use implicit differentiation Atzmon et al. (2019); Niemeyer et al. (2019), that is, differentiate the equation w.r.t. and solve for the derivatives of . Then, it can be checked that the formula in equation 3 possess the correct derivatives. More details are in the supplementary. We implement equation 3 as a neural network, namely, we add two linear layers (with parameters ): one before and one after the MLP . Equation 3 unifies the sample network formula in Atzmon et al. (2019) and the differentiable depth in Niemeyer et al. (2019) and generalizes them to account for unknown cameras. The normal vector to at can be computed by:
Note that for SDF the denominator is , so can be omitted.
3.2 Approximation of the surface light field
The surface light field radiance is the amount of light reflected from at in direction reaching . It is determined by two functions: The bidirectional reflectance distribution function (BRDF) describing the reflectance and color properties of the surface, and the light emitted in the scene (i.e., light sources).
The BRDF function describes the proportion of reflected radiance (i.e., flux of light) at some wave-length (i.e., color) leaving the surface point with normal at direction with respect to the incoming radiance from direction . We let the BRDF depend also on the normal to the surface at a point. The light sources in the scene are described by a function measuring the emitted radiance of light at some wave-length at point in direction . The amount of light reaching in direction equals the amount of light reflected from in direction and is described by the so-called rendering equation Kajiya (1986); Immel et al. (1986):
where encodes the incoming radiance at in direction , and the term compensates for the fact that the light does not hit the surface orthogonally; is the half sphere centered at . The function represents the surface light field as a function of the local surface geometry , and the viewing direction . This rendering equation holds for every light wave-length; as described later we will use it for the red, green and blue (RGB) wave-lengths.
We restrict our attention to light fields that can be represented by a continuous function . We denote the collection of such continuous functions by (see supplementary material for more discussion on ). Replacing with a (sufficiently large) MLP approximation (neural renderer) provides the light field approximation:
Disentanglement of geometry and appearance requires the learnable to approximate for all inputs rather than memorizing the radiance values for a particular geometry. Given an arbitrary choice of light field function there exists a choice of weights so that approximates for all (in some bounded set). This can be proved using a standard universality theorem for MLPs (details in the supplementary). However, the fact that can learn the correct light field function does not mean it is guaranteed to learn it during optimization. Nevertheless, being able to approximate for arbitrary is a necessary condition for disentanglement of geometry (represented with ) and appearance (represented with ). We name this necessary condition -universality.
Necessity of viewing direction and normal. For to be able to represent the correct light reflected from a surface point , i.e., be -universal, it has to receive as arguments also . The viewing direction is necessary even if we expect to work for a fixed geometry; e.g., for modeling specularity. The normal , on the other hand, can be memorized by as a function of . However, for disentanglement of geometry, i.e., allowing to learn appearance independently from the geometry, incorporating the normal direction is also necessary. This can be seen in Figure 3: A renderer without normal information will produce the same light estimation in cases (a) and (b), while a renderer without viewing direction will produce the same light estimation in cases (a) and (c). In the supplementary we provide details on how these renderers fail to generate correct radiance under the Phong reflection model Foley et al. (1996). Previous works, e.g., Niemeyer et al. (2019), have considered rendering functions of implicit neural representations of the form . As indicated above, omitting and/or from will result in a non--universal renderer. In the experimental section we demonstrate that incorporating in the renderer indeed leads to a successful disentanglement of geometry and appearance, while omitting it impairs disentanglement.
Accounting for global light effects. -universality is a necessary conditions to learn a neural renderer that can simulate appearance from the collection . However, does not include global lighting effects such as secondary lighting and self-shadows. We further increase the expressive power of IDR by introducing a global feature vector . This feature vector allows the renderer to reason globally about the geometry . To produce the vector we extend the network as follows: . In general, can encode the geometry relative to the surface sample ; is fed into the renderer as to take into account the surface sample relevant for the current pixel of interest . We have now completed the description of the IDR model, given in equation 2.
3.3 Masked rendering
Another useful type of 2D supervision for reconstructing 3D geometry are masks; masks are binary images indicating, for each pixel , if the object of interest occupies this pixel. Masks can be provided in the data (as we assume) or computed using, e.g., masking or segmentation algorithms. We would like to consider the following indicator function identifying whether a certain pixel is occupied by the rendered object (remember we assume some fixed pixel ):
Since this function is not differentiable nor continuous in we use an almost everywhere differentiable approximation:
where is a parameter. Since, by convention, inside our geometry and outside, it can be verified that . Note that differentiating equation 7 w.r.t. can be done using the envelope theorem, namely , where is an argument achieving the minimum, i.e., , and similarly for . We therefore implement as the neural network . Note that this neural network has exact value and first derivatives at , and .
Let , be the RGB and mask values (resp.) corresponding to a pixel in an image taken with camera and direction where indexes all pixels in the input collection of images, and represents the parameters of all the cameras in scene. Our loss function has the form:
We train this loss on mini-batches of pixels in ; for keeping notations simple we denote by the current mini-batch. For each we use the sphere-tracing algorithm Hart (1996); Jiang et al. (2019) to compute the first intersection point, , of the ray and . Let be the subset of pixels where intersection has been found and . Let , where , is defined as in equations 3 and 4, and as in section 3.2 and equation 2. The RGB loss is
where represents the norm. Let denote the indices in the mini-batch for which no ray-geometry intersection or . The mask loss is
where is the cross-entropy loss. Lastly, we enforce to be approximately a signed distance function with Implicit Geometric Regularization (IGR) Gropp et al. (2020), i.e., incorporating the Eikonal regularization:
where is distributed uniformly in a bounding box of the scene.
Implementation details. The MLP consists of layers with hidden layers of width , and a single skip connection from the input to the middle layer as in Park et al. (2019). We initialize the weights as in Atzmon and Lipman (2019), so that produces an approximate SDF of a unit sphere. The renderer MLP, , consists of 4 layers, with hidden layers of width 512. We use the non-linear maps of Mildenhall et al. (2020) to improve the learning of high-frequencies, which are otherwise difficult to train for due to the inherent low frequency bias of neural networks Ronen et al. (2019). Specifically, for a scalar we denote by the vector of real and imaginary parts of with , and for a vector we denote by the concatenation of for all the entries of . We redefine to obtain as input, i.e., , and likewise we redefine to receive , i.e., . For the loss, equation 8, we set and . To approximate the indicator function with , during training, we gradually increase and by this constrain the shape boundaries in a coarse to fine manner: we start with and multiply it by a factor of every epochs (up to a total of multiplications). The gradients in equations (11),(4) are implemented using using auto-differentiation. More details are in the supplementary.
|Fixed cameras||Trained cameras|
4.1 Multiview 3D reconstruction
We apply our multiview surface reconstruction model to real 2D images from the DTU MVS repository Jensen et al. (2014). Our experiments were run on 15 challenging scans, each includes either 49 or 64 high resolution images of objects with a wide variety of materials and shapes. The dataset also contains ground truth 3D geometries and camera poses. We manually annotated binary masks for all 15 scans except for scans 65, 106 and 118 which are supplied by Niemeyer et al. (2019).
We used our method to generate 3D reconstructions in two different setups: (1) fixed ground truth cameras, and (2) trainable cameras with noisy initializations obtained with the linear method of Jiang et al. (2013). In both cases we re-normalize the cameras so that their visual hulls are contained in the unit sphere.
Training each multi-view image collection proceeded iteratively. Each iteration we randomly sampled 2048 pixel from each image and derived their per-pixel information, including , . We then optimized the loss in equation 8 to find the geometry and renderer network . After training, we used the Marching Cubes algorithm Lorensen and Cline (1987) to retrieve the reconstructed surface from .
Evaluation. We evaluated the quality of our 3D surface reconstructions using the formal surface evaluation script of the DTU dataset, which measures the standard Chamfer- distance between the ground truth and the reconstruction. We also report PSNR of train image reconstructions. We note that the ground truth geometry in the dataset has some noise, does not include watertight surfaces, and often suffers from notable missing parts, e.g., Figure 5 and Fig.7c of Niemeyer et al. (2019). We compare to the following baselines: DVR Niemeyer et al. (2019) (for fixed cameras), Colmap Schönberger et al. (2016) (for fixed and trained cameras) and Furu Furukawa and Ponce (2009) (for fixed cameras). Similar to Niemeyer et al. (2019), for a fair comparison we cleaned the point clouds of Colmap and Furu using the input masks before running the Screened Poison Surface Reconstruction (sPSR) Kazhdan and Hoppe (2013) to get a watertight surface reconstruction. For completeness we also report their trimmed reconstructions obtained with the trim configuration of sPSR that contain large missing parts (see Fig. 5 middle) but performs well in terms of the Chamfer distance.
Quantitative results of the experiment with known fixed cameras are presented in Table 1, and qualitative results are in Figure 4 (left). Our model outperforms the baselines in the PSNR metric, and in the Chamfer metric, for watertight surface reconstructions. In Table 3 we compare the reconstructions obtained with unknown trained camera. Qualitative results for this setup are shown in Figure 4 (right). The relevant baseline here is the Colmap SFM Schonberger and Frahm (2016)+MVSSchönberger et al. (2016) pipeline. In Figure 7 we further show the convergence of our cameras (rotation and translation errors sorted from small to large) from the initialization of Jiang et al. (2013) during training epochs along with Colmap’s cameras. We note that our method simultaneously improves the cameras parameters while reconstructing accurate 3D surfaces, still outperforming the baselines for watertight reconstruction and PSNR in most cases; scan 97 is a failure case of our method. As can be seen in Figure 4, our 3D surface reconstruction are more complete with better signal to noise ratio than the baselines, while our renderings (right column in each part) are close to realistic.
Small number of cameras.
We further tested our method on the Fountain-P11 image collections Strecha et al. (2008) provided with 11 high resolution images with associated GT camera parameters. In Table 2 we show a comparison to Colmap (trim7-sPSR) in a setup of unknown cameras (our method is roughly initialized with Jiang et al. (2013)). Note the considerable improvement in final camera accuracy over Colmap. Qualitative results are shown in Figure 6.