InverseRenderNet: Learning single image inverse rendering
We show how to train a fully convolutional neural network to perform inverse rendering from a single, uncontrolled image. The network takes an RGB image as input, regresses albedo and normal maps from which we compute lighting coefficients. Our network is trained using large uncontrolled image collections without ground truth. By incorporating a differentiable renderer, our network can learn from self-supervision. Since the problem is ill-posed we introduce additional supervision: 1. We learn a statistical natural illumination prior, 2. Our key insight is to perform offline multiview stereo (MVS) on images containing rich illumination variation. From the MVS pose and depth maps, we can cross project between overlapping views such that Siamese training can be used to ensure consistent estimation of photometric invariants. MVS depth also provides direct coarse supervision for normal map estimation. We believe this is the first attempt to use MVS supervision for learning inverse rendering.
|Input||Diffuse albedo||Illumination||NM prediction||NM from MVS||Frontal shading||Shading|
Inverse rendering is the problem of estimating one or more of illumination, reflectance properties and shape from observed appearance (i.e. one or more images). In this paper, we tackle the most challenging setting of this problem; we seek to estimate all three quantities from only a single, uncontrolled image. Specifically, we estimate a normal map, diffuse albedo map and spherical harmonic lighting coefficients. This subsumes two classical computer vision problems: (uncalibrated) shape-from-shading and intrinsic image decomposition.
Classical approaches [4, 29] cast these problems in terms of energy minimisation. Here, a data term measures the difference between the input image and the synthesised image that arises from the estimated quantities. We approach the problem as one of image to image translation and solve it using a deep, fully convolutional neural network. However, inverse rendering of uncontrolled, outdoor scenes is itself an unsolved problem and so labels for supervised learning are not available. Instead, we use the data term for self-supervision via a differentiable renderer (see Fig. 2).
Single image inverse rendering is an inherently ambiguous problem. For example, any image can be explained with zero data error by setting the albedo map equal to the image, the normal map to be planar and the illumination arbitrarily such that the shading is unity everywhere. Hence, the data term alone cannot be used to solve this problem. For this reason, classical methods augment the data term with generic  or object-class-specific  priors. Likewise, we also exploit priors during learning (specifically a statistical prior on lighting and a smoothness prior on diffuse albedo). However, our key insight that enables the CNN to learn good performance is to introduce additional supervision provided by an offline multiview reconstruction.
While photometric vision has largely been confined to restrictive lab settings, classical geometric methods are sufficiently robust to provide multiview 3D shape reconstructions from large, unstructured datasets containing very rich illumination variation [17, 14]. This is made possible by local image descriptors that are largely invariant to illumination. However, these methods recover only geometric information and any recovered texture map has illumination “baked in” and so is useless for relighting. We exploit the robustness of geometric methods to varying illumination to supervise our inverse rendering network. We apply a multiview stereo (MVS) pipeline to large sets of images of the same scene. We select pairs of overlapping images with different illumination, use the estimated relative pose and depth maps to cross project photometric invariants between views and use this for supervision via Siamese training. In other words, geometry provides correspondence that allows us to simulate varying illumination from a fixed viewpoint. Finally, the depth maps from MVS provide coarse normal map estimates that can be used for direct supervision of the normal map estimation.
Deep learning has already shown good performance on components of the inverse rendering problem. This includes monocular depth estimation , depth and normal estimation  and intrinsic image decomposition . However, these works use supervised learning. For tasks where ground truth does not exist, such approaches must either train on synthetic data (in which case generalisation to the real world is not guaranteed) or generate pseudo ground truth using an existing method (in which case the network is just learning to replicate the performance of the existing method). Inverse rendering of outdoor, complex scenes is itself an unsolved problem and so reliable ground truth is not available and supervised learning cannot be used. In this context, we make the following contributions. To the best of our knowledge, we are the first to exploit MVS supervision for learning inverse rendering. Second, we are the first to tackle the most general version of the problem, considering arbitrary outdoor scenes and learning from real data, as opposed to restricting to a single object class  or using synthetic training data . Third, we introduce a statistical model of spherical harmonic lighting in natural scenes that we use as a prior. Finally, the resulting network is the first to inverse render all of shape, reflectance and lighting in the wild and we perform the first evaluation in this setting.
2 Related work
Classical methods estimate intrinsic properties by fitting photometric or geometric models. Most methods require multiple images. From multiview images, a structure-from-motion/multiview stereo pipeline enables recovery of dense mesh models [24, 14] though illumination effects are baked into the texture. From images with fixed viewpoint but varying illumination photometric stereo can be applied. Variants consider statistical BRDF models , the use of outdoor time-lapse images , spatially-varying BRDFs  Attempts to combine geometric and photometric methods are limited. Haber et al.  assume known geometry (which can be provided by MVS) and inverse render reflectance and lighting from community photo collections. Kim et al.  represents the state-of-the-art and again uses an MVS initialisation for joint optimisation of geometry, illumination and albedo. Some methods consider a single image setting. Jeson et al.  introduce a local-adaptive reflectance smoothness constraint for intrinsic image decomposition on texture-free input images which are acquired with a texture separation algorithm. Barron et al.  present SIRFS, a classical optimisation-based approach that recovers all of shape, illumination and albedo using a sophisticated combination of generic priors.
Deep depth prediction Direct estimation of shape alone using deep neural networks has attracted a lot of attention. Eigen et al. [11, 10] were the first to apply deep learning in this context. Subsequently, performance gains were obtained using improved architectures , post-processing with classical CRF-based methods [49, 35, 50] and using ordinal relationships for objects within the scenes [13, 34, 8]. Zheng et al.  use synthetic images for training but improve generalisation using a synthetic-to-real transform GAN. However, all of this work requires supervision by ground truth depth. An alternative branch of methods explore using self-supervision from augmented data. For example, binocular stereo pairs can provide a supervisory signal through consistency of cross projected images [46, 25, 15, 16]. Alternatively, video data can provide a similar source of supervision [53, 47, 48]. Some of other work built from specific ways were proposed recently. Tulsiani et al.  use multiview supervision in a ray tracing network. While all these methods take single image input, Ji et al.  tackle the MVS problem itself using deep learning.
Deep intrinsic image decomposition Intrinsic image decomposition is a partial step towards inverse rendering. It decomposes an image into reflectance (albedo) and shading but does not separate shading into shape and illumination. Even so, the lack of ground truth training data makes this a hard problem to solve with deep learning. Recent work either uses synthetic training data and supervised learning [37, 20, 30, 7, 12] or self-supervision/unsupervised learning. Very recently, Li et al.  used uncontrolled time-lapse images allowing them to combine an image reconstruction loss with reflectance consistency between frames. This work was further extended using photorealistic, synthetic training data . Ma et al.  also trained on time-lapse sequences and introduced a new gradient constraint which encourage better explanations for sharp changes caused by shading or reflectance. Baslamisli et al.  applied a similar gradient constraint while they used supervised training. Shelhamer et al.  propose a hybrid approach where a CNN estimates a depth map which is used to constrain a classical optimisation-based intrinsic image estimation.
Deep inverse rendering To date, this topic has not received much attention. One line of work simplifies the problem by restricting to a single object class, e.g. faces , meaning that a statistical face model can constrain the geometry and reflectance estimates. This enables entirely self-supervised training. Shu et al.  extend this idea with an adversarial loss. Sengupta et al.  on the other hand, initialise with supervised training on synthetic data, and fine-tuned their network in an unsupervised fashion on real images. Another line of work restricts geometry to almost planar objects and lighting to a flash in the viewing direction [1, 31] under which assumptions they can obtain impressive results. More general settings have been considered by Kulkarni et al.  who show how to learn latent variables that correspond to extrinsic parameters allowing image manipulation. Janner et al.  is the only prior work we are aware of that tackles the full inverse rendering problem. Like us, they use self-supervision but include a trainable shading model. However, the shader requires supervised training on synthetic data, limiting the ability of the network to generalise to real world scenes.
We assume that a perspective camera observes a scene, such that the projection from 3D world coordinates, , to 2D image coordinates, , is given by:
where is an arbitrary scale factor, a rotation matrix, a translation vector, the focal length and the principal point.
The inverse rendered shape estimate could be represented in a number of ways. For example, many previous methods estimate a viewer-centred depth map. However, local reflectance, and hence appearance, is determined by surface orientation, i.e. the local surface normal direction. So, to render a depth map for self-supervision, we would need to compute the surface normal. From a perspective depth map , the surface normal direction is:
from which the unit length normal is given by: . The derivatives of the depth map in the image plane, and , can be approximated by finite differences. However, (2) requires knowledge of the intrinsic camera parameters. This would severely restrict the applicability of our method. For this reason, we choose to estimate a surface normal map directly.
Although the surface normal can be represented by a 3D vector, since it has only two degrees of freedom. So, our network estimates the two elements of the surface gradient at each pixel, and , and the transformation to a 3D surface normal vector is computed by a fixed layer that calculates: . Note that we estimate the normal map in a viewer-centred coordinate system.
We assume that appearance can be approximated by a local reflectance model under environment illumination. Specifically we use a Lambertian diffuse model with order 2 spherical harmonic lighting. This means that RGB intensity can be computed as
where contains the spherical harmonic colour illumination coefficients, is the colour diffuse albedo and the order 2 basis is given by:
Our appearance model means that we neglect high frequency illumination effects, cast shadows and interreflections. However, we found that in practice this model works well for typical outdoor scenes. Finally, cameras apply a nonlinear gamma transformation. We simulate this to produce our final predicted intensities: , where we assume a fixed .
Our inverse rendering network (see Fig. 2) is an image-to-image network that regresses albedo and normal maps from a single image and uses these to estimate lighting. We describe these inference components in more detail here.
4.1 Trainable encoder-decoder
We implement a deep fully-convolutional neural network with skip connections like the hourglass architecture . We use a single encoder and separate deconvolution decoders for albedo and normal prediction. Albedo maps have 3 channel RGB output, normal maps have two channels for the surface gradient which is converted to a normal map as described above. Both convolutional subnet and deconvolutional subnet contain 15 layers and the activation functions are ReLUs. Adam Optimiser is used in training.
4.2 Implicit lighting prediction
In order to estimate illumination parameters, one option would be to use a fully connected branch from the output of our decoder and train our network to predict it directly. However, fully connected layers require very large numbers of parameters and, in fact, lighting can be inferred from the input image and estimated albedo and normal maps, making its explicit prediction redundant. An additional advantage is that the architecture remains fully convolutional and so can process images of any size at inference time.
Consider an input image comprising pixels. We invert the nonlinear gamma and stack the linearised RGB values to form the matrix . We similarly stack the estimated albedo map to form , the estimated surface normals to form and define by applying (4) to each normal vector. We can now rewrite (3) for the whole image as:
where is the Hadamard (element-wise) product. We can now solve for the spherical harmonic illumination coefficients in a least squares sense, using the whole image. This can be done using any method, so long as the computation is differentiable such that losses dependent on the estimated illumination can have their gradients backpropagated into the inverse rendering network. For example, the solution using the pseudoinverse is given by: , where denotes element-wise division and is the pseudoinverse of . Fig. 2 shows the inferred shading, , and a visualisation of the estimated lighting.
As shown in Fig. 2, we use a data term (the error between predicted and observed appearance) for self-supervision. However, inverse rendering using only a data term is ill-posed (an infinite set of solutions can yield zero data error) and so we use additional sources of supervision, all of which are essential for good performance. We describe all sources of supervision in this section.
5.1 Self-supervision via differentiable rendering
Given estimated normal and albedo maps and spherical harmonic illumination coefficients, we compute a predicted image using (3). This local illumination model is straightforward to differentiate. Self-supervision is provided by the error between the predicted, , and observed, , intensities. We compute this error in LAB space as this provides perceptually more convincing results:
where LAB performs the colour space transformation.
5.2 Natural illumination model and prior
The spherical harmonic lighting model in (3) enables efficient representation of complex lighting. However, even within this low dimensional space, not all possible illumination environments are natural. The space of natural illumination possesses statistical regularities . We can use this knowledge to constrain the space of possible illumination and enforce a prior on the illumination parameters. To do this, we build a statistical illumination model (see Fig. 3) using a dataset of 79 HDR spherical panoramic images taken outdoors. For each environment, we compute the spherical harmonic coefficients, . Since the overall intensity scale is arbitrary, we also normalise each lighting matrix to unit norm, , to avoid ambiguity with the albedo scale. Our illumination model in (5) uses surface normals in a viewer-centred coordinate system. So, the dataset must be augmented to account for possible rotations of the environment relative to the viewer. Since the rotation around the vertical () axis is arbitrary, we rotate the lighting coefficients by angles between and in increments of . In addition, to account for camera pitch or roll, we additionally augment with rotations about the and axes in the range . This gives us a dataset of 139,356 environments. We then build a statistical model, such that any illumination can be approximated as:
where contains the principal components, are the corresponding eigenvalues, is the mean lighting coefficients and is the parametric representation of . We use dimensions. Under the assumption that the original data is Gaussian distributed then the parameters are normally distributed: . When we compute lighting, we do so within the subspace of the statistical model. In addition, we introduce a prior loss on the estimated lighting vector: .
5.3 Multiview stereo supervision
A pipeline comprising structure-from-motion followed by multiview stereo (which we refer to simply as MVS) enables both camera poses and dense 3D scene models to be estimated from large, uncontrolled image sets. Of particular importance to us, these pipelines are relatively insensitive to illumination variation between images in the dataset since they rely on matching local image features that are themselves illumination insensitive. We emphasise that MVS is run offline prior to training and that at inference time our network uses only single images of novel scenes. We use the MVS output for three sources of supervision.
Cross-projection We use the MVS poses and depth maps to establish correspondence between views, allowing us to cross-project quantities between overlapping images. Given an estimated depth map, , in view and camera matrices for views and , a pixel can be cross-projected to location in view via:
In practice, we perform the cross-projection in the reverse direction, computing non-integer pixel locations in the source view for each pixel in the target view. We can then use bilinear interpolation of the source image to compute quantities for each pixel in the target image. Since the MVS depth maps contain holes, any pixels that cross project to a missing pixel are not assigned a value. Similarly, any target pixels that project outside the image bounds of the source are not assigned a value.
Direct normal map supervision The per-view depth maps provided by MVS can be used to estimate normal maps, albeit that they are typically coarse and incomplete (see Fig. 1, column 5). We compute guide normal maps from the depth maps and intrinsic camera parameters estimated by MVS using (2). The guide normal maps are used for direct supervision by computing a loss that measures the angular difference between the guide, , and estimated, , surface normals: .
Albedo consistency loss Diffuse albedo is an intrinsic quantity. Hence, we expect that albedo estimates of the same scene point from two overlapping images should be the same, even if the illumination varies between views. Hence, we automatically select pairs of images that overlap (defined as having similar camera locations and similar centres of mass of their backprojected depth maps). We discard pairs that do not contain illumination variation (where cross-projected appearance is too similar). Then, we train our network in a Siamese fashion on these pairs and use the cross projection described above to compute an albedo consistency loss: , where are the estimated albedo maps in the th and th images respectively, where has been cross projected to view , for the pixels in which image has a defined depth value. The scalar is the value that minimises the loss and accounts for the fact that there is an overall scale ambiguity between images. Again, we compute albedo consistency loss in LAB space. The albedo consistency loss is visualised by the blue arrows in Fig. 4.
|Images||Li  (R)||Nestmeyer  (R)||Ours (R)||Li  (S)||Nestmeyer  (S)||Ours (S)|
Cross-rendering loss For improved stability, we also use a mixed cross-projection/appearance loss, . We use the cross-projected albedo above in conjunction with the estimated normals and albedo to render a new image and measure the appearance error in the same way as (6). This loss is visualised by the green arrows in Fig. 4.
5.4 Albedo priors
Finally, we also employ two additional prior losses on the albedo. This helps resolve ambiguities between shading and albedo. First, we introduce an albedo smoothness prior, . Rather than uniformly applying smoothness penalty, we apply a pixel-wise varying weighted penalty according to chromaticities of the input image. So the stronger smoothness penalties are only enforced on neighbouring pixels with closer chromaticities. The loss itself is the L1 distance between adjacent pixels.
Second, during the self-supervised phase of training, we also introduce a pseudo supervision loss to prevent convergence to trivial solutions. After the pretraining process (see Section 6), our model learns plausible albedo predictions using MVS normals. To prevent subsequent training diverging too far from this, we encourage albedo predictions to remain close to the pretrained albedo predictions.
We train our network to minimise: .
Datasets We train using the MegaDepth  dataset. This contains dense depth maps and camera calibration parameters estimated from crawled Flickr images. The pre-processed images have arbitrary shapes and orientations. For ease of training, we crop square images and resize to a fixed size. We choose our crops to maximise the number of pixels with defined depth values. Where possible, we crop multiple images from each image, achieving augmentation as well as standardisation. We create mini-batches with overlap between all pairs of images in the mini-batch and sufficient illumination variation (correlation coefficient of intensity histograms significantly different from 1). Finally, before inputting an image to our network, we detect and mask the sky region using PSPNet . This is because the albedo map and normal map in sky area are meaingless and it severely influences illumination estimation.
Training strategy We found that for convergence to a good solution it is important to include a pre-training phase. During this phase, the surface normals used for illumination estimation and for the appearance-based losses are the MVS normal maps. This means that the surface normal prediction decoder is only learning from the direct supervision loss, i.e. it is learning to replicate the MVS normals. After this initial phase, we switch to full self-supervision where the predicted appearance is computed entirely from estimated quantities. Note that this pre-taining step is not using pseudo albedo supervisions.
There are no existing benchmarks for inverse rendering in the wild. So, we evaluate our method on an intrinsic image benchmark and devise our own benchmark for inverse rendering. Finally, we show a relighting application.
|Nestmeyer  (CNN)||IIW||19.5|
|Zhou et al. ||IIW||19.9|
|Fan et al. ||IIW||14.5|
|Shi et al. ||ShapeNet||59.4|
|Li et al. ||BigTime||20.3|
Evaluation on IIW The standard benchmark for intrinsic image decomposition is Intrinsic Images in the Wild  (IIW) which is almost exclusively indoor scenes. Since our training regime requires large multiview image datasets, we are restricted to using scene-tagged images crawled from the web, which are usually outdoors. In addition, our illumination model is learnt on outdoor, natural environments. For these reasons, we cannot perform training or fine-tuning on indoor benchmarks. Moreover, our network is not trained specifically for the task of intrinsic image estimation and our shading predictions are limited by the fact that we use an explicit local illumination model (so cannot predict cast shadows). Nevertheless, we test our network on this benchmark directly without fine-tuning. We follow the suggestion in  and rescale albedo predictions to the range before evaluation. Quantitative results are shown in Tab. 1 and some qualitative visual comparison in Fig. 5. Despite the limitations described above, we achieve the second best performance of the methods not trained on the IIW data.
Evaluation on MegaDepth We evaluate inverse rendering using unobserved scenes from the MegaDepth dataset . We evaluate normal estimation performance directly using the MVS geometry. We evaluate albedo estimation using a state-of-the-art multiview inverse rendering algorithm . Given the output from their pipeline, we perform rasterisation to generate albedo ground truth for every input image. Note that both sources of “ground truth” here are themselves only estimations, e.g. the albedo ground truth contains ambient occlusion baked in. The colour balance of the estimated albedo is arbitrary, so we compute per-channel optimal scalings prior to computing errors. We use three metrics - MSE, LMSE and DSSIM, which are commonly used for evaluating albedo predictions. To evaluate normal predictions, we use angular errors. The correctness of illumination predictions could be inferred by the other two, so we do not perform explicit evaluations on it. The quantitative evaluations are shown in Tab. 2. For depth prediction methods, we first compute the optimal scaling onto the ground truth geometry, then differentiate to compute surface normals. These methods can only be evaluated on normal prediction. Intrinsic image methods can only be evaluated on albedo prediction. We can see that our network performs best in normal prediction and also the best in MSE and DSSIM. Qualitative example results can be seen in Fig. 6.
|Li et al. ||-||-||-||50.6||50.4|
|Godard et al. ||-||-||-||79.2||79.6|
|Nestmeyer et al. ||0.0204||0.0735||0.241||-||-|
|Li et al. ||0.0171||0.0637||0.208||-||-|
|Input||Relit 1||Relit 2|
We have shown for the first time that the task of inverse rendering can be learnt from real world images in uncontrolled conditions. Our results show that “shape-from-shading” in the wild is possible and are far superior to classical methods. It is interesting to ponder how this feat is achieved. We believe the reason this is possible is because of the large range of cues that the deep network can exploit, for example shading, texture, ambient occlusion, perhaps even high level semantic concepts learnt from the diverse data. For example, once a region is recognised as a “window”, the possible shape and configuration is much restricted. Recognising a scene as a man-made building suggests the presence of many parallel and orthogonal planes. These sort of cues would be extremely difficult to exploit in hand-crafted solutions.
There are many promising ways in which this work can be extended. First, our modelling assumptions could be relaxed, for example using more general reflectance models and estimating global illumination effects such as shadowing. Second, our network could be combined with a depth prediction network. Either the two networks could be applied independently and then the depth and normal maps merged, or a unified network could be trained in which the normals computed from the depth map are used to compute the losses we use in this paper. Third, our network could benefit from losses used in training intrinsic image decomposition networks. For example, if we added the timelapse dataset of  to our training, we could incorporate their reflectance consistency loss to improve our albedo map estimates. Our code, trained model and inverse rendering benchmark data is available at URL removed for review.
-  M. Aittala, T. Aila, and J. Lehtinen. Reflectance modeling by neural texture synthesis. ACM Transactions on Graphics (TOG), 35(4):65, 2016.
-  O. Aldrian and W. Smith. Inverse rendering of faces with a 3d morphable model. IEEE transactions on pattern analysis and machine intelligence, 35(5):1080–1093, 2013.
-  N. Alldrin, T. Zickler, and D. Kriegman. Photometric stereo with non-parametric and spatially-varying reflectance. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, June 2008.
-  J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. TPAMI, 2015.
-  A. S. Baslamisli, H.-A. Le, and T. Gevers. Cnn based learning using reflection and retinex models for intrinsic image decomposition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Trans. on Graphics (SIGGRAPH), 33(4), 2014.
-  S. Bi, N. K. Kalantari, and R. Ramamoorthi. Deep Hybrid Real and Synthetic Training for Intrinsic Decomposition. In W. Jakob and T. Hachisuka, editors, Eurographics Symposium on Rendering - Experimental Ideas & Implementations. The Eurographics Association, 2018.
-  W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In Advances in Neural Information Processing Systems, pages 730–738, 2016.
-  R. O. Dror, T. K. Leung, E. H. Adelson, and A. S. Willsky. Statistics of real-world illumination. In Proc. CVPR, 2001.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
-  Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf. Revisiting deep intrinsic image decompositions. In Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8944–8952, 2018.
-  H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
-  Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell., 32(8):1362–1376, Aug. 2010.
-  R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
-  M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz. Multi-view stereo for community photo collections. 2007 IEEE 11th International Conference on Computer Vision, pages 1–8, 2007.
-  D. B. Goldman, B. Curless, A. Hertzmann, and S. M. Seitz. Shape and spatially-varying brdfs from photometric stereo. IEEE Trans. Pattern Analysis and Machine Intelligence, 32(6):1060–1071, 2010.
-  T. Haber, C. Fuchs, P. Bekaer, H. P. Seidel, M. Goesele, and H. P. A. Lensch. Relighting objects from image collections. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 627–634, June 2009.
-  G. Han, X. Xie, J. Lai, and W.-S. Zheng. Learning an intrinsic image decomposer using synthesized rgb-d dataset. IEEE Signal Processing Letters, 25(6):753–757, 2018.
-  M. Janner, J. Wu, T. D. Kulkarni, I. Yildirim, and J. Tenenbaum. Self-supervised intrinsic image decomposition. In Advances in Neural Information Processing Systems, pages 5936–5946, 2017.
-  J. Jeon, S. Cho, X. Tong, and S. Lee. Intrinsic image decomposition using structure-texture separation and surface normals. In European Conference on Computer Vision, pages 218–233. Springer, 2014.
-  M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang. Surfacenet: an end-to-end 3d neural network for multiview stereopsis. arXiv preprint arXiv:1708.01749, 2017.
-  M. Kazhdan and H. Hoppe. Screened poisson surface reconstruction. ACM Trans. Graph., 32(3):29:1–29:13, July 2013.
-  A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning of geometry and context for deep stereo regression. CoRR, vol. abs/1703.04309, 2017.
-  K. Kim, A. Torii, and M. Okutomi. Multi-view inverse rendering under arbitrary illumination and albedo. In European Conference on Computer Vision, pages 750–767. Springer, 2016.
-  T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
-  F. Langguth. Photometric stereo for outdoor webcams. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, pages 262–269, Washington, DC, USA, 2012. IEEE Computer Society.
-  L. Lettry, K. Vanhoey, and L. Van Gool. Darn: a deep adversarial residual network for intrinsic image decomposition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1359–1367. IEEE, 2018.
-  X. Li, Y. Dong, P. Peers, and X. Tong. Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Transactions on Graphics (TOG), 36(4):45, 2017.
-  Z. Li and N. Snavely. Cgintrinsics: Better intrinsic image decomposition through physically-based rendering. In European Conference on Computer Vision (ECCV), 2018.
-  Z. Li and N. Snavely. Learning intrinsic image decomposition from watching the world. In Computer Vision and Pattern Recognition (CVPR), 2018.
-  Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018.
-  F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5162–5170, 2015.
-  W.-C. Ma, H. Chu, B. Zhou, R. Urtasun, and A. Torralba. Single image intrinsic decomposition without a single intrinsic image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–217, 2018.
-  T. Narihira, M. Maire, and S. X. Yu. Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. In Proceedings of the IEEE international conference on computer vision, pages 2992–2992, 2015.
-  T. Nestmeyer and P. V. Gehler. Reflectance adaptive filtering improves intrinsic image estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, page 4, 2017.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In Proc. ECCV, pages 483–499, 2016.
-  S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs. Sfsnet: Learning shape, reflectance and illuminance of faces âin the wildâ. arXiv preprint arXiv:1712.01261, 2017.
-  E. Shelhamer, J. T. Barron, and T. Darrell. Scene intrinsics and depth from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 37–44, 2015.
-  J. Shi, Y. Dong, H. Su, and X. Y. Stella. Learning non-lambertian object intrinsics across shapenet categories. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5844–5853. IEEE, 2017.
-  Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5444–5453. IEEE, 2017.
-  A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt. Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In The IEEE International Conference on Computer Vision (ICCV), volume 2, page 5, 2017.
-  S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, volume 1, page 3, 2017.
-  B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, page 6, 2017.
-  S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
-  C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2022–2030, 2018.
-  P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2800–2809, 2015.
-  D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of CVPR, volume 1, 2017.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.
-  C. Zheng, T.-J. Cham, and J. Cai. T2net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 767–783, 2018.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017.
-  T. Zhou, P. Krahenbuhl, and A. A. Efros. Learning data-driven reflectance priors for intrinsic image decomposition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3469–3477, 2015.