Learning to Separate Multiple Illuminants in a Single Image
We present a method to separate a single image captured under two illuminants, with different spectra, into the two images corresponding to the appearance of the scene under each individual illuminant. We do this by training a deep neural network to predict the per-pixel reflectance chromaticity of the scene, which we use in conjunction with a previous flash/no-flash image-based separation algorithm  to produce the final two output images. We design our reflectance chromaticity network and loss functions by incorporating intuitions from the physics of image formation. We show that this leads to significantly better performance than other single image techniques and even approaches the quality of the two image separation method.
Natural environments are often lit by multiple light sources with different illuminant spectra. Depending on scene geometry and material properties, each of these lights causes different light transport effects like color casts, shading, shadows, specularities, etc. An image of the scene combines the effects from the different lights present, and is a superposition of the images that would have been captured under each individual light. We seek to invert this superposition, i.e., separate a single image observed under two light sources, with different spectra, into two images, each corresponding to the appearance of the scene under one light source alone. Such a decomposition can give users the ability to edit and relight photographs, as well as provide information useful for photometric analysis.
However, the appearance of a surface depends not only on the properties of the light sources, but also on its spatially-varying geometry and material properties. When all of these quantities are unknown, disentangling them is a significantly ill-posed problem. Thus, past efforts to achieve such separation have relied heavily on extensive manual annotation [8, 7, 9] or access to calibrated scene and lighting information [12, 11]. More recently, Hui et al.  demonstrate that the lighting separation problem can be reliably solved if one additionally knows the reflectance chromaticity of all surface points — which they recover by capturing a second image of the same scene under flash lighting. Given that the flash image is used in their processing pipeline only for estimating the reflectance chromaticity, could we computationally estimate the reflectance chromaticity from a single image, thereby avoiding the need to capture a flash photograph all together? This would greatly enhance the applicability of the method especially for scenarios where it is challenging to sufficiently illuminate every pixel with the flash (e.g., when the flash is not strong enough, the scene is large, or the ambient light sources are too strong).
|(a) Input image||(b) Output separated images|
Our work is also motivated by the success of deep convolutional neural networks for solving closely related problems like intrinsic decompositions [30, 41], and reflectance estimation [39, 34, 32]; hence, we propose training a deep convolution neural network to perform this separation. However, we find that standard architectures, trained only with the respect to the quality of the final separated images, are unable to learn to effectively perform the separation. Therefore, we guide the design of our network using a physics-based analysis of the task  to match the expected sequence of inference steps and intermediate outputs — reflectance chromaticities, shading chromaticities, separated shading maps, and final separated images. In addition to ensuring that our architecture has the ability to express these required computations, this decomposition also allows us to provide supervision to intermediate layers in our network, which proves crucial to successful training.
We train our network on two existing datasets: the synthetic database of Li et al. , and the set of real flash/no-flash pairs collected by Aksoy et al.  — using a variant of Hui et al.’s algorithm  to compute ground-truth values. Once trained, we find that our approach is able to successfully solve this ill-posed problem and produce high-quality lighting decompositions that, as can be seen in Figure 1, capture complex shading and shadows. In fact, our network is able to match, and in specific instances outperform, the quality of results from Hui et al.’s two-image method , despite needing only a single image as input.
2 Related Work
Estimating illumination and scene geometry from a single image is a highly ill-posed problem. Previous work has focused on specific subsets of this problem; we discuss previous works on illumination analysis as well as prior attempts of intrinsic image decomposition that aim to jointly estimate the illumination and surface reflectance.
Estimating the ambient illumination from a single photograph has been a long-standing goal in computer vision and computer graphics. The majority of past techniques have been extensive studied in literature of color constancy  — the problem of removing the color casts of ambient illumination. One popular solution is to model the scene with single dominant light source [15, 14, 17]. To deal with the mixture of lightings in the scene, previous works [13, 19, 35] typically characterize each local region with different but single light source. However, these approaches cannot generalize well to scenes where multiple light sources mix gracefully. To address this, Boyadzhiev et al.  utilize user scribbles to indicate color attributes of the scene such as white surfaces and constant lighting regions. Hsu et al.  propose a method to address mixtures of two light sources in the scene; however, they require the precise knowledge of the color of each illuminant. Prinet et al.  resolve the color chromaticity of two light sources by utilizing the consistency of the reflectance of the scene in a sequence of images. Sunkavalli et al.  demonstrate this (and image separation) for time-lapse sequences of outdoor scenes.
In parallel, many techniques have been developed to explicitly model the illumination of the scene, rather than removing the color of the illuminants. Lalonde et al.  propose the parametric model to characterize the sky and sun for the outdoor photographs. Hold-Geoffroy et al.  extend the idea to model the outdoor illumination by incorporating a deep neutral network. Gardner et al.  utilize a data-driven approach to represent the indoor illumination from a single LDR photograph. In contrast, our method does not model the illumination in certain form, but directly regresses the single-illuminant images.
Intrinsic image decomposition.
Intrinsic image decomposition methods seek to separate a single image into a product of reflectance and illumination layers. This problem is commonly solved by assuming that reflectance of the scene is piece-wise linear while the illumination shading varies smoothly . Several approaches developed along this line by further imposing priors on non-local reflectance [40, 36, 5], on the consistency of reflectance for the images sequence taken under static camera [20, 28, 21]. A common assumption in intrinsic image methods is that the scene is lit by a single dominant illuminant. This does not generalize to real world scenes with visually complex object and illuminated with mixture of multiple light sources. Recently, the vast majority of techniques [30, 31] are devoted to deep neutral networks with large amount of data to improve the conditioning of the problem. While effective, these techniques also focus on the scene illuminated with single light source. Barron et al. [2, 3] resolve this by incorporating the global lighting model to characterize spatially-varying illumination. While this lighting model works well for single object, it is unable to capture high-frequency spatial information, like shadows that are often present in real scenes. In comparison, our technique is well-suited for the scene with mixture of multiple light sources and able to work well for the scenes with complex geometry. In addition, as opposed to predicting the reflectance of the scene, our method only requires to predict its chromaticity, which is an easier problem to solve.
3 Problem Statement
Our objective is to take as input, a single photograph of a scene lit by a mixture of two illuminants, and estimate the images lit by each single light source. This is a severely ill-posed problem and we propose solving it using deep neural networks. In this section, we set up the image formation model and describe the physical priors we impose to supervise the intermediate results produced by the network.
3.1 Problem setup and image formation
We adopt the image formation model from Hui et al.  by assuming that the scene is Lambertian and is imaged by a three-channel color camera. However, instead of modeling infinite-dimensional spectra using subspaces, we assume that the camera color response is narrow-band, allowing us to characterize both the light source and albedo in RGB. That is, the intensity observed at a pixel in a single photograph is given by:
where is the three-color albedo. In our work, we focus on the scenes that are lit by light sources and we denote the light chromaticities as . Note that with . Similar to Hui et al. , we assume that the light source chromaticities are unique, i.e., . The term is the shading observed at pixel due to the -th light source multiplied by the light-source brightness. Given the fact that two sources with the same color are clustered together, the shading term has a complex dependence on the lighting geometry and does not have a simple analytical form. Our goal is to compute the separated images corresponding to the each light source as:
To solve this, Hui et al.  capture additional image under flash illumination, that is used to compute the reflectance color chromaticity of the scene, from which it is able to isolate the reflectance from the illumination shading. Given the reflectance invariant space, they solve for the lighting colors of each light source as well as the per-pixel contribution of each illuminant. We provide a quick summary of the key steps of their computational pipeline, adapted to the RGB color model.
Step 1 — Flash to reflectance chromaticity.
The pure flash photograph enables us to estimate the reflectance chromaticity defined as
Step 2 — Estimate shading chromaticity.
Using the reflectance chromaticity , we next derive the shading chromaticity defined as
where we denote
as the relative shading term. As indicated by Hui et al. , is key for estimating the relative shading from the illumination shadings, from which we are able to separate the images with respect to the illuminant colors.
Step 3 — Estimate relative shading.
From the shading chromaticity, the illuminant shadings for each light source is
We can now get the separated images using the following expression:
In this paper, we design our network by mimicking the steps in the derivation above, but each processing element is replaced with deep networks as shown in Figure 2. In particular, we utilize three sub-networks — ChromNet, ShadingNet and SeparateNet — to estimate the reflectance chromaticity, illuminant shadings and separated images, respectively. ChromNet predicts the values of reflectance chromaticity , defined in (3), with its input being the RGB image that we seek to separate. ShadingNet takes in as the output of ChromeNet concatenated with the input RGB image to regress the illuminant shadings in (6). Finally, SeparateNet gathers the estimated illuminant shadings as well as the input RGB image to estimate the separated images.
3.2 Generating the training dataset
We utilize the databases of CGIntrinsics  and Flash/No-Flash  to produce (approximate) ground truth reflectance chromaticity, illuminant shadings and separated images. Figure 3 shows an example of the training data from each dataset.
The CGIintrinsics dataset consists of rendered scenes from SUNCG  and provides the ground truth reflectance, from which we compute the reflectance chromaticity. We then estimate the shadings chromaticity by using (4).
The Flash/No-flash dataset consists of image pairs. We estimate the reflectance chromaticity as the color chromaticity of the pure flash image, which is the difference between the flash and the no-flash photograph. We anecdotally observed that the majority of the scenes in this dataset are only illuminated by a single light source — which, as such, makes it uninteresting for our application. To resolve this, we add the flash image back to no-flash image and create photographs illuminated by two light sources. By changing the color of the flash photograph, we can enhance the amount of training data; this allows us to generate input-output pairs, where the input is a photo, and the output is the reflectance chromaticity, a pair of its corresponding illuminant shadings as well as the separated images.
|(a) Input (Top) /||(b) Illuminant||(c) Separated|
4 Learning Illuminant Separation
Now that we have the training data for the intermediate results, i.e. reflectance chromaticity, relative shadings and separated images, we detail our approach for learning the relationship between a single photo and its constituent images lit by each illuminant.
4.1 Network architecture
As shown in Figure 2, we use a deep neural network to match the computation of the separation algorithm in Section 3. Specifically, our network consists of three sub-networks that produce the reflectance chromaticity, illuminant shadings, and the separated images respectively.
We design the first sub-network to explicitly estimate the reflectance chromaticity (3) from the input color image. This essentially requires the network to solve the ill-posed problem of estimating and removing the illumination color cast given only a single photograph. We adopt an architecture similar to that of Johnson et al.  to map the input image to a three channel reflectance chromaticity map.111A detailed description of the construction of each subnetwork is provided in the supplemental material. We will also release our code base, training data and trained models upon acceptance.
The second sub-network in our framework takes reflectance chromaticity estimates as inputs, and solves for the two illuminant shadings in (6). From Section 3, we expect the first part of this computation to involve deriving from the chromaticities and original input, on a purely per-pixel basis as per (4). However, we found computing the values explicitly to lead to instability in training, likely since this involves a division. Instead, we produce a general feature map intended to encode the information (note that we do not require it to exactly correspond to values): we concatenate the input image with the estimated chromaticities, and include two convolution layers to produce a -channel feature map.
Given this feature map, our second sub-network produces the two separated illuminant shading maps. Since this requires global reasoning, we use an architecture similar to the pixel-to-pixel network of Isola et al.  to incorporate a large receptive field. However, since we need to produce two outputs shading maps from a single input feature map, we retain their architecture for the encoder that maps the feature map to a coarse resolution bottleneck, and include two copies of the decoder each of which maps this coarse output a different three-channel illuminant shading map. Both decoders in this architecture receive skip-connections from intermediate layers of the encoder.
Given the illuminant shadings and previously estimated reflectance chromaticity, the last computation step is to produce the separated images. Here again, we use a series of pixel-wise layers to express the computation in (7). Our third sub-network concatenates the two predicted shading maps and the input RGB photograph into a nine-channel input, and uses three convolution layers to produce a six-channel output corresponding to the two final separated RGB images.
Note that the output of our first sub-network—reflectance chromaticity—is sufficient to perform separation using the method of Hui et al. . However, training this sub-network based directly on the quality of reflectance chromaticity estimates proves insufficient, because the final separated image quality can degrade differently with different kind of errors in chromaticity estimates. Thus, our goal is to instead train the reflectance chromaticity estimation sub-network to be optimal towards final separation quality. Unfortunately, the separation algorithm in  has non-differentiable processing steps, as well as other computation that produces unstable gradients. Hence, we use two additional sub-networks to approximate the processing in Hui et al.’s algorithm . However, once trained, we find it is optimal to directly use the reflectance chromaticity estimates with the exact algorithm in , over the output of these sub-networks.
4.2 Loss functions
For the reflectance chromaticity estimation task, we use a scale-invariant loss. We also incorporate loss in gradient domain, to enforce that the estimated reflectance chromaticity is piece-wise constant. In particular, we define our loss function as
where denotes the predicted chromaticity, is the ground truth provided, and is a term to compensate for the global scale difference, which can be estimated via least squares. We also use mask to disregard the loss at pixels where we do not have reliable ground truth (e.g. pixels that are close to black or pixels corresponding to the outdoor environment map in the SUNCG dataset). indicates the total number of valid pixels in an image. Similar to the approach of Li et al. , we include a multi-scale matching term, where is the total number of layers specified ( in the paper) and denotes the corresponding number of pixels not masked as invalid pixels.
We impose an loss on both the absolute value and the gradients of the relative shadings. This encourages spatially smooth shading solutions (as is commonly done in prior intrinsic images work). However, the network outputs two potential relative shadings and swapping these two predictions should not induce any loss. To address this, we define our loss function as
where denote the loss between the -th output with -th illuminant shadings defined in (6). Specifically, is defined as , where
Here, denotes the -th illuminant shading prediction while is the ground truth, and is the global scale to compensate for the illuminant brightness.
Our loss for the two separated images is similar to our ShadingNet loss:
where is the loss. Specifically, is defined as
where, denotes the -th separated image predication while is the ground truth for the -th light source, and is scale factor for the global intensity difference.
We now present an extensive quantitative and qualitative evaluation of our proposed method. Please refer to our supplementary material for more details and results.
5.1 Test dataset
Synthetic benchmark dataset.
To quantitatively evaluate our method, we utilize the high quality synthetic dataset of . This dataset has approximate scenes, each rendered under several different single illuminants. We first white balance each image of the same scene, and then modulate the white-balanced images with pre-selected light colors; these represent the ground truth separated images. The input images are then created by adding pairs of these separated images, each corresponding to one of the lights in the scene. We produce test samples in the dataset and use both of the ground truth of reflectance chromaticity and separated results to evaluate our method.
We also evaluate the performance of our proposed technique on real images captured for both indoor and outdoor scenes. Specifically, we utilize the dataset of the indoor scenes collected by Hui et al.  as well as time-lapse videos for outdoor scenes. Hui et al.  capture a pair of flash/no-flash for the same scene. We take the no-flash images in the dataset as the input to the network. For the time-lapse videos, each frame serves as a test input as shown in Figure 1 (a).
We resize our training images to . We use Adam optimizer  to train our network with . The initial learning rate is set to be for all sub-networks. We cut down the learning rate by after epochs. We then train for epochs with the reduced learning rate. We ensure that all our networks have converged with this scheme.
We characterize the performance of our approach on both reflectance chromaticity and the separated images. We adopt the error to quantitative measure the performance of the reflectance chromaticity. To evaluate the performance of the separated results, we compute the error for the separated result against the ground truth as:
where denote the error between two images. We use a global scale-invariant loss because we are most interested in capturing relative variations between the two images.
|Shen et al. ||0.0821||0.0791|
|Bell et al. ||0.0785||0.0763|
|Li et al. ||0.0833||0.0821|
|Hsu et al. ||—||0.0678|
|Hui et al. ||—||0.0101|
Use additional information as input.
|(a) Input||(b) SingleNet||(c) Final-Only|
|(d) Chrom-Only||(e) Full-Direct||(f) Full+|
5.2 Quantitative results on synthetic benchmark
We next measure performance quantitatively on the synthetic dataset for our approach and compare it to several baselines and report these in Table 1. We begin by quantifying the importance of supervision. We train different models for our network: with full supervision, with supervision only on the quality of the final separated images (Final-Only), and training only the first sub-network, i.e., ChromNet, with supervision only on reflectance chromaticities (Chrom-Only). Moreover, for our fully supervised model (Full), we consider using the separated images directly predicted by our full network (Full-Direct), as well as taking only the reflectance chromaticity estimates and using Hui et al.’s algorithm —which includes more complex processing—to perform separation (Full+). For the model with only chromaticity supervision, we use  as well to perform separation, and for the final-only supervised model (where intermediate chromaticities are not meaningful), we only consider the final output.
We find that our model trained with full supervision has the best performance in terms of the quality of final separated images. Interestingly, the Chrom-Only model is better at predicting chromaticity, but as expected, this does not translate to higher quality image outputs. The Final-Only model also yields worse separation results despite being trained with respect to their quality, highlighting the importance of intermediate supervision. Finally, we find that using our Full model in combination with  yields comparatively better results than taking the direct final output of the network. Thus, our final sub-networks (ShadingNet and SeparateNet) are able to only approximate ’s algorithm. Thus, their main benefit in our framework is in allowing back-propagation to provide supervision for chromaticity estimation, in a manner that is optimal for separation.
We also include comparisons to a network with a more traditional architecture (rather than three sub-networks) to do direct separation (SingleNet). We use the same architecture as the encoder-decoder portion of our ShadingNet, and train this again with supervision only on the final separated outputs. We find that this performs significantly worse (than even Final-Only), illustrating the utility of our physically-motivated architecture. Finally, we also include the comparisons with baselines where different intrinsic image decomposition methods [36, 5, 30] are used to estimate reflectance chromaticity from a single image, and these are used for separation with . We find these methods yield lower accuracy in both reflectance chromaticity estimation and lighting separation—likely because they, like most intrinsic image methods, assume a single light source.
Finally, we evaluate on two methods that require additional information beyond a single image: ground truth light colors for Hsu et al. , and a flash/no-flash pair which provides direct access to reflectance chromaticity, for Hui et al. . We produce better results than , but as expected,  yields the most accurate separation with direct access to chromaticity information — but requires capturing an additional flash image.
|(a) Input photographs|
|(b) Hui et al. ||(c) Ours|
|(a) Input||(b) Hui et al. ||(c) Our|
5.3 Qualitative evaluation on real data
Figure 4 shows results on a real image for the different versions of our network (as well as of SingleNet), while Figure 5 compares our results to Hui et al.’s method  when using a flash/no-flash pair. These results confirm our conclusions from Table 1 — we find that the version of our network trained with full supervision performs best, especially when used in combination with  to carry out the separation from predicted chromaticities. Moreover, despite requiring only a single image input, it comes close to matching Hui et al.’s  performance with a flash/no-flash pair. We show an example in Figure 6 where our method affords a distinct advantage even when an image with flash is available, but when several regions in the scene are too far from the flash. This leads to artifacts in those regions for , while our approach is able to perform a higher quality separation. The accompanying supplementary material contains additional results and comparisons for time-lapse videos and indoor scenes.
We describe a learning-based approach to separate the lighting effect of two illuminants in an image. Our method relies on the use of a deep-neural network based estimator, whose architecture is motivated by a physics-based analysis of the problem and associate intermediate supervision. Our ablation experiments demonstrate the importance of this supervision. Crucially, we show that we are able to produce high-quality outputs that match the performance of previous methods that required a flash/no-flash pair, while being more practical in requiring only a single image.
Hui and Sankaranarayanan acknowledge support via the NSF CAREER grant CCF-1652569, and the NGIA grant HM0476-17-1-2000. Chakrabarti is supported by NSF Grant IIS-1820693 and a gift from Adobe.
-  Y. Aksoy, C. Kim, P. Kellnhofer, S. Paris, M. Elgharib, M. Pollefeys, and W. Matusik. A dataset of flash and ambient illumination pairs from the crowd. In ECCV, 2018.
-  J. T. Barron and J. Malik. Color constancy, intrinsic images, and shape estimation. In ECCV. 2012.
-  J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. PAMI, 37(8):1670–1687, 2015.
-  H. Barrow and J. Tenenbaum. Recovering intrinsic scene characteristics from images. Computer Vision Systems, 1978.
-  S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. TOG, 33(4):159, 2014.
-  N. Bonneel, B. Kovacs, S. Paris, and K. Bala. Intrinsic decompositions for image editing. In Computer Graphics Forum, 2017.
-  N. Bonneel, K. Sunkavalli, J. Tompkin, D. Sun, S. Paris, and H. Pfister. Interactive intrinsic video editing. TOG, 33(6):197, 2014.
-  A. Bousseau, S. Paris, and F. Durand. User-assisted intrinsic images. In TOG, volume 28, page 130, 2009.
-  I. Boyadzhiev, K. Bala, S. Paris, and F. Durand. User-guided white balance for mixed lighting conditions. TOG, 31(6):200, 2012.
-  I. Boyadzhiev, S. Paris, and K. Bala. User-assisted image compositing for photographic lighting. TOG, 32(4):36–1, 2013.
-  P. Debevec. Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. In SIGGRAPH 2008 classes, page 32, 2008.
-  P. Debevec. The Light Stages and Their Applications to Photoreal Digital Actors. In SIGGRAPH Asia, 2012.
-  M. Ebner. Color constancy using local color shifts. In ECCV, 2004.
-  G. Finlayson, M. Drew, and B. Funt. Enhancing von kries adaptation via sensor transformations. 1993.
-  G. D. Finlayson, M. S. Drew, and B. V. Funt. Diagonal transforms suffice for color constancy. In ICCV, 1993.
-  M.-A. Gardner, K. Sunkavalli, E. Yumer, X. Shen, E. Gambaretto, C. GagnÃ©, and J.-F. Lalonde. Learning to predict indoor illumination from a single image. TOG, 9(4), 2017.
-  P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp. Bayesian color constancy revisited. In CVPR, 2008.
-  A. Gijsenij, T. Gevers, and J. van de Weijer. Computational color constancy: Survey and experiments. TIP, 20(9):2475–2489, 2011.
-  A. Gijsenij, R. Lu, and T. Gevers. Color constancy for multiple light sources. TIP, 21(2):697–707, 2012.
-  R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In CVPR, 2009.
-  D. Hauagge, S. Wehrwein, K. Bala, and N. Snavely. Photometric ambient occlusion. In CVPR, 2013.
-  Y. Hold-Geoffroy, K. Sunkavalli, S. Hadap, E. Gambaretto, and J.-F. Lalonde. Deep outdoor illumination estimation. In CVPR, 2017.
-  E. Hsu, T. Mertens, S. Paris, S. Avidan, and F. Durand. Light mixture estimation for spatially varying white balance. In TOG, volume 27, page 70, 2008.
-  Z. Hui, K. Sunkavalli, S. Hadap, and A. C. Sankaranarayanan. Illuminant spectra-based source separation using flash photography. In CVPR, 2018.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  P.-Y. Laffont and J.-C. Bazin. Intrinsic decomposition of image sequences from local temporal variations. In ICCV, 2015.
-  J.-F. Lalonde, S. G. Narasimhan, and A. A. Efros. What do the sun and the sky tell us about the camera? IJCV, 88(1):24–51, 2010.
-  Z. Li and N. Snavely. Cgintrinsics: Better intrinsic image decomposition through physically-based rendering. In ECCV, 2018.
-  Z. Li and N. Snavely. Learning intrinsic image decomposition from watching the world. In CVPR, 2018.
-  Z. Li, K. Sunkavalli, and M. Chandraker. Materials for masses: Svbrdf acquisition with a single mobile phone image. In ECCV, 2018.
-  V. Prinet, D. Lischinski, and M. Werman. Illuminant chromaticity from image sequences. In ICCV, 2013.
-  K. Rematas, T. Ritschel, M. Fritz, E. Gavves, and T. Tuytelaars. Deep reflectance maps. In CVPR, 2016.
-  C. Riess, E. Eibenberger, and E. Angelopoulou. Illuminant color estimation for real-world mixed-illuminant scenes. In ICCVW, 2011.
-  J. Shen, X. Yang, X. Li, and Y. Jia. Intrinsic image decomposition using optimization and user scribbles. IEEE Transactions on Cybernetics, 43(2):425–436, 2013.
-  S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. CVPR, 2017.
-  K. Sunkavalli, F. Romeiro, W. Matusik, T. Zickler, and H. Pfister. What do color changes reveal about an outdoor scene? In CVPR, 2008.
-  Y. Tang, R. Salakhutdinov, and G. Hinton. Deep lambertian networks. arXiv preprint arXiv:1206.6445, 2012.
-  Q. Zhao, P. Tan, Q. Dai, L. Shen, E. Wu, and S. Lin. A closed-form solution to retinex with nonlocal texture constraints. PAMI, 34(7):1437–1444, 2012.
-  T. Zhou, P. Krahenbuhl, and A. A. Efros. Learning data-driven reflectance priors for intrinsic image decomposition. In ICCV, 2015.