Deep Unsupervised Intrinsic Image Decomposition by Siamese Training

Deep Unsupervised Intrinsic Image Decomposition by Siamese Training

Louis Lettry
CVL, ETH Zürich
   Kenneth Vanhoey
CVL, ETH Zürich
Unity Technologies
   Luc van Gool
CVL, ETH Zürich

We harness modern intrinsic decomposition tools based on deep learning to increase their applicability on real-world use cases. Traditional techniques are derived from the Retinex theory: handmade prior assumptions constrain an optimization to yield a unique solution that is qualitatively satisfying on a limited set of examples. Modern techniques based on supervised deep learning leverage large-scale databases that are usually synthetic or sparsely annotated. Decomposition quality on images in the wild is therefore arguable.

We propose an end-to-end deep learning solution that can be trained without any ground truth supervision, as this is hard to obtain. Time-lapses form an ubiquitous source of data that (under a scene staticity assumption) capture a constant albedo under varying shading conditions. We exploit this natural relationship to train in an unsupervised siamese manner on image pairs. Yet, the trained network applies to single images at inference time.

We present a new dataset to demonstrate our siamese training on, and reach results that compete with the state of the art, despite the unsupervised nature of our training scheme. As evaluation is difficult, we rely on extensive experiments to analyze the strengths and weaknesses of our and related methods.

Figure 1: To train our single-image IID network without GT annotations, we use an unsupervised siamese training: image pairs of a time-varying scene are passed through a CNN that optimizes for similarity of the estimated albedo . Additional reciprocity and information losses regularize this ambiguous problem.

1 Introduction

Taking a picture has become a trivial act achievable by anyone, anywhere, any time. Yet pictures result from a complex pipeline starting with light interacting in many ways with the observed world and casting effects like shadows and reflections. Transforming the photons reaching a sensor into an image poses technical challenges solved by complex hardware, and manufacturer-specific standard operations (e.g., demosaicing), followed by heavy post-processing (e.g., white balancing, tone-mapping or compression) [8]. Thus the image is the representation of a unique scene, acquired under unique lighting and by a unique process.

Intrinsic Image Decomposition (IID) proposes to separate albedo (i.e., a lighting and acquisition-independent representation of the scene) from the remainder, termed shading111The term shading refers to light-induced effects, yet we (abusively) also include acquisition-induced effects – like tone-mapping – in this term.. Albedo can be thought of as the base color (or reflectance) of what constitutes the acquired scene. Many applications would benefit from having access to this disentangled representation (e.g., scene understanding, feature and shadow detection, stylization, relighting, or object insertion) [4], yet albedo cannot be observed without light.

We tackle single-image IID (SIID): formally, that is the regression of two image layers representing albedo and shading from an input image , such that  [2], where the product is per-pixel. The under-determination of this problem has traditionally been tackled either by adding constraints based on observation of a simplified Mondrian world (e.g., is piecewise constant or is smooth [23]), or additional input data (e.g., image depth [11]). On real-world use cases however, automatic SIID is unsolved [4]. We conjecture the real world is too complex for human-made assumptions to be sufficient, and that learning-based solutions are promising. However, this usually requires annotated ground truth (GT) data (i.e., dense per-pixel triplets in our case), which is close to impossible to obtain. Current learning-based methods suffer from overfitting on too small or artificial, biased datasets [30], hindering applicability in the real world. We propose 3 contributions that strive towards making learning-based intrinsic decomposition work for real use cases.

First, we introduce a deep unsupervised learning method for single-image IID (section 4). Specifically, we exploit the natural constraint that arises when looking at time-lapses, in which the scene (thus the albedo) is static, but time (thus shading) varies. This source of training data has the main advantage of being cheap to acquire compared to dense GT annotations. It has been used before for IID algorithms taking a time-lapse as input at inference time [31], but never for SIID. At training time, we optimize using this constraint on unannotated sequences in a siamese training procedure and no reference to any GT. At test time, we infer and from a single image, that need not be part of a time-lapse. Despite no GT supervision, we obtain good numerical results competing with state of the art methods on several numerical benchmarks.

Second, we propose a synthetic yet realistic GT dataset that is generated using physically-based rendering (section 3). It is an extension of the recent SUNCG dataset [29], which we augment for unsupervised IID by rendering static scenes under varying lighting conditions and tone mappings. We use it to demonstrate the capabilities of our approach on clean data, as a first step towards tackling noisier real footage from webcams, that suffer from non-staticity.

Third, we propose to complete the set of evaluation metrics used to benchmark SIID algorithms (section 5). IID is an inherently complex field to quantitatively compare as no GT data exist in the wild. Several numerical benchmarks have been presented and used separately or jointly in different papers. Whether these numerical values guarantee sufficient quality for typical Computer Vision (CV) and Graphics (CG) applications that would benefit from IID is quite unclear. We therefore build a set of metrics – two of which are new – and discuss what each one measures before comparing to related works on them.

We will make all data and code publicly available.

2 Related Work

Our interests lie with IID methods (section 2.1), datasets for learning IID algorithms (section 2.2) and evaluation (which is discussed in section 5).

2.1 Intrinsic decomposition methods

Single-image IID is the process of decomposing an image into a dense albedo map and shading map . A recent survey by Bonneel et al[4] reviews and compares related work on its usability for typical CG applications. The problem is severely ambiguous: twice as many unknowns as knowns have to be estimated.

Regularizing priors.

Human-designed priors (e.g., is globally sparse and piecewise constant and is smooth [23]) have been used to regularize optimizations. Many derivatives exist [28, 13, 12, 1, 33]. Most generate decent decompositions on Mondrian-like images, but none generalize to the true complexity of photographed everyday scenes. We believe no human-devised priors can better capture the complex reality, hence we prefer learning from data as much as possible.

Other techniques take additional information as input: e.g., user input [6, 5] or depth information [11]. Just like us, some methods exploit multiple images capturing a static scene with temporal variation (timelapses) [31, 25, 21, 32]. Unlike us they use the full sequence at inference time: they are not tailored to process single-image IID. Conversely, we exploit the timelapse logic only to train our model on data with temporal variation: at inference time, our learned model is an automatic SIID method that can be applied to a single image.

Finally, other priors restrict the range of values for (e.g., grayscale and/or ). The overwhelming majority of IID works make such choices [4]. Notably, [18] defines , where the diffuse and the specular are colored. Similarly, we use the most general and harder variant (, ), which we motivate in section 4.

Learning-based solutions.

Recently, data-driven solutions have been attempted, either in the form of learning priors and feeding them into a classical optimization (e.g., leveraging a CRF) [3, 34] or an end-to-end [30, 24, 18] framework. When training, the question of what source of data to use is raised. Observing GT albedo and shading layers is hardly possible in the real world, and all datasets have strong limitations (see Sec. 2.2). Supervised end-to-end learning solutions rely on synthetic datasets [30, 24, 18] which hardly generalize to real photographs. Others rely on sparse human annotations, followed by applying the classical hand-devised priors [3, 34]: the learned model is by definition human-centric thus may be limited in generalizability. Conversely, we propose the first unsupervised end-to-end deep learning solution that does not rely on any annotation.

2.2 Datasets for Intrinsic Decomposition

Learning-based solutions require data, preferably with dense GT annotations (i.e., triplets), and covering the space of real-world use cases. Such a dataset is expensive to create at best, resulting in a realism versus size tradeoff. We discuss five datasets here, illustrated in Fig. 2(a-d).

(a) MIT
(b) MPI Sintel
(c) ShapeNet
(e) Webcams
(f) Light Compositing
Figure 2: One sample per existing dataset. (a-c) is available with dense GT annotations (insets), (d) with sparse relative ones (arrows), and (e-f) without any.

Realistic and Scarce.

MIT IID [14] is the only GT dataset on real-world data. It contains 20 single-object scenes lit by 10 different illumination conditions. Dense (i.e., pixelwise) decompositions were defined after a tedious acquisition process involving controlled light and paint-sprayed objects. Its small size and lack of variety makes it unusable for training convolutional neural networks (CNN) but provides a benchmark for evaluating object decompositions.

Intrinsic Images in the Wild [3] (IIW) introduces a dataset of 5,230 real-life indoor images with sparse reflectance annotations by humans, who were asked to compare (similar, greater or smaller than) albedo intensity (i.e., grayscale level) of random point pairs in the images. This is taken as a sparse GT reference to measure the Weighted Human Disagreement Rate (WHDR) of SIID algorithms applied on the IIW images. Training a dense regression CNN is feasible but the sparse annotations provide insufficient cues.

Shading Annotations in the Wild [20] (SAW) extends and complements [3] with partly dense shading annotations by humans, who were asked to classify pixels as belonging to either smooth shadow areas or non-smooth shadow boundaries. It is taken as a semi-dense GT reference to measure the SAW quality of SIID algorithms applied on the SAW dataset.

Synthetic and Dense.

CG rendering algorithms allow for approaching photorealistic image quality while accessing and exporting intrinsic layers, like albedo, thus dense GT. However, creating a dataset that is completely realistic and covers the visual variety of the real world is impossible, since realistic rendering requires substantial expert human effort.

The MPI Sintel dataset [9] contains frames from 48 scenes (along with GT albedo and shading) of the Sintel CG short movie. However, it is biased: non-realistic effects (e.g., fluorescent fluids) and harmful modeling tricks (painting shading in the albedo) have been used, so training on it hardly generalizes to real-world images [30, 24]. The non-Lambertian ShapeNet dataset [18] is closer to photo-realism, but contains only single-objects. 25K ShapeNet [10] objects’ intrinsic layers were lit by 98 different HDR environment maps and rendered using Mitsuba [17], for a total of 2.4M training images.

3 Timelapse Datasets

Training CNNs, especially deep ones, requires large-scale GT annotations. However, both synthetic CG rendering and crowdsourcing human annotations on real images are hardly scalable processes: they are not sufficiently realistic or dense and both require extensive and expert human intervention. As an alternative, we propose to work on an abundant data source: timelapses. We define timelapses as a collection of images acquired from a fixed viewpoint of a scene, with time-varying environmental parameters like weather. Under an assumption of staticity, they observe a constant albedo with different illuminations and acquisition processes (e.g., tone-mapping). We wish to exploit this natural relationship for guiding our learning process without GT annotations in Sec. 4.

Typical web cameras form the target training data of our approach, for its ease of acquisition and realism. However, they often violate the staticity assumption. The webcamclipart dataset [22] (Fig. 2(e)) contains 54 webcams that acquired several images per day over a year, and showed that elements may move (including the camera itself), and that weather (e.g., fog, snow) changes the intrinsic albedo of the scene. We leave sanitization of this data for future work.


Instead, we present SUNCG-Intrinsic Images (SUNCG-II), a synthetic dataset that guarantees staticity to train on (cf. Fig. 3). It is an extension of the SUNCG dataset, which is a recent database of modeled apartments and houses introduced in [29]. Geometry, aspect of each surface, light parameters and preset interior viewpoints with full camera calibrations are included. This data serves as a base to render views (i.e., a fixed scene acquired from a fixed viewpoint and intrinsic camera parameters) with physically-based path tracing using the Mitsuba renderer [17] . The primary objective was to obtain realistic GT interiors for different CV applications such as depth or normal estimation, semantic labeling, or scene completion.

We propose to adapt and augment this dataset to model several shading and image acquisition variants for each static scene and viewpoint. For each viewpoint, we randomly sample several variants in light sources, and post-process the images with several variants of tone-mapping. As a result, we created 7,892 views from 817 scenes, multiplied by 5 varying lighting conditions and 5 different tone-mappings, producing a total of 106,609 images (after removing around half of the images because they have too little light, i.e., mean intensity less than 20). It is to be noted that SUNCG comes with 45,622 scenes, thus we currently exploited only 1.8% of the available scenes.

Figure 3: Three variants (varying lighting and tone-mapping) of five example scenes of our dataset.

For each image, we have also rendered the corresponding GT albedo and shading maps. While the dataset we created and publish is composed of time-varying data with annotated GT, we emphasize that our unsupervised method does not need GT data. And while multiple images of the same static scene are required at train time: test time inference is achieved on a single image.

Technical details.

To determine lighting and tone-mapping variants of each view, we use the following procedure. First, we remove every transparent object in the scene (e.g., windows, vases) as they incur many artifacts. Second, we remove any light source, including the environment maps. Third, we randomly add 1 to 3 point lights in a cuboid of radius 3x1.5x3 scene units (in camera reference) around the camera. Finally, we render the scene and apply the post-processing tonemap operation [26] with parameters and .

The photorealistic path-traced images () were rendered with 128 samples per pixel. Albedo maps () were rendered by fetching the material’s diffuse color/texture information only, while shading maps () were calculated by element-wise division: . Validity masks were also produced, discarding infinite depth points and black pixels (i.e., ).

Benchmark Dataset.

Finally, we will use the “light compositing” (L.C.) dataset [7] consisting of 6 scenes observed from a single viewpoint but different single-flashlight illuminations (Fig. 2(f)). It is too small to be used for training, but forms an interesting dataset for comparing the consistency of albedo decompositions.

4 Method

In this work, we aim at decomposing an image into an albedo and shading using a dense regression CNN with parameters such that . For generality, we choose the albedo/shading decomposition following , where , and . represents intrinsic colors, which we represent using the usual 3-channel RGB values. Unlike most related work, we allow to be colored and to grow beyond unit value. This allows to represent natural light phenomena, like colored lighting (see the cityscape or the red reflection on the yellow pepper in Fig. 4) and bright highlights. Note that solving this problem is harder than many variants in which shading is grayscale (i.e., single-channel), or disallows specularities (i.e., ).

Figure 4: Real-life photographs. Left: two views of the same scene at different times: the color variation resides in the lighting and acquisition process. Right: red light reflected onto the yellow pepper, and a specularity. Capturing these effects in requires it to be color-valued and without upper bound.

We choose an architecture (see Sec. 4.3) in which is regressed and is deduced by element-wise division (we empirically observed the same behavior as [24], regressing S and deducing A produced better results). This guarantees consistent results. Note that we add a clipping layer so as to force , and that division is derivable, so both and can be used in loss functions, allowing for backpropagation of errors on both components simultaneously.

Figure 5: Our architecture is an augmented autoencoder with skip connections (blue arrows), downscaling (magenta arrows) and upscaling (red arrows). Cyan arrows of the decoder part are residual connections. Each green block is a densely connected convolutional block [16].

4.1 Naturally-constrained Siamese Training

Following the assumption of static, time-varying scenes with illumination (and subsequent capture parameters like white balancing) changes, a natural constraint arises: between images of the same view, only the shading is changing, and albedo is fixed. We implement this in a siamese training procedure in which we train a network on pairs of images (, ) taken from a time-varying image sequence . For each image a forward pass generates a decomposition pair (cf. Fig. 1) and a joint loss is backpropagated. We next present the different components of this loss. Note that while training is siamese (i.e., requires pairs), inference is done on single images.

Albedo similarity.

Our main training target relates the albedo decompositions of two images of with an loss function:


This forms the main optimization target. However, it still leaves the problem underdetermined: an infinite number of minima exist. Without regularization, training will inevitably lead to pitfalls: i.e., all solutions of the form , , .

Cross-information Regularization.

To constrain the remaining degrees of freedom, we make the assumption that most of the color and texture information resides in the albedo. We encode this by adding a weighted loss


where we make decrease linearly from to during the first 30% of the training, then remain fixed. This strongly initializes the model, while loosening it during training in favor of Eqn. (1).

Note that : we favor proximity between and albedo estimated from a different view, hence the name “cross-information”. We also experimented with the simpler variant , but this explicitly motivates the network to keep some shading coming from in its decomposition . The difference in shading between and prevents this undesired effect. Fig. 6 illustrates comparative results: much less shading spills into .

(a) Input
(b) w/ cross
(c) w/o cross
Figure 6: decompositions using cross– vs. non-cross-information regularization in the term.

Reconstruction Identity.

We wish to obtain consistent decompositions (i.e., multiplying and should produce again), following the argument of [24]. Because of the clipping layer at the end of the network (see Fig. 5), and because we do not use GT supervision as in [24], consistency could be lost during the optimization. Hence, we add a loss term to counter this:


This has the side-effect of pushing the relative intensity of and to make sense w.r.t. the input. In practice, this loss reaches and stays close to 0 after of our train time.

4.2 Variants & Notation

For comparison and evaluation, we trained several variants in training goals. We present them here along with the notations we will use in the results section. “Our” denotes the standard unsupervised version of this work, trained on SUNCG-II data. “Our supervised” is a supervised variant, trained summing two norms (w.r.t. the GT from SUNCG-II) on and instead of the term. “Our IIW” uses a different dataset (i.e., IIW [3]) and its sparse annotations with augmentations [34]: we train for optimizing the WHDR on the annotated pixels.

4.3 Technical Details

We propose a new architecture relying on the latest state-of-the-art CNNs. This architecture is briefly summarized in Fig. 5. Our network is composed of an autoencoder with skip connections at every level, the data is downsampled, respectively upsampled, by a factor of 2. The decoder part possesses residual connections [15]. Every level of the encoder and decoder are densely connected convolutions [16] with 4 convolutions of 20 filters. Finally, we use the element-wise division presented in [24] to enforce consistency of the decomposition. Training has been done with 2 siamese images (randomly taken from the same view) in mini-batches of size 6. The Adam [19] optimizer has been used with a learning rate ranging from to over 20k iterations, taking 15h on an NVidia GTX Titan X.

LMSE MIT Bonneel
Garces et al[12] 8.28 5.31
Zhou et al[34] 6.12 1.04
Narihira et al[30] 5.92 1.38
Bell et al[3] 5.59 1.43
Our 6.52 1.40
Our superv. 3.27 1.79
Table 1: LMSE () of the state-of-the-art methods w.r.t. the GT-annotated datasets MIT [14] and Bonneel [4].
Input GT Ours Zhou et al[34] Garces et al[12] Zhao et al[33]
Figure 7: Decompositions on the GT dataset of Bonneel et al[4].

5 Results

There are many applications to IID both in CV (e.g., scene understanding, robust feature detection for structure from motion, optical flow or segmentation, and shadow detection) and CG (e.g., shading-preserving texture editing, shading-less histogram matching, stylization, relighting, object insertion) [4]. Depending on the target application, one may have different qualitative expectations from a decomposition algorithm: e.g., texture should be preserved in , or and should be strictly consistent (i.e., ).

While several metrics [14, 3, 20] have been suggested to evaluate IID, it has been observed that none give the full picture [20]. Therefore, we now assemble and extend a set of metrics that covers many requirements of IID algorithms (i.e., proximity to dense GT, agreement with human judgments, and consistency of decomposition). Our argument is that they are all necessary to give the full (or at least a wider) picture: no metric taken alone is sufficient to validate an IID algorithm.

Alongside quantitative measures, we present qualitative results so as to link numbers with visual quality on the recent realistic CG scenes by [4] and on real images from [7]. We evaluate and compare using our full set of metrics and show that despite not being supervised, our method competes with state-of-the-art methods on reference-based measures, and surpasses them on consistency of decomposition. More detailed results are provided in an additional document.

5.1 Proximity to Dense Ground Truth

Ideally, decompositions should lean closely to true albedo (represented by dense GT). The most widely-used full-reference metric in SIID is the Local Mean Squared Error (LMSE) [14, 30, 4]. We measure quality w.r.t. two small datasets having GT annotations in Tab. 1: the real-world MIT dataset [14] and the (close to) realistic CG dataset of Bonneel et al[4]. MIT contains simple objects, with a handful albedo’s only, while Bonneel et al.’s dataset contains higher-frequency albedo details, closer to casual images. Fig. 7 shows qualitative results alongside GT decompositions. One can notice that the GT shadings are colored, as are ours. Despite avoiding GT supervision, our method beats classical methods [12], and leans close to those that use data-driven supervision [3, 34, 30] in Tab. 1, especially on Bonneel et al.’s more visually cluttered dataset.

To evaluate how close our unsupervised training leans to the fully supervised variant, we compared “Our” and “Our supervised” tested against the SUNCG-II dataset’s GT. Both use the same random 80/20 scene split so as to minimize view similarity between train and test data. We obtain LMSE errors of 1.14 (supervised) and 1.16 (unsupervised). Drawing conclusions from this is complex (as trainings converge at a different pace and towards different goals), but it hints that timelapses without supervision are able to approximate the GT despite being GT-oblivious at train time.

5.2 Agreement with Human Judgments

Large-scale sparse human judgments on real images regarding albedo intensity and shading type have been collected in the IIW [3] and SAW [20] papers, respectively. The corresponding metrics (i.e., WHDR and SAW, see Sec. 2.2) measure the alignment of IID results with these annotations. The SAW metric evaluates the smoothness of the decomposed shading, while the WHDR is a sparse metric comparing the relationship between pairs of albedo intensity points (around 65% equality and 35% inequality). It is acknowledged [3, 34] that the WHDR possesses weaknesses (induced by grayscaleness and sparsity) and bias coming from the human annotators. Indeed, humans are prone to visual illusion (e.g. the checkerboard or the blue/gold dress illusion) [4]. Moreover, the WHDR is by definition human-centric and sparse, hence it does not contain detailed annotations on texture: most annotations denote albedo equality, which is rare in the real (textured) world. Thus a method that defines albedo as piece-wise constant and grayscale can have an excellent WHDR, while incorrectly handling textures.

Input Ours Zhou et al[34] Bell et al[3]
Figure 8: IID on two different lightings per scene on the Light Compositing dataset [7].
WHDR SAW precision @
Method 50% 70% 80%
Constant Albedo 36.5 82.7 82.2 78.7
Shen et al[27] 24.0 90.0 78.4 70.1
Color Retinex [23] 23.5 93.4 87.5 77.3
Garces et al[12] 22.6 95.8 84.7 75.4
Zhao et al[33] 23.2 98.3 90.2 80.4
Bell et al[3] 19.2 97.8 88.9 79.1
Zhou et al[34] 20.1 97.8 92.9 80.3
Narih. et al[30] 40.7 89.2 80.9 73.9
Kovacs et al[20] N/A 93.8 84.5 DNC
Our 30.0 97.5 91.5 82.1
Our IIW 21.1 DNC 70.6 66.0
Our superv. 36.4 91.5 75.7 64.0
Table 2: WHDR [3] and SAW (precision at given recall values) [20] evaluation.

WHDR and SAW results can be seen in Tab. 2. We recalculated all WHDR and SAW values following the fair protocol of [3] taking the best of two runs with and without sRGB to RGB conversion and using the train/test split of [34]. Hence our results differ from those presented in [34, 20]. Our method lags behind on the WHDR but has an excellent SAW score, especially on high recall values. While visual inspection of IID results is difficult, we note that our decompositions preserve more texture in the albedo (cf. the floor in Fig. 7, and the wooden texture behind the basket in Fig. 8). It can be reasoned that methods that define albedo as piece-wise constant (e.g., by shifting high frequencies due to texture to the shading layer) can have an excellent WHDR, but will fail on SAW: compare, e.g., the textures in Fig. 7 (top), e.g., [34]. In reality though, variations due to texture reside in the albedo. Finally, similarly to other methods, ours struggles on handling cast shadows.

To prove the capabilities of our architecture, we trained our IIW variant. It exhibited close to state-of-the-art results on the WHDR but it came at the cost of a poor performance in SAW metrics while showing bad visual results due to the lack of constrains on the WHDR alone.

5.3 Decomposition Consistency

Finally, we introduce two new metrics to measure IID requirements that are unattended to in prior work, but are nonetheless important for several applications.

Reciprocity Error

measures the loss of information occurring during the decomposition by comparing the original image () and the reconstructed one (). This measure is of predominant importance for applications like image editing and object insertion. We define the Mean Reconstruction Error


For fairness of comparison with other methods who export 8-bit quantized and rescaled results in image file formats, we proceed similarly. Moreover, we optimize for the rescaling parameter to virtually undo potential scalings needed for the 8-bit range quantization or for visualization. Tiny errors remain due to quantization: when training done and converged, our method should have zero MRE.

Temporal Inconsistency

measures how much the albedo decompositions of a set of images capturing the same static scene under different (lighting) conditions differ. This is important for robust shadow detection, relighting and any time-varying (i.e., video) application. Let be a set of albedo decompositions of such an image sequence . We define the Mean Albedo Consistency Error


where is the albedo decomposition of image , and the sum is normalized by the number of pixels per image , the number of ordered pairs . runs over colors channels.

Additionally, we define : a relaxation of MACE that does not consider dark pixels that have intensity values below a threshold , and normalizes accordingly. That is because dark pixels in an input image give little information on what the valid decomposition should be, and algorithms typically have to guess such pixel colors by extrapolatation. Hence, evaluates extrapolation capabilities, but we also use another value () to assess on the more feasible pixel locations only. For , we exclude from the calculation any pair whose non-dark pixels have small overlap area, i.e., less than of the full image.

Note that the MPRE [34] is similar, but defined on products of estimated and across temporal variation in a sequence. While elegant, we believe it is not adequate because albedo errors are weighed by shading intensity. This attenuates errors in underexposed areas (which we think is legitimate), but emphasizes those made in saturated areas, which is arguable, especially with models where .


Method MIT L.C. SAW MIT L.C. Webc.
Garces et al[12] 9.07 15.90 24.86 9.62 DNC 26.37
Bell et al[3] 2.28 2.84 1.53 35.85 40.87 27.76
Zhou et al[34] 1.87 2.32 1.57 29.51 40.23 24.21
Narih. et al[30] 9.32 11.71 18.42 30.61 30.31 25.26
Our 0.12 0.34 0.38 9.89 18.96 14.34
Our supervised 0.36 2.97 0.68 16.47 35.33 35.62
Table 3: MRE and MACE metrics (i.e., mean pixel deviation in the range ). [12] did not converge (DNC) on most images of the L.C. sequences.

Tab. 3 shows consistency results over three time-varying real-world datasets, and the SAW images. Our method is lossless (, only tiny quantization errors remain) and preserves temporal consistency (see Fig. 8) across the datasets observed. Surprisingly, it is more temporally consistent even on the narrow set of albedos of the MIT dataset. We believe this is due to most methods producing an albedo whose average intensity is close to the input image, and this temporally varies a lot on MIT and Light Compositing. Conversely, our method generates consistent intensities, more independent from the lightness. Zhou et al[34] performs similarly well with a small MRE, but lacks temporal consistency. While Garces et al[12] performs well when its assumptions (e.g., piecewise constant albedo) are satisfied (e.g., on the MIT dataset), it does not generalize well to real-life situations: it is the lossiest method and does not produce temporally consistent results on the real-life cluttered scenes.


Our proposed metrics fill an important gap by measuring aspects of decomposition unmeasured before, and explain how lossy methods such as [12] can perform well on the WHDR or SAW metrics: neither of them penalizes the loss of high frequencies. Yet, consistency metrics should always be evaluated in the context of reference-based ones or visual results, since trivial solutions (e.g., or ) minimize the consistency metrics.

6 Conclusion

We presented the first unsupervised deep learning solution to the single-image intrinsic image decomposition problem.

Figure 9: Performance on our five metrics (see suppl. mat.).

We trained it without GT knowledge on our new SUNCG-II dataset by looking at pairs of timelapse images at train time. We evaluate on many SIID metrics, including newly proposed ones, so as to give a large evaluation panel users can choose from depending on the target application. Results compete qualitatively and quantitatively with the state of the art methods, including those that require supervision, and our method outperforms related work on metrics measuring consistency of decomposition. This makes our solution a good choice for applications that require consistency on large dataset, e.g., like robust feature detection for 3D reconstruction. The metrics set we assembled for evaluation captures many aspects of SIID quality, and an overall best method is hard to define. We hope it brings a broader view on the strengths and weaknesses of each method (see Fig. 9), and allows for making an informed choice.

Our method still presents some limitations, while consistent and preserving texture in the albedo layers unlike most methods, it has difficulty dealing with hard shadows and may have too much colors in the shading. Also, while we believe human-based priors cannot be general enough to solve the full problem, we had to resort to one (Eqn. 2) to regularize our optimization. We see it as a long-term goal to get completely rid of them.

We believe our work opens the path for SIID using unsupervised deep learning, which is an exciting avenue of future work.


  • [1] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. TPAMI, 2015.
  • [2] H. Barrow and J. Tenenbaum. Recovering intrinsic scene characteristics. Comput. Vis. Syst., A Hanson & E. Riseman (Eds.), pages 3–26, 1978.
  • [3] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Trans. on Graphics (SIGGRAPH), 33(4), 2014.
  • [4] N. Bonneel, B. Kovacs, S. Paris, and K. Bala. Intrinsic Decompositions for Image Editing. Computer Graphics Forum (Eurographics State of the Art Reports 2017), 36(2), 2017.
  • [5] N. Bonneel, K. Sunkavalli, J. Tompkin, D. Sun, S. Paris, and H. Pfister. Interactive intrinsic video editing. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2014), 33(6), 2014.
  • [6] A. Bousseau, S. Paris, and F. Durand. User assisted intrinsic images. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2009), 28(5), 2009.
  • [7] I. Boyadzhiev, S. Paris, and K. Bala. User-assisted image compositing for photographic lighting. ACM Trans. Graph., 32(4), July 2013.
  • [8] M. S. Brown. Understanding the in-camera image processing pipeline for computer vision, 2016.
  • [9] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, Oct. 2012.
  • [10] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  • [11] Q. Chen and V. Koltun. A simple model for intrinsic image decomposition with depth cues. In 2013 IEEE International Conference on Computer Vision, pages 241–248, Dec 2013.
  • [12] E. Garces, A. Munoz, J. Lopez-Moreno, and D. Gutierrez. Intrinsic images by clustering. Computer Graphics Forum (Proc. EGSR 2012), 31(4), 2012.
  • [13] P. V. Gehler, C. Rother, M. Kiefel, L. Zhang, and B. Schölkopf. Recovering intrinsic images with a global sparsity prior on reflectance. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 765–773, USA, 2011. Curran Associates Inc.
  • [14] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman. Ground-truth dataset and baseline evaluations for intrinsic image algorithms. In International Conference on Computer Vision, pages 2335–2342, 2009.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  • [16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [17] W. Jakob. Mitsuba renderer, 2010.
  • [18] H. S. Jian Shi, Yue Dong and S. X. Yu. Learning non-lambertian object intrinsics across shapenet categories. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [19] D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations, 2014.
  • [20] B. Kovacs, S. Bell, N. Snavely, and K. Bala. Shading annotations in the wild. Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] P. Y. Laffont and J. C. Bazin. Intrinsic decomposition of image sequences from local temporal variations. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 433–441, Dec 2015.
  • [22] J.-F. Lalonde, A. A. Efros, and S. G. Narasimhan. Webcam Clip Art: Appearance and illuminant transfer from time-lapse sequences. ACM Transactions on Graphics (SIGGRAPH Asia 2009), 28(5), December 2009.
  • [23] E. H. Land and J. J. McCann. Lightness and retinex theory. JOSA, 61(1):1–11, 1971.
  • [24] L. Lettry, K. Vanhoey, and L. Van Gool. DARN: a deep adversarial residual network for intrinsic image decomposition. CoRR, abs/1612.07899, 2016.
  • [25] Y. Matsushita, S. Lin, S. B. Kang, and H. Shum. Estimating intrinsic images from image sequences with biased illumination. pages 274–286, April 2004.
  • [26] E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda. Photographic tone reproduction for digital images. ACM Trans. Graph., 21(3):267–276, July 2002.
  • [27] J. Shen, X. Yang, Y. Jia, and X. Li. Intrinsic images using optimization. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 3481–3487, Washington, DC, USA, 2011. IEEE Computer Society.
  • [28] L. Shen and C. Yeo. Intrinsic images decomposition using a local and global sparse representation of reflectance. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 697–704, June 2011.
  • [29] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [30] M. M. Takuya Narihira and S. X. Yu. Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. In International Conference on Computer Vision (ICCV), 2015.
  • [31] Y. Weiss. Deriving intrinsic images from image sequences. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 68–75 vol.2, 2001.
  • [32] J. Yu. Rank-constrained pca for intrinsic images decomposition. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3578–3582, Sept 2016.
  • [33] Q. Zhao, P. Tan, Q. Dai, L. Shen, E. Wu, and S. Lin. A closed-form solution to retinex with nonlocal texture constraints. IEEE Trans. Pattern Anal. Mach. Intell., 34(7):1437–1444, July 2012.
  • [34] T. Zhou, P. Krähenbühl, and A. A. Efros. Learning data-driven reflectance priors for intrinsic image decomposition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3469–3477, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description