Single and Multiple Illuminant Estimation Using Convolutional Neural Networks
In this paper we present a method for the estimation of the color of the illuminant in RAW images. The method includes a Convolutional Neural Network that has been specially designed to produce multiple local estimates. A multiple illuminant detector determines whether or not the local outputs of the network must be aggregated into a single estimate. We evaluated our method on standard datasets with single and multiple illuminants, obtaining lower estimation errors with respect to those obtained by other general purpose methods in the state of the art.
The observed color of the objects in the scene depends on the intrinsic color of the object (i.e. the surface spectral reflectance), on the illumination, and on their relative positions. Many computer vision problems in both still images and videos can make use of color constancy processing as a pre-processing step to make sure that the recorded color of the objects in the scene does not change under different illumination conditions.
In general there are two methodologies to obtain reliable color description from image data: computational color constancy and color invariance . Computational color constancy is a two-stage operation: the former is specialized on estimating the color of the scene illuminant from the image data, the latter corrects the image on the basis of this estimate to generate a new image of the scene as if it was taken under a reference illuminant. Color invariance methods instead represent images by features which remain unchanged with respect to imaging conditions.
In this work we focus on illuminant estimation. Our method is based on supervised learning and includes a Convolutional Neural Network (CNN) specially designed for the local estimation of the illuminant color. Recently, deep neural networks have gained the attention of numerous researchers outperforming state-of-the-art approaches on various computer vision tasks [2, 3]. One of CNNâs advantages is that it can take raw images as input and incorporate feature design into the training process. With a deep structure, CNN can learn complicated mappings while requiring minimal domain knowledge. In our method the outputs of the CNN provide a spatially varying estimate of the illuminant that can optionally be aggregated into a single global estimate by a local-to-global regressor based on non-linear Support Vector Regression (SVR). To make a final decision between the local and the global estimates we designed a multiple illuminant detector exploiting a Kernel Density Estimator (KDE). To the best of our knowledge this is the first general purpose work in which both single and multiple illuminants are dealt with in a comprehensive way.
Preliminary findings reported in this paper appeared in , where we presented the basic architecture of the CNN and evaluated its performance in the single illuminant scenario. This paper extends the previous one in several ways:
since one of the assumptions that is often violated in color constancy is the presence of a uniform illumination in the scene, we have extended the applicability of the proposed algorithm to the case of nonuniform illumination. The method is adaptive, being able to distinguish and process in different ways images of scenes taken under a uniform and those acquired under non-uniform illumination.
In the case of uniform illumination, the multiple local estimates must be aggregated in a single global estimate. To do so we designed a new local-to-global regression method that replaces the per-channel median operator used in  with a non-linear mapping based on a RBF kernel over local statistics of the CNN estimates. The parameters of the mapping are obtained by applying a regression procedure that minimizes the median angular error on the training set.
Preliminary results reported in  included only images having a single color target in the scene, thus allowing only the comparisons with global illuminant estimation methods. We present a much more detailed experimental evaluation using both a multiple illuminant synthetic dataset and a dataset of RAW images containing at least two known color targets for benchmarking.
We show experimentally that the proposed method advances the state-of-the-art on standard datasets of RAW images for both the cases of single and multiple illuminants.
The rest of the paper is organized as follows: Section 2 formalizes the problem of illuminant estimation and reviews the main approaches in the state of the art. Section 3 illustrates in detail the proposed method. Section 4 describes the data and the algorithms used in the experimentation, while Section 5 discusses the results obtained. Section 6 reviews the architecture of the CNN on which our method is based, and gives insights on the learned model from a computational color constancy point of view. Finally, Section 7 summarizes the findings of our experimentation and proposes new directions of research in this field.
2 Problem formulation and related works
The image values for a Lambertian surface located at the pixel with coordinates can be seen as a function , mainly dependent on three physical factors: the illuminant spectral power distribution , the surface spectral reflectance and the sensor spectral sensitivities . Using this notation can be expressed as
where is the wavelength, and are three-component vectors and the integration is performed over the visible spectrum. The goal of color constancy is to estimate the color of the scene illuminant, i.e. the projection of on the sensor spectral sensitivities :
Usually the illuminant color is estimated up to a scale factor as it is more important to estimate the chromaticity of the scene illuminant than its overall intensity . Thus, the error metric usually considered, as suggested by Hordley and Finlayson , is the angle between the RGB triplet of estimated illuminant () and the RGB triplet of the measured ground truth illuminant ():
Since the only information available are the sensor responses across the image, color constancy is an under-determined problem  and thus further assumptions and/or knowledge are needed to solve it. Several computational color constancy algorithms have been proposed, each based on different assumptions. The most common assumption is that the color of the light source is uniform across the scene, i.e. . The next two sections review single and multiple illuminant estimation algorithms in the state of the art.
2.1 Single illuminant estimation
Methods for single illuminant estimation can be divided into two main classes: statistic approaches, and learning-based approaches. Statistic approaches estimate the scene illumination only on the base of the content in a single image making assumptions about the nature of color images exploiting statistical or physical properties; learning-based approaches require training data in order to build a statistical image model, before the estimation of the illumination.
Van de Weijer et al.  have unified a variety of algorithms. These algorithms estimate the illuminant color by implementing instantiations of the following equation:
where is the order of the derivative, is the Minkowski norm, is the convolution of the image with a Gaussian filter with scale parameter , and is a constant to be chosen such that the illuminant color has unit length (using the norm). The integration is performed over all pixel coordinates. Different combinations correspond to different illuminant estimation algorithms, each based on a different assumption. For example, the Gray World algorithm  — generated by setting — is based on the assumption that the average color in the image is gray and that the illuminant color can be estimated as the shift from gray of the averages in the image color channels; the White Patch algorithm  — generated by setting — is based on the assumption that the maximum response is caused by a perfect reflectance: a surface with perfect reflectance properties will reflect the full range of light that it captures and consequently, the color of this perfect reflectance is exactly the color of the light source. In practice, the assumption of perfect reflectance is alleviated by considering the color channels separately, resulting in the maxRGB algorithm. The Gray Edge algorithm  — generated by setting for example — is based on the assumption that the average color of the edges is gray and that the illuminant color can be estimated as the shift from gray of the averages of the edges in the image color channels.
The Gamut Mapping method does not follows (4) and assumes that, for a given illuminant, one observes only a limited gamut of colors . It has a preliminary phase in which a canonical illuminant is chosen and the canonical gamut is computed observing as many surfaces under the canonical illuminant as possible. Given an input image with an unknown illuminant, its gamut is computed and the illuminant is estimated as the mapping that can be applied to the gamut of the input image, resulting in a gamut that lies completely within the canonical gamut and produces the most colorful scene. If the spectral sensitivity functions of the camera are known, the Color by Correlation approach could be also used .
The learning-based illuminant estimation algorithms, that estimate the scene illuminant using a model that is learned on training data, can be subdivided into two main subcategories: probabilistic methods and fusion/selection based methods.
One of the first learning-based algorithms is , where a Neural Network was trained on binarized chromaticity histograms: input neurons are set either to zero indicating that a chromaticity is not present in the image, or to one indicating that it is present.
Bayesian approaches  model the variability of reflectance and of illuminant as random variables, and then estimate illuminant from the posterior distribution conditioned on image intensity data.
Given a set illuminant estimation algorithms, in  an image classifier is trained to classify the images as indoor and outdoor, and different experimental frameworks are proposed to exploit this information in order to select the best performing algorithm on each class. In  it has been shown how intrinsic, low level properties of the images can be used to drive the selection of the best algorithm (or the best combination of algorithms) for a given image. The algorithm selection and combination is made by a decision forest composed of several trees on the basis of the values of a set of heterogeneous features. In  the Weibull parametrization has been used to train a maximum likelihood classifier based on mixture of Gaussians to select the best performing illuminant estimation method for a certain image.
In  a statistical model for the spatial distribution of colors in white balanced images is developed, and then used to infer illumination parameters as those being most likely under their model. High level visual information has been used to select the best illuminant out of a set of possible illuminants . This is achieved by restating the problem in terms of semantic interpretability of the image. Several illuminant estimation methods are applied to generate a set of illuminant hypotheses. For each illuminant hypothesis, they correct the image, evaluate the likelihood of the semantic content of the corrected image, and select the most likely illuminant color. In [19, 20] the use of automatically detected objects having intrinsic color is investigated. In particular, they shown how illuminant estimation can be performed exploiting the color statistics extracted from the faces automatically detected in the image. When no faces are detected in the image, any other algorithm in the state-of-the-art can be used. In [21, 22] the surfaces in the image are exploited and the illuminant estimation problem is addresses by unsupervised learning of an appropriate model for each training surface in training images. The model for each surface is defined using both texture features and color features. In a test image the nearest neighbor model is found for each surface and its illumination is estimated by comparing the statistics of pixels belonging to nearest neighbor surfaces and the target surface. The final illumination estimation results from combining these estimated illuminants over surfaces to generate a unique estimate.
In  it was showed how simple moment based algorithms can, with the addition of a simple correction step deliver much improved illuminant estimation performance. The approach employs first, second and higher moments of color and color derivatives and linearly corrects them to give an illuminant estimate.
In  four simple image features are used for training an ensemble of decision trees. Each of these trees is computed from samples in the training data that are biased to a local region in chromaticity space of the ground truth illuminations. The final estimate is made by finding consensus among the different featuresâ trees estimations.
In  two different approaches using CNNs were investigated: in the first one an ad-hoc CNN for the color constancy problem was trained; in the second one a pre-trained one was used by extracting a 4096-dimensional feature vector from each image using the Caffe  implementation of the deep CNN described by Krizhevsky et al. . Features were computed by forward propagation of a mean-subtracted RGB RAW image through five convolutional layers and two fully connected layers. More details about the network architecture can be found in [3, 25]. The CNN was discriminatively trained on a large dataset (ILSVRC 2012) with image-level annotations to classify images into 1000 different classes. Features are obtained by extracting activation values of the last hidden layer. The extracted features were then used as input to a linear Support Vector Regression (SVR)  to estimate the illuminant color for each image.
In  illuminant color is predicted from luminance-to-chromaticity based on a conditional likelihood function for the true chromaticity of a pixel, given its luminance. Two approaches have been proposed to learn this function. The first was based purely on empirical pixel statistics, while the second was based on maximizing accuracy of the final illuminant estimate.
2.2 Multiple illuminant estimation
The great majority of state-of-the-art illuminant estimation methods assumes that a uniform illumination is present in the scene. This assumption is often violated in real-world images. It is not trivial to extend the existing illuminant estimation algorithms to work locally instead of globally, since the spatial support on which they accumulate the statistics is reduced, and the final local estimate could be biased by local image properties. One of the first methods following this strategy is Retinex , which is able to deal with non-uniform illumination assuming that an abrupt change in chromaticity is caused by a change in reflectance properties. This implies that the illuminant smoothly varies across the image and does not change between adjacent or nearby locations. Ebner  proposed a method that assumes that the illuminant transition is smooth. The method uses the local space average color for local estimation of the illuminant by convolving the image with a Gaussian kernel function. Bleier et al.  investigated whether existing color constancy methods, originally developed assuming uniform illumination, can be adapted to local illuminant color estimation using image sub-regions. Multiple independent estimations are then combined through regression to obtain a more robust final estimate. Gijsenij et al.  proposed a method that makes use of local image patches, which can be selected by any sampling method. After sampling of the patches, illuminant estimation techniques are applied to obtain local illuminant estimates, and these estimates are combined into more robust estimations, since it is assumed that the number of different lights is less than the number of patches. This combination of local estimates is done with two different approaches: clustering if the number of lights is known, segmentation otherwise. Recently Bianco and Schettini , and Joze and Drew  respectively extended the face-based and exemplar-based color constancy algorithms to deal with multiple illuminations. A different class of algorithms is based on user guidance to deal with the case of two  and multiple lights .
3 The proposed approach
In the last years deep learning techniques allowed to obtain significant improvements in the solution of several computer vision problems. Their success often depends on the availability of a large amount of annotated training data. Compared to other image-related problems, in illuminant estimation annotated data is scarce. Therefore, the straightforward procedure of learning the most probable illuminant color directly from the image pixels needs some major adjustments.
We propose a three-stage method: the first stage is patch based, that is, a CNN is trained to predict the illuminant color from a small square portion of the input image. A large training set of patches can be obtained even from a relatively small data set of images, making it possible the use of deep learning techniques. This first stage allows to obtain multiple local estimates of the illuminant across the input image.
The second stage determines whether or not there are multiple illuminants in the scene. This decision is taken on the basis of a statistical analysis of the local estimates produced by the first stage. When multiple illuminants are detected, the local estimates can be directly used as the final output of the whole method.
The optional third stage is applied when the second one determines that the scene has been taken under a single illuminant. In this case it is better to aggregate the local estimates into a single prediction. For this purpose, in our previous work  we experimented with the mean and the per-channel median operators. In this work we propose a local-to-global aggregation procedure based on supervised learning. More in detail, statistical features are extracted from the local estimates, and then fed to a non-linear mapping whose output is the final global estimate of the color of the illuminant. Differently from the first stage, this stage is image based. Therefore, its complexity is limited by the small number of annotated images. For this reason, instead of using a deep learning approach, we adopted a “shallow” non-linear regression scheme. Figure 1 shows a schematic view of the proposed method.
3.1 Local illuminant estimation
In the first stage a convolutional neural network produces local estimates of the illuminant. The network, described in greater detail in Section 6, takes as input non-overlapping patches that have been previously subjected to a stretching of the histogram so that the output estimate is invariant with respect to the local contrast. The network is composed by the the following sequence of layers (see also Figure 2 for a graphical representation):
input RGB patches of size ;
a bank of 240 convolutional filters producing an output of size ;
downsampling via an max pooling layer to a size of ;
reshaping of the result of pooling into a 3840-dimensional vector;
a linear layer producing a 40-dimensional feature vector;
a ReLU activation function;
a linear layer producing the output RGB estimate.
Taking into account all the linear coefficients and the biases, the network include a total of 154,723 parameters that have been learned by applying the standard back propagation algorithm to minimize the average Euclidean squared difference between the estimated and the ground truth illuminant colors (we also tried to minimize the cosine loss without any improvement). Beside its size, compared to the networks used for scene and object recognition we notice two major differences: (i) convolutional filters, and (ii) the large pooling. These differences can be motivated by considering that with respect to object/scene recognition, illuminant estimation is a dual problem: instead of trying to identify the content of the image regardless the illuminant, here we need to estimate the illuminant regardless the content of the image. A detailed interpretation of the model from a color constancy point of view is given in Section 6.1.
3.2 Detection of multiple illuminants
Since our CNN is applied to each patch independently, it can be easily used to predict local illuminants. However, local estimates tend to be noisy and sometimes (when there is a single illuminant, or when the color of all the light sources is very similar) it is better to replace them with a single global estimate. What we need is an automatic rule to switch between the two modalities. In order to decide if the image contains single or multiple illuminants, the per patch illuminant estimates are normalized and projected onto the normalized chromaticity plane. Then, an efficient 2D kernel density estimation (KDE)  is applied. The modes , , i.e. the red/blue chromaticities (the green channel is scaled to one) with the highest densities are identified using a scale-space filtering . Only the modes with a value higher than times the maximum are retained:
The angular difference between each pair of the retained modes (, ) is computed. If the maximum difference exceeds a set threshold then the scene is considered as taken under multiple illuminants. Otherwise, we proceed by assuming the presence of a single illuminant. Following [35, 20] we set the threshold to 3, since it has been judged to be a noticeable but acceptable difference.
3.3 Local to global aggregation of the estimates
In our previous work  we generated a single illuminant estimation per image by pooling the predicted illuminants on the image patches. By taking image patches as input, we have a much larger number of training samples compared to using the whole image on a given dataset, which particularly meets the needs of CNNs, but we loose the information that certain patches belong to the same image. Thus, we fine-tuned the learned net by adding knowledge about the way local estimates are pooled to generate a single global estimate for each image.
In this work we extend the per-channel average and median pooling operators used in  with a non-linear mapping based on a RBF kernel over local statistics of the CNN estimates. The parameters of the mapping are obtained by applying a regression procedure that minimizes the median angular error on the training set. Given as input the map of the per-patch illuminant estimates having a size of , the first step in this module is the smoothing via convolution with a Gaussian filter. The response is then independently pooled in three different ways: average pooling and standard deviation pooling both with size (i.e. on a subdivision in nine rectangular regions), and median pooling with size (i.e. on the whole image). These values are reshaped and given as input to a SVR (with RBF kernel) which predicts the global illuminant by minimizing the median angular error over the training set. The architecture of this module is reported in Figure 3.
In Figure 4 the output of each stage of the proposed illuminant estimation method is showed in the case of multiple and single illuminants.
|Input image||Patch subdivision||Local illuminant estimate||KDE||Final illuminant estimate||Corrected image|
4 Experimental Setup
The aim of this section is to investigate if the proposed algorithm can outperform state-of-the-art algorithms in the single and multiple illuminant estimation on standard datasets of RAW images.
4.1 Image Datasets and Evaluation Procedure
To test the performance of the proposed algorithm for the global illuminant estimation, two standard datasets of RAW camera images having a known color target are used. In the first dataset, images have been captured using high-quality digital SLR cameras in RAW format, and are therefore free of any color correction. The dataset  was originally available in sRGB-format, but Shi and Funt  reprocessed the raw data to obtain linear images with a higher dynamic range (14 bits as opposed to standard 8 bits). The dataset has been acquired using a Canon 5D and a Canon 1D DSLR cameras and consists of a total of 568 images. The Macbeth ColorChecker (MCC) chart is included in every scene, and this allows to accurately estimate the actual illuminant of each acquired image. The second dataset is the NUS dataset . The dataset is similar to the previous one: it has been captured using digital SLR cameras in RAW format with a MCC included in every scene. The differences with the previous dataset are that it has been captured by 9 different cameras (Canon 1Ds Mk III, Canon 600D, Fujifilm X-M1, Nikon D5200, Olympus E-PL6, Panasonic Lumix DMC-GX1, Samsung NX2000, Sony SLT-A57 and Nikon D40) and that there is a larger number of images, i.e. 1853 with around 200 images for each camera.
To test the performance of the proposed algorithm for the multiple illuminant estimation, three different datasets have been used. The first one is synthetically generated from the Gehler-Shi dataset: each image is relighted using two, three and four random illuminants taken from the same datasets. This synthetic dataset thus contains a total of 1704 images. The second dataset used is a subset of the Milan portrait dataset . It has been acquired in RAW format using four different DSLR cameras: Canon 40D, Canon 350D, Canon 400D, and Nikon D700. The dataset is the union of different subsets that have been acquired in three different world locations: Italy, Taiwan, and Japan. The dataset includes portraits of a single person with a single MCC up to multiple persons with multiple MCCs. In this work we used the subset containing multiple MCCs, for a total of 197 images. Finally, the third one is the multiple illuminant dataset by Beigpour et al. . It has been acquired using a Sigma SD10 single-lens reflex (SLR) digital camera which uses a Foveon X3 sensor and is available in linear RAW format. The dataset consist of two parts: the first one is taken in controlled laboratory setting for a total of 10 scenes taken under six distinct illumination conditions; the second one is taken in uncontrolled setting for a total of 20 indoor and outdoor scenes. The datasets comes with pixel-wise ground truth information.
The network has been trained on the Gehler-Shi dataset and adapted to the other datasets by re-training each time the local-to-global regressor to cope with the different cameras and sensor type used.
Examples of images within the datasets considered are reported in Figure 5.
4.1.1 Relighted Gehler-Shi dataset
We synthetically generated a relighted version of the Gehler-Shi dataset: each image is balanced using the corresponding ground truth illuminant and relighted using two, three and four random illuminants taken from the original dataset. Their position in the image was set randomly with the constraint of being at least apart, with and being image width and height respectively. The ground truth for each image has been generated by nearest-neighbor assignment followed by Gaussian smoothing to simulate illuminant mixing. This synthetic dataset thus contain a total of 1704 images. The average maximum angular distance among the illuminants in each image are 8.6, 12.2, 14.8 for the subsets relighted with two, three, and four illuminants respectively.
4.2 Benchmark algorithms
Different benchmarking algorithms for color constancy are considered. Since each image of the dataset contains only one MCC, only global color constancy algorithms based on the assumption of uniform illumination can be compared. Six of them are generated varying the three variables in Equation 4, and correspond to well known and widely used illuminant estimation algorithms. The values chosen for are reported in Table I and set as in . The algorithms are used in the original authors’ implementation which is freely available online (http://lear.inrialpes.fr/people/vandeweijer/code/ColorConstancy.zip). The seventh algorithm is the pixel-based Gamut Mapping . The value chosen for is also reported in Table I. The other algorithms considered are illumination chromaticity estimation via Support Vector Regression (SVR ); the Bayesian (BAY ); the Natural Image Statistics (NIS ); the High Level Visual Information : bottom-up (HLVI BU), top-down (HLVI TD), and their combination (HLVI BU&TD); the Spatio-Spectral statistics : with Maximum Likelihood estimation (SS ML), and with General Priors (SS GP); the Automatic color constancy Algorithm Selection (AAS)  and the Automatic Algorithm Combination (AAC) ; the Exemplar-Based color constancy (EB) ; the Face-Based (FB) color constancy algorithm  using GM or SS ML when no faces are detected; the CNN-based algorithms  and the AlexNet fine-tuned with a linear Support Vector Regression (SVR)  to estimate the illuminant color for each image  (AlexNet+SVR); the ensemble of regression trees applied to simple color features  (SF); the corrected-moment illuminant estimation  (CM); the one predicting chromaticity from pixel luminance (PCL) ; the one exploiting bright pixels (BP)  and the one exploiting both bright and dark pixels (BDP) .
|Gray World (GW)||0||1||0|
|White Patch (WP)||0||0|
|Shades of Gray (SoG)||0||4||0|
|general Gray World (gGW)||0||9||9|
|1st-order Gray Edge (GE1)||1||1||6|
|2nd-order Gray Edge (GE2)||2||1||1|
|Gamut Mapping (GM)||0||0||4|
The last algorithm considered is the Do Nothing (DN) algorithm which gives the same estimation for the color of the illuminant () for every image, i.e. it assumes that the image is already correctly balanced.
4.3 Learning of the main modules
We train our CNN on patches randomly taken from training images of the Gehler-Shi dataset in RAW format (patches including portions of the reference MCC are excluded from training). Images have been resized to pixels. The net is learned using a three-fold cross validation on the folds provided with the dataset: for each run one is used for training, one for validation and the remaining one for test. For training, we assign each patch with the illuminant ground truth associated to the image to which it belongs. At testing time, we generate a single illuminant estimation per image by pooling the the predicted patch illuminants. By taking image patches as input, we have a much larger number of training samples compared to using the whole image on a given dataset, which particularly meets the needs of CNNs. Net parameters have been learned using Caffe  with Euclidean loss.
The learned net is then applied to each whole image in the training set by masking the MCC to obtain an illuminant estimation map. The pooled features computed from these maps are the input to our local-to-global regressor to give a single global illuminant estimate for each image. We train our regressor using the same three-fold cross validation as before using an -SVR  with RBF kernel in which we used a modified cost function to minimize the median angular distance between illuminant estimates and ground-truths. The regressor is able to give a more accurate global estimate than a simple average or median pooling  for two reasons: (i) it is learning-based and is able to leverage the different local estimates coming from the patches belonging to the same image; (ii) it is trained by explicitly minimizing the error metric using in the evaluation of illuminant estimation methods.
5 Results and Discussion
We evaluated the proposed method in both single and multiple illuminant estimation.
5.1 Global illuminant estimation
In Table II the median, the average, the 90-percentile, and the maximum of the angular errors obtained by the considered state-of-the-art algorithms and the proposed approach on the Gehler-Shi dataset are reported.
|HLVI BU ||2.54||3.30||6.59||17.51|
|HLVI TD ||2.63||3.65||7.53||25.24|
|HLVI BU&TD ||2.47||3.38||6.97||25.24|
|SS ML ||2.93||3.55||7.23||15.25|
|SS GP ||2.90||3.47||7.00||14.80|
|FB+SS GP ||2.57||3.18||6.67||14.80|
|CNN per patch ||2.69||3.67||7.79||30.93|
|CNN average-pooling ||2.44||3.18||6.37||14.84|
|CNN median-pooling ||2.32||3.07||6.15||19.04|
|CNN fine-tuned ||1.98||2.63||5.54||14.77|
|Proposed single estimate||1.44||2.36||5.72||16.98|
The table is divided into three blocks and for each of them the best result for each statistic is reported in bold. The first block includes statistic-based algorithms, the second one learning-based algorithms, and the third one the different variants of the proposed approach.
From the results it is possible to see that the deep CNN pre-trained on ILSVRC 2012  coupled with SVR (i.e. AlexNet+SVR) is already able to outperform most statistic-based algorithms and some learning-based ones. The CNN introduced in our previous work  in its various instantiations allowed to obtain a median angular error below 2 degrees which is better than almost all the other methods considered. Even better results have been obtained with the recent method by Cheng et al.  for which the median error is 1.65 degrees. The method proposed here obtained the lowest error (1.44 degrees if we consider the median). The ranking of the algorithms does not change if we consider the mean error instead of the median; the best maximum error, instead has been obtained by the fine-tuned CNN ).
Note that for this experiment we did not apply the multiple illuminant detection module and we always performed the local-to-global aggregation. This last step brings a significant improvement. In fact, without it the median error raises by more than one degree, reaching the 2.69 degrees corresponding to the “CNN per patch” result. It is also a significant improvement with respect to the other aggregation methods considered in our previous work: average pooling, median pooling and fine tuning, that obtained median errors of 2.44, 2.32 and 1.98, respectively. Figure 6 shows the distribution of the angular errors obtained with and without the local-to-global regressor. It is possible to see how the introduction of the aggregation module pushes the angular error distribution towards zero.
Figure 7 reports some examples of images on which the proposed illuminant estimation method makes the largest errors. Even if during the illuminant estimation phase, the patches overlapping the MCC are ignored, they are left unmasked in the figure to better appreciate the results. Once we have an estimate of the global illuminant color , each pixel in the image is color corrected using the von Kries model , i.e.: .
|Input image||Ground truth ()||Proposed (16.98)||AAS (1.48)|
|Input image||Ground truth ()||Proposed (14.77)||GM (0.82)|
|Input image||Ground truth ()||Proposed (14.29)||FB+GM (0.27)|
In Table III the median angular errors obtained by the considered state-of-the-art algorithms and the proposed approach on the NUS dataset are reported. As commonly done, results are reported separately for each camera. From the results it is possible to notice that our method outperforms the other algorithms on all cameras with an average improvement over the best algorithm of 0.35 degrees corresponding to the 15.8%.
|GW||WP||SoG||gGW||BP||GE1||GE2||GM(P)||GM(E)||GM(I)||BAY||SS ML||SS GP||NIS||BDP||Prop.|
5.2 Local illuminant estimation
Our CNN predicts the illumination on small image patches, so it can be easily used to predict local illuminants as well as giving a global illuminant estimate for the entire image. Given the performance of the per patch error in Table II we expect our CNN to perform well even on local estimation. We perform here a preliminary test by using our learned CNN as-is on the synthetically relighted Geheler-Shi dataset.
Among the algorithms in the state-of-the-art able to deal with non-uniform illumination, e.g. [9, 44, 28, 29, 22, 20] we report as comparison the results of the Multiple Light Sources (MLS)  using White Patch (WP) and Gray World (GW) algorithms, grid based sampling, in the clustering version setting the number of clusters equal to the number of lights in the scene. The numerical results are reported in Table IV, while some examples are given in Figure 8. It is clear that the proposed method obtain significantly better results than all the other methods considered; the second best obtained about twice the median error (5.92 degrees) than the proposed one (2.96 degrees).
|Proposed multiple estimate||2.96||3.75||6.79||23.87|
|Input image||Local illuminant estimate||Ground truth||Angular error map||Corrected image|
Note that this comparison has been made by disabling the detection of multiple illuminant and by always taking the local estimates. In a further experiment we evaluated the performance in a mixed single/multi illuminant scenario. The dataset used is the single illuminant version of the Gehler-Shi and one-third of the synthetically relighted version so that the numbers of images having single and multiple illuminants are equal. The numerical results are reported in Table V, where the performance of the four variants of the proposed method are reported: i) single illuminant, that always applies the local-to-global regressor; ii) multi illuminant, that always keeps the local estimates; iii) the fully automatic, that uses the multiple illuminant detector to decide if the local-to-global regressor must be applied or not; iv) the oracle, that applies the local-to-global regressor only when, according to the ground truth, the image present a single illuminant. The results obtained show that the use of the multiple illuminant detector allows to obtain better results with respect to adopting a single strategy. Its performance are very close to those that can be obtained by exploiting the ground truth information about the presence of single or multiple illuminants (i.e. the oracle version).
The first experiment on real world data is performed on the subset of the Milan portrait dataset containing multiple MCCs. The numerical results are reported in Table VI, where the performance of the proposed method are reported enabling the multiple illuminant detector to decide if the local-to-global regressor must be applied or not. The results obtained show that the proposed method performs better than all the single illuminant estimation algorithms as well as all the general purpose multiple illuminant estimation ones. The only algorithm able to outperform the one proposed here is the face-based , which is specifically designed to leverage skin properties in images containing faces.
An example taken from the Milan portrait dataset is reported in Figure 9. Since ground truth illuminant is available only on the MCCS, pixel-level ground truth is obtained by linear interpolation. As usual, MCCs are ignored during illuminant estimation but are left unmasked in the figure to better understand the results.
|MLS + WP ||3.21||4.04||7.55||17.19|
|MLS + GW ||3.33||4.18||8.82||17.97|
|Fusion Grad. Tree Boost. ||4.48||5.29||9.95||31.26|
|Fusion Rand. Forest Regr. ||3.23||3.96||7.61||27.76|
|Proposed (fully automatic)||2.75||3.30||6.24||15.22|
|Input image||correction using the ground truth||correction using Face-based ||correct. using the proposed meth.|
|illuminant ground truth||Face-based illuminant estimate||our illuminant estimate||angular error|
The last experiment concerning local illuminant estimation is performed on the multiple illuminant dataset by Beigpour et al. The numerical results are reported in Table VII, where the performance of the proposed method are reported enabling the multiple illuminant detector to decide if the local-to-global regressor must be applied or not. The results are reported separately for the laboratory and real-world settings. In both cases the results obtained show that the proposed method performs better than all the algorithms considered with an average reduction of the median error of almost 14%.
6 Network architecture
In this section we discuss the design of the network, how its performance is affected by the parameters, and how we can relate the behavior of the learned model to that of other methods for computational color constancy.
The architecture of the network has been designed by starting from a deep CNN similar to the LeNet  and by removing layers until no further improvement in performance was possible. The final model is a simplified convolutional neural network with a single convolutional layer, max pooling, and two fully connected layers. Differently from other computer vision tasks, deepening the network causes slightly worse results. This fact probably depends on the small variability in content provided by the annotated data sets for computational color constancy. In fact, our training patches come from a few hundreds of images, while deep networks are often trained on millions of annotated images.
The performance of the network are quite robust with respect to its parameters as shown in Figure 10, that reports the variation in accuracy as a function of the size of the input patches, of the width and number of convolutional kernels, of the size of the receptive field of the pooling units, and of the number of fully connected units in the second to last layer. The plots are obtained by changing one parameter at a time while setting all the others as in the optimal configuration. In additional tests, not reported here, we also measured the performance obtained by varying multiple parameters without obtaining any surprising result.
The most striking element of the final network is the use of “convolutional” units. At first this could be surprising, since in different domains larger kernels are preferred. However, it is not the first time that such small kernels are used, see . In our case, networks built with larger convolutions failed to reproduce the spatial filters (edge detectors etc.) that are usually observed in CNNs trained for image classification. From the color constancy point of view, this choice of kernel size seems to confirm the finding by Cheng at al. , that local spatial information does not provide any additional information that cannot be obtained directly from the color distributions. The number of the convolutional kernels seems less important and we found that the optimal value was around 240.
Another interesting element is represented by the relatively large () receptive fields of the pooling units. As a consequence the max pooling layer strongly reduces the dimensionality of the incoming data, while retaining just some spatial information. Smaller receptive fields resulted in a decrease in the performance of the network. We observe a sort of duality with respect to the parameters used for CNNs for image classification that usually prefer large convolutional kernels and small pooling units.
Concerning the remaining parameters, we found that the optimal number of fully-connected units was intermediate (40) and that the network prefers large patches over smaller ones.
6.1 Model interpretation
After training the network, we analyzed the resulting weights for the three layers with learning capabilities. The last layer maps the 40 intermediate values (“features”, in the following) in the three components of the illuminant estimated for the input patch. The transformation is affine and is represented by a matrix of coefficients and by 3 biases. A layer of this kind has been already shown to perform well by Funt et al. , where it was used to process the responses of indicator functions over a regular quantization of the image chromaticities. It is also similar to combinational methods  where the outputs of different color constancy algorithms are combined to give the final illuminant estimate.
Differently from the work by Funt et al. , our network exploits some spatial information encoded in the 40 features that are computed as linear combinations of the 240 convolutions after that they have been pooled according to a spatial grid. In fact, as noted by Gijsenij et al. , the use of spatial information brings an improvement over the application of color constancy to the entire image. To better understand the role of the 40 features, we report in Figure 11 the ten patches producing their highest values. The patches are taken from the first fold of the Gehler-Shi dataset and are shown after the stretching of the color channels. It can be seen how different neurons are activated by different kinds of patches. Some of them are specialized in finding uniform patches of a given dominant color (blue, red, green…) that often correspond to specific content in the input images (sky, vegetation…). Several neurons are able to identify highlights, an element that has been previously exploited for color constancy . There are also neurons specialized in detecting strong edges (that have also been used in the past ) and patches with complex textures. Figure 12 shows the 40 activations on the patches of a whole image, while those of five selected neurons on six different images are shown in Figure 13. These figures suggest that the network performs a rough analysis of the content of the image by identifying the main elements of the scene or by selecting elements that may be useful for the estimation of illuminant. For instance, neuron #8 seems to fire on image edges, neuron #17 on highlights, neuron #22 on sky and bluish texture, neuron #27 on skin and orange/reddish texture, neuron #38 on vegetation and greenish texture. The use of semantic concepts share some similarities with the work of van de Weijer et al.  where the illuminant is estimated by maximizing the likelihood of the colors associated to each semantic class.
|Input image||Neuron #8||Neuron #17||Neuron #22||Neuron #27||Neuron #38|
Finally, the first layer is of the convolutional kind, and it consists of 240 units with kernels. The activation of each convolutional unit can be seen as the projection over a specific direction in the RGB cube. Note that while convolutions do not exploit spatial information, they also preserve it unaltered for the subsequent layers. The combination of the 240 units forms a sort of “soft” quantization of the color space that can be combined by the pooling units to represent the local color distribution. Figure 14 shows how different regions of the RGB color cube activate the 240 convolutional units. Since each unit corresponds to a linear projection, the maximum activation always occur on a vertex of the RGB cube (to improve visualization, cubes are rotated so that the region of maximum activation is always front-facing). It is possible to see that for all the eight vertexes there are several units with high activations. In practice this means that the quantization learned by the network covers the whole color space instead of being focused on specific colors. Several units seem redundant as they activate in presence of very similar colors. We observed, in fact, small differences in performance when we reduced the number of convolutional units (see Figure 10 b).
In this work we have developed a CNN-based color constancy algorithm that combines feature learning and regression as a complete optimization process, which enables us to employ modern training techniques to boost performance. The network has been specially designed to work on image patches in order to estimate the local illuminant color. When our method detects a single illuminant in the image, the local estimates are given as input to a trained local-to-global regressor which is able to predict the global illuminant with a high accuracy. The experimental results showed that our algorithm improves the state-of-the-art performance on images with a single illuminant. Experiments on a synthetically relighted dataset with multiple illuminants showed that our method outperforms all the general purpose local illuminant estimation methods in the state of the art. Results are further confirmed on two real-world datasets with multiple illuminants, where our method is outperformed only by an illuminant estimation method exploiting the presence of faces. The results obtained suggest that a possible future research direction is that of feeding additional semantic information in the form of scene category or detected objects to further improve illuminant estimation performance.
Currently, our method is articulated in three separate steps. In the future we plan to merge them into a single estimation model. In order to allow the end-to-end learning of such a model, we are collecting a large dataset composed by RAW images having both single and multiple illuminants.
-  D. Lee and K. N. Plataniotis, “A taxonomy of color constancy and invariance algorithm,” in Advances in Low-Level Color Image Processing. Springer, 2014, pp. 55–94.
-  K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. L. Cun, “Learning convolutional feature hierarchies for visual recognition,” in Advances in neural information processing systems, 2010, pp. 1090–1098.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  S. Bianco, C. Cusano, and R. Schettini, “Color constancy using cnns,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015.
-  S. Hordley and G. Finlayson, “Re-evaluating color constancy algorithms,” Proc. of the 17th International Conference on Pattern Recognition, pp. 76–79, 2004.
-  B. Funt, K. Barnard, and L. Martin, “Is machine colour constancy good enough?” Proceedings of the 5th European Conference on Computer Vision, pp. 445–459, 1998.
-  J. van de Weijer, T. Gevers, and A. Gijsenij, “Edge-based color constancy,” IEEE Transactions on Image Processing, vol. 16, no. 9, pp. 2207–2214, 2007.
-  G. Buchsbaum, “A spacial processor model for object color perception,” J. Franklin Inst., vol. 310, pp. 1–26, 1980.
-  E. H. Land et al., The retinex theory of color vision. Scientific America., 1977.
-  D. A. Forsyth, “A novel algorithm for color constancy,” International Journal of Computer Vision, vol. 5, no. 1, pp. 5–36, 1990.
-  G. Finlayson, S. Hordley, and P. Hubel, “Color by correlation: a simple, unifying framework for color constancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1209–1221, 2001.
-  B. Funt, V. Cardei, and K. Barnard, “Learning color constancy,” in Color and Imaging Conference, vol. 1996, no. 1. Society for Imaging Science and Technology, 1996, pp. 58–60.
-  P. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp, “Bayesian color constancy revisited,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8, 2008.
-  S. Bianco, G. Ciocca, C. Cusano, and R. Schettini, “Improving color constancy using indoor-outdoor image classification,” IEEE Transactions on Image Processing, vol. 17, no. 12, pp. 2381–2392, 2008.
-  ——, “Automatic color constancy algorithm selection and combination,” Pattern recognition, vol. 43, no. 3, pp. 695–705, 2010.
-  A. Gijsenij and T. Gevers, “Color constancy using natural image statistics and scene semantics,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 4, no. 33, pp. 687–698, 2011.
-  A. Chakrabarti, K. Hirakawa, and T. Zickler, “Color constancy with spatio-spectral statistics,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, (DOI:10.1109/TPAMI.2011.252).
-  J. van de Weijer, C. Schmid, and J. Verbeek, “Using high-level visual information for color constancy,” IEEE International Conference on Computer Vision (ICCV 2007), pp. 1 –8, 2007.
-  S. Bianco and R. Schettini, “Color constancy using faces,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 65–72.
-  ——, “Adaptive color constancy using faces,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 8, pp. 1505–1518, 2014.
-  H. R. V. Joze and M. S. Drew, “Exemplar-based colour constancy.” in BMVC, 2012, pp. 1–12.
-  ——, “Exemplar-based color constancy and multiple illumination,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 5, pp. 860–873, 2014.
-  G. D. Finlayson, “Corrected-moment illuminant estimation,” in Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013, pp. 1904–1911.
-  D. Cheng, B. Price, S. Cohen, and M. S. Brown, “Effective learning-based illuminant estimation using simple features,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 675–678.
-  H. Drucker, C. J. Burges, L. Kaufman, A. Smola, V. Vapnik et al., “Support vector regression machines,” Advances in neural information processing systems, vol. 9, pp. 155–161, 1997.
-  A. Chakrabarti, “Color constancy by learning to predict chromaticity from luminance,” arXiv preprint arXiv:1506.02167, 2015.
-  M. Ebner, “Color constancy based on local space average color,” Machine Vision and Applications, vol. 20, no. 5, pp. 283–301, 2009.
-  M. Bleier, C. Riess, S. Beigpour, E. Eibenberger, E. Angelopoulou, T. Troger, and A. Kaup, “Color constancy and non-uniform illumination: Can existing algorithms work?” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011, pp. 774–781.
-  A. Gijsenij, R. Lu, and T. Gevers, “Color constancy for multiple light sources,” Image Processing, IEEE Transactions on, vol. 21, no. 2, pp. 697–707, 2012.
-  E. Hsu, T. Mertens, S. Paris, S. Avidan, and F. Durand, “Light mixture estimation for spatially varying white balance,” in ACM Transactions on Graphics (TOG), vol. 27, no. 3. ACM, 2008, p. 70.
-  I. Boyadzhiev, K. Bala, S. Paris, and F. Durand, “User-guided white balance for mixed lighting conditions.” ACM Trans. Graph., vol. 31, no. 6, p. 200, 2012.
-  Z. I. Botev, J. F. Grotowski, D. P. Kroese et al., “Kernel density estimation via diffusion,” The Annals of Statistics, vol. 38, no. 5, pp. 2916–2957, 2010.
-  A. Witkin, “Scale-space filtering: a new approach to multi-scale description,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 150–153, 1984.
-  S. D. Hordley, “Scene illuminant estimation: past, present, and future,” Color Research & Application, vol. 31, no. 4, pp. 303–314, 2006.
-  L. Shi and B. V. Funt, “Re-processed version of the gehler color constancy database of 568 images. [online]. available: http://www.cs.sfu.ca/ colour/data/. last access: Nov. 1, 2011.”
-  D. Cheng, D. K. Prasad, and M. S. Brown, “Illuminant estimation for color constancy: why spatial-domain methods work and the role of the color distribution,” JOSA A, vol. 31, no. 5, pp. 1049–1058, 2014.
-  S. Beigpour, C. Riess, J. van de Weijer, and E. Angelopoulou, “Multi-illuminant estimation with conditional random fields,” Image Processing, IEEE Transactions on, vol. 23, no. 1, pp. 83–96, 2014.
-  A. Gijsenij, T. Gevers, and J. van de Weijer, “Computational color constancy: survey and experiments,” IEEE Transactions on Image Processing, vol. 20, no. 9, pp. 2475–2489, 2011.
-  ——, “Generalized gamut mapping using image derivative structures for color constancy,” International Journal of Computer Vision, vol. 86, no. 2-3, pp. 127–139, 2010.
-  B. Funt and W. Xiong, “Estimating illumination chromaticity via support vector regression,” in Color and Imaging Conference, vol. 2004, no. 1. Society for Imaging Science and Technology, 2004, pp. 47–52.
-  H. R. V. Joze, M. S. Drew, G. D. Finlayson, and P. A. T. Rey, “The role of bright pixels in illumination estimation,” in Color and Imaging Conference, vol. 2012, no. 1. Society for Imaging Science and Technology, 2012, pp. 41–46.
-  J. von Kries, “Chromatic adaptation,” Festschrift der Albrecht-Ludwigs-Universität, pp. 145–158, 1902.
-  E. Provenzi, C. Gatta, M. Fierro, and A. Rizzi, “A spatially variant white-patch and gray-world method for color image enhancement driven by local contrast,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 10, pp. 1757–1770, 2008.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv preprint arXiv:1409.4842, 2014.
-  B. Li, W. Xiong, W. Hu, and B. Funt, “Evaluating combinational illumination estimation methods on real-world images,” Image Processing, IEEE Transactions on, vol. 23, no. 3, pp. 1194–1209, 2014.
-  A. Gijsenij and T. Gevers, “Color constancy using image regions,” in Image Processing, 2007. ICIP 2007. IEEE International Conference on, vol. 3. IEEE, 2007, pp. III–501.
-  H.-C. Lee, “Method for computing the scene-illuminant chromaticity from specular highlights,” JOSA A, vol. 3, no. 10, pp. 1694–1699, 1986.
Simone Bianco obtained the BSc and the MSc degree in Mathematics from the University of Milano-Bicocca, Italy, respectively in 2003 and 2006. He received the PhD in Computer Science at Department of Informatics, Systems and Communication of the University of Milano-Bicocca, Italy, in 2010, where he currently is a post-doc. His research interests include computer vision, optimization algorithms, machine learning, and color imaging.
Claudio Cusano received the Laurea and PhD degrees from the University of Milano Bicocca in 2002 and 2006, respectively. He has been a researcher with grant at the ITC institute of the Italian National Research Council and then at the Imaging and Vision Laboratory of the University of Milano-Bicocca. Currently, he is assistant professor at the Department of Electrical, Computer and Biomedical Engineering of the University of Pavia. The main topics of his research concern 2D and 3D imaging, with a particular focus on image analysis and classification, and on face recognition.
Raimondo Schettini is a professor at the University of Milano Bicocca (Italy). He is head of Imaging and Vision Lab and Vice-Director of the Department of Informatics, Systems and Communication. He has been associated with Italian National Research Council (CNR) since 1987 where he has leaded the Color Imaging lab from 1990 to 2002. He has been team leader in several research projects and published more than 200 refereed papers and six patents about color reproduction, and image processing, analysis and classification. He has been recently elected Fellow of the International Association of Pattern Recognition (IAPR) for his contributions to pattern recognition research and color image analysis.