ColorNet - Estimating Colorfulness in Natural Images

ColorNet - Estimating Colorfulness in Natural Images


Measuring the colorfulness of a natural or virtual scene is critical for many applications in image processing field ranging from capturing to display. In this paper, we propose the first deep learning-based colorfulness estimation metric. For this purpose, we develop a color rating model which simultaneously learns to extracts the pertinent characteristic color features and the mapping from feature space to the ideal colorfulness scores for a variety of natural colored images. Additionally, we propose to overcome the lack of adequate annotated dataset problem by combining/aligning two publicly available colorfulness databases using the results of a new subjective test which employs a common subset of both databases. Using the obtained subjectively annotated dataset with 180 colored images, we finally demonstrate the efficacy of our proposed model over the traditional methods, both quantitatively and qualitatively.


Emin Zerman, Aakanksha Ranathanks: *These authors contributed equally to this work., Aljosa Smolicthanks: This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under the Grant Number 15/RP/2776. We also gratefully acknowledge the support of NVIDIA Corporation with the donated GPU used for this research. \addressV-SENSE, School of Computer Science, Trinity College Dublin, Dublin, Ireland \ninept {keywords} Colourfulness, CNN, Color metric, Deep learning.

1 Introduction

Color is a crucial factor in human visual perception. It affects human behavior and decision processes both in nature and in society, as color conveys pivotal information about the surroundings. Thus, its accurate acquisition and display are necessary for multimedia, entertainment, and image processing systems.

Within the imaging pipeline, the color or brightness information of the real (or virtual) scene needs to be processed for various reasons such as color grading [pitie2007automated], tone-mapping [mantiuk2008display] for high dynamic range (HDR) scenes, gamut mapping [sikudova2016gamut], color correction [mantiuk2009color]. During the acquisition or the processing, color of the scene can be affected by color cast or colorfulness changes [hasler2003measuring]. Within the scope of this study, only the colorfulness aspect of the natural images is considered.

Colorfulness is generally defined as the amount, intensity, and saturation of colors in an image [fairchild2013color]. Understanding and estimating colorfulness is quintessential for a variety of applications, e.g., HDR tone-mapping [mantiuk2009color, reinhard2012calibrated, rana2019learning, rana-icme17, rana-icip17], aesthetics image analysis [datta2006studying, aydin2015automated], color-reproduction in cameras [Karaimer_CVPR], image/video quality assessment [winkler2012analysis, pinson2013selecting, yendrikhovskij1998optimizing, panetta2013no], virtual reality [rana-icassp] etc. For colorfulness estimation, several methods have been proposed [yendrikhovskij1998optimizing, hasler2003measuring, datta2006studying, panetta2013no, amati2014study] in the literature, in addition to the techniques used in the color appearance models [hunt1995reproduction, nayatani1995revision, fairchild2013color] (see Section 2 for more discussion).

To explore the competence of learning-based approaches and to create a stepping stone for further analysis of human perception with deep learning methods, in this paper, we propose a novel deep learning-based objective metric ‘ColorNet’ for the estimation of colorfulness in natural images. Based on a convolutional neural network (CNN), our proposed ColorNet is a two-stage color rating model, where at stage I, a feature network extracts the characteristics features from the natural images and at stage II, a rating network estimates the colorfulness rating. To design our feature network, we explore the designs of the popular high-level CNN based feature models such as VGG [Simonyan15], ResNet [He2016DeepRL], and MobileNet [MobileNetsEC] architectures which we finally alter and tune for our colorfulness metric problem at hand. We also propose a rating network which is simultaneously learned to estimate the relationship between the characteristic features and ideal colorfulness scores.

In this paper, we additionally overcome the challenge of the absence of a well-annotated dataset for training and validating ColorNet model in a supervised manner. To this end, we combine two publicly available colorfulness databases [hasler2003measuring, amati2014study] using the results of a new subjective test which employs a common subset of both databases. The resulting ‘Combined’ dataset contains 180 color images and corresponding subjective colorfulness scores. Finally, we compare and showcase how our ColorNet model outperforms the state-of-the-art traditional colorfulness estimation models both quantitatively and qualitatively.

(a) Subset
(b) EPFL vs. ‘Anchor’
(c) Subset
(d) UCL vs. ‘Anchor’
(e) ‘Combined’ vs. ‘Anchor’
Figure 1: Subjective data. To be representative, 12 images are selected each from EPFL (a) and UCL (c) datasets where blue circles indicate subjective colorfulness scores for all the images and red cross marks indicate those of the selected images. In (b), (d) and (e), we present the relationship between the colorfulness scores of the ‘Anchor’ experiment conducted and the colorfulness scores of EPFL (b), UCL (d), and ‘Combined’ (e) datasets. The diamonds indicate the scores for the selected subset images and the black dashed line indicates the best linear fit between the considered two scores.

2 Related Work

Several studies have attempted to understand and estimate the colorfulness in visual content. Color appearance models (CAMs) utilize some form of chroma and colorfulness estimation to estimate the local visual perception and to reproduce the colors considering the lighting conditions [nayatani1995revision, hunt1995reproduction, fairchild2013color]. However, such estimations are valid mostly for simple uniform image patches. To estimate the overall colorfulness values of complex natural images, several studies have been conducted [yendrikhovskij1998optimizing, hasler2003measuring, datta2006studying, amati2014study, panetta2013no] in the literature.

Yendrikhovskij et al. [yendrikhovskij1998optimizing] developed a model to estimate the quality of natural color images by computing their naturalness and colorfulness indices. The proposed colorfulness index of an image is given as a summation of the average saturation (saturation as defined in CIE L*u*v* color space) and the standard deviation of the saturation. In another study, Datta et al. [datta2006studying] proposed to first partition the RGB cube and estimate the colorfulness by calculating the Earth Mover’s Distance between the pixel distribution of the tested image and that of an “ideal” colorful image. However, compared to others, the model failed in perceptual validation [amati2014study].

Hasler and Süsstrunk [hasler2003measuring], on the other hand, studied the colorfulness in natural images by conducting a subjective user study over a relatively complex set of images. Their proposed method estimates the colorfulness by computing the first and second order statistics between the opponent color channels i.e., yellow-blue and red-green color channels of the given RGB image. Using the same opponent color space strategy, Panetta et al. [panetta2013no] proposed a colorfulness metric as a part of their no-reference image quality measure. Their proposed model involved statistical computations in logarithmic space assuming that human visual perception works logarithmically.

The performances of these colorfulness methods have been compared in a couple of studies [amati2014study, krasula2014objective]. Amati et al. [amati2014study] analyzed the relationship between the colorfulness and the aesthetics of the images by conducting a subjective study with 100 images. They also proposed a contrast-based colorfulness metric which, however, was not better than Hasler and Süsstrunk [hasler2003measuring]. Hence, it was not considered in this study.

Considering a tone mapping scenario, Krasula et al.[krasula2014objective] compared three different colorfulness methods (namely ,  [panetta2013no], and –a colorfulness metric inspired from Hasler and Süsstrunk [hasler2003measuring]) along with three naturalness and six contrast metrics. To this end, they employed the high dynamic range (HDR) image tone mapping methods database of Čadík et al. [cadik2008evaluation] with 42 images. As a result of this comparison, the metric was found to be the most consistent and most correlated colorfulness metric.

In addition to the traditional image processing methods, several deep learning techniques have been recently explored for designing learning-based image quality metrics [Talebi2018NIMA, Kangcvpr]. An early attempt was made by [Kangcvpr] for no-reference image quality assessment task where the efficacy of utilizing the high-level CNN features was explored. In [Talebi2018NIMA], a neural metric has been proposed to predict the aesthetic and pixel-level quality distributions. However, no work has been done to estimate colorfulness in images. In this study, we, therefore, first gather a 180-image dataset with subjective colorfulness scores and then propose a color quality estimation model by exploring various state-of-the-art high-level CNN features.

3 Subjective Data

In this study, we use the subjective colorfulness scores collected from participants for two different colorfulness databases: EPFL Dataset [hasler2003measuring] with 84 images and UCL Dataset111University College London Colourfulness Dataset - [amati2014study] with 96 images222Although the links are provided for 100 images the UCL Dataset, four out of these 100 images have been removed from Flickr..

Both datasets provide the collected subjective scores and corresponding images. The UCL Dataset provides pairwise comparison scores (also with the user confidence). The numerical colorfulness scores, for this database, are obtained through a Thurstone Case V scaling333pwcmp software – [perezOrtiz2017practical]. The quality scores for the EPFL Dataset are already scaled by the respective authors, using one of the methods proposed by Engeldrum [engeldrum2000psychometric].

Even though a psychometric scaling algorithm has been employed, the EPFL Dataset scores have been collected using a rating methodology. Whereas, UCL Dataset scores have been collected using a ranking (i.e. pairwise comparison) methodology. To bring these scores to the same scale, a third subjective test is conducted as an anchor, using a common subset of these two datasets. The two databases are then combined (i.e. aligned) using the scores from this third subjective experiment [pitrey2011aligning] as explained below.

Selection of Images from Each Dataset: To create a common subset, we first selected 12 representative images from each database. These selected images cover the whole quality scale, starting from not colorful at all to very colorful. The distribution of subjective colorfulness scores for all of the images and the selected images is shown in Fig. 1.(a) and Fig. 1.(c) for EPFL and UCL, respectively. These selected images are used in a new subjective experiment. The new subjective test and the subjective scores collected are referred to as ‘Anchor’ experiment and dataset, respectively.

Subjective Test for Anchoring the Two Datasets: To merge EPFL and UCL datasets on the same scale, we have conducted a third subjective experiment with a common subset of images [pitrey2011aligning]. The pairwise comparisons (PWC) methodology is chosen in order to keep the cognitive load for the participants lighter and the experiment process easier. Although PWC is easier for subjects to decide, the full PWC design (with pairs) is time-consuming. Instead, in this study, we use an adaptive pair selection process, known as adaptive square design (ASD) [li2013boosting, P915]. With this pair selection process (and also test methodology), the pair selection and scaling are done iteratively, making sure that similar stimuli are compared more than different stimuli (for which the difference will be clearer with fewer comparisons). The ASD is implemented in Matlab, and a Thurstone Case V based psychometric scaling method is used for this experiment. For the presentation and interactive voting, a Matlab toolbox called Psychtoolbox is used [kleiner2007whats].

To obtain the subjective scores, the 24 selected images are used in the ‘Anchor’ experiment, to which three expert viewers have attended. Instead of updating the rank matrix of ASD for each observer, a different approach is used. The rank matrix of ASD is randomly initialized for each expert observer, and throughout the 5 loops of the test, this rank matrix is updated, and the new pairs are selected considering this updated rank matrix. This ensured that different images have been compared for each different expert viewer.

The obtained PWC results are then scaled to quality values which are later used to merge the databases. The scaling results are generally arbitrary (i.e., relative to each other without an origin point), and hence, hard to understand. Therefore, these subjective quality scores are mapped to [1 9] scale, 1 being the least colorful and 9 being the most colorful.

Combining the Datasets: The relationship between the new ‘Anchor’ colorfulness scores and those of EPFL and UCL are shown in Fig. 1.(b) and 1.(d), respectively, for the selected images. The relationship between these scores is linear, with very high correlation scores.

To merge the databases linearly with corresponding ’Anchor’ scores, the parameters and are found by solving relationship. Here, is the ‘Anchor’ score and is the source (either EPFL or UCL) database score. These parameters are found as: and for EPFL and and for UCL databases.

Then, the two databases are brought to the same scale as:


Finally, the mapped quality scores are concatenated in a single dataset. In Fig. 1.(e), the merged (denoted as ‘Combined’) subjective quality scores are plotted again vs the ‘Anchor’ scores to validate the merging operation, where these two databases are observed to be on the same scale.

4 Proposed Color Rating Model

The ColorNet architecture is illustrated in Fig. 2. The proposed model has two major building blocks, 1) the feature network and 2) the rating network . We base our feature network on the state-of-the-art deep learning models, namely, VGG [Simonyan15], ResNet [He2016DeepRL] and MobileNet [MobileNetsEC]. For all these models, we specifically removed the last layers originally meant for classification and fine-tuned the remaining feature layers in an end-to-end fashion. These features are fed to the rating network which has an objective of mapping the characteristic features to the colorfulness rating domain. The three variants of our ColorNet model are named as ColorNet-VGG, ColorNet-ResNet, and ColorNet-Mobile. The rating network is the same for all three variants. It consists of a dropout regularization layer, two fully connected layers and with and channels respectively. An added non-linearity is introduced by using the ReLU unit in between the fully connected layers to learn the desired colorfulness mapping. Further details regarding the three variants are as follows:

  1. ColorNet-VGG has a 13 convolutional layered feature network adopted from the VGG16 [Simonyan15] architecture, with small convolutions, resulting in a feature vector of dimension . In our paper, we removed the three fully connected layers of the VGG16 architecture and simply fed the resulting feature vector into the proposed rating network.

  2. ColorNet-ResNet consists of an 18 layer deep residual feature network adopted from the ResNet architecture [He2016DeepRL]. Similar to VGG [Simonyan15], the ResNet architecture has small convolutions, however with an additional concept of residual learning applied to every few stacked layers. In our paper, we removed the last fully connected layer of 1000 channels and fed the resulting feature of size into the rating work.

  3. ColorNet-Mobile consists of a 28 convolution layers feature network adopted from the MobileNet architecture [MobileNetsEC], which comprises of both depth-wise and point-wise convolutions. MobileNet has been a widely adopted network for many mobile and embedded applications. In this paper, we removed its last fully connected and soft-max layer to obtain a feature of size which is finally fed into the rating network.

Note that we choose these feature models mainly due to ease in training and faster convergence over small dataset size. For further details regarding the feature network architectures, we refer the reader to [Simonyan15, He2016DeepRL, MobileNetsEC]. For each ColorNet variant, the receptive input for the first fully-connected is altered as per the resulting feature size of the feature network.

Loss Function: The ultimate goal of ColorNet models is to rate the colorfulness in images. To this end, we train these models in an end-to-end fashion using a fully supervised setting. For training, we define a set of an image and its corresponding rating as a pair, given as , where is an image and is its corresponding colorfulness rating, , = number of training image pairs. To train the ColorNet models, we used the following objective function:


where is the predicted colorfulness rating and vgg, resnet, mobile. We additionally experimented with the L2 norm as a loss function, however, we observe that practically the L1 loss term effectively penalizes the network to learn and converge at a faster rate using the ADAM [KingmaB14adam2] optimizer technique.

Figure 2: The ColorNet Model.

5 Results

Training and Implementation Details: We split the dataset of images into training, validation and test sets in an , and setting. For training, the input image size is fixed at , and random crops of size are applied to the image. Additional data augmentation techniques such as rotation and flipping are applied to scale up the training dataset.

The ColorNet model is implemented using the Pytorch [paszke2017automatic] deep learning library. During training, the batch size is set to 4 and the baseline weights of the feature networks are initialized by training on the ImageNet [imagenet_cvpr09] dataset. The weights of the layers in the rating networks are initialized randomly. In other terms, we fine-tune the feature network and train the rating network layers for our task at hand. A dropout rate of 0.75 is set in the rating network for all the three models. We utilize an ADAM solver [KingmaB14adam2] with an initial learning rate of and for training the feature and rating networks respectively. The learning rates are allowed to decay exponentially with a decay rate of , after every 10 epochs. The momentum rate is fixed at for all epochs. All our models are trained in an end-to-end fashion for epochs. Training is done using a GB NVIDIA Titan-X GPU on an Intel Xeon E7 core i7 machine for epochs which take approximately hours. Inference time is secs for each image.

Traditional Colorfulness Methods: We implemented four different colorfulness metrics: Hasler and Süsstrunk ([hasler2003measuring] , two versions ( and ) of Panetta et al. [panetta2013no], and Yendrikhovskij et al. ([yendrikhovskij1998optimizing] to compare with the proposed ColorNet model.

For the Hasler and Süsstrunk [hasler2003measuring], with a given RGB image, the metric uses the mean , and standard deviation of the opponent color space vectors and where and . Then, is computed as:


Panetta et al. [panetta2013no] use the similar color opponent space, but in logarithmic space. The metrics and are calculated as:


where and are concatenated to form , i.e..

To compute , we use the following formula after the RGB to CIE L*u*v* color space transform, as specified in the paper [yendrikhovskij1998optimizing]:

Colorfulness Metric PCC SROCC
 [hasler2003measuring] 0.841 0.884
 [panetta2013no] 0.895 0.896
 [panetta2013no] 0.312 0.415
 [yendrikhovskij1998optimizing] 0.843 0.834
ColorNet-Mobile 0.841 0.774
ColorNet-ResNet 0.916 0.889
ColorNet-VGG 0.937 0.921
Table 1: Correlation Coefficient Results.

Quantitative Evaluation: Colorfulness metrics are evaluated by computing the Pearson correlation coefficient (PCC) and the Spearman rank-ordered correlation coefficient (SROCC). For testing, we used the 10-fold cross-validation strategy where the whole dataset was divided into 10 non-overlapping pieces , and for each iteration, one piece is used for the test, one piece used for validation, and the remaining pieces are used for training (as also described above). For each iteration , the unique piece is used for the test, where all the pieces (, ) cover the whole dataset. Finally, the PCC and SROCC results for 10 different test cases are averaged. The performance results are presented in Table 1 where PCC and SROCC for the 4 state-of-the-art colorfulness metrics are computed using the aligned subjective scores from ‘Combined’ dataset. Our proposed ColorNet model with VGG and ResNet based feature network outperforms over the other classical models. This is partly due to the model’s ability to learn rich high-level color feature representation.

Qualitative Evaluation: For the qualitative evaluation of ColorNet, we crafted a set of images that are not used during the training of the model, by considering two different scenarios: i) change in the dominant colours and ii) change in the color contrast. In Fig. 3, row I shows 4 different cases of the change in the dominant colors, and row II shows the change in the color contrast. We report the objective results for the proposed deep-learning-based colorfulness estimation method, in Fig. 3, and show also for completeness. The results showcase that in both cases i.e., by increasing color contrast and increasing the number of dominant hues, the proposed metric scores increase, thus, validating the understanding of colorfulness of our model. Overall, our results confirm that learning-based models bring huge potential to cater for various implicit aspects of color perception over a wide variety of natural images.

Figure 3: Qualitative Evaluation. Row I depicts the change in dominant colors. Row II depicts the change in the saturation of the colors.

6 Conclusion

In this study, we propose a CNN based model for the estimation of colorfulness ratings. To prepare a well-annotated colored image dataset, we combine two colorfulness databases with subjective user scores, using the results of an anchor subjective experiment with a common subset of images. We compare the results of the proposed model to those of four other traditional colorfulness metrics quantitatively and qualitatively where we observe that our learning-based model effectively rates the colorfulness by catering for the wide variety of natural images. This study constitutes an initial step towards the exploration of color perception in natural images using the deep learning approach. In future work, we aim to delve deeper in learning-based color perception models and analyze the impacts of various associated factors.


Comments 1
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description