Automatic Content-Aware Color and Tone Stylization

Automatic Content-Aware Color and Tone Stylization

Joon-Young Lee
Adobe Research
   Kalyan Sunkavalli
Adobe Research
   Zhe Lin
Adobe Research
   Xiaohui Shen
Adobe Research
   In So Kweon

We introduce a new technique that automatically generates diverse, visually compelling stylizations for a photograph in an unsupervised manner. We achieve this by learning style ranking for a given input using a large photo collection and selecting a diverse subset of matching styles for final style transfer. We also propose a novel technique that transfers the global color and tone of the chosen exemplars to the input photograph while avoiding the common visual artifacts produced by the existing style transfer methods. Together, our style selection and transfer techniques produce compelling, artifact-free results on a wide range of input photographs, and a user study shows that our results are preferred over other techniques.

Figure 1: Our technique automatically generates a set of different stylistic renditions of an input photograph. We use a combination of semantic and style similarity metrics to learn a style ranking that is specific to the content of the photograph. We sample this ranking to select a subset of diverse styles and robustly transfer their color and tone statistics to the input photograph. This allows us to create stylizations that are diverse, artifact-free, and adapt to content ranging from landscapes to still life to people.

1 Introduction

Photographers often stylize their images by editing their color, contrast and tonal distributions – a process that requires a significant amount of skill with tools like Adobe Photoshop. Instead, casual users use preset style filters provided by apps like Instagram to stylize their photographs. However, these fixed sets of styles do not work well for every photograph and in many cases, produce poor results.

Example-based style transfer techniques [24, 3] can transfer the look of a given stylized exemplar to another photograph. However, the quality of these results is tied to the choice of the exemplar used, and the wrong choices often result in visual artifacts. This can be avoided in some cases by directly learning style transforms from input-stylized image pairs [5, 28, 12, 32]. However, these approaches require large amounts of training data, limiting them to a small set of styles.

Our goal is to make the process of image stylization adaptive by automatically finding the “right” looks for a photograph (from potentially hundreds or thousands of different styles), and robustly applying them to produce a diverse set of stylized outputs. In particular, we consider stylizations that can be represented as global transformations of color and luminance. We would also like to do this in an unsupervised manner, without the need for input-stylized example pairs for different content and looks.

We introduce two datasets to derive our stylization technique. The first is our manually curated target style database, which consists of 1500 stylized exemplar images that capture color and tonal distributions that we consider as good styles. Given an input photograph, we would like to automatically select a subset of these style exemplars that will guarantee good stylization results. We do this by leveraging our second dataset – a large photo collection that contains millions of photographs and spans the range of styles and semantic content that we expect in our input photographs (e.g., indoor photographs, urban scenes, landscapes, portraits, etc.). These datasets cannot be used individually for stylization; the style dataset is small and does not span the full content-style space, and the photo collection is not curated and contains both good and poorly-stylized images. The key idea of our work is that we can use the large photo collection to learn a content-to-style mapping and bridge the gap between the source photograph and the target style database. We do this in a completely unsupervised manner, allowing us to easily scale to a large range of image content and photographic styles.

We segment the large photo collection into content-based clusters using semantic features, and learn a ranking of the style exemplars for each cluster by evaluating their style similarities to the images in the cluster. At run time, we determine the semantic clusters nearest to the input photograph, retrieve their corresponding stylized exemplar rankings, and sample this set to obtain a diverse subset of relevant style exemplars.

We propose a new robust technique to transfer the global color and tone statistics of the chosen exemplars to the input photo. Doing this using previous techniques can produce artifacts, especially when the exemplar and input statistics are very disparate. We use regularized color and tone mapping functions, and use a face-specific luminance correction step to minimize artifacts in the final results. Fig. 1 show our stylization results on three example images.

We introduce a new benchmark dataset of 55 images with manually stylized results created by an artist. We compare our style selection method with other variants as well as the artist’s results through a blind user study. We also evaluate the performance of a number of current statistics-based style transfer techniques on this dataset, and show that our style transfer technique produces better results than all of them. To the best of our knowledge, this is the first extensive quantitative evaluation of these methods.

The technical contributions of our work include:

  1. A robust style transfer method that captures a wide range of looks while avoiding image artifacts,

  2. An unsupervised method to learn a content-specific style ranking using semantic and style similarity,

  3. A style selection method to sample the ranked styles to ensure both diversity and quality in the results, and

  4. A new benchmark dataset with professional stylizations and a comprehensive user evaluation of various style selection and transfer techniques.

Figure 2: Stylization results with different choices of the exemplar images. All exemplars are shown in insets in the top-left corner.

2 Related Work

Example-based Style Transfer

One popular approach for image stylization is to transfer the style of an exemplar image to the input image. This approach was pioneered by Reinhard et al. [24] who transferred color between images by matching the statistics of their color distributions. There are several subsequent work [27, 21, 20, 23] that improves this technique. All these techniques are designed to match the input and exemplar color distributions robust to outliers. Subsequent work has improved on this technique by using soft-segmentation [27], multi-dimensional histogram matching [21], minimal displacement mapping [20], and histogram reshaping [23]. All these techniques are designed to match the input and exemplar color distributions while remaining robust to outliers. Instead of transferring color distributions, correspondence-based methods compute (potentially non-linear) color transfer functions from pixel correspondences between the input and exemplar images that are either automatically estimated [11, 13] or specified by the user [1]. Example-based color transfer techniques have also been used for video grading [4], realistic compositing [15, 30], and transferring attributes like time of day to photographs [26, 18]. Please refer to [29, 10] for a detailed survey of different color transfer methods. We base our chrominance transfer function on the work of Pitié et al. [20] but add a regularization term to make it robust to large differences in the color distributions being matched.

Style transfer techniques also match the contrast and tone between images. This is done by manipulating the luminance of the photograph using histogram matching, or applying a parametric tone-mapping curve like a gamma curve or an S-curve [16]. Bae et al. [3] propose a two-scale technique to transfer both global and local contrast. Aubry et al. [2] demonstrate the use of local Laplacian pyramids for contrast and tone transfer. Shih et al. [25] use a multi-scale local contrast transfer technique to stylize portrait photographs. We propose a parametric luminance reshaping curve that is designed to be smooth and avoids artifacts in the results. In addition, we propose a face luminance correction method that is specifically designed to avoid artifacts for portrait shots.

Learning-based Stylization and Enhancement

Another approach for image stylization is to use supervised methods to learn style mapping functions from data consisting of input-stylized image pairs. Wang et al. [28] introduce a method to learn piece-wise smooth non-linear color mappings from image pairs. Yan et al. [32] uses deep neural networks to learn local nonlinear transfer functions for a variety of photographic effects. There are also several automatic learning-based enhancement techniques. Kang et al. [16] present a personalized image enhancement framework using distance metric learning. It was extended by [6], which proposes collaborative personalization. Bychkovsky et al. [5] build a reference dataset of input-output image pairs. Hwang et al. [12] propose a context-based local image enhancement method. Yan et al. [31] account for the intermediate decisions of a user in the editing process. While these learning-based methods show impressive adjustment results, collecting training data and generalizing them to a large number of styles is very challenging. In contrast, our technique to learn content-specific style rankings is completely unsupervised and easily generalizes to a large number of content and style classes.

Our technique is similar in spirit to two papers that leverage large image collections to restore/stylize the color and tone of photographs. Dale et al. [7] find visually similar images in a large photo collection, and use their aggregate color and tone statistics to restore the input photograph. This aggregation causes a regression to the mean that is appropriate for image restoration but not stylization. Liu et al. [19] use a user-specified keyword to search for images that are used to stylize the input photo. The final results are highly dependent on the choice of the keyword and it can be challenging to predict the right keywords to stylize a photograph. Our technique automatically predicts the right styles for the input photograph.

Figure 3: The overall framework of our system.

3 Overview

Given an input photograph, , our goal is to automatically create a set of stylized outputs . In particular, we focus on stylizations that can be represented as global transformations of the input color and luminance values. The styles we are interested in are captured by a curated set of exemplar images . Using images as style examples makes it intuitive for users to specify the looks they are interested in.

We use an example-based style transfer algorithm to transfer the look of a given exemplar image to the input photograph. While example-based techniques can produce compelling results [10], they often cause visual artifacts when there are strong differences in the input and exemplar images being processed. In this work, we develop regularized global color and tone mapping functions (Sec. 4) that are expressive enough to capture a wide range of effects, but sufficiently constrained to avoid such artifacts.

The quality of the stylized result is also closely tied to the choice of the exemplar . Using an outdoor landscape image, for example, to stylize a portrait could lead to poor transfer results (see Fig. 2(b)). It is therefore important to choose the “right” set of exemplar images based on the content of the input photograph. We use a semantic similarity metric – that we learned using a convolutional neural network (CNN) – to match images with similar content. Given this semantic similarity measure, one approach would be to use it directly to find exemplar images with content similar to an input photograph and stylize it. However, the curated exemplar dataset is limited and unlikely to contain style examples for every content class. Using the semantic similarity metric to find the closest stylized exemplar to an input photograph will not guarantee a good match, and as illustrated in Fig. 2(c), could lead to poor stylizations.

In order to learn a content-specific style ranking, we crawl a large collection of Flickr interesting photos that cover a wide range of different content with varying styles and levels of quality. A straightforward way of stylizing an input photograph could be to use the semantic similarity measure to directly find matching images from this large collection and transfer their statistics to the input photograph. However, this large collection of photos is not manually curated, and contains images of both good and bad quality. Performing style transfer using the low-quality photographs in the database can lead to poor stylizations, as shown in Fig. 2(d). While these results can be improved by curating the photo collection, this is an infeasible task given the size of the database.

We leverage the large photo collection to learn a style ranking for each content class in an unsupervised way. We cluster the photo collection into a set of semantic classes using the semantic similarity metric (Sec. 5.1). For each image in a semantic class, we vote for the best matching stylized exemplar using a style similarity metric (Sec. 5.2). We aggregate these votes across all the images in the class to build a content-specific ranking of the stylized exemplars.

At run time, we match an input photograph to its closest semantic classes and use the pre-computed style ranking for these classes to choose the exemplars. We use a greedy sampling technique to ensure a diverse set of examples (Sec. 5.3), and transfer the statistics of the sampled exemplars to the input photograph using our robust example-based transfer technique. As shown in Fig. 2(e), our style selection technique chooses stylized exemplars that are not necessarily semantically similar to the input photograph, yet have the “right” color and tone statistics to transfer, and produces results that are significantly better than the approaches of directly searching for semantically similar images in the style database or the photo collection. Fig. 3 illustrates the overall framework of our stylization system.

Figure 4: Examples of our style transfer results compared with previous statistics-based transfer methods. Exemplars are shown in insets in the top-left corner of input images.

4 Robust Example-based Style Transfer

We stylize an input photograph, , by applying global transforms to match its color and tonal statistics to those of a style example, . This space of transformations encompasses a wide range of stylizations that artists use, including color mixing, hue and saturation shifts, and non-linear tone adjustments. While a very flexible transfer model can capture a wide range of photographic looks, it is also important that it can be robustly estimated and does not cause artifacts; this is particularly important in our case, where the images being mapped may differ significantly in their content. With this in mind, we design color and contrast mapping functions that are regularized to avoid artifacts.

To effectively stylize images with global transforms, we first compress the dynamic ranges of the two images using a () mapping and convert the images into the CIELab colorspace (because it decorrelates the different channels well). Then, we stretch the luminance (L channel) to cover the full dynamic range after clipping both the minimum and the maximum 0.5 percent pixels of luminance levels, and apply different transfer functions to the luminance and chrominance components.


Our color transfer method maps the statistics of the chrominance channels of the two images. We model the chrominance distribution of an image using a multivariate Gaussian, and find a transfer function that creates the output image by mapping the Gaussian statistics of the style exemplar to the Gaussian statistics of the input image as:


where is a linear transformation that maps chrominance between the images and is the chrominance at pixel . Following Pitié et al. [20], we solve for the color transform using the following closed form solution:


This solution is unstable for low input covariance values, leading to color artifacts when the input has low color variation. To avoid this, we regularize this solution by clipping diagonal elements of as:


and substitute it into Eq. (2). Here is an identity matrix. This formulation has the advantage that it only regularizes colors channels with low variation without affecting the others. We use a regularization of .


We match contrast and tone using histogram matching between the luminance channels of the input and style exemplar images. Direct histogram matching typically results in arbitrary transfer functions and may produce artifacts due to non-smooth mapping or excessive stretching/compressing of the luminance values. Instead, we design a new parametric model of luminance mapping that allows for strong expressiveness and regularization simultaneously. Our transfer function is defined as:


where and are the input and output luminance respectively, and and are the two parameters of the mapping function. determines the inflection point of the mapping function and determines the degree of luminance stretching around the inflection point. This parametric function can represent a diverse set of tone mapping curves and we can easily control the degree of stretching/compressing of tone. Since the derivative of Eq. (4) is always positive and continuous, it is guaranteed to be a smooth and monotonically increasing curve. This ensures that this mapping function generates a proper luminance mapping curve for any set of parameters.

We extract a luminance feature, , that represents the luminance histogram with uniformly sampled percentiles of the luminance cumulative distribution function (we use 32 samples). We estimate the tone-mapping parameters by minimizing the cost function:


where and represent the the input and style luminance features, respectively. is an interpolation of the input and exemplar luminance features and represents how closely we want to match the exemplar luminance distribution. We set to and minimize this cost using parameter sweeping in a branch-and-bound scheme.

Fig. 4 compares the quality of our style transfer method against three recent methods: the N-dimensional histogram matching technique of Pitié et al. [21], the linear Monge-Kantarovich solution of Pitié and Kokaram [20], and the three-band method of Bonneel et al. [4]. While each of these algorithms has its strengths, only our method consistently produces visually compelling results without any artifacts. We further evaluate all these methods via a comprehensive user study in Sec. 6.

Figure 5: Face exposure correction.

Face exposure correction

In the process of transferring tonal distributions, our luminance mapping method can over-darken some regions. When this happens to faces, it detracts from the quality of the result, as humans are sensitive to facial appearance. We fix this using a face-specific luminance correction. We detect face regions in the input image, given by center and radius , using the OpenCV face detector. If the median luminance in a face region, , is lower than a threshold , we correct the luminance as:


This technique applies a simple -correction to the luminance, where determines the maximum level of exposure correction. We would like to apply it to the entire face; however, the face region is given by a coarse box and applying the correction to the entire box will produce artifacts. Instead we interpolate the corrected luminance with the original luminance using weights . We compute these weights based on spatial distance from the face center, and chrominance distance from the median face chrominance value, (to capture the color of the skin). and are normalization parameters that control the weights of the spatial and chrominance kernels respectively. We set to . Fig. 5 shows an example of our face exposure correction results.

5 Content-aware Style Selection

Given the target style database111a curated dataset of 1500 exemplar style images, we can use the method described in Sec. 4 to transfer the photographic style of a style exemplar to an input photograph. However, as noted in Sec. 3 and illustrated in Fig. 2, it is important that we choose the right set of style exemplars. Motivated by the fact that images with different semantic content require different styles, we attempt to learn the set of good styles (or their ranking) for each type of semantic content separately.

To achieve this, we prepare a large photo collection consisting of one million photographs downloaded from Flickr’s daily interesting photograph collection222 As noted in Sec. 3, the curated style dataset does not contain examples for all content classes and cannot be directly used to stylize a photograph. However, by leveraging the large photo collection, we can learn style rankings of the curated style dataset even for content classes that are not represented in it.

The large photo collection captures a joint distribution of content and styles. We use a semantic descriptor (Sec. 5.1) to cluster the training collection into content classes. The semantic feature has a degree of invariance to style, and as a result each class contains images of very similar content but with a variety of different styles, both good and bad. This distribution of styles within each content class allows us to learn how compatible a style is with a content class. The style-to-content compatibility is specifically learned via a simple style-based voting scheme (Sec. 5.2) that evaluates how similar each style exemplar is to the images in the content cluster; style exemplars that occur often are deemed to be better suited to that content class, and conversely, those that occur infrequently are not considered compatible.

In the on-line phase, we determine the content class of an input photograph and retrieve its pre-computed style ranking. We sample this style ranking (Sec. 5.3) to obtain a small set of diverse style images and compute the final results using our style transfer technique (Sec. 4).

Figure 6: Examples of semantic clusters.

5.1 Semantic clustering

Inspired by recent breakthroughs in the use of CNN [17], we represent the semantic information of an image using a CNN feature, trained on the ImageNet dataset [8]. We modified the CaffeNet [14] to have fewer nodes in the fully-connected layers and fine-tuned the modified network. This results in a -dimensional feature vector for each image. We empirically found that this smaller CNN captures more style diversity in each content cluster compared to the original CaffeNet or AlexNet [8] which sometimes “oversegments” content into clusters with low style variation.

We perform -means clustering on the CNN feature vectors for each image in the large photo collection to obtain semantic content clusters. A small number of clusters leads to different content classes being grouped in the same cluster, while a large number of clusters lead to the style variations of the same content class of images being split into different clusters. In our experiments, we found that using clusters was a good balance between these two aspects.

Fig. 6 shows images from six different semantic clusters. The images in a single cluster share semantically similar content but have diverse appearances (including both good and bad styles). These intra-class style variations allow us to learn the space of relevant styles for each class.

Figure 7: Intermediate steps of style selection. The input (a) can be semantically different from the selected exemplars (c) (second and third example especially). However, the cluster images with the highest votes for these style exemplars (d), are both semantically similar to the input and stylistically similar to the chosen exemplars. This ensures input-exemplar compatibility and leads to artifact-free stylizations (e).

5.2 Style ranking

To choose the best style example for each semantic cluster, we compute style similarity between each style example and the images in a cluster, and use this measure to rank the styles for that cluster. As explained in Sec. 4, we represent a photograph’s style using chrominance and luminance statistics. Following this, we define the style similarity measure between cluster photograph and style image as:


where represents the Euclidean distance between the two luminance features, and and are normalization parameters. We set and to generate all our results. is the Hellinger distance [22] defined as:


where are the multivariate Gaussian statistics of chrominance channel for an image. We chose the Hellinger distance to measure the overlap between two distributions because it strongly penalizes large differences in covariance even if the means are close enough. is added to the difference between the means to additionally penalize small covariance images.

We measure the compatibility of a stylized exemplar , with a semantic cluster , by aggregating the style similarity measure over all the images in the cluster as


For each semantic cluster, we compute for all the style exemplars and determine the style example ranking by sorting in decreasing order. This voting scheme measures how often a particular exemplar’s color and tonal statistics occurs in the semantic cluster. Poorly stylized cluster images are implicitly filtered out because they do not vote for any style exemplar. Meanwhile, well stylized images in the cluster vote for their corresponding exemplars, giving us a “histogram” of the style exemplars for that cluster.

Figs. 3 and 7 show the results of each stage of our stylization pipeline. As these figures illustrate, our semantic similarity term is able to find clusters with semantically similar content (see Fig. 7(b)). Our technique does not require the selected style exemplars to be semantically similar to the input image (see Fig. 7(c)). While this might seem counter-intuitive, the final stylized results do not suffer from any artifacts because the highly-ranked styles have the same style characteristics as a large number of “auxiliary exemplars” in the training photo collection that, in turn, share the same content as the input (see Fig. 7(d)). This is an important property of our style selection scheme, and is what allows it to generalize a small style dataset to arbitrary content.

We also experimented with an alternative way of ranking styles based on a weighted combination of style and semantic similarity between the curated dataset and the large photo collection. However, our empirical experiments showed that it was consistently worse than relying solely on the style similarity due to the lack of semantically similar examples with diverse styles in the curated dataset. We evaluate our style selection criteria against other candidate methods via a user study in Sec. 6.

Figure 8: Results according to different style sampling strategies (example images in insets). Directly using the top-ranked style examples from the learned ranking can lead to similar results (b). Our sampling strategy combines styles from multiple semantic clusters and enforces a certain style diversity threshold (c). Increasing the number of clusters and the threshold increases diversity (d).

5.3 Style sampling

Given an input photograph, we can extract its semantic feature and assign it to the nearest semantic cluster. We can retrieve the pre-computed style ranking for this cluster and use the top style images to create a set of stylized renditions of the input photograph. However, this strategy could lead to outputs that are similar to each other. In order to improve the diversity of styles in the final results, we propose the following multi-cluster style sampling scheme.

Adjacent semantic clusters usually share similar high-level semantics but different low-level features such as object scale, color, and tone. Therefore we propose using multiple nearest semantic clusters to capture more diversity. We merge the style lists for the chosen semantic clusters and order them by the aggregate similarity measure (Eq. (9)). To avoid redundant styles, we sample this merged style list in order (starting with the top-ranked one) and discard styles that are within a specified threshold distance from the styles that have already been chosen.

We define a new similarity measure for this sampling process that computes the squared Fréchet distance [9]:


We use this distance because it measures optimal transport between distributions and is more perceptually linear. We use three semantic clusters and set the threshold to .

The threshold of the squared Fréchet distance chosen in this sampling strategy controls diversity in the set of styles. A small threshold will lead to little diversity in the results. On the other hand, a large threshold may cause low ranked styles to get sampled, resulting in artifact-prone stylizations. Considering this tradeoff, we use three nearest semantic clusters and set the threshold value to .

Fig. 8(b) shows the stylizations our sampling method produces. In comparison, naively sampling the style ranking without enforcing diversity creates multiple results that are visually similar (Fig. 8(a)). On the other hand, increasing the Fréchet distance threshold leads to more diversity, but could result in artifacts in the stylizations because of styles at the low-rank end being selected.

6 Results and Discussion

We have implemented our stylization technique as a C++ application where the style transfer is parallelized on the CPU. To improve performance, we pre-compute and store the semantic cluster centers of the large photo collection, the style features, and the per-semantic class style ranking. At run time, we first extract the CNN feature for the input photograph. The semantic search, style sampling, and style transfer make use of the pre-computed information. They take a total of 150 ms (about 40 ms for the CNN feature extraction and 110 ms for style selection and transfer) to create five stylized results from an input image of resolution on an I7 3.4GHz machine. We use the same set of parameters (, , and ) to generate all the results in the paper. Please refer to the accompanying video to see a real-time demo of our technique.

We have tested our automatic stylization results on a wide range of input images, and show a subset of our results in Figs. 1 237, and 9. Please refer to the supplementary material and video for more examples, comparisons, and a real-time demo of our technique. As can be seen from these results, our stylization method can robustly capture fairly aggressive visual styles without creating artifacts, and is able to generate diverse stylization results. Figs. 237, and 9 also show the automatically chosen style examples that were used to stylize the input photographs. As expected, in most cases, the style examples chosen have different semantics from the input image, but the stylizations are still of high-quality. This verifies the advantage of our method when given only a limited set of stylized exemplars.

Figure 9: Our stylization results. The left most images are input photographs and the right images are our automatically stylized results.
Figure 10: Summary of our two user studies to evaluate our style selection method (a) and our style transfer method (b). For each study, we plot the histogram of user ratings of each tested variant. We also sort the (average) scores achieved by each tested method on each of the benchmark images of each method and plot these distributions. For both the selection and transfer methods, our algorithms significantly outperform competing methods.

User study

Due to the subjective nature of image stylization, we validated our stylization technique through user studies that evaluate our style selection and style transfer strategies. For the study, we created a benchmark dataset of images – 50 images were randomly chosen from the FiveK dataset [5] and the rest were downloaded from Flickr. We resized all test images to 500-pixels wide on the long edge and stored them using an 8-bit sRGB JPEG format.

We asked a professional artist to create five diverse stylizations for every image in our benchmark dataset as a baseline for evaluation. The artist was told to only use tools that globally edit the color and tone; he used the ‘Levels’, ‘Curves’, ‘Exposure’, ‘Color Balance’, ‘Hue/Saturation’, ‘Vibrance’, and ‘Black and White’ tools in Adobe Photoshop. Creating five different looks for every photograph is challenging even for professional artists. Instead, our artist first constructed different looks, each of which evoked a particular theme (like ‘old photo’, ‘sunny’, ‘romantic’, etc.), applied all of them to all the images in the dataset, and picked the five diverse styles that he preferred the most.

We performed two user studies. In Study 1, we evaluated two style selection methods, our style selection and direct semantic search which directly searches for semantically similar images in the style database. We also explored directly searching in the photo collection using semantic similarity, but its results were consistently poor, which led us to drop this selection method in the larger study. To assess the effect of the size of the style database on the selection algorithm, we tested against two style databases: the full database with 1500 style exemplars, and a small database with 50 style exemplars randomly chosen from the full database.

We compared five different groups of stylization results including: the reference dataset retouched by a professional (henceforth, Pro), our style selection with the full style database (Ours 1500) and the small style database (Ours 50), direct semantic search on the full style database (Direct 1500) and the small style database (Direct 50). For both our style selection and direct semantic search, we apply the same style sampling in Sec. 5.3 to achieve the similar levels of style diversity and create the results using the same style transfer technique (Sec. 4). Please see the supplementary material for all these results.

For each image in the benchmark dataset, we showed users five groups of five stylized results (one set each from Ours 1500, Ours 50, Direct 1500, Direct 50, and Pro). Users were asked to rate the stylization quality of each group of results on a five-point Likert scale ranging from 1 (worst) to 5 (best). A total of 37 users participated in this study, and a total of 1498 different image groups were rated, giving us an average of 27.24 ratings per group.

Fig. 10(a) shows the result of Study 1. In this study, Ours 1500 () outperforms all the other techniques. We reported the mean of all user ratings and the standard deviation of the average scores of each of the 55 benchmark images. Direct 1500 () is substantially worse than Ours 1500. When the style database becomes smaller, the performance of direct search drops dramatically ( for Direct 50) while our style selection stays stable ( for Ours 50). We believe that this is a result of our novel two-step style ranking algorithm that is able to learn the mapping between semantic content and style even with very few style examples. On the other hand, direct search fails to find good semantic matches when the size of the style database is reduced significantly. Interestingly, we found that even when direct search finds a semantically meaningful match, this does not guarantee a good style transfer result. An example of this is shown in Fig. 11, where the green in the background of the exemplar image influences the global statistics and causes the girl’s skin to take on an undesirable green tone. Our technique aggregates style similarity across many images giving it robustness to such scenarios.

It is also worth noting that Pro () got a lower mean score than {Ours 1500, Ours 50, Direct 1500} with the largest standard deviation of scores. We attribute this to two reasons. First, the artist-created filters do not adapt to the content of the image in the same way our example-based style transfer technique does. Second, image stylization tends to be subjective in nature; some of users might be uncomfortable with the aggressive stylizations of a professional, while our style selection is learned from a more ‘natural’ style database and does not have the same level of stylization.

In Study 2, we compare our style transfer technique with four different statistics-based style transfer techniques: MK, which computes an affine transform in CIELab [20], SMH, which combines three different affine transforms in different luminance bands with a non-linear tone curve [4], PDF, which use 3-d histogram matching in CIELab [21], and PHR, which progressively reshapes the histograms to make them match [23]. We used our implementation for the MK method and used the original authors’ code for the other methods. Style exemplars are chosen by our style selection and these methods are used only for the transfer. We showed users an input photograph, an exemplar, and a randomly arranged set of five stylized images created using the techniques, and asked them to rate the results in terms of style transfer and visual quality on a five-point Likert scale ranging from 1 (worst) to 5 (best). 27 participants from the same pool as (Study 1) participated in this study; they rated 1554 results in total giving us 5.65 ratings per input-style pair and 28.25 rating per input.

Fig. 10(b) shows the result of Study 2. In this study, Ours () records the best rating, while MK () is ranked second. SHM (), PDF (), and PHR () are less favored by users. These three techniques have more expressive color transfer models leading to over-fitting and poor results in many cases. This demonstrates the importance of the style transfer technique for high-quality stylization; our technique balances expressiveness and robustness well.

Our evaluation is, to our knowledge, the first extensive evaluation of style transfer techniques. We will release all our benchmark data, including our professionally created dataset, and the results of the different algorithms for other researchers to compare against.

Figure 11: Failure case of direct search.

7 Conclusion

In this work, we have proposed a completely automatic technique to stylize photographs based on their content. Given a set of target photographic styles, we leverage a large collection of photographs to learn a content-specific style ranking in a completely unsupervised manner. At run-time, we use the learned content-specific style ranking to adaptively stylize images based on their content. Our technique produces a diverse set of compelling, high-quality stylized results. We have extensively evaluated both style selection and transfer components of our technique and studies show that users clearly prefer our results over other variations of our pipeline.


  • [1] X. An and F. Pellacini. User-controllable color transfer. In CGF, volume 29, pages 263–271, 2010.
  • [2] M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast local laplacian filters: Theory and applications. ACM TOG, 33(5):167:1–167:14, Sept. 2014.
  • [3] S. Bae, S. Paris, and F. Durand. Two-scale tone management for photographic look. ACM TOG (Proc. SIGGRAPH), 25(3):637–645, July 2006.
  • [4] N. Bonneel, K. Sunkavalli, S. Paris, and H. Pfister. Example-based video color grading. ACM TOG (Proc. SIGGRAPH), 32(4):39:1–39:12, July 2013.
  • [5] V. Bychkovsky, S. Paris, E. Chan, and F. Durand. Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR, pages 97–104, 2011.
  • [6] J. C. Caicedo, A. Kapoor, and S. B. Kang. Collaborative personalization of image enhancement. In CVPR, pages 249–256, 2011.
  • [7] K. Dale, M. K. Johnson, K. Sunkavalli, W. Matusik, and H. Pfister. Image restoration using online photo collections. In ICCV, pages 2217–2224, 2009.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  • [9] D. Dowson and B. Landau. The fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982.
  • [10] H. S. Faridul, T. Pouli, C. Chamaret, J. Stauder, A. Trémeau, E. Reinhard, et al. A survey of color mapping and its applications. In Eurographics, pages 43–67, 2014.
  • [11] Y. HaCohen, E. Shechtman, D. B. Goldman, and D. Lischinski. Non-rigid dense correspondence with applications for image enhancement. In ACM TOG (Proc. SIGGRAPH), volume 30, page 70, 2011.
  • [12] S. J. Hwang, A. Kapoor, and S. B. Kang. Context-based automatic local image enhancement. In ECCV, volume 7572, pages 569–582, 2012.
  • [13] Y. Hwang, J.-Y. Lee, I. S. Kweon, and S. J. Kim. Color transfer using probabilistic moving least squares. In CVPR, 2014.
  • [14] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding., 2013.
  • [15] M. Johnson, K. Dale, S. Avidan, H. Pfister, W. Freeman, and W. Matusik. Cg2real: Improving the realism of computer generated images using a large collection of photographs. TVCG, 17(9):1273–1285, Sept. 2011.
  • [16] S. B. Kang, A. Kapoor, and D. Lischinski. Personalization of image enhancement. In CVPR, pages 1799–1806, 2010.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [18] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACM TOG (Proc. SIGGRAPH), 33(4):149:1–149:11, July 2014.
  • [19] Y. Liu, M. Cohen, M. Uyttendaele, and S. Rusinkiewicz. Autostyle: Automatic style transfer from image collections to users’ images. In CGF, volume 33, pages 21–31, 2014.
  • [20] F. Pitié and A. Kokaram. The linear monge-kantorovitch linear colour mapping for example-based colour transfer. In CVMP, 2007.
  • [21] F. Pitié, A. C. Kokaram, and R. Dahyot. N-dimensional probability density function transfer and its application to color transfer. In ICCV, volume 2, pages 1434–1439, 2005.
  • [22] D. Pollard. A user’s guide to measure theoretic probability, volume 8. Cambridge University Press, 2002.
  • [23] T. Pouli and E. Reinhard. Progressive color transfer for images of arbitrary dynamic range. Comp. & Graph., 35(1):67–80, 2011.
  • [24] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color transfer between images. CG&A, 21(5):34–41, 2001.
  • [25] Y. Shih, S. Paris, C. Barnes, W. T. Freeman, and F. Durand. Style transfer for headshot portraits. ACM TOG (Proc. SIGGRAPH), 33(4):148:1–148:14, July 2014.
  • [26] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM TOG (Proc. SIGGRAPH Asia), 32(6):200:1–200:11, Nov. 2013.
  • [27] Y.-W. Tai, J. Jia, and C.-K. Tang. Soft color segmentation and its applications. PAMI, 29(9):1520–1537, 2007.
  • [28] B. Wang, Y. Yu, and Y.-Q. Xu. Example-based image color and tone style enhancement. In ACM TOG (Proc. SIGGRAPH), volume 30, page 64. ACM, 2011.
  • [29] W. Xu and J. Mulligan. Performance evaluation of color correction approaches for automatic multi-view image and video stitching. In CVPR, pages 263–270, 2010.
  • [30] S. Xue, A. Agarwala, J. Dorsey, and H. Rushmeier. Understanding and improving the realism of image composites. ACM TOG (Proc. SIGGRAPH), 31(4):84:1–84:10, July 2012.
  • [31] J. Yan, S. Lin, S. B. Kang, and X. Tang. A learning-to-rank approach for image color enhancement. In CVPR, pages 2987–2994, 2014.
  • [32] Z. Yan, H. Zhang, B. Wang, S. Paris, and Y. Yu. Automatic photo adjustment using deep neural networks. CoRR, abs/1412.7725, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description