Stylizing Face Images via Multiple Exemplars
We address the problem of transferring the style of a headshot photo to face images. Existing methods using a single exemplar lead to inaccurate results when the exemplar does not contain sufficient stylized facial components for a given photo. In this work, we propose an algorithm to stylize face images using multiple exemplars containing different subjects in the same style. Patch correspondences between an input photo and multiple exemplars are established using a Markov Random Field (MRF), which enables accurate local energy transfer via Laplacian stacks. As image patches from multiple exemplars are used, the boundaries of facial components on the target image are inevitably inconsistent. The artifacts are removed by a post-processing step using an edge-preserving filter. Experimental results show that the proposed algorithm consistently produces visually pleasing results.
keywords:Style transfer, Image Processing
Transferring photo styles of professional headshot portraits to ordinary ones is of great importance in photo editing. Traditionally, it requires professional photographers to perform painstaking post-editing using specially designed photo editing systems. Recently, automatic methods are proposed to ease this problem (1); (2); (3). These methods transfer the styles of photos produced by professional photographers to ordinary photos using exemplar-based learning algorithms.
Although significant advancements have been made in recent years, existing exemplar-based methods involve only a single exemplar for holistic style transfer. They produce erroneous results if the exemplar is not able to provide sufficient stylized facial components for the given photo. A straightforward solution is to select the best exemplar among a collection in the same style (3). However, as the subject in the input photo is different from those in the exemplar set, it is difficult to find a single exemplar where all the facial components are similar to those in the input photo. The mismatches between the input photo and selected exemplar lead to incompatibility issues, which largely degrade the stylization quality. Figure 1 shows different methods using a single exemplar as reference. Since the hair structures of the subjects in the input image and the selected exemplar are different, the methods based on holistic appearance are less effective to transfer the skin tone and the ambient light to the stylized output. Figure 1(b) and (c) show that the stylized images generated by the holistic methods (1); (2) are either unnatural or less stylistic. In contrast, the local method (3) can effectively stylize the input photo around similar facial components (e.g., the nose and mouth shown in Figure 1(d)). However, some undesired effects are likely to be produced in the regions where the components are different (e.g., forehead). To alleviate the problems of finding proper components for stylization, we select local regions from multiple exemplars instead of relying on a single one. As such, we can consistently find correct and similar components from all the exemplars even though they belong to different subjects.
|(a) Input photo||(b) PhotoShop (1)||(c) Holistic (2)||(d) Local (3)||(e) Proposed|
In this paper we propose a face stylization algorithm using multiple exemplars. Instead of limiting to a single exemplar for each input photo, we search the whole collection of exemplars with the same style to find the most similar component represented in each local patch of the input photo. Given a photo, we first align all the exemplars using the local affine transformation and SIFT flow methods (4). Then we locally establish the patch correspondences between the input photo and multiple exemplars through a Markov random field. Next, we construct a Laplacian pyramid for every image and remap the local contrast at multiple scales. Finally, we remove the artifacts caused by inconsistent remapping from different exemplars using a edge-preserving filter. As similar components can be consistently selected from the exemplar collection, the proposed algorithm can effectively perform style transfer to an input photo. Qualitative and quantitative experimental results on a benchmark dataset demonstrate the effectiveness of the proposed algorithm with different artistic styles.
The contributions of this work are summarized as follows:
We propose a style transfer algorithm in which a Markov random field is used to incorporate patches from multiple exemplars. The proposed method enables the use of all stylization information from different exemplars.
We propose an artifact removal method based on an edge-preserving filter. It removes the artifacts introduced by inconsistent boundaries of local patches stylized from different exemplars.
In addition to visual comparison conducted by existing methods, we perform quantitative evaluations using both objective and subjective metrics to demonstrate the effectiveness of the proposed method.
2 Related Work
Image style transfer methods can be broadly categorized into holistic and local approaches.
These methods typically learn a mapping function using one exemplar to adjust the tone and lighting of the input photo. In (5), a transformation function is estimated over the entire image to map one distribution into another for color transfer. A multiple layer style transfer method is proposed in (6) where an input image is decomposed into base and detail layers where the style is transferred independently. Further improvement is made in (2) where a multi-scale approach is presented to reduce artifacts through an image pyramid. In (7), a color grading approach is developed by using color distribution transfer. A graph regularization for color processing is proposed in (8). To reduce time complexity, an efficient method is proposed in (9) based on the generalized patchmatch algorithm (10). It uses a holistic non-linear parametric color model to address dense correspondence problem. We note these algorithms are effective in transferring image styles holistically at the expense of capturing fine details, which are well transferred using the proposed method.
These methods transfer the color and tone based on the distributions on the exemplars. In (11), a local method is proposed for regional color transfer between two natural images by probabilistic segmentation, and a scheme based on expectation maximization is proposed to impose spatial and color smoothness. An exemplar-based style transfer method is proposed in (12) where local affine color transformation model is developed to render natural images during different time of the day. In addition to color or tone transfer, numerous face photo decomposition methods based on edge-preserving filters (13); (14); (15) are developed for makeup transfer (16) and relighting (17). From an identity-specific collection of face images, an algorithm is developed to enhance low-quality photos based on high-quality ones by exploiting holistic and face-specific regions (e.g., deblurring, light transfer, and super resolution) (18). The training and input photos used in (18) are from the same subject, and their goal is for image enhancement.
Face Style Transfer Approach:
A local method that transfers the face style of an exemplar to the input face image is proposed in (3). It first generates dense correspondence between an input photo and one selected exemplar. Then it decomposes each image into a Laplacian stack before transferring the local energy in each image frequency subband within each layer. Finally all the stacks are aggregated to generate the output image. Since the style represented by the local energy is precisely transferred in multiple layers, it has the advantage to handle detailed facial components. Compared to the holistic methods, local approaches can better capture the region details and thus facilitate face stylization. However, if the components appeared in the exemplars and the input photos are significantly different, the resulting images are likely to contain undesired effects. In this work, we use multiple exemplars to solve this problem.
The motivation of this work is illustrated with an example in Figure 2. Both the input image and exemplars are in the same resolution and divided into overlapping patches. Given a collection of exemplars in one style, we aim to transfer the local details and contrast to an input photo while maintaining its textures and structures. We describe the details of the proposed algorithm in the following sections.
3.1 Face Alignment and Local Identification
We align each exemplar to the input photo in the same way as illustrated in (3). First we obtain facial landmarks of each image using the fast landmark detection method (19). Through landmark correspondence, we apply a local affine transformation to generate a dense correspondence field which warps each exemplar into the input photo. We warp each exemplar accordingly and further align each warped exemplar using the SIFT flow method (4). It refines the dense correspondence field locally to achieve pixel wise precision. After alignment we uniformly divide both exemplar and the input image into overlapping patches. The patch size and the center pixel locations are the same for each input and exemplar patches.
We construct a MRF model to incorporate all the exemplars for local patch selection. The MRF formulation considers both patch similarity and local smoothness constraints. We denote as the number of patches extracted from one image, and as one patch centered at pixel in the input photo. In addition, we denote and as the selected exemplar patches centered on and its neighboring pixel . The joint probability of patches from an input photo and selected exemplars can be written as:
where has a discrete representation taking values from the number of exemplars. We denote as the patch centered on in the -th exemplar. We compute the similarity between and by
where is the distance between an input patch and the corresponding exemplar patch . We define patch distance in terms of normalized cross correlation and absolute difference by
where is a weighting factor, is the tone similarity and is the structural similarity. We set to be 0.8 in all the experiments since we emphasize on the structure similarity during local patch selection. Meanwhile, we also set a small weight (i.e, ) on the tone similarity when the structures among exemplar patches are similar. The value of each image pixel is normalized to .
The compatibility function measures the local smoothness between two exemplar patches centered at pixel and its neighboring pixel , respectively. We define it as
where is the number of pixels in which is the overlapping region between and . We use the minimum mean-squared error (MMSE) to estimate the optimal candidate patch with
where is the message computed from the previous iteration. The probabilities of the patch similarity and local smoothness are updated in each iteration of the belief propagation (20); (21) with the MRF model. After the belief propagation process, we select the optimal patches locally which contain the maximum probabilities.
|(a) Input photo||(b) Remapped|
|(c) Guided filtered||(d) Output|
3.2 Local Remapping
We decompose the input photo and every exemplar separately into a Laplacian stack formulation. A Laplacian stack consists of multiple layers among which the last one is the residual and the remaining ones are the subtracted result of two Gaussian filtered images with increasing radius. For each layer, a local energy map is generated by locally averaging the squared layer values. These local energy maps from the exemplars represent the style to be transferred to the input photo. The goal of the local contrast transfer is to update all the layers in the Laplacian stacks of the input image such that the energy distributions are similar to those in the exemplars. We transfer local contrast at each pixel location from multiple exemplars using the local patch selection method described in Section 3.1.
We denote and as the values of pixel at the -th Laplacian layer and energy map, respectively. The local remapping function at pixel can be written as:
where is the remapped image patch, is the patch in the exemplar photo selected at pixel , and is a small number to avoid division by zero. We locally remap the input photo in all the layers except the residual which only contains low frequency components. When we generate the residual layer of an output image, we use the values from the residual layer of the identified exemplars. After this step with local energy maps, we accumulate all the layers in the Laplacian stack of the input photo. Since a Laplacian stack is constructed based on the subtracted results of a Gaussian filtered image at different scales, the accumulation of all the transferred layers is used to generate the stylized output.
|(a) Input photo||(b) Local (3)|
|(c) Post processing on (b)||(d) Proposed|
3.3 Artifact Removal
We aggregate each layer in the Laplacian stack to generate the remapped output image. As local patches from multiple examples are selected between neighboring pixels, each remapped output image is likely to contain artifacts around the facial component boundaries. Figure 3(b) shows one example that contains artifacts due to inconsistent local patches. As such, we use an edge-preserving filter (22); (23) to remove artifacts and retain facial details. We use the input photo as guidance to filter the remapped result. The artifacts are removed using an edge-preserving filter at the expense of missing local details. Nevertheless, these details are recovered through creating a similar blurry scenario that we use the input photo as guidance to filter itself. The differences between the filtered result and the input photo are the missing details on the remapped result. We transfer the details back to the filtered result to minimize over-smoothing effects. Consequently, the holistic tone and local contrast can be well maintained in the final output while artifacts are effectively removed.
Figure 3 shows the main steps of the artifact removal process. Given an input photo, we use the matting method (24) to substitute its original background with a predefined background. We then use the guided filter (15) to smooth the remapped result with the input photo as guidance, as shown in Figure 3(c). The radius of the guided filter is set relatively large to remove the artifacts on the remapped result. The downside of filtering using a large radius is that the filtered images are likely to be over-smoothed. However, we can alleviate this problem with the help of the input photo. First we use the guided filter to smooth the input photo using itself as guidance. The filter radius is set the same as the previous filtering process on the remapped result. The missing details can then be obtained by subtracting the filtered result using the input photo. Finally, we add back the missing details to the smoothed remapped image and generate the final result shown in Figure 3(d).
We note that the main contribution to the high-quality stylized images is the selection of propoer local patches from multiple exemplars rather than removal of artifacts. We show one example in Figure 4 where the stylized image is obtained by the state-of-the-art method (3) and post-processed by the artifact removal process discusvsed above. Without correct exemplars selection, the artifacts in the stylized image can not be removed. On the other hand, the proposed algorithm transfers low frequency components from multiple exemplars while preserving high frequency contents of the input photo. Figure 3(c) and (d) show one example where the guided filter is used to suppress inconsistent artifacts (due to MRF regularization) and maintain high frequency details in the input photo. In contrast, the state-of-the-art methods may fail to transfer high frequency details from exemplars. Another example is shown in Figure 7(c) where the undesired textures such as wrinkles or beard are wrongly transferred to the output image.
The main steps of proposed style transfer algorithm are summarized in Algorithm 1.
|(a) Platon||(b) Martin||(c) Kelco|
|(a) Input photo||(b) Holistic (2)||(c) Local (3)||(d) Proposed|
|(a) Input photo||(b) Holistic (2)||(c) Local (3)||(d) Proposed|
4 Experimental Results
In all the experiments we set in Equation 2 to be and of Equation 4 in be . The Laplacian stack is set to be 5 (which is the same as (3)). The image resolution of both input photo and exemplars is pixels. The resolution of the local patches used by the MRF is pixels. When smaller patches are used, more artifacts may be introduced due to inconsistency among multiple exemplars. For the artifact removal process, the radius of the guided filter is pixels. Note that we first generate the local energy map for each layer in the Laplacian stack and warp this map using the dense correspondence field. The evaluation is conducted on the benchmark dataset from (3). The numbers of photos from the Platon, Martin and Kelco collections are 34, 54 and 77, respectively. As shown in Figure 5, the photography style of each collection is drastically different from each other. In addition, all the exemplars differ significantly from 98 input photos which are obtained from Flickr (25).
We evaluate the proposed algorithm against the state-of-the-art methods (2); (3). The results of these two methods are generated using the code provided by authors. For each photo we use the same exemplar from the collections for these two methods which is selected by (3). In the following, we present evaluation results on different collections. More experimental results can be found at the authors’ website.
4.1 Qualitative Evaluation
We evaluate all the comparing methods on the Platon dataset in Figure 6 where the input photos are acquired under varying lighting conditions. The holistic method (2) does not perform well as there is strong contrast in the images. It generates numerous artifacts on the regions with cast shadows shown on the first row. The local method (3) alleviates dark lighting effects on the left cheek with the guidance of corresponding regions from the exemplar. However, it is less effective to transfer details around the right eye region mainly because the corresponding region of the exemplar is also dark. The input photos and exemplars on the second and third rows of Figure 6 contain significant differences in facial components (e.g., long and short hair). Neither of these two methods are able to transfer style naturally. In contrast, the proposed algorithm consistently selects similar facial components from multiple exemplars, and effectively transfers local contrast for stylization.
Figure 7 shows the evaluation results using exemplar images from the Martin dataset. As a global transform is used in the holistic method, local details are likely to be lost and the results are unnatural especially around nose and mouth regions as shown on the first row of Figure 7(b). The local method can successfully transfer local contrast when the input photo and exemplar have similar facial components. However, it also transfers the high frequency details of one exemplar to the stylized result, thereby making the image unnatural when the exemplar and input photo have distinct local contents. As shown in Figure 7(c), the wrinkle, beard and hair of the exemplar are transferred to the stylized image. By using multiple exemplars the proposed algorithm can effectively transfer lighting and low frequency components of exemplars without obvious artifacts. Compared to holistic and local methods, the proposed algorithm is more effective in transferring local contrast and preserving nature appearances of the input photos.
In Kelco dataset shown in Figure 8, the local method is less effective in transferring details around the dissimilar regions (e.g., hair). For holistic method, the difference in the luminance distribution results in unnatural stylized image. Although a portrait may be acquired under various lighting conditions with different facial components that are not well described or matched by one single exemplar, with a collection of exemplars the proposed algorithm can accurately identify corresponding patches for each photo patch to transfer local details effectively.
4.2 Quantitative Evaluation
In quantitative evaluations we first compare the results generated by different methods with one reference image edited by an artist. A human subject study is then conducted to evaluate the local method (3) and the proposed algorithm.
Evaluation with Reference Image
We evaluate the results generated by three methods. The exemplar is manually selected for holistic and local methods. Instead of relying on automatic exemplar selection as carried out in (3), this manually selected exemplar is used as the most similar one to the input photo. We use the PSNR and FSIM (26) metrics to measure the tone and feature similarities with the reference images.
|(a) Input photo||(b) Reference||(c) Exemplar|
|(d) Holistic (2)||(e) Local (3)||(f) Proposed|
Figure 9 shows the evaluation result where the reference image shown in (b) is manually edited by an artist. The proposed algorithm performs favorably against the other methods in terms of PSNR and FSIM. The exemplar shown in (c) shares many similarities to the input photo in facial components (i.e, eyes, nose, mouth and ears). However, it still contains differences around the hair and shoulder regions. The hair region of the exemplar is bright while it is dark in the input photo. On the other hand, the shoulder of the exemplar is not as bright as that in the input photo. Despite significant similarities, these differences affect how the holistic and local methods generate stylized face images based on one exemplar as shown in Figure 9(d) and (e). The stylized image by the holistic method contains artifacts on the face region, and the result by the local scheme consists of regions with unnatural lighting (e.g., bright hair and dark shoulders) when compared with the reference photo. In other words, minor differences are likely to affect existing methods based on a single exemplar holistically or locally. Furthermore, we note in practice it is challenging to find a well suited exemplar for an input photo. However, the proposed method alleviates this problem by establishing the identification in a collection of exemplars for effective stylization of facial details.
Human Subject Evaluation
The human subject evaluation on the stylized face images is carried out under three datasets. As shown in Section 4.2.1, the holistic method is not effective for transferring local contrast and thus the evaluation focuses on the comparison between the proposed and local approaches.
There are 65 participants in the experiments (45 are graduate students or faculty members). For each participant, we randomly select 60 photos and split them into three subsets. We assign three styles to the three subsets randomly and generate transferred results using two evaluated methods. For visual comparison the input photo is positioned in the middle and two results are shown on each side randomly on a high resolution display. we show some photo samples in each style to one subject before experiments. For the participants affiliated with the university, the subjects are asked to select the result with the fewest artifacts (i.e., in order to choose the image in which local contrast is transferred most effectively). The other participants are asked to subjectively select the result in which the style represented by local contrast is well transferred. We use different criteria as most participants affiliated with the university have research background and are experienced to pick up minor artifacts of the transferred images. Meanwhile, the other participants tend to select images based on personal preference. We tally the votes and show the voting results of each method in Figure 11 and 11 respectively. The evaluation results indicate that the performance is similar between two groups of participants. In other words, the quality of a stylized face image is mainly affected by artifacts. Overall, human subjects consider that the proposed method performs favorably against the local method on the three styles.
Figure 12 shows some stylized images in this evaluation. The input images are on the first row. The results by the local and proposed algorithms are on the second and third rows, respectively. The photos marked by red rectangles indicate the preferred results by subjects. The stylized image generated by the local method shown in (a) contains inconsistent local contrast around the hair and ear region. In (b) the result generated by the local method lacks contrast in the hair region. In addition, this stylized image contains artifacts in the forehead region. In contrast, the proposed algorithm is able to effectively transfer the local contrast without generating the artifacts. In (c)-(e) both methods are able to effectively transfer local contrast without introducing artifacts. The user preference for these two images is somewhat random and two methods receive almost the same number of votes. As in practice different subjects appear in the exemplar and input photo, it is challenging to find similar facial components from only one exemplar. The proposed algorithm alleviates this problem by using a collection of exemplars, and performs favorably against the local method on average across three styles as shown in Figures 11, 11, and 12.
5 Concluding Remarks
In this work, we propose a face stylization algorithm using multiple exemplars. As single exemplar-based methods are less effective to find similar facial components for effective style transfer, we propose an algorithm using a collection of exemplars and perform local patch identification via a Markov Random Field model. The facial components of an input photo can be properly selected from multiple exemplars through the MRF regularization. It enables effective local energy transfer in the Laplacian stacks to construct the stylized output. However, the stylized image is likely to contain artifacts due to inconsistency among multiple exemplars. An effective artifact removal method based on an edge-preserving filter is used to refine the stylized output without losing local details. Experiments on the benchmark datasets containing three styles demonstrate the effectiveness of the proposed algorithm against the state-of-the-art methods in terms of qualitative and quantitative evaluations.
- journal: Computer Vision and Image Understanding
- Adobe, Adobe photoshop cs6 match color, http://www.adobe.com/products/photoshop.html/ (2014).
- K. Sunkavalli, M. K. Johnson, W. Matusik, H. Pfister, Multi-scale image harmonization, ACM Transactions on Graphics (SIGGRAPH).
- Y. Shih, S. Paris, C. Barnes, W. T. Freeman, D. Frédo, Style transfer for headshot portraits, ACM Transactions on Graphics (SIGGRAPH).
- C. Liu, J. Yuen, A. Torralba, Sift flow: Dense correspondence across scenes and its applications, IEEE Transactions on Pattern Analysis and Machine Intelligence.
- F. Pitie, A. C. Kokaram, R. Dahyot, N-dimensional probability density function transfer and its application to color transfer, in: IEEE International Conference on Computer Vision, 2005.
- S. Bae, S. Paris, F. Durand, Two-scale tone management for photographic look, ACM Transactions on Graphics (SIGGRAPH).
- F. Pitié, A. C. Kokaram, R. Dahyot, Automated colour grading using colour distribution transfer, Computer Vision and Image Understanding.
- O. Lezoray, A. Elmoataz, S. Bougleux, Graph regularization for color image processing, Computer Vision and Image Understanding.
- Y. HaCohen, E. Shechtman, D. B. Goldman, D. Lischinski, Non-rigid dense correspondence with applications for image enhancement, ACM Transactions on Graphics (SIGGRAPH).
- C. Barnes, E. Shechtman, D. B. Goldman, A. Finkelstein, The generalized patchmatch correspondence algorithm, in: European Conference on Computer Vision, 2010.
- Y.-W. Tai, J. Jia, C.-K. Tang, Local color transfer via probabilistic segmentation by expectation-maximization, in: IEEE Conference on Computer Vision and Pattern Recognition, 2005.
- Y. Shih, S. Paris, F. Durand, W. T. Freeman, Data-driven hallucination for different times of day from a single outdoor photo, ACM Transactions on Graphics (SIGGRAPH Asia).
- Z. Farbman, R. Fattal, D. Lischinski, R. Szeliski, Edge-preserving decompositions for multi-scale tone and detail manipulation, ACM Transactions on Graphics (SIGGRAPH).
- Q. Yang, N. Ahuja, K.-H. Tan, Constant time median and bilateral filtering, in: International Journal of Computer Vision, 2014.
- K. He, J. Sun, X. Tang, Guided image filtering, IEEE Transactions on Pattern Analysis and Machine Intelligence.
- D. Guo, T. Sim, Digital face makeup by example, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- X. Chen, M. Chen, X. Jin, Q. Zhao, Face illumination transfer through edge-preserving filters, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011.
- N. Joshi, W. Matusik, E. H. Adelson, D. Kriegman, Personal photo enhancement using example images, ACM Transactions on Graphics.
- V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- W. T. Freeman, E. C. Pasztor, O. T. Carmichael, Learning low-level vision, in: International Journal of Computer Vision, 2000.
- J. S. Yedidia, W. T. Freeman, Y. Weiss, Understanding belief propagation and its generalizations, in: Exploring Artificial Intelligence in the New Millennium, 2003.
- G. Petschnigg, M. Agrawala, H. Hoppe, R. Szeliski, M. Cohen, K. Toyama, Digital photography with flash and no-flash image pairs, ACM Transactions on Graphics (SIGGRAPH).
- E. Eisemann, F. Durand, Flash photography enhancement via intrinsic relighting, ACM Transactions on Graphics (SIGGRAPH).
- A. Levin, D. Lischinski, W. Yair, A closed form solution to natural image matting, IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Flickr, Flickr, https://www.flickr.com/ (2014).
- L. Zhang, L. Zhang, X. Mou, D. Zhang, Fsim: A feature similarity index for image quality assessment, IEEE Transactions on Image Processing.