CartoonRenderer: An Instance-based Multi-Style Cartoon Image Translator
Instance based photo cartoonization is one of the challenging image stylization tasks which aim at transforming realistic photos into cartoon style images while preserving the semantic contents of the photos. State-of-the-art Deep Neural Networks (DNNs) methods still fail to produce satisfactory results with input photos in the wild, especially for photos which have high contrast and full of rich textures. This is due to that: cartoon style images tend to have smooth color regions and emphasized edges which are contradict to realistic photos which require clear semantic contents, i.e., textures, shapes etc. Previous methods have difficulty in satisfying cartoon style textures and preserving semantic contents at the same time. In this work, we propose a novel ”CartoonRenderer” framework which utilizing a single trained model to generate multiple cartoon styles. In a nutshell, our method maps photo into a feature model and renders the feature model back into image space. In particular, cartoonization is achieved by conducting some transforma- tion manipulation in the feature space with our proposed Soft-AdaIN. Extensive experimental results show our method produces higher quality cartoon style images than prior arts, with accurate semantic content preservation. In addition, due to the decoupling of whole generating process into “Modeling-Coordinating-Rendering” parts, our method could easily process higher resolution photos, which is intractable for existing methods.
Keywords:Non-photorealistic rendering Neural Style Transfer Image generation.
Cartoon style is one of the most popular artistic styles in today’s world, especially in online social media. To obtain high quality cartoon images, artists need to draw every line and paint every color block, which is labor intensive. Therefore, a well-designed approach for automatically rendering realistic photos in cartoon style is of a great value. Photo cartoonization is a challenging image stylization task which aims at transforming realistic photos into cartoon style images while preserving the semantic contents of the photos. Recently, image stylized rendering has been widely studied and several inspirational methods have been proposed based on Convolutional Neural Network (CNN). Gatys  formulates image stylized rendering as an optimization problem that translat- ing the style of an image while preserving its semantic content.
This method produces promising results on transforming images in traditional oil painting styles, such as Van Gogh’s style and Monet’s style. However, it suffers from long running time caused by tremendous amount of computation. Based on Gatys’s pioneering work, some researchers have devoted substantial efforts to accelerating training and inference process through feed-forward network, such as . Some methods follow this line of idea that employs a feedforward network as generator to generate stylized results and achieve significant success. Since cartoon style is one of artist styles, many existing methods used for artistic style transfer can also be used to transform realistic photographs into cartoon style. However, even state-of-the-art methods fail to stably produce acceptable results with input content photos in the wild, especially for the high resolution photographs that full of complex texture details. The main reasons are as follows. First, different from artworks in other artistic styles (e.g. oil painting style), cartoon images tend to have clear edges, smooth color blocks and simple textures. As a consequence, cartoon images have sparse gradient information that make it hard for normal convolutional networks to extract valuable features which can well describe cartoon style. Secondly, clear semantic content is often difficult to preserve. For example, an apple should still be round and red in car- toon images. However, current instance-based algorithms tend to preserve local and noisy details but fail to capture global characteristics of cartoon images.This is because such algorithms purely utilize Perceptual Loss or Gram-Matrix to describe image style, and this type of loss encourages to transfer local style textures which can be described by “strokes” but conflicts to the objective that preserving detailed semantic contents at the same time. Third, current GAN- based algorithms cannot handle high resolution images because they utilize an end-to-end generator which has a large burden in computation .
To address issues mentioned above, we propose CartoonRenderer, a novel learning-based approach that renders realistic photographs into cartoon style. Our method takes a set of cartoon images and a set of realistic photographs for training. No correspondence is required between the two sets of training data. It is worth noting that we also do not require cartoon images coming from the same artist. Similar to other instance based methods, our CartoonRenderer receives photograph and cartoon image as inputs and models them respectively in high dimensional feature space to get feature model”s. Inspired by AdaIN , we propose Soft-AdaIN for robustly align the “feature-model” of photograph according to the “feature-model” of cartoon image. Then, we use a Rendering Network to generate output cartoonized photograph from the aligned “feature-model”. Furthermore, we employ a set of well-designed loss functions to train the CartoonRenderer. Except for using pre-trained VGG19  model to compute content loss and style loss, we add extra reconstruction loss and adversarial loss to further preserve detailed semantic content of photographs to be cartoonized. Our method is capable of producing high-quality cartoonized photographs, which are substantially better than state-of-the-art methods. Furthermore, due to the decoupling of whole generating process into “Modeling-Coordinating-Rendering” parts, our method is able to process high resolution photographs (up to 5000*5000 pixels) and maintain high quality of results, which is infeasible for state-of-the-art methods.
2 Related Works
2.0.1 Non-photorealistic rendering (NPR)
Non-photorealistic rendering is an alternative to conventional, photorealistic computer graphics, aiming to make visual communication more effective and automatically create aesthetic results resembling a variety existing art styles. The main venue of NPR is animation and rendering. Some methods have been developed to create images with flat shading, mimicking cartoon styles. Such methods use either image filtering or formulations in optimization problem. However, applying filtering or optimization uniformly to the entirely image does not give the high-level abstraction that an artist would normally do, such as making object boundaries clear. To improve the results, alternative methods rely on segmentation of images have been proposed, although at the cost of requiring some user interaction. Dedicated methods have also been developed for portraits, where semantic segmentation can be derived automatically by detecting facial components. Nevertheless, such methods cannot cope with general images. Turning photos of various categories into cartoons such as the problem studied in this paper is much more challenging.
The task of instance based photo cartoonization is to generate cartoon style version of the given input image, according to some user specified style attributes. In our method, the style attributes are provided by the reference cartoon image. Our model learns from a quantity of unpaired realistic photographs and cartoon images to capture the common characteristics of cartoon styles and re-renders the photographs into cartoon styles. The photographs and cartoon images are not required to be paired and the cartoon images are also not required to be classified by artists. This brings significant feasibility in model training.
For the sake of discussion, let and be the photograph domain and cartoon domain respectively. Inspired by object modeling and rendering techniques with deep neural networks in the field of computer graphics , we formulate photograph cartoonization as a process of Modeling-Coordinating-Rendering:
Modeling : , we construct the feature model of a photograph and a cartoon image as and respectively. The feature model consisting of multi-scale feature maps represent the style characteristics.
Coordinating : Align the feature model of photographs according to the feature model of cartoon image and gets the coordinated feature model , i.e, which possesses ’s content representation with ’s style representation.
Rendering : Generate the cartoonized photograph () from the coordinated feature model which can be considered as reconstruction.
In a nutshell, we propose a novel instance based method “CartoonRenderer” to render the input photo into cartoon style result according to reference cartoon image , which can be described as . represents the non-linear mapping function of whole “Modeling-Coordinating-Rendering” process. It is worth noting that style attributes are provided by input cartoon image , so single trained “CartoonRenderer” can be used to render different cartoon styles by feeding different reference images. Results with different reference cartoon images are demonstrated in Figure 1. We present the detail of our model architecture in Section 3.1 and propose a series of loss functions for training in Section 3.2.
3.1 Model Architecture
As demonstrated in Figure 2, our CartoonRenderer is consist of three parts: Modeling, Coordinating and Rendering. In addition, a discriminator D is employed to produce adversarial loss. CartoonRenderer follows the auto-encoder architecture. The Modeling network is used to map input images into feature spaces. Different from traditional encoder used in Adain  and MUNIT , our Modeling network maps input image into multiple scales feature spaces instead of single fixed scale feature space. The Coordinating part of CartoonRenderer is consist of four Soft-AdaIN blocks, corresponding to the number of elements in feature model. Each Soft-AdaIN block is used to align the corresponding scale’s element in feature models of photo according to the feature model of cartoon image. At last, we train a Rendering network to reconstruct back from the coordinated feature models .
3.1.1 Modeling network.
The Modeling network is used to construct the feature model of input image. The great success of U-Net  based methods in high precision segmentation has proved that the coarse contextual information embedded in shallow features plays important role for preserving detailed semantic contents. The shallow features have relatively small receptive field that make it sensitive to local and detailed texture information, meanwhile, the deep features can better at describing global and abstract textures’ characteristics. Both local and global texture information are important for generating high-quality images. So we utilize multiple scales of feature responses instead of fixed single scale to represent the images. The collection of multi-scale features can be recognized as a high dimensional feature model, which contains local and global semantic information and be able to well-represent the input image in terms of content and style. We employ the top few layers of a VGG19  network (up to layer) as the modeling network. According to the definition of content loss and style loss used in , we choose the , , and layers’ output to construct the feature model. We define the feature model of input image as:
where be the layer’s feature response of input . can be recognized as a “model” of input image in the space spanned by multi-scale feature subspaces. The feature model represents image x on different scales.
3.1.2 Soft-AdaIN for robust feature coordinating.
AdaIN proposed in  adaptively computes the affine parameters from the style input instead of learning affine parameters:
AdaIN  first scale the normalized content input with and then add as a bias. and are the channel-wise mean and variance of input . As mentioned in Section1, cartoon images tend to have quite unique characteristics which far away from realistic photos, which means that there is a huge gap between the distributions of photos and cartoon images. AdaIN  explicitly replace the feature statistics of photo with corresponding feature statistics of cartoon image, which will break the consistency and continuity of feature map. The non-consistency and non-continuity will cause obvious artifacts as shown in Figure 3.c.
To circumvent this problem, we designed Soft-AdaIN. The Soft-AdaIN is used to align the feature model of photographs for cartoon stylizing. As shown in Figure 2, Coordinating part consists of four Soft-AdaIN blocks. Soft-AdaIN also receives a content input and a style input. In this paper, we denote content input (which comes from photographs) as and style input (which comes from cartoon images) as . Two mini convolutional networks and are used to further extract features of and . Since the shape of and maybe mismatch, we adopt global average pooling to pool and into tensors, where is the channel number of and . The pooled and are concatenated in channel dimension as a tensor. Then we employ 2 fully-connection layers to compute channel-wise weight from the concatenated tensor. is a tensor. In fact, can be recognized as channel-wise weights for adaptively blend feature statistics of photo input and cartoon input:
Our proposed scales the normalized content input with and shift it with :
We employ Soft-AdaIN to coordinate the feature model of realistic photo according to feature model of cartoon image. For , we have
To perform feature coordination, we conduct Soft-AdaIN on each element in and :
where is the coordinated feature model for generating output cartoonized result .
3.1.3 Rendering Network.
The Rendering Network is used to render feature model into image space. Our Rendering Network has similar architecture as the expansive path of U-Net . As shown in Figure 2, Rendering Network is consist of 4 blocks. The first 3 blocks have two paths: a concatenation path and upsampling path, and the last block only has a upsampling path. The concatenation path receives corresponding scale’s element in feature-model (, and for the first, second and third blocks’ concatenation path respectively). The upsampling path receives proceeding block’s output (the first block receives as input) and is used to expansion the feature responses. We use Reflection Padding before each convolutional layer to avoid border artifacts. Obviously, the output of concatenation path and output of upsampling path have the same size, so we can concatenate them along the channel dimension. The concatenated feature map then be fed into another activation layer and become the final output of current block. The last block is a pure upsampling block without concatenation path, and its upsampling path have the same structure as preceding blocks do. All activation functions in Rendering Network are ReLU. By adopting such multi-scale architecture, the Rendering Network is able to make full use of both local and global texture information for generating the output images.
3.2 Loss Function
The loss function used to train the CartoonRenderer consists of three parts: (1) the style loss which guides the output to have the same cartoon style as the input cartoon image; (2) the content loss which preserves the photographs’ semantic content during cartoon stylization; (3) the adversarial loss which further drives the generator network to render input photographs into desired cartoon styles; and (4) the reconstruction loss which guides the rendering network to reconstruct origin input images from corresponding un-coordinated feature models. We formulate the loss function into a simple additive from:
where ,, and balance the four losses. In all our experiments. We set , , and to achieve a good balance of style and content preservation.
3.2.1 Content loss and style loss.
Similar to AdaIN , we reuse the elements in the feature model (feature maps extracted by modeling network) to compute the style loss function and content loss function. The modeling network is initialized with a pre-trained VGG-19 . The content loss is defined as:
where is used to normalize feature maps with channel-wise mean and variance.
Unlike other style transferring methods , we define the semantic content loss using the sum of sparse regularization of normalized VGG feature maps instead of origin feature maps. This is due to the fact that statistics of feature maps affect the image’s style. Directly using feature maps to compute content loss will introduce some restriction on style, which drives output image very similar to input photograph. The normalization operation eliminates representation of image style from feature maps, so the sparse regularization of normalized feature maps better describes the differences between semantic contents of photograph and cartoon images. To preserve both local and global semantic contents, we compute sparse regularization of every feature map in feature model and use the sum of them as final content loss.
For style loss , we adopt the same method as used in AdaIN . The style loss is defined as:
3.2.2 Adversarial loss
Because the content loss and style loss mentioned before are both regularization in feature spaces. If there is not explicit restriction in image space, the generated images tend to be inconsistent among different parts and usually contain some small line segments. So we introduce the adversarial training strategy. We use the multi-scale discriminator proposed in  to distinguish between real cartoon images and generated cartoonized photographs. is defined as:
The adversarial loss explicitly add restriction to the generated images in image space, which drives the whole generated images more smooth and consistent among different parts. Some ablation study in Section 4.3 proves that the adversarial loss plays important role for producing high-quality cartoonized images.
3.2.3 Reconstruction loss .
The reconstruction loss is consist of two parts: the reconstruction loss for photo images and the reconstruction loss for cartoon images . For input images and , we directly render the un-coordinated feature model and and get reconstructed images and . Reconstruction loss is defined as:
The reconstruction loss is used to ensure the Rendering network’s generalization ability. Rendering network do not participate in cartoonization process and all cartoonization processes are limited in Coordinating part. By adopting reconstruction loss, we make Rendering network focus on reconstructing images from feature models, which prompts the Rendering network be able to render any type image, no matter it is photo ir cartoon image.
We implement our approach in PyTorch1.0 and Python language. All experiments were performed on an NVIDIA Titan X GPU.
CartoonRenderer can generate high-quality cartoonized images by using various cartoon images for training, which are easy to obtain since we do not require paired or classified images. Our model is able to efficiently learn different cartoon sub-styles. Some of results are shown in Figure 1.
To compare CartoonRenderer with state of the art methods, we collected the training and test data as presented in Section 4.1. In Section 4.2, we present the comparison between CartoonRenderer and representative stylization methods.
The training and test data contains realistic photos and cartoon images. All the training images are randomly cropped to 256x256.
Realistic photos. We collect 220,000 photos in all, some of which come from MSCOCO  dataaset and others come from the Internet. 200,000 photos are used for training and 20,000 for testing. The shortest width of each image is greater than 256 to ensure random cropping is feasible.
Cartoon images. We collect 80,000 high-quality cartoon images from the Internet for training, another 10000 cartoon images sampled from Danbooru2018  dataset are used for testing.
4.2 Comparison with state of the art
Refer to Figure 3, we show the qualitative results generated by different methods, and all of the test data are never observed during the training phase. It is clear that CycleGAN  and AdaIN  can not work well with the cartoon styles. In contrast, CartoonGAN and our CartoonRenderer produce high-quality results. To preserve the content of the input images well, we add the identity loss to CycleGAN , but the stylization results are still far from satisfactory. AdaIN  successfully generates images with smooth colors but suffers from serious artifacts. CartoonGAN  produces clear images without artifacts, but the generated results are too close to the input photos and the color distribution is very monotonous. In other words, the extent of cartoonization with CartoonGAN  is not enough. In contrast, our method apparently produces higher-quality cartoonized images, which have high contrast between colors and contains very clear edges.
For more details, we show close-up views of one result in Figure 4. Obviously, our method performs much better than others in detail. Even the details of eyelashes and pupils are well preserved and re-rendered in cartoon style.
In this work, we propose a novel ”CartoonRenderer” framework which utilizing a single trained model to generate multiple cartoon styles. In a nutshell, our method maps photo into a feature model and render the feature model back into image space. Our method is able to produce higher quality cartoon style images than prior art. In addition, due to the decoupling of whole generating process into “Modeling-Coordinating-Rendering” parts, our method could easily process higher resolution photos (up to 5000x5000 pixels).
This work was supported by National Natural Science Foundation of China (61976137, U1611461), 111 Project (B07022 and Sheitc No.150633) and the Shanghai Key Laboratory of Digital Media Processing and Transmissions. This work was also supported by SJTU-BIGO Joint Research Fund, and CCF-Tencent Open Fund.
-  (2019-01) Danbooru2018: a large-scale crowdsourced and tagged anime illustration dataset. dataset. Note: Accessed: DATE Cited by: §4.1.
-  (2016) Fast patch-based style transfer of arbitrary style. Cited by: §1.
-  (2018-06) CartoonGAN: generative adversarial networks for photo cartoonization. pp. 9465–9474. External Links: Cited by: §3.2.1, §4.2, §4.2.
-  (2016) A learned representation for artistic style. Cited by: §3.1.1.
-  (2016) Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition, Cited by: §1.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. Cited by: §1, §1, §3.1.2, §3.1, §3.2.1, §3.2.1, §4.2, §4.2.
-  (2018) Multimodal unsupervised image-to-image translation. Cited by: §3.1, §3.2.2.
-  (2018) A style-based generator architecture for generative adversarial networks. Cited by: §1.
-  (2014) Microsoft coco: common objects in context. Cited by: §4.1.
-  (2018) RenderNet: a deep convolutional network for differentiable rendering from 3d shapes. Cited by: §3.
-  (2017) U-net: convolutional networks for biomedical image segmentation. Cited by: §3.1.1, §3.1.3.
-  (2018) A style-aware content loss for real-time hd style transfer. Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. Computer Science. Cited by: §1, §3.1.1, §3.2.1.
-  (2016) Texture networks: feed-forward synthesis of textures and stylized images. Cited by: §3.1.1.
-  (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. Cited by: §3.1.1.
-  (2019) Attention-aware multi-stroke style transfer. Cited by: §1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, Cited by: §4.2, §4.2.