User-Guided Deep Anime Line Art Colorization with Conditional Adversarial Networks

User-Guided Deep Anime Line Art Colorization with Conditional Adversarial Networks

Yuanzheng Ci DUT-RU International School of Information Science & EngineeringDalian University of Technology Xinzhu Ma DUT-RU International School of Information Science & EngineeringDalian University of Technology Zhihui Wang Key Laboratory for Ubiquitous Network and Service Software of Liaoning ProvinceDalian University of Technology Haojie Li Key Laboratory for Ubiquitous Network and Service Software of Liaoning ProvinceDalian University of Technology  and  Zhongxuan Luo Key Laboratory for Ubiquitous Network and Service Software of Liaoning ProvinceDalian University of Technology

Scribble colors based line art colorization is a challenging computer vision problem since neither greyscale values nor semantic information is presented in line arts, and the lack of authentic illustration-line art training pairs also increases difficulty of model generalization. Recently, several Generative Adversarial Nets (GANs) based methods have achieved great success. They can generate colorized illustrations conditioned on given line art and color hints. However, these methods fail to capture the authentic illustration distributions and are hence perceptually unsatisfying in the sense that they are often lack of accurate shading. To address these challenges, we propose a novel deep conditional adversarial architecture for scribble based anime line art colorization. Specifically, we integrate the conditional framework with WGAN-GP criteria as well as the perceptual loss to enable us to robustly train a deep network that makes the synthesized images more natural and real. We also introduce a local features network that is independent of synthetic data. With GANs conditioned on features from such network, we notably increase the generalization capability over ”in the wild” line arts. Furthermore, we collect two datasets that provide high-quality colorful illustrations and authentic line arts for training and benchmarking. With the proposed model trained on our illustration dataset, we demonstrate that images synthesized by the presented approach are considerably more realistic and precise than alternative approaches.

Interactive Colorization; GANs; Edit Propagation
copyright: rightsretainedjournalyear: 2018copyright: acmcopyrightconference: 2018 ACM Multimedia Conference; October 22–26, 2018; Seoul, Republic of Koreaprice: 15.00doi: 10.1145/3240508.3240661isbn: 978-1-4503-5665-7/18/10ccs: Computing methodologies Image manipulationccs: Computing methodologies Computer visionccs: Computing methodologies Neural networks
Figure 1. Our proposed method colorizes a line art composed by artist (left) based on guided stroke colors (top center, best viewed with grey background) and learned priors. Line art image is from our collected line art dataset.

1. Introduction

Line art colorization plays a critical role in the workflow of artistic work such as the composition of illustration and animation. Colorizing the in-between frames of animation as well as simple illustrations involves a large portion of redundant works. Nevertheless, there is no automatic colorization pipelines for line art colorization. As a consequence, it is performed manually using image editing applications such as Photoshop and PaintMan. Especially in the animation industry, where it is even known as a hard labor. It’s a challenging task to develop a fast and straightforward way to produce illustration-realistic imagery with line arts.

Several recent methods have explored approaches for guided image colorization (Sangkloy et al., 2017; Zhang et al., 2017b; Hensman and Aizawa, 2017; Furusawa et al., 2017; Networks, 2017; Zhang et al., 2017a; Liu et al., 2017; Frans, 2017). Some works are mainly focused on images containing greyscale information (Zhang et al., 2017b; Furusawa et al., 2017; Hensman and Aizawa, 2017). A user can colorize a greyscale image by color points or color histograms (Zhang et al., 2017b), or colorize a manga based on reference images (Hensman and Aizawa, 2017) with color points (Furusawa et al., 2017). These methods can achieve impressive results but can neither handle sparse input like line arts, nor color stroke-based input which is easier for users to intervene. To address these issues, recently, researchers have also explored more data-driven colorization methods (Sangkloy et al., 2017; Liu et al., 2017; Frans, 2017). These method colorize a line art/sketch by scribble color strokes based on learned priors over synthetic sketches. This makes colorizing a new sketch cheaper and easier. Nevertheless, the results often contains unrealistic colorization and artifacts. More fundamentally, the overfitting over synthetic data is not well handled, thus the network can hardly perform well with ”in the wild” data. PaintsChainer(Networks, 2017) tries to address these issues with three models (with two models not opened) proposed. However, the eye pleasing results were achieved in the cost of losing realistic textures and global shading that authentic illustrations preserve.

In this paper, we propose a novel conditional adversarial illustration synthesis architecture trained fully on synthetic line arts. Unlike existing approaches using typical conditional networks (Isola et al., 2017), we combine a cGAN with a pretrained local features network, i.e., both generator and discriminator network are only conditioned on the output of local features network to increase the generalization ability over authentic line arts. With proposed networks trained in an end-to-end manner with synthetic line arts, we are able to generate illustration-realistic colorization from sparse, authentic line art boundaries and color strokes. This means we do not suffer from overfitting issue over synthetic data. In addition, we randomly simulate user interactions during training, allowing the network to propagate the sparse stroke colors to semantic scene elements. Inspired by (Kupyn et al., 2017) which obtained state-of-the art results in motion deblurring, we fuse the conditional framework (Isola et al., 2017) with WGAN-GP (Gulrajani et al., 2017) and perceptual loss (Johnson et al., 2016) as the criterion in the GAN training stage. This allows us to robustly train a network with more capacity, thus makes the synthesized images more natural and real. Moreover, we collected two cleaned datasets with high quality color illustrations and hand-drawn line arts. They provide a stable training data source as well as a test benchmark for line art colorization.

By training with the proposed illustration dataset and adding minimal augmentation, our model can handle general anime line arts with stroke color hints. As the loss term optimizes the results to resemble the ground truth, it mimics realistic color allocations and general shadings with respect to the color strokes and scene elements as shown in Figure 1. We trained our model with millions of image pairs, and achieve significant improvements in stroke-guided line art colorization. Extensive experimental results demonstrate that the performance of the proposed method is far superior to those of the state-of-the-art stroke-based user-guided line art colorization methods in both qualitative and quantitative evaluations.

In summary, the key contributions of this paper are summarized as follows.

  • We propose an illustration synthesis architecture and a loss for stroke-based user-guided anime line art colorization, whose results are significantly better than existing guided line art colorization methods.

  • We introduce a novel local features network in the cGAN architecture to enhance the generalization ability of the networks trained with synthetic data.

  • The colorization network in our cGAN is different with existing GANs’ generators in that it is much deeper, with several specially designed layers to both increase the receptive field and the network capacity. It makes the synthesized images more natural and real.

  • We collect two datasets that provide quality illustration training data and line art test benchmark.

2. Related Work

2.1. User-guided colorization

Early interactive colorization methods (Levin et al., 2004; Huang et al., 2005) propagate stroke colors with low-level similarity metrics. These methods based on the assumption that the adjacent pixels with similar luminousness in greyscale images should have a similar color, and numerous user interactions are typically required to achieve realistic colorization results. Later research studies improved and extended this method by using chrominance blending (Yatziv and Sapiro, 2006), specific schemes for different textures (Qu et al., 2006), better similarity metrics (Luan et al., 2007) and global optimization with all-pair constraints (An and Pellacini, 2008; Xu et al., 2009). Learning methods such as boosting (Li et al., 2008), manifold learning (Chen et al., 2012) and neural networks (Endo et al., 2016; Zhang et al., 2017b) have also been proposed to propagate stroke colors with learned priors. In addition to local control, some approaches proposed to colorize images by transferring the color theme (Li et al., 2015; Furusawa et al., 2017; Hensman and Aizawa, 2017) or color palette (Chang et al., 2015; Zhang et al., 2017b) of the reference image. While these methods make use of the greyscale information of the source image which is not available for line art/sketch, Scribbler (Sangkloy et al., 2016) developed a system to transform sketches of specific categories to real images with scribble color strokes. Frans (Frans, 2017) and Liu et al. (Liu et al., 2017) proposed methods for guided line art colorization but can hardly produce plausible results based on arbitrary man-made line arts. Concurrently, Zhang et al. (Zhang et al., 2017a) colorize man-made anime line arts with a reference image. PaintsChainer (Networks, 2017) first developed an online application that can generate pleasing colorization results for man-made anime line arts with stroke colors as hints, they provide three models (named tanpopo, satsuki, canna) with one of them open-sourced (tanpopo). However, these models failed to capture the authentic illustration distribution and thus lack of accurate shading.

2.2. Automatic colorization

Recently, colorization methods that do not require color information were proposed (Cheng et al., 2015; Deshpande et al., 2015; Iizuka et al., 2016; Zhang et al., 2016). These methods train CNNs (LeCun et al., 1998) on large datasets to learn a direct mapping from greyscale images to colors. The learning based methods can combine the low-level details as well as high-level semantic information to produce photo-realistic colorization that perceptually pleasing the people. In addition, Isola et al. (Isola et al., 2017), Zhu et al. (Zhu et al., 2017) and Chen et al. (Chen and Hays, 2018) learn a direct mapping from human drawn sketches (for a particular category or with category labels) to realistic images with generative adversarial networks. Larsson et al. (Larsson et al., 2016) and Guadarrama et al. (Guadarrama et al., 2017) also provides solutions to the multi-modal uncertainty of the colorization problem as their methods can also generate multiple results. However limitations still exist as they can only cover a small subset of the possibilities. Beyond learning a direct mapping, Sketch2Photo (Chen et al., 2009) and PhotoSketcher (Eitz et al., 2011) synthesize realistic images by compositing objects and backgrounds retrieved from a large collection of images based on a given sketch.

2.3. Generative Adversarial Networks

Recent study of GANs (Goodfellow et al., 2014; Radford et al., 2015) has achieved great success in a wide range of image synthesis applications, including blind motion deblurring (Nah et al., 2017; Kupyn et al., 2017), high-resolution image syhthesis (Wang et al., 2017; Karras et al., 2017), photo-realistic super-resolution (Ledig et al., 2016) and image in-paining (Pathak et al., 2016). The GAN training strategy is to define a game between two competing networks. The generator attempts to fool a simultaneously trained discriminator that classifies images as real or synthetic. GANs are known for its ability to generate samples of good perceptual quality, however, the vanilla version of GAN suffers from many problems such as mode collapse, vanishing gradients etc, as described in (Salimans et al., 2016). Arjovsky et al. (Arjovsky et al., 2017) discuss the difficulties in GAN training caused by the vanilla loss function and propose to use the approximation of Earth-Mover (also called Wasserstein-1) distance as the critic. Gulrajani et al. (Gulrajani et al., 2017) further improved its stability with gradient penalty thus enable us to train more architectures with almost no hyperparameter tuning. The basic GAN framework can also be augmented using side information. One strategy is to supply both the generator and discriminator with class labels to produce class conditional samples, which is known as cGAN (Mirza and Osindero, 2014). This kind of side information can significantly improve the quality of generated samples (van den Oord et al., 2016). Richer side information such as paired input images (Isola et al., 2017), boundary map (Wang et al., 2017) and image captions (Reed et al., 2016) can improve sample quality further. However, when training data has different patterns compared with test data (in our case, the synthetic line art and authentic line art), existing frameworks cannot perform reasonably well.

3. Proposed Method

Figure 2. Overview of our cGAN based colorization model. The training proceeds with feature extractor (, ), generator () and discriminator () to help learns to generate colorized image based on line art image and color hint . Network extracts semantical feature maps from , while we do not feed to to avoid being overfitted on the characteristic of synthetic line arts. Network learns to give a wasserstein distance between - pairs and - pairs.

Given line arts and user inputs, we train a deep network to synthesis illustrations. In Section 3.1, we introduce the objective of our network. We then describe the loss functions of our system in Section 3.2. In Section 3.3, we define our network architecture. Finally we describe the user interaction mechanism in Section 3.4.

Figure 3. Architecture of Generator and Discriminator Network with corresponding number of feature maps (n) and stride (s) indicated for each convolutional block.

3.1. Learning Framework for Colorization

The first input to our system is a greyscale line art image , which is a sparse, binary image-like tensor synthesized from real illustration with boundary detection filter XDoG (Winnemöller et al., 2012) as shown in Figure 5.

Real-world anime line arts contains a large variety of contents and are usually drawn in different styles. It’s crucial to identify the boundary of different objects and further extract semantic labels from the plain line art as the two plays an important role in generating high-quality results in image-to-image translation tasks (Wang et al., 2017). Recent work (Hensman and Aizawa, 2017) adopted trapped-ball segmentation on greyscale manga images and use the segmentation to refine the cGAN colorization output, while (Furusawa et al., 2017) added an extra global features network (trained to predict characters’ name) to extract global feature vectors from greyscale manga images to the generator.

By extracting features from an earlier stage of a pretrained network, we introduce a local features network (trained to tag illustrations) to extract semantic feature maps that contains both semantic information and spacial information directly from the line arts to the generator. We also take the local features as the conditional input of the discriminator as shown in Figure 2. This relieves the overfitting problem since the characteristics of the man-made line arts could be very different from the synthetic line arts generated by algorithms, while the local features network is trained separately and is not affected by those synthetic line arts. Moreover, compared with global features network, local features network preserves spatial information for the abstracted features and keeps the generator fully convolutional for arbitrary input size.

Specifically, we use the ReLU activations of the 6th convolution layer of the Illustration2Vec (Saito and Matsui, 2015) network as the local feature, of which is pretrained on 1,287,596 illustrations (colored images and line arts included) predicting 1,539 labels.

The second input to the system is the simulated user hint . We sample random pixels from 4 times downsampled as . The locations of the sampled pixels are selected by binary mask where and and we let . Together with , the tensors form color hint .

The output of the system is , the estimate of the channels of the line art. The mapping is learned with a generator , parameterized by , with the network architecture specified in Section 3.3 and shown in Figure 3. We train the network to minimize the objective function in Equation 1, across , which represents a dataset of illustrations, line arts, color hints, and desired output colorization. Loss function describes how close the network output is to the ground truth.


3.2. Loss Function

We formulate the loss function for generator as a combination of content and adversarial loss:


where the equals to 1e-4 in all experiments. Similar to Isola et al. (Isola et al., 2017), our discriminator is also conditioned, but with local features as conditional input and WGAN-GP (Gulrajani et al., 2017) as the critic function that distinguish between real and fake training pairs. The critic does not output a probability and the loss is calculated as the following:


where the output of denotes the feature maps obtained by a pretrained network as is described in Section 3.1. To penalize color/structural mismatch between the output of generator and ground truth, we adopted perceptual loss (Johnson et al., 2016) as our content loss. Perceptual loss is a simple L2-loss based on the difference of the generated and target image CNN feature maps. It is defined as following:


Here , , denotes the number of channels, height and width of the feature maps. The output of denotes the feature maps obtained by the 4th convolution layer (after activation) within the VGG16 network, pretrained on ImageNet (Deng et al., 2009).

The loss definition of our discriminator is formulated as a combination of wasserstein critic loss and penalty loss:


while the critic loss is simply the WGAN (Arjovsky et al., 2017) with conditional input:


For the penalty term, we combine the gradient penalty (Gulrajani et al., 2017) and an extra constraint term introduced by karras et al. (Karras et al., 2017):


to keep the output value from drifting too far from zero, as well as enable us to alternate between updating the generator and discriminator on a per-minibatch basis, which reduces the training time compared to traditional setup that updates discriminator five times for every generator update. We set , in all experiments. The distribution of interpolate points at which to penalize the gradient is implicitly defined as following:


By this we penalize the gradient over straight lines between points in the illustration distribution and generator distribution .

3.3. Network Architecture

Figure 4. Scribble color-based line art colorization of authentic line arts. The first column shows the line art input image. Columns 2-5 show automatic results from three models of (Networks, 2017) as well as ours without user input. Column 6 shows input scribble colors (generated on PaintsChainer(Networks, 2017), best viewed with grey background). Columns 7-10 show the results from the (Networks, 2017) and our model, incorporating user inputs. In the selected examples in row 1, 3, 5, our system produces higher quality colorization results given variation inputs. Row 2, 4 show some failures of our model. In row 2, thew background color by the user is not successfully propagated smoothly to the image. In row4, the automatic result produce undesired gridding artifacts where the input is sparse. Images are from our line art dataset. All the results of (Networks, 2017) are obtained in March 2018, images are from our line art dataset.

For the main branch of our generator which is shown in Figure 3, we employ an U-Net (Ronneberger et al., 2015) architecture which has recently been used on a variety of image-to-image translation tasks(Isola et al., 2017; Zhu et al., 2017; Wang et al., 2017; Zhang et al., 2017b). At the front of the network locates two convolution blocks and a local features network that transform image/color hint inputs to feature maps. Then the features are progressively halved spatially until they reach the same scale as local features. For the second half of our U-Net follows 4 sub-networks that share a similar structure. Each sub-network contains a convolution block in the front to fuse features from skip connection(or local features for the first sub-network), then we stack ResNeXt blocks (Xie et al., 2017) as the core of our sub-network. Here we use ResNeXt blocks instead of Resnet blocks because ResNeXt blocks are more effective in increasing the capacity of the network. Specifically, are set to be . We also utilize design principles from (Yu et al., 2017) and add dilation in the ResNeXt blocks of to further increase the receptive fields without increasing the calculation cost. Finally we increase the resolution of the features with sub-pixel convolution layers as proposed by Shi et al. (Shi et al., 2016). Inspired by Nah et al.(Nah et al., 2017) and Lim et al. (Xie et al., 2017), we did not use any normalization layer throughout our networks to keep the range flexibility for accurate colorizing. It also reduces the memory usage, computational cost and enables a deeper structure that has receptive fields large enough to ”see” the whole patch with limited computational resources. We use LeakyReLU activations with slope 0.2 for every convolution layer except the last one with tanh activation.

During the training phase, we define a discriminator network as shown in Figure 3. The architecture of is similar to the setup of SRGAN (Ledig et al., 2016) with some modifications. We take local features from as conditional input to form a cGAN (Mirza and Osindero, 2014) and employ same basic block as is used in the generator without dilation. We additionally stacked more layers so that it can process inputs.

3.4. User Interaction

One of the most intuitive ways to control the outcome of colorization is to ’scribble’ some color strokes to indicate the preferred color in a region. To train a network to recognize these control signals at test time, Sangkloy et al. (Sangkloy et al., 2017) synthesize color strokes for the training data. Zhang et al. (Zhang et al., 2017b) suggest that randomly sampled points are good enough to simulate point-based inputs. We trade off between them and use randomly sampled points in downsampled scale to simulate stroke-based inputs with the intuition that each color strokes tend to have uniform color value and dense spatial information. The way to generate training points is described in Section 3.1. For the user input strokes, we downsample the stroke image to quarter resolution with max-pooling and remove half of the input pixels by setting stroke image and binary mask to 0 with an interval of 1. This removes the redundancy of strokes as well as preserve spatial information and simulate the sparse training input. In this way, we are able to cover the input space adequately and train an effective model.

4. Experiments

4.1. Dataset

Figure 5. Sample images from our illustration dataset and authentic line art dataset, with a matching generated line art with XDoG (Winnemöller et al., 2012) algorithm from illustrations.

Nico-opendata (Ikuta et al., 2016) and Danbooru2017 (Anonymous, 2018) provide large illustration datasets that contain illustrations and their associated metadata. But they are not adoptable for our task as they contain messy scribbles as well as sketches/line arts mixed in the dataset. These noises are hard to clean and could be harmful to the training process.

Due to the lack of publicly available quality illustration/line art dataset. We propose two quality datasets111Both datasets are available at illustration dataset and line art dataset. We collected 21930 colored anime illustrations and 2779 authentic line arts from the internet for training and benchmarking. We further apply the boundary detection filter XDoG (Winnemöller et al., 2012) to generate synthetic sketches-drawing pairs.

To simulate the line drawings sketched by artists, we set the parameters of XDoG algorithm with to keep a step transition at the border of sketch lines. We randomly set to be 0.3/0.4/0.5 to get different levels of line thickness thus generalize the network on various line width. Additionally, we set as default.

4.2. Experimental Settings

The PyTorch framework (Paszke et al., 2017) is used to implement our model. All training was performed on a single NVIDIA GTX 1080ti GPU. We use ADAM (Kingma and Ba, 2014) optimizer with hyperparameters and batch size 4 due to limited GPU memory. All networks were trained from scratch, with an initial learning rate of 1e-4 for both generator and discriminator. After 125k iterations, the learning rate is decreased to 1e-5. Total training takes 250k iterations to converge. As is described in Section 3.2, we perform one gradient descent step on and simultaneously one step on .

To take non-black sketches into account, every sketch image is randomly scaled to , where is sampled from an uniform distribution . We resize the image pairs with shortest sides to be 512 and then randomly corp to 512x512 before random horizontal flipping.

4.3. Quantitative Comparisons

Evaluating the quality of synthesized images is an open and difficult problem (Salimans et al., 2016). Traditional metrics such as PSNR do not assess joint statistics between targets and results. Unlike greyscale image colorization, only few authentic line arts have corresponding ground truths available to perform PSNR evaluation. In order to evaluate the visual quality of our results, we employ two metrics. First, we adopted Fréchet Inception Distance (Heusel et al., 2017) to measure the similarity between colorized line arts and authentic illustrations. It is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the pretrained Inception network (Szegedy et al., 2016). Second, we perform a mean opinion score test to quantify the ability of different approaches in reconstructing perceptually convincing colorization results, i.e., whether the results are plausible to a human observer.

Figure 6. Selected User Study Results. We show line art images with user inputs (best viewed with grey background), along side the outputs from our method. All images was colorized by a novice user and from our line art dataset.

4.3.1. Fréchet Inception Distance (FID)

Training Configuration FID
PaintsChainer(canna) 103.24 0.18
PaintsChainer(tanpopo) 85.38 0.21
PaintsChainer(satsuki) 81.91 0.26
Ours (w/o Adversarial Loss) 70.90 0.13
Ours (w/o Local Features Network) 60.73 0.22
Ours 57.06 0.16
Table 1. Quantitative comparison of Fréchet Inception Distance without added color hint. The results are calculated between automatic colorization results of 2779 authentic line arts and 21930 illustrations from our proposed datasets.

Fréchet Inception Distance (FID) is adopted as our metric because it can detect intra-class mode dropping (i.e., a model generates only one image per class will have a bad FID), and can measure diversity, quality of generated samples (Lucic et al., 2017; Heusel et al., 2017). Intuitively a small FID indicates that the distribution of two set of images is similar. Since two of paintschainer’s models (Networks, 2017) are not open-sourced (canna, satsuki), it’s hard to keep identical user input during testing. Therefore, to quantify the quality of the results under the same condition, we only synthesize colorized line arts automatically (i.e., without color hints) on our authentic line art dataset for all methods and report their FID scores on our proposed illustration dataset. The obtained FID scores are reported in Table 1. As can be seen, our final method achieves a smaller FID than other methods. Removing adversarial loss or local features network during training both leads to a higher FID.

4.3.2. Mean opinion score (MOS) testing

Figure 7. Color-coded distribution of MOS scores on our line art dataset. For each method 1504 samples (94 images 16 raters) were assessed. Number of ratings annotated in the corresponding blocks.
Training Configuration MOS

PaintsChainer(tanpopo) 2.361
PaintsChainer(satsuki) 2.660
Ours (w/o Adversarial Loss) 2.818
Ours (w/o Local Features Network) 2.955
Ours 3.186
Table 2. Performance of different methods for automatic colorization on our line art dataset. Our method achieves significantly higher () MOS score than other methods.

We have performed a MOS test to quantify the ability of different approaches to reconstruct perceptually convincing colorization result. Specifically, we asked 16 raters to assign an integral score from 1 (bad quality) to 5 (excellent quality) to the automatically colorized images. The raters rated 6 versions of 1504 randomly sampled results on our line art dataset: ours, ours w/o adversarial loss, ours w/o local features network, PaintsChainer’s (Networks, 2017) model canna, satsuki, tanpopo. Each rater thus rated 564 instances (6 versions of 94 line arts) that were presented in a randomized fashion. The experimental results of the conducted MOS tests are summarized in Table 2 and Figure 7. All raters were not calibrated and statistical tests were performed as one-tailed hypothesis test of difference in means, significance determined at for our final method, confirming that our method outperforms all reference methods. We noticed that method canna (Networks, 2017) has a better performance in terms of MOS testing than FID, we conclude that it focuses more on generating eye pleasing results rather than results matching the authentic illustration distribution.

4.4. Analysis of the Architectures

Figure 8. Comparison on the automatic colorization results from synthetic line arts and authentic line arts. Results generated from synthetic line art are perceptually satisfying but overfitted over XDoG artifacts for all methods. Nevertheless, our method with local features network and adversarial loss can still generate illustration-realistic result from man-made line arts.

We investigate the value of our proposed cGAN training methodology and local features network for colorization task, and the results are shown in Figure 8. It can be seen that method with our adversarial loss and local features network can achieve better performance than those without on ”in the wild” authentic line arts. All methods can generate results with greyscale values matching the ground truth on synthetic line arts, which shows an overfitting on synthetic line art artifacts. In the case of authentic line arts, the method without adversarial loss generates results that lack saturation, which indicates the underfitting of the model. While method without local features network generates unusually colorized results, showing that the lack of generalization over authentic line arts. It is apparent that adversarial loss leads to a more sharp, illustration-realistic results, while using the local features network, on the other hand, helps to generalize the network on authentic line arts which are unseen during training.

4.5. Color Strokes Guided Colorization

A benefit of our system is that the network predicts user-intended actions based on learned priors. Figure 4 shows side-by-side comparisons of the results against the state of the art of (Networks, 2017) and Figure 6 shows example results generated by a novice user. Our method performs better at hallucinating missing details (such as shadings and color of the eyes) as well as generating diverse results given color strokes and simple line arts drawn with various styles. It can also be seen that our network is able to propagate the input color to the relevant regions respecting object boundaries.

5. Conclusion

In this paper, we propose a conditional adversarial illustration synthesis architecture and a loss for stroke-based user-guided anime line art colorization. We explicitly proposed local features network to ease the gap between synthetic data and authentic data. A novel deep cGAN network is described with specially designed sub-networks to increase the network capacity as well as receptive fields. Furthermore, we collected two datasets with quality anime illustrations and line arts, enabling efficient training and rigorous evaluation. Extensive experiments show that our approach outperforms the state-of-the-art methods in both qualitative and quantitative ways.

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants No. Grant #3 and No. Grant #3.


  • (1)
  • An and Pellacini (2008) Xiaobo An and Fabio Pellacini. 2008. AppProp: all-pairs appearance-space edit propagation. In ACM Transactions on Graphics (TOG), Vol. 27. ACM, 40.
  • Anonymous (2018) Gwern Branwen Aaron Gokaslan Anonymous, the Danbooru community. 2018. Danbooru2017: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset. (January 2018).
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).
  • Chang et al. (2015) Huiwen Chang, Ohad Fried, Yiming Liu, Stephen DiVerdi, and Adam Finkelstein. 2015. Palette-based photo recoloring. ACM Transactions on Graphics (TOG) 34, 4 (2015), 139.
  • Chen et al. (2009) Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. 2009. Sketch2photo: Internet image montage. In ACM Transactions on Graphics (TOG), Vol. 28. ACM, 124.
  • Chen and Hays (2018) Wengling Chen and James Hays. 2018. SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis. arXiv preprint arXiv:1801.02753 (2018).
  • Chen et al. (2012) Xiaowu Chen, Dongqing Zou, Qinping Zhao, and Ping Tan. 2012. Manifold preserving edit propagation. ACM Transactions on Graphics (TOG) 31, 6 (2012), 132.
  • Cheng et al. (2015) Zezhou Cheng, Qingxiong Yang, and Bin Sheng. 2015. Deep colorization. In Proceedings of the IEEE International Conference on Computer Vision. 415–423.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248–255.
  • Deshpande et al. (2015) Aditya Deshpande, Jason Rock, and David Forsyth. 2015. Learning large-scale automatic image colorization. In Proceedings of the IEEE International Conference on Computer Vision. 567–575.
  • Eitz et al. (2011) Mathias Eitz, Ronald Richter, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2011. Photosketcher: interactive sketch-based image synthesis. IEEE Computer Graphics and Applications 31, 6 (2011), 56–66.
  • Endo et al. (2016) Yuki Endo, Satoshi Iizuka, Yoshihiro Kanamori, and Jun Mitani. 2016. DeepProp: extracting deep features from a single image for edit propagation. In Computer Graphics Forum, Vol. 35. Wiley Online Library, 189–201.
  • Frans (2017) Kevin Frans. 2017. Outline Colorization through Tandem Adversarial Networks. arXiv preprint arXiv:1704.08834 (2017).
  • Furusawa et al. (2017) Chie Furusawa, Kazuyuki Hiroshiba, Keisuke Ogaki, and Yuri Odagiri. 2017. Comicolorization: semi-automatic manga colorization. In SIGGRAPH Asia 2017 Technical Briefs. ACM, 12.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
  • Guadarrama et al. (2017) Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, and Kevin Murphy. 2017. Pixcolor: Pixel recursive colorization. arXiv preprint arXiv:1705.07208 (2017).
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems. 5769–5779.
  • Hensman and Aizawa (2017) Paulina Hensman and Kiyoharu Aizawa. 2017. cGAN-based Manga Colorization Using a Single Training Image. arXiv preprint arXiv:1706.06918 (2017).
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems. 6629–6640.
  • Huang et al. (2005) Yi-Chin Huang, Yi-Shin Tung, Jun-Cheng Chen, Sung-Wen Wang, and Ja-Ling Wu. 2005. An adaptive edge detection based colorization algorithm and its applications. In Proceedings of the 13th annual ACM international conference on Multimedia. ACM, 351–354.
  • Iizuka et al. (2016) Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2016. Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (TOG) 35, 4 (2016), 110.
  • Ikuta et al. (2016) Hikaru Ikuta, Keisuke Ogaki, and Yuri Odagiri. 2016. Blending Texture Features from Multiple Reference Images for Style Transfer. In SIGGRAPH Asia Technical Briefs.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. arXiv preprint (2017).
  • Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision. Springer, 694–711.
  • Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Computer Science (2014).
  • Kupyn et al. (2017) Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiri Matas. 2017. DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks. arXiv preprint arXiv:1711.07064 (2017).
  • Larsson et al. (2016) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2016. Learning representations for automatic colorization. In European Conference on Computer Vision. Springer, 577–593.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • Ledig et al. (2016) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2016. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint (2016).
  • Levin et al. (2004) Anat Levin, Dani Lischinski, and Yair Weiss. 2004. Colorization using optimization. In ACM Transactions on Graphics (ToG), Vol. 23. ACM, 689–694.
  • Li et al. (2015) Xujie Li, Hanli Zhao, Guizhi Nie, and Hui Huang. 2015. Image recoloring using geodesic distance based color harmonization. Computational Visual Media 1, 2 (2015), 143–155.
  • Li et al. (2008) Yuanzhen Li, Edward Adelson, and Aseem Agarwala. 2008. ScribbleBoost: Adding Classification to Edge-Aware Interpolation of Local Image and Video Adjustments. In Computer Graphics Forum, Vol. 27. Wiley Online Library, 1255–1264.
  • Liu et al. (2017) Yifan Liu, Zengchang Qin, Zhenbo Luo, and Hua Wang. 2017. Auto-painter: Cartoon Image Generation from Sketch by Using Conditional Generative Adversarial Networks. arXiv preprint arXiv:1705.01908 (2017).
  • Luan et al. (2007) Qing Luan, Fang Wen, Daniel Cohen-Or, Lin Liang, Ying-Qing Xu, and Heung-Yeung Shum. 2007. Natural image colorization. In Proceedings of the 18th Eurographics conference on Rendering Techniques. Eurographics Association, 309–320.
  • Lucic et al. (2017) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. 2017. Are GANs Created Equal? A Large-Scale Study. arXiv preprint arXiv:1711.10337 (2017).
  • Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
  • Nah et al. (2017) Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. 2017. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2.
  • Networks (2017) Preferred Networks. 2017. paintschainer. (2017).
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
  • Pathak et al. (2016) Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2536–2544.
  • Qu et al. (2006) Yingge Qu, Tien-Tsin Wong, and Pheng-Ann Heng. 2006. Manga colorization. In ACM Transactions on Graphics (TOG), Vol. 25. ACM, 1214–1220.
  • Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
  • Reed et al. (2016) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396 (2016).
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
  • Saito and Matsui (2015) Masaki Saito and Yusuke Matsui. 2015. Illustration2Vec: a semantic vector representation of illustrations. In SIGGRAPH Asia 2015 Technical Briefs. 380–383.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In Advances in Neural Information Processing Systems. 2234–2242.
  • Sangkloy et al. (2016) Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2016. Scribbler: Controlling Deep Image Synthesis with Sketch and Color. (2016).
  • Sangkloy et al. (2017) Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2017. Scribbler: Controlling deep image synthesis with sketch and color. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2.
  • Shi et al. (2016) Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1874–1883.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.
  • van den Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems. 4790–4798.
  • Wang et al. (2017) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2017. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. arXiv preprint arXiv:1711.11585 (2017).
  • Winnemöller et al. (2012) Holger Winnemöller, Jan Eric Kyprianidis, and Sven C. Olsen. 2012. XDoG: An eXtended difference-of-Gaussians compendium including advanced image stylization. Computers & Graphics 36, 6 (2012), 740–753.
  • Xie et al. (2017) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 5987–5995.
  • Xu et al. (2009) Kun Xu, Yong Li, Tao Ju, Shi-Min Hu, and Tian-Qiang Liu. 2009. Efficient affinity-based edit propagation using kd tree. In ACM Transactions on Graphics (TOG), Vol. 28. ACM, 118.
  • Yatziv and Sapiro (2006) Liron Yatziv and Guillermo Sapiro. 2006. Fast image and video colorization using chrominance blending. IEEE transactions on image processing 15, 5 (2006), 1120–1129.
  • Yu et al. (2017) Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated residual networks. In Computer Vision and Pattern Recognition, Vol. 1.
  • Zhang et al. (2017a) Lvmin Zhang, Yi Ji, and Xin Lin. 2017a. Style transfer for anime sketches with enhanced residual u-net and auxiliary classifier GAN. arXiv preprint arXiv:1706.03319 (2017).
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In European Conference on Computer Vision. Springer, 649–666.
  • Zhang et al. (2017b) Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros. 2017b. Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999 (2017).
  • Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description