Channel Decomposition into Painting Actions
This work presents a method to decompose a convolution layer of the deep neural network into painting actions. The pre-trained knowledge in the appointed operation layer is used to guide the neural painter. To behave like the human painter, the actions are driven by the cost simulating the hand movement, the paint color change, the stroke shape and the stroking style. To help planning, the Mask R-CNN is applied to detect the object areas and decide the painting order. The proposed painting system introduces a variety of extensions in artistic styles, based on the chosen parameters. Further experiments are performed to evaluate the channel penetration and the channel sensitivity on the strokes.
For years, the convolutional neural networks have been digesting the visual world and serving a wide range of applications. Neural style transfer is one of the most popular applications. Soon after the pioneering work by Gatys et al. Gatys et al. (2016), the feed-forward network incorporating convolutional layers has been introduced to perform near realtime style transfer Johnson et al. (2016); Ulyanov et al. (2016). In recent work Li et al. (2017); Jing et al. (2018), the stroke and the attention factors have been considered. Although fast, current style transfer work generates the whole transferred frame in one feed. This leaves the audience wondering that, in which stroke order the neural painter would paint the art that can lead to the style transfer output.
On the other hand, the stroke composure process has been studied in the literature. The steps to paint the image (or to write the letters) is typically referred to as stroke-based rendering (or inverse graphics). To learn the painting behavior without pairing stroke-wise training data, reinforcement learning is applied to help stroke planning, in recent work like SPIRALGanin et al. (2019), StrokeNetZheng et al. (2019) and LearningToPaintHuang et al. (2019). While the strokes can typically be arranged in the coarse-to-fine order as the painting architecture of the designed, the stroke shape and stroke order may differ from that of the human painter.
Decomposing the target into reasonable amount of stroking actions of reasonable amount of stroke shapes remains challenging. Which parts of the target needs to be painted, in which kind of artistic style? How does the human painter plan and compose the paint with strokes? For different painter to paint the same object or the same scene, how different would their approaches be? We try to find clues from the generator network.
This work presents a strategy to decompose the channel response of the generator networks into stroke actions, called the channel stroke. The channel stroke considers the burden of the human painter in changing paint brushes and changing colors. Leveraging the channel depth of the generator networks, the proposed strategy strokes through the same channels continuously over the regions with high receptive field response.
Experiments are performed over the generator in the GANs and the transformer network of the style transfer. The layer in any of these networks decides the stroke-able space for the channel stroke. The cost in actions in the stroke-able space then quantifies the burden of the action and makes decision on either continuing stroke, change color, or stop painting. Depending on the learned knowledge in the pre-trained CNN layers, the channel stroking location and style varies. When applied toward the style transfer network, the stroke style varies on top of the neural style (Fig. 1).
Mask R-CNNHe et al. (2017) is used to help the neural painter plan via understanding what it is painting. With the knowledge of recognized objects, the neural painter then focus in painting the object regions one by one. The plan thus covers what to paint, whether the background is painted, and to what detail each region is painted, where the stroke detail is already parameterized in the channel stroke above. The tune-able planning helps to put appropriate focus over different regions. The whole system is called channel painter in action, CPIA (Fig. 4).
The paper details the method on channel decomposition and rendering over limited amount of channels. Qualitative results are presented with quantified channel coverage at the operation layer. Based on the existing pre-trained networks, the proposed CPIA provides: 1. stroke composure actions, 2. additional tune-able artistic outcome, 3. controllable brush shape and movement. It is unsupervised and without additional training data.
2 Related Work
Generative neural networks provides the tensor space for channel decomposition. The back-propagation concept of the autoencoders has been introduced in 1980’s Rumelhart et al. (1985); Ballard (1987), making the neurons learn on the errors between the generated target and the real target. The infrastructural improvement on parallel computing later leads to two popular generative approaches: the variational autoencoders (VAEsKingma and Welling (2014)) and the generative adversarial networks (GANsGoodfellow et al. (2014)). The VAEs uses a framework of probabilistic graphical models to generate the output by maximizing the lower bound of the likelihood of the data. While the GANs leverages a discriminative network to judge and improve the output of the generative network. After the adoption of the deep convolutional nets (DCGANRadford et al. (2016)), the task-oriented GANs have been applied to image-to-image translation (Pix2PixIsola et al. (2017), CycleGANZhu et al. (2017), GDWCTCho et al. (2019)), concept-to-image translation (GAWWNReed et al. (2016), PGMa et al. (2017), StyleGANKarras et al. (2019)) and text-to-image translation (StackGANZhang et al. (2017), BigGANBrock et al. (2019)), among other domain-specific GANsJin et al. (2017); Chen et al. (2018); Wang et al. (2018). In this paper, the BigGAN is used to generate the CPIA painting targets from keywords, as in Fig. 1.
Neural style transfer is one major domain that CPIA can be applied. In the work by Gatys et al. Gatys et al. (2016), the authors formulate the style transfer cost as a combination of the content loss and the style loss. The loss is measured over the pre-trained VGGnetSimonyan and Zisserman (2015), from the generated image to both the content image and the style image. The transformer networks with deep convolutional layers are introduced in Johnson et al. (2016); Ulyanov et al. (2016) to speed up the style transfer - the whole transformer is trained on a particular style. Then comes the transformer attempting to learn multiple styles in one single network, such as Dumoulin et al. (2017); Zhang and Dana (2018). In the following sections, the transformer of MSGZhang and Dana (2018) is decomposed into the CPIA actions.
Stroke-based rendering, or inverse graphic, without the training stroke sequence is challenging. To deal without the training stroke sequence, a discriminative network guides the distributed reinforcement learners to make meaningful progress in SPIRALGanin et al. (2019). The computation cost is high for the deep reinforcement learners with large and continuous action space. That can be mitigated by creating a differentiable environment, like the ones in WorldModelsHa and Schmidhuber (2018), PlaNetHafner et al. (2019) and StrokeNetZheng et al. (2019). The ongoing research has delved into various stroking agents that generate very different output styles. For cartoon-like stroking, LearningToPaintHuang et al. (2019) efficiently generates the simple strokes to compose the complex image. On the other hand, the NeuralPainterNakano (2019) abstracts and recreates the image into a sketch-like output.
Let denote the layer operation of the layer on its input . The generator network of layers can be expressed as
where is the input of the network and is the output. The operations and corresponding weights in were trained with or without the input .
Extending the forward path of the neural network to allow additional layer operations leads to
3.1 Channel Flush
Next, we provide implementations of the layer operations . One key finding in our experiments is that, for the intermediate layer images , the decomposed representation preserves the spatial information of the output image , while the detail at each location can be truncated and represented by the high response channel(s). By only showing channels out of the total channels at layer , we force the later layers to respond only on the top channels at each location of . This observation shed light on the following operation
where is the Hadamard product, is the cardinality of a set, and is the channel limit out of the channels at layer per location . The tensor masks the original layer output and picks only the top channels of at each location for rendering into later layers.
The channel flush provides a way to focus the image render on high response channels. On any spatial location of the operating layer, the lower response channels are muted. The number of channels to choose from also matters. In CNNs LeCun et al. (2004); Krizhevsky et al. (2012), the depth (channels) is traded with the breadth (spatial size). As a consequence, the operation layer of the channel flush is better in the middle of the network - avoiding the last layers which have low channel variety, and the initial layers which have low spatial resolution.
3.2 Channel Stroke
Sometimes the channel flush incurs unnecessary discontinuities over the output image. To deal with this issue, we extend the in Eq. 3 into an operation set of channel strokes at layer . Let denote the channel depth, height and width of its output . We define as the set of neighborhood pixels near pixel , where for all in . The quantifier is a real number from , which means the sensitivity of the stroke. When is close to one, the stroke sensitivity is high. Thus the channel stroke can only turn on the neighboring pixels with highly similar response as the stroke pixel. The simplest case of is a square box centered at with each side of pixels on channel . In this case, the parameter is the stroke size.
The channel stroke algorithm (Alg.1) updates the mask tensor in and the cost tensor in , on each of its iteration. The mask then filters the layer to response in . The stopping criterion terminates the procedure when current response is lower than a fraction of . Other possible choices for include number of strokes and the fraction of painted locations.
The intuition of channel stroke is that, for a human painter to paint, one needs to select a color of paint, the painting brush, and the pattern to paint. Once these items are selected, the painter can stroke on the canvas and then extend the stroke over a certain region. In Alg.1, the color of paint and pattern to paint are controlled by channel , which is chosen in Step 5 according the current most responsive pixel . The neighborhood decides the stroke shape, which can be related to the painting brush of the human painter.
At the end of each stroke, the human painter can either continue to use the same color to paint other area on the canvas, or switch to other color. To be effective, it is desirable to keep using the brush of current color as much as possible on the same level of painting detail. We factor this behavior into the cost at the stroke pixel . The channel stroke will continue the stroke into nearby stroke pixel at the same channel. Note that the neighborhood extension in Step 8 is about the shape of the stroke, while the stroke continuation in Step 9 models the behavior of the human painter in switching color.
3.3 Action Cost
We first consider the channel cost, which reflects the cost in changing the brush color for human. Let be the constant cost accounting to changing channel. The channel cost tensor is
The next cost factor is the stroke movement. Naturally, the stroke continues into its nearby region. Therefore, the movement cost increases as the distance from the current stroke location increases. The movement cost is
where is the standard deviation of the Gaussian kernel centered at the current stroking location .
The overall cost is then the Hadamard product of the individual cost components, . This cost is being updated in every channel stroke iteration at Alg.1 Step 9. Further stroke behavior modeling may incorporate location cost for top-to-bottom and left-to-right handwriting behavior. It is also possible to add the directional cost to make the stroke continue in the same direction and attain certain artistic feel. Here we focus on quantifying the burden in between continuing with the current brush or changing color. The cost provides the next stroking location and channel.
In Fig.3, we compare the results from channel flush and channel stroke over different operation layers and different channel limits . We also provide a histogram of channel coverage for each outcome image. For an arbitrary channel, the coverage means the fraction of locations having their mask turned on. The results from the channel flush have more concentrated coverage compared to those from the channel stroke. Because channel stroke extends the stroke into its neighborhood and continues onto the nearby stroke-able region, some channels tend to have higher coverage then others. That causes in the diverged channel coverage.
With neighborhood extension and stroke continuation, the channel stroke overcomes the occasional discontinuity issue in the channel flush, while keeping the artistic look from blocking out the low response channels at each location. When the channel limit becomes closer to the number of channels at the operation layer, the output image becomes closer to the original output without channel operations.
3.4 Channel Painting in Action
Based on the channel stroke, we propose the channel painting-in-action (CPIA) framework. This framework first analyzes the input image into painting regions, and come up with a painting plan. The painting plan contains a list of step images, which are composed of certain masked regions of the input image. The step images are then sequentially fed into the pre-trained generator network. The operation layer carries out the stroke actions and paint on the output canvas. The framework is presented in Fig. 4. In the implementation of this work, we use the Mask R-CNN to mark out the objects as regions of interest (ROIs). The step images are prepared according to these ROIs.
The step images mask out the regions to ensure channel continuity on those regions. When the first step image is fed, the channel stroke algorithm is applied to paint on the masked regions of the current step image, until the stopping criterion is met. The output canvas contains only strokes on the regions that have been exposed to channel stroke so far. Then the next step image is fed. The painting process continues until the list of step images is exhausted.
Because of the convolution kernels, the stroked response will propagate into its nearby regions in the next convolutional layer of the CNNs. The cascaded propagation tends to tint the whole canvas at the last layer. Therefore, the stroke maps are used as the post-generator masks to wipe out the response of the non-stroking regions at the last layer.
Note that although there can be as many as channels being stroked for each location, the painting process does not guarantee that every location has exactly channels stroked. In fact, at the operation layer, it is very likely that the locations with more high response channels have reached the maximum channels being stroked. While the less response locations have less than channels being stroked.
Fig. 5 presents the CPIA result over the style transfer network in Zhang and Dana (2018). The original style transfer offers the sharper result in general. On the other hand, the CPIA introduces additional artistic styling components on top of the existing style transfer. Such components are defined by the parameters , , , and . At the same time, the step images in the painting plan can also be seen as a controlling factor that decides the painting priority of each masked region.
4 Discussion and Implementation Details
Working with the stroke penetration parameter, the visually plausible stroke outcome requires the proper stroke mask after the last layer. With low penetration in channels, the response can be very dimmed and need to be masked in low opacity on the background. The mask opacity increases as the location collects more stroked channels.
In current implementations, the painting area ordering is based on the object and its detection score of each ROIs from the Mask R-CNN. We can optionally add side information about the semantics to the planning - such as paint the persons at the end, or paint the large objects first.
To stop the painting at each step image feed of the CPIA, we leverage the stopping criterion in Alg.1. can be a global threshold cutting of the response ratio at each ROI, or can be the stopping condition tailored toward different object types or sizes. The former is adopted in this work.
The proposed channel stroke strategy utilizes the knowledge learned and stored within the deep convolutional generator network. On top of that, the CPIA leverages the learned object and segmentation knowledge in frameworks like Mask R-CNN to plan the painting regions. The CPIA works with the existing generator networks and the existing image segmentation tools, without additional training data on stroking order.
Looking ahead, we seek to drive the stroking factors in a more adaptive way. The stroke size (bundled in the neighborhood ) can vary based on the stroking location. Depending on the stroking channel , the stroke penetration can possibly change in a responsive fashion. In this work, one single layer of the CNN is chosen as the operation layer. A further expansion is to investigate the multi-layer coordination of the channel stroke. The channel decomposition for the generator networks still has much to explore and the application may be beyond the artistic rendering and the step-wise painting.
-  (1987) Modular learning in neural networks. In Proceedings of the sixth National conference on Artificial intelligence-Volume 1, pp. 279–284. Cited by: §2.
-  (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), External Links: Cited by: Figure 6, Figure 7, §A.1, §2, 1(a), Figure 2, Figure 3, §3.1.
-  (2018) Cartoongan: generative adversarial networks for photo cartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9465–9474. Cited by: §2.
-  (2019) Image-to-image translation via group-wise deep whitening-and-coloring transformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10639–10647. Cited by: §2.
-  (2017) A learned representation for artistic style. International Conference on Learning Representations (ICLR). Cited by: §2.
-  (2019) Synthesizing programs for images using reinforced adversarial learning. International Conference on Learning Representations (ICLR). Cited by: §1, §2.
-  (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2414–2423. Cited by: §1, §2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems (NIPS), pp. 2672–2680. Cited by: §2.
-  (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NIPS), pp. 2450–2462. Cited by: §2.
-  (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning (ICML), pp. 2555–2565. Cited by: §2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2961–2969. Cited by: §1.
-  (2019) Learning to paint with model-based deep reinforcement learning. arXiv preprint arXiv:1903.04411. Cited by: §1, §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1125–1134. Cited by: §2.
-  (2017) Towards the automatic anime characters creation with generative adversarial networks. In Advances in neural information processing systems (NIPS), Cited by: §2.
-  (2018) Stroke controllable fast style transfer with adaptive receptive fields. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 238–254. Cited by: §1.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European conference on computer vision (ECCV), pp. 694–711. Cited by: §1, §2.
-  (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4401–4410. Cited by: §2.
-  (2014) Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pp. 1097–1105. Cited by: §3.1.
-  (2004) Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 97–104. Cited by: §3.1.
-  (2017) Universal style transfer via feature transforms. In Advances in neural information processing systems (NIPS), pp. 386–396. Cited by: §1.
-  (2017) Pose guided person image generation. In Advances in Neural Information Processing Systems (NIPS), pp. 406–416. Cited by: §2.
-  (2019) Neural painters: a learned differentiable constraint for generating brushstroke paintings. arXiv preprint arXiv:1904.08410. Cited by: §2.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2016) Learning what and where to draw. In Advances in Neural Information Processing Systems (NIPS), pp. 217–225. Cited by: §2.
-  (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2016) Texture networks: feed-forward synthesis of textures and stylized images.. In International Conference on Machine Learning (ICML), Vol. 1, pp. 4. Cited by: §1, §2.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 8798–8807. Cited by: §2.
-  (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 5907–5915. Cited by: §2.
-  (2018) Multi-style generative network for real-time transfer. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Figure 1, §2, Figure 5, §3.4.
-  (2019) Strokenet: a neural painting environment. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2223–2232. Cited by: §2.
Appendix A Appendix
In this section, we discuss additional parameters that can futher fine-tune the channel stroke and the CPIA result.
a.1 Stroke Sensitivity
In the channel stroke (Alg. 1), the parameter is used to qualify which of the neighboring pixels can be turned on during the current channel stroke. Therefore, both the scope of neighborhood and the sensitivity decides the shape of the current stroke. In Fig. 6, different ’s are compared over the generated images from the BigGAN  network.
Picking a smaller can further extend the current stroke into its neighboring pixels and create a wider stroke shape. For any pixel at the operation layer, being turned on at one channel reduces the chance of being turned on at other channels. The channel coverage histograms also confirms that smaller ’s cause more channels not being turned on. The channels not being turned on are the less responsive channels across the operation layer. Thus the output image from the smaller ’s focuses more on the globally responsive channels.
a.2 Stroke Penetration
A stroke typically affects multiple channels. For example, the last layer in the generator consists RGB channels. A stroke in yellow color at least impacts the red and the green channels. Reconsidering the channel strokes, another option is to enable the stroke to penetrate through several channels. The penetration happens when there exists several channels having similarly high response as the stroking channel at location .
When stroking at the pixel , other top response channels at the same location are also turned on, conditioned on they are previously muffled. The penetrating stroke then follow Alg.1 to complete the stroking process. The set of the penetrating channels decides the color and the pattern of the stroke. Depending on the operation layer, there can be more than a thousand paint-able channels. The depth of the layer gives room for the stroke penetration. We evaluate the stroke penetration in Fig. 7, where the operation layer has channels.
The passing channels are less coupled on lower stroke penetration. This induces the variety in channel coverage, and generates high contrast bold results. On the other hand, the higher stroke penetration brings the output closer to the original full channel output. This can be verified with the histograms of channel coverage. While stroke sensitivity maintains the spatial continuity, the channel continuity is controlled by the stroke penetration . Note that in Fig. 6 the penetration is fixed at , and in Fig. 7 the sensitivity is fixed at .