Interactive Sketch & Fill: Multiclass Sketch-to-Image Translation

# Interactive Sketch & Fill: Multiclass Sketch-to-Image Translation

Arnab Ghosh         Richard Zhang         Puneet K. Dokania
Oliver Wang         Alexei A. Efros         Philip H. S. Torr         Eli Shechtman

University of Oxford               Adobe Research               UC Berkeley
###### Abstract

We propose an interactive GAN-based sketch-to-image translation method that helps novice users create images of simple objects. As the user starts to draw a sketch of a desired object type, the network interactively recommends plausible completions, and shows a corresponding synthesized image to the user. This enables a feedback loop, where the user can edit their sketch based on the network’s recommendations, visualizing both the completed shape and final rendered image while they draw. In order to use a single trained model across a wide array of object classes, we introduce a gating-based approach for class conditioning, which allows us to generate distinct classes without feature mixing, from a single generator network.

## 1 Introduction

Conditional GAN-based image translation [isola2016image2image, sangkloy2017scribbler, zhu2017unpaired] models have shown remarkable success at taking an abstract input, such as an edge map or a semantic segmentation map, and translating it to a real image. Combining this with a user interface allows a user to quickly create images in the target domain. However, such interfaces for object creation require the entire edge or label map as input, which is a challenging task as users typically create drawings incrementally. Furthermore, completing a line drawing without any feedback may prove difficult for many, as untrained practitioners generally struggle at free-hand drawing of accurate proportions of objects and their parts [cohen1997can], 3D shapes and perspective [schmidt2009expert]. As a result, it is much easier with current interactive image translation methods to obtain realistic looking images by editing existing images [dekel2018sparse, portenier2018faceshop] rather than creating images from scratch.

We propose a new GAN-based interactive image generation system for drawing objects from scratch that: 1) generates full images given partial user strokes (or sketches); 2) serves as a recommender system that suggests or helps the user during their creative process to help them generate a desired image; and 3) uses a single conditional GAN model for multiple image classes, via a gating-based conditioning mechanism. Such a system allows for creative input to come from the user, while the challenging task of getting exact object proportions correct is left to the model, which constantly predicts a plausible completion of the user’s sketch (Fig. 1).

Unlike other related work, we use sparse object outlines/sketches/simplified-edges instead of dense edge maps as the user input as these are closer to the lines that novice users tend to draw [cole2008people]. Our model first completes the user input and then generates an image conditioned on the completed shape. There are several advantages to this two-stage approach. For one, we are able to give the artist feedback on the general object shape in our interactive interface (similar to ShadowDraw [lee2011shadowdraw]), allowing them to quickly refine higher level shape until it is satisfactory. Second, we found that splitting completion and image generation to work better than going directly from partial outlines to images, as the additional intermediate supervision on full outlines/sketches breaks the problem into two easier sub-problems – first recover the geometric properties of the object (shape, proportions) and then fill in the appearance (colors, textures).

For the second stage, we use a multi-class generator that is conditioned on a user supplied class label. This generator applies a gating mechanism that allows the network to focus on the important parts (activations) of the network specific to a given class. Such an approach allows for a clean separation of classes, enabling us to train a single generator and discriminator across multiple object classes, thereby enabling a finite-size deployable model that can be used in multiple different scenarios.

To demonstrate the potential of our method as an interactive tool for stroke-based image generation, we collect a new image dataset of ten simple object classes (pineapple, soccer, basketball, etc.) with white backgrounds. In order to stress test our conditional generation mechanism, six of the object classes have similar round shapes, which requires the network to derive texture information from the class conditioning. Fig. 2 shows a short video of an interactive editing session using our system. Along with these simple objects, we also demonstrate the potential of our method on more complicated shapes, such as faces and shoes. Code and other details are available at our website.

## 2 Related Work

#### Interactive Generation

Interactive interfaces for freehand drawing go all the way back to Ivan Sutherland’s Sketchpad [sutherland64]. The pre-deep work most related to us, ShadowDraw [lee2011shadowdraw], introduced the concept of generating multiple shadows for novice users to be able to draw sketches. PhotoSketcher [eitz2011photosketcher] introduces a retrieval based method for obtaining real images from sketches. More recently, deep recurrent networks have been used to generate sketches [ha2017neural, ganin2018synthesizing]. Sketch-RNN [ha2017neural] provides a completion of partial strokes, with the advantage of intermediate stroke information via the Quickdraw dataset at training time. SPIRAL [ganin2018synthesizing] learns to generate digits and faces using a reinforcement learning approach. Zhu et al. [zhu2016generative] train a generative model, and an optimization-based interface to generate possible images, given color or edge constraints. The technique is limited to a single class and does not propose a recommendation for the completion of the shape. SketchyGAN [chen2018sketchygan] also aimed at generating multi-class images but lacks interactive capability. In contrast to the above, our method provides interactive prediction of the shape and appearance to the user and supports multiple object classes.

#### Generative Modeling

Parametric modeling of an image distribution is a challenging problem. Classic approaches include autoencoders [hinton2006reducing, vincent2008extracting] and Boltzmann machines [smolensky1986information]. More modern approaches include autoregressive models [efros1999texture, van2016conditional], variational autoencoders (VAEs) [kingma2013auto], and generative adversarial networks (GANs). GANs and VAEs both learn mappings from a low-dimensional “latent” code, sampled stochastically, to a high-dimensional image through a feedforward pass of a network. GANs have been successful recently [denton2015deep, radford2015unsupervised, arjovsky2017wgan], and hybrid models feature both a learned mapping from image to latent space as well as adversarial training [donahue2016adversarial, dumoulin2016adversarially, larsen2016vaegan, chen2016infogan].

#### Conditioned Image Generation

The methods described above can be conditioned, either by a low-dimensional vector (such as an object class, or noise vector), a high-dimensional image, or both. Isola et al. [isola2016image2image] propose “pix2pix”, establishing the general usefulness of conditional GANs for image-to-image translation tasks. However, they discover that obtaining multimodality by injecting a random noise vector is difficult, a result corroborated in [mathieu2015deep, pathak2016context, zhu2017toward]. This is an example of mode collapse [goodfellow2016nips], a phenomenon especially prevalent in image-to-image GANs, as the generator tends or ignore the low-dimensional latent code in favor of the high-dimensional image. Proposed solutions include layers which better condition the optimization, such as Spectral Normalization [zhang2018self, miyato2018spectral], modifications to the loss function, such as WGAN [arjovsky2017wasserstein, gulrajani2017improved] or optimization procedure [heusel2017gans], or modeling proposals, such as MAD-GAN [ghosh2017multi] and MUNIT [huang2018multimodal]. One modeling approach is to add a predictor from the output to the conditioner, to discourage the model from ignoring the conditioner. This has been explored in the classification setting in Auxiliary-Classifier GAN (ACGAN) [odena2016conditional] and regression setting with InfoGAN [chen2016infogan] and ALI/BiGAN (“latent regressor” model) [dumoulin2016adversarially, donahue2016adversarial], and is one half of BicycleGAN model [zhu2017toward]. We explore a complementary approach of architectural modification via gating.

Gating Mechanisms Residual networks [he2016deep], first introduced for image classification [krizhevsky2012imagenet], have made extremely deep networks viable to train. Veit et al. [veit2016residual] find that the skip connection in the architecture enables test-time removal of blocks. Follow-up work [veit2018adaptive] builds in block removal during training time, with the goal of subsets of blocks specializing to different categories. Inspired by these results, we propose the use of gating for image generation and provide a systematic analysis of gating mechanisms.

The adaptive instance normalization (AdaIn) layer has similarly been used in arbitrary style transfer [huang2017arbitrary] and image-to-image translation [huang2018multimodal], and Feature-wise Linear Modulation (FiLM) [perez2017film]. Both methods scale and shift feature distributions, based on a high-dimensional conditioner, such as an image or natural language question. Gating also plays an important role in sequential models for natural language processing: LSTMs [hochreiter1997long] and GRU [cho2014learning]. Similarly, concurrent work [karras2018style], [park2019semantic] use a AdaIN-style network to modulate the generator parameters.

## 3 Method

We decouple the problem of interactive image generation into two stages: object shape completion from sparse user sketches, and appearance synthesis from the completed shape. More specifically, as illustrated in Fig. 3 we use the Shape Generator for the automatic shape (outline/sparse-sketch/simplified-edge) generation and the Appearance Generator for generating the final image as well as the adversary discriminators and . Example usage is shown in our user interface in Fig. 2.

### 3.1 Shape completion

The shape completion network should provide the user with a visualization of its completed shape(s), based on the user input, and should keep on updating the suggested shape(s) interactively. We take a data-driven approach for this whereby, to train the network, we simulate partial strokes (or inputs) by removing random square patches from the full outline/ full sparse sketch/ full simplified edges. The patches are of three sizes (6464, 128128, 192192) and placed at a random location in the image of size 256256 (see Fig. 5 for an example). To extend the technique beyond outlines and generate more human-like sketches, we adopt the multistage procedure depicted in Fig. 6. We refer to these generated sketches as “simplified edges”. We automatically generate data in this manner, creating a dataset where for a given full outline/sketch or a simplified edge-map, 75 different inputs are created. The model, shown in Fig. 3, is based on the architecture used for non-image conditional generations in [mescheder2018training]. We modify the architecture such that the conditioning input is provided to the generator and discriminator at multiple scales as shown in Fig. 4. This makes the conditioning input an active part of the generation process and helps in producing multimodal completions.

### 3.2 Appearance synthesis

An ideal interactive sketch-to-image system should be able to generate multiple different image classes with a single generator. Beside memory and time considerations (avoiding loading/using a separate model per class, reducing overall memory), a single network can share features related to outline recognition and texture generation that are common across classes, which helps training with limited examples per class.

As we later show, class-conditioning by concatenation can fail to properly condition the network about the class information in current image translation networks [isola2016image2image, zhu2017toward]. To address this, we propose an effective soft gating mechanism, shown in Fig. 7. Conceptually, our network consists of a small external gating network that is conditioned on the object class (encoded as a 1-hot vector). The gating network outputs parameters that are used to modify the features of the main generator network. Given an input feature tensor , “vanilla” ResNet [he2016deep] maps it to

 Xl+1=Xl+Hl(Xl). (1)

Changes in resolution are obtained by upsampling before or downsampling after the residual block. Note that we omit subscript from this point forward to reduce clutter. Our gating network augments this with a predicted scalar for each layer of the network using a learned network , where is the conditioning vector:

 X+αH(X),where α∈[0,1] (2)

If the conditioning vector has no use for a particular block, it can predict close to zero and effectively switch off the layer. During training, blocks within the main network can transform the image in various ways, and can modulate such that the most useful blocks are selected. Unlike previous feature map conditioning methods such as AdaIn [ulyanovinstance], we apply gating to both the generator and discriminator. This enables the discriminator to select blocks which effectively judge whether generations are real or fake, conditioned on the class input. Some blocks can be shared across regions in the conditioning vector, whereas other blocks can specialize for a given class.

A more powerful method is to apply this weighting channel-wise using a vector :

 X+\boldmathα⊙H(X),where %\boldmath$α$∈[0,1]c, (3)

where represents channel-wise multiplication. This allows specific channels to be switched “on” or “off”, providing additional degrees of freedom. We found that this channelwise approach for gating provides the strongest results. AdaIn describes the case where an Instance Normalization [ulyanovinstance] (IN) operation is applied before scaling and shifting the feature distribution. We constrain each element of and in . We additionally explored incorporating a bias term after the soft-gating, either block-wise using a scalar per layer, or channel-wise using a vector per layer but we found that they did not help much, and so we leave them out of our final model. Refer Fig. 8 for pictorial representation of various gatings.

Finally, we describe our network architecture, which utilizes the gated residual blocks described above. We base our architecture on the proposed residual Encoder-Decoder model from MUNIT [huang2018multimodal]. This architecture is comprised of 3 conv layers, 8 residual blocks, and 3 up-conv layers. The residual blocks have 256 channels. First, we deepen the network, based on the principle that deeper networks have more valid disjoint, partially shared paths [veit2016residual], and add 24 residual blocks. To enable the larger number of residual blocks, we drastically reduce the width to 32 channels for every layer. We refer to this network as SkinnyResNet. Additionally, we found that modifying the downsampling and upsampling blocks to be residual connections as well improved results, and also enables us to apply gating to all blocks. When gating is used, the gate prediction network, , is also designed using residual blocks. Additional architecture details are in the supplementary material.

## 4 Experiments

We first compare our 2 step approach for interactive image generation on existing datasets such as the UTZappos Shoes dataset [yu2014fine] and CelebA-HQ [karras2017progressive]. State-of-the-art techniques such as pix2pixHD [Wang_2018_CVPR] are used to generate the final image from the autocompleted sketches. We finally evaluate our approach on a multi-class dataset that we collected to test our proposed gating mechanism.

### 4.1 Single Class Generation

#### Datasets

We use the edges2shoes[isola2016image2image], CelebA-HQ[karras2017progressive] datasets to test our method on single class generation. We simplify the edges to attempt to more closely resemble how humans would draw strokes by first using the preprocessing code of [li2019im2pencil] further reducing the strokes with a sketch simplification network [simo2016learning].

#### Architecture

We use the architecture described in Section 3.1 for shape completion. In this case, each dataset only contains a single class, so we can use an off-the-shelf network, such as pix2pixHD [wang2017high] for rendering.

#### Results

As seen in Fig. 9, our 2 step technique allows us to complete the simplified edge maps from the partial strokes and also generate realistic images from the autocompleted simplified edges. Table 1 also demonstrates, across two datasets (faces and shoes), that using a 2 step procedure produces stronger results than mapping directly from the partial sketch to the completed image.

### 4.2 Multi-Class Generation

#### Datasets

To explore the efficacy of our full pipeline, we introduce a new outline dataset consisting of 200 images (150 train, 50 test) for each of 10 classes – basketball, chicken, cookie, cupcake, moon, orange, soccer, strawberry, watermelon and pineapple. All the images have a white background and were collected using search keywords on popular search engines. In each image, we obtain rough outlines for the image. We find the largest blob in the image after thresholding it into a black and white image. We fill the interior holes of the largest blob and obtain a smooth outline using the Savitzky–Golay filter [savitzky1964smoothing].

#### Architecture

For the shape completion, we use the architecture in Section 3.1. For class-conditioned image generation, test the gated architectures in Section 3.2.

#### Results

In order to test the fidelity of the automatically completed shapes, we evaluate the accuracy of a trained classifier on being able to correctly label a particular generation. We first test in Table 2 that our 2 stage technique is better than 1 step generation. We evaluate the results on the multi-class outline to image generations on two axes: adherence to conditioning and realism. We first test the conditioning adherence – whether the network generates an image of the correct class. Off-the-shelf networks have been previously used to evaluate colorizations [zhang2016colorful], street scenes [isola2016image2image, wang2017high], and ImageNet generations [salimans2016improved]. We take a similar approach and fine-tune a pretrained InceptionV3 network [szegedy2016rethinking] for our 10 classes. The generations are then tested with this network for classification accuracy. Results are presented in Table 3.

To judge the generation quality, we also perform a “Visual Turing test” using Amazon Mechanical Turk (AMT). Turkers are shown a real image, followed by a generated image, or vice versa, and asked to identify the fake. An algorithm which generates a realistic image will “fool” Turkers into choosing the incorrect image. We use the implementation from [zhang2016colorful]. Results are presented in Table 3, and qualitative examples are shown in Fig. 10.

#### Gating Architectures

We compare our proposed model to the residual Encoder-Decoder model [huang2018multimodal]. In addition, we compare our proposed gating strategy and SkinnyResNet architecture to the following methods for conditional image generation:

• [noitemsep,leftmargin=12pt]

• Per-class: a single generator for each category; this is the only test setting with multiple networks, all others train a single network

• Concat (In): naive concatenation, input layer only

• Concat (All): naive concatenation, all layers

• Concat (In)+Aux-Class: we add an auxiliary classifier, both for input-only and all layers settings

• BlockGate(+Bias), BlockGate: block-wise soft-gating, with and without a bias parameter

• ChannelGate(+Bias), ChannelGate: channel-wise soft-gating, with and without a bias parameter

Does naive concatenation effectively inject conditioning? In Fig. 10, we show a selected example from each of the 10 classes. The per-class baseline trivially adheres to the conditioning, as each class gets to have its own network. However, when a single network is trained to generate all classes, naive concatenation is unable to successfully inject class information, for either network and for either type of concatenation. For the EncoderDecoder network, basketballs, oranges, cupcakes, pineapples, and fried chicken are all confused with each other. For the SkinnyResNet network, oranges are generated instead of basketballs, and pineapples and fried chicken drumsticks are confused. As seen in Table 3, classification accuracy is slightly higher when concatenating all layers () versus only the input layer (), but is low for both.

Does gating effectively inject conditioning? Using the proposed soft-gating, on the other hand, leads to successful generations. We test variants of soft-gating on the SkinnyResNet, and accuracy is dramatically improved, between to , comparable to using a single generator per class (). Among the gating mechanisms, we find that channel-wise multiplication generates the most realistic images, achieving an AMT fooling rate of . Interestingly, the fooling rate is higher than the per-class generator of . Qualitatively, we notice that per-class generators sometimes exhibits artifacts in the background, as seen in the generation of “moon”. We hypothesize with the correct conditioning mechanism, the single generator across multiple classes has the benefit of seeing more training data and finding common elements across classes, such as clean, white backgrounds.

Is gating effective across architectures? As seen in Table 3, using channelwise gating instead of naive concatenation improves performance both accuracy and realism across architectures. For example, for the EncoderDecoder architecture, gating enables successful generation of the pineapple. Both quantitatively and qualitatively, results are better for our proposed SkinnyResNet architecture.

Do the generations generalize to unusual outlines? The training images consist of the outlines corresponding to the geometry of each class. However, an interesting test scenario is whether the technique generalizes to unseen shape and class combinations. In Fig. 1, we show that an input circle not only produces circular objects, such as a basketball, watermelon, and cookie, but also noncircular objects such as strawberry, pineapple, and cupcake. Note that both the pineapple crown and bottom are generated, even without any structural indication of these parts in the outline.

## 5 Discussion

We present a two-stage approach for interactive object generation, centered around the idea of a shape completion intermediary. This step both makes training more stable and also allows us to give coarse geometric feedback to the user, which they can choose to integrate as they desire.

## Acknowledgements

AG, PKD, and PHST are supported by the ERC grant ERC-2012-AdG, EPSRC grant Seebibyte EP/M013774/1, EPSRC/MURI grant EP/N019474/1 and would also like to acknowledge the Royal Academy of Engineering and FiveAI. Part of the work was done while AG was an intern at Adobe.

## 6 Insights on Gating Mechanism

We demonstrate the intuition behind our gating mechanism with a toy experiment where a generative network models a 1D mixture of Gaussians, comprised of five components Fig. 13. For this test, the generator and discriminator architectures consist of only residual blocks, where each residual block is composed of fully connected layers. The generator is conditioned on a latent vector and is trained to approximate the distribution, as seen in Fig. 13 (left). Removing a single residual block, in the spirit of [veit2016residual], leads to the disappearance of a mode from the predicted distribution. Removal of another block leads to further removal of another mode, as seen in Fig. 13 (mid, right). This experiment suggests that residual blocks arrange themselves naturally into modeling parts of a distribution, which motivates our use of a gating network where the network learns which blocks (or alternatively, which channels) to attend to for each object class.

### 6.1 Network Architecture

The architecture was designed to reproduce some of the experiments performed by [veit2016residual] by removing blocks and observing the resulting generated distribution. While our network is deeper (16 layers of residual blocks) than required for similar experiments e.g., in MAD-GAN [ghosh2017multi], Mode Regularized GAN [che2016mode] and Unrolled GAN [metz2017unrolledGAN], we use only 4 neurons in each residual block of the generator and discriminator (Tables 5 & 6) compared to fully connected versions in which there consisted of connections between 256 neurons in the preceding layer to 256 neurons in the current layer. Thus although the number of parameters is much lower, the network learns the distribution quite accurately. The architecture used in this experiment inspired the design of the skinny Resnet architecture as described later.

## 7 Shape Completion Details

For shape completion, training and testing inputs were created using by placing occluders of 3 sizes (6464, 128128, 192192) on top of full sketches or outlines. For each size, 25 partial sketches/outlines were created by random placement of the occluder, thus leading to 75 partial versions to be completed from a single sketch/outline.

The generator architecture for the shape completion is depicted in table Table 7 while the discriminator architecture is depicted in table Table 8. The architecture is almost the same as [mescheder2018training] except for the sparse Resnet blocks used for injecting conditioning via multiple scales. The sparse Resnet blocks first resize the input conditioning (for example, the partial user strokes), and then convert the feature map into the correct number of channels using a Resnet block to add to the feature activation. This occurs just prior to the upsampling step in the generator and just prior to the avg pool step in the discriminator.

## 8 Outline→Image Network Architecture

The network architecture is based on our observations that deeper, narrower networks perform better when capturing multi-modal data distributions. The second guiding principle in the design of the architecture is that the different blocks should have similar number of channels so that the gating hypernetwork can distribute the modes between the blocks efficiently. Finally, we apply gating to the residual blocks responsible for upsampling and downsampling as well, in order to allow for better fine-grained control on the generation process. Table 9 shows the Convolution Residual Block which does not change the spatial resolution of the activation volume, Table 11 shows the Downsampling Residual Block which reduces the activation volume to half the spatial resolution, Table 10 shows the Upsampling Residual Block which increases the activation volume to twice the spatial resolution. In the case of gating (either block wise/channel-wise) the gating is applied on the of each network. The shortcut branch represented in Table 10 and Table 11 represents the branch of the Resnet which is added to branch. In these scenarios since the resolution of changes in , the shortcut also has a similar upsampling/downsampling layer.

### 8.1 Gating Hypernetwork

The gating hypernetwork was also designed using Resnet blocks. We use 1D convolutions in the Resnet block Table 15 to reduce the number of parameters and use BatchNormalization to speed up the training of the network responsible for prediction gating parameters. Class conditioning is first passed through an embedding layer to obtain a representation of the class which is further processed by the Resnet blocks. The same network is used for the various forms of gating. In case of block wise gating, the number of outputs for this network is equal to the number of blocks used in the main network. In the case of an affine transformation, the network predicts an equal number of biases for each the block. In the case of channel-wise gating, the number of predicted parameters is equal to since each residual block consists of equal number of channels. was constrained between 0 and 1 corresponding to selecting or rejecting a block, while the was restricted between -1 and 1 when used. In the original AdaIN case, parameters are unrestricted, but we found we had to constrain parameters between -1 and 1 in order for the network to perform well.

## 9 Distribution of Alphas

A histogram of the distribution of the various alphas for the block-wise setting and the channel-wise setting is shown in Fig. 16. Even without an explicit sparsity constraint, the alphas are pushed near the extremes.

## 10 Unusual Shapes for Various Classes

As evident from Fig. 18 the gated generative techniques extend to shapes it never was shown while training.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters