Texture Mixer: A Network for Controllable Synthesis and Interpolation of Texture
This paper addresses the problem of interpolating visual textures. We formulate the problem of texture interpolation by requiring (1) by-example controllability and (2) realistic and smooth interpolation among an arbitrary number of texture samples. To solve it we propose a neural network trained simultaneously on a reconstruction task and a generation task, which can project texture examples onto a latent space where they can be linearly interpolated and reprojected back onto the image domain, thus ensuring both intuitive control and realistic results. We show several additional applications including texture brushing and texture dissolve, and show our method outperforms a number of baselines according to a comprehensive suite of metrics as well as a user study.
Many materials that are naturally occurring or made by humans exhibit variation in local appearance, as well as complex transitions between different materials. For example, if we look closely at pebbles on a sandy beach, we may notice that the size, density, and color of pebbles may change, and the sand may change in color due to being wet or dry as well as mix with the pebbles. If a person wants to edit materials in an image, however, it can be highly challenging to create such rich, spatially-varying material combinations as we see in the natural world. One general research challenge then is to attempt to enable these kinds of edits. In particular, in this paper, we focus on textures. We define “texture” as being an image-space representation of a statistically homogeneous material, captured from a top down view. We further focus on allowing a user to both be able to accurately control the placement of textures, as well as create plausible transitions between them.
Because of the complex appearance of textures, creating transitions by interpolating between them on the pixel domain is difficult. Doing so naïvely results in unpleasant artifacts such as ghosting, visible seams, and obvious repetitions. Researchers in texture synthesis have therefore developed sophisticated algorithms to address this problem. These may be divided to two families: non-parameteric methods such as patch-based synthesis (e.g., [10, 9, 2]) and parameteric methods (e.g. [16, 34]), including neural network synthesis approaches (e.g. [11, 39, 21, 28, 29]). Previously, researchers used sophisticated patch-based interpolation methods [7, 8] with carefully crafted objective functions. However, such approaches are extremely slow. Moreover, due to the hand-crafted nature of their objectives, they cannot learn from a large variety of textures in the natural world, and as we show in our comparisons are often brittle and frequently result in less pleasing transitions. Further, we are not aware of any existing feedforward neural network approaches that offers both fine-grained controllable synthesis and interpolation between multiple textures.
In our paper, we develop a neural network approach that we call “Texture Mixer,” which allows for both user control and interpolation of texture. We define the interpolation of texture as a broad term, encompassing any combination of: (1) Either gradual or rapid spatial transitions between two or more different textures, as shown in the palette, the letters, and the background in Figure 1, and (2) Texture dissolve, where we can imagine putting two textures in different layers, and cross-dissolving them according to a user-controlled transparency, as we show in our video. Our feedforward network is trained on a large dataset of textures and runs at interactive rates.
Our approach addresses the difficulty of interpolating between textures on the image domain by projecting these textures onto a latent domain where they may be linearly interpolated, and then decoding them back into the image domain to obtain the desired result. In order to satisfy the two goals of controllability and visual realism, we train our network simultaneously for both tasks. A reconstruction task ensures that when a texture is passed through an encoder and then a decoder (autoencoder), the result will be the similar to the input. This allows the user to specify texture at any given point of the output by example. An interpolation task uses a discriminator to ensure that linear interpolations of latent vectors also decode into plausible textures, so that the regions of the output not directly specified by the user are realistic and artifact-free. For this task, we can view our network as a conditional Generative Adversarial Network (GAN). In effect, we thus train an autoencoder and a conditional GAN at the same time, using shared weights and a shared latent space.
To perform the interpolation task, we take the image domain texture samples the user specifies, and project them into latent space using a learned encoder. Given these latent tensors, our network then uses three intuitive latent-space operations: tiling, interpolation, and shuffling. The tiling operation extends a texture spatially to any arbitrary size. The interpolation operation uses weighted combinations of two or more textures in latent domain. The shuffling operation swaps adjacent small squares within the latent tensor to reduce repetitions. These new latent tensors are then decoded to obtain the interpolated result.
Our main contributions are: (1) A state-of-the-art interactive technique that allows both user control and interpolation of texture; and (2) We present new metrics that are suitable for evaluation of texture reconstruction, synthesis, and interpolation. Our method outperforms previous work both based on these metrics, if we consider them holistically, and based on a user study.
2 Related Work
The problem of texture interpolation has so far been underexplored. It is however closely related to several other problems, most significantly texture synthesis, inpainting, and stylization.
Texture synthesis algorithms can broadly be divided in two families. The first one are parametric methods, with a generative texture model. These algorithms include older, non-neural methods [16, 34], and also more recent deep learning-based methods that are based on optimization [11, 12, 35, 37] or trained feedforward models [39, 21, 28, 29]. Where the underlying model allows for spatially varying parameters, it may be used to interpolate textures.
The second family of texture synthesis algorithms are non-parametric methods, in which the algorithm produces output that is optimized to be as close as possible to the input under some appearance measure [10, 40, 9, 26, 25, 32, 27, 41, 2, 7, 23]. These can be formulated to accept two different inputs and spatially vary which is being compared to, facilitating interpolation [7, 8].
Recently, generative adversarial networks (GANs) [13, 36, 1, 14] have shown improved realism in image synthesis and translation tasks [19, 47, 48]. GANs have also been used directly for texture synthesis [28, 20, 46], however they were limited to a single texture they were trained on. A recent approach dubbed PSGAN  learns to synthesize a collection of textures present in a single photograph, making it more general and applicable to texture interpolation; it does not, however, allow for user control by specifying which textures are synthesized.
Texture synthesis and image inpainting algorithms are often closely related. A good hole filling algorithm needs to be able to produce some sort of transition between textures on opposite ends of the hole, and so may be used in a texture interpolation task. A few recent deep learning-based methods showed promising results [42, 44, 31, 43].
Finally, some neural-based image stylization approaches [12, 28, 18, 30] based on separating the images into content and style layers have shown that by stylizing a noise content image they can effectively synthesize texture . By spatially varying the style layer, texture interpolation may thus be achieved.
3 Our network: Texture Mixer
In this section, we explain how our network works. We first explain in Section 3.1 how our method is trained. We then show how our training losses are set up in Section 3.2. Finally, we explain in Section 3.3 how our method can be either tested or used by an end user.
3.1 Training setup
Our training process should ideally be able to handle any texture input that the user provides at testing time. However, to limit the amount of data that we need to gather for training, we currently assume that the user’s testing texture comes from a left-out portion of the same dataset that was used for training. For example, if we train on earth textures, then we also test on an earth texture.
As a reminder, we aim to train our network simultaneously for two tasks: reconstruction and interpolation. The reconstruction task ensures that every input texture after being encoded and then decoded results in a similar texture. Meanwhile, the interpolation task ensures that interpolations of latent tensors also decode into plausible textures.
We now explain the components of our network. As we mentioned in the introduction, our method can be viewed as a way of training a network containing both encoders and a generator, such that the generator portion of the network is effectively a GAN. Accordingly, our method has the following components. The network accepts a source texture as input. A global encoder encodes into a latent vector , which can also be viewed as a latent tensor with spatial size . A local encoder encodes the source texture into a latent tensor , which has a spatial size that is a factor smaller than the size of the input texture: we use . The generator can decode these latent tensors back into a texture patch, so that ideally , which encompasses the reconstruction task. Our generator is fully convolutional, so that it can generate output textures of arbitrary size: the output texture size is directly proportional to the size of the local tensor . A discriminator is part of the reconstruction loss. An identical but separately trained discriminator evaluates the realism of interpolation.
Note that in practice, our generator network is implemented as taking a global tensor as input, which has the same spatial size as the local tensor. This is because for some applications of texture interpolation, can actually vary spatially. Thus, when we refer to taking a global latent vector with spatial size as input, what we mean is that this vector is first repeated spatially to match the size of , and the generator is run on the result.
We show the full training setup in Figure 2. Please see the figure caption for a description of all the components. We will also explain them in terms of formulas here. As is shown in the upper-left of Figure 2, the network is given two real source texture images and from the real texture dataset . Each local encoder encodes () to a local latent tensor . Meanwhile, each global encoder encodes to a global latent vector , denoted as . These latent variables are shown in green and blue boxes in the upper-left of Figure 2.
For the reconstruction task, we then evaluate the reconstructed texture image . These are shown in the upper center of Figure 2. For each reconstructed image , we then impose a weighted sum of three losses against the original texture . We describe these losses in more detail later in Section 3.2.
For the interpolation task, we pose the process of multiple texture interpolation as a problem of simultaneously (1) synthesizing a larger texture, and (2) interpolating between two different textures. In this manner, the network learns to perform well for both single and multiple texture synthesis. For single texture synthesis, we enlarge the generated images by a factor of . We do this by tiling spatially by a factor of . We denote this tiling by , and indicate tiling by a tile icon in the lower-left of Figure 2. We chose the factor 3 because this is the smallest integer that can synthesize transitions over the four edges of . Such a small tiling factor minimizes computational cost. The tiling operation can be beneficial for regular textures. However, in semiregular or stochastic textures, the tiling introduces two artifacts: undesired spatial repetitions, and undesired seams on borders between tiles.
We reduce these artifacts by applying a random shuffling to the tiled latent tensors . In Figure 2, this shuffling operation is indicated by a dice icon. Random shuffling in the latent space not only results in more varied decoded image appearance and thus reduces visual repetition, but also softens seams by spatially swapping âpixelsâ in the latent space across the border of two tensors.
We implement the random shuffling by row and column swapping over several scales from coarse to fine. For this coarse to fine process, we use scales that are powers of two: for . We set the coarsest scale to give a scale that is half the size of the local tensor . For each scale , we define a grid over the tiled latent tensor , where each grid cell has size . For each scale , we then apply a random shuffling on cells of the grid for that scale: we denote this by . This shuffling proceeds through grid rows first in top-down and then bottom-up order: each row is randomly swapped with the succeeding row with probability 0.5. Similarly, this is repeated on grid columns, with column swapping from left to right and right to left. Thus, the entire shuffling operation is:
We visualize this shuffling procedure in the supplementary material. We also want the synthesized texture to be able to transit smoothly between regions where there are user-specified texture constraints and regions where there are none. Thus, we override the original without shuffling at the 4 corners of the tiled latent tensor. We denote such shuffling with corner overriding as .
If we apply the fully convolutional generator to a network trained using a single input texture and the above shuffling process, it will work for single texture synthesis. However, for multiple texture interpolation, we additionally apply interpolation in the latent space before calling , as inspired by [29, 18, 3]. We randomly sample an interpolation parameter , and then interpolate the latent tensors using . This is shown by the circles labelled with in Figure 2. We linearly blend the shuffled local tensors and , which results in the final interpolated latent tensor :
In the same way, we blend and to obtain
Finally, we feed the tiled and blended tensors into the generator to obtain an interpolated texture image , which is shown on the right in Figure 2. From the interpolated texture, we take a random crop of the same size as the input textures. The crop is shown in the red dotted lines in Figure 2. The crop is then compared using appropriately -weighted losses to each of the source textures.
3.2 Training losses
For the reconstruction task, we use three losses. The first loss is a pixel-wise loss against each input . The second loss is a Gram matrix loss against each input , based on an ImageNet-pretrained VGG-19 model. We define the Gram loss in the same manner as Johnson et al. , and use the features for . The third loss is an adversarial loss based on WGAN-GP , where the reconstruction discriminator tries to classify whether the reconstructed image is from the real source texture set or generated by the network. The losses are:
The term is defined from WGAN-GP  as:
Here and represent a pair of input images, is the adversarially trained discriminator, and is the gradient penalty regularization term.
For the interpolation task, we expect the large interpolated texture image to be similar to some combination of the two input textures. Specifically, if , the interpolated image should be similar to source texture , and if , it should be similar to . However, we do not require pixel-wise similarity, because that would encourage ghosting. We thus impose only a Gram matrix and an adversarial loss. We select a random crop from the interpolated texture image. Then the Gram matrix loss for interpolation is defined as an -weighted loss to each source texture:
Similarly, we adversarially train the interpolation discriminator for the interpolation task to classify whether its input image is from the real source texture set or whether it is a synthetically generated interpolation:
Our final training objective is
where , , and are used to balance the order of magnitude of each loss term, which are not sensitive to dataset.
We provide details related to our training and architecture in the supplementary document, such as how we used progressive growing during training .
3.3 Testing and user interactions
At test time, we can use our network in several different ways: we can interpolate sparsely placed textures, we can brush with textures, and we can dissolve between textures.
Interpolation of sparsely placed textures. This option is shown in the palette and background in Figure 1. In this scenario, one or more textures are placed down by the user in the image domain. These textures are each encoded to latent domain.
To make the textures better agree at boundary conditions, we postprocess our images as follows. Suppose that the user places a source textured region as a boundary condition. We first replace the reconstructed regions with the source texture. Then, within the source texture, we use graph cuts  to determine an optimal seam where we can cut between the source texture and the reconstruction. Finally, we use standard Poisson blending  to minimize visibility of this seam.
Texture brush. We can allow the user to brush with texture as follows. We assume that there is a textured background region, which we have encoded to latent space. The user can select any texture to brush with. For example, in Figure 1 we show an example of selecting a texture from a palette created by interpolating four sparsely created textures. We find the brush texture’s latent domain tensors, and apply them using a Gaussian-weighted brush. Here full weight in the brush causes the background latent tensors to be replaced entirely, and other weights cause a proportionately decreased effect. We show more results in the supplementary material.
Texture dissolve. We can create a cross-dissolve effect between any two textures by encoding them both to latent domain, and then blending between them using blending weights that are spatially uniform. This effect is best visualized in a video, where time controls the dissolve effect. Please see our supplementary video for such results. Figure 3 shows a sequence of video frame samples with gradually varying weights.
Training to interpolate fronto-parallel stationary textures of a particular class requires a dataset with a rich set of examples to represent the intra-variability of the class. Unfortunately most existing texture datasets such as DTD  are intended for texture classification tasks, and do not have enough samples per class (120 in the case of DTD) to cover the texture appearance space with sufficient density.
We collected two datasets of our own: (1) the earth texture dataset contains Creative Commons images from Flickr, which we randomly split into training and testing images; (2) the animal texture dataset contains images from Adobe Stock, randomly split into training and testing images. All textures are real-world RGB photos with arbitrary sizes larger than . Examples from both are shown in our figures throughout the paper.
We further augmented all our training and testing sets by applying: (1) Color histogram matching with a random reference image in the same dataset; (2) Random geometric transformations including horizontal and vertical mirroring, random in-plane rotation and downscaling (up to ); and (3) Randomly cropping a size of . In this way, we augmented samples for each training image and samples for each testing image. We used all three augmentations for the earth textures, and the last two for the animal textures, because changing color often introduces undesired semantic changes (e.g. coloring a zebra’s stripes brown).
In this section, we compare previous work with ours, and also do an ablation study on our own method. In order to fairly compare all methods, we use a horizontal interpolation task. Specifically, we randomly sampled two squares from the test set. We call these the side textures. We placed them as constraints on either end of a canvas. We then used each method to produce the interpolation on the canvas, configuring each method to interpolate linearly where such option is available.
To the best of our knowledge, there is no standard method to quantitatively evaluate texture interpolation. We found existing generation evaluation techniques [36, 17, 4, 22] inadequate for our task. We therefore developed a suite of metrics that evaluate three aspects we consider crucial for our task: (1) user controllability, (2) interpolation smoothness, and (3) interpolation realism. We now discuss these.
User controllability. For interpolation to be considered controllable, it has to closely reproduce the user’s chosen texture at the user’s chosen locations. In our experiment, we measure this as the reconstruction quality for the side textures. We average the LPIPS perceptual similarity measure  for the two side textures. We call this Side Perceptual Distance (SPD).
We also would like the center of the interpolation to be similar to both side textures. To measure this, we consider the Gram matrix loss  between the central crop of the interpolation and the side textures. We report the sum of distances from the center crop to the two side textures, normalized by the Gram distance between the two. We call this measure the Center Gram Distance (CGD).
Interpolation smoothness. Ideally, we would like the interpolation to follow a shortest path between the two side textures. To measure this, we construct two difference vectors between the left side texture and the center crop, and the center crop and the right side texture, and measure the cosine distance between the two vectors. We expect this Centre Cosine distance (CCD) to be minimized.
For smoothness, the appearance change should be gradual, without abrupt changes such as seams and cuts. To measure such, we train a seam classifier using real samples from the training set as negative examples, and where we create synthetic seams by concatenating two random textures as positive examples. We run this classifier on the center crop. We call this the Center Seam Score (CSS). The architecture and training details of seam classifier are almost the same as those of and except (1) we remove the minibatch standard deviation channel, (2) we add a sigmoid activation layer after the output layer for the binary cross-entropy loss computation, and (3) we exclude the progressive growing process. We directly use the sigmoid output of the classifier as the seam score for each input image.
Interpolation realism. The texture should also look realistic, like the training set. To measure this, we chose the Inception Score  and Sliced Wasserstein Distance (SWD) , and apply them on the center crops. This gives Center Inception Score (CIS) and Center SWD, respectively. For CIS, we use the state-of-the-art Inception-ResNet-v2 inception model architecture  finetuned with our two datasets separately.
We also found these metrics do not capture undesired repetitions, a common texture synthesis artifact. We therefore trained a dedicated repetition classifier for this purpose. We call this the Center Repetition Score (CRS). The architecture and training details of repetition classifier are almost the same as those of the seam classifier except the input image size is instead of , where the negative examples are random crops of size from real datasets and the positive examples are horizontally tiled twice from random crops of size from real datasets.
|SPD ||CGD||CCD||CSS||CRS||CIS ||CSWD ||PR||p-value|
|Image Melding ||0.0111||1.289||0.865||0.0005||0.0004||29.45||47.09||0.672|
|Ours (no )||0.0112||1.207||0.680||0.0078||0.0010||21.04||21.54||-||-|
|Ours (no blending)||0.0103||1.272||0.817||0.0125||0.0009||22.24||52.29||-||-|
|Ours (no shuffling)||0.0107||1.129||0.490||0.0534||0.2386||26.78||20.99||-||-|
We compare against several leading methods from different categories on the task of texture interpolation. These include: naïve -blending, Image Melding  as a representative of patch-based techniques, two neural stylization methods - AdaIN  and WCT , a recent deep hole-filling method called DeepFill , and PSGAN  which is the closest to ours. Most these had to be adapted for our task. See more details in the supplementary material. Fig. 4 contains a qualitative comparison between the different methods. Note that in this example: (1) the overly sharp interpolation of DeepFill, (2) the undesired ghosting and repetition artifacts of naïve -blending and ours (no shuffling), (3) the incorrect reconstruction and less relevant interpolation of AdaIN, WCT, and PSGAN, (4) the appearance mismatch between source and interpolation of Image Melding, (5) the lack of smoothness of ours (no ), and (6) the undesired fading of ours (no blending). More qualitative comparisons are shown in the supplementary material. We also report qualitative results, including the user study and the ablation experiments, in Table 1, that contains average values for the two datasets - earth texture and animal texture. Figure 5 summarizes the quantitative comparisons.
4.4 User study
We also conducted a user study on Amazon Mechanical Turk. We presented the users with a binary choice, asking them if they aesthetically prefer our method or one of the baseline methods on a random example from the evaluation horizontal interpolation task. The user study webpage design and sanity check (to guarantee effectiveness of users’ feedback) are shown in the supplementary material. For each method pair, we sampled examples and collected independent user responses per example. Tallying the user votes, we get results per method pair. We assumed a null hypothesis that on average, our method will be preferred by users for a given method pair. We used a one-sample permutation t test to measure p-values, using permutations, and found the p-values for the null hypothesis are all . This indicates that the users do prefer one method over another. To quantify this preference, we count for each method pair all the examples where at least users agree in their preference, and report a preference rate (PR) which shows how many of the preferences were in our method’s favor. Both PR and the p-values for the aggregated earth and animal datasets are reported in Table 1.
4.5 Ablation study
We also compare against simplified versions of our own method.The qualitative results for this comparison are shown in Figure 4. We report quantitative result numbers in Table 1, and visualized them in Figure 5. We ablate the following components:
Remove . The only difference between and is in the tiling and shuffling for . However, if we remove , we find that the texture transitions are less smooth and gradual.
Remove texture blending during training. We modify our method so that the interpolation task during training is performed only upon two identical textures. This makes the interpolation discriminator not be aware of the realism of blended samples, so testing realism deteriorates.
Remove random shuffling. We skip the shuffling operation in latent space and only perform blending during training. This slightly improves realism and interpolation directness, but causes visually disturbing repetitions.
We presented a method for controllable interpolation of visual textures by projection into an interpolable latent space, and a re-projection back into the image domain. By training a network simultaneously on a reconstruction task and interpolation task, we were able to satisfy the criteria of controllability, smoothness, and realism. We have further collected two texture datasets for the purposes of a quantitative evaluation. As we see in Figure 5, although some baseline method may achieve better results than ours on one of the evaluation criteria, they usually fail on the others. Even when compared to our ablation candidates, our presented method has consistent high marks in all evaluation categories. The user study also shows that the users overwhelmingly prefer our method to any of the baselines.
We have demonstrated several applications for this method and hope that it may become a building block of more complex workflows, such as image harmonization. In the future, we would like to collect a texture dataset spanning many categories, and evaluate inter-category interpolation.
This work was partly funded by Adobe. The authors acknowledge the Maryland Advanced Research Computing Center (MARCC) for providing computing resources and acknowledge the photographers for licensing photos under Creative Commons or public domain.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG), 28(3):24, 2009.
-  U. Bergmann, N. Jetchev, and R. Vollgraf. Learning texture manifolds with the periodic spatial GAN. In Proceedings of the 34th International Conference on Machine Learning, pages 469–477, 2017.
-  M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
-  R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
-  M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
-  S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen. Image melding: Combining inconsistent images using patch-based synthesis. ACM Trans. Graph., 31(4):82–1, 2012.
-  O. Diamanti, C. Barnes, S. Paris, E. Shechtman, and O. Sorkine-Hornung. Synthesis of complex image appearance from limited exemplars. ACM Transactions on Graphics (TOG), 34(2):22, 2015.
-  A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 341–346. ACM, 2001.
-  A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In iccv, page 1033. IEEE, 1999.
-  L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 229–238. ACM, 1995.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pages 1510–1519, 2017.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  N. Jetchev, U. Bergmann, and R. Vollgraf. Texture synthesis with spatial generative adversarial networks. arXiv preprint arXiv:1611.08207, 2016.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  A. Kaspar, B. Neubert, D. Lischinski, M. Pauly, and J. Kopf. Self tuning texture optimization. In Computer Graphics Forum, volume 34, pages 349–359. Wiley Online Library, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  V. Kwatra, I. Essa, A. Bobick, and N. Kwatra. Texture optimization for example-based synthesis. In ACM Transactions on Graphics (ToG), volume 24, pages 795–802. ACM, 2005.
-  V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis using graph cuts. ACM Transactions on Graphics (ToG), 22(3):277–286, 2003.
-  S. Lefebvre and H. Hoppe. Appearance-space texture synthesis. In ACM Transactions on Graphics (TOG), volume 25, pages 541–548. ACM, 2006.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716. Springer, 2016.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In Proc. CVPR, 2017.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems, pages 386–396, 2017.
-  G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
-  W. Matusik, M. Zwicker, and F. Durand. Texture design using a simplicial complex of morphable textures. In ACM Transactions on Graphics (TOG), volume 24, pages 787–794. ACM, 2005.
-  P. Pérez, M. Gangnet, and A. Blake. Poisson image editing. ACM Transactions on graphics (TOG), 22(3):313–318, 2003.
-  J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision, 40(1):49–70, 2000.
-  E. Risser, P. Wilmot, and C. Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893, 2017.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
-  O. Sendik and D. Cohen-Or. Deep correlations for texture synthesis. ACM Transactions on Graphics (TOG), 36(5):161, 2017.
-  C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, pages 1349–1357, 2016.
-  L.-Y. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 479–488. ACM Press/Addison-Wesley Publishing Co., 2000.
-  Y. Wexler, E. Shechtman, and M. Irani. Space-time completion of video. IEEE Transactions on Pattern Analysis & Machine Intelligence, (3):463–476, 2007.
-  C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589, 2018.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892, 2018.
-  R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
-  Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or, and H. Huang. Non-stationary texture synthesis by adversarial expansion. ACM Trans. Graph., 37(4):49:1–49:13, 2018.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, 2017.
Appendix A Supplementary material
A Shuffling procedure visualization
We visualize our shuffling procedure in Figure 6.
B Texture palette and brush examples
In order to increase application diversity, we additionally collect a plant texture dataset from Adobe Stock and randomly split it into training and testing images. We show the texture palette and brush application on the earth texture and plant texture datasets in Figure 7. We additionally show in Figure 8 a camouflage effect of brush painting on the animal texture dataset, intentionally given the background patterns similar to brush patterns, which reflects the smooth interpolation across textures. The dynamic processes of drawing such paintings plus the painting of Figure 1 in the main paper are demonstrated in the attached videos named texture_brush_earth.mp4, texture_brush_plant.mp4, texture_brush_animal.mp4, and texture_brush_animal_camouflage.mp4 respectively. The videos are encoded using MP4 libx265 codec at frame rate and M bit rate.
C Texture dissolve examples
We shows in Figure 9 additional sequences of video frame samples with gradually varying weights on the earth texture and plant texture datasets. The corresponding videos plus the video for Figure 3 in the main paper are attached with names texture_dissolve_animal.mp4, texture_dissolve_earth.mp4, and texture_dissolve_plant.mp4. The videos are encoded using MP4 libx265 codec at frame rate and M bit rate.
D Network architecture details
We set the texture image size to be throughout our experiments. The proposed , , , and architectures are employed or adapted from the discriminator architecture in , where layers with spatial resolutions higher than are removed. We also adopt their techniques including pixel normalization instead of batch normalization, and leaky ReLU activation. The minibatch standard deviation channel is also preserved for and , but not for and . For , we truncate the architecture so that the output local latent tensor is times smaller than the input texture, where in all our experiments. We tried using deeper architectures but noticed this does not favor reconstruction quality. For , we truncate the architecture at resolution right before the fully-connected layer, because we are doing encoding rather than binary classification.
Our is modified from the fully-convolutional generator architecture from Karras et al.  with three changes. First, the architecture is truncated to accept an input spatial resolution that is times smaller than the texture size, and to output the original texture size. Second, the local and global latent tensor inputs are concatenated together along the channel dimension after they are fed into . A third important point is that since our goal is to interpolate a larger texture image output, at the bottleneck layer the receptive field should be large enough to cover the size of input image. We do this by inserting a chain of five residual blocks  in the generator after local and global latent tensor concatenation and before the the deconvolution layers from .
E Training details
Our training procedure again follows the progressive growing training in , where , , , , and simultaneously grow from image spatial resolution at to . We repeatedly alternate between performing one training iteration on and , and then four training iterations on , , and . At each intermediate resolution during growth, the stabilization stage takes 1 epoch of training and the transition stage takes 3 epochs. After the growth is completed, we keep training the model until a total of 20 epochs is reached.
We use Adam  as our optimization approach with no exponential decay rate for the first moment estimates and with the exponential decay rate for the second moment estimates . The learning rate is set to before the model grows to the final resolution and then is set to at . The trainable weights of the autoencoder and discriminator are initialized with the equalized learning rate technique from . We train and test all our models on 8 NVIDIA GeForce GTX 1080 Ti GPUs with 12GB of GPU memory each. Based on the memory available and the training performance, we set the batch size as , and the training lasts for 3 days.
F Baseline method details
Naïve -blending. We split the output into 8 square tiles, where the end textures are copied as-is, and the intervening tiles (copies of the two boundaries) are linearly per-pixel -blended.
Image Melding . We selected Image Melding in its inpainting mode as a representative of patch-based methods. We use the default setting of the official public implementation111https://www.ece.ucsb.edu/~psen/melding.
AdaIN . Style transfer techniques can potentially be leveraged for the interpolation task by using random noise as the content image and texture sample as the style. We interpolate the neural features of the two source textures to vary the style from left to right. We consider AdaIN as one representative of this family of techniques, as it can run with arbitrary content and style images. However, with the default setting of the official implementation222https://github.com/xunhuang1995/AdaIN-style and their pre-trained model, AdaIN has some systematic artifacts as it over-preserves the noise appearance. Because of this, we only show qualitative results in Figure 4 in the main paper, and in Figure 12 to Figure 16 and did not include this method in the quantitative evaluation.
WCT . WCT is an advancement over AdaIN with whitening and coloring transforms (WCT) as the stylization technique and works better on our data. We use its official public implementation333https://github.com/Yijunmaverick/UniversalStyleTransfer with default setting and their pre-trained model. By design, this method does not guarantee accurate reconstruction of input samples.
DeepFill . Texture interpolation can be considered an instance of image hole-filling. The training code for the most recent work in this area  is not released yet. We therefore tried another recent method called DeepFill  with their official code444https://github.com/JiahuiYu/generative_inpainting. We re-trained it for our two texture datasets separately with input image size, hole size, and all the other default settings. The interpolation results suffered from two major problems: (i) the method is not designed for inpainting wide holes (in our experiment ) because of lack of such wide ground truth; (ii) even for a smaller hole with size , as shown in the failure cases in Figure 4 in the main paper and in Figure 12 to Figure 16, this work systematically failed to merge the two source textures gradually. We therefore excluded this method from our quantitative comparisons.
PSGAN . The most closely related work to ours, PSGAN, learns a smooth and complete neural manifold that favors interpolation. However, it only supports constraining the interpolation in latent space, and lacks a mechanism to specify end texture conditions using image examples. To allow for a comparison, we have trained a PSGAN model for each of our datasets separately, using the official code555https://github.com/zalandoresearch/psgan and default settings. Then, we optimize for the latent code that corresponds to each of the end texture images by backpropagating through L-BFGS-B . We use the gradients of the reconstruction loss and the Gram matrix loss  and initialize randomly the latent vectors. We use different initializations and report the best result.
G More qualitative comparisons
H User study details
Our user study webpage design is shown in Figure 10. To guarantee the accuracy of users’ feedback, we insert sanity check by additionally comparing our interpolation results with another naive baseline results where the transition regions are filled with constant pixel values. The constant value is computed as the mean value of the two end texture pixels, as shown in Figure 11. The preference should be obvious and deterministic without subjective variance. In our statistics, only two users made mistake once on the sanity check questions. We then manually checked their answers to other real questions but didn’t notice any robot or laziness style. We there trust and accept all users’ feedback.