Learning Texture Manifolds with the Periodic Spatial GAN
This paper introduces a novel approach to texture synthesis based on generative adversarial networks (GAN) . We extend the structure of the input noise distribution by constructing tensors with different types of dimensions. We call this technique Periodic Spatial GAN (PSGAN).
The PSGAN has several novel abilities which surpass the current state of the art in texture synthesis. First, we can learn multiple textures from datasets of one or more complex large images. Second, we show that the image generation with PSGANs has properties of a texture manifold: we can smoothly interpolate between samples in the structured noise space and generate novel samples, which lie perceptually between the textures of the original dataset. In addition, we can also accurately learn periodical textures. We make multiple experiments which show that PSGANs can flexibly handle diverse texture and image data sources. Our method is highly scalable and it can generate output images of arbitrary large size.
1.1Textures and Texture Synthesis
Textures are important perceptual elements, both in the real world and in the visual arts. Many textures have random noise characteristics, formally defined as stationary, ergodic, stochastic processes . There are many natural image examples with such properties, e.g. rice randomly spread on the ground. However, more complex textures also exist in nature, e.g. those that exhibit periodicity like a honeycomb or fish scales.
The goal of texture synthesis is to learn from a given example image a generating process, which allows to create many images with similar properties. Classical texture synthesis methods include instance based approaches , where pixels or patches of the source image are resampled and copied next to similar image regions, so that a seamless bigger texture image is obtained. Such methods have good visual quality and can deal with periodic images, but have a high runtime complexity when generating big images. In addition, since they do not learn an explicit model of images but just copy patches from the original pixels, they cannot be used to generate novel textures from multiple examples.
Parametric methods define an explicit model of a “good” texture by specifying some statistical properties; new texture images that are optimal w.r.t. the specified criteria are synthesized by optimization. The method of  yields good results in creating various textures, including periodic ones (the parametric statistics include phase variables of pre-specified periodicity). However, the run-time complexity is high, even for small output images. The authors also tried blending of textures, but the results were not satisfactory: patch-wise mixtures were obtained, rather than a new homogeneous texture that perceptually interpolates the originals.
More recently, deep learning methods were shown to be a powerful, fast and data-driven, parametric approach to texture synthesis. The work of  is a milestone: they showed that filters from a discriminatively trained deep neural network can be used as effective parametric image descriptors. Texture synthesis is modeled as an optimization problem.  also showed the interesting application of painting a target content photo in the style of a given input image: “neural art style transfer”. Related works speed-up texture synthesis and style transfer by approximating the optimization process by feed-forward convolutional networks .
However, the choice of descriptor in all of these related works – the Gram matrix of learned filters – is a specific prior on the learnable textures for the method. It generalizes to many, but not all textures – e.g. periodic textures are reproduced inaccurately. Another limitation is that texture synthesis is performed from a single example image only, lacking the ability to represent and morph textures defined by several different images. In a related work,  explored the blending of multiple styles by parametrically mixing their statistical descriptors. The results are interesting in terms of image stylization, but the synthesis of novel blended textures has not been shown.
Purely data driven generative models are an alternative deep learning approach to texture synthesis. Introduced in , generative adversarial networks (GAN) train a model that learns a data distribution from example data, and a discriminator that attempts to distinguish generated from training data. The GAN architecture was further improved  by using deep convolutional layers with (fractional) stride. GANs have successfully created “natural” images of great perceptual quality that can fool even human observers. However, pixel resolution is usually low, and the output image size is pre-specified and fixed at training time.
For the texture synthesis use case, fully convolutional layers, which can scale to any image size, are advantageous.  presented an interesting architecture, that combines ideas from GANs and the pre-trained descriptor of  in order to generate small patches with the statistics of layer activations from the VGG network. This method allows fast texture synthesis and style transfer.
Spatial GAN (SGAN)  applied for the first time fully unsupervised GANs for texture synthesis. SGANs had properties like good scalability w.r.t. speed and memory, and showed excellent results on certain texture classes, surpassing the results of . However, some classes of textures cannot be handled, and no plausible texture morphing is possible.
The current contribution, PSGAN, makes a great step forward with respect to the types of images a neural texture synthesis method can create – both periodic and non-periodic images are learned in an unsupervised way from single images or large datasets of images. Afterwards, flexible sampling in the noise space allows to create novel textures of potentially infinite output size, and smoothly transition between them. Figure 1 shows a few example textures generated with a PSGAN. In the next section we describe in detail the architecture of the PSGAN, and then proceed to illustrate its abilities with a number of experiments
2Methods: Periodic GAN
In GANs, the generative model maps a noise vector to the input data space. As in SGANs , we generalize the generator to map a noise tensor to an image , see Figure ?. The first two dimensions, and , are spatial dimensions, and are blown up by the generator to the respective input spatial dimensions and . The final dimension of , , is the channel dimension.
In analogy to the extension of the generator , we extend the discriminator to map from an input image to a two-dimensional field of spatial size . Each position of the resulting discriminator , responds only to a local part , which we call ’s effective receptive field. The response of represents the estimated probability that the respective part of is real instead of being generated by .
As the discriminator outputs a field, we extend the standard GAN cost function to marginalize spatially:
This function is then minimized in and maximized in , . Maximizing the first line of Equation 1 in leads the discriminator to return values close to (i.e. “fake”) for generated images – and, vice versa, minimization in aims at the discriminator taking large output values close to (i.e. “real”). On the other hand, maximizing in the second line of Equation 1 anchors the discriminator on real data to return values close to . As we want the model to be able to learn from a single image, the input image data is augmented by selecting patches from the image(s) at random positions. To speed-up convergence, in particular in the beginning of the learning process, we employ the standard GAN trick and substitute with .
We base the design of the generator network and the discriminator network on the DCGAN model . Empirically, choosing and to be symmetric in their architecture (i.e. depth and channel dimensions) turned out to stabilize the learning dynamics. In particular, we chose equal sizes for the image patches and the generated data . As a deviation from this symmetry rule, we found that removing batch normalization in the discriminator yields better results, especially on training with single images.
In contrast to the DCGAN architecture, our model contains exclusively convolutional layers. Due to the convolutional weight sharing, this allows that a network trained on small image patches can be rolled out to synthesize arbitrary large output images after training. Upon successful training, the sampled images then match the local image statistics of the training data. Hence, the model implements a spatial stochastic process. Further, if components of are sampled independently, the limited receptive fields of the generator imply that the generator implements a stationary, ergodic and strongly mixing stochastic process. This means that sampling of different textures is not possible – this would require a non-ergodic process. For independent sampling, learning from a set of textures results in the generation of textures combining elements of the whole set. Another limitation of independent sampling is the impossibility to align far away regions in the generated image – alignment violates translation invariance, stationarity and mixing. However, periodic textures depend on long-range correlations.
To get rid of these limitations, we extend to be composed of three distinct parts: a local independent part , a spatially global part , and a periodic part . Each part has the same spatial dimensions , but may vary in their respective channel dimensions , , and . Let be their concatenation with total channel dimension . We proceed with a discussion on ’s three parts.
Conceptually, the simplest approach is to sample each slice of at position and , i.e , independently from the uniform distribution , where with and . As each affects a finite region in the image, we speak of local dimensions. Intuitively, local dimensions allow the generative process to produce spatial variance and diversity by sampling from its statistical model.
For the global dimensions, a unique vector of dimensionality is sampled from , which is then repeated along all spatial dimensions of , or , where , , and . Thus, has global impact on the whole image, and allows for the selection of the type of structure to be generated – employing global dimensions, the generative stochastic process becomes non-ergodic. Consider the task of learning from two texture images: the generator then only needs to “learn” a splitting of in two half-spaces (e.g. by learning a hyperplane), where vectors from each half-space generate samples in the style of one of the two textures.
Besides the scenario of learning from a set of texture images, combination with random patch selection from a larger image (see Section 2) is particularly interesting: here, the converged generator samples textures that are consistent with the local statistics of an image. Notably, the source image does not necessarily have to be a texture, but the method will extract a texture generating stochastic process from the image, nevertheless (see Figure ?).
After learning, each vector represents a texture from the manifold of learned textures of the PSGAN, where corresponds to a generating stochastic process of a texture, not just a static image. For the purpose of image generation, does not need to be composed of a single vector, but can be a smooth function in and . As long as neighboring vectors in don’t vary too rapidly, the statistics of is close to the statistics during training. Hence, smoothness in implies a smooth texture change in (see Figure 3).
2.3Spatially Periodic Dimensions
The third part of , , contains spatial periodic functions, or plane waves in each channel :
where , , , and is a matrix which contains the wave vectors as its column vectors. These vectors parametrize the direction and the number of radians per spatial unit distance in the periodic channel . is a random phase offset uniformly sampled from , and mimics the random positional extraction of patches from the real images. Adding this periodic global tensor breaks translation invariance and stationarity of the generating process. However, it is still cyclostationary.
While wave numbers could be set to a fixed basis, we note that a specific texture has associated wave vectors, i.e. different textures will have different axes of periodicities and scales. Hence, we make dependent on the global dimensions through a multi-layer perceptron (MLP), when more than one texture is learned. When only one texture is learned, i.e. , the wave numbers are direct parameters to the system. In Figure ?, we indicate this alternative dependency on with a dotted arrow between the MLP and . All parameters of the MLP are learned end-to-end alongside the parameters of the generator and the discriminator .
We base our system on the DCGAN architecture  with a stride of for the generator and 2 for the discriminator. Local and global noise dimensions are sampled from a uniform distribution. As in DCGAN, filters have 64 channels at the highest spatial resolution, and are doubled after every layer, which halves the spatial resolution. E.g. the 4 layer architecture has channels between the noise input and output RGB image. Training was done with ADAM  with the settings of  – learning rate , minibatch size of 25. The typical image patch size was 160x160 pixels. We usually used 5 layers in and (see Table 1), kernels of size 5x5 with zero padding, and batch normalization. Such a generator upsamples the spatial noise by a factor of and has a receptive field size of 125. Receptive field and image patch size can both affect learning . On our hardware (Theano and Nvidia Tesla K80 GPU) we measured seconds for the generation of a 256x256 pixels image and seconds for a 2048x2048 pixels image.
The MLP for the spatially periodic dimensions has one hidden layer of dimensionality :
where is the point-wise rectified-linear unit function, and we have , , and , and . We used for the experiments. All parameters are initialized from an independent random Gaussian distribution , except and , which have a non-zero mean . The constant vector is chosen with entries spread in the interval
The following image sources were used for the experiments in this paper: the Oxford Describable Textures Dataset (DTD) , which is composed of various categories, each containing images; the Facades dataset , which contains 500 facades of different houses in Prague. Both datasets comprise objects of different scales and sizes. We also used satellite images of Sydney from Google Maps. The P6 and Merrigum house are from Wikimedia Commons.
3.2Learning and Sampling Textures
What are criteria for good texture synthesis? The way humans perceive a texture is not easily quantifiable with a statistic or metric. Still, one can qualitatively assess whether a texture synthesis method captures the right properties of a source image. In order to illustrate this, we will demonstrate how we can learn complex periodic images and texture manifolds, which allow texture blending.
First, we demonstrate learning a single periodic texture image. Figure 2 illustrates the results of PSGAN compared with SGAN , and the methods of . The text example in the top row has a periodic and stochastic dimension. The PSGAN learns this and arranges “text” in regular lines, while varying their content horizontally. The methods of  also manage to do this. SGAN (equivalent to a PSGAN without periodic dimensions) and Gatys’ method fail to capture the periodic structure.
The second row in Figure 2 demonstrates learning a honeycomb texture – a basic hexagonal pattern – where our method captures both the underlying periodicity and the random coloring effects inside the cells. The method of  was inaccurate for that texture – the borders between the copied patches (60x60 pixels large) were inaccurately aligned. The other 3 methods fail to produce a hexagonal structure even locally. The last row of the figure shows the autocorrelation plots of the honeycomb textures, where the periodicity reveals itself as a regular grid superimposed onto the background, a feature only PSGAN is able to reproduce.
While periodic dimensions are enough to learn the above patterns, we noticed that training convergence is faster when setting . However, for beating of sinusoids with close wave numbers can occur, which rarely happens also for due to sub-Nyquist artefacts , i.e. when the texture periodicity is close to an integer fractional of the Nyquist wavenumber.
Figure ? shows a larger slice of the learned periodic textures. In particular, Figure ?B shows that learning works for more complex patterns, here a pattern with a P6 wallpaper group symmetry
Next, we extract multiple textures from a single large image, or a set of images. The chosen images (e.g. landscape photography or satellite images) have a global structure, but also exhibit characteristics of many textures in a single image (e.g. various vegetation and houses). The structured PSGAN generator noise with global dimensions allows to extract textures, corresponding to different image regions.
In order to visualize the texture diversity of a model, we define a quilt array that can generate different textures from a trained PSGAN model by setting rectangular spatial regions (tiles) of size to the same vector, randomly sampled from the prior. Since the generator is a convolutional network with receptive fields over several spatial elements of , the borders between tiles look partially aligned. For example, in Figure 1 the borders of the tiles have scaly elements across them, rather than being sharply separated (as the input per construction).
Figure ? shows results when trained on a single large image. PSGAN extracts diverse bricks, grass and leaf textures. In contrast, SGAN forces the output to be a single mixing process, rather than a multitude of different visual textures. Gatys’ method also learns a single texture-like process with statistics from the whole image.
Figure ?A shows texture learning from city satellite images, a challenging image domain due to fine details of the images. Figures ?B and C show results from training on a set of multiple texture-like images from DTD.
In order to show that textures vary smoothly, we sample 4 different values in the four corners of a target image and then interpolate bi-linearly between them to construct the tensor. Figure 3 shows that all values lying between the original 4 points generate proper textures as well. Hence, we speak of a learned texture manifold.
Disentangling frequencies and global dimensions
In this section, we explore how and – the global and periodic dimensions – influence the output generated from the noise tensor. Take a array with quilt structure. We define as an array of the same size as , where all are set to the same . We calculate two different periodic tensors, : the first tensor with wave numbers varying as a function of the different elements of the quilt, and the second tensor, , with the same wave numbers everywhere.
The PSGAN is trained with minibatches for which it holds that , but the model is flexible and produces meaningful outputs even when setting and to different values. Figure ? shows that the global and periodic dimensions encode complementary aspects of the image generation process: texture identity and periodicity. The facades dataset has strong vertical and horizontal periodicity which is easily interpretable – the height of floors and window placement directly depends on these frequencies.
This disentangling leads to instructive visualizations. Figure 4 shows the generation from a tensor , which is constructed as a linear interpolation between two sampled at the left and right border. However, the wave numbers of the periodic dimensions are fixed, independently of the changing global dimensions. The figure clearly shows a change in visual appearance of the texture (controlled by the global dimensions), while preserving a consistent periodic structure (fixed by the constant wave numbers). This PSGAN disentangling property is reminiscent of the way  construct categorical and continuous noise variables, which explain factors of variation such as object identity and spatial transformation.
Texture synthesis from large unlabeled image datasets requires novel data-driven methods, going beyond older techniques that learn from single textures and rely on pre-specified statistical descriptors. Previous methods like SGAN are limited to stationary, ergodic and stochastic textures – even if trained on many images, SGAN fuses them and outputs a single mixing process for them. Our experiments suggest that Gatys’ method exhibits similar limitations. In contrast, PSGAN models non-ergodic cyclostationary processes, and can learn a whole texture manifold from sets of images, or from a single large image.
CGANs  use additional label information as input to the GAN generator and discriminator, which allows for class conditional generation. In comparison, the PSGAN also uses additional information in the generator input (the specifically designed periodic dimensions ), but not in the discriminator. Our method remains fully unsupervised and uses only sampled noise, unlike CGANs which require specific label information.
Concerning the model architecture, the SGAN  model is similar – it can be seen as an ablated PSGAN instance with . This architecture allows great scalability (linear memory and runtime complexity w.r.t. output image pixel size) of the PSGAN when generating outputs. High resolution images can be created by splitting parts of the arrays and rendering them sequentially, thus having a constant GPU memory footprint. Another nice property of our architecture is the ability to stitch seamlessly output texture images and get tileable textures, potentially increasing the output image size even more.
To summarize, these are the key abilities of the PSGAN:
learn textures of great variability from large images
learn periodical textures
learn whole manifolds of textures and smoothly blend between their elements, thus creating novel textures
generate images of any desired size with a fast forward pass of a convolutional neural network
linear scalability in memory and speed w.r.t. output image size.
Our method has a few limitations: convergence can be sometimes tricky, as noted for other GAN models ; like GANs, the PSGAN can suffer from “mode dropping” – given a large set of textures it may learn only some of them, especially if the data varies in scale and periodicity. Finally, PSGANs can represent arbitrary probability distributions that extend in spatial scale to the largest periods in , and can generalize to periodic structures beyond that. However, images that have larger structures or more general non-periodic features are not representable: e.g. images with a global trend, or with a perspective projection, or aperiodic images, like Penrose tilings.
The PSGAN has a great potential to be adapted to further use cases. In-painting is a possible application - our method can fill random missing image regions with fitting textures. Texture style transfer – painting a target image with textures – can be done similar to the way the quilts in this paper were constructed. Further, explicit modeling with periodic dimensions in the PSGAN could be a great fit in other modalities, in particular time-series and audio data. Here, we’d expect the model to extract “sound textures”, which might be useful in synthesizing completely novel sounds by interpolating on the manifold.
On the theoretical side, to capture more symmetries of texture images, one could extend the tensor even further, by adding dimensions with reflection or rotation symmetries. In terms of model stability and convergence, we’ll investigate alternative GAN training criteria , which may alleviate the mode dropping problem.
We would like to thank Christian Bracher for his valuable feedback on the manuscript.
- Our source code is available at https://github.com/zalandoresearch/psgan
- Ideally, the wave numbers , with , should be within the valid interval between the negative and positive Nyquist wave numbers (here ). However, wave numbers of single sinusoids are projected back into this interval. Hence, no constraint is necessary.
- en.wikipedia.org/wiki/Wallpaper_group. Note that only translational symmetries are represented in PSGANs, no rotation and reflection symmetries.
- As a technical note, the whole image did not fit in memory for Gatys’ method, so we trained it only on a 1920x1920 clip-out.
- Sub-nyquist artefacts and sampling moiré effects.
Amidror, Isaac. Royal Society open science
- Wasserstein GAN.
Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. 2017.
- Infogan: Interpretable representation learning by information maximizing generative adversarial nets.
Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. CoRR
- Describing textures in the wild.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Vedaldi, A. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
- A learned representation for artistic style.
Dumoulin, Vincent, Shlens, Jonathon, and Kudlur, Manjunath. CoRR
- Image quilting for texture synthesis and transfer.
Efros, Alexei A. and Freeman, William T. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, 2001.
- Texture synthesis by non-parametric sampling.
Efros, Alexei A. and Leung, Thomas K. In Proceedings of the International Conference on Computer Vision, 1999.
- Texture synthesis using convolutional neural networks.
Gatys, Leon, Ecker, Alexander, and Bethge, Matthias. In Advances in Neural Information Processing Systems 28, 2015a.
- A neural algorithm of artistic style.
Gatys, Leon A., Ecker, Alexander S., and Bethge, Matthias. CoRR
- Texture compression.
Georgiadis, G., Chiuso, A., and Soatto, S. In Data Compression Conference, March 2013.
- Generative adversarial nets.
Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua. In Advances in Neural Information Processing Systems 27, 2014.
- Texture synthesis with spatial generative adversarial networks.
Jetchev, Nikolay, Bergmann, Urs, and Vollgraf, Roland. CoRR
- Perceptual losses for real-time style transfer and super-resolution.
Johnson, Justin, Alahi, Alexandre, and Fei-Fei, Li. In European Conference on Computer Vision, 2016.
- Adam: A method for stochastic optimization.
Kingma, Diederik P. and Ba, Jimmy. CoRR
- Precomputed real-time texture synthesis with Markovian generative adversarial networks.
Li, Chuan and Wand, Michael. CoRR
- Unrolled generative adversarial networks.
Metz, Luke, Poole, Ben, Pfau, David, and Sohl-Dickstein, Jascha. CoRR
- Conditional generative adversarial nets.
Mirza, Mehdi and Osindero, Simon. CoRR
- A parametric texture model based on joint statistics of complex wavelet coefficients.
Portilla, Javier and Simoncelli, Eero P. Int. J. Comput. Vision
- Unsupervised representation learning with deep convolutional generative adversarial networks.
Radford, Alec, Metz, Luke, and Chintala, Soumith. CoRR
- Spatial pattern templates for recognition of objects with regular structure.
Radim Tyleček, Radim Š ara. In Proc. GCPR, Saarbrücken, Germany, 2013.
- Texture networks: Feed-forward synthesis of textures and stylized images.
Ulyanov, Dmitry, Lebedev, Vadim, Vedaldi, Andrea, and Lempitsky, Victor. In International Conference on Machine Learning, 2016.