Texture Synthesis Using Shallow Convolutional Networks with Random Filters
Abstract
Here we demonstrate that the feature space of random shallow convolutional neural networks (CNNs) can serve as a surprisingly good model of natural textures. Patches from the same texture are consistently classified as being more similar then patches from different textures. Samples synthesized from the model capture spatial correlations on scales much larger then the receptive field size, and sometimes even rival or surpass the perceptual quality of state of the art texture models (but show less variability). The current state of the art in parametric texture synthesis relies on the multilayer feature space of deep CNNs that were trained on natural images (gatys:2015a). Our finding suggests that such optimized multilayer feature spaces are not imperative for texture modeling. Instead, much simpler shallow and convolutional networks can serve as the basis for novel texture synthesis algorithms.
Texture Synthesis Using Shallow Convolutional Networks with Random Filters
Ivan Ustyuzhaninov^{*,1,2,3}, Wieland Brendel^{*,1,2}, Leon Gatys^{1,2,3}, Matthias Bethge^{1,2,3,4} ^{*}contributed equally ^{1}Centre for Integrative Neuroscience, University of Tübingen, Germany ^{2}Bernstein Center for Computational Neuroscience, Tübingen, Germany ^{3}Graduate School of Neural Information Processing, University of Tübingen, Germany ^{4}Max Planck Institute for Biological Cybernetics, Tübingen, Germany first.last@bethgelab.org
1 Introduction
The aim of visual texture synthesis is to define a generative process that, from a given example texture, can generate arbitrarily many new samples of the same texture. Among the class of such algorithms, parametric texture models aim to uniquely describe each texture by a set of statistical measurements that are taken over the spatial extent of the image. Each image with the same spatial summary statistics should be perceived as the same texture. Consequently, synthesizing a texture corresponds to finding a new image that reproduces the summary statistics of the reference texture. Starting from Nthorder joint histograms of the pixels by Julesz julesz1962, many different statistical measures have been proposed (see e.g. Heeger1995; Portilla:2000). The quality of the synthesized textures is usually determined by human inspection; the synthesis is successful if a human observer cannot tell the reference texture from the synthesized ones.
The current state of the art in parametric texture modeling gatys:2015a employs the hierarchical image representation in a deep 19layer convolutional network (in the following referred to as VGG network) VGG:2014 that was trained on object recognition in natural images. In this model textures are described by the raw correlations between feature activations in response to the texture image from a collection of network layers (see section 5 for details). Two aspects of the model seemed critical for texture synthesis: the hierarchical multilayer representation of the textures, and the supervised training of the feature spaces. Here we show that neither aspect is imperative for texture modeling and that in fact a single convolutional layer with random features can often synthesize textures that rival, and sometimes even surpass, the perceptual quality of Gatys et al. gatys:2015a. This is in contrast to Gatys et al. gatys:2015a who reported that networks with random weights fail to generate perceptually interesting images. We suggest that this discrepancy originates from a more elaborate tuning of the optimization procedure (see section 4).
2 Convolutional Neural Network
All our models employ singlelayer CNNs with standard rectified linear units (ReLUs) and convolutions with stride one, no bias and padding where is the filtersize. This choice ensures that the spatial dimension of the output feature maps is the same as the input. All networks except the last one employ filters of size (filter width filter height no. of input channels), but the number of feature maps as well as the selection of the filters differ:

Fourier363: Each color channel (R, G, B) is filtered separately by each element of the 2D Fourier basis ( feature maps/channel), yielding feature maps in total. More concretely, each filter can be described as the tensor product where the elements of the unitnorm are all zero except one.

Fourier3267: All color channels (R, G, B) are filtered simultaneously by each element of the 2D Fourier basis but with different weighting terms , yielding feature maps in total. More concretely, each filter can be described by the tensor product .

Kmeans363: We randomly sample and whiten 1e7 patches of size from the Imagenet dataset ILSVRC15, partition the patches into 363 clusters using kmeans rubinstein:2009, and use the cluster means as convolutional filters.

Kmeans3267: Same as Kmeans363 but with 3267 clusters.

KmeansNonWhite363/3267: Same as Kmeans363/3267 but without whitening of the patches.

KmeansSample363/3267: Same as Kmeans363/3267, but patches are only sampled from the target texture.

PCA363: We randomly sample 1e7 patches of size from the Imagenet dataset ILSVRC15, vectorize each patch, perform PCA and use the set of principal axes as convolutional filters.

Random363: Filters are drawn from a uniform distribution according to (glorot:2010), 363 feature maps in total.

Random3267: Same as Random363 but with 3267 feature maps.

RandomMultiscale Eight different filter sizes with and 128 feature maps each (1024 feature maps in total). Filters are drawn from a uniform distribution according to (glorot:2010).
The networks were implemented in Lasagne (lasagne; theano:2016). We remove the DC component of the inputs by subtracting the mean intensity in each color channel (estimated over the Imagenet dataset ILSVRC15).
3 Texture Model
The texture model closely follows (gatys:2015a). In essence, to characterise a given vectorised texture , we first pass through the convolutional layer and compute the output activations. The output can be understood as a nonlinear filter bank, and thus its activations form a set of filtered images (socalled feature maps). For distinct feature maps, the rectified output activations can be described by a matrix . To capture the stationary structure of the textures, we compute the covariances (or, more precisely, the Gramian matrix) between the feature activations by averaging the outer product of the pointwise feature vectors,
(1) 
We will denote as the Gram matrix of the feature activations for the input . To determine the relative distance between two textures and we compute the euclidean distance of the normalized Gram matrices,
(2) 
To compare with the distance in the raw pixel values, we compute
(3) 
4 Texture Synthesis
To generate a new texture we start from a uniform noise image (in the range [0, 1]) and iteratively optimize it to match the Gram matrix of the reference texture. More precisely, let be the Gram matrix of the reference texture. The goal is to find a synthesised image such that the squared distance between and the Gram matrix of the synthesized image is minimized, i.e.
(4)  
(5) 
The gradient of the reconstruction error with respect to the image can readily be computed using standard backpropagation, which we then use in conjunction with the LBFGSB algorithm scipy:2001 to solve (4). We leave all parameters of the optimization algorithm at their default value except for the maximum number of iterations (2000), and add a box constraints with range [0, 1]. In addition, we scale the loss and the gradients by a factor of in order to avoid early stopping of the optimization algorithm (which might have caused the negative results for random networks in gatys:2015a).
5 Texture Evaluation
Evaluating the quality of the synthesized textures is traditionally performed by human inspection. Optimal texture synthesis should generate samples that humans perceive as being the same texture as the reference. The high quality of the synthesized textures by (gatys:2015a) suggests that the summary statistics from multiple layers of VGG can approximate the perceptual metric of humans. Even though the VGG texture representation is clearly not perfect (see Fig. 1), this allows us to utilize these statistics as a more objective quantification of texture quality.
For all details of the VGGbased texture model see (gatys:2015a). Here we use the standard 19layer VGG network VGG:2014 with pretrained weights and average instead of maxpooling^{1}^{1}1https://github.com/Lasagne/Recipes/blob/master/modelzoo/vgg19.py as accessed at 12.05.2016.. We compute a Gram matrix on the output of each convolutional layer that follows a pooling layer. Let be the Gram matrix on the activations of the th layer and
(6) 
the corresponding relative reconstruction cost. The total reconstruction cost is then defined as the average distance between the reference Gram matrices and the synthesized ones, i.e.
(7) 
This cost is reported on top of each synthesised texture in Figures LABEL:fig:samples. To visually evaluate samples from our single and multiscale model against the VGGbased model gatys:2015a, we additionally synthesize textures from VGG by minimizing (7) using LBFGSB as in section 4.
6 Results
In Fig. 1 we quantify the quality of two very simple random singlelayer texture models. The singlescale model employs 1024 feature maps of size that are drawn from a zeromean uniform distribution. The multiscale model employs random filters on multiple scales ranging from up to pixels (see sec. 2). A good texture representation should be similar for patches taken from the same texture, and very distinctive for patches from different textures. To test this, we sample 10 random patches from 10 different textures, compute the model representations (i.e. the Gram matrix) on each patch and evaluate the relative squared distance (2) between them. We then plot the median distance between patches of two textures as a confusion matrix, Fig. 1. For comparison, we also plot the confusion matrix for the raw pixel values as well as for the VGGmodel gatys:2015a. The latter shows a clear distinction between withinclass and betweenclass patches, which is completely lacking in the raw pixel space. More surprisingly, however, is the confusion matrix in the random singlelayer models: their distinction of patches is on par with VGG. In other words, the texture parametrization in random shallow networks seems similarly suited to measure the perceptual difference between two patches. This intriguing finding suggests that the astonishing perceptual quality of textures synthesized from the VGG model is not, as has been thought, the result of the very specific, supervisedly trained multilayer representation. As a consequence, images synthesized from the singlelayer models should perform similarly to Gatys et al. gatys:2015a.
In Fig. 2 we show textures synthesised from the random single and multiscale models, as well as eight other nonrandom singlelayer models for three different source images (top left). For comparison, we also plot samples generated from the VGG model by Gatys et al. (gatys:2015a) (bottom left). There are roughly two groups of models: those with a small number of feature maps (363, top row), and those with a large number of feature maps (3267, bottom row). Only the multiscale model employs 1024 feature maps. Within each group, we can differentiate models for which the filters are unsupervisedly trained on natural images (e.g. sparse coding filters from kmeans), principally devised filter banks (e.g. 2D Fourier basis) and completely random filters (see sec. 2 for all details). All singlelayer networks, except for multiscale, feature filters. Remarkably, despite the small spatial size of the filters, all models capture much of the small and midscale structure of the textures, in particular if the number of feature maps is large. Notably, the scale of these structures extends far beyond the receptive fields of the single units (see e.g. the pebble texture). We further observe that a larger number of feature maps generally increases the perceptual quality of the generated textures. Surprisingly, however, completely random filters perform on par or better then filters that have been trained on the statistics of natural images. This is particularly true for the multiscale model that clearly outperforms the singlescale models on all textures. The captured structures in the multiscale model are generally much larger and often reach the full size of the texture (see e.g. the wall).
The perceptual quality of the textures generated from models with only a single layer and random filters is quite remarkable and surpasses previous state of the art parametric methods like Portilla and Simoncelli Portilla:2000. The multiscale model often rivals and sometimes even outperforms the current state of the art gatys:2015a as we show in Fig. LABEL:fig:samples where we compare samples synthesized from 20 different textures for the random single and multiscale model, as well as VGG. The multiscale model generates very competitive samples in particular for textures with extremely regular structures across the whole image (e.g. for the brick wall, the grids or the scales). In part, this effect can be attributed to the more robust optimization of the singlelayer model that is less prone to local minima then the optimization in deeper models. This is exemplified in the grid structures, for which the VGGbased loss is paradoxically lower for samples from the multiscale model then for the VGGbased model (which directly optimized the VGGbased loss). This suggests that the naive synthesis performed here favors images that are perceptually similar to the reference texture and thus looses variability (see sec. 7 for further discussion). Nonetheless, samples from the singlelayer model still exhibit large perceptual differences, see Fig. 3. The VGGbased loss (7) appears to generally be an acceptable approximation of the perceptual differences between the reference and the synthesized texture. Only for a few textures, especially those with very regular menmade structures (e.g. the wall or the grids), the VGGbased loss fails to capture the perceptual advantage of the multiscale synthesis.
7 Discussion
In this paper we demonstrated a new parametric texture model based on a singlelayer convolutional network with random filters. We showed that the model is able to qualitatively capture the perceptual differences between natural textures. Samples from the model often rival and sometimes even outperform the current state of the art (gatys:2015a) (Fig. LABEL:fig:samples, third vs fourth row), even though the latter relies on a highperformance deep neural network with features that are tuned to the statistics of natural images. This finding suggests that neither the hierarchical texture representation, nor the trained filters are critical for highquality texture synthesis. Instead, both aspects rather seem to serve as finetuning of the texture representation.
Our results clearly demonstrate that Gram matrices computed from the feature maps of convolutional neural networks generically lead to useful summary statistics for texture synthesis. The Gram matrix on the feature maps transforms the representations from the convolutional neural network into a stationary feature space that captures the pairwise correlations between different features. If the number of feature maps is large, then the local structures in the image are well preserved in the projected space and the overlaps of the convolutional filtering add additional constraints. At the same time, averaging out the spatial dimensions yields sufficient flexibility to generate entirely new textures that differ from the reference on a patch by patch level, but still share much of the small and longrange statistics.
The success of shallow convolutional networks with random filters in reproducing the structure of the reference texture is remarkable and indicates that they can be useful for parametric texture synthesis. Besides reproducing the stationary correlation structure of the reference image ("perceptual similarity") another desideratum of a texture synthesis is to exhibit a large variety between different samples generated from the same given image ("variability"). Hence, synthesis algorithms need to balance perceptual similarity and variability. This balance is determined by a complex interplay between the choice of summary statistics and the optimization algorithm used. For example the stopping criterion of the optimization algorithm can be adjusted to trade perceptual similarity for larger variability.
In this paper we focused on maximizing perceptual similarity only, and it is worth pointing out that additional efforts will be necessary to find an optimal tradeoff between perceptual similarity and variability. For the synthesis of textures from the random models considered here, the tradeoff leans more towards perceptual similarity in comparison to Gatys et al. gatys:2015a(due to the simpler optimization) which also explains the superior performance on some samples. In fact, we found some anecdotal evidence (not shown) in deeper multilayer random CNNs where the reference texture was exactly reconstructed during the synthesis. From a theoretical point of view this is likely a finite size effect which does not necessarily constitute a failure of the chosen summary statistics: for finite size images it is well possible that only the reference image can exactly reproduce all the summary statistics. Therefore, in practice, the Gram matrices are not treated as hard constraints but as soft constraints only. More generally, we do not expect a perceptual distance metric to assign exactly zero to a random pair of patches from the same texture. Instead, we expect it to assign small values for pairs from the same texture, and large values for patches from different textures (Fig. 1). Therefore, the selection of constraints is not sufficient to characterize a texture synthesis model but only determines the exact minima of the objective function (which are sought for by the synthesis). If we additionally consider images with small but nonzero distance to the reference statistics, then the set of equivalent textures increases substantially, and the precise composition of this set becomes critically dependent on the perceptual distance metric.
Mathematically, parametric texture synthesis models are described as ergodic random fields that have maximum entropy subject to certain constraints zhu:97; bruna:2013; zhu:00 (MaxEnt framework). Practical texture synthesis algorithms, however, always deal with finite size images. As discussed above, two finitesize patches from the same ergodic random field will almost never feature the exact same summary statistics. This additional uncertainty in estimating the constraints on finite length processes is not thoroughly accounted for by the MaxEnt framework (see discussion on its “ad hockeries” by Jaynes jaynes:1982). Thus, a critical difference of practical implementations of texture synthesis algorithms from the conceptual MaxEnt texture modeling framework is that they genuinely allow a small mismatch in the constraints. Accordingly, specifying the summary statistics is not sufficient but a comprehensive definition of a texture synthesis model should specify:

A metric that determines the distance between any two arbitrary textures .

A bipartition of the image space that determines which images are considered perceptually equivalent to a reference texture . A simple example for such a partition is the environment and its complement.
This definition is relevant for both under as well as overconstrained models, but its importance becomes particularly obvious for the latter. According to the Minimax entropy principle for texture modeling suggested by Zhu et al zhu:97, as many constraints as possible should be used to reduce the (KullbackLeibler) divergence between the true texture model and its estimate. However, for finite spatial size, the synthetic samples become exactly equivalent to the reference texture (up to shifts) in the limit of sufficiently many independent constraints. In contrast, if we explicitly allow for a small mismatch between the summary statistics of the reference image and the synthesized textures, then the set of possible textures does not constitute a lowdimensional manifold but rather a small volume within the pixel space.
Taken together we have shown that simple singlelayer CNNs with random filters can serve as the basis for excellent texture synthesis models that outperform previous handcrafted synthesis models and may even rival the current state of the art. This finding repeals previous observations that suggested a critical role for the multilayer representations in trained deep networks for natural texture generation. On the other hand, it is not enough to just use sufficiently many constraints as one would predict from the MaxEnt framework. Instead, for the design of good texture synthesis algorithms it will be crucial to find distance measures for which the environment around the reference texture leads to perceptually satisfying results. In this way, building better texture synthesis models is inherently related to better quantitative models of human perception.