Amortised MAP Inference for
Image super-resolution (SR) is an underdetermined inverse problem, where a large number of plausible high resolution images can explain the same downsampled image. Most current single image SR methods use empirical risk minimisation, often with a pixel-wise mean squared error (MSE) loss. However, the outputs from such methods tend to be blurry, over-smoothed and generally appear implausible. A more desirable approach would employ Maximum a Posteriori (MAP) inference, preferring solutions that always have a high probability under the image prior, and thus appear more plausible. Direct MAP estimation for SR is non-trivial, as it requires us to build a model for the image prior from samples. Here we introduce new methods for amortised MAP inference whereby we calculate the MAP estimate directly using a convolutional neural network. We first introduce a novel neural network architecture that performs a projection to the affine subspace of valid SR solutions ensuring that the high resolution output of the network is always consistent with the low resolution input. Using this architecture, the amortised MAP inference problem reduces to minimising the cross-entropy between two distributions, similar to training generative models. We propose three methods to solve this optimisation problem: (1) Generative Adversarial Networks (GAN) (2) denoiser-guided SR which backpropagates gradient-estimates from denoising to train the network, and (3) a baseline method using a maximum-likelihood-trained image prior. Our experiments show that the GAN based approach performs best on real image data. Lastly, we establish a connection between GANs and amortised variational inference as in e. g. variational autoencoders.
|Casper Kaae Sønderby1 2††thanks: Work done while CKS was an intern at Twitter , Jose Caballero1, Lucas Theis1, Wenzhe Shi1& Ferenc Huszár1|
|1Twitter, London, UK|
|2University of Copenhagen, Denmark|
Image super-resolution (SR) is the underdetermined inverse problem of estimating a high resolution (HR) image given the corresponding low resolution (LR) input. This problem has recently attracted significant research interest due to the potential of enhancing the visual experience in many applications while limiting the amount of raw pixel data that needs to be stored or transmitted. While SR has many applications in for example medical diagnostics or forensics (Nasrollahi & Moeslund, 2014, and references therein), here we are primarily motivated to improve the perceptual quality when applied to natural images. Most current single image SR methods use empirical risk minimisation, often with a pixel-wise mean squared error (MSE) loss (Dong et al., 2016; Shi et al., 2016). However, MSE, and convex loss functions in general, are known to have limitations when presented with uncertainty in multimodal and nontrivial distributions such as distributions over natural images. In SR, a large number of plausible images can explain the LR input and the Bayes-optimal behaviour for any MSE trained model is to output the mean of the plausible solutions weighted according to their posterior probability. For natural images this averaging behaviour leads to blurry and over-smoothed outputs that generally appear implausible, i.e. the produced estimates have low probability under the natural image prior.
An idealised method for our applications would use a full-reference perceptual loss function that describes the sensitivity of the human visual perception system to different distortions. However the most widely used loss functions MSE and the related peak-signal-to-noise-ratio (PSNR) metric have been shown to correlate poorly with human perception of image quality (Laparra et al., 2016; Wang et al., 2004). Improved perceptual quality metrics have been proposed, the most popular being structural similarity (SSIM) (Wang et al., 2004) and its multi-scale variants (Wang et al., 2003). Although the correlation of these metrics with human perception has improved, they still do not provide a fully satisfactory alternative to MSE for training of neural networks (NN) for SR.
In lieu of a satisfactory perceptual loss function, we leave the empirical risk minimisation framework and present methods based only on natural image statistics. In this paper we argue that a desirable approach is to employ amortised Maximum a Posteriori (MAP) inference, preferring solutions that have a high posterior probability and thus high probability under the image prior while keeping the computational benefits of amortised inference. To motivate why MAP inference is desirable consider the toy problem in Figure 1a, where the HR data is two-dimensional and distributed according to the Swiss-roll density. The LR observation is defined as the average of the two pixels . Consider observing a LR data point : the set of possible HR solutions is the line , more generally an affine subspace, which is shown by the dashed line in Figure 1a. The posterior distribution is thus degenerate, and corresponds to a slice of the prior along this line, as shown by the red shading. If one minimise MSE or Mean Absolute Error (MAE), the Bayes-optimal solution will lie at the mean or the median along the line, respectively. This example illustrates that MSE and MAE can produce output with very low probability under that data prior whereas MAP inference would always find the mode which by definition is in a high-probability region. See Section 5.6 for a discussion of possible limitations of the MAP inference approach.
Our first contribution is a convolutional neural networks (CNN) architecture designed to exploit the structure of the SR problem. Image downsampling is a linear transformation, and can be modelled as a strided convolution. As Figure 1a illustrates, the set of HR images that are compatible with any LR image span an affine subspace. We show that by using specifically chosen linear convolution and deconvolution layers we can implement a projection to this affine subspace. This ensures that our CNNs always output estimates that are consistent with the inputs. The affine projection layer can be added to any CNN, or indeed, any other trainable SR algorithm. Using this architecture we show that training the model for MAP inference reduces to minimising the cross-entropy between the HR data distribution and the implied distribution of the model’s output when evaluated at random LR images. As a result, we don’t need corresponding HR and LR image pairs any more, and training becomes more akin to training generative models. However direct minimisation of the cross-entropy is not possible and instead we develop three approaches, all depending on projecting the model output to the affine subspace of valid solution, to approximate it directly from data:
We present a variant of the Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) which approximately minimises the Kullback–Leibler divergence () and cross-entropy between and . Our analysis provides theoretical grounding for using GANs in image SR (Ledig et al., 2016). We also introduce a trick that we call instance noise that can be generally applied to address the instability of training GANs.
We employ denoising as a way to capture natural image statistics. Bayes-optimal denoising approximately learn to take a gradient step along the log-probability of the data distribution (Alain & Bengio, 2014). These gradient estimates from denoising can be directly backpropagated through the network to minimise cross-entropy between and via gradient descent.
We present an approach where the probability density of data is directly modelled via a generative model trained by maximum likelihood. We use a differentiable generative model based on PixelCNNs (Oord et al., 2016) and Mixture of Conditional Gaussian Scale Mixtures (MCGSM, Theis et al., 2012) whose performance we believe is very close to the-state-of-the-art in this category.
In section 5 we empirically demonstrate the behaviour of the proposed methods on both the two dimensional toy dataset and on real image datasets. Lastly, in Appendix F we show that a stochastic version of AffGAN performs amortised variational inference, which for the first time establishes a connection between GANs and variational inference as in e. g. variational autoencoders (Kingma & Welling, 2014).
2 Related work
The GAN framework was introduced by Goodfellow et al. (2014) which also showed that these models minimise the Shannon-Jensen Divergence between and under certain conditions. In Section 3.2, we present an update rule that corresponds to minising . Recently, Nowozin et al. (2016) presented a more general treatment that connects GANs to -divergence minimisation. In parallel to our contributions, theoretical work by Mohamed & Lakshminarayanan (2016) presented a unifying view on learning in GAN-style algorithms, of which our variant can be regarded a special case. The focus of several recent papers on GANs were algorithmic tricks to improve their stability (Radford et al., 2015; Salimans et al., 2016). In Section 3.2.1 we introduce another such trick we call instance noise. We discuss theoretical motivations for this and compare it to one-sided label smoothing proposed by Salimans et al. (2016). We also refer to parallel work by Arjovsky & Bottou (2017) proposing a similar method. Recently, several attempts have been made to improve perceptual quality of SR using deep representations of natural images. Bruna et al. (2016) and Li & Wand (2016) measure the Euclidean distance in the nonlinear feature space of a deep NN pre-trained to perform object classification. Dosovitskiy & Brox (2016) and Ledig et al. (2016) use a similar approach and also add an adversarial loss term. Unpublished work by Garcia (2016) explored combining GANs with an penalty between the LR input and the down-sampled output. We note that the soft or penalties used in these methods can be interpreted as assuming Gaussian and Laplace observation noise. In contrast, our approach assumes no observation noise and satisfies the consistency of inputs and outputs exactly by using an affine projection as explained in Section 3.1. In other work, Larsen et al. (2015) proposed to replace the pixel-wise MSE used for training of variational autoencoders with a learned metric from the GAN discriminator. Our denoiser based method exploits a fundamental connection between probabilistic modelling and learning to denoise (see e. g. Vincent et al., 2008; Alain & Bengio, 2014; Särelä & Valpola, 2005; Rasmus et al., 2015; Greff et al., 2016): a Bayes-optimal denoiser can be used to estimate the gradient of the log probability of data. To our knowledge this work is the first time that the output of a denoiser is explicitly back-propagated to train another network. Lastly, we note that denoising has been used to solve inverse problems in compressed sensing as in approximate message passing (Metzler et al., 2015).
Consider a function parametrised by which maps a LR observation to a HR estimate . Most current SR methods optimise model parameters via empirical risk minimization:
Where is the true target and is some loss function. The loss function is typically a simple convex function most often MSE as in (Dong et al., 2016; Shi et al., 2016). Here, we seek to perform MAP inference instead. For a single LR observation the MAP estimate is
Instead of calculating for each separately we perform amortised inference, i. e. we would like to train the SR function to calculate the MAP estimate. A natural loss function for learning the parameters is the average log-posterior:
where the expectation is taken over the distribution of LR observations . This loss depends on the unknown posterior distribution . We proceed by decomposing the log-posterior using Bayes’ rule as follows.
3.1 Handling the Likelihood term
Notice that the last term of Eqn. (4), the marginal likelihood, does not depend on , so we only have to deal with the likelihood and image prior. The observation model in SR can be described as follows.
where is a linear transformation used for image downsampling. In general, can be modelled as a strided two-dimensional convolution. Therefore, the likelihood term in Eqn. (4) is degenerate , and Eqn. (4) can be rewritten as constrained optimisation:
To satisfy the constraints, we introduce a parametric function class that always guarantees . Specifically, we propose to use functions of the form
where is an arbitrary mapping from LR to HR space, a projection to the affine subspace , and is the Moore-Penrose pseudoinverse of , which satisfies and . Conveniently, if is a strided two-dimensional convolution, then becomes a deconvolution or up-convolution, which is a standard operation used in deep learning (e. g. Shi et al., 2016). It is important to stress that the optimal deconvolution is not simply the transpose of , Figure 2 illustrates the upsampling kernel () that corresponds to a Gaussian downsampling kernel (). For any the deconvolution can be easily found, here we used numerical methods as detailed in Appendix B. Intuitively, can be thought of as a baseline SR solution, while is the residual. The operation is a projection to the null-space of , therefore when we downsample the residual we are guaranteed to get no matter what is. By using functions of this form we can turn Eqn. (6) into an unconstrained optimization problem.
Interestingly, the objective above can be expressed in terms of the probability distribution of the model output as follows.
where denotes the cross-entropy between and and we used . To minimise this objective, we do not need matched input-output pairs as in empirical risk minimisation. Instead we need to match the marginal distribution of reconstructed images to that of the distribution of HR images. In this respect, the problem becomes more akin to unsupervised learning or generative modelling. In the following sections we present three approaches to finding the optimal utilising the properties of the affine projection.
3.2 Affine projected Generative Adversarial Networks
Generative Adversarial Networks (Goodfellow et al., 2014) consist of a generator that turns noise sampled from some distribution into images via a parametric mapping, and a discriminator that learns to distinguish between real and synthetic images. The generator and discriminator are updated in tandem resulting in the generative distribution moving closer to the distribution of real data . The behaviour of GANs depends on the specifics of how the generator and the discriminator are trained. We use the following objective functions for and :
The algorithm iterates two steps: first, it updates by lowering keeping fixed, then it updates by lowering keeping fixed. It can be shown that this amounts to minimising , where is the distribution of samples generated by . See Appendix A for a proof111First shown in (Huszár, 2016). In the context of SR, the affine projected SR function takes the role of the generator. Instead of noise, the generator is now fed low-resolution images . Leaving everything else unchanged, we can deploy the GAN algorithm to minimise . We call this algorithm affine projected GAN or AffGAN for short. Similarly, we introduce notation SoftGAN to denote the GAN algorithm without the affine projection, which instead uses an additional soft-constraint as in (Garcia, 2016). Note that the difference between the cross-entropy and the KL divergence is the entropy of : . Hence, we can expect AffGAN to favour approximate MAP solutions that lead to higher entropy and thus more diverse solutions overall.
3.2.1 Instance Noise
The theory suggests that GANs should be a convergent algorithm. If a unique optimal discriminator exists and it is reached by optimising to perfection at each step, technically the whole algorithm corresponds to gradient descent on an estimate of with respect to . In practice, however, GANs tend to be highly unstable. So where does the theory go wrong? We think the main reason for the instability of GANs stems from and being concentrated distributions whose support does not overlap. The distribution of natural images is often assumed to concentrate on or around a low-dimensional manifold. In most cases, is degenerate and manifold-like by construction, such as in AffGAN. Therefore, odds are that especially before convergence is reached, and can be perfectly separated by several s violating a condition for the convergence proof. We try to remedy this problem by adding instance noise to both SR and true image samples. This amounts to minimising the divergence , where denotes convolution of with the noise distribution . The noise level can be annealed during training, and the noise allows us to safely optimise until convergence in each iteration. The trick is related to one-sided label noise introduced by Salimans et al. (2016), however without introducing a bias in the optimal discriminator, and we believe it is a promising technique for stabilising GAN training in general. For more details please see Appendix C
3.3 Denoiser Guided Super-Resolution
To optimise the criterion Eqn. (6) via gradient descent we need its gradient with respect to :
Here are the gradients of the SR function which can be calculated via back-propagation whereas requires estimation since is unknown. We use results from (Alain & Bengio, 2014; Särelä & Valpola, 2005) showing that in the limit of infinitesimal Gaussian noise, optimal denoising functions can be used to estimate this gradient:
where is Gaussian white noise, is the Bayes-optimal denoising function for noise level . Using these results we can maximise Eqn. (9) by first training a neural network to denoise samples from and then backpropagate the gradient estimates from Eqn. (12) via the chain rule in Eqn. (11) to update . Well call this method AffDG, as it uses the affine subspace projection and is guided by the gradient from the DAE. Similar to above we’ll call the similar algorithm soft-enforcing Eqn. (5) SoftDG.
3.4 Density Guided Super-Resolution
As a more direct baseline model for amortised MAP inference we fit a tractable, yet powerful density model to using maximum likelihood, and then use cross entropy with respect to the generative model to approximate Eqn. (9). We use a deep generative model similar to the pixelCNN (Oord et al., 2016) but with a continuous (and differentiable) MCGSM (Theis et al., 2012) likelihood. These type of models are state-of-the-art in density estimation, are relatively fast to evaluate and produce visually interesting samples (Oord et al., 2016). We call this method AffLL, as it uses the affine projection and is guided by the log-likelihood of a density model.
We designed our experiments to address the following questions:
Does the affine projection layer hurt the performance of CNNs for image SR? Section 5.2
We initially illustrate the behaviour of the proposed algorithms on data where exact MAP inference is computationally tractable. Here the HR data is drawn from a two-dimensional noisy Swiss-roll distribution and the one-dimensional LR data is simply the average of the two HR pixels. Next we tested the proposed algorithm in a series of experiments on natural images using downsampling.. For the first dataset, we took random crops from HR images containing grass texture. SR of random textures is known to be very hard using MSE or MAE loss functions. Finally, we tested the proposed models on real image data of faces (Celeb-A) and natural images (ImageNet). All models were convolution neural networks implemented using Theano (Team et al., 2016) and Lasagne (Dieleman et al., 2015). We refer to Appendix D for full experimental details.
5 Results and Discussion
5.1 2D MAP inference: Swiss-Roll
In this experiment we wanted to demonstrate that AffGAN and AffDG are indeed minimising the MAP objective in Eqn. (9). For this we used the two-dimensional toy problem where can be evaluated using brute-force Monte Carlo. Figure 1b) shows the outputs for for models trained with different criterion. The AffGAN and AffDG solutions largely fit the dominant mode similar to MAP inference. For the MSE and MAE models the output generally falls in regions with low prior density. Table 1 shows the cross-entropy achieved by different methods, averaged over 10 independent trials with random initialisation. The cross-entropy values for the GAN and DAE based models are relatively close to the optimal MAP solution, which in this case we can find in a brute-force way. As expected the MSE and MAE models perform worse as these models do not minimize . We also calculated the average MSE between the network input and the downsampled network output. For the affine projected models, this error is exactly . The soft constrained models only approximately satisfy this constraint, even after extensive training (Table 1 second column). Further, we observe that the affine projected models generally found a lower cross-entropy when compared to soft-constrained versions.
5.2 Affine Projected Networks: Proof of Concept using MSE criterion
Adding the affine projection restricts the class of functions that the SR network can model, so it is important to verify that the network is still capable of achieving the same performance in SR as unconstrained CNN architectures. To test this, we trained CNNs with and without affine projections to perform SR on the CelebA dataset using MSE as the objective function. Results are shown in Figure 2. First note that when using affine projections, a randomly initialised network starts learning from a lower initial loss as the low-frequency components of the network output already match those of the target image. We observed that the affine projected networks generally train faster than unconstrained ones. Furthermore, the affine projected networks tend to find a better solution as measured by MSE and SSIM (Figure 2a-b). To investigate which aspects of the network architecture are responsible for the improved performance, we evaluated two further models: In one variant, we initialise the affine projected CNN to implement the correct projection, but then treat as a trainable parameter. In the final variant, we keep the architecture the same, but initialise the final deconvolution layer randomly and allow it to be trained. We found that initialising to the correct Moore-Penrose inverse is important, and we get the similar results irrespective of whether or not it is fixed during training. Figure 2c shows the error between the network input and the downsampled network output. We can see that the exact affine projected network keeps this error at virtually (up to numerical precision), whereas any other network will violate this consistency. In Figure 2d we show the downsampling kernel and the corresponding optimal kernel for .
5.3 Grass Textures
Random textures are known to be hard model using MSE loss function. Figure 3 shows SR of grass texture patches using identical affine projected CNNs trained with different loss functions. When randomly initialised, affine projected CNNs always produce an output with the correct low-frequency components,as illustrated by the third panel labelled in Figure 3. The AffGAN model produces clearly the sharpest images, and we found the images to be plausible given the LR inputs. Notice that the reconstruction is not perfect pixel-by-pixel, but it has the correct statistical properties for the human visual system to recognise it as grass texture. The AffDG and AffLL models both produced blurry results which we where unable to improve upon using various optimization methods. Due to these findings we choose not to perform any further experiments with these models and concentrate on AffGAN instead. We refer to Appendix E for discussion of the results of these models.
5.4 CelebA Faces
In Figure 5 the SR results are seen for several models trained using different loss functions. The MSE trained models outputs somewhat generic and over-smoothed images as expected. For the GAN models the global content is correct for both the affine projected and soft constrained models. Comparing the AffGAN and SoftGAN outputs the AffGAN model produces slightly sharper images which however also seem to contain slightly more high frequency noise. We observed some colour drifting for the soft constrained models. Table 5 shows quantitative results for the same four models where, in terms of PSNR and SSIM, the MSE model achieves the best scores as expected. The consistency between input and output clearly shows that the models using the affine projections satisfy Eqn. (5) better than the soft constrained versions for both MSE and GAN losses.
5.5 Natural Images
In Figure 6 we show the results for SR from to pixels for AffGAN trained on natural images from ImageNET. For most of the images the results are sharp and corresponds well with the LR input. However we still see the high-frequency noise present in most GAN results in some of the images. Interestingly the snake depicted in the third column is super resolved into water which is obviously wrong but still a very plausible image considering the LR input image. Further, water will likely have a higher density under the image prior than snakes which suggests that the GAN model dreams up reasonable data.
5.6 Criticism and future directions
One argument against MAP inference is that the mode of a distribution is dependent on the representation: transforming a variable through an invertible transformation and performing MAP inference in the transformed space may lead to different answers depending on the transformation. As an extreme example, consider transforming a continuous random scalar with its cumulative distribution function . The resulting variable is uniformly distributed, so any value in the interval can be the mode. Thus, the MAP estimate is not unique if one allows for alternative representations, and there is no guarantee that the MAP estimate in 24-bit RGB pixel representation which we seek in this paper is in any way special. One may arrive at a different solution when performing MAP estimation in the feature space of a convolutional neural network, or even if merely an alternative colour space is used. Interestingly, AffGAN is more resilient to coordinate transformations: Eqn. (10) includes the extra term which is effected by transformations the same way as . The second argument relates to the assumption that MAP estimates appear plausible. Although by definition the mode lies in a high-probability region, it does not guarantee that its appearance is anything like that of a random sample. Consider for example data drawn from a -dimensional standard Normal distribution. Due to concentration of measure, as increases the norm of a typical sample will be approximately with very high probability. The mode, however, has a norm of . In this sense, the mode of the distribution is highly atypical. Indeed human observers can easily tell apart a typical sample from the noise distribution and the mode, but would have a hard time noticing the difference between two random samples. This argument suggests that sampling from the posterior may be a good or even preferable way to obtain plausible reconstructions. In Appendix F we establish a connection between variational inference, such as in varational autoencoders (Kingma & Welling, 2014), and a stochastic version of AffGAN, however leaving emperical studies as further.
In this work we developed methods for approximate MAP inference in SR. We first introduced an architectural restriction to neural networks projecting the model output to the affine subspace of valid solutions. We then proposed three methods, based on GANs, denoising or density models, for amortised MAP inference in SR using this affine projection. In high dimensions we empirically found that the GAN based approach, AffGAN produced the most visually appealing results. Our work follows successful demonstrations of GAN-based algorithms for image SR (Ledig et al., 2016), and we provide additional theoretical motivation for why this approach makes sense. In future work we plan to focus on a stochastic extension of AffGAN which can be seen as performing amortised variational inference.
- Alain & Bengio (2014) Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research, 15(1):3563–3593, 2014.
- Arjovsky & Bottou (2017) Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
- Bruna et al. (2016) Joan Bruna, Pablo Sprechmann, and Yann LeCun. Super-resolution with deep convolutional sufficient statistics. International Conference on Learning Representations, 2016.
- Dieleman et al. (2015) Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri, and Eric Battenberg and. Lasagne: First release., 2015. URL http://dx.doi.org/10.5281/zenodo.27878.
- Dong et al. (2016) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 295–307, 2016.
- Dosovitskiy & Brox (2016) Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644, 2016.
- Garcia (2016) David Garcia. Open source code. retrieved on 22 Sept 2016, 2016. URL https://github.com/david-gpu/srez.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
- Greff et al. (2016) Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao, Jürgen Schmidhuber, and Harri Valpola. Tagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing Systems, 2016.
- Huang et al. (2016) Gao Huang, Zhuang Liu, and Kilian Q Weinberger. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
- Huszár (2016) Ferenc Huszár. An alternative update rule for generative adversarial networks. Unpublished note (retrieved on 7 Oct 2016), 2016. URL http://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/.
- Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In The International Conference on Learning Representations, 2014.
- Laparra et al. (2016) Valero Laparra, Johannes Ballé, Alexander Berardino, and Eero P Simoncelli. Perceptual image quality assessment using a normalized laplacian pyramid. In Proc. IS&T Int’l Symposium on Electronic Imaging, Conf. on Human Vision and Electronic Imaging, 2016.
- Larsen et al. (2015) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1558––1566, 2015.
- Ledig et al. (2016) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
- Li & Wand (2016) Chuan Li and Michael Wand. Combining markov random fields and convolutional neural networks for image synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Metzler et al. (2015) Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. Optimal recovery from compressive measurements via denoising-based approximate message passing. In International Conference on Sampling Theory and Applications (SampTA), pp. 508–512, 2015.
- Mohamed & Lakshminarayanan (2016) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
- Nasrollahi & Moeslund (2014) Kamal Nasrollahi and Thomas B. Moeslund. Super-resolution: a comprehensive survey. Machine Vision and Applications, pp. 1423–1468, 2014.
- Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
- Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1747––1756, 2016.
- Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2015.
- Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
- Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.
- Särelä & Valpola (2005) Jaakko Särelä and Harri Valpola. Denoising source separation. Journal of Machine Learning Research, pp. 233–272, 2005.
- Shi et al. (2016) Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883, 2016.
- Team et al. (2016) The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
- Theis & Bethge (2015) Lucas Theis and Matthias Bethge. Generative image modeling using spatial lstms. In Advances in Neural Information Processing Systems, pp. 1927–1935, 2015.
- Theis et al. (2012) Lucas Theis, Reshad Hosseini, and Matthias Bethge. Mixtures of conditional gaussian scale mixtures applied to multiscale image representations. PLoS ONE, 2012.
- Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103, 2008.
- Wang et al. (2003) Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In Conference Record of the 27th Asilomar Conference on Signals, Systems and Computers, volume 2, pp. 1398–1402, 2003.
- Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, pp. 600–612, 2004.
Appendix A Generative Adversarial Networks for minimising KL-divergence
First note that for a fixed generator the discriminator maximises:
where is the generative distribution. A function of the form always has maximum at and we find the Bayes-optimal discriminator to be (assuming equal prior class probabilities)
Let’s assume that this Bayes-optimal discriminator is unique and can be approximated closely by our neural network (see Appendix C for more discussion on this assumption).
Using the modified update rule proposed here the combined optimization problem for the discriminator and generator is
Starting from the definition of
Which is equal to the terms affecting the generator in Eqn. (17).
Appendix B Affine projection
b.1 Numerical Estimation of the pseudo-inverse
In practice we implement the down-sampling projection as a strided convolution with a fixed Gaussian smoothing kernel where the stride corresponds to the down-sampling factor. is implemented as a transposed convolution operation with parameters optimised numerically via stochastic gradient descent on the following objective function:
Where is the -dimensional standard normal distribution, and is the dimensionality of LR data . and can be thought of as a Monte Carlo estimate of the spectral norm of the transformations and , respectively. The Monte Carlo formulation above has the advantage that it can be optimised via stochastic gradient descent. The operation can be thought of as a three-layer fully linear convolutional neural network, where corresponds to a strided convolution with fixed kernels, while is a trainable deconvolution. We note that for certain downsampling kernels the exact would have an infinitely large kernel, although it can always be approximated with a local kernel. At convergence we found to be between and depending on the down-sampling factor, width of the Gaussian kernel used for and the filter sizes of and .
The gradients of the affine projected SR models is derived by applying the chain rule
Which is essentially the high-pass filtered version of the gradient of .
Appendix C Instance Noise
GANs are notoriously unstable to train, and several papers exist that try to improve their convergence properties (Salimans et al., 2016; Radford et al., 2015) via various tricks. Consider the following idealised GAN algorithm, each iteration consisting of the following steps:
we train the discriminator via logistic regression between vs , until convergence
we extract from an estimate of the logarithmic likelihood ratio
we update by taking a stochastic gradient step with objective function
If and are well-conditioned distributions in a low-dimensional space, this algorithm performs gradient descent on an approximation to the KL divergence, so it should converge. So why is it highly unstable in practical situations?
Crucially, the convergence of this algorithm relies on a few assumptions that don’t always hold: (1) that the log-likelihood-ratio is finite, (2) that the Jensen-Shannon divergence is a well-behaved function of and (3) that the Bayes-optimal solution to the logistic regression problem is unique. We stipulate that in real-world situations neither of these holds, mainly because and are concentrated distributions whose support may not overlap. In image modelling, distribution of natural images is often assumed to be concentrated on or around a lower-dimensional manifold. Similarly, is often degenerate by construction. The odds that the two distributions share support in high-dimensional space, especially early in training, are very small. If and have non-overlapping support (1) the log-likelihood-ratio and therefore KL divergence is infinite (2) the Jensen-Shannon divergence is saturated so its maximum value and is locally constant in and (3) there may be a large set of near-optimal discriminators whose logistic regression loss is very close to the Bayes optimum, but each of these possibly provides very different gradients to the generator. Thus, training the discriminator might find a different near-optimal solution each time depending on initialisation, even for a fixed and .
The main ways to avoid these pathologies involve making the discriminator’s job harder. For example, in most GAN implementations the discriminator is only partially updated in each iteration, rather than trained until convergence. Another way to cripple the discriminator is adding label noise, or equivalently, one-sided label smoothing as introduced by Salimans et al. (2016). In this technique the labels in the discriminator’s training data are randomly flipped. However we do not believe these techniques adequately address all of the concerns described above.
In Figure 7a we illustrate two almost perfectly separable distributions. Notice how the large gap between the distributions means that there are large number of possible classifiers that tell the two distributions apart and achieve similar logistic loss. The Bayes-optimal classifier may not be unique, and the set of near-optimal classifiers is very large and diverse. In Figure 7b we show the effect of one sided label smoothing or equivalently, adding label noise. In this technique, the labels of some real data samples are flipped so the discriminator is trained thinking they were samples from . The discriminator indeed has a harder task now, but all classifiers are penalised almost equally. As a result, there is still a large set of discriminators which achieve near-optimal loss, it’s just that the near-optimal loss is now larger. Label smoothing does not help if the Bayes-optimal classifier is not unique.
Instead we propose to add noise to the samples, rather than labels, which we denote instance noise. Using instance noise the support of the two distributions is broadened and they are no longer perfectly separable as illustrated in Figure 7c. Adding noise, the Bayes-optimal discriminator becomes unique, the discriminator is less prone to overfitting because it has a wider training distribution, and the log-likelihood-ratio becomes better behaved. The Jensen-Shannon divergence between the noisy distributions is now a non-constant function of .
Using instance noise, is easy to construct an algorithm that minimises the following divergence:
where is the parameter of the noise distribution. Logistic regression on the noisy samples provides an estimate of . When updating the generator we have to minimise the mean of on noisy samples from . We know that, if is Gaussian, is a Bregman-divergence, and that it is if and only if the two distributions are equal. Because of the added noise, is less sensitive to local features of the distribution. We found that in our experiments instance noise helped the convergence of AffGAN. We have not tested the instance noise in the generative modelling application. Because we don’t have to worry about over-training the discriminator, we can train it until convergence, or take more gradient steps between subsequent updates to the generator. One critical hyper-parameter of this method is the noise distribution. We used additive Gaussian noise, whose variance we annealed during training. We propose a heuristic annealing schedule where the noise is adapted so as to keep the optimal discriminator’s loss constant during training. It is possible that other noise distributions such as heavy-tailed or spike-and-slab would work better but we have not investigated these options.
Appendix D experimental details
For the GAN models the generative and discriminative parameters were updated using Eqn. (10). For the models enforcing Eqn. (5) using a soft-constraint we added an extra MAE loss term to the generative parameters , where i runs over the number of data samples .
The denoiser guided models were trained in a two step procedure. Initially we pre-trained a DAE to denoise samples from the data distribution by minimising
During training we anneal the noise level and continuously save the model parameters of the DAE trained at increasingly smaller noise levels. We then learn the parameters of the generator by following the gradient in Eqn. (11) using the DAE to estimate
Where is the learning rate. During training we continuously load parameters of the DAE trained at increasingly low noise levels to get gradients pointing in the approximate correct direction in beginning of training while covering a large data space and precise gradients close to the data manifold in the end of the training.
For the density guided models we first pre-train a density model by maximising the tractable log-likelihood
Where the joint density have been decomposed using the chain rule and runs over the pixels. Similar to the DAE we continuously save the parameters of the density model during training. We then learn the parameters of the generator by directly minimising the negative log-likelihood of the generated samples under the learned density model.
The 2D target data was sampled from the 2D Swiss-Roll defined as:
Where , , and . The LR input was defined as . The cross-entropy were calculated by estimating the probability density function using a Gaussian kernel density estimator fitted to samples from a noiseless Swiss Roll density i.e. , and setting the bandwidth of each kernel to . All generator and discriminators were 2-layered fully connected NNs with 64 units in each layer. For the AffDG model the DAE was a two layered NN with 256 units in each layer trained while annealing the standard deviation of the Gaussian noise from to .
For all image experiments we set to a convolution using a Gaussian smoothing kernel of size using a stride of corresponding to down-sampling. were set to a convolution operation with kernels of size followed by a reordering of the pixel with the output corresponding to up-sampling convolution as described in (Shi et al., 2016). The parameters of the was optimised numerically as described in Appendix B. All down-sampling were done using the projection. For all image models we used convolutional models using ReLU nonlinearities and batch normalization in all layers except the output. All generators used skip connections similar to (Huang et al., 2016) and a final sigmoid non-linearity was applied to output of the model which were either used directly or feed through the affine transformation layers parameterised by and . The discriminators were standard convolutional networks followed by a final sigmoid layer.
For the grass texture experiments we used randomly extracted patches of data from high resolution grass texture images. The generators used 6 layers of convolutions with 32, 32, 64, 64, 128 and filter maps and skip connections after every second layer. The discriminators had four layers of strided convolutions with 32, 64, 128 and 256 filter maps. For the AffDG model the DAE was a four layer convolutional network with 128 filter maps in each layer trained while annealing the standard deviation of the Gaussian noise from to . The density model was implemented as a pixelCNN similar to Oord et al. (2016) with four layers of convolution with 64 filter map with kernel sizes of 5, except for the first layers which used 7. The original PixelCNN uses a non-differentiable categorical distribution as the likelihood model why it can not be used for gradient based optimization. Instead we used a MCGSM as the likelihood model (Theis & Bethge, 2015), which have been shown to be a good density model for images (Theis et al., 2012), using 32 mixture components and 32 quadratic features to approximate the covariance matrices.
For the CelebA experiments the datasets were split into train, validation and test set using the standard splitting. All images were center cropped and resized to before down-sampling to using . All generators were 12 layer convolution networks with four layers of 128, 256 and 512 filter maps and skip connections between every fourth layer. The discriminators were 8 layer convolution nets with two layers of 128, 256, 512 and 1024 filter maps using a stride of 2 for every second layer.
For the ImageNET experiments the 2012 dataset were randomly split into train, validation and test set with samples in the test and validation sets. All images below 20kB were then discarded to remove images with to low resolution. The images were center cropped and resized to before down-sampling to using . The generator were a 8 layer convolutional network with 4 layers of 128 and 256 filter maps and skip connections between every second layer. The discriminators were 8 layer convolution nets with two layers of 128, 256, 512 and 1024 filter maps using a stride of 2 for every second layer. To stabilise training we used Gaussian instance noise linearly annealed from an initial standard deviation of to . We were unable to stable train models without this extra regularization.
Appendix E Additional results for Denoiser and Density guided Super-resolution
Figure 8 show the PSNR and SSIM scores during training for the AffDG and AffLL models trained on the grass textures. Note that the models are converging, but as seen in Figure 3 the images are very blurry. For both models we had problems with diverging training. For the DAE models with high noise levels the gradients are only approximately correct but covers a large space around the data manifold whereas for small noise levels the gradients are more accurate in a small space around the data manifold. For the density model we believe a similar phenomenon is making the training diverge since for accurate density models the estimated density is likely very peaked around the data manifold making learning in the beginning of training difficult. To resolve these issue we started training using models with high noise levels or low log-likelihood values and then loaded model parameters during training with continuously smaller noise levels or better log-likelihood values. The effect of this can be clearly seen during training as the step like behavior of the AffDG in Figure 8. We note that the density model used for training the AffLL achieved a log-likelihood of bits per dimension which is comparable to values obtained in Theis & Bethge (2015) on a texture dataset. Further the AffLL model achieved high log-likelihood values under this model suggesting that the density model is simply not providing an accurate enough representation of to provide precises scores for training the AffLL model.
Appendix F Amortised variational inference using AffGAN
Here we’ll show that a stochastic extension of the AffGAN model approximately minimises an amortised variational inference criterion as in e. g. variational autoencoders, which for the first time establishes a connection between adversarial methods of inferences and and variational inference. We introduce a variant of AffGAN where, in addition to the LR data , the generator function also takes as input some independent noise variables : we establish a connection between GANs and amortised variational
Similarly to how we defined in Section 3.1 we introduce the following notation:
Here the affine projection ensures that under , and are always consistent. Therefore, under , the conditional of given is the same as the likelihood by construction and the following equality holds:
Applying Bayes’ rule to and substituting into the above equality we get:
The divergence that the AffGAN objective minimises can now be rewritten as.
Therefore we can conclude that the AffGAN algorithm described in Section 3.2 approximately minimizes the following amortised variational inference criterion:
and in doing so it only requires samples from and .