DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

Abstract

The difficulty of obtaining paired data remains a major bottleneck for learning image restoration and enhancement models for real-world applications. Current strategies aim to synthesize realistic training data by modeling noise and degradations that appear in real-world settings. We propose DeFlow, a method for learning stochastic image degradations from unpaired data. Our approach is based on a novel unpaired learning formulation for conditional normalizing flows. We model the degradation process in the latent space of a shared flow encoder-decoder network. This allows us to learn the conditional distribution of a noisy image given the clean input by solely minimizing the negative log-likelihood of the marginal distributions. We validate our DeFlow formulation on the task of joint image restoration and super-resolution. The models trained with the synthetic data generated by DeFlow outperform previous learnable approaches on all three evaluated datasets.

1 Introduction

Figure 1: DeFlow is able to learn complex image degradation processes from unpaired training data. Our approach can sample different degraded versions of a clean input image (bottom) that faithfully resemble the noise of the real data (top).

Deep learning based methods have demonstrated astonishing performance for image restoration and enhancement when large quantities of paired training data are available. However, for many real-world applications the difficulty to obtain paired data remains a major bottleneck. For instance in real-world super-resolution [25, 8, 9] and denoising [2, 3], collecting paired data is cumbersome and expensive, requiring careful setups and procedures that are difficult to scale. Moreover, such data is often limited to certain scenes and contains substantial misalignment issues. In many settings, including enhancement of existing image collections or restoration of historic photographs, the collection of paired data is impossible.

To tackle this fundamental problem, one promising direction is to generate paired training data by applying synthesized degradations and noise to high-quality images. The degraded image thus has a high-quality ground-truth. This allows effective supervised learning techniques to be applied directly to the synthesized pairs. However, in most practical applications the degradation process is unknown. It generally constitutes a complex combination of sensor noise, compression, and post-processing artifacts. Modeling the degradation process by hand is therefore a highly challenging problem, calling for learnable alternatives.

Since paired data is unavailable, learning the degradation process requires unpaired or unsupervised techniques. Several approaches resort to hand-crafted strategies tailored to specific types of degradations [17]. Existing learnable solutions mostly adopt generative adversarial networks (GANs) with cycle-consistency constraints [39, 25, 7] or domain-aware adversarial objectives [12, 34, 6] for unpaired training. However, these approaches require careful tuning of several losses. Moreover, cycle-consistency is a weak constraint that easily leads to changes in color and content [10]. Importantly, the aforementioned works rely on fully deterministic mappings, completely ignoring the fundamental stochasticity of natural degradations and noise. In this work, we take a radically different approach.

We propose DeFlow: a novel conditional normalizing flow based method for learning degradations from unpaired data. DeFlow models the conditional distribution of a degraded image given its clean counterpart . It allows us to sample multiple degraded versions of any clean image . However, conventional conditional flow models [35, 27, 5, 1] require sample pairs for supervised training. We therefore propose a novel formulation for conditional flows, capable of unpaired learning. Specifically, we treat the unpaired setting as the problem of learning the conditional distribution from observations of the marginals and . By modeling both domains and in the latent space of a joint flow network, we ensure sufficient constraints for effective unpaired learning while preserving flexibility for accurate modeling of . We additionally introduce a method for conditioning the flow on domain invariant information derived from either or , which further facilitates the learning problem.

We apply our DeFlow formulation to the problem of joint image restoration and super-resolution in the real-world setting. DeFlow is tasked with learning complex image degradations, which are then used to synthesize training data for a baseline super-resolution model. We perform comprehensive experiments and analysis on the AIM2019 [24] and NTIRE2020 [26] real-world super-resolution challenge datasets. Our approach sets a new state-of-the-art on all datasets among learning-based approaches by outperforming GAN-based alternatives for generating image degradations. Degraded images obtained by DeFlow are shown in Fig. 1.

2 Related Work

Learning degradations from unpaired data  Realistic noise modeling and generation is a long-standing problem in Computer Vision research. The problem of finding learning-based solutions capable of using only unpaired data has received growing interest. One line of research employs generative adversarial networks (GANs) [13]. To learn from unpaired data, either cycle-consistency losses [25, 7] or domain-based adversarial losses [12, 34, 6] are employed. Yet, these approaches suffer from convergence and mode collapse issues, requiring elaborate fine-tuning of their losses. Importantly, such methods learn a deterministic mapping, ignoring the stochasticity of degradations.

Other works [22, 31, 23, 36] learn unsupervised denoising models based on the assumption of spatially uncorrelated (\iewhite) noise. However, this assumption does not apply to more complex degradations, which can have substantial spatial correlation due to \egcompression or post-processing artifacts. Our approach employs fundamentally different constraints to allow for unpaired learning in this more challenging setting. Recently Abdelhamed \etal[1] proposed a conditional flow based architecture to learn noise models. Yet, their method relies on the availability of paired data for training. Moreover, the authors employ an architecture that is specifically designed to model low-level sensor noise. In contrast, we aim to model more general degradations with no available paired training data.

Unpaired Learning with Flows  Whilst not for the application of learning image degradations, a few methods have investigated unpaired learning with flows. Grover \etal[14] trained two flow models with a shared latent space to obtain a model that adheres to exact cycle consistency. Their approach then requires an additional adversarial learning strategy based on CyCADA [15], to successfully perform domain translations. Further, Yamaguchi \etal[37] proposed domain-specific normalization layers for anomaly detection. As a byproduct, their approach can perform cross-domain translations on low-resolution images, by decoding an image of one domain with the normalization layer statistics of a different domain. Our proposed unpaired learning approach for flows is, however, fundamentally different from these methods. We do not rely on adversarial training or normalization layers. Instead, we introduce a shared latent space formulation that allows unpaired learning by minimizing the marginal negative log-likelihood.

3 DeFlow

In this paper, we strive to develop a method for learning a mapping from samples of a source domain to a target domain . While there are standard supervised learning techniques for addressing this problem, paired training datasets are not available in a variety of important real-world applications. Therefore, we tackle the unpaired learning scenario, where only unrelated sets of source and target samples are available. While we formulate a more general approach for addressing this problem, we focus on the case where represent non-corrupted observations, while are observations affected by an unknown degradation process . In particular, we are interested in image data.

Our aim is to capture stochastic degradation operations, which include noise and other random corruptions. The mapping therefore constitutes an unknown conditional distribution . The goal of this work is to learn a generative model of this conditional distribution, without any paired samples .

3.1 Learning the Joint Distribution from Marginals

The unpaired learning problem defined above corresponds to the task of retrieving the conditional , or equivalently, the joint distribution given only observations from the marginals and . In the social sciences this problem is known as ecological inference [18, 19]. It is in general, a highly ill-posed problem. However, under certain assumptions solutions can be inferred. As the most trivial case, assuming independence yields the solution , which is not relevant since we are interested in finding correlations between and . Instead, we first present a simple univariate Gaussian model, which serves as an illustrative starting point for our approach. As we will see, this example model forms the simplest special case of our general DeFlow formulation.

Let us assume a 1D Gaussian distribution with unknown mean and variance . We additionally postulate that , where has an unknown Gaussian distribution and is independent of . As a sum of independent Gaussian random variables is again Gaussian, it follows that . Moreover, it is easy to see that . Under these assumptions, we can estimate all unknown parameters in by minimizing the combined negative log-likelihood of the marginal observations,

(1)

A derivation and the resulting analytic solutions can be found in the appendix (Sec. A). This simple case provides an insightful example that allows us to infer the full joint distribution given only unpaired examples. Next, we generalize this example using normalizing flows to achieve a highly powerful class of models capable of likelihood-based unpaired learning.

3.2 Generalizing with Normalizing Flows

In this section, we introduce a normalizing flow based formulation capable of learning flexible conditional distributions from unpaired data. Its core idea is to model the relation between and in a Gaussian latent space. We then use a deep invertible encoder-decoder network to map the latent variables to the output space. Thanks to the formulation of our latent model and the invertibility of the encoder-decoder network, we demonstrate that the resulting model can be trained end-to-end using only the marginal log-likelihoods.

We first detail our latent-space formulation. Our model postulates that the random variables and are related through a shared latent space. Let and denote the latent variables corresponding to and respectively. In particular, we let follow a standard Normal distribution. The latent variable of is modeled to depend on , but perturbed by another Gaussian random variable such that . The perturbation is independent of , and therefore also . Furthermore, the mean and covariance of are unknown. Note that, our latent space model is a multivariate generalization of the 1D example presented in Sec. 3.1. However, real image data cannot be directly modeled as Gaussian distributions. We therefore integrate a powerful deep network, capable of disentangling the complex correlations and relations in the image space to a Gaussian distribution.

We model the relation between the observation and latent space with an invertible neural network . The observed samples are retrieved as and respectively. Our complete model is summarized as,

(2a)
(2b)

Here, denotes stochastic independence. Note that we can sample from the joint distribution by directly applying (2). More importantly, we can also easily sample from the conditional distribution by first generating the encoding . Since is only dependent on through , we have . From (2), we thus achieve,

(3)

In order to learn the model with a likelihood-based objective (1), we require a differentiable expression of the marginal probability densities and . The invertible normalizing flow allows us to apply the change of variables formula in order to achieve the expressions,

(4a)
(4b)

In both cases, the first factor is given by the determinant of the Jacobian of the flow network. The second factor stems from the Gaussian latent space distribution of and respectively. Using (3), we can in a similar fashion derive the conditional density as,

(5)

Using (4), our model can be learned by minimizing the negative log-likelihood of the marginals (1) in the unpaired setting. Furthermore, the conditional likelihood (5) enables the use of paired samples, if available. Our approach can thus operate in both the paired and unpaired setting.

It is worth noting that the 1D Gaussian example presented in Sec. 3.1 is retrieved as a special case of our model by setting the flow to the affine map . A deep flow thus generalizes our initial example beyond the Gaussian case. Crucially, the a general flow network can capture complex correlations and dependencies in the data. In the case of modeling complex image degradations, our formulation has a particularly intuitive interpretation. In the image space, the degradation operation can follow a complex and signal-dependent distribution. Our approach therefore learns a bijection that maps the image data to a space where the degradation can be modeled by additive Gaussian noise . This consequence is most easily seen by studying (3), which implements the stochastic degradation for our model. The clean data is first mapped to the latent space and then corrupted by the random Gaussian ‘noise’ . The image is then reconstructed by inverting the learned mapping .

Lastly, we also note that the proposed model achieves conditioning through a very different mechanism compared to the conventional conditional flows [35, 27, 5, 1]. The latter works learn a flow network that is directly conditioned on as . Thus, a generative model of is not learned. However, these methods require paired data for training since the flow network simultaneously inputs both and in order to compute the conditional likelihood. In contrast, our approach learns the full joint distribution and relies on an unconditional flow network. The conditioning is instead performed by our latent space model (2). However, next we show that our approach can further benefit from the conventional technique for conditional flows, without sacrificing the ability of unpaired learning.

3.3 Domain Invariant Conditioning

The formulation presented in Sec. 3.2 requires learning the full marginal distributions and . For image data, this is a difficult task, requiring large model capacity and large datasets. In this section, we therefore propose a further generalization of our formulation, which effectively circumvents the need for learning the full marginals. This allows the network to focus on more accurate learning of the conditional distribution .

Our approach is based on conditioning our model on auxiliary information or . Here, represents a known mapping from the observation space to a conditional variable. We use the conventional technique for creating conditional flows [35, 27, 5] by explicitly inputting into the individual layers of the flow network (as detailed in Sec. 4.1). The flow is thus a function that is invertible only in the first argument. Instead of the marginal distributions in (4), our approach thus models the conditional densities . Since is a known function, we can still learn both and without any paired data. Importantly, learning is an easier problem, requiring the modeling only of the information not present in .

In order to ensure unpaired learning of the conditional distribution , the map must satisfy an important criterion. Namely, that only extracts domain invariant information about the sample. Formally, this is written as,

(6)

It is easy to verify the existence of such a function by taking for all . This choice, where carries no information about the input sample, retrieves the formulation presented in Sec. 3.2 as a special case. Intuitively, we wish to find a function that preserves the most information about the input, without violating the domain invariance condition (6). Since the joint distribution is unknown, strictly ensuring (6) is a difficult problem. In practice, however, we only need to satisfy domain invariance to the degree where it cannot be exploited by the flow network . The conditioning function can thus be set empirically by gradually reducing its preserved information. We detail strategies for designing the conditioning for the case of image degradation learning in Sec. 4.2.

The formulation in Sec. 3.2 is easily generalized to the case that includes the domain invariant conditioning by simply extending the flow network as and . To avoid repetition, we include a detailed derivation of the generalized formulation in the appendix (Sec. C). Specifically, the model is learned by minimizing the negative log-likelihood conditioned on ,

(7)

During inference, we sample from the conditional distribution using,

(8)

We visualize our DeFlow formulation in Figure 2.

Figure 2: Visualization of DeFlow: the shared conditional flow model encodes inputs and , enforcing an additive Gaussian relationship in the latent space. The domain invariant conditional feature extractor and the invertibility of allows stochastic domain translation as presented in (8)

4 Learning Image Degradations with DeFlow

In this section we discuss the application of our flow-based unpaired learning formulation to the problem of generating complex image degradations. We detail the model architecture used by DeFlow and explain our approach for obtaining domain invariant conditioning in this setting.

4.1 Model Architecture

Flow models are generally implemented as a composition of invertible layers. Let denote the -th layer. Then the model can be expressed recursively as

(9)

where , and the remaining represent intermediate feature maps. By the chain rule, (4) gives

(10)

allowing for efficient log-likelihood optimization.

We parametrize the distribution in (2) with mean the weight matrix , such that where is a standard Gaussian. Consequently, the covariance is given by . To ensure spatial invariance, we use the same parameters and at each spatial location in the latent space. We initialize both and to zero, thus ensuring that and initially follow the same distribution.

Our DeFlow formulation for unsupervised conditional modeling can in principle be integrated into any conditional flow architecture . We start from the recent SRFlow [27] network architecture, which itself is based on the unconditional Glow [21] and RealNVP [11] models. We use a level network. Each level starts with a squeeze operation that halves the resolution. It is followed by flow steps, each consisting of four different layers. The level ends with a split, which removes a fraction of the activations as a latent variable. In our experiments we use flow steps, unless specified otherwise. Next, we give a brief description of each layer in the architecture and discuss our modifications. Please, see [27, 21] for details.

Conditional Affine Coupling [27]:  extends the affine coupling layer from [11] to the conditional setting. The input feature map is split into two distinct subsets along the channel dimension. From the subset and the conditional a scaling and bias is computed using an arbitrary neural network. These are then applied to the other subset providing an easily invertible yet flexible learnable transformation.

Affine injector [27]:  computes an individual scaling and bias for each entry of the input feature map from the conditional . The function computing the scaling and bias is not required to be invertible enabling to have direct influence on all channels.

Invertible 1x1 Convolution [21]:  multiplies each spatial location with an invertible matrix. We found the LU-decomposed parametrization [21] to improve stability and conditioning of the model.

Actnorm [21]:  learns a channel-wise scaling and shift to normalize intermediate feature maps

Flow Step:  is the block of flow layers that is repeated throughout the network. Each flow step contains the above mentioned four layers. First, an Actnorm is applied, followed by the convolution, Conditional Affine Coupling, and the Affine Injector. Note, that the last two layers are applied not only in reverse order but also in their inverted form compared to the Flow Step in SRFlow [27].

Feature extraction network:  we encode the domain-invariant conditional information using the same low-resolution image encoder employed by SRFlow. It consist of a modified Residual-in-Residual Dense Blocks (RRDB) model [32]. For our experiments, we initialize it with pretrained weights provided by the authors of [32]. Note that, although this network was originally intended for super-resolution, it is here employed for an entirely different task, namely to encode domain-invariant information for image degradation learning.

4.2 Domain-Invariant Mapping

The goal of our domain-invariant conditioning is to provide image information to the flow network, while hiding the domain of the input image. In our application, the domain invariance (6) implies that the mapping needs to remove information that could reveal whether input is a clean or a degraded image. On the other hand, we want to preserve information about the underlying image content to simplify learning. We accomplish this by utilizing some prior assumptions that are valid for most stochastic degradations. Namely, that they mostly affect the high frequencies in the image, while preserving the low frequencies.

We construct by down-sampling the image to a sufficient extent to remove the visible impact of the degradations. We found it beneficial to also add a small amount of noise to the resulting image to hide remaining traces of the original degradation. The domain invariant mapping is thus constructed as , where denotes bicubic downsampling. Note that this operation is only performed to extract a domain-invariant representation, and is not related to the degradation learned by DeFlow. The purpose of is to remove the original degradation, while preserving image content.

Figure 3: Super-resolved images from the AIM-RWSR (top), NTIRE-RWSR (mid) and DPED-RWSR (bottom) datasets. Top-5 methods are shown based on LPIPS score for the synthetic datasets and the visual judgement of the authors for the DPED-RWSR dataset.

5 Experiments and Results

We validate the performance of the degradations learned by DeFlow by applying them to the problem of real-world super-resolution (RWSR). This entails training a model for joint image restoration and super-resolution in the unpaired setting. Our DeFlow is first employed to learn the underlying degradation model. We then use this model to generate paired training data for a supervised super-resolution model. Experiments are performed on three recent benchmark datasets designed for the unpaired setting. Detailed results and more visual examples are available in the appendix (Sec. D).

5.1 Datasets

AIM-RWSR:  The AIM 2019 real-world super-resolution challenge [24] provides a dataset consisting of a source and a target domain. The former contains synthetically degraded images that feature an unknown combination of noise and compression. The latter domain contains images with a similar corruption (Track 1) or high quality clean images (Track 2). Task is to super-resolve images from the source domain images such that they look like images from the target domain. Since the degradations are generated synthetically, there exists a validation set of 100 paired degraded and ground-truth images, allowing thorough benchmarking through use of reference-based evaluation metrics. Since we are interested in the restoration task, we solely use the Track 2 setting of a clean target domain.

NTIRE-RWSR:  Track 1 of the NTIRE 2020 super-resolution challenge [26] follows the same setting as Track 2 of the AIM 2019 challenge. However, it features a completely different type of degradation, namely highly correlated high-frequency noise. We refer to this dataset as the NTIRE-RWSR dataset. As before, a validation set exists enabling a reference-based evaluation.

DPED-RWSR:  Differing from the other two datasets, the source domain of Track 2 of the NTIRE 2020 RWSR challenge consists of real low-quality smartphone photos. The task is to jointly restore and super-resolve these images. A high-quality target domain dataset is also provided. The source data stems from the iPhone3 images of the DPED dataset [16], while the target data corresponds to the DIV2k [4] training set. Because no reference images exist evaluation can with only be done with no-reference metrics and by visual inspection.

5.2 Evaluation Metrics

For the synthetic datasets, we report the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [33]. In addition we compute the LPIPS [38] distance, a learned reference-based image quality metric using feature distances from deep CNNs. It has been shown to correlate well with human perceived image quality. For the DPED dataset we the report the NIQE [29], BRISQUE [28] and PIQE [30] no reference metrics instead. While these metrics are often a good indicator of a models performance, they can still deviate a lot from the human perceived image quality. Thus we conduct a user study for the best performing models using Amazon Mechanical Turk. There we ask the participants to rank images by their quality. The ground truth image for the synthetic datasets and the low-resolution source image for the DPED dataset are given as a visual reference. We report the mean opinion rank (MOR), computed as the mean over ranks.

5.3 Baselines and other Methods

We compare DeFlow against the challenge winners namely Impressionism [17] that won the NTIRE 2020 real-world super-resolution challenge [26] and Frequency Separation [12] the winner of the AIM 2019 RWSR challenge [24]. Further, we compare against the very recent DASR [34] and the CycleGan based method introduced in [24]. All these methods apply a two stage approach, where first a degradation model is used to generate synthetic training data that is then used to train a supervised super-resolution model. All aforementioned approaches employ the popular ESRGAN super-resolution model [32]. We also validate the performance against simple baselines. All baselines employ the ESRGAN super-resolution network with the same training settings. Our No Degradation model is trained without any degradations applied to the low-resolution images. The White Noise model adds white Gaussian noise to the low-resolution image. As performance of this model is dependent the strength of the noise we first tuned the standard deviation for each dataset and chose the model with the best LPIPS score.

5.4 Training Details

We train all DeFlow models for 100k iterations using the Adam [20] optimizer. The initial learning rate is set to on the synthetic datasets and to on the DPED dataset. We halve the learning rate at 50k, 75k, 90k and 95k iterations. We use a batch size of 8, containing random crops of size on the AIM-RWSR and NTIRE-RWSR dataset. On DPED-RWSR we obtained better performance by reducing the patch size to while increasing the batch size to . Batches are sampled randomly such that images of both domains are drawn equally often. Random flips are used as a data augmentation. We use hidden channels in the affine injector layer for NTIRE-RWSR and DPED-RWSR and on AIM-RWSR. Similar to [21, 27], we apply a 5-bit de-quantization by adding uniform noise to the input of the flow model during training. We train the DeFlow models using the bicubic downsampled clean data as and the noisy dataset as . To increase the amount of training data we include bicubic downsampled noisy images in the clean domain for the DPED-RWSR dataset. For DPED-RWSR we further follow the approach of [17] and estimate the blur kernels of the degraded domain using KernelGAN [6]. We then apply them first to any data from the clean domain, \ieon the clean training data and before degrading images. On the AIM dataset we normalize the noisy data, such that it has the same channel-wise mean and standard deviation as the clean domain. We then de-normalize degraded images again before employing them as training data. For the conditional we used and applied downsampling on NTIRE-RWSR and DPED-RWSR and downsampling on the AIM-RWSR dataset.

PSNR SSIM LPIPS MOR
No Degradation [32] 21.82 0.56 0.514 -
White Noise () 22.43 0.65 0.406 4.182
Impressionism [17] 22.54 0.63 0.420 4.237
Frequency Separation [12] 20.47 0.52 0.394 -
DASR [34] 21.16 0.57 0.370 -
DeFlow 22.25 0.62 0.349 3.317
DASR [34] 21.79 0.58 0.346 4.285
Frequency Separation [12] 21.00 0.50 0.403 3.855
CycleGan [24] 21.19 0.53 0.476 -
Table 1: Comparison with state-of-the-art on AIM-RWSR.  
PSNR SSIM LPIPS MOR
No Degradation [32] 20.59 0.34 0.659 -
White Noise () 24.39 0.69 0.253 3.643
Frequency Separation [12] 23.04 0.59 0.332 4.463
Impressionism [17] 25.03 0.70 0.226 -
DeFlow 25.87 0.71 0.218 3.247
Impressionism [17] 24.77 0.67 0.227 3.253
CycleGan[24] 24.75 0.70 0.417 5.257
Table 2: Comparison with state-of-the-art on NTIRE-RWSR.
NIQE BRISQUE PIQE MOR
No Degradation [32] 3.55 24.56 8.01 -
KernelGAN [6] 6.37 42.74 30.32 -
CycleGAN[24] 5.47 49.19 86.83 -
Frequency Separation [12] 3.27 22.73 11.88 2.502
Impressionism [17] 4.12 23.24 14.09 1.590
DeFlow 3.42 21.13 15.84 1.930
Table 3: Comparison with state-of-the-art on DPED-RWSR.

5.5 Super-Resolution Model

To fairly compare with existing approaches, we use the same ESRGAN [32] as the super-resolution model. Specifically, we employ the training code provided by the authors of Impressionism [17]. This trains a standard ESRGAN for 60k iterations. For AIM-RWSR and NTIRE-RWSR the standard VGG discriminator is used while on DPED a patch discriminator is applied. As in [17], we use the down-sampled smartphone images of the DPED-RWSR dataset as clean images and do not use the provided high-quality data. Unlike [17] we however do not use any downsampled noisy images as additional clean training data. We evaluate the trained models after 10k, 20k, 40k and 60k iterations on the validation set and use the model with the best LPIPS. For DPED-RWSR we simply choose the final model. To better isolate the impact of the learned degradations, we further report the performance for other methods when using their degradation pipeline with our super-resolution model. We mark these models with the symbol.

5.6 Comparison with State-of-the-Art

First, we compare the results of our approach on the AIM-RWSR dataset, which features particularly strong and complex degradations. Results are shown in Table 1. The GAN-based Frequency Separation approach [12], the winner of this dataset’s challenge, obtains an LPIPS similar to the baseline White Noise model. DASR [34] obtains a highly competitive LPIPS, but the worst result in our user study. This can be explained by overfitting, as DASR directly employs LPIPS as a training objective in the super-resolution model. In fact, as shown in the visual results in Fig. 3, DASR generates strong artifacts. Moreover, we independently evaluate the degradation model of DASR by training it with our super-resolution training pipeline. The resulting model DASR obtains an LPIPS of compared to for our DeFlow. Notably, our DeFlow outperforms all previous methods by a large margin in terms of the human perceptual study. Our approach also obtains higher PSNR and SSIM than GAN based competitors.

On the NTIRE-RWSR dataset (see Table 2) DeFlow obtains the best scores among all reference metrics, significantly outperforming the second best method. DeFlow also obtains the best result according to the human study, outperforming the NTIRE-2020 RWSR challenge winner Impressionism [17]. Note that the latter employs a hand-crafted noise generation pipeline, specifically designed for this dataset. It is also worth noting that CycleGAN [24], despite its immense popularity for unpaired learning, does not perform well for this task. This can be partly explained by the weak cycle consistency constraint and the use of a deterministic generator.

Lastly, we compare the results on the DPED-RWSR dataset in Table 3. We report no-reference metrics Table 3, however similar to [26] we find that these do not correlate well with the perceived quality of the images. As shown in Figure 3, DeFlow obtains sharp images with pleasing details clearly outperforming all other learned approaches. Compared to Impressionism [17], we find that our method produces fewer artifacts and does not over-smooth textures. However, we notice that our images retain more noise and are sometimes less sharp. This is supported by the user study where DeFlow significantly outperforms the Frequency Separation method [12], while being second place to Impressionism [17]. However, we again want to emphasize that our proposed DeFlow method is entirely learned from unpaired data, while Impressionism employs a handcrafted approach specifically tuned to this dataset. The limitation of their hand-crafted approach becomes apparent on the AIM-RWSR dataset where it is even outperformed by the Gaussian white noise baseline.

PSNR SSIM LPIPS
Flow Steps 22.18 0.61 0.362
Flow Steps 22.20 0.63 0.355
Flow Steps 22.25 0.62 0.349
downsampling in 22.44 0.61 0.429
downsampling in 22.25 0.62 0.349
downsampling in 21.52 0.61 0.352
No Conditional 18.33 0.52 0.412
Uncorrelated Shift 22.04 0.61 0.359
Unlearned Shift 22.04 0.62 0.405
Learned Correlated Shift   22.25   0.62   0.349
Table 4: Ablation study of DeFlow on the AIM-RWSR dataset. The final setting of the DeFlow model for AIM-RWSR are in bold.

5.7 Ablation Study

In this section, we analyze DeFlow through an ablation study. For each experiment we train a separate model and evaluate its downstream performance on the AIM-RWSR dataset using the same settings as in the previous experiments. We scrutinize on three core segments, namely the depth of the model, the choice of conditioning and the way of learning the domain shift. For each segment we show the results of our study in a separate section of Table 4.

Network depth (Tab. 4, top):  Reducing the number of Flow Steps negatively impacts performance, showing that indeed powerful networks are required to learn the complicated degradation of the data.

Conditioning (Tab. 4, middle):  Next we analyze the impact of our domain invariant conditioning (Sec. 3.3). A downsampling as the conditioning yields noticeable worse performance compared to larger downsampling factors. We thus conclude that this is too little downsampling to remove all domain information form the conditional . Notably, downsampling yields only a slight reduction on performance compared to our standard setting of . In contrast, no conditional information at all \ie leads to a significantly worse performance. Degraded data from this model exhibit strong color shifts and blur. This highlights the importance of conditional information and shows that even little auxiliary information yields drastic performance improvements.

Learned shift (Tab. 4, bottom):  Here, we analyze our latent space formulation. We first restrict the added noise to be uncorrelated across the channels. That is, is constrained to be a diagonal covariance matrix. We notice a negative impact on performance. This demonstrates the effectiveness of our more general Gaussian latent space model. Further, we validate our choice of using domain dependent base distributions. We train a DeFlow model with a standard normal Gaussians as the base distribution for both domains (\iesetting in (2)). To infer the domain shift, we compute the channel-wise mean and covariance matrix over the data sets of each domain in the latent space after training. We observe that the empirical distributions of both domains become very similar and that the inferred shift does not model the degradations faithfully. This results in a substantially worse performance for the down-stream super-resolution task and further shows the potential of our novel unpaired learning formulation.

6 Conclusion

In this work we proposed DeFlow, a novel method for applying conditional flow networks to unpaired learning settings. We apply this method to generate synthetic training data for the downstream task of real-world super-resolution. Despite the radically difference from current approaches in RWSR which mostly rely on adversarial training, degradations from DeFlow provide significant performance improvements and achieve state-of-the-art performance.

Acknowledgements  This work was partly supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, a Microsoft Azure grant, and an Nvidia hardware grant.

Appendix

In this appendix, we first derive the closed-form solution of the 1-dimensional Gaussian case from Sec. 3.1. We then go on and show in Sec. B that restricting to a standard normal distribution is absorbed by a single affine layer in the flow. Next, we provide a derivation to the DeFlow method with domain invariant conditioning in Sec. C. We then show in Sec. D that degradations generated by DeFlow are stochastic and can be sampled at varying strengths. Further, we provide a visual comparison of the degradations and the downstream real-world super-resolution (RWSR) performance in Sec. E. Lastly, we give insight into the set-up of the conducted user study in Sec. F.

Appendix A Closed-Form Solution of the 1-Dimensional Gaussian Case

We first present a detailed derivation of the solution to the 1-dimensional Gaussian example from Sec. 3.1. To recall, we are given two datasets and . We know that are i.i.d. samples from . Further, we know that are i.i.d. samples from with additive independent Gaussian noise .

Task is to find the parameters that jointly maximize the marginal likelihoods of and .

Proceeding as usual we apply the i.i.d. property and minimize the negative log-likelihood \wrt,

subject to (11)

To ensure the estimated variances are non-negative, \ie and , we introduce the Lagrange multipliers and and have

(12)

By the Karush–Kuhn–Tucker theorem is a optimal solution to if while , , and hold.

Next, we take partial derivatives of \wrtthe individual parameters and set them to to obtain the optimal estimates. First, we differentiate \wrtthe means and , and obtain

(13)
(14)

It directly follows, that the optimal estimates of and can be written as the empirical means and ,

(15)
(16)

Now we turn to the estimation of the variances. We first obtain the following partial derivatives

(17)
(18)

Setting to and using the complementary slackness condition that must hold at a minimum we obtain

(19)
(20)
(21)
(22)

where is used as short-hand notation for the empirical variance of .

Similarly, we set to . We first define the empirical variance of as . By using the complementary slackness condition and the fact that , we achieve

(23)
(24)
(25)
(26)

Finally, the complementary slackness condition leaves us with two cases to consider: (1) and (2) . In the former case, it directly follows from (22) and then (26) that

Case 1: (27)
(28)
(29)

In the case of , we first obtain from (22) that

(30)

Inserting this into (26) gives the desired solution for as

Case 2: (31)
(32)
(33)

The second case thus corresponds to the solution where is an unknown constant variable.

Appendix B Closed-Form Solution for the 1-Dimensional Gaussian Case using DeFlow with a Single Affine Layer

In our proposed DeFlow method, we restrict the base distribution to be , while keeping . We show that a single-affine-layer flow is able obtain the an optimal solution for the 1-dimensional Gaussian setting from the previous section under this restriciton. To do so, we simply set

(34)

where and are the optimal estimates obtained in the previous section. Intuitively we can interpret the single-layer flow as a learned normalization layer, that ensures a standard normal distribution in the latent space. To recover the optimal parameters and of , we need to adjust the optimal values retrieved in the previous section accordingly to this normalization and obtain

(35)

This shows that the restriction of to be standard normal simply leads to an absorption of the required normalization in an affine layer of the flow model.

Appendix C Derivation of the Domain Invariant Conditional DeFlow Method

To generalize the formulation of DeFlow from Sec. 3.2 to include the domain invariant conditioning , we extend the flow network to and . By invertibility in the first arguments of , samples can then be retrieved by

(36a)
(36b)

Then, by domain invariance , it follows that we can sample from the conditional distribution using

(37)

where .

By the change of variables formula, we obtain the differentiable expressions for the conditional marginal distributions,

(38a)
(38b)

As in the unconditional case, the first factor is given by the determinant of the Jacobian of the flow network, while the second factor stems from the Gaussian base distributions.

We can then use (38) to allow the optimization of the new negative log-conditional-likelihood objective

(39)

Appendix D DeFlow Degradation Results

Stochasticity of Degradtations  Current GAN based approaches [12, 17, 34, 24] model the degradation process as a deterministic mapping ignoring its inherent stochastic nature. In contrast, DeFlow learns the conditional distribution of a degraded image given a clean image and thereby allows sampling multiple degraded versions of a single clean image. As shown in Figure 4 different degraded samples from DeFlow feature different yet realistic noise characteristics without noticeable bias or recurring patterns.



Clean Input Different Samples with
Figure 4: Multiple degraded samples of a clean input image (left column) using DeFlow on the AIM (top two rows) and NTIRE (bottom two rows) RWSR datasets.

Varying Degradation Strength  We further show that DeFlow can be extended to enable sampling degradations at different strengths. To do so we include a temperature parameter that scales the sampled shift-vector in the latent space. This extends (8) to

(40)

As shown in Figure 5 setting yields more nuanced degradations, while amplifies the noise.



Clean Input
Figure 5: Sampling degradations from DeFlow with increasing temperature in (40) on the AIM (top row) and NTIRE (bottom row) RWSR datasets.

Appendix E Visual Comparison

While we compared DeFlow to current methods using reference and no-reference based evaluation metrics and a user study, we here provide detailed visual results.

Degradation Results:  We thus show examples of the synthetic degradations generated from different methods in Figures 7, 9, and 11 for the AIM-, NTIRE-, and DPED-RWSR datasets. As a reference, we further provide examples of real noisy image patches from the respective datasets in Figures 6, 8, and 10. We notice that DeFlow consistently adds more noise compared to the other methods. Yet, on all datasets, the degradations from DeFlow resemble the real noisy data, whereas other learned methods struggle to pick-up on the noise characteristics.

Real-World Super-Resolution Performance:  Further, we provide results of the downstream real-world super-resolution task of the different methods on the AIM-, NTIRE-, and DPED-RWSR datasets in Figures 12, 13, and 14 respectively. It is noticeable, that our proposed approach introduces fewer artifacts than the other methods, across all datasets. Further, DeFlow is able to reconstruct fine details and provides sharper images than the White Noise model, which performs surprisingly well on the synthetic datasets. On DPED, the performance of the DeFlow degradations is comparable to the handcrafted approach of Impressionism [17]. While DeFlow retains more noise in smooth patches, Impressionism tends to over-smooth textures.

Appendix F Details of the User Study

In this section, we give insight into how we conducted the user study. On AIM and DPED we chose the top 5 models by their LPIPS score to compare in the user study. To obtain a larger variety of methods we did not include any method twice if both the pretrained model provided by the respective authors and our retrained model were among the top 5. In this case, we always used the officially provided pretrained models. On DPED we decided to only compare against Frequency Separation [12] and Impressionism [17], as we found other methods to perform considerably worse.

For each dataset we used the following study set-up: Participants were shown random crops of size pixels of a super-resolved image from the validation set from all compared methods at once. As a reference, the ground truth crop for the synthetic datasets and the low-resolution source image for the DPED dataset were given to the participants. We then asked to rank the shown images by their quality. To verify that the participants were not giving random answers we also included the ground truth image on AIM-RWSR and NTIRE-RWSR and the low-resolution source image on DPED among the candidates to rank. We then only counted answers of participants that had an accuracy of at choosing the ground truth image at rank 1 for AIM-RWSR and NTIRE-RWSR or the up-sampled low-resolution image as the last rank for DPED. After filtering, we obtained 2850, 1350, and 2395 ranked crops from the AIM-, NTIRE-, and DPED-RWSR datasets respectively. Finally, we report the mean opinion rank (MOR), as the mean rank of each method obtained. Note, that the reported MOR was not corrected for including the GT image, and can thus be considered as shifted up by nearly on the AIM-RWSR and NTIRE-RWSR datasets.


Figure 6: AIM RWSR: examples of noisy image patches.

Clean Input DASR [34] Frequency Separation [12] Impressionism [17] DeFlow (ours)
Figure 7: AIM RWSR: examples of clean inputs and corresponding synthetically degraded versions from different domain adaption methods.
Figure 8: NTIRE RWSR: examples of noisy image patches.

Clean Input CycleGAN [24] Frequency Separation [12] Impressionism [17] DeFlow (ours)
Figure 9: NTIRE RWSR: examples of clean inputs and corresponding synthetically degraded versions from different domain adaption methods.
Figure 10: DPED RWSR: examples of noisy image patches.

Clean Input [24] Frequency Separation [12] Impressionism [17] DeFlow (ours)
Figure 11: DPED RWSR: examples of clean inputs and corresponding synthetically degraded versions from different domain adaption methods. Note, that we did not include CycleGAN [24] as differing to the other approaches it is trained to degrade images from DIV2k with DPED noise instead of down-sampled DPED images.

LR White Noise DASR [34] Frequency Separation [12] Impressionism [17] DeFlow (ours)
Figure 12: AIM RWSR: super-resolution results on the validation set. The shown crops were chosen at random for an unbiased comparison.

LR White Noise DASR [34] Frequency Separation [12] Impressionism [17] DeFlow (ours)
Figure 13: NTIRE RWSR: super-resolution results on the validation set. The shown crops were chosen at random for an unbiased comparison.

LR No Degradation [32] CycleGAN [24] Frequency Separation [12] Impressionism [17] DeFlow (ours)
Figure 14: DPED RWSR: super-resolution results on the validation set. The shown crops were chosen at random for an unbiased comparison.

References

  1. A. Abdelhamed, M. A. Brubaker and M. S. Brown (2019) Noise flow: noise modeling with conditional normalizing flows. In ICCV, Cited by: §1, §2, §3.2.
  2. A. Abdelhamed, S. Lin and M. S. Brown (2018-06) A high-quality denoising dataset for smartphone cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  3. A. Abdelhamed, R. Timofte and M. S. Brown (2019-06) NTIRE 2019 challenge on real image denoising: methods and results. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §1.
  4. E. Agustsson and R. Timofte (2017) NTIRE 2017 challenge on single image super-resolution: dataset and study. In CVPR Workshops, Cited by: §5.1.
  5. L. Ardizzone, C. Lüth, J. Kruse, C. Rother and U. Köthe (2019) Guided image generation with conditional invertible neural networks. CoRR abs/1907.02392. External Links: Link, 1907.02392 Cited by: §1, §3.2, §3.3.
  6. S. Bell-Kligler, A. Shocher and M. Irani (2019) Blind super-resolution kernel estimation using an internal-gan. In Advances in Neural Information Processing Systems, pp. 284–293. Cited by: §1, §2, §5.4, Table 3.
  7. A. Bulat, J. Yang and G. Tzimiropoulos (2018) To learn image super-resolution, use a gan to learn how to do image degradation first. arXiv preprint arXiv:1807.11458. Cited by: §1, §2.
  8. J. Cai, S. Gu, R. Timofte and L. Zhang (2019-06) NTIRE 2019 challenge on real image super-resolution: methods and results. In CVPR Workshops, Cited by: §1.
  9. J. Cai, H. Zeng, H. Yong, Z. Cao and L. Zhang (2019) Toward real-world single image super-resolution: a new benchmark and a new model. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1.
  10. C. Chu, A. Zhmoginov and M. Sandler (2017) Cyclegan, a master of steganography. arXiv preprint arXiv:1712.02950. Cited by: §1.
  11. L. Dinh, J. Sohl-Dickstein and S. Bengio (2017) Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §4.1, §4.1.
  12. M. Fritsche, S. Gu and R. Timofte (2019) Frequency separation for real-world super-resolution. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Vol. , pp. 3599–3608. External Links: Document Cited by: Appendix D, Figure 11, Figure 12, Figure 13, Figure 14, Figure 7, Figure 9, Appendix F, §1, §2, Figure 3, §5.3, §5.6, §5.6, Table 1, Table 2, Table 3.
  13. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2672–2680. Cited by: §2.
  14. A. Grover, C. Chute, R. Shu, Z. Cao and S. Ermon (2019) AlignFlow: cycle consistent learning from multiple domains via normalizing flows. External Links: 1905.12892 Cited by: §2.
  15. J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros and T. Darrell (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §2.
  16. A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey and L. Van Gool (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In ICCV, Cited by: §5.1.
  17. X. Ji, Y. Cao, Y. Tai, C. Wang, J. Li and F. Huang (2020-06) Real-world super-resolution via kernel estimation and noise injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: Appendix D, Appendix E, Figure 11, Figure 12, Figure 13, Figure 14, Figure 7, Figure 9, Appendix F, §1, Figure 3, §5.3, §5.4, §5.5, §5.6, §5.6, Table 1, Table 2, Table 3.
  18. G. King, O. Rosen, M. Tanner, G. King, O. Rosen and M. A. Tanner (2004) Ecological inference: new methodological strategies. Cambridge University Press, Cambridge University Press, New York. Cited by: §3.1.
  19. G. King (2013) A solution to the ecological inference problem: reconstructing individual behavior from aggregate data. Princeton University Press. Cited by: §3.1.
  20. D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §5.4.
  21. D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 10236–10245. Cited by: §4.1, §4.1, §4.1, §5.4.
  22. A. Krull, T. Buchholz and F. Jug (2018) Noise2Void - learning denoising from single noisy images. CoRR abs/1811.10980. External Links: Link, 1811.10980 Cited by: §2.
  23. S. Laine, T. Karras, J. Lehtinen and T. Aila (2019) High-quality self-supervised deep image denoising. External Links: 1901.10277 Cited by: §2.
  24. A. Lugmayr, M. Danelljan and R. Timofte (2019) AIM 2019 challenge on real-world image super-resolution: methods and results. In ICCV Workshops, Cited by: Appendix D, Figure 11, Figure 14, Figure 9, §1, Figure 3, §5.1, §5.3, §5.6, Table 1, Table 2, Table 3.
  25. A. Lugmayr, M. Danelljan and R. Timofte (2019) Unsupervised learning for real-world super-resolution. In ICCVW, pp. 3408–3416. Cited by: §1, §1, §2.
  26. A. Lugmayr, M. Danelljan and R. Timofte (2020-06) NTIRE 2020 challenge on real-world image super-resolution: methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §5.1, §5.3, §5.6.
  27. A. Lugmayr, M. Danelljan, L. Van Gool and R. Timofte (2020) SRFlow: learning the super-resolution space with normalizing flow. In ECCV, Cited by: §1, §3.2, §3.3, §4.1, §4.1, §4.1, §4.1, §5.4.
  28. A. Mittal, A. Moorthy and A. Bovik (2011) Referenceless image spatial quality evaluation engine. In 45th Asilomar Conference on Signals, Systems and Computers, Vol. 38, pp. 53–54. Cited by: §5.2.
  29. A. Mittal, R. Soundararajan and A. C. Bovik (2013) Making a ”completely blind” image quality analyzer. IEEE Signal Process. Lett. 20 (3), pp. 209–212. Cited by: §5.2.
  30. V. N., P. D., M. C. Bh., S. S. Channappayya and S. S. Medasani (2015) Blind image quality evaluation using perception based features. In NCC, pp. 1–6. Cited by: §5.2.
  31. M. Prakash, M. Lalit, P. Tomancak, A. Krull and F. Jug (2020) Fully unsupervised probabilistic noise2void. External Links: 1911.12291 Cited by: §2.
  32. X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. ECCV. Cited by: Figure 14, Figure 3, §4.1, §5.3, §5.5, Table 1, Table 2, Table 3.
  33. Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing 13 (4), pp. 600–612. Cited by: §5.2.
  34. Y. Wei, S. Gu, Y. Li and L. Jin (2020) Unsupervised real-world image super resolution via domain-distance aware training. External Links: 2004.01178 Cited by: Appendix D, Figure 12, Figure 13, Figure 7, §1, §2, Figure 3, §5.3, §5.6, Table 1.
  35. C. Winkler, D. Worrall, E. Hoogeboom and M. Welling (2019) Learning likelihoods with conditional normalizing flows. External Links: 1912.00042 Cited by: §1, §3.2, §3.3.
  36. X. Wu, M. Liu, Y. Cao, D. Ren and W. Zuo (2020) Unpaired learning of deep image denoising. External Links: 2008.13711 Cited by: §2.
  37. M. Yamaguchi, Y. Koizumi and N. Harada (2019) AdaFlow: domain-adaptive density estimator with application to anomaly detection and unpaired cross-domain translation. External Links: 1812.05796 Cited by: §2.
  38. R. Zhang, P. Isola, A. A. Efros, E. Shechtman and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. CVPR. Cited by: §5.2.
  39. J. Zhu, T. Park, P. Isola and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV. Cited by: §1.