W-Net: Two-stage U-Net with misaligned data for raw-to-RGB mapping

W-Net: Two-stage U-Net with misaligned data for raw-to-RGB mapping

Kwang-Hyun Uhm, Seung-Wook Kim, Seo-Won Ji, Sung-Jin Cho, Jun-Pyo Hong, Sung-Jea Ko
School of Electrical Engineering, Korea University
Seoul, Korea
{khuhm, swkim, swji, sjcho, jphong}@dali.korea.ac.kr, sjko@korea.ac.kr
Corresponding author

Recent research on a learning mapping between raw Bayer images and RGB images has progressed with the development of deep convolutional neural network. A challenging data set namely the Zurich Raw-to-RGB data set (ZRR) has been released in the AIM 2019 raw-to-RGB mapping challenge. In ZRR, input raw and target RGB images are captured by two different cameras and thus not perfectly aligned. Moreover, camera metadata such as white balance gains and color correction matrix are not provided, which makes the challenge more difficult. In this paper, we explore an effective network structure and a loss function to address these issues. We exploit a two-stage U-Net architecture, and also introduce a loss function that is less variant to alignment and more sensitive to color differences. In addition, we show an ensemble of networks trained with different loss functions can bring a significant performance gain. We demonstrate the superiority of our method by achieving the highest score in terms of both the peak signal-to-noise ratio and the structural similarity and obtaining the second-best mean-opinion-score in the challenge.

1 Introduction

In this paper, we describe our solution for the AIM 2019 challenge on raw-to-RGB mapping [11]. The challenge releases a Zurich Raw-to-RGB (ZRR) data set for the task. The ZRR data set consists of pairs of raw and RGB images which are captured by Huawei P20 and Canon 5D Mark \Romannum4 cameras, respectively. The challenge aims at learning mapping between input raw and target RGB images based on the image pairs given in ZRR data set. However, in this data set, the input and target images are not perfectly aligned as they are taken with different cameras. Moreover, some camera metadata such as white balance gains and color correction matrix are not provided, which makes the task more difficult.

In general, digital cameras process the raw sensor data through the image processing pipeline to produce the desired RGB images. A traditional camera imaging pipeline includes a sequence of operations such as white balance, demosaicing, denoising, color correction, gamma correction, and tone mapping. Typically, each operation is performed independently and requires hand crafted priors. With the recent advances in deep convolutional neural networks (CNNs), research on implementing the imaging pipeline using CNN also has progressed. Schwartz et al. [17] proposed a CNN architecture named DeepISP to perform end-to-end image processing pipeline. DeepISP achieved better visual quality scores than the manufacturer image signal processor. Chen et al. [3] developed a CNN to learn the imaging pipeline for short-exposure low-light raw images. However, these methods trained the network using aligned pairs of input raw and target RGB images obtained by the same camera.

Some studies have attempted to convert the image captured by one camera to the image taken by another camera. Nguyen [14] proposed a calibration method to find a mapping between two raw images from the two different cameras. Ignatov [10] proposed a CNN-based method of learning a mapping from mobile camera images into DSLR images using RGB image pairs. However, these methods only handle the mapping between the images in the same color space.

In this work, we explore an effective network structure and loss functions to address the challenging issues in ZRR. We exploit a two-stage U-net architecture with network enhancements. As U-net utilizes features that are down-sampled several times, these features are relatively invariant to small translation and rotation of the contents in an image. To extract more informative features for our task, we employ a channel attention mechanism. Specifically, we only apply the channel attention module to the expanding path of U-Net. In the expanding path, features are up-sampled and combined with the high-resolution features from the contracting path. Then, the combined features are channel-wise weighted according to the global statistics of the activation to contain more useful information. We also add a long skip connection to U-Net to ease the training of the network. Our experiment demonstrates that the performance on ZRR can be improved by these network enhancements. Though single enhanced U-Net achieves comparable performance, we exploit two-stage U-Net architecture to further boost the performance in the challenge. We cascade the same enhanced U-Net to refine the output RGB images of the first stage.

Also, we introduce a loss function that is less variant to the alignment of training data and encourages the network to generate well color-corrected images. We utilize the perceptual loss [12] to handle the misalignment between input raw and target RGB images. We use high-level features from a deep network since they are down-sampled multiple times and thus effective for learning with misaligned data. Since the color correction step is implemented in a camera image processing pipeline, the network needs to inherently learn this step to reconstruct RGB images. To encourage the network to learn an accurate color transformation, we introduce a color loss which is defined by the cosine distance between the RGB vectors of predicted and target images.

Finally, we apply the model ensemble method to improve the quality of output images. Unlike the typical model ensemble method, we trained the networks with different loss functions and averaged the outputs. Our experiment shows that the ensemble of three networks trained with three different loss functions brings significant improvement of performance. We achieved the best performance in terms of the peak signal-to-noise ratio (PSNR) of 22.59 dB and the structural similarity (SSIM) of 0.83 in the AIM 2019 raw-to-RGB mapping challenge - Track 1: fidelity, and the second-best performance in Track 2: perceptual.

2 Related work

Image signal processing pipeline.

There exist various image processing sub-tasks inside the traditional ISP pipeline. The most representative method includes denoising, demosaicing, white balancing, color correction, and tone mapping [1, 15]. The demosaicing operation interpolates the single-channel raw image with repeated mosaic patterns into multi-channel color images [6]. Denoising operation removes the noise occurred in a sensor and enhances the signal-to-noise ratio [5]. White balancing step corrects the color shifted by illumination according to human perception[4]. Color correction applies a matrix to convert color space of the image from raw to RGB for display [1]. Tone mapping compresses the dynamic range of the raw image and enhances the image details [21]

In the traditional image processing pipeline, each step is designed using the handcrafted priors and performed independently. This may cause an error accumulation when processing the raw data through the pipeline [9].

Figure 1: Illustration of the process of the W-Net.

Deep learning on imaging pipeline.

As CNNs have shown significant success in low-level image processing tasks, such as demosaicing [6], denoising [1, 7, 22], deblurring [20], some studies [2, 3, 17] utilize CNNs to model the camera imaging pipeline. Schwartz \etal proposed a CNN model to perform demosaicing, denoising and image enhancement together [17]. Chen  \etal developed a CNN to learn the imaging pipeline for low-light raw images [3, 18].

Converting image taken by one camera to another camera has also been studied. Nguyen \etal proposed a calibration method to obtain raw-to-raw mapping between image sensor color responses [14]. Ignatov \etal proposed a method to learn the mapping between images taken by mobile phone and a DSLR camera. However, they use, as an input, an image already processed by an image signal processor [10].

Figure 2: The architecture of the enhanced U-Net. Best viewed with zoom.

3 Methodology

3.1 Network Architecture

Figure 1 shows our two-stage U-Net based network, called W-Net, for raw-to-RGB mapping. We utilize the U-Net [16] because the structure consisting of multiple pooling and un-pooling layers is effective for learning on the misaligned data. In W-Net, the RGB image is first reconstructed by the single U-Net based network and then refined by the cascaded network.

We enhance the U-Net for our task. Figure 2 shows the architecture of the enhanced U-Net. U-Net consists of a contracting path and an expanding path. In the contracting path, we apply convolutions (Convs) blocks and 22 max-pooling operations to extract and down-sample features, where the Convs block consists of three 33 convolution layers followed by parametric rectified linear units (PReLU) [8] with negative slope of 0.2. Note that the number of channels of the features is doubled at each Convs block. In the expanding path, features are up-sampled by bilinear interpolation and concatenated with the features from the contracting path of the same size. To obtain more informative features for our task, we employ channel attention mechanism. Specifically, the channel attentional-convolutions block (CA-Convs) block is applied at each step of the expanding path. In the CA-Convs block, output features of the Convs block are first global average pooled to take the global spatial information and then transformed by fully connected layers (FC) and ReLU to represent the channel importance. The weights are obtained by the sigmoid function and multiplied to the outputs of the Convs block. The use of CA-Convs block largely improves the performance with a negligible parameter increase (see Sec. 4.2). Note that applying the CA-Convs blocks to both the contracting and expanding path does not increase the performance in our experiment. We also add a long skip connection between the highest resolution features to ease the training of the network. After the long skip connection, a convolution is performed to produce the RGB image.

3.2 Loss Function

In this section, we describe our loss function which consists of three terms. we denote as the target RGB image and as the predicted image.

Pixel loss.

First, we adapt the pixel-wise loss, which is defined by . However, using only the pixel loss leads to blurry results because image pairs are misaligned (see Sec. 4.2).

Feature loss.

To handle data misalignment, we utilize the perceptual loss function [12]. As features that are down-sampled multiple times by pooling layers are less sensitive to the mild misalignment of images, we extract the high-level features from the pretrained VGG-19 network [19] and calculate distance between the extracted features of and . Therefore, our feature loss can be written as:


where denotes the ‘relu4_1’ or ‘relu5_1’ feature of the VGG-19 network. By using the feature loss, we could obtain less blurry output images with fine details (see Figure 3).

Color loss.

We further define the color loss to learn an accurate color transformation between input raw and target RGB images. We measure the cosine distance between the RGB color vectors of the down-sampled predicted and target image. The color loss can be written as:


where is the inner product operator, denotes down-sampling operator by a factor of 2, and and are the RGB pixel values of and , respectively.

Finally, we define our loss function by the sum of the aforementioned losses as follows:


4 Experiments

4.1 Dataset and Training Details


ZRR provides 89,000 pairs of raw and corresponding RGB image with the size 224224, where raw and RGB images are taken by Huawei P20 and Canon 5D Mark \Romannum4, respectively. We used 88,000 image pairs for training our model and the remaining 1,000 pairs for validation. We normalized and denormalized the input and the predicted images, respectively, by the mean and standard deviation of the whole training data. Data augmentations such as flipping and rotation were not applied.

Training details.

We implemented our model using Pytorch framework with Intel i7, 32GB of RAM, and NVIDIA Titan XP. Mini-batch size was set to 24. We trained our model using Adam optimizer [13] with =0.9, =0.999. The first U-Net of our model was trained for 100 epochs and then the weights of the first network were frozen. Then, the second U-Net was trained for 25 epochs. Learning rate was initialized to 10 and dropped to 10 at the last one epoch. Approximately 3 days were required to train our model.

4.2 Ablation Study

Network architecture.

First, we demonstrated the effectiveness of our network model. The experimental results are shown in Table 1. We trained models using only the pixel loss described in Sec. 3.2. Our basic network is a single original U-Net. By applying channel attention (CA) modules, an improvement of 0.2 dB was obtained. Also, adding the long skip connection (LSC) increased the PSNR by 0.2 dB. Note that these improvements were achieved with negligible parameter increase. In addition, cascading the same enhanced U-Net brought 0.1 dB performance gain. These results suggest that our network design is effective for learning the raw-to-RGB mapping task.

PSNR 22.30 22.58 22.72 22.82
SSIM 0.8687 0.8704 0.8700 0.8741
Table 1: Ablation studies on network architectures

Loss function.

Secondly, we verified the efficiency of our loss function on ZRR. Table 2 and Figure 3 shows the experimental results. In this experiment, the two-stage network model described in the previous section was used. As expected, the model trained using only pixel-wise L1 loss (Model 1) obtained the lowest PSNR and produced blurry results as shown in Figure 3. Combining the pixel loss with the feature loss (Model 2 and Model 3) led to approximately 0.2 dB gain and successfully generated more sharp images. Note that Model 2 and Model 3 used the layers of ‘relu4_1’ and ‘relu5_1’ of VGG-19 Network, respectively, to calculate the perceptual loss. By further incorporating the color loss (Model 4 and Model 5), 0.05dB PSNR increase was achieved and better color transformed output images were obtained as shown in Figure 3. The layers of ‘relu4_1’ and ‘relu5_1’ were used for Model 4 and Model 5, respectively. To boost the performance in the challenge, we adopted a model ensemble. As shown in Table 3, we observed that averaging the output images of Model 2, and Model 4, and Model 5 largely improves the PSNR around 0.5 dB.

Model Loss function PSNR SSIM
Model 1 22.82 0.8741
Model 2 23.12 0.8755
Model 3 23.14 0.8709
Model 4 23.18 0.8750
Model 5 23.19 0.8719
Ensemble - 23.70 0.8826
Table 2: Ablation studies on loss functions
Figure 3: Ablation results on different loss functions. Best viewed with zoom.
Track 1 Track 2
Rank Method PSNR SSIM Method MOS
1 W-Net 22.59 0.81 1.24
2 22.24 0.80 W-Net 1.28
3 21.94 0.79 1.46
4 21.91 0.79 1.56
5 20.85 0.77 1.92
6 19.46 0.53 2.16
Table 3: The result of the AIM raw-to-RGB mapping challenge for the two tracks.

4.3 AIM 2019 raw-to-RGB Mapping Challenge

AIM 2019 raw-to-RGB mapping challenge [11] consists of two tracks: fidelity track (Track 1) and perceptual track (Track 2). In the Track 1, the average PSNR and SSIM are calculated. In the Track 2, the Mean-Opinion-Score (MOS) is obtained from human subjects. Note that the full-resolution input raw images were provided for the Track 2. We submitted our ensemble model described in the previous section for the Track 1 and Model 2 for the Track 2. As shown in Table 4, our model ranked 1st place in the Track 1 and outperformed the second place by a large margin (0.35 dB). Our results for the Track 2 ranked second place, where the MOS difference between our method and the first-place method is 0.04. Qualitative results are shown in Figure 4. It is observed that W-Net produces well color-transformed images with clear details.

Figure 4: Qualitative results of our W-Net on the Track 2 in the AIM 2019 raw-to-RGB mapping challenge. Best viewed with zoom.

5 Conclusion

We described our solution for AIM 2019 raw-to-RGB mapping challenge. To solve the challenging issues in the released dataset, we developed an effective network architecture and a loss function. We enhanced U-Net and built the two-stage network model for the task. Through the ablation studies, we verified that our loss function can handle the data misalignment and color transformation. Also, we boosted the performance by combining the models trained with different loss functions. As a result, we achieved the best quantitative results and the second-best qualitative results in the challenge.

6 Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2014-3-00077-006, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis).


  • [1] T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron (2019) Unprocessing images for learned raw denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11036–11045. Cited by: §2, §2.
  • [2] M. Buckler, S. Jayasuriya, and A. Sampson (2017) Reconfiguring the imaging pipeline for computer vision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 975–984. Cited by: §2.
  • [3] C. Chen, Q. Chen, J. Xu, and V. Koltun (2018) Learning to see in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300. Cited by: §1, §2.
  • [4] D. Cheng, B. Price, S. Cohen, and M. S. Brown (2015-12) Beyond white: ground truth colors for color constancy correction. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [5] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing 16 (8), pp. 2080–2095. Cited by: §2.
  • [6] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand (2016) Deep joint demosaicking and denoising. ACM Transactions on Graphics (TOG) 35 (6), pp. 191. Cited by: §2, §2.
  • [7] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang (2019) Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1712–1722. Cited by: §2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015-12) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.1.
  • [9] F. Heide, M. Steinberger, Y. Tsai, M. Rouf, D. Pajak, D. Reddy, O. Gallo, J. Liu, W. Heidrich, K. Egiazarian, et al. (2014) FlexISP: a flexible camera image processing framework. ACM Transactions on Graphics (TOG) 33 (6), pp. 231. Cited by: §2.
  • [10] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3277–3285. Cited by: §1, §2.
  • [11] A. Ignatov, R. Timofte, et al. (2019) AIM 2019 challenge on raw to rgb mapping: methods and results. In ICCV Workshops, Cited by: §1, §4.3.
  • [12] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §1, §3.2.
  • [13] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, External Links: 1412.6980 Cited by: §4.1.
  • [14] R. Nguyen, D. K. Prasad, and M. S. Brown (2014) Raw-to-raw: mapping between image sensor color responses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3398–3405. Cited by: §1, §2.
  • [15] R. Ramanath, W. E. Snyder, Y. Yoo, and M. S. Drew (2005) Color image processing pipeline. IEEE Signal Processing Magazine 22 (1), pp. 34–43. Cited by: §2.
  • [16] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241. Cited by: §3.1.
  • [17] E. Schwartz, R. Giryes, and A. M. Bronstein (2018) DeepISP: learning end-to-end image processing pipeline. arXiv preprint arXiv:1801.06724. Cited by: §1, §2.
  • [18] L. Shen, Z. Yue, F. Feng, Q. Chen, S. Liu, and J. Ma (2017) Msr-net: low-light image enhancement using deep convolutional network. arXiv preprint arXiv:1711.02488. Cited by: §2.
  • [19] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
  • [20] J. Sun, W. Cao, Z. Xu, and J. Ponce (2015) Learning a convolutional neural network for non-uniform motion blur removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 769–777. Cited by: §2.
  • [21] L. Yuan and J. Sun (2012) Automatic exposure correction of consumer photographs. In European Conference on Computer Vision, pp. 771–785. Cited by: §2.
  • [22] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description