FHDR: HDR Image Reconstruction from a Single LDR Image using Feedback Network

FHDR: HDR Image Reconstruction from a Single LDR Image using Feedback Network

Abstract

High dynamic range (HDR) image generation from a single exposure low dynamic range (LDR) image has been made possible due to the recent advances in Deep Learning. Various feed-forward Convolutional Neural Networks (CNNs) have been proposed for learning LDR to HDR representations. To better utilize the power of CNNs, we exploit the idea of feedback, where the initial low level features are guided by the high level features using a hidden state of a Recurrent Neural Network. Unlike a single forward pass in a conventional feed-forward network, the reconstruction from LDR to HDR in a feedback network is learned over multiple iterations. This enables us to create a coarse-to-fine representation, leading to an improved reconstruction at every iteration. Various advantages over standard feed-forward networks include early reconstruction ability and better reconstruction quality with fewer network parameters. We design a dense feedback block and propose an end-to-end feedback network- FHDR for HDR image generation from a single exposure LDR image. Qualitative and quantitative evaluations show the superiority of our approach over the state-of-the-art methods.

HDR imaging, Feedback Networks, RNN, Deep Learning.

I Introduction

Common digital cameras can not capture the wide range of light intensity levels in a natural scene. This can lead to a loss of pixel information in under-exposed and over-exposed regions of an image, resulting in a low dynamic range (LDR) image. To recover the lost information and represent the wide range of illuminance in an image, high dynamic range (HDR) images need to be generated. There has been active research going on in the area of deep learning for HDR imaging. The advances in deep learning for image processing tasks have paved way for various approaches for HDR image reconstruction using feed-forward convolutional neural network (CNN) architectures [1][2][3][4][5]. The above methods specifically transform a single exposure LDR image into an HDR image. HDRCNN [1] proposed a deep autoencoder for HDR image reconstruction which uses a weighted mask to recover only the over-exposed regions of an LDR image. The authors of DRTMO [3] designed a framework with two networks for generating up-exposure and down-exposure LDR images, which are merged to form an HDR image. The network is not end-to-end trainable and uses a large number of parameters. Unlike the others, the proposed FHDR model provides an end-to-end trainable solution and is able to comprehensively learn the LDR-HDR mapping while outperforming the existing methods.

Deeper networks are known to learn more complex non-linear relationships like the LDR to HDR mapping. The caveat with deeper networks is that they consume a lot of computational resources and tend to over-fit the training data. To overcome this problem, we exploit the power of feedback mechanisms, inspired by [6], for the task of HDR image reconstruction. A feedback block is an RNN whose output is fed back to guide its input via a hidden state. A feedback network can run for many iterations for a single training example. Considering the number of iterative operations on the shared network parameters, a feedback network is virtually deeper than the corresponding feed-forward network with the same physical depth. Here, virtual depth = physical depth number of iterations.

For improving the reconstruction at every iteration, the loss is calculated for the output of each iteration. By doing so, the network is forced to create a coarse-to-fine representation, being able to reconstruct HDR content right from the first iteration and improve with every subsequent iteration. We propose a global feedback block which consists of smaller local feedback blocks of densely connected layers, inspired by the Dense-Net architecture [7]. Dense connections allow the network to reuse features. This helps in learning robust image representations, even with lesser network parameters.

The performance of our framework is evaluated on the standard City Scene dataset [8] and another dataset prepared from the list of HDR image datasets suggested in [1]. Qualitative and quantitative assessments of the network suggest that even with fewer network parameters, the proposed FHDR model outperforms the state-of-the-art methods.

Ii HDR reconstruction framework

Ii-a Feedback system

Feedback systems are adopted to influence the input based on the generated output, unlike the conventional feed-forward networks, where information flow is unidirectional and is not directly influenced by the generated output. The authors in [6] exploited the feedback mechanism using an RNN, where the output of the feedback block travels back to guide its input via a hidden state. Their architecture uses ConvLSTM cells as the basic RNN units and is designed for the task of image classification. Even with lesser network parameters as compared to other feed-forward networks, such networks are able to learn better representations. Recently, authors of [9] designed a feedback network specifically for the task of image super-resolution which achieved state-of-the-art performance.

Inspired by the success of feedback networks, we designed a feedback network for learning the LDR to HDR mapping that has been explained in detail in the following sections.

Ii-B Model architecture

Fig. 1: FHDR Architecture

Our architecture consists of three blocks similar to [9], as shown in Fig. 1. The first block is the Feature Extraction block (FEB), followed by the Feedback block (FBB) and an HDR reconstruction block (HRB). Inspired by [10], we use a global residual skip connection for bypassing low level LDR features at every iteration to guide the HDR reconstruction block in the final layers. For every training example, the network runs for iterations. Here each iteration from to is a forward pass in time in an unfolded RNN. The FEB is responsible for extracting the low-level feature information from the input LDR image .

(1)

Here, represents the operations of the FEB. To achieve the feedback mechanism, is fed to the FBB, combined with the output of the FBB from the previous iteration, using a global hidden state as below.

(2)

Here, represents the output of the feedback block at iteration . At , when there is no feedback, the hidden state is initialised with the values of the extracted features .

At every iteration, the low level LDR features from the first convolutional layer in FEB are added to the output of the FBB using a global residual skip connection as below.

(3)

Here, represents the global residual feature map learned at iteration . stands for the low level LDR features from the first convolutional layer in FEB. is passed to the HRB to generate an HDR image at every iteration as below.

(4)

Here, represents the operations of the HRB and represents the HDR image generated at iteration. For every LDR image, a forward pass can run for iterations, therefore generating HDR images. Each generated image is coupled with a loss, hence resulting in improved reconstruction at each iteration through back-propagation in time.

Ii-C Feedback block

Fig. 2: Feedback block

We have designed a novel feedback block for the task of learning LDR-to-HDR representations, as shown in Fig. 2. The basic unit of the feedback block is a Dilated Dense Block (DDB) shown in Fig. 3. It is a modification of the Dense block proposed in [7]. Dilated convolutions help in increasing the receptive field of the network [11]. A DDB helps in utilising all the hierarchical features from the input. Other than the two convolutional layers for feature compression, each DDB houses four dilated convolutional layers, each of which uses the information from all the previous layers using dense skip connections. This reuse of features due to the dense forward connections allows for reduced network parameters and improves the learning ability. Three of such DDBs come together to form the feedback block of the network.

We implement global and local feedback mechanisms described as follows.

Global Feedback

The global feedback block, FBB is considered as an RNN with a global hidden state. High level features are transferred from the output of the feedback block at the iteration to its input at the iteration. The hidden state is concatenated with and a compression convolution layer is applied for high-level and low-level feature fusion as shown in Fig. 2. The fused features are passed to the dilated dense blocks, followed by a convolution layer for further processing.

Local Feedback

We argue that a feedback connection is always beneficial as it helps to guide the low-level features which are in some way blind to the higher level features. Hence, we have implemented local feedback connections over each DDB which aim to improve the features generated locally. These connections run parallel to the global feedback connections and increase the overall effectiveness of the network. Each DDB can be considered as an RNN similar to the global feedback block, transferring features from its output to its input via a local hidden state.

Fig. 3: Dilated Dense Block

Ii-D Loss function

Loss calculated directly on HDR images is misrepresented due to the dominance of high intensity values of images with a wide dynamic range. Therefore, we tonemap the generated and the ground truth HDR images to compress the wide intensity range before calculating the loss function value. We use the -law for tonemapping, as suggested by [12]. The -law is represented as below.

(5)

Here, represents the tonemapping operation and defines the amount of compression, which is set to 5000 for the experiments. In addition to the L1 loss suggested by previous feedback networks, we use a perceptual loss [13] for improving the visual quality of the generated image. We calculate the L1 loss and the perceptual loss at every iteration and take an average over all the iterations. The average L1 loss is given below.

(6)

Here, represents the ground truth image. The average perceptual loss can be represented as below.

(7)

Here, represents the perceptual loss calculated between the tonemapped ground truth and generated images. The final loss function is given below.

(8)

Here, is set to 0.1 for all the experiments. We have observed that applying only the L1 distance produces dark artifacts. However, a combination of L1 loss and perceptual loss has resulted in improved visual quality of the generated image.

Ii-E Implementation details

All the convolutional layers (conv2d) in the network have kernels, followed by a ReLU activation, unless mentioned otherwise. There are 2 conv2d layers in the FEB, 20 () conv2d layers in FBB, and 2 conv2d layers in the HRB. The size of the feature maps remains same throughout the network. The depth of the feature maps also remain same i.e. 64, except for in the DDBs. A growth rate of 32 has been implemented for the DDBs, meaning that conv2d layers of each dense block output a 32 channel feature map that gets concatenated with features from all the preceding layers. This accumulated feature map depth is brought down using a conv2d layer at the end of the dense block.

We trained our network on Geforce RTX 2070 GPU with a batch-size of 16 for the City Scene dataset and 6 for the curated HDR dataset. Adam optimizer [14] was adopted with momentum parameters = 0.5 and = 0.999. All the variants of the proposed model were trained for 200 epochs with an initial learning rate of for first 100 epochs, decayed linearly over the next 100 epochs.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i) LDR
(j) AKY[15]
(k) KOV[16]
(l) DRTMO[3]
(m) HDRCNN[1]
(n) FHDR/W
(o) FHDR
(p) GROUND TRUTH
Fig. 4: Qualitative evaluation against five described methods on the curated HDR dataset. (HDR images have been tonemapped using Reinhard tonemapping algorithm [17] for displaying.)

Iii Experiments

Iii-a Dataset

We have trained and evaluated the performance of our network on the standard City Scene dataset [8] and another HDR dataset curated from various sources shared in [1]. The City Scene dataset is a low resolution HDR dataset that contains pairs of LDR and ground truth HDR images of size . We trained the network on 39460 image pairs provided in the training set and evaluated it on 1672 randomly selected images from the testing set. The curated HDR dataset consists of 1010 HDR images of high resolution. Training and testing pairs were created by producing LDR counterparts of the HDR images by exposure alteration and application of different camera curves from [18]. Data augmentation was done using random cropping and resizing. The final training set consists of 11262 images of size 256. The testing set consists of 500 images of size 512 to evaluate the performance of the network on high resolution images.

Iii-B Evaluation metrics

For the quantitative evaluation of the proposed method, we use the HDRVDP-2 Q-score metric, specifically designed for evaluating reconstruction of HDR images based on human perception [19]. We also use the commonly adopted PSNR and SSIM image-quality metrics. The PSNR score is calculated in dB between the -law tonemapped ground truth and generated images.

Iii-C Feedback mechanism analysis

To study the influence of feedback mechanisms, we compare the results of four variants of the network based on the number of feedback iterations performed - (i) FHDR/W (feed-forward / without feedback), (ii) FHDR-2 iterations, (iii) FHDR-3 iterations, and (iv) FHDR-4 iterations. To visualise the overall performance, we plot the PSNR score values against the number of epochs for the four variants, shown in Fig. 5. The significant impact of feedback connections can be easily observed. The PSNR score increases as the number of iterations increase. Also, early reconstruction can be seen in the network with feedback connections. Based on this, we decided to implement an iteration count of 4 for the proposed FHDR network.

Fig. 5: Convergence analysis for four variants of FHDR, evaluated on the City Scene dataset.

Iv Results

Qualitative and quantitative evaluations are performed against two non-learning based inverse tonemapping methods- (AKY) [15] and (KOV) [16], two deep learning methods- HDRCNN [1] and DRTMO [3] and the feed-forward counterpart of the proposed FHDR method. We use the HDR toolbox [20] for evaluating the non-learning based methods. The deep learning methods were trained and evaluated on the described datasets, as mentioned in earlier sections.

DRHT [4] is the state-of-the-art deep neural network for the image correction task. It reconstructs HDR content as an intermediate output and transforms it back into an LDR image. Due to unavailability of DRHT network implementation, results of their network on the curated dataset are not presented. Performance metrics of their LDR-to-HDR network, trained on the City Scene dataset has been sourced from their paper because of the same experiment setup in terms of the training and testing datasets.

Iv-a Quantitative evaluation

A quantitative comparison of the above mentioned HDR reconstruction methods has been presented in Table 1. The proposed network outperforms the state-of-the-art methods in evaluations performed on both the datasets. Results of DRTMO [3] could not be calculated on the City Scene dataset because the network does not accept low resolution images. Even though the feed-forward counterpart of FHDR outperforms other methods, the proposed FHDR network (4-iterations) performs far better.

Iv-B Qualitative evaluation

As can be seen in Fig. 4, non-learning based methods are unable to recover highly over-exposed regions. DRTMO brightens the whole image and is able to recover only the under-exposed, regions. HDRCNN, on the other hand, is designed to recover only the saturated regions and thus under performs in recovering information in under-exposed dark regions of the image. Unlike others, our FHDR/W pays equal attention to both under-exposed and over-exposed regions of the image. The proposed FHDR network with the feedback connections enhances the overall sharpness of the image and performs an even better reconstruction of the input.

Methods City Scene Dataset Curated HDR Dataset
PSNR SSIM Q-score PSNR SSIM Q-score
AKY[15] 15.35 0.44 35.40 9.58 0.20 33.47
KOV[16] 16.77 0.59 35.31 12.99 0.41 29.87
HDRCNN[1] 13.21 0.38 54.34 12.13 0.34 55.32
DRTMO[3] - - - 11.4 0.28 58.85
DRHT[4] - 0.93 61.51 - - -
FHDR/W 25.39 0.89 63.21 16.94 0.74 65.27
FHDR 32.54 0.95 67.18 20.3 0.79 70.97
TABLE I: Quantitative comparison against state-of-the-art methods.

V Conclusion

We propose a novel feedback network FHDR, to reconstruct an HDR image from a single exposure LDR image. The dense connections in the forward-pass enable feature-reuse, thus learning robust representations with minimum parameters. Local and global feedback connections enhance the learning ability, guiding the initial low level features from the high level features. Iterative learning forces the network to create a coarse-to-fine representation which results in early reconstructions. Extensive experiments demonstrate that the FHDR network is successfully able to recover the underexposed and overexposed regions outperforming state-of-the-art methods.

Vi Acknowledgement

This research was supported by the SERB Core Research Grant.

References

  1. G. Eilertsen, J. Kronander, G. Denes, R. K. Mantiuk, and J. Unger, “Hdr image reconstruction from a single exposure using deep cnns,” ACM Transactions on Graphics (TOG), vol. 36, no. 6, p. 178, 2017.
  2. D. Marnerides, T. Bashford-Rogers, J. Hatchett, and K. Debattista, “Expandnet: A deep convolutional neural network for high dynamic range expansion from low dynamic range content,” in Computer Graphics Forum, vol. 37, pp. 37–49, Wiley Online Library, 2018.
  3. Y. Endo, Y. Kanamori, and J. Mitani, “Deep reverse tone mapping.,” ACM Trans. Graph., vol. 36, no. 6, pp. 177–1, 2017.
  4. X. Yang, K. Xu, Y. Song, Q. Zhang, X. Wei, and R. W. Lau, “Image correction via deep reciprocating hdr transformation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807, 2018.
  5. S. Lee, G. H. An, and S.-J. Kang, “Deep chain hdri: Reconstructing a high dynamic range image from a single low dynamic range image,” IEEE Access, vol. 6, pp. 49913–49924, 2018.
  6. A. R. Zamir, T.-L. Wu, L. Sun, W. B. Shen, B. E. Shi, J. Malik, and S. Savarese, “Feedback networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1308–1317, 2017.
  7. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
  8. J. Zhang and J.-F. Lalonde, “Learning high dynamic range from outdoor panoramas,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4519–4528, 2017.
  9. Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  10. Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2472–2481, 2018.
  11. F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
  12. N. K. Kalantari and R. Ramamoorthi, “Deep high dynamic range imaging of dynamic scenes.,” ACM Trans. Graph., vol. 36, no. 4, pp. 144–1, 2017.
  13. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision, pp. 694–711, Springer, 2016.
  14. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  15. A. O. Akyüz, R. Fleming, B. E. Riecke, E. Reinhard, and H. H. Bülthoff, “Do hdr displays support ldr content?: a psychophysical evaluation,” in ACM Transactions on Graphics (TOG), vol. 26, p. 38, ACM, 2007.
  16. R. P. Kovaleski and M. M. Oliveira, “High-quality reverse tone mapping for a wide range of exposures,” in 2014 27th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 49–56, IEEE, 2014.
  17. E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda, “Photographic tone reproduction for digital images,” ACM Trans. Graph., vol. 21, pp. 267–276, July 2002.
  18. M. D. Grossberg and S. K. Nayar, “What is the space of camera response functions?,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., vol. 2, pp. II–602, June 2003.
  19. M. Narwaria, R. Mantiuk, M. P. Da Silva, and P. Le Callet, “Hdr-vdp-2.2: a calibrated method for objective quality prediction of high-dynamic range and standard images,” Journal of Electronic Imaging, vol. 24, no. 1, p. 010501, 2015.
  20. F. Banterle, A. Artusi, K. Debattista, and A. Chalmers, Advanced high dynamic range imaging. AK Peters/CRC Press, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
403085
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description