Subjective and Objective De-raining Quality Assessment Towards Authentic Rain Image

Subjective and Objective De-raining Quality Assessment Towards Authentic Rain Image

Qingbo Wu, Lei Wang, King N. Ngan, Hongliang Li, Fanman Meng, and Linfeng Xu Q. Wu, L. Wang, K. N. Ngan, H. Li, F. Meng and L. Xu were with the School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, 611731 China e-mail: (;;;;;

Images acquired by outdoor vision systems easily suffer poor visibility and annoying interference due to the rainy weather, which brings great challenge for accurately understanding and describing the visual contents. Recent researches have devoted great efforts on the task of rain removal for improving the image visibility. However, there is very few exploration about the quality assessment of de-rained image, even it is crucial for accurately measuring the performance of various de-raining algorithms. In this paper, we first create a de-raining quality assessment (DQA) database that collects 206 authentic rain images and their de-rained versions produced by 6 representative single image rain removal algorithms. Then, a subjective study is conducted on our DQA database, which collects the subject-rated scores of all de-rained images. To quantitatively measure the quality of de-rained image with non-uniform artifacts, we propose a bi-directional feature embedding network (B-FEN) which integrates the features of global perception and local difference together. Experiments confirm that the proposed method significantly outperforms many existing universal blind image quality assessment models. To help the research towards perceptually preferred de-raining algorithm, we will publicly release our DQA database and B-FEN source code on

Single image de-raining, authentic rain image, de-raining quality assessment.

I Introduction

Rainy weather often causes poor visibility and visual distraction for the image captured in outdoor scenes, which may significantly degrade the performance of various computational photography and computer vision tasks [1, 2, 3]. In these real-world applications, image de-raining is highly desirable to achieve two objectives. First, the rain is removed as clean as possible. Second, the de-rained image is perceived as natural as possible. This is a quite challenging problem. Due to the lack of prior information for both rain and background, the single image based rain removal is highly ill-posed. Given any rain image, there are multiple alternative de-rained results, whose perceptual quality may vary significantly.

Similar to classic image restoration framework [4, 5], existing algorithms typically model single image rain removal as a decomposition problem, which aims to separate contaminated image into the rain and background layers. To make it tractable, various techniques are proposed to describe the prior information for rain and background. In [6, 7, 8], a set of guided filter based methods are proposed to model the background prior from specific rain-free or pre-processing samples. Due to the sensitivity to guidance image and parameter selection, these methods easily cause over- or under-smooth for rain image, which would damage original image structure or leave too many rain. In [9], Chen et al. utilize low-rank approximation to capture the rain prior. It works efficiently in estimating rain streaks, but also easily mistakes striped background, whose texture is similar to the rain streaks. To better distinguish rain streaks from background, many researchers propose to simultaneously learn the priors for rain and background layers via dictionary learning, Gaussian mixture model, deep neural network and so on [10, 11, 12, 13, 14]. These data-driven methods deliver great perform for the rain images whose appearance features are covered by the training samples. But, in dealing with some rare types of rain images with respect to the training set, their performance would drop significantly as well.

In the real-world application, rain image will present great diversity due to the change of background illumination, lens speed, depth of field, and so on. Although there are multiple de-raining algorithms developed recently, they usually capture partial properties of rain images, whose performance may change significantly from one image to another one. To select optimal de-raining result for each single image and find the further direction towards universal rain removal, it has become crucial to accurately evaluate the de-rained images produced by different algorithms. However, surprisingly, there are quite rare literatures exploring the perceptual evaluation for de-raining algorithms. Existing methods typically evaluate de-raining performance on a few synthesized rain images111, whose ground-truth images (i.e., rain-free versions) are available. Then, two classic full-reference objective metrics including PSNR and SSIM [15] are employed for quantitative quality assessment. In comparison with the diverse authentic rain images, these synthetic data only cover very limited types of raindrops, which are far from sufficient to verify the de-raining capability in reality. Meanwhile, given a de-rained sample produced from authentic rain image, it is also challenging to accurately evaluate its performance due to the absence of ground-truth image.

To the best of our knowledge, the first exploration of de-raining quality assessment (DQA) is conducted in our previous work [16], which proposed a no-reference image quality assessment (NR-IQA) model specifically designed for the de-rained image and investigated the performance of many existing general purpose NR-IQA models [17, 18, 19, 20, 21, 22, 23] in the DQA task. In this paper, we extend our previous exploration [16] of subjective and objective DQA tasks towards the authentic rain images. More comprehensive statistical analysis is conducted for the subject-rated data and an enhanced deep feature representation is proposed to improve the performance of DQA. The detailed contributions are summarized in the following:

  1. IVIPC-DQA database: We create a DQA database which collects a variety of authentic rain images and their de-rained versions produced by 6 categories of single image rain removal algorithms. Then, a subjective study is conducted on the DQA database, which presents two important findings. Firstly, although existing de-raining algorithms perform well on removing synthetic rain streaks, they still hardly balance the rain removal and detail preservation towards authentic rain image. Secondly, existing general purpose NR-IQA models perform poorly in evaluating the de-rained image, whose artifacts present significantly different characteristics with respect to the traditional uniform distortions, such as, the white noise or gaussian blur.

  2. B-FEN DQA model: We propose a bi-directional feature embedding network to accurately assess the de-raining quality. The de-raining artifacts usually present different degradation degrees in different local regions as shown in Fig. 1, which brings great difficulty in identifying the overall quality of a de-rained image. To cope with this issue, we employ a forward branch to suppress the quality irrelevant information at the cost of dimension reduction for the feature maps. Then, a backward branch is developed to embed the low-resolution quality-aware features into the high-resolution feature maps of shallow layers. A gated fusion module is further utilized to integrate the forward and backward features together, which captures both the global perception and local difference.

Fig. 1: Illustration of non-uniform de-raining artifacts. This example is generated by [7], where the red and blue bounding-boxes highlight the de-raining results for different local regions, respectively. It is clear that the red bounding-box presents poor quality due to the annoying holes and streaks. By contrast, the blue bounding-box presents good quality where all leaves retain natural appearance.

Extensive experiments on our IVIPC-DQA database demonstrate that the proposed B-FEN model significantly outperform many classic general purpose NR-IQA models and the latest deep learning based quality evaluators in the task of de-raining quality assessment.

The rest of this paper is organized as follows. In Section II, we first introduce the IVIPC-DQA database and our findings from the subjective investigation. The Section III describes the proposed B-FEN model in details, and the experimental results will be shown in Section IV. Finally, we conclude this paper in Section V.

Ii Subjective study of DQA

Ii-a Image Collection

In order to cover diverse rain scenes, we first collect 206 authentic rain images from Internet, which are captured under different illuminations, perspectives, lens speeds, depth of fields, and so on. Some sample images are shown in Fig. 2. In the following, we apply six representative single image rain removal algorithms to these authentic rain images, which generate totally 1236 de-rained samples. To avoid the composite distortion caused by compression, both the source and de-rained images are saved with lossless compression format.

Fig. 2: Authentic rain images collected in the DQA database.

The de-raining algorithms investigated in this paper cover a wide variety of techniques, which include the guided filter, dictionary learning, low-rank approximation, maximum posteriori, directional regularization and deep neural network. For short, we denote them by Ding16 [7], Kang12 [10], Luo15 [11], Li16 [12], Deng17 [24], and Fu17 [14]. In our investigation, all codes are provided by authors and the default parameters are used without additional fine-tuning procedure. Given each authentic rain image, there are a set of six de-rained results available in our database. For illustration, an intuitive comparison between different de-raining algorithms are given in Fig. 3. It is clear that different de-rained images present obviously different appearances as shown in Figs. 3 (a)-(f). In the following, we implement a subjective study to quantitatively evaluate these de-raining algorithms.

Ii-B Subjective Testing Method

(a) Ding16 [7]
(b) Kang12 [10]
(c) Luo15 [11]
(d) Li16 [12]
(e) Deng17 [24]
(f) Fu17 [14]
(g) Input
Fig. 3: Illustration of de-rained images produced by different algorithms.

The subjective experiment is conducted in Intelligent Visual Information Processing and Communication (IVIPC) laboratory, UESTC222 All images are displayed on a 27-inch true color (32 bits) LCD monitor with the resolution of 19201080. The viewing conditions are set by following the recommendation of ITU-R BT.500-13 [25]. In total, there are 22 naive subjects participated in this experiment, which include 11 males and 11 females.

Specific to the rain removal task, we require all participants to rate the derained images by 5 levels which is represented via a continuous scale between 1 to 100. A lower rating score indicates a worse perceptual quality, which still retains the rain or distorts original image structure. By contrast, a higher rating score denotes a better perceptual quality, which not only removes the rain but also well preserves original image structure. The detailed description about the rating criteria is given in Table I.

Fig. 4: The dialogue window of our subjective experiment.
TABLE I: Rating criteria for the rain removal task

Following the recommendation of ITU-T P.910 [26], we employ the simultaneous presentation method to evaluate the de-raining performance. The reference image (i.e., rain image) and its associated de-rained version are simultaneously presented to the subject via a customized dialogue window. The reference image is always placed on the left and all subjects are aware of the relative positions of these two images. For clarity, the dialogue window of our subjective experiment is shown in Fig. 4. To avoid the memory effect in the human rating, we randomly show a de-rained version for the given reference image in each time, which is selected from six de-raining algorithms. For each participant, the human rating is implemented by 1236 times until all de-rained images are assigned to a corresponding rating score. Meanwhile, to reduce the influence of fatigue effect, the duration of each rating session is limited to 30 minutes, which allows the participants to take a break after rating several pairs of images.

The raw scores collected from multiple subjects may contain a few outliers. We first clean the human rating scores via the test based outlier rejection method [25]. Then, all raw scores are tuned into Z-scores [27, 28], which is demonstrated efficient in eliminating the individual difference. Let denote the raw score, which is obtained from the th subject in evaluating the th de-rained image. Let and denote the mean score and standard deviation of the th subject across all de-rained images, respectively. The Z-score could be computed by


Similar to [27, 28], we further rescale the Z-scores to the range of [0, 100] via a linear mapping function


which assumes the Z-scores of each subject follow Gaussian distribution and nearly 99% Z-scores fall into the range of [-3, 3]. Finally, the mean opinion score (MOS) of the th de-rained image is computed by


where is the number of valid human ratings for the th de-rained image.

Ii-C Analysis of Human Ratings

To investigate the de-raining performance of existing algorithms, we first illustrate the distribution of MOS values for all de-rained images in Fig. 5. It is observed that the collected perceptual qualities span a wide range from the low to high scores. Meanwhile, the distribution of our collected perceptual qualities show reasonably uniform fashion. This good separation of perceptual qualities facilitates a more reliable investigation on the perceptual characteristics of de-rained images [27, 29].

Fig. 5: The distribution of MOS values for our DQA database.

Fig. 6: The mean and standard deviation comparison of MOS values between different de-raining algorithms.
TABLE II: Results of onesided t-test between the MOS values of different de-raining algorithms. A value of “1”/“0”/“-1” indicates that the row algorithm is statistically superior/equivalent/inferior to the column algorithm

Fig. 7: The MOS value of de-rained image versus its rain image number. The scatters with different shapes and colors indicate the de-rained images produced by different algorithms, which are labeled in the legend.

To quantitatively compare different de-raining algorithms, we also compute the mean and standard deviation for their MOS values. Each algorithm is associated to 206 MOS values, which are collected from its de-rained images. As shown in Fig. 6, we have two interesting findings. Firstly, in coping with the authentic rain images, the difference of overall performance is indistinguishable between existing de-raining algorithms, whose mean MOS values are very close to each other. To evaluate the statistical significance of this finding, we perform the onesided t-test [30] on the MOS values of each pair of de-raining algorithms. As shown in Table II, the reported results confirm that most of de-raining algorithms are statistically equivalent to the others under 95% confidence, which are denoted by ‘0’. The best performed method, i.e., Fu17, is only statistically superior to half of competitors including Ding16, Luo15 and Deng17. It shows a fact that there is no one de-raining algorithm possessing the absolute superiority with respect to the others in removing the realistic rain. Secondly, the de-raining performances of existing algorithms are unreliable, whose error bars are all quite large. It means that for any given de-raining algorithm, the perceptual qualities of its de-rained versions may change significantly from one image to another one. This finding could be verified from Fig. 7, which plots the MOS value of each de-rained image versus its corresponding rain image number. It is seen that the scatters of each de-raining algorithm undulate largely across different rain images, which indicates that their performances are highly correlated with the rain types and image content.

Fig. 8: Illustration of the de-raining performance variation across different images. The MOS value of each de-rained image is labeled on its top-left corn.

To illustrate this problem, in Fig. 8, we show the de-raining results produced from two rain images. In the first row, the deep learning based method, i.e., Fu17 [14], performs well in removing the directional rain streaks and preserving the contours of window and balcony, whose MOS could reach 97.97. By contrast, the dictionary learning based method, such as, Kang12 [10], clearly over smoothes the original image structure, whose MOS is only 37.63. When we change the rain image to the second row, it is seen that Fu17 [14] almost does nothing for the dot-like raindrops, whose MOS drops to 40.95. While, Kang12 [10] could perfectly remove these small raindrops without obvious damage to the contour of the player, whose MOS rises to 99.68. This observation shows the pressing demand for an efficient NR-IQA model, which is crucial to select optimal de-raining algorithm in coping with different rain images.

Iii Objective model of DQA

Fig. 9: The comparison of different network architectures for NR-IQA.

Fig. 10: The detailed network structure of the proposed B-FEN. The purple and green dotted lines denote the features passed from and , respectively.

After building the IVIPC-DQA database, we further develop an efficient objective model to predict the human perception towards the de-rained image. Recently, many deep learning based NR-IQA models [31, 32, 33] have explored various efficient network structures for evaluating the uniform distortions, which achieve state-of-the-art quality prediction accuracy via a common unidirectional feature embedding (UFE) architecture as shown in Fig. 9 (a). However, unlike the typical distortions with distinct characteristic and uniform distribution (e.g., gaussian blur or white noise), the distortions of de-rained images are quite different across various de-raining algorithms and visual contents, which are hard to capture with specific global descriptor. Therefore, we propose to learn the quality-aware features and regressor by jointly considering the local and global information using a bi-directional feature embedding network (B-FEN) as shown in Fig. 9 (b). More specifically, the forward feature embedding aims to extract the perceptual quality related global information, which is similar to existing methods [31, 32, 33]. By contrast, the backward feature embedding attempts to incorporate the global information into multiple local features, which could be captured from the intermediate layers of a convolutional neural network (CNN) [34, 35, 36].

For clarity, Fig. 10 shows the detailed network structure of the proposed B-FEN. Our “forward path” subnetwork is composed of four cascaded Dense Blocks (DB) [37], which shares the same structure with DenseNet-161 except for the channel sizes. When a de-rained image goes deeper though our “forward path”, the resolutions of the feature maps gradually decrease after a succession of pooling operations, which discard the semantic irrelevant information and squeeze more quality-aware global features into the top layer [34]. Specific to the DQA task, the global and uniform artifacts could be well captured from the top layer features, such as, the blurriness caused by the low-pass filter. However, the local information of image would be lost in this process, which plays a vital role in describing the non-uniform quality degradation across different regions.

Recently, many region- and pixel-level image representation methods [38, 36, 35] have verified the efficiency of extracting local features from the intermediate-layer feature maps of a CNN. Inspired by these works, we further develop the “backward path” to unfold the way of incorporating the low-resolution global features into the high-resolution local features. In addition, since the importance of local and global features may vary across different image contents, we adopt a gated fusion module to adaptively determine the weights assigned to the multi-resolution feature maps, and merge them into a comprehensive feature vector to feed the quality regressor.

Let denote the “forward path” features outputted from four DB, where a larger denotes the deeper layer. Then, we reuse these feature maps in our “backward path”. Let denote the features generated from the “backward path”. Each element of could be computed by integrating the feature maps from the current layer to the top layer, i.e.,


where denotes a convolution and ReLU operation, and denotes the upsampling operation with factor , which is conducted by the transposed convolution. In this way, we sequentially embed the semantic-related information of top layer into the detail-related information of previous layers across different scales.

It is worth noting that the initial elements of present different resolutions in simulating the perceptions of various receptive field sizes [39]. Straightforwardly concatenating or merging these features would raise the bias towards high-resolution feature maps. To cope with this issue, as shown in Fig. 10, we first rescale all elements of to the equal-length feature vectors via the spatial pyramid pooling (SPP) [40], which applies a regular 44 and 22 max-pooling window to each feature map and reshapes them to the feature vectors . For clarity, we denote this operation by


where the dimension of is 5120, i.e., (44+22)256.

Then, the weight of each could be computed as the nonlinear mapping of its response on the learnable convolution, i.e.,


where is the sigmoid function and share the same dimension with . For brevity, we stack the feature and weight vectors to the matrix form, i.e., and , which share the same dimension of 45120. In the following, we assign the weights to via an element-wise product operation, and generate the fused feature vector with a convolution, i.e.,


where denotes the element-wise product operation.

Finally, the comprehensive feature vector , which collects both the local and global quality-aware information, is fed to three cascaded fully connected (FC) layers and a sigmoid function to generate the predicted quality score . Let denote the ground-truth quality score. The learning objective of our B-FEN model is to minimize the loss between and , i.e.,


where is the number of all training samples, and denote the predicted and ground-truth quality scores of the th de-rained image, respectively.

Iv Experiments

To evaluate the performance of the proposed B-FEN model, we conduct the experiments on our IVIPC-DQA database. Due to the absence of specific quality metric for de-rained image, we compare the proposed B-FEN model with our previous B-GFN [16] and some representative general-purpose image quality assessment models, which include 10 opinion-aware (OA) metrics (i.e., BIQI [18], BLIINDS II [41], BRISQUE [22], DIIVINE [21], M3 [42], NFERM [43], TCLT [19], MEON [31], DB-CNN [32] and WaDIQaM [33]), and 4 opinion-unaware (OU) metrics (i.e., NIQE [44], ILNIQE [45], QAC [46], and LPSI [47]). Meanwhile, two popular unidirectional feature embedding networks, i.e., DenseNet-161 [37] and ResNet-152 [48] are also involved in our comparison, which are categorized as OA metric in the following section.

Iv-a Implementation Details

All OA metrics need training process to determine the parameters of quality assessment model. Following the setup of [32, 31, 33], we randomly separate the IVIPC-DQA database into the non-overlapped training and testing sets, which include 80% and 20% images respectively.

For efficiently training our B-FEN model, the generic label preserving transformations including the random cropping and horizontal flipping [49] are used for augmenting the training data, where the cropped patch size is 320320. The four dense blocks are pre-trained on the ImageNet database [50], and the weights/biases of all the other convolutional layers are initialized by the recommendation of [51]. We employ the SGD optimizer [52] for model learning and the mini-batch size is 16. The base learning rate is set to 0.01. In addition, the momentum and weight/bias decay parameters are set to 0.9 and 0, respectively.

TABLE III: The evaluation results of all quality assessment models

The DenseNet-161 [37] and ResNet-152 [48] models are pre-trained on the ImageNet database [50] and then fine tuned on our IVIPC-DQA database. All the other OA metrics are directly re-trained on our IVIPC-DQA database, whose training settings follow the descriptions in their literatures [18, 41, 22, 21, 42, 43, 19, 31, 32, 33]. Since the OU metrics do not require quality labels to learn the parameters, we directly use the models released by the authors [44, 45, 46, 47] to predict the image quality in the following experiments.

Similar to [32, 31, 33], the random split is repeated 10 times and the median results of four popular indicators across all trials are reported for evaluating the DQA performance, which include the pearson’s linear correlation coefficient (PLCC), spearman’s rank correlation coefficient (SRCC), kendall’s rank correlation coefficient (KRCC) and the perceptually weighted rank correlation (PWRC) [53]. It is noted that the PWRC indicator provides an overall performance measure, i.e., AUC, and a confidence-varying performance measure, i.e., SA-ST curve.

Iv-B Consistency Evaluation

In this section, we first compare the consistency between the subjective ratings and the predictions of different quality assessment algorithms towards the de-rained images. In Table III, we report the overall prediction accuracies of all metrics, where the deep learning based methods are denoted by italics and the best results are highlighted by the boldface.

Fig. 11: The SA-ST curves of different quality assessment models.

It is seen that all of existing general-purpose NR-IQA models perform poorly in the DQA task. For the hand-crafted feature based models, their SRCC are all smaller than 0.4. Meanwhile, limited by the image prior learned from rainless scenes and inflexible parameter settings, the OU metrics produce much worse performance, whose SRCC are all even close to or smaller than 0. It shows that the DQA task is more challenging than the traditional uniform distortion evaluation, where the OU metrics could achieve comparable or even better performance than the OA metrics [44, 45, 46, 47].

Fig. 12: Illustration of ranking results from different deep image quality assessment models. In each column, six de-rained versions of an image are ranked in descending order from top to down by each DQA model. The MOS value of each image is labeled in its top left corner.

Due to the great capability of joint learning discriminative features and regressors, all deep learning based OA metrics report much better results, whose SRCC are larger than 0.4. More specifically, the MEON, WaDIQaM and ResNet-152 directly pass the feature maps from the shallow layer to the deep layer, whose performance improvements are still moderate. The DenseNet-161 and DB-CNN further enhance the quality-aware global representation by feature reuse and fusion. Due to this superiority, the SRCC values of ResNet-152 and DB-CNN raise up to 0.57, which outperform the MEON and WaDIQaM. It is worth noting that these representative deep learning based NR-IQA models (such as, MEON, DB-CNN and WaDIQaM) and the popular CNNs (such as, DenseNet-161 and ResNet-152) all employ an unidirectional feature embedding architecture, whose feature map size gradually decreases from the shallow layer to the deep layer and the local information are erased after successive pooling operations. By contrast, we incorporate the local details into the global features via our unique bi-directional feature embedding network, which is quite beneficial for describing the non-uniform distortions in the de-rained image. Finally, our B-GFN and B-FEN models produce much better quality prediction results in terms of all indicators, whose SRCCs exceed 0.6 and approach 0.7. In addition, since the B-FEN enhances the feature reuse in the ‘backward path’, we report better performance than our previous B-FGN model.

Fig. 11 further plots the SA-ST curves of different quality assessment models, where the deep and handcrafted IQA models are labeled by the solid and dotted lines respectively. It is seen that our B-FEN significantly outperforms existing general-purpose NR-IQA models and our previous B-GFN across a wide range of confidence interval. That is, we perform better in correctly ranking the high quality image pairs no matter their perceptual difference is small or large [53]. This is important in recommending perceptually preferred de-raining results in various real-world applications. For clarity, an illustration of ranking results from different deep image quality assessment models is given in Fig. 12, where our B-FEN model produces the same quality rank with respect to the human ratings and all the other models show several rank errors.

Iv-C Rain-remover Independency

Besides the consistency investigation for the proposed objective model, we also conduct the leave-one-out cross validation [54] to verify that the accuracy of our B-FEN is not dependent on any specific de-raining algorithm. More specifically, for each de-raining algorithm, we take its 206 de-rained images as the test set and the rest images produced by all the other de-raining algorithms are used for training our B-FEN model. We repeat this trial 6 times until all de-raining algorithms are separately tested in our experiment. Let denote the overall performance of the th quality assessment model, and denote the accuracy of the th quality assessment model towards the th de-raining algorithm. Following the criteria of [54], we represent the by


where this overall performance is computed across all indicators, i.e., SRCC/PLCC/KRCC/PWRC.

TABLE IV: The evaluation results of all quality assessment models

Fig. 13: The SA-ST curves for rain-remover independency investigation.

Table IV shows the independency investigation results for all quality assessment models. It is seen that the deep learning based methods still perform better than the handcrafted models, and the OA metrics significantly outperform the OU metrics. This demonstrates that a powerful learning capability is necessary for bridging the gap from the uniform distortion measure to the nonuniform distortion measure.

In addition, we can find that the straight through UFE networks, i.e., MEON, WaDIQaM and ResNet-152, are still inferior to the feature fusion based UFE networks, such as, DenseNet-161 and DB-CNN. Meanwhile, the proposed B-FEN also performs best in the independency investigation. It confirms that a more comprehensive local and global quality-aware feature representation is the key for DQA no matter which deraining algorithm is applied to the rainy image.

Similar to Section IV-B, we also show the SA-ST curves of different quality assessment models in this rain-remover independency investigation. As shown in Fig. 13, our B-FEN still outperforms all the other NR-IQA models across different confidence intervals. It demonstrates that the proposed DQA model offers better de-rained recommendations to the users no matter which de-raining algorithm is used in current application.

Iv-D Complexity Analysis

TABLE V: Complexity analysis for deep quality assessment models. ‘M’ and ‘B’ represent the units of million and billion, respectively.

Besides the evaluation accuracy, we further compare the complexities of different deep quality assessment methods by the number of parameters (#Params.), floating-point operations per second (FLOPs) [55], and actual running speed (Images/sec.). We implement the proposed B-FEN method with the PyTorch library, and perform the experiments in a workstation with Intel Xeon E5-2660 CPU and NVIDIA TITAN X GPU.

(a) SROCC versus #Params.
(b) SROCC versus FLOPs.
(c) SROCC versus Images/sec.
Fig. 14: Performance and complexity analysis for deep quality assessment models. ‘M’ and ‘B’ represent the units of million and billion, respectively.

Table V shows the statistical results of all deep IQA models. It is seen that the MEON and WaDIQaM show the lowest complexities, which are developed with shallow network architectures and report the smallest #Params. and FLOPs. In addition, the complexity of the proposed B-FEN is moderate, whose #Params. is smaller than ResNet-152 and the FLOPs is lower than both ResNet-152 and DB-CNN. This benefits from the application of multiple 11 convolutions and limited channels in our backward path, which only slightly increase the parameter size and computation cost in comparison with the DenseNet-161. Meanwhile, the proposed B-FEN adds more feature reuse in the ‘backward path’, which leads to a litter higher complexity than our previous B-GFN. The DB-CNN uses fewer convolution layers and presents smaller #Params. than the proposed model. However, since multiple high dimensional features are employed in the fully connected layers, the FLOPs of DB-CNN is significantly higher than our B-FEN. Finally, the running speed of our B-FEN model could reach 14.14 Images/sec., which is close to DenseNet-161 and much faster than DB-CNN.

Fig. 14 further investigates the relationship between the performance and complexity for the deep IQA models, where the SROCC is used as the performance indicator and #Params., FLOPs, Images/sec. are used for complexity measurement. We can find that the proposed B-FEN performs well in balancing the evaluation accuracy and complexity. More specifically, we achieve the highest SROCC with moderate memory and computation costs.

V Conclusion

Single image rain removal has received extensive attentions recently. However, there is very few work dedicated to the quality assessment of de-rained images. In this paper, we first build a new database to collect the human rated scores for the de-rained versions of various authentic rain images. Then, a bi-directional feature embedding network (B-FEN) is proposed to predict the human perception toward different de-rained images. Experimental results show that the de-raining quality assessment task is quite challenging and all of existing general purpose BIQA models fail to accurately predict the perceptual de-raining quality. By means of the enriched global and local feature representation, our proposed B-FEN produces very promising DQA result, which significantly outperforms many representative BIQA models and the state-of-the-art deep neural networks. Our new database and B-FEN metric are helpful for evaluating and developing the perceptually preferred de-raining algorithms in the authentic rain scenes.


  • [1] D. Chen, C. Chen, and L. Kang, “Visual depth guided color image rain streaks removal using sparse coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 8, pp. 1430–1455, Aug 2014.
  • [2] H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a conditional generative adversarial network,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019, in Press.
  • [3] P. C. Barnum, S. Narasimhan, and T. Kanade, “Analysis of rain and snow in frequency space,” International Journal of Computer Vision, vol. 86, no. 2, p. 256, Jan 2009.
  • [4] A. K. Katsaggelos, Digital Image Restoration.   Springer Publishing Company, Incorporated, 2012.
  • [5] H. Liu, R. Xiong, X. Zhang, Y. Zhang, S. Ma, and W. Gao, “Nonlocal gradient sparsity regularization for image restoration,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 9, pp. 1909–1921, Sep. 2017.
  • [6] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE TPAMI, vol. 35, no. 6, pp. 1397–1409, June 2013.
  • [7] X. Ding, L. Chen, X. Zheng, Y. Huang, and D. Zeng, “Single image rain and snow removal via guided l0 smoothing filter,” Multimedia Tools and Applications, vol. 75, no. 5, pp. 2697–2712, Mar 2016.
  • [8] X. Zheng, Y. Liao, W. Guo, X. Fu, and X. Ding, “Single-image-based rain and snow removal using multi-guided filter,” in ICONIP, 2013, pp. 258–265.
  • [9] Y. L. Chen and C. T. Hsu, “A generalized low-rank appearance model for spatio-temporally correlated rain streaks,” in IEEE International Conference on Computer Vision, Dec 2013, pp. 1968–1975.
  • [10] L. W. Kang, C. W. Lin, and Y. H. Fu, “Automatic single-image-based rain streaks removal via image decomposition,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1742–1755, April 2012.
  • [11] Y. Luo, Y. Xu, and H. Ji, “Removing rain from a single image via discriminative sparse coding,” in IEEE International Conference on Computer Vision, Dec 2015, pp. 3397–3405.
  • [12] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown, “Rain streak removal using layer priors,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp. 2736–2744.
  • [13] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Joint rain detection and removal via iterative region dependent multi-task learning,” CoRR, vol. abs/1609.07769, 2016. [Online]. Available:
  • [14] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley, “Clearing the skies: A deep network architecture for single-image rain removal,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2944–2956, June 2017.
  • [15] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.
  • [16] Q. Wu, L. Wang, K. N. Ngan, H. Li, and F. Meng, “Beyond synthetic data: A blind deraining quality assessment metric towards authentic rain image,” in IEEE International Conference on Image Processing, Sep. 2019, pp. 2364–2368.
  • [17] Q. Wu, H. Li, K. N. Ngan, and K. Ma, “Blind image quality assessment using local consistency aware retriever and uncertainty aware evaluator,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2078–2089, Sep. 2018.
  • [18] A. K. Moorthy and A. C. Bovik, “A two-step framework for constructing blind image quality indices,” IEEE Signal Processing Letter, vol. 17, no. 5, pp. 513–516, May 2010.
  • [19] Q. Wu, H. Li, F. Meng, K. N. Ngan, B. Luo, C. Huang, and B. Zeng, “Blind image quality assessment based on multichannel feature fusion and label transfer,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp. 425–440, March 2016.
  • [20] Q. Wu, H. Li, Z. Wang, F. Meng, B. Luo, W. Li, and K. N. Ngan, “Blind image quality assessment based on rank-order regularized regression,” IEEE Transactions on Multimedia, vol. 19, no. 11, pp. 2490–2504, Nov 2017.
  • [21] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, Dec 2011.
  • [22] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, Dec 2012.
  • [23] Q. Wu, H. Li, F. Meng, K. N. Ngan, and S. Zhu, “No reference image quality assessment metric via multi-domain structural information and piecewise regression,” J. Vis. Commun Image R., vol. 32, no. Supplement C, pp. 205 – 216, 2015.
  • [24] L. Deng, T. Huang, X. Zhao, and T. Jiang, “A directional global sparse model for single image rain removal,” Applied Mathematical Modelling, vol. 59, pp. 662–679, 2018.
  • [25] ITU-R, “Recommendation bt.500-13: Methodology for subjective assessment of the quality of television pictures,” [Online] Available:, 2012.
  • [26] ITU-T, “Recommendation p.910: Subjective video quality assessment methods for multimedia applications,” [Online] Available:, 2008.
  • [27] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1427–1441, Jun. 2010.
  • [28] L. Ma, W. Lin, C. Deng, and K. N. Ngan, “Image retargeting quality assessment: A study of subjective scores and objective metrics,” IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, pp. 626–639, Oct 2012.
  • [29] A. K. Moorthy, K. Seshadrinathan, R. Soundararajan, and A. C. Bovik, “Wireless video quality assessment: A study of subjective scores and objective algorithms,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 4, pp. 587–599, April 2010.
  • [30] D. J. Sheskin, Handbook of parametric and nonparametric statistical procedures.   CRC Press, 2003.
  • [31] K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, and W. Zuo, “End-to-end blind image quality assessment using deep neural networks,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1202–1213, March 2018.
  • [32] W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2018.
  • [33] S. Bosse, D. Maniry, K. M ller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 206–219, Jan 2018.
  • [34] N. Akhtar and U. Ragavendran, “Interpretation of intelligence in cnn-pooling processes: a methodological survey,” Neural Computing and Applications, Jul 2019. [Online]. Available:
  • [35] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang, “Single image dehazing via multi-scale convolutional neural networks,” in European conference on computer vision.   Springer, 2016, pp. 154–169.
  • [36] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in IEEE Winter Conference on Applications of Computer Vision, March 2018, pp. 1451–1460.
  • [37] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online]. Available:
  • [38] Z. Zhang, X. Zhang, C. Peng, D. Cheng, and J. Sun, “Exfuse: Enhancing feature fusion for semantic segmentation,” CoRR, vol. abs/1804.03821, 2018. [Online]. Available:
  • [39] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4898–4906.
  • [40] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” CoRR, vol. abs/1406.4729, 2014. [Online]. Available:
  • [41] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the dct domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, Aug 2012.
  • [42] W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and laplacian features,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4850–4862, Nov 2014.
  • [43] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, Jan 2015.
  • [44] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a completely blind image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, March 2013.
  • [45] L. Zhang, L. Zhang, and A. C. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2579–2591, Aug 2015.
  • [46] W. Xue, L. Zhang, and X. Mou, “Learning without human scores for blind image quality assessment,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp. 995–1002.
  • [47] Q. Wu, Z. Wang, and H. Li, “A highly efficient method for blind image quality assessment,” in IEEE International Conference on Image Processing, Sept 2015, pp. 339–343.
  • [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available:
  • [49] L. Taylor and G. Nitschke, “Improving deep learning using generic data augmentation,” CoRR, vol. abs/1708.06020, 2017. [Online]. Available:
  • [50] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255.
  • [51] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE International Conference on Computer Vision, Dec 2015, pp. 1026–1034.
  • [52] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International Conference on International Conference on Machine Learning, 2013, pp. III–1139–III–1147.
  • [53] Q. Wu, H. Li, F. Meng, and K. N. Ngan, “A perceptually weighted rank correlation indicator for objective image quality assessment,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2499–2513, May 2018.
  • [54] M. W. Browne, “Cross-validation methods,” Journal of Mathematical Psychology, vol. 44, no. 1, pp. 108 – 132, 2000.
  • [55] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient transfer learning,” CoRR, vol. abs/1611.06440, 2016. [Online]. Available:
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description