Learn to Evaluate Image Perceptual Quality Blindly from Statistics of Self-similarity

Learn to Evaluate Image Perceptual Quality Blindly from Statistics of Self-similarity

Wufeng Xue,  Xuanqin Mou,  and Lei Zhang,  This work was supported in part by National Natural Science Foundation of China under Grant 61172163, Grant 90920003, and Grant 61271294, and in part by HK RGC GRF grant (under no. PolyU 5313/13E).W. Xue and X. Mou are with the Institute of Image Processing and Pattern Recognition, Xi’an Jiaotong University, Xi’an, China. (Email: xwolfs@hotmail.com, xqmou@mail.xjtu.edu.cn). X. Mou is also with the Beijing Center for Mathematics and Information Interdisciplinary Sciences (BCMIIS), Beijing, China.L. Zhang is with the Department of Computing, The Hong Kong Polytechnic University, Hong Kong. (Email: cslzhang@comp.polyu.edu.hk).

Among the various image quality assessment (IQA) tasks, blind IQA (BIQA) is particularly challenging due to the absence of knowledge about the reference image and distortion type. Features based on natural scene statistics (NSS) have been successfully used in BIQA, while the quality relevance of the feature plays an essential role to the quality prediction performance. Motivated by the fact that the early processing stage in human visual system aims to remove the signal redundancies for efficient visual coding, we propose a simple but very effective BIQA method by computing the statistics of self-similarity (SOS) in an image. Specifically, we calculate the inter-scale similarity and intra-scale similarity of the distorted image, extract the SOS features from these similarities, and learn a regression model to map the SOS features to the subjective quality score. Extensive experiments demonstrate very competitive quality prediction performance and generalization ability of the proposed SOS based BIQA method.

Blind image quality assessment, natural scene statistics, self-similarity, image redundancy.

I Introduction

Image quality assessment (IQA) aims to measure to what extent the observer is satisfied with the perceptual quality of a given image. IQA has become increasingly important due to its versatile utilities, including image quality monitoring, parameter tuning of image processing algorithms, and acting as yardstick for image processing system performance evaluation. With the proliferation of applications of high speed networks and portable multimedia devices, the demanding of reliable and efficient IQA algorithms is getting higher.

In the past decade, a variety of IQA methods have been proposed, which can be generally classified into three categories according to the available information of original reference image [1]: full reference (FR), reduced reference (RR) and no reference (NR). The FR methods have high prediction accuracy [2, 3, 4, 5, 6] because of the availability of the pristine reference image. In RR methods, a brief description of the reference image is available, for example, features based on natural scene statistics (NSS). By matching the statistics between the reference image and the distorted image, RR methods can also lead to good accuracy of quality prediction [7, 8, 9, 10, 11]. However, both FR and RR methods are hard to use in many practical applications, where the reference image information is completely inaccessible. Therefore, it is highly demanding to develop NR methods to predict the quality of a distorted image without prior information.

The NR methods can be categorized into distortion-specific (DS) methods and non-distortion-specific (NDS) ones. DS methods assume that the image degradation procedure is known, and descriptors that are capable of capturing the artifacts are employed to measure the quality. A review of DS methods can be found in [12]. In NDS methods, the distortion procedure is unknown, which is the case in most practical applications. Usually, this class of NR methods are also called blind IQA (BIQA) methods.

Fig. 1: Flowchart of the proposed BIQA method by statistics of self-similarity (SOS).

Existing solutions to BIQA have achieved good performance with the help of machine learning methods such as support vector regression and neural network. A survey about recent BIQA methods can be found in [13]. These methods differ from each other mainly on how the quality aware features are extracted. It is widely accepted that natural images are highly sparse in the high dimensional space. Once a natural image is distorted, its characteristics will accordingly deviate from that of the original image. This provides the underlying motivation of most BIQA methods. To capture the quality aware representation, usually images are subjected to decompositions of multiple frequencies and orientations by using wavelet [14, 15, 16], contourlet [17] and discrete cosine transform (DCT) [18, 19], etc. Compared with the pixel based representation, redundancies among the coefficients in these transformed domain are largely reduced. The resulting transform coefficients follow a high kurtosis, heavy tailed distribution. Distortions presented in an image will lead to deviation of this distribution, which can be used to predict image’s quality.

To further reduce the redundancy of images for a more effective (in the viewpoint of encoding) representation, contrast normalization is introduced to multi-scale and multi-orientation image decomposition [20]. It is shown [21] that the mean-subtracted contrast normalized (MSCN) coefficients are decorrelated and follow Gaussian distribution [21]. This finding can be used to model the contrast masking effect in early human vision. Mittal et al. [22] parameterized MSCN coefficients with a Generalized Gaussian Distribution (GGD) and the pairwise product of MSCN coefficients with an asymmetric GGD (AGGD). The resulting method, called BRISQUE, obtains state-of-the-art BIQA performance. Inspired by this contrast normalization, Xue et al. [23] proposed a jointly adaptive normalization (JAN) scheme to reduce the redundancy in domains of Laplacian of Gaussian (LOG) response and gradient magnitude (GM). After the JAN operation, the statistics of LOG and GM become more similar among natural images of different contents while becoming more different from unnatural images. The proposed model M3 in [23] shows better performance than BRISQUE on two benchmark databases.

Most existing BIQA methods [14, 15, 16, 17, 18, 19] calculate the image statistics in a transformed domain where the image redundancies are much reduced. Contrary to these methods, we find that measuring image redundancy directly in the pixel based spatial domain can lead to an efficient BIQA method with promising performance. Human visual system has evolved to economically describe natural images by efficient redundancy reduction [24]. The redundancy in an image can be reflected by the predictability of a pixel’s intensity by its neighboring pixels [25]. Natural images generally have high spatial correlation and multi-scale correlation; that is, a natural image looks similar to its translated, zoom in or zoom out versions. Therefore, we measure the image redundancy by computing the image intra-scale and inter-scale self-similarities, and propose a BIQA method called Statistics of Self-similarity (SOS). It is worthwhile to note that SOS is very different from the previous work M3 in [23]. First, SOS aims to describe the degree of redundancy of an image, while M3 aims to capture the local contrast. Second, the SOS features are based on the distributions of similarity maps, while in M3 the GM and LOG features are used and jointly normalized to obtain a more robust feature representation. At last, SOS provides a general framework of BIQA and any similarity function can be employed in it.

The rest of this paper is organized as follows. Section II presents in details the proposed SOS computation framework, and demonstrates the high relevance of SOS based features with image quality. Section III gives experimental settings. Extensive experimental results and analysis are presented in Section IV. Section V concludes the paper.

Ii Statistics of Self-Similarity for Biqa

The flowchart of the proposed SOS based BIQA method is illustrated in Fig. 1. It consists of the following main steps: local self-similarity map calculation, SOS feature extraction, and regression model learning, which are described in detail as follows.

(a) Reference image.
(b) Inter-scale LSM 1 for (a)
(c) Inter-scale LSM 2 for (a)
(d) Intra-scale LSM 1 for (a)
(e) Intra-scale LSM 2 for (a)
(f) Distorted image
(g) Inter-scale LSM 1 for (f)
(h) Inter-scale LSM 2 for (f)
(i) Intra-scale LSM 1 for (f)
(j) Intra-scale LSM 2 for (f)
Fig. 2: The LSMs of a reference image and its JPEG2000 compressed image.

Ii-a Calculation of Local Self-similarity Map

For an input image I, the redundancy can be described by its self-similarity. In particular, the image self-similarity can be measured from two aspects. The first one is intra-scale self-similarity. Due to the spatial correlation, I will be similar to its translated versions, denoted by , where m and n are the translations along vertical and horizontal directions. Refer to Fig. 1, in this paper we employ four translated versions of I by setting .

Apart from intra-scale self-similarity, natural images also exhibit inter-scale self-similarity. It is well known that natural images have the property of scale invariance, i.e., an image usually looks similar to its scaled versions. Considering the fact that the scale space of human visual system can be well approximated by Gaussian filtering [26], we produce a series of smoothed versions of I by:


where x, y are the spatial location, and


is the 2D Gaussian filter with scale s. Refer to Fig. 1, we compute four smoothed versions of I with .

Denote by R any one of the four translated images and the four smoothed images . Both the intra-scale and inter-scale self-similarity can be calculated by computing the similarity between I and R in any local region, leading to a Local Similarity Map (LSM). Intuitively, the similarity functions used in many existing FR IQA methods to compute the local quality map (LQM) can all be used to compute this LSM. In this paper, we adopt the similarity functions in two representative FR-IQA methods, i.e., Structural SIMilarity (SSIM) [27] and ratio of non-shift edge (rNSE) [2, 28].

Fig. 3: The distributions of the LSMs for Fig. 2(a) (left) and Fig. 2(f) (right).
Feature Description Dimension (for each of the 8 LSMs)
Mean of the elements in an LSM 1 (8 in total)
d Standard deviation of the elements in an LSM 1 (8 in total)
h Histogram of a quantized (10 bins) LSM 10 (80 in total)
TABLE I: The SOS features considered in the proposed method.

SSIM is a benchmark FR IQA method. In SSIM, the local similarity at each location is calculated by [27]:


where constants , and mediate the relative importance of the three components. L, C and S measure the similarities of luminance, contrast and structure between I and R:


where and are the local means of I and R; and are the local standard deviations of I and R; and is the local covariance between I and R. All these computations are applied using a local Gaussian window with a specified scale parameter as the weighting factor. , and are small constants to avoid the denominator being zero. In this work, we follow [27] for the configuration of the parameters , , , , and .

rNSE [2, 28] is a recently proposed FR-IQA method, which measures the image quality by computing the ratio of the number of non-shift edges after distortion to the number of original edges. The rNSE index is computed as:


where and are the sets of edges of distorted image D and reference image A, respectively. The computation of the edge set is based on the zero-crossings detection of the Laplacian of Gaussian (LOG) response [2, 28]. ”” denotes the intersection of sets and , i.e., non-shift edge points between A and D; ”” counts the number of edges in the set. Clearly, rNSE is a ratio between 0 and 1. Since in the context of BIQA, the reference image A is not available, we modify the rNSE index as follows to calculate the desired LSM:


where and are respectively the sets of edge points of I and R in a local square window centered at .

Fig. 4: 2-D visulization of BIQA features. For each BIQA feature, a linear transform is learned from the LIVE database. Then the transform is applied to the coppresponing features in databases of CSIQ and TID2013. The pristine images are represeted as green dots while the distorted images are represented as ”+”. The color of ”+” encodes the subjective score, i.e., DMOS for CSIQ and MOS for TID2013. The used BIQA features include BIQI, DIIVINE, BLIINDS2, BRISQUE, SOS-H-rNSE and SOS-H-SSIM.

With either Eq. 3 or Eq. 8, we could calculate eight LSMs of I. In Fig. 2, we show the LSMs of a reference image and its distorted image (JPEG2000 compressed). SSIM is used as the similarity function. Two intra-scale LSMs and two inter-scale LSMs are shown. Note that the distortion in Fig. 2(f) is moderate and the artifacts generated by compression are nearly invisible. However, from the LSMs, we can easily tell the difference between the reference image and the distorted one. Those LSMs reflect the local correlation of the distorted image along different orientations (for intra-scale self-similarity) and in scale space (for inter-scale self-similarity), implying that the image perceptual quality can be well inferred from them.

Ii-B SOS Features

From the 8 SOS based LSMs computed above, features can be extracted to predict the image quality. Clearly, the most important statistics of the LSMs are their mean and standard deviation, which are computed as follows:


where N is the number of elements in the LSM. We call the method that uses mean and standard deviation as SOS-MD. One advantage SOS-MD is its low feature dimensionality (16 in total for all the 8 LSMs). Note that the pair can completely characterize the statistical information of one LSM if its elements follow a Gaussian distribution. However, in practice the distribution of the LSMs are far from Gaussian (please refer to Fig. 3 for example distributions of the LSMs), and using only the mean and standard deviation cannot accurately describe them. Therefore, we quantify the LSMs into several levels, and use their normalized histograms as the SOS features. We call this method as SOS-H. Since both the SSIM and rNSE indices range between 0 and 1, we quantify each LSM into 10 bins (with step length 0.1), resulting in a 10 dimensional histogram h for each LSM.

In TABLE I we list the three types of SOS features: and h. By using each type of features or the combination of them, we could learn a regression model for BIQA.

Ii-C Regression Model Learning

With the SOS feature vector, denoted by f, of an image, a regression function F could be learned to map f to the image subjective quality score q, i.e., . To this end, we need a set of training images, whose subjective quality scores are available. Such a training dataset can be extracted from the existing IQA databases. We can construct a training set of k images with their feature vectors and subjective scores: . Machine learning tools, such as support vector regression (SVR), neuron network, random forest, can be used to learn the mapping F. In this work, we adopt the SVR with a radial basis function kernel [29]. The readers may refer to [29] for the details of SVR and its implementation. Once the regression model F is learned, we can use it to estimate the perceptual quality of any input image.

Ii-D Comparison with Other NSS based Features

As we discussed in the Section I, the differences between the SOS based features and the NSS features used in previous BIQA methods lie in two folds. 1) First, instead of transforming the image into another redundancy-reduced domain, we use the translated and smoothed versions of the image in the spatial domain for feature extraction. 2) Second, the statistics of self-similarity maps are used as the quality aware features. To illustrate the power of SOS features in BIQA, we use the neighboring components analysis (NCA) [30] to transform the high dimensional BIQA features into 2 dimensional (2-D) points, and then plot the scatter of these 2-D points to reveal the essential structure of the data.

We first learn a projection matrix via NCA on the LIVE database [31] for each BIQA feature, then apply this matrix to features on databases of CSIQ [3] and TID2013 [32]. The scatter plots of the resulting 2-D points are shown in Fig. 4. In each plot, the reference images are represented as green dots, while the distorted images as ”+”. The color of ”+” encodes the subjective score of each distorted image. The used BIQA features include BIQI [14], DIIVINE [16], BLIINDS-II [19], BRISQUE [22], SOS-H-rNSE and SOS-H-SSIM. The first row shows the scatter plots of the 2-D points which are obtained when the LIVE-learned projection matrix is applied to the CSIQ database, while the second row shows the results on the TID2013 database. The third row in Fig. 4 shows the corresponding plots for SOS-H-rNSE and SOS-H-SSIM.

From these plots, we can draw the following conclusions. 1) The distributions of the 2-D points for BRISQUE, SOS-H-rNSE and SOS-H-SSIM show obvious quality relevance. The points for reference images are close or overlapped with points of slightly distorted images and far from the heavily distorted images. The intermediately distorted images are by and large ordered according to their quality. 2) In the distributions for BIQI, DIIVINE and BLIINDS-II, no or weak quality relevance can be observed. Distorted images are not sequentially located according to their quality. These observations reveal the advantages of BRISQUE and the proposed SOS-H methods over the other methods.

Distortion type
LIVE ALL 0.918 10.764 0.911 0.891 12.373 0.871 0.946 8.837 0.943 0.926 10.234 0.921
JP2K 0.923 9.516 0.904 0.849 13.119 0.825 0.950 7.900 0.933 0.923 9.656 0.906
JPEG 0.963 8.564 0.945 0.937 11.088 0.911 0.974 7.159 0.958 0.962 8.672 0.942
WN 0.978 5.811 0.967 0.983 5.194 0.973 0.989 4.121 0.980 0.981 5.486 0.971
GB 0.912 7.673 0.877 0.925 6.976 0.891 0.923 7.018 0.903 0.917 7.372 0.875
FF 0.884 13.113 0.844 0.855 14.296 0.738 0.918 11.094 0.890 0.903 12.109 0.868
CSIQ ALL 0.926 0.104 0.903 0.890 0.129 0.863 0.933 0.101 0.910 0.926 0.106 0.901
WN 0.946 0.054 0.937 0.954 0.050 0.933 0.938 0.059 0.923 0.941 0.056 0.920
JPEG 0.967 0.078 0.922 0.927 0.114 0.885 0.961 0.084 0.906 0.957 0.088 0.915
JP2K 0.935 0.110 0.912 0.875 0.152 0.842 0.941 0.106 0.915 0.936 0.111 0.907
GB 0.917 0.112 0.879 0.906 0.119 0.851 0.929 0.103 0.902 0.935 0.099 0.905
TID2008 ALL 0.931 0.514 0.927 0.902 0.603 0.871 0.937 0.488 0.928 0.938 0.488 0.920
WN 0.905 0.302 0.886 0.924 0.270 0.906 0.940 0.243 0.923 0.934 0.252 0.914
GB 0.925 0.466 0.912 0.894 0.556 0.876 0.923 0.475 0.913 0.923 0.469 0.914
JPEG 0.971 0.358 0.923 0.928 0.547 0.842 0.969 0.367 0.906 0.965 0.394 0.901
JP2K 0.945 0.556 0.929 0.918 0.688 0.872 0.950 0.535 0.931 0.948 0.540 0.923
TABLE II: Performance comparison of SOS-MD and SOS-H on the three databases. Results for the overall database and for each individual distortion are listed.

Iii Experiments Configuration

Iii-a Image Databases and Evaluation Criteria

We evaluate the performance of the proposed SOS based BIQA methods in terms of their ability to predict the subjective score of distorted images. Three publicly available large-scale databases are employed for this evaluation.

The LIVE database [31]: A total of 779 distorted images are generated by applying 5 distortion operations at levels to 29 pristine images. The distortions include: JPEG2000 compression (JP2K), JPEG compression, white noise (WN), Gaussian blurring (GB) and simulated fast fading Rayleigh channel (FF).

The CSIQ database [3]: A total of 866 distorted images are generated by applying 5 distortion operations at levels to 30 pristine images. The distortions include: JPEG, JP2K, additive pink noise, WN, GB and global contrast decrements.

The TID2013 database [32]: A total of 3000 distorted images are generated by applying 24 distortion operations at 5 levels to 25 pristine images. The distortions in TID2013 reflect a broad range of image impairments, such as edge smoothing, block artifacts, additive and multiplicative noise, chromatic aberrations, denoising and contrast change, etc. Details of the distortions can be found in [32].

The ground truth quality of each image is given by the subjective score, i.e., (Difference) Mean Opinion Score (DMOS/MOS). To evaluate the performance of BIQA methods, three indexes are usually computed by using the subjective scores and the model-predicted scores: the Spearman rank order correlation coefficient (SRC), the Pearson correlation coefficient (PCC) after a logistic regression [33], and the root mean squared error (RMSE) between the subjective score and the predicted score after the regression. Note that this logistic regression accounts for the different range of the objective and subjective scores, as well as the nonlinearity of human perception in extreme distortions.

Iii-B Implementation Details and Parameter Setting

In the implementation of the proposed BIQA method, the scale of the Gaussian window in SSIM and the scale of the LOG filter in rNSE are both set as 0.5 for computing the intra-scale self-similarity. For the inter-scale self-similarity, we set the scale parameters of SSIM and rNSE to the same value as the four smooth parameters in Eq. 2. With the LSMs available, the SOS features (mean , standard deviation d and histogram h) can be extracted, and then fed into the SVR to train a regression model for quality prediction. We adopt the -SVR algorithm with an RBF (radial-basis function) kernel, and the source code is from LibSVM [34]. The parameters of -SVR are tuned by 2D grid search in the logarithm space.

During our experiments, 80% of the images are employed for training and the rest 20% for testing. The training and test sets are split according to the reference image to guarantee the independency of the image content in training set and test set. This splitting is repeated for 1,000 times and the median results are used to evaluate the final performance.

Iv Results and Discussions

Iv-a Performance of SOS-MD and SOS-H

We first compare the performance of SOS-MD and SOS-H on the three databases. The results are listed in TABLE II. Note that in this experiment we only consider the common distortion types to all the three databases, i.e., JP2K, JPEG, WN, and GB.

From TABLE II, we can observe that both SOS-MD and SOS-H show good performance on the three databases in terms of PCC and SRC. The best results on all the three databases are achieved by the method SOS-H-rNSE, with PCC values 0.946, 0.933, and 0.937 on the three databases, respectively. For every single distortion type, it also demonstrates PCC and SRC values consistently higher than 0.9. Note that due to the scale difference of subjective scores on the three databases, the resulting RMSE values range differently.

As for the two similarity functions, rNSE shows clear advantages over SSIM. We highlight in boldface the better one of rNSE and SSIM in each row. The SOS-MD and SOS-H methods with rNSE as similarity function outperform those with SSIM in most of the distortion types. This may be due to the fact that rNSE emphasizes more on edge structure, which is crucial for human visual perception. Besides, benefiting from the richer information in histograms, SOS-H always exhibits better performance than SOS-MD on the three databases.

Methods Feature domain JP2K JPEG WN GB FF ALL
BIQI [14]* Wavelet 0.7849 0.8801 0.9157 0.8367 0.7023 0.8084
GRNN [35] Fourier+Spatial 0.8156 0.8721 0.9794 0.8331 0.7354 0.8268
LD-GS [36] Wavelet 0.8317 0.8339 0.9134 0.8751 0.8588 0.8414
LD-TS [36] Wavelet 0.8202 0.8334 0.9556 0.9251 0.8863 0.8833
DIIVINE [16]* Wavelet 0.8418 0.8926 0.9617 0.8792 0.8202 0.8816
CBIQ-I [37] Gabor 0.912 0.963 0.959 0.918 0.885 0.896
BLIINDS-II [19]* DCT 0.9258 0.95 0.9477 0.9132 0.8736 0.9302
CBIQ-II [37] Gabor 0.919 0.965 0.933 0.944 0.912 0.93
BRISQUE [22]* Spatial 0.9175 0.9655 0.9789 0.9479 0.8854 0.943
M3 [23] Gradient magnitude+LOG 0.9283 0.9659 0.9853 0.9359 0.9008 0.9511
SOS-H-rNSE Spatial+Scale 0.9328 0.9582 0.9802 0.9026 0.8899 0.9434
SOS-H-SSIM Spatial+Scale 0.906 0.9415 0.9711 0.8754 0.8679 0.9212
PSNR 0.9081 0.8923 0.984 0.8111 0.8941 0.8839
SSIM 0.9606 0.9739 0.9693 0.9515 0.9551 0.9481
TABLE III: Median src of the existing BIQAmethods on live database. The results of PSNR and SSIM are listed for reference. (*the results are computed by using the codes provided by the authors.)
Fig. 5: Box plot of the BIQA methods’ performance on LIVE database in terms of SRC.
BIQI 1 1 1 1 1 1
DIIVINE 0 1 1 1 1 1
BLIINDS-II 0 0 1 0.79 1 1
BRISQUE 0 0 0 0 1 1
SOS-H-SSIM 0 0 0.21 1 1 1
SOS-H-rNSE 0 0 0 0 0 1
M3 0 0 0 0 0 0
TABLE IV: P-values of t-test between each pair of BIQA methods.

Iv-B Performance Comparison with Existing BIQA Methods

In TABLE III, we compare the performances of the proposed SOS-based methods with existing state-of-the-art BIQA methods, including BIQI [14], BLIINDS-II [19], DIIVINE [16], GRNN [35], visual codebook based method (CBIQ) [37], local dependency based method (LD-GS and LD-TS) [36], BRISQUE [22] and M3 [23]. The results of these competitors are either sourced from their original publications or computed by using the source codes provided by the authors. The results of the classical PSNR and SSIM indices are also presented for reference. To save space, only the result of SRC index is shown in TABLE III. The top three results are highlighted with boldface for each column.

When the entire LIVE database is considered, the proposed two SOS-H methods show very competitive performance with the state-of-the-art BIQA methods. The top two methods on LIVE database are M3, SOS-H-rNSE and BRISUQE. SOS-H-SSIM beats all the wavelet-based methods. Box plots of the results on LIVE database are presented in Fig. 5 for a more intuitive comparison. To investigate the significance of difference between the performances of these BIQA methods, the right-tailed t-test with a significance level of 0.01 is conducted for each pair of BIQA methods. The null hypothesis is that the mean of the SRC values of the two methods are equal. The alternative hypothesis is that the mean SRC value of the method in the row is greater than that of the method in the column. The resulting p-values of the tests are shown in TABLE IV, and a small p-value favors the alternative hypothesis. Again, we can see that the proposed SOS-H-rNSE delivers excellent performance, and it is only beaten by M3.

CSIQ 0.857 0.888 0.899 0.911 0.898 0.907
TID2013 0.860 0.895 0.891 0.923 0.897 0.913
TABLE V: The database independency of SOS-H. The models are trained on LIVE and applied on CSIQ and TID2013. Only the four distortions common to all the three databases are considered.
Additive Gaussian noise 0.821 0.774 0.778 0.628 0.709 0.769
Additive noise more in color 0.501 0.542 0.554 0.357 0.431 0.583
Spatially correlated noise 0.728 0.761 0.830 0.689 0.816 0.783
Masked noise 0.261 0.307 0.172 0.281 0.111 0.504
High frequency noise 0.876 0.893 0.855 0.772 0.816 0.884
Impulse noise 0.739 0.698 0.815 0.607 0.789 0.718
Quantization noise 0.579 0.748 0.695 0.639 0.535 0.819
Gaussian blur 0.863 0.826 0.856 0.855 0.915 0.872
Image denoising 0.803 0.693 0.551 0.797 0.723 0.771
JPEG compression 0.832 0.748 0.756 0.706 0.725 0.819
JPEG2000 compression 0.898 0.793 0.780 0.850 0.861 0.873
JPEG transmission errors 0.435 0.165 0.231 0.409 0.343 0.423
JPEG2000 transmission errors 0.565 0.633 0.695 0.696 0.717 0.723
Non eccentricity pattern noise 0.182 0.131 0.126 0.176 0.134 0.212
Local block-wise distortions 0.146 0.206 0.203 0.290 0.298 0.280
Mean shift 0.127 0.217 0.112 0.185 0.197 0.090
Contrast change 0.161 0.056 0.058 0.085 0.347 0.301
Change of color saturation 0.099 0.175 0.092 0.022 0.213 0.185
Multiplicative Gaussian noise 0.695 0.720 0.621 0.626 0.666 0.709
Comfort noise 0.142 0.021 0.165 0.084 0.265 0.229
Lossy compression of noisy images 0.628 0.639 0.531 0.454 0.677 0.704
Image color quantization with dither 0.837 0.815 0.827 0.789 0.802 0.858
Chromatic aberrations 0.678 0.707 0.731 0.596 0.775 0.644
Sparse sampling and reconstruction 0.847 0.819 0.807 0.861 0.843 0.922
ALL 0.608 0.569 0.558 0.576 0.593 0.687
TABLE VI: The performance of SOS-H with more distortion types in the TID2013 database.

When each distortion type is considered, SOS-H-rNSE shows top performance on JP2K, WN, and FF, while SOS-H-SSIM gives inferior SRC value. More specifically, on JP2K images, all methods that based on wavelet features [14, 16] fail to give a high SRC value, while the DCT based BLIINDS-II, the gradient and LOG based M3 [23] and the purposed SOS-H methods show excellent performance. On JPEG images, the two SOS-H methods perform better than the wavelet based methods. On WN images, all the methods based on spatial features show clear advantage over the wavelet, Gabor and DCT based methods. This is due to the fact that pixel based representation in spatial domain is more appropriate for additive noise. On GB images, BRISQUE, M3 and CBIQ-II give the best performance, while the proposed SOS-H methods are still better than the wavelet-based methods. On FF images, it is hard to capture the intrinsic characteristic for quality prediction because FF simultaneously introduces structure shifting, blurring, ringing and color contamination. Among the competing methods, CBIQ-II behaves the best, followed by SOS-H-rNSE and BRISQUE. SOS-H-SSIM only leads to an acceptable performance.

When compared to the FR methods PSNR and SSIM, the two SOS-H methods show an obvious advantage over PSNR, and inferior to the SSIM index.

Iv-C Database Indenpendency

We examine the database independency of the proposed SOS-H methods as follows: we train a quality prediction model with the SOS-H features on LIVE database and then test the model on CSIQ and TID2013. Note that for CSIQ and TID2013, only images with the four common distortions to LIVE are considered. The results are shown in TABLE V. Obviously, the two SOS-H methods show good independency of databases. When the LIVE-trained models are tested on CSIQ and TID2013, SOS-H-rNSE gives SRC values competitive with M3, and SOS-H-SSIM works on par with BRISQUE. DIIVINE shows less stable performance in this case.

Iv-D More Distortion Types

We further test the performance of the proposed SOS-H methods with more distortions by using the TID2013 database, which has a wide range of distortions. of the images are used for training and the rest for testing. The procedure is the same as described in subsection III-B. The obtained median SRC values are presented in TABLE VI. The top two methods in each row are highlighted in bold for each distortion. (The detailed explanations of the distortions can be found in [32].) As can be seen, SOS-H-rNSE shows better performance than other BIQA methods, except for M3, on the entire TID2013 database. On some distortions, all BIQA methods fail to give acceptable performance. We shade the rows in TABLE VI where all methods obtain SRC values less than 0.5. Examples of these distortions include JPEG transmission errors, non-eccentricity pattern noise, block-wise distortion with different intensity, mean shift, contrast change, color saturation change, and comfort noise. The failure in these cases may be due to the fact that all the current BIQA methods make use of structure features which are not capable of capturing the non-structural distortions, such as color aberration, mean shift, etc.

Iv-E Similarity Function of MSE

To further validate the effectiveness of the proposed SOS framework, we take MSE as the similarity function in SOS:


Note that we take a logarithm transform of MSE in order to better compute the histogram of LSM. The features for quality prediction are extracted in the same way as that in SOS-H-rNSE and SOS-H-SSIM. The resulting SOS-based method is denoted as SOS-H-MSE.

LIVE ALL 0.943 0.921 0.944
JP2K 0.933 0.906 0.946
JPEG 0.958 0.942 0.959
WN 0.980 0.971 0.981
GB 0.903 0.875 0.928
FF 0.890 0.868 0.865
CSIQ ALL 0.910 0.901 0.902
WN 0.923 0.920 0.917
JPEG 0.906 0.915 0.918
JP2K 0.915 0.907 0.905
GB 0.902 0.905 0.895
TID2013 ALL 0.928 0.920 0.919
WN 0.923 0.914 0.918
GB 0.913 0.914 0.915
JPEG 0.906 0.901 0.893
JP2K 0.931 0.923 0.923
TABLE VII: SRC Performance comparison of SOS-H with rNSE, SSIM and MSE as the similarity functions.

TABLE VII compares the performances of the three SOS-H based methods. Under the framework of SOS, MSE gives similar performance to SSIM on databases of CSIQ and TID2013, and the same performance as rNSE on LIVE. We can draw that the effectiveness of SOS framework can be demonstrated by all the three similarity functions of rNSE, SSIM and MSE. The MSE-based SOS-H even shows slightly better performance in terms of SRC on the LIVE database. For each distortion type, SOS-H-rNSE and SOS-H-MSE show similar results. The good results of SOS-H-MSE can be explained as follows. MSE computes the squared difference between the original image and its shifted or smoothed version. This is similar to the computation of image gradient, which has been shown very effective for image quality assessment [4]. Better performance may be achieved by other potential similarity functions under the SOS framework.

Iv-F Discussions

It was found that the function of ganglion and lateral geniculate nucleus (LGN) neurons can be modeled by principal component analysis (PCA) based whitening, while the role of PCA is similar to DCT for natural images [38]. The responses of simple cells in the primary visual cortex (V1) are similar to the WT outputs and approach to the independent components of natural images [39]. However, these transforms may not be effective and efficient to represent distorted images in the context of IQA. The scatter plots in the first two rows of Fig. 4 show clearly that the features extracted from DCT and WT domains cannot distinguish well the distorted image and their reference counterpart, and their 2D scatter plots show low relevance to the subjective quality of image.

This fact motivated us to find a different method for NSS calculation. Instead of transformation, we directly compute the intra-scale and inter-scale redundancy in the spatial domain. Interestingly, as shown in the third row of Fig. 2, the proposed SOS features can distinguish better the original natural images from their distorted counterparts and the scatter plots show better relevance with subjective quality. Our experiments in the previous sections also validated that the SOS features can predict the perceptual quality very well. Besides, the proposed SOS features works robustly with no strict restriction on the similarity functions. Whether or not our results imply a new physical model of HVS to sense the image quality will be an interesting problem open to investigate, whereas this is out the scope of this paper.

V Conclusion

It is well-known that a proper presentation will make the task of image processing more easily, so does for the task of image quality assessment. In this paper, we proposed a new feature representation framework which aims to capture the statistics of self-similarity (SOS) for natural images. Different from previous methods, SOS directly measures the redundancy existing in an image, rather than describing the structure in a redundancy reduced domain. The computed local similarity map (LSM) can portray the local correlation across space and scales, both of which will be altered by image distortion. The statistics of these LSMs, i.e., the SOS features, were validated to be able to more effectively capture the distortion degree than previous features that are based on image decomposition. Especially, when the LSM histogram features are utilized, very competitive performance can be achieved on the benchmark databases. New similarity functions can be introduced or designed under the framework of SOS for better performance in the future study.


  • [1] Z. Wang and A. C. Bovik, “Modern image quality assessment,” Synthesis Lectures on Image, Video, and Multimedia Processing, vol. 2, no. 1, pp. 1–156, 2006.
  • [2] M. Zhang, X. Mou, and L. Zhang, “Non-shift edge based ratio (nser): An image quality assessment metric based on early vision features,” Signal Processing Letters, IEEE, no. 99, pp. 1–1, 2011.
  • [3] E. C. Larson and D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, pp. 011 006–011 006, 2010.
  • [4] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: a highly efficient perceptual image quality index,” Image Processing, IEEE Transactions on, vol. 23, no. 2, pp. 684–695, 2014.
  • [5] S. Li, F. Zhang, L. Ma, and K. N. Ngan, “Image quality assessment by separately evaluating detail losses and additive impairments,” Multimedia, IEEE Transactions on, vol. 13, no. 5, pp. 935–949, Oct 2011.
  • [6] L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” Image Processing, IEEE Transactions on, no. 99, pp. 1–1, 2011.
  • [7] Q. Li and Z. Wang, “Reduced-reference image quality assessment using divisive normalization-based image representation,” Selected Topics in Signal Processing, IEEE Journal of, vol. 3, no. 2, pp. 202–211, 2009.
  • [8] W. Xue and X. Mou, “Reduced reference image quality assessment based on weibull statistics,” in Quality of Multimedia Experience (QoMEX), 2010 Second International Workshop on.   IEEE, 2010, pp. 1–6.
  • [9] X. Mou, W. Xue, and L. Zhang, “Reduced reference image quality assessment via sub-image similarity based redundancy measurement,” Proceedings of SPIE, vol. 8291, p. 82911S, 2012.
  • [10] L. Ma, S. Li, F. Zhang, and K. N. Ngan, “Reduced-reference image quality assessment using reorganized dct-based image representation,” Multimedia, IEEE Transactions on, vol. 13, no. 4, pp. 824–829, Aug 2011.
  • [11] J. Wu, W. Lin, G. Shi, and A. Liu, “Reduced-reference image quality assessment with visual information fidelity,” Multimedia, IEEE Transactions on, vol. 15, no. 7, pp. 1700–1705, Nov 2013.
  • [12] M. Shahid, A. Rossholm, B. Lövström, and H.-J. Zepernick, “No-reference image and video quality assessment: a classification and review of recent approaches,” EURASIP Journal on Image and Video Processing, vol. 2014, no. 1, pp. 1–32, 2014.
  • [13] R. A. Manap and L. Shao, “Non-distortion-specific no-reference image quality assessment: A survey,” Information Sciences, vol. 301, pp. 141–160, 2015.
  • [14] A. Moorthy and A. Bovik, “A two-step framework for constructing blind image quality indices,” Signal Processing Letters, IEEE, vol. 17, no. 5, pp. 513–516, 2010.
  • [15] H. Tang, N. Joshi, and A. Kapoor, “Learning a blind measure of perceptual image quality,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.   IEEE, 2011, pp. 305–312.
  • [16] A. Moorthy and A. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” Image Processing, IEEE Transactions on, vol. 20, no. 12, pp. 3350 –3364, dec. 2011.
  • [17] W. Lu, K. Zeng, D. Tao, Y. Yuan, and X. Gao, “No-reference image quality assessment in contourlet domain,” Neurocomputing, vol. 73, no. 4, pp. 784–794, 2010.
  • [18] M. Saad, A. Bovik, and C. Charrier, “A dct statistics-based blind image quality index,” Signal Processing Letters, IEEE, vol. 17, no. 6, pp. 583–586, 2010.
  • [19] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the dct domain,” Image Processing, IEEE Transactions on, vol. 21, no. 8, pp. 3339–3352, 2012.
  • [20] P. Teo and D. Heeger, “Perceptual image distortion,” in Image Processing, 1994. Proceedings. ICIP-94., IEEE International Conference, vol. 2.   IEEE, 1994, pp. 982–986.
  • [21] D. L. Ruderman and W. Bialek, “Statistics of natural images: Scaling in the woods,” Physical review letters, vol. 73, no. 6, p. 814, 1994.
  • [22] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” Image Processing, IEEE Transactions on, vol. 21, no. 12, pp. 4695–4708, 2012.
  • [23] W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and laplacian features,” Image Processing, IEEE Transactions on, vol. 23, no. 11, pp. 4850–4862, 2014.
  • [24] F. Attneave, “Some informational aspects of visual perception.” Psychological review, vol. 61, no. 3, p. 183, 1954.
  • [25] D. Kersten, “Predictability and redundancy of natural images,” JOSA A, vol. 4, no. 12, pp. 2395–2400, 1987.
  • [26] E. Dam and B. ter Haar Romeny, “Front end vision and multi-scale image analysis,” Deep Structure I, II & III, no. 1-4020, pp. 1507–0, 2003.
  • [27] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” Image Processing, IEEE Transactions on, vol. 13, no. 4, pp. 600–612, 2004.
  • [28] W. Xue and X. Mou, “An image quality assessment metric based on non-shift edge,” in Image Processing (ICIP), 2011 18th IEEE International Conference on.   IEEE, 2011, pp. 3309–3312.
  • [29] A. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and computing, vol. 14, no. 3, pp. 199–222, 2004.
  • [30] S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood component analysis,” in Neural Information Processing Systems, vol. 17, pp. 513–520.
  • [31] H. Sheikh, Z. Wang, L. Cormack, and A. Bovik, “Live image quality assessment database release 2 (2005).”
  • [32] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, L. Lin, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. Jay Kuo, “Color image database tid2013: Peculiarities and preliminary results,” Advances of Modern Radioelectronics, vol. 10, no. 10, pp. 30–45, 2009.
  • [33] VQEG, “Final report from the video quality experts group on the validation of objective models of video quality assessment, phase ii,” VQEG, Aug, 2003.
  • [34] C. Chang and C. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
  • [35] C. Li, A. Bovik, and X. Wu, “Blind image quality assessment using a general regression neural network,” Neural Networks, IEEE Transactions on, vol. 22, no. 5, pp. 793–799, 2011.
  • [36] F. Gao, X. Gao, D. Tao, X. Li, L. He, and W. Lu, “Universal no reference image quality assessment metrics based on local dependency,” in Pattern Recognition (ACPR), 2011 First Asian Conference on.   IEEE, 2011, pp. 298–302.
  • [37] P. Ye and D. Doermann, “No-reference image quality assessment using visual codebooks,” Image Processing, IEEE Transactions on, vol. 21, no. 7, pp. 3129–3138, 2012.
  • [38] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” Computers, IEEE Transactions on, vol. 100, no. 1, pp. 90–93, 1974.
  • [39] S. Fischer, F. Šroubek, L. Perrinet, R. Redondo, and G. Cristóbal, “Self-invertible 2d log-gabor wavelets,” International Journal of Computer Vision, vol. 75, no. 2, pp. 231–246, 2007.

Wufeng Xue received the B.Sc. degree in automatic engineering from the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China, in 2009. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Xi’an Jiaotong University. His research interest focuses on perceptual quality of visual signals.

Xuanqin Mou (M’08) has been with the Institute of Image Processing and Pattern Recognition (IPPR), Electronic and Information Engineering School, Xi’an Jiaotong University, since 1987. He has been an Associate Professor since 1997, and a Professor since 2002. He is currently the Director of IPPR. Dr. Mou served as the member of the 12th Expert Evaluation Committee for the National Natural Science Foundation of China, the Member of the 5th and 6th Executive Committee of China Society of Image and Graphics, the Vice President of Shaanxi Image and Graphics Association. He has authored or co-authored more than 200 peer-reviewed journal or conference papers. He has been granted as the Yung Wing Award for Excellence in Education, the KC Wong Education Award, the Technology Academy Award for Invention by the Ministry of Education of China, and the Technology Academy Awards from the Government of Shaanxi Province, China.

Lei Zhang (M’04, SM’14) received the B.Sc. degree in 1995 from Shenyang Institute of Aeronautical Engineering, Shenyang, P.R. China, the M.Sc. and Ph.D degrees in Control Theory and Engineering from Northwestern Polytechnical University, Xi’an, P.R. China, respectively in 1998 and 2001. From 2001 to 2002, he was a research associate in the Dept. of Computing, The Hong Kong Polytechnic University. From Jan. 2003 to Jan. 2006 he worked as a Postdoctoral Fellow in the Dept. of Electrical and Computer Engineering, McMaster University, Canada. In 2006, he joined the Dept. of Computing, The Hong Kong Polytechnic University, as an Assistant Professor. Since Sept. 2010, he has been an Associate Professor in the same department. His research interests include Image and Video Processing, Computer Vision, Pattern Recognition and Biometrics, etc. Dr. Zhang has published about 200 papers in those areas. Dr. Zhang is currently an Associate Editor of IEEE Trans. on CSVT and Image and Vision Computing. He was awarded the 2012-13 Faculty Award in Research and Scholarly Activities. More information can be found in his homepage http://www4.comp.polyu.edu.hk/ cslzhang/.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description