Learn to Evaluate Image Perceptual Quality Blindly from Statistics of Selfsimilarity
Abstract
Among the various image quality assessment (IQA) tasks, blind IQA (BIQA) is particularly challenging due to the absence of knowledge about the reference image and distortion type. Features based on natural scene statistics (NSS) have been successfully used in BIQA, while the quality relevance of the feature plays an essential role to the quality prediction performance. Motivated by the fact that the early processing stage in human visual system aims to remove the signal redundancies for efficient visual coding, we propose a simple but very effective BIQA method by computing the statistics of selfsimilarity (SOS) in an image. Specifically, we calculate the interscale similarity and intrascale similarity of the distorted image, extract the SOS features from these similarities, and learn a regression model to map the SOS features to the subjective quality score. Extensive experiments demonstrate very competitive quality prediction performance and generalization ability of the proposed SOS based BIQA method.
I Introduction
Image quality assessment (IQA) aims to measure to what extent the observer is satisfied with the perceptual quality of a given image. IQA has become increasingly important due to its versatile utilities, including image quality monitoring, parameter tuning of image processing algorithms, and acting as yardstick for image processing system performance evaluation. With the proliferation of applications of high speed networks and portable multimedia devices, the demanding of reliable and efficient IQA algorithms is getting higher.
In the past decade, a variety of IQA methods have been proposed, which can be generally classified into three categories according to the available information of original reference image [1]: full reference (FR), reduced reference (RR) and no reference (NR). The FR methods have high prediction accuracy [2, 3, 4, 5, 6] because of the availability of the pristine reference image. In RR methods, a brief description of the reference image is available, for example, features based on natural scene statistics (NSS). By matching the statistics between the reference image and the distorted image, RR methods can also lead to good accuracy of quality prediction [7, 8, 9, 10, 11]. However, both FR and RR methods are hard to use in many practical applications, where the reference image information is completely inaccessible. Therefore, it is highly demanding to develop NR methods to predict the quality of a distorted image without prior information.
The NR methods can be categorized into distortionspecific (DS) methods and nondistortionspecific (NDS) ones. DS methods assume that the image degradation procedure is known, and descriptors that are capable of capturing the artifacts are employed to measure the quality. A review of DS methods can be found in [12]. In NDS methods, the distortion procedure is unknown, which is the case in most practical applications. Usually, this class of NR methods are also called blind IQA (BIQA) methods.
Existing solutions to BIQA have achieved good performance with the help of machine learning methods such as support vector regression and neural network. A survey about recent BIQA methods can be found in [13]. These methods differ from each other mainly on how the quality aware features are extracted. It is widely accepted that natural images are highly sparse in the high dimensional space. Once a natural image is distorted, its characteristics will accordingly deviate from that of the original image. This provides the underlying motivation of most BIQA methods. To capture the quality aware representation, usually images are subjected to decompositions of multiple frequencies and orientations by using wavelet [14, 15, 16], contourlet [17] and discrete cosine transform (DCT) [18, 19], etc. Compared with the pixel based representation, redundancies among the coefficients in these transformed domain are largely reduced. The resulting transform coefficients follow a high kurtosis, heavy tailed distribution. Distortions presented in an image will lead to deviation of this distribution, which can be used to predict image’s quality.
To further reduce the redundancy of images for a more effective (in the viewpoint of encoding) representation, contrast normalization is introduced to multiscale and multiorientation image decomposition [20]. It is shown [21] that the meansubtracted contrast normalized (MSCN) coefficients are decorrelated and follow Gaussian distribution [21]. This finding can be used to model the contrast masking effect in early human vision. Mittal et al. [22] parameterized MSCN coefficients with a Generalized Gaussian Distribution (GGD) and the pairwise product of MSCN coefficients with an asymmetric GGD (AGGD). The resulting method, called BRISQUE, obtains stateoftheart BIQA performance. Inspired by this contrast normalization, Xue et al. [23] proposed a jointly adaptive normalization (JAN) scheme to reduce the redundancy in domains of Laplacian of Gaussian (LOG) response and gradient magnitude (GM). After the JAN operation, the statistics of LOG and GM become more similar among natural images of different contents while becoming more different from unnatural images. The proposed model M3 in [23] shows better performance than BRISQUE on two benchmark databases.
Most existing BIQA methods [14, 15, 16, 17, 18, 19] calculate the image statistics in a transformed domain where the image redundancies are much reduced. Contrary to these methods, we find that measuring image redundancy directly in the pixel based spatial domain can lead to an efficient BIQA method with promising performance. Human visual system has evolved to economically describe natural images by efficient redundancy reduction [24]. The redundancy in an image can be reflected by the predictability of a pixel’s intensity by its neighboring pixels [25]. Natural images generally have high spatial correlation and multiscale correlation; that is, a natural image looks similar to its translated, zoom in or zoom out versions. Therefore, we measure the image redundancy by computing the image intrascale and interscale selfsimilarities, and propose a BIQA method called Statistics of Selfsimilarity (SOS). It is worthwhile to note that SOS is very different from the previous work M3 in [23]. First, SOS aims to describe the degree of redundancy of an image, while M3 aims to capture the local contrast. Second, the SOS features are based on the distributions of similarity maps, while in M3 the GM and LOG features are used and jointly normalized to obtain a more robust feature representation. At last, SOS provides a general framework of BIQA and any similarity function can be employed in it.
The rest of this paper is organized as follows. Section II presents in details the proposed SOS computation framework, and demonstrates the high relevance of SOS based features with image quality. Section III gives experimental settings. Extensive experimental results and analysis are presented in Section IV. Section V concludes the paper.
Ii Statistics of SelfSimilarity for Biqa
The flowchart of the proposed SOS based BIQA method is illustrated in Fig. 1. It consists of the following main steps: local selfsimilarity map calculation, SOS feature extraction, and regression model learning, which are described in detail as follows.
Iia Calculation of Local Selfsimilarity Map
For an input image I, the redundancy can be described by its selfsimilarity. In particular, the image selfsimilarity can be measured from two aspects. The first one is intrascale selfsimilarity. Due to the spatial correlation, I will be similar to its translated versions, denoted by , where m and n are the translations along vertical and horizontal directions. Refer to Fig. 1, in this paper we employ four translated versions of I by setting .
Apart from intrascale selfsimilarity, natural images also exhibit interscale selfsimilarity. It is well known that natural images have the property of scale invariance, i.e., an image usually looks similar to its scaled versions. Considering the fact that the scale space of human visual system can be well approximated by Gaussian filtering [26], we produce a series of smoothed versions of I by:
(1) 
where x, y are the spatial location, and
(2) 
is the 2D Gaussian filter with scale s. Refer to Fig. 1, we compute four smoothed versions of I with .
Denote by R any one of the four translated images and the four smoothed images . Both the intrascale and interscale selfsimilarity can be calculated by computing the similarity between I and R in any local region, leading to a Local Similarity Map (LSM). Intuitively, the similarity functions used in many existing FR IQA methods to compute the local quality map (LQM) can all be used to compute this LSM. In this paper, we adopt the similarity functions in two representative FRIQA methods, i.e., Structural SIMilarity (SSIM) [27] and ratio of nonshift edge (rNSE) [2, 28].
Feature  Description  Dimension (for each of the 8 LSMs) 

Mean of the elements in an LSM  1 (8 in total)  
d  Standard deviation of the elements in an LSM  1 (8 in total) 
h  Histogram of a quantized (10 bins) LSM  10 (80 in total) 
SSIM is a benchmark FR IQA method. In SSIM, the local similarity at each location is calculated by [27]:
(3) 
where constants , and mediate the relative importance of the three components. L, C and S measure the similarities of luminance, contrast and structure between I and R:
(4) 
(5) 
(6) 
where and are the local means of I and R; and are the local standard deviations of I and R; and is the local covariance between I and R. All these computations are applied using a local Gaussian window with a specified scale parameter as the weighting factor. , and are small constants to avoid the denominator being zero. In this work, we follow [27] for the configuration of the parameters , , , , and .
rNSE [2, 28] is a recently proposed FRIQA method, which measures the image quality by computing the ratio of the number of nonshift edges after distortion to the number of original edges. The rNSE index is computed as:
(7) 
where and are the sets of edges of distorted image D and reference image A, respectively. The computation of the edge set is based on the zerocrossings detection of the Laplacian of Gaussian (LOG) response [2, 28]. ”” denotes the intersection of sets and , i.e., nonshift edge points between A and D; ”” counts the number of edges in the set. Clearly, rNSE is a ratio between 0 and 1. Since in the context of BIQA, the reference image A is not available, we modify the rNSE index as follows to calculate the desired LSM:
(8) 
where and are respectively the sets of edge points of I and R in a local square window centered at .
With either Eq. 3 or Eq. 8, we could calculate eight LSMs of I. In Fig. 2, we show the LSMs of a reference image and its distorted image (JPEG2000 compressed). SSIM is used as the similarity function. Two intrascale LSMs and two interscale LSMs are shown. Note that the distortion in Fig. 2(f) is moderate and the artifacts generated by compression are nearly invisible. However, from the LSMs, we can easily tell the difference between the reference image and the distorted one. Those LSMs reflect the local correlation of the distorted image along different orientations (for intrascale selfsimilarity) and in scale space (for interscale selfsimilarity), implying that the image perceptual quality can be well inferred from them.
IiB SOS Features
From the 8 SOS based LSMs computed above, features can be extracted to predict the image quality. Clearly, the most important statistics of the LSMs are their mean and standard deviation, which are computed as follows:
(9)  
(10) 
where N is the number of elements in the LSM. We call the method that uses mean and standard deviation as SOSMD. One advantage SOSMD is its low feature dimensionality (16 in total for all the 8 LSMs). Note that the pair can completely characterize the statistical information of one LSM if its elements follow a Gaussian distribution. However, in practice the distribution of the LSMs are far from Gaussian (please refer to Fig. 3 for example distributions of the LSMs), and using only the mean and standard deviation cannot accurately describe them. Therefore, we quantify the LSMs into several levels, and use their normalized histograms as the SOS features. We call this method as SOSH. Since both the SSIM and rNSE indices range between 0 and 1, we quantify each LSM into 10 bins (with step length 0.1), resulting in a 10 dimensional histogram h for each LSM.
In TABLE I we list the three types of SOS features: and h. By using each type of features or the combination of them, we could learn a regression model for BIQA.
IiC Regression Model Learning
With the SOS feature vector, denoted by f, of an image, a regression function F could be learned to map f to the image subjective quality score q, i.e., . To this end, we need a set of training images, whose subjective quality scores are available. Such a training dataset can be extracted from the existing IQA databases. We can construct a training set of k images with their feature vectors and subjective scores: . Machine learning tools, such as support vector regression (SVR), neuron network, random forest, can be used to learn the mapping F. In this work, we adopt the SVR with a radial basis function kernel [29]. The readers may refer to [29] for the details of SVR and its implementation. Once the regression model F is learned, we can use it to estimate the perceptual quality of any input image.
IiD Comparison with Other NSS based Features
As we discussed in the Section I, the differences between the SOS based features and the NSS features used in previous BIQA methods lie in two folds. 1) First, instead of transforming the image into another redundancyreduced domain, we use the translated and smoothed versions of the image in the spatial domain for feature extraction. 2) Second, the statistics of selfsimilarity maps are used as the quality aware features. To illustrate the power of SOS features in BIQA, we use the neighboring components analysis (NCA) [30] to transform the high dimensional BIQA features into 2 dimensional (2D) points, and then plot the scatter of these 2D points to reveal the essential structure of the data.
We first learn a projection matrix via NCA on the LIVE database [31] for each BIQA feature, then apply this matrix to features on databases of CSIQ [3] and TID2013 [32]. The scatter plots of the resulting 2D points are shown in Fig. 4. In each plot, the reference images are represented as green dots, while the distorted images as ”+”. The color of ”+” encodes the subjective score of each distorted image. The used BIQA features include BIQI [14], DIIVINE [16], BLIINDSII [19], BRISQUE [22], SOSHrNSE and SOSHSSIM. The first row shows the scatter plots of the 2D points which are obtained when the LIVElearned projection matrix is applied to the CSIQ database, while the second row shows the results on the TID2013 database. The third row in Fig. 4 shows the corresponding plots for SOSHrNSE and SOSHSSIM.
From these plots, we can draw the following conclusions. 1) The distributions of the 2D points for BRISQUE, SOSHrNSE and SOSHSSIM show obvious quality relevance. The points for reference images are close or overlapped with points of slightly distorted images and far from the heavily distorted images. The intermediately distorted images are by and large ordered according to their quality. 2) In the distributions for BIQI, DIIVINE and BLIINDSII, no or weak quality relevance can be observed. Distorted images are not sequentially located according to their quality. These observations reveal the advantages of BRISQUE and the proposed SOSH methods over the other methods.

SOSMDrNSE  SOSMDSSIM  SOSHrNSE  SOSHSSIM  

PCC  RMSE  SRC  PCC  RMSE  SRC  PCC  RMSE  SRC  PCC  RMSE  SRC  
LIVE  ALL  0.918  10.764  0.911  0.891  12.373  0.871  0.946  8.837  0.943  0.926  10.234  0.921  
JP2K  0.923  9.516  0.904  0.849  13.119  0.825  0.950  7.900  0.933  0.923  9.656  0.906  
JPEG  0.963  8.564  0.945  0.937  11.088  0.911  0.974  7.159  0.958  0.962  8.672  0.942  
WN  0.978  5.811  0.967  0.983  5.194  0.973  0.989  4.121  0.980  0.981  5.486  0.971  
GB  0.912  7.673  0.877  0.925  6.976  0.891  0.923  7.018  0.903  0.917  7.372  0.875  
FF  0.884  13.113  0.844  0.855  14.296  0.738  0.918  11.094  0.890  0.903  12.109  0.868  
CSIQ  ALL  0.926  0.104  0.903  0.890  0.129  0.863  0.933  0.101  0.910  0.926  0.106  0.901  
WN  0.946  0.054  0.937  0.954  0.050  0.933  0.938  0.059  0.923  0.941  0.056  0.920  
JPEG  0.967  0.078  0.922  0.927  0.114  0.885  0.961  0.084  0.906  0.957  0.088  0.915  
JP2K  0.935  0.110  0.912  0.875  0.152  0.842  0.941  0.106  0.915  0.936  0.111  0.907  
GB  0.917  0.112  0.879  0.906  0.119  0.851  0.929  0.103  0.902  0.935  0.099  0.905  
TID2008  ALL  0.931  0.514  0.927  0.902  0.603  0.871  0.937  0.488  0.928  0.938  0.488  0.920  
WN  0.905  0.302  0.886  0.924  0.270  0.906  0.940  0.243  0.923  0.934  0.252  0.914  
GB  0.925  0.466  0.912  0.894  0.556  0.876  0.923  0.475  0.913  0.923  0.469  0.914  
JPEG  0.971  0.358  0.923  0.928  0.547  0.842  0.969  0.367  0.906  0.965  0.394  0.901  
JP2K  0.945  0.556  0.929  0.918  0.688  0.872  0.950  0.535  0.931  0.948  0.540  0.923  
Iii Experiments Configuration
Iiia Image Databases and Evaluation Criteria
We evaluate the performance of the proposed SOS based BIQA methods in terms of their ability to predict the subjective score of distorted images. Three publicly available largescale databases are employed for this evaluation.
The LIVE database [31]: A total of 779 distorted images are generated by applying 5 distortion operations at levels to 29 pristine images. The distortions include: JPEG2000 compression (JP2K), JPEG compression, white noise (WN), Gaussian blurring (GB) and simulated fast fading Rayleigh channel (FF).
The CSIQ database [3]: A total of 866 distorted images are generated by applying 5 distortion operations at levels to 30 pristine images. The distortions include: JPEG, JP2K, additive pink noise, WN, GB and global contrast decrements.
The TID2013 database [32]: A total of 3000 distorted images are generated by applying 24 distortion operations at 5 levels to 25 pristine images. The distortions in TID2013 reflect a broad range of image impairments, such as edge smoothing, block artifacts, additive and multiplicative noise, chromatic aberrations, denoising and contrast change, etc. Details of the distortions can be found in [32].
The ground truth quality of each image is given by the subjective score, i.e., (Difference) Mean Opinion Score (DMOS/MOS). To evaluate the performance of BIQA methods, three indexes are usually computed by using the subjective scores and the modelpredicted scores: the Spearman rank order correlation coefficient (SRC), the Pearson correlation coefficient (PCC) after a logistic regression [33], and the root mean squared error (RMSE) between the subjective score and the predicted score after the regression. Note that this logistic regression accounts for the different range of the objective and subjective scores, as well as the nonlinearity of human perception in extreme distortions.
IiiB Implementation Details and Parameter Setting
In the implementation of the proposed BIQA method, the scale of the Gaussian window in SSIM and the scale of the LOG filter in rNSE are both set as 0.5 for computing the intrascale selfsimilarity. For the interscale selfsimilarity, we set the scale parameters of SSIM and rNSE to the same value as the four smooth parameters in Eq. 2. With the LSMs available, the SOS features (mean , standard deviation d and histogram h) can be extracted, and then fed into the SVR to train a regression model for quality prediction. We adopt the SVR algorithm with an RBF (radialbasis function) kernel, and the source code is from LibSVM [34]. The parameters of SVR are tuned by 2D grid search in the logarithm space.
During our experiments, 80% of the images are employed for training and the rest 20% for testing. The training and test sets are split according to the reference image to guarantee the independency of the image content in training set and test set. This splitting is repeated for 1,000 times and the median results are used to evaluate the final performance.
Iv Results and Discussions
Iva Performance of SOSMD and SOSH
We first compare the performance of SOSMD and SOSH on the three databases. The results are listed in TABLE II. Note that in this experiment we only consider the common distortion types to all the three databases, i.e., JP2K, JPEG, WN, and GB.
From TABLE II, we can observe that both SOSMD and SOSH show good performance on the three databases in terms of PCC and SRC. The best results on all the three databases are achieved by the method SOSHrNSE, with PCC values 0.946, 0.933, and 0.937 on the three databases, respectively. For every single distortion type, it also demonstrates PCC and SRC values consistently higher than 0.9. Note that due to the scale difference of subjective scores on the three databases, the resulting RMSE values range differently.
As for the two similarity functions, rNSE shows clear advantages over SSIM. We highlight in boldface the better one of rNSE and SSIM in each row. The SOSMD and SOSH methods with rNSE as similarity function outperform those with SSIM in most of the distortion types. This may be due to the fact that rNSE emphasizes more on edge structure, which is crucial for human visual perception. Besides, benefiting from the richer information in histograms, SOSH always exhibits better performance than SOSMD on the three databases.
Methods  Feature domain  JP2K  JPEG  WN  GB  FF  ALL 

BIQI [14]*  Wavelet  0.7849  0.8801  0.9157  0.8367  0.7023  0.8084 
GRNN [35]  Fourier+Spatial  0.8156  0.8721  0.9794  0.8331  0.7354  0.8268 
LDGS [36]  Wavelet  0.8317  0.8339  0.9134  0.8751  0.8588  0.8414 
LDTS [36]  Wavelet  0.8202  0.8334  0.9556  0.9251  0.8863  0.8833 
DIIVINE [16]*  Wavelet  0.8418  0.8926  0.9617  0.8792  0.8202  0.8816 
CBIQI [37]  Gabor  0.912  0.963  0.959  0.918  0.885  0.896 
BLIINDSII [19]*  DCT  0.9258  0.95  0.9477  0.9132  0.8736  0.9302 
CBIQII [37]  Gabor  0.919  0.965  0.933  0.944  0.912  0.93 
BRISQUE [22]*  Spatial  0.9175  0.9655  0.9789  0.9479  0.8854  0.943 
M3 [23]  Gradient magnitude+LOG  0.9283  0.9659  0.9853  0.9359  0.9008  0.9511 
SOSHrNSE  Spatial+Scale  0.9328  0.9582  0.9802  0.9026  0.8899  0.9434 
SOSHSSIM  Spatial+Scale  0.906  0.9415  0.9711  0.8754  0.8679  0.9212 
PSNR  0.9081  0.8923  0.984  0.8111  0.8941  0.8839  
SSIM  0.9606  0.9739  0.9693  0.9515  0.9551  0.9481 
BIQI  DIIVINE  BLIINDSII  BRISQUE  SOSHSSIM  SOSHrNSE  M3  

BIQI  1  1  1  1  1  1  
DIIVINE  0  1  1  1  1  1  
BLIINDSII  0  0  1  0.79  1  1  
BRISQUE  0  0  0  0  1  1  
SOSHSSIM  0  0  0.21  1  1  1  
SOSHrNSE  0  0  0  0  0  1  
M3  0  0  0  0  0  0 
IvB Performance Comparison with Existing BIQA Methods
In TABLE III, we compare the performances of the proposed SOSbased methods with existing stateoftheart BIQA methods, including BIQI [14], BLIINDSII [19], DIIVINE [16], GRNN [35], visual codebook based method (CBIQ) [37], local dependency based method (LDGS and LDTS) [36], BRISQUE [22] and M3 [23]. The results of these competitors are either sourced from their original publications or computed by using the source codes provided by the authors. The results of the classical PSNR and SSIM indices are also presented for reference. To save space, only the result of SRC index is shown in TABLE III. The top three results are highlighted with boldface for each column.
When the entire LIVE database is considered, the proposed two SOSH methods show very competitive performance with the stateoftheart BIQA methods. The top two methods on LIVE database are M3, SOSHrNSE and BRISUQE. SOSHSSIM beats all the waveletbased methods. Box plots of the results on LIVE database are presented in Fig. 5 for a more intuitive comparison. To investigate the significance of difference between the performances of these BIQA methods, the righttailed ttest with a significance level of 0.01 is conducted for each pair of BIQA methods. The null hypothesis is that the mean of the SRC values of the two methods are equal. The alternative hypothesis is that the mean SRC value of the method in the row is greater than that of the method in the column. The resulting pvalues of the tests are shown in TABLE IV, and a small pvalue favors the alternative hypothesis. Again, we can see that the proposed SOSHrNSE delivers excellent performance, and it is only beaten by M3.
DIIVINE  BLIINDSII  BRISQUE  M3  SOSHSSIM  SOSHrNSE  

CSIQ  0.857  0.888  0.899  0.911  0.898  0.907 
TID2013  0.860  0.895  0.891  0.923  0.897  0.913 
SRC  SOSHrNSE  SOSHSSIM  BRISQUE  BLIINDSII  DIIVINE  M3 

Additive Gaussian noise  0.821  0.774  0.778  0.628  0.709  0.769 
Additive noise more in color  0.501  0.542  0.554  0.357  0.431  0.583 
Spatially correlated noise  0.728  0.761  0.830  0.689  0.816  0.783 
Masked noise  0.261  0.307  0.172  0.281  0.111  0.504 
High frequency noise  0.876  0.893  0.855  0.772  0.816  0.884 
Impulse noise  0.739  0.698  0.815  0.607  0.789  0.718 
Quantization noise  0.579  0.748  0.695  0.639  0.535  0.819 
Gaussian blur  0.863  0.826  0.856  0.855  0.915  0.872 
Image denoising  0.803  0.693  0.551  0.797  0.723  0.771 
JPEG compression  0.832  0.748  0.756  0.706  0.725  0.819 
JPEG2000 compression  0.898  0.793  0.780  0.850  0.861  0.873 
JPEG transmission errors  0.435  0.165  0.231  0.409  0.343  0.423 
JPEG2000 transmission errors  0.565  0.633  0.695  0.696  0.717  0.723 
Non eccentricity pattern noise  0.182  0.131  0.126  0.176  0.134  0.212 
Local blockwise distortions  0.146  0.206  0.203  0.290  0.298  0.280 
Mean shift  0.127  0.217  0.112  0.185  0.197  0.090 
Contrast change  0.161  0.056  0.058  0.085  0.347  0.301 
Change of color saturation  0.099  0.175  0.092  0.022  0.213  0.185 
Multiplicative Gaussian noise  0.695  0.720  0.621  0.626  0.666  0.709 
Comfort noise  0.142  0.021  0.165  0.084  0.265  0.229 
Lossy compression of noisy images  0.628  0.639  0.531  0.454  0.677  0.704 
Image color quantization with dither  0.837  0.815  0.827  0.789  0.802  0.858 
Chromatic aberrations  0.678  0.707  0.731  0.596  0.775  0.644 
Sparse sampling and reconstruction  0.847  0.819  0.807  0.861  0.843  0.922 
ALL  0.608  0.569  0.558  0.576  0.593  0.687 
When each distortion type is considered, SOSHrNSE shows top performance on JP2K, WN, and FF, while SOSHSSIM gives inferior SRC value. More specifically, on JP2K images, all methods that based on wavelet features [14, 16] fail to give a high SRC value, while the DCT based BLIINDSII, the gradient and LOG based M3 [23] and the purposed SOSH methods show excellent performance. On JPEG images, the two SOSH methods perform better than the wavelet based methods. On WN images, all the methods based on spatial features show clear advantage over the wavelet, Gabor and DCT based methods. This is due to the fact that pixel based representation in spatial domain is more appropriate for additive noise. On GB images, BRISQUE, M3 and CBIQII give the best performance, while the proposed SOSH methods are still better than the waveletbased methods. On FF images, it is hard to capture the intrinsic characteristic for quality prediction because FF simultaneously introduces structure shifting, blurring, ringing and color contamination. Among the competing methods, CBIQII behaves the best, followed by SOSHrNSE and BRISQUE. SOSHSSIM only leads to an acceptable performance.
When compared to the FR methods PSNR and SSIM, the two SOSH methods show an obvious advantage over PSNR, and inferior to the SSIM index.
IvC Database Indenpendency
We examine the database independency of the proposed SOSH methods as follows: we train a quality prediction model with the SOSH features on LIVE database and then test the model on CSIQ and TID2013. Note that for CSIQ and TID2013, only images with the four common distortions to LIVE are considered. The results are shown in TABLE V. Obviously, the two SOSH methods show good independency of databases. When the LIVEtrained models are tested on CSIQ and TID2013, SOSHrNSE gives SRC values competitive with M3, and SOSHSSIM works on par with BRISQUE. DIIVINE shows less stable performance in this case.
IvD More Distortion Types
We further test the performance of the proposed SOSH methods with more distortions by using the TID2013 database, which has a wide range of distortions. of the images are used for training and the rest for testing. The procedure is the same as described in subsection IIIB. The obtained median SRC values are presented in TABLE VI. The top two methods in each row are highlighted in bold for each distortion. (The detailed explanations of the distortions can be found in [32].) As can be seen, SOSHrNSE shows better performance than other BIQA methods, except for M3, on the entire TID2013 database. On some distortions, all BIQA methods fail to give acceptable performance. We shade the rows in TABLE VI where all methods obtain SRC values less than 0.5. Examples of these distortions include JPEG transmission errors, noneccentricity pattern noise, blockwise distortion with different intensity, mean shift, contrast change, color saturation change, and comfort noise. The failure in these cases may be due to the fact that all the current BIQA methods make use of structure features which are not capable of capturing the nonstructural distortions, such as color aberration, mean shift, etc.
IvE Similarity Function of MSE
To further validate the effectiveness of the proposed SOS framework, we take MSE as the similarity function in SOS:
(11) 
Note that we take a logarithm transform of MSE in order to better compute the histogram of LSM. The features for quality prediction are extracted in the same way as that in SOSHrNSE and SOSHSSIM. The resulting SOSbased method is denoted as SOSHMSE.
Database  SOSHrMSE  SOSHSSIM  SOSHMSE  

LIVE  ALL  0.943  0.921  0.944 
JP2K  0.933  0.906  0.946  
JPEG  0.958  0.942  0.959  
WN  0.980  0.971  0.981  
GB  0.903  0.875  0.928  
FF  0.890  0.868  0.865  
CSIQ  ALL  0.910  0.901  0.902 
WN  0.923  0.920  0.917  
JPEG  0.906  0.915  0.918  
JP2K  0.915  0.907  0.905  
GB  0.902  0.905  0.895  
TID2013  ALL  0.928  0.920  0.919 
WN  0.923  0.914  0.918  
GB  0.913  0.914  0.915  
JPEG  0.906  0.901  0.893  
JP2K  0.931  0.923  0.923 
TABLE VII compares the performances of the three SOSH based methods. Under the framework of SOS, MSE gives similar performance to SSIM on databases of CSIQ and TID2013, and the same performance as rNSE on LIVE. We can draw that the effectiveness of SOS framework can be demonstrated by all the three similarity functions of rNSE, SSIM and MSE. The MSEbased SOSH even shows slightly better performance in terms of SRC on the LIVE database. For each distortion type, SOSHrNSE and SOSHMSE show similar results. The good results of SOSHMSE can be explained as follows. MSE computes the squared difference between the original image and its shifted or smoothed version. This is similar to the computation of image gradient, which has been shown very effective for image quality assessment [4]. Better performance may be achieved by other potential similarity functions under the SOS framework.
IvF Discussions
It was found that the function of ganglion and lateral geniculate nucleus (LGN) neurons can be modeled by principal component analysis (PCA) based whitening, while the role of PCA is similar to DCT for natural images [38]. The responses of simple cells in the primary visual cortex (V1) are similar to the WT outputs and approach to the independent components of natural images [39]. However, these transforms may not be effective and efficient to represent distorted images in the context of IQA. The scatter plots in the first two rows of Fig. 4 show clearly that the features extracted from DCT and WT domains cannot distinguish well the distorted image and their reference counterpart, and their 2D scatter plots show low relevance to the subjective quality of image.
This fact motivated us to find a different method for NSS calculation. Instead of transformation, we directly compute the intrascale and interscale redundancy in the spatial domain. Interestingly, as shown in the third row of Fig. 2, the proposed SOS features can distinguish better the original natural images from their distorted counterparts and the scatter plots show better relevance with subjective quality. Our experiments in the previous sections also validated that the SOS features can predict the perceptual quality very well. Besides, the proposed SOS features works robustly with no strict restriction on the similarity functions. Whether or not our results imply a new physical model of HVS to sense the image quality will be an interesting problem open to investigate, whereas this is out the scope of this paper.
V Conclusion
It is wellknown that a proper presentation will make the task of image processing more easily, so does for the task of image quality assessment. In this paper, we proposed a new feature representation framework which aims to capture the statistics of selfsimilarity (SOS) for natural images. Different from previous methods, SOS directly measures the redundancy existing in an image, rather than describing the structure in a redundancy reduced domain. The computed local similarity map (LSM) can portray the local correlation across space and scales, both of which will be altered by image distortion. The statistics of these LSMs, i.e., the SOS features, were validated to be able to more effectively capture the distortion degree than previous features that are based on image decomposition. Especially, when the LSM histogram features are utilized, very competitive performance can be achieved on the benchmark databases. New similarity functions can be introduced or designed under the framework of SOS for better performance in the future study.
References
 [1] Z. Wang and A. C. Bovik, “Modern image quality assessment,” Synthesis Lectures on Image, Video, and Multimedia Processing, vol. 2, no. 1, pp. 1–156, 2006.
 [2] M. Zhang, X. Mou, and L. Zhang, “Nonshift edge based ratio (nser): An image quality assessment metric based on early vision features,” Signal Processing Letters, IEEE, no. 99, pp. 1–1, 2011.
 [3] E. C. Larson and D. M. Chandler, “Most apparent distortion: fullreference image quality assessment and the role of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, pp. 011 006–011 006, 2010.
 [4] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: a highly efficient perceptual image quality index,” Image Processing, IEEE Transactions on, vol. 23, no. 2, pp. 684–695, 2014.
 [5] S. Li, F. Zhang, L. Ma, and K. N. Ngan, “Image quality assessment by separately evaluating detail losses and additive impairments,” Multimedia, IEEE Transactions on, vol. 13, no. 5, pp. 935–949, Oct 2011.
 [6] L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” Image Processing, IEEE Transactions on, no. 99, pp. 1–1, 2011.
 [7] Q. Li and Z. Wang, “Reducedreference image quality assessment using divisive normalizationbased image representation,” Selected Topics in Signal Processing, IEEE Journal of, vol. 3, no. 2, pp. 202–211, 2009.
 [8] W. Xue and X. Mou, “Reduced reference image quality assessment based on weibull statistics,” in Quality of Multimedia Experience (QoMEX), 2010 Second International Workshop on. IEEE, 2010, pp. 1–6.
 [9] X. Mou, W. Xue, and L. Zhang, “Reduced reference image quality assessment via subimage similarity based redundancy measurement,” Proceedings of SPIE, vol. 8291, p. 82911S, 2012.
 [10] L. Ma, S. Li, F. Zhang, and K. N. Ngan, “Reducedreference image quality assessment using reorganized dctbased image representation,” Multimedia, IEEE Transactions on, vol. 13, no. 4, pp. 824–829, Aug 2011.
 [11] J. Wu, W. Lin, G. Shi, and A. Liu, “Reducedreference image quality assessment with visual information fidelity,” Multimedia, IEEE Transactions on, vol. 15, no. 7, pp. 1700–1705, Nov 2013.
 [12] M. Shahid, A. Rossholm, B. Lövström, and H.J. Zepernick, “Noreference image and video quality assessment: a classification and review of recent approaches,” EURASIP Journal on Image and Video Processing, vol. 2014, no. 1, pp. 1–32, 2014.
 [13] R. A. Manap and L. Shao, “Nondistortionspecific noreference image quality assessment: A survey,” Information Sciences, vol. 301, pp. 141–160, 2015.
 [14] A. Moorthy and A. Bovik, “A twostep framework for constructing blind image quality indices,” Signal Processing Letters, IEEE, vol. 17, no. 5, pp. 513–516, 2010.
 [15] H. Tang, N. Joshi, and A. Kapoor, “Learning a blind measure of perceptual image quality,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 305–312.
 [16] A. Moorthy and A. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” Image Processing, IEEE Transactions on, vol. 20, no. 12, pp. 3350 –3364, dec. 2011.
 [17] W. Lu, K. Zeng, D. Tao, Y. Yuan, and X. Gao, “Noreference image quality assessment in contourlet domain,” Neurocomputing, vol. 73, no. 4, pp. 784–794, 2010.
 [18] M. Saad, A. Bovik, and C. Charrier, “A dct statisticsbased blind image quality index,” Signal Processing Letters, IEEE, vol. 17, no. 6, pp. 583–586, 2010.
 [19] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the dct domain,” Image Processing, IEEE Transactions on, vol. 21, no. 8, pp. 3339–3352, 2012.
 [20] P. Teo and D. Heeger, “Perceptual image distortion,” in Image Processing, 1994. Proceedings. ICIP94., IEEE International Conference, vol. 2. IEEE, 1994, pp. 982–986.
 [21] D. L. Ruderman and W. Bialek, “Statistics of natural images: Scaling in the woods,” Physical review letters, vol. 73, no. 6, p. 814, 1994.
 [22] A. Mittal, A. K. Moorthy, and A. C. Bovik, “Noreference image quality assessment in the spatial domain,” Image Processing, IEEE Transactions on, vol. 21, no. 12, pp. 4695–4708, 2012.
 [23] W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and laplacian features,” Image Processing, IEEE Transactions on, vol. 23, no. 11, pp. 4850–4862, 2014.
 [24] F. Attneave, “Some informational aspects of visual perception.” Psychological review, vol. 61, no. 3, p. 183, 1954.
 [25] D. Kersten, “Predictability and redundancy of natural images,” JOSA A, vol. 4, no. 12, pp. 2395–2400, 1987.
 [26] E. Dam and B. ter Haar Romeny, “Front end vision and multiscale image analysis,” Deep Structure I, II & III, no. 14020, pp. 1507–0, 2003.
 [27] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” Image Processing, IEEE Transactions on, vol. 13, no. 4, pp. 600–612, 2004.
 [28] W. Xue and X. Mou, “An image quality assessment metric based on nonshift edge,” in Image Processing (ICIP), 2011 18th IEEE International Conference on. IEEE, 2011, pp. 3309–3312.
 [29] A. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and computing, vol. 14, no. 3, pp. 199–222, 2004.
 [30] S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood component analysis,” in Neural Information Processing Systems, vol. 17, pp. 513–520.
 [31] H. Sheikh, Z. Wang, L. Cormack, and A. Bovik, “Live image quality assessment database release 2 (2005).”
 [32] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, L. Lin, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.C. Jay Kuo, “Color image database tid2013: Peculiarities and preliminary results,” Advances of Modern Radioelectronics, vol. 10, no. 10, pp. 30–45, 2009.
 [33] VQEG, “Final report from the video quality experts group on the validation of objective models of video quality assessment, phase ii,” VQEG, Aug, 2003.
 [34] C. Chang and C. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
 [35] C. Li, A. Bovik, and X. Wu, “Blind image quality assessment using a general regression neural network,” Neural Networks, IEEE Transactions on, vol. 22, no. 5, pp. 793–799, 2011.
 [36] F. Gao, X. Gao, D. Tao, X. Li, L. He, and W. Lu, “Universal no reference image quality assessment metrics based on local dependency,” in Pattern Recognition (ACPR), 2011 First Asian Conference on. IEEE, 2011, pp. 298–302.
 [37] P. Ye and D. Doermann, “Noreference image quality assessment using visual codebooks,” Image Processing, IEEE Transactions on, vol. 21, no. 7, pp. 3129–3138, 2012.
 [38] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” Computers, IEEE Transactions on, vol. 100, no. 1, pp. 90–93, 1974.
 [39] S. Fischer, F. Šroubek, L. Perrinet, R. Redondo, and G. Cristóbal, “Selfinvertible 2d loggabor wavelets,” International Journal of Computer Vision, vol. 75, no. 2, pp. 231–246, 2007.
Wufeng Xue received the B.Sc. degree in automatic engineering from the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China, in 2009. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Xi’an Jiaotong University. His research interest focuses on perceptual quality of visual signals. 
Xuanqin Mou (M’08) has been with the Institute of Image Processing and Pattern Recognition (IPPR), Electronic and Information Engineering School, Xi’an Jiaotong University, since 1987. He has been an Associate Professor since 1997, and a Professor since 2002. He is currently the Director of IPPR. Dr. Mou served as the member of the 12th Expert Evaluation Committee for the National Natural Science Foundation of China, the Member of the 5th and 6th Executive Committee of China Society of Image and Graphics, the Vice President of Shaanxi Image and Graphics Association. He has authored or coauthored more than 200 peerreviewed journal or conference papers. He has been granted as the Yung Wing Award for Excellence in Education, the KC Wong Education Award, the Technology Academy Award for Invention by the Ministry of Education of China, and the Technology Academy Awards from the Government of Shaanxi Province, China. 
Lei Zhang (M’04, SM’14) received the B.Sc. degree in 1995 from Shenyang Institute of Aeronautical Engineering, Shenyang, P.R. China, the M.Sc. and Ph.D degrees in Control Theory and Engineering from Northwestern Polytechnical University, Xi’an, P.R. China, respectively in 1998 and 2001. From 2001 to 2002, he was a research associate in the Dept. of Computing, The Hong Kong Polytechnic University. From Jan. 2003 to Jan. 2006 he worked as a Postdoctoral Fellow in the Dept. of Electrical and Computer Engineering, McMaster University, Canada. In 2006, he joined the Dept. of Computing, The Hong Kong Polytechnic University, as an Assistant Professor. Since Sept. 2010, he has been an Associate Professor in the same department. His research interests include Image and Video Processing, Computer Vision, Pattern Recognition and Biometrics, etc. Dr. Zhang has published about 200 papers in those areas. Dr. Zhang is currently an Associate Editor of IEEE Trans. on CSVT and Image and Vision Computing. He was awarded the 201213 Faculty Award in Research and Scholarly Activities. More information can be found in his homepage http://www4.comp.polyu.edu.hk/ cslzhang/. 