dipIQ: Blind Image Quality Assessment by Learning-to-Rank Discriminable Image Pairs

dipIQ: Blind Image Quality Assessment by Learning-to-Rank Discriminable Image Pairs

Kede Ma,  Wentao Liu,  Tongliang Liu,
Zhou Wang,  and Dacheng Tao, 
K. Ma, W. Liu, and Z. Wang are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: {k29ma, w238liu, zhou.wang}@uwaterloo.ca).T. Liu and D. Tao are with the UBTech Sydney Artificial Intelligence Institute and the School of Information Technologies in the Faculty of Engineering and Information Technologies at The University of Sydney, J12 Cleveland St, Darlington, NSW 2008, Australia (email: {tongliang.liu, dacheng.tao}@sydney.edu.au).© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract

Objective assessment of image quality is fundamentally important in many image processing tasks. In this work, we focus on learning blind image quality assessment (BIQA) models which predict the quality of a digital image with no access to its original pristine-quality counterpart as reference. One of the biggest challenges in learning BIQA models is the conflict between the gigantic image space (which is in the dimension of the number of image pixels) and the extremely limited reliable ground truth data for training. Such data are typically collected via subjective testing, which is cumbersome, slow, and expensive. Here we first show that a vast amount of reliable training data in the form of quality-discriminable image pairs (DIP) can be obtained automatically at low cost by exploiting large-scale databases with diverse image content. We then learn an opinion-unaware BIQA (OU-BIQA, meaning that no subjective opinions are used for training) model using RankNet, a pairwise learning-to-rank (L2R) algorithm, from millions of DIPs, each associated with a perceptual uncertainty level, leading to a DIP inferred quality (dipIQ) index. Extensive experiments on four benchmark IQA databases demonstrate that dipIQ outperforms state-of-the-art OU-BIQA models. The robustness of dipIQ is also significantly improved as confirmed by the group MAximum Differentiation (gMAD) competition method. Furthermore, we extend the proposed framework by learning models with ListNet (a listwise L2R algorithm) on quality-discriminable image lists (DIL). The resulting DIL Inferred Quality (dilIQ) index achieves an additional performance gain.

Blind image quality assessment (BIQA), learning-to-rank (L2R), dipIQ, RankNet, quality-discriminable image pair (DIP), gMAD.

I Introduction

Objectively assessing image quality is of fundamental importance due in part to the massive expansion of online image volume. Objective image quality assessment (IQA) has become an active research topic over the last decade, with a large variety of IQA models proposed [1, 2]. They can be categorized into full-reference models (FR, where the reference image is fully available when evaluating a distorted image) [3], reduced-reference models (RR, where only partial information about the reference image is available) [4], and blind/no-reference models (NR, where the reference image is not accessible) [5]. In many real-world applications, reference images are unavailable, making blind IQA (BIQA) models highly desirable in practice.

Many BIQA models are developed by supervised learning [6, 7, 8, 9, 10, 11, 12, 13, 14] and share a common two-stage structure: 1) perception- and/or distortion-relevant features (denoted by ) are extracted from the test image; and 2) a quality prediction function is learned by statistical machine learning algorithms. The performance and robustness of these approaches rely heavily on the quality and quantity of the ground truth data for training. The most common type of ground truth data is in the form of the mean opinion score (MOS), which is the average of quality ratings given by multiple subjects. Therefore, these models are often referred to as opinion-aware BIQA (OA-BIQA) models and may incur the following drawbacks. First, collecting MOS via subjective testing is slow, cumbersome, and expensive. As a result, even the largest publicly available IQA database, TID2013 [15], provides only images with MOSs. This limited number of training images is deemed extremely sparsely distributed in the entire image space, whose dimension equals the number of pixels and is typically in the order of millions. As such, the generalizability of BIQA models learned from small training samples is questionable on real-world images. Second, among thousands of sample images, only a few dozen source reference images can be included, considering the combinations of reference images, distortion types and levels. For example, the TID2013 database [15] includes source images only. It is extremely unlikely that this limited number of reference images sufficiently represent the variations that exist in real-world images. Third, since these BIQA models are trained with individual images to make independent quality predictions, the cost function is blind to the relative perceptual order between images. As a result, the learned models are weak at ordering images with respect to their perceptual quality.

In this paper, we show that a vast amount of reliable training data in the form of so-called quality-discriminable image pairs (DIP) can be generated by exploiting large-scale databases with diverse image content. Each DIP is associated with a perceptual uncertainty measure to indicate the confidence level of its quality discriminability. We show that such DIPs can be generated at very low cost without resorting to subjective testing. We then employ RankNet [16], a neural network-based pairwise learning-to-rank (L2R) algorithm [17, 18], to learn an opinion-unaware BIQA (OU-BIQA, meaning that no subjective opinions are used for training) model by incorporating the uncertainty measure into the loss function. Extensive experiments on four benchmark IQA databases demonstrate that the DIP inferred quality (dipIQ) indices significantly outperform previous OU-BIQA models. We also conduct another set of experiments in which we train the dipIQ indices using different feature representations as inputs and compare them with OA-BIQA models using the same representations. The generalizability and robustness of dipIQ are improved across all four IQA databases and verified by the group MAximum Differentiation (gMAD) competition method [19], which examines image pairs optimally selected from the Waterloo Exploration Database [20]. Furthermore, we extend the proposed pairwise L2R approach for OU-BIQA to a listwise L2R one by evoking ListNet [21] (a listwise L2R extension of RankNet [16]) and transforming DIPs to quality-discriminable image lists (DIL) for training. The resulting DIL inferred quality (dilIQ) index leads to an additional performance gain.

The remainder of the paper is organized as follows. BIQA models and typical L2R algorithms are reviewed and categorized in Section II. The proposed dipIQ approach is introduced in Section III. Experimental results using dipIQ on four benchmark IQA databases compared with state-of-the-art BIQA models are presented in Section IV, followed by an extension to the dilIQ model in Section V. We conclude the paper in Section VI.

Ii Related Work

We first review existing BIQA models according to their two-stage structure: feature extraction and quality prediction model learning. We then review typical L2R algorithms. Details of RankNet [16] are provided in Section III.

Ii-a Existing BIQA Models

From the feature extraction point of view, three types of knowledge can be exploited to craft useful features for BIQA. The first is knowledge about our visual world that summarizes the statistical regularities of undistorted images. The second is knowledge about degradation, which can then be explicitly taken into account to build features for particular artifacts, such as blocking [22, 23, 24], blurring [25, 26, 27] and ringing [28, 29, 30]. The third is knowledge of the human visual system (HVS) [31], namely perceptual models derived from visual physiological and psychophysical studies [32, 33, 34, 35]. Natural scene statistics (NSS), which seek to capture the natural statistical behavior of images, embody the three-fold modeling in a rather elegant way [5]. NSS can be extracted directly in the spatial domain or in transform domains such as DFT, DCT, and wavelets [36, 37].

In the spatial domain, edges are presumably the most important image features. The edge spread can be used to detect blurring [38, 39], and the intensity variance in smooth regions close to edges can indicate ringing artifacts [28]. Step edge detectors that operate at block boundaries measure the severity of discontinuities caused by JPEG compression [22]. The sample entropy of intensity histograms is used to identify image anisotropy [40, 41]. The responses of image gradients and the Laplacian of Gaussian operators are jointly modeled to describe the destruction of statistical naturalness of images [12]. The singular value decomposition of local image gradient matrices may provide a quantitative measure of image content [42]. Mean-subtracted and contrast-normalized pixel value statistics have also been modeled using a generalized Gaussian distribution (GGD) [8, 43, 44, 45], inspired by the adaptive gain control mechanism seen in neurons [33].

Statistical modeling in the wavelet domain resembles the early visual system [32], and natural images exhibit statistical regularities in the wavelet space. Specifically, it is widely acknowledged that the marginal distribution of wavelet coefficients of a natural image (regardless of content) has a sharp peak near zero and heavier than Gaussian tails. Therefore, statistics of raw [46, 4, 6, 47] and normalized [48, 49] wavelet coefficients, and wavelet coefficient correlations in the neighborhood [29, 10, 50, 51, 52] can be individually or jointly modeled as image naturalness measurements. The phase information of wavelet coefficients, for example expressed as the local phase coherence, is exploited to describe the perception of blur [26] and sharpness [53].

In the DFT domain, blur kernels can be efficiently estimated [50, 54, 51] to quantify the degree of image blurring. The regular peaks at feature frequencies can be used to identity blocking artifacts [23, 55]. Moreover, it is generally hypothesized that most perceptual information in an image is stored in the Fourier phase rather than the Fourier amplitude [56, 57]. Phase congruency [58] is such a feature that identifies perceptually significant image features at spatial locations where Fourier components are maximally in-phase [40].

In the DCT domain, blocking artifacts can be identified in a shifted block [24]. The ratio of AC coefficients to DC components can be interpreted as a measure of local contrast [59]. The kurtosis of AC coefficients can be used to quantify the structure statistics. In addition, AC coefficients can also be jointly modeled using a GGD [7].

There is a growing interest in learning features for BIQA. Ye et al. learned quality filters on image patches using K-means clustering and adopted filter responses as features [9]. They then took one step further by supervised filter learning [45]. Xue et al. [60] proposed a quality-aware clustering scheme on the high frequencies of raw patches, guided by an FR-IQA measure [61]. Kang et al. investigated a convolutional neural network to jointly learn features and nonlinear mappings for BIQA [62].

From the model learning perspective, SVR [63, 64] is the most commonly used tool to learn for BIQA [6, 10, 52, 9, 45, 12]. The capabilities of neural networks to pre-train a model without labels and to easily scale up have also been exploited for this purpose [40, 62, 51, 47]. Another typical quality regression is the example-based method, which predicts the test image quality score using the weighted average of training image quality scores, where the weight encodes the perceptual similarity between the test and training images [52, 60, 14]. Saad et al. jointly modeled and MOS using a multivariate Gaussian distribution and performed prediction by maximizing the conditional probability  [59, 7]. Similar probabilistic modeling strategies have been investigated [43, 65]. Pairwise L2R algorithms have also been used to learn BIQA models [66, 67]. However, in these methods, DIP generation relies solely on MOS availability, which limits the number of DIPs produced. Moreover, their performance is inferior to that of existing BIQA methods. Other advanced learning algorithms include topic modeling [68], Gaussian process [51], and multi-kernel learning [69, 67].

Ii-B Existing L2R Algorithms

Existing L2R algorithms can be broadly classified into three categories based on the training data format and loss function: pointwise, pairwise, and listwise approaches. An excellent survey of L2R algorithms can be found in [17]. Here we only provide a brief overview.

Pointwise approaches assume that each instance’s importance degree is known. The loss function usually examines the prediction accuracy of each individual instance. In an early attempt on L2R, Fuhr [70] adopted a linear regression with a polynomial feature expansion to learn the score function . Cossock and Zhang [71] utilized a similar formulation with some theoretical justifications for the use of the least squares loss function. Nallapati [72] formulated L2R as a classification problem and investigated the use of maximum entropy and support vector machines (SVMs) to classify each instance into two classes—relevant or irrelevant. Ordinal regression-based pointwise L2R algorithms have also been proposed such as PRanking [73] and SVM-based large margin principles [74].

Pairwise approaches assume that the relative order between two instances is known or can be inferred from other ground truth formats. The goal is to minimize the number of misclassified instance pairs. In the extreme case, if all instance pairs are correctly classified, they will be correctly ranked [17]. In RankSVM [75], Joachims creatively generated training pairs from clickthrough data and reformulated SVM to learn the score function from instance pairs. Proposed in 2005, RankNet [16] was probably the first L2R algorithm used by commercial search engines, which had a typical neural network with a weight-sharing scheme forming its skeleton. Tsai et al. [76] replaced RankNet’s loss function [16] with a fidelity loss originating from quantum physics. In this paper, RankNet is adopted as the default pairwise L2R algorithm to learn OU-BIQA models for reasons that will be described later. RankBoost [77] is another well-known pairwise L2R algorithm based on AdaBoost [78] with an exponential loss.

Listwise approaches provide the opportunity to directly optimize ranking performance criteria [17]. Representative algorithms include SoftRank [79],  [80], and RankGP [81]. Another subset of listwise approaches choose to optimize listwise ranking losses. For example, as a direct extension of RankNet, ListNet [21] duplicates RankNet’s structure to accommodate an instance list as input and optimizes a ranking loss based on the permutation probability distribution [21]. In this paper, we also employ ListNet to learn OU-BIQA models as an extension of the proposed pairwise L2R approach.


Fig. 1: Illustration of the perceptual uncertainty of quality discriminability of DIPs as a function of . The left images of all DIPs have better quality in terms of the three FR-IQA models with . However, the quality discriminability differs significantly. All images are originated from the training images and cropped for better visibility.

Iii Proposed Pairwise L2R Approach for OU-BIQA

In this section, we elaborate the proposed pairwise L2R approach to learn OU-BIQA models. First, we propose an automatic DIP generation engine. Each DIP is associated with an uncertainty measure to quantify the confidence level of its quality discriminability. Second, we detail RankNet [16] and extend its capability to learn from the generated DIPs with uncertainty.

Iii-a DIP Generation

Our automatic DIP generation engine is described as follows. We first choose three best-trusted FR-IQA models, namely MS-SSIM [82], VIF [83], and GSMD [84]. A logistic nonlinear function suggested in [85] is adopted to map predictions of the three models to the MOS scale of the LIVE database [86]. After that, the score range of the three models roughly spans , where higher values indicate better perceptual quality. We associate each candidate image pair with a nonnegative , which is equal to the smallest score difference of the three FR models. Intuitively, the perceptual uncertainty level of quality discriminability should decrease monotonically with the increase of . By varying , we can generate DIPs with different uncertainty levels. To quantify the level of uncertainty, we employ a raised-cosine function given by

(1)

where lies in , with a higher value indicating a greater degree of uncertainty and is a constant, above which the uncertainty goes to zero. In the current implementation, we set , whose legitimacy can be validated from two sources. First, the average standard deviation of MOSs on LIVE is around , which is approximately half of , therefore guaranteeing the perceived discriminability of two images. Second, based on the subjective experiments conducted by Gao et al. [67] on LIVE, the consistency between subjects on the relative quality of one pair increases with the absolute difference and, when it is larger than , the consistency approaches . Fig. 1 shows the shape of the uncertainty function as a function of and some representative DIPs, where the left images have better quality in terms of the three chosen FR-IQA models with . All the shown DIPs are generated from the training image set that will be described later. It is clear that setting close to zero produces the highest level of uncertainty of quality discriminability. Careful inspection of Fig. 1(a) and Fig. 1(b) reveals that the uncertainty manifests itself in two ways. First, the right image in Fig. 1(a) has better perceived quality to many human observers compared with the left one, which disagrees with the three FR-IQA models. Second, both images in Fig. 1(b) have distortions that are barely perceived by the human eye. In other words, they have very similar perceptual quality. The perceptual uncertainty generally decreases if increases and when , the DIP is clearly discriminable, further justifying the selection of .

Iii-B RankNet [16]

Given a number of DIPs, a pairwise L2R algorithm would make use of their perceptual order to learn quality models while taking the inherent perceptual uncertainty into account. Here, we revisit RankNet [16], a pairwise L2R algorithm that was the first of its kind used by commercial search engines [17]. We extend it to learn from DIPs associated with uncertainty. Fig. 2 shows RankNet’s architecture, which is based on classical neural networks and has two parallel streams to accommodate a pair of inputs. The two-stream weights are shared, which is achieved by using the same initializations and the same gradients during backpropagation [16]. The quality prediction function , namely the dipIQ index, is implemented by one of the streams, and the loss function is defined on a pair of images with the help of . Specifically, let and be the output of the first and second streams, whose difference is converted to a probability using

(2)

based on which we define the cross entropy loss as

(3)

where is the ground truth label associated with the training pair, consisting of the -th and -th images. In the case of DIPs described in the Section III-A, is always or , indicating that the quality of the -th image is worse or better than the -th one. Within the mini-batch stochastic gradient minimization framework, we define the batch-level loss function using the perceptual uncertainty of each DIP as a weighting factor

(4)

where is the batch containing the DIP indices currently being trained. As Eq. (4) makes clear, DIPs with higher uncertainty contribute less to the overall loss. With some derivations, we obtain the gradient of with respect to the model parameters collectively denoted by as follows

(5)

In the case of a linear dipIQ containing no hidden layers and no nonlinear activations, Eq. (3) is reduced to

(6)

which is easily recognized as logistic regression. The convexity of Eq. (6) ensures the global optimality of the solution. We investigate both linear and nonlinear dipIQ cases with the cross entropy as loss. In fact, any probability distribution measures can be adopted as alternatives. For example, Tsai et al. [76] proposed a fidelity loss measure from quantum physics. We find in our experiments that the fidelity loss impairs performance, so we use the cross entropy loss throughout the paper.

We select RankNet [16] as our first choice of pairwise L2R algorithm for two reasons. First, it is capable of handling a large number (millions) of training samples using stochastic or mini-batch gradient descent algorithms. By contrast, the training of other pairwise L2R methods such as RankSVM [75], even with a linear kernel, is painfully slow. Second, since RankNet [16] embodies classical neural network architectures, we embrace the latest advances in training deep neural networks [87, 88] and can easily upscale the network by adding more hidden layers to learn powerful nonlinear quality prediction functions.


Fig. 2: The architecture of dipIQ based on RankNet [16].

Iv Experiments

In this section, we first provide thorough implementation details of RankNet [16] to learn OU-BIQA models. We then describe the experimental protocol based on which a fair comparison is conducted between dipIQ and state-of-the-art BIQA models. After that, we discuss how to extend the proposed pairwise L2R approach for OU-BIQA to a listwise one that could possibly boost the performance.


Fig. 3: Sample source images in the training set. (a) Human. (b) Animal. (c) Plant. (d) Landscape. (e) Cityscape. (f) Still-life. (g) Transportation. All images are cropped for better visibility.

Iv-a Implementation Details

Iv-A1 Training Set Construction

We collect high quality and high resolution natural images to represent scenes we see in the real-world. They can be roughly clustered into seven groups: human, animal, plant, landscape, cityscape, still-life, and transportation. Sample source images are shown in Fig. 3. We preprocess each source image by down-sampling it using a bicubic kernel so that the maximum height or width is . Following the procedures described in [19], we add four distortion types, namely JPEG and JPEG2000 (JP2K) compression, white Gaussian noise contamination (WN), and Gaussian blur (BLUR), each with five distortion levels. As a result, our training set consists of test images, with source and distorted images. We randomly hold out source images and their corresponding distorted images and use them as the validation set. For the rest images, we adopt the proposed DIP generation engine to produce more than million DIPs, which constitute our training set.

Iv-A2 Base Feature

We adopt CORNIA features [9] to represent test images because they appear to be highly competitive in a recent gMAD competition on the Waterloo Exploration Database [19]. In addition, a top performing OU-BIQA model, BLISS [89], also chooses CORNIA features as input and trains on synthetic scores. As such, we offer a fair testing bed to compare dipIQ learned by a pairwise L2R approach (RankNet [16]) against BLISS [89] learned by a regression method (SVR).

Iv-A3 RankNet Instantiation

We investigate both linear and nonlinear dipIQ models, denoted by dipIQ and dipIQ, respectively. The input dimension to RankNet is , equaling the feature dimension in CORNIA [9]. The loss layer is implemented by the cross entropy function in Eq. (3). For dipIQ, the input layer is directly connected to the output layer without adding hidden layers or going through nonlinear transforms. The use of the cross entropy loss ensures the convexity of the optimization problem. For dipIQ, we add hidden layers, which have a - - structure. All layers are fully connected, followed by rectified linear units (ReLU) [90] as nonlinearity activations. We choose the node number of the third hidden layer to be so that we can visualize the three-dimensional embedding of test images. Other choices are somewhat ad-hoc, and a more careful exploration of alternative architectures could potentially lead to significant performance improvements.

The RankNet training procedure generally follows Simonyan and Zisserman [91]. Specifically, the training is carried out by optimizing the cross entropy function using mini-batch gradient descent with momentum. The weights of the two streams in RankNet are shared. The batch size is set to , and momentum to . The training is regularized by weight decay (the penalty multiplier set to ). The learning rate is fixed to . Since we have a plenty of DIPs (more than million) for training, each DIP is exposed to the learning algorithm once and only once. The learning stops when the entire set of DIPs have been swept. The weights that achieve the lowest validation set loss are used for testing.

Iv-B Experimental Protocol

Iv-B1 Databases

Four IQA databases are used to compare dipIQ with state-of-the-art BIQA measures. They are LIVE [86], CSIQ [92], TID2013 [15] and Waterloo Exploration Database [20]. The first three are small subject-rated IQA databases that are widely adopted to benchmark objective IQA models. Each test image is associated with an MOS to represent its perceptual quality. In our experiments, we only consider distortion types that are shared by all four databases, namely JP2K, JPEG, WN, and BLUR. As a result, LIVE [86], CSIQ [92], and TID2013 [15] contain , , and test images, respectively. The Exploration database contains reference and distorted images. Although the MOS of each test image is not available in the Exploration database, innovative evaluation criteria are employed to compare BIQA measures as will be specified next.

Iv-B2 Evaluation Criteria

We use five evaluation criteria to compare the performance of BIQA measures. The first two are included in previous tests carried out by the video quality experts group (VQEG) [93]. Others are introduced in [20] to take into account image databases without MOS. Details are given as follows.

  • Spearman’s rank-order correlation coefficient (SRCC) is defined as

    (7)

    where is the number of images in a database and is the difference between the -th image’s ranks in the MOS and model prediction.

  • Pearson linear correlation coefficient (PLCC) is computed by

    (8)

    where and stand for the MOS and model prediction of the -th image, respectively.

  • Pristine/distorted image discriminability test (D-test) considers pristine and distorted images as two distinct classes, and aims to measure how well an IQA model is able to separate the two classes. More specifically, indices of pristine and distorted images are grouped into sets and , respectively. A threshold is adopted to classify images such that and . The average correct classification rate is defined as

    (9)

    The value of should be optimized to yield the maximum correct classification rate, which results in a discriminability index

    (10)

    lies in with a larger value indicating a better separability between pristine and distorted images.

  • Listwise ranking consistency test (L-test) evaluates the robustness of IQA models when rating images with the same content and the same distortion type but different distortion levels. The assumption is that the quality of an image degrades monotonically with the increase of the distortion level for any distortion type. Given a database with source images, distortion types and distortion levels, the average SRCC is used to quantify the ranking consistency between distortion levels and model predictions

    (11)

    where and represent the distortion levels and the corresponding distortion/quality scores given by a model to the set of images that are from the same (-th) source image and have the same (-th) distortion type.

  • Pairwise preference consistency test (P-test) compares the performance of IQA models on a number of DIPs, whose generation is similar to what is described Section III-A but with a stricter rule [20]. A good IQA model should give concordant preferences with respect to DIPs. Assuming that an image database contains DIPs and that the number of concordant pairs of an IQA model (meaning that the model predicts the correct preference) is , the pairwise preference consistency ratio is defined as

    (12)

    lies in with a higher value indicating better performance. We also denote the number of incorrect preference predictions as .

SRCC JP2K JPEG WN BLUR ALL4
PSNR 0.908 0.894 0.984 0.814 0.883
SSIM [94] 0.961 0.974 0.970 0.952 0.947
QAC [60] 0.876 0.951 0.925 0.911 0.869
NIQE [43] 0.924 0.945 0.972 0.941 0.920
ILNIQE [65] 0.901 0.944 0.979 0.927 0.918
BLISS [89] 0.925 0.956 0.967 0.936 0.945
dipIQ 0.946 0.956 0.976 0.962 0.952
dipIQ 0.956 0.969 0.975 0.940 0.958
PLCC JP2K JPEG WN BLUR ALL4
PSNR 0.912 0.896 0.987 0.812 0.874
SSIM [94] 0.968 0.980 0.972 0.951 0.937
QAC [60] 0.876 0.960 0.895 0.912 0.855
NIQE [43] 0.932 0.956 0.979 0.951 0.912
ILNIQE [65] 0.912 0.966 0.976 0.936 0.913
BLISS [89] 0.933 0.972 0.978 0.948 0.945
dipIQ 0.958 0.953 0.951 0.950 0.948
dipIQ 0.964 0.980 0.983 0.948 0.957
TABLE I: Median SRCC and PLCC results across sessions on LIVE [86]
SRCC JP2K JPEG WN BLUR ALL4
PSNR 0.941 0.901 0.943 0.936 0.928
SSIM [94] 0.962 0.956 0.912 0.965 0.935
QAC [60] 0.884 0.913 0.850 0.839 0.840
NIQE [43] 0.926 0.882 0.836 0.908 0.883
ILNIQE [65] 0.924 0.905 0.867 0.867 0.887
BLISS [89] 0.932 0.927 0.879 0.922 0.920
dipIQ 0.938 0.926 0.887 0.925 0.924
dipIQ 0.944 0.936 0.904 0.932 0.930
PLCC JP2K JPEG WN BLUR ALL4
PSNR 0.954 0.908 0.961 0.937 0.918
SSIM [94] 0.973 0.983 0.908 0.956 0.930
QAC [60] 0.898 0.942 0.865 0.855 0.847
NIQE [43] 0.944 0.946 0.824 0.935 0.900
ILNIQE [65] 0.942 0.956 0.880 0.903 0.914
BLISS [89] 0.954 0.970 0.895 0.947 0.939
dipIQ 0.955 0.971 0.903 0.951 0.946
dipIQ 0.959 0.975 0.927 0.958 0.949
TABLE II: Median SRCC and PLCC results across sessions on CSIQ [92]
SRCC JP2K JPEG WN BLUR ALL4
PSNR 0.898 0.929 0.942 0.965 0.924
SSIM [94] 0.950 0.935 0.896 0.969 0.924
QAC [60] 0.883 0.885 0.668 0.879 0.837
NIQE [43] 0.901 0.873 0.854 0.821 0.812
ILNIQE [65] 0.912 0.873 0.890 0.815 0.881
BLISS [89] 0.906 0.893 0.856 0.872 0.836
dipIQ 0.909 0.903 0.854 0.884 0.857
dipIQ 0.926 0.932 0.905 0.922 0.877
PLCC JP2K JPEG WN BLUR ALL4
PSNR 0.933 0.925 0.963 0.958 0.911
SSIM [94] 0.970 0.968 0.902 0.958 0.927
QAC [60] 0.892 0.929 0.719 0.877 0.829
NIQE [43] 0.912 0.928 0.859 0.848 0.819
ILNIQE [65] 0.929 0.944 0.899 0.816 0.890
BLISS [89] 0.930 0.963 0.863 0.872 0.862
dipIQ 0.937 0.963 0.851 0.892 0.894
dipIQ 0.948 0.973 0.906 0.928 0.894
TABLE III: Median SRCC and PLCC results across sessions on TID2013 [15]
(a)
(b)
(c)
Fig. 4: The noisiness of the synthetic score [89]. (a) Synthetic score = 10. (b) Synthetic score = 10. (c) Synthetic score = 40. (a) has worse perceptual quality than (b), which in turn has approximately the same quality compared with (c). Both two cases are in disagreement with the synthetic score [89]. Images are selected from the training set.
(a)
(b)
Fig. 5: Three dimensional embedding of the LIVE database [86]. (a) Color encodes distortion type. (b) Color encodes quality; the warmer, the better. The learned features from the third hidden layer of dipIQ are able to cluster images based on distortion types and align them in a perceptually meaningful way.

SRCC and PLCC are applied to LIVE [86], CSIQ [92], and TID2013 [15], while the D-test, L-test, and P-test are applied to the Waterloo Exploration Database. Note that the use of PLCC requires a nonlinear function to map raw model predictions to the MOS scale. Following Mittal et al. [8] and Ye et al. [89], in our experiments we randomly choose reference images along with their corresponding distorted versions to estimate , and use the rest images for testing. This procedure is repeated times and the median SRCC and PLCC values are reported.

Iv-C Experimental Results

Iv-C1 Comparison with FR and OU-BIQA Models

We compare dipIQ with two well-known FR-IQA models: PSNR (whose largest value is clipped at dB in order to perform a reasonable parameter estimation) and SSIM [94] (whose implementation used in the paper involves a down-sampling process [95]) and previous OU-BIQA models, including QAC [60], NIQE [43], ILNIQE [65], and BLISS [89]. The implementations of QAC [60], NIQE [43], and ILNIQE [65] are obtained from the original authors. To the best of our knowledge, the complete implementation of BLISS [89] is not publicly available. Therefore, to make a fair comparison we train BLISS [89] on the same reference images and their distorted versions, which have been used to train dipIQ. The labels are synthesized using the method in [89]. The training toolbox and parameter settings are inherited from the original paper [89].

PSNR 1.0000 1.0000 0.9995 620,071
SSIM [94] 1.0000 0.9992 0.9991 1,131,457
QAC [60] 0.9226 0.8699 0.9779 28,447,590
NIQE [43] 0.9109 0.9885 0.9937 8,127,941
ILNIQE [65] 0.9084 0.9926 0.9927 9,435,319
BLISS [89] 0.9080 0.9801 0.9996 562,925
dipIQ 0.9209 0.9863 0.9996 465,069
dipIQ 0.9346 0.9846 0.9999 129,668
TABLE IV: The D-test, L-test and P-test results on the Waterloo Exploration Database [20].
PLCC PSNR SSIM QAC NIQE ILNIQE BLISS dipIQ dipIQ
PSNR - 0 1 0 0 0 0 0
SSIM [94] 1 - 1 1 1 0 0 0
QAC [60] 0 0 - 0 0 0 0 0
NIQE [43] 1 0 1 - - 0 0 0
ILNIQE [65] 1 0 1 - - 0 0 0
BLISS [89] 1 1 1 1 1 - 0 0
dipIQ 1 1 1 1 1 1 - 0
dipIQ 1 1 1 1 1 1 1 -
TABLE V: Statistical significance matrix based on the hypothesis testing. A symbol “1” means that the performance of the row algorithm is statistically better than that of the column algorithm, a symbol “0” means that the row algorithm is statistically worse, and a symbol “-” means that the row and column algorithms are statistically indistinguishable

Tables III, and III list comparison results between dipIQ and existing OU-BIQA models in terms of median SRCC and PLCC values on LIVE [86], CSIQ [92], and TID2013 [15], respectively. Both dipIQ and dipIQ outperform all previous OU-BIQA models on LIVE [86] and CSIQ [92], and are comparable to ILNIQE [65] on TID2013 [15]. Although both dipIQ and BLISS [89] learn a linear prediction function using CORNIA features as inputs [9], we observe consistent performance gains of dipIQ across all three databases over BLISS [89]. This may be because dipIQ learns from more reliable data (DIPs) with uncertainty weighting, whereas the training labels (synthetic scores) for BLISS are noisier, as exemplified in Fig. 4. It is not hard to observe that Fig. 4(a) has clearly worse perceptual quality than Fig. 4(b), which in turn has approximately the same quality compared with Fig. 4(c). Both two cases are in disagreement with the synthetic score [89].

To ascertain that the improvement of dipIQ is statistically significant, we carry out a two sample T-test (with a confidence) between PLCC values obtained by different models on LIVE [86]. After comparing every possible pairs of OU-BIQA models, the results are summarized in Table V, where a symbol “1” means the row model performs significantly better than the column model, a symbol “0” means the opposite, and a symbol “-” indicates that the row and column models are statistically indistinguishable. It can be observed that dipIQ is statistically better than dipIQ, which is better than all previous OU-BIQA models.

Table IV shows the results on the Waterloo Exploration Database. dipIQ and dipIQ outperform all previous OU-BIQA models in the D-test and P-test, and are competitive in the L-test, whose performance is slightly inferior to NIQE [43] and ILNIQE [65]. By learning from examples with a variety of image content, dipIQ is able to crush the number of incorrect preference predictions in the P-test down to around out of more than billion candidate DIPs.

In order to gain intuitions on why the generalizability of dipIQ is excellent even without MOS for training, we visualize the three-dimensional embedding of the LIVE database [86] in Fig 5, using the learned three-dimensional features from the third hidden layer of dipIQ. We can see that the learned representation is able to cluster test images according to the distortion type, and meanwhile align them with respect to their perceptual quality in a meaningful way, where high quality images are clamped together regardless of image content.

SRCC JP2K JPEG WN BLUR ALL4
BRISQUE [8] 0.894 0.916 0.934 0.915 0.909
dipIQ 0.938 0.938 0.934 0.943 0.926
DIIVINE [10] 0.844 0.819 0.881 0.884 0.835
dipIQ 0.930 0.939 0.904 0.920 0.912
CORNIA [9] 0.916 0.919 0.787 0.928 0.915
dipIQ 0.944 0.936 0.904 0.932 0.930
PLCC JP2K JPEG WN BLUR ALL4
BRISQUE [8] 0.937 0.960 0.947 0.936 0.937
dipIQ 0.956 0.974 0.945 0.959 0.943
DIIVINE [10] 0.898 0.818 0.903 0.909 0.855
dipIQ 0.949 0.973 0.924 0.944 0.942
CORNIA [9] 0.947 0.960 0.777 0.953 0.934
dipIQ 0.959 0.975 0.927 0.958 0.949
TABLE VI: Median SRCC and PLCC results across sessions, training on LIVE [86] and testing on CSIQ [92]. The superscripts and indicate that the input features of dipIQ are from BRISQUE [8] and DIIVINE [10], respectively
SRCC JP2K JPEG WN BLUR ALL4
BRISQUE [8] 0.906 0.894 0.889 0.886 0.883
dipIQ 0.927 0.921 0.921 0.917 0.883
DIIVINE [10] 0.857 0.680 0.879 0.859 0.795
dipIQ 0.912 0.889 0.887 0.905 0.872
CORNIA [9] 0.907 0.912 0.798 0.934 0.893
dipIQ 0.926 0.932 0.905 0.922 0.877
PLCC JP2K JPEG WN BLUR ALL4
BRISQUE [8] 0.919 0.950 0.886 0.884 0.901
dipIQ 0.942 0.957 0.923 0.906 0.883
DIIVINE [10] 0.901 0.696 0.882 0.860 0.794
dipIQ 0.945 0.947 0.881 0.896 0.892
CORNIA [9] 0.923 0.960 0.778 0.934 0.904
dipIQ 0.948 0.973 0.906 0.928 0.894
TABLE VII: Median SRCC and PLCC results across sessions, training on LIVE [86] and testing on TID2013 [15]
BRISQUE [8] 0.9204 0.9772 0.9930 9,004,685
dipIQ 0.9265 0.9753 0.9996 503,911
DIIVINE [10] 0.8538 0.8908 0.9540 59,053,011
dipIQ 0.9191 0.9588 0.9983 2,124,199
CORNIA [9] 0.9290 0.9764 0.9947 6,808,400
dipIQ 0.9346 0.9846 0.9999 129,668
TABLE VIII: The D-test, L-test and P-test results on the Exploration database [20], training on LIVE [86]
(a)
(b)
(c)
(d)
Fig. 6: gMAD competition between dipIQ and BRISQUE [8]. (a) best BRISQUE for fixed dipIQ. (b) worst BRISQUE for fixed dipIQ. (c) best dipIQ for fixed BRISQUE. (d) worst dipIQ for fixed BRISQUE.

Iv-C2 Comparison with OA-BIQA Models

In the second set of experiments, we train dipIQ using different feature representations as inputs and compare with OA-BIQA models using the same representations and MOS for training. BRISQUE [8] and DIIVINE [10] are selected as representative features extracted from the spatial and wavelet domain, respectively. We also compare dipIQ with CORNIA [9], whose features are adopted as the default input to dipIQ. We re-train BRISQUE [8], DIIVINE [10], and CORNIA [9] on the LIVE database, whose learning tools and parameter settings follow their respective papers. We adjust the dimension of the input layer of dipIQ to accommodate features of different dimensions and train them on the reference images and their distorted versions, as described in IV-A. All models are tested on CSIQ [92], TID2013 [15] and the Exportation database [20]. From Tables VIVII, and VIII, we observe that dipIQ consistently performs better than the corresponding OA-BIQA model on CSIQ [92] and the Exploration database, and is comparable on TID2013 [15]. The reason we do not obtain noticeable performance gains on TID2013 [15] may be that TID2013 [15] has references images originated from LIVE [86], based on which the OA-BIQA models have been trained. This creates dependencies between training and testing sets. We may also draw conclusions about the effectiveness of the feature representations based on their performance under the same pairwise L2R framework: generally speaking, CORNIA [9] features BRISQUE [8] features DIIVINE [10] features.

We further compare dipIQ and BRISQUE [8] using the gMAD competition methodology on the Waterloo Exploration Database. Specifically, we first find a pair of images that have the maximum and minimum dipIQ values from a subset of images in the Exploration database, where BRISQUE [8] rates them to have the same quality. We then repeat this procedure, but with the roles of dipIQ and BRISQUE [8] exchanged. The two image pairs are shown in Fig. 6, from which we conclude that images in the first row exhibits approximately the same perceptual quality (in agreement with dipIQ) and those in the second row has drastically different perceptual quality (in disagreement with BRISQUE [8]). This verifies that the robustness of dipIQ is significantly improved over BRISQUE [8] using the same feature representations and MOS for training. Similar gMAD competition results are obtained across all quality levels, and for dipIQ versus DIIVINE [10] and dipIQ versus CORNIA [9].

In summary, the proposed pairwise L2R approach is proved to learn OU-BIQA models with improved generalizability and robustness compared with OA-BIQA models using the same feature representations and MOS for training.

V Listwise L2R Approach for OU-BIQA

In this section, we extend the proposed pairwise L2R approach for OU-BIQA to a listwise L2R one. Specifically, we first construct three-element DILs by concatenating DIPs. For example, given two DIPs and with the same level of uncertainty, we create a list with the ground truth label , indicating that the quality of the -th image is better than the -th image, whose quality is better than the -th image. The uncertainty level is transferred as well. We then employ ListNet [21], a listwise L2R extension of RankNet [16] to learn OU-BIQA models. The major differences between ListNet and RankNet are twofold. First, ListNet can have multiple streams with the same weights to accommodate a list of inputs, where each stream is implemented by a classical neural network architecture similar to RankNet, as shown in Fig. 2. In this paper, we instantiate a three-stream ListNet to fit three-element DILs. Second, the loss function of ListNet is defined using the concept of permutation probability. More specifically, we define a permutation on a list of instances as a bijection from to itself, where denotes the instance at position in the permutation. The set of all possible permutations of instances is termed as . We define the probability of permutation given the list of predicted scores as

(13)

which satisfies and as proved in [21]. The loss function can then be defined as the cross entropy function between the ground truth and permutation probabilities

(14)

When , the loss function of ListNet [21] in Eq. (14) becomes equivalent to that of RankNet [16] in Eq. (3). In the case of three-element DILs, we have , if and otherwise. Therefore, the loss function in Eq. (14) can be simplified as

(15)

base on which we define the batch-level loss as

(16)

where is the uncertainty level of the list, transferred from the corresponding DIPs. The gradient of Eq. (16) w.r.t. the parameters can be easily derived. Note that ListNet [21] does not add new parameters.

We generate million DILs from the available DIPs as the training data for ListNet [21]. The training procedure is exactly the same as training RankNet [16]. The training stops when the entire set of image lists have been swept once. The weights that achieve the lowest validation set loss are used for testing.

We list the comparison results between dilIQ trained by ListNet [21] and the baseline dipIQ on LIVE [86], CSIQ [92], TID2013 [15], and the Exploration database in Tables IXXXI, and XII, respectively. Remarkable performance improvements have been achieved on CSIQ and TID2013. This may be because the ranking position information is made explicit to the learning process. dilIQ is comparable to dipIQ on LIVE and the Exploration database.

SRCC JP2K JPEG WN BLUR ALL
dipIQ 0.956 0.969 0.975 0.940 0.958
dilIQ 0.956 0.966 0.976 0.953 0.958
PLCC JP2K JPEG WN BLUR ALL
dipIQ 0.964 0.980 0.983 0.948 0.957
dilIQ 0.964 0.978 0.985 0.956 0.954
TABLE IX: Median SRCC and PLCC results across sessions on LIVE [86], using ListNet [21] for training
SRCC JP2K JPEG WN BLUR ALL
dipIQ 0.944 0.936 0.904 0.932 0.930
dilIQ 0.930 0.925 0.893 0.939 0.936
PLCC JP2K JPEG WN BLUR ALL
dipIQ 0.959 0.975 0.927 0.958 0.949
dilIQ 0.954 0.968 0.920 0.960 0.954
TABLE X: Median SRCC and PLCC results across sessions on CSIQ [92], using ListNet [21] for training
SRCC JP2K JPEG WN BLUR ALL
dipIQ 0.926 0.932 0.905 0.922 0.877
dilIQ 0.918 0.849 0.905 0.925 0.891
PLCC JP2K JPEG WN BLUR ALL
dipIQ 0.948 0.973 0.906 0.928 0.894
dilIQ 0.948 0.923 0.903 0.929 0.915
TABLE XI: Median SRCC and PLCC results across sessions on TID2013 [15], using ListNet [21] for training
dipIQ 0.9346 0.9846 0.9999 129,668
dilIQ 0.9346 0.9893 0.9998 198,650
TABLE XII: The D-test, L-test and P-test results on the Exploration database [20], using ListNet [21] for training

Vi Conclusion and Future Work

In this paper, we have proposed an OU-BIQA model, namely dipIQ, using RankNet [16]. The input to the dipIQ training model are an enormous number of DIPs, not obtained by expensive subjective testing but automatically generated with the help of most trusted FR-IQA models at low cost. Extensive experimental results demonstrate the effectiveness of the proposed dipIQ indices with higher accuracy and improved robustness in content variations. We also learn an OU-BIQA model, namely dilIQ, using a listwise L2R approach, which achieves an additional performance gain.

The current work opens the door to a new class of OU-BIQA models and can be extended in many ways. First, novel image pair and list generation engines may be developed to account for situations that reference images are not available (or do not ever exist). Second, advanced L2R algorithms are worth exploring to improve the quality prediction performance. Third, in practice, a pair of images may be regarded as having indiscriminable quality. Such knowledge could be obtained either from subjective testing (e.g., paired comparison between images) or from the image source (e.g., two pristine images acquired from the same source), and is informative in constraining the behavior of an objective quality model. The current learning framework needs to be improved in order to learn from such quality-indiscriminable image pairs. Fourth, given the powerful DIP generation engine developed in the current work and the remarkable success of recent deep convolutional neural networks, it may become feasible to develop end-to-end BIQA models that bypass the feature extraction process and achieve even stronger robustness and generalizability.

Acknowledgment

The authors would like to thank Zhengfang Duanmu for suggestions on the efficient implementation of RankNet, and the anonymous reviewers for constructive comments. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada, and the Australian Research Council Projects FT-130101457, DP-140102164, and LP-150100671. K. Ma was partially supported by the CSC.

References

  • [1] H. R. Wu and K. R. Rao, Digital Video Image Quality and Perceptual Coding.   CRC press, 2005.
  • [2] Z. Wang and A. C. Bovik, Modern Image Quality Assessment.   Morgan & Claypool Publishers, 2006.
  • [3] S. J. Daly, “Visible differences predictor: An algorithm for the assessment of image fidelity,” in SPIE/IS&T Symposium on Electronic Imaging: Science and Technology, 1992, pp. 2–15.
  • [4] Z. Wang, G. Wu, H. R. Sheikh, E. P. Simoncelli, E.-H. Yang, and A. C. Bovik, “Quality-aware images,” IEEE Transactions on Image Processing, vol. 15, no. 6, pp. 1680–1689, Jun. 2006.
  • [5] Z. Wang and A. C. Bovik, “Reduced- and no-reference image quality assessment: The natural scene statistic model approach,” IEEE Signal Processing Magazine, vol. 28, no. 6, pp. 29–40, Nov. 2011.
  • [6] A. K. Moorthy and A. C. Bovik, “A two-step framework for constructing blind image quality indices,” IEEE Signal Processing Letters, vol. 17, no. 5, pp. 513–516, May 2010.
  • [7] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, Aug. 2012.
  • [8] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, Dec. 2012.
  • [9] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature learning framework for no-reference image quality assessment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1098–1105.
  • [10] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, Dec. 2011.
  • [11] Q. Wu, Z. Wang, and H. Li, “A highly efficient method for blind image quality assessment,” in IEEE International Conference on Image Processing, 2015, pp. 339–343.
  • [12] W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and Laplacian features,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4850–4862, Nov. 2014.
  • [13] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, Jan. 2015.
  • [14] Q. Wu, H. Li, F. Meng, K. N. Ngan, B. Luo, C. Huang, and B. Zeng, “Blind image quality assessment based on multi-channel features fusion and label transfer,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp. 425–440, Mar. 2016.
  • [15] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo, “Image database TID2013: Peculiarities, results and perspectives,” Signal Processing: Image Communication, vol. 30, pp. 57–77, Jan. 2015.
  • [16] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent,” in International Conference on Machine Learning, 2005, pp. 89–96.
  • [17] T.-Y. Liu, “Learning to rank for information retrieval,” Foundations and Trends in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.
  • [18] L. Hang, “A short introduction to learning to rank,” IEICE Transactions on Information and Systems, vol. 94, no. 10, pp. 1854–1862, Oct. 2011.
  • [19] K. Ma, Q. Wu, Z. Wang, Z. Duanmu, H. Yong, H. Li, and L. Zhang, “Group MAD competition a new methodology to compare objective image quality models,” in IEEE Conference on Computer Vsion and Pattern Recognition, 2016, pp. 1664–1673.
  • [20] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo Exploration Database: New challenges for image quality assessment models,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 1004–1016, Feb. 2017.
  • [21] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: From pairwise approach to listwise approach,” in International Conference on Machine Learning, 2007, pp. 129–136.
  • [22] H. R. Wu and M. Yuen, “A generalized block-edge impairment metric for video coding,” IEEE Signal Processing Letters, vol. 4, no. 11, pp. 317–320, Nov. 1997.
  • [23] Z. Wang, A. C. Bovik, and B. L. Evan, “Blind measurement of blocking artifacts in images,” in IEEE International Conference on Image Processing, 2000, pp. 981–984.
  • [24] S. Liu and A. C. Bovik, “Efficient DCT-domain blind measurement and reduction of blocking artifacts,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 12, pp. 1139–1149, Dec. 2002.
  • [25] H. Tong, M. Li, H. Zhang, and C. Zhang, “Blur detection for digital images using wavelet transform,” in IEEE International Conference on Multimedia and Expo, 2004, pp. 17–20.
  • [26] Z. Wang and E. P. Simoncelli, “Local phase coherence and the perception of blur,” in Advances in Neural Information Processing Systems, 2003.
  • [27] X. Zhu and P. Milanfar, “A no-reference sharpness metric sensitive to blur and noise,” in International Workshop on Quality of Multimedia Experience, 2009, pp. 64–69.
  • [28] S. Oğuz, Y. Hu, and T. Q. Nguyen, “Image coding ringing artifact reduction using morphological post-filtering,” in IEEE Workshop on Multimedia Signal Processing, 1998, pp. 628–633.
  • [29] H. R. Sheikh, A. C. Bovik, and L. Cormack, “No-reference quality assessment using natural scene statistics: JPEG2000,” IEEE Transactions on Image Processing, vol. 14, no. 11, pp. 1918–1927, Nov. 2005.
  • [30] H. Liu, N. Klomp, and I. Heynderickx, “A no-reference metric for perceived ringing artifacts in images,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 4, pp. 529–539, Apr. 2010.
  • [31] B. A. Wandell, Foundations of Vision.   Sinauer Associates, 1995.
  • [32] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of physiology, vol. 160, no. 1, pp. 106–154, Jan. 1962.
  • [33] D. J. Heeger, “Normalization of cell responses in cat striate cortex,” Visual Neuroscience, vol. 9, no. 02, pp. 181–197, Aug. 1992.
  • [34] D. J. Field, “What is the goal of sensory coding?” Neural Computation, vol. 6, no. 4, pp. 559–601, Jul. 1994.
  • [35] W. S. Geisler and R. L. Diehl, “Bayesian natural selection and the evolution of perceptual systems,” Philosophical Transactions of the Royal Society of London B: Biological Sciences, vol. 357, no. 1420, pp. 419–448, Apr. 2002.
  • [36] E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger, “Shiftable multiscale transforms,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 587–607, Mar. 1992.
  • [37] S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 7, pp. 674–693, Jul. 1989.
  • [38] X. Li, “Blind image quality assessment,” in IEEE International Conference on Image Processing, 2002, pp. 449–452.
  • [39] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi, “Perceptual blur and ringing metrics: Application to JPEG2000,” Signal Processing: Image Communication, vol. 19, no. 2, pp. 163–172, Feb. 2004.
  • [40] C. Li, A. C. Bovik, and X. Wu, “Blind image quality assessment using a general regression neural network,” IEEE Transactions on Neural Networks, vol. 22, no. 5, pp. 793–799, May 2011.
  • [41] Y. Fang, K. Ma, Z. Wang, W. Lin, and G. Zhai, “No-reference quality assessment of contrast-distorted images based on natural scene statistics,” IEEE Signal Processing Letters, vol. 22, no. 7, pp. 838–842, Jul. 2015.
  • [42] X. Zhu and P. Milanfar, “Automatic parameter selection for denoising algorithms using a no-reference measure of image content,” IEEE Transactions on Image Processing, vol. 19, no. 12, pp. 3116–3132, Dec. 2010.
  • [43] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, Mar. 2013.
  • [44] A. Mittal, G. S. Muralidhar, J. Ghosh, and A. C. Bovik, “Blind image quality assessment without human training using latent quality factors,” IEEE Signal Processing Letters, vol. 19, no. 2, pp. 75–78, Feb. 2012.
  • [45] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Real-time no-reference image quality assessment based on filter learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 987–994.
  • [46] Z. Wang and E. P. Simoncelli, “Reduced-reference image quality assessment using a wavelet-domain natural image statistic model,” in Human Vision and Electronic Imaging, 2005, pp. 149–159.
  • [47] W. Hou, X. Gao, D. Tao, and X. Li, “Blind image quality assessment via deep learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 6, pp. 1275–1286, Jun. 2015.
  • [48] Q. Li and Z. Wang, “Reduced-reference image quality assessment using divisive normalization-based image representation,” IEEE Journal of Selected Topics in Signal Processing, vol. 3, no. 2, pp. 202–211, Apr. 2009.
  • [49] A. Rehman and Z. Wang, “Reduced-reference image quality assessment by structural similarity estimation,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3378–3389, Aug. 2012.
  • [50] H. Tang, N. Joshi, and A. Kapoor, “Learning a blind measure of perceptual image quality,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 305–312.
  • [51] ——, “Blind image quality assessment using semi-supervised rectifier networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2877–2884.
  • [52] P. Ye and D. Doermann, “No-reference image quality assessment using visual codebooks,” IEEE Transactions on Image Processing, vol. 21, no. 7, pp. 3129–3138, Jul. 2012.
  • [53] R. Hassen, Z. Wang, and M. M. Salama, “Image sharpness assessment based on local phase coherence,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2798–2810, Jul. 2013.
  • [54] L. Xu and J. Jia, “Two-phase kernel estimation for robust motion deblurring,” in European Conference on Computer Vision, 2010, pp. 157–170.
  • [55] Z. Wang, H. R. Sheikh, and A. C. Bovik, “No-reference perceptual quality assessment of JPEG compressed images,” in IEEE International Conference on Image Processing, vol. 1, 2002, pp. 477–480.
  • [56] T. S. Huang, J. W. Burnett, and A. G. Deczky, “The importance of phase in image processing filters,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 23, no. 6, pp. 529–542, Dec. 1975.
  • [57] A. V. Oppenheim and J. S. Lim, “The importance of phase in signals,” Proceedings of the IEEE, vol. 69, no. 5, pp. 529–541, May 1981.
  • [58] P. Kovesi, “Image features from phase congruency,” Journal of Computer Vision Research, vol. 1, no. 3, pp. 1–26, Jun. 1999.
  • [59] M. A. Saad, A. C. Bovik, and C. Charrier, “A DCT statistics-based blind image quality index,” IEEE Signal Processing Letters, vol. 17, no. 6, pp. 583–586, Jun. 2010.
  • [60] W. Xue, L. Zhang, and X. Mou, “Learning without human scores for blind image quality assessment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 995–1002.
  • [61] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, Aug. 2011.
  • [62] L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for no-reference image quality assessment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1733–1740.
  • [63] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, Sep. 1995.
  • [64] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Computation, vol. 12, no. 5, pp. 1207–1245, May 2000.
  • [65] L. Zhang, L. Zhang, and A. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2579–2591, Aug. 2015.
  • [66] L. Xu, W. Lin, J. Li, X. Wang, Y. Yan, and Y. Fang, “Rank learning on training set selection and image quality assessment,” in IEEE International Conference on Multimedia and Expo, 2014, pp. 1–6.
  • [67] F. Gao, D. Tao, X. Gao, and X. Li, “Learning to rank for blind image quality assessment,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2275–2290, Oct. 2015.
  • [68] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, no. 1, pp. 177–196, Jan. 2001.
  • [69] X. Gao, F. Gao, D. Tao, and X. Li, “Universal blind image quality assessment metrics via natural scene statistics and multiple kernel learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 12, pp. 2013–2026, Dec. 2013.
  • [70] N. Fuhr, “Optimum polynomial retrieval functions based on the probability ranking principle,” ACM Transactions on Information Systems, vol. 7, no. 3, pp. 183–204, Jul. 1989.
  • [71] D. Cossock and T. Zhang, “Subset ranking using regression,” in Conference on Learning Theory, 2006, pp. 605–619.
  • [72] R. Nallapati, “Discriminative models for information retrieval,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004, pp. 64–71.
  • [73] K. Crammer and Y. Singer, “Pranking with ranking,” in Advances in Neural Information Processing Systems, 2002, pp. 641–647.
  • [74] A. Shashua and A. Levin, “Ranking with large margin principle: Two approaches,” in Advances in Neural Information Processing Systems, 2002, pp. 937–944.
  • [75] T. Joachims, “Optimizing search engines using clickthrough data,” in Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 133–142.
  • [76] M. F. Tsai, T. Y. Liu, T. Qin, H. H. Chen, and W. Y. Ma, “FRank: A ranking method with fidelity loss,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007, pp. 383–390.
  • [77] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of Machine Learning Research, vol. 4, no. 6, pp. 170–178, Nov. 2003.
  • [78] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of online learning and an application to boosting,” in European Conference on Computational Learning Theory, 1995, pp. 23–37.
  • [79] M. Taylor, J. Guiver, S. Robertson, and T. Minka, “SoftRank: optimizing non-smooth rank metrics,” in ACM International Conference on Web Search and Data Mining, 2008, pp. 77–86.
  • [80] Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A support vector method for optimizing average precision,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007, pp. 271–278.
  • [81] J.-Y. Yeh, J.-Y. Lin, H.-R. Ke, and W.-P. Yang, “Learning to rank for information retrieval using genetic programming,” in SIGIR Workshop on Learning to Rank for Information Retrieval, 2007, pp. 1–8.
  • [82] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in IEEE Asilomar Conference on Signals, Systems and Computers, 2003, pp. 1398–1402.
  • [83] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.
  • [84] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: A highly efficient perceptual image quality index,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 684–695, Feb. 2014.
  • [85] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3440–3451, Nov. 2006.
  • [86] H. R. Sheikh, Z. Wang, A. C. Bovik, and L. K. Cormack, Image and Video Quality Assessment Research at LIVE [Online]. Available: http://live.ece.utexas.edu/research/quality/.
  • [87] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, Jul. 2006.
  • [88] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [89] P. Ye, J. Kumar, and D. Doermann, “Beyond human opinion scores: blind image quality assessment based on synthetic scores,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4241–4248.
  • [90] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in IEEE International Conference on Machine Learning, 2010, pp. 807–814.
  • [91] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representation, 2015.
  • [92] E. C. Larson and D. M. Chandler, “Most apparent distortion: Full-reference image quality assessment and the role of strategy,” SPIE Journal of Electronic Imaging, vol. 19, no. 1, pp. 1–21, Jan. 2010.
  • [93] VQEG, Final Report from the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment 2000 [Online]. Available: http://www.vqeg.org.
  • [94] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.
  • [95] ——, The SSIM Index for Image Quality Assessment [Online]. Available: https://ece.uwaterloo.ca/~z70wang/research/ssim/.

Kede Ma (S’13) received the B.E. degree from University of Science and Technology of China, Hefei, China, in 2012 and M.A.Sc. degree from University of Waterloo, ON, Canada, where he is currently working toward the Ph.D. degree in electrical and computer engineering. His research interests lie in perceptual image processing and computational photography.

Wentao Liu (S’15) received the B.E. and the M.E. degrees from Tsinghua University, Beijing, China in 2011 and 2014, respectively. He is currently working toward the Ph.D. degree in the Electrical & Computer Engineering Department, University of Waterloo, ON, Canada. His research interests include perceptual quality assessment of images and videos.

Tongliang Liu is currently a Lecturer with the School of Information Technologies and the Faculty of Engineering and Information Technologies, and a core member in the UBTech Sydney AI Institute, at The University of Sydney. He received the BEng degree in electronic engineering and information science from the University of Science and Technology of China, and the PhD degree from the University of Technology Sydney. His research interests include statistical learning theory, computer vision, and optimization. He has authored and co-authored 20+ research papers including IEEE T-PAMI, T-NNLS, T-IP, ICML, and KDD.

Zhou Wang (S’99-M’02-SM’12-F’14) received the Ph.D. degree from The University of Texas at Austin in 2001. He is currently a Professor in the Department of Electrical and Computer Engineering, University of Waterloo, Canada. His research interests include image processing, coding, and quality assessment; computational vision and pattern analysis; multimedia communications; and biomedical signal processing. He has more than 100 publications in these fields with over 30,000 citations (Google Scholar).

Dr. Wang serves as a Senior Area Editor of IEEE Transactions on Image Processing (2015-present), and an Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology (2016-present). Previously, he served as a member of IEEE Multimedia Signal Processing Technical Committee (2013-2015), an Associate Editor of IEEE Transactions on Image Processing (2009-2014), Pattern Recognition (2006-present) and IEEE Signal Processing Letters (2006-2010), and a Guest Editor of IEEE Journal of Selected Topics in Signal Processing (2013-2014 and 2007-2009). He is a Fellow of Canadian Academy of Engineering, and a recipient of 2016 IEEE Signal Processing Society Sustained Impact Paper Award, 2015 Primetime Engineering Emmy Award, 2014 NSERC E.W.R. Steacie Memorial Fellowship Award, 2013 IEEE Signal Processing Magazine Best Paper Award, 2009 IEEE Signal Processing Society Best Paper Award, and 2009 Ontario Early Researcher Award.

Dacheng Tao (F’15) is Professor of Computer Science and ARC Future Fellow in the School of Information Technologies and the Faculty of Engineering and Information Technologies, and the Inaugural Director of the UBTech Sydney Artificial Intelligence Institute, at The University of Sydney. He mainly applies statistics and mathematics to Artificial Intelligence and Data Science. His research interests spread across computer vision, data science, image processing, machine learning, and video surveillance. His research results have expounded in one monograph and 500+ publications at prestigious journals and prominent conferences, such as IEEE T-PAMI, T-NNLS, T-IP, JMLR, IJCV, NIPS, CIKM, ICML, CVPR, ICCV, ECCV, AISTATS, ICDM; and ACM SIGKDD, with several best paper awards, such as the best theory/algorithm paper runner up award in IEEE ICDM’07, the best student paper award in IEEE ICDM’13, the 2014 ICDM 10-year highest-impact paper award, and the 2017 IEEE Signal Processing Society Best Paper Award. He received the 2015 Australian Scopus-Eureka Prize, the 2015 ACS Gold Disruptor Award and the 2015 UTS Vice-Chancellor’s Medal for Exceptional Research. He is a Fellow of the IEEE, OSA, IAPR and SPIE.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
352841
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description