Empirical evaluation of full-reference image quality metrics on MDID database
In this study, our goal is to give a comprehensive evaluation of 32 state-of-the-art FR-IQA metrics using the recently published MDID. This database contains distorted images derived from a set of reference, pristine images using random types and levels of distortions. Specifically, Gaussian noise, Gaussian blur, contrast change, JPEG noise, and JPEG2000 noise were considered.
Full-reference image quality assessment
The goal of objective image quality assessment is to design mathematical models that are able to predict the perceptual quality of digital images. The classification of objective image quality assessment algorithms is based on the accessibility of the reference image. In the case of reference image is unavailable image quality assessment is considered as a no-reference (NR) one. Reduced-reference (RR) methods have only partial information about the reference image, while full-reference (FR) algorithms have full access to the reference image.
The research of objective image quality assessment demands databases that contain images with the corresponding MOS values. To this end, a number of image quality databases have been made publicly available. Roughly speaking, these databases can be categorized into three groups. The first one contains a smaller set of pristine, reference digital images and artificially distorted images derived from the pristine images considering different artificial distortions at different intensity levels. The second group contains only digital images with authentic distortions collected from photographers, so pristine images cannot be found in such databases. Virtanen et al.  were first to introduce this type of database for images by releasing CID2013. As a consequence, the development of FR methods is connected to the first group of databases. In contrast Waterloo Exploration  and KADIS-700k  databases are meant to provide an alternative evaluation of objective image quality assessment models, by means of paired comparisons. That is why, they contain a set of reference (pristine) images, distorted images, and distortion levels. In contrast to other databases, they do not provide MOS values. Information about major publicly available image quality assessment databases are summarized in Table 1.
In this study, we provide a comprehensive evaluation of 32 full-reference image quality assessment (FR-IQA) algorithms on MDID database. In contrast to other available image quality databases, the images in MDID contain multiple types of distortions simultaneously.
The rest of this study is organized as follows. There are a number of publicly available image quality databases, such as IVC , LIVE IQA , A57 , Toyoma , TID2008 , CSIQ , IVC-LAR , MMSP 3D , IRSQ , , TID2013 , CID2013 , LIVE In the Wild , Waterloo Exploration , MDID , KonIQ-10k , KADID-10k , and KADIS-700k . In Section 2, we give a brief introduction to each of them. In Section 3, we give a comprehensive evaluation of 31 full-reference image quality assessment (FR-IQA) algorithms on MDID database. Finally, a conclusion is drawn in Section 4.
2 Image quality databases
IVC111http://www2.irccyn.ec-nantes.fr/ivcdb/  database consists of 10 pristine images, and 235 distorted images, including four types of distortions (JPEG, JPEG2000, locally adaptive resolution coding, blurring). Quality score ratings (1 to 5) are provided in the form of MOS.
LIVE Image Quality Database222http://www.live.ece.utexas.edu/research/quality/subjective.htm (LIVE IQA)  has two releases, Release 1 and Release 2. Laboratory for Image and Video Engineering (University of Texas at Austin) conducted an extensive experiment to obtain scores from human subjects for a number of images distorted with different distortion types. Release 2 has more distortion types — JPEG (169 images), JPEG2000 (175 images), Gaussian blur (145 images, White noise (145 images), bit errors in JPEG2000 bit stream (145 images). The subjective quality scores in this database are DMOS (Differential MOS), ranging from 0 to 100.
A57 Database333http://vision.eng.shizuoka.ac.jp/mod/page/view.php?id=26  has 3 pristine images, and 54 distorted images, including six types of distortions (JPEG, JPEG2000, JPEG2000 with dynamic contrast-based quantization, quantization of the LH subbands of DWT, additive Gaussian white noise, Gaussian blurring). Quality score ratings (0 to 1) are provided in the form of DMOS.
Toyoma Database  consists of 14 pristine images, and 168 distorted images, including two types of distortions (JPEG, JPEG2000). Quality score ratings (1 to 5) are provided in the form of MOS.
Tampere Image Database 2008444http://www.ponomarenko.info/tid2008.htm (TID2008)  contains 25 reference images and 1,700 distorted images (25 reference images types of distortions levels of distortions). The MOS was obtained from the results of 838 experiments carried out by observers from three countries. 838 observers have performed 256,428 comparisons of visual quality of distorted images or 512,856 evaluations of relative visual quality in image pairs. Higher value of MOS (0 - minimal, 9 - maximal, MSE of each score is 0.019) corresponds to higher visual quality of the image. A file enclosed “mos.txt” contains the Mean Opinion Score for each distorted image.
Computational and Subjective Image Quality555http://vision.eng.shizuoka.ac.jp/mod/page/view.php?id=23 (CSIQ)  database consists of 30 original images, each distorted using one of six types of distortions, each at four to five different levels of distortion. The images were subjectively rated based on a linear displacement of the images across four calibrated monitors placed side-by-side with equal viewing distance to the observer. The database contains 5,000 subjective ratings from 35 different — both male and female — observers. Quality score ratings (0 to 1) are provided in the form of DMOS.
IVC-LAR666http://ivc.univ-nantes.fr/en/databases/LAR/  database contains 8 pristine images (4 natural images and 4 art images), and 120 distorted images, consisting of three types of distortions (JPEG, JPEG2000, locally adaptive resolution coding). Quality score ratings (1 to 5) are provided in the form of MOS.
Wireless Imaging Quality777https://computervisiononline.com/dataset/1105138665 (WIQ) Database ,  consists of 7 reference images and 80 distorted images. The subjective quality scores are given in DMOS, ranging from 0 to 100.
In contrast to other publicly available image quality databases MMSP 3D Image Quality Assessment Database888https://mmspg.epfl.ch/downloads/3diqa/  consists of stereoscopic images with a resolution of pixels. Specifically, 10 indoor and outdoor scenes were captured with a wide variety of colors, textures, and depth structures. Furthermore, 6 different stimuli have been considered corresponding to different camera distances (10, 20, 30, 40, 50, and 60 cm) for each scene.
Image Retargeting Subjective Quality999http://ivp.ee.cuhk.edu.hk/projects/demo/retargeting/index.html (IRSQ) Database ,  consists of 57 reference images grouped into four attributes, specfically face and people, clear foreground object, natural scenery, and geometric structure. Moreover, ten different retargeting methods (cropping, seam carving, scaling, shift-map editing, scale and stretch, etc.) are applied to generate retargeted images. In total, 171 test images can be found in this database.
Tampere Image Database 2013101010http://www.ponomarenko.info/tid2013.htm (TID2013)  contains 25 reference images and 3,000 distorted images (25 reference images types of distortions levels of distortions). MOS (Mean Opinion Score) is provided as subjective score, ranging from 0 to 9.
The CID2013111111http://www.helsinki.fi/psychology/groups/visualcognition/  database contains 474 images with authentic distortions captured by 79 imaging devices, such as mobile phones, digital still cameras, and digital single-lens reflex cameras.
LIVE In the Wild Image Quality Challenge Database121212http://live.ece.utexas.edu/research/ChallengeDB/  contains widely diverse authentic image distortions on a large number of images captured using a representative variety of modern mobile devices. The LIVE In the Wild Image Quality Database has over 350,000 opinion scores on 1,162 images evaluated by over 8,100 unique human observers.
Waterloo Exploration131313https://ece.uwaterloo.ca/ k29ma/exploration/  database consists of 4,744 reference images and 94,880 distorted images created from them. Instead of collecting MOS for each test image, the authors introduced three alternative test criteria to evaluate the performance of IQA models, such as discriminability test (D-test), listwise ranking consistency test (L-test), and pairwise preference consistency test (P-test).
In contrast to other databases considering artificial distortions, MDID141414https://www.sz.tsinghua.edu.cn/labs/vipl/mdid.html  obtains distorted images from reference images with random types and levels of distortions. In this way, each distorted image contains multiple types of distortions simultaneously. Gaussian noise, Gaussian blur, contrast change, JPEG noise, and JPEG2000 noise were considered.
The main challenge in applying state-of-the-art deep learning methods to predict image quality in-the-wild is the relatively small size of existing quality scored datasets. The reason for the lack of larger datasets is the massive resources required in generating diverse and publishable content. In KonIQ-10k151515http://database.mmsp-kn.de/koniq-10k-database.html  a new systematic and scalable approach is presented to create large-scale, authentic image datasets for image quality assessment. KonIQ-10k  consists of 10,073 images, on which large scale crowdsourcing experiments has been carried out in order to obtain reliable quality ratings from 1,467 crowd workers (1.2 million ratings) . During the test users exhibiting unusual scoring behavior were removed.
KADID-10k161616http://database.mmsp-kn.de/kadid-10k-database.html  consists of 81 pristine images and distorted images derived from the pristine images considering different distortion types at 5 intensity levels (). In contrast, KADIS-700k  contains pristine images and distorted images were derived using different distortion types at 5 intensity levels but MOS values are not given in this database.
|Database||Year||Reference images||Test images||Distortion type||Subjective score|
|IVC ||2005||10||235||artificial||MOS (1-5)|
|LIVE IQA ||2006||29||779||artificial||DMOS (0-100)|
|A57 ||2007||3||54||artificial||DMOS (0-1)|
|Toyoma ||2008||14||168||artificial||MOS (1-5)|
|TID2008 ||2008||25||1,700||artificial||MOS (0-9)|
|CSIQ ||2009||30||866||artificial||DMOS (0-1)|
|IVC-LAR ||2009||8||120||artificial||MOS (1-5)|
|WIQ , ||2009||7||80||artificial||DMOS (0-100)|
|MMSP 3D ||2009||9||54||artificial||MOS (0-100)|
|IRSQ , ||2011||57||171||artificial||MOS (0-5)|
|TID2013 ||2013||25||3,000||artificial||MOS (0-9)|
|CID2013 ||2013||8||474||authentic||MOS (0-9)|
|LIVE In the Wild ||2016||-||1,162||authentic||MOS (1-5)|
|Waterloo Exploration ||2016||4,744||94,880||artificial||-|
|MDID ||2017||20||1600||artificial||MOS (0-8)|
|KonIQ-10k ||2018||-||10,073||authentic||MOS (1-5)|
|KADID-10k ||2019||81||10,125||artificial||MOS (1-5)|
3 Experimental results
|MDSI (’mult’) ||2016||0.8130||0.8278||0.6441|
|MDSI (’sum’) ||2016||0.8249||0.8363||0.6527|
|SSIM CNN ||2018||0.8706||0.8804||0.6992|
The evaluation of objective visual quality assessment is based on the correlation between the predicted and the ground-truth quality scores. Pearson’s linear correlation coefficient (PLCC) and Spearman’s rank order correlation coefficient (SROCC) are widely applied to this end. Furthermore, some authors give the Kendall’s rank order correlation coefficient as well.
The PLCC between data set and is defined as
where and stand for the average of set and , and denote the th elements of set and , respectively. For two ranked sets A and B SROCC is defined as
where and are the middle ranks of set A and B. KROCC between dataset and can be calculated as
where is the length of the input vectors, is the number of concordant pairs between and , and is the number of discordant pairs between and .
We collected 31 FR-IQA metrics whose source codes are available online. Furthermore, we reimplemented SSIM CNN171717https://github.com/Skythianos/Pretrained-CNNs-for-full-reference-image-quality-assessment  in MATLAB R2019a. In Table 2, we present PLCC, SROCC, and KROCC values measured over the MDID database. It can be clearly seen from the results that there is still a lot of space for the improvement of FR-IQA algorithms because only HaarPSI  was able to produce PLCC and SROCC values higher than 0.9. Furthermore, only three methods — FSIM , FSIMc , HaarPSI  — were able to produce KROCC values higher than 0.7.
First, we gave information about the mostly applied image quality databases. Subsequently, we extensively evaluated 32 state-of-the-art FR-IQA methods on MDID database whose images contain multiple types of distortions simultaneously. We dmonstrated that there is still a lot of space for the improvement of FR-IQA algorithms because only HaarPSI  was able to produce PLCC and SROCC values higher than 0.9.
-  (2006) Image quality assessment based on local variance. In 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 4815–4818. Cited by: Table 2.
-  (2018) Reviving traditional image quality metrics using cnns. In Color and Imaging Conference, Vol. 2018, pp. 241–246. Cited by: Table 2, §3.
-  (2009) Subjective quality assessment of lar coded art images. Note: http://www.irccyn.ec-nantes.fr/ autrusse/Databases/ Cited by: §1, Table 1, §2.
-  (2015) Image quality assessment based on dct subband similarity. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 2105–2109. Cited by: Table 2.
-  (2007) VSNR: a wavelet-based visual signal-to-noise ratio for natural images. IEEE transactions on image processing 16 (9), pp. 2284–2298. Cited by: §1, Table 1, §2.
-  (2000) Image quality assessment based on a degradation model. IEEE transactions on image processing 9 (4), pp. 636–650. Cited by: Table 2.
-  (2006) New full-reference quality metrics based on hvs. In Proceedings of the Second International Workshop on Video Processing and Quality Metrics, Vol. 4. Cited by: Table 2.
-  (2009) Reduced-reference metric design for objective perceptual quality assessment in wireless imaging. Signal Processing: Image Communication 24 (7), pp. 525–547. Cited by: Table 1, §2.
-  (2010) Subjective quality assessment for wireless image communication: the wireless imaging quality database. In International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), Cited by: Table 1, §2.
-  (2005) A content-based image quality metric. In International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing, pp. 231–240. Cited by: Table 2.
-  (2015) Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25 (1), pp. 372–387. Cited by: §1, Table 1, §2.
-  (2010) Impact of acquisition distortion on the quality of stereoscopic images. In Proceedings of the International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Cited by: §1, Table 1, §2.
-  (2011) Quaternion structural similarity: a new quality index for color images. IEEE Transactions on Image Processing 21 (4), pp. 1526–1536. Cited by: Table 2.
-  (2010) Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging 19 (1), pp. 011006. Cited by: §1, Table 1, §2, Table 2.
-  (2005) Subjective quality assessment irccyn/ivc database. Note: http://www.irccyn.ec-nantes.fr/ivcdb/ Cited by: §1, Table 1, §2.
-  (2018) KonIQ-10k: towards an ecologically valid and large-scale iqa database. arXiv preprint arXiv:1803.08489. Cited by: §1, Table 1, §2.
-  (2019) KADID-10k: a large-scale artificially distorted iqa database. In 2019 Tenth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–3. Cited by: §1, Table 1.
-  (2019) KADID-10k: a large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–3. Cited by: §1, §2.
-  (2017) Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing 26 (2), pp. 1004–1016. Cited by: §1, §1, Table 1, §2.
-  (2012) Study of subjective and objective quality assessment of retargeted images. In Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pp. 2677–2680. Cited by: §1, Table 1, §2.
-  (2012) Image retargeting quality assessment: a study of subjective scores and objective metrics. IEEE Journal of Selected Topics in Signal Processing 6 (6), pp. 626–639. Cited by: §1, Table 1, §2.
-  (2016) Mean deviation similarity index: efficient and reliable full-reference image quality evaluator. IEEE Access 4, pp. 5579–5590. Cited by: Table 2.
-  (2013) Color image database tid2013: peculiarities and preliminary results. In Visual Information Processing (EUVIP), 2013 4th European Workshop on, pp. 106–111. Cited by: §1, Table 1, §2.
-  (2009) TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics 10 (4), pp. 30–45. Cited by: §1, Table 1, §2.
-  (2007) On between-coefficient contrast masking of dct basis functions. In Proceedings of the third international workshop on video processing and quality metrics, Vol. 4. Cited by: Table 2.
-  (2017) Ms-unique: multi-model and sharpness-weighted unsupervised image quality estimation. Electronic Imaging 2017 (12), pp. 30–35. Cited by: Table 2.
-  (2018) A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication 61, pp. 33–43. Cited by: Table 2, §3, §4.
-  (2009) Complex wavelet structural similarity: a new image similarity index. IEEE transactions on image processing 18 (11), pp. 2385–2401. Cited by: Table 2.
-  (2016) Crowd workers proven useful: a comparative study of subjective video quality assessment. In QoMEX 2016: 8th International Conference on Quality of Multimedia Experience, Cited by: §2.
-  (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15 (11), pp. 3440–3451. Cited by: §1, Table 1, §2.
-  (2017) MDID: a multiply distorted image database for image quality assessment. Pattern Recognition 61, pp. 153–168. Cited by: §1, Table 1, §2.
-  (2015) PerSIM: multi-resolution image quality assessment in the perceptually uniform color domain. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 1682–1686. Cited by: Table 2.
-  (2016) BLeSS: bio-inspired low-level spatiochromatic similarity assisted image quality assessment. In 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: Table 2.
-  (2016) CSV: image quality assessment based on color, structure, and visual system. Signal Processing: Image Communication 48, pp. 92–103. Cited by: Table 2.
-  (2019) Perceptual image quality assessment through spectral analysis of error representations. Signal Processing: Image Communication 70, pp. 37–46. Cited by: Table 2.
-  (2008) Impact of the subjective dataset on the performance of image quality metrics. In IEEE International Conference on Image Processing 2008. ICIP 2008., Cited by: §1, Table 1, §2.
-  (2014) CID2013: a database for evaluating no-reference image quality assessment algorithms. IEEE Transactions on Image Processing 24 (1), pp. 390–402. Cited by: §1, §1, Table 1, §2.
-  (2016) Multiscale contrast similarity deviation: an effective and efficient index for perceptual image quality assessment. Signal Processing: Image Communication 45, pp. 1–9. Cited by: Table 2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: Table 2.
-  (2002) A universal image quality index. IEEE signal processing letters 9 (3), pp. 81–84. Cited by: Table 2.
-  (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: Table 2.
-  (2013) Gradient magnitude similarity deviation: a highly efficient perceptual image quality index. IEEE Transactions on Image Processing 23 (2), pp. 684–695. Cited by: Table 2.
-  (2012) SR-sim: a fast and high performance iqa index based on spectral residual. In 2012 19th IEEE international conference on image processing, pp. 1473–1476. Cited by: Table 2.
-  (2014) VSI: a visual saliency-induced index for perceptual image quality assessment. IEEE Transactions on Image Processing 23 (10), pp. 4270–4281. Cited by: Table 2.
-  (2011) FSIM: a feature similarity index for image quality assessment. IEEE transactions on Image Processing 20 (8), pp. 2378–2386. Cited by: Table 2, §3.
-  (2010) RFSIM: a feature based image quality assessment metric using riesz transforms. In 2010 IEEE International Conference on Image Processing, pp. 321–324. Cited by: Table 2.
-  (2013) Edge strength similarity for image quality assessment. IEEE Signal processing letters 20 (4), pp. 319–322. Cited by: Table 2.
-  (1997) Color image quality metric s-cielab and its application on halftone texture visibility. In Proceedings IEEE COMPCON 97. Digest of Papers, pp. 44–48. Cited by: Table 2.